Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-21 Thread Jörn Franke
I would import data via sqoop and put it on HDFS. It has some mechanisms to handle the lack of reliability by jdbc. Then you can process the data via Spark. You could also use jdbc rdd but I do not recommend to use it, because you do not want to pull data all the time out of the database when

Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-21 Thread ayan guha
I may be wrong here, but beeline is basically a client library. So you "connect" to STS and/or HS2 using beeline. Spark connecting to jdbc is different discussion and no way related to beeline. When you read data from DB (Oracle, DB2 etc) then you do not use beeline, but use jdbc connection to

Spark 1.5.2 - Different results from reduceByKey over multiple iterations

2016-06-21 Thread Nirav Patel
I have an RDD[String, MyObj] which is a result of Join + Map operation. It has no partitioner info. I run reduceByKey without passing any Partitioner or partition counts. I observed that output aggregation result for given key is incorrect sometime. like 1 out of 5 times. It looks like reduce

feture importance or variable importance

2016-06-21 Thread pseudo oduesp
hi , i am pyspark user and i want to extract var imprtance in randomforest model for plot how i can deal with that ? thanks

Fwd: 'numBins' property not honoured in BinaryClassificationMetrics class when spark.default.parallelism is not set to 1

2016-06-21 Thread Sneha Shukla
Hi, I'm trying to use the BinaryClassificationMetrics class to compute the pr curve as below - import org.apache.avro.generic.GenericRecord import org.apache.hadoop.conf.Configuration import org.apache.hadoop.mapred.JobConf import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics

Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-21 Thread Michael Segel
Sorry, I think you misunderstood. Spark can read from JDBC sources so to say using beeline as a way to access data is not a spark application isn’t really true. Would you say the same if you were pulling data in to spark from Oracle or DB2? There are a couple of different design patterns and

Re: Does saveAsHadoopFile depend on master?

2016-06-21 Thread Jeff Zhang
Please check the driver and executor log, there should be logs about where the data is written. On Wed, Jun 22, 2016 at 2:03 AM, Pierre Villard wrote: > Hi, > > I have a Spark job writing files to HDFS using .saveAsHadoopFile method. > > If I run my job in

Re: Build Spark 2.0 succeeded but could not run it on YARN

2016-06-21 Thread Wu Gang
Hi Ted, I didn't type any command, it just threw that exception after launched. Thanks! On Mon, Jun 20, 2016 at 7:18 PM, Ted Yu wrote: > What operations did you run in the Spark shell ? > > It would be easier for other people to reproduce using your code snippet. > >

Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-21 Thread ayan guha
1. Yes, in the sense you control number of executors from spark application config. 2. Any IO will be done from executors (never ever on driver, unless you explicitly call collect()). For example, connection to a DB happens one for each worker (and used by local executors). Also, if you run a

Getting a DataFrame back as result from SparkIMain

2016-06-21 Thread Jayant Shekhar
Hi, I have written a program using SparkIMain which creates an RDD and I am looking for a way to access that RDD in my normal spark/scala code for further processing. The code below binds the SparkContext:: sparkIMain.bind("sc", "org.apache.spark.SparkContext", sparkContext,

Re: Union of multiple RDDs

2016-06-21 Thread Michael Segel
By repartition I think you mean coalesce() where you would get one parquet file per partition? And this would be a new immutable copy so that you would want to write this new RDD to a different HDFS directory? -Mike > On Jun 21, 2016, at 8:06 AM, Eugene Morozov

Silly question about Yarn client vs Yarn cluster modes...

2016-06-21 Thread Michael Segel
Ok, its at the end of the day and I’m trying to make sure I understand the locale of where things are running. I have an application where I have to query a bunch of sources, creating some RDDs and then I need to join off the RDDs and some other lookup tables. Yarn has two modes… client and

Spark-Cassandra connector

2016-06-21 Thread Joaquin Alzola
Hi List I am trying to install the Spark-Cassandra connector through maven or sbt but neither works. Both of them try to connect to the Internet (which I do not have connection) to download certain files. Is there a way to install the files manually? I downloaded from the maven repository -->

Re: Improving performance of a kafka spark streaming app

2016-06-21 Thread Colin Kincaid Williams
Thanks @Cody, I will try that out. In the interm, I tried to validate my Hbase cluster by running a random write test and see 30-40K writes per second. This suggests there is noticeable room for improvement. On Tue, Jun 21, 2016 at 8:32 PM, Cody Koeninger wrote: > Take HBase

Re: Improving performance of a kafka spark streaming app

2016-06-21 Thread Cody Koeninger
Take HBase out of the equation and just measure what your read performance is by doing something like createDirectStream(...).foreach(_.println) not take() or print() On Tue, Jun 21, 2016 at 3:19 PM, Colin Kincaid Williams wrote: > @Cody I was able to bring my processing time

Re: Improving performance of a kafka spark streaming app

2016-06-21 Thread Colin Kincaid Williams
@Cody I was able to bring my processing time down to a second by setting maxRatePerPartition as discussed. My bad that I didn't recognize it as the cause of my scheduling delay. Since then I've tried experimenting with a larger Spark Context duration. I've been trying to get some noticeable

Re: cast only some columns

2016-06-21 Thread Michael Armbrust
Use `withColumn`. It will replace a column if you give it the same name. On Tue, Jun 21, 2016 at 4:16 AM, pseudo oduesp wrote: > Hi , > with fillna we can select some columns to perform replace some values > with chosing columns with dict > {columns :values } > but

How to do some pre-processing of the SQL in the Thrift server?

2016-06-21 Thread Timothy Potter
I'm using the Spark Thrift server to execute SQL queries over JDBC. I'm wondering if it's possible to plugin a class to do some pre-processing on the SQL statement before it gets passed to the SQLContext for actual execution? I scanned over the code and it doesn't look like this is supported but I

Does saveAsHadoopFile depend on master?

2016-06-21 Thread Pierre Villard
Hi, I have a Spark job writing files to HDFS using .saveAsHadoopFile method. If I run my job in local/client mode, it works as expected and I get all my files written in HDFS. However if I change to yarn/cluster mode, I don't see any error logs (the job is successful) and there is no files

Re: Labeledpoint

2016-06-21 Thread Ndjido Ardo BAR
To answer more accurately to your question, the model.fit(df) method takes in a DataFrame of Row(label=double, feature=Vectors.dense([...])) . cheers, Ardo. On Tue, Jun 21, 2016 at 6:44 PM, Ndjido Ardo BAR wrote: > Hi, > > You can use a RDD of LabelPoints to fit your model.

Re: Labeledpoint

2016-06-21 Thread Ndjido Ardo BAR
Hi, You can use a RDD of LabelPoints to fit your model. Check the doc for more example : http://spark.apache.org/docs/latest/api/python/pyspark.ml.html?highlight=transform#pyspark.ml.classification.RandomForestClassificationModel.transform cheers, Ardo. On Tue, Jun 21, 2016 at 6:12 PM, pseudo

Labeledpoint

2016-06-21 Thread pseudo oduesp
Hi, i am pyspark user and i want test Randomforest. i have dataframe with 100 columns i should give Rdd or data frame to algorithme i transformed my dataframe to only tow columns label ands features columns df.label df.features 0(517,(0,1,2,333,56 ... 1

Can Spark Streaming checkpoint only metadata ?

2016-06-21 Thread Natu Lauchande
Hi, I wonder if it is possible to checkpoint only metadata and not the data in RDD's and dataframes. Thanks, Natu

Re: Union of multiple RDDs

2016-06-21 Thread Eugene Morozov
Apurva, I'd say you have to apply repartition just once to the RDD that is union of all your files. And it has to be done right before you do anything else. If something is not needed on your files, then the sooner you project, the better. Hope, this helps. -- Be well! Jean Morozov On Tue,

Re: Number of consumers in Kafka with Spark Streaming

2016-06-21 Thread Cody Koeninger
If you're using the direct stream, and don't have speculative execution turned on, there is one executor consumer created per partition, plus a driver consumer for getting the latest offsets. If you have fewer executors than partitions, not all of those consumers will be running at the same time.

Number of consumers in Kafka with Spark Streaming

2016-06-21 Thread Guillermo Ortiz
I use Spark Streaming with Kafka and I'd like to know how many consumers are generated. I guess that as many as partitions in Kafka but I'm not sure. Is there a way to know the name of the groupId generated in Spark to Kafka?

)

2016-06-21 Thread pseudo oduesp
hi, help me please to resolve this issues

Re: [Spark + MLlib] How to prevent negative values in Linear regression?

2016-06-21 Thread Sean Owen
Just clamp the predicted value to 0? But you may also want to reconsider your model in this case; maybe a simple linear model is not appropriate. On Tue, Jun 21, 2016 at 2:03 PM, diplomatic Guru wrote: > Hello Sean, > > Absolutely, there is nothing wrong with predicting

Union of multiple RDDs

2016-06-21 Thread Apurva Nandan
Hello, I am trying to combine several small text files (each file is approx hundreds of MBs to 2-3 gigs) into one big parquet file. I am loading each one of them and trying to take a union, however this leads to enormous amounts of partitions, as union keeps on adding the partitions of the input

FullOuterJoin on Spark

2016-06-21 Thread Rychnovsky, Dusan
Hi, can somebody please explain the way FullOuterJoin works on Spark? Does each intersection get fully loaded to memory? My problem is as follows: I have two large data-sets: * a list of web pages, * a list of domain-names with specific rules for processing pages from that domain. I am

Fwd: [Spark + MLlib] How to prevent negative values in Linear regression?

2016-06-21 Thread diplomatic Guru
Hello Sean, Absolutely, there is nothing wrong with predicting negative values but for my scenario I do not want to predict any negative value (also, all the data that is fed into the model is positive). Is there any way I could stop predicting negative values. I assume it is not possible but

Re: Spark 2.0 Preview After caching query didn't work and can't kill job.

2016-06-21 Thread Gene Pang
Hi, It looks like this is not related to Alluxio. Have you tried running the same job with different storage? Maybe you could increase the Spark JVM heap size to see if that helps your issue? Hope that helps, Gene On Wed, Jun 15, 2016 at 8:52 PM, Chanh Le wrote: > Hi

Re: [Spark + MLlib] How to prevent negative values in Linear regression?

2016-06-21 Thread Sean Owen
There's nothing inherently wrong with a regression predicting a negative value. What is the issue, more specifically? On Tue, Jun 21, 2016 at 1:38 PM, diplomatic Guru wrote: > Hello all, > > I have a job for forecasting using linear regression, but sometimes I'm >

[Spark + MLlib] How to prevent negative values in Linear regression?

2016-06-21 Thread diplomatic Guru
Hello all, I have a job for forecasting using linear regression, but sometimes I'm getting a negative prediction. How do I prevent this? Thanks.

cast only some columns

2016-06-21 Thread pseudo oduesp
Hi , with fillna we can select some columns to perform replace some values with chosing columns with dict {columns :values } but how i can do same with cast i have data frame with 300 columns and i want just cats 4 from list columns but with select query like that :

Re: scala.NotImplementedError: put() should not be called on an EmptyStateMap while doing stateful computation on spark streaming

2016-06-21 Thread Ted Yu
Are you using 1.6.1 ? If not, does the problem persist when you use 1.6.1 ? Thanks > On Jun 20, 2016, at 11:16 PM, umanga wrote: > > I am getting following warning while running stateful computation. The state > consists of BloomFilter (stream-lib) as Value and Integer

Re: scala.NotImplementedError: put() should not be called on an EmptyStateMap while doing stateful computation on spark streaming

2016-06-21 Thread umanga
further descriptions: Environment: Spark cluster running in standalone mode with 1 master, 5 slaves, each has 4 vCPUS, 8GB RAM data is being streamed from 3 node kafka cluster (managed by 3 node zk cluster). Checkpointing is being done at hadoop-cluster, plus we are also saving state in HBase

Re: Unable to acquire bytes of memory

2016-06-21 Thread Akhil Das
Looks like this issue https://issues.apache.org/jira/browse/SPARK-10309 On Mon, Jun 20, 2016 at 4:27 PM, pseudo oduesp wrote: > Hi , > i don t have no idea why i get this error > > > > Py4JJavaError: An error occurred while calling o69143.parquet. > :

Re: Saving data using tempTable versus save() method

2016-06-21 Thread Mich Talebzadeh
thanks Robin. This is data from Hive (source) to Hive (Target) via Spark. The database in Hive is called oraclehadoop (mainly used to import data from Oracle in the first place) I am very sceptical of these methods in Spark pertaining to store data in Hive database. I all probability they just

Re: Unsubscribe

2016-06-21 Thread Akhil Das
You need to send an email to user-unsubscr...@spark.apache.org for unsubscribing. Read more over here http://spark.apache.org/community.html On Mon, Jun 20, 2016 at 1:10 PM, Ram Krishna wrote: > Hi Sir, > > Please unsubscribe me > > -- > Regards, > Ram Krishna KT > >

Re: Saving data using tempTable versus save() method

2016-06-21 Thread Robin East
if you are able to trace the underlying oracle session you can see whether a commit has been called or not. > On 21 Jun 2016, at 09:57, Robin East wrote: > > I’m not sure - I don’t know what those APIs do under the hood. It simply rang > a bell with something I have

Re: Saving data using tempTable versus save() method

2016-06-21 Thread Robin East
I’m not sure - I don’t know what those APIs do under the hood. It simply rang a bell with something I have fallen foul of in the past (not with Spark though) - have wasted many hours forgetting to commit and then scratching my head as why my data is not persisting. > On 21 Jun 2016, at

Re: Running JavaBased Implementationof StreamingKmeans

2016-06-21 Thread Biplob Biswas
Hi, Can someone please look into this and tell me whats wrong?and why am I not getting any output? Thanks & Regards Biplob Biswas On Sun, Jun 19, 2016 at 1:29 PM, Biplob Biswas wrote: > Hi, > > Thanks for that input, I tried doing that but apparently thats not

Re: Saving data using tempTable versus save() method

2016-06-21 Thread Mich Talebzadeh
that is a very interesting point. I am not sure. how can I do that with sorted.save("oraclehadoop.sales2") like .. commit? thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Saving data using tempTable versus save() method

2016-06-21 Thread Robin East
random thought - do you need an explicit commit with the 2nd method? > On 20 Jun 2016, at 21:35, Mich Talebzadeh wrote: > > Hi, > > I have a DF based on a table and sorted and shown below > > This is fine and when I register as tempTable I can populate the

read.parquet or read.load

2016-06-21 Thread pseudo oduesp
hi , realy i m angry about parquet file each time i get error like Could not read footer: java.lang.RuntimeException: or error occuring when o127.load why we have à lot of issuse with this format ? thanks

Running spark executor process with username in standalone mode

2016-06-21 Thread Florian Philippon
Hello guys, I would like to know if there is a way in standalone mode to run spark executor processes using user/group of the user that have submitted the job? I found an old opened wish that exactly describes my need but i want to be sure that it's still not possible before thinking

scala.NotImplementedError: put() should not be called on an EmptyStateMap while doing stateful computation on spark streaming

2016-06-21 Thread umanga
I am getting following warning while running stateful computation. The state consists of BloomFilter (stream-lib) as Value and Integer as key. The program runs smoothly for few minutes and after that, i am getting this warning, and streaming app becomes unstable (processing time increases