Re: Spark + Kafka all messages being used in 1 batch

2016-03-05 Thread Supreeth
Try setting spark.streaming.kafka.maxRatePerPartition, this can help control the number of messages read from Kafka per partition on the spark streaming consumer. -S > On Mar 5, 2016, at 10:02 PM, Vinti Maheshwari wrote: > > Hello, > > I am trying to figure out why my

Add the sql record having same field.

2016-03-05 Thread Angel Angel
Hello, I have one table and 2 fields in it 1) item_id and 2) count i want to add the count field as per item (means group the item_ids) example Input itea_ID Count 500 2 200 6 500 4 100 3 200 6 Required Output Result Itea_id Count 500 6 200 12 100 3 I used the command the Resut=

Spark + Kafka all messages being used in 1 batch

2016-03-05 Thread Vinti Maheshwari
Hello, I am trying to figure out why my kafka+spark job is running slow. I found that spark is consuming all the messages out of kafka into a single batch itself and not sending any messages to the other batches. 2016/03/05 21:57:05

MLLib + Streaming

2016-03-05 Thread Lan Jiang
Hi, there I hope someone can clarify this for me. It seems that some of the MLlib algorithms such as KMean, Linear Regression and Logistics Regression have a Streaming version, which can do online machine learning. But does that mean other MLLib algorithm cannot be used in Spark streaming

Re: How can I pass a Data Frame from object to another class

2016-03-05 Thread Ted Yu
Looking at the methods you call on HiveContext, they seem to belong to SQLContext. For SQLContext, you can use the below method of SQLContext in FirstQuery to retrieve SQLContext: def getOrCreate(sparkContext: SparkContext): SQLContext = { FYI On Sat, Mar 5, 2016 at 3:37 PM, Mich Talebzadeh

Re: Dynamic partitions reallocations with added worker nodes

2016-03-05 Thread Eugene Morozov
Ted, thanks for the reply. Yeah, there were just three nodes with hdfs and spark workers colocated. There were actually one more with spark master (standalone) and namenode. And I've thrown one more spark worker node, which sees whole hdfs pretty well, but doesn't have colocated datanode process.

Re: Dynamic partitions reallocations with added worker nodes

2016-03-05 Thread Ted Yu
bq. I haven't added one more HDFS node to a hadoop cluster Does each of three nodes colocate with hdfs data nodes ? The absence of 4th data node might have something to do with the partition allocation. Can you show your code snippet ? Thanks On Sat, Mar 5, 2016 at 2:54 PM, Eugene Morozov

Dynamic partitions reallocations with added worker nodes

2016-03-05 Thread Eugene Morozov
Hi, My cluster (standalone deployment) consisting of 3 worker nodes was in the middle of computations, when I added one more worker node. I can see that new worker is registered in master and that my job actually get one more executor. I have configured default parallelism as 12 and thus I see

Re: spark driver in docker

2016-03-05 Thread Timothy Chen
Will need more information to help you, what's the commands you used to launch slave/master, and what error message did you see in the driver logs? Tim > On Mar 5, 2016, at 4:34 AM, Mailing List wrote: > > I am trying to do the same but till now no luck... > I have

Re: 1.6.0 spark.sql datetime conversion problem

2016-03-05 Thread Jan Štěrba
I dont know whats wrong but I can suggest looking up the source of the UDF and debugging from there. I would think this is some JDK API cleveat and not a Spark bug -- Jan Sterba https://twitter.com/honzasterba | http://flickr.com/honzasterba | http://500px.com/honzasterba On Fri, Mar 4, 2016 at

How can I pass a Data Frame from object to another class

2016-03-05 Thread Mich Talebzadeh
Hi, I can use sbt to compile and run the following code. It works without any problem. I want to divide this into the obj and another class. I would like to do the result set joining tables identified by Data Frame 'rs' and then calls the method "firstquerym" in the class FirstQuery to do the

Re: S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

2016-03-05 Thread Igor Berman
it's not safe to use direct committer with append mode, you may loose your data.. On 4 March 2016 at 22:59, Jelez Raditchkov wrote: > Working on a streaming job with DirectParquetOutputCommitter to S3 > I need to use PartitionBy and hence SaveMode.Append > > Apparently when

problems with my code that stimulate task failure - task is not resubmitted

2016-03-05 Thread Gil Vernik
I have some code that stimulates task failure in the speculative mode. The code i compile to jar and execute with ./bin/spark-submit --class com.test.SparkTest --jars --driver-memory 2g --executor-memory 1g --master local[4] --conf spark.speculation=true --conf spark.task.maxFailures=4

Re: Spark Streaming - Travis CI and GitHub custom receiver - continuous data but empty RDD?

2016-03-05 Thread Ted Yu
bq. reportError("Exception while streaming travis", e) I assume there was none of the above in your job. What Spark release are you using ? Thanks On Sat, Mar 5, 2016 at 4:57 AM, Dominik Safaric wrote: > Dear all, > > Lately, as a part of a scientific

Spark SQL drops the HIVE table in "overwrite" mode while writing into table

2016-03-05 Thread Dhaval Modi
Hi Team, I am facing a issue while writing dataframe back to HIVE table. When using "SaveMode.Overwrite" option the table is getting dropped and Spark is unable to recreate it thus throwing error. JIRA: https://issues.apache.org/jira/browse/SPARK-13699 E.g.

Re: Mapper side join with DataFrames API

2016-03-05 Thread Deepak Gopalakrishnan
Hello Guys, No help yet. Can someone tell me with a reply to the above question in SO ? Thanks Deepak On Fri, Mar 4, 2016 at 5:32 PM, Deepak Gopalakrishnan wrote: > Have added this to SO, can you guys share any thoughts ? > > >

Spark Streaming - Travis CI and GitHub custom receiver - continuous data but empty RDD?

2016-03-05 Thread Dominik Safaric
Dear all, Lately, as a part of a scientific research, I've been developing an application that streams (or at least should) data from Travis CI and GitHub, using their REST API's. The purpose of this is to get insight into the commit-build relationship, in order to further perform numerous

Re: spark driver in docker

2016-03-05 Thread Mailing List
I am trying to do the same but till now no luck... I have everything running inside docker container including mesos master, mesos slave , marathon , spark mesos cluster dispatcher. But when I try to submit the job using spark submit as a docker container it fails ... Between this setup is on

Re: OOM When Running with Mesos Fine-grained Mode

2016-03-05 Thread Tamas Szuromi
Hey, We had the same with Spark 1.5.x and disappeared after we upgraded to 1.6. Tamas On Saturday, 5 March 2016, SLiZn Liu wrote: > Hi Spark Mailing List, > > I’m running terabytes of text files with Spark on Mesos, the job runs fine > until we decided to switch to

Re: spark driver in docker

2016-03-05 Thread Tamas Szuromi
Hi, Have a look at on http://spark.apache.org/docs/latest/configuration.html what ports need to be exposed. With mesos we had a lot of problems with container networking but yes the --net=host is a shortcut. Tamas On 4 March 2016 at 22:37, yanlin wang wrote: > We would like