Re: spark streaming doubt

2015-07-13 Thread Shushant Arora
For second question I am comparing 2 situtations of processing kafkaRDD. case I - When I used foreachPartition to process kafka stream I am not able to see any stream job timing interval like Time: 142905487 ms . displayed on driver console at start of each stream batch. But it processed

Re: spark streaming doubt

2015-07-13 Thread Cody Koeninger
Regarding your first question, having more partitions than you do executors usually means you'll have better utilization, because the workload will be distributed more evenly. There's some degree of per-task overhead, but as long as you don't have a huge imbalance between number of tasks and

Re: spark streaming doubt

2015-07-13 Thread Aniruddh Sharma
Hi Sushant/Cody, For question 1 , following is my understanding ( I am not 100% sure and this is only my understanding, I have asked this question in another words to TD for confirmation which is not confirmed as of now). Following is my understanding. In accordance with tasks created in

spark streaming doubt

2015-07-11 Thread Shushant Arora
1.spark streaming 1.3 creates as many RDD Partitions as there are kafka partitions in topic. Say I have 300 partitions in topic and 10 executors and each with 3 cores so , is it means at a time only 10*3=30 partitions are processed and then 30 like that since executors launch tasks per RDD

Re: spark streaming doubt

2015-05-20 Thread Akhil Das
One receiver basically runs on 1 core, so if your single node is having 4 cores, there are still 3 cores left for the processing (for executors). And yes receiver remains on the same machine unless some failure happens. Thanks Best Regards On Tue, May 19, 2015 at 10:57 PM, Shushant Arora

Re: spark streaming doubt

2015-05-20 Thread Shushant Arora
So I can explicitly specify no of receivers and executors in receiver based streaming? Can you share a sample program if any? Also in Low level non receiver based , will data be fetched by same worker executor node and processed ? Also if I have concurrent jobs set to 1- so in low level fetching

Re: spark streaming doubt

2015-05-20 Thread Akhil Das
On Wed, May 20, 2015 at 1:12 PM, Shushant Arora shushantaror...@gmail.com wrote: So I can explicitly specify no of receivers and executors in receiver based streaming? Can you share a sample program if any? ​ ​ -You can look at the lowlevel consumer repo

spark streaming doubt

2015-05-19 Thread Shushant Arora
What happnes if in a streaming application one job is not yet finished and stream interval reaches. Does it starts next job or wait for first to finish and rest jobs will keep on accumulating in queue. Say I have a streaming application with stream interval of 1 sec, but my job takes 2 min to

Re: spark streaming doubt

2015-05-19 Thread Akhil Das
It will be a single job running at a time by default (you can also configure the spark.streaming.concurrentJobs to run jobs parallel which is not recommended to put in production). Now, your batch duration being 1 sec and processing time being 2 minutes, if you are using a receiver based

Re: spark streaming doubt

2015-05-19 Thread Akhil Das
spark.streaming.concurrentJobs takes an integer value, not boolean. If you set it as 2 then 2 jobs will run parallel. Default value is 1 and the next job will start once it completes the current one. Actually, in the current implementation of Spark Streaming and under default configuration,

Re: spark streaming doubt

2015-05-19 Thread Shushant Arora
So for Kafka+spark streaming, Receiver based streaming used highlevel api and non receiver based streaming used low level api. 1.In high level receiver based streaming does it registers consumers at each job start(whenever a new job is launched by streaming application say at each second)? 2.No

Re: spark streaming doubt

2015-05-19 Thread Akhil Das
On Tue, May 19, 2015 at 8:10 PM, Shushant Arora shushantaror...@gmail.com wrote: So for Kafka+spark streaming, Receiver based streaming used highlevel api and non receiver based streaming used low level api. 1.In high level receiver based streaming does it registers consumers at each job

Re: spark streaming doubt

2015-05-19 Thread Dibyendu Bhattacharya
Just to add, there is a Receiver based Kafka consumer which uses Kafka Low Level Consumer API. http://spark-packages.org/package/dibbhatt/kafka-spark-consumer Regards, Dibyendu On Tue, May 19, 2015 at 9:00 PM, Akhil Das ak...@sigmoidanalytics.com wrote: On Tue, May 19, 2015 at 8:10 PM,

Re: spark streaming doubt

2015-05-19 Thread Shushant Arora
Thanks Akhil andDibyendu. Does in high level receiver based streaming executors run on receivers itself to have data localisation ? Or its always data is transferred to executor nodes and executor nodes differ in each run of job but receiver node remains same(same machines) throughout life of