For second question
I am comparing 2 situtations of processing kafkaRDD.
case I - When I used foreachPartition to process kafka stream I am not able
to see any stream job timing interval like Time: 142905487 ms .
displayed on driver console at start of each stream batch. But it processed
Regarding your first question, having more partitions than you do executors
usually means you'll have better utilization, because the workload will be
distributed more evenly. There's some degree of per-task overhead, but as
long as you don't have a huge imbalance between number of tasks and
Hi Sushant/Cody,
For question 1 , following is my understanding ( I am not 100% sure and
this is only my understanding, I have asked this question in another words
to TD for confirmation which is not confirmed as of now).
Following is my understanding. In accordance with tasks created in
1.spark streaming 1.3 creates as many RDD Partitions as there are kafka
partitions in topic. Say I have 300 partitions in topic and 10 executors
and each with 3 cores so , is it means at a time only 10*3=30 partitions
are processed and then 30 like that since executors launch tasks per RDD
One receiver basically runs on 1 core, so if your single node is having 4
cores, there are still 3 cores left for the processing (for executors). And
yes receiver remains on the same machine unless some failure happens.
Thanks
Best Regards
On Tue, May 19, 2015 at 10:57 PM, Shushant Arora
So I can explicitly specify no of receivers and executors in receiver based
streaming? Can you share a sample program if any?
Also in Low level non receiver based , will data be fetched by same worker
executor node and processed ? Also if I have concurrent jobs set to 1- so
in low level
fetching
On Wed, May 20, 2015 at 1:12 PM, Shushant Arora shushantaror...@gmail.com
wrote:
So I can explicitly specify no of receivers and executors in receiver
based streaming? Can you share a sample program if any?
-You can look at the lowlevel consumer repo
What happnes if in a streaming application one job is not yet finished and
stream interval reaches. Does it starts next job or wait for first to
finish and rest jobs will keep on accumulating in queue.
Say I have a streaming application with stream interval of 1 sec, but my
job takes 2 min to
It will be a single job running at a time by default (you can also
configure the spark.streaming.concurrentJobs to run jobs parallel which is
not recommended to put in production).
Now, your batch duration being 1 sec and processing time being 2 minutes,
if you are using a receiver based
spark.streaming.concurrentJobs takes an integer value, not boolean. If you
set it as 2 then 2 jobs will run parallel. Default value is 1 and the next
job will start once it completes the current one.
Actually, in the current implementation of Spark Streaming and under
default configuration,
So for Kafka+spark streaming, Receiver based streaming used highlevel api
and non receiver based streaming used low level api.
1.In high level receiver based streaming does it registers consumers at
each job start(whenever a new job is launched by streaming application say
at each second)?
2.No
On Tue, May 19, 2015 at 8:10 PM, Shushant Arora shushantaror...@gmail.com
wrote:
So for Kafka+spark streaming, Receiver based streaming used highlevel api
and non receiver based streaming used low level api.
1.In high level receiver based streaming does it registers consumers at
each job
Just to add, there is a Receiver based Kafka consumer which uses Kafka Low
Level Consumer API.
http://spark-packages.org/package/dibbhatt/kafka-spark-consumer
Regards,
Dibyendu
On Tue, May 19, 2015 at 9:00 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
On Tue, May 19, 2015 at 8:10 PM,
Thanks Akhil andDibyendu.
Does in high level receiver based streaming executors run on receivers
itself to have data localisation ? Or its always data is transferred to
executor nodes and executor nodes differ in each run of job but receiver
node remains same(same machines) throughout life of
14 matches
Mail list logo