Re: spark.streaming.concurrentJobs

Shixiong Zhu Mon, 28 Sep 2015 19:30:58 -0700

"writing to HBase for each partition in the RDDs from a given DStream is an
independent output operation"


This is not correct. "writing to HBase for each partition in the RDDs from
a given DStream" is just a task. And they already run in parallel.

The output operation is the DStream action, such as count, saveXXX, take.

For example, if "spark.streaming.concurrentJobs" is 1, and you call
DStream.count() twice. There will be two "count" Spark jobs and they will
run one by one. But if you set "spark.streaming.concurrentJobs" to 2, these
two "count" Spark jobs will run in parallel.

Moreover, "spark.streaming.concurrentJobs" is an internal configuration and
it may be changed in future.


Best Regards,
Shixiong Zhu

2015-09-26 3:34 GMT+08:00 Atul Kulkarni <atulskulka...@gmail.com>:

> Can someone please help either by explaining or pointing to documentation
> the relationship between #executors needed and How to let the concurrent
> jobs that are created by the above parameter run in parallel?
>
> On Thu, Sep 24, 2015 at 11:56 PM, Atul Kulkarni <atulskulka...@gmail.com>
> wrote:
>
>> Hi Folks,
>>
>> I am trying to speed up my spark streaming job, I found a presentation by
>> Tathagata Das that mentions to increase value of
>> "spark.streaming.concurrentJobs" if I have more than one output.
>>
>> In my spark streaming job I am reading from Kafka using Receiver Bases
>> approach and transforming each line of data from Kafka and storing to
>> HBase. I do not intend to do any kind of collation at this stage. I believe
>> this can be parallelized by creating a separate job to write a different
>> set of lines from Kafka to HBase and hence, I set the above parameter to a
>> value > 1. Is my above assumption that writing to HBase for each partition
>> in the RDDs from a given DStream is an independent output operation and can
>> be parallelized?
>>
>> If the assumption is correct, and I run the job - this job creates
>> multiple (smaller) jobs but they are executed one after another, not in
>> parallel - I am curious if there is a requirement that #Executors be >= a
>> particular number (a calculation based on how many repartitions after unio
>> od DSreams etc. - I don't know I am grasping at Straws here.)
>>
>> I would appreciate some help in this regard. Thanks in advance.
>>
>> --
>> Regards,
>> Atul Kulkarni
>>
>
>
>
> --
> Regards,
> Atul Kulkarni
>

Re: spark.streaming.concurrentJobs

Reply via email to