Can someone please help either by explaining or pointing to documentation
the relationship between #executors needed and How to let the concurrent
jobs that are created by the above parameter run in parallel?

On Thu, Sep 24, 2015 at 11:56 PM, Atul Kulkarni <atulskulka...@gmail.com>
wrote:

> Hi Folks,
>
> I am trying to speed up my spark streaming job, I found a presentation by
> Tathagata Das that mentions to increase value of
> "spark.streaming.concurrentJobs" if I have more than one output.
>
> In my spark streaming job I am reading from Kafka using Receiver Bases
> approach and transforming each line of data from Kafka and storing to
> HBase. I do not intend to do any kind of collation at this stage. I believe
> this can be parallelized by creating a separate job to write a different
> set of lines from Kafka to HBase and hence, I set the above parameter to a
> value > 1. Is my above assumption that writing to HBase for each partition
> in the RDDs from a given DStream is an independent output operation and can
> be parallelized?
>
> If the assumption is correct, and I run the job - this job creates
> multiple (smaller) jobs but they are executed one after another, not in
> parallel - I am curious if there is a requirement that #Executors be >= a
> particular number (a calculation based on how many repartitions after unio
> od DSreams etc. - I don't know I am grasping at Straws here.)
>
> I would appreciate some help in this regard. Thanks in advance.
>
> --
> Regards,
> Atul Kulkarni
>



-- 
Regards,
Atul Kulkarni

Reply via email to