Can someone please help either by explaining or pointing to documentation the relationship between #executors needed and How to let the concurrent jobs that are created by the above parameter run in parallel?
On Thu, Sep 24, 2015 at 11:56 PM, Atul Kulkarni <atulskulka...@gmail.com> wrote: > Hi Folks, > > I am trying to speed up my spark streaming job, I found a presentation by > Tathagata Das that mentions to increase value of > "spark.streaming.concurrentJobs" if I have more than one output. > > In my spark streaming job I am reading from Kafka using Receiver Bases > approach and transforming each line of data from Kafka and storing to > HBase. I do not intend to do any kind of collation at this stage. I believe > this can be parallelized by creating a separate job to write a different > set of lines from Kafka to HBase and hence, I set the above parameter to a > value > 1. Is my above assumption that writing to HBase for each partition > in the RDDs from a given DStream is an independent output operation and can > be parallelized? > > If the assumption is correct, and I run the job - this job creates > multiple (smaller) jobs but they are executed one after another, not in > parallel - I am curious if there is a requirement that #Executors be >= a > particular number (a calculation based on how many repartitions after unio > od DSreams etc. - I don't know I am grasping at Straws here.) > > I would appreciate some help in this regard. Thanks in advance. > > -- > Regards, > Atul Kulkarni > -- Regards, Atul Kulkarni