Hello, Does anyone have an insight into what could be the issue here?
Thanks Nikunj On Fri, Sep 25, 2015 at 10:44 AM, N B <nb.nos...@gmail.com> wrote: > Hi Akhil, > > I do have 25 partitions being created. I have set > the spark.default.parallelism property to 25. Batch size is 30 seconds and > block interval is 1200 ms which also gives us roughly 25 partitions from > the input stream. I can see 25 partitions being created and used in the > Spark UI also. Its just that those tasks are waiting for cores on N1 to get > free before being scheduled while N2 is sitting idle. > > The cluster configuration is: > > N1: 2 workers, 6 cores and 16gb memory => 12 cores on the first node. > N2: 2 workers, 8 cores and 16 gb memory => 16 cores on the second node. > > for a grand total of 28 cores. But it still does most of the processing on > N1 (divided among the 2 workers running) but almost completely disregarding > N2 until its the final stage where data is being written out to > Elasticsearch. I am not sure I understand the reason behind it not > distributing more partitions to N2 to begin with and use it effectively. > Since there are only 12 cores on N1 and 25 total partitions, shouldn't it > send some of those partitions to N2 as well? > > Thanks > Nikunj > > > On Fri, Sep 25, 2015 at 5:28 AM, Akhil Das <ak...@sigmoidanalytics.com> > wrote: > >> Parallel tasks totally depends on the # of partitions that you are >> having, if you are not receiving sufficient partitions (partitions > total >> # cores) then try to do a .repartition. >> >> Thanks >> Best Regards >> >> On Fri, Sep 25, 2015 at 1:44 PM, N B <nb.nos...@gmail.com> wrote: >> >>> Hello all, >>> >>> I have a Spark streaming application that reads from a Flume Stream, >>> does quite a few maps/filters in addition to a few reduceByKeyAndWindow and >>> join operations before writing the analyzed output to ElasticSearch inside >>> a foreachRDD()... >>> >>> I recently started to run this on a 2 node cluster (Standalone) with the >>> driver program directly submitting to Spark master on the same host. The >>> way I have divided the resources is as follows: >>> >>> N1: spark Master + driver + flume + 2 spark workers (16gb + 6 cores each >>> worker) >>> N2: 2 spark workers (16 gb + 8 cores each worker). >>> >>> The application works just fine but it is underusing N2 completely. It >>> seems to use N1 (note that both executors on N1 get used) for all the >>> analytics but when it comes to writing to Elasticsearch, it does divide the >>> data around into all 4 executors which then write to ES on a separate host. >>> >>> I am puzzled as to why the data is not being distributed evenly from the >>> get go into all 4 executors and why would it only do so in the final step >>> of the pipeline which seems counterproductive as well? >>> >>> CPU usage on N1 is near the peak while on N2 is < 10% of overall >>> capacity. >>> >>> Any help in getting the resources more evenly utilized on N1 and N2 is >>> welcome. >>> >>> Thanks in advance, >>> Nikunj >>> >>> >> >