Hi Miller,


Try using

1.*coalesce(numberOfPartitions*) to reduce the number of partitions in
order to avoid idle cores .

2.Try reducing executor memory as you increase the number of executors.

3. Try performing GC or changing naïve java serialization to *kryo*
serialization.





Thanks,

Tejeshwar





*From:* Jeroen Miller [mailto:bluedasya...@gmail.com]
*Sent:* Thursday, September 28, 2017 2:11 PM
*To:* user@spark.apache.org
*Subject:* More instances = slower Spark job



Hello,



I am experiencing a disappointing performance issue with my Spark jobs

as I scale up the number of instances.



The task is trivial: I am loading large (compressed) text files from S3,

filtering out lines that do not match a regex, counting the numbers

of remaining lines and saving the resulting datasets as (compressed)

text files on S3. Nothing that a simple grep couldn't do, except that

the files are too large to be downloaded and processed locally.



On a single instance, I can process X GBs per hour. When scaling up

to 10 instances, I noticed that processing the /same/ amount of data

actually takes /longer/.



This is quite surprising as the task is really simple: I was expecting

a significant speed-up. My naive idea was that each executors would

process a fraction of the input file, count the remaining lines /locally/,

and save their part of the processed file /independently/, thus no data

shuffling would occur.



Obviously, this is not what is happening.



Can anyone shed some light on this or provide pointers to relevant

information?



Regards,



Jeroen

Reply via email to