I'm running into an issue with a pyspark job where I'm sometimes seeing extremely variable job times (20min to 2hr) and very long shuffle times (e.g. ~2 minutes for 18KB/86 records).
Cluster set up is Amazon EMR 4.4.0, Spark 1.6.0, an m4.2xl driver and a single m4.10xlarge (40 vCPU, 160GB) executor with the following params: --conf spark.driver.memory=10G --conf spark.default.parallelism=160 --conf spark.driver.maxResultSize=4g --num-executors 2 --executor-cores 20 --executor-memory 67G What's odd is that sometimes the job will run in 20 minutes and sometimes it will take 2 hours - both jobs with the same data. I'm using RDDs (not DataFrames). There's plenty of RAM, I've looked at the GC logs (using CMS) and they look fine. The job reads some data from files, does some maps/filters/joins/etc; nothing too special. The only thing I've noticed that looks odd is that the slow instances of the job have unusually long Shuffle Write times for some tasks. For example, a .join operation has ~30 tasks out of 320 that take 2.5 minutes, GC time of 0.1 seconds, Shuffle Read Size / Records of 12KB/30, and, most interestingly, Write Time of 2.5 minutes for Shuffle Write Size / Records of 18KB/86 records. When looking at the event time line for the stage it's almost all yellow (Shuffle Write). We've been running this job on a difference EMR cluster topology (12 m3.2xlarge's) and have not seen the slow down described above. We've only observed it on the m4.10xl machine. It might be worth mentioning again that this is pyspark and no DataFrames (just RDDs). When I run 'top' I sometimes see lots (e.g. 60 or 70) python processes on the executor (I assume one per partition being processed?). It seems like this has something to do with the single m4.10xl set up, as we haven't seen this behavior on the 12 m3.2xl cluster. What I really don't understand is why the job seems to run fine (20 minutes) for a while, and then (for the same data) takes so much longer (2 hours), and with such long shuffle write times.