Thanks for your answer Imran. I haven't tried your suggestions yet, but setting spark.shuffle.blockTransferService=nio solved my issue. There is a JIRA for this: https://issues.apache.org/jira/browse/SPARK-6962.
Zsolt 2015-04-14 21:57 GMT+02:00 Imran Rashid <iras...@cloudera.com>: > is it possible that when you switch to the bigger data set, your data is > skewed, and so that some tasks generate far more data? reduceByKey could > result in a huge amount of data going to a small number of tasks. I'd > suggest > > (a) seeing what happens if you don't collect() -- eg. instead try writing > to hdfs with saveAsObjectFile. > (b) take a look at what is happening on the executors with the long > running tasks. You can get thread dumps via the UI (or you can login into > the boxes and use jstack). This might point to some of your code that is > taking a long time, or it might point to spark internals. > > On Wed, Apr 8, 2015 at 3:45 AM, Zsolt Tóth <toth.zsolt....@gmail.com> > wrote: > >> I use EMR 3.3.1 which comes with Java 7. Do you think that this may cause >> the issue? Did you test it with Java 8? >> > >