Thanks for your answer Imran. I haven't tried your suggestions yet, but
setting spark.shuffle.blockTransferService=nio solved my issue. There is a
JIRA for this: https://issues.apache.org/jira/browse/SPARK-6962.

Zsolt

2015-04-14 21:57 GMT+02:00 Imran Rashid <iras...@cloudera.com>:

> is it possible that when you switch to the bigger data set, your data is
> skewed, and so that some tasks generate far more data?  reduceByKey could
> result in a huge amount of data going to a small number of tasks.  I'd
> suggest
>
> (a) seeing what happens if you don't collect() -- eg. instead try writing
> to hdfs with saveAsObjectFile.
> (b) take a look at what is happening on the executors with the long
> running tasks.  You can get thread dumps via the UI (or you can login into
> the boxes and use jstack).  This might point to some of your code that is
> taking a long time, or it might point to spark internals.
>
> On Wed, Apr 8, 2015 at 3:45 AM, Zsolt Tóth <toth.zsolt....@gmail.com>
> wrote:
>
>> I use EMR 3.3.1 which comes with Java 7. Do you think that this may cause
>> the issue? Did you test it with Java 8?
>>
>
>

Reply via email to