RDD collect hangs on large input data

Zsolt Tóth Fri, 27 Mar 2015 08:49:04 -0700

Hi,

I have a simple Spark application: it creates an input rdd with
sc.textfile, and it calls flatMapToPair, reduceByKey and map on it. The
output rdd is small, a few MB's. Then I call collect() on the output.


If the textfile is ~50GB, it finishes in a few minutes. However, if it's
larger (~100GB) the execution hangs at the end of the collect() stage. The
UI shows one active job (collect); one completed (flatMapToPair) and one
active stage (collect). The collect stage has 880/892 tasks succeeded so I
think the issue should happen when the whole job is finished (every task on
the UI is either in SUCCESS or in RUNNING state).
The driver and the containers don't log anything for 15 mins, then I get
Connection time out.

I run the job in yarn-cluster mode on Amazon EMR with Spark 1.2.1 and
Hadoop 2.4.0.

This happens every time I run the process with larger input data so I think
this isn't just a connection issue or something like that. Is this a Spark
bug or something is wrong with my setup?

Zsolt

RDD collect hangs on large input data

Reply via email to