Hi, I have a simple Spark application: it creates an input rdd with sc.textfile, and it calls flatMapToPair, reduceByKey and map on it. The output rdd is small, a few MB's. Then I call collect() on the output.
If the textfile is ~50GB, it finishes in a few minutes. However, if it's larger (~100GB) the execution hangs at the end of the collect() stage. The UI shows one active job (collect); one completed (flatMapToPair) and one active stage (collect). The collect stage has 880/892 tasks succeeded so I think the issue should happen when the whole job is finished (every task on the UI is either in SUCCESS or in RUNNING state). The driver and the containers don't log anything for 15 mins, then I get Connection time out. I run the job in yarn-cluster mode on Amazon EMR with Spark 1.2.1 and Hadoop 2.4.0. This happens every time I run the process with larger input data so I think this isn't just a connection issue or something like that. Is this a Spark bug or something is wrong with my setup? Zsolt