Has anyone observed Spark worker threads stalling during a shuffle phase with
the following message (one per worker host) being echoed to the terminal on
the driver thread?

INFO spark.MapOutputTrackerActor: Asked to send map output locations for
shuffle 0 to [worker host]...


At this point Spark-related activity on the hadoop cluster completely halts
.. there's no network activity, disk IO or CPU activity, and individual
tasks are not completing and the job just sits in this state.  At this point
we just kill the job & a re-start of the Spark server service is required.

Using identical jobs we were able to by-pass this halt point by increasing
available heap memory to the workers, but it's odd we don't get an
out-of-memory error or any error at all.  Upping the memory available isn't
a very satisfying answer to what may be going on :)

We're running Spark 0.9.0 on CDH5.0 in stand-alone mode.

Thanks for any help or ideas you may have!

Cheers,
Jonathan




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stalling-during-shuffle-maybe-a-memory-issue-tp6067.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to