So we upped the spark.akka.frameSize value to 128 MB and still observed the same behavior. It's happening not necessarily when data is being sent back to the driver, but when there is an inter-cluster shuffle, for example during a groupByKey.
Is it possible we should focus on tuning these parameters: spark.storage.memoryFraction & spark.shuffle.memoryFraction ?? On Tue, May 20, 2014 at 12:09 AM, Aaron Davidson <ilike...@gmail.com> wrote: > This is very likely because the serialized map output locations buffer > exceeds the akka frame size. Please try setting "spark.akka.frameSize" > (default 10 MB) to some higher number, like 64 or 128. > > In the newest version of Spark, this would throw a better error, for what > it's worth. > > > > On Mon, May 19, 2014 at 8:39 PM, jonathan.keebler <jkeeble...@gmail.com>wrote: > >> Has anyone observed Spark worker threads stalling during a shuffle phase >> with >> the following message (one per worker host) being echoed to the terminal >> on >> the driver thread? >> >> INFO spark.MapOutputTrackerActor: Asked to send map output locations for >> shuffle 0 to [worker host]... >> >> >> At this point Spark-related activity on the hadoop cluster completely >> halts >> .. there's no network activity, disk IO or CPU activity, and individual >> tasks are not completing and the job just sits in this state. At this >> point >> we just kill the job & a re-start of the Spark server service is required. >> >> Using identical jobs we were able to by-pass this halt point by increasing >> available heap memory to the workers, but it's odd we don't get an >> out-of-memory error or any error at all. Upping the memory available >> isn't >> a very satisfying answer to what may be going on :) >> >> We're running Spark 0.9.0 on CDH5.0 in stand-alone mode. >> >> Thanks for any help or ideas you may have! >> >> Cheers, >> Jonathan >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stalling-during-shuffle-maybe-a-memory-issue-tp6067.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> > >