Thanks for the suggestion, Andrew. We have also implemented our solution using reduceByKey, but observe the same behavior. For example if we do the following:
map1 groupByKey map2 saveAsTextFile Then the stalling will occur during the map1+groupByKey execution. If we do map1 reduceByKey map2 saveAsTextFile Then the reduceByKey finishes successfully, but the stalling will occur during the map2+saveAsTextFile execution. On Tue, May 20, 2014 at 4:22 PM, Andrew Ash [via Apache Spark User List] < ml-node+s1001560n6134...@n3.nabble.com> wrote: > If the distribution of the keys in your groupByKey is skewed (some keys > appear way more often than others) you should consider modifying your job > to use reduceByKey instead wherever possible. > On May 20, 2014 12:53 PM, "Jon Keebler" <[hidden > email]<http://user/SendEmail.jtp?type=node&node=6134&i=0>> > wrote: > >> So we upped the spark.akka.frameSize value to 128 MB and still observed >> the same behavior. It's happening not necessarily when data is being sent >> back to the driver, but when there is an inter-cluster shuffle, for example >> during a groupByKey. >> >> Is it possible we should focus on tuning these parameters: >> spark.storage.memoryFraction & spark.shuffle.memoryFraction ?? >> >> >> On Tue, May 20, 2014 at 12:09 AM, Aaron Davidson <[hidden >> email]<http://user/SendEmail.jtp?type=node&node=6134&i=1> >> > wrote: >> >>> This is very likely because the serialized map output locations buffer >>> exceeds the akka frame size. Please try setting "spark.akka.frameSize" >>> (default 10 MB) to some higher number, like 64 or 128. >>> >>> In the newest version of Spark, this would throw a better error, for >>> what it's worth. >>> >>> >>> >>> On Mon, May 19, 2014 at 8:39 PM, jonathan.keebler <[hidden >>> email]<http://user/SendEmail.jtp?type=node&node=6134&i=2> >>> > wrote: >>> >>>> Has anyone observed Spark worker threads stalling during a shuffle >>>> phase with >>>> the following message (one per worker host) being echoed to the >>>> terminal on >>>> the driver thread? >>>> >>>> INFO spark.MapOutputTrackerActor: Asked to send map output locations for >>>> shuffle 0 to [worker host]... >>>> >>>> >>>> At this point Spark-related activity on the hadoop cluster completely >>>> halts >>>> .. there's no network activity, disk IO or CPU activity, and individual >>>> tasks are not completing and the job just sits in this state. At this >>>> point >>>> we just kill the job & a re-start of the Spark server service is >>>> required. >>>> >>>> Using identical jobs we were able to by-pass this halt point by >>>> increasing >>>> available heap memory to the workers, but it's odd we don't get an >>>> out-of-memory error or any error at all. Upping the memory available >>>> isn't >>>> a very satisfying answer to what may be going on :) >>>> >>>> We're running Spark 0.9.0 on CDH5.0 in stand-alone mode. >>>> >>>> Thanks for any help or ideas you may have! >>>> >>>> Cheers, >>>> Jonathan >>>> >>>> >>>> >>>> >>>> -- >>>> View this message in context: >>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stalling-during-shuffle-maybe-a-memory-issue-tp6067.html >>>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>>> >>> >>> >> > > ------------------------------ > If you reply to this email, your message will be added to the discussion > below: > > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stalling-during-shuffle-maybe-a-memory-issue-tp6067p6134.html > To unsubscribe from Spark stalling during shuffle (maybe a memory issue), > click > here<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=6067&code=amtlZWJsZXI0MkBnbWFpbC5jb218NjA2N3wtMjA5NzAzMzE5NQ==> > . > NAML<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stalling-during-shuffle-maybe-a-memory-issue-tp6067p6137.html Sent from the Apache Spark User List mailing list archive at Nabble.com.