Thanks for the suggestion, Andrew.  We have also implemented our solution
using reduceByKey, but observe the same behavior.  For example if we do the
following:

map1
groupByKey
map2
saveAsTextFile

Then the stalling will occur during the map1+groupByKey execution.

If we do

map1
reduceByKey
map2
saveAsTextFile

Then the reduceByKey finishes successfully, but the stalling will occur
during the map2+saveAsTextFile execution.


On Tue, May 20, 2014 at 4:22 PM, Andrew Ash [via Apache Spark User List] <
ml-node+s1001560n6134...@n3.nabble.com> wrote:

> If the distribution of the keys in your groupByKey is skewed (some keys
> appear way more often than others) you should consider modifying your job
> to use reduceByKey instead wherever possible.
> On May 20, 2014 12:53 PM, "Jon Keebler" <[hidden 
> email]<http://user/SendEmail.jtp?type=node&node=6134&i=0>>
> wrote:
>
>> So we upped the spark.akka.frameSize value to 128 MB and still observed
>> the same behavior.  It's happening not necessarily when data is being sent
>> back to the driver, but when there is an inter-cluster shuffle, for example
>> during a groupByKey.
>>
>> Is it possible we should focus on tuning these parameters:
>> spark.storage.memoryFraction & spark.shuffle.memoryFraction ??
>>
>>
>> On Tue, May 20, 2014 at 12:09 AM, Aaron Davidson <[hidden 
>> email]<http://user/SendEmail.jtp?type=node&node=6134&i=1>
>> > wrote:
>>
>>> This is very likely because the serialized map output locations buffer
>>> exceeds the akka frame size. Please try setting "spark.akka.frameSize"
>>> (default 10 MB) to some higher number, like 64 or 128.
>>>
>>> In the newest version of Spark, this would throw a better error, for
>>> what it's worth.
>>>
>>>
>>>
>>> On Mon, May 19, 2014 at 8:39 PM, jonathan.keebler <[hidden 
>>> email]<http://user/SendEmail.jtp?type=node&node=6134&i=2>
>>> > wrote:
>>>
>>>> Has anyone observed Spark worker threads stalling during a shuffle
>>>> phase with
>>>> the following message (one per worker host) being echoed to the
>>>> terminal on
>>>> the driver thread?
>>>>
>>>> INFO spark.MapOutputTrackerActor: Asked to send map output locations for
>>>> shuffle 0 to [worker host]...
>>>>
>>>>
>>>> At this point Spark-related activity on the hadoop cluster completely
>>>> halts
>>>> .. there's no network activity, disk IO or CPU activity, and individual
>>>> tasks are not completing and the job just sits in this state.  At this
>>>> point
>>>> we just kill the job & a re-start of the Spark server service is
>>>> required.
>>>>
>>>> Using identical jobs we were able to by-pass this halt point by
>>>> increasing
>>>> available heap memory to the workers, but it's odd we don't get an
>>>> out-of-memory error or any error at all.  Upping the memory available
>>>> isn't
>>>> a very satisfying answer to what may be going on :)
>>>>
>>>> We're running Spark 0.9.0 on CDH5.0 in stand-alone mode.
>>>>
>>>> Thanks for any help or ideas you may have!
>>>>
>>>> Cheers,
>>>> Jonathan
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stalling-during-shuffle-maybe-a-memory-issue-tp6067.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>
>>>
>>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stalling-during-shuffle-maybe-a-memory-issue-tp6067p6134.html
>  To unsubscribe from Spark stalling during shuffle (maybe a memory issue), 
> click
> here<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=6067&code=amtlZWJsZXI0MkBnbWFpbC5jb218NjA2N3wtMjA5NzAzMzE5NQ==>
> .
> NAML<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stalling-during-shuffle-maybe-a-memory-issue-tp6067p6137.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to