Re: Any ideas why a few tasks would stall

Sameer Farooqui Thu, 04 Dec 2014 00:33:07 -0800

Good point, Ankit.

Steve - You can click on the link for '27' in the first column to get a
break down of how much data is in each of those 116 cached partitions. But
really, you want to also understand how much data is in the 4 non-cached
partitions, as they may be huge. One thing you can try doing is
.repartition() on the RDD with something like 100 partitions and then cache
this new RDD. See if that spreads the load between the partitions more
evenly.


Let us know how it goes.

On Thu, Dec 4, 2014 at 12:16 AM, Ankit Soni <ankitso...@gmail.com> wrote:

> I ran into something similar before. 19/20 partitions would complete very
> quickly, and 1 would take the bulk of time and shuffle reads & writes. This
> was because the majority of partitions were empty, and 1 had all the data.
> Perhaps something similar is going on here - I would suggest taking a look
> at how much data each partition contains and try to achieve a roughly even
> distribution for best performance. In particular, if the RDDs are PairRDDs,
> partitions are assigned based on the hash of the key, so an even
> distribution of values among keys is required for even split of data across
> partitions.
>
> On December 2, 2014 at 4:15:25 PM, Steve Lewis (lordjoe2...@gmail.com)
> wrote:
>
> 1) I can go there but none of the links are clickable
> 2) when I see something like 116/120  partitions succeeded in the stages
> ui in the storage ui I see
> NOTE RDD 27 has 116 partitions cached - 4 not and those are exactly the
> number of machines which will not complete
> Also RDD 27 does not show up in the Stages UI
>
>    RDD Name Storage Level Cached Partitions Fraction Cached Size in Memory 
> Size
> in Tachyon Size on Disk   2
> <http://hwlogin.labs.uninett.no:4040/storage/rdd?id=2> Memory
> Deserialized 1x Replicated 1 100% 11.8 MB 0.0 B 0.0 B  14
> <http://hwlogin.labs.uninett.no:4040/storage/rdd?id=14> Memory
> Deserialized 1x Replicated 1 100% 122.7 MB 0.0 B 0.0 B  7
> <http://hwlogin.labs.uninett.no:4040/storage/rdd?id=7> Memory
> Deserialized 1x Replicated 120 100% 151.1 MB 0.0 B 0.0 B  1
> <http://hwlogin.labs.uninett.no:4040/storage/rdd?id=1> Memory
> Deserialized 1x Replicated 1 100% 65.6 MB 0.0 B 0.0 B  10
> <http://hwlogin.labs.uninett.no:4040/storage/rdd?id=10> Memory
> Deserialized 1x Replicated 24 100% 160.6 MB 0.0 B 0.0 B   27
> <http://hwlogin.labs.uninett.no:4040/storage/rdd?id=27> Memory
> Deserialized 1x Replicated 116 97%
>
> On Tue, Dec 2, 2014 at 3:43 PM, Sameer Farooqui <same...@databricks.com>
> wrote:
>
>> Have you tried taking thread dumps via the UI? There is a link to do so
>> on the Executors' page (typically under http://driver IP:4040/exectuors.
>>
>> By visualizing the thread call stack of the executors with slow running
>> tasks, you can see exactly what code is executing at an instant in time. If
>> you sample the executor several times in a short time period, you can
>> identify 'hot spots' or expensive sections in the user code.
>>
>> On Tue, Dec 2, 2014 at 3:03 PM, Steve Lewis <lordjoe2...@gmail.com>
>> wrote:
>>
>>>   I am working on a problem which will eventually involve many millions
>>> of function calls. A have a small sample with several thousand calls
>>> working but when I try to scale up the amount of data things stall. I use
>>> 120 partitions and 116 finish in very little time. The remaining 4 seem to
>>> do all the work and stall after a fixed number (about 1000) calls and even
>>> after hours make no more progress.
>>>
>>> This is my first large and complex job with spark and I would like any
>>> insight on how to debug  the issue or even better why it might exist. The
>>> cluster has 15 machines and I am setting executor memory at 16G.
>>>
>>> Also what other questions are relevant to solving the issue
>>>
>>
>>
>
>
> --
> Steven M. Lewis PhD
> 4221 105th Ave NE
> Kirkland, WA 98033
> 206-384-1340 (cell)
> Skype lordjoe_com
>
>

Re: Any ideas why a few tasks would stall

Reply via email to