Re: Any ideas why a few tasks would stall

Steve Lewis Tue, 02 Dec 2014 16:15:44 -0800

1) I can go there but none of the links are clickable
2) when I see something like 116/120  partitions succeeded in the stages ui
in the storage ui I see
NOTE RDD 27 has 116 partitions cached - 4 not and those are exactly the
number of machines which will not complete
Also RDD 27 does not show up in the Stages UI


RDD NameStorage LevelCached PartitionsFraction CachedSize in MemorySize in
TachyonSize on Disk2
<http://hwlogin.labs.uninett.no:4040/storage/rdd?id=2>Memory
Deserialized 1x Replicated1100%11.8 MB0.0 B0.0 B14
<http://hwlogin.labs.uninett.no:4040/storage/rdd?id=14>Memory Deserialized
1x Replicated1100%122.7 MB0.0 B0.0 B7
<http://hwlogin.labs.uninett.no:4040/storage/rdd?id=7>Memory Deserialized
1x Replicated120100%151.1 MB0.0 B0.0 B1
<http://hwlogin.labs.uninett.no:4040/storage/rdd?id=1>Memory Deserialized
1x Replicated1100%65.6 MB0.0 B0.0 B10
<http://hwlogin.labs.uninett.no:4040/storage/rdd?id=10>Memory Deserialized
1x Replicated24100%160.6 MB0.0 B0.0 B27
<http://hwlogin.labs.uninett.no:4040/storage/rdd?id=27>Memory Deserialized
1x Replicated11697%

On Tue, Dec 2, 2014 at 3:43 PM, Sameer Farooqui <same...@databricks.com>
wrote:

> Have you tried taking thread dumps via the UI? There is a link to do so on
> the Executors' page (typically under http://driver IP:4040/exectuors.
>
> By visualizing the thread call stack of the executors with slow running
> tasks, you can see exactly what code is executing at an instant in time. If
> you sample the executor several times in a short time period, you can
> identify 'hot spots' or expensive sections in the user code.
>
> On Tue, Dec 2, 2014 at 3:03 PM, Steve Lewis <lordjoe2...@gmail.com> wrote:
>
>>  I am working on a problem which will eventually involve many millions of
>> function calls. A have a small sample with several thousand calls working
>> but when I try to scale up the amount of data things stall. I use 120
>> partitions and 116 finish in very little time. The remaining 4 seem to do
>> all the work and stall after a fixed number (about 1000) calls and even
>> after hours make no more progress.
>>
>> This is my first large and complex job with spark and I would like any
>> insight on how to debug  the issue or even better why it might exist. The
>> cluster has 15 machines and I am setting executor memory at 16G.
>>
>> Also what other questions are relevant to solving the issue
>>
>
>


-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: Any ideas why a few tasks would stall

Reply via email to