Re: Any ideas why a few tasks would stall

Steve Lewis Thu, 04 Dec 2014 09:16:40 -0800

Thanks - I found the same thing -
calling
       boolean forceShuffle = true;
        myRDD =   myRDD.coalesce(120,forceShuffle );
worked - there were 120 partitions but forcing a shuffle distributes the
work


I believe there is a bug in my code causing memory to accumulate as
partitions grow in size.
With a job ofer ten times larger I ran into other issues raising the number
of partitions to 10,000 -
namely "too many open files"

On Thu, Dec 4, 2014 at 8:32 AM, Sameer Farooqui <same...@databricks.com>
wrote:

> Good point, Ankit.
>
> Steve - You can click on the link for '27' in the first column to get a
> break down of how much data is in each of those 116 cached partitions. But
> really, you want to also understand how much data is in the 4 non-cached
> partitions, as they may be huge. One thing you can try doing is
> .repartition() on the RDD with something like 100 partitions and then cache
> this new RDD. See if that spreads the load between the partitions more
> evenly.
>
> Let us know how it goes.
>
> On Thu, Dec 4, 2014 at 12:16 AM, Ankit Soni <ankitso...@gmail.com> wrote:
>
>> I ran into something similar before. 19/20 partitions would complete very
>> quickly, and 1 would take the bulk of time and shuffle reads & writes. This
>> was because the majority of partitions were empty, and 1 had all the data.
>> Perhaps something similar is going on here - I would suggest taking a look
>> at how much data each partition contains and try to achieve a roughly even
>> distribution for best performance. In particular, if the RDDs are PairRDDs,
>> partitions are assigned based on the hash of the key, so an even
>> distribution of values among keys is required for even split of data across
>> partitions.
>>
>> On December 2, 2014 at 4:15:25 PM, Steve Lewis (lordjoe2...@gmail.com)
>> wrote:
>>
>> 1) I can go there but none of the links are clickable
>> 2) when I see something like 116/120  partitions succeeded in the stages
>> ui in the storage ui I see
>> NOTE RDD 27 has 116 partitions cached - 4 not and those are exactly the
>> number of machines which will not complete
>> Also RDD 27 does not show up in the Stages UI
>>
>>    RDD Name Storage Level Cached Partitions Fraction Cached Size in
>> Memory Size in Tachyon Size on Disk   2
>> <http://hwlogin.labs.uninett.no:4040/storage/rdd?id=2> Memory
>> Deserialized 1x Replicated 1 100% 11.8 MB 0.0 B 0.0 B  14
>> <http://hwlogin.labs.uninett.no:4040/storage/rdd?id=14> Memory
>> Deserialized 1x Replicated 1 100% 122.7 MB 0.0 B 0.0 B  7
>> <http://hwlogin.labs.uninett.no:4040/storage/rdd?id=7> Memory
>> Deserialized 1x Replicated 120 100% 151.1 MB 0.0 B 0.0 B  1
>> <http://hwlogin.labs.uninett.no:4040/storage/rdd?id=1> Memory
>> Deserialized 1x Replicated 1 100% 65.6 MB 0.0 B 0.0 B  10
>> <http://hwlogin.labs.uninett.no:4040/storage/rdd?id=10> Memory
>> Deserialized 1x Replicated 24 100% 160.6 MB 0.0 B 0.0 B   27
>> <http://hwlogin.labs.uninett.no:4040/storage/rdd?id=27> Memory
>> Deserialized 1x Replicated 116 97%
>>
>> On Tue, Dec 2, 2014 at 3:43 PM, Sameer Farooqui <same...@databricks.com>
>> wrote:
>>
>>> Have you tried taking thread dumps via the UI? There is a link to do so
>>> on the Executors' page (typically under http://driver
>>> IP:4040/exectuors.
>>>
>>> By visualizing the thread call stack of the executors with slow running
>>> tasks, you can see exactly what code is executing at an instant in time. If
>>> you sample the executor several times in a short time period, you can
>>> identify 'hot spots' or expensive sections in the user code.
>>>
>>> On Tue, Dec 2, 2014 at 3:03 PM, Steve Lewis <lordjoe2...@gmail.com>
>>> wrote:
>>>
>>>>   I am working on a problem which will eventually involve many
>>>> millions of function calls. A have a small sample with several thousand
>>>> calls working but when I try to scale up the amount of data things stall. I
>>>> use 120 partitions and 116 finish in very little time. The remaining 4 seem
>>>> to do all the work and stall after a fixed number (about 1000) calls and
>>>> even after hours make no more progress.
>>>>
>>>> This is my first large and complex job with spark and I would like any
>>>> insight on how to debug  the issue or even better why it might exist. The
>>>> cluster has 15 machines and I am setting executor memory at 16G.
>>>>
>>>> Also what other questions are relevant to solving the issue
>>>>
>>>
>>>
>>
>>
>> --
>> Steven M. Lewis PhD
>> 4221 105th Ave NE
>> Kirkland, WA 98033
>> 206-384-1340 (cell)
>> Skype lordjoe_com
>>
>>
>


-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: Any ideas why a few tasks would stall

Reply via email to