I ran into something similar before. 19/20 partitions would complete very 
quickly, and 1 would take the bulk of time and shuffle reads & writes. This was 
because the majority of partitions were empty, and 1 had all the data. Perhaps 
something similar is going on here - I would suggest taking a look at how much 
data each partition contains and try to achieve a roughly even distribution for 
best performance. In particular, if the RDDs are PairRDDs, partitions are 
assigned based on the hash of the key, so an even distribution of values among 
keys is required for even split of data across partitions.

On December 2, 2014 at 4:15:25 PM, Steve Lewis (lordjoe2...@gmail.com) wrote:

1) I can go there but none of the links are clickable
2) when I see something like 116/120  partitions succeeded in the stages ui in 
the storage ui I see
NOTE RDD 27 has 116 partitions cached - 4 not and those are exactly the number 
of machines which will not complete 
Also RDD 27 does not show up in the Stages UI

RDD Name        Storage Level   Cached Partitions       Fraction Cached Size in 
Memory  Size in Tachyon Size on Disk
2       Memory Deserialized 1x Replicated       1       100%    11.8 MB 0.0 B   
0.0 B
14      Memory Deserialized 1x Replicated       1       100%    122.7 MB        
0.0 B   0.0 B
7       Memory Deserialized 1x Replicated       120     100%    151.1 MB        
0.0 B   0.0 B
1       Memory Deserialized 1x Replicated       1       100%    65.6 MB 0.0 B   
0.0 B
10      Memory Deserialized 1x Replicated       24      100%    160.6 MB        
0.0 B   0.0 B
27      Memory Deserialized 1x Replicated       116     97%

On Tue, Dec 2, 2014 at 3:43 PM, Sameer Farooqui <same...@databricks.com> wrote:
Have you tried taking thread dumps via the UI? There is a link to do so on the 
Executors' page (typically under http://driver IP:4040/exectuors.

By visualizing the thread call stack of the executors with slow running tasks, 
you can see exactly what code is executing at an instant in time. If you sample 
the executor several times in a short time period, you can identify 'hot spots' 
or expensive sections in the user code.

On Tue, Dec 2, 2014 at 3:03 PM, Steve Lewis <lordjoe2...@gmail.com> wrote:
 I am working on a problem which will eventually involve many millions of 
function calls. A have a small sample with several thousand calls working but 
when I try to scale up the amount of data things stall. I use 120 partitions 
and 116 finish in very little time. The remaining 4 seem to do all the work and 
stall after a fixed number (about 1000) calls and even after hours make no more 
progress.

This is my first large and complex job with spark and I would like any insight 
on how to debug  the issue or even better why it might exist. The cluster has 
15 machines and I am setting executor memory at 16G.

Also what other questions are relevant to solving the issue




--
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Reply via email to