Thanks - I found the same thing - calling boolean forceShuffle = true; myRDD = myRDD.coalesce(120,forceShuffle ); worked - there were 120 partitions but forcing a shuffle distributes the work
I believe there is a bug in my code causing memory to accumulate as partitions grow in size. With a job ofer ten times larger I ran into other issues raising the number of partitions to 10,000 - namely "too many open files" On Thu, Dec 4, 2014 at 8:32 AM, Sameer Farooqui <same...@databricks.com> wrote: > Good point, Ankit. > > Steve - You can click on the link for '27' in the first column to get a > break down of how much data is in each of those 116 cached partitions. But > really, you want to also understand how much data is in the 4 non-cached > partitions, as they may be huge. One thing you can try doing is > .repartition() on the RDD with something like 100 partitions and then cache > this new RDD. See if that spreads the load between the partitions more > evenly. > > Let us know how it goes. > > On Thu, Dec 4, 2014 at 12:16 AM, Ankit Soni <ankitso...@gmail.com> wrote: > >> I ran into something similar before. 19/20 partitions would complete very >> quickly, and 1 would take the bulk of time and shuffle reads & writes. This >> was because the majority of partitions were empty, and 1 had all the data. >> Perhaps something similar is going on here - I would suggest taking a look >> at how much data each partition contains and try to achieve a roughly even >> distribution for best performance. In particular, if the RDDs are PairRDDs, >> partitions are assigned based on the hash of the key, so an even >> distribution of values among keys is required for even split of data across >> partitions. >> >> On December 2, 2014 at 4:15:25 PM, Steve Lewis (lordjoe2...@gmail.com) >> wrote: >> >> 1) I can go there but none of the links are clickable >> 2) when I see something like 116/120 partitions succeeded in the stages >> ui in the storage ui I see >> NOTE RDD 27 has 116 partitions cached - 4 not and those are exactly the >> number of machines which will not complete >> Also RDD 27 does not show up in the Stages UI >> >> RDD Name Storage Level Cached Partitions Fraction Cached Size in >> Memory Size in Tachyon Size on Disk 2 >> <http://hwlogin.labs.uninett.no:4040/storage/rdd?id=2> Memory >> Deserialized 1x Replicated 1 100% 11.8 MB 0.0 B 0.0 B 14 >> <http://hwlogin.labs.uninett.no:4040/storage/rdd?id=14> Memory >> Deserialized 1x Replicated 1 100% 122.7 MB 0.0 B 0.0 B 7 >> <http://hwlogin.labs.uninett.no:4040/storage/rdd?id=7> Memory >> Deserialized 1x Replicated 120 100% 151.1 MB 0.0 B 0.0 B 1 >> <http://hwlogin.labs.uninett.no:4040/storage/rdd?id=1> Memory >> Deserialized 1x Replicated 1 100% 65.6 MB 0.0 B 0.0 B 10 >> <http://hwlogin.labs.uninett.no:4040/storage/rdd?id=10> Memory >> Deserialized 1x Replicated 24 100% 160.6 MB 0.0 B 0.0 B 27 >> <http://hwlogin.labs.uninett.no:4040/storage/rdd?id=27> Memory >> Deserialized 1x Replicated 116 97% >> >> On Tue, Dec 2, 2014 at 3:43 PM, Sameer Farooqui <same...@databricks.com> >> wrote: >> >>> Have you tried taking thread dumps via the UI? There is a link to do so >>> on the Executors' page (typically under http://driver >>> IP:4040/exectuors. >>> >>> By visualizing the thread call stack of the executors with slow running >>> tasks, you can see exactly what code is executing at an instant in time. If >>> you sample the executor several times in a short time period, you can >>> identify 'hot spots' or expensive sections in the user code. >>> >>> On Tue, Dec 2, 2014 at 3:03 PM, Steve Lewis <lordjoe2...@gmail.com> >>> wrote: >>> >>>> I am working on a problem which will eventually involve many >>>> millions of function calls. A have a small sample with several thousand >>>> calls working but when I try to scale up the amount of data things stall. I >>>> use 120 partitions and 116 finish in very little time. The remaining 4 seem >>>> to do all the work and stall after a fixed number (about 1000) calls and >>>> even after hours make no more progress. >>>> >>>> This is my first large and complex job with spark and I would like any >>>> insight on how to debug the issue or even better why it might exist. The >>>> cluster has 15 machines and I am setting executor memory at 16G. >>>> >>>> Also what other questions are relevant to solving the issue >>>> >>> >>> >> >> >> -- >> Steven M. Lewis PhD >> 4221 105th Ave NE >> Kirkland, WA 98033 >> 206-384-1340 (cell) >> Skype lordjoe_com >> >> > -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com