Hi Steve et al.,
It is possible that there's just a lot of skew in your data, in which case
repartitioning is a good idea. Depending on how large your input data is
and how much skew you have, you may want to repartition to a larger number
of partitions. By the way you can just call
I ran into something similar before. 19/20 partitions would complete very
quickly, and 1 would take the bulk of time and shuffle reads writes. This was
because the majority of partitions were empty, and 1 had all the data. Perhaps
something similar is going on here - I would suggest taking a
Good point, Ankit.
Steve - You can click on the link for '27' in the first column to get a
break down of how much data is in each of those 116 cached partitions. But
really, you want to also understand how much data is in the 4 non-cached
partitions, as they may be huge. One thing you can try
Thanks - I found the same thing -
calling
boolean forceShuffle = true;
myRDD = myRDD.coalesce(120,forceShuffle );
worked - there were 120 partitions but forcing a shuffle distributes the
work
I believe there is a bug in my code causing memory to accumulate as
partitions grow in
This did not work for me. that is, rdd.coalesce(200, forceShuffle) . Does
anyone have ideas on how to distribute your data evenly and co-locate
partitions of interest?
--
View this message in context:
Have you tried taking thread dumps via the UI? There is a link to do so on
the Executors' page (typically under http://driver IP:4040/exectuors.
By visualizing the thread call stack of the executors with slow running
tasks, you can see exactly what code is executing at an instant in time. If
you
1) I can go there but none of the links are clickable
2) when I see something like 116/120 partitions succeeded in the stages ui
in the storage ui I see
NOTE RDD 27 has 116 partitions cached - 4 not and those are exactly the
number of machines which will not complete
Also RDD 27 does not show up