Hi all,

I am really struggling with the behavior of sortBy. I am running sortBy on
a fairly large dataset (~20GB), that I partitioned in 1200 tasks. The
output of the sortBy stage in the Spark UI shows that it ran with 1200
tasks.

However, when I run the next operation (e.g. filter or saveToTextFile) I
find myself with only 7 partitions. The problem with this is that those
partitions are extremely skewed with 99.99% of the data being in a 12GB
partitions and everything else being in tiny partitions.

It appears (by writing to file) that the data is partitioned according to
the value that I used to sort on (as expected). The problem is that 99.99%
of the data has the same value and therefore ends up in the same partition.

I tried changing the number of tasks in the sortBy as well as a repartition
after the sortBy but to no avail. Is there any way of changing this
behavior? I fear not as this is probably due to the way that sortBy is
implemented, but I thought I would ask anyway.

Should it matter, I am running Spark 1.4.2 (DataStax Enterprise).

Thanks,

Simone Franzini, PhD

http://www.linkedin.com/in/simonefranzini

Reply via email to