Hi all, I am really struggling with the behavior of sortBy. I am running sortBy on a fairly large dataset (~20GB), that I partitioned in 1200 tasks. The output of the sortBy stage in the Spark UI shows that it ran with 1200 tasks.
However, when I run the next operation (e.g. filter or saveToTextFile) I find myself with only 7 partitions. The problem with this is that those partitions are extremely skewed with 99.99% of the data being in a 12GB partitions and everything else being in tiny partitions. It appears (by writing to file) that the data is partitioned according to the value that I used to sort on (as expected). The problem is that 99.99% of the data has the same value and therefore ends up in the same partition. I tried changing the number of tasks in the sortBy as well as a repartition after the sortBy but to no avail. Is there any way of changing this behavior? I fear not as this is probably due to the way that sortBy is implemented, but I thought I would ask anyway. Should it matter, I am running Spark 1.4.2 (DataStax Enterprise). Thanks, Simone Franzini, PhD http://www.linkedin.com/in/simonefranzini