Hi all:

I am developing an algorithm that needs to put together elements with the same key as much as possible but with always using a fixed number of partitions. To do that, this algorithm sorts by key the elements. The problem is that the number of distinct keys influences in the number of final partitions. For example, if I define 200 distinct keys and 800 partitions in the /sortByKey/ function, the resulting number of partitions is equal to 202.

I have took a look to the code and I have found this:

Note that the actual number of partitions created by the RangePartitioner might not be the same as the `partitions` parameter, in the case where the number of sampled records is less than the value of `partitions`.

I have tried with /repartition/ with /RangePartitioner/ with the same result (obvious).

¿Is there any function that can solve my problem, like /repartitionAndSortWithinPartitions/? ¿Is there any sequence of instructions that can help me? If not, I think it can become a real problem to sort cases in which the number of rows is huge and the number of distinct keys is small.

Thanks in advance,

Sergio R.


Reply via email to