Hi all:
I am developing an algorithm that needs to put together elements with
the same key as much as possible but with always using a fixed number of
partitions. To do that, this algorithm sorts by key the elements. The
problem is that the number of distinct keys influences in the number of
final partitions. For example, if I define 200 distinct keys and 800
partitions in the /sortByKey/ function, the resulting number of
partitions is equal to 202.
I have took a look to the code and I have found this:
Note that the actual number of partitions created by the
RangePartitioner might not be the same
as the `partitions` parameter, in the case where the number of sampled
records is less than the value of `partitions`.
I have tried with /repartition/ with /RangePartitioner/ with the same
result (obvious).
¿Is there any function that can solve my problem, like
/repartitionAndSortWithinPartitions/? ¿Is there any sequence of
instructions that can help me? If not, I think it can become a real
problem to sort cases in which the number of rows is huge and the
number of distinct keys is small.
Thanks in advance,
Sergio R.