repartitionAndSortWithinPartitions partitions the rdd and sorts within each
partition. so each partition is fully sorted, but the rdd is not sorted.

sortByKey is basically the same as repartitionAndSortWithinPartitions
except it uses a range partitioner so that the entire rdd is sorted.
however since sortByKey uses a different partitioner than
repartitionAndSortWithinPartitions you do not get much benefit from running
sortByKey after repartitionAndSortWithinPartitions (because all the data
will get shuffled again)


On Thu, Jul 14, 2016 at 1:59 PM, Punit Naik <naik.puni...@gmail.com> wrote:

> Hi Koert
>
> I have already used "repartitionAndSortWithinPartitions" for secondary
> sorting and it works fine. Just wanted to know whether it will sort the
> entire RDD or not.
>
> On Thu, Jul 14, 2016 at 11:25 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> repartitionAndSortWithinPartit sort by keys, not values per key, so not
>> really secondary sort by itself.
>>
>> for secondary sort also check out:
>> https://github.com/tresata/spark-sorted
>>
>>
>> On Thu, Jul 14, 2016 at 1:09 PM, Punit Naik <naik.puni...@gmail.com>
>> wrote:
>>
>>> Hi guys
>>>
>>> In my spark/scala code I am implementing secondary sort. I wanted to
>>> know, when I call the "repartitionAndSortWithinPartitions" method, the
>>> whole (entire) RDD will be sorted or only the individual partitions will be
>>> sorted?
>>> If its the latter case, will applying a "sortByKey" after
>>> "repartitionAndSortWithinPartitions" be faster now that the individual
>>> partitions are sorted?
>>>
>>> --
>>> Thank You
>>>
>>> Regards
>>>
>>> Punit Naik
>>>
>>
>>
>
>
> --
> Thank You
>
> Regards
>
> Punit Naik
>

Reply via email to