[Core][Suggestion] sortWithinPartitions and aggregateWithinPartitions for RDD

Ruifeng Zheng Wed, 31 Jan 2018 23:09:08 -0800

HI all:


       1, Dataset API supports operation “sortWithinPartitions”, but in RDD API 
there is no counterpart (I know there is “repartitionAndSortWithinPartitions”, 
but I don’t want to repartition the RDD), I have to convert RDD to Dataset for 
this function. Would it make sense to add a “sortWithinPartitions” for RDD?

 

       2, In “aggregateByKey”/”reduceByKey”, I want to do some special 
operation (like aggregator compression) after local aggregation on each 
partitions. A similar case may be: compute ‘ApproximatePercentile’ for 
different keys by ”reduceByKey”, it may be helpful if 
‘QuantileSummaries#compress’ is called before network communication. So I 
wonder if it is useful to add a ‘aggregateWithinPartitions’ for RDD?

 

Regards,

Ruifeng

[Core][Suggestion] sortWithinPartitions and aggregateWithinPartitions for RDD

Reply via email to