Re: [Core][Suggestion] sortWithinPartitions and aggregateWithinPartitions for RDD
On Wed, Jan 31, 2018 at 1:15 AM, Ruifeng Zhengwrote: > HI all: > > > >1, Dataset API supports operation “sortWithinPartitions”, but in RDD > API there is no counterpart (I know there is > “repartitionAndSortWithinPartitions”, but I don’t want to repartition the > RDD), I have to convert RDD to Dataset for this function. Would it make > sense to add a “sortWithinPartitions” for RDD? Are you concerned about a shuffle here ? If yes, if you make the partitioners same, you can avoid that and get sort within partition. > > > >2, In “aggregateByKey”/”reduceByKey”, I want to do some special > operation (like aggregator compression) after local aggregation on each > partitions. A similar case may be: compute ‘ApproximatePercentile’ for > different keys by ”reduceByKey”, it may be helpful if > ‘QuantileSummaries#compress’ is called before network communication. So I > wonder if it is useful to add a ‘aggregateWithinPartitions’ for RDD? > You could override serde of your 'C' to accomplish this ? The tricky part would be merging in case of spills and differentiating between map side and reduce side serialization. Regards, Mridul > > > Regards, > > Ruifeng > > > > > > > > - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: [Core][Suggestion] sortWithinPartitions and aggregateWithinPartitions for RDD
Do you mean in-memory processing? It works fine if all partitions are small. But when some partition don’t fit in memory, it will cause OOM. 发件人: Reynold Xin <r...@databricks.com> 日期: 2018年2月1日 星期四 下午3:14 收件人: Ruifeng Zheng <ruife...@foxmail.com> 抄送: <dev@spark.apache.org> 主题: Re: [Core][Suggestion] sortWithinPartitions and aggregateWithinPartitions for RDD You can just do that with mapPartitions pretty easily can’t you? On Wed, Jan 31, 2018 at 11:08 PM Ruifeng Zheng <ruife...@foxmail.com> wrote: HI all: 1, Dataset API supports operation “sortWithinPartitions”, but in RDD API there is no counterpart (I know there is “repartitionAndSortWithinPartitions”, but I don’t want to repartition the RDD), I have to convert RDD to Dataset for this function. Would it make sense to add a “sortWithinPartitions” for RDD? 2, In “aggregateByKey”/”reduceByKey”, I want to do some special operation (like aggregator compression) after local aggregation on each partitions. A similar case may be: compute ‘ApproximatePercentile’ for different keys by ”reduceByKey”, it may be helpful if ‘QuantileSummaries#compress’ is called before network communication. So I wonder if it is useful to add a ‘aggregateWithinPartitions’ for RDD? Regards, Ruifeng
Re: [Core][Suggestion] sortWithinPartitions and aggregateWithinPartitions for RDD
You can just do that with mapPartitions pretty easily can’t you? On Wed, Jan 31, 2018 at 11:08 PM Ruifeng Zhengwrote: > HI all: > > > >1, Dataset API supports operation “sortWithinPartitions”, but in > RDD API there is no counterpart (I know there is > “repartitionAndSortWithinPartitions”, but I don’t want to repartition the > RDD), I have to convert RDD to Dataset for this function. Would it make > sense to add a “sortWithinPartitions” for RDD? > > > >2, In “aggregateByKey”/”reduceByKey”, I want to do some special > operation (like aggregator compression) after local aggregation on each > partitions. A similar case may be: compute ‘ApproximatePercentile’ for > different keys by ”reduceByKey”, it may be helpful if > ‘QuantileSummaries#compress’ is called before network communication. So I > wonder if it is useful to add a ‘aggregateWithinPartitions’ for RDD? > > > > Regards, > > Ruifeng > > > > > > > > >
[Core][Suggestion] sortWithinPartitions and aggregateWithinPartitions for RDD
HI all: 1, Dataset API supports operation “sortWithinPartitions”, but in RDD API there is no counterpart (I know there is “repartitionAndSortWithinPartitions”, but I don’t want to repartition the RDD), I have to convert RDD to Dataset for this function. Would it make sense to add a “sortWithinPartitions” for RDD? 2, In “aggregateByKey”/”reduceByKey”, I want to do some special operation (like aggregator compression) after local aggregation on each partitions. A similar case may be: compute ‘ApproximatePercentile’ for different keys by ”reduceByKey”, it may be helpful if ‘QuantileSummaries#compress’ is called before network communication. So I wonder if it is useful to add a ‘aggregateWithinPartitions’ for RDD? Regards, Ruifeng