Re: [Core][Suggestion] sortWithinPartitions and aggregateWithinPartitions for RDD

2018-02-01 Thread Mridul Muralidharan
On Wed, Jan 31, 2018 at 1:15 AM, Ruifeng Zheng  wrote:
> HI all:
>
>
>
>1, Dataset API supports operation “sortWithinPartitions”, but in RDD
> API there is no counterpart (I know there is
> “repartitionAndSortWithinPartitions”, but I don’t want to repartition the
> RDD), I have to convert RDD to Dataset for this function. Would it make
> sense to add a “sortWithinPartitions” for RDD?


Are you concerned about a shuffle here ?
If yes, if you make the partitioners same, you can avoid that and get
sort within partition.




>
>
>
>2, In “aggregateByKey”/”reduceByKey”, I want to do some special
> operation (like aggregator compression) after local aggregation on each
> partitions. A similar case may be: compute ‘ApproximatePercentile’ for
> different keys by ”reduceByKey”, it may be helpful if
> ‘QuantileSummaries#compress’ is called before network communication. So I
> wonder if it is useful to add a ‘aggregateWithinPartitions’ for RDD?
>


You could override serde of your 'C' to accomplish this ?
The tricky part would be merging in case of spills and differentiating
between map side and reduce side serialization.


Regards,
Mridul


>
>
> Regards,
>
> Ruifeng
>
>
>
>
>
>
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [Core][Suggestion] sortWithinPartitions and aggregateWithinPartitions for RDD

2018-01-31 Thread Ruifeng Zheng
Do you mean in-memory processing? It works fine if all partitions are small. 
But when some partition don’t fit in memory, it will cause OOM. 

 

 

发件人: Reynold Xin <r...@databricks.com>
日期: 2018年2月1日 星期四 下午3:14
收件人: Ruifeng Zheng <ruife...@foxmail.com>
抄送: <dev@spark.apache.org>
主题: Re: [Core][Suggestion] sortWithinPartitions and aggregateWithinPartitions 
for RDD

 

You can just do that with mapPartitions pretty easily can’t you?

 

On Wed, Jan 31, 2018 at 11:08 PM Ruifeng Zheng <ruife...@foxmail.com> wrote:

HI all:

 

   1, Dataset API supports operation “sortWithinPartitions”, but in RDD API 
there is no counterpart (I know there is “repartitionAndSortWithinPartitions”, 
but I don’t want to repartition the RDD), I have to convert RDD to Dataset for 
this function. Would it make sense to add a “sortWithinPartitions” for RDD?

 

   2, In “aggregateByKey”/”reduceByKey”, I want to do some special 
operation (like aggregator compression) after local aggregation on each 
partitions. A similar case may be: compute ‘ApproximatePercentile’ for 
different keys by ”reduceByKey”, it may be helpful if 
‘QuantileSummaries#compress’ is called before network communication. So I 
wonder if it is useful to add a ‘aggregateWithinPartitions’ for RDD?

 

Regards,

Ruifeng

 

 

 

 



Re: [Core][Suggestion] sortWithinPartitions and aggregateWithinPartitions for RDD

2018-01-31 Thread Reynold Xin
You can just do that with mapPartitions pretty easily can’t you?

On Wed, Jan 31, 2018 at 11:08 PM Ruifeng Zheng  wrote:

> HI all:
>
>
>
>1, Dataset API supports operation “sortWithinPartitions”, but in
> RDD API there is no counterpart (I know there is
> “repartitionAndSortWithinPartitions”, but I don’t want to repartition the
> RDD), I have to convert RDD to Dataset for this function. Would it make
> sense to add a “sortWithinPartitions” for RDD?
>
>
>
>2, In “aggregateByKey”/”reduceByKey”, I want to do some special
> operation (like aggregator compression) after local aggregation on each
> partitions. A similar case may be: compute ‘ApproximatePercentile’ for
> different keys by ”reduceByKey”, it may be helpful if
> ‘QuantileSummaries#compress’ is called before network communication. So I
> wonder if it is useful to add a ‘aggregateWithinPartitions’ for RDD?
>
>
>
> Regards,
>
> Ruifeng
>
>
>
>
>
>
>
>
>


[Core][Suggestion] sortWithinPartitions and aggregateWithinPartitions for RDD

2018-01-31 Thread Ruifeng Zheng
HI all:

 

   1, Dataset API supports operation “sortWithinPartitions”, but in RDD API 
there is no counterpart (I know there is “repartitionAndSortWithinPartitions”, 
but I don’t want to repartition the RDD), I have to convert RDD to Dataset for 
this function. Would it make sense to add a “sortWithinPartitions” for RDD?

 

   2, In “aggregateByKey”/”reduceByKey”, I want to do some special 
operation (like aggregator compression) after local aggregation on each 
partitions. A similar case may be: compute ‘ApproximatePercentile’ for 
different keys by ”reduceByKey”, it may be helpful if 
‘QuantileSummaries#compress’ is called before network communication. So I 
wonder if it is useful to add a ‘aggregateWithinPartitions’ for RDD?

 

Regards,

Ruifeng