Re: rdd.distinct with Partitioner

Alexander Pivovarov Wed, 08 Jun 2016 21:52:42 -0700

reduceByKey(randomPartitioner, (a, b) => a + b) also gives incorrect result


Why reduceByKey with Partitioner exists then?

On Wed, Jun 8, 2016 at 9:22 PM, 汪洋 <tiandiwo...@icloud.com> wrote:

> Hi Alexander,
>
> I think it does not guarantee to be right if an arbitrary Partitioner is
> passed in.
>
> I have created a notebook and you can check it out. (
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/7973071962862063/2110745399505739/58107563000366/latest.html
> )
>
> Best regards,
>
> Yang
>
>
> 在 2016年6月9日，上午11:42，Alexander Pivovarov <apivova...@gmail.com> 写道：
>
> most of the RDD methods which shuffle data take Partitioner as a parameter
>
> But rdd.distinct does not have such signature
>
> Should I open a PR for that?
>
> /**
>  * Return a new RDD containing the distinct elements in this RDD.
>  */
>
> def distinct(partitioner: Partitioner)(implicit ord: Ordering[T] = null): 
> RDD[T] = withScope {
>   map(x => (x, null)).reduceByKey(partitioner, (x, y) => x).map(_._1)
> }
>
>
>

Re: rdd.distinct with Partitioner

Reply via email to