I don't know actual implementation:  But, to me it's still necessary as
each worker reads data separately and reduces to get local distinct these
will then need to be shuffled to find actual distinct.

On Sun, 23 Jan 2022, 17:39 ashok34...@yahoo.com.INVALID,
<ashok34...@yahoo.com.invalid> wrote:

> Hello,
>
> I know some operators in Spark are expensive because of shuffle.
>
> This document describes shuffle
>
> https://www.educba.com/spark-shuffle/
>
> and says
> More shufflings in numbers are not always bad. Memory constraints and
> other impossibilities can be overcome by shuffling.
>
> In RDD, the below are a few operations and examples of shuffle:
> – subtractByKey
> – groupBy
> – foldByKey
> – reduceByKey
> – aggregateByKey
> – transformations of a join of any type
> – distinct
> – cogroup
> I know some operations like reduceBykey are well known for creating
> shuffle but what I don't understand why distinct operation should cause
> shuffle!
>
>
> Thanking
>
>
>
>
>
>

Reply via email to