Dataframes have a partitionBy function too.

You can avoid a shuffle if one of your datasets is small enough to
broadcast.

On Thu., 4 Jul. 2019, 7:34 am Mkal, <diomf...@hotmail.com> wrote:

> Please keep in mind i'm fairly new to spark.
> I have some spark code where i load two textfiles as datasets and after
> some
> map and filter operations to bring the columns in a specific shape, i join
> the datasets.
>
> The join takes place on a common column (of type string).
> Is there any way to avoid the exchange/shuffle before the join?
>
> As i understand it, the idea is that if i, initially, hash partition the
> datasets based on the join column, then the join would only have to look
> within the same partitions to complete the join, thus avoiding a shuffle.
>
> In the rdd API, you can create a hash partitioner and use partitionBy when
> creating the RDDS.(Though im not sure if this a sure way to avoid the
> shuffle on the join.) Is there any similar method for Dataframe/Dataset
> API?
>
> I also would like to avoid repartition,repartitionByRange and bucketing
> techniques since i only intend to do one join and these also require
> shuffling beforehand.
>
>
>
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Reply via email to