Dataframes have a partitionBy function too. You can avoid a shuffle if one of your datasets is small enough to broadcast.
On Thu., 4 Jul. 2019, 7:34 am Mkal, <diomf...@hotmail.com> wrote: > Please keep in mind i'm fairly new to spark. > I have some spark code where i load two textfiles as datasets and after > some > map and filter operations to bring the columns in a specific shape, i join > the datasets. > > The join takes place on a common column (of type string). > Is there any way to avoid the exchange/shuffle before the join? > > As i understand it, the idea is that if i, initially, hash partition the > datasets based on the join column, then the join would only have to look > within the same partitions to complete the join, thus avoiding a shuffle. > > In the rdd API, you can create a hash partitioner and use partitionBy when > creating the RDDS.(Though im not sure if this a sure way to avoid the > shuffle on the join.) Is there any similar method for Dataframe/Dataset > API? > > I also would like to avoid repartition,repartitionByRange and bucketing > techniques since i only intend to do one join and these also require > shuffling beforehand. > > > > > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >