For RDD the shuffle is already skipped but the sort is not. In spark-sorted we track partitioning and sorting within partitions for key-value RDDs and can avoid the sort. See: https://github.com/tresata/spark-sorted
For Dataset/DataFrame such optimizations are done automatically, however it's currently not always working for Dataset, see: https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-19468 On Mar 3, 2017 11:06 AM, "Rohit Verma" <rohit.ve...@rokittech.com> wrote: Sending it to dev’s. Can you please help me providing some ideas for below. Regards Rohit > On Feb 23, 2017, at 3:47 PM, Rohit Verma <rohit.ve...@rokittech.com> wrote: > > Hi > > While joining two columns of different dataset, how to optimize join if both the columns are pre sorted within the dataset. > So that when spark do sort merge join the sorting phase can skipped. > > Regards > Rohit