Re: Spark join over sorted columns of dataset.

2017-03-12 Thread Li Jin
I am not an expert on this but here is what I think: Catalyst maintains information on whether a plan node is ordered. If your dataframe is a result of a order by, catalyst will skip the sorting when it does merge sort join. If you dataframe is created from storage, for instance. ParquetRelation,

Re: Spark join over sorted columns of dataset.

2017-03-03 Thread Koert Kuipers
For RDD the shuffle is already skipped but the sort is not. In spark-sorted we track partitioning and sorting within partitions for key-value RDDs and can avoid the sort. See: https://github.com/tresata/spark-sorted For Dataset/DataFrame such optimizations are done automatically, however it's

Re: Spark join over sorted columns of dataset.

2017-03-03 Thread Rohit Verma
Sending it to dev’s. Can you please help me providing some ideas for below. Regards Rohit > On Feb 23, 2017, at 3:47 PM, Rohit Verma wrote: > > Hi > > While joining two columns of different dataset, how to optimize join if both > the columns are pre sorted within

Spark join over sorted columns of dataset.

2017-02-23 Thread Rohit Verma
Hi While joining two columns of different dataset, how to optimize join if both the columns are pre sorted within the dataset. So that when spark do sort merge join the sorting phase can skipped. Regards Rohit - To unsubscribe