Hi, I have a dataframe df1 and I partitioned it by col1,col2 and persisted it. Then I created new dataframe df2.
val df2 = df1.sortWithinPartitions("col1","col2","col3") df1.persist() df2.persist() df1.count() df2.count() now I expect that any group by statement using the "col1","col2","col3" should be way too fast in the df2 as compared to df1. I expect that there should not be any shuffle in df2 as data is already sorted and sum should be done on the same machine which has the partition. e.g. df1.groupBy("col1","col2","col3").sum("col4","col5").collect() should be suffled as we know that the data is not sorted. df2.groupBy("col1","col2","col3").sum("col4","col5").collect() shouldnt cause any shuffle as data within partition is sorted the way we need. However, this doesn't seem to be the case in 1.6.1. Am i missing something? Thanks