SortWithinPartitions on DataFrame

Darshan Singh Thu, 05 May 2016 13:23:46 -0700

Hi,

I have a dataframe df1 and I partitioned it by col1,col2 and persisted it.
Then I created new dataframe df2.


val df2 = df1.sortWithinPartitions("col1","col2","col3")

df1.persist()
df2.persist()
df1.count()
df2.count()

now I expect that any group by statement using the "col1","col2","col3"
should be way too fast in the df2 as compared to df1. I expect that there
should not be any shuffle in df2 as data is already sorted and sum should
be done on the same machine which has the partition.

e.g.
df1.groupBy("col1","col2","col3").sum("col4","col5").collect()
should be suffled as we know that the data is not sorted.

df2.groupBy("col1","col2","col3").sum("col4","col5").collect()
shouldnt cause any shuffle as data within partition is sorted the way we
need.

However, this doesn't seem to be the case in 1.6.1.


Am i missing something?

Thanks

SortWithinPartitions on DataFrame

Reply via email to