Thanks Davies, after I did a coalesce(1) to save as single parquet file I was able to get the head() to return the correct order.
On Sun, May 8, 2016 at 12:29 AM, Davies Liu <dav...@databricks.com> wrote: > When you have multiple parquet files, the order of all the rows in > them is not defined. > > On Sat, May 7, 2016 at 11:48 PM, Buntu Dev <buntu...@gmail.com> wrote: > > I'm using pyspark dataframe api to sort by specific column and then > saving > > the dataframe as parquet file. But the resulting parquet file doesn't > seem > > to be sorted. > > > > Applying sort and doing a head() on the results shows the correct results > > sorted by 'value' column in desc order, as shown below: > > > > ~~~~~ > >>>df=sqlContext.read.parquet("/some/file.parquet") > >>>df.printSchema() > > > > root > > |-- c1: string (nullable = true) > > |-- c2: string (nullable = true) > > |-- value: double (nullable = true) > > > >>>df.sort(df.value.desc()).head(3) > > > > [Row(c1=u'546', c2=u'234', value=1020.25), Row(c1=u'3212', c2=u'6785', > > value=890.6111111111111), Row(c1=u'546', c2=u'234', value=776.45)] > > ~~~~~~ > > > > But saving the sorted dataframe as parquet and fetching the first N rows > > using head() doesn't seem to return the results ordered by 'value' > column: > > > > ~~~~ > >>>df=sqlContext.read.parquet("/some/file.parquet") > >>>df.sort(df.value.desc()).write.parquet("/sorted/file.parquet") > > ... > >>>df2=sqlContext.read.parquet("/sorted/file.parquet") > >>>df2.head(3) > > > > [Row(c1=u'444', b2=u'233', value=0.024120907), Row(c1=u'5672', > c2=u'9098', > > value=0.024120906), Row(c1=u'546', c2=u'234', value=0.024120905)] > > ~~~~ > > > > How do I go about sorting and saving a sorted dataframe? > > > > > > Thanks! >