Thanks Davies, after I did a coalesce(1) to save as single parquet file I
was able to get the head() to return the correct order.

On Sun, May 8, 2016 at 12:29 AM, Davies Liu <dav...@databricks.com> wrote:

> When you have multiple parquet files, the order of all the rows in
> them is not defined.
>
> On Sat, May 7, 2016 at 11:48 PM, Buntu Dev <buntu...@gmail.com> wrote:
> > I'm using pyspark dataframe api to sort by specific column and then
> saving
> > the dataframe as parquet file. But the resulting parquet file doesn't
> seem
> > to be sorted.
> >
> > Applying sort and doing a head() on the results shows the correct results
> > sorted by 'value' column in desc order, as shown below:
> >
> > ~~~~~
> >>>df=sqlContext.read.parquet("/some/file.parquet")
> >>>df.printSchema()
> >
> > root
> >  |-- c1: string (nullable = true)
> >  |-- c2: string (nullable = true)
> >  |-- value: double (nullable = true)
> >
> >>>df.sort(df.value.desc()).head(3)
> >
> > [Row(c1=u'546', c2=u'234', value=1020.25), Row(c1=u'3212', c2=u'6785',
> > value=890.6111111111111), Row(c1=u'546', c2=u'234', value=776.45)]
> > ~~~~~~
> >
> > But saving the sorted dataframe as parquet and fetching the first N rows
> > using head() doesn't seem to return the results ordered by 'value'
> column:
> >
> > ~~~~
> >>>df=sqlContext.read.parquet("/some/file.parquet")
> >>>df.sort(df.value.desc()).write.parquet("/sorted/file.parquet")
> > ...
> >>>df2=sqlContext.read.parquet("/sorted/file.parquet")
> >>>df2.head(3)
> >
> > [Row(c1=u'444', b2=u'233', value=0.024120907), Row(c1=u'5672',
> c2=u'9098',
> > value=0.024120906), Row(c1=u'546', c2=u'234', value=0.024120905)]
> > ~~~~
> >
> > How do I go about sorting and saving a sorted dataframe?
> >
> >
> > Thanks!
>

Reply via email to