I'm using pyspark dataframe api to sort by specific column and then saving
the dataframe as parquet file. But the resulting parquet file doesn't seem
to be sorted.

Applying sort and doing a head() on the results shows the correct results
sorted by 'value' column in desc order, as shown below:

~~~~~
>>df=sqlContext.read.parquet("/some/file.parquet")
>>df.printSchema()

root
 |-- c1: string (nullable = true)
 |-- c2: string (nullable = true)
 |-- value: double (nullable = true)

>>df.sort(df.value.desc()).head(3)

[Row(c1=u'546', c2=u'234', value=1020.25), Row(c1=u'3212', c2=u'6785',
value=890.6111111111111), Row(c1=u'546', c2=u'234', value=776.45)]
~~~~~~

But saving the sorted dataframe as parquet and fetching the first N rows
using head() doesn't seem to return the results ordered by 'value' column:

~~~~
>>df=sqlContext.read.parquet("/some/file.parquet")
>>df.sort(df.value.desc()).write.parquet("/sorted/file.parquet")
...
>>df2=sqlContext.read.parquet("/sorted/file.parquet")
>>df2.head(3)

[Row(c1=u'444', b2=u'233', value=0.024120907), Row(c1=u'5672', c2=u'9098',
value=0.024120906), Row(c1=u'546', c2=u'234', value=0.024120905)]
~~~~

How do I go about sorting and saving a sorted dataframe?


Thanks!

Reply via email to