When you have multiple parquet files, the order of all the rows in
them is not defined.

On Sat, May 7, 2016 at 11:48 PM, Buntu Dev <buntu...@gmail.com> wrote:
> I'm using pyspark dataframe api to sort by specific column and then saving
> the dataframe as parquet file. But the resulting parquet file doesn't seem
> to be sorted.
>
> Applying sort and doing a head() on the results shows the correct results
> sorted by 'value' column in desc order, as shown below:
>
> ~~~~~
>>>df=sqlContext.read.parquet("/some/file.parquet")
>>>df.printSchema()
>
> root
>  |-- c1: string (nullable = true)
>  |-- c2: string (nullable = true)
>  |-- value: double (nullable = true)
>
>>>df.sort(df.value.desc()).head(3)
>
> [Row(c1=u'546', c2=u'234', value=1020.25), Row(c1=u'3212', c2=u'6785',
> value=890.6111111111111), Row(c1=u'546', c2=u'234', value=776.45)]
> ~~~~~~
>
> But saving the sorted dataframe as parquet and fetching the first N rows
> using head() doesn't seem to return the results ordered by 'value' column:
>
> ~~~~
>>>df=sqlContext.read.parquet("/some/file.parquet")
>>>df.sort(df.value.desc()).write.parquet("/sorted/file.parquet")
> ...
>>>df2=sqlContext.read.parquet("/sorted/file.parquet")
>>>df2.head(3)
>
> [Row(c1=u'444', b2=u'233', value=0.024120907), Row(c1=u'5672', c2=u'9098',
> value=0.024120906), Row(c1=u'546', c2=u'234', value=0.024120905)]
> ~~~~
>
> How do I go about sorting and saving a sorted dataframe?
>
>
> Thanks!

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to