Re: pyspark dataframe sort issue

2016-05-08 Thread Buntu Dev
Thanks Davies, after I did a coalesce(1) to save as single parquet file I
was able to get the head() to return the correct order.

On Sun, May 8, 2016 at 12:29 AM, Davies Liu  wrote:

> When you have multiple parquet files, the order of all the rows in
> them is not defined.
>
> On Sat, May 7, 2016 at 11:48 PM, Buntu Dev  wrote:
> > I'm using pyspark dataframe api to sort by specific column and then
> saving
> > the dataframe as parquet file. But the resulting parquet file doesn't
> seem
> > to be sorted.
> >
> > Applying sort and doing a head() on the results shows the correct results
> > sorted by 'value' column in desc order, as shown below:
> >
> > ~
> >>>df=sqlContext.read.parquet("/some/file.parquet")
> >>>df.printSchema()
> >
> > root
> >  |-- c1: string (nullable = true)
> >  |-- c2: string (nullable = true)
> >  |-- value: double (nullable = true)
> >
> >>>df.sort(df.value.desc()).head(3)
> >
> > [Row(c1=u'546', c2=u'234', value=1020.25), Row(c1=u'3212', c2=u'6785',
> > value=890.6), Row(c1=u'546', c2=u'234', value=776.45)]
> > ~~
> >
> > But saving the sorted dataframe as parquet and fetching the first N rows
> > using head() doesn't seem to return the results ordered by 'value'
> column:
> >
> > 
> >>>df=sqlContext.read.parquet("/some/file.parquet")
> >>>df.sort(df.value.desc()).write.parquet("/sorted/file.parquet")
> > ...
> >>>df2=sqlContext.read.parquet("/sorted/file.parquet")
> >>>df2.head(3)
> >
> > [Row(c1=u'444', b2=u'233', value=0.024120907), Row(c1=u'5672',
> c2=u'9098',
> > value=0.024120906), Row(c1=u'546', c2=u'234', value=0.024120905)]
> > 
> >
> > How do I go about sorting and saving a sorted dataframe?
> >
> >
> > Thanks!
>


Re: pyspark dataframe sort issue

2016-05-08 Thread Davies Liu
When you have multiple parquet files, the order of all the rows in
them is not defined.

On Sat, May 7, 2016 at 11:48 PM, Buntu Dev  wrote:
> I'm using pyspark dataframe api to sort by specific column and then saving
> the dataframe as parquet file. But the resulting parquet file doesn't seem
> to be sorted.
>
> Applying sort and doing a head() on the results shows the correct results
> sorted by 'value' column in desc order, as shown below:
>
> ~
>>>df=sqlContext.read.parquet("/some/file.parquet")
>>>df.printSchema()
>
> root
>  |-- c1: string (nullable = true)
>  |-- c2: string (nullable = true)
>  |-- value: double (nullable = true)
>
>>>df.sort(df.value.desc()).head(3)
>
> [Row(c1=u'546', c2=u'234', value=1020.25), Row(c1=u'3212', c2=u'6785',
> value=890.6), Row(c1=u'546', c2=u'234', value=776.45)]
> ~~
>
> But saving the sorted dataframe as parquet and fetching the first N rows
> using head() doesn't seem to return the results ordered by 'value' column:
>
> 
>>>df=sqlContext.read.parquet("/some/file.parquet")
>>>df.sort(df.value.desc()).write.parquet("/sorted/file.parquet")
> ...
>>>df2=sqlContext.read.parquet("/sorted/file.parquet")
>>>df2.head(3)
>
> [Row(c1=u'444', b2=u'233', value=0.024120907), Row(c1=u'5672', c2=u'9098',
> value=0.024120906), Row(c1=u'546', c2=u'234', value=0.024120905)]
> 
>
> How do I go about sorting and saving a sorted dataframe?
>
>
> Thanks!

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



pyspark dataframe sort issue

2016-05-08 Thread Buntu Dev
I'm using pyspark dataframe api to sort by specific column and then saving
the dataframe as parquet file. But the resulting parquet file doesn't seem
to be sorted.

Applying sort and doing a head() on the results shows the correct results
sorted by 'value' column in desc order, as shown below:

~
>>df=sqlContext.read.parquet("/some/file.parquet")
>>df.printSchema()

root
 |-- c1: string (nullable = true)
 |-- c2: string (nullable = true)
 |-- value: double (nullable = true)

>>df.sort(df.value.desc()).head(3)

[Row(c1=u'546', c2=u'234', value=1020.25), Row(c1=u'3212', c2=u'6785',
value=890.6), Row(c1=u'546', c2=u'234', value=776.45)]
~~

But saving the sorted dataframe as parquet and fetching the first N rows
using head() doesn't seem to return the results ordered by 'value' column:


>>df=sqlContext.read.parquet("/some/file.parquet")
>>df.sort(df.value.desc()).write.parquet("/sorted/file.parquet")
...
>>df2=sqlContext.read.parquet("/sorted/file.parquet")
>>df2.head(3)

[Row(c1=u'444', b2=u'233', value=0.024120907), Row(c1=u'5672', c2=u'9098',
value=0.024120906), Row(c1=u'546', c2=u'234', value=0.024120905)]


How do I go about sorting and saving a sorted dataframe?


Thanks!