Re: pyspark dataframe: row with a minimum value of a column for each group

Wei Chen Wed, 06 Jan 2016 11:35:18 -0800

Thank you. I have tried the window function as follows:

import pyspark.sql.functions as f
sqc = sqlContext
from pyspark.sql import Window
import pandas as pd


DF = pd.DataFrame({'a': [1,1,1,2,2,2,3,3,3],
                   'b': [1,2,3,1,2,3,1,2,3],
                   'c': [1,2,3,4,5,6,7,8,9]
                  })

df = sqc.createDataFrame(DF)

window = Window.partitionBy("a").orderBy("c")

df.select('a', 'b', 'c', f.min('c').over(window).alias('y')).show()

I got the following result which is understandable:

+---+---+---+---+
|  a|  b|  c|  y|
+---+---+---+---+
|  1|  1|  1|  1|
|  1|  2|  2|  1|
|  1|  3|  3|  1|
|  2|  1|  4|  4|
|  2|  2|  5|  4|
|  2|  3|  6|  4|
|  3|  1|  7|  7|
|  3|  2|  8|  7|
|  3|  3|  9|  7|
+---+---+---+---+


However if I change min to max, the result is not what is expected:

df.select('a', 'b', 'c', f.max('c').over(window).alias('y')).show() gives

+---+---+---+---+
|  a|  b|  c|  y|
+---+---+---+---+
|  1|  1|  1|  1|
|  1|  2|  2|  2|
|  1|  3|  3|  3|
|  2|  1|  4|  4|
|  2|  2|  5|  5|
|  2|  3|  6|  6|
|  3|  1|  7|  7|
|  3|  2|  8|  8|
|  3|  3|  9|  9|
+---+---+---+---+



Thanks,

Wei


On Tue, Jan 5, 2016 at 8:30 PM, ayan guha <guha.a...@gmail.com> wrote:

> Yes there is. It is called window function over partitions.
>
> Equivalent SQL would be:
>
> select * from
>          (select a,b,c, rank() over (partition by a order by b) r from df)
> x
> where r = 1
>
> You can register your DF as a temp table and use the sql form. Or, (>Spark
> 1.4) you can use window methods and their variants in Spark SQL module.
>
> HTH....
>
> On Wed, Jan 6, 2016 at 11:56 AM, Wei Chen <wei.chen.ri...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I am trying to retrieve the rows with a minimum value of a column for
>> each group. For example: the following dataframe:
>>
>> a | b | c
>> ----------
>> 1 | 1 | 1
>> 1 | 2 | 2
>> 1 | 3 | 3
>> 2 | 1 | 4
>> 2 | 2 | 5
>> 2 | 3 | 6
>> 3 | 1 | 7
>> 3 | 2 | 8
>> 3 | 3 | 9
>> ----------
>>
>> I group by 'a', and want the rows with the smallest 'b', that is, I want
>> to return the following dataframe:
>>
>> a | b | c
>> ----------
>> 1 | 1 | 1
>> 2 | 1 | 4
>> 3 | 1 | 7
>> ----------
>>
>> The dataframe I have is huge so get the minimum value of b from each
>> group and joining on the original dataframe is very expensive. Is there a
>> better way to do this?
>>
>>
>> Thanks,
>> Wei
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>



-- 
Wei Chen, Ph.D.
Astronomer and Data Scientist
Phone: (832)646-7124
Email: wei.chen.ri...@gmail.com
LinkedIn: https://www.linkedin.com/in/weichen1984

Re: pyspark dataframe: row with a minimum value of a column for each group

Reply via email to