Re: pyspark dataframe: row with a minimum value of a column for each group

Kristina Rogale Plazonic Wed, 06 Jan 2016 12:54:35 -0800

Try redefining your window, without sortBy part. In other words, rerun your
code with


window = Window.partitionBy("a")

The thing is that the window is defined differently in these two cases. In
your example, in the group where "a" is 1,

  - If you include "sortBy" option, it is a rolling window:
       - 1st min is computed on the first row in this group,
       - 2nd min is computed on the first 2 rows in this group,
       - 3rd min is computed on the first 3 rows in this group

  - if you don't include the sortBy option, min is computed on a constant
window of width 3.

On Wed, Jan 6, 2016 at 2:34 PM, Wei Chen <wei.chen.ri...@gmail.com> wrote:

> Thank you. I have tried the window function as follows:
>
> import pyspark.sql.functions as f
> sqc = sqlContext
> from pyspark.sql import Window
> import pandas as pd
>
> DF = pd.DataFrame({'a': [1,1,1,2,2,2,3,3,3],
>                    'b': [1,2,3,1,2,3,1,2,3],
>                    'c': [1,2,3,4,5,6,7,8,9]
>                   })
>
> df = sqc.createDataFrame(DF)
>
> window = Window.partitionBy("a").orderBy("c")
>
> df.select('a', 'b', 'c', f.min('c').over(window).alias('y')).show()
>
> I got the following result which is understandable:
>
> +---+---+---+---+
> |  a|  b|  c|  y|
> +---+---+---+---+
> |  1|  1|  1|  1|
> |  1|  2|  2|  1|
> |  1|  3|  3|  1|
> |  2|  1|  4|  4|
> |  2|  2|  5|  4|
> |  2|  3|  6|  4|
> |  3|  1|  7|  7|
> |  3|  2|  8|  7|
> |  3|  3|  9|  7|
> +---+---+---+---+
>
>
> However if I change min to max, the result is not what is expected:
>
> df.select('a', 'b', 'c', f.max('c').over(window).alias('y')).show() gives
>
> +---+---+---+---+
> |  a|  b|  c|  y|
> +---+---+---+---+
> |  1|  1|  1|  1|
> |  1|  2|  2|  2|
> |  1|  3|  3|  3|
> |  2|  1|  4|  4|
> |  2|  2|  5|  5|
> |  2|  3|  6|  6|
> |  3|  1|  7|  7|
> |  3|  2|  8|  8|
> |  3|  3|  9|  9|
> +---+---+---+---+
>
>
>
> Thanks,
>
> Wei
>
>
> On Tue, Jan 5, 2016 at 8:30 PM, ayan guha <guha.a...@gmail.com> wrote:
>
>> Yes there is. It is called window function over partitions.
>>
>> Equivalent SQL would be:
>>
>> select * from
>>          (select a,b,c, rank() over (partition by a order by b) r from
>> df) x
>> where r = 1
>>
>> You can register your DF as a temp table and use the sql form. Or,
>> (>Spark 1.4) you can use window methods and their variants in Spark SQL
>> module.
>>
>> HTH....
>>
>> On Wed, Jan 6, 2016 at 11:56 AM, Wei Chen <wei.chen.ri...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I am trying to retrieve the rows with a minimum value of a column for
>>> each group. For example: the following dataframe:
>>>
>>> a | b | c
>>> ----------
>>> 1 | 1 | 1
>>> 1 | 2 | 2
>>> 1 | 3 | 3
>>> 2 | 1 | 4
>>> 2 | 2 | 5
>>> 2 | 3 | 6
>>> 3 | 1 | 7
>>> 3 | 2 | 8
>>> 3 | 3 | 9
>>> ----------
>>>
>>> I group by 'a', and want the rows with the smallest 'b', that is, I want
>>> to return the following dataframe:
>>>
>>> a | b | c
>>> ----------
>>> 1 | 1 | 1
>>> 2 | 1 | 4
>>> 3 | 1 | 7
>>> ----------
>>>
>>> The dataframe I have is huge so get the minimum value of b from each
>>> group and joining on the original dataframe is very expensive. Is there a
>>> better way to do this?
>>>
>>>
>>> Thanks,
>>> Wei
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>
>
> --
> Wei Chen, Ph.D.
> Astronomer and Data Scientist
> Phone: (832)646-7124
> Email: wei.chen.ri...@gmail.com
> LinkedIn: https://www.linkedin.com/in/weichen1984
>

Re: pyspark dataframe: row with a minimum value of a column for each group

Reply via email to