subject:"pyspark dataframe\: row with a minimum value of a column for each group"

Re: pyspark dataframe: row with a minimum value of a column for each group

2016-01-06 Thread Wei Chen

Thank you. I have tried the window function as follows: import pyspark.sql.functions as f sqc = sqlContext from pyspark.sql import Window import pandas as pd DF = pd.DataFrame({'a': [1,1,1,2,2,2,3,3,3], 'b': [1,2,3,1,2,3,1,2,3], 'c': [1,2,3,4,5,6,7,8,9]

Re: pyspark dataframe: row with a minimum value of a column for each group

2016-01-06 Thread Kristina Rogale Plazonic

Try redefining your window, without sortBy part. In other words, rerun your code with window = Window.partitionBy("a") The thing is that the window is defined differently in these two cases. In your example, in the group where "a" is 1, - If you include "sortBy" option, it is a rolling

Re: pyspark dataframe: row with a minimum value of a column for each group

2016-01-05 Thread ayan guha

Yes there is. It is called window function over partitions. Equivalent SQL would be: select * from (select a,b,c, rank() over (partition by a order by b) r from df) x where r = 1 You can register your DF as a temp table and use the sql form. Or, (>Spark 1.4) you can use window methods

pyspark dataframe: row with a minimum value of a column for each group

2016-01-05 Thread Wei Chen

Hi, I am trying to retrieve the rows with a minimum value of a column for each group. For example: the following dataframe: a | b | c -- 1 | 1 | 1 1 | 2 | 2 1 | 3 | 3 2 | 1 | 4 2 | 2 | 5 2 | 3 | 6 3 | 1 | 7 3 | 2 | 8 3 | 3 | 9 -- I group by 'a', and want the rows with the