Try redefining your window, without sortBy part. In other words, rerun your code with
window = Window.partitionBy("a") The thing is that the window is defined differently in these two cases. In your example, in the group where "a" is 1, - If you include "sortBy" option, it is a rolling window: - 1st min is computed on the first row in this group, - 2nd min is computed on the first 2 rows in this group, - 3rd min is computed on the first 3 rows in this group - if you don't include the sortBy option, min is computed on a constant window of width 3. On Wed, Jan 6, 2016 at 2:34 PM, Wei Chen <wei.chen.ri...@gmail.com> wrote: > Thank you. I have tried the window function as follows: > > import pyspark.sql.functions as f > sqc = sqlContext > from pyspark.sql import Window > import pandas as pd > > DF = pd.DataFrame({'a': [1,1,1,2,2,2,3,3,3], > 'b': [1,2,3,1,2,3,1,2,3], > 'c': [1,2,3,4,5,6,7,8,9] > }) > > df = sqc.createDataFrame(DF) > > window = Window.partitionBy("a").orderBy("c") > > df.select('a', 'b', 'c', f.min('c').over(window).alias('y')).show() > > I got the following result which is understandable: > > +---+---+---+---+ > | a| b| c| y| > +---+---+---+---+ > | 1| 1| 1| 1| > | 1| 2| 2| 1| > | 1| 3| 3| 1| > | 2| 1| 4| 4| > | 2| 2| 5| 4| > | 2| 3| 6| 4| > | 3| 1| 7| 7| > | 3| 2| 8| 7| > | 3| 3| 9| 7| > +---+---+---+---+ > > > However if I change min to max, the result is not what is expected: > > df.select('a', 'b', 'c', f.max('c').over(window).alias('y')).show() gives > > +---+---+---+---+ > | a| b| c| y| > +---+---+---+---+ > | 1| 1| 1| 1| > | 1| 2| 2| 2| > | 1| 3| 3| 3| > | 2| 1| 4| 4| > | 2| 2| 5| 5| > | 2| 3| 6| 6| > | 3| 1| 7| 7| > | 3| 2| 8| 8| > | 3| 3| 9| 9| > +---+---+---+---+ > > > > Thanks, > > Wei > > > On Tue, Jan 5, 2016 at 8:30 PM, ayan guha <guha.a...@gmail.com> wrote: > >> Yes there is. It is called window function over partitions. >> >> Equivalent SQL would be: >> >> select * from >> (select a,b,c, rank() over (partition by a order by b) r from >> df) x >> where r = 1 >> >> You can register your DF as a temp table and use the sql form. Or, >> (>Spark 1.4) you can use window methods and their variants in Spark SQL >> module. >> >> HTH.... >> >> On Wed, Jan 6, 2016 at 11:56 AM, Wei Chen <wei.chen.ri...@gmail.com> >> wrote: >> >>> Hi, >>> >>> I am trying to retrieve the rows with a minimum value of a column for >>> each group. For example: the following dataframe: >>> >>> a | b | c >>> ---------- >>> 1 | 1 | 1 >>> 1 | 2 | 2 >>> 1 | 3 | 3 >>> 2 | 1 | 4 >>> 2 | 2 | 5 >>> 2 | 3 | 6 >>> 3 | 1 | 7 >>> 3 | 2 | 8 >>> 3 | 3 | 9 >>> ---------- >>> >>> I group by 'a', and want the rows with the smallest 'b', that is, I want >>> to return the following dataframe: >>> >>> a | b | c >>> ---------- >>> 1 | 1 | 1 >>> 2 | 1 | 4 >>> 3 | 1 | 7 >>> ---------- >>> >>> The dataframe I have is huge so get the minimum value of b from each >>> group and joining on the original dataframe is very expensive. Is there a >>> better way to do this? >>> >>> >>> Thanks, >>> Wei >>> >>> >> >> >> -- >> Best Regards, >> Ayan Guha >> > > > > -- > Wei Chen, Ph.D. > Astronomer and Data Scientist > Phone: (832)646-7124 > Email: wei.chen.ri...@gmail.com > LinkedIn: https://www.linkedin.com/in/weichen1984 >