Thank you. I have tried the window function as follows: import pyspark.sql.functions as f sqc = sqlContext from pyspark.sql import Window import pandas as pd
DF = pd.DataFrame({'a': [1,1,1,2,2,2,3,3,3], 'b': [1,2,3,1,2,3,1,2,3], 'c': [1,2,3,4,5,6,7,8,9] }) df = sqc.createDataFrame(DF) window = Window.partitionBy("a").orderBy("c") df.select('a', 'b', 'c', f.min('c').over(window).alias('y')).show() I got the following result which is understandable: +---+---+---+---+ | a| b| c| y| +---+---+---+---+ | 1| 1| 1| 1| | 1| 2| 2| 1| | 1| 3| 3| 1| | 2| 1| 4| 4| | 2| 2| 5| 4| | 2| 3| 6| 4| | 3| 1| 7| 7| | 3| 2| 8| 7| | 3| 3| 9| 7| +---+---+---+---+ However if I change min to max, the result is not what is expected: df.select('a', 'b', 'c', f.max('c').over(window).alias('y')).show() gives +---+---+---+---+ | a| b| c| y| +---+---+---+---+ | 1| 1| 1| 1| | 1| 2| 2| 2| | 1| 3| 3| 3| | 2| 1| 4| 4| | 2| 2| 5| 5| | 2| 3| 6| 6| | 3| 1| 7| 7| | 3| 2| 8| 8| | 3| 3| 9| 9| +---+---+---+---+ Thanks, Wei On Tue, Jan 5, 2016 at 8:30 PM, ayan guha <guha.a...@gmail.com> wrote: > Yes there is. It is called window function over partitions. > > Equivalent SQL would be: > > select * from > (select a,b,c, rank() over (partition by a order by b) r from df) > x > where r = 1 > > You can register your DF as a temp table and use the sql form. Or, (>Spark > 1.4) you can use window methods and their variants in Spark SQL module. > > HTH.... > > On Wed, Jan 6, 2016 at 11:56 AM, Wei Chen <wei.chen.ri...@gmail.com> > wrote: > >> Hi, >> >> I am trying to retrieve the rows with a minimum value of a column for >> each group. For example: the following dataframe: >> >> a | b | c >> ---------- >> 1 | 1 | 1 >> 1 | 2 | 2 >> 1 | 3 | 3 >> 2 | 1 | 4 >> 2 | 2 | 5 >> 2 | 3 | 6 >> 3 | 1 | 7 >> 3 | 2 | 8 >> 3 | 3 | 9 >> ---------- >> >> I group by 'a', and want the rows with the smallest 'b', that is, I want >> to return the following dataframe: >> >> a | b | c >> ---------- >> 1 | 1 | 1 >> 2 | 1 | 4 >> 3 | 1 | 7 >> ---------- >> >> The dataframe I have is huge so get the minimum value of b from each >> group and joining on the original dataframe is very expensive. Is there a >> better way to do this? >> >> >> Thanks, >> Wei >> >> > > > -- > Best Regards, > Ayan Guha > -- Wei Chen, Ph.D. Astronomer and Data Scientist Phone: (832)646-7124 Email: wei.chen.ri...@gmail.com LinkedIn: https://www.linkedin.com/in/weichen1984