Thank you. I have tried the window function as follows:
import pyspark.sql.functions as f
sqc = sqlContext
from pyspark.sql import Window
import pandas as pd
DF = pd.DataFrame({'a': [1,1,1,2,2,2,3,3,3],
'b': [1,2,3,1,2,3,1,2,3],
'c': [1,2,3,4,5,6,7,8,9]
Try redefining your window, without sortBy part. In other words, rerun your
code with
window = Window.partitionBy("a")
The thing is that the window is defined differently in these two cases. In
your example, in the group where "a" is 1,
- If you include "sortBy" option, it is a rolling
Yes there is. It is called window function over partitions.
Equivalent SQL would be:
select * from
(select a,b,c, rank() over (partition by a order by b) r from df) x
where r = 1
You can register your DF as a temp table and use the sql form. Or, (>Spark
1.4) you can use window methods
Hi,
I am trying to retrieve the rows with a minimum value of a column for each
group. For example: the following dataframe:
a | b | c
--
1 | 1 | 1
1 | 2 | 2
1 | 3 | 3
2 | 1 | 4
2 | 2 | 5
2 | 3 | 6
3 | 1 | 7
3 | 2 | 8
3 | 3 | 9
--
I group by 'a', and want the rows with the