Re: [pyspark 2.4.3] nested windows function performance

2019-10-21 Thread Georg Heiler
No, as you shuffle each time again (you always partition by different windows) Instead: could you choose a single window (w3 with more columns =fine granular) and the nfilter out records to achieve the same result? Or instead: df.groupBy(a,b,c).agg(sort_array(collect_list(foo,bar,baz)) and then

Re: [pyspark 2.4.3] nested windows function performance

2019-10-21 Thread Rishi Shah
Hi All, Any suggestions? Thanks, -Rishi On Sun, Oct 20, 2019 at 12:56 AM Rishi Shah wrote: > Hi All, > > I have a use case where I need to perform nested windowing functions on a > data frame to get final set of columns. Example: > > w1 = Window.partitionBy('col1') > df =

[pyspark 2.4.3] nested windows function performance

2019-10-19 Thread Rishi Shah
Hi All, I have a use case where I need to perform nested windowing functions on a data frame to get final set of columns. Example: w1 = Window.partitionBy('col1') df = df.withColumn('sum1', F.sum('val')) w2 = Window.partitionBy('col1', 'col2') df = df.withColumn('sum2', F.sum('val')) w3 =