Hi Devs,
I am seeing some behavior with window functions that is a bit unintuitive
and would like to get some clarification.
When using aggregation function with window, the frame boundary seems to
change depending on the order of the window.
Example:
(1)
df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
w1 = Window.partitionBy('id')
df.withColumn('v2', mean(df.v).over(w1)).show()
+---+---+---+
| id| v| v2|
+---+---+---+
| 0| 1|2.0|
| 0| 2|2.0|
| 0| 3|2.0|
+---+---+---+
(2)
df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
w2 = Window.partitionBy('id').orderBy('v')
df.withColumn('v2', mean(df.v).over(w2)).show()
+---+---+---+
| id| v| v2|
+---+---+---+
| 0| 1|1.0|
| 0| 2|1.5|
| 0| 3|2.0|
+---+---+---+
Seems like orderBy('v') in the example (2) also changes the frame
boundaries from (
unboundedPreceding, unboundedFollowing) to (unboundedPreceding, currentRow).
I found this behavior a bit unintuitive. I wonder if this behavior is by
design and if so, what's the specific rule that orderBy() interacts with
frame boundaries?
Thanks,
Li