Clarify window behavior in Spark SQL

Li Jin Tue, 03 Apr 2018 13:27:36 -0700

Hi Devs,

I am seeing some behavior with window functions that is a bit unintuitive
and would like to get some clarification.


When using aggregation function with window, the frame boundary seems to
change depending on the order of the window.

Example:
(1)

df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')

w1 = Window.partitionBy('id')

df.withColumn('v2', mean(df.v).over(w1)).show()

+---+---+---+

| id|  v| v2|

+---+---+---+

|  0|  1|2.0|

|  0|  2|2.0|

|  0|  3|2.0|

+---+---+---+

(2)
df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')

w2 = Window.partitionBy('id').orderBy('v')

df.withColumn('v2', mean(df.v).over(w2)).show()

+---+---+---+

| id|  v| v2|

+---+---+---+

|  0|  1|1.0|

|  0|  2|1.5|

|  0|  3|2.0|

+---+---+---+

Seems like orderBy('v') in the example (2) also changes the frame
boundaries from (

unboundedPreceding, unboundedFollowing) to (unboundedPreceding, currentRow).


I found this behavior a bit unintuitive. I wonder if this behavior is by
design and if so, what's the specific rule that orderBy() interacts with
frame boundaries?


Thanks,

Li

Clarify window behavior in Spark SQL

Reply via email to