[ 
https://issues.apache.org/jira/browse/SPARK-28225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16889518#comment-16889518
 ] 

Marco Gaido edited comment on SPARK-28225 at 7/20/19 2:43 PM:
--------------------------------------------------------------

Let me cite PostgreSQL documentation to explain you the behavior:

??When an aggregate function is used as a window function, it aggregates over 
the rows within the current row's window frame. An aggregate used with ORDER BY 
and the default window frame definition produces a "running sum" type of 
behavior, which may or may not be what's wanted. To obtain aggregation over the 
whole partition, omit ORDER BY or use ROWS BETWEEN UNBOUNDED PRECEDING AND 
UNBOUNDED FOLLOWING. Other frame specifications can be used to obtain other 
effects.??

So the returned values seem correct to me.


was (Author: mgaido):
Let me cite PostgreSQL documentation to explain you the behavior:

{noformat}
When an aggregate function is used as a window function, it aggregates over the 
rows within the current row's window frame. An aggregate used with ORDER BY and 
the default window frame definition produces a "running sum" type of behavior, 
which may or may not be what's wanted. To obtain aggregation over the whole 
partition, omit ORDER BY or use ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED 
FOLLOWING. Other frame specifications can be used to obtain other effects.
{noformat}

So the returned values seem correct to me.


> Unexpected behavior for Window functions
> ----------------------------------------
>
>                 Key: SPARK-28225
>                 URL: https://issues.apache.org/jira/browse/SPARK-28225
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Andrew Leverentz
>            Priority: Major
>
> I've noticed some odd behavior when combining the "first" aggregate function 
> with an ordered Window.
> In particular, I'm working with columns created using the syntax
> {code}
> first($"y", ignoreNulls = true).over(Window.orderBy($"x"))
> {code}
> Below, I'm including some code which reproduces this issue in a Databricks 
> notebook.
> *Code:*
> {code:java}
> import org.apache.spark.sql.functions.first
> import org.apache.spark.sql.expressions.Window
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.{StructType,StructField,IntegerType}
> val schema = StructType(Seq(
>   StructField("x", IntegerType, false),
>   StructField("y", IntegerType, true),
>   StructField("z", IntegerType, true)
> ))
> val input =
>   spark.createDataFrame(sc.parallelize(Seq(
>     Row(101, null, 11),
>     Row(102, null, 12),
>     Row(103, null, 13),
>     Row(203, 24, null),
>     Row(201, 26, null),
>     Row(202, 25, null)
>   )), schema = schema)
> input.show
> val output = input
>   .withColumn("u1", first($"y", ignoreNulls = 
> true).over(Window.orderBy($"x".asc_nulls_last)))
>   .withColumn("u2", first($"y", ignoreNulls = 
> true).over(Window.orderBy($"x".asc)))
>   .withColumn("u3", first($"y", ignoreNulls = 
> true).over(Window.orderBy($"x".desc_nulls_last)))
>   .withColumn("u4", first($"y", ignoreNulls = 
> true).over(Window.orderBy($"x".desc)))
>   .withColumn("u5", first($"z", ignoreNulls = 
> true).over(Window.orderBy($"x".asc_nulls_last)))
>   .withColumn("u6", first($"z", ignoreNulls = 
> true).over(Window.orderBy($"x".asc)))
>   .withColumn("u7", first($"z", ignoreNulls = 
> true).over(Window.orderBy($"x".desc_nulls_last)))
>   .withColumn("u8", first($"z", ignoreNulls = 
> true).over(Window.orderBy($"x".desc)))
> output.show
> {code}
> *Expectation:*
> Based on my understanding of how ordered-Window and aggregate functions work, 
> the results I expected to see were:
>  * u1 = u2 = constant value of 26
>  * u3 = u4 = constant value of 24
>  * u5 = u6 = constant value of 11
>  * u7 = u8 = constant value of 13
> However, columns u1, u2, u7, and u8 contain some unexpected nulls. 
> *Results:*
> {code:java}
> +---+----+----+----+----+---+---+---+---+----+----+
> |  x|   y|   z|  u1|  u2| u3| u4| u5| u6|  u7|  u8|
> +---+----+----+----+----+---+---+---+---+----+----+
> |203|  24|null|  26|  26| 24| 24| 11| 11|null|null|
> |202|  25|null|  26|  26| 24| 24| 11| 11|null|null|
> |201|  26|null|  26|  26| 24| 24| 11| 11|null|null|
> |103|null|  13|null|null| 24| 24| 11| 11|  13|  13|
> |102|null|  12|null|null| 24| 24| 11| 11|  13|  13|
> |101|null|  11|null|null| 24| 24| 11| 11|  13|  13|
> +---+----+----+----+----+---+---+---+---+----+----+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to