Lior Chaga created SPARK-22840: ---------------------------------- Summary: Incorrect results when using distinct on window Key: SPARK-22840 URL: https://issues.apache.org/jira/browse/SPARK-22840 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Lior Chaga
Given the following schema: {code} root |-- id: string (nullable = true) |-- start_time: long (nullable = true) |-- stats: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- calibratedRecsHistory: double (nullable = true) | | |-- eventTime: long (nullable = true) | | |-- itemId: long (nullable = true) | | |-- recsHistory: long (nullable = true) {code} Data contains multiple rows per id and start_time, with all stats elements for a specific id and start_time is identical in all rows, I've noticed inconsistent results when using window with FIRST(stats) DESC, and LAST(stats) ASC. Specifically, the latter (LAST with ASC) produces more results. This is the query for seeing that: {code} SELECT DISTINCT id , LAST(stats) over w FROM sample WINDOW w AS (PARTITION BY id SORT BY start_time DESC) except SELECT DISTINCT id , FIRST(stats) over w FROM sample WINDOW w AS (PARTITION BY id SORT BY start_time ASC) {code} Each of the subqueries should return the stats for the latest start_time, partitioned by id. Changing the order of the subqueries returns nothing... The query with FIRST and ASC produces correct results. the data for sample is attached in sample.parquet.zip -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org