Lior Chaga created SPARK-22840:
----------------------------------

             Summary: Incorrect results when using distinct on window
                 Key: SPARK-22840
                 URL: https://issues.apache.org/jira/browse/SPARK-22840
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.2.0
            Reporter: Lior Chaga


Given the following schema:
{code}
root
 |-- id: string (nullable = true)
 |-- start_time: long (nullable = true)
 |-- stats: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- calibratedRecsHistory: double (nullable = true)
 |    |    |-- eventTime: long (nullable = true)
 |    |    |-- itemId: long (nullable = true)
 |    |    |-- recsHistory: long (nullable = true)
{code}

Data contains multiple rows per id and start_time, with all stats elements for 
a specific id and start_time is identical in all rows, I've noticed 
inconsistent results when using window with FIRST(stats) DESC, and LAST(stats) 
ASC.
Specifically, the latter (LAST with ASC) produces more results.

This is the query for seeing that:

{code}
SELECT DISTINCT
        id ,
        LAST(stats) over w
        FROM sample
        WINDOW w AS (PARTITION BY id  SORT BY start_time DESC)
except        
SELECT DISTINCT
        id ,
        FIRST(stats) over w 
        FROM sample
        WINDOW w AS (PARTITION BY id  SORT BY start_time ASC)
{code}

Each of the subqueries should return the stats for the latest start_time, 
partitioned by id.
Changing the order of the subqueries returns nothing...
The query with FIRST and ASC produces correct results.

the data for sample is attached in sample.parquet.zip



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to