[ 
https://issues.apache.org/jira/browse/SPARK-22840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lior Chaga updated SPARK-22840:
-------------------------------
    Description: 
Given the following schema:
{code}
root
 |-- id: string (nullable = true)
 |-- start_time: long (nullable = true)
 |-- stats: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- calibratedRecsHistory: double (nullable = true)
 |    |    |-- eventTime: long (nullable = true)
 |    |    |-- itemId: long (nullable = true)
 |    |    |-- recsHistory: long (nullable = true)
{code}

Data contains multiple rows per id and start_time, with all stats elements for 
a specific id and start_time is identical in all rows, I've noticed 
inconsistent results when using window with FIRST(stats) DESC, and LAST(stats) 
ASC.
Specifically, the latter (LAST with ASC) produces more results.

This is the query for seeing that:

{code}
SELECT DISTINCT
        id ,
        LAST(stats) over w
        FROM sample
        WINDOW w AS (PARTITION BY id  SORT BY start_time DESC)
except        
SELECT DISTINCT
        id ,
        FIRST(stats) over w 
        FROM sample
        WINDOW w AS (PARTITION BY id  SORT BY start_time ASC)
{code}

Each of the subqueries should return the stats for the latest start_time, 
partitioned by id.
Changing the order of the subqueries returns nothing...
The query with FIRST and ASC produces correct results.

the data for sample is attached in [^sample.parquet.zip]

  was:
Given the following schema:
{code}
root
 |-- id: string (nullable = true)
 |-- start_time: long (nullable = true)
 |-- stats: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- calibratedRecsHistory: double (nullable = true)
 |    |    |-- eventTime: long (nullable = true)
 |    |    |-- itemId: long (nullable = true)
 |    |    |-- recsHistory: long (nullable = true)
{code}

Data contains multiple rows per id and start_time, with all stats elements for 
a specific id and start_time is identical in all rows, I've noticed 
inconsistent results when using window with FIRST(stats) DESC, and LAST(stats) 
ASC.
Specifically, the latter (LAST with ASC) produces more results.

This is the query for seeing that:

{code}
SELECT DISTINCT
        id ,
        LAST(stats) over w
        FROM sample
        WINDOW w AS (PARTITION BY id  SORT BY start_time DESC)
except        
SELECT DISTINCT
        id ,
        FIRST(stats) over w 
        FROM sample
        WINDOW w AS (PARTITION BY id  SORT BY start_time ASC)
{code}

Each of the subqueries should return the stats for the latest start_time, 
partitioned by id.
Changing the order of the subqueries returns nothing...
The query with FIRST and ASC produces correct results.

the data for sample is attached in sample.parquet.zip


> Incorrect results when using distinct on window
> -----------------------------------------------
>
>                 Key: SPARK-22840
>                 URL: https://issues.apache.org/jira/browse/SPARK-22840
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: Lior Chaga
>         Attachments: sample.parquet.zip
>
>
> Given the following schema:
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- start_time: long (nullable = true)
>  |-- stats: array (nullable = true)
>  |    |-- element: struct (containsNull = true)
>  |    |    |-- calibratedRecsHistory: double (nullable = true)
>  |    |    |-- eventTime: long (nullable = true)
>  |    |    |-- itemId: long (nullable = true)
>  |    |    |-- recsHistory: long (nullable = true)
> {code}
> Data contains multiple rows per id and start_time, with all stats elements 
> for a specific id and start_time is identical in all rows, I've noticed 
> inconsistent results when using window with FIRST(stats) DESC, and 
> LAST(stats) ASC.
> Specifically, the latter (LAST with ASC) produces more results.
> This is the query for seeing that:
> {code}
> SELECT DISTINCT
>         id ,
>         LAST(stats) over w
>         FROM sample
>         WINDOW w AS (PARTITION BY id  SORT BY start_time DESC)
> except        
> SELECT DISTINCT
>         id ,
>         FIRST(stats) over w 
>         FROM sample
>         WINDOW w AS (PARTITION BY id  SORT BY start_time ASC)
> {code}
> Each of the subqueries should return the stats for the latest start_time, 
> partitioned by id.
> Changing the order of the subqueries returns nothing...
> The query with FIRST and ASC produces correct results.
> the data for sample is attached in [^sample.parquet.zip]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to