[jira] [Comment Edited] (SPARK-22840) Incorrect results when using distinct on window

Denys Zadorozhnyi (JIRA) Mon, 25 Dec 2017 06:15:41 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-22840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16303281#comment-16303281
 ]


Denys Zadorozhnyi edited comment on SPARK-22840 at 12/25/17 2:14 PM:
---------------------------------------------------------------------

[~lio...@taboola.com] [~liorchaga] If you don't specify the frame for the 
window it'll be {{RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW}} which 
makes {last} function to always return current value ( see - 
[https://github.com/awesome-spark/spark-gotchas/blob/master/05_spark_sql_and_dataset_api.md#rows-between-and-range-between-clauses]
 ).
If you explicitly specify the frame you should get the result you are expecting 
:
{code}
         |SELECT DISTINCT
        |        id ,
        |        LAST(stats) over w
        |        FROM sample
        |        WINDOW w AS (PARTITION BY id  SORT BY start_time DESC ROWS 
BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
        |except
        |SELECT DISTINCT
        |        id ,
        |        FIRST(stats) over w
        |        FROM sample
        |        WINDOW w AS (PARTITION BY id  SORT BY start_time ASC ROWS 
BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
{code}


was (Author: greenhat):
[~lio...@taboola.com] [~liorchaga] If you don't specify the frame for the 
window it'll be {RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW} which makes 
{last} function to always return current value ( see - 
[https://github.com/awesome-spark/spark-gotchas/blob/master/05_spark_sql_and_dataset_api.md#rows-between-and-range-between-clauses]
 ).
If you explicitly specify the frame you should get the result you are expecting 
:
{code}
         |SELECT DISTINCT
        |        id ,
        |        LAST(stats) over w
        |        FROM sample
        |        WINDOW w AS (PARTITION BY id  SORT BY start_time DESC ROWS 
BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
        |except
        |SELECT DISTINCT
        |        id ,
        |        FIRST(stats) over w
        |        FROM sample
        |        WINDOW w AS (PARTITION BY id  SORT BY start_time ASC ROWS 
BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
{code}

> Incorrect results when using distinct on window
> -----------------------------------------------
>
>                 Key: SPARK-22840
>                 URL: https://issues.apache.org/jira/browse/SPARK-22840
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: Lior Chaga
>         Attachments: sample.parquet.zip
>
>
> Given the following schema:
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- start_time: long (nullable = true)
>  |-- stats: array (nullable = true)
>  |    |-- element: struct (containsNull = true)
>  |    |    |-- calibratedRecsHistory: double (nullable = true)
>  |    |    |-- eventTime: long (nullable = true)
>  |    |    |-- itemId: long (nullable = true)
>  |    |    |-- recsHistory: long (nullable = true)
> {code}
> Data contains multiple rows per id and start_time, with all stats elements 
> for a specific id and start_time is identical in all rows, I've noticed 
> inconsistent results when using window with FIRST(stats) DESC, and 
> LAST(stats) ASC.
> Specifically, the latter (LAST with ASC) produces more results.
> This is the query for seeing that:
> {code}
> SELECT DISTINCT
>         id ,
>         LAST(stats) over w
>         FROM sample
>         WINDOW w AS (PARTITION BY id  SORT BY start_time DESC)
> except        
> SELECT DISTINCT
>         id ,
>         FIRST(stats) over w 
>         FROM sample
>         WINDOW w AS (PARTITION BY id  SORT BY start_time ASC)
> {code}
> Each of the subqueries should return the stats for the latest start_time, 
> partitioned by id.
> Changing the order of the subqueries returns nothing...
> The query with FIRST and ASC produces correct results.
> the data for sample is attached in [^sample.parquet.zip]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22840) Incorrect results when using distinct on window

Reply via email to