[ 
https://issues.apache.org/jira/browse/SPARK-41793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17685384#comment-17685384
 ] 

Gera Shegalov edited comment on SPARK-41793 at 2/9/23 6:26 PM:
---------------------------------------------------------------

Another interpretation of why the pre-3.4 count of 1 may be actually correct 
could be that regardless of whether the window frame bound values overflow or 
not  the current row is always part of the window it defines. Whether or not it 
should be the case can be clarified in the doc.

UPDATE: if we consider the range of this query plan above {{RangeFrame, -10.23, 
6.79)}}.
{code}
spark-sql> select b - 10.23 as lower, b, b + 6.79 as upper from test_table;
11342371013783243717493546650944533.24  11342371013783243717493546650944543.47  
11342371013783243717493546650944550.26
999999999999999999999999999999999989.76 999999999999999999999999999999999999.99 
NULL
{code}

as consisting of the union of 
[lower; b) 
b 
(b; upper]

only the (b; upper] is undefined

the first [lower; b) and b are defined and have at least one row.  

either way not counting the current row is a counterintuitive result



was (Author: jira.shegalov):
Another interpretation of why the pre-3.4 count of 1 may be actually correct 
could be that regardless of whether the window frame bound values overflow or 
not  the current row is always part of the window it defines. Whether or not it 
should be the case can be clarified in the doc.

> Incorrect result for window frames defined by a range clause on large 
> decimals 
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-41793
>                 URL: https://issues.apache.org/jira/browse/SPARK-41793
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.4.0
>            Reporter: Gera Shegalov
>            Priority: Blocker
>              Labels: correctness
>
> Context 
> https://github.com/NVIDIA/spark-rapids/issues/7429#issuecomment-1368040686
> The following windowing query on a simple two-row input should produce two 
> non-empty windows as a result
> {code}
> from pprint import pprint
> data = [
>   ('9223372036854775807', '11342371013783243717493546650944543.47'),
>   ('9223372036854775807', '999999999999999999999999999999999999.99')
> ]
> df1 = spark.createDataFrame(data, 'a STRING, b STRING')
> df2 = df1.select(df1.a.cast('LONG'), df1.b.cast('DECIMAL(38,2)'))
> df2.createOrReplaceTempView('test_table')
> df = sql('''
>   SELECT 
>     COUNT(1) OVER (
>       PARTITION BY a 
>       ORDER BY b ASC 
>       RANGE BETWEEN 10.2345 PRECEDING AND 6.7890 FOLLOWING
>     ) AS CNT_1 
>   FROM 
>     test_table
>   ''')
> res = df.collect()
> df.explain(True)
> pprint(res)
> {code}
> Spark 3.4.0-SNAPSHOT output:
> {code}
> [Row(CNT_1=1), Row(CNT_1=0)]
> {code}
> Spark 3.3.1 output as expected:
> {code}
> Row(CNT_1=1), Row(CNT_1=1)]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to