[jira] [Comment Edited] (SPARK-41793) Incorrect result for window frames defined by a range clause on large decimals

2023-02-09 Thread Gera Shegalov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17685384#comment-17685384
 ] 

Gera Shegalov edited comment on SPARK-41793 at 2/9/23 6:26 PM:
---

Another interpretation of why the pre-3.4 count of 1 may be actually correct 
could be that regardless of whether the window frame bound values overflow or 
not  the current row is always part of the window it defines. Whether or not it 
should be the case can be clarified in the doc.

UPDATE: if we consider the range of this query plan above {{RangeFrame, -10.23, 
6.79)}}.
{code}
spark-sql> select b - 10.23 as lower, b, b + 6.79 as upper from test_table;
11342371013783243717493546650944533.24  11342371013783243717493546650944543.47  
11342371013783243717493546650944550.26
9989.76 .99 
NULL
{code}

as consisting of the union of 
[lower; b) 
b 
(b; upper]

only the (b; upper] is undefined

the first [lower; b) and b are defined and have at least one row.  

either way not counting the current row is a counterintuitive result



was (Author: jira.shegalov):
Another interpretation of why the pre-3.4 count of 1 may be actually correct 
could be that regardless of whether the window frame bound values overflow or 
not  the current row is always part of the window it defines. Whether or not it 
should be the case can be clarified in the doc.

> Incorrect result for window frames defined by a range clause on large 
> decimals 
> ---
>
> Key: SPARK-41793
> URL: https://issues.apache.org/jira/browse/SPARK-41793
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gera Shegalov
>Priority: Blocker
>  Labels: correctness
>
> Context 
> https://github.com/NVIDIA/spark-rapids/issues/7429#issuecomment-1368040686
> The following windowing query on a simple two-row input should produce two 
> non-empty windows as a result
> {code}
> from pprint import pprint
> data = [
>   ('9223372036854775807', '11342371013783243717493546650944543.47'),
>   ('9223372036854775807', '.99')
> ]
> df1 = spark.createDataFrame(data, 'a STRING, b STRING')
> df2 = df1.select(df1.a.cast('LONG'), df1.b.cast('DECIMAL(38,2)'))
> df2.createOrReplaceTempView('test_table')
> df = sql('''
>   SELECT 
> COUNT(1) OVER (
>   PARTITION BY a 
>   ORDER BY b ASC 
>   RANGE BETWEEN 10.2345 PRECEDING AND 6.7890 FOLLOWING
> ) AS CNT_1 
>   FROM 
> test_table
>   ''')
> res = df.collect()
> df.explain(True)
> pprint(res)
> {code}
> Spark 3.4.0-SNAPSHOT output:
> {code}
> [Row(CNT_1=1), Row(CNT_1=0)]
> {code}
> Spark 3.3.1 output as expected:
> {code}
> Row(CNT_1=1), Row(CNT_1=1)]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-41793) Incorrect result for window frames defined by a range clause on large decimals

2023-02-06 Thread Gera Shegalov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17684903#comment-17684903
 ] 

Gera Shegalov edited comment on SPARK-41793 at 2/6/23 7:38 PM:
---

if the consensus is that it's not a correctness bug in 3.4,  then this fix 
should probably be documented and backported to maintenance branches?


was (Author: jira.shegalov):
if the consensus is that it's not a correctness bug in 3.4,  then this fix 
should probably be documented and probably backported to maintenance branches?

> Incorrect result for window frames defined by a range clause on large 
> decimals 
> ---
>
> Key: SPARK-41793
> URL: https://issues.apache.org/jira/browse/SPARK-41793
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gera Shegalov
>Priority: Blocker
>  Labels: correctness
>
> Context 
> https://github.com/NVIDIA/spark-rapids/issues/7429#issuecomment-1368040686
> The following windowing query on a simple two-row input should produce two 
> non-empty windows as a result
> {code}
> from pprint import pprint
> data = [
>   ('9223372036854775807', '11342371013783243717493546650944543.47'),
>   ('9223372036854775807', '.99')
> ]
> df1 = spark.createDataFrame(data, 'a STRING, b STRING')
> df2 = df1.select(df1.a.cast('LONG'), df1.b.cast('DECIMAL(38,2)'))
> df2.createOrReplaceTempView('test_table')
> df = sql('''
>   SELECT 
> COUNT(1) OVER (
>   PARTITION BY a 
>   ORDER BY b ASC 
>   RANGE BETWEEN 10.2345 PRECEDING AND 6.7890 FOLLOWING
> ) AS CNT_1 
>   FROM 
> test_table
>   ''')
> res = df.collect()
> df.explain(True)
> pprint(res)
> {code}
> Spark 3.4.0-SNAPSHOT output:
> {code}
> [Row(CNT_1=1), Row(CNT_1=0)]
> {code}
> Spark 3.3.1 output as expected:
> {code}
> Row(CNT_1=1), Row(CNT_1=1)]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-41793) Incorrect result for window frames defined by a range clause on large decimals

2023-02-03 Thread XiDuo You (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17683753#comment-17683753
 ] 

XiDuo You edited comment on SPARK-41793 at 2/3/23 9:03 AM:
---

I'm not sure this is a correctness bug but I think it's more like a right 
behavior change.

An example, what happened if run this query ?
{code:java}
select 1 where where (1 + .99) > 
.99;{code}
Without ANSI, it returns empty, so the size of count should return 0.


was (Author: ulysses):
I'm not sure this is a correctness bug but I think it's more like a right 
behavior change.

An example, what happened if run this query ?
{code:java}
select 1 + .99 {code}
Without ANSI, it returns null, so the size of count should return 0.

> Incorrect result for window frames defined by a range clause on large 
> decimals 
> ---
>
> Key: SPARK-41793
> URL: https://issues.apache.org/jira/browse/SPARK-41793
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gera Shegalov
>Priority: Blocker
>  Labels: correctness
>
> Context 
> https://github.com/NVIDIA/spark-rapids/issues/7429#issuecomment-1368040686
> The following windowing query on a simple two-row input should produce two 
> non-empty windows as a result
> {code}
> from pprint import pprint
> data = [
>   ('9223372036854775807', '11342371013783243717493546650944543.47'),
>   ('9223372036854775807', '.99')
> ]
> df1 = spark.createDataFrame(data, 'a STRING, b STRING')
> df2 = df1.select(df1.a.cast('LONG'), df1.b.cast('DECIMAL(38,2)'))
> df2.createOrReplaceTempView('test_table')
> df = sql('''
>   SELECT 
> COUNT(1) OVER (
>   PARTITION BY a 
>   ORDER BY b ASC 
>   RANGE BETWEEN 10.2345 PRECEDING AND 6.7890 FOLLOWING
> ) AS CNT_1 
>   FROM 
> test_table
>   ''')
> res = df.collect()
> df.explain(True)
> pprint(res)
> {code}
> Spark 3.4.0-SNAPSHOT output:
> {code}
> [Row(CNT_1=1), Row(CNT_1=0)]
> {code}
> Spark 3.3.1 output as expected:
> {code}
> Row(CNT_1=1), Row(CNT_1=1)]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org