[jira] [Comment Edited] (SPARK-41793) Incorrect result for window frames defined by a range clause on large decimals
[ https://issues.apache.org/jira/browse/SPARK-41793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17685384#comment-17685384 ] Gera Shegalov edited comment on SPARK-41793 at 2/9/23 6:26 PM: --- Another interpretation of why the pre-3.4 count of 1 may be actually correct could be that regardless of whether the window frame bound values overflow or not the current row is always part of the window it defines. Whether or not it should be the case can be clarified in the doc. UPDATE: if we consider the range of this query plan above {{RangeFrame, -10.23, 6.79)}}. {code} spark-sql> select b - 10.23 as lower, b, b + 6.79 as upper from test_table; 11342371013783243717493546650944533.24 11342371013783243717493546650944543.47 11342371013783243717493546650944550.26 9989.76 .99 NULL {code} as consisting of the union of [lower; b) b (b; upper] only the (b; upper] is undefined the first [lower; b) and b are defined and have at least one row. either way not counting the current row is a counterintuitive result was (Author: jira.shegalov): Another interpretation of why the pre-3.4 count of 1 may be actually correct could be that regardless of whether the window frame bound values overflow or not the current row is always part of the window it defines. Whether or not it should be the case can be clarified in the doc. > Incorrect result for window frames defined by a range clause on large > decimals > --- > > Key: SPARK-41793 > URL: https://issues.apache.org/jira/browse/SPARK-41793 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gera Shegalov >Priority: Blocker > Labels: correctness > > Context > https://github.com/NVIDIA/spark-rapids/issues/7429#issuecomment-1368040686 > The following windowing query on a simple two-row input should produce two > non-empty windows as a result > {code} > from pprint import pprint > data = [ > ('9223372036854775807', '11342371013783243717493546650944543.47'), > ('9223372036854775807', '.99') > ] > df1 = spark.createDataFrame(data, 'a STRING, b STRING') > df2 = df1.select(df1.a.cast('LONG'), df1.b.cast('DECIMAL(38,2)')) > df2.createOrReplaceTempView('test_table') > df = sql(''' > SELECT > COUNT(1) OVER ( > PARTITION BY a > ORDER BY b ASC > RANGE BETWEEN 10.2345 PRECEDING AND 6.7890 FOLLOWING > ) AS CNT_1 > FROM > test_table > ''') > res = df.collect() > df.explain(True) > pprint(res) > {code} > Spark 3.4.0-SNAPSHOT output: > {code} > [Row(CNT_1=1), Row(CNT_1=0)] > {code} > Spark 3.3.1 output as expected: > {code} > Row(CNT_1=1), Row(CNT_1=1)] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-41793) Incorrect result for window frames defined by a range clause on large decimals
[ https://issues.apache.org/jira/browse/SPARK-41793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17684903#comment-17684903 ] Gera Shegalov edited comment on SPARK-41793 at 2/6/23 7:38 PM: --- if the consensus is that it's not a correctness bug in 3.4, then this fix should probably be documented and backported to maintenance branches? was (Author: jira.shegalov): if the consensus is that it's not a correctness bug in 3.4, then this fix should probably be documented and probably backported to maintenance branches? > Incorrect result for window frames defined by a range clause on large > decimals > --- > > Key: SPARK-41793 > URL: https://issues.apache.org/jira/browse/SPARK-41793 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gera Shegalov >Priority: Blocker > Labels: correctness > > Context > https://github.com/NVIDIA/spark-rapids/issues/7429#issuecomment-1368040686 > The following windowing query on a simple two-row input should produce two > non-empty windows as a result > {code} > from pprint import pprint > data = [ > ('9223372036854775807', '11342371013783243717493546650944543.47'), > ('9223372036854775807', '.99') > ] > df1 = spark.createDataFrame(data, 'a STRING, b STRING') > df2 = df1.select(df1.a.cast('LONG'), df1.b.cast('DECIMAL(38,2)')) > df2.createOrReplaceTempView('test_table') > df = sql(''' > SELECT > COUNT(1) OVER ( > PARTITION BY a > ORDER BY b ASC > RANGE BETWEEN 10.2345 PRECEDING AND 6.7890 FOLLOWING > ) AS CNT_1 > FROM > test_table > ''') > res = df.collect() > df.explain(True) > pprint(res) > {code} > Spark 3.4.0-SNAPSHOT output: > {code} > [Row(CNT_1=1), Row(CNT_1=0)] > {code} > Spark 3.3.1 output as expected: > {code} > Row(CNT_1=1), Row(CNT_1=1)] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-41793) Incorrect result for window frames defined by a range clause on large decimals
[ https://issues.apache.org/jira/browse/SPARK-41793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17683753#comment-17683753 ] XiDuo You edited comment on SPARK-41793 at 2/3/23 9:03 AM: --- I'm not sure this is a correctness bug but I think it's more like a right behavior change. An example, what happened if run this query ? {code:java} select 1 where where (1 + .99) > .99;{code} Without ANSI, it returns empty, so the size of count should return 0. was (Author: ulysses): I'm not sure this is a correctness bug but I think it's more like a right behavior change. An example, what happened if run this query ? {code:java} select 1 + .99 {code} Without ANSI, it returns null, so the size of count should return 0. > Incorrect result for window frames defined by a range clause on large > decimals > --- > > Key: SPARK-41793 > URL: https://issues.apache.org/jira/browse/SPARK-41793 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gera Shegalov >Priority: Blocker > Labels: correctness > > Context > https://github.com/NVIDIA/spark-rapids/issues/7429#issuecomment-1368040686 > The following windowing query on a simple two-row input should produce two > non-empty windows as a result > {code} > from pprint import pprint > data = [ > ('9223372036854775807', '11342371013783243717493546650944543.47'), > ('9223372036854775807', '.99') > ] > df1 = spark.createDataFrame(data, 'a STRING, b STRING') > df2 = df1.select(df1.a.cast('LONG'), df1.b.cast('DECIMAL(38,2)')) > df2.createOrReplaceTempView('test_table') > df = sql(''' > SELECT > COUNT(1) OVER ( > PARTITION BY a > ORDER BY b ASC > RANGE BETWEEN 10.2345 PRECEDING AND 6.7890 FOLLOWING > ) AS CNT_1 > FROM > test_table > ''') > res = df.collect() > df.explain(True) > pprint(res) > {code} > Spark 3.4.0-SNAPSHOT output: > {code} > [Row(CNT_1=1), Row(CNT_1=0)] > {code} > Spark 3.3.1 output as expected: > {code} > Row(CNT_1=1), Row(CNT_1=1)] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org