[jira] [Commented] (SPARK-33726) Duplicate field names causes wrong answers during aggregation
[ https://issues.apache.org/jira/browse/SPARK-33726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277644#comment-17277644 ] Apache Spark commented on SPARK-33726: -- User 'yliou' has created a pull request for this issue: https://github.com/apache/spark/pull/31447 > Duplicate field names causes wrong answers during aggregation > - > > Key: SPARK-33726 > URL: https://issues.apache.org/jira/browse/SPARK-33726 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.1 >Reporter: Yian Liou >Assignee: Yian Liou >Priority: Major > Labels: correctness > Fix For: 2.4.8, 3.0.2, 3.1.1 > > > We saw this bug at Workday. > Duplicate field names for different fields can cause > org.apache.spark.sql.catalyst.expressions.RowBasedKeyValueBatch#allocate to > return a fixed batch when it should have returned a variable batch leading to > wrong results. > This example produces wrong results in the spark shell: > scala> sql("with T as (select id as a, -id as x from range(3)), U as (select > id as b, cast(id as string) as x from range(3)) select T.x, U.x, min(a) as > ma, min(b) as mb from T join U on a=b group by U.x, T.x").show > > |*x*|*x*|*ma*|*mb*| > |-2|2|0|null| > |-1|1|null|1| > |0|0|0|0| > instead of correct output : > |*x*|*x*|*ma*|*mb*| > |0|0|0|0| > |-2|2|2|2| > |-1|1|1|1| > The issue can be solved by iterating over the fields themselves instead of > field names. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33726) Duplicate field names causes wrong answers during aggregation
[ https://issues.apache.org/jira/browse/SPARK-33726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277643#comment-17277643 ] Apache Spark commented on SPARK-33726: -- User 'yliou' has created a pull request for this issue: https://github.com/apache/spark/pull/31447 > Duplicate field names causes wrong answers during aggregation > - > > Key: SPARK-33726 > URL: https://issues.apache.org/jira/browse/SPARK-33726 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.1 >Reporter: Yian Liou >Assignee: Yian Liou >Priority: Major > Labels: correctness > Fix For: 2.4.8, 3.0.2, 3.1.1 > > > We saw this bug at Workday. > Duplicate field names for different fields can cause > org.apache.spark.sql.catalyst.expressions.RowBasedKeyValueBatch#allocate to > return a fixed batch when it should have returned a variable batch leading to > wrong results. > This example produces wrong results in the spark shell: > scala> sql("with T as (select id as a, -id as x from range(3)), U as (select > id as b, cast(id as string) as x from range(3)) select T.x, U.x, min(a) as > ma, min(b) as mb from T join U on a=b group by U.x, T.x").show > > |*x*|*x*|*ma*|*mb*| > |-2|2|0|null| > |-1|1|null|1| > |0|0|0|0| > instead of correct output : > |*x*|*x*|*ma*|*mb*| > |0|0|0|0| > |-2|2|2|2| > |-1|1|1|1| > The issue can be solved by iterating over the fields themselves instead of > field names. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33726) Duplicate field names causes wrong answers during aggregation
[ https://issues.apache.org/jira/browse/SPARK-33726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17271600#comment-17271600 ] Apache Spark commented on SPARK-33726: -- User 'yliou' has created a pull request for this issue: https://github.com/apache/spark/pull/31327 > Duplicate field names causes wrong answers during aggregation > - > > Key: SPARK-33726 > URL: https://issues.apache.org/jira/browse/SPARK-33726 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.1 >Reporter: Yian Liou >Assignee: Yian Liou >Priority: Major > Labels: correctness > Fix For: 3.0.2, 3.1.1 > > > We saw this bug at Workday. > Duplicate field names for different fields can cause > org.apache.spark.sql.catalyst.expressions.RowBasedKeyValueBatch#allocate to > return a fixed batch when it should have returned a variable batch leading to > wrong results. > This example produces wrong results in the spark shell: > scala> sql("with T as (select id as a, -id as x from range(3)), U as (select > id as b, cast(id as string) as x from range(3)) select T.x, U.x, min(a) as > ma, min(b) as mb from T join U on a=b group by U.x, T.x").show > > |*x*|*x*|*ma*|*mb*| > |-2|2|0|null| > |-1|1|null|1| > |0|0|0|0| > instead of correct output : > |*x*|*x*|*ma*|*mb*| > |0|0|0|0| > |-2|2|2|2| > |-1|1|1|1| > The issue can be solved by iterating over the fields themselves instead of > field names. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33726) Duplicate field names causes wrong answers during aggregation
[ https://issues.apache.org/jira/browse/SPARK-33726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17271599#comment-17271599 ] Apache Spark commented on SPARK-33726: -- User 'yliou' has created a pull request for this issue: https://github.com/apache/spark/pull/31327 > Duplicate field names causes wrong answers during aggregation > - > > Key: SPARK-33726 > URL: https://issues.apache.org/jira/browse/SPARK-33726 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.1 >Reporter: Yian Liou >Assignee: Yian Liou >Priority: Major > Labels: correctness > Fix For: 3.0.2, 3.1.1 > > > We saw this bug at Workday. > Duplicate field names for different fields can cause > org.apache.spark.sql.catalyst.expressions.RowBasedKeyValueBatch#allocate to > return a fixed batch when it should have returned a variable batch leading to > wrong results. > This example produces wrong results in the spark shell: > scala> sql("with T as (select id as a, -id as x from range(3)), U as (select > id as b, cast(id as string) as x from range(3)) select T.x, U.x, min(a) as > ma, min(b) as mb from T join U on a=b group by U.x, T.x").show > > |*x*|*x*|*ma*|*mb*| > |-2|2|0|null| > |-1|1|null|1| > |0|0|0|0| > instead of correct output : > |*x*|*x*|*ma*|*mb*| > |0|0|0|0| > |-2|2|2|2| > |-1|1|1|1| > The issue can be solved by iterating over the fields themselves instead of > field names. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33726) Duplicate field names causes wrong answers during aggregation
[ https://issues.apache.org/jira/browse/SPARK-33726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17249858#comment-17249858 ] Apache Spark commented on SPARK-33726: -- User 'yliou' has created a pull request for this issue: https://github.com/apache/spark/pull/30788 > Duplicate field names causes wrong answers during aggregation > - > > Key: SPARK-33726 > URL: https://issues.apache.org/jira/browse/SPARK-33726 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.1 >Reporter: Yian Liou >Priority: Major > Labels: correctness > > We saw this bug at Workday. > Duplicate field names for different fields can cause > org.apache.spark.sql.catalyst.expressions.RowBasedKeyValueBatch#allocate to > return a fixed batch when it should have returned a variable batch leading to > wrong results. > This example produces wrong results in the spark shell: > scala> sql("with T as (select id as a, -id as x from range(3)), U as (select > id as b, cast(id as string) as x from range(3)) select T.x, U.x, min(a) as > ma, min(b) as mb from T join U on a=b group by U.x, T.x").show > > |*x*|*x*|*ma*|*mb*| > |-2|2|0|null| > |-1|1|null|1| > |0|0|0|0| > instead of correct output : > |*x*|*x*|*ma*|*mb*| > |0|0|0|0| > |-2|2|2|2| > |-1|1|1|1| > The issue can be solved by iterating over the fields themselves instead of > field names. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33726) Duplicate field names causes wrong answers during aggregation
[ https://issues.apache.org/jira/browse/SPARK-33726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17246812#comment-17246812 ] Yian Liou commented on SPARK-33726: --- Will create a PR for the issue. > Duplicate field names causes wrong answers during aggregation > - > > Key: SPARK-33726 > URL: https://issues.apache.org/jira/browse/SPARK-33726 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.1 >Reporter: Yian Liou >Priority: Major > > We saw this bug at Workday. > Duplicate field names for different fields can cause > org.apache.spark.sql.catalyst.expressions.RowBasedKeyValueBatch#allocate to > return a fixed batch when it should have returned a variable batch leading to > wrong results. > This example produces wrong results in the spark shell: > scala> sql("with T as (select id as a, -id as x from range(3)), U as (select > id as b, cast(id as string) as x from range(3)) select T.x, U.x, min(a) as > ma, min(b) as mb from T join U on a=b group by U.x, T.x").show > > |*x*|*x*|*ma*|*mb*| > |-2|2|0|null| > |-1|1|null|1| > |0|0|0|0| > instead of correct output : > |*x*|*x*|*ma*|*mb*| > |0|0|0|0| > |-2|2|2|2| > |-1|1|1|1| > The issue can be solved by iterating over the fields themselves instead of > field names. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org