[jira] [Assigned] (SPARK-39906) Eliminate build warnings - 'sbt 0.13 shell syntax is deprecated; use slash syntax instead'
[ https://issues.apache.org/jira/browse/SPARK-39906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39906: Assignee: Apache Spark > Eliminate build warnings - 'sbt 0.13 shell syntax is deprecated; use slash > syntax instead' > -- > > Key: SPARK-39906 > URL: https://issues.apache.org/jira/browse/SPARK-39906 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Assignee: Apache Spark >Priority: Minor > Fix For: 3.4.0 > > > 1. > ./2_Run Build modules catalyst, > hive-thriftserver.txt:2022-07-27T01:23:12.1294533Z [warn] sbt 0.13 shell > syntax is deprecated; use slash syntax instead: examples / Test / package, > avro / Test / package, network-shuffle / Test / package, sketch / Test / > package, unsafe / Test / package, launcher / Test / package, network-yarn / > Test / package, streaming / Test / package, catalyst / Test / package, > hive-thriftserver / Test / package, kvstore / Test / package, core / Test / > package, ganglia-lgpl / Test / package, hadoop-cloud / Test / package, > streaming-kinesis-asl-assembly / Test / package, assembly / Test / package, > network-common / Test / package, sql / Test / package, > streaming-kafka-0-10-assembly / Test / package, mllib / Test / package, > streaming-kinesis-asl / Test / package, docker-integration-tests / Test / > package, kubernetes / Test / package, yarn / Test / package, tags / Test / > package, graphx / Test / package, token-provider-kafka-0-10 / Test / package, > mesos / Test / package, streaming-kafka-0-10 / Test / package, hive / Test / > package, tools / Test / package, mllib-local / Test / package, repl / Test / > package, sql-kafka-0-10 / Test / package, Test / package > ./2_Run Build modules catalyst, > hive-thriftserver.txt:2022-07-27T01:23:12.1294533Z [warn] sbt 0.13 shell > syntax is deprecated; use slash syntax instead: examples / Test / package, > avro / Test / package, network-shuffle / Test / package, sketch / Test / > package, unsafe / Test / package, launcher / Test / package, network-yarn / > Test / package, streaming / Test / package, catalyst / Test / package, > hive-thriftserver / Test / package, kvstore / Test / package, core / Test / > package, ganglia-lgpl / Test / package, hadoop-cloud / Test / package, > streaming-kinesis-asl-assembly / Test / package, assembly / Test / package, > network-common / Test / package, sql / Test / package, > streaming-kafka-0-10-assembly / Test / package, mllib / Test / package, > streaming-kinesis-asl / Test / package, docker-integration-tests / Test / > package, kubernetes / Test / package, yarn / Test / package, tags / Test / > package, graphx / Test / package, token-provider-kafka-0-10 / Test / package, > mesos / Test / package, streaming-kafka-0-10 / Test / package, hive / Test / > package, tools / Test / package, mllib-local / Test / package, repl / Test / > package, sql-kafka-0-10 / Test / package, Test / package > ./Run Build modules pyspark-core, pyspark-streaming, pyspark-ml/11_Run > tests.txt:2022-07-27T01:25:30.0840251Z [warn] sbt 0.13 shell syntax is > deprecated; use slash syntax instead: network-yarn / Test / package, > network-shuffle / Test / package, sketch / Test / package, yarn / Test / > package, sql / Test / package, core / Test / package, hive-thriftserver / > Test / package, ganglia-lgpl / Test / package, streaming / Test / package, > streaming-kinesis-asl / Test / package, docker-integration-tests / Test / > package, kubernetes / Test / package, launcher / Test / package, > streaming-kinesis-asl-assembly / Test / package, tags / Test / package, > assembly / Test / package, mllib-local / Test / package, > token-provider-kafka-0-10 / Test / package, repl / Test / package, graphx / > Test / package, sql-kafka-0-10 / Test / package, mesos / Test / package, > streaming-kafka-0-10 / Test / package, streaming-kafka-0-10-assembly / Test / > package, examples / Test / package, tools / Test / package, avro / Test / > package, hadoop-cloud / Test / package, mllib / Test / package, kvstore / > Test / package, hive / Test / package, catalyst / Test / package, > network-common / Test / package, unsafe / Test / package, Test / package > ./Run Build modules pyspark-core, pyspark-streaming, pyspark-ml/11_Run > tests.txt:2022-07-27T01:25:30.0840251Z [warn] sbt 0.13 shell syntax is > deprecated; use slash syntax instead: network-yarn / Test / package, > network-shuffle / Test / package, sketch / Test / package, yarn / Test / > package, sql / Test / package, core / Test / package, hive-thriftserver / > Test / package, ganglia-lgpl / Test / package, streaming / Test / package, > streaming
[jira] [Assigned] (SPARK-39906) Eliminate build warnings - 'sbt 0.13 shell syntax is deprecated; use slash syntax instead'
[ https://issues.apache.org/jira/browse/SPARK-39906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39906: Assignee: (was: Apache Spark) > Eliminate build warnings - 'sbt 0.13 shell syntax is deprecated; use slash > syntax instead' > -- > > Key: SPARK-39906 > URL: https://issues.apache.org/jira/browse/SPARK-39906 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > Fix For: 3.4.0 > > > 1. > ./2_Run Build modules catalyst, > hive-thriftserver.txt:2022-07-27T01:23:12.1294533Z [warn] sbt 0.13 shell > syntax is deprecated; use slash syntax instead: examples / Test / package, > avro / Test / package, network-shuffle / Test / package, sketch / Test / > package, unsafe / Test / package, launcher / Test / package, network-yarn / > Test / package, streaming / Test / package, catalyst / Test / package, > hive-thriftserver / Test / package, kvstore / Test / package, core / Test / > package, ganglia-lgpl / Test / package, hadoop-cloud / Test / package, > streaming-kinesis-asl-assembly / Test / package, assembly / Test / package, > network-common / Test / package, sql / Test / package, > streaming-kafka-0-10-assembly / Test / package, mllib / Test / package, > streaming-kinesis-asl / Test / package, docker-integration-tests / Test / > package, kubernetes / Test / package, yarn / Test / package, tags / Test / > package, graphx / Test / package, token-provider-kafka-0-10 / Test / package, > mesos / Test / package, streaming-kafka-0-10 / Test / package, hive / Test / > package, tools / Test / package, mllib-local / Test / package, repl / Test / > package, sql-kafka-0-10 / Test / package, Test / package > ./2_Run Build modules catalyst, > hive-thriftserver.txt:2022-07-27T01:23:12.1294533Z [warn] sbt 0.13 shell > syntax is deprecated; use slash syntax instead: examples / Test / package, > avro / Test / package, network-shuffle / Test / package, sketch / Test / > package, unsafe / Test / package, launcher / Test / package, network-yarn / > Test / package, streaming / Test / package, catalyst / Test / package, > hive-thriftserver / Test / package, kvstore / Test / package, core / Test / > package, ganglia-lgpl / Test / package, hadoop-cloud / Test / package, > streaming-kinesis-asl-assembly / Test / package, assembly / Test / package, > network-common / Test / package, sql / Test / package, > streaming-kafka-0-10-assembly / Test / package, mllib / Test / package, > streaming-kinesis-asl / Test / package, docker-integration-tests / Test / > package, kubernetes / Test / package, yarn / Test / package, tags / Test / > package, graphx / Test / package, token-provider-kafka-0-10 / Test / package, > mesos / Test / package, streaming-kafka-0-10 / Test / package, hive / Test / > package, tools / Test / package, mllib-local / Test / package, repl / Test / > package, sql-kafka-0-10 / Test / package, Test / package > ./Run Build modules pyspark-core, pyspark-streaming, pyspark-ml/11_Run > tests.txt:2022-07-27T01:25:30.0840251Z [warn] sbt 0.13 shell syntax is > deprecated; use slash syntax instead: network-yarn / Test / package, > network-shuffle / Test / package, sketch / Test / package, yarn / Test / > package, sql / Test / package, core / Test / package, hive-thriftserver / > Test / package, ganglia-lgpl / Test / package, streaming / Test / package, > streaming-kinesis-asl / Test / package, docker-integration-tests / Test / > package, kubernetes / Test / package, launcher / Test / package, > streaming-kinesis-asl-assembly / Test / package, tags / Test / package, > assembly / Test / package, mllib-local / Test / package, > token-provider-kafka-0-10 / Test / package, repl / Test / package, graphx / > Test / package, sql-kafka-0-10 / Test / package, mesos / Test / package, > streaming-kafka-0-10 / Test / package, streaming-kafka-0-10-assembly / Test / > package, examples / Test / package, tools / Test / package, avro / Test / > package, hadoop-cloud / Test / package, mllib / Test / package, kvstore / > Test / package, hive / Test / package, catalyst / Test / package, > network-common / Test / package, unsafe / Test / package, Test / package > ./Run Build modules pyspark-core, pyspark-streaming, pyspark-ml/11_Run > tests.txt:2022-07-27T01:25:30.0840251Z [warn] sbt 0.13 shell syntax is > deprecated; use slash syntax instead: network-yarn / Test / package, > network-shuffle / Test / package, sketch / Test / package, yarn / Test / > package, sql / Test / package, core / Test / package, hive-thriftserver / > Test / package, ganglia-lgpl / Test / package, streaming / Test / package, > streaming-kinesis-asl / Test / pac
[jira] [Commented] (SPARK-39906) Eliminate build warnings - 'sbt 0.13 shell syntax is deprecated; use slash syntax instead'
[ https://issues.apache.org/jira/browse/SPARK-39906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17572248#comment-17572248 ] Apache Spark commented on SPARK-39906: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/37326 > Eliminate build warnings - 'sbt 0.13 shell syntax is deprecated; use slash > syntax instead' > -- > > Key: SPARK-39906 > URL: https://issues.apache.org/jira/browse/SPARK-39906 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > Fix For: 3.4.0 > > > 1. > ./2_Run Build modules catalyst, > hive-thriftserver.txt:2022-07-27T01:23:12.1294533Z [warn] sbt 0.13 shell > syntax is deprecated; use slash syntax instead: examples / Test / package, > avro / Test / package, network-shuffle / Test / package, sketch / Test / > package, unsafe / Test / package, launcher / Test / package, network-yarn / > Test / package, streaming / Test / package, catalyst / Test / package, > hive-thriftserver / Test / package, kvstore / Test / package, core / Test / > package, ganglia-lgpl / Test / package, hadoop-cloud / Test / package, > streaming-kinesis-asl-assembly / Test / package, assembly / Test / package, > network-common / Test / package, sql / Test / package, > streaming-kafka-0-10-assembly / Test / package, mllib / Test / package, > streaming-kinesis-asl / Test / package, docker-integration-tests / Test / > package, kubernetes / Test / package, yarn / Test / package, tags / Test / > package, graphx / Test / package, token-provider-kafka-0-10 / Test / package, > mesos / Test / package, streaming-kafka-0-10 / Test / package, hive / Test / > package, tools / Test / package, mllib-local / Test / package, repl / Test / > package, sql-kafka-0-10 / Test / package, Test / package > ./2_Run Build modules catalyst, > hive-thriftserver.txt:2022-07-27T01:23:12.1294533Z [warn] sbt 0.13 shell > syntax is deprecated; use slash syntax instead: examples / Test / package, > avro / Test / package, network-shuffle / Test / package, sketch / Test / > package, unsafe / Test / package, launcher / Test / package, network-yarn / > Test / package, streaming / Test / package, catalyst / Test / package, > hive-thriftserver / Test / package, kvstore / Test / package, core / Test / > package, ganglia-lgpl / Test / package, hadoop-cloud / Test / package, > streaming-kinesis-asl-assembly / Test / package, assembly / Test / package, > network-common / Test / package, sql / Test / package, > streaming-kafka-0-10-assembly / Test / package, mllib / Test / package, > streaming-kinesis-asl / Test / package, docker-integration-tests / Test / > package, kubernetes / Test / package, yarn / Test / package, tags / Test / > package, graphx / Test / package, token-provider-kafka-0-10 / Test / package, > mesos / Test / package, streaming-kafka-0-10 / Test / package, hive / Test / > package, tools / Test / package, mllib-local / Test / package, repl / Test / > package, sql-kafka-0-10 / Test / package, Test / package > ./Run Build modules pyspark-core, pyspark-streaming, pyspark-ml/11_Run > tests.txt:2022-07-27T01:25:30.0840251Z [warn] sbt 0.13 shell syntax is > deprecated; use slash syntax instead: network-yarn / Test / package, > network-shuffle / Test / package, sketch / Test / package, yarn / Test / > package, sql / Test / package, core / Test / package, hive-thriftserver / > Test / package, ganglia-lgpl / Test / package, streaming / Test / package, > streaming-kinesis-asl / Test / package, docker-integration-tests / Test / > package, kubernetes / Test / package, launcher / Test / package, > streaming-kinesis-asl-assembly / Test / package, tags / Test / package, > assembly / Test / package, mllib-local / Test / package, > token-provider-kafka-0-10 / Test / package, repl / Test / package, graphx / > Test / package, sql-kafka-0-10 / Test / package, mesos / Test / package, > streaming-kafka-0-10 / Test / package, streaming-kafka-0-10-assembly / Test / > package, examples / Test / package, tools / Test / package, avro / Test / > package, hadoop-cloud / Test / package, mllib / Test / package, kvstore / > Test / package, hive / Test / package, catalyst / Test / package, > network-common / Test / package, unsafe / Test / package, Test / package > ./Run Build modules pyspark-core, pyspark-streaming, pyspark-ml/11_Run > tests.txt:2022-07-27T01:25:30.0840251Z [warn] sbt 0.13 shell syntax is > deprecated; use slash syntax instead: network-yarn / Test / package, > network-shuffle / Test / package, sketch / Test / package, yarn / Test / > package, sql / Test / package, core / Test / package, hive-thriftserver
[jira] [Created] (SPARK-39906) Eliminate build warnings - 'sbt 0.13 shell syntax is deprecated; use slash syntax instead'
BingKun Pan created SPARK-39906: --- Summary: Eliminate build warnings - 'sbt 0.13 shell syntax is deprecated; use slash syntax instead' Key: SPARK-39906 URL: https://issues.apache.org/jira/browse/SPARK-39906 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.4.0 Reporter: BingKun Pan Fix For: 3.4.0 1. ./2_Run Build modules catalyst, hive-thriftserver.txt:2022-07-27T01:23:12.1294533Z [warn] sbt 0.13 shell syntax is deprecated; use slash syntax instead: examples / Test / package, avro / Test / package, network-shuffle / Test / package, sketch / Test / package, unsafe / Test / package, launcher / Test / package, network-yarn / Test / package, streaming / Test / package, catalyst / Test / package, hive-thriftserver / Test / package, kvstore / Test / package, core / Test / package, ganglia-lgpl / Test / package, hadoop-cloud / Test / package, streaming-kinesis-asl-assembly / Test / package, assembly / Test / package, network-common / Test / package, sql / Test / package, streaming-kafka-0-10-assembly / Test / package, mllib / Test / package, streaming-kinesis-asl / Test / package, docker-integration-tests / Test / package, kubernetes / Test / package, yarn / Test / package, tags / Test / package, graphx / Test / package, token-provider-kafka-0-10 / Test / package, mesos / Test / package, streaming-kafka-0-10 / Test / package, hive / Test / package, tools / Test / package, mllib-local / Test / package, repl / Test / package, sql-kafka-0-10 / Test / package, Test / package ./2_Run Build modules catalyst, hive-thriftserver.txt:2022-07-27T01:23:12.1294533Z [warn] sbt 0.13 shell syntax is deprecated; use slash syntax instead: examples / Test / package, avro / Test / package, network-shuffle / Test / package, sketch / Test / package, unsafe / Test / package, launcher / Test / package, network-yarn / Test / package, streaming / Test / package, catalyst / Test / package, hive-thriftserver / Test / package, kvstore / Test / package, core / Test / package, ganglia-lgpl / Test / package, hadoop-cloud / Test / package, streaming-kinesis-asl-assembly / Test / package, assembly / Test / package, network-common / Test / package, sql / Test / package, streaming-kafka-0-10-assembly / Test / package, mllib / Test / package, streaming-kinesis-asl / Test / package, docker-integration-tests / Test / package, kubernetes / Test / package, yarn / Test / package, tags / Test / package, graphx / Test / package, token-provider-kafka-0-10 / Test / package, mesos / Test / package, streaming-kafka-0-10 / Test / package, hive / Test / package, tools / Test / package, mllib-local / Test / package, repl / Test / package, sql-kafka-0-10 / Test / package, Test / package ./Run Build modules pyspark-core, pyspark-streaming, pyspark-ml/11_Run tests.txt:2022-07-27T01:25:30.0840251Z [warn] sbt 0.13 shell syntax is deprecated; use slash syntax instead: network-yarn / Test / package, network-shuffle / Test / package, sketch / Test / package, yarn / Test / package, sql / Test / package, core / Test / package, hive-thriftserver / Test / package, ganglia-lgpl / Test / package, streaming / Test / package, streaming-kinesis-asl / Test / package, docker-integration-tests / Test / package, kubernetes / Test / package, launcher / Test / package, streaming-kinesis-asl-assembly / Test / package, tags / Test / package, assembly / Test / package, mllib-local / Test / package, token-provider-kafka-0-10 / Test / package, repl / Test / package, graphx / Test / package, sql-kafka-0-10 / Test / package, mesos / Test / package, streaming-kafka-0-10 / Test / package, streaming-kafka-0-10-assembly / Test / package, examples / Test / package, tools / Test / package, avro / Test / package, hadoop-cloud / Test / package, mllib / Test / package, kvstore / Test / package, hive / Test / package, catalyst / Test / package, network-common / Test / package, unsafe / Test / package, Test / package ./Run Build modules pyspark-core, pyspark-streaming, pyspark-ml/11_Run tests.txt:2022-07-27T01:25:30.0840251Z [warn] sbt 0.13 shell syntax is deprecated; use slash syntax instead: network-yarn / Test / package, network-shuffle / Test / package, sketch / Test / package, yarn / Test / package, sql / Test / package, core / Test / package, hive-thriftserver / Test / package, ganglia-lgpl / Test / package, streaming / Test / package, streaming-kinesis-asl / Test / package, docker-integration-tests / Test / package, kubernetes / Test / package, launcher / Test / package, streaming-kinesis-asl-assembly / Test / package, tags / Test / package, assembly / Test / package, mllib-local / Test / package, token-provider-kafka-0-10 / Test / package, repl / Test / package, graphx / Test / package, sql-kafka-0-10 / Test / package, mesos / Test / package, streaming-kafka-0-10
[jira] [Assigned] (SPARK-39904) Rename inferDate to preferDate and fix an issue when inferring schema
[ https://issues.apache.org/jira/browse/SPARK-39904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39904: Assignee: Apache Spark > Rename inferDate to preferDate and fix an issue when inferring schema > - > > Key: SPARK-39904 > URL: https://issues.apache.org/jira/browse/SPARK-39904 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Ivan Sadikov >Assignee: Apache Spark >Priority: Major > > Follow-up for https://issues.apache.org/jira/browse/SPARK-39469. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39904) Rename inferDate to preferDate and fix an issue when inferring schema
[ https://issues.apache.org/jira/browse/SPARK-39904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39904: Assignee: (was: Apache Spark) > Rename inferDate to preferDate and fix an issue when inferring schema > - > > Key: SPARK-39904 > URL: https://issues.apache.org/jira/browse/SPARK-39904 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Ivan Sadikov >Priority: Major > > Follow-up for https://issues.apache.org/jira/browse/SPARK-39469. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39904) Rename inferDate to preferDate and fix an issue when inferring schema
[ https://issues.apache.org/jira/browse/SPARK-39904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17572240#comment-17572240 ] Apache Spark commented on SPARK-39904: -- User 'sadikovi' has created a pull request for this issue: https://github.com/apache/spark/pull/37327 > Rename inferDate to preferDate and fix an issue when inferring schema > - > > Key: SPARK-39904 > URL: https://issues.apache.org/jira/browse/SPARK-39904 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Ivan Sadikov >Priority: Major > > Follow-up for https://issues.apache.org/jira/browse/SPARK-39469. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39844) Restrict adding DEFAULT columns for existing tables to allowlist of supported data source types
[ https://issues.apache.org/jira/browse/SPARK-39844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang reassigned SPARK-39844: -- Assignee: Daniel > Restrict adding DEFAULT columns for existing tables to allowlist of supported > data source types > --- > > Key: SPARK-39844 > URL: https://issues.apache.org/jira/browse/SPARK-39844 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Assignee: Daniel >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39844) Restrict adding DEFAULT columns for existing tables to allowlist of supported data source types
[ https://issues.apache.org/jira/browse/SPARK-39844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-39844. Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37256 [https://github.com/apache/spark/pull/37256] > Restrict adding DEFAULT columns for existing tables to allowlist of supported > data source types > --- > > Key: SPARK-39844 > URL: https://issues.apache.org/jira/browse/SPARK-39844 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Assignee: Daniel >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39905) Remove checkErrorClass()
[ https://issues.apache.org/jira/browse/SPARK-39905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39905: Assignee: Apache Spark (was: Max Gekk) > Remove checkErrorClass() > > > Key: SPARK-39905 > URL: https://issues.apache.org/jira/browse/SPARK-39905 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > Replace all invokes of checkErrorClass() by checkError() and remove > checkErrorClass(). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39905) Remove checkErrorClass()
[ https://issues.apache.org/jira/browse/SPARK-39905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39905: Assignee: Max Gekk (was: Apache Spark) > Remove checkErrorClass() > > > Key: SPARK-39905 > URL: https://issues.apache.org/jira/browse/SPARK-39905 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Replace all invokes of checkErrorClass() by checkError() and remove > checkErrorClass(). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39905) Remove checkErrorClass()
[ https://issues.apache.org/jira/browse/SPARK-39905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17572238#comment-17572238 ] Apache Spark commented on SPARK-39905: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/37322 > Remove checkErrorClass() > > > Key: SPARK-39905 > URL: https://issues.apache.org/jira/browse/SPARK-39905 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Replace all invokes of checkErrorClass() by checkError() and remove > checkErrorClass(). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39905) Remove checkErrorClass()
Max Gekk created SPARK-39905: Summary: Remove checkErrorClass() Key: SPARK-39905 URL: https://issues.apache.org/jira/browse/SPARK-39905 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Max Gekk Assignee: Max Gekk Replace all invokes of checkErrorClass() by checkError() and remove checkErrorClass(). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39904) Rename inferDate to preferDate and fix an issue when inferring schema
[ https://issues.apache.org/jira/browse/SPARK-39904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Sadikov updated SPARK-39904: - Description: Follow-up for https://issues.apache.org/jira/browse/SPARK-39469. > Rename inferDate to preferDate and fix an issue when inferring schema > - > > Key: SPARK-39904 > URL: https://issues.apache.org/jira/browse/SPARK-39904 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Ivan Sadikov >Priority: Major > > Follow-up for https://issues.apache.org/jira/browse/SPARK-39469. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39904) Rename inferDate to preferDate and fix an issue when inferring schema
Ivan Sadikov created SPARK-39904: Summary: Rename inferDate to preferDate and fix an issue when inferring schema Key: SPARK-39904 URL: https://issues.apache.org/jira/browse/SPARK-39904 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Ivan Sadikov -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39899) Incorrect passing of message parameters in InvalidUDFClassException
[ https://issues.apache.org/jira/browse/SPARK-39899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-39899. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37323 [https://github.com/apache/spark/pull/37323] > Incorrect passing of message parameters in InvalidUDFClassException > --- > > Key: SPARK-39899 > URL: https://issues.apache.org/jira/browse/SPARK-39899 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > > In fact, messageParameters is not passed AnalysisException. It is used only > to form the error message. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39889) Use different error classes for numeric/interval divided by 0
[ https://issues.apache.org/jira/browse/SPARK-39889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-39889. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37313 [https://github.com/apache/spark/pull/37313] > Use different error classes for numeric/interval divided by 0 > - > > Key: SPARK-39889 > URL: https://issues.apache.org/jira/browse/SPARK-39889 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.4.0 > > > Currently, when numbers are divided by 0 under ANSI mode, the error message > is like > {quote}[DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate > divisor being 0 and return NULL instead. If necessary set "ansi_mode" to > "false" (except for ANSI interval type) to bypass this error.{quote} > The "(except for ANSI interval type)" part is confusing. We should remove it > and have a new error class "INTERVAL_DIVIDED_BY_ZERO" -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34788) Spark throws FileNotFoundException instead of IOException when disk is full
[ https://issues.apache.org/jira/browse/SPARK-34788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17572212#comment-17572212 ] wenweijian commented on SPARK-34788: I got this FileNotFoundException when the hdfs(/tmp/logs/root/bucket-logs-tfile) disk is full. when I delele some file in that dir, the exception disappeared. logs: {code:java} org.apache.spark.shuffle.FetchFailedException: Error in reading FileSegmentManagedBuffer[file=/home/install/hadoop/data2/hadoop-3.3.0/nm-local-dir/usercache/root/appcache/application_1658221318180_0002/blockmgr-001c6d1a-bda3-4f8e-8998-42e3a8fbaaa9/0d/shuffle_0_7622_0.data,offset=57062119,length=26794] at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:770) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:649) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:70) at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:200) at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:128) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: java.io.IOException: Error in reading FileSegmentManagedBuffer[file=/home/install/hadoop/data2/hadoop-3.3.0/nm-local-dir/usercache/root/appcache/application_1658221318180_0002/blockmgr-001c6d1a-bda3-4f8e-8998-42e3a8fbaaa9/0d/shuffle_0_7622_0.data,offset=57062119,length=26794] at org.apache.spark.network.buffer.FileSegmentManagedBuffer.createInputStream(FileSegmentManagedBuffer.java:112) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:637) ... 23 more Caused by: java.io.FileNotFoundException: /home/install/hadoop/data2/hadoop-3.3.0/nm-local-dir/usercache/root/appcache/application_1658221318180_0002/blockmgr-001c6d1a-bda3-4f8e-8998-42e3a8fbaaa9/0d/shuffle_0_7622_0.data (No such file or directory) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.(FileInputStream.java:138) at org.apache.spark.network.buffer.FileSegmentManagedBuffer.createInputStream(FileSegmentManagedBuffer.java:101) ... 24 more {code} > Spark throws FileNotFoundException instead of IOException when disk is full > --- > > Key: SPARK-34788 > URL: https://issues.apache.org/jira/browse/SPARK-34788 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 3.2.0 >Reporter: wuyi >Priority: Major > > When the disk is full, Spark throws FileNotFoundException instead of > IOException with the hint. It's quite a confusing error to users: > {code:java} > 9/03/26 09:03:45 ERROR ShuffleBlockFetcherIterator: Failed to create input > stream from local block > java.io.IOException: Error in reading > FileSegmentManagedBuffer{file=/local_disk0/spark-c2f26f02-2572-4764-815a-cbba65ddb315/executor-b4b76a4c-788c-4cb6-b904-664a883be1aa/blockmgr-36804371-24fe-4131-a3dc-00b7f98f3a3e/11/shuffle_113_1029_0.data, > offset=110254956, length=1875458} > at > org.apache.spark.network.buffer.FileSegmentManagedBuffer.createInputStream(FileSegmentManagedBuffer.java:111) > at > org.apache.spark.storage.ShuffleBlo
[jira] [Resolved] (SPARK-39890) Make TakeOrderedAndProjectExec inherit AliasAwareOutputOrdering
[ https://issues.apache.org/jira/browse/SPARK-39890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-39890. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37318 [https://github.com/apache/spark/pull/37318] > Make TakeOrderedAndProjectExec inherit AliasAwareOutputOrdering > --- > > Key: SPARK-39890 > URL: https://issues.apache.org/jira/browse/SPARK-39890 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Priority: Minor > Fix For: 3.4.0 > > > AliasAwareOutputOrdering can save a sort if the project inside > TakeOrderedAndProjectExec has an alias for the sort order. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39890) Make TakeOrderedAndProjectExec inherit AliasAwareOutputOrdering
[ https://issues.apache.org/jira/browse/SPARK-39890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-39890: --- Assignee: XiDuo You > Make TakeOrderedAndProjectExec inherit AliasAwareOutputOrdering > --- > > Key: SPARK-39890 > URL: https://issues.apache.org/jira/browse/SPARK-39890 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Minor > Fix For: 3.4.0 > > > AliasAwareOutputOrdering can save a sort if the project inside > TakeOrderedAndProjectExec has an alias for the sort order. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39903) Reenable TPC-DS q72 in GitHub Actions
Hyukjin Kwon created SPARK-39903: Summary: Reenable TPC-DS q72 in GitHub Actions Key: SPARK-39903 URL: https://issues.apache.org/jira/browse/SPARK-39903 Project: Spark Issue Type: Test Components: Tests Affects Versions: 3.3.0, 3.4.0 Reporter: Hyukjin Kwon https://github.com/apache/spark/pull/37289 disabled TPC-DS q72 in GitHub Actions. We should reenable this to recover the test coverage. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39900) Issue with querying dataframe produced by 'binaryFile' format using 'not' operator
[ https://issues.apache.org/jira/browse/SPARK-39900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17572201#comment-17572201 ] Hyukjin Kwon commented on SPARK-39900: -- Please go ahead for a PR [~Zing] > Issue with querying dataframe produced by 'binaryFile' format using 'not' > operator > -- > > Key: SPARK-39900 > URL: https://issues.apache.org/jira/browse/SPARK-39900 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1, 3.3.0 >Reporter: Benoit Roy >Priority: Minor > > When creating a dataframe using the binaryFile format I am encountering weird > result when filtering/query with the 'not' operator. > > Here's a repo that will help describe and reproduce the issue. > [https://github.com/cccs-br/spark-binaryfile-issue] > {code:java} > g...@github.com:cccs-br/spark-binaryfile-issue.git {code} > > Here's a very simple test case that illustrate what's going on: > [https://github.com/cccs-br/spark-binaryfile-issue/blob/main/src/test/scala/BinaryFileSuite.scala] > TLDR; > {code:java} >test("binary file dataframe") { > // load files in directly into df using 'binaryFile' format. > // > // - src/test/resources/files/ > // - test1.csv > // - test2.json > // - test3.txt > val df = spark > .read > .format("binaryFile") > .load("src/test/resources/files") > df.createOrReplaceTempView("files") > // This works as expected. > val like_count = spark.sql("select * from files where path like > '%.csv'").count() > assert(like_count === 1) > // This does not work as expected. > val not_like_count = spark.sql("select * from files where path not like > '%.csv'").count() > assert(not_like_count === 2) > // This used to work in 3.2.1 > // df.filter(col("path").endsWith(".csv") === false).show() > }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39902) Add Scan details to spark plan scan node in SparkUI
[ https://issues.apache.org/jira/browse/SPARK-39902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17572193#comment-17572193 ] Sumeet commented on SPARK-39902: An example of this change can be seen while viewing the Iceberg scans on SparkUI. h2. Before this change: !Screen Shot 2022-07-27 at 6.39.48 PM.png|width=415,height=211! h2. After this change: !Screen Shot 2022-07-27 at 6.38.56 PM.png|width=430,height=216! > Add Scan details to spark plan scan node in SparkUI > --- > > Key: SPARK-39902 > URL: https://issues.apache.org/jira/browse/SPARK-39902 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.3.1 >Reporter: Sumeet >Priority: Major > Attachments: Screen Shot 2022-07-27 at 6.00.27 PM.png, Screen Shot > 2022-07-27 at 6.00.50 PM.png, Screen Shot 2022-07-27 at 6.38.56 PM.png, > Screen Shot 2022-07-27 at 6.39.48 PM.png > > > Hi, > For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" > as opposed to "Scan ". > Add a method "String name()" to the Scan interface, that "BatchScanExec" can > invoke to set the node name the plan. This nodeName will be eventually used > by "SparkPlanGraphNode" to display it in the header of the UI node. > > DSv1 > !Screen Shot 2022-07-27 at 6.00.27 PM.png|width=356,height=212! > > DSv2 > !Screen Shot 2022-07-27 at 6.00.50 PM.png|width=293,height=277! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39902) Add Scan details to spark plan scan node in SparkUI
[ https://issues.apache.org/jira/browse/SPARK-39902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sumeet updated SPARK-39902: --- Attachment: Screen Shot 2022-07-27 at 6.39.48 PM.png > Add Scan details to spark plan scan node in SparkUI > --- > > Key: SPARK-39902 > URL: https://issues.apache.org/jira/browse/SPARK-39902 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.3.1 >Reporter: Sumeet >Priority: Major > Attachments: Screen Shot 2022-07-27 at 6.00.27 PM.png, Screen Shot > 2022-07-27 at 6.00.50 PM.png, Screen Shot 2022-07-27 at 6.38.56 PM.png, > Screen Shot 2022-07-27 at 6.39.48 PM.png > > > Hi, > For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" > as opposed to "Scan ". > Add a method "String name()" to the Scan interface, that "BatchScanExec" can > invoke to set the node name the plan. This nodeName will be eventually used > by "SparkPlanGraphNode" to display it in the header of the UI node. > > DSv1 > !Screen Shot 2022-07-27 at 6.00.27 PM.png|width=356,height=212! > > DSv2 > !Screen Shot 2022-07-27 at 6.00.50 PM.png|width=293,height=277! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39902) Add Scan details to spark plan scan node in SparkUI
[ https://issues.apache.org/jira/browse/SPARK-39902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sumeet updated SPARK-39902: --- Attachment: Screen Shot 2022-07-27 at 6.38.56 PM.png > Add Scan details to spark plan scan node in SparkUI > --- > > Key: SPARK-39902 > URL: https://issues.apache.org/jira/browse/SPARK-39902 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.3.1 >Reporter: Sumeet >Priority: Major > Attachments: Screen Shot 2022-07-27 at 6.00.27 PM.png, Screen Shot > 2022-07-27 at 6.00.50 PM.png, Screen Shot 2022-07-27 at 6.38.56 PM.png, > Screen Shot 2022-07-27 at 6.39.48 PM.png > > > Hi, > For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" > as opposed to "Scan ". > Add a method "String name()" to the Scan interface, that "BatchScanExec" can > invoke to set the node name the plan. This nodeName will be eventually used > by "SparkPlanGraphNode" to display it in the header of the UI node. > > DSv1 > !Screen Shot 2022-07-27 at 6.00.27 PM.png|width=356,height=212! > > DSv2 > !Screen Shot 2022-07-27 at 6.00.50 PM.png|width=293,height=277! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39902) Add Scan details to spark plan scan node in SparkUI
[ https://issues.apache.org/jira/browse/SPARK-39902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17572192#comment-17572192 ] Apache Spark commented on SPARK-39902: -- User 'sumeetgajjar' has created a pull request for this issue: https://github.com/apache/spark/pull/37325 > Add Scan details to spark plan scan node in SparkUI > --- > > Key: SPARK-39902 > URL: https://issues.apache.org/jira/browse/SPARK-39902 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.3.1 >Reporter: Sumeet >Priority: Major > Attachments: Screen Shot 2022-07-27 at 6.00.27 PM.png, Screen Shot > 2022-07-27 at 6.00.50 PM.png > > > Hi, > For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" > as opposed to "Scan ". > Add a method "String name()" to the Scan interface, that "BatchScanExec" can > invoke to set the node name the plan. This nodeName will be eventually used > by "SparkPlanGraphNode" to display it in the header of the UI node. > > DSv1 > !Screen Shot 2022-07-27 at 6.00.27 PM.png|width=356,height=212! > > DSv2 > !Screen Shot 2022-07-27 at 6.00.50 PM.png|width=293,height=277! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39902) Add Scan details to spark plan scan node in SparkUI
[ https://issues.apache.org/jira/browse/SPARK-39902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39902: Assignee: Apache Spark > Add Scan details to spark plan scan node in SparkUI > --- > > Key: SPARK-39902 > URL: https://issues.apache.org/jira/browse/SPARK-39902 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.3.1 >Reporter: Sumeet >Assignee: Apache Spark >Priority: Major > Attachments: Screen Shot 2022-07-27 at 6.00.27 PM.png, Screen Shot > 2022-07-27 at 6.00.50 PM.png > > > Hi, > For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" > as opposed to "Scan ". > Add a method "String name()" to the Scan interface, that "BatchScanExec" can > invoke to set the node name the plan. This nodeName will be eventually used > by "SparkPlanGraphNode" to display it in the header of the UI node. > > DSv1 > !Screen Shot 2022-07-27 at 6.00.27 PM.png|width=356,height=212! > > DSv2 > !Screen Shot 2022-07-27 at 6.00.50 PM.png|width=293,height=277! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39902) Add Scan details to spark plan scan node in SparkUI
[ https://issues.apache.org/jira/browse/SPARK-39902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39902: Assignee: (was: Apache Spark) > Add Scan details to spark plan scan node in SparkUI > --- > > Key: SPARK-39902 > URL: https://issues.apache.org/jira/browse/SPARK-39902 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.3.1 >Reporter: Sumeet >Priority: Major > Attachments: Screen Shot 2022-07-27 at 6.00.27 PM.png, Screen Shot > 2022-07-27 at 6.00.50 PM.png > > > Hi, > For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" > as opposed to "Scan ". > Add a method "String name()" to the Scan interface, that "BatchScanExec" can > invoke to set the node name the plan. This nodeName will be eventually used > by "SparkPlanGraphNode" to display it in the header of the UI node. > > DSv1 > !Screen Shot 2022-07-27 at 6.00.27 PM.png|width=356,height=212! > > DSv2 > !Screen Shot 2022-07-27 at 6.00.50 PM.png|width=293,height=277! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39902) Add Scan details to spark plan scan node in SparkUI
[ https://issues.apache.org/jira/browse/SPARK-39902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sumeet updated SPARK-39902: --- Description: Hi, For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" as opposed to "Scan ". Add a method "String name()" to the Scan interface, that "BatchScanExec" can invoke to set the node name the plan. This nodeName will be eventually used by "SparkPlanGraphNode" to display it in the header of the UI node. DSv1 !Screen Shot 2022-07-27 at 6.00.27 PM.png|width=356,height=212! DSv2 !Screen Shot 2022-07-27 at 6.00.50 PM.png|width=293,height=277! was: Hi, For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" as opposed to "Scan ". Add a method "String name()" to the Scan interface, that "BatchScanExec" can invoke to set the node name the plan. This nodeName will be eventually used by "SparkPlanGraphNode" to display it in the header of the UI node. DSv1 > Add Scan details to spark plan scan node in SparkUI > --- > > Key: SPARK-39902 > URL: https://issues.apache.org/jira/browse/SPARK-39902 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.3.1 >Reporter: Sumeet >Priority: Major > Attachments: Screen Shot 2022-07-27 at 6.00.27 PM.png, Screen Shot > 2022-07-27 at 6.00.50 PM.png > > > Hi, > For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" > as opposed to "Scan ". > Add a method "String name()" to the Scan interface, that "BatchScanExec" can > invoke to set the node name the plan. This nodeName will be eventually used > by "SparkPlanGraphNode" to display it in the header of the UI node. > > DSv1 > !Screen Shot 2022-07-27 at 6.00.27 PM.png|width=356,height=212! > > DSv2 > !Screen Shot 2022-07-27 at 6.00.50 PM.png|width=293,height=277! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39902) Add Scan details to spark plan scan node in SparkUI
[ https://issues.apache.org/jira/browse/SPARK-39902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sumeet updated SPARK-39902: --- Attachment: Screen Shot 2022-07-27 at 6.00.50 PM.png > Add Scan details to spark plan scan node in SparkUI > --- > > Key: SPARK-39902 > URL: https://issues.apache.org/jira/browse/SPARK-39902 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.3.1 >Reporter: Sumeet >Priority: Major > Attachments: Screen Shot 2022-07-27 at 6.00.27 PM.png, Screen Shot > 2022-07-27 at 6.00.50 PM.png > > > Hi, > For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" > as opposed to "Scan ". > Add a method "String name()" to the Scan interface, that "BatchScanExec" can > invoke to set the node name the plan. This nodeName will be eventually used > by "SparkPlanGraphNode" to display it in the header of the UI node. > > DSv1 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39902) Add Scan details to spark plan scan node in SparkUI
[ https://issues.apache.org/jira/browse/SPARK-39902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sumeet updated SPARK-39902: --- Attachment: Screen Shot 2022-07-27 at 6.00.27 PM.png > Add Scan details to spark plan scan node in SparkUI > --- > > Key: SPARK-39902 > URL: https://issues.apache.org/jira/browse/SPARK-39902 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.3.1 >Reporter: Sumeet >Priority: Major > Attachments: Screen Shot 2022-07-27 at 6.00.27 PM.png > > > Hi, > For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" > as opposed to "Scan ". > Add a method "String name()" to the Scan interface, that "BatchScanExec" can > invoke to set the node name the plan. This nodeName will be eventually used > by "SparkPlanGraphNode" to display it in the header of the UI node. > > DSv1 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39902) Add Scan details to spark plan scan node in SparkUI
[ https://issues.apache.org/jira/browse/SPARK-39902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sumeet updated SPARK-39902: --- Attachment: (was: Screen Shot 2022-07-27 at 6.00.27 PM.png) > Add Scan details to spark plan scan node in SparkUI > --- > > Key: SPARK-39902 > URL: https://issues.apache.org/jira/browse/SPARK-39902 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.3.1 >Reporter: Sumeet >Priority: Major > > Hi, > For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" > as opposed to "Scan ". > Add a method "String name()" to the Scan interface, that "BatchScanExec" can > invoke to set the node name the plan. This nodeName will be eventually used > by "SparkPlanGraphNode" to display it in the header of the UI node. > > DSv1 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39902) Add Scan details to spark plan scan node in SparkUI
[ https://issues.apache.org/jira/browse/SPARK-39902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sumeet updated SPARK-39902: --- Attachment: Screen Shot 2022-07-27 at 6.00.27 PM.png > Add Scan details to spark plan scan node in SparkUI > --- > > Key: SPARK-39902 > URL: https://issues.apache.org/jira/browse/SPARK-39902 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.3.1 >Reporter: Sumeet >Priority: Major > > Hi, > For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" > as opposed to "Scan ". > Add a method "String name()" to the Scan interface, that "BatchScanExec" can > invoke to set the node name the plan. This nodeName will be eventually used > by "SparkPlanGraphNode" to display it in the header of the UI node. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39902) Add Scan details to spark plan scan node in SparkUI
[ https://issues.apache.org/jira/browse/SPARK-39902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sumeet updated SPARK-39902: --- Description: Hi, For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" as opposed to "Scan ". Add a method "String name()" to the Scan interface, that "BatchScanExec" can invoke to set the node name the plan. This nodeName will be eventually used by "SparkPlanGraphNode" to display it in the header of the UI node. DSv1 was: Hi, For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" as opposed to "Scan ". Add a method "String name()" to the Scan interface, that "BatchScanExec" can invoke to set the node name the plan. This nodeName will be eventually used by "SparkPlanGraphNode" to display it in the header of the UI node. > Add Scan details to spark plan scan node in SparkUI > --- > > Key: SPARK-39902 > URL: https://issues.apache.org/jira/browse/SPARK-39902 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.3.1 >Reporter: Sumeet >Priority: Major > > Hi, > For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" > as opposed to "Scan ". > Add a method "String name()" to the Scan interface, that "BatchScanExec" can > invoke to set the node name the plan. This nodeName will be eventually used > by "SparkPlanGraphNode" to display it in the header of the UI node. > > DSv1 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39902) Add Scan details to spark plan scan node in SparkUI
[ https://issues.apache.org/jira/browse/SPARK-39902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17572186#comment-17572186 ] Sumeet commented on SPARK-39902: I'm working on it and will publish a patch soon. > Add Scan details to spark plan scan node in SparkUI > --- > > Key: SPARK-39902 > URL: https://issues.apache.org/jira/browse/SPARK-39902 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.3.1 >Reporter: Sumeet >Priority: Major > > Hi, > For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" > as opposed to "Scan ". > Add a method "String name()" to the Scan interface, that "BatchScanExec" can > invoke to set the node name the plan. This nodeName will be eventually used > by "SparkPlanGraphNode" to display it in the header of the UI node. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39902) Add Scan details to spark plan scan node in SparkUI
Sumeet created SPARK-39902: -- Summary: Add Scan details to spark plan scan node in SparkUI Key: SPARK-39902 URL: https://issues.apache.org/jira/browse/SPARK-39902 Project: Spark Issue Type: Improvement Components: SQL, Web UI Affects Versions: 3.3.1 Reporter: Sumeet Hi, For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" as opposed to "Scan ". Add a method "String name()" to the Scan interface, that "BatchScanExec" can invoke to set the node name the plan. This nodeName will be eventually used by "SparkPlanGraphNode" to display it in the header of the UI node. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38496) Improve the test coverage for pyspark/sql module
[ https://issues.apache.org/jira/browse/SPARK-38496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17572180#comment-17572180 ] Haejoon Lee commented on SPARK-38496: - I'm working on this > Improve the test coverage for pyspark/sql module > > > Key: SPARK-38496 > URL: https://issues.apache.org/jira/browse/SPARK-38496 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Haejoon Lee >Priority: Major > > Currently, sql module has 90% of test coverage. > We could improve the test coverage by adding the missing tests for sql module. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38493) Improve the test coverage for pyspark/pandas module
[ https://issues.apache.org/jira/browse/SPARK-38493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-38493. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37294 [https://github.com/apache/spark/pull/37294] > Improve the test coverage for pyspark/pandas module > --- > > Key: SPARK-38493 > URL: https://issues.apache.org/jira/browse/SPARK-38493 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Haejoon Lee >Priority: Major > Fix For: 3.4.0 > > > Currently, pandas module (pandas API on Spark) has 94% of test coverage. > We could improve the test coverage by adding the missing tests for pandas > module. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39839) Handle special case of null variable-length Decimal with non-zero offsetAndSize in UnsafeRow structural integrity check
[ https://issues.apache.org/jira/browse/SPARK-39839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-39839. -- Fix Version/s: 3.3.1 3.2.3 3.4.0 Resolution: Fixed Issue resolved by pull request 37252 [https://github.com/apache/spark/pull/37252] > Handle special case of null variable-length Decimal with non-zero > offsetAndSize in UnsafeRow structural integrity check > --- > > Key: SPARK-39839 > URL: https://issues.apache.org/jira/browse/SPARK-39839 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0, 3.2.0, 3.3.0 >Reporter: Kris Mok >Assignee: Kris Mok >Priority: Major > Fix For: 3.3.1, 3.2.3, 3.4.0 > > > The {{UnsafeRow}} structural integrity check in > {{UnsafeRowUtils.validateStructuralIntegrity}} is added in Spark 3.1.0. It’s > supposed to validate that a given {{UnsafeRow}} conforms to the format that > the {{UnsafeRowWriter}} would have produced. > Currently the check expects all fields that are marked as null should also > have its field (i.e. the fixed-length part) set to all zeros. It needs to be > updated to handle a special case for variable-length {{{}Decimal{}}}s, where > the {{UnsafeRowWriter}} may mark a field as null but also leave the > fixed-length part of the field as {{OffsetAndSize(offset=current_offset, > size=0)}}. This may happen when the {{Decimal}} being written is either a > real {{null}} or has overflowed the specified precision. > Logic in {{UnsafeRowWriter}}: > in general: > {code:scala} > public void setNullAt(int ordinal) { > BitSetMethods.set(getBuffer(), startingOffset, ordinal); // set null bit > write(ordinal, 0L); // also zero out > the fixed-length field > } {code} > special case for {{DecimalType}}: > {code:scala} > // Make sure Decimal object has the same scale as DecimalType. > // Note that we may pass in null Decimal object to set null for it. > if (input == null || !input.changePrecision(precision, scale)) { > BitSetMethods.set(getBuffer(), startingOffset, ordinal); // set null > bit > // keep the offset for future update > setOffsetAndSize(ordinal, 0); // doesn't > zero out the fixed-length field > } {code} > The special case is introduced to allow all {{DecimalType}}s (including both > fixed-length and variable-length ones) to be mutable – thus need to leave > space for the variable-length field even if it’s currently null. > Note that this special case in {{UnsafeRowWriter}} has been there since Spark > 1.6.0, where as the integrity check was added in Spark 3.1.0. The check was > originally added for Structured Streaming’s checkpoint evolution validation, > so that a newer version of Spark can check whether or not an older checkpoint > file for Structured Streaming queries can be supported, and/or if the > contents of the checkpoint file is corrupted. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38493) Improve the test coverage for pyspark/pandas module
[ https://issues.apache.org/jira/browse/SPARK-38493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-38493: Assignee: Haejoon Lee > Improve the test coverage for pyspark/pandas module > --- > > Key: SPARK-38493 > URL: https://issues.apache.org/jira/browse/SPARK-38493 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.4.0 > > > Currently, pandas module (pandas API on Spark) has 94% of test coverage. > We could improve the test coverage by adding the missing tests for pandas > module. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39839) Handle special case of null variable-length Decimal with non-zero offsetAndSize in UnsafeRow structural integrity check
[ https://issues.apache.org/jira/browse/SPARK-39839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-39839: Assignee: Kris Mok > Handle special case of null variable-length Decimal with non-zero > offsetAndSize in UnsafeRow structural integrity check > --- > > Key: SPARK-39839 > URL: https://issues.apache.org/jira/browse/SPARK-39839 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0, 3.2.0, 3.3.0 >Reporter: Kris Mok >Assignee: Kris Mok >Priority: Major > > The {{UnsafeRow}} structural integrity check in > {{UnsafeRowUtils.validateStructuralIntegrity}} is added in Spark 3.1.0. It’s > supposed to validate that a given {{UnsafeRow}} conforms to the format that > the {{UnsafeRowWriter}} would have produced. > Currently the check expects all fields that are marked as null should also > have its field (i.e. the fixed-length part) set to all zeros. It needs to be > updated to handle a special case for variable-length {{{}Decimal{}}}s, where > the {{UnsafeRowWriter}} may mark a field as null but also leave the > fixed-length part of the field as {{OffsetAndSize(offset=current_offset, > size=0)}}. This may happen when the {{Decimal}} being written is either a > real {{null}} or has overflowed the specified precision. > Logic in {{UnsafeRowWriter}}: > in general: > {code:scala} > public void setNullAt(int ordinal) { > BitSetMethods.set(getBuffer(), startingOffset, ordinal); // set null bit > write(ordinal, 0L); // also zero out > the fixed-length field > } {code} > special case for {{DecimalType}}: > {code:scala} > // Make sure Decimal object has the same scale as DecimalType. > // Note that we may pass in null Decimal object to set null for it. > if (input == null || !input.changePrecision(precision, scale)) { > BitSetMethods.set(getBuffer(), startingOffset, ordinal); // set null > bit > // keep the offset for future update > setOffsetAndSize(ordinal, 0); // doesn't > zero out the fixed-length field > } {code} > The special case is introduced to allow all {{DecimalType}}s (including both > fixed-length and variable-length ones) to be mutable – thus need to leave > space for the variable-length field even if it’s currently null. > Note that this special case in {{UnsafeRowWriter}} has been there since Spark > 1.6.0, where as the integrity check was added in Spark 3.1.0. The check was > originally added for Structured Streaming’s checkpoint evolution validation, > so that a newer version of Spark can check whether or not an older checkpoint > file for Structured Streaming queries can be supported, and/or if the > contents of the checkpoint file is corrupted. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39743) Unable to set zstd compression level while writing parquet files
[ https://issues.apache.org/jira/browse/SPARK-39743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17572169#comment-17572169 ] zhiming she commented on SPARK-39743: - [~hyukjin.kwon] can you mark this issue as `Resolved`. > Unable to set zstd compression level while writing parquet files > > > Key: SPARK-39743 > URL: https://issues.apache.org/jira/browse/SPARK-39743 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Yeachan Park >Priority: Minor > > While writing zstd compressed parquet files, the following setting > `spark.io.compression.zstd.level` does not have any affect with regards to > the compression level of zstd. > All files seem to be written with the default zstd compression level, and the > config option seems to be ignored. > Using the zstd cli tool, we confirmed that setting a higher compression level > for the same file tested in spark resulted in a smaller file. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39900) Issue with querying dataframe produced by 'binaryFile' format using 'not' operator
[ https://issues.apache.org/jira/browse/SPARK-39900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17572166#comment-17572166 ] shezm commented on SPARK-39900: --- I can try to fix this issue. > Issue with querying dataframe produced by 'binaryFile' format using 'not' > operator > -- > > Key: SPARK-39900 > URL: https://issues.apache.org/jira/browse/SPARK-39900 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1, 3.3.0 >Reporter: Benoit Roy >Priority: Minor > > When creating a dataframe using the binaryFile format I am encountering weird > result when filtering/query with the 'not' operator. > > Here's a repo that will help describe and reproduce the issue. > [https://github.com/cccs-br/spark-binaryfile-issue] > {code:java} > g...@github.com:cccs-br/spark-binaryfile-issue.git {code} > > Here's a very simple test case that illustrate what's going on: > [https://github.com/cccs-br/spark-binaryfile-issue/blob/main/src/test/scala/BinaryFileSuite.scala] > TLDR; > {code:java} >test("binary file dataframe") { > // load files in directly into df using 'binaryFile' format. > // > // - src/test/resources/files/ > // - test1.csv > // - test2.json > // - test3.txt > val df = spark > .read > .format("binaryFile") > .load("src/test/resources/files") > df.createOrReplaceTempView("files") > // This works as expected. > val like_count = spark.sql("select * from files where path like > '%.csv'").count() > assert(like_count === 1) > // This does not work as expected. > val not_like_count = spark.sql("select * from files where path not like > '%.csv'").count() > assert(not_like_count === 2) > // This used to work in 3.2.1 > // df.filter(col("path").endsWith(".csv") === false).show() > }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39722) Make Dataset.showString() public
[ https://issues.apache.org/jira/browse/SPARK-39722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17572165#comment-17572165 ] Erik Krogen commented on SPARK-39722: - General +1 from me. We have some internal code that does exactly the {{Console.out}} redirection hack you described. > Make Dataset.showString() public > > > Key: SPARK-39722 > URL: https://issues.apache.org/jira/browse/SPARK-39722 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.8, 3.3.0 >Reporter: Jatin Sharma >Priority: Trivial > > Currently, we have {{.show}} APIs on a Dataset, but they print directly to > stdout. > But there are a lot of cases where we might need to get a String > representation of the show output. For example > * We have a logging framework to which we need to push the representation of > a df > * We have to send the string over a REST call from the driver > * We want to send the string to stderr instead of stdout > For such cases, currently one needs to do a hack by changing the Console.out > temporarily and catching the representation in a ByteArrayOutputStream or > similar, then extracting the string from it. > Strictly only printing to stdout seems like a limiting choice. > > Solution: > We expose APIs to return the String representation back. We already have the > .{{{}showString{}}} method internally. > > We could mirror the current {{.show}} APIS with a corresponding > {{.showString}} (and rename the internal private function to something else > if required) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39901) Reconsider design of ignoreCorruptFiles feature
Josh Rosen created SPARK-39901: -- Summary: Reconsider design of ignoreCorruptFiles feature Key: SPARK-39901 URL: https://issues.apache.org/jira/browse/SPARK-39901 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Josh Rosen I'm filing this ticket as a followup to the discussion at [https://github.com/apache/spark/pull/36775#issuecomment-1148136217] regarding the `ignoreCorruptFiles` feature: the current implementation is based towards considering a broad range of IOExceptions to be corruption, but this is likely overly-broad and might mis-identify transient errors as corruption (causing non-corrupt data to be erroneously discarded). SPARK-39389 fixes one instance of that problem, but we are still vulnerable to similar issues because of the overall design of this feature. I think we should reconsider the design of this feature: maybe we should switch the default behavior so that only an explicit allowlist of known corruption exceptions can cause files to be skipped. This could be done through involvement of other parts of the code, e.g. rewrapping exceptions into a `CorruptFileException` so higher layers can positively identify corruption. Any changes to behavior here could potentially impact users jobs, so we'd need to think carefully about when we want to change (in a 3.x release? 4.x?) and how we want to provide escape hatches (e.g. configs to revert back to old behavior). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39898) Upgrade kubernetes-client to 5.12.3
[ https://issues.apache.org/jira/browse/SPARK-39898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-39898. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37321 [https://github.com/apache/spark/pull/37321] > Upgrade kubernetes-client to 5.12.3 > --- > > Key: SPARK-39898 > URL: https://issues.apache.org/jira/browse/SPARK-39898 > Project: Spark > Issue Type: Bug > Components: Build, Kubernetes >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39898) Upgrade kubernetes-client to 5.12.3
[ https://issues.apache.org/jira/browse/SPARK-39898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-39898: - Assignee: Dongjoon Hyun > Upgrade kubernetes-client to 5.12.3 > --- > > Key: SPARK-39898 > URL: https://issues.apache.org/jira/browse/SPARK-39898 > Project: Spark > Issue Type: Bug > Components: Build, Kubernetes >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39857) V2ExpressionBuilder uses the wrong LiteralValue data type for In predicate
[ https://issues.apache.org/jira/browse/SPARK-39857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17572125#comment-17572125 ] Apache Spark commented on SPARK-39857: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/37324 > V2ExpressionBuilder uses the wrong LiteralValue data type for In predicate > -- > > Key: SPARK-39857 > URL: https://issues.apache.org/jira/browse/SPARK-39857 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > Fix For: 3.4.0 > > > When building V2 In Predicate in V2ExpressionBuilder, InSet.dataType (which > is BooleanType) is used to build the LiteralValue, InSet.child.dataType > should be used instead. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39857) V2ExpressionBuilder uses the wrong LiteralValue data type for In predicate
[ https://issues.apache.org/jira/browse/SPARK-39857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17572124#comment-17572124 ] Apache Spark commented on SPARK-39857: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/37324 > V2ExpressionBuilder uses the wrong LiteralValue data type for In predicate > -- > > Key: SPARK-39857 > URL: https://issues.apache.org/jira/browse/SPARK-39857 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > Fix For: 3.4.0 > > > When building V2 In Predicate in V2ExpressionBuilder, InSet.dataType (which > is BooleanType) is used to build the LiteralValue, InSet.child.dataType > should be used instead. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39900) Issue with querying dataframe produced by 'binaryFile' format using 'not' operator
[ https://issues.apache.org/jira/browse/SPARK-39900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoit Roy updated SPARK-39900: --- Summary: Issue with querying dataframe produced by 'binaryFile' format using 'not' operator (was: Querying dataframe produced by 'binaryFile' format using 'not' operator) > Issue with querying dataframe produced by 'binaryFile' format using 'not' > operator > -- > > Key: SPARK-39900 > URL: https://issues.apache.org/jira/browse/SPARK-39900 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1, 3.3.0 >Reporter: Benoit Roy >Priority: Minor > > When creating a dataframe using the binaryFile format I am encountering weird > result when filtering/query with the 'not' operator. > > Here's a repo that will help describe and reproduce the issue. > [https://github.com/cccs-br/spark-binaryfile-issue] > {code:java} > g...@github.com:cccs-br/spark-binaryfile-issue.git {code} > > Here's a very simple test case that illustrate what's going on: > [https://github.com/cccs-br/spark-binaryfile-issue/blob/main/src/test/scala/BinaryFileSuite.scala] > TLDR; > {code:java} >test("binary file dataframe") { > // load files in directly into df using 'binaryFile' format. > // > // - src/test/resources/files/ > // - test1.csv > // - test2.json > // - test3.txt > val df = spark > .read > .format("binaryFile") > .load("src/test/resources/files") > df.createOrReplaceTempView("files") > // This works as expected. > val like_count = spark.sql("select * from files where path like > '%.csv'").count() > assert(like_count === 1) > // This does not work as expected. > val not_like_count = spark.sql("select * from files where path not like > '%.csv'").count() > assert(not_like_count === 2) > // This used to work in 3.2.1 > // df.filter(col("path").endsWith(".csv") === false).show() > }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39900) Querying dataframe produced by 'binaryFile' format using 'not' operator
[ https://issues.apache.org/jira/browse/SPARK-39900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoit Roy updated SPARK-39900: --- Description: When creating a dataframe using the binaryFile format I am encountering weird result when filtering/query with the 'not' operator. Here's a repo that will help describe and reproduce the issue. [https://github.com/cccs-br/spark-binaryfile-issue] {code:java} g...@github.com:cccs-br/spark-binaryfile-issue.git {code} Here's a very simple test case that illustrate what's going on: [https://github.com/cccs-br/spark-binaryfile-issue/blob/main/src/test/scala/BinaryFileSuite.scala] {code:java} test("binary file dataframe") { // load files in directly into df using 'binaryFile' format. // // - src/test/resources/files/ // - test1.csv // - test2.json // - test3.txt val df = spark .read .format("binaryFile") .load("src/test/resources/files") df.createOrReplaceTempView("files") // This works as expected. val like_count = spark.sql("select * from files where path like '%.csv'").count() assert(like_count === 1) // This does not work as expected. val not_like_count = spark.sql("select * from files where path not like '%.csv'").count() assert(not_like_count === 2) // This used to work in 3.2.1 // df.filter(col("path").endsWith(".csv") === false).show() }{code} was: When creating a dataframe using the binaryFile format I am encountering weird result when filtering/query with the 'not' operator. Here's a repo that will help describe and reproduce the issue. [https://github.com/cccs-br/spark-binaryfile-issue] {code:java} g...@github.com:cccs-br/spark-binaryfile-issue.git {code} Here's a very simple test case that illustrate what's going on: [https://github.com/cccs-br/spark-binaryfile-issue/blob/main/src/test/scala/BinaryFileSuite.scala] {code:java} test("binary file dataframe") { // load files in directly into df using 'binaryFile' format. val df = spark .read .format("binaryFile") .load("src/test/resources/files") df.createOrReplaceTempView("files") // This works as expected. val like_count = spark.sql("select * from files where path like '%.csv'").count() assert(like_count === 1) // This does not work as expected. val not_like_count = spark.sql("select * from files where path not like '%.csv'").count() assert(not_like_count === 2) // This used to work in 3.2.1 // df.filter(col("path").endsWith(".csv") === false).show() }{code} > Querying dataframe produced by 'binaryFile' format using 'not' operator > --- > > Key: SPARK-39900 > URL: https://issues.apache.org/jira/browse/SPARK-39900 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1, 3.3.0 >Reporter: Benoit Roy >Priority: Minor > > When creating a dataframe using the binaryFile format I am encountering weird > result when filtering/query with the 'not' operator. > > Here's a repo that will help describe and reproduce the issue. > [https://github.com/cccs-br/spark-binaryfile-issue] > {code:java} > g...@github.com:cccs-br/spark-binaryfile-issue.git {code} > > Here's a very simple test case that illustrate what's going on: > [https://github.com/cccs-br/spark-binaryfile-issue/blob/main/src/test/scala/BinaryFileSuite.scala] > {code:java} >test("binary file dataframe") { > // load files in directly into df using 'binaryFile' format. > // > // - src/test/resources/files/ > // - test1.csv > // - test2.json > // - test3.txt > val df = spark > .read > .format("binaryFile") > .load("src/test/resources/files") > df.createOrReplaceTempView("files") > // This works as expected. > val like_count = spark.sql("select * from files where path like > '%.csv'").count() > assert(like_count === 1) > // This does not work as expected. > val not_like_count = spark.sql("select * from files where path not like > '%.csv'").count() > assert(not_like_count === 2) > // This used to work in 3.2.1 > // df.filter(col("path").endsWith(".csv") === false).show() > }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39900) Querying dataframe produced by 'binaryFile' format using 'not' operator
[ https://issues.apache.org/jira/browse/SPARK-39900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoit Roy updated SPARK-39900: --- Description: When creating a dataframe using the binaryFile format I am encountering weird result when filtering/query with the 'not' operator. Here's a repo that will help describe and reproduce the issue. [https://github.com/cccs-br/spark-binaryfile-issue] {code:java} g...@github.com:cccs-br/spark-binaryfile-issue.git {code} Here's a very simple test case that illustrate what's going on: [https://github.com/cccs-br/spark-binaryfile-issue/blob/main/src/test/scala/BinaryFileSuite.scala] TLDR; {code:java} test("binary file dataframe") { // load files in directly into df using 'binaryFile' format. // // - src/test/resources/files/ // - test1.csv // - test2.json // - test3.txt val df = spark .read .format("binaryFile") .load("src/test/resources/files") df.createOrReplaceTempView("files") // This works as expected. val like_count = spark.sql("select * from files where path like '%.csv'").count() assert(like_count === 1) // This does not work as expected. val not_like_count = spark.sql("select * from files where path not like '%.csv'").count() assert(not_like_count === 2) // This used to work in 3.2.1 // df.filter(col("path").endsWith(".csv") === false).show() }{code} was: When creating a dataframe using the binaryFile format I am encountering weird result when filtering/query with the 'not' operator. Here's a repo that will help describe and reproduce the issue. [https://github.com/cccs-br/spark-binaryfile-issue] {code:java} g...@github.com:cccs-br/spark-binaryfile-issue.git {code} Here's a very simple test case that illustrate what's going on: [https://github.com/cccs-br/spark-binaryfile-issue/blob/main/src/test/scala/BinaryFileSuite.scala] {code:java} test("binary file dataframe") { // load files in directly into df using 'binaryFile' format. // // - src/test/resources/files/ // - test1.csv // - test2.json // - test3.txt val df = spark .read .format("binaryFile") .load("src/test/resources/files") df.createOrReplaceTempView("files") // This works as expected. val like_count = spark.sql("select * from files where path like '%.csv'").count() assert(like_count === 1) // This does not work as expected. val not_like_count = spark.sql("select * from files where path not like '%.csv'").count() assert(not_like_count === 2) // This used to work in 3.2.1 // df.filter(col("path").endsWith(".csv") === false).show() }{code} > Querying dataframe produced by 'binaryFile' format using 'not' operator > --- > > Key: SPARK-39900 > URL: https://issues.apache.org/jira/browse/SPARK-39900 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1, 3.3.0 >Reporter: Benoit Roy >Priority: Minor > > When creating a dataframe using the binaryFile format I am encountering weird > result when filtering/query with the 'not' operator. > > Here's a repo that will help describe and reproduce the issue. > [https://github.com/cccs-br/spark-binaryfile-issue] > {code:java} > g...@github.com:cccs-br/spark-binaryfile-issue.git {code} > > Here's a very simple test case that illustrate what's going on: > [https://github.com/cccs-br/spark-binaryfile-issue/blob/main/src/test/scala/BinaryFileSuite.scala] > TLDR; > {code:java} >test("binary file dataframe") { > // load files in directly into df using 'binaryFile' format. > // > // - src/test/resources/files/ > // - test1.csv > // - test2.json > // - test3.txt > val df = spark > .read > .format("binaryFile") > .load("src/test/resources/files") > df.createOrReplaceTempView("files") > // This works as expected. > val like_count = spark.sql("select * from files where path like > '%.csv'").count() > assert(like_count === 1) > // This does not work as expected. > val not_like_count = spark.sql("select * from files where path not like > '%.csv'").count() > assert(not_like_count === 2) > // This used to work in 3.2.1 > // df.filter(col("path").endsWith(".csv") === false).show() > }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39900) Querying dataframe produced by 'binaryFile' format using 'not' operator
[ https://issues.apache.org/jira/browse/SPARK-39900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoit Roy updated SPARK-39900: --- Summary: Querying dataframe produced by 'binaryFile' format using 'not' operator (was: Incorrect result when query dataframe produced by 'binaryFile' format) > Querying dataframe produced by 'binaryFile' format using 'not' operator > --- > > Key: SPARK-39900 > URL: https://issues.apache.org/jira/browse/SPARK-39900 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1, 3.3.0 >Reporter: Benoit Roy >Priority: Minor > > When creating a dataframe using the binaryFile format I am encountering weird > result when filtering/query with the 'not' operator. > > Here's a repo that will help describe and reproduce the issue. > [https://github.com/cccs-br/spark-binaryfile-issue] > {code:java} > g...@github.com:cccs-br/spark-binaryfile-issue.git {code} > > Here's a very simple test case that illustrate what's going on: > [https://github.com/cccs-br/spark-binaryfile-issue/blob/main/src/test/scala/BinaryFileSuite.scala] > {code:java} >test("binary file dataframe") { > // load files in directly into df using 'binaryFile' format. > val df = spark > .read > .format("binaryFile") > .load("src/test/resources/files") > df.createOrReplaceTempView("files") > // This works as expected. > val like_count = spark.sql("select * from files where path like > '%.csv'").count() > assert(like_count === 1) > // This does not work as expected. > val not_like_count = spark.sql("select * from files where path not like > '%.csv'").count() > assert(not_like_count === 2) > // This used to work in 3.2.1 > // df.filter(col("path").endsWith(".csv") === false).show() > }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39900) Incorrect result when query dataframe produced by 'binaryFile' format
[ https://issues.apache.org/jira/browse/SPARK-39900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoit Roy updated SPARK-39900: --- Description: When creating a dataframe using the binaryFile format I am encountering weird result when filtering/query with the 'not' operator. Here's a repo that will help describe and reproduce the issue. [https://github.com/cccs-br/spark-binaryfile-issue] {code:java} g...@github.com:cccs-br/spark-binaryfile-issue.git {code} Here's a very simple test case that illustrate what's going on: [https://github.com/cccs-br/spark-binaryfile-issue/blob/main/src/test/scala/BinaryFileSuite.scala] {code:java} test("binary file dataframe") { // load files in directly into df using 'binaryFile' format. val df = spark .read .format("binaryFile") .load("src/test/resources/files") df.createOrReplaceTempView("files") // This works as expected. val like_count = spark.sql("select * from files where path like '%.csv'").count() assert(like_count === 1) // This does not work as expected. val not_like_count = spark.sql("select * from files where path not like '%.csv'").count() assert(not_like_count === 2) // This used to work in 3.2.1 // df.filter(col("path").endsWith(".csv") === false).show() }{code} was: When creating a dataframe using the binaryFile format I am encountering weird result when filtering/query with the 'not' operator. Here's a repo that will help describe and reproduce the issue. [https://github.com/cccs-br/spark-binaryfile-issue] {code:java} g...@github.com:cccs-br/spark-binaryfile-issue.git {code} Here's a very simple test case that illustrate what's going on: [https://github.com/cccs-br/spark-binaryfile-issue/blob/main/src/test/scala/BinaryFileSuite.scala] {code:java} test("binary file dataframe") { // load files in directly into df using 'binaryFile' format. val df = spark .read .format("binaryFile") .load("src/test/resources/files") df.createOrReplaceTempView("files") // This works as expected. val like_count = spark.sql("select * from files where path like '%.csv'").count() assert(like_count === 1) // This does not work as expected. val not_like_count = spark.sql("select * from files where path not like '%.csv'").count() assert(not_like_count === 2) // This used to work in 3.2.1 // df.filter(col("path").endsWith(".csv") === false).show() }{code} > Incorrect result when query dataframe produced by 'binaryFile' format > - > > Key: SPARK-39900 > URL: https://issues.apache.org/jira/browse/SPARK-39900 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1, 3.3.0 >Reporter: Benoit Roy >Priority: Minor > > When creating a dataframe using the binaryFile format I am encountering weird > result when filtering/query with the 'not' operator. > > Here's a repo that will help describe and reproduce the issue. > [https://github.com/cccs-br/spark-binaryfile-issue] > {code:java} > g...@github.com:cccs-br/spark-binaryfile-issue.git {code} > > Here's a very simple test case that illustrate what's going on: > [https://github.com/cccs-br/spark-binaryfile-issue/blob/main/src/test/scala/BinaryFileSuite.scala] > {code:java} >test("binary file dataframe") { > // load files in directly into df using 'binaryFile' format. > val df = spark > .read > .format("binaryFile") > .load("src/test/resources/files") > df.createOrReplaceTempView("files") > // This works as expected. > val like_count = spark.sql("select * from files where path like > '%.csv'").count() > assert(like_count === 1) > // This does not work as expected. > val not_like_count = spark.sql("select * from files where path not like > '%.csv'").count() > assert(not_like_count === 2) > // This used to work in 3.2.1 > // df.filter(col("path").endsWith(".csv") === false).show() > }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39900) Incorrect result when query dataframe produced by 'binaryFile' format
[ https://issues.apache.org/jira/browse/SPARK-39900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoit Roy updated SPARK-39900: --- Description: When creating a dataframe using the binaryFile format I am encountering weird result when filtering/query with the 'not' operator. Here's a repo that will help describe and reproduce the issue. [https://github.com/cccs-br/spark-binaryfile-issue] {code:java} g...@github.com:cccs-br/spark-binaryfile-issue.git {code} Here's a very simple test case that illustrate what's going on: [https://github.com/cccs-br/spark-binaryfile-issue/blob/main/src/test/scala/BinaryFileSuite.scala] {code:java} test("binary file dataframe") { // load files in directly into df using 'binaryFile' format. val df = spark .read .format("binaryFile") .load("src/test/resources/files") df.createOrReplaceTempView("files") // This works as expected. val like_count = spark.sql("select * from files where path like '%.csv'").count() assert(like_count === 1) // This does not work as expected. val not_like_count = spark.sql("select * from files where path not like '%.csv'").count() assert(not_like_count === 2) // This used to work in 3.2.1 // df.filter(col("path").endsWith(".csv") === false).show() }{code} was: When creating a dataframe using the binaryFile format. I am encountering weird result when filtering/query with the 'not' operator. Here's a repo that will help describe and reproduce the issue. [https://github.com/cccs-br/spark-binaryfile-issue] ``` g...@github.com:cccs-br/spark-binaryfile-issue.git ``` Here's a very simple test case: https://github.com/cccs-br/spark-binaryfile-issue/blob/main/src/test/scala/BinaryFileSuite.scala > Incorrect result when query dataframe produced by 'binaryFile' format > - > > Key: SPARK-39900 > URL: https://issues.apache.org/jira/browse/SPARK-39900 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1, 3.3.0 >Reporter: Benoit Roy >Priority: Minor > > When creating a dataframe using the binaryFile format I am encountering weird > result when filtering/query with the 'not' operator. > Here's a repo that will help describe and reproduce the issue. > [https://github.com/cccs-br/spark-binaryfile-issue] > > {code:java} > g...@github.com:cccs-br/spark-binaryfile-issue.git {code} > Here's a very simple test case that illustrate what's going on: > > [https://github.com/cccs-br/spark-binaryfile-issue/blob/main/src/test/scala/BinaryFileSuite.scala] > {code:java} >test("binary file dataframe") { > // load files in directly into df using 'binaryFile' format. > val df = spark > .read > .format("binaryFile") > .load("src/test/resources/files") > df.createOrReplaceTempView("files") > // This works as expected. > val like_count = spark.sql("select * from files where path like > '%.csv'").count() > assert(like_count === 1) > // This does not work as expected. > val not_like_count = spark.sql("select * from files where path not like > '%.csv'").count() > assert(not_like_count === 2) > // This used to work in 3.2.1 > // df.filter(col("path").endsWith(".csv") === false).show() > }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39900) Incorrect result when query dataframe produced by 'binaryFile' format
Benoit Roy created SPARK-39900: -- Summary: Incorrect result when query dataframe produced by 'binaryFile' format Key: SPARK-39900 URL: https://issues.apache.org/jira/browse/SPARK-39900 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0, 3.2.1 Reporter: Benoit Roy When creating a dataframe using the binaryFile format. I am encountering weird result when filtering/query with the 'not' operator. Here's a repo that will help describe and reproduce the issue. [https://github.com/cccs-br/spark-binaryfile-issue] ``` g...@github.com:cccs-br/spark-binaryfile-issue.git ``` Here's a very simple test case: https://github.com/cccs-br/spark-binaryfile-issue/blob/main/src/test/scala/BinaryFileSuite.scala -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39899) Incorrect passing of message parameters in InvalidUDFClassException
[ https://issues.apache.org/jira/browse/SPARK-39899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17572096#comment-17572096 ] Apache Spark commented on SPARK-39899: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/37323 > Incorrect passing of message parameters in InvalidUDFClassException > --- > > Key: SPARK-39899 > URL: https://issues.apache.org/jira/browse/SPARK-39899 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > In fact, messageParameters is not passed AnalysisException. It is used only > to form the error message. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39899) Incorrect passing of message parameters in InvalidUDFClassException
[ https://issues.apache.org/jira/browse/SPARK-39899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39899: Assignee: Apache Spark (was: Max Gekk) > Incorrect passing of message parameters in InvalidUDFClassException > --- > > Key: SPARK-39899 > URL: https://issues.apache.org/jira/browse/SPARK-39899 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > In fact, messageParameters is not passed AnalysisException. It is used only > to form the error message. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39899) Incorrect passing of message parameters in InvalidUDFClassException
[ https://issues.apache.org/jira/browse/SPARK-39899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39899: Assignee: Max Gekk (was: Apache Spark) > Incorrect passing of message parameters in InvalidUDFClassException > --- > > Key: SPARK-39899 > URL: https://issues.apache.org/jira/browse/SPARK-39899 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > In fact, messageParameters is not passed AnalysisException. It is used only > to form the error message. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39899) Incorrect passing of message parameters in InvalidUDFClassException
[ https://issues.apache.org/jira/browse/SPARK-39899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17572095#comment-17572095 ] Apache Spark commented on SPARK-39899: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/37323 > Incorrect passing of message parameters in InvalidUDFClassException > --- > > Key: SPARK-39899 > URL: https://issues.apache.org/jira/browse/SPARK-39899 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > In fact, messageParameters is not passed AnalysisException. It is used only > to form the error message. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39899) Incorrect passing of message parameters in InvalidUDFClassException
Max Gekk created SPARK-39899: Summary: Incorrect passing of message parameters in InvalidUDFClassException Key: SPARK-39899 URL: https://issues.apache.org/jira/browse/SPARK-39899 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Max Gekk Assignee: Max Gekk In fact, messageParameters is not passed AnalysisException. It is used only to form the error message. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39864) ExecutionListenerManager's registration of the ExecutionListenerBus should be lazy
[ https://issues.apache.org/jira/browse/SPARK-39864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-39864: -- Affects Version/s: 3.4.0 (was: 2.0.0) > ExecutionListenerManager's registration of the ExecutionListenerBus should be > lazy > -- > > Key: SPARK-39864 > URL: https://issues.apache.org/jira/browse/SPARK-39864 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Major > Fix For: 3.4.0 > > > Today, ExecutionListenerManager eagerly registers an ExecutionListenerBus > SparkListener when it is created, even if the SparkSession has no query > execution listeners registered. In applications with many short-lived > SparkSessions, this can cause a buildup of empty listeners on the shared > listener bus, increasing Spark listener processing times on the driver. > If we make the registration lazy then we avoid this driver-side listener > performance overhead. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39898) Upgrade kubernetes-client to 5.12.3
[ https://issues.apache.org/jira/browse/SPARK-39898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17572016#comment-17572016 ] Apache Spark commented on SPARK-39898: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/37321 > Upgrade kubernetes-client to 5.12.3 > --- > > Key: SPARK-39898 > URL: https://issues.apache.org/jira/browse/SPARK-39898 > Project: Spark > Issue Type: Bug > Components: Build, Kubernetes >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39898) Upgrade kubernetes-client to 5.12.3
[ https://issues.apache.org/jira/browse/SPARK-39898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17572015#comment-17572015 ] Apache Spark commented on SPARK-39898: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/37321 > Upgrade kubernetes-client to 5.12.3 > --- > > Key: SPARK-39898 > URL: https://issues.apache.org/jira/browse/SPARK-39898 > Project: Spark > Issue Type: Bug > Components: Build, Kubernetes >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39845) 0.0 and -0.0 are not consistent in set operations
[ https://issues.apache.org/jira/browse/SPARK-39845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Navin Kumar updated SPARK-39845: Description: This is a continuation of the issue described in SPARK-32110. When using Array set-based functions {{array_union}}, {{array_intersect}}, {{array_except}} and {{arrays_overlap}}, {{0.0}} and {{-0.0}} have inconsistent behavior. When parsed, {{-0.0}} is normalized to {{0.0}}. Therefore if I use {{array_union}} for example with these values directly, {{array(-0.0)}} becomes {{array(0.0)}}. See the example below using {{array_union}}: {code:java} scala> val df = spark.sql("SELECT array_union(array(0.0), array(-0.0))") df: org.apache.spark.sql.DataFrame = [array_union(array(0.0), array(0.0)): array] scala> df.collect() res2: Array[org.apache.spark.sql.Row] = Array([WrappedArray(0.0)]) {code} In this case, {{0.0}} and {{-0.0}} are considered equal and the union of the arrays produces a single value: {{0.0}}. However, if I try this operation using a constructed dataframe, these values are not equal, and the result is an array with both {{0.0}} and {{-0.0}}. {code:java} scala> val df = List((Array(0.0), Array(-0.0))).toDF("a", "b") df: org.apache.spark.sql.DataFrame = [a: array, b: array] scala> df.selectExpr("array_union(a, b)").collect() res3: Array[org.apache.spark.sql.Row] = Array([WrappedArray(0.0, -0.0)]) {code} For {{arrays_overlap}}, here is a similar version of that inconsistency: {code:java} scala> val df = spark.sql("SELECT arrays_overlap(array(0.0), array(-0.0))") df: org.apache.spark.sql.DataFrame = [arrays_overlap(array(0.0), array(0.0)): boolean] scala> df.collect res4: Array[org.apache.spark.sql.Row] = Array([true]) {code} {code:java} scala> val df = List((Array(0.0), Array(-0.0))).toDF("a", "b") df: org.apache.spark.sql.DataFrame = [a: array, b: array] scala> df.selectExpr("arrays_overlap(a, b)") res5: org.apache.spark.sql.DataFrame = [arrays_overlap(a, b): boolean] scala> df.selectExpr("arrays_overlap(a, b)").collect res6: Array[org.apache.spark.sql.Row] = Array([false]) {code} It looks like this is due to the fact that in the constructed dataframe case, the Double value is hashed by using {{java.lang.Double.doubleToLongBits}}, which will treat {{0.0}} and {{-0.0}} as distinct because of the sign bit. See here for more information: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala#L312-L321 I can also confirm that the same behavior occurs with FloatType and the use of {{java.lang.Float.floatToIntBits}} was: This is a continuation of the issue described in SPARK-32110. When using Array set-based functions {{array_union}}, {{array_intersect}}, {{array_except}} and {{arrays_overlap}}, {{0.0}} and {{-0.0}} have inconsistent behavior. When parsed, {{-0.0}} is normalized to {{0.0}}. Therefore if I use {{array_union}} for example with these values directly, {{array(-0.0)}} becomes {{array(0.0)}}. See the example below using {{array_union}}: {code:java} scala> val df = spark.sql("SELECT array_union(array(0.0), array(-0.0))") df: org.apache.spark.sql.DataFrame = [array_union(array(0.0), array(0.0)): array] scala> df.collect() res2: Array[org.apache.spark.sql.Row] = Array([WrappedArray(0.0)]) {code} In this case, {{0.0}} and {{-0.0}} are considered equal and the union of the arrays produces a single value: {{0.0}}. However, if I try this operation using a constructed dataframe, these values are not equal, and the result is an array with both {{0.0}} and {{-0.0}}. {code:java} scala> val df = List((Array(0.0), Array(-0.0))).toDF("a", "b") df: org.apache.spark.sql.DataFrame = [a: array, b: array] scala> df.selectExpr("array_union(a, b)").collect() res3: Array[org.apache.spark.sql.Row] = Array([WrappedArray(0.0, -0.0)]) {code} For {{arrays_overlap}}, here is a similar version of that inconsistency: {code:java} scala> val df = spark.sql("SELECT arrays_overlap(array(0.0), array(-0.0))") df: org.apache.spark.sql.DataFrame = [arrays_overlap(array(0.0), array(0.0)): boolean] scala> df.collect res4: Array[org.apache.spark.sql.Row] = Array([true]) {code} {code:java} scala> val df = List((Array(0.0), Array(-0.0))).toDF("a", "b") df: org.apache.spark.sql.DataFrame = [a: array, b: array] scala> df.selectExpr("arrays_overlap(a, b)") res5: org.apache.spark.sql.DataFrame = [arrays_overlap(a, b): boolean] scala> df.selectExpr("arrays_overlap(a, b)").collect res6: Array[org.apache.spark.sql.Row] = Array([false]) {code} It looks like this is due to the fact that in the constructed dataframe case, the Double value is hashed by using {{java.lang.Double.doubleToLongBits}}, which will treat {{0.0}} and {{-0.0}} as distinct because of the sign bit. See here for more information: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala#L312-L321
[jira] [Assigned] (SPARK-39898) Upgrade kubernetes-client to 5.12.3
[ https://issues.apache.org/jira/browse/SPARK-39898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39898: Assignee: Apache Spark > Upgrade kubernetes-client to 5.12.3 > --- > > Key: SPARK-39898 > URL: https://issues.apache.org/jira/browse/SPARK-39898 > Project: Spark > Issue Type: Bug > Components: Build, Kubernetes >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39898) Upgrade kubernetes-client to 5.12.3
[ https://issues.apache.org/jira/browse/SPARK-39898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39898: Assignee: (was: Apache Spark) > Upgrade kubernetes-client to 5.12.3 > --- > > Key: SPARK-39898 > URL: https://issues.apache.org/jira/browse/SPARK-39898 > Project: Spark > Issue Type: Bug > Components: Build, Kubernetes >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39845) 0.0 and -0.0 are not consistent in set operations
[ https://issues.apache.org/jira/browse/SPARK-39845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Navin Kumar updated SPARK-39845: Description: This is a continuation of the issue described in SPARK-32110. When using Array set-based functions {{array_union}}, {{array_intersect}}, {{array_except}} and {{arrays_overlap}}, {{0.0}} and {{-0.0}} have inconsistent behavior. When parsed, {{-0.0}} is normalized to {{0.0}}. Therefore if I use {{array_union}} for example with these values directly, {{array(-0.0)}} becomes {{array(0.0)}}. See the example below using {{array_union}}: {code:java} scala> val df = spark.sql("SELECT array_union(array(0.0), array(-0.0))") df: org.apache.spark.sql.DataFrame = [array_union(array(0.0), array(0.0)): array] scala> df.collect() res2: Array[org.apache.spark.sql.Row] = Array([WrappedArray(0.0)]) {code} In this case, {{0.0}} and {{-0.0}} are considered equal and the union of the arrays produces a single value: {{0.0}}. However, if I try this operation using a constructed dataframe, these values are not equal, and the result is an array with both {{0.0}} and {{-0.0}}. {code:java} scala> val df = List((Array(0.0), Array(-0.0))).toDF("a", "b") df: org.apache.spark.sql.DataFrame = [a: array, b: array] scala> df.selectExpr("array_union(a, b)").collect() res3: Array[org.apache.spark.sql.Row] = Array([WrappedArray(0.0, -0.0)]) {code} For {{arrays_overlap}}, here is a similar version of that inconsistency: {code:java} scala> val df = spark.sql("SELECT arrays_overlap(array(0.0), array(-0.0))") df: org.apache.spark.sql.DataFrame = [arrays_overlap(array(0.0), array(0.0)): boolean] scala> df.collect res4: Array[org.apache.spark.sql.Row] = Array([true]) {code} {code:java} scala> val df = List((Array(0.0), Array(-0.0))).toDF("a", "b") df: org.apache.spark.sql.DataFrame = [a: array, b: array] scala> df.selectExpr("arrays_overlap(a, b)") res5: org.apache.spark.sql.DataFrame = [arrays_overlap(a, b): boolean] scala> df.selectExpr("arrays_overlap(a, b)").collect res6: Array[org.apache.spark.sql.Row] = Array([false]) {code} It looks like this is due to the fact that in the constructed dataframe case, the Double value is hashed by using {{java.lang.Double.doubleToLongBits}}, which will treat {{0.0}} and {{-0.0}} as distinct because of the sign bit. See here for more information: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala#L312-L321 for I can also confirm that the same behavior occurs with FloatType and the use of {{java.lang.Float.floatToIntBits}} was: This is a continuation of the issue described in SPARK-32110. When using Array set-based functions {{array_union}}, {{array_intersect}}, {{array_except}} and {{arrays_overlap}}, {{0.0}} and {{-0.0}} have inconsistent behavior. When parsed, {{-0.0}} is normalized to {{0.0}}. Therefore if I use {{array_union}} for example with these values directly, {{array(-0.0)}} becomes {{array(0.0)}}. See the example below using {{array_union}}: {code:java} scala> val df = spark.sql("SELECT array_union(array(0.0), array(-0.0))") df: org.apache.spark.sql.DataFrame = [array_union(array(0.0), array(0.0)): array] scala> df.collect() res2: Array[org.apache.spark.sql.Row] = Array([WrappedArray(0.0)]) {code} In this case, {{0.0}} and {{-0.0}} are considered equal and the union of the arrays produces a single value: {{0.0}}. However, if I try this operation using a constructed dataframe, these values are not equal, and the result is an array with both {{0.0}} and {{-0.0}}. {code:java} scala> val df = List((Array(0.0), Array(-0.0))).toDF("a", "b") df: org.apache.spark.sql.DataFrame = [a: array, b: array] scala> df.selectExpr("array_union(a, b)").collect() res3: Array[org.apache.spark.sql.Row] = Array([WrappedArray(0.0, -0.0)]) {code} For {{arrays_overlap}}, here is a similar version of that inconsistency: {code:java} scala> val df = spark.sql("SELECT arrays_overlap(array(0.0), array(-0.0))") df: org.apache.spark.sql.DataFrame = [arrays_overlap(array(0.0), array(0.0)): boolean] scala> df.collect res4: Array[org.apache.spark.sql.Row] = Array([true]) {code} {code:java} scala> val df = List((Array(0.0), Array(-0.0))).toDF("a", "b") df: org.apache.spark.sql.DataFrame = [a: array, b: array] scala> df.selectExpr("arrays_overlap(a, b)") res5: org.apache.spark.sql.DataFrame = [arrays_overlap(a, b): boolean] scala> df.selectExpr("arrays_overlap(a, b)").collect res6: Array[org.apache.spark.sql.Row] = Array([false]) {code} It looks like this is due to the fact that in the constructed dataframe case, the Double value is hashed by using {{java.lang.Double.doubleToLongBits}}, which will treat {{0.0}} and {{-0.0} as distinct because of the sign bit. See here for more information: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala#L312
[jira] [Updated] (SPARK-39898) Upgrade kubernetes-client to 5.12.3
[ https://issues.apache.org/jira/browse/SPARK-39898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-39898: -- Component/s: Build > Upgrade kubernetes-client to 5.12.3 > --- > > Key: SPARK-39898 > URL: https://issues.apache.org/jira/browse/SPARK-39898 > Project: Spark > Issue Type: Bug > Components: Build, Kubernetes >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39898) Upgrade kubernetes-client to 5.12.3
Dongjoon Hyun created SPARK-39898: - Summary: Upgrade kubernetes-client to 5.12.3 Key: SPARK-39898 URL: https://issues.apache.org/jira/browse/SPARK-39898 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 3.4.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39885) Behavior differs between arrays_overlap and array_contains for negative 0.0
[ https://issues.apache.org/jira/browse/SPARK-39885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Vogelbacher updated SPARK-39885: -- Summary: Behavior differs between arrays_overlap and array_contains for negative 0.0 (was: Behavior differs between array_overlap and array_contains for negative 0.0) > Behavior differs between arrays_overlap and array_contains for negative 0.0 > --- > > Key: SPARK-39885 > URL: https://issues.apache.org/jira/browse/SPARK-39885 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.2 >Reporter: David Vogelbacher >Priority: Major > > {{array_contains([0.0], -0.0)}} will return true. {{array_overlaps([0.0], > [-0.0])}} will return false. I think we generally want to treat -0.0 and 0.0 > as the same (see > https://github.com/apache/spark/blob/e9eb28e27d10497c8b36774609823f4bbd2c8500/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtil.scala#L28) > However, the {{Double::equals}} method doesn't. Therefore, we should either > mark double as false in > [TypeUtils#typeWithProperEquals|https://github.com/apache/spark/blob/e9eb28e27d10497c8b36774609823f4bbd2c8500/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala#L96], > or we should wrap it with our own equals method that handles this case. > Java code snippets showing the issue: > {code:java} > dataset = sparkSession.createDataFrame( > List.of(RowFactory.create(List.of(-0.0))), > > DataTypes.createStructType(ImmutableList.of(DataTypes.createStructField( > "doubleCol", > DataTypes.createArrayType(DataTypes.DoubleType), false; > Dataset df = dataset.withColumn( > "overlaps", > functions.arrays_overlap(functions.array(functions.lit(+0.0)), > dataset.col("doubleCol"))); > List result = df.collectAsList(); // [[WrappedArray(-0.0),false]] > {code} > {code:java} > dataset = sparkSession.createDataFrame( > List.of(RowFactory.create(-0.0)), > DataTypes.createStructType( > > ImmutableList.of(DataTypes.createStructField("doubleCol", > DataTypes.DoubleType, false; > Dataset df = dataset.withColumn( > "contains", > functions.array_contains(functions.array(functions.lit(+0.0)), > dataset.col("doubleCol"))); > List result = df.collectAsList(); // [[-0.0,true]] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39885) Behavior differs between array_overlap and array_contains for negative 0.0
[ https://issues.apache.org/jira/browse/SPARK-39885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Vogelbacher updated SPARK-39885: -- Description: {{array_contains([0.0], -0.0)}} will return true. {{array_overlaps([0.0], [-0.0])}} will return false. I think we generally want to treat -0.0 and 0.0 as the same (see https://github.com/apache/spark/blob/e9eb28e27d10497c8b36774609823f4bbd2c8500/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtil.scala#L28) However, the {{Double::equals}} method doesn't. Therefore, we should either mark double as false in [TypeUtils#typeWithProperEquals|https://github.com/apache/spark/blob/e9eb28e27d10497c8b36774609823f4bbd2c8500/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala#L96], or we should wrap it with our own equals method that handles this case. Java code snippets showing the issue: {code:java} dataset = sparkSession.createDataFrame( List.of(RowFactory.create(List.of(-0.0))), DataTypes.createStructType(ImmutableList.of(DataTypes.createStructField( "doubleCol", DataTypes.createArrayType(DataTypes.DoubleType), false; Dataset df = dataset.withColumn( "overlaps", functions.arrays_overlap(functions.array(functions.lit(+0.0)), dataset.col("doubleCol"))); List result = df.collectAsList(); // [[WrappedArray(-0.0),false]] {code} {code:java} dataset = sparkSession.createDataFrame( List.of(RowFactory.create(-0.0)), DataTypes.createStructType( ImmutableList.of(DataTypes.createStructField("doubleCol", DataTypes.DoubleType, false; Dataset df = dataset.withColumn( "contains", functions.array_contains(functions.array(functions.lit(+0.0)), dataset.col("doubleCol"))); List result = df.collectAsList(); // [[-0.0,true]] {code} was: {{array_contains([0.0], -0.0)}} will return true. {{array_overlaps([0.0], [-0.0])}} will return false. I think we generally want to treat -0.0 and 0.0 as the same (see https://github.com/apache/spark/blob/e9eb28e27d10497c8b36774609823f4bbd2c8500/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtil.scala#L28) However, the {{Double::equals}} method doesn't. Therefore, we should either mark double as false in [TypeUtils#typeWithProperEquals|https://github.com/apache/spark/blob/e9eb28e27d10497c8b36774609823f4bbd2c8500/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala#L96], or we should wrap it with our own equals method that handles this case. Java code snippets showing the issue: {code:java} dataset = sparkSession.createDataFrame( List.of(RowFactory.create(List.of(-0.0))), DataTypes.createStructType(ImmutableList.of(DataTypes.createStructField( "doubleCol", DataTypes.createArrayType(DataTypes.DoubleType), false; Dataset df = dataset.withColumn( "overlaps", functions.arrays_overlap(functions.array(functions.lit(+0.0)), dataset.col("doubleCol"))); List result = df.collectAsList(); // [[WrappedArray(-0.0),false]] {code} {code:java} dataset = sparkSession.createDataFrame( List.of(RowFactory.create(-0.0)), DataTypes.createStructType( ImmutableList.of(DataTypes.createStructField("doubleCol", DataTypes.DoubleType, false; Dataset df = dataset.withColumn( "overlaps", functions.array_contains(functions.array(functions.lit(+0.0)), dataset.col("doubleCol"))); List result = df.collectAsList(); // [[-0.0,true]] {code} > Behavior differs between array_overlap and array_contains for negative 0.0 > -- > > Key: SPARK-39885 > URL: https://issues.apache.org/jira/browse/SPARK-39885 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.2 >Reporter: David Vogelbacher >Priority: Major > > {{array_contains([0.0], -0.0)}} will return true. {{array_overlaps([0.0], > [-0.0])}} will return false. I think we generally want to treat -0.0 and 0.0 > as the same (see > https://github.com/apache/spark/blob/e9eb28e27d10497c8b36774609823f4bbd2c8500/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtil.scala#L28) > However, the {{Double::equals}} method doesn't. Therefore, we should either > mark double as false in > [TypeUtils#typeWithProperEquals|https://github.com/apache/spark/blob/e9eb28e27d10497c8b36774609823f4bbd2c8500/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala#L96], > or we should wrap it with our own equals method that handles this case. > Java code snippets showing the issue: > {cod
[jira] [Created] (SPARK-39897) StackOverflowError in TaskMemoryManager
Andrew Ray created SPARK-39897: -- Summary: StackOverflowError in TaskMemoryManager Key: SPARK-39897 URL: https://issues.apache.org/jira/browse/SPARK-39897 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.7 Reporter: Andrew Ray I have observed the following error that looks to stem from TaskMemoryManager.allocatePage making a recursive call to itself when a page can not be allocated. I'm observing this in Spark 2.4 but since the relevant code is still the same in master this is likely still a potential point of failure in current versions. Prioritizing this as minor as this looks to be a very uncommon outcome as I can not find any other reports of a similar nature. {code:java} Py4JJavaError: An error occurred while calling o625.saveAsTable. : org.apache.spark.SparkException: Job aborted. at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170) at org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:503) at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:217) at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:177) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:474) at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:453) at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:409) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.StackOverflowError at java.util.concurrent.ConcurrentHashMap.putVal(ConcurrentHashMap.java:1012) at java.util.concurrent.ConcurrentHashMap.putIfAbsent(ConcurrentHashMap.java:1535) at java.lang.ClassLoader.getClassLoadingLock(ClassLoader.java:457) at java.lang.ClassLoader.loadClass(ClassLoader.java:398) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) at java.util.ResourceBundle$RBClassLoader.loadClass(ResourceBundle.java:512) at java.util.ResourceBundle$Control.newBundle(ResourceBundle.java:2657) at java.util.ResourceBundle.loadBundle(ResourceBundle.java:1518) at java.util.ResourceBundle.findBundle(ResourceBundle.java:1482) at java.util.ResourceBundle.findBundle(ResourceBundle.java:1436) at java.util.ResourceBundle.findBundle(ResourceBundle.java:1436) at java.util.ResourceBundle.getBundleImpl(ResourceBundle.java:1370) at java.util.ResourceBundle.getBundle(ResourceBundle.java:899) at sun.util.resources.LocaleData$1.run(LocaleData.java:167) a
[jira] [Commented] (SPARK-39743) Unable to set zstd compression level while writing parquet files
[ https://issues.apache.org/jira/browse/SPARK-39743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17571995#comment-17571995 ] shezm commented on SPARK-39743: --- [~yeachan153] spark.io.compression.zstd.level is adapted to {{{}spark.io.compression.codec{}}}. It only works on internal data. If you want to set a different zstd level to write parquet files , you can set `parquet.compression.codec.zstd.level` in sparkConf, like : {code:java} val spark = SparkSession .builder() .master("local") .appName("spark example") .config("spark.sql.parquet.compression.codec", "zstd") .config("parquet.compression.codec.zstd.level", 10) // here .getOrCreate(); val csvfile = spark.read.csv("file:///home/test_data/Reviews.csv") csvfile.coalesce(1).write.parquet("file:///home/test_data/nn_parq_10"){code} > Unable to set zstd compression level while writing parquet files > > > Key: SPARK-39743 > URL: https://issues.apache.org/jira/browse/SPARK-39743 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Yeachan Park >Priority: Minor > > While writing zstd compressed parquet files, the following setting > `spark.io.compression.zstd.level` does not have any affect with regards to > the compression level of zstd. > All files seem to be written with the default zstd compression level, and the > config option seems to be ignored. > Using the zstd cli tool, we confirmed that setting a higher compression level > for the same file tested in spark resulted in a smaller file. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-39375) SPIP: Spark Connect - A client and server interface for Apache Spark.
[ https://issues.apache.org/jira/browse/SPARK-39375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17571978#comment-17571978 ] pralabhkumar edited comment on SPARK-39375 at 7/27/22 2:51 PM: --- This is really good proposal and need of an hour (specifically since Livy is dormant and Toree also not very active) . This will hugely help in the use cases related to Notebook. Please let us know , is there an ETA for the first version of this , or any plan to have further sub tasks , so that other people can contribute to it . was (Author: pralabhkumar): This is really good proposal and need of an hour (specifically since Livy is dormant and Toree also not very active) . This will hugely help in the use cases related to Notebook. Please let us know , is there an ETA for the first version of this , or any plan to have further tasks , so that other people can contribute to it . > SPIP: Spark Connect - A client and server interface for Apache Spark. > - > > Key: SPARK-39375 > URL: https://issues.apache.org/jira/browse/SPARK-39375 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Martin Grund >Priority: Major > Labels: SPIP > > Please find the full document for discussion here: [Spark Connect > SPIP|https://docs.google.com/document/d/1Mnl6jmGszixLW4KcJU5j9IgpG9-UabS0dcM6PM2XGDc/edit#heading=h.wmsrrfealhrj] > Below, we have just referenced the introduction. > h2. What are you trying to do? > While Spark is used extensively, it was designed nearly a decade ago, which, > in the age of serverless computing and ubiquitous programming language use, > poses a number of limitations. Most of the limitations stem from the tightly > coupled Spark driver architecture and fact that clusters are typically shared > across users: (1) {*}Lack of built-in remote connectivity{*}: the Spark > driver runs both the client application and scheduler, which results in a > heavyweight architecture that requires proximity to the cluster. There is no > built-in capability to remotely connect to a Spark cluster in languages > other than SQL and users therefore rely on external solutions such as the > inactive project [Apache Livy|https://livy.apache.org/]. (2) {*}Lack of rich > developer experience{*}: The current architecture and APIs do not cater for > interactive data exploration (as done with Notebooks), or allow for building > out rich developer experience common in modern code editors. (3) > {*}Stability{*}: with the current shared driver architecture, users causing > critical exceptions (e.g. OOM) bring the whole cluster down for all users. > (4) {*}Upgradability{*}: the current entangling of platform and client APIs > (e.g. first and third-party dependencies in the classpath) does not allow for > seamless upgrades between Spark versions (and with that, hinders new feature > adoption). > > We propose to overcome these challenges by building on the DataFrame API and > the underlying unresolved logical plans. The DataFrame API is widely used and > makes it very easy to iteratively express complex logic. We will introduce > {_}Spark Connect{_}, a remote option of the DataFrame API that separates the > client from the Spark server. With Spark Connect, Spark will become > decoupled, allowing for built-in remote connectivity: The decoupled client > SDK can be used to run interactive data exploration and connect to the server > for DataFrame operations. > > Spark Connect will benefit Spark developers in different ways: The decoupled > architecture will result in improved stability, as clients are separated from > the driver. From the Spark Connect client perspective, Spark will be (almost) > versionless, and thus enable seamless upgradability, as server APIs can > evolve without affecting the client API. The decoupled client-server > architecture can be leveraged to build close integrations with local > developer tooling. Finally, separating the client process from the Spark > server process will improve Spark’s overall security posture by avoiding the > tight coupling of the client inside the Spark runtime environment. > > Spark Connect will strengthen Spark’s position as the modern unified engine > for large-scale data analytics and expand applicability to use cases and > developers we could not reach with the current setup: Spark will become > ubiquitously usable as the DataFrame API can be used with (almost) any > programming language. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org Fo
[jira] [Commented] (SPARK-39375) SPIP: Spark Connect - A client and server interface for Apache Spark.
[ https://issues.apache.org/jira/browse/SPARK-39375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17571978#comment-17571978 ] pralabhkumar commented on SPARK-39375: -- This is really good proposal and need of an hour (specifically since Livy is dormant and Toree also not very active) . This will hugely help in the use cases related to Notebook. Please let us know , is there an ETA for the first version of this , or any plan to have further tasks , so that other people can contribute to it . > SPIP: Spark Connect - A client and server interface for Apache Spark. > - > > Key: SPARK-39375 > URL: https://issues.apache.org/jira/browse/SPARK-39375 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Martin Grund >Priority: Major > Labels: SPIP > > Please find the full document for discussion here: [Spark Connect > SPIP|https://docs.google.com/document/d/1Mnl6jmGszixLW4KcJU5j9IgpG9-UabS0dcM6PM2XGDc/edit#heading=h.wmsrrfealhrj] > Below, we have just referenced the introduction. > h2. What are you trying to do? > While Spark is used extensively, it was designed nearly a decade ago, which, > in the age of serverless computing and ubiquitous programming language use, > poses a number of limitations. Most of the limitations stem from the tightly > coupled Spark driver architecture and fact that clusters are typically shared > across users: (1) {*}Lack of built-in remote connectivity{*}: the Spark > driver runs both the client application and scheduler, which results in a > heavyweight architecture that requires proximity to the cluster. There is no > built-in capability to remotely connect to a Spark cluster in languages > other than SQL and users therefore rely on external solutions such as the > inactive project [Apache Livy|https://livy.apache.org/]. (2) {*}Lack of rich > developer experience{*}: The current architecture and APIs do not cater for > interactive data exploration (as done with Notebooks), or allow for building > out rich developer experience common in modern code editors. (3) > {*}Stability{*}: with the current shared driver architecture, users causing > critical exceptions (e.g. OOM) bring the whole cluster down for all users. > (4) {*}Upgradability{*}: the current entangling of platform and client APIs > (e.g. first and third-party dependencies in the classpath) does not allow for > seamless upgrades between Spark versions (and with that, hinders new feature > adoption). > > We propose to overcome these challenges by building on the DataFrame API and > the underlying unresolved logical plans. The DataFrame API is widely used and > makes it very easy to iteratively express complex logic. We will introduce > {_}Spark Connect{_}, a remote option of the DataFrame API that separates the > client from the Spark server. With Spark Connect, Spark will become > decoupled, allowing for built-in remote connectivity: The decoupled client > SDK can be used to run interactive data exploration and connect to the server > for DataFrame operations. > > Spark Connect will benefit Spark developers in different ways: The decoupled > architecture will result in improved stability, as clients are separated from > the driver. From the Spark Connect client perspective, Spark will be (almost) > versionless, and thus enable seamless upgradability, as server APIs can > evolve without affecting the client API. The decoupled client-server > architecture can be leveraged to build close integrations with local > developer tooling. Finally, separating the client process from the Spark > server process will improve Spark’s overall security posture by avoiding the > tight coupling of the client inside the Spark runtime environment. > > Spark Connect will strengthen Spark’s position as the modern unified engine > for large-scale data analytics and expand applicability to use cases and > developers we could not reach with the current setup: Spark will become > ubiquitously usable as the DataFrame API can be used with (almost) any > programming language. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39896) The structural integrity of the plan is broken after UnwrapCastInBinaryComparison
[ https://issues.apache.org/jira/browse/SPARK-39896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17571950#comment-17571950 ] Yuming Wang commented on SPARK-39896: - cc [~fchen] > The structural integrity of the plan is broken after > UnwrapCastInBinaryComparison > - > > Key: SPARK-39896 > URL: https://issues.apache.org/jira/browse/SPARK-39896 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Priority: Major > > {code:scala} > sql("create table t1(a decimal(3, 0)) using parquet") > sql("insert into t1 values(100), (10), (1)") > sql("select * from t1 where a in(10, 10, 0, 1.00)").show > {code} > {noformat} > After applying rule > org.apache.spark.sql.catalyst.optimizer.UnwrapCastInBinaryComparison in batch > Operator Optimization before Inferring Filters, the structural integrity of > the plan is broken. > java.lang.RuntimeException: After applying rule > org.apache.spark.sql.catalyst.optimizer.UnwrapCastInBinaryComparison in batch > Operator Optimization before Inferring Filters, the structural integrity of > the plan is broken. > at > org.apache.spark.sql.errors.QueryExecutionErrors$.structuralIntegrityIsBrokenAfterApplyingRuleError(QueryExecutionErrors.scala:1325) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:229) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39896) The structural integrity of the plan is broken after UnwrapCastInBinaryComparison
Yuming Wang created SPARK-39896: --- Summary: The structural integrity of the plan is broken after UnwrapCastInBinaryComparison Key: SPARK-39896 URL: https://issues.apache.org/jira/browse/SPARK-39896 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0 Reporter: Yuming Wang {code:scala} sql("create table t1(a decimal(3, 0)) using parquet") sql("insert into t1 values(100), (10), (1)") sql("select * from t1 where a in(10, 10, 0, 1.00)").show {code} {noformat} After applying rule org.apache.spark.sql.catalyst.optimizer.UnwrapCastInBinaryComparison in batch Operator Optimization before Inferring Filters, the structural integrity of the plan is broken. java.lang.RuntimeException: After applying rule org.apache.spark.sql.catalyst.optimizer.UnwrapCastInBinaryComparison in batch Operator Optimization before Inferring Filters, the structural integrity of the plan is broken. at org.apache.spark.sql.errors.QueryExecutionErrors$.structuralIntegrityIsBrokenAfterApplyingRuleError(QueryExecutionErrors.scala:1325) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:229) {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39880) V2 SHOW FUNCTIONS command should print qualified function name like v1
[ https://issues.apache.org/jira/browse/SPARK-39880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-39880: Assignee: Wenchen Fan > V2 SHOW FUNCTIONS command should print qualified function name like v1 > -- > > Key: SPARK-39880 > URL: https://issues.apache.org/jira/browse/SPARK-39880 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39880) V2 SHOW FUNCTIONS command should print qualified function name like v1
[ https://issues.apache.org/jira/browse/SPARK-39880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-39880. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37301 [https://github.com/apache/spark/pull/37301] > V2 SHOW FUNCTIONS command should print qualified function name like v1 > -- > > Key: SPARK-39880 > URL: https://issues.apache.org/jira/browse/SPARK-39880 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39819) DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)
[ https://issues.apache.org/jira/browse/SPARK-39819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17571871#comment-17571871 ] Apache Spark commented on SPARK-39819: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/37320 > DS V2 aggregate push down can work with Top N or Paging (Sort with group > expressions) > - > > Key: SPARK-39819 > URL: https://issues.apache.org/jira/browse/SPARK-39819 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > Currently, DS V2 aggregate push-down cannot work with Top N (order by ... > limit ...) or Paging (order by ... limit ... offset ...). > If it can work with Top N or Paging, it will be better performance. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39887) Expression transform error
[ https://issues.apache.org/jira/browse/SPARK-39887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17571865#comment-17571865 ] Apache Spark commented on SPARK-39887: -- User 'cfmcgrady' has created a pull request for this issue: https://github.com/apache/spark/pull/37319 > Expression transform error > -- > > Key: SPARK-39887 > URL: https://issues.apache.org/jira/browse/SPARK-39887 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1, 3.3.0, 3.2.2 >Reporter: zhuml >Priority: Major > > {code:java} > spark.sql( > """ > |select to_date(a) a, to_date(b) b from > |(select a, a as b from > |(select to_date(a) a from > | values ('2020-02-01') as t1(a) > | group by to_date(a)) t3 > |union all > |select a, b from > |(select to_date(a) a, to_date(b) b from > |values ('2020-01-01','2020-01-02') as t1(a, b) > | group by to_date(a), to_date(b)) t4) t5 > |group by to_date(a), to_date(b) > |""".stripMargin).show(){code} > result is (2020-02-01, 2020-02-01), (2020-01-01, 2020-01-01) > expected (2020-02-01, 2020-02-01), (2020-01-01, 2020-01-02) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39887) Expression transform error
[ https://issues.apache.org/jira/browse/SPARK-39887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39887: Assignee: Apache Spark > Expression transform error > -- > > Key: SPARK-39887 > URL: https://issues.apache.org/jira/browse/SPARK-39887 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1, 3.3.0, 3.2.2 >Reporter: zhuml >Assignee: Apache Spark >Priority: Major > > {code:java} > spark.sql( > """ > |select to_date(a) a, to_date(b) b from > |(select a, a as b from > |(select to_date(a) a from > | values ('2020-02-01') as t1(a) > | group by to_date(a)) t3 > |union all > |select a, b from > |(select to_date(a) a, to_date(b) b from > |values ('2020-01-01','2020-01-02') as t1(a, b) > | group by to_date(a), to_date(b)) t4) t5 > |group by to_date(a), to_date(b) > |""".stripMargin).show(){code} > result is (2020-02-01, 2020-02-01), (2020-01-01, 2020-01-01) > expected (2020-02-01, 2020-02-01), (2020-01-01, 2020-01-02) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39887) Expression transform error
[ https://issues.apache.org/jira/browse/SPARK-39887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39887: Assignee: (was: Apache Spark) > Expression transform error > -- > > Key: SPARK-39887 > URL: https://issues.apache.org/jira/browse/SPARK-39887 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1, 3.3.0, 3.2.2 >Reporter: zhuml >Priority: Major > > {code:java} > spark.sql( > """ > |select to_date(a) a, to_date(b) b from > |(select a, a as b from > |(select to_date(a) a from > | values ('2020-02-01') as t1(a) > | group by to_date(a)) t3 > |union all > |select a, b from > |(select to_date(a) a, to_date(b) b from > |values ('2020-01-01','2020-01-02') as t1(a, b) > | group by to_date(a), to_date(b)) t4) t5 > |group by to_date(a), to_date(b) > |""".stripMargin).show(){code} > result is (2020-02-01, 2020-02-01), (2020-01-01, 2020-01-01) > expected (2020-02-01, 2020-02-01), (2020-01-01, 2020-01-02) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39887) Expression transform error
[ https://issues.apache.org/jira/browse/SPARK-39887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17571864#comment-17571864 ] Apache Spark commented on SPARK-39887: -- User 'cfmcgrady' has created a pull request for this issue: https://github.com/apache/spark/pull/37319 > Expression transform error > -- > > Key: SPARK-39887 > URL: https://issues.apache.org/jira/browse/SPARK-39887 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1, 3.3.0, 3.2.2 >Reporter: zhuml >Priority: Major > > {code:java} > spark.sql( > """ > |select to_date(a) a, to_date(b) b from > |(select a, a as b from > |(select to_date(a) a from > | values ('2020-02-01') as t1(a) > | group by to_date(a)) t3 > |union all > |select a, b from > |(select to_date(a) a, to_date(b) b from > |values ('2020-01-01','2020-01-02') as t1(a, b) > | group by to_date(a), to_date(b)) t4) t5 > |group by to_date(a), to_date(b) > |""".stripMargin).show(){code} > result is (2020-02-01, 2020-02-01), (2020-01-01, 2020-01-01) > expected (2020-02-01, 2020-02-01), (2020-01-01, 2020-01-02) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39819) DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)
[ https://issues.apache.org/jira/browse/SPARK-39819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-39819: --- Summary: DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions) (was: DS V2 aggregate push down can work with Top N or Paging (Sort with group column)) > DS V2 aggregate push down can work with Top N or Paging (Sort with group > expressions) > - > > Key: SPARK-39819 > URL: https://issues.apache.org/jira/browse/SPARK-39819 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > Currently, DS V2 aggregate push-down cannot work with Top N (order by ... > limit ...) or Paging (order by ... limit ... offset ...). > If it can work with Top N or Paging, it will be better performance. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39890) Make TakeOrderedAndProjectExec inherit AliasAwareOutputOrdering
[ https://issues.apache.org/jira/browse/SPARK-39890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39890: Assignee: (was: Apache Spark) > Make TakeOrderedAndProjectExec inherit AliasAwareOutputOrdering > --- > > Key: SPARK-39890 > URL: https://issues.apache.org/jira/browse/SPARK-39890 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Priority: Minor > > AliasAwareOutputOrdering can save a sort if the project inside > TakeOrderedAndProjectExec has an alias for the sort order. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39890) Make TakeOrderedAndProjectExec inherit AliasAwareOutputOrdering
[ https://issues.apache.org/jira/browse/SPARK-39890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17571831#comment-17571831 ] Apache Spark commented on SPARK-39890: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/37318 > Make TakeOrderedAndProjectExec inherit AliasAwareOutputOrdering > --- > > Key: SPARK-39890 > URL: https://issues.apache.org/jira/browse/SPARK-39890 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Priority: Minor > > AliasAwareOutputOrdering can save a sort if the project inside > TakeOrderedAndProjectExec has an alias for the sort order. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39890) Make TakeOrderedAndProjectExec inherit AliasAwareOutputOrdering
[ https://issues.apache.org/jira/browse/SPARK-39890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39890: Assignee: Apache Spark > Make TakeOrderedAndProjectExec inherit AliasAwareOutputOrdering > --- > > Key: SPARK-39890 > URL: https://issues.apache.org/jira/browse/SPARK-39890 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Assignee: Apache Spark >Priority: Minor > > AliasAwareOutputOrdering can save a sort if the project inside > TakeOrderedAndProjectExec has an alias for the sort order. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39895) pyspark drop doesn't accept *cols
Santosh Pingale created SPARK-39895: --- Summary: pyspark drop doesn't accept *cols Key: SPARK-39895 URL: https://issues.apache.org/jira/browse/SPARK-39895 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.2.2, 3.3.0, 3.0.3 Reporter: Santosh Pingale Pyspark dataframe drop has following signature: {color:#4c9aff}{{def drop(self, *cols: "ColumnOrName") -> "DataFrame":}}{color} However when we try to pass multiple Column types to drop function it raises TypeError {{each col in the param list should be a string}} *Minimal reproducible example:* {color:#4c9aff}values = [("id_1", 5, 9), ("id_2", 5, 1), ("id_3", 4, 3), ("id_1", 3, 3), ("id_2", 4, 3)]{color} {color:#4c9aff}df = spark.createDataFrame(values, "id string, point int, count int"){color} |– id: string (nullable = true)| |– point: integer (nullable = true)| |– count: integer (nullable = true)| {color:#4c9aff}{{df.drop(df.point, df.count)}}{color} {quote}{color:#505f79}/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py in drop(self, *cols){color} {color:#505f79}2537 for col in cols:{color} {color:#505f79}2538 if not isinstance(col, str):{color} {color:#505f79}-> 2539 raise TypeError("each col in the param list should be a string"){color} {color:#505f79}2540 jdf = self._jdf.drop(self._jseq(cols)){color} {color:#505f79}2541{color} {color:#505f79}TypeError: each col in the param list should be a string{color} {quote} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39894) Combine the similar binary comparison in boolean expression.
[ https://issues.apache.org/jira/browse/SPARK-39894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17571817#comment-17571817 ] Apache Spark commented on SPARK-39894: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/37317 > Combine the similar binary comparison in boolean expression. > > > Key: SPARK-39894 > URL: https://issues.apache.org/jira/browse/SPARK-39894 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > If boolean expression have two similar binary comparisons onnect with And. > i.e. 'a > 1 and 'a > 2. > We should simplify them to 'a > 2 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39894) Combine the similar binary comparison in boolean expression.
[ https://issues.apache.org/jira/browse/SPARK-39894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39894: Assignee: Apache Spark > Combine the similar binary comparison in boolean expression. > > > Key: SPARK-39894 > URL: https://issues.apache.org/jira/browse/SPARK-39894 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Assignee: Apache Spark >Priority: Major > > If boolean expression have two similar binary comparisons onnect with And. > i.e. 'a > 1 and 'a > 2. > We should simplify them to 'a > 2 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39894) Combine the similar binary comparison in boolean expression.
[ https://issues.apache.org/jira/browse/SPARK-39894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17571816#comment-17571816 ] Apache Spark commented on SPARK-39894: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/37317 > Combine the similar binary comparison in boolean expression. > > > Key: SPARK-39894 > URL: https://issues.apache.org/jira/browse/SPARK-39894 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > If boolean expression have two similar binary comparisons onnect with And. > i.e. 'a > 1 and 'a > 2. > We should simplify them to 'a > 2 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39894) Combine the similar binary comparison in boolean expression.
[ https://issues.apache.org/jira/browse/SPARK-39894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39894: Assignee: (was: Apache Spark) > Combine the similar binary comparison in boolean expression. > > > Key: SPARK-39894 > URL: https://issues.apache.org/jira/browse/SPARK-39894 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > If boolean expression have two similar binary comparisons onnect with And. > i.e. 'a > 1 and 'a > 2. > We should simplify them to 'a > 2 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39731) Correctness issue when parsing dates with yyyyMMdd format in CSV and JSON
[ https://issues.apache.org/jira/browse/SPARK-39731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-39731: --- Assignee: Ivan Sadikov > Correctness issue when parsing dates with MMdd format in CSV and JSON > - > > Key: SPARK-39731 > URL: https://issues.apache.org/jira/browse/SPARK-39731 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Ivan Sadikov >Assignee: Ivan Sadikov >Priority: Major > > In Spark 3.x, when reading CSV data like this: > {code:java} > name,mydate > 1,2020011 > 2,20201203{code} > and specifying date pattern as "MMdd", dates are not parsed correctly > with CORRECTED time parser policy. > For example, > {code:java} > val df = spark.read.schema("name string, mydate date").option("dateFormat", > "MMdd").option("header", "true").csv("file:/tmp/test.csv") > df.show(false){code} > Returns: > {code:java} > ++--+ > |name|mydate| > ++--+ > |1 |+2020011-01-01| > |2 |2020-12-03| > ++--+ {code} > and it used to return null instead of the invalid date in Spark 3.2 or below. > > The issue appears to be caused by this PR: > [https://github.com/apache/spark/pull/32959]. > > A similar issue can observed in JSON data source. > test.json > {code:java} > {"date": "2020011"} > {"date": "20201203"} {code} > > Running commands > {code:java} > val df = spark.read.schema("date date").option("dateFormat", > "MMdd").json("file:/tmp/test.json") > df.show(false) {code} > returns > {code:java} > +--+ > |date | > +--+ > |+2020011-01-01| > |2020-12-03 | > +--+{code} > but before the patch linked in the description it used to show: > {code:java} > +--+ > |date | > +--+ > |7500-08-09| > |2020-12-03| > +--+{code} > which is strange either way. I will try to address it in the PR. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39731) Correctness issue when parsing dates with yyyyMMdd format in CSV and JSON
[ https://issues.apache.org/jira/browse/SPARK-39731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-39731. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37147 [https://github.com/apache/spark/pull/37147] > Correctness issue when parsing dates with MMdd format in CSV and JSON > - > > Key: SPARK-39731 > URL: https://issues.apache.org/jira/browse/SPARK-39731 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Ivan Sadikov >Assignee: Ivan Sadikov >Priority: Major > Fix For: 3.4.0 > > > In Spark 3.x, when reading CSV data like this: > {code:java} > name,mydate > 1,2020011 > 2,20201203{code} > and specifying date pattern as "MMdd", dates are not parsed correctly > with CORRECTED time parser policy. > For example, > {code:java} > val df = spark.read.schema("name string, mydate date").option("dateFormat", > "MMdd").option("header", "true").csv("file:/tmp/test.csv") > df.show(false){code} > Returns: > {code:java} > ++--+ > |name|mydate| > ++--+ > |1 |+2020011-01-01| > |2 |2020-12-03| > ++--+ {code} > and it used to return null instead of the invalid date in Spark 3.2 or below. > > The issue appears to be caused by this PR: > [https://github.com/apache/spark/pull/32959]. > > A similar issue can observed in JSON data source. > test.json > {code:java} > {"date": "2020011"} > {"date": "20201203"} {code} > > Running commands > {code:java} > val df = spark.read.schema("date date").option("dateFormat", > "MMdd").json("file:/tmp/test.json") > df.show(false) {code} > returns > {code:java} > +--+ > |date | > +--+ > |+2020011-01-01| > |2020-12-03 | > +--+{code} > but before the patch linked in the description it used to show: > {code:java} > +--+ > |date | > +--+ > |7500-08-09| > |2020-12-03| > +--+{code} > which is strange either way. I will try to address it in the PR. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39893) Remove redundant aggregate if it is group only and all grouping and aggregate expressions are foldable
[ https://issues.apache.org/jira/browse/SPARK-39893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39893: Assignee: Apache Spark > Remove redundant aggregate if it is group only and all grouping and aggregate > expressions are foldable > -- > > Key: SPARK-39893 > URL: https://issues.apache.org/jira/browse/SPARK-39893 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wan Kun >Assignee: Apache Spark >Priority: Major > > If all groupingExpressions and aggregateExpressions in a aggregate are > foldable, we can remove this aggregate. > For example, query : > {code:java} > SELECT distinct 1001 as id , cast('2022-06-03' as date) AS DT FROM testData > {code} > the grouping expressions are : *[1001, 2022-06-03]* > the aggregate expressions are : *[1001 AS id#274, 2022-06-03 AS DT#275]* > so we can skip scan table testData and remote the aggregate operation. > Before this PR: > {code:java} > Aggregate [1001, 2022-06-03], [1001 AS id#274, 2022-06-03 AS DT#275], > Statistics(sizeInBytes=16.0 EiB) > +- SerializeFromObject, Statistics(sizeInBytes=8.0 EiB) >+- ExternalRDD [obj#12], Statistics(sizeInBytes=8.0 EiB) > {code} > After this PR: > {code:java} > Project [1001 AS id#218, 2022-06-03 AS DT#219], Statistics(sizeInBytes=2.0 B) > +- OneRowRelation, Statistics(sizeInBytes=1.0 B) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39893) Remove redundant aggregate if it is group only and all grouping and aggregate expressions are foldable
[ https://issues.apache.org/jira/browse/SPARK-39893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17571804#comment-17571804 ] Apache Spark commented on SPARK-39893: -- User 'wankunde' has created a pull request for this issue: https://github.com/apache/spark/pull/37316 > Remove redundant aggregate if it is group only and all grouping and aggregate > expressions are foldable > -- > > Key: SPARK-39893 > URL: https://issues.apache.org/jira/browse/SPARK-39893 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wan Kun >Priority: Major > > If all groupingExpressions and aggregateExpressions in a aggregate are > foldable, we can remove this aggregate. > For example, query : > {code:java} > SELECT distinct 1001 as id , cast('2022-06-03' as date) AS DT FROM testData > {code} > the grouping expressions are : *[1001, 2022-06-03]* > the aggregate expressions are : *[1001 AS id#274, 2022-06-03 AS DT#275]* > so we can skip scan table testData and remote the aggregate operation. > Before this PR: > {code:java} > Aggregate [1001, 2022-06-03], [1001 AS id#274, 2022-06-03 AS DT#275], > Statistics(sizeInBytes=16.0 EiB) > +- SerializeFromObject, Statistics(sizeInBytes=8.0 EiB) >+- ExternalRDD [obj#12], Statistics(sizeInBytes=8.0 EiB) > {code} > After this PR: > {code:java} > Project [1001 AS id#218, 2022-06-03 AS DT#219], Statistics(sizeInBytes=2.0 B) > +- OneRowRelation, Statistics(sizeInBytes=1.0 B) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39893) Remove redundant aggregate if it is group only and all grouping and aggregate expressions are foldable
[ https://issues.apache.org/jira/browse/SPARK-39893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39893: Assignee: (was: Apache Spark) > Remove redundant aggregate if it is group only and all grouping and aggregate > expressions are foldable > -- > > Key: SPARK-39893 > URL: https://issues.apache.org/jira/browse/SPARK-39893 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wan Kun >Priority: Major > > If all groupingExpressions and aggregateExpressions in a aggregate are > foldable, we can remove this aggregate. > For example, query : > {code:java} > SELECT distinct 1001 as id , cast('2022-06-03' as date) AS DT FROM testData > {code} > the grouping expressions are : *[1001, 2022-06-03]* > the aggregate expressions are : *[1001 AS id#274, 2022-06-03 AS DT#275]* > so we can skip scan table testData and remote the aggregate operation. > Before this PR: > {code:java} > Aggregate [1001, 2022-06-03], [1001 AS id#274, 2022-06-03 AS DT#275], > Statistics(sizeInBytes=16.0 EiB) > +- SerializeFromObject, Statistics(sizeInBytes=8.0 EiB) >+- ExternalRDD [obj#12], Statistics(sizeInBytes=8.0 EiB) > {code} > After this PR: > {code:java} > Project [1001 AS id#218, 2022-06-03 AS DT#219], Statistics(sizeInBytes=2.0 B) > +- OneRowRelation, Statistics(sizeInBytes=1.0 B) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39893) Remove redundant aggregate if it is group only and all grouping and aggregate expressions are foldable
[ https://issues.apache.org/jira/browse/SPARK-39893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17571803#comment-17571803 ] Apache Spark commented on SPARK-39893: -- User 'wankunde' has created a pull request for this issue: https://github.com/apache/spark/pull/37316 > Remove redundant aggregate if it is group only and all grouping and aggregate > expressions are foldable > -- > > Key: SPARK-39893 > URL: https://issues.apache.org/jira/browse/SPARK-39893 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wan Kun >Priority: Major > > If all groupingExpressions and aggregateExpressions in a aggregate are > foldable, we can remove this aggregate. > For example, query : > {code:java} > SELECT distinct 1001 as id , cast('2022-06-03' as date) AS DT FROM testData > {code} > the grouping expressions are : *[1001, 2022-06-03]* > the aggregate expressions are : *[1001 AS id#274, 2022-06-03 AS DT#275]* > so we can skip scan table testData and remote the aggregate operation. > Before this PR: > {code:java} > Aggregate [1001, 2022-06-03], [1001 AS id#274, 2022-06-03 AS DT#275], > Statistics(sizeInBytes=16.0 EiB) > +- SerializeFromObject, Statistics(sizeInBytes=8.0 EiB) >+- ExternalRDD [obj#12], Statistics(sizeInBytes=8.0 EiB) > {code} > After this PR: > {code:java} > Project [1001 AS id#218, 2022-06-03 AS DT#219], Statistics(sizeInBytes=2.0 B) > +- OneRowRelation, Statistics(sizeInBytes=1.0 B) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org