[jira] [Commented] (SPARK-34115) Long runtime on many environment variables
[ https://issues.apache.org/jira/browse/SPARK-34115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268420#comment-17268420 ] Norbert Schultz commented on SPARK-34115: - Thank you! > Long runtime on many environment variables > -- > > Key: SPARK-34115 > URL: https://issues.apache.org/jira/browse/SPARK-34115 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.4.0, 2.4.7, 3.0.1 > Environment: Spark 2.4.0 local[2] on a Kubernetes Pod >Reporter: Norbert Schultz >Assignee: Norbert Schultz >Priority: Major > Fix For: 3.0.2, 3.1.2 > > Attachments: spark-bug-34115.tar.gz > > > I am not sure if this is a bug report or a feature request. The code is is > the same in current versions of Spark and maybe this ticket saves someone > some time for debugging. > We migrated some older code to Spark 2.4.0, and suddently the integration > tests on our build machine were much slower than expected. > On local machines it was running perfectly. > At the end it turned out, that Spark was wasting CPU Cycles during DataFrame > analyzing in the following functions > * AnalysisHelper.assertNotAnalysisRule calling > * Utils.isTesting > Utils.isTesting is traversing all environment variables. > The offending build machine was a Kubernetes Pod which automatically exposed > all services as environment variables, so it had more than 3000 environment > variables. > As Utils.isTesting is called very often throgh > AnalysisHelper.assertNotAnalysisRule (via AnalysisHelper.transformDown, > transformUp). > > Of course we will restrict the number of environment variables, on the other > side Utils.isTesting could also use a lazy val for > > {code:java} > sys.env.contains("SPARK_TESTING") {code} > > to not make it that expensive. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34176) Java UT failed when testing sql/hive module independently in Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-34176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34176: Assignee: Apache Spark > Java UT failed when testing sql/hive module independently in Scala 2.13 > --- > > Key: SPARK-34176 > URL: https://issues.apache.org/jira/browse/SPARK-34176 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Minor > > When execute mvn test of sql/hive module independently in Scala 2.13, there > is Java UT failed as follow: > # > {code:java} > mvn clean install -Dhadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn > -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive > -Pscala-2.13 -pl sql/hive -am -DskipTests{code} > # > {code:java} > mvn test -Dhadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl > -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive -Pscala-2.13 -pl > sql/hive{code} > # > {code:java} > [ERROR] Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: > 20.353 s <<< FAILURE! - in org.apache.spark.sql.hive.JavaDataFrameSuite > [ERROR] org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF Time elapsed: > 18.548 s <<< ERROR! > java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport > at > org.apache.spark.sql.hive.JavaDataFrameSuite.checkAnswer(JavaDataFrameSuite.java:41) > at > org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF(JavaDataFrameSuite.java:92) > Caused by: java.lang.ClassNotFoundException: > scala.collection.parallel.TaskSupport > at > org.apache.spark.sql.hive.JavaDataFrameSuite.checkAnswer(JavaDataFrameSuite.java:41) > at > org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF(JavaDataFrameSuite.java:92) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34176) Java UT failed when testing sql/hive module independently in Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-34176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268418#comment-17268418 ] Apache Spark commented on SPARK-34176: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/31259 > Java UT failed when testing sql/hive module independently in Scala 2.13 > --- > > Key: SPARK-34176 > URL: https://issues.apache.org/jira/browse/SPARK-34176 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yang Jie >Priority: Minor > > When execute mvn test of sql/hive module independently in Scala 2.13, there > is Java UT failed as follow: > # > {code:java} > mvn clean install -Dhadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn > -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive > -Pscala-2.13 -pl sql/hive -am -DskipTests{code} > # > {code:java} > mvn test -Dhadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl > -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive -Pscala-2.13 -pl > sql/hive{code} > # > {code:java} > [ERROR] Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: > 20.353 s <<< FAILURE! - in org.apache.spark.sql.hive.JavaDataFrameSuite > [ERROR] org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF Time elapsed: > 18.548 s <<< ERROR! > java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport > at > org.apache.spark.sql.hive.JavaDataFrameSuite.checkAnswer(JavaDataFrameSuite.java:41) > at > org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF(JavaDataFrameSuite.java:92) > Caused by: java.lang.ClassNotFoundException: > scala.collection.parallel.TaskSupport > at > org.apache.spark.sql.hive.JavaDataFrameSuite.checkAnswer(JavaDataFrameSuite.java:41) > at > org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF(JavaDataFrameSuite.java:92) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34176) Java UT failed when testing sql/hive module independently in Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-34176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34176: Assignee: (was: Apache Spark) > Java UT failed when testing sql/hive module independently in Scala 2.13 > --- > > Key: SPARK-34176 > URL: https://issues.apache.org/jira/browse/SPARK-34176 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yang Jie >Priority: Minor > > When execute mvn test of sql/hive module independently in Scala 2.13, there > is Java UT failed as follow: > # > {code:java} > mvn clean install -Dhadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn > -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive > -Pscala-2.13 -pl sql/hive -am -DskipTests{code} > # > {code:java} > mvn test -Dhadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl > -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive -Pscala-2.13 -pl > sql/hive{code} > # > {code:java} > [ERROR] Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: > 20.353 s <<< FAILURE! - in org.apache.spark.sql.hive.JavaDataFrameSuite > [ERROR] org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF Time elapsed: > 18.548 s <<< ERROR! > java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport > at > org.apache.spark.sql.hive.JavaDataFrameSuite.checkAnswer(JavaDataFrameSuite.java:41) > at > org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF(JavaDataFrameSuite.java:92) > Caused by: java.lang.ClassNotFoundException: > scala.collection.parallel.TaskSupport > at > org.apache.spark.sql.hive.JavaDataFrameSuite.checkAnswer(JavaDataFrameSuite.java:41) > at > org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF(JavaDataFrameSuite.java:92) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34176) Java UT failed when testing sql/hive module independently in Scala 2.13
Yang Jie created SPARK-34176: Summary: Java UT failed when testing sql/hive module independently in Scala 2.13 Key: SPARK-34176 URL: https://issues.apache.org/jira/browse/SPARK-34176 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Yang Jie When execute mvn test of sql/hive module independently in Scala 2.13, there is Java UT failed as follow: # {code:java} mvn clean install -Dhadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive -Pscala-2.13 -pl sql/hive -am -DskipTests{code} # {code:java} mvn test -Dhadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive -Pscala-2.13 -pl sql/hive{code} # {code:java} [ERROR] Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 20.353 s <<< FAILURE! - in org.apache.spark.sql.hive.JavaDataFrameSuite [ERROR] org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF Time elapsed: 18.548 s <<< ERROR! java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport at org.apache.spark.sql.hive.JavaDataFrameSuite.checkAnswer(JavaDataFrameSuite.java:41) at org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF(JavaDataFrameSuite.java:92) Caused by: java.lang.ClassNotFoundException: scala.collection.parallel.TaskSupport at org.apache.spark.sql.hive.JavaDataFrameSuite.checkAnswer(JavaDataFrameSuite.java:41) at org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF(JavaDataFrameSuite.java:92) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34174) Add SHOW PArTITIONS as table valued function
angerszhu created SPARK-34174: - Summary: Add SHOW PArTITIONS as table valued function Key: SPARK-34174 URL: https://issues.apache.org/jira/browse/SPARK-34174 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: angerszhu -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34175) Add SHOW VIEWS as table valued function
angerszhu created SPARK-34175: - Summary: Add SHOW VIEWS as table valued function Key: SPARK-34175 URL: https://issues.apache.org/jira/browse/SPARK-34175 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: angerszhu -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34173) Add SHOW FUNCTIONS as table valued function
angerszhu created SPARK-34173: - Summary: Add SHOW FUNCTIONS as table valued function Key: SPARK-34173 URL: https://issues.apache.org/jira/browse/SPARK-34173 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: angerszhu -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34172) Add SHOW DATABASES as table valued function
angerszhu created SPARK-34172: - Summary: Add SHOW DATABASES as table valued function Key: SPARK-34172 URL: https://issues.apache.org/jira/browse/SPARK-34172 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: angerszhu -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34171) Add SHOW COLUMNS as table valued function
angerszhu created SPARK-34171: - Summary: Add SHOW COLUMNS as table valued function Key: SPARK-34171 URL: https://issues.apache.org/jira/browse/SPARK-34171 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: angerszhu -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34170) Add SHOW TABLES EXTENDED as table valued function
angerszhu created SPARK-34170: - Summary: Add SHOW TABLES EXTENDED as table valued function Key: SPARK-34170 URL: https://issues.apache.org/jira/browse/SPARK-34170 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: angerszhu -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33630) Support save SHOW command result to table
[ https://issues.apache.org/jira/browse/SPARK-33630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-33630: -- Parent: SPARK-34169 Issue Type: Sub-task (was: Improvement) > Support save SHOW command result to table > - > > Key: SPARK-33630 > URL: https://issues.apache.org/jira/browse/SPARK-33630 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > The Scala interface support save SHOW command result to table: > {code:scala} > spark.sql("CREATE TABLE AS SHOW TABLES IN default").write.saveAsTable("t") > {code} > But SQL interface does not support it: > {code:sql} > spark-sql> create table t using parquet as SHOW TABLES IN default; > Error in query: > mismatched input 'SHOW' expecting {'(', 'FROM', 'MAP', 'REDUCE', 'SELECT', > 'TABLE', 'VALUES', 'WITH'}(line 1, pos 32) > == SQL == > create table t using parquet as SHOW TABLES IN default > ^^^ > {code} > It would be great if SQL interface can support it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34169) Add some useful auxiliary SQL as table valued function
angerszhu created SPARK-34169: - Summary: Add some useful auxiliary SQL as table valued function Key: SPARK-34169 URL: https://issues.apache.org/jira/browse/SPARK-34169 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.2.0 Reporter: angerszhu Support command such as `{{SHOW TABLES`}} command as a table valued function. So user can be easy to reach metadata system information use SQL and analysis directory. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34168) Support DPP in AQE When the join is Broadcast hash join before applying the AQE rules
[ https://issues.apache.org/jira/browse/SPARK-34168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34168: Assignee: (was: Apache Spark) > Support DPP in AQE When the join is Broadcast hash join before applying the > AQE rules > - > > Key: SPARK-34168 > URL: https://issues.apache.org/jira/browse/SPARK-34168 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: Ke Jia >Priority: Major > > Both AQE and DPP cannot be applied at the same time. This PR will enable AQE > and DPP when the join is Broadcast hash join at the beginning. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34168) Support DPP in AQE When the join is Broadcast hash join before applying the AQE rules
[ https://issues.apache.org/jira/browse/SPARK-34168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268408#comment-17268408 ] Apache Spark commented on SPARK-34168: -- User 'JkSelf' has created a pull request for this issue: https://github.com/apache/spark/pull/31258 > Support DPP in AQE When the join is Broadcast hash join before applying the > AQE rules > - > > Key: SPARK-34168 > URL: https://issues.apache.org/jira/browse/SPARK-34168 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: Ke Jia >Priority: Major > > Both AQE and DPP cannot be applied at the same time. This PR will enable AQE > and DPP when the join is Broadcast hash join at the beginning. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34168) Support DPP in AQE When the join is Broadcast hash join before applying the AQE rules
[ https://issues.apache.org/jira/browse/SPARK-34168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34168: Assignee: Apache Spark > Support DPP in AQE When the join is Broadcast hash join before applying the > AQE rules > - > > Key: SPARK-34168 > URL: https://issues.apache.org/jira/browse/SPARK-34168 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: Ke Jia >Assignee: Apache Spark >Priority: Major > > Both AQE and DPP cannot be applied at the same time. This PR will enable AQE > and DPP when the join is Broadcast hash join at the beginning. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33630) Support save SHOW command result to table
[ https://issues.apache.org/jira/browse/SPARK-33630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33630: Assignee: Apache Spark > Support save SHOW command result to table > - > > Key: SPARK-33630 > URL: https://issues.apache.org/jira/browse/SPARK-33630 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > The Scala interface support save SHOW command result to table: > {code:scala} > spark.sql("CREATE TABLE AS SHOW TABLES IN default").write.saveAsTable("t") > {code} > But SQL interface does not support it: > {code:sql} > spark-sql> create table t using parquet as SHOW TABLES IN default; > Error in query: > mismatched input 'SHOW' expecting {'(', 'FROM', 'MAP', 'REDUCE', 'SELECT', > 'TABLE', 'VALUES', 'WITH'}(line 1, pos 32) > == SQL == > create table t using parquet as SHOW TABLES IN default > ^^^ > {code} > It would be great if SQL interface can support it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33630) Support save SHOW command result to table
[ https://issues.apache.org/jira/browse/SPARK-33630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33630: Assignee: (was: Apache Spark) > Support save SHOW command result to table > - > > Key: SPARK-33630 > URL: https://issues.apache.org/jira/browse/SPARK-33630 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > The Scala interface support save SHOW command result to table: > {code:scala} > spark.sql("CREATE TABLE AS SHOW TABLES IN default").write.saveAsTable("t") > {code} > But SQL interface does not support it: > {code:sql} > spark-sql> create table t using parquet as SHOW TABLES IN default; > Error in query: > mismatched input 'SHOW' expecting {'(', 'FROM', 'MAP', 'REDUCE', 'SELECT', > 'TABLE', 'VALUES', 'WITH'}(line 1, pos 32) > == SQL == > create table t using parquet as SHOW TABLES IN default > ^^^ > {code} > It would be great if SQL interface can support it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33630) Support save SHOW command result to table
[ https://issues.apache.org/jira/browse/SPARK-33630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268407#comment-17268407 ] Apache Spark commented on SPARK-33630: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/31257 > Support save SHOW command result to table > - > > Key: SPARK-33630 > URL: https://issues.apache.org/jira/browse/SPARK-33630 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > The Scala interface support save SHOW command result to table: > {code:scala} > spark.sql("CREATE TABLE AS SHOW TABLES IN default").write.saveAsTable("t") > {code} > But SQL interface does not support it: > {code:sql} > spark-sql> create table t using parquet as SHOW TABLES IN default; > Error in query: > mismatched input 'SHOW' expecting {'(', 'FROM', 'MAP', 'REDUCE', 'SELECT', > 'TABLE', 'VALUES', 'WITH'}(line 1, pos 32) > == SQL == > create table t using parquet as SHOW TABLES IN default > ^^^ > {code} > It would be great if SQL interface can support it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34168) Support DPP in AQE When the join is Broadcast hash join before applying the AQE rules
Ke Jia created SPARK-34168: -- Summary: Support DPP in AQE When the join is Broadcast hash join before applying the AQE rules Key: SPARK-34168 URL: https://issues.apache.org/jira/browse/SPARK-34168 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.1, 3.0.0 Reporter: Ke Jia Both AQE and DPP cannot be applied at the same time. This PR will enable AQE and DPP when the join is Broadcast hash join at the beginning. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34027) ALTER TABLE .. RECOVER PARTITIONS doesn't refresh cache
[ https://issues.apache.org/jira/browse/SPARK-34027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-34027: -- Fix Version/s: (was: 3.1.1) 3.1.2 > ALTER TABLE .. RECOVER PARTITIONS doesn't refresh cache > --- > > Key: SPARK-34027 > URL: https://issues.apache.org/jira/browse/SPARK-34027 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.1, 3.1.0, 3.2.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > Labels: correctness > Fix For: 3.0.2, 3.2.0, 3.1.2 > > > Here is the example to reproduce the issue: > {code:sql} > spark-sql> create table tbl (col int, part int) using parquet partitioned by > (part); > spark-sql> insert into tbl partition (part=0) select 0; > spark-sql> cache table tbl; > spark-sql> select * from tbl; > 0 0 > spark-sql> show table extended like 'tbl' partition(part=0); > default tbl false Partition Values: [part=0] > Location: > file:/Users/maximgekk/proj/recover-partitions-refresh-cache/spark-warehouse/tbl/part=0 > ... > {code} > Add new partition by copying the existing one: > {code} > cp -r > /Users/maximgekk/proj/recover-partitions-refresh-cache/spark-warehouse/tbl/part=0 > > /Users/maximgekk/proj/recover-partitions-refresh-cache/spark-warehouse/tbl/part=1 > {code} > Recover and select the table: > {code} > spark-sql> alter table tbl recover partitions; > spark-sql> select * from tbl; > 0 0 > {code} > We see only old data. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34162) Add PyArrow compatibility note for Python 3.9
[ https://issues.apache.org/jira/browse/SPARK-34162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-34162: -- Fix Version/s: (was: 3.1.1) 3.1.2 > Add PyArrow compatibility note for Python 3.9 > - > > Key: SPARK-34162 > URL: https://issues.apache.org/jira/browse/SPARK-34162 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.2.0, 3.1.1 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.2.0, 3.1.2 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34166) Fix flaky test in DecommissionWorkerSuite
[ https://issues.apache.org/jira/browse/SPARK-34166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-34166: -- Fix Version/s: (was: 3.1.0) 3.1.2 > Fix flaky test in DecommissionWorkerSuite > - > > Key: SPARK-34166 > URL: https://issues.apache.org/jira/browse/SPARK-34166 > Project: Spark > Issue Type: Test > Components: Spark Core, Tests >Affects Versions: 3.1.0, 3.2.0, 3.1.1 >Reporter: ulysses you >Assignee: ulysses you >Priority: Minor > Fix For: 3.1.2 > > > The test `decommission workers ensure that shuffle output is regenerated even > with shuffle service` assumes it has two executor and both of two tasks can > execute concurrently. > The two tasks will execute serially if there only one executor. The result is > test is unexpceted. E.g. > ``` > [info] 5 did not equal 4 Expected 4 tasks but got List(0:0:0:0-SUCCESS, > 0:0:1:0-FAILED, 0:0:1:1-SUCCESS, 0:1:0:0-SUCCESS, 1:0:0:0-SUCCESS) > (DecommissionWorkerSuite.scala:190) > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34167) Reading parquet with Decimal(8,2) written as a Decimal64 blows up
[ https://issues.apache.org/jira/browse/SPARK-34167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raza Jafri updated SPARK-34167: --- Description: When reading a parquet file written with Decimals with precision < 10 as a 64-bit representation, Spark tries to read it as an INT and fails Steps to reproduce: Read the attached file that has a single Decimal(8,2) column with 10 values {code:java} scala> spark.read.parquet("/tmp/pyspark_tests/936454/PARQUET_DATA").show ... Caused by: java.lang.NullPointerException at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLong(OnHeapColumnVector.java:327) at org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readLongs(VectorizedRleValuesReader.java:370) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readLongBatch(VectorizedColumnReader.java:514) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:256) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:273) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:171) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:497) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:756) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:480) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1426) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:483) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ... {code} Here are my findings. The *{{VectorizedParquetRecordReader}}* reads in the parquet file correctly because its basing the read on the [requestedSchema|https://github.com/apache/spark/blob/e6f019836c099398542b443f7700f79de81da0d5/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L150] which is a MessageType and has the underlying data stored correctly as {{INT64}} where as the *{{OnHeapColumnVector}}* is initialized based on the [batchSchema|https://github.com/apache/spark/blob/e6f019836c099398542b443f7700f79de81da0d5/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L151] which is coming from {{org.apache.spark.sql.parquet.row.requested_schema}} that is set by the reader which is a {{StructType}} and only has {{Decimal(_,_)}} in it. [https://github.com/apache/spark/blob/a44e008de3ae5aecad9e0f1a7af6a1e8b0d97f4e/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L224] Attached are two files, one with Decimal(8,2) ther other with Decimal(1,1) both written as Decimal backed by INT64. Decimal(1,1) results in a different exception but same for the same reason was: When reading a parquet file written with Decimals with precision < 10 as a 64-bit representation, Spark tries to read it as an INT and fails Steps to reproduce: Read the attached file that has a single Decimal(8,2) column with 10 values {code:java} scala> spark.read.parquet("/tmp/pyspark_tests/936454/PARQUET_DATA").show ...
[jira] [Updated] (SPARK-34167) Reading parquet with Decimal(8,2) written as a Decimal64 blows up
[ https://issues.apache.org/jira/browse/SPARK-34167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raza Jafri updated SPARK-34167: --- Attachment: part-0-940f44f1-f323-4a5e-b828-1e65d87895aa-c000.snappy.parquet > Reading parquet with Decimal(8,2) written as a Decimal64 blows up > - > > Key: SPARK-34167 > URL: https://issues.apache.org/jira/browse/SPARK-34167 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.0.1 >Reporter: Raza Jafri >Priority: Major > Attachments: > part-0-7fecd321-b247-4f7e-bff5-c2e4d8facaa0-c000.snappy.parquet, > part-0-940f44f1-f323-4a5e-b828-1e65d87895aa-c000.snappy.parquet > > > When reading a parquet file written with Decimals with precision < 10 as a > 64-bit representation, Spark tries to read it as an INT and fails > > Steps to reproduce: > Read the attached file that has a single Decimal(8,2) column with 10 values > {code:java} > scala> spark.read.parquet("/tmp/pyspark_tests/936454/PARQUET_DATA").show > ... > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLong(OnHeapColumnVector.java:327) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readLongs(VectorizedRleValuesReader.java:370) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readLongBatch(VectorizedColumnReader.java:514) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:256) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:273) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:171) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:497) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:756) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340) > at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:127) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:480) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1426) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:483) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > ... > {code} > > > Here are my findings. The *{{VectorizedParquetRecordReader}}* reads in the > parquet file correctly because its basing the read on the > [requestedSchema|https://github.com/apache/spark/blob/e6f019836c099398542b443f7700f79de81da0d5/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L150] > which is a MessageType and has the underlying data stored correctly as > {{INT64}} where as the *{{OnHeapColumnVector}}* is initialized based on the > [batchSchema|https://github.com/apache/spark/blob/e6f019836c099398542b443f7700f79de81da0d5/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L151] > which is coming from {{org.apache.spark.sql.parquet.row.requested_schema}} > that is set by the reader
[jira] [Updated] (SPARK-34167) Reading parquet with Decimal(8,2) written as a Decimal64 blows up
[ https://issues.apache.org/jira/browse/SPARK-34167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raza Jafri updated SPARK-34167: --- Description: When reading a parquet file written with Decimals with precision < 10 as a 64-bit representation, Spark tries to read it as an INT and fails Steps to reproduce: Read the attached file that has a single Decimal(8,2) column with 10 values {code:java} scala> spark.read.parquet("/tmp/pyspark_tests/936454/PARQUET_DATA").show ... Caused by: java.lang.NullPointerException at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLong(OnHeapColumnVector.java:327) at org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readLongs(VectorizedRleValuesReader.java:370) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readLongBatch(VectorizedColumnReader.java:514) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:256) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:273) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:171) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:497) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:756) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:480) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1426) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:483) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ... {code} Here are my findings. The *{{VectorizedParquetRecordReader}}* reads in the parquet file correctly because its basing the read on the [requestedSchema|https://github.com/apache/spark/blob/e6f019836c099398542b443f7700f79de81da0d5/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L150] which is a MessageType and has the underlying data stored correctly as {{INT64}} where as the *{{OnHeapColumnVector}}* is initialized based on the [batchSchema|https://github.com/apache/spark/blob/e6f019836c099398542b443f7700f79de81da0d5/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L151] which is coming from {{org.apache.spark.sql.parquet.row.requested_schema}} that is set by the reader which is a {{StructType}} and only has {{Decimal(_,_)}} in it. [https://github.com/apache/spark/blob/a44e008de3ae5aecad9e0f1a7af6a1e8b0d97f4e/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L224] Attached are two files, one with Decimal(8,2) ther other with Decimal(1,1) both written as Decimal backed by INT64 was: When reading a parquet file written with Decimals with precision < 10 as a 64-bit representation, Spark tries to read it as an INT and fails Steps to reproduce: Read the attached file that has a single Decimal(8,2) column with 10 values {code:java} scala> spark.read.parquet("/tmp/pyspark_tests/936454/PARQUET_DATA").show ... Caused by: java.lang.NullPointerException at
[jira] [Updated] (SPARK-34167) Reading parquet with Decimal(8,2) written as a Decimal64 blows up
[ https://issues.apache.org/jira/browse/SPARK-34167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raza Jafri updated SPARK-34167: --- Description: When reading a parquet file written with Decimals with precision < 10 as a 64-bit representation, Spark tries to read it as an INT and fails Steps to reproduce: Read the attached file that has a single Decimal(8,2) column with 10 values {code:java} scala> spark.read.parquet("/tmp/pyspark_tests/936454/PARQUET_DATA").show ... Caused by: java.lang.NullPointerException at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLong(OnHeapColumnVector.java:327) at org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readLongs(VectorizedRleValuesReader.java:370) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readLongBatch(VectorizedColumnReader.java:514) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:256) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:273) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:171) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:497) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:756) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:480) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1426) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:483) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ... {code} Here are my findings. The *{{VectorizedParquetRecordReader}}* reads in the parquet file correctly because its basing the read on the [requestedSchema|https://github.com/apache/spark/blob/e6f019836c099398542b443f7700f79de81da0d5/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L150] which is a MessageType and has the underlying data stored correctly as {{INT64}} where as the *{{OnHeapColumnVector}}* is initialized based on the [batchSchema|https://github.com/apache/spark/blob/e6f019836c099398542b443f7700f79de81da0d5/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L151] which is coming from {{org.apache.spark.sql.parquet.row.requested_schema}} that is set by the reader which is a {{StructType}} and only has {{Decimal(_,_)}} in it. [https://github.com/apache/spark/blob/a44e008de3ae5aecad9e0f1a7af6a1e8b0d97f4e/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L224] Attached are two files, one with Decimal(8,2) ther other with Decimal(1,2) both written as Decimal backed by INT64 was: When reading a parquet file written with Decimals with precision < 10 as a 64-bit representation, Spark tries to read it as an INT and fails Steps to reproduce: Read the attached file that has a single Decimal(8,2) column with 10 values {code:java} scala> spark.read.parquet("/tmp/pyspark_tests/936454/PARQUET_DATA").show ... Caused by: java.lang.NullPointerException at
[jira] [Updated] (SPARK-34167) Reading parquet with Decimal(8,2) written as a Decimal64 blows up
[ https://issues.apache.org/jira/browse/SPARK-34167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raza Jafri updated SPARK-34167: --- Description: When reading a parquet file written with Decimals with precision < 10 as a 64-bit representation, Spark tries to read it as an INT and fails Steps to reproduce: Read the attached file that has a single Decimal(8,2) column with 10 values {code:java} scala> spark.read.parquet("/tmp/pyspark_tests/936454/PARQUET_DATA").show ... Caused by: java.lang.NullPointerException at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLong(OnHeapColumnVector.java:327) at org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readLongs(VectorizedRleValuesReader.java:370) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readLongBatch(VectorizedColumnReader.java:514) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:256) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:273) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:171) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:497) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:756) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:480) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1426) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:483) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ... {code} Here are my findings. The *{{VectorizedParquetRecordReader}}* reads in the parquet file correctly because its basing the read on the [requestedSchema|https://github.com/apache/spark/blob/e6f019836c099398542b443f7700f79de81da0d5/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L150] which is a MessageType and has the underlying data stored correctly as {{INT64}} where as the *{{OnHeapColumnVector}}* is initialized based on the [batchSchema|https://github.com/apache/spark/blob/e6f019836c099398542b443f7700f79de81da0d5/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L151] which is coming from {{org.apache.spark.sql.parquet.row.requested_schema}} that is set by the reader which is a {{StructType}} and only has {{Decimal(_,_)}} in it. [https://github.com/apache/spark/blob/a44e008de3ae5aecad9e0f1a7af6a1e8b0d97f4e/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L224] was: When reading a parquet file written with Decimals with precision < 10 as a 64-bit representation, Spark tries to read it as an INT and fails Steps to reproduce: Read the attached file that has a single Decimal(8,2) column with 10 values {code:java} scala> spark.read.parquet("/tmp/pyspark_tests/936454/PARQUET_DATA").show ... Caused by: java.lang.NullPointerException at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLong(OnHeapColumnVector.java:327) at
[jira] [Updated] (SPARK-34167) Reading parquet with Decimal(8,2) written as a Decimal64 blows up
[ https://issues.apache.org/jira/browse/SPARK-34167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raza Jafri updated SPARK-34167: --- Attachment: part-0-7fecd321-b247-4f7e-bff5-c2e4d8facaa0-c000.snappy.parquet > Reading parquet with Decimal(8,2) written as a Decimal64 blows up > - > > Key: SPARK-34167 > URL: https://issues.apache.org/jira/browse/SPARK-34167 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.0.1 >Reporter: Raza Jafri >Priority: Major > Attachments: > part-0-7fecd321-b247-4f7e-bff5-c2e4d8facaa0-c000.snappy.parquet > > > When reading a parquet file written with Decimals with precision < 10 as a > 64-bit representation, Spark tries to read it as an INT and fails > > Steps to reproduce: > Read the attached file that has a single Decimal(8,2) column with 10 values > {code:java} > scala> spark.read.parquet("/tmp/pyspark_tests/936454/PARQUET_DATA").show > ... > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLong(OnHeapColumnVector.java:327) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readLongs(VectorizedRleValuesReader.java:370) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readLongBatch(VectorizedColumnReader.java:514) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:256) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:273) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:171) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:497) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:756) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340) > at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:127) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:480) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1426) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:483) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > ... > {code} > > > Here are my findings. The *{{VectorizedParquetRecordReader}}* reads in the > parquet file correctly because its basing the read on the > [requestedSchema|https://github.com/apache/spark/blob/e6f019836c099398542b443f7700f79de81da0d5/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L150] > which is a MessageType and has the underlying data stored correctly as > {{INT64}} where as the *{{OnHeapColumnVector}}* is initialized based on the > [batchSchema|https://github.com/apache/spark/blob/e6f019836c099398542b443f7700f79de81da0d5/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L151] > which is coming from {{org.apache.spark.sql.parquet.row.requested_schema}} > that is set by the reader which is a {{StructType}} and only has > {{Decimal(_,_)}} in it. >
[jira] [Created] (SPARK-34167) Reading parquet with Decimal(8,2) written as a Decimal64 blows up
Raza Jafri created SPARK-34167: -- Summary: Reading parquet with Decimal(8,2) written as a Decimal64 blows up Key: SPARK-34167 URL: https://issues.apache.org/jira/browse/SPARK-34167 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 3.0.1 Reporter: Raza Jafri Attachments: part-0-7fecd321-b247-4f7e-bff5-c2e4d8facaa0-c000.snappy.parquet When reading a parquet file written with Decimals with precision < 10 as a 64-bit representation, Spark tries to read it as an INT and fails Steps to reproduce: Read the attached file that has a single Decimal(8,2) column with 10 values {code:java} scala> spark.read.parquet("/tmp/pyspark_tests/936454/PARQUET_DATA").show ... Caused by: java.lang.NullPointerException at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLong(OnHeapColumnVector.java:327) at org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readLongs(VectorizedRleValuesReader.java:370) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readLongBatch(VectorizedColumnReader.java:514) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:256) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:273) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:171) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:497) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:756) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:480) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1426) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:483) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ... {code} Here are my findings. The *{{VectorizedParquetRecordReader}}* reads in the parquet file correctly because its basing the read on the [requestedSchema|https://github.com/apache/spark/blob/e6f019836c099398542b443f7700f79de81da0d5/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L150] which is a MessageType and has the underlying data stored correctly as {{INT64}} where as the *{{OnHeapColumnVector}}* is initialized based on the [batchSchema|https://github.com/apache/spark/blob/e6f019836c099398542b443f7700f79de81da0d5/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L151] which is coming from {{org.apache.spark.sql.parquet.row.requested_schema}} that is set by the reader which is a {{StructType}} and only has {{Decimal(_,_)}} in it. [https://github.com/apache/spark/blob/a44e008de3ae5aecad9e0f1a7af6a1e8b0d97f4e/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L224] {{ }} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Resolved] (SPARK-34166) Fix flaky test in DecommissionWorkerSuite
[ https://issues.apache.org/jira/browse/SPARK-34166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-34166. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 31255 [https://github.com/apache/spark/pull/31255] > Fix flaky test in DecommissionWorkerSuite > - > > Key: SPARK-34166 > URL: https://issues.apache.org/jira/browse/SPARK-34166 > Project: Spark > Issue Type: Test > Components: Spark Core, Tests >Affects Versions: 3.1.0, 3.2.0, 3.1.1 >Reporter: ulysses you >Assignee: ulysses you >Priority: Minor > Fix For: 3.1.0 > > > The test `decommission workers ensure that shuffle output is regenerated even > with shuffle service` assumes it has two executor and both of two tasks can > execute concurrently. > The two tasks will execute serially if there only one executor. The result is > test is unexpceted. E.g. > ``` > [info] 5 did not equal 4 Expected 4 tasks but got List(0:0:0:0-SUCCESS, > 0:0:1:0-FAILED, 0:0:1:1-SUCCESS, 0:1:0:0-SUCCESS, 1:0:0:0-SUCCESS) > (DecommissionWorkerSuite.scala:190) > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34166) Fix flaky test in DecommissionWorkerSuite
[ https://issues.apache.org/jira/browse/SPARK-34166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-34166: - Assignee: ulysses you > Fix flaky test in DecommissionWorkerSuite > - > > Key: SPARK-34166 > URL: https://issues.apache.org/jira/browse/SPARK-34166 > Project: Spark > Issue Type: Test > Components: Spark Core, Tests >Affects Versions: 3.1.0, 3.2.0, 3.1.1 >Reporter: ulysses you >Assignee: ulysses you >Priority: Minor > > The test `decommission workers ensure that shuffle output is regenerated even > with shuffle service` assumes it has two executor and both of two tasks can > execute concurrently. > The two tasks will execute serially if there only one executor. The result is > test is unexpceted. E.g. > ``` > [info] 5 did not equal 4 Expected 4 tasks but got List(0:0:0:0-SUCCESS, > 0:0:1:0-FAILED, 0:0:1:1-SUCCESS, 0:1:0:0-SUCCESS, 1:0:0:0-SUCCESS) > (DecommissionWorkerSuite.scala:190) > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34166) Fix flaky test in DecommissionWorkerSuite
[ https://issues.apache.org/jira/browse/SPARK-34166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-34166: -- Issue Type: Test (was: Improvement) > Fix flaky test in DecommissionWorkerSuite > - > > Key: SPARK-34166 > URL: https://issues.apache.org/jira/browse/SPARK-34166 > Project: Spark > Issue Type: Test > Components: Spark Core, Tests >Affects Versions: 3.1.0, 3.2.0, 3.1.1 >Reporter: ulysses you >Priority: Minor > > The test `decommission workers ensure that shuffle output is regenerated even > with shuffle service` assumes it has two executor and both of two tasks can > execute concurrently. > The two tasks will execute serially if there only one executor. The result is > test is unexpceted. E.g. > ``` > [info] 5 did not equal 4 Expected 4 tasks but got List(0:0:0:0-SUCCESS, > 0:0:1:0-FAILED, 0:0:1:1-SUCCESS, 0:1:0:0-SUCCESS, 1:0:0:0-SUCCESS) > (DecommissionWorkerSuite.scala:190) > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34166) Fix flaky test in DecommissionWorkerSuite
[ https://issues.apache.org/jira/browse/SPARK-34166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-34166: -- Affects Version/s: 3.1.1 3.1.0 > Fix flaky test in DecommissionWorkerSuite > - > > Key: SPARK-34166 > URL: https://issues.apache.org/jira/browse/SPARK-34166 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Tests >Affects Versions: 3.1.0, 3.2.0, 3.1.1 >Reporter: ulysses you >Priority: Minor > > The test `decommission workers ensure that shuffle output is regenerated even > with shuffle service` assumes it has two executor and both of two tasks can > execute concurrently. > The two tasks will execute serially if there only one executor. The result is > test is unexpceted. E.g. > ``` > [info] 5 did not equal 4 Expected 4 tasks but got List(0:0:0:0-SUCCESS, > 0:0:1:0-FAILED, 0:0:1:1-SUCCESS, 0:1:0:0-SUCCESS, 1:0:0:0-SUCCESS) > (DecommissionWorkerSuite.scala:190) > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34166) Fix flaky test in DecommissionWorkerSuite
[ https://issues.apache.org/jira/browse/SPARK-34166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-34166: -- Component/s: Spark Core > Fix flaky test in DecommissionWorkerSuite > - > > Key: SPARK-34166 > URL: https://issues.apache.org/jira/browse/SPARK-34166 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Tests >Affects Versions: 3.2.0 >Reporter: ulysses you >Priority: Minor > > The test `decommission workers ensure that shuffle output is regenerated even > with shuffle service` assumes it has two executor and both of two tasks can > execute concurrently. > The two tasks will execute serially if there only one executor. The result is > test is unexpceted. E.g. > ``` > [info] 5 did not equal 4 Expected 4 tasks but got List(0:0:0:0-SUCCESS, > 0:0:1:0-FAILED, 0:0:1:1-SUCCESS, 0:1:0:0-SUCCESS, 1:0:0:0-SUCCESS) > (DecommissionWorkerSuite.scala:190) > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34166) Fix flaky test in DecommissionWorkerSuite
[ https://issues.apache.org/jira/browse/SPARK-34166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34166: Assignee: (was: Apache Spark) > Fix flaky test in DecommissionWorkerSuite > - > > Key: SPARK-34166 > URL: https://issues.apache.org/jira/browse/SPARK-34166 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.2.0 >Reporter: ulysses you >Priority: Minor > > The test `decommission workers ensure that shuffle output is regenerated even > with shuffle service` assumes it has two executor and both of two tasks can > execute concurrently. > The two tasks will execute serially if there only one executor. The result is > test is unexpceted. E.g. > ``` > [info] 5 did not equal 4 Expected 4 tasks but got List(0:0:0:0-SUCCESS, > 0:0:1:0-FAILED, 0:0:1:1-SUCCESS, 0:1:0:0-SUCCESS, 1:0:0:0-SUCCESS) > (DecommissionWorkerSuite.scala:190) > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34166) Fix flaky test in DecommissionWorkerSuite
[ https://issues.apache.org/jira/browse/SPARK-34166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34166: Assignee: Apache Spark > Fix flaky test in DecommissionWorkerSuite > - > > Key: SPARK-34166 > URL: https://issues.apache.org/jira/browse/SPARK-34166 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.2.0 >Reporter: ulysses you >Assignee: Apache Spark >Priority: Minor > > The test `decommission workers ensure that shuffle output is regenerated even > with shuffle service` assumes it has two executor and both of two tasks can > execute concurrently. > The two tasks will execute serially if there only one executor. The result is > test is unexpceted. E.g. > ``` > [info] 5 did not equal 4 Expected 4 tasks but got List(0:0:0:0-SUCCESS, > 0:0:1:0-FAILED, 0:0:1:1-SUCCESS, 0:1:0:0-SUCCESS, 1:0:0:0-SUCCESS) > (DecommissionWorkerSuite.scala:190) > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34166) Fix flaky test in DecommissionWorkerSuite
[ https://issues.apache.org/jira/browse/SPARK-34166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268362#comment-17268362 ] Apache Spark commented on SPARK-34166: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/31255 > Fix flaky test in DecommissionWorkerSuite > - > > Key: SPARK-34166 > URL: https://issues.apache.org/jira/browse/SPARK-34166 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.2.0 >Reporter: ulysses you >Priority: Minor > > The test `decommission workers ensure that shuffle output is regenerated even > with shuffle service` assumes it has two executor and both of two tasks can > execute concurrently. > The two tasks will execute serially if there only one executor. The result is > test is unexpceted. E.g. > ``` > [info] 5 did not equal 4 Expected 4 tasks but got List(0:0:0:0-SUCCESS, > 0:0:1:0-FAILED, 0:0:1:1-SUCCESS, 0:1:0:0-SUCCESS, 1:0:0:0-SUCCESS) > (DecommissionWorkerSuite.scala:190) > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34166) Fix flaky test in DecommissionWorkerSuite
[ https://issues.apache.org/jira/browse/SPARK-34166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ulysses you updated SPARK-34166: Description: The test `decommission workers ensure that shuffle output is regenerated even with shuffle service` assumes it has two executor and both of two tasks can execute concurrently. The two tasks will execute serially if there only one executor. The result is test is unexpceted. E.g. ``` [info] 5 did not equal 4 Expected 4 tasks but got List(0:0:0:0-SUCCESS, 0:0:1:0-FAILED, 0:0:1:1-SUCCESS, 0:1:0:0-SUCCESS, 1:0:0:0-SUCCESS) (DecommissionWorkerSuite.scala:190) ``` was: The test `decommission workers ensure that shuffle output is regenerated even with shuffle service` assumes it has two executor and both of two tasks can execute concurrently. The two tasks will execute serially if there only have one executor. The result is test is unexpceted. E.g. ``` [info] 5 did not equal 4 Expected 4 tasks but got List(0:0:0:0-SUCCESS, 0:0:1:0-FAILED, 0:0:1:1-SUCCESS, 0:1:0:0-SUCCESS, 1:0:0:0-SUCCESS) (DecommissionWorkerSuite.scala:190) ``` > Fix flaky test in DecommissionWorkerSuite > - > > Key: SPARK-34166 > URL: https://issues.apache.org/jira/browse/SPARK-34166 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.2.0 >Reporter: ulysses you >Priority: Minor > > The test `decommission workers ensure that shuffle output is regenerated even > with shuffle service` assumes it has two executor and both of two tasks can > execute concurrently. > The two tasks will execute serially if there only one executor. The result is > test is unexpceted. E.g. > ``` > [info] 5 did not equal 4 Expected 4 tasks but got List(0:0:0:0-SUCCESS, > 0:0:1:0-FAILED, 0:0:1:1-SUCCESS, 0:1:0:0-SUCCESS, 1:0:0:0-SUCCESS) > (DecommissionWorkerSuite.scala:190) > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34166) Fix flaky test in DecommissionWorkerSuite
ulysses you created SPARK-34166: --- Summary: Fix flaky test in DecommissionWorkerSuite Key: SPARK-34166 URL: https://issues.apache.org/jira/browse/SPARK-34166 Project: Spark Issue Type: Improvement Components: Tests Affects Versions: 3.2.0 Reporter: ulysses you The test `decommission workers ensure that shuffle output is regenerated even with shuffle service` assumes it has two executor and both of two tasks can execute concurrently. The two tasks will execute serially if there only have one executor. The result is test is unexpceted. E.g. ``` [info] 5 did not equal 4 Expected 4 tasks but got List(0:0:0:0-SUCCESS, 0:0:1:0-FAILED, 0:0:1:1-SUCCESS, 0:1:0:0-SUCCESS, 1:0:0:0-SUCCESS) (DecommissionWorkerSuite.scala:190) ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34165) Add countDistinct option to Dataset#summary
[ https://issues.apache.org/jira/browse/SPARK-34165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268356#comment-17268356 ] Apache Spark commented on SPARK-34165: -- User 'MrPowers' has created a pull request for this issue: https://github.com/apache/spark/pull/31254 > Add countDistinct option to Dataset#summary > --- > > Key: SPARK-34165 > URL: https://issues.apache.org/jira/browse/SPARK-34165 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.0 >Reporter: Matthew Powers >Priority: Minor > > The Dataset#summary function supports options like count, mean, min, and max. > It's a great little function for lightweight exploratory data analysis. > A count distinct of each column is a common exploratory data analysis > workflow. This should be easy to add (piggybacking off the existing > countDistinct code), entirely backwards compatible, and will help a lot of > users. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34165) Add countDistinct option to Dataset#summary
[ https://issues.apache.org/jira/browse/SPARK-34165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34165: Assignee: (was: Apache Spark) > Add countDistinct option to Dataset#summary > --- > > Key: SPARK-34165 > URL: https://issues.apache.org/jira/browse/SPARK-34165 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.0 >Reporter: Matthew Powers >Priority: Minor > > The Dataset#summary function supports options like count, mean, min, and max. > It's a great little function for lightweight exploratory data analysis. > A count distinct of each column is a common exploratory data analysis > workflow. This should be easy to add (piggybacking off the existing > countDistinct code), entirely backwards compatible, and will help a lot of > users. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34165) Add countDistinct option to Dataset#summary
[ https://issues.apache.org/jira/browse/SPARK-34165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34165: Assignee: Apache Spark > Add countDistinct option to Dataset#summary > --- > > Key: SPARK-34165 > URL: https://issues.apache.org/jira/browse/SPARK-34165 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.0 >Reporter: Matthew Powers >Assignee: Apache Spark >Priority: Minor > > The Dataset#summary function supports options like count, mean, min, and max. > It's a great little function for lightweight exploratory data analysis. > A count distinct of each column is a common exploratory data analysis > workflow. This should be easy to add (piggybacking off the existing > countDistinct code), entirely backwards compatible, and will help a lot of > users. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34165) Add countDistinct option to Dataset#summary
[ https://issues.apache.org/jira/browse/SPARK-34165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268355#comment-17268355 ] Apache Spark commented on SPARK-34165: -- User 'MrPowers' has created a pull request for this issue: https://github.com/apache/spark/pull/31254 > Add countDistinct option to Dataset#summary > --- > > Key: SPARK-34165 > URL: https://issues.apache.org/jira/browse/SPARK-34165 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.0 >Reporter: Matthew Powers >Priority: Minor > > The Dataset#summary function supports options like count, mean, min, and max. > It's a great little function for lightweight exploratory data analysis. > A count distinct of each column is a common exploratory data analysis > workflow. This should be easy to add (piggybacking off the existing > countDistinct code), entirely backwards compatible, and will help a lot of > users. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34115) Long runtime on many environment variables
[ https://issues.apache.org/jira/browse/SPARK-34115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268344#comment-17268344 ] Dongjoon Hyun commented on SPARK-34115: --- Thank you! > Long runtime on many environment variables > -- > > Key: SPARK-34115 > URL: https://issues.apache.org/jira/browse/SPARK-34115 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.4.0, 2.4.7, 3.0.1 > Environment: Spark 2.4.0 local[2] on a Kubernetes Pod >Reporter: Norbert Schultz >Assignee: Norbert Schultz >Priority: Major > Fix For: 3.0.2, 3.1.2 > > Attachments: spark-bug-34115.tar.gz > > > I am not sure if this is a bug report or a feature request. The code is is > the same in current versions of Spark and maybe this ticket saves someone > some time for debugging. > We migrated some older code to Spark 2.4.0, and suddently the integration > tests on our build machine were much slower than expected. > On local machines it was running perfectly. > At the end it turned out, that Spark was wasting CPU Cycles during DataFrame > analyzing in the following functions > * AnalysisHelper.assertNotAnalysisRule calling > * Utils.isTesting > Utils.isTesting is traversing all environment variables. > The offending build machine was a Kubernetes Pod which automatically exposed > all services as environment variables, so it had more than 3000 environment > variables. > As Utils.isTesting is called very often throgh > AnalysisHelper.assertNotAnalysisRule (via AnalysisHelper.transformDown, > transformUp). > > Of course we will restrict the number of environment variables, on the other > side Utils.isTesting could also use a lazy val for > > {code:java} > sys.env.contains("SPARK_TESTING") {code} > > to not make it that expensive. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34162) Add PyArrow compatibility note for Python 3.9
[ https://issues.apache.org/jira/browse/SPARK-34162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-34162. --- Fix Version/s: 3.1.1 3.2.0 Assignee: Dongjoon Hyun (was: Apache Spark) Resolution: Fixed This is resolved via https://github.com/apache/spark/pull/31251 > Add PyArrow compatibility note for Python 3.9 > - > > Key: SPARK-34162 > URL: https://issues.apache.org/jira/browse/SPARK-34162 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.2.0, 3.1.1 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.2.0, 3.1.1 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34098) HadoopVersionInfoSuite faild when maven test in Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-34098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268339#comment-17268339 ] Yang Jie commented on SPARK-34098: -- Looks like a network issue, if SPARK_VERSIONS_SUITE_IVY_PATH exported, the above problems seems not appear. The SparkSubmitUtils.resolveMavenCoordinates method has a certain failure probability. > HadoopVersionInfoSuite faild when maven test in Scala 2.13 > -- > > Key: SPARK-34098 > URL: https://issues.apache.org/jira/browse/SPARK-34098 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yang Jie >Priority: Minor > > > {code:java} > mvn clean test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn > -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive > -pl sql/hive -Pscala-2.13 > -DwildcardSuites=org.apache.spark.sql.hive.client.HadoopVersionInfoSuite > -Dtest=none > {code} > Independent test HadoopVersionInfoSuite all case passed, but execute > > {code:java} > mvn clean test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn > -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive > -pl sql/hive -Pscala-2.13 > {code} > > HadoopVersionInfoSuite failed as follow: > {code:java} > HadoopVersionInfoSuite: 22:32:30.310 WARN > org.apache.spark.sql.hive.client.IsolatedClientLoader: Failed to resolve > Hadoop artifacts for the version 2.7.4. We will change the hadoop version > from 2.7.4 to 2.7.4 and try again. It is recommended to set jars used by Hive > metastore client through spark.sql.hive.metastore.jars in the production > environment. - SPARK-32256: Hadoop VersionInfo should be preloaded *** FAILED > *** java.lang.RuntimeException: [unresolved dependency: > org.apache.hive#hive-metastore;2.0.1: not found, unresolved dependency: > org.apache.hive#hive-exec;2.0.1: not found, unresolved dependency: > org.apache.hive#hive-common;2.0.1: not found, unresolved dependency: > org.apache.hive#hive-serde;2.0.1: not found, unresolved dependency: > com.google.guava#guava;14.0.1: not found, unresolved dependency: > org.apache.hadoop#hadoop-client;2.7.4: not found] at > org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1423) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader$.$anonfun$downloadVersion$2(IsolatedClientLoader.scala:122) > at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:42) at > org.apache.spark.sql.hive.client.IsolatedClientLoader$.downloadVersion(IsolatedClientLoader.scala:122) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader$.liftedTree1$1(IsolatedClientLoader.scala:75) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader$.forVersion(IsolatedClientLoader.scala:63) > at > org.apache.spark.sql.hive.client.HadoopVersionInfoSuite.$anonfun$new$1(HadoopVersionInfoSuite.scala:46) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) at > org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) at > org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > {code} > need some investigate. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-34098) HadoopVersionInfoSuite faild when maven test in Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-34098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268339#comment-17268339 ] Yang Jie edited comment on SPARK-34098 at 1/20/21, 3:04 AM: Looks like a network issue, if SPARK_VERSIONS_SUITE_IVY_PATH exported, the above problems seems not appear. The SparkSubmitUtils.resolveMavenCoordinates method has a certain failure probability if SPARK_VERSIONS_SUITE_IVY_PATH not exported was (Author: luciferyang): Looks like a network issue, if SPARK_VERSIONS_SUITE_IVY_PATH exported, the above problems seems not appear. The SparkSubmitUtils.resolveMavenCoordinates method has a certain failure probability. > HadoopVersionInfoSuite faild when maven test in Scala 2.13 > -- > > Key: SPARK-34098 > URL: https://issues.apache.org/jira/browse/SPARK-34098 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yang Jie >Priority: Minor > > > {code:java} > mvn clean test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn > -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive > -pl sql/hive -Pscala-2.13 > -DwildcardSuites=org.apache.spark.sql.hive.client.HadoopVersionInfoSuite > -Dtest=none > {code} > Independent test HadoopVersionInfoSuite all case passed, but execute > > {code:java} > mvn clean test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn > -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive > -pl sql/hive -Pscala-2.13 > {code} > > HadoopVersionInfoSuite failed as follow: > {code:java} > HadoopVersionInfoSuite: 22:32:30.310 WARN > org.apache.spark.sql.hive.client.IsolatedClientLoader: Failed to resolve > Hadoop artifacts for the version 2.7.4. We will change the hadoop version > from 2.7.4 to 2.7.4 and try again. It is recommended to set jars used by Hive > metastore client through spark.sql.hive.metastore.jars in the production > environment. - SPARK-32256: Hadoop VersionInfo should be preloaded *** FAILED > *** java.lang.RuntimeException: [unresolved dependency: > org.apache.hive#hive-metastore;2.0.1: not found, unresolved dependency: > org.apache.hive#hive-exec;2.0.1: not found, unresolved dependency: > org.apache.hive#hive-common;2.0.1: not found, unresolved dependency: > org.apache.hive#hive-serde;2.0.1: not found, unresolved dependency: > com.google.guava#guava;14.0.1: not found, unresolved dependency: > org.apache.hadoop#hadoop-client;2.7.4: not found] at > org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1423) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader$.$anonfun$downloadVersion$2(IsolatedClientLoader.scala:122) > at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:42) at > org.apache.spark.sql.hive.client.IsolatedClientLoader$.downloadVersion(IsolatedClientLoader.scala:122) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader$.liftedTree1$1(IsolatedClientLoader.scala:75) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader$.forVersion(IsolatedClientLoader.scala:63) > at > org.apache.spark.sql.hive.client.HadoopVersionInfoSuite.$anonfun$new$1(HadoopVersionInfoSuite.scala:46) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) at > org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) at > org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > {code} > need some investigate. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34165) Add countDistinct option to Dataset#summary
Matthew Powers created SPARK-34165: -- Summary: Add countDistinct option to Dataset#summary Key: SPARK-34165 URL: https://issues.apache.org/jira/browse/SPARK-34165 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.2.0 Reporter: Matthew Powers The Dataset#summary function supports options like count, mean, min, and max. It's a great little function for lightweight exploratory data analysis. A count distinct of each column is a common exploratory data analysis workflow. This should be easy to add (piggybacking off the existing countDistinct code), entirely backwards compatible, and will help a lot of users. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33940) allow configuring the max column name length in csv writer
[ https://issues.apache.org/jira/browse/SPARK-33940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-33940. -- Fix Version/s: 3.1.2 Resolution: Fixed Issue resolved by pull request 31246 [https://github.com/apache/spark/pull/31246] > allow configuring the max column name length in csv writer > -- > > Key: SPARK-33940 > URL: https://issues.apache.org/jira/browse/SPARK-33940 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Nan Zhu >Assignee: Nan Zhu >Priority: Major > Fix For: 3.1.2 > > > csv writer actually has an implicit limit on column name length due to > univocity-parser, > > when we initialize a writer > [https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/AbstractWriter.java#L211,] > it calls toIdentifierGroupArray which calls valueOf in NormalizedString.java > eventually > ([https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/NormalizedString.java#L205-L209)] > > in that stringCache.get, it has a maxStringLength cap > [https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/StringCache.java#L104] > which is 1024 by default > > we do not expose this as configurable option, leading to NPE when we have a > column name larger than 1024, > > ``` > [info] Cause: java.lang.NullPointerException: > [info] at > com.univocity.parsers.common.AbstractWriter.submitRow(AbstractWriter.java:349) > [info] at > com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:444) > [info] at > com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:410) > [info] at > org.apache.spark.sql.catalyst.csv.UnivocityGenerator.writeHeaders(UnivocityGenerator.scala:87) > [info] at > org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter$.writeHeaders(CsvOutputWriter.scala:58) > [info] at > org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.(CsvOutputWriter.scala:44) > [info] at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anon$1.newInstance(CSVFileFormat.scala:86) > [info] at > org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:126) > [info] at > org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:111) > [info] at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:269) > [info] at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:210) > ``` > > it could be reproduced by a simple unit test > > ``` > val row1 = Row("a") > val superLongHeader = (0 until 1025).map(_ => "c").mkString("") > val df = Seq(s"${row1.getString(0)}").toDF(superLongHeader) > df.repartition(1) > .write > .option("header", "true") > .option("maxColumnNameLength", 1025) > .csv(dataPath) > ``` > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33940) allow configuring the max column name length in csv writer
[ https://issues.apache.org/jira/browse/SPARK-33940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-33940: Assignee: Nan Zhu > allow configuring the max column name length in csv writer > -- > > Key: SPARK-33940 > URL: https://issues.apache.org/jira/browse/SPARK-33940 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Nan Zhu >Assignee: Nan Zhu >Priority: Major > > csv writer actually has an implicit limit on column name length due to > univocity-parser, > > when we initialize a writer > [https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/AbstractWriter.java#L211,] > it calls toIdentifierGroupArray which calls valueOf in NormalizedString.java > eventually > ([https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/NormalizedString.java#L205-L209)] > > in that stringCache.get, it has a maxStringLength cap > [https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/StringCache.java#L104] > which is 1024 by default > > we do not expose this as configurable option, leading to NPE when we have a > column name larger than 1024, > > ``` > [info] Cause: java.lang.NullPointerException: > [info] at > com.univocity.parsers.common.AbstractWriter.submitRow(AbstractWriter.java:349) > [info] at > com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:444) > [info] at > com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:410) > [info] at > org.apache.spark.sql.catalyst.csv.UnivocityGenerator.writeHeaders(UnivocityGenerator.scala:87) > [info] at > org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter$.writeHeaders(CsvOutputWriter.scala:58) > [info] at > org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.(CsvOutputWriter.scala:44) > [info] at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anon$1.newInstance(CSVFileFormat.scala:86) > [info] at > org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:126) > [info] at > org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:111) > [info] at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:269) > [info] at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:210) > ``` > > it could be reproduced by a simple unit test > > ``` > val row1 = Row("a") > val superLongHeader = (0 until 1025).map(_ => "c").mkString("") > val df = Seq(s"${row1.getString(0)}").toDF(superLongHeader) > df.repartition(1) > .write > .option("header", "true") > .option("maxColumnNameLength", 1025) > .csv(dataPath) > ``` > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34164) Improve write side varchar check to visit only last few tailing spaces
[ https://issues.apache.org/jira/browse/SPARK-34164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34164: Assignee: (was: Apache Spark) > Improve write side varchar check to visit only last few tailing spaces > -- > > Key: SPARK-34164 > URL: https://issues.apache.org/jira/browse/SPARK-34164 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0, 3.2.0 >Reporter: Kent Yao >Priority: Major > > For varchar(N), we currently trim all spaces first to check whether the > remained length exceeds, it not necessary to visit them all but at most to > those after N. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34164) Improve write side varchar check to visit only last few tailing spaces
[ https://issues.apache.org/jira/browse/SPARK-34164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268326#comment-17268326 ] Apache Spark commented on SPARK-34164: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/31253 > Improve write side varchar check to visit only last few tailing spaces > -- > > Key: SPARK-34164 > URL: https://issues.apache.org/jira/browse/SPARK-34164 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0, 3.2.0 >Reporter: Kent Yao >Priority: Major > > For varchar(N), we currently trim all spaces first to check whether the > remained length exceeds, it not necessary to visit them all but at most to > those after N. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34164) Improve write side varchar check to visit only last few tailing spaces
[ https://issues.apache.org/jira/browse/SPARK-34164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34164: Assignee: Apache Spark > Improve write side varchar check to visit only last few tailing spaces > -- > > Key: SPARK-34164 > URL: https://issues.apache.org/jira/browse/SPARK-34164 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0, 3.2.0 >Reporter: Kent Yao >Assignee: Apache Spark >Priority: Major > > For varchar(N), we currently trim all spaces first to check whether the > remained length exceeds, it not necessary to visit them all but at most to > those after N. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34052) A cached view should become invalid after a table is dropped
[ https://issues.apache.org/jira/browse/SPARK-34052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-34052. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 31107 [https://github.com/apache/spark/pull/31107] > A cached view should become invalid after a table is dropped > > > Key: SPARK-34052 > URL: https://issues.apache.org/jira/browse/SPARK-34052 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.1, 3.1.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.2.0 > > > It seems a view doesn't become invalid after a DSv2 table is dropped or > replaced. This is different from V1 and may cause correctness issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34052) A cached view should become invalid after a table is dropped
[ https://issues.apache.org/jira/browse/SPARK-34052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-34052: --- Assignee: Chao Sun > A cached view should become invalid after a table is dropped > > > Key: SPARK-34052 > URL: https://issues.apache.org/jira/browse/SPARK-34052 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.1, 3.1.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > > It seems a view doesn't become invalid after a DSv2 table is dropped or > replaced. This is different from V1 and may cause correctness issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34164) Improve write side varchar check to visit only last few tailing spaces
Kent Yao created SPARK-34164: Summary: Improve write side varchar check to visit only last few tailing spaces Key: SPARK-34164 URL: https://issues.apache.org/jira/browse/SPARK-34164 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0, 3.2.0 Reporter: Kent Yao For varchar(N), we currently trim all spaces first to check whether the remained length exceeds, it not necessary to visit them all but at most to those after N. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34056) Unify v1 and v2 ALTER TABLE .. RECOVER PARTITIONS tests
[ https://issues.apache.org/jira/browse/SPARK-34056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-34056. - Resolution: Fixed Issue resolved by pull request 31105 [https://github.com/apache/spark/pull/31105] > Unify v1 and v2 ALTER TABLE .. RECOVER PARTITIONS tests > --- > > Key: SPARK-34056 > URL: https://issues.apache.org/jira/browse/SPARK-34056 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.2.0 > > > Extract ALTER TABLE .. RECOVER PARTITIONS tests to the common place to run > them for V1 and v2 datasources. Some tests can be places to V1 and V2 > specific test suites. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34143) Adding partitions to fully partitioned v2 table
[ https://issues.apache.org/jira/browse/SPARK-34143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-34143: - Fix Version/s: 3.1.2 > Adding partitions to fully partitioned v2 table > --- > > Key: SPARK-34143 > URL: https://issues.apache.org/jira/browse/SPARK-34143 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.2.0, 3.1.2 > > > The test below fails: > {code:scala} > withNamespaceAndTable("ns", "tbl") { t => > sql(s"CREATE TABLE $t (p0 INT, p1 STRING) $defaultUsing PARTITIONED BY > (p0, p1)") > sql(s"ALTER TABLE $t ADD PARTITION (p0 = 0, p1 = 'abc')") > checkPartitions(t, Map("p0" -> "0", "p1" -> "abc")) > checkAnswer(sql(s"SELECT * FROM $t"), Row(0, "abc")) > } > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34121) Intersect operator missing rowCount when CBO enabled
[ https://issues.apache.org/jira/browse/SPARK-34121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-34121. -- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 31240 [https://github.com/apache/spark/pull/31240] > Intersect operator missing rowCount when CBO enabled > > > Key: SPARK-34121 > URL: https://issues.apache.org/jira/browse/SPARK-34121 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Assignee: angerszhu >Priority: Major > Fix For: 3.2.0 > > > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/BasicStatsPlanVisitor.scala#L60 > missing the row count. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34121) Intersect operator missing rowCount when CBO enabled
[ https://issues.apache.org/jira/browse/SPARK-34121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-34121: Assignee: angerszhu > Intersect operator missing rowCount when CBO enabled > > > Key: SPARK-34121 > URL: https://issues.apache.org/jira/browse/SPARK-34121 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Assignee: angerszhu >Priority: Major > > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/BasicStatsPlanVisitor.scala#L60 > missing the row count. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34115) Long runtime on many environment variables
[ https://issues.apache.org/jira/browse/SPARK-34115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-34115: Assignee: Norbert Schultz > Long runtime on many environment variables > -- > > Key: SPARK-34115 > URL: https://issues.apache.org/jira/browse/SPARK-34115 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.4.0, 2.4.7, 3.0.1 > Environment: Spark 2.4.0 local[2] on a Kubernetes Pod >Reporter: Norbert Schultz >Assignee: Norbert Schultz >Priority: Major > Fix For: 3.0.2, 3.1.2 > > Attachments: spark-bug-34115.tar.gz > > > I am not sure if this is a bug report or a feature request. The code is is > the same in current versions of Spark and maybe this ticket saves someone > some time for debugging. > We migrated some older code to Spark 2.4.0, and suddently the integration > tests on our build machine were much slower than expected. > On local machines it was running perfectly. > At the end it turned out, that Spark was wasting CPU Cycles during DataFrame > analyzing in the following functions > * AnalysisHelper.assertNotAnalysisRule calling > * Utils.isTesting > Utils.isTesting is traversing all environment variables. > The offending build machine was a Kubernetes Pod which automatically exposed > all services as environment variables, so it had more than 3000 environment > variables. > As Utils.isTesting is called very often throgh > AnalysisHelper.assertNotAnalysisRule (via AnalysisHelper.transformDown, > transformUp). > > Of course we will restrict the number of environment variables, on the other > side Utils.isTesting could also use a lazy val for > > {code:java} > sys.env.contains("SPARK_TESTING") {code} > > to not make it that expensive. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34115) Long runtime on many environment variables
[ https://issues.apache.org/jira/browse/SPARK-34115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-34115. -- Fix Version/s: 3.1.2 3.0.2 Resolution: Fixed Fixed in https://github.com/apache/spark/pull/31244 > Long runtime on many environment variables > -- > > Key: SPARK-34115 > URL: https://issues.apache.org/jira/browse/SPARK-34115 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.4.0, 2.4.7, 3.0.1 > Environment: Spark 2.4.0 local[2] on a Kubernetes Pod >Reporter: Norbert Schultz >Priority: Major > Fix For: 3.0.2, 3.1.2 > > Attachments: spark-bug-34115.tar.gz > > > I am not sure if this is a bug report or a feature request. The code is is > the same in current versions of Spark and maybe this ticket saves someone > some time for debugging. > We migrated some older code to Spark 2.4.0, and suddently the integration > tests on our build machine were much slower than expected. > On local machines it was running perfectly. > At the end it turned out, that Spark was wasting CPU Cycles during DataFrame > analyzing in the following functions > * AnalysisHelper.assertNotAnalysisRule calling > * Utils.isTesting > Utils.isTesting is traversing all environment variables. > The offending build machine was a Kubernetes Pod which automatically exposed > all services as environment variables, so it had more than 3000 environment > variables. > As Utils.isTesting is called very often throgh > AnalysisHelper.assertNotAnalysisRule (via AnalysisHelper.transformDown, > transformUp). > > Of course we will restrict the number of environment variables, on the other > side Utils.isTesting could also use a lazy val for > > {code:java} > sys.env.contains("SPARK_TESTING") {code} > > to not make it that expensive. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34121) Intersect operator missing rowCount when CBO enabled
[ https://issues.apache.org/jira/browse/SPARK-34121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-34121: Description: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/BasicStatsPlanVisitor.scala#L60 missing the row count. > Intersect operator missing rowCount when CBO enabled > > > Key: SPARK-34121 > URL: https://issues.apache.org/jira/browse/SPARK-34121 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Major > > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/BasicStatsPlanVisitor.scala#L60 > missing the row count. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33888) JDBC SQL TIME type represents incorrectly as TimestampType, it should be physical Int in millis
[ https://issues.apache.org/jira/browse/SPARK-33888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268291#comment-17268291 ] Apache Spark commented on SPARK-33888: -- User 'skestle' has created a pull request for this issue: https://github.com/apache/spark/pull/31252 > JDBC SQL TIME type represents incorrectly as TimestampType, it should be > physical Int in millis > --- > > Key: SPARK-33888 > URL: https://issues.apache.org/jira/browse/SPARK-33888 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3, 3.0.0, 3.0.1 >Reporter: Duc Hoa Nguyen >Assignee: Apache Spark >Priority: Minor > Fix For: 3.2.0 > > > Currently, for JDBC, SQL TIME type represents incorrectly as Spark > TimestampType. This should be represent as physical int in millis Represents > a time of day, with no reference to a particular calendar, time zone or date, > with a precision of one millisecond. It stores the number of milliseconds > after midnight, 00:00:00.000. > We encountered the issue of Avro logical type of `TimeMillis` not being > converted correctly to Spark `Timestamp` struct type using the > `SchemaConverters`, but it converts to regular `int` instead. Reproducible by > ingest data from MySQL table with a column of TIME type: Spark JDBC dataframe > will get the correct type (Timestamp), but enforcing our avro schema > (`{"type": "int"," logicalType": "time-millis"}`) externally will fail to > apply with the following exception: > {{java.lang.RuntimeException: java.sql.Timestamp is not a valid external type > for schema of int}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33888) JDBC SQL TIME type represents incorrectly as TimestampType, it should be physical Int in millis
[ https://issues.apache.org/jira/browse/SPARK-33888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268290#comment-17268290 ] Apache Spark commented on SPARK-33888: -- User 'skestle' has created a pull request for this issue: https://github.com/apache/spark/pull/31252 > JDBC SQL TIME type represents incorrectly as TimestampType, it should be > physical Int in millis > --- > > Key: SPARK-33888 > URL: https://issues.apache.org/jira/browse/SPARK-33888 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3, 3.0.0, 3.0.1 >Reporter: Duc Hoa Nguyen >Assignee: Apache Spark >Priority: Minor > Fix For: 3.2.0 > > > Currently, for JDBC, SQL TIME type represents incorrectly as Spark > TimestampType. This should be represent as physical int in millis Represents > a time of day, with no reference to a particular calendar, time zone or date, > with a precision of one millisecond. It stores the number of milliseconds > after midnight, 00:00:00.000. > We encountered the issue of Avro logical type of `TimeMillis` not being > converted correctly to Spark `Timestamp` struct type using the > `SchemaConverters`, but it converts to regular `int` instead. Reproducible by > ingest data from MySQL table with a column of TIME type: Spark JDBC dataframe > will get the correct type (Timestamp), but enforcing our avro schema > (`{"type": "int"," logicalType": "time-millis"}`) externally will fail to > apply with the following exception: > {{java.lang.RuntimeException: java.sql.Timestamp is not a valid external type > for schema of int}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33888) JDBC SQL TIME type represents incorrectly as TimestampType, it should be physical Int in millis
[ https://issues.apache.org/jira/browse/SPARK-33888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268285#comment-17268285 ] Stephen Kestle commented on SPARK-33888: I've searched for "scale" and only Oracle seems to use it (although it defaults to 0 if not present). The tests however define it for other number types (SMALL_INT etc) - which should have defined scale etc: * [https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala#L901] * [https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala#L938] I think that these probably aren't necessary. > JDBC SQL TIME type represents incorrectly as TimestampType, it should be > physical Int in millis > --- > > Key: SPARK-33888 > URL: https://issues.apache.org/jira/browse/SPARK-33888 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3, 3.0.0, 3.0.1 >Reporter: Duc Hoa Nguyen >Assignee: Apache Spark >Priority: Minor > Fix For: 3.2.0 > > > Currently, for JDBC, SQL TIME type represents incorrectly as Spark > TimestampType. This should be represent as physical int in millis Represents > a time of day, with no reference to a particular calendar, time zone or date, > with a precision of one millisecond. It stores the number of milliseconds > after midnight, 00:00:00.000. > We encountered the issue of Avro logical type of `TimeMillis` not being > converted correctly to Spark `Timestamp` struct type using the > `SchemaConverters`, but it converts to regular `int` instead. Reproducible by > ingest data from MySQL table with a column of TIME type: Spark JDBC dataframe > will get the correct type (Timestamp), but enforcing our avro schema > (`{"type": "int"," logicalType": "time-millis"}`) externally will fail to > apply with the following exception: > {{java.lang.RuntimeException: java.sql.Timestamp is not a valid external type > for schema of int}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34163) Spark Structured Streaming - Kafka avro transformation on optional field Failed
[ https://issues.apache.org/jira/browse/SPARK-34163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Kizhakkel Jose updated SPARK-34163: - Description: Hello All, I have a spark structured streaming job to inject data from Kafka where message from Kafka is avro type. Some of the fields are optional in the data. And I have to perform transformation if those optional fields are present in the data. So I tried to check whether the column exists by : {color:#0747A6}def has_column(dataframe, col): """ This function checks the existence of a given column in the given DataFrame :param dataframe: the dataframe :type dataframe: DataFrame :param col: the column name :type col: str :return: true if the column exists else false :rtype: boolean """ try: dataframe[col] return True except AnalysisException: return False{color} But it seems not working when its a streaming dataframe, but when the dataframe is normal dataframe, and when a column is not present the above check returns false, therefore I can ignore the transformation on the missing column. But on Streaming dataframe *has_column* always returns true and therefore the transformation get executed and cause exception. What is the right approach to check existence of column in a streaming dataframe before performing transformation? Why streaming dataframe and normal dataframe differ in behavior? How to skip transformation on a column if it doesn't exists? was: Hello All, I have a spark structured streaming job to inject data from Kafka where message from Kafka is avro type. Some of the fields are optional in the data. And I have to perform transformation if those optional fields are present in the data. So I tried to check whether the column exists by : {color:#0747A6}def has_column(dataframe, col): """ This function checks the existence of a given column in the given DataFrame :param dataframe: the dataframe :type dataframe: DataFrame :param col: the column name :type col: str :return: true if the column exists else false :rtype: boolean """ try: dataframe[col] return True except AnalysisException: return False{color} But it seems not working when its a streaming dataframe, but when the dataframe is normal dataframe, and when a column is not present the above check returns false, therefore I can ignore the transformation on the missing column. But on Streaming dataframe *has_column* always returns true and therefore the transformation get executed and cause exception. What is the right approach to check existence of column in a streaming dataframe before performing transformation? > Spark Structured Streaming - Kafka avro transformation on optional field > Failed > --- > > Key: SPARK-34163 > URL: https://issues.apache.org/jira/browse/SPARK-34163 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.7 >Reporter: Felix Kizhakkel Jose >Priority: Major > > Hello All, > I have a spark structured streaming job to inject data from Kafka where > message from Kafka is avro type. > Some of the fields are optional in the data. And I have to perform > transformation if those optional fields are present in the data. > So I tried to check whether the column exists by : > {color:#0747A6}def has_column(dataframe, col): > """ > This function checks the existence of a given column in the given > DataFrame > :param dataframe: the dataframe > :type dataframe: DataFrame > :param col: the column name > :type col: str > :return: true if the column exists else false > :rtype: boolean > """ > try: > dataframe[col] > return True > except AnalysisException: > return False{color} > But it seems not working when its a streaming dataframe, but when the > dataframe is normal dataframe, and when a column is not present the above > check returns false, therefore I can ignore the transformation on the missing > column. > But on Streaming dataframe *has_column* always returns true and therefore the > transformation get executed and cause exception. What is the right approach > to check existence of column in a streaming dataframe before performing > transformation? > Why streaming dataframe and normal dataframe differ in behavior? How to skip > transformation on a column if it doesn't exists? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34163) Spark Structured Streaming - Kafka avro transformation on optional field Failed
[ https://issues.apache.org/jira/browse/SPARK-34163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Kizhakkel Jose updated SPARK-34163: - Description: Hello All, I have a spark structured streaming job to inject data from Kafka where message from Kafka is avro type. Some of the fields are optional in the data. And I have to perform transformation if those optional fields are present in the data. So I tried to check whether the column exists by : {color:#0747A6}def has_column(dataframe, col): """ This function checks the existence of a given column in the given DataFrame :param dataframe: the dataframe :type dataframe: DataFrame :param col: the column name :type col: str :return: true if the column exists else false :rtype: boolean """ try: dataframe[col] return True except AnalysisException: return False{color} But it seems not working when its a streaming dataframe, but when the dataframe is normal dataframe, and when a column is not present the above check returns false, therefore I can ignore the transformation on the missing column. But on Streaming dataframe *has_column* always returns true and therefore the transformation get executed and cause exception. What is the right approach to check existence of column in a streaming dataframe before performing transformation? was: Hello All, I have a spark structured streaming job to inject data from Kafka where message from Kafka is avro type. Some of the fields are optional in the data. And I have to perform transformation if those optional fields are present in the data. So I tried to check whether the column exists by : *def has_column(dataframe, col): """ This function checks the existence of a given column in the given DataFrame :param dataframe: the dataframe :type dataframe: DataFrame :param col: the column name :type col: str :return: true if the column exists else false :rtype: boolean """ try: dataframe[col] return True except AnalysisException: return False* But it seems not working when its a streaming dataframe, but when the dataframe is normal dataframe, and when a column is not present the above check returns false, therefore I can ignore the transformation on the missing column. But on Streaming dataframe *has_column* always returns true and therefore the transformation get executed and cause exception. What is the right approach to check existence of column in a streaming dataframe before performing transformation? > Spark Structured Streaming - Kafka avro transformation on optional field > Failed > --- > > Key: SPARK-34163 > URL: https://issues.apache.org/jira/browse/SPARK-34163 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.7 >Reporter: Felix Kizhakkel Jose >Priority: Major > > Hello All, > I have a spark structured streaming job to inject data from Kafka where > message from Kafka is avro type. > Some of the fields are optional in the data. And I have to perform > transformation if those optional fields are present in the data. > So I tried to check whether the column exists by : > {color:#0747A6}def has_column(dataframe, col): > """ > This function checks the existence of a given column in the given > DataFrame > :param dataframe: the dataframe > :type dataframe: DataFrame > :param col: the column name > :type col: str > :return: true if the column exists else false > :rtype: boolean > """ > try: > dataframe[col] > return True > except AnalysisException: > return False{color} > But it seems not working when its a streaming dataframe, but when the > dataframe is normal dataframe, and when a column is not present the above > check returns false, therefore I can ignore the transformation on the missing > column. > But on Streaming dataframe *has_column* always returns true and therefore the > transformation get executed and cause exception. What is the right approach > to check existence of column in a streaming dataframe before performing > transformation? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34163) Spark Structured Streaming - Kafka avro transformation on optional field Failed
Felix Kizhakkel Jose created SPARK-34163: Summary: Spark Structured Streaming - Kafka avro transformation on optional field Failed Key: SPARK-34163 URL: https://issues.apache.org/jira/browse/SPARK-34163 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.4.7 Reporter: Felix Kizhakkel Jose Hello All, I have a spark structured streaming job to inject data from Kafka where message from Kafka is avro type. Some of the fields are optional in the data. And I have to perform transformation if those optional fields are present in the data. So I tried to check whether the column exists by : *def has_column(dataframe, col): """ This function checks the existence of a given column in the given DataFrame :param dataframe: the dataframe :type dataframe: DataFrame :param col: the column name :type col: str :return: true if the column exists else false :rtype: boolean """ try: dataframe[col] return True except AnalysisException: return False* But it seems not working when its a streaming dataframe, but when the dataframe is normal dataframe, and when a column is not present the above check returns false, therefore I can ignore the transformation on the missing column. But on Streaming dataframe *has_column* always returns true and therefore the transformation get executed and cause exception. What is the right approach to check existence of column in a streaming dataframe before performing transformation? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31786) Exception on submitting Spark-Pi to Kubernetes 1.17.3
[ https://issues.apache.org/jira/browse/SPARK-31786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268257#comment-17268257 ] Dongjoon Hyun commented on SPARK-31786: --- Hi, [~smurarka]. Apache Jira is not designed for Q You had better use the official mailing list if you have an irrelevant question to this JIRA. > Exception on submitting Spark-Pi to Kubernetes 1.17.3 > - > > Key: SPARK-31786 > URL: https://issues.apache.org/jira/browse/SPARK-31786 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.5, 3.0.0 >Reporter: Maciej Bryński >Assignee: Dongjoon Hyun >Priority: Blocker > Fix For: 3.0.0 > > > Hi, > I'm getting exception when submitting Spark-Pi app to Kubernetes cluster. > Kubernetes version: 1.17.3 > JDK version: openjdk version "1.8.0_252" > Exception: > {code} > ./bin/spark-submit --master k8s://https://172.31.23.60:8443 --deploy-mode > cluster --name spark-pi --conf > spark.kubernetes.container.image=spark-py:2.4.5 --conf > spark.kubernetes.executor.request.cores=0.1 --conf > spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf > spark.executor.instances=1 local:///opt/spark/examples/src/main/python/pi.py > log4j:WARN No appenders could be found for logger > (io.fabric8.kubernetes.client.Config). > log4j:WARN Please initialize the log4j system properly. > log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more > info. > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Exception in thread "main" > io.fabric8.kubernetes.client.KubernetesClientException: Operation: [create] > for kind: [Pod] with name: [null] in namespace: [default] failed. > at > io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64) > at > io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72) > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:337) > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:330) > at > org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$2.apply(KubernetesClientApplication.scala:141) > at > org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$2.apply(KubernetesClientApplication.scala:140) > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2543) > at > org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:140) > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply(KubernetesClientApplication.scala:250) > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply(KubernetesClientApplication.scala:241) > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2543) > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:241) > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:204) > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845) > at > org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) > at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) > at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) > at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: java.net.SocketException: Broken pipe (Write failed) > at java.net.SocketOutputStream.socketWrite0(Native Method) > at > java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111) > at java.net.SocketOutputStream.write(SocketOutputStream.java:155) > at sun.security.ssl.OutputRecord.writeBuffer(OutputRecord.java:431) > at sun.security.ssl.OutputRecord.write(OutputRecord.java:417) > at > sun.security.ssl.SSLSocketImpl.writeRecordInternal(SSLSocketImpl.java:894) > at sun.security.ssl.SSLSocketImpl.writeRecord(SSLSocketImpl.java:865) > at sun.security.ssl.AppOutputStream.write(AppOutputStream.java:123) > at okio.Okio$1.write(Okio.java:79) > at okio.AsyncTimeout$1.write(AsyncTimeout.java:180) > at okio.RealBufferedSink.flush(RealBufferedSink.java:224) > at okhttp3.internal.http2.Http2Writer.settings(Http2Writer.java:203) > at >
[jira] [Assigned] (SPARK-34162) Add PyArrow compatibility note for Python 3.9
[ https://issues.apache.org/jira/browse/SPARK-34162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34162: Assignee: Apache Spark > Add PyArrow compatibility note for Python 3.9 > - > > Key: SPARK-34162 > URL: https://issues.apache.org/jira/browse/SPARK-34162 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.2.0, 3.1.1 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34162) Add PyArrow compatibility note for Python 3.9
[ https://issues.apache.org/jira/browse/SPARK-34162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34162: Assignee: (was: Apache Spark) > Add PyArrow compatibility note for Python 3.9 > - > > Key: SPARK-34162 > URL: https://issues.apache.org/jira/browse/SPARK-34162 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.2.0, 3.1.1 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34162) Add PyArrow compatibility note for Python 3.9
[ https://issues.apache.org/jira/browse/SPARK-34162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268250#comment-17268250 ] Apache Spark commented on SPARK-34162: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/31251 > Add PyArrow compatibility note for Python 3.9 > - > > Key: SPARK-34162 > URL: https://issues.apache.org/jira/browse/SPARK-34162 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.2.0, 3.1.1 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34162) Add PyArrow compatibility note for Python 3.9
[ https://issues.apache.org/jira/browse/SPARK-34162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34162: Assignee: Apache Spark > Add PyArrow compatibility note for Python 3.9 > - > > Key: SPARK-34162 > URL: https://issues.apache.org/jira/browse/SPARK-34162 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.2.0, 3.1.1 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34162) Add PyArrow compatibility note for Python 3.9
Dongjoon Hyun created SPARK-34162: - Summary: Add PyArrow compatibility note for Python 3.9 Key: SPARK-34162 URL: https://issues.apache.org/jira/browse/SPARK-34162 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 3.2.0, 3.1.1 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31786) Exception on submitting Spark-Pi to Kubernetes 1.17.3
[ https://issues.apache.org/jira/browse/SPARK-31786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268214#comment-17268214 ] Sachit Murarka commented on SPARK-31786: [~dongjoon] To run spark in kubernetes . I see case "$1" in driver) shift 1 CMD=( "$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client Could you pls suggest Why deploy-mode client is mentioned in [entrypoint.sh|https://t.co/IPbK5sEIqM?amp=1] ? I am running spark submit using deploy mode cluster but inside this [entrypoint.sh|https://t.co/IPbK5sEIqM?amp=1] its mentioned like that. > Exception on submitting Spark-Pi to Kubernetes 1.17.3 > - > > Key: SPARK-31786 > URL: https://issues.apache.org/jira/browse/SPARK-31786 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.5, 3.0.0 >Reporter: Maciej Bryński >Assignee: Dongjoon Hyun >Priority: Blocker > Fix For: 3.0.0 > > > Hi, > I'm getting exception when submitting Spark-Pi app to Kubernetes cluster. > Kubernetes version: 1.17.3 > JDK version: openjdk version "1.8.0_252" > Exception: > {code} > ./bin/spark-submit --master k8s://https://172.31.23.60:8443 --deploy-mode > cluster --name spark-pi --conf > spark.kubernetes.container.image=spark-py:2.4.5 --conf > spark.kubernetes.executor.request.cores=0.1 --conf > spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf > spark.executor.instances=1 local:///opt/spark/examples/src/main/python/pi.py > log4j:WARN No appenders could be found for logger > (io.fabric8.kubernetes.client.Config). > log4j:WARN Please initialize the log4j system properly. > log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more > info. > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Exception in thread "main" > io.fabric8.kubernetes.client.KubernetesClientException: Operation: [create] > for kind: [Pod] with name: [null] in namespace: [default] failed. > at > io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64) > at > io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72) > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:337) > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:330) > at > org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$2.apply(KubernetesClientApplication.scala:141) > at > org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$2.apply(KubernetesClientApplication.scala:140) > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2543) > at > org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:140) > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply(KubernetesClientApplication.scala:250) > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply(KubernetesClientApplication.scala:241) > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2543) > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:241) > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:204) > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845) > at > org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) > at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) > at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) > at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: java.net.SocketException: Broken pipe (Write failed) > at java.net.SocketOutputStream.socketWrite0(Native Method) > at > java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111) > at java.net.SocketOutputStream.write(SocketOutputStream.java:155) > at sun.security.ssl.OutputRecord.writeBuffer(OutputRecord.java:431) > at sun.security.ssl.OutputRecord.write(OutputRecord.java:417) > at > sun.security.ssl.SSLSocketImpl.writeRecordInternal(SSLSocketImpl.java:894) > at sun.security.ssl.SSLSocketImpl.writeRecord(SSLSocketImpl.java:865) > at
[jira] [Commented] (SPARK-34099) Refactor table caching in `DataSourceV2Strategy`
[ https://issues.apache.org/jira/browse/SPARK-34099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268210#comment-17268210 ] Apache Spark commented on SPARK-34099: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/31250 > Refactor table caching in `DataSourceV2Strategy` > > > Key: SPARK-34099 > URL: https://issues.apache.org/jira/browse/SPARK-34099 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.2.0 > > > Currently, invalidateCache() performs 3 calls to the Cache Manager to refresh > table cache. We can perform only one call recacheByPlan(). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33792) NullPointerException in HDFSBackedStateStoreProvider
[ https://issues.apache.org/jira/browse/SPARK-33792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268202#comment-17268202 ] Jungtaek Lim commented on SPARK-33792: -- I have no idea about the possibility, but the new stack trace concerns me. If I understand correctly, unless there's other possibility, that's likely saying key or value in map is null, which doesn't look to be possible. I'm not sure the failure is consistent, as you said the failure as "time to time". Does the problematic checkpoint always fail consistently? Or does it simply work with another run? Reproducing the issue is quite important to solve the problem. And probably add Java vendor & version to issue description as well. Also if the issue is vendor specific (as you've mentioned EMR) I'm not sure we'd like to investigate here. Please check the problem also appears with "Apache" Spark 2.4.4. > NullPointerException in HDFSBackedStateStoreProvider > > > Key: SPARK-33792 > URL: https://issues.apache.org/jira/browse/SPARK-33792 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.4 > Environment: emr-5.28.0 >Reporter: Gabor Barna >Priority: Major > > Hi, > We are getting NPEs with spark structured streaming from time-to-time. Here's > the stacktrace: > {code:java} > java.lang.NullPointerException > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore$$anonfun$iterator$1.apply(HDFSBackedStateStoreProvider.scala:164) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore$$anonfun$iterator$1.apply(HDFSBackedStateStoreProvider.scala:163) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:220) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at > org.apache.spark.sql.kinesis.KinesisWriteTask.execute(KinesisWriteTask.scala:50) > at > org.apache.spark.sql.kinesis.KinesisWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1$$anonfun$apply$1.apply$mcV$sp(KinesisWriter.scala:40) > at > org.apache.spark.sql.kinesis.KinesisWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1$$anonfun$apply$1.apply(KinesisWriter.scala:40) > at > org.apache.spark.sql.kinesis.KinesisWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1$$anonfun$apply$1.apply(KinesisWriter.scala:40) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at > org.apache.spark.sql.kinesis.KinesisWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(KinesisWriter.scala:40) > at > org.apache.spark.sql.kinesis.KinesisWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(KinesisWriter.scala:38) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:123) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) {code} > As you can see we are using [https://github.com/qubole/kinesis-sql/] for > writing the kinesis queue, the state backend is HDFS. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To
[jira] [Commented] (SPARK-34078) Provide async variants for Dataset APIs
[ https://issues.apache.org/jira/browse/SPARK-34078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268201#comment-17268201 ] Yesheng Ma commented on SPARK-34078: Thanks! I'm looking into this and will prepare a diff shortly. > Provide async variants for Dataset APIs > --- > > Key: SPARK-34078 > URL: https://issues.apache.org/jira/browse/SPARK-34078 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.1 >Reporter: Yesheng Ma >Priority: Major > > Spark RDDs have async variants such as `collectAsync`, which comes handy when > we want to cancel a job. However, Dataset APIs are lacking such APIs, which > makes it very painful to cancel a Dataset/SQL job. > > The proposed change was to add async variants so that we can directly cancel > a Dataset/SQL query via a future programmatically. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34105) In addition to killing exlcuded/flakey executors which should support decommissioning
[ https://issues.apache.org/jira/browse/SPARK-34105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34105: Assignee: (was: Apache Spark) > In addition to killing exlcuded/flakey executors which should support > decommissioning > - > > Key: SPARK-34105 > URL: https://issues.apache.org/jira/browse/SPARK-34105 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Holden Karau >Priority: Major > > Decommissioning will give the executor a chance to migrate it's files to a > more stable node. > > Note: we want SPARK-34104 to be integrated as well so that flaky executors > which can not decommission are eventually killed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34105) In addition to killing exlcuded/flakey executors which should support decommissioning
[ https://issues.apache.org/jira/browse/SPARK-34105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34105: Assignee: Apache Spark > In addition to killing exlcuded/flakey executors which should support > decommissioning > - > > Key: SPARK-34105 > URL: https://issues.apache.org/jira/browse/SPARK-34105 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Holden Karau >Assignee: Apache Spark >Priority: Major > > Decommissioning will give the executor a chance to migrate it's files to a > more stable node. > > Note: we want SPARK-34104 to be integrated as well so that flaky executors > which can not decommission are eventually killed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34105) In addition to killing exlcuded/flakey executors which should support decommissioning
[ https://issues.apache.org/jira/browse/SPARK-34105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268196#comment-17268196 ] Apache Spark commented on SPARK-34105: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/31249 > In addition to killing exlcuded/flakey executors which should support > decommissioning > - > > Key: SPARK-34105 > URL: https://issues.apache.org/jira/browse/SPARK-34105 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Holden Karau >Priority: Major > > Decommissioning will give the executor a chance to migrate it's files to a > more stable node. > > Note: we want SPARK-34104 to be integrated as well so that flaky executors > which can not decommission are eventually killed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34105) In addition to killing exlcuded/flakey executors which should support decommissioning
[ https://issues.apache.org/jira/browse/SPARK-34105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268197#comment-17268197 ] Apache Spark commented on SPARK-34105: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/31249 > In addition to killing exlcuded/flakey executors which should support > decommissioning > - > > Key: SPARK-34105 > URL: https://issues.apache.org/jira/browse/SPARK-34105 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Holden Karau >Priority: Major > > Decommissioning will give the executor a chance to migrate it's files to a > more stable node. > > Note: we want SPARK-34104 to be integrated as well so that flaky executors > which can not decommission are eventually killed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34104) Allow users to specify a maximum decommissioning time
[ https://issues.apache.org/jira/browse/SPARK-34104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268195#comment-17268195 ] Apache Spark commented on SPARK-34104: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/31249 > Allow users to specify a maximum decommissioning time > - > > Key: SPARK-34104 > URL: https://issues.apache.org/jira/browse/SPARK-34104 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0, 3.2.0, 3.1.1 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Major > > We currently have the ability for users to set the predicted time of the > cluster manager or cloud provider to terminate a decommissioning executor, > but for nodes where Spark it's self is triggering decommissioning we should > add the ability of users to specify a maximum time we want to allow the > executor to decommission. > > This is important especially if we start to in more places (like with > excluded executors that are found to be flaky, that may or may not be able to > decommission successfully). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34104) Allow users to specify a maximum decommissioning time
[ https://issues.apache.org/jira/browse/SPARK-34104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34104: Assignee: Holden Karau (was: Apache Spark) > Allow users to specify a maximum decommissioning time > - > > Key: SPARK-34104 > URL: https://issues.apache.org/jira/browse/SPARK-34104 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0, 3.2.0, 3.1.1 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Major > > We currently have the ability for users to set the predicted time of the > cluster manager or cloud provider to terminate a decommissioning executor, > but for nodes where Spark it's self is triggering decommissioning we should > add the ability of users to specify a maximum time we want to allow the > executor to decommission. > > This is important especially if we start to in more places (like with > excluded executors that are found to be flaky, that may or may not be able to > decommission successfully). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34104) Allow users to specify a maximum decommissioning time
[ https://issues.apache.org/jira/browse/SPARK-34104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34104: Assignee: Apache Spark (was: Holden Karau) > Allow users to specify a maximum decommissioning time > - > > Key: SPARK-34104 > URL: https://issues.apache.org/jira/browse/SPARK-34104 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0, 3.2.0, 3.1.1 >Reporter: Holden Karau >Assignee: Apache Spark >Priority: Major > > We currently have the ability for users to set the predicted time of the > cluster manager or cloud provider to terminate a decommissioning executor, > but for nodes where Spark it's self is triggering decommissioning we should > add the ability of users to specify a maximum time we want to allow the > executor to decommission. > > This is important especially if we start to in more places (like with > excluded executors that are found to be flaky, that may or may not be able to > decommission successfully). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34161) Check re-caching of v2 table dependents after table altering
[ https://issues.apache.org/jira/browse/SPARK-34161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-34161: --- Description: Add tests to unified test suites and check that dependants of v2 table are still cached after table altering. (was: Add tests to unified tests and check that dependants of v2 table are still cached after table altering.) > Check re-caching of v2 table dependents after table altering > > > Key: SPARK-34161 > URL: https://issues.apache.org/jira/browse/SPARK-34161 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Priority: Major > > Add tests to unified test suites and check that dependants of v2 table are > still cached after table altering. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34161) Check re-caching of v2 table dependents after table altering
Maxim Gekk created SPARK-34161: -- Summary: Check re-caching of v2 table dependents after table altering Key: SPARK-34161 URL: https://issues.apache.org/jira/browse/SPARK-34161 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Maxim Gekk Add tests to unified tests and check that dependants of v2 table are still cached after table altering. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34143) Adding partitions to fully partitioned v2 table
[ https://issues.apache.org/jira/browse/SPARK-34143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268156#comment-17268156 ] Apache Spark commented on SPARK-34143: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/31247 > Adding partitions to fully partitioned v2 table > --- > > Key: SPARK-34143 > URL: https://issues.apache.org/jira/browse/SPARK-34143 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.2.0 > > > The test below fails: > {code:scala} > withNamespaceAndTable("ns", "tbl") { t => > sql(s"CREATE TABLE $t (p0 INT, p1 STRING) $defaultUsing PARTITIONED BY > (p0, p1)") > sql(s"ALTER TABLE $t ADD PARTITION (p0 = 0, p1 = 'abc')") > checkPartitions(t, Map("p0" -> "0", "p1" -> "abc")) > checkAnswer(sql(s"SELECT * FROM $t"), Row(0, "abc")) > } > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34143) Adding partitions to fully partitioned v2 table
[ https://issues.apache.org/jira/browse/SPARK-34143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268153#comment-17268153 ] Apache Spark commented on SPARK-34143: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/31247 > Adding partitions to fully partitioned v2 table > --- > > Key: SPARK-34143 > URL: https://issues.apache.org/jira/browse/SPARK-34143 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.2.0 > > > The test below fails: > {code:scala} > withNamespaceAndTable("ns", "tbl") { t => > sql(s"CREATE TABLE $t (p0 INT, p1 STRING) $defaultUsing PARTITIONED BY > (p0, p1)") > sql(s"ALTER TABLE $t ADD PARTITION (p0 = 0, p1 = 'abc')") > checkPartitions(t, Map("p0" -> "0", "p1" -> "abc")) > checkAnswer(sql(s"SELECT * FROM $t"), Row(0, "abc")) > } > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33940) allow configuring the max column name length in csv writer
[ https://issues.apache.org/jira/browse/SPARK-33940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268099#comment-17268099 ] Apache Spark commented on SPARK-33940: -- User 'CodingCat' has created a pull request for this issue: https://github.com/apache/spark/pull/31246 > allow configuring the max column name length in csv writer > -- > > Key: SPARK-33940 > URL: https://issues.apache.org/jira/browse/SPARK-33940 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Nan Zhu >Priority: Major > > csv writer actually has an implicit limit on column name length due to > univocity-parser, > > when we initialize a writer > [https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/AbstractWriter.java#L211,] > it calls toIdentifierGroupArray which calls valueOf in NormalizedString.java > eventually > ([https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/NormalizedString.java#L205-L209)] > > in that stringCache.get, it has a maxStringLength cap > [https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/StringCache.java#L104] > which is 1024 by default > > we do not expose this as configurable option, leading to NPE when we have a > column name larger than 1024, > > ``` > [info] Cause: java.lang.NullPointerException: > [info] at > com.univocity.parsers.common.AbstractWriter.submitRow(AbstractWriter.java:349) > [info] at > com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:444) > [info] at > com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:410) > [info] at > org.apache.spark.sql.catalyst.csv.UnivocityGenerator.writeHeaders(UnivocityGenerator.scala:87) > [info] at > org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter$.writeHeaders(CsvOutputWriter.scala:58) > [info] at > org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.(CsvOutputWriter.scala:44) > [info] at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anon$1.newInstance(CSVFileFormat.scala:86) > [info] at > org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:126) > [info] at > org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:111) > [info] at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:269) > [info] at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:210) > ``` > > it could be reproduced by a simple unit test > > ``` > val row1 = Row("a") > val superLongHeader = (0 until 1025).map(_ => "c").mkString("") > val df = Seq(s"${row1.getString(0)}").toDF(superLongHeader) > df.repartition(1) > .write > .option("header", "true") > .option("maxColumnNameLength", 1025) > .csv(dataPath) > ``` > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33507) Improve and fix cache behavior in v1 and v2
[ https://issues.apache.org/jira/browse/SPARK-33507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268063#comment-17268063 ] Chao Sun commented on SPARK-33507: -- [~aokolnychyi] could you elaborate on the question? currently Spark doesn't support caching streaming tables yet. > Improve and fix cache behavior in v1 and v2 > --- > > Key: SPARK-33507 > URL: https://issues.apache.org/jira/browse/SPARK-33507 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Chao Sun >Priority: Critical > > This is an umbrella JIRA to track fixes & improvements for caching behavior > in Spark datasource v1 and v2, which includes: > - fix existing cache behavior in v1 and v2. > - fix inconsistent cache behavior between v1 and v2 > - implement missing features in v2 to align with those in v1. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34160) pyspark.ml.stat.Summarizer should allow sparse vector results
Ophir Yoktan created SPARK-34160: Summary: pyspark.ml.stat.Summarizer should allow sparse vector results Key: SPARK-34160 URL: https://issues.apache.org/jira/browse/SPARK-34160 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 3.0.1 Reporter: Ophir Yoktan currently pyspark.ml.stat.Summarizer will return a dense vector, even if the input is sparse. the Summarizer should either deduce the relevant type from the input, or add a parameter that forces sparse output code to reproduce: {{import pyspark}} {{from pyspark.sql.functions import col}} {{from pyspark.ml.stat import Summarizer}} {{from pyspark.ml.linalg import SparseVector, DenseVector}}{{sc = pyspark.SparkContext.getOrCreate()}} {{sql_context = pyspark.SQLContext(sc)}}{{df = sc.parallelize([ ( SparseVector(100, \{1: 1.0}),)]).toDF(['v'])}} {{print(df.head())}} {{print(df.select(Summarizer.mean(col('v'))).head())}} ouput: {{Row(v=SparseVector(100, \{1: 1.0})) }} {{Row(mean(v)=DenseVector([0.0, 1.0,}} {{0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]))}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33792) NullPointerException in HDFSBackedStateStoreProvider
[ https://issues.apache.org/jira/browse/SPARK-33792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17267972#comment-17267972 ] Gabor Barna commented on SPARK-33792: - Another flavor of probably the same issue {code:java} Caused by: java.lang.NullPointerException at java.util.concurrent.ConcurrentHashMap.putVal(ConcurrentHashMap.java:1011) at java.util.concurrent.ConcurrentHashMap.putAll(ConcurrentHashMap.java:1084) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:204) at org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:371) at org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {code} > NullPointerException in HDFSBackedStateStoreProvider > > > Key: SPARK-33792 > URL: https://issues.apache.org/jira/browse/SPARK-33792 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.4 > Environment: emr-5.28.0 >Reporter: Gabor Barna >Priority: Major > > Hi, > We are getting NPEs with spark structured streaming from time-to-time. Here's > the stacktrace: > {code:java} > java.lang.NullPointerException > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore$$anonfun$iterator$1.apply(HDFSBackedStateStoreProvider.scala:164) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore$$anonfun$iterator$1.apply(HDFSBackedStateStoreProvider.scala:163) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:220) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at > org.apache.spark.sql.kinesis.KinesisWriteTask.execute(KinesisWriteTask.scala:50) > at > org.apache.spark.sql.kinesis.KinesisWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1$$anonfun$apply$1.apply$mcV$sp(KinesisWriter.scala:40) > at > org.apache.spark.sql.kinesis.KinesisWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1$$anonfun$apply$1.apply(KinesisWriter.scala:40) > at > org.apache.spark.sql.kinesis.KinesisWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1$$anonfun$apply$1.apply(KinesisWriter.scala:40) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at > org.apache.spark.sql.kinesis.KinesisWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(KinesisWriter.scala:40) > at > org.apache.spark.sql.kinesis.KinesisWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(KinesisWriter.scala:38) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935) >
[jira] [Commented] (SPARK-34157) Unify output of SHOW TABLES and pass output attributes properly
[ https://issues.apache.org/jira/browse/SPARK-34157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17267949#comment-17267949 ] Apache Spark commented on SPARK-34157: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/31245 > Unify output of SHOW TABLES and pass output attributes properly > --- > > Key: SPARK-34157 > URL: https://issues.apache.org/jira/browse/SPARK-34157 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: jiaan.geng >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34157) Unify output of SHOW TABLES and pass output attributes properly
[ https://issues.apache.org/jira/browse/SPARK-34157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34157: Assignee: Apache Spark > Unify output of SHOW TABLES and pass output attributes properly > --- > > Key: SPARK-34157 > URL: https://issues.apache.org/jira/browse/SPARK-34157 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: jiaan.geng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34157) Unify output of SHOW TABLES and pass output attributes properly
[ https://issues.apache.org/jira/browse/SPARK-34157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34157: Assignee: (was: Apache Spark) > Unify output of SHOW TABLES and pass output attributes properly > --- > > Key: SPARK-34157 > URL: https://issues.apache.org/jira/browse/SPARK-34157 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: jiaan.geng >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34156) Unify the output of DDL and pass output attributes properly
[ https://issues.apache.org/jira/browse/SPARK-34156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-34156: --- Description: The current implement of some DDL not unify the output and not pass the output properly to physical command. (was: The current implement of some DDL not unify the output and not ) > Unify the output of DDL and pass output attributes properly > --- > > Key: SPARK-34156 > URL: https://issues.apache.org/jira/browse/SPARK-34156 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: jiaan.geng >Priority: Major > > The current implement of some DDL not unify the output and not pass the > output properly to physical command. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34156) Unify the output of DDL and pass output attributes properly
[ https://issues.apache.org/jira/browse/SPARK-34156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-34156: --- Description: The current implement of some DDL not unify the output and not (was: The current implement of some DDL ) > Unify the output of DDL and pass output attributes properly > --- > > Key: SPARK-34156 > URL: https://issues.apache.org/jira/browse/SPARK-34156 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: jiaan.geng >Priority: Major > > The current implement of some DDL not unify the output and not -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org