[jira] [Commented] (SPARK-34115) Long runtime on many environment variables

2021-01-19 Thread Norbert Schultz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268420#comment-17268420
 ] 

Norbert Schultz commented on SPARK-34115:
-

Thank you!

 

> Long runtime on many environment variables
> --
>
> Key: SPARK-34115
> URL: https://issues.apache.org/jira/browse/SPARK-34115
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0, 2.4.7, 3.0.1
> Environment: Spark 2.4.0 local[2] on a Kubernetes Pod
>Reporter: Norbert Schultz
>Assignee: Norbert Schultz
>Priority: Major
> Fix For: 3.0.2, 3.1.2
>
> Attachments: spark-bug-34115.tar.gz
>
>
> I am not sure if this is a bug report or a feature request. The code is is 
> the same in current versions of Spark and maybe this ticket saves someone 
> some time for debugging.
> We migrated some older code to Spark 2.4.0, and suddently the integration 
> tests on our build machine were much slower than expected.
> On local machines it was running perfectly.
> At the end it turned out, that Spark was wasting CPU Cycles during DataFrame 
> analyzing in the following functions
>  * AnalysisHelper.assertNotAnalysisRule calling
>  * Utils.isTesting
> Utils.isTesting is traversing all environment variables.
> The offending build machine was a Kubernetes Pod which automatically exposed 
> all services as environment variables, so it had more than 3000 environment 
> variables.
> As Utils.isTesting is called very often throgh 
> AnalysisHelper.assertNotAnalysisRule (via AnalysisHelper.transformDown, 
> transformUp).
>  
> Of course we will restrict the number of environment variables, on the other 
> side Utils.isTesting could also use a lazy val for
>  
> {code:java}
> sys.env.contains("SPARK_TESTING") {code}
>  
> to not make it that expensive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34176) Java UT failed when testing sql/hive module independently in Scala 2.13

2021-01-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34176:


Assignee: Apache Spark

> Java UT failed when testing sql/hive module independently in Scala 2.13
> ---
>
> Key: SPARK-34176
> URL: https://issues.apache.org/jira/browse/SPARK-34176
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> When execute mvn test of sql/hive module independently in Scala 2.13, there 
> is Java UT failed as follow: 
>  # 
> {code:java}
> mvn clean install -Dhadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn 
> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive 
> -Pscala-2.13 -pl  sql/hive -am -DskipTests{code}
>  # 
> {code:java}
> mvn test -Dhadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl 
> -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive -Pscala-2.13 -pl 
>  sql/hive{code}
>  # 
> {code:java}
> [ERROR] Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 20.353 s <<< FAILURE! - in org.apache.spark.sql.hive.JavaDataFrameSuite
> [ERROR] org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF  Time elapsed: 
> 18.548 s  <<< ERROR!
> java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport
>   at 
> org.apache.spark.sql.hive.JavaDataFrameSuite.checkAnswer(JavaDataFrameSuite.java:41)
>   at 
> org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF(JavaDataFrameSuite.java:92)
> Caused by: java.lang.ClassNotFoundException: 
> scala.collection.parallel.TaskSupport
>   at 
> org.apache.spark.sql.hive.JavaDataFrameSuite.checkAnswer(JavaDataFrameSuite.java:41)
>   at 
> org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF(JavaDataFrameSuite.java:92)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34176) Java UT failed when testing sql/hive module independently in Scala 2.13

2021-01-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268418#comment-17268418
 ] 

Apache Spark commented on SPARK-34176:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/31259

> Java UT failed when testing sql/hive module independently in Scala 2.13
> ---
>
> Key: SPARK-34176
> URL: https://issues.apache.org/jira/browse/SPARK-34176
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
>
> When execute mvn test of sql/hive module independently in Scala 2.13, there 
> is Java UT failed as follow: 
>  # 
> {code:java}
> mvn clean install -Dhadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn 
> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive 
> -Pscala-2.13 -pl  sql/hive -am -DskipTests{code}
>  # 
> {code:java}
> mvn test -Dhadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl 
> -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive -Pscala-2.13 -pl 
>  sql/hive{code}
>  # 
> {code:java}
> [ERROR] Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 20.353 s <<< FAILURE! - in org.apache.spark.sql.hive.JavaDataFrameSuite
> [ERROR] org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF  Time elapsed: 
> 18.548 s  <<< ERROR!
> java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport
>   at 
> org.apache.spark.sql.hive.JavaDataFrameSuite.checkAnswer(JavaDataFrameSuite.java:41)
>   at 
> org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF(JavaDataFrameSuite.java:92)
> Caused by: java.lang.ClassNotFoundException: 
> scala.collection.parallel.TaskSupport
>   at 
> org.apache.spark.sql.hive.JavaDataFrameSuite.checkAnswer(JavaDataFrameSuite.java:41)
>   at 
> org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF(JavaDataFrameSuite.java:92)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34176) Java UT failed when testing sql/hive module independently in Scala 2.13

2021-01-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34176:


Assignee: (was: Apache Spark)

> Java UT failed when testing sql/hive module independently in Scala 2.13
> ---
>
> Key: SPARK-34176
> URL: https://issues.apache.org/jira/browse/SPARK-34176
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
>
> When execute mvn test of sql/hive module independently in Scala 2.13, there 
> is Java UT failed as follow: 
>  # 
> {code:java}
> mvn clean install -Dhadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn 
> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive 
> -Pscala-2.13 -pl  sql/hive -am -DskipTests{code}
>  # 
> {code:java}
> mvn test -Dhadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl 
> -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive -Pscala-2.13 -pl 
>  sql/hive{code}
>  # 
> {code:java}
> [ERROR] Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 20.353 s <<< FAILURE! - in org.apache.spark.sql.hive.JavaDataFrameSuite
> [ERROR] org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF  Time elapsed: 
> 18.548 s  <<< ERROR!
> java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport
>   at 
> org.apache.spark.sql.hive.JavaDataFrameSuite.checkAnswer(JavaDataFrameSuite.java:41)
>   at 
> org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF(JavaDataFrameSuite.java:92)
> Caused by: java.lang.ClassNotFoundException: 
> scala.collection.parallel.TaskSupport
>   at 
> org.apache.spark.sql.hive.JavaDataFrameSuite.checkAnswer(JavaDataFrameSuite.java:41)
>   at 
> org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF(JavaDataFrameSuite.java:92)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34176) Java UT failed when testing sql/hive module independently in Scala 2.13

2021-01-19 Thread Yang Jie (Jira)
Yang Jie created SPARK-34176:


 Summary: Java UT failed when testing sql/hive module independently 
in Scala 2.13
 Key: SPARK-34176
 URL: https://issues.apache.org/jira/browse/SPARK-34176
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yang Jie


When execute mvn test of sql/hive module independently in Scala 2.13, there is 
Java UT failed as follow: 
 # 
{code:java}
mvn clean install -Dhadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn 
-Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive 
-Pscala-2.13 -pl  sql/hive -am -DskipTests{code}

 # 
{code:java}
mvn test -Dhadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl 
-Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive -Pscala-2.13 -pl  
sql/hive{code}

 # 
{code:java}
[ERROR] Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 20.353 
s <<< FAILURE! - in org.apache.spark.sql.hive.JavaDataFrameSuite
[ERROR] org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF  Time elapsed: 
18.548 s  <<< ERROR!
java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport
at 
org.apache.spark.sql.hive.JavaDataFrameSuite.checkAnswer(JavaDataFrameSuite.java:41)
at 
org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF(JavaDataFrameSuite.java:92)
Caused by: java.lang.ClassNotFoundException: 
scala.collection.parallel.TaskSupport
at 
org.apache.spark.sql.hive.JavaDataFrameSuite.checkAnswer(JavaDataFrameSuite.java:41)
at 
org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF(JavaDataFrameSuite.java:92)
{code}

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34174) Add SHOW PArTITIONS as table valued function

2021-01-19 Thread angerszhu (Jira)
angerszhu created SPARK-34174:
-

 Summary: Add SHOW PArTITIONS as table valued function
 Key: SPARK-34174
 URL: https://issues.apache.org/jira/browse/SPARK-34174
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34175) Add SHOW VIEWS as table valued function

2021-01-19 Thread angerszhu (Jira)
angerszhu created SPARK-34175:
-

 Summary: Add SHOW VIEWS as table valued function
 Key: SPARK-34175
 URL: https://issues.apache.org/jira/browse/SPARK-34175
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34173) Add SHOW FUNCTIONS as table valued function

2021-01-19 Thread angerszhu (Jira)
angerszhu created SPARK-34173:
-

 Summary: Add SHOW FUNCTIONS as table valued function
 Key: SPARK-34173
 URL: https://issues.apache.org/jira/browse/SPARK-34173
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34172) Add SHOW DATABASES as table valued function

2021-01-19 Thread angerszhu (Jira)
angerszhu created SPARK-34172:
-

 Summary: Add SHOW DATABASES as table valued function
 Key: SPARK-34172
 URL: https://issues.apache.org/jira/browse/SPARK-34172
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34171) Add SHOW COLUMNS as table valued function

2021-01-19 Thread angerszhu (Jira)
angerszhu created SPARK-34171:
-

 Summary: Add SHOW COLUMNS as table valued function
 Key: SPARK-34171
 URL: https://issues.apache.org/jira/browse/SPARK-34171
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34170) Add SHOW TABLES EXTENDED as table valued function

2021-01-19 Thread angerszhu (Jira)
angerszhu created SPARK-34170:
-

 Summary: Add SHOW TABLES EXTENDED as table valued function
 Key: SPARK-34170
 URL: https://issues.apache.org/jira/browse/SPARK-34170
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33630) Support save SHOW command result to table

2021-01-19 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-33630:
--
Parent: SPARK-34169
Issue Type: Sub-task  (was: Improvement)

> Support save SHOW command result to table
> -
>
> Key: SPARK-33630
> URL: https://issues.apache.org/jira/browse/SPARK-33630
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> The Scala interface support save SHOW command result to table:
> {code:scala}
> spark.sql("CREATE TABLE AS SHOW TABLES IN default").write.saveAsTable("t")
> {code}
> But SQL interface does not support it:
> {code:sql}
> spark-sql> create table t using parquet as SHOW TABLES IN default;
> Error in query:
> mismatched input 'SHOW' expecting {'(', 'FROM', 'MAP', 'REDUCE', 'SELECT', 
> 'TABLE', 'VALUES', 'WITH'}(line 1, pos 32)
> == SQL ==
> create table t using parquet as SHOW TABLES IN default
> ^^^
> {code}
> It would be great if SQL interface can support it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34169) Add some useful auxiliary SQL as table valued function

2021-01-19 Thread angerszhu (Jira)
angerszhu created SPARK-34169:
-

 Summary: Add some useful auxiliary SQL as table valued function 
 Key: SPARK-34169
 URL: https://issues.apache.org/jira/browse/SPARK-34169
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.2.0
Reporter: angerszhu


Support command such as `{{SHOW TABLES`}} command as a table valued function.
So user can be easy to reach metadata system information use SQL and analysis 
directory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34168) Support DPP in AQE When the join is Broadcast hash join before applying the AQE rules

2021-01-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34168:


Assignee: (was: Apache Spark)

> Support DPP in AQE When the join is Broadcast hash join before applying the 
> AQE rules
> -
>
> Key: SPARK-34168
> URL: https://issues.apache.org/jira/browse/SPARK-34168
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Ke Jia
>Priority: Major
>
> Both AQE and DPP cannot be applied at the same time. This PR will enable AQE 
> and DPP when the join is Broadcast hash join at the beginning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34168) Support DPP in AQE When the join is Broadcast hash join before applying the AQE rules

2021-01-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268408#comment-17268408
 ] 

Apache Spark commented on SPARK-34168:
--

User 'JkSelf' has created a pull request for this issue:
https://github.com/apache/spark/pull/31258

> Support DPP in AQE When the join is Broadcast hash join before applying the 
> AQE rules
> -
>
> Key: SPARK-34168
> URL: https://issues.apache.org/jira/browse/SPARK-34168
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Ke Jia
>Priority: Major
>
> Both AQE and DPP cannot be applied at the same time. This PR will enable AQE 
> and DPP when the join is Broadcast hash join at the beginning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34168) Support DPP in AQE When the join is Broadcast hash join before applying the AQE rules

2021-01-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34168:


Assignee: Apache Spark

> Support DPP in AQE When the join is Broadcast hash join before applying the 
> AQE rules
> -
>
> Key: SPARK-34168
> URL: https://issues.apache.org/jira/browse/SPARK-34168
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Ke Jia
>Assignee: Apache Spark
>Priority: Major
>
> Both AQE and DPP cannot be applied at the same time. This PR will enable AQE 
> and DPP when the join is Broadcast hash join at the beginning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33630) Support save SHOW command result to table

2021-01-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33630:


Assignee: Apache Spark

> Support save SHOW command result to table
> -
>
> Key: SPARK-33630
> URL: https://issues.apache.org/jira/browse/SPARK-33630
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> The Scala interface support save SHOW command result to table:
> {code:scala}
> spark.sql("CREATE TABLE AS SHOW TABLES IN default").write.saveAsTable("t")
> {code}
> But SQL interface does not support it:
> {code:sql}
> spark-sql> create table t using parquet as SHOW TABLES IN default;
> Error in query:
> mismatched input 'SHOW' expecting {'(', 'FROM', 'MAP', 'REDUCE', 'SELECT', 
> 'TABLE', 'VALUES', 'WITH'}(line 1, pos 32)
> == SQL ==
> create table t using parquet as SHOW TABLES IN default
> ^^^
> {code}
> It would be great if SQL interface can support it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33630) Support save SHOW command result to table

2021-01-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33630:


Assignee: (was: Apache Spark)

> Support save SHOW command result to table
> -
>
> Key: SPARK-33630
> URL: https://issues.apache.org/jira/browse/SPARK-33630
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> The Scala interface support save SHOW command result to table:
> {code:scala}
> spark.sql("CREATE TABLE AS SHOW TABLES IN default").write.saveAsTable("t")
> {code}
> But SQL interface does not support it:
> {code:sql}
> spark-sql> create table t using parquet as SHOW TABLES IN default;
> Error in query:
> mismatched input 'SHOW' expecting {'(', 'FROM', 'MAP', 'REDUCE', 'SELECT', 
> 'TABLE', 'VALUES', 'WITH'}(line 1, pos 32)
> == SQL ==
> create table t using parquet as SHOW TABLES IN default
> ^^^
> {code}
> It would be great if SQL interface can support it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33630) Support save SHOW command result to table

2021-01-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268407#comment-17268407
 ] 

Apache Spark commented on SPARK-33630:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/31257

> Support save SHOW command result to table
> -
>
> Key: SPARK-33630
> URL: https://issues.apache.org/jira/browse/SPARK-33630
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> The Scala interface support save SHOW command result to table:
> {code:scala}
> spark.sql("CREATE TABLE AS SHOW TABLES IN default").write.saveAsTable("t")
> {code}
> But SQL interface does not support it:
> {code:sql}
> spark-sql> create table t using parquet as SHOW TABLES IN default;
> Error in query:
> mismatched input 'SHOW' expecting {'(', 'FROM', 'MAP', 'REDUCE', 'SELECT', 
> 'TABLE', 'VALUES', 'WITH'}(line 1, pos 32)
> == SQL ==
> create table t using parquet as SHOW TABLES IN default
> ^^^
> {code}
> It would be great if SQL interface can support it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34168) Support DPP in AQE When the join is Broadcast hash join before applying the AQE rules

2021-01-19 Thread Ke Jia (Jira)
Ke Jia created SPARK-34168:
--

 Summary: Support DPP in AQE When the join is Broadcast hash join 
before applying the AQE rules
 Key: SPARK-34168
 URL: https://issues.apache.org/jira/browse/SPARK-34168
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.1, 3.0.0
Reporter: Ke Jia


Both AQE and DPP cannot be applied at the same time. This PR will enable AQE 
and DPP when the join is Broadcast hash join at the beginning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34027) ALTER TABLE .. RECOVER PARTITIONS doesn't refresh cache

2021-01-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34027:
--
Fix Version/s: (was: 3.1.1)
   3.1.2

> ALTER TABLE .. RECOVER PARTITIONS doesn't refresh cache
> ---
>
> Key: SPARK-34027
> URL: https://issues.apache.org/jira/browse/SPARK-34027
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0, 3.2.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.2, 3.2.0, 3.1.2
>
>
> Here is the example to reproduce the issue:
> {code:sql}
> spark-sql> create table tbl (col int, part int) using parquet partitioned by 
> (part);
> spark-sql> insert into tbl partition (part=0) select 0;
> spark-sql> cache table tbl;
> spark-sql> select * from tbl;
> 0 0
> spark-sql> show table extended like 'tbl' partition(part=0);
> default   tbl false   Partition Values: [part=0]
> Location: 
> file:/Users/maximgekk/proj/recover-partitions-refresh-cache/spark-warehouse/tbl/part=0
> ...
> {code}
> Add new partition by copying the existing one:
> {code}
> cp -r 
> /Users/maximgekk/proj/recover-partitions-refresh-cache/spark-warehouse/tbl/part=0
>  
> /Users/maximgekk/proj/recover-partitions-refresh-cache/spark-warehouse/tbl/part=1
> {code}
>  Recover and select the table:
> {code}
> spark-sql> alter table tbl recover partitions;
> spark-sql> select * from tbl;
> 0 0
> {code}
> We see only old data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34162) Add PyArrow compatibility note for Python 3.9

2021-01-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34162:
--
Fix Version/s: (was: 3.1.1)
   3.1.2

> Add PyArrow compatibility note for Python 3.9
> -
>
> Key: SPARK-34162
> URL: https://issues.apache.org/jira/browse/SPARK-34162
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.2.0, 3.1.1
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.2.0, 3.1.2
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34166) Fix flaky test in DecommissionWorkerSuite

2021-01-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34166:
--
Fix Version/s: (was: 3.1.0)
   3.1.2

> Fix flaky test in DecommissionWorkerSuite
> -
>
> Key: SPARK-34166
> URL: https://issues.apache.org/jira/browse/SPARK-34166
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
> Fix For: 3.1.2
>
>
> The test `decommission workers ensure that shuffle output is regenerated even 
> with shuffle service` assumes it has two executor and both of two tasks can 
> execute concurrently.
> The two tasks will execute serially if there only one executor. The result is 
> test is unexpceted. E.g. 
> ```
> [info] 5 did not equal 4 Expected 4 tasks but got List(0:0:0:0-SUCCESS, 
> 0:0:1:0-FAILED, 0:0:1:1-SUCCESS, 0:1:0:0-SUCCESS, 1:0:0:0-SUCCESS) 
> (DecommissionWorkerSuite.scala:190)
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34167) Reading parquet with Decimal(8,2) written as a Decimal64 blows up

2021-01-19 Thread Raza Jafri (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raza Jafri updated SPARK-34167:
---
Description: 
When reading a parquet file written with Decimals with precision < 10 as a 
64-bit representation, Spark tries to read it as an INT and fails

 

Steps to reproduce:

Read the attached file that has a single Decimal(8,2) column with 10 values
{code:java}
scala> spark.read.parquet("/tmp/pyspark_tests/936454/PARQUET_DATA").show

...
Caused by: java.lang.NullPointerException
  at 
org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLong(OnHeapColumnVector.java:327)
  at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readLongs(VectorizedRleValuesReader.java:370)
  at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readLongBatch(VectorizedColumnReader.java:514)
  at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:256)
  at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:273)
  at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:171)
  at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
  at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:497)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
 Source)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
  at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:756)
  at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
  at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
  at org.apache.spark.scheduler.Task.run(Task.scala:127)
  at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:480)
  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1426)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:483)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
...

{code}
 

 

Here are my findings. The *{{VectorizedParquetRecordReader}}* reads in the 
parquet file correctly because its basing the read on the 
[requestedSchema|https://github.com/apache/spark/blob/e6f019836c099398542b443f7700f79de81da0d5/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L150]
 which is a MessageType and has the underlying data stored correctly as 
{{INT64}} where as the *{{OnHeapColumnVector}}* is initialized based on the 
[batchSchema|https://github.com/apache/spark/blob/e6f019836c099398542b443f7700f79de81da0d5/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L151]
 which is coming from {{org.apache.spark.sql.parquet.row.requested_schema}} 
that is set by the reader which is a {{StructType}} and only has 
{{Decimal(_,_)}} in it.

[https://github.com/apache/spark/blob/a44e008de3ae5aecad9e0f1a7af6a1e8b0d97f4e/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L224]

 

Attached are two files, one with Decimal(8,2) ther other with Decimal(1,1) both 
written as Decimal backed by INT64. Decimal(1,1) results in a different 
exception but same for the same reason

 

 

  was:
When reading a parquet file written with Decimals with precision < 10 as a 
64-bit representation, Spark tries to read it as an INT and fails

 

Steps to reproduce:

Read the attached file that has a single Decimal(8,2) column with 10 values
{code:java}
scala> spark.read.parquet("/tmp/pyspark_tests/936454/PARQUET_DATA").show

...

[jira] [Updated] (SPARK-34167) Reading parquet with Decimal(8,2) written as a Decimal64 blows up

2021-01-19 Thread Raza Jafri (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raza Jafri updated SPARK-34167:
---
Attachment: 
part-0-940f44f1-f323-4a5e-b828-1e65d87895aa-c000.snappy.parquet

> Reading parquet with Decimal(8,2) written as a Decimal64 blows up
> -
>
> Key: SPARK-34167
> URL: https://issues.apache.org/jira/browse/SPARK-34167
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.0.1
>Reporter: Raza Jafri
>Priority: Major
> Attachments: 
> part-0-7fecd321-b247-4f7e-bff5-c2e4d8facaa0-c000.snappy.parquet, 
> part-0-940f44f1-f323-4a5e-b828-1e65d87895aa-c000.snappy.parquet
>
>
> When reading a parquet file written with Decimals with precision < 10 as a 
> 64-bit representation, Spark tries to read it as an INT and fails
>  
> Steps to reproduce:
> Read the attached file that has a single Decimal(8,2) column with 10 values
> {code:java}
> scala> spark.read.parquet("/tmp/pyspark_tests/936454/PARQUET_DATA").show
> ...
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLong(OnHeapColumnVector.java:327)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readLongs(VectorizedRleValuesReader.java:370)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readLongBatch(VectorizedColumnReader.java:514)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:256)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:273)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:171)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:497)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:756)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
>   at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:127)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:480)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1426)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:483)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> ...
> {code}
>  
>  
> Here are my findings. The *{{VectorizedParquetRecordReader}}* reads in the 
> parquet file correctly because its basing the read on the 
> [requestedSchema|https://github.com/apache/spark/blob/e6f019836c099398542b443f7700f79de81da0d5/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L150]
>  which is a MessageType and has the underlying data stored correctly as 
> {{INT64}} where as the *{{OnHeapColumnVector}}* is initialized based on the 
> [batchSchema|https://github.com/apache/spark/blob/e6f019836c099398542b443f7700f79de81da0d5/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L151]
>  which is coming from {{org.apache.spark.sql.parquet.row.requested_schema}} 
> that is set by the reader 

[jira] [Updated] (SPARK-34167) Reading parquet with Decimal(8,2) written as a Decimal64 blows up

2021-01-19 Thread Raza Jafri (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raza Jafri updated SPARK-34167:
---
Description: 
When reading a parquet file written with Decimals with precision < 10 as a 
64-bit representation, Spark tries to read it as an INT and fails

 

Steps to reproduce:

Read the attached file that has a single Decimal(8,2) column with 10 values
{code:java}
scala> spark.read.parquet("/tmp/pyspark_tests/936454/PARQUET_DATA").show

...
Caused by: java.lang.NullPointerException
  at 
org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLong(OnHeapColumnVector.java:327)
  at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readLongs(VectorizedRleValuesReader.java:370)
  at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readLongBatch(VectorizedColumnReader.java:514)
  at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:256)
  at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:273)
  at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:171)
  at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
  at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:497)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
 Source)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
  at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:756)
  at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
  at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
  at org.apache.spark.scheduler.Task.run(Task.scala:127)
  at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:480)
  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1426)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:483)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
...

{code}
 

 

Here are my findings. The *{{VectorizedParquetRecordReader}}* reads in the 
parquet file correctly because its basing the read on the 
[requestedSchema|https://github.com/apache/spark/blob/e6f019836c099398542b443f7700f79de81da0d5/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L150]
 which is a MessageType and has the underlying data stored correctly as 
{{INT64}} where as the *{{OnHeapColumnVector}}* is initialized based on the 
[batchSchema|https://github.com/apache/spark/blob/e6f019836c099398542b443f7700f79de81da0d5/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L151]
 which is coming from {{org.apache.spark.sql.parquet.row.requested_schema}} 
that is set by the reader which is a {{StructType}} and only has 
{{Decimal(_,_)}} in it.

[https://github.com/apache/spark/blob/a44e008de3ae5aecad9e0f1a7af6a1e8b0d97f4e/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L224]

 

Attached are two files, one with Decimal(8,2) ther other with Decimal(1,1) both 
written as Decimal backed by INT64

 

 

  was:
When reading a parquet file written with Decimals with precision < 10 as a 
64-bit representation, Spark tries to read it as an INT and fails

 

Steps to reproduce:

Read the attached file that has a single Decimal(8,2) column with 10 values
{code:java}
scala> spark.read.parquet("/tmp/pyspark_tests/936454/PARQUET_DATA").show

...
Caused by: java.lang.NullPointerException
  at 

[jira] [Updated] (SPARK-34167) Reading parquet with Decimal(8,2) written as a Decimal64 blows up

2021-01-19 Thread Raza Jafri (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raza Jafri updated SPARK-34167:
---
Description: 
When reading a parquet file written with Decimals with precision < 10 as a 
64-bit representation, Spark tries to read it as an INT and fails

 

Steps to reproduce:

Read the attached file that has a single Decimal(8,2) column with 10 values
{code:java}
scala> spark.read.parquet("/tmp/pyspark_tests/936454/PARQUET_DATA").show

...
Caused by: java.lang.NullPointerException
  at 
org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLong(OnHeapColumnVector.java:327)
  at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readLongs(VectorizedRleValuesReader.java:370)
  at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readLongBatch(VectorizedColumnReader.java:514)
  at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:256)
  at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:273)
  at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:171)
  at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
  at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:497)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
 Source)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
  at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:756)
  at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
  at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
  at org.apache.spark.scheduler.Task.run(Task.scala:127)
  at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:480)
  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1426)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:483)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
...

{code}
 

 

Here are my findings. The *{{VectorizedParquetRecordReader}}* reads in the 
parquet file correctly because its basing the read on the 
[requestedSchema|https://github.com/apache/spark/blob/e6f019836c099398542b443f7700f79de81da0d5/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L150]
 which is a MessageType and has the underlying data stored correctly as 
{{INT64}} where as the *{{OnHeapColumnVector}}* is initialized based on the 
[batchSchema|https://github.com/apache/spark/blob/e6f019836c099398542b443f7700f79de81da0d5/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L151]
 which is coming from {{org.apache.spark.sql.parquet.row.requested_schema}} 
that is set by the reader which is a {{StructType}} and only has 
{{Decimal(_,_)}} in it.

[https://github.com/apache/spark/blob/a44e008de3ae5aecad9e0f1a7af6a1e8b0d97f4e/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L224]

 

Attached are two files, one with Decimal(8,2) ther other with Decimal(1,2) both 
written as Decimal backed by INT64

 

 

  was:
When reading a parquet file written with Decimals with precision < 10 as a 
64-bit representation, Spark tries to read it as an INT and fails

 

Steps to reproduce:

Read the attached file that has a single Decimal(8,2) column with 10 values
{code:java}
scala> spark.read.parquet("/tmp/pyspark_tests/936454/PARQUET_DATA").show

...
Caused by: java.lang.NullPointerException
  at 

[jira] [Updated] (SPARK-34167) Reading parquet with Decimal(8,2) written as a Decimal64 blows up

2021-01-19 Thread Raza Jafri (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raza Jafri updated SPARK-34167:
---
Description: 
When reading a parquet file written with Decimals with precision < 10 as a 
64-bit representation, Spark tries to read it as an INT and fails

 

Steps to reproduce:

Read the attached file that has a single Decimal(8,2) column with 10 values
{code:java}
scala> spark.read.parquet("/tmp/pyspark_tests/936454/PARQUET_DATA").show

...
Caused by: java.lang.NullPointerException
  at 
org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLong(OnHeapColumnVector.java:327)
  at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readLongs(VectorizedRleValuesReader.java:370)
  at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readLongBatch(VectorizedColumnReader.java:514)
  at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:256)
  at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:273)
  at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:171)
  at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
  at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:497)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
 Source)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
  at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:756)
  at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
  at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
  at org.apache.spark.scheduler.Task.run(Task.scala:127)
  at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:480)
  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1426)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:483)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
...

{code}
 

 

Here are my findings. The *{{VectorizedParquetRecordReader}}* reads in the 
parquet file correctly because its basing the read on the 
[requestedSchema|https://github.com/apache/spark/blob/e6f019836c099398542b443f7700f79de81da0d5/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L150]
 which is a MessageType and has the underlying data stored correctly as 
{{INT64}} where as the *{{OnHeapColumnVector}}* is initialized based on the 
[batchSchema|https://github.com/apache/spark/blob/e6f019836c099398542b443f7700f79de81da0d5/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L151]
 which is coming from {{org.apache.spark.sql.parquet.row.requested_schema}} 
that is set by the reader which is a {{StructType}} and only has 
{{Decimal(_,_)}} in it.

[https://github.com/apache/spark/blob/a44e008de3ae5aecad9e0f1a7af6a1e8b0d97f4e/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L224]

 

 

 

 

  was:
When reading a parquet file written with Decimals with precision < 10 as a 
64-bit representation, Spark tries to read it as an INT and fails

 

Steps to reproduce:


Read the attached file that has a single Decimal(8,2) column with 10 values 
{code:java}
scala> spark.read.parquet("/tmp/pyspark_tests/936454/PARQUET_DATA").show

...
Caused by: java.lang.NullPointerException
  at 
org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLong(OnHeapColumnVector.java:327)
  at 

[jira] [Updated] (SPARK-34167) Reading parquet with Decimal(8,2) written as a Decimal64 blows up

2021-01-19 Thread Raza Jafri (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raza Jafri updated SPARK-34167:
---
Attachment: 
part-0-7fecd321-b247-4f7e-bff5-c2e4d8facaa0-c000.snappy.parquet

> Reading parquet with Decimal(8,2) written as a Decimal64 blows up
> -
>
> Key: SPARK-34167
> URL: https://issues.apache.org/jira/browse/SPARK-34167
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.0.1
>Reporter: Raza Jafri
>Priority: Major
> Attachments: 
> part-0-7fecd321-b247-4f7e-bff5-c2e4d8facaa0-c000.snappy.parquet
>
>
> When reading a parquet file written with Decimals with precision < 10 as a 
> 64-bit representation, Spark tries to read it as an INT and fails
>  
> Steps to reproduce:
> Read the attached file that has a single Decimal(8,2) column with 10 values
> {code:java}
> scala> spark.read.parquet("/tmp/pyspark_tests/936454/PARQUET_DATA").show
> ...
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLong(OnHeapColumnVector.java:327)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readLongs(VectorizedRleValuesReader.java:370)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readLongBatch(VectorizedColumnReader.java:514)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:256)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:273)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:171)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:497)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:756)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
>   at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:127)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:480)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1426)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:483)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> ...
> {code}
>  
>  
> Here are my findings. The *{{VectorizedParquetRecordReader}}* reads in the 
> parquet file correctly because its basing the read on the 
> [requestedSchema|https://github.com/apache/spark/blob/e6f019836c099398542b443f7700f79de81da0d5/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L150]
>  which is a MessageType and has the underlying data stored correctly as 
> {{INT64}} where as the *{{OnHeapColumnVector}}* is initialized based on the 
> [batchSchema|https://github.com/apache/spark/blob/e6f019836c099398542b443f7700f79de81da0d5/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L151]
>  which is coming from {{org.apache.spark.sql.parquet.row.requested_schema}} 
> that is set by the reader which is a {{StructType}} and only has 
> {{Decimal(_,_)}} in it.
> 

[jira] [Created] (SPARK-34167) Reading parquet with Decimal(8,2) written as a Decimal64 blows up

2021-01-19 Thread Raza Jafri (Jira)
Raza Jafri created SPARK-34167:
--

 Summary: Reading parquet with Decimal(8,2) written as a Decimal64 
blows up
 Key: SPARK-34167
 URL: https://issues.apache.org/jira/browse/SPARK-34167
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 3.0.1
Reporter: Raza Jafri
 Attachments: 
part-0-7fecd321-b247-4f7e-bff5-c2e4d8facaa0-c000.snappy.parquet

When reading a parquet file written with Decimals with precision < 10 as a 
64-bit representation, Spark tries to read it as an INT and fails

 

Steps to reproduce:


Read the attached file that has a single Decimal(8,2) column with 10 values 
{code:java}
scala> spark.read.parquet("/tmp/pyspark_tests/936454/PARQUET_DATA").show

...
Caused by: java.lang.NullPointerException
  at 
org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLong(OnHeapColumnVector.java:327)
  at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readLongs(VectorizedRleValuesReader.java:370)
  at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readLongBatch(VectorizedColumnReader.java:514)
  at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:256)
  at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:273)
  at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:171)
  at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
  at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:497)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
 Source)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
  at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:756)
  at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
  at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
  at org.apache.spark.scheduler.Task.run(Task.scala:127)
  at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:480)
  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1426)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:483)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
...

{code}
 

 

Here are my findings. The *{{VectorizedParquetRecordReader}}* reads in the 
parquet file correctly because its basing the read on the 
[requestedSchema|https://github.com/apache/spark/blob/e6f019836c099398542b443f7700f79de81da0d5/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L150]
 which is a MessageType and has the underlying data stored correctly as 
{{INT64}} where as the *{{OnHeapColumnVector}}* is initialized based on the 
[batchSchema|https://github.com/apache/spark/blob/e6f019836c099398542b443f7700f79de81da0d5/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L151]
 which is coming from {{org.apache.spark.sql.parquet.row.requested_schema}} 
that is set by the reader which is a {{StructType}} and only has 
{{Decimal(_,_)}} in it.

[https://github.com/apache/spark/blob/a44e008de3ae5aecad9e0f1a7af6a1e8b0d97f4e/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L224]
{{  }}

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Resolved] (SPARK-34166) Fix flaky test in DecommissionWorkerSuite

2021-01-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-34166.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 31255
[https://github.com/apache/spark/pull/31255]

> Fix flaky test in DecommissionWorkerSuite
> -
>
> Key: SPARK-34166
> URL: https://issues.apache.org/jira/browse/SPARK-34166
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
> Fix For: 3.1.0
>
>
> The test `decommission workers ensure that shuffle output is regenerated even 
> with shuffle service` assumes it has two executor and both of two tasks can 
> execute concurrently.
> The two tasks will execute serially if there only one executor. The result is 
> test is unexpceted. E.g. 
> ```
> [info] 5 did not equal 4 Expected 4 tasks but got List(0:0:0:0-SUCCESS, 
> 0:0:1:0-FAILED, 0:0:1:1-SUCCESS, 0:1:0:0-SUCCESS, 1:0:0:0-SUCCESS) 
> (DecommissionWorkerSuite.scala:190)
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34166) Fix flaky test in DecommissionWorkerSuite

2021-01-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-34166:
-

Assignee: ulysses you

> Fix flaky test in DecommissionWorkerSuite
> -
>
> Key: SPARK-34166
> URL: https://issues.apache.org/jira/browse/SPARK-34166
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
>
> The test `decommission workers ensure that shuffle output is regenerated even 
> with shuffle service` assumes it has two executor and both of two tasks can 
> execute concurrently.
> The two tasks will execute serially if there only one executor. The result is 
> test is unexpceted. E.g. 
> ```
> [info] 5 did not equal 4 Expected 4 tasks but got List(0:0:0:0-SUCCESS, 
> 0:0:1:0-FAILED, 0:0:1:1-SUCCESS, 0:1:0:0-SUCCESS, 1:0:0:0-SUCCESS) 
> (DecommissionWorkerSuite.scala:190)
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34166) Fix flaky test in DecommissionWorkerSuite

2021-01-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34166:
--
Issue Type: Test  (was: Improvement)

> Fix flaky test in DecommissionWorkerSuite
> -
>
> Key: SPARK-34166
> URL: https://issues.apache.org/jira/browse/SPARK-34166
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: ulysses you
>Priority: Minor
>
> The test `decommission workers ensure that shuffle output is regenerated even 
> with shuffle service` assumes it has two executor and both of two tasks can 
> execute concurrently.
> The two tasks will execute serially if there only one executor. The result is 
> test is unexpceted. E.g. 
> ```
> [info] 5 did not equal 4 Expected 4 tasks but got List(0:0:0:0-SUCCESS, 
> 0:0:1:0-FAILED, 0:0:1:1-SUCCESS, 0:1:0:0-SUCCESS, 1:0:0:0-SUCCESS) 
> (DecommissionWorkerSuite.scala:190)
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34166) Fix flaky test in DecommissionWorkerSuite

2021-01-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34166:
--
Affects Version/s: 3.1.1
   3.1.0

> Fix flaky test in DecommissionWorkerSuite
> -
>
> Key: SPARK-34166
> URL: https://issues.apache.org/jira/browse/SPARK-34166
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: ulysses you
>Priority: Minor
>
> The test `decommission workers ensure that shuffle output is regenerated even 
> with shuffle service` assumes it has two executor and both of two tasks can 
> execute concurrently.
> The two tasks will execute serially if there only one executor. The result is 
> test is unexpceted. E.g. 
> ```
> [info] 5 did not equal 4 Expected 4 tasks but got List(0:0:0:0-SUCCESS, 
> 0:0:1:0-FAILED, 0:0:1:1-SUCCESS, 0:1:0:0-SUCCESS, 1:0:0:0-SUCCESS) 
> (DecommissionWorkerSuite.scala:190)
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34166) Fix flaky test in DecommissionWorkerSuite

2021-01-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34166:
--
Component/s: Spark Core

> Fix flaky test in DecommissionWorkerSuite
> -
>
> Key: SPARK-34166
> URL: https://issues.apache.org/jira/browse/SPARK-34166
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Tests
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Priority: Minor
>
> The test `decommission workers ensure that shuffle output is regenerated even 
> with shuffle service` assumes it has two executor and both of two tasks can 
> execute concurrently.
> The two tasks will execute serially if there only one executor. The result is 
> test is unexpceted. E.g. 
> ```
> [info] 5 did not equal 4 Expected 4 tasks but got List(0:0:0:0-SUCCESS, 
> 0:0:1:0-FAILED, 0:0:1:1-SUCCESS, 0:1:0:0-SUCCESS, 1:0:0:0-SUCCESS) 
> (DecommissionWorkerSuite.scala:190)
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34166) Fix flaky test in DecommissionWorkerSuite

2021-01-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34166:


Assignee: (was: Apache Spark)

> Fix flaky test in DecommissionWorkerSuite
> -
>
> Key: SPARK-34166
> URL: https://issues.apache.org/jira/browse/SPARK-34166
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Priority: Minor
>
> The test `decommission workers ensure that shuffle output is regenerated even 
> with shuffle service` assumes it has two executor and both of two tasks can 
> execute concurrently.
> The two tasks will execute serially if there only one executor. The result is 
> test is unexpceted. E.g. 
> ```
> [info] 5 did not equal 4 Expected 4 tasks but got List(0:0:0:0-SUCCESS, 
> 0:0:1:0-FAILED, 0:0:1:1-SUCCESS, 0:1:0:0-SUCCESS, 1:0:0:0-SUCCESS) 
> (DecommissionWorkerSuite.scala:190)
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34166) Fix flaky test in DecommissionWorkerSuite

2021-01-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34166:


Assignee: Apache Spark

> Fix flaky test in DecommissionWorkerSuite
> -
>
> Key: SPARK-34166
> URL: https://issues.apache.org/jira/browse/SPARK-34166
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Assignee: Apache Spark
>Priority: Minor
>
> The test `decommission workers ensure that shuffle output is regenerated even 
> with shuffle service` assumes it has two executor and both of two tasks can 
> execute concurrently.
> The two tasks will execute serially if there only one executor. The result is 
> test is unexpceted. E.g. 
> ```
> [info] 5 did not equal 4 Expected 4 tasks but got List(0:0:0:0-SUCCESS, 
> 0:0:1:0-FAILED, 0:0:1:1-SUCCESS, 0:1:0:0-SUCCESS, 1:0:0:0-SUCCESS) 
> (DecommissionWorkerSuite.scala:190)
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34166) Fix flaky test in DecommissionWorkerSuite

2021-01-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268362#comment-17268362
 ] 

Apache Spark commented on SPARK-34166:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/31255

> Fix flaky test in DecommissionWorkerSuite
> -
>
> Key: SPARK-34166
> URL: https://issues.apache.org/jira/browse/SPARK-34166
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Priority: Minor
>
> The test `decommission workers ensure that shuffle output is regenerated even 
> with shuffle service` assumes it has two executor and both of two tasks can 
> execute concurrently.
> The two tasks will execute serially if there only one executor. The result is 
> test is unexpceted. E.g. 
> ```
> [info] 5 did not equal 4 Expected 4 tasks but got List(0:0:0:0-SUCCESS, 
> 0:0:1:0-FAILED, 0:0:1:1-SUCCESS, 0:1:0:0-SUCCESS, 1:0:0:0-SUCCESS) 
> (DecommissionWorkerSuite.scala:190)
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34166) Fix flaky test in DecommissionWorkerSuite

2021-01-19 Thread ulysses you (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ulysses you updated SPARK-34166:

Description: 
The test `decommission workers ensure that shuffle output is regenerated even 
with shuffle service` assumes it has two executor and both of two tasks can 
execute concurrently.

The two tasks will execute serially if there only one executor. The result is 
test is unexpceted. E.g. 

```

[info] 5 did not equal 4 Expected 4 tasks but got List(0:0:0:0-SUCCESS, 
0:0:1:0-FAILED, 0:0:1:1-SUCCESS, 0:1:0:0-SUCCESS, 1:0:0:0-SUCCESS) 
(DecommissionWorkerSuite.scala:190)

```

  was:
The test `decommission workers ensure that shuffle output is regenerated even 
with shuffle service` assumes it has two executor and both of two tasks can 
execute concurrently.

The two tasks will execute serially if there only have one executor. The result 
is test is unexpceted. E.g. 

```

[info] 5 did not equal 4 Expected 4 tasks but got List(0:0:0:0-SUCCESS, 
0:0:1:0-FAILED, 0:0:1:1-SUCCESS, 0:1:0:0-SUCCESS, 1:0:0:0-SUCCESS) 
(DecommissionWorkerSuite.scala:190)

```


> Fix flaky test in DecommissionWorkerSuite
> -
>
> Key: SPARK-34166
> URL: https://issues.apache.org/jira/browse/SPARK-34166
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Priority: Minor
>
> The test `decommission workers ensure that shuffle output is regenerated even 
> with shuffle service` assumes it has two executor and both of two tasks can 
> execute concurrently.
> The two tasks will execute serially if there only one executor. The result is 
> test is unexpceted. E.g. 
> ```
> [info] 5 did not equal 4 Expected 4 tasks but got List(0:0:0:0-SUCCESS, 
> 0:0:1:0-FAILED, 0:0:1:1-SUCCESS, 0:1:0:0-SUCCESS, 1:0:0:0-SUCCESS) 
> (DecommissionWorkerSuite.scala:190)
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34166) Fix flaky test in DecommissionWorkerSuite

2021-01-19 Thread ulysses you (Jira)
ulysses you created SPARK-34166:
---

 Summary: Fix flaky test in DecommissionWorkerSuite
 Key: SPARK-34166
 URL: https://issues.apache.org/jira/browse/SPARK-34166
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Affects Versions: 3.2.0
Reporter: ulysses you


The test `decommission workers ensure that shuffle output is regenerated even 
with shuffle service` assumes it has two executor and both of two tasks can 
execute concurrently.

The two tasks will execute serially if there only have one executor. The result 
is test is unexpceted. E.g. 

```

[info] 5 did not equal 4 Expected 4 tasks but got List(0:0:0:0-SUCCESS, 
0:0:1:0-FAILED, 0:0:1:1-SUCCESS, 0:1:0:0-SUCCESS, 1:0:0:0-SUCCESS) 
(DecommissionWorkerSuite.scala:190)

```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34165) Add countDistinct option to Dataset#summary

2021-01-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268356#comment-17268356
 ] 

Apache Spark commented on SPARK-34165:
--

User 'MrPowers' has created a pull request for this issue:
https://github.com/apache/spark/pull/31254

> Add countDistinct option to Dataset#summary
> ---
>
> Key: SPARK-34165
> URL: https://issues.apache.org/jira/browse/SPARK-34165
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Matthew Powers
>Priority: Minor
>
> The Dataset#summary function supports options like count, mean, min, and max. 
>  It's a great little function for lightweight exploratory data analysis.
> A count distinct of each column is a common exploratory data analysis 
> workflow.  This should be easy to add (piggybacking off the existing 
> countDistinct code), entirely backwards compatible, and will help a lot of 
> users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34165) Add countDistinct option to Dataset#summary

2021-01-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34165:


Assignee: (was: Apache Spark)

> Add countDistinct option to Dataset#summary
> ---
>
> Key: SPARK-34165
> URL: https://issues.apache.org/jira/browse/SPARK-34165
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Matthew Powers
>Priority: Minor
>
> The Dataset#summary function supports options like count, mean, min, and max. 
>  It's a great little function for lightweight exploratory data analysis.
> A count distinct of each column is a common exploratory data analysis 
> workflow.  This should be easy to add (piggybacking off the existing 
> countDistinct code), entirely backwards compatible, and will help a lot of 
> users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34165) Add countDistinct option to Dataset#summary

2021-01-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34165:


Assignee: Apache Spark

> Add countDistinct option to Dataset#summary
> ---
>
> Key: SPARK-34165
> URL: https://issues.apache.org/jira/browse/SPARK-34165
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Matthew Powers
>Assignee: Apache Spark
>Priority: Minor
>
> The Dataset#summary function supports options like count, mean, min, and max. 
>  It's a great little function for lightweight exploratory data analysis.
> A count distinct of each column is a common exploratory data analysis 
> workflow.  This should be easy to add (piggybacking off the existing 
> countDistinct code), entirely backwards compatible, and will help a lot of 
> users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34165) Add countDistinct option to Dataset#summary

2021-01-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268355#comment-17268355
 ] 

Apache Spark commented on SPARK-34165:
--

User 'MrPowers' has created a pull request for this issue:
https://github.com/apache/spark/pull/31254

> Add countDistinct option to Dataset#summary
> ---
>
> Key: SPARK-34165
> URL: https://issues.apache.org/jira/browse/SPARK-34165
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Matthew Powers
>Priority: Minor
>
> The Dataset#summary function supports options like count, mean, min, and max. 
>  It's a great little function for lightweight exploratory data analysis.
> A count distinct of each column is a common exploratory data analysis 
> workflow.  This should be easy to add (piggybacking off the existing 
> countDistinct code), entirely backwards compatible, and will help a lot of 
> users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34115) Long runtime on many environment variables

2021-01-19 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268344#comment-17268344
 ] 

Dongjoon Hyun commented on SPARK-34115:
---

Thank you!

> Long runtime on many environment variables
> --
>
> Key: SPARK-34115
> URL: https://issues.apache.org/jira/browse/SPARK-34115
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0, 2.4.7, 3.0.1
> Environment: Spark 2.4.0 local[2] on a Kubernetes Pod
>Reporter: Norbert Schultz
>Assignee: Norbert Schultz
>Priority: Major
> Fix For: 3.0.2, 3.1.2
>
> Attachments: spark-bug-34115.tar.gz
>
>
> I am not sure if this is a bug report or a feature request. The code is is 
> the same in current versions of Spark and maybe this ticket saves someone 
> some time for debugging.
> We migrated some older code to Spark 2.4.0, and suddently the integration 
> tests on our build machine were much slower than expected.
> On local machines it was running perfectly.
> At the end it turned out, that Spark was wasting CPU Cycles during DataFrame 
> analyzing in the following functions
>  * AnalysisHelper.assertNotAnalysisRule calling
>  * Utils.isTesting
> Utils.isTesting is traversing all environment variables.
> The offending build machine was a Kubernetes Pod which automatically exposed 
> all services as environment variables, so it had more than 3000 environment 
> variables.
> As Utils.isTesting is called very often throgh 
> AnalysisHelper.assertNotAnalysisRule (via AnalysisHelper.transformDown, 
> transformUp).
>  
> Of course we will restrict the number of environment variables, on the other 
> side Utils.isTesting could also use a lazy val for
>  
> {code:java}
> sys.env.contains("SPARK_TESTING") {code}
>  
> to not make it that expensive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34162) Add PyArrow compatibility note for Python 3.9

2021-01-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-34162.
---
Fix Version/s: 3.1.1
   3.2.0
 Assignee: Dongjoon Hyun  (was: Apache Spark)
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/31251

> Add PyArrow compatibility note for Python 3.9
> -
>
> Key: SPARK-34162
> URL: https://issues.apache.org/jira/browse/SPARK-34162
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.2.0, 3.1.1
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.2.0, 3.1.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34098) HadoopVersionInfoSuite faild when maven test in Scala 2.13

2021-01-19 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268339#comment-17268339
 ] 

Yang Jie commented on SPARK-34098:
--

Looks like a network issue,  if SPARK_VERSIONS_SUITE_IVY_PATH exported, the 
above problems seems not appear.

The SparkSubmitUtils.resolveMavenCoordinates method  has a certain failure 
probability.

> HadoopVersionInfoSuite faild when maven test in Scala 2.13
> --
>
> Key: SPARK-34098
> URL: https://issues.apache.org/jira/browse/SPARK-34098
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
>
>  
> {code:java}
> mvn clean test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn 
> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive 
> -pl sql/hive -Pscala-2.13  
> -DwildcardSuites=org.apache.spark.sql.hive.client.HadoopVersionInfoSuite 
> -Dtest=none 
> {code}
> Independent test HadoopVersionInfoSuite all case passed, but execute
>  
> {code:java}
> mvn clean test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn 
> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive 
> -pl sql/hive -Pscala-2.13 
> {code}
>  
> HadoopVersionInfoSuite failed as follow:
> {code:java}
> HadoopVersionInfoSuite: 22:32:30.310 WARN 
> org.apache.spark.sql.hive.client.IsolatedClientLoader: Failed to resolve 
> Hadoop artifacts for the version 2.7.4. We will change the hadoop version 
> from 2.7.4 to 2.7.4 and try again. It is recommended to set jars used by Hive 
> metastore client through spark.sql.hive.metastore.jars in the production 
> environment. - SPARK-32256: Hadoop VersionInfo should be preloaded *** FAILED 
> *** java.lang.RuntimeException: [unresolved dependency: 
> org.apache.hive#hive-metastore;2.0.1: not found, unresolved dependency: 
> org.apache.hive#hive-exec;2.0.1: not found, unresolved dependency: 
> org.apache.hive#hive-common;2.0.1: not found, unresolved dependency: 
> org.apache.hive#hive-serde;2.0.1: not found, unresolved dependency: 
> com.google.guava#guava;14.0.1: not found, unresolved dependency: 
> org.apache.hadoop#hadoop-client;2.7.4: not found] at 
> org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1423)
>  at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$.$anonfun$downloadVersion$2(IsolatedClientLoader.scala:122)
>  at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:42) at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$.downloadVersion(IsolatedClientLoader.scala:122)
>  at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$.liftedTree1$1(IsolatedClientLoader.scala:75)
>  at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$.forVersion(IsolatedClientLoader.scala:63)
>  at 
> org.apache.spark.sql.hive.client.HadoopVersionInfoSuite.$anonfun$new$1(HadoopVersionInfoSuite.scala:46)
>  at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) at 
> org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) at 
> org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> {code}
> need some investigate.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-34098) HadoopVersionInfoSuite faild when maven test in Scala 2.13

2021-01-19 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268339#comment-17268339
 ] 

Yang Jie edited comment on SPARK-34098 at 1/20/21, 3:04 AM:


Looks like a network issue,  if SPARK_VERSIONS_SUITE_IVY_PATH exported, the 
above problems seems not appear.

The SparkSubmitUtils.resolveMavenCoordinates method  has a certain failure 
probability if SPARK_VERSIONS_SUITE_IVY_PATH not exported


was (Author: luciferyang):
Looks like a network issue,  if SPARK_VERSIONS_SUITE_IVY_PATH exported, the 
above problems seems not appear.

The SparkSubmitUtils.resolveMavenCoordinates method  has a certain failure 
probability.

> HadoopVersionInfoSuite faild when maven test in Scala 2.13
> --
>
> Key: SPARK-34098
> URL: https://issues.apache.org/jira/browse/SPARK-34098
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
>
>  
> {code:java}
> mvn clean test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn 
> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive 
> -pl sql/hive -Pscala-2.13  
> -DwildcardSuites=org.apache.spark.sql.hive.client.HadoopVersionInfoSuite 
> -Dtest=none 
> {code}
> Independent test HadoopVersionInfoSuite all case passed, but execute
>  
> {code:java}
> mvn clean test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn 
> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive 
> -pl sql/hive -Pscala-2.13 
> {code}
>  
> HadoopVersionInfoSuite failed as follow:
> {code:java}
> HadoopVersionInfoSuite: 22:32:30.310 WARN 
> org.apache.spark.sql.hive.client.IsolatedClientLoader: Failed to resolve 
> Hadoop artifacts for the version 2.7.4. We will change the hadoop version 
> from 2.7.4 to 2.7.4 and try again. It is recommended to set jars used by Hive 
> metastore client through spark.sql.hive.metastore.jars in the production 
> environment. - SPARK-32256: Hadoop VersionInfo should be preloaded *** FAILED 
> *** java.lang.RuntimeException: [unresolved dependency: 
> org.apache.hive#hive-metastore;2.0.1: not found, unresolved dependency: 
> org.apache.hive#hive-exec;2.0.1: not found, unresolved dependency: 
> org.apache.hive#hive-common;2.0.1: not found, unresolved dependency: 
> org.apache.hive#hive-serde;2.0.1: not found, unresolved dependency: 
> com.google.guava#guava;14.0.1: not found, unresolved dependency: 
> org.apache.hadoop#hadoop-client;2.7.4: not found] at 
> org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1423)
>  at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$.$anonfun$downloadVersion$2(IsolatedClientLoader.scala:122)
>  at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:42) at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$.downloadVersion(IsolatedClientLoader.scala:122)
>  at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$.liftedTree1$1(IsolatedClientLoader.scala:75)
>  at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$.forVersion(IsolatedClientLoader.scala:63)
>  at 
> org.apache.spark.sql.hive.client.HadoopVersionInfoSuite.$anonfun$new$1(HadoopVersionInfoSuite.scala:46)
>  at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) at 
> org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) at 
> org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> {code}
> need some investigate.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34165) Add countDistinct option to Dataset#summary

2021-01-19 Thread Matthew Powers (Jira)
Matthew Powers created SPARK-34165:
--

 Summary: Add countDistinct option to Dataset#summary
 Key: SPARK-34165
 URL: https://issues.apache.org/jira/browse/SPARK-34165
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.2.0
Reporter: Matthew Powers


The Dataset#summary function supports options like count, mean, min, and max.  
It's a great little function for lightweight exploratory data analysis.

A count distinct of each column is a common exploratory data analysis workflow. 
 This should be easy to add (piggybacking off the existing countDistinct code), 
entirely backwards compatible, and will help a lot of users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33940) allow configuring the max column name length in csv writer

2021-01-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33940.
--
Fix Version/s: 3.1.2
   Resolution: Fixed

Issue resolved by pull request 31246
[https://github.com/apache/spark/pull/31246]

> allow configuring the max column name length in csv writer
> --
>
> Key: SPARK-33940
> URL: https://issues.apache.org/jira/browse/SPARK-33940
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Nan Zhu
>Assignee: Nan Zhu
>Priority: Major
> Fix For: 3.1.2
>
>
> csv writer actually has an implicit limit on column name length due to 
> univocity-parser, 
>  
> when we initialize a writer 
> [https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/AbstractWriter.java#L211,]
>  it calls toIdentifierGroupArray which calls valueOf in NormalizedString.java 
> eventually 
> ([https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/NormalizedString.java#L205-L209)]
>  
> in that stringCache.get, it has a maxStringLength cap 
> [https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/StringCache.java#L104]
>  which is 1024 by default
>  
> we do not expose this as configurable option, leading to NPE when we have a 
> column name larger than 1024, 
>  
> ```
> [info]   Cause: java.lang.NullPointerException:
> [info]   at 
> com.univocity.parsers.common.AbstractWriter.submitRow(AbstractWriter.java:349)
> [info]   at 
> com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:444)
> [info]   at 
> com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:410)
> [info]   at 
> org.apache.spark.sql.catalyst.csv.UnivocityGenerator.writeHeaders(UnivocityGenerator.scala:87)
> [info]   at 
> org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter$.writeHeaders(CsvOutputWriter.scala:58)
> [info]   at 
> org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.(CsvOutputWriter.scala:44)
> [info]   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anon$1.newInstance(CSVFileFormat.scala:86)
> [info]   at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:126)
> [info]   at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:111)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:269)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:210)
> ```
>  
> it could be reproduced by a simple unit test
>  
> ```
> val row1 = Row("a")
> val superLongHeader = (0 until 1025).map(_ => "c").mkString("")
> val df = Seq(s"${row1.getString(0)}").toDF(superLongHeader)
> df.repartition(1)
>  .write
>  .option("header", "true")
>  .option("maxColumnNameLength", 1025)
>  .csv(dataPath)
> ```
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33940) allow configuring the max column name length in csv writer

2021-01-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33940:


Assignee: Nan Zhu

> allow configuring the max column name length in csv writer
> --
>
> Key: SPARK-33940
> URL: https://issues.apache.org/jira/browse/SPARK-33940
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Nan Zhu
>Assignee: Nan Zhu
>Priority: Major
>
> csv writer actually has an implicit limit on column name length due to 
> univocity-parser, 
>  
> when we initialize a writer 
> [https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/AbstractWriter.java#L211,]
>  it calls toIdentifierGroupArray which calls valueOf in NormalizedString.java 
> eventually 
> ([https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/NormalizedString.java#L205-L209)]
>  
> in that stringCache.get, it has a maxStringLength cap 
> [https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/StringCache.java#L104]
>  which is 1024 by default
>  
> we do not expose this as configurable option, leading to NPE when we have a 
> column name larger than 1024, 
>  
> ```
> [info]   Cause: java.lang.NullPointerException:
> [info]   at 
> com.univocity.parsers.common.AbstractWriter.submitRow(AbstractWriter.java:349)
> [info]   at 
> com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:444)
> [info]   at 
> com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:410)
> [info]   at 
> org.apache.spark.sql.catalyst.csv.UnivocityGenerator.writeHeaders(UnivocityGenerator.scala:87)
> [info]   at 
> org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter$.writeHeaders(CsvOutputWriter.scala:58)
> [info]   at 
> org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.(CsvOutputWriter.scala:44)
> [info]   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anon$1.newInstance(CSVFileFormat.scala:86)
> [info]   at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:126)
> [info]   at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:111)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:269)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:210)
> ```
>  
> it could be reproduced by a simple unit test
>  
> ```
> val row1 = Row("a")
> val superLongHeader = (0 until 1025).map(_ => "c").mkString("")
> val df = Seq(s"${row1.getString(0)}").toDF(superLongHeader)
> df.repartition(1)
>  .write
>  .option("header", "true")
>  .option("maxColumnNameLength", 1025)
>  .csv(dataPath)
> ```
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34164) Improve write side varchar check to visit only last few tailing spaces

2021-01-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34164:


Assignee: (was: Apache Spark)

> Improve write side varchar check to visit only last few tailing spaces
> --
>
> Key: SPARK-34164
> URL: https://issues.apache.org/jira/browse/SPARK-34164
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Kent Yao
>Priority: Major
>
> For varchar(N), we currently trim all spaces first to check whether the 
> remained length exceeds, it not necessary to visit them all but at most to 
> those after N.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34164) Improve write side varchar check to visit only last few tailing spaces

2021-01-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268326#comment-17268326
 ] 

Apache Spark commented on SPARK-34164:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/31253

> Improve write side varchar check to visit only last few tailing spaces
> --
>
> Key: SPARK-34164
> URL: https://issues.apache.org/jira/browse/SPARK-34164
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Kent Yao
>Priority: Major
>
> For varchar(N), we currently trim all spaces first to check whether the 
> remained length exceeds, it not necessary to visit them all but at most to 
> those after N.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34164) Improve write side varchar check to visit only last few tailing spaces

2021-01-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34164:


Assignee: Apache Spark

> Improve write side varchar check to visit only last few tailing spaces
> --
>
> Key: SPARK-34164
> URL: https://issues.apache.org/jira/browse/SPARK-34164
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
>
> For varchar(N), we currently trim all spaces first to check whether the 
> remained length exceeds, it not necessary to visit them all but at most to 
> those after N.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34052) A cached view should become invalid after a table is dropped

2021-01-19 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-34052.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31107
[https://github.com/apache/spark/pull/31107]

> A cached view should become invalid after a table is dropped
> 
>
> Key: SPARK-34052
> URL: https://issues.apache.org/jira/browse/SPARK-34052
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.2.0
>
>
> It seems a view doesn't become invalid after a DSv2 table is dropped or 
> replaced. This is different from V1 and may cause correctness issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34052) A cached view should become invalid after a table is dropped

2021-01-19 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-34052:
---

Assignee: Chao Sun

> A cached view should become invalid after a table is dropped
> 
>
> Key: SPARK-34052
> URL: https://issues.apache.org/jira/browse/SPARK-34052
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>
> It seems a view doesn't become invalid after a DSv2 table is dropped or 
> replaced. This is different from V1 and may cause correctness issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34164) Improve write side varchar check to visit only last few tailing spaces

2021-01-19 Thread Kent Yao (Jira)
Kent Yao created SPARK-34164:


 Summary: Improve write side varchar check to visit only last few 
tailing spaces
 Key: SPARK-34164
 URL: https://issues.apache.org/jira/browse/SPARK-34164
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0, 3.2.0
Reporter: Kent Yao


For varchar(N), we currently trim all spaces first to check whether the 
remained length exceeds, it not necessary to visit them all but at most to 
those after N.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34056) Unify v1 and v2 ALTER TABLE .. RECOVER PARTITIONS tests

2021-01-19 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-34056.
-
Resolution: Fixed

Issue resolved by pull request 31105
[https://github.com/apache/spark/pull/31105]

> Unify v1 and v2 ALTER TABLE .. RECOVER PARTITIONS tests
> ---
>
> Key: SPARK-34056
> URL: https://issues.apache.org/jira/browse/SPARK-34056
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> Extract ALTER TABLE .. RECOVER PARTITIONS tests to the common place to run 
> them for V1 and v2 datasources. Some tests can be places to V1 and V2 
> specific test suites.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34143) Adding partitions to fully partitioned v2 table

2021-01-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34143:
-
Fix Version/s: 3.1.2

> Adding partitions to fully partitioned v2 table
> ---
>
> Key: SPARK-34143
> URL: https://issues.apache.org/jira/browse/SPARK-34143
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.2.0, 3.1.2
>
>
> The test below fails:
> {code:scala}
> withNamespaceAndTable("ns", "tbl") { t =>
>   sql(s"CREATE TABLE $t (p0 INT, p1 STRING) $defaultUsing PARTITIONED BY 
> (p0, p1)")
>   sql(s"ALTER TABLE $t ADD PARTITION (p0 = 0, p1 = 'abc')")
>   checkPartitions(t, Map("p0" -> "0", "p1" -> "abc"))
>   checkAnswer(sql(s"SELECT * FROM $t"), Row(0, "abc"))
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34121) Intersect operator missing rowCount when CBO enabled

2021-01-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-34121.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31240
[https://github.com/apache/spark/pull/31240]

> Intersect operator missing rowCount when CBO enabled
> 
>
> Key: SPARK-34121
> URL: https://issues.apache.org/jira/browse/SPARK-34121
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.2.0
>
>
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/BasicStatsPlanVisitor.scala#L60
>  missing the row count.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34121) Intersect operator missing rowCount when CBO enabled

2021-01-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-34121:


Assignee: angerszhu

> Intersect operator missing rowCount when CBO enabled
> 
>
> Key: SPARK-34121
> URL: https://issues.apache.org/jira/browse/SPARK-34121
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: angerszhu
>Priority: Major
>
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/BasicStatsPlanVisitor.scala#L60
>  missing the row count.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34115) Long runtime on many environment variables

2021-01-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-34115:


Assignee: Norbert Schultz

> Long runtime on many environment variables
> --
>
> Key: SPARK-34115
> URL: https://issues.apache.org/jira/browse/SPARK-34115
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0, 2.4.7, 3.0.1
> Environment: Spark 2.4.0 local[2] on a Kubernetes Pod
>Reporter: Norbert Schultz
>Assignee: Norbert Schultz
>Priority: Major
> Fix For: 3.0.2, 3.1.2
>
> Attachments: spark-bug-34115.tar.gz
>
>
> I am not sure if this is a bug report or a feature request. The code is is 
> the same in current versions of Spark and maybe this ticket saves someone 
> some time for debugging.
> We migrated some older code to Spark 2.4.0, and suddently the integration 
> tests on our build machine were much slower than expected.
> On local machines it was running perfectly.
> At the end it turned out, that Spark was wasting CPU Cycles during DataFrame 
> analyzing in the following functions
>  * AnalysisHelper.assertNotAnalysisRule calling
>  * Utils.isTesting
> Utils.isTesting is traversing all environment variables.
> The offending build machine was a Kubernetes Pod which automatically exposed 
> all services as environment variables, so it had more than 3000 environment 
> variables.
> As Utils.isTesting is called very often throgh 
> AnalysisHelper.assertNotAnalysisRule (via AnalysisHelper.transformDown, 
> transformUp).
>  
> Of course we will restrict the number of environment variables, on the other 
> side Utils.isTesting could also use a lazy val for
>  
> {code:java}
> sys.env.contains("SPARK_TESTING") {code}
>  
> to not make it that expensive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34115) Long runtime on many environment variables

2021-01-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-34115.
--
Fix Version/s: 3.1.2
   3.0.2
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/31244

> Long runtime on many environment variables
> --
>
> Key: SPARK-34115
> URL: https://issues.apache.org/jira/browse/SPARK-34115
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0, 2.4.7, 3.0.1
> Environment: Spark 2.4.0 local[2] on a Kubernetes Pod
>Reporter: Norbert Schultz
>Priority: Major
> Fix For: 3.0.2, 3.1.2
>
> Attachments: spark-bug-34115.tar.gz
>
>
> I am not sure if this is a bug report or a feature request. The code is is 
> the same in current versions of Spark and maybe this ticket saves someone 
> some time for debugging.
> We migrated some older code to Spark 2.4.0, and suddently the integration 
> tests on our build machine were much slower than expected.
> On local machines it was running perfectly.
> At the end it turned out, that Spark was wasting CPU Cycles during DataFrame 
> analyzing in the following functions
>  * AnalysisHelper.assertNotAnalysisRule calling
>  * Utils.isTesting
> Utils.isTesting is traversing all environment variables.
> The offending build machine was a Kubernetes Pod which automatically exposed 
> all services as environment variables, so it had more than 3000 environment 
> variables.
> As Utils.isTesting is called very often throgh 
> AnalysisHelper.assertNotAnalysisRule (via AnalysisHelper.transformDown, 
> transformUp).
>  
> Of course we will restrict the number of environment variables, on the other 
> side Utils.isTesting could also use a lazy val for
>  
> {code:java}
> sys.env.contains("SPARK_TESTING") {code}
>  
> to not make it that expensive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34121) Intersect operator missing rowCount when CBO enabled

2021-01-19 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-34121:

Description: 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/BasicStatsPlanVisitor.scala#L60
 missing the row count.

> Intersect operator missing rowCount when CBO enabled
> 
>
> Key: SPARK-34121
> URL: https://issues.apache.org/jira/browse/SPARK-34121
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/BasicStatsPlanVisitor.scala#L60
>  missing the row count.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33888) JDBC SQL TIME type represents incorrectly as TimestampType, it should be physical Int in millis

2021-01-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268291#comment-17268291
 ] 

Apache Spark commented on SPARK-33888:
--

User 'skestle' has created a pull request for this issue:
https://github.com/apache/spark/pull/31252

> JDBC SQL TIME type represents incorrectly as TimestampType, it should be 
> physical Int in millis
> ---
>
> Key: SPARK-33888
> URL: https://issues.apache.org/jira/browse/SPARK-33888
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3, 3.0.0, 3.0.1
>Reporter: Duc Hoa Nguyen
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.2.0
>
>
> Currently, for JDBC, SQL TIME type represents incorrectly as Spark 
> TimestampType. This should be represent as physical int in millis Represents 
> a time of day, with no reference to a particular calendar, time zone or date, 
> with a precision of one millisecond. It stores the number of milliseconds 
> after midnight, 00:00:00.000.
> We encountered the issue of Avro logical type of `TimeMillis` not being 
> converted correctly to Spark `Timestamp` struct type using the 
> `SchemaConverters`, but it converts to regular `int` instead. Reproducible by 
> ingest data from MySQL table with a column of TIME type: Spark JDBC dataframe 
> will get the correct type (Timestamp), but enforcing our avro schema 
> (`{"type": "int"," logicalType": "time-millis"}`) externally will fail to 
> apply with the following exception:
> {{java.lang.RuntimeException: java.sql.Timestamp is not a valid external type 
> for schema of int}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33888) JDBC SQL TIME type represents incorrectly as TimestampType, it should be physical Int in millis

2021-01-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268290#comment-17268290
 ] 

Apache Spark commented on SPARK-33888:
--

User 'skestle' has created a pull request for this issue:
https://github.com/apache/spark/pull/31252

> JDBC SQL TIME type represents incorrectly as TimestampType, it should be 
> physical Int in millis
> ---
>
> Key: SPARK-33888
> URL: https://issues.apache.org/jira/browse/SPARK-33888
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3, 3.0.0, 3.0.1
>Reporter: Duc Hoa Nguyen
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.2.0
>
>
> Currently, for JDBC, SQL TIME type represents incorrectly as Spark 
> TimestampType. This should be represent as physical int in millis Represents 
> a time of day, with no reference to a particular calendar, time zone or date, 
> with a precision of one millisecond. It stores the number of milliseconds 
> after midnight, 00:00:00.000.
> We encountered the issue of Avro logical type of `TimeMillis` not being 
> converted correctly to Spark `Timestamp` struct type using the 
> `SchemaConverters`, but it converts to regular `int` instead. Reproducible by 
> ingest data from MySQL table with a column of TIME type: Spark JDBC dataframe 
> will get the correct type (Timestamp), but enforcing our avro schema 
> (`{"type": "int"," logicalType": "time-millis"}`) externally will fail to 
> apply with the following exception:
> {{java.lang.RuntimeException: java.sql.Timestamp is not a valid external type 
> for schema of int}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33888) JDBC SQL TIME type represents incorrectly as TimestampType, it should be physical Int in millis

2021-01-19 Thread Stephen Kestle (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268285#comment-17268285
 ] 

Stephen Kestle commented on SPARK-33888:


I've searched for "scale" and only Oracle seems to use it (although it defaults 
to 0 if not present).

The tests however define it for other number types (SMALL_INT etc) - which 
should have defined scale etc:
 * 
[https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala#L901]
 * 
[https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala#L938]

I think that these probably aren't necessary.

> JDBC SQL TIME type represents incorrectly as TimestampType, it should be 
> physical Int in millis
> ---
>
> Key: SPARK-33888
> URL: https://issues.apache.org/jira/browse/SPARK-33888
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3, 3.0.0, 3.0.1
>Reporter: Duc Hoa Nguyen
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.2.0
>
>
> Currently, for JDBC, SQL TIME type represents incorrectly as Spark 
> TimestampType. This should be represent as physical int in millis Represents 
> a time of day, with no reference to a particular calendar, time zone or date, 
> with a precision of one millisecond. It stores the number of milliseconds 
> after midnight, 00:00:00.000.
> We encountered the issue of Avro logical type of `TimeMillis` not being 
> converted correctly to Spark `Timestamp` struct type using the 
> `SchemaConverters`, but it converts to regular `int` instead. Reproducible by 
> ingest data from MySQL table with a column of TIME type: Spark JDBC dataframe 
> will get the correct type (Timestamp), but enforcing our avro schema 
> (`{"type": "int"," logicalType": "time-millis"}`) externally will fail to 
> apply with the following exception:
> {{java.lang.RuntimeException: java.sql.Timestamp is not a valid external type 
> for schema of int}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34163) Spark Structured Streaming - Kafka avro transformation on optional field Failed

2021-01-19 Thread Felix Kizhakkel Jose (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Kizhakkel Jose updated SPARK-34163:
-
Description: 
Hello All,
I have a spark structured streaming job to inject data from Kafka where message 
from Kafka is avro type.
Some of the fields are optional in the data. And I have to perform 
transformation if those optional fields are present in the data. 
So I tried to check whether the column exists by :


{color:#0747A6}def has_column(dataframe, col):
"""
This function checks the existence of a given column in the given DataFrame

:param dataframe: the dataframe
:type dataframe: DataFrame
:param col: the column name
:type col: str
:return: true if the column exists else false
:rtype: boolean
"""
try:
dataframe[col]
return True
except AnalysisException:
return False{color}


But it seems not working when its a streaming dataframe, but when the dataframe 
is normal dataframe, and when a column is not present the above check returns 
false, therefore I can ignore the transformation on the missing column. 

But on Streaming dataframe *has_column* always returns true and therefore the 
transformation get executed and cause exception. What is the right approach to 
check existence of column in a streaming dataframe before performing 
transformation?

Why streaming dataframe and normal dataframe differ in behavior? How to skip 
transformation on a column if it doesn't exists?

  was:
Hello All,
I have a spark structured streaming job to inject data from Kafka where message 
from Kafka is avro type.
Some of the fields are optional in the data. And I have to perform 
transformation if those optional fields are present in the data. 
So I tried to check whether the column exists by :


{color:#0747A6}def has_column(dataframe, col):
"""
This function checks the existence of a given column in the given DataFrame

:param dataframe: the dataframe
:type dataframe: DataFrame
:param col: the column name
:type col: str
:return: true if the column exists else false
:rtype: boolean
"""
try:
dataframe[col]
return True
except AnalysisException:
return False{color}


But it seems not working when its a streaming dataframe, but when the dataframe 
is normal dataframe, and when a column is not present the above check returns 
false, therefore I can ignore the transformation on the missing column. 

But on Streaming dataframe *has_column* always returns true and therefore the 
transformation get executed and cause exception. What is the right approach to 
check existence of column in a streaming dataframe before performing 
transformation?


> Spark Structured Streaming - Kafka avro transformation on optional field 
> Failed
> ---
>
> Key: SPARK-34163
> URL: https://issues.apache.org/jira/browse/SPARK-34163
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.7
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> Hello All,
> I have a spark structured streaming job to inject data from Kafka where 
> message from Kafka is avro type.
> Some of the fields are optional in the data. And I have to perform 
> transformation if those optional fields are present in the data. 
> So I tried to check whether the column exists by :
> {color:#0747A6}def has_column(dataframe, col):
> """
> This function checks the existence of a given column in the given 
> DataFrame
> :param dataframe: the dataframe
> :type dataframe: DataFrame
> :param col: the column name
> :type col: str
> :return: true if the column exists else false
> :rtype: boolean
> """
> try:
> dataframe[col]
> return True
> except AnalysisException:
> return False{color}
> But it seems not working when its a streaming dataframe, but when the 
> dataframe is normal dataframe, and when a column is not present the above 
> check returns false, therefore I can ignore the transformation on the missing 
> column. 
> But on Streaming dataframe *has_column* always returns true and therefore the 
> transformation get executed and cause exception. What is the right approach 
> to check existence of column in a streaming dataframe before performing 
> transformation?
> Why streaming dataframe and normal dataframe differ in behavior? How to skip 
> transformation on a column if it doesn't exists?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34163) Spark Structured Streaming - Kafka avro transformation on optional field Failed

2021-01-19 Thread Felix Kizhakkel Jose (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Kizhakkel Jose updated SPARK-34163:
-
Description: 
Hello All,
I have a spark structured streaming job to inject data from Kafka where message 
from Kafka is avro type.
Some of the fields are optional in the data. And I have to perform 
transformation if those optional fields are present in the data. 
So I tried to check whether the column exists by :


{color:#0747A6}def has_column(dataframe, col):
"""
This function checks the existence of a given column in the given DataFrame

:param dataframe: the dataframe
:type dataframe: DataFrame
:param col: the column name
:type col: str
:return: true if the column exists else false
:rtype: boolean
"""
try:
dataframe[col]
return True
except AnalysisException:
return False{color}


But it seems not working when its a streaming dataframe, but when the dataframe 
is normal dataframe, and when a column is not present the above check returns 
false, therefore I can ignore the transformation on the missing column. 

But on Streaming dataframe *has_column* always returns true and therefore the 
transformation get executed and cause exception. What is the right approach to 
check existence of column in a streaming dataframe before performing 
transformation?

  was:
Hello All,
I have a spark structured streaming job to inject data from Kafka where message 
from Kafka is avro type.
Some of the fields are optional in the data. And I have to perform 
transformation if those optional fields are present in the data. 
So I tried to check whether the column exists by :


*def has_column(dataframe, col):
"""
This function checks the existence of a given column in the given DataFrame

:param dataframe: the dataframe
:type dataframe: DataFrame
:param col: the column name
:type col: str
:return: true if the column exists else false
:rtype: boolean
"""
try:
dataframe[col]
return True
except AnalysisException:
return False*


But it seems not working when its a streaming dataframe, but when the dataframe 
is normal dataframe, and when a column is not present the above check returns 
false, therefore I can ignore the transformation on the missing column. 

But on Streaming dataframe *has_column* always returns true and therefore the 
transformation get executed and cause exception. What is the right approach to 
check existence of column in a streaming dataframe before performing 
transformation?


> Spark Structured Streaming - Kafka avro transformation on optional field 
> Failed
> ---
>
> Key: SPARK-34163
> URL: https://issues.apache.org/jira/browse/SPARK-34163
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.7
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> Hello All,
> I have a spark structured streaming job to inject data from Kafka where 
> message from Kafka is avro type.
> Some of the fields are optional in the data. And I have to perform 
> transformation if those optional fields are present in the data. 
> So I tried to check whether the column exists by :
> {color:#0747A6}def has_column(dataframe, col):
> """
> This function checks the existence of a given column in the given 
> DataFrame
> :param dataframe: the dataframe
> :type dataframe: DataFrame
> :param col: the column name
> :type col: str
> :return: true if the column exists else false
> :rtype: boolean
> """
> try:
> dataframe[col]
> return True
> except AnalysisException:
> return False{color}
> But it seems not working when its a streaming dataframe, but when the 
> dataframe is normal dataframe, and when a column is not present the above 
> check returns false, therefore I can ignore the transformation on the missing 
> column. 
> But on Streaming dataframe *has_column* always returns true and therefore the 
> transformation get executed and cause exception. What is the right approach 
> to check existence of column in a streaming dataframe before performing 
> transformation?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34163) Spark Structured Streaming - Kafka avro transformation on optional field Failed

2021-01-19 Thread Felix Kizhakkel Jose (Jira)
Felix Kizhakkel Jose created SPARK-34163:


 Summary: Spark Structured Streaming - Kafka avro transformation on 
optional field Failed
 Key: SPARK-34163
 URL: https://issues.apache.org/jira/browse/SPARK-34163
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.7
Reporter: Felix Kizhakkel Jose


Hello All,
I have a spark structured streaming job to inject data from Kafka where message 
from Kafka is avro type.
Some of the fields are optional in the data. And I have to perform 
transformation if those optional fields are present in the data. 
So I tried to check whether the column exists by :


*def has_column(dataframe, col):
"""
This function checks the existence of a given column in the given DataFrame

:param dataframe: the dataframe
:type dataframe: DataFrame
:param col: the column name
:type col: str
:return: true if the column exists else false
:rtype: boolean
"""
try:
dataframe[col]
return True
except AnalysisException:
return False*


But it seems not working when its a streaming dataframe, but when the dataframe 
is normal dataframe, and when a column is not present the above check returns 
false, therefore I can ignore the transformation on the missing column. 

But on Streaming dataframe *has_column* always returns true and therefore the 
transformation get executed and cause exception. What is the right approach to 
check existence of column in a streaming dataframe before performing 
transformation?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31786) Exception on submitting Spark-Pi to Kubernetes 1.17.3

2021-01-19 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268257#comment-17268257
 ] 

Dongjoon Hyun commented on SPARK-31786:
---

Hi, [~smurarka]. Apache Jira is not designed for Q You had better use the 
official mailing list if you have an irrelevant question to this JIRA.

> Exception on submitting Spark-Pi to Kubernetes 1.17.3
> -
>
> Key: SPARK-31786
> URL: https://issues.apache.org/jira/browse/SPARK-31786
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Maciej Bryński
>Assignee: Dongjoon Hyun
>Priority: Blocker
> Fix For: 3.0.0
>
>
> Hi,
> I'm getting exception when submitting Spark-Pi app to Kubernetes cluster.
> Kubernetes version: 1.17.3
> JDK version: openjdk version "1.8.0_252"
> Exception:
> {code}
>  ./bin/spark-submit --master k8s://https://172.31.23.60:8443 --deploy-mode 
> cluster --name spark-pi --conf 
> spark.kubernetes.container.image=spark-py:2.4.5 --conf 
> spark.kubernetes.executor.request.cores=0.1 --conf 
> spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf 
> spark.executor.instances=1 local:///opt/spark/examples/src/main/python/pi.py
> log4j:WARN No appenders could be found for logger 
> (io.fabric8.kubernetes.client.Config).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
> info.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Exception in thread "main" 
> io.fabric8.kubernetes.client.KubernetesClientException: Operation: [create]  
> for kind: [Pod]  with name: [null]  in namespace: [default]  failed.
> at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
> at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:337)
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:330)
> at 
> org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$2.apply(KubernetesClientApplication.scala:141)
> at 
> org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$2.apply(KubernetesClientApplication.scala:140)
> at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2543)
> at 
> org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:140)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply(KubernetesClientApplication.scala:250)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply(KubernetesClientApplication.scala:241)
> at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2543)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:241)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:204)
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
> at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
> at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
> at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
> at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.net.SocketException: Broken pipe (Write failed)
> at java.net.SocketOutputStream.socketWrite0(Native Method)
> at 
> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
> at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
> at sun.security.ssl.OutputRecord.writeBuffer(OutputRecord.java:431)
> at sun.security.ssl.OutputRecord.write(OutputRecord.java:417)
> at 
> sun.security.ssl.SSLSocketImpl.writeRecordInternal(SSLSocketImpl.java:894)
> at sun.security.ssl.SSLSocketImpl.writeRecord(SSLSocketImpl.java:865)
> at sun.security.ssl.AppOutputStream.write(AppOutputStream.java:123)
> at okio.Okio$1.write(Okio.java:79)
> at okio.AsyncTimeout$1.write(AsyncTimeout.java:180)
> at okio.RealBufferedSink.flush(RealBufferedSink.java:224)
> at okhttp3.internal.http2.Http2Writer.settings(Http2Writer.java:203)
> at 
> 

[jira] [Assigned] (SPARK-34162) Add PyArrow compatibility note for Python 3.9

2021-01-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34162:


Assignee: Apache Spark

> Add PyArrow compatibility note for Python 3.9
> -
>
> Key: SPARK-34162
> URL: https://issues.apache.org/jira/browse/SPARK-34162
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.2.0, 3.1.1
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34162) Add PyArrow compatibility note for Python 3.9

2021-01-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34162:


Assignee: (was: Apache Spark)

> Add PyArrow compatibility note for Python 3.9
> -
>
> Key: SPARK-34162
> URL: https://issues.apache.org/jira/browse/SPARK-34162
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.2.0, 3.1.1
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34162) Add PyArrow compatibility note for Python 3.9

2021-01-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268250#comment-17268250
 ] 

Apache Spark commented on SPARK-34162:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/31251

> Add PyArrow compatibility note for Python 3.9
> -
>
> Key: SPARK-34162
> URL: https://issues.apache.org/jira/browse/SPARK-34162
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.2.0, 3.1.1
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34162) Add PyArrow compatibility note for Python 3.9

2021-01-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34162:


Assignee: Apache Spark

> Add PyArrow compatibility note for Python 3.9
> -
>
> Key: SPARK-34162
> URL: https://issues.apache.org/jira/browse/SPARK-34162
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.2.0, 3.1.1
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34162) Add PyArrow compatibility note for Python 3.9

2021-01-19 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-34162:
-

 Summary: Add PyArrow compatibility note for Python 3.9
 Key: SPARK-34162
 URL: https://issues.apache.org/jira/browse/SPARK-34162
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 3.2.0, 3.1.1
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31786) Exception on submitting Spark-Pi to Kubernetes 1.17.3

2021-01-19 Thread Sachit Murarka (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268214#comment-17268214
 ] 

Sachit Murarka commented on SPARK-31786:


[~dongjoon] 
To run spark in kubernetes . I see case "$1" in driver) shift 1 CMD=( 
"$SPARK_HOME/bin/spark-submit" --conf 
"spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client 

Could you pls suggest Why deploy-mode client is mentioned in 
[entrypoint.sh|https://t.co/IPbK5sEIqM?amp=1] ? I am running spark submit using 
deploy mode cluster but inside this 
[entrypoint.sh|https://t.co/IPbK5sEIqM?amp=1] its mentioned like that.

> Exception on submitting Spark-Pi to Kubernetes 1.17.3
> -
>
> Key: SPARK-31786
> URL: https://issues.apache.org/jira/browse/SPARK-31786
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Maciej Bryński
>Assignee: Dongjoon Hyun
>Priority: Blocker
> Fix For: 3.0.0
>
>
> Hi,
> I'm getting exception when submitting Spark-Pi app to Kubernetes cluster.
> Kubernetes version: 1.17.3
> JDK version: openjdk version "1.8.0_252"
> Exception:
> {code}
>  ./bin/spark-submit --master k8s://https://172.31.23.60:8443 --deploy-mode 
> cluster --name spark-pi --conf 
> spark.kubernetes.container.image=spark-py:2.4.5 --conf 
> spark.kubernetes.executor.request.cores=0.1 --conf 
> spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf 
> spark.executor.instances=1 local:///opt/spark/examples/src/main/python/pi.py
> log4j:WARN No appenders could be found for logger 
> (io.fabric8.kubernetes.client.Config).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
> info.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Exception in thread "main" 
> io.fabric8.kubernetes.client.KubernetesClientException: Operation: [create]  
> for kind: [Pod]  with name: [null]  in namespace: [default]  failed.
> at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
> at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:337)
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:330)
> at 
> org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$2.apply(KubernetesClientApplication.scala:141)
> at 
> org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$2.apply(KubernetesClientApplication.scala:140)
> at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2543)
> at 
> org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:140)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply(KubernetesClientApplication.scala:250)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply(KubernetesClientApplication.scala:241)
> at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2543)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:241)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:204)
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
> at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
> at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
> at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
> at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.net.SocketException: Broken pipe (Write failed)
> at java.net.SocketOutputStream.socketWrite0(Native Method)
> at 
> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
> at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
> at sun.security.ssl.OutputRecord.writeBuffer(OutputRecord.java:431)
> at sun.security.ssl.OutputRecord.write(OutputRecord.java:417)
> at 
> sun.security.ssl.SSLSocketImpl.writeRecordInternal(SSLSocketImpl.java:894)
> at sun.security.ssl.SSLSocketImpl.writeRecord(SSLSocketImpl.java:865)
> at 

[jira] [Commented] (SPARK-34099) Refactor table caching in `DataSourceV2Strategy`

2021-01-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268210#comment-17268210
 ] 

Apache Spark commented on SPARK-34099:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/31250

> Refactor table caching in `DataSourceV2Strategy`
> 
>
> Key: SPARK-34099
> URL: https://issues.apache.org/jira/browse/SPARK-34099
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> Currently, invalidateCache() performs 3 calls to the Cache Manager to refresh 
> table cache. We can perform only one call recacheByPlan(). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33792) NullPointerException in HDFSBackedStateStoreProvider

2021-01-19 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268202#comment-17268202
 ] 

Jungtaek Lim commented on SPARK-33792:
--

I have no idea about the possibility, but the new stack trace concerns me. If I 
understand correctly, unless there's other possibility, that's likely saying 
key or value in map is null, which doesn't look to be possible.

I'm not sure the failure is consistent, as you said the failure as "time to 
time". Does the problematic checkpoint always fail consistently? Or does it 
simply work with another run? Reproducing the issue is quite important to solve 
the problem.

And probably add Java vendor & version to issue description as well. Also if 
the issue is vendor specific (as you've mentioned EMR) I'm not sure we'd like 
to investigate here. Please check the problem also appears with "Apache" Spark 
2.4.4.

> NullPointerException in HDFSBackedStateStoreProvider
> 
>
> Key: SPARK-33792
> URL: https://issues.apache.org/jira/browse/SPARK-33792
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.4
> Environment: emr-5.28.0
>Reporter: Gabor Barna
>Priority: Major
>
> Hi,
> We are getting NPEs with spark structured streaming from time-to-time. Here's 
> the stacktrace:
> {code:java}
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore$$anonfun$iterator$1.apply(HDFSBackedStateStoreProvider.scala:164)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore$$anonfun$iterator$1.apply(HDFSBackedStateStoreProvider.scala:163)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:220)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>   at 
> org.apache.spark.sql.kinesis.KinesisWriteTask.execute(KinesisWriteTask.scala:50)
>   at 
> org.apache.spark.sql.kinesis.KinesisWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1$$anonfun$apply$1.apply$mcV$sp(KinesisWriter.scala:40)
>   at 
> org.apache.spark.sql.kinesis.KinesisWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1$$anonfun$apply$1.apply(KinesisWriter.scala:40)
>   at 
> org.apache.spark.sql.kinesis.KinesisWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1$$anonfun$apply$1.apply(KinesisWriter.scala:40)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at 
> org.apache.spark.sql.kinesis.KinesisWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(KinesisWriter.scala:40)
>   at 
> org.apache.spark.sql.kinesis.KinesisWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(KinesisWriter.scala:38)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:123)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748) {code}
> As you can see we are using [https://github.com/qubole/kinesis-sql/] for 
> writing the kinesis queue, the state backend is HDFS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To 

[jira] [Commented] (SPARK-34078) Provide async variants for Dataset APIs

2021-01-19 Thread Yesheng Ma (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268201#comment-17268201
 ] 

Yesheng Ma commented on SPARK-34078:


Thanks! I'm looking into this and will prepare a diff shortly.

> Provide async variants for Dataset APIs
> ---
>
> Key: SPARK-34078
> URL: https://issues.apache.org/jira/browse/SPARK-34078
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Yesheng Ma
>Priority: Major
>
> Spark RDDs have async variants such as `collectAsync`, which comes handy when 
> we want to cancel a job. However, Dataset APIs are lacking such APIs, which 
> makes it very painful to cancel a Dataset/SQL job.
>  
> The proposed change was to add async variants so that we can directly cancel 
> a Dataset/SQL query via a future programmatically.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34105) In addition to killing exlcuded/flakey executors which should support decommissioning

2021-01-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34105:


Assignee: (was: Apache Spark)

> In addition to killing exlcuded/flakey executors which should support 
> decommissioning
> -
>
> Key: SPARK-34105
> URL: https://issues.apache.org/jira/browse/SPARK-34105
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Holden Karau
>Priority: Major
>
> Decommissioning will give the executor a chance to migrate it's files to a 
> more stable node.
>  
> Note: we want SPARK-34104 to be integrated as well so that flaky executors 
> which can not decommission are eventually killed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34105) In addition to killing exlcuded/flakey executors which should support decommissioning

2021-01-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34105:


Assignee: Apache Spark

> In addition to killing exlcuded/flakey executors which should support 
> decommissioning
> -
>
> Key: SPARK-34105
> URL: https://issues.apache.org/jira/browse/SPARK-34105
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Holden Karau
>Assignee: Apache Spark
>Priority: Major
>
> Decommissioning will give the executor a chance to migrate it's files to a 
> more stable node.
>  
> Note: we want SPARK-34104 to be integrated as well so that flaky executors 
> which can not decommission are eventually killed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34105) In addition to killing exlcuded/flakey executors which should support decommissioning

2021-01-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268196#comment-17268196
 ] 

Apache Spark commented on SPARK-34105:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/31249

> In addition to killing exlcuded/flakey executors which should support 
> decommissioning
> -
>
> Key: SPARK-34105
> URL: https://issues.apache.org/jira/browse/SPARK-34105
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Holden Karau
>Priority: Major
>
> Decommissioning will give the executor a chance to migrate it's files to a 
> more stable node.
>  
> Note: we want SPARK-34104 to be integrated as well so that flaky executors 
> which can not decommission are eventually killed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34105) In addition to killing exlcuded/flakey executors which should support decommissioning

2021-01-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268197#comment-17268197
 ] 

Apache Spark commented on SPARK-34105:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/31249

> In addition to killing exlcuded/flakey executors which should support 
> decommissioning
> -
>
> Key: SPARK-34105
> URL: https://issues.apache.org/jira/browse/SPARK-34105
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Holden Karau
>Priority: Major
>
> Decommissioning will give the executor a chance to migrate it's files to a 
> more stable node.
>  
> Note: we want SPARK-34104 to be integrated as well so that flaky executors 
> which can not decommission are eventually killed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34104) Allow users to specify a maximum decommissioning time

2021-01-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268195#comment-17268195
 ] 

Apache Spark commented on SPARK-34104:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/31249

> Allow users to specify a maximum decommissioning time
> -
>
> Key: SPARK-34104
> URL: https://issues.apache.org/jira/browse/SPARK-34104
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
>
> We currently have the ability for users to set the predicted time of the 
> cluster manager or cloud provider to terminate a decommissioning executor, 
> but for nodes where Spark it's self is triggering decommissioning we should 
> add the ability of users to specify a maximum time we want to allow the 
> executor to decommission.
>  
> This is important especially if we start to in more places (like with 
> excluded executors that are found to be flaky, that may or may not be able to 
> decommission successfully).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34104) Allow users to specify a maximum decommissioning time

2021-01-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34104:


Assignee: Holden Karau  (was: Apache Spark)

> Allow users to specify a maximum decommissioning time
> -
>
> Key: SPARK-34104
> URL: https://issues.apache.org/jira/browse/SPARK-34104
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
>
> We currently have the ability for users to set the predicted time of the 
> cluster manager or cloud provider to terminate a decommissioning executor, 
> but for nodes where Spark it's self is triggering decommissioning we should 
> add the ability of users to specify a maximum time we want to allow the 
> executor to decommission.
>  
> This is important especially if we start to in more places (like with 
> excluded executors that are found to be flaky, that may or may not be able to 
> decommission successfully).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34104) Allow users to specify a maximum decommissioning time

2021-01-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34104:


Assignee: Apache Spark  (was: Holden Karau)

> Allow users to specify a maximum decommissioning time
> -
>
> Key: SPARK-34104
> URL: https://issues.apache.org/jira/browse/SPARK-34104
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: Holden Karau
>Assignee: Apache Spark
>Priority: Major
>
> We currently have the ability for users to set the predicted time of the 
> cluster manager or cloud provider to terminate a decommissioning executor, 
> but for nodes where Spark it's self is triggering decommissioning we should 
> add the ability of users to specify a maximum time we want to allow the 
> executor to decommission.
>  
> This is important especially if we start to in more places (like with 
> excluded executors that are found to be flaky, that may or may not be able to 
> decommission successfully).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34161) Check re-caching of v2 table dependents after table altering

2021-01-19 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-34161:
---
Description: Add tests to unified test suites and check that dependants of 
v2 table are still cached after table altering.  (was: Add tests to unified 
tests and check that dependants of v2 table are still cached after table 
altering.)

> Check re-caching of v2 table dependents after table altering
> 
>
> Key: SPARK-34161
> URL: https://issues.apache.org/jira/browse/SPARK-34161
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Add tests to unified test suites and check that dependants of v2 table are 
> still cached after table altering.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34161) Check re-caching of v2 table dependents after table altering

2021-01-19 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-34161:
--

 Summary: Check re-caching of v2 table dependents after table 
altering
 Key: SPARK-34161
 URL: https://issues.apache.org/jira/browse/SPARK-34161
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Maxim Gekk


Add tests to unified tests and check that dependants of v2 table are still 
cached after table altering.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34143) Adding partitions to fully partitioned v2 table

2021-01-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268156#comment-17268156
 ] 

Apache Spark commented on SPARK-34143:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/31247

> Adding partitions to fully partitioned v2 table
> ---
>
> Key: SPARK-34143
> URL: https://issues.apache.org/jira/browse/SPARK-34143
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> The test below fails:
> {code:scala}
> withNamespaceAndTable("ns", "tbl") { t =>
>   sql(s"CREATE TABLE $t (p0 INT, p1 STRING) $defaultUsing PARTITIONED BY 
> (p0, p1)")
>   sql(s"ALTER TABLE $t ADD PARTITION (p0 = 0, p1 = 'abc')")
>   checkPartitions(t, Map("p0" -> "0", "p1" -> "abc"))
>   checkAnswer(sql(s"SELECT * FROM $t"), Row(0, "abc"))
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34143) Adding partitions to fully partitioned v2 table

2021-01-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268153#comment-17268153
 ] 

Apache Spark commented on SPARK-34143:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/31247

> Adding partitions to fully partitioned v2 table
> ---
>
> Key: SPARK-34143
> URL: https://issues.apache.org/jira/browse/SPARK-34143
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> The test below fails:
> {code:scala}
> withNamespaceAndTable("ns", "tbl") { t =>
>   sql(s"CREATE TABLE $t (p0 INT, p1 STRING) $defaultUsing PARTITIONED BY 
> (p0, p1)")
>   sql(s"ALTER TABLE $t ADD PARTITION (p0 = 0, p1 = 'abc')")
>   checkPartitions(t, Map("p0" -> "0", "p1" -> "abc"))
>   checkAnswer(sql(s"SELECT * FROM $t"), Row(0, "abc"))
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33940) allow configuring the max column name length in csv writer

2021-01-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268099#comment-17268099
 ] 

Apache Spark commented on SPARK-33940:
--

User 'CodingCat' has created a pull request for this issue:
https://github.com/apache/spark/pull/31246

> allow configuring the max column name length in csv writer
> --
>
> Key: SPARK-33940
> URL: https://issues.apache.org/jira/browse/SPARK-33940
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Nan Zhu
>Priority: Major
>
> csv writer actually has an implicit limit on column name length due to 
> univocity-parser, 
>  
> when we initialize a writer 
> [https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/AbstractWriter.java#L211,]
>  it calls toIdentifierGroupArray which calls valueOf in NormalizedString.java 
> eventually 
> ([https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/NormalizedString.java#L205-L209)]
>  
> in that stringCache.get, it has a maxStringLength cap 
> [https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/StringCache.java#L104]
>  which is 1024 by default
>  
> we do not expose this as configurable option, leading to NPE when we have a 
> column name larger than 1024, 
>  
> ```
> [info]   Cause: java.lang.NullPointerException:
> [info]   at 
> com.univocity.parsers.common.AbstractWriter.submitRow(AbstractWriter.java:349)
> [info]   at 
> com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:444)
> [info]   at 
> com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:410)
> [info]   at 
> org.apache.spark.sql.catalyst.csv.UnivocityGenerator.writeHeaders(UnivocityGenerator.scala:87)
> [info]   at 
> org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter$.writeHeaders(CsvOutputWriter.scala:58)
> [info]   at 
> org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.(CsvOutputWriter.scala:44)
> [info]   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anon$1.newInstance(CSVFileFormat.scala:86)
> [info]   at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:126)
> [info]   at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:111)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:269)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:210)
> ```
>  
> it could be reproduced by a simple unit test
>  
> ```
> val row1 = Row("a")
> val superLongHeader = (0 until 1025).map(_ => "c").mkString("")
> val df = Seq(s"${row1.getString(0)}").toDF(superLongHeader)
> df.repartition(1)
>  .write
>  .option("header", "true")
>  .option("maxColumnNameLength", 1025)
>  .csv(dataPath)
> ```
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33507) Improve and fix cache behavior in v1 and v2

2021-01-19 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268063#comment-17268063
 ] 

Chao Sun commented on SPARK-33507:
--

[~aokolnychyi] could you elaborate on the question? currently Spark doesn't 
support caching streaming tables yet.

> Improve and fix cache behavior in v1 and v2
> ---
>
> Key: SPARK-33507
> URL: https://issues.apache.org/jira/browse/SPARK-33507
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Chao Sun
>Priority: Critical
>
> This is an umbrella JIRA to track fixes & improvements for caching behavior 
> in Spark datasource v1 and v2, which includes:
>   - fix existing cache behavior in v1 and v2.
>   - fix inconsistent cache behavior between v1 and v2
>   - implement missing features in v2 to align with those in v1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34160) pyspark.ml.stat.Summarizer should allow sparse vector results

2021-01-19 Thread Ophir Yoktan (Jira)
Ophir Yoktan created SPARK-34160:


 Summary: pyspark.ml.stat.Summarizer should allow sparse vector 
results
 Key: SPARK-34160
 URL: https://issues.apache.org/jira/browse/SPARK-34160
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 3.0.1
Reporter: Ophir Yoktan


currently pyspark.ml.stat.Summarizer will return a dense vector, even if the 
input is sparse.

the Summarizer should either deduce the relevant type from the input, or add a 
parameter that forces sparse output

code to reproduce:

{{import pyspark}}
{{from pyspark.sql.functions import col}}
{{from pyspark.ml.stat import Summarizer}}
{{from pyspark.ml.linalg import SparseVector, DenseVector}}{{sc = 
pyspark.SparkContext.getOrCreate()}}
{{sql_context = pyspark.SQLContext(sc)}}{{df = sc.parallelize([ ( 
SparseVector(100, \{1: 1.0}),)]).toDF(['v'])}}
{{print(df.head())}}
{{print(df.select(Summarizer.mean(col('v'))).head())}}

ouput:

{{Row(v=SparseVector(100, \{1: 1.0})) }}
{{Row(mean(v)=DenseVector([0.0, 1.0,}}
{{0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 
0.0, 0.0, 0.0]))}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33792) NullPointerException in HDFSBackedStateStoreProvider

2021-01-19 Thread Gabor Barna (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17267972#comment-17267972
 ] 

Gabor Barna commented on SPARK-33792:
-

Another flavor of probably the same issue
{code:java}
Caused by: java.lang.NullPointerException
at 
java.util.concurrent.ConcurrentHashMap.putVal(ConcurrentHashMap.java:1011)
at 
java.util.concurrent.ConcurrentHashMap.putAll(ConcurrentHashMap.java:1084)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:204)
at 
org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:371)
at 
org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:88)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748) {code}

> NullPointerException in HDFSBackedStateStoreProvider
> 
>
> Key: SPARK-33792
> URL: https://issues.apache.org/jira/browse/SPARK-33792
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.4
> Environment: emr-5.28.0
>Reporter: Gabor Barna
>Priority: Major
>
> Hi,
> We are getting NPEs with spark structured streaming from time-to-time. Here's 
> the stacktrace:
> {code:java}
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore$$anonfun$iterator$1.apply(HDFSBackedStateStoreProvider.scala:164)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore$$anonfun$iterator$1.apply(HDFSBackedStateStoreProvider.scala:163)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:220)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>   at 
> org.apache.spark.sql.kinesis.KinesisWriteTask.execute(KinesisWriteTask.scala:50)
>   at 
> org.apache.spark.sql.kinesis.KinesisWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1$$anonfun$apply$1.apply$mcV$sp(KinesisWriter.scala:40)
>   at 
> org.apache.spark.sql.kinesis.KinesisWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1$$anonfun$apply$1.apply(KinesisWriter.scala:40)
>   at 
> org.apache.spark.sql.kinesis.KinesisWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1$$anonfun$apply$1.apply(KinesisWriter.scala:40)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at 
> org.apache.spark.sql.kinesis.KinesisWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(KinesisWriter.scala:40)
>   at 
> org.apache.spark.sql.kinesis.KinesisWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(KinesisWriter.scala:38)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
>

[jira] [Commented] (SPARK-34157) Unify output of SHOW TABLES and pass output attributes properly

2021-01-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17267949#comment-17267949
 ] 

Apache Spark commented on SPARK-34157:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/31245

> Unify output of SHOW TABLES and pass output attributes properly
> ---
>
> Key: SPARK-34157
> URL: https://issues.apache.org/jira/browse/SPARK-34157
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34157) Unify output of SHOW TABLES and pass output attributes properly

2021-01-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34157:


Assignee: Apache Spark

> Unify output of SHOW TABLES and pass output attributes properly
> ---
>
> Key: SPARK-34157
> URL: https://issues.apache.org/jira/browse/SPARK-34157
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34157) Unify output of SHOW TABLES and pass output attributes properly

2021-01-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34157:


Assignee: (was: Apache Spark)

> Unify output of SHOW TABLES and pass output attributes properly
> ---
>
> Key: SPARK-34157
> URL: https://issues.apache.org/jira/browse/SPARK-34157
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34156) Unify the output of DDL and pass output attributes properly

2021-01-19 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-34156:
---
Description: The current implement of some DDL not unify the output and not 
pass the output properly to physical command.  (was: The current implement of 
some DDL not unify the output and not )

> Unify the output of DDL and pass output attributes properly
> ---
>
> Key: SPARK-34156
> URL: https://issues.apache.org/jira/browse/SPARK-34156
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jiaan.geng
>Priority: Major
>
> The current implement of some DDL not unify the output and not pass the 
> output properly to physical command.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34156) Unify the output of DDL and pass output attributes properly

2021-01-19 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-34156:
---
Description: The current implement of some DDL not unify the output and not 
  (was: The current implement of some DDL )

> Unify the output of DDL and pass output attributes properly
> ---
>
> Key: SPARK-34156
> URL: https://issues.apache.org/jira/browse/SPARK-34156
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jiaan.geng
>Priority: Major
>
> The current implement of some DDL not unify the output and not 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >