[jira] [Assigned] (SPARK-37733) Change log level of tests to WARN

2021-12-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37733:


Assignee: Gengliang Wang  (was: Apache Spark)

> Change log level of tests to WARN
> -
>
> Key: SPARK-37733
> URL: https://issues.apache.org/jira/browse/SPARK-37733
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37733) Change log level of tests to WARN

2021-12-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37733:


Assignee: Apache Spark  (was: Gengliang Wang)

> Change log level of tests to WARN
> -
>
> Key: SPARK-37733
> URL: https://issues.apache.org/jira/browse/SPARK-37733
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37733) Change log level of tests to WARN

2021-12-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17464919#comment-17464919
 ] 

Apache Spark commented on SPARK-37733:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/35011

> Change log level of tests to WARN
> -
>
> Key: SPARK-37733
> URL: https://issues.apache.org/jira/browse/SPARK-37733
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37733) Change log level of tests to WARN

2021-12-23 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-37733:
---
Component/s: Build
 (was: Project Infra)

> Change log level of tests to WARN
> -
>
> Key: SPARK-37733
> URL: https://issues.apache.org/jira/browse/SPARK-37733
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37733) Change log level of tests to WARN

2021-12-23 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-37733:
--

 Summary: Change log level of tests to WARN
 Key: SPARK-37733
 URL: https://issues.apache.org/jira/browse/SPARK-37733
 Project: Spark
  Issue Type: Task
  Components: Project Infra
Affects Versions: 3.3.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37732) Improve the implement of JDBCV2Suite

2021-12-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17464906#comment-17464906
 ] 

Apache Spark commented on SPARK-37732:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/35010

> Improve the implement of JDBCV2Suite
> 
>
> Key: SPARK-37732
> URL: https://issues.apache.org/jira/browse/SPARK-37732
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> When I reading the implement of JDBCV2Suite, I find we can improve the code.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37732) Improve the implement of JDBCV2Suite

2021-12-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37732:


Assignee: Apache Spark

> Improve the implement of JDBCV2Suite
> 
>
> Key: SPARK-37732
> URL: https://issues.apache.org/jira/browse/SPARK-37732
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>
> When I reading the implement of JDBCV2Suite, I find we can improve the code.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37732) Improve the implement of JDBCV2Suite

2021-12-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37732:


Assignee: (was: Apache Spark)

> Improve the implement of JDBCV2Suite
> 
>
> Key: SPARK-37732
> URL: https://issues.apache.org/jira/browse/SPARK-37732
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> When I reading the implement of JDBCV2Suite, I find we can improve the code.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37732) Improve the implement of JDBCV2Suite

2021-12-23 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-37732:
---
Description: When I reading the implement of JDBCV2Suite, I find we can 
improve the code.  (was: When I read the )

> Improve the implement of JDBCV2Suite
> 
>
> Key: SPARK-37732
> URL: https://issues.apache.org/jira/browse/SPARK-37732
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> When I reading the implement of JDBCV2Suite, I find we can improve the code.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37732) Improve the implement of JDBCV2Suite

2021-12-23 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-37732:
---
Description: When I read the 

> Improve the implement of JDBCV2Suite
> 
>
> Key: SPARK-37732
> URL: https://issues.apache.org/jira/browse/SPARK-37732
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> When I read the 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37732) Improve the implement of JDBCV2Suite

2021-12-23 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-37732:
--

 Summary: Improve the implement of JDBCV2Suite
 Key: SPARK-37732
 URL: https://issues.apache.org/jira/browse/SPARK-37732
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.3.0
Reporter: jiaan.geng






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37527) Translate more standard aggregate functions for pushdown

2021-12-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17464902#comment-17464902
 ] 

Apache Spark commented on SPARK-37527:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/35009

> Translate more standard aggregate functions for pushdown
> 
>
> Key: SPARK-37527
> URL: https://issues.apache.org/jira/browse/SPARK-37527
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, Spark aggregate pushdown will translate some standard aggregate 
> functions, so that compile these functions suitable specify database.
> After this job, users could override JdbcDialect.compileAggregate to 
> implement some aggregate functions supported by some database.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37727) Show ignored confs & hide warnings for conf already set in SparkSession.builder.getOrCreate

2021-12-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37727.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35001
[https://github.com/apache/spark/pull/35001]

> Show ignored confs & hide warnings for conf already set in 
> SparkSession.builder.getOrCreate
> ---
>
> Key: SPARK-37727
> URL: https://issues.apache.org/jira/browse/SPARK-37727
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently, {{SparkSession.builder.getOrCreate()}} is too noisy even when 
> duplicate configurations are set. And users cannot tell which configurations 
> are to fix. See the example below:
> {code}
> ./bin/spark-shell --conf spark.abc=abc
> {code}
> {code}
> import org.apache.spark.sql.SparkSession
> spark.sparkContext.setLogLevel("DEBUG")
> SparkSession.builder.config("spark.abc", "abc").getOrCreate
> {code}
> {code}
> ...
> 21:12:40.601 [main] WARN  org.apache.spark.sql.SparkSession - Using an 
> existing SparkSession; some spark core configurations may not take effect.
> {code}
> This is strait forward when there are few configurations but it is difficult 
> for users to figure out when there are too many configurations especially 
> when these configurations are defined in property files like 
> {{spark-default.conf}} that is sometimes maintained separately by system 
> admins.
> See also https://github.com/apache/spark/pull/34757#discussion_r769248275



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37727) Show ignored confs & hide warnings for conf already set in SparkSession.builder.getOrCreate

2021-12-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-37727:


Assignee: Hyukjin Kwon

> Show ignored confs & hide warnings for conf already set in 
> SparkSession.builder.getOrCreate
> ---
>
> Key: SPARK-37727
> URL: https://issues.apache.org/jira/browse/SPARK-37727
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Currently, {{SparkSession.builder.getOrCreate()}} is too noisy even when 
> duplicate configurations are set. And users cannot tell which configurations 
> are to fix. See the example below:
> {code}
> ./bin/spark-shell --conf spark.abc=abc
> {code}
> {code}
> import org.apache.spark.sql.SparkSession
> spark.sparkContext.setLogLevel("DEBUG")
> SparkSession.builder.config("spark.abc", "abc").getOrCreate
> {code}
> {code}
> ...
> 21:12:40.601 [main] WARN  org.apache.spark.sql.SparkSession - Using an 
> existing SparkSession; some spark core configurations may not take effect.
> {code}
> This is strait forward when there are few configurations but it is difficult 
> for users to figure out when there are too many configurations especially 
> when these configurations are defined in property files like 
> {{spark-default.conf}} that is sometimes maintained separately by system 
> admins.
> See also https://github.com/apache/spark/pull/34757#discussion_r769248275



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37478) Unify v1 and v2 DROP NAMESPACE tests

2021-12-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17464877#comment-17464877
 ] 

Apache Spark commented on SPARK-37478:
--

User 'dchvn' has created a pull request for this issue:
https://github.com/apache/spark/pull/35007

> Unify v1 and v2 DROP NAMESPACE tests
> 
>
> Key: SPARK-37478
> URL: https://issues.apache.org/jira/browse/SPARK-37478
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Assignee: dch nguyen
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37720) /1624610690382/state/0/8/1272.delta does not exist at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.updateFromDeltaFile

2021-12-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37720.
--
Resolution: Invalid

This is likely a version mismatch somewhere in your environment between Python 
side and Spark core.

For questions, let's better ask into mailing list.

> /1624610690382/state/0/8/1272.delta does not exist at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.updateFromDeltaFile
> -
>
> Key: SPARK-37720
> URL: https://issues.apache.org/jira/browse/SPARK-37720
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.5, 3.0.1, 3.1.2
> Environment: hadoop 2.8.5  spark 3.1.2 and spark2.4.5
>Reporter: lai fangmin
>Priority: Major
>  Labels: issues
>
> hi,
>  I using structured streaming and checkpoint. If anybody has the following 
> errors happen occasionly? SPARK 3.1.2 and spark2.4.5,hadoop 2.8.5
> An error occurred while calling o126.getResult. : 
> org.apache.spark.SparkException: Exception thrown in awaitResult: at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301) at 
> org.apache.spark.security.SocketAuthServer.getResult(SocketAuthServer.scala:97)
>  at 
> org.apache.spark.security.SocketAuthServer.getResult(SocketAuthServer.scala:93)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at 
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at 
> py4j.Gateway.invoke(Gateway.java:282) at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at 
> py4j.commands.CallCommand.execute(CallCommand.java:79) at 
> py4j.GatewayConnection.run(GatewayConnection.java:238) at 
> java.lang.Thread.run(Thread.java:748) Caused by: 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 8 in 
> stage 1.0 failed 4 times, most recent failure: Lost task 8.3 in stage 1.0 
> (TID 103) (SYSOPS00260773 executor 6): java.lang.IllegalStateException: Error 
> reading delta file hdfs://24610690382/state/0/8/1272.delta of 
> HDFSStateStoreProvider[id = (op=0,part=8),dir = 
> hdfs://24610690382/state/0/8]: hdfs://1624610690382/state/0/8/1272.delta does 
> not exist at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.updateFromDeltaFile(HDFSBackedStateStoreProvider.scala:461)
>  at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.$anonfun$loadMap$4(HDFSBackedStateStoreProvider.scala:417)
>  at scala.runtime.java8.JFunction1$mcVJ$sp.apply(JFunction1$mcVJ$sp.java:23) 
> at scala.collection.immutable.NumericRange.foreach(NumericRange.scala:74) at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.$anonfun$loadMap$2(HDFSBackedStateStoreProvider.scala:416)
>  at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:597) at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.loadMap(HDFSBackedStateStoreProvider.scala:389)
>  at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getLoadedMapForStore(HDFSBackedStateStoreProvider.scala:236)
>  at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:220)
>  at 
> org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:469)
>  at 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:125)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at 
> org.apache.spark.sql.execution.SQLExecutionRDD.compute(SQLExecutionRDD.scala:55)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at 
> 

[jira] [Commented] (SPARK-37302) Explicitly download the dependencies of guava and jetty-io in test-dependencies.sh

2021-12-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17464868#comment-17464868
 ] 

Apache Spark commented on SPARK-37302:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/35006

> Explicitly download the dependencies of guava and jetty-io in 
> test-dependencies.sh
> --
>
> Key: SPARK-37302
> URL: https://issues.apache.org/jira/browse/SPARK-37302
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.2.1, 3.3.0
>
>
> dev/run-tests.py fails if Scala 2.13 is used and guava or jetty-io is not in 
> the both of Maven and Coursier local repository.
> {code:java}
> $ rm -rf ~/.m2/repository/*
> $ # For Linux
> $ rm -rf ~/.cache/coursier/v1/*
> $ # For macOS
> $ rm -rf ~/Library/Caches/Coursier/v1/*
> $ dev/change-scala-version.sh 2.13
> $ dev/test-dependencies.sh
> $ build/sbt -Pscala-2.13 clean compile
> ...
> [error] 
> /home/kou/work/oss/spark-scala-2.13/common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java:24:1:
>   error: package com.google.common.primitives does not exist
> [error] import com.google.common.primitives.Ints;
> [error]^
> [error] 
> /home/kou/work/oss/spark-scala-2.13/common/network-common/src/main/java/org/apache/spark/network/client/TransportClientFactory.java:30:1:
>   error: package com.google.common.annotations does not exist
> [error] import com.google.common.annotations.VisibleForTesting;
> [error] ^
> [error] 
> /home/kou/work/oss/spark-scala-2.13/common/network-common/src/main/java/org/apache/spark/network/client/TransportClientFactory.java:31:1:
>   error: package com.google.common.base does not exist
> [error] import com.google.common.base.Preconditions;
> ...
> {code}
> {code:java}
> [error] 
> /home/kou/work/oss/spark-scala-2.13/core/src/main/scala/org/apache/spark/deploy/rest/RestSubmissionServer.scala:87:25:
>  Class org.eclipse.jetty.io.ByteBufferPool not found - continuing with a stub.
> [error] val connector = new ServerConnector(
> [error] ^
> [error] 
> /home/kou/work/oss/spark-scala-2.13/core/src/main/scala/org/apache/spark/deploy/rest/RestSubmissionServer.scala:87:21:
>  multiple constructors for ServerConnector with alternatives:
> [error]   (x$1: org.eclipse.jetty.server.Server,x$2: 
> java.util.concurrent.Executor,x$3: 
> org.eclipse.jetty.util.thread.Scheduler,x$4: 
> org.eclipse.jetty.io.ByteBufferPool,x$5: Int,x$6: Int,x$7: 
> org.eclipse.jetty.server.ConnectionFactory*)org.eclipse.jetty.server.ServerConnector
>  
> [error]   (x$1: org.eclipse.jetty.server.Server,x$2: 
> org.eclipse.jetty.util.ssl.SslContextFactory,x$3: 
> org.eclipse.jetty.server.ConnectionFactory*)org.eclipse.jetty.server.ServerConnector
>  
> [error]   (x$1: org.eclipse.jetty.server.Server,x$2: 
> org.eclipse.jetty.server.ConnectionFactory*)org.eclipse.jetty.server.ServerConnector
>  
> [error]   (x$1: org.eclipse.jetty.server.Server,x$2: Int,x$3: Int,x$4: 
> org.eclipse.jetty.server.ConnectionFactory*)org.eclipse.jetty.server.ServerConnector
> [error]  cannot be invoked with (org.eclipse.jetty.server.Server, Null, 
> org.eclipse.jetty.util.thread.ScheduledExecutorScheduler, Null, Int, Int, 
> org.eclipse.jetty.server.HttpConnectionFactory)
> [error] val connector = new ServerConnector(
> [error] ^
> [error] 
> /home/kou/work/oss/spark-scala-2.13/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:207:13:
>  Class org.eclipse.jetty.io.ClientConnectionFactory not found - continuing 
> with a stub.
> [error] new HttpClient(new HttpClientTransportOverHTTP(numSelectors), 
> null)
> [error] ^
> [error] 
> /home/kou/work/oss/spark-scala-2.13/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:287:25:
>  multiple constructors for ServerConnector with alternatives:
> [error]   (x$1: org.eclipse.jetty.server.Server,x$2: 
> java.util.concurrent.Executor,x$3: 
> org.eclipse.jetty.util.thread.Scheduler,x$4: 
> org.eclipse.jetty.io.ByteBufferPool,x$5: Int,x$6: Int,x$7: 
> org.eclipse.jetty.server.ConnectionFactory*)org.eclipse.jetty.server.ServerConnector
>  
> [error]   (x$1: org.eclipse.jetty.server.Server,x$2: 
> org.eclipse.jetty.util.ssl.SslContextFactory,x$3: 
> org.eclipse.jetty.server.ConnectionFactory*)org.eclipse.jetty.server.ServerConnector
>  
> [error]   (x$1: org.eclipse.jetty.server.Server,x$2: 
> org.eclipse.jetty.server.ConnectionFactory*)org.eclipse.jetty.server.ServerConnector
>  
> [error]   (x$1: org.eclipse.jetty.server.Server,x$2: Int,x$3: Int,x$4: 

[jira] [Commented] (SPARK-34648) Reading Parquet Files in Spark Extremely Slow for Large Number of Files?

2021-12-23 Thread vicviz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17464859#comment-17464859
 ] 

vicviz commented on SPARK-34648:


I also encountered this problem on 2.0.2;

But it works well and read fast on 3.0.0. Upgrading your version is a good 
choice.

> Reading Parquet Files in Spark Extremely Slow for Large Number of Files?
> 
>
> Key: SPARK-34648
> URL: https://issues.apache.org/jira/browse/SPARK-34648
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Pankaj Bhootra
>Priority: Major
>
> Hello Team
> I am new to Spark and this question may be a possible duplicate of the issue 
> highlighted here: https://issues.apache.org/jira/browse/SPARK-9347 
> We have a large dataset partitioned by calendar date, and within each date 
> partition, we are storing the data as *parquet* files in 128 parts.
> We are trying to run aggregation on this dataset for 366 dates at a time with 
> Spark SQL on spark version 2.3.0, hence our Spark job is reading 
> 366*128=46848 partitions, all of which are parquet files. There is currently 
> no *_metadata* or *_common_metadata* file(s) available for this dataset.
> The problem we are facing is that when we try to run *spark.read.parquet* on 
> the above 46848 partitions, our data reads are extremely slow. It takes a 
> long time to run even a simple map task (no shuffling) without any 
> aggregation or group by.
> I read through the above issue and I think I perhaps generally understand the 
> ideas around *_common_metadata* file. But the above issue was raised for 
> Spark 1.3.1 and for Spark 2.3.0, I have not found any documentation related 
> to this metadata file so far.
> I would like to clarify:
>  # What's the latest, best practice for reading large number of parquet files 
> efficiently?
>  # Does this involve using any additional options with spark.read.parquet? 
> How would that work?
>  # Are there other possible reasons for slow data reads apart from reading 
> metadata for every part? We are basically trying to migrate our existing 
> spark pipeline from using csv files to parquet, but from my hands-on so far, 
> it seems that parquet's read time is slower than csv? This seems 
> contradictory to popular opinion that parquet performs better in terms of 
> both computation and storage?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37391) SIGNIFICANT bottleneck introduced by fix for SPARK-32001

2021-12-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37391:
-
Fix Version/s: 3.1.3
   3.2.1

> SIGNIFICANT bottleneck introduced by fix for SPARK-32001
> 
>
> Key: SPARK-37391
> URL: https://issues.apache.org/jira/browse/SPARK-37391
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0
> Environment: N/A
>Reporter: Danny Guinther
>Assignee: Danny Guinther
>Priority: Major
> Fix For: 3.1.3, 3.2.1, 3.3.0
>
> Attachments: so-much-blocking.jpg, spark-regression-dashes.jpg
>
>
> The fix for https://issues.apache.org/jira/browse/SPARK-32001 ( 
> [https://github.com/apache/spark/pull/29024/files#diff-345beef18081272d77d91eeca2d9b5534ff6e642245352f40f4e9c9b8922b085R58]
>  ) does not seem to have consider the reality that some apps may rely on 
> being able to establish many JDBC connections simultaneously for performance 
> reasons.
> The fix forces concurrency to 1 when establishing database connections and 
> that strikes me as a *significant* user impacting change and a *significant* 
> bottleneck.
> Can anyone propose a workaround for this? I have an app that makes 
> connections to thousands of databases and I can't upgrade to any version 
> >3.1.x because of this significant bottleneck.
>  
> Thanks in advance for your help!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37391) SIGNIFICANT bottleneck introduced by fix for SPARK-32001

2021-12-23 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta resolved SPARK-37391.

Fix Version/s: 3.3.0
 Assignee: Danny Guinther
   Resolution: Fixed

Issue resolved in https://github.com/apache/spark/pull/34745 for Spark 3.3.0.

> SIGNIFICANT bottleneck introduced by fix for SPARK-32001
> 
>
> Key: SPARK-37391
> URL: https://issues.apache.org/jira/browse/SPARK-37391
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0
> Environment: N/A
>Reporter: Danny Guinther
>Assignee: Danny Guinther
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: so-much-blocking.jpg, spark-regression-dashes.jpg
>
>
> The fix for https://issues.apache.org/jira/browse/SPARK-32001 ( 
> [https://github.com/apache/spark/pull/29024/files#diff-345beef18081272d77d91eeca2d9b5534ff6e642245352f40f4e9c9b8922b085R58]
>  ) does not seem to have consider the reality that some apps may rely on 
> being able to establish many JDBC connections simultaneously for performance 
> reasons.
> The fix forces concurrency to 1 when establishing database connections and 
> that strikes me as a *significant* user impacting change and a *significant* 
> bottleneck.
> Can anyone propose a workaround for this? I have an app that makes 
> connections to thousands of databases and I can't upgrade to any version 
> >3.1.x because of this significant bottleneck.
>  
> Thanks in advance for your help!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37657) Support str and timestamp for (Series|DataFrame).describe()

2021-12-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37657.
--
Fix Version/s: 3.3.0
 Assignee: Haejoon Lee
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/34931

> Support str and timestamp for (Series|DataFrame).describe()
> ---
>
> Key: SPARK-37657
> URL: https://issues.apache.org/jira/browse/SPARK-37657
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.3.0
>
>
> Initialized in Koalas issue: 
> [https://github.com/databricks/koalas/issues/1888]
>  
> The `(Series|DataFrame).describe()` in pandas API on Spark doesn't work 
> properly when DataFrame has no numeric column.
>  
>  
> {code:java}
> >>> df = ps.DataFrame({'a': ["a", "b", "c"]})
> >>> df.describe()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/.../python/pyspark/pandas/frame.py", line 7582, in describe
> raise ValueError("Cannot describe a DataFrame without columns")
> ValueError: Cannot describe a DataFrame without columns 
> {code}
>  
> As it works fine in pandas, we should fix it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37716) Allow LateralJoin node to host non-deterministic expressions when the outer query is a single row relation

2021-12-23 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-37716:
-
Description: 
After https://issues.apache.org/jira/browse/SPARK-37199, Analyzer will block 
LateralJoin that has non-deterministic lateral subqueries. But when the outer 
query is a single row relation, this should be allowed

For example:
Query:
{code:java}
SELECT * FROM VALUES(0) t(x) JOIN LATERAL (SELECT rand(0) + x AS y); {code}
Result:
{code:java}
org.apache.spark.sql.AnalysisException: nondeterministic expressions are only 
allowed in
Project, Filter, Aggregate or Window{code}
 

  was:
After https://issues.apache.org/jira/browse/SPARK-37199, Analyzer will block 
LateralJoin that has non-deterministic lateral subqueries. But when the outer 
query is a single row relation, this should be allowed

For example:
Query:
{code:java}
SELECT * FROM VALUES(0) t(x) JOIN LATERAL (SELECT rand(0) + x AS y); {code}
Result:
{code:java}
org.apache.spark.sql.AnalysisException: nondeterministic expressions are only 
allowed in
Project, Filter, Aggregate, Window, or Generate{code}
 


> Allow LateralJoin node to host non-deterministic expressions when the outer 
> query is a single row relation
> --
>
> Key: SPARK-37716
> URL: https://issues.apache.org/jira/browse/SPARK-37716
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Allison Wang
>Priority: Major
>
> After https://issues.apache.org/jira/browse/SPARK-37199, Analyzer will block 
> LateralJoin that has non-deterministic lateral subqueries. But when the outer 
> query is a single row relation, this should be allowed
> For example:
> Query:
> {code:java}
> SELECT * FROM VALUES(0) t(x) JOIN LATERAL (SELECT rand(0) + x AS y); {code}
> Result:
> {code:java}
> org.apache.spark.sql.AnalysisException: nondeterministic expressions are only 
> allowed in
> Project, Filter, Aggregate or Window{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37716) Allow LateralJoin node to host non-deterministic expressions when the outer query is a single row relation

2021-12-23 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-37716:
-
Description: 
After https://issues.apache.org/jira/browse/SPARK-37199, Analyzer will block 
LateralJoin that has non-deterministic lateral subqueries. But when the outer 
query is a single row relation, this should be allowed

For example:
Query:
{code:java}
SELECT * FROM VALUES(0) t(x) JOIN LATERAL (SELECT rand(0) + x AS y); {code}
Result:
{code:java}
org.apache.spark.sql.AnalysisException: nondeterministic expressions are only 
allowed in
Project, Filter, Aggregate, Window, or Generate{code}
 

  was:
After https://issues.apache.org/jira/browse/SPARK-37199, Analyzer will block 
LateralJoin that has non-deterministic lateral subqueries. This should be 
allowed.

For example:
Query:

 
{code:java}
SELECT t1.* FROM t1 JOIN LATERAL (SELECT rand(0) + c2 AS c3); {code}
Result:
{code:java}
org.apache.spark.sql.AnalysisException: nondeterministic expressions are only 
allowed in
Project, Filter, Aggregate, Window, or Generate, but found:
 lateralsubquery(spark_catalog.default.t1.c2)
in operator LateralJoin lateral-subquery#1 [c2#3], Inner {code}
 


> Allow LateralJoin node to host non-deterministic expressions when the outer 
> query is a single row relation
> --
>
> Key: SPARK-37716
> URL: https://issues.apache.org/jira/browse/SPARK-37716
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Allison Wang
>Priority: Major
>
> After https://issues.apache.org/jira/browse/SPARK-37199, Analyzer will block 
> LateralJoin that has non-deterministic lateral subqueries. But when the outer 
> query is a single row relation, this should be allowed
> For example:
> Query:
> {code:java}
> SELECT * FROM VALUES(0) t(x) JOIN LATERAL (SELECT rand(0) + x AS y); {code}
> Result:
> {code:java}
> org.apache.spark.sql.AnalysisException: nondeterministic expressions are only 
> allowed in
> Project, Filter, Aggregate, Window, or Generate{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37716) Allow LateralJoin node to host non-deterministic expressions when the outer query is a single row relation

2021-12-23 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-37716:
-
Summary: Allow LateralJoin node to host non-deterministic expressions when 
the outer query is a single row relation  (was: Allow LateralJoin node to host 
non-deterministic expressions)

> Allow LateralJoin node to host non-deterministic expressions when the outer 
> query is a single row relation
> --
>
> Key: SPARK-37716
> URL: https://issues.apache.org/jira/browse/SPARK-37716
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Allison Wang
>Priority: Major
>
> After https://issues.apache.org/jira/browse/SPARK-37199, Analyzer will block 
> LateralJoin that has non-deterministic lateral subqueries. This should be 
> allowed.
> For example:
> Query:
>  
> {code:java}
> SELECT t1.* FROM t1 JOIN LATERAL (SELECT rand(0) + c2 AS c3); {code}
> Result:
> {code:java}
> org.apache.spark.sql.AnalysisException: nondeterministic expressions are only 
> allowed in
> Project, Filter, Aggregate, Window, or Generate, but found:
>  lateralsubquery(spark_catalog.default.t1.c2)
> in operator LateralJoin lateral-subquery#1 [c2#3], Inner {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8582) Optimize checkpointing to avoid computing an RDD twice

2021-12-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-8582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8582:
---

Assignee: Apache Spark  (was: Shixiong Zhu)

> Optimize checkpointing to avoid computing an RDD twice
> --
>
> Key: SPARK-8582
> URL: https://issues.apache.org/jira/browse/SPARK-8582
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>Priority: Major
>  Labels: bulk-closed
>
> In Spark, checkpointing allows the user to truncate the lineage of his RDD 
> and save the intermediate contents to HDFS for fault tolerance. However, this 
> is not currently implemented super efficiently:
> Every time we checkpoint an RDD, we actually compute it twice: once during 
> the action that triggered the checkpointing in the first place, and once 
> while we checkpoint (we iterate through an RDD's partitions and write them to 
> disk). See this line for more detail: 
> https://github.com/apache/spark/blob/0401cbaa8ee51c71f43604f338b65022a479da0a/core/src/main/scala/org/apache/spark/rdd/RDDCheckpointData.scala#L102.
> Instead, we should have a `CheckpointingInterator` that writes checkpoint 
> data to HDFS while we run the action. This will speed up many usages of 
> `RDD#checkpoint` by 2X.
> (Alternatively, the user can just cache the RDD before checkpointing it, but 
> this is not always viable for very large input data. It's also not a great 
> API to use in general.)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8582) Optimize checkpointing to avoid computing an RDD twice

2021-12-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-8582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17464794#comment-17464794
 ] 

Apache Spark commented on SPARK-8582:
-

User 'agrawaldevesh' has created a pull request for this issue:
https://github.com/apache/spark/pull/35005

> Optimize checkpointing to avoid computing an RDD twice
> --
>
> Key: SPARK-8582
> URL: https://issues.apache.org/jira/browse/SPARK-8582
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Shixiong Zhu
>Priority: Major
>  Labels: bulk-closed
>
> In Spark, checkpointing allows the user to truncate the lineage of his RDD 
> and save the intermediate contents to HDFS for fault tolerance. However, this 
> is not currently implemented super efficiently:
> Every time we checkpoint an RDD, we actually compute it twice: once during 
> the action that triggered the checkpointing in the first place, and once 
> while we checkpoint (we iterate through an RDD's partitions and write them to 
> disk). See this line for more detail: 
> https://github.com/apache/spark/blob/0401cbaa8ee51c71f43604f338b65022a479da0a/core/src/main/scala/org/apache/spark/rdd/RDDCheckpointData.scala#L102.
> Instead, we should have a `CheckpointingInterator` that writes checkpoint 
> data to HDFS while we run the action. This will speed up many usages of 
> `RDD#checkpoint` by 2X.
> (Alternatively, the user can just cache the RDD before checkpointing it, but 
> this is not always viable for very large input data. It's also not a great 
> API to use in general.)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8582) Optimize checkpointing to avoid computing an RDD twice

2021-12-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-8582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8582:
---

Assignee: Shixiong Zhu  (was: Apache Spark)

> Optimize checkpointing to avoid computing an RDD twice
> --
>
> Key: SPARK-8582
> URL: https://issues.apache.org/jira/browse/SPARK-8582
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Shixiong Zhu
>Priority: Major
>  Labels: bulk-closed
>
> In Spark, checkpointing allows the user to truncate the lineage of his RDD 
> and save the intermediate contents to HDFS for fault tolerance. However, this 
> is not currently implemented super efficiently:
> Every time we checkpoint an RDD, we actually compute it twice: once during 
> the action that triggered the checkpointing in the first place, and once 
> while we checkpoint (we iterate through an RDD's partitions and write them to 
> disk). See this line for more detail: 
> https://github.com/apache/spark/blob/0401cbaa8ee51c71f43604f338b65022a479da0a/core/src/main/scala/org/apache/spark/rdd/RDDCheckpointData.scala#L102.
> Instead, we should have a `CheckpointingInterator` that writes checkpoint 
> data to HDFS while we run the action. This will speed up many usages of 
> `RDD#checkpoint` by 2X.
> (Alternatively, the user can just cache the RDD before checkpointing it, but 
> this is not always viable for very large input data. It's also not a great 
> API to use in general.)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-8582) Optimize checkpointing to avoid computing an RDD twice

2021-12-23 Thread Devesh Agrawal (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-8582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devesh Agrawal reopened SPARK-8582:
---

This issue hasn't been fixed satisfactorily and I am making one more attempt at 
it: [https://github.com/apache/spark/pull/35005]

> Optimize checkpointing to avoid computing an RDD twice
> --
>
> Key: SPARK-8582
> URL: https://issues.apache.org/jira/browse/SPARK-8582
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Shixiong Zhu
>Priority: Major
>  Labels: bulk-closed
>
> In Spark, checkpointing allows the user to truncate the lineage of his RDD 
> and save the intermediate contents to HDFS for fault tolerance. However, this 
> is not currently implemented super efficiently:
> Every time we checkpoint an RDD, we actually compute it twice: once during 
> the action that triggered the checkpointing in the first place, and once 
> while we checkpoint (we iterate through an RDD's partitions and write them to 
> disk). See this line for more detail: 
> https://github.com/apache/spark/blob/0401cbaa8ee51c71f43604f338b65022a479da0a/core/src/main/scala/org/apache/spark/rdd/RDDCheckpointData.scala#L102.
> Instead, we should have a `CheckpointingInterator` that writes checkpoint 
> data to HDFS while we run the action. This will speed up many usages of 
> `RDD#checkpoint` by 2X.
> (Alternatively, the user can just cache the RDD before checkpointing it, but 
> this is not always viable for very large input data. It's also not a great 
> API to use in general.)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8582) Optimize checkpointing to avoid computing an RDD twice

2021-12-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-8582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17464790#comment-17464790
 ] 

Apache Spark commented on SPARK-8582:
-

User 'agrawaldevesh' has created a pull request for this issue:
https://github.com/apache/spark/pull/35005

> Optimize checkpointing to avoid computing an RDD twice
> --
>
> Key: SPARK-8582
> URL: https://issues.apache.org/jira/browse/SPARK-8582
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Shixiong Zhu
>Priority: Major
>  Labels: bulk-closed
>
> In Spark, checkpointing allows the user to truncate the lineage of his RDD 
> and save the intermediate contents to HDFS for fault tolerance. However, this 
> is not currently implemented super efficiently:
> Every time we checkpoint an RDD, we actually compute it twice: once during 
> the action that triggered the checkpointing in the first place, and once 
> while we checkpoint (we iterate through an RDD's partitions and write them to 
> disk). See this line for more detail: 
> https://github.com/apache/spark/blob/0401cbaa8ee51c71f43604f338b65022a479da0a/core/src/main/scala/org/apache/spark/rdd/RDDCheckpointData.scala#L102.
> Instead, we should have a `CheckpointingInterator` that writes checkpoint 
> data to HDFS while we run the action. This will speed up many usages of 
> `RDD#checkpoint` by 2X.
> (Alternatively, the user can just cache the RDD before checkpointing it, but 
> this is not always viable for very large input data. It's also not a great 
> API to use in general.)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37700) Add LoggingSuite and some improvements

2021-12-23 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-37700.
-
Resolution: Invalid

> Add LoggingSuite and some improvements
> --
>
> Key: SPARK-37700
> URL: https://issues.apache.org/jira/browse/SPARK-37700
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: L. C. Hsieh
>Priority: Major
>
> LoggingSuite was wrongly removed in previous PR. We should add it back. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37729) SparkSession.setLogLevel not working in Spark Shell

2021-12-23 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh reassigned SPARK-37729:
---

Assignee: L. C. Hsieh

> SparkSession.setLogLevel not working in Spark Shell
> ---
>
> Key: SPARK-37729
> URL: https://issues.apache.org/jira/browse/SPARK-37729
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: L. C. Hsieh
>Priority: Major
>
> In Spark 3.2:
> {code}
> scala> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.SparkSession
> scala> spark.sparkContext.setLogLevel("FATAL")
> scala> SparkSession.builder.config("spark.abc", "abc").getOrCreate
> res1: org.apache.spark.sql.SparkSession = 
> org.apache.spark.sql.SparkSession@7dafb9f9
> scala> spark.sparkContext.setLogLevel("WARN")
> scala> SparkSession.builder.config("spark.abc", "abc").getOrCreate
> 21/12/23 21:08:18 WARN SparkSession$Builder: Using an existing SparkSession; 
> some spark core configurations may not take effect.
> res3: org.apache.spark.sql.SparkSession = 
> org.apache.spark.sql.SparkSession@7dafb9f9
> scala> spark.sparkContext.setLogLevel("FATAL")
> scala> SparkSession.builder.config("spark.abc", "abc").getOrCreate
> res5: org.apache.spark.sql.SparkSession = 
> org.apache.spark.sql.SparkSession@7dafb9f9
> {code}
> In the current master:
> {code}
> scala> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.SparkSession
> scala> spark.sparkContext.setLogLevel("FATAL")
> scala> SparkSession.builder.config("spark.abc", "abc").getOrCreate
> res1: org.apache.spark.sql.SparkSession = 
> org.apache.spark.sql.SparkSession@3e8a1137
> scala> spark.sparkContext.setLogLevel("WARN")
> scala> SparkSession.builder.config("spark.abc", "abc").getOrCreate
> res3: org.apache.spark.sql.SparkSession = 
> org.apache.spark.sql.SparkSession@3e8a1137
> scala> spark.sparkContext.setLogLevel("FATAL")
> scala> SparkSession.builder.config("spark.abc", "abc").getOrCreate
> res5: org.apache.spark.sql.SparkSession = 
> org.apache.spark.sql.SparkSession@3e8a1137
> {code}
> Seems like it works when you set via {{setLogLevel}} initially but cannot be 
> changed afterward.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2021-12-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17464750#comment-17464750
 ] 

Apache Spark commented on SPARK-6305:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/34965

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Tal Sliwowicz
>Assignee: L. C. Hsieh
>Priority: Minor
> Fix For: 3.3.0
>
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37729) SparkSession.setLogLevel not working in Spark Shell

2021-12-23 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17464729#comment-17464729
 ] 

L. C. Hsieh commented on SPARK-37729:
-

[~hyukjin.kwon] Thanks. Let me check it.

> SparkSession.setLogLevel not working in Spark Shell
> ---
>
> Key: SPARK-37729
> URL: https://issues.apache.org/jira/browse/SPARK-37729
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> In Spark 3.2:
> {code}
> scala> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.SparkSession
> scala> spark.sparkContext.setLogLevel("FATAL")
> scala> SparkSession.builder.config("spark.abc", "abc").getOrCreate
> res1: org.apache.spark.sql.SparkSession = 
> org.apache.spark.sql.SparkSession@7dafb9f9
> scala> spark.sparkContext.setLogLevel("WARN")
> scala> SparkSession.builder.config("spark.abc", "abc").getOrCreate
> 21/12/23 21:08:18 WARN SparkSession$Builder: Using an existing SparkSession; 
> some spark core configurations may not take effect.
> res3: org.apache.spark.sql.SparkSession = 
> org.apache.spark.sql.SparkSession@7dafb9f9
> scala> spark.sparkContext.setLogLevel("FATAL")
> scala> SparkSession.builder.config("spark.abc", "abc").getOrCreate
> res5: org.apache.spark.sql.SparkSession = 
> org.apache.spark.sql.SparkSession@7dafb9f9
> {code}
> In the current master:
> {code}
> scala> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.SparkSession
> scala> spark.sparkContext.setLogLevel("FATAL")
> scala> SparkSession.builder.config("spark.abc", "abc").getOrCreate
> res1: org.apache.spark.sql.SparkSession = 
> org.apache.spark.sql.SparkSession@3e8a1137
> scala> spark.sparkContext.setLogLevel("WARN")
> scala> SparkSession.builder.config("spark.abc", "abc").getOrCreate
> res3: org.apache.spark.sql.SparkSession = 
> org.apache.spark.sql.SparkSession@3e8a1137
> scala> spark.sparkContext.setLogLevel("FATAL")
> scala> SparkSession.builder.config("spark.abc", "abc").getOrCreate
> res5: org.apache.spark.sql.SparkSession = 
> org.apache.spark.sql.SparkSession@3e8a1137
> {code}
> Seems like it works when you set via {{setLogLevel}} initially but cannot be 
> changed afterward.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37731) refactor and cleanup function lookup in Analyzer

2021-12-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37731:


Assignee: Apache Spark

> refactor and cleanup function lookup in Analyzer
> 
>
> Key: SPARK-37731
> URL: https://issues.apache.org/jira/browse/SPARK-37731
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37731) refactor and cleanup function lookup in Analyzer

2021-12-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17464643#comment-17464643
 ] 

Apache Spark commented on SPARK-37731:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/35004

> refactor and cleanup function lookup in Analyzer
> 
>
> Key: SPARK-37731
> URL: https://issues.apache.org/jira/browse/SPARK-37731
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37731) refactor and cleanup function lookup in Analyzer

2021-12-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37731:


Assignee: (was: Apache Spark)

> refactor and cleanup function lookup in Analyzer
> 
>
> Key: SPARK-37731
> URL: https://issues.apache.org/jira/browse/SPARK-37731
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37193) DynamicJoinSelection.shouldDemoteBroadcastHashJoin should not apply to outer joins

2021-12-23 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-37193.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34464
[https://github.com/apache/spark/pull/34464]

> DynamicJoinSelection.shouldDemoteBroadcastHashJoin should not apply to outer 
> joins
> --
>
> Key: SPARK-37193
> URL: https://issues.apache.org/jira/browse/SPARK-37193
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
> Fix For: 3.3.0
>
>
> {{DynamicJoinSelection.shouldDemoteBroadcastHashJoin}} will prevent AQE from 
> converting Sort merge join into a broadcast join because SMJ is faster when 
> the side that would be broadcast has a lot of empty partitions.
>  This makes sense for inner joins which can short circuit if one side is 
> empty.
>  For (left,right) outer join, the streaming side still has to be processed so 
> demoting broadcast join doesn't have the same advantage.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37193) DynamicJoinSelection.shouldDemoteBroadcastHashJoin should not apply to outer joins

2021-12-23 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-37193:
---

Assignee: Eugene Koifman

> DynamicJoinSelection.shouldDemoteBroadcastHashJoin should not apply to outer 
> joins
> --
>
> Key: SPARK-37193
> URL: https://issues.apache.org/jira/browse/SPARK-37193
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
>
> {{DynamicJoinSelection.shouldDemoteBroadcastHashJoin}} will prevent AQE from 
> converting Sort merge join into a broadcast join because SMJ is faster when 
> the side that would be broadcast has a lot of empty partitions.
>  This makes sense for inner joins which can short circuit if one side is 
> empty.
>  For (left,right) outer join, the streaming side still has to be processed so 
> demoting broadcast join doesn't have the same advantage.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37644) Support datasource v2 complete aggregate pushdown

2021-12-23 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-37644:
---

Assignee: jiaan.geng

> Support datasource v2 complete aggregate pushdown 
> --
>
> Key: SPARK-37644
> URL: https://issues.apache.org/jira/browse/SPARK-37644
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>
> Currently , Spark supports push down aggregate with partial-agg and final-agg 
> . For some data source (e.g. JDBC ) , we can avoid partial-agg and final-agg 
> by running completely on database.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37644) Support datasource v2 complete aggregate pushdown

2021-12-23 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-37644.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34904
[https://github.com/apache/spark/pull/34904]

> Support datasource v2 complete aggregate pushdown 
> --
>
> Key: SPARK-37644
> URL: https://issues.apache.org/jira/browse/SPARK-37644
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently , Spark supports push down aggregate with partial-agg and final-agg 
> . For some data source (e.g. JDBC ) , we can avoid partial-agg and final-agg 
> by running completely on database.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37731) refactor and cleanup function lookup in Analyzer

2021-12-23 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-37731:
---

 Summary: refactor and cleanup function lookup in Analyzer
 Key: SPARK-37731
 URL: https://issues.apache.org/jira/browse/SPARK-37731
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37718) Demo sql is incorrect

2021-12-23 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-37718:


Assignee: Peng

> Demo sql is incorrect
> -
>
> Key: SPARK-37718
> URL: https://issues.apache.org/jira/browse/SPARK-37718
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.2.0
>Reporter: Peng
>Assignee: Peng
>Priority: Minor
>
> There is a sql statement in this section 
> https://spark.apache.org/docs/latest/sql-ref-null-semantics.html#null-semantics
>  that is incorrect.
> {code:java}
> SELECT * FROM person GROUP BY age HAVING max(age) > 18; {code}
> should be
> {code:java}
> SELECT age, count(*) FROM person GROUP BY age HAVING max(age) > 18;{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37718) Demo sql is incorrect

2021-12-23 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-37718.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34992
[https://github.com/apache/spark/pull/34992]

> Demo sql is incorrect
> -
>
> Key: SPARK-37718
> URL: https://issues.apache.org/jira/browse/SPARK-37718
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.2.0
>Reporter: Peng
>Assignee: Peng
>Priority: Minor
> Fix For: 3.3.0
>
>
> There is a sql statement in this section 
> https://spark.apache.org/docs/latest/sql-ref-null-semantics.html#null-semantics
>  that is incorrect.
> {code:java}
> SELECT * FROM person GROUP BY age HAVING max(age) > 18; {code}
> should be
> {code:java}
> SELECT age, count(*) FROM person GROUP BY age HAVING max(age) > 18;{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37659) Fix FsHistoryProvider race condition between list and delet log info

2021-12-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17464582#comment-17464582
 ] 

Apache Spark commented on SPARK-37659:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/35003

> Fix FsHistoryProvider race condition between list and delet log info
> 
>
> Key: SPARK-37659
> URL: https://issues.apache.org/jira/browse/SPARK-37659
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.1.2, 3.2.1, 3.3.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.3.0
>
>
> After SPARK-29043, FsHistoryProvider will list the log info without waitting 
> all `mergeApplicationListing` task finished.
> However the `LevelDBIterator` of list log info is not thread safe if some 
> other threads delete the related log info at same time.
> There is the error msg:
> {code:java}
> 21/12/15 14:12:02 ERROR FsHistoryProvider: Exception in checking for event 
> log updates
> java.util.NoSuchElementException: 
> 1^@__main__^@+hdfs://xxx/application_xxx.inprogress
> at org.apache.spark.util.kvstore.LevelDB.get(LevelDB.java:132)
> at 
> org.apache.spark.util.kvstore.LevelDBIterator.next(LevelDBIterator.java:137)
> at 
> scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:44)
> at scala.collection.Iterator.foreach(Iterator.scala:941)
> at scala.collection.Iterator.foreach$(Iterator.scala:941)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
> at scala.collection.IterableLike.foreach(IterableLike.scala:74)
> at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
> at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
> at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
> at 
> scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:184)
> at 
> scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:47)
> at scala.collection.TraversableLike.to(TraversableLike.scala:678)
> at scala.collection.TraversableLike.to$(TraversableLike.scala:675)
> at scala.collection.AbstractTraversable.to(Traversable.scala:108)
> at scala.collection.TraversableOnce.toList(TraversableOnce.scala:299)
> at scala.collection.TraversableOnce.toList$(TraversableOnce.scala:299)
> at scala.collection.AbstractTraversable.toList(Traversable.scala:108)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.checkForLogs(FsHistoryProvider.scala:588)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$startPolling$3(FsHistoryProvider.scala:299)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37462) Avoid unnecessary calculating the number of outstanding fetch requests and RPCS

2021-12-23 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-37462.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34711
[https://github.com/apache/spark/pull/34711]

>  Avoid unnecessary calculating the number of outstanding fetch requests and 
> RPCS
> 
>
> Key: SPARK-37462
> URL: https://issues.apache.org/jira/browse/SPARK-37462
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0, 3.2.0
>Reporter: weixiuli
>Assignee: weixiuli
>Priority: Major
> Fix For: 3.3.0
>
>
> It is unnecessary to calculate the number of outstanding fetch requests and 
> RPCS when the IdleStateEvent is not IDLE or the last request is not timeout.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37730) plot.hist throws AttributeError on pandas=1.3.5

2021-12-23 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-37730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michał Słapek updated SPARK-37730:
--
Description: 
plot.hist from PySpark throws AttributeError exception when pyspark.pandas is 
used with pandas=1.3.5.

Pandas in commit 
[https://github.com/pandas-dev/pandas/commit/029907c9d69a0260401b78a016a6c4515d8f1c40]
replaced MPLPlot._add_legend_handle with MPLPlot._append_legend_handles_labels.

I've attached PR on github which replaces use of MPLPlot._add_legend_handle in 
PySpark with MPLPlot._append_legend_handles_labels.

 

Code:

 
{code:java}
import pyspark.pandas as ps
from matplotlib import pyplot as plt

ps.set_option("plotting.backend", "matplotlib")

df = ps.DataFrame({'data': [4, 5, 5, 6, 8, 9]})
df['data'].plot.hist()

plt.show()
 {code}
 

 

Truncated traceback:
{code:java}
Traceback (most recent call last):                                              
  File "/home/develop/Documents/sparkbug/code.py", line 6, in 
    df['data'].plot.hist()
  ...
  File 
"/mnt/transient/develop/miniconda3/envs/testenv/lib/python3.9/site-packages/pyspark/pandas/plot/matplotlib.py",
 line 403, in _make_plot
    self._add_legend_handle(artists[0], label, index=i)
AttributeError: 'PandasOnSparkHistPlot' object has no attribute 
'_add_legend_handle' {code}

  was:
plot.hist from PySpark throws AttributeError exception when pyspark.pandas is 
used with pandas=1.3.5.

Pandas in commit 
[https://github.com/pandas-dev/pandas/commit/029907c9d69a0260401b78a016a6c4515d8f1c40]
replaced MPLPlot._add_legend_handle with MPLPlot._append_legend_handles_labels.

I've attached PR on github which replaces use of MPLPlot._add_legend_handle in 
PySpark with MPLPlot._append_legend_handles_labels.

Code:


{{import pyspark.pandas as ps}}
{{from matplotlib import pyplot as }}{{plt}}

{{ps.set_option("plotting.backend", "matplotlib")}}

{{{}df = ps.DataFrame({}}}{{{}{'data': [4, 5, 5, 6, 8, 9]}{}}}{{{}){}}}
{{df['data'].plot.hist()}}

{{plt.show()}}

 

Truncated traceback:

{{Traceback (most recent call last): }}
{{File "/home/develop/Documents/sparkbug/code.py", line 6, in }}
{{df['data'].plot.hist()}}
{{...}}
{{File 
"/mnt/transient/develop/miniconda3/envs/testenv/lib/python3.9/site-packages/pyspark/pandas/plot/matplotlib.py",
 line 403, in _make_plot}}
{{self._add_legend_handle(artists[0], label, index=i)}}
{{AttributeError: 'PandasOnSparkHistPlot' object has no attribute 
'_add_legend_handle'}}


> plot.hist throws AttributeError on pandas=1.3.5
> ---
>
> Key: SPARK-37730
> URL: https://issues.apache.org/jira/browse/SPARK-37730
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0, 3.3.0
> Environment: Conda environment.yml (also tested with 3.3.0-SNAPSHOT):
> {{name: testenv}}
> {{channels:}}
> {{  - conda-forge}}
> {{dependencies:}}
> {{  - python=3.9.9}}
> {{  }}
> {{  - numpy=1.21.5}}
> {{  - pandas=1.3.5}}
> {{  - matplotlib=3.5.1}}
> {{  }}
> {{  - pyspark=3.2.0}}
>  
>Reporter: Michał Słapek
>Priority: Major
>
> plot.hist from PySpark throws AttributeError exception when pyspark.pandas is 
> used with pandas=1.3.5.
> Pandas in commit 
> [https://github.com/pandas-dev/pandas/commit/029907c9d69a0260401b78a016a6c4515d8f1c40]
> replaced MPLPlot._add_legend_handle with 
> MPLPlot._append_legend_handles_labels.
> I've attached PR on github which replaces use of MPLPlot._add_legend_handle 
> in PySpark with MPLPlot._append_legend_handles_labels.
>  
> Code:
>  
> {code:java}
> import pyspark.pandas as ps
> from matplotlib import pyplot as plt
> ps.set_option("plotting.backend", "matplotlib")
> df = ps.DataFrame({'data': [4, 5, 5, 6, 8, 9]})
> df['data'].plot.hist()
> plt.show()
>  {code}
>  
>  
> Truncated traceback:
> {code:java}
> Traceback (most recent call last):                                            
>   
>   File "/home/develop/Documents/sparkbug/code.py", line 6, in 
>     df['data'].plot.hist()
>   ...
>   File 
> "/mnt/transient/develop/miniconda3/envs/testenv/lib/python3.9/site-packages/pyspark/pandas/plot/matplotlib.py",
>  line 403, in _make_plot
>     self._add_legend_handle(artists[0], label, index=i)
> AttributeError: 'PandasOnSparkHistPlot' object has no attribute 
> '_add_legend_handle' {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37462) Avoid unnecessary calculating the number of outstanding fetch requests and RPCS

2021-12-23 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-37462:
-
Priority: Trivial  (was: Major)

>  Avoid unnecessary calculating the number of outstanding fetch requests and 
> RPCS
> 
>
> Key: SPARK-37462
> URL: https://issues.apache.org/jira/browse/SPARK-37462
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0, 3.2.0
>Reporter: weixiuli
>Assignee: weixiuli
>Priority: Trivial
> Fix For: 3.3.0
>
>
> It is unnecessary to calculate the number of outstanding fetch requests and 
> RPCS when the IdleStateEvent is not IDLE or the last request is not timeout.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37462) Avoid unnecessary calculating the number of outstanding fetch requests and RPCS

2021-12-23 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-37462:


Assignee: weixiuli

>  Avoid unnecessary calculating the number of outstanding fetch requests and 
> RPCS
> 
>
> Key: SPARK-37462
> URL: https://issues.apache.org/jira/browse/SPARK-37462
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0, 3.2.0
>Reporter: weixiuli
>Assignee: weixiuli
>Priority: Major
>
> It is unnecessary to calculate the number of outstanding fetch requests and 
> RPCS when the IdleStateEvent is not IDLE or the last request is not timeout.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37668) 'Index' object has no attribute 'levels' in pyspark.pandas.frame.DataFrame.insert

2021-12-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37668.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34957
[https://github.com/apache/spark/pull/34957]

> 'Index' object has no attribute 'levels' in  
> pyspark.pandas.frame.DataFrame.insert
> --
>
> Key: SPARK-37668
> URL: https://issues.apache.org/jira/browse/SPARK-37668
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.3.0
>
>
>  [This piece of 
> code|https://github.com/apache/spark/blob/6e45b04db48008fa033b09df983d3bd1c4f790ea/python/pyspark/pandas/frame.py#L3991-L3993]
>  in {{pyspark.pandas.frame}} is going to fail on runtime, when 
> {{is_name_like_tuple}} evaluates to {{True}}
> {code:python}
> if is_name_like_tuple(column):
> if len(column) != len(self.columns.levels):
> {code}
> with 
> {code}
> 'Index' object has no attribute 'levels'
> {code}
> To be honest, I am not sure what is intended behavior (initially, I suspected 
> that we should have 
> {code:python}
>  if len(column) != self.columns.nlevels
> {code}
> but {{nlevels}} is hard-coded to one, and wouldn't be consistent with Pandas 
> at all.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37668) 'Index' object has no attribute 'levels' in pyspark.pandas.frame.DataFrame.insert

2021-12-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-37668:


Assignee: Haejoon Lee

> 'Index' object has no attribute 'levels' in  
> pyspark.pandas.frame.DataFrame.insert
> --
>
> Key: SPARK-37668
> URL: https://issues.apache.org/jira/browse/SPARK-37668
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Haejoon Lee
>Priority: Major
>
>  [This piece of 
> code|https://github.com/apache/spark/blob/6e45b04db48008fa033b09df983d3bd1c4f790ea/python/pyspark/pandas/frame.py#L3991-L3993]
>  in {{pyspark.pandas.frame}} is going to fail on runtime, when 
> {{is_name_like_tuple}} evaluates to {{True}}
> {code:python}
> if is_name_like_tuple(column):
> if len(column) != len(self.columns.levels):
> {code}
> with 
> {code}
> 'Index' object has no attribute 'levels'
> {code}
> To be honest, I am not sure what is intended behavior (initially, I suspected 
> that we should have 
> {code:python}
>  if len(column) != self.columns.nlevels
> {code}
> but {{nlevels}} is hard-coded to one, and wouldn't be consistent with Pandas 
> at all.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37728) reading nested columns with ORC vectorized reader can cause ArrayIndexOutOfBoundsException

2021-12-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17464554#comment-17464554
 ] 

Apache Spark commented on SPARK-37728:
--

User 'yym1995' has created a pull request for this issue:
https://github.com/apache/spark/pull/35002

> reading nested columns with ORC vectorized reader can cause 
> ArrayIndexOutOfBoundsException
> --
>
> Key: SPARK-37728
> URL: https://issues.apache.org/jira/browse/SPARK-37728
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: Yimin Yang
>Priority: Major
>
> When spark.sql.orc.enableNestedColumnVectorizedReader is set to true, reading 
> nested columns of ORC files can cause ArrayIndexOutOfBoundsException. Here is 
> a simple reproduction:
> 1) create an ORC file which contains records of type Array>:
> {code:java}
> ./bin/spark-shell {code}
> {code:java}
> case class Item(record: Array[Array[String]])
> val data = new Array[Array[Array[String]]](100)
>     for (i <- 0 to 99) {
>       val temp = new Array[Array[String]](50)
>       for (j <- 0 to 49) {
>         temp(j) = new Array[String](1000)
>         for (k <- 0 to 999) {
>           temp(j)(k) = k.toString
>         }
>       }
>       data(i) = temp
>     }
> val rdd = spark.sparkContext.parallelize(data, 1)
> val df = rdd.map(x => Item(x)).toDF
> df.write.orc("file:///home/user_name/data") {code}
>  
> 2) read the orc with spark.sql.orc.enableNestedColumnVectorizedReader=true
> {code:java}
> ./bin/spark-shell --conf spark.sql.orc.enableVectorizedReader=true --conf 
> spark.sql.codegen.wholeStage=true --conf 
> spark.sql.orc.enableNestedColumnVectorizedReader=true --conf 
> spark.sql.orc.columnarReaderBatchSize=4096 {code}
> {code:java}
> val df = spark.read.orc("file:///home/user_name/data")
> df.show(100) {code}
>  
> Then Spark threw ArrayIndexOutOfBoundsException:
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2455)
>   at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2404)
>   at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2403)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2403)
>   at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1162)
>   at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1162)
>   at scala.Option.foreach(Option.scala:407)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1162)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2643)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2585)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2574)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>   at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:940)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2227)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2248)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2267)
>   at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:490)
>   at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:443)
>   at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48)
>   at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3833)
>   at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2832)
>   at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3824)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3822)
>   at org.apache.spark.sql.Dataset.head(Dataset.scala:2832)
>   at org.apache.spark.sql.Dataset.take(Dataset.scala:3053)
>   at 

[jira] [Assigned] (SPARK-37728) reading nested columns with ORC vectorized reader can cause ArrayIndexOutOfBoundsException

2021-12-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37728:


Assignee: Apache Spark

> reading nested columns with ORC vectorized reader can cause 
> ArrayIndexOutOfBoundsException
> --
>
> Key: SPARK-37728
> URL: https://issues.apache.org/jira/browse/SPARK-37728
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: Yimin Yang
>Assignee: Apache Spark
>Priority: Major
>
> When spark.sql.orc.enableNestedColumnVectorizedReader is set to true, reading 
> nested columns of ORC files can cause ArrayIndexOutOfBoundsException. Here is 
> a simple reproduction:
> 1) create an ORC file which contains records of type Array>:
> {code:java}
> ./bin/spark-shell {code}
> {code:java}
> case class Item(record: Array[Array[String]])
> val data = new Array[Array[Array[String]]](100)
>     for (i <- 0 to 99) {
>       val temp = new Array[Array[String]](50)
>       for (j <- 0 to 49) {
>         temp(j) = new Array[String](1000)
>         for (k <- 0 to 999) {
>           temp(j)(k) = k.toString
>         }
>       }
>       data(i) = temp
>     }
> val rdd = spark.sparkContext.parallelize(data, 1)
> val df = rdd.map(x => Item(x)).toDF
> df.write.orc("file:///home/user_name/data") {code}
>  
> 2) read the orc with spark.sql.orc.enableNestedColumnVectorizedReader=true
> {code:java}
> ./bin/spark-shell --conf spark.sql.orc.enableVectorizedReader=true --conf 
> spark.sql.codegen.wholeStage=true --conf 
> spark.sql.orc.enableNestedColumnVectorizedReader=true --conf 
> spark.sql.orc.columnarReaderBatchSize=4096 {code}
> {code:java}
> val df = spark.read.orc("file:///home/user_name/data")
> df.show(100) {code}
>  
> Then Spark threw ArrayIndexOutOfBoundsException:
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2455)
>   at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2404)
>   at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2403)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2403)
>   at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1162)
>   at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1162)
>   at scala.Option.foreach(Option.scala:407)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1162)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2643)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2585)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2574)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>   at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:940)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2227)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2248)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2267)
>   at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:490)
>   at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:443)
>   at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48)
>   at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3833)
>   at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2832)
>   at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3824)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3822)
>   at org.apache.spark.sql.Dataset.head(Dataset.scala:2832)
>   at org.apache.spark.sql.Dataset.take(Dataset.scala:3053)
>   at org.apache.spark.sql.Dataset.getRows(Dataset.scala:288)
>   at 

[jira] [Assigned] (SPARK-37728) reading nested columns with ORC vectorized reader can cause ArrayIndexOutOfBoundsException

2021-12-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37728:


Assignee: (was: Apache Spark)

> reading nested columns with ORC vectorized reader can cause 
> ArrayIndexOutOfBoundsException
> --
>
> Key: SPARK-37728
> URL: https://issues.apache.org/jira/browse/SPARK-37728
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: Yimin Yang
>Priority: Major
>
> When spark.sql.orc.enableNestedColumnVectorizedReader is set to true, reading 
> nested columns of ORC files can cause ArrayIndexOutOfBoundsException. Here is 
> a simple reproduction:
> 1) create an ORC file which contains records of type Array>:
> {code:java}
> ./bin/spark-shell {code}
> {code:java}
> case class Item(record: Array[Array[String]])
> val data = new Array[Array[Array[String]]](100)
>     for (i <- 0 to 99) {
>       val temp = new Array[Array[String]](50)
>       for (j <- 0 to 49) {
>         temp(j) = new Array[String](1000)
>         for (k <- 0 to 999) {
>           temp(j)(k) = k.toString
>         }
>       }
>       data(i) = temp
>     }
> val rdd = spark.sparkContext.parallelize(data, 1)
> val df = rdd.map(x => Item(x)).toDF
> df.write.orc("file:///home/user_name/data") {code}
>  
> 2) read the orc with spark.sql.orc.enableNestedColumnVectorizedReader=true
> {code:java}
> ./bin/spark-shell --conf spark.sql.orc.enableVectorizedReader=true --conf 
> spark.sql.codegen.wholeStage=true --conf 
> spark.sql.orc.enableNestedColumnVectorizedReader=true --conf 
> spark.sql.orc.columnarReaderBatchSize=4096 {code}
> {code:java}
> val df = spark.read.orc("file:///home/user_name/data")
> df.show(100) {code}
>  
> Then Spark threw ArrayIndexOutOfBoundsException:
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2455)
>   at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2404)
>   at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2403)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2403)
>   at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1162)
>   at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1162)
>   at scala.Option.foreach(Option.scala:407)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1162)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2643)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2585)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2574)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>   at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:940)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2227)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2248)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2267)
>   at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:490)
>   at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:443)
>   at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48)
>   at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3833)
>   at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2832)
>   at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3824)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3822)
>   at org.apache.spark.sql.Dataset.head(Dataset.scala:2832)
>   at org.apache.spark.sql.Dataset.take(Dataset.scala:3053)
>   at org.apache.spark.sql.Dataset.getRows(Dataset.scala:288)
>   at org.apache.spark.sql.Dataset.showString(Dataset.scala:327)
>   at 

[jira] [Commented] (SPARK-37728) reading nested columns with ORC vectorized reader can cause ArrayIndexOutOfBoundsException

2021-12-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17464553#comment-17464553
 ] 

Apache Spark commented on SPARK-37728:
--

User 'yym1995' has created a pull request for this issue:
https://github.com/apache/spark/pull/35002

> reading nested columns with ORC vectorized reader can cause 
> ArrayIndexOutOfBoundsException
> --
>
> Key: SPARK-37728
> URL: https://issues.apache.org/jira/browse/SPARK-37728
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: Yimin Yang
>Priority: Major
>
> When spark.sql.orc.enableNestedColumnVectorizedReader is set to true, reading 
> nested columns of ORC files can cause ArrayIndexOutOfBoundsException. Here is 
> a simple reproduction:
> 1) create an ORC file which contains records of type Array>:
> {code:java}
> ./bin/spark-shell {code}
> {code:java}
> case class Item(record: Array[Array[String]])
> val data = new Array[Array[Array[String]]](100)
>     for (i <- 0 to 99) {
>       val temp = new Array[Array[String]](50)
>       for (j <- 0 to 49) {
>         temp(j) = new Array[String](1000)
>         for (k <- 0 to 999) {
>           temp(j)(k) = k.toString
>         }
>       }
>       data(i) = temp
>     }
> val rdd = spark.sparkContext.parallelize(data, 1)
> val df = rdd.map(x => Item(x)).toDF
> df.write.orc("file:///home/user_name/data") {code}
>  
> 2) read the orc with spark.sql.orc.enableNestedColumnVectorizedReader=true
> {code:java}
> ./bin/spark-shell --conf spark.sql.orc.enableVectorizedReader=true --conf 
> spark.sql.codegen.wholeStage=true --conf 
> spark.sql.orc.enableNestedColumnVectorizedReader=true --conf 
> spark.sql.orc.columnarReaderBatchSize=4096 {code}
> {code:java}
> val df = spark.read.orc("file:///home/user_name/data")
> df.show(100) {code}
>  
> Then Spark threw ArrayIndexOutOfBoundsException:
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2455)
>   at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2404)
>   at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2403)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2403)
>   at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1162)
>   at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1162)
>   at scala.Option.foreach(Option.scala:407)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1162)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2643)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2585)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2574)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>   at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:940)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2227)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2248)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2267)
>   at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:490)
>   at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:443)
>   at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48)
>   at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3833)
>   at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2832)
>   at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3824)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3822)
>   at org.apache.spark.sql.Dataset.head(Dataset.scala:2832)
>   at org.apache.spark.sql.Dataset.take(Dataset.scala:3053)
>   at 

[jira] [Assigned] (SPARK-37727) Show ignored confs & hide warnings for conf already set in SparkSession.builder.getOrCreate

2021-12-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37727:


Assignee: (was: Apache Spark)

> Show ignored confs & hide warnings for conf already set in 
> SparkSession.builder.getOrCreate
> ---
>
> Key: SPARK-37727
> URL: https://issues.apache.org/jira/browse/SPARK-37727
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, {{SparkSession.builder.getOrCreate()}} is too noisy even when 
> duplicate configurations are set. And users cannot tell which configurations 
> are to fix. See the example below:
> {code}
> ./bin/spark-shell --conf spark.abc=abc
> {code}
> {code}
> import org.apache.spark.sql.SparkSession
> spark.sparkContext.setLogLevel("DEBUG")
> SparkSession.builder.config("spark.abc", "abc").getOrCreate
> {code}
> {code}
> ...
> 21:12:40.601 [main] WARN  org.apache.spark.sql.SparkSession - Using an 
> existing SparkSession; some spark core configurations may not take effect.
> {code}
> This is strait forward when there are few configurations but it is difficult 
> for users to figure out when there are too many configurations especially 
> when these configurations are defined in property files like 
> {{spark-default.conf}} that is sometimes maintained separately by system 
> admins.
> See also https://github.com/apache/spark/pull/34757#discussion_r769248275



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37727) Show ignored confs & hide warnings for conf already set in SparkSession.builder.getOrCreate

2021-12-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37727:


Assignee: Apache Spark

> Show ignored confs & hide warnings for conf already set in 
> SparkSession.builder.getOrCreate
> ---
>
> Key: SPARK-37727
> URL: https://issues.apache.org/jira/browse/SPARK-37727
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> Currently, {{SparkSession.builder.getOrCreate()}} is too noisy even when 
> duplicate configurations are set. And users cannot tell which configurations 
> are to fix. See the example below:
> {code}
> ./bin/spark-shell --conf spark.abc=abc
> {code}
> {code}
> import org.apache.spark.sql.SparkSession
> spark.sparkContext.setLogLevel("DEBUG")
> SparkSession.builder.config("spark.abc", "abc").getOrCreate
> {code}
> {code}
> ...
> 21:12:40.601 [main] WARN  org.apache.spark.sql.SparkSession - Using an 
> existing SparkSession; some spark core configurations may not take effect.
> {code}
> This is strait forward when there are few configurations but it is difficult 
> for users to figure out when there are too many configurations especially 
> when these configurations are defined in property files like 
> {{spark-default.conf}} that is sometimes maintained separately by system 
> admins.
> See also https://github.com/apache/spark/pull/34757#discussion_r769248275



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37727) Show ignored confs & hide warnings for conf already set in SparkSession.builder.getOrCreate

2021-12-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17464529#comment-17464529
 ] 

Apache Spark commented on SPARK-37727:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/35001

> Show ignored confs & hide warnings for conf already set in 
> SparkSession.builder.getOrCreate
> ---
>
> Key: SPARK-37727
> URL: https://issues.apache.org/jira/browse/SPARK-37727
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, {{SparkSession.builder.getOrCreate()}} is too noisy even when 
> duplicate configurations are set. And users cannot tell which configurations 
> are to fix. See the example below:
> {code}
> ./bin/spark-shell --conf spark.abc=abc
> {code}
> {code}
> import org.apache.spark.sql.SparkSession
> spark.sparkContext.setLogLevel("DEBUG")
> SparkSession.builder.config("spark.abc", "abc").getOrCreate
> {code}
> {code}
> ...
> 21:12:40.601 [main] WARN  org.apache.spark.sql.SparkSession - Using an 
> existing SparkSession; some spark core configurations may not take effect.
> {code}
> This is strait forward when there are few configurations but it is difficult 
> for users to figure out when there are too many configurations especially 
> when these configurations are defined in property files like 
> {{spark-default.conf}} that is sometimes maintained separately by system 
> admins.
> See also https://github.com/apache/spark/pull/34757#discussion_r769248275



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37730) plot.hist throws AttributeError on pandas=1.3.5

2021-12-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17464523#comment-17464523
 ] 

Apache Spark commented on SPARK-37730:
--

User 'mslapek' has created a pull request for this issue:
https://github.com/apache/spark/pull/35000

> plot.hist throws AttributeError on pandas=1.3.5
> ---
>
> Key: SPARK-37730
> URL: https://issues.apache.org/jira/browse/SPARK-37730
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0, 3.3.0
> Environment: Conda environment.yml (also tested with 3.3.0-SNAPSHOT):
> {{name: testenv}}
> {{channels:}}
> {{  - conda-forge}}
> {{dependencies:}}
> {{  - python=3.9.9}}
> {{  }}
> {{  - numpy=1.21.5}}
> {{  - pandas=1.3.5}}
> {{  - matplotlib=3.5.1}}
> {{  }}
> {{  - pyspark=3.2.0}}
>  
>Reporter: Michał Słapek
>Priority: Major
>
> plot.hist from PySpark throws AttributeError exception when pyspark.pandas is 
> used with pandas=1.3.5.
> Pandas in commit 
> [https://github.com/pandas-dev/pandas/commit/029907c9d69a0260401b78a016a6c4515d8f1c40]
> replaced MPLPlot._add_legend_handle with 
> MPLPlot._append_legend_handles_labels.
> I've attached PR on github which replaces use of MPLPlot._add_legend_handle 
> in PySpark with MPLPlot._append_legend_handles_labels.
> Code:
> {{import pyspark.pandas as ps}}
> {{from matplotlib import pyplot as }}{{plt}}
> {{ps.set_option("plotting.backend", "matplotlib")}}
> {{{}df = ps.DataFrame({}}}{{{}{'data': [4, 5, 5, 6, 8, 9]}{}}}{{{}){}}}
> {{df['data'].plot.hist()}}
> {{plt.show()}}
>  
> Truncated traceback:
> {{Traceback (most recent call last): }}
> {{File "/home/develop/Documents/sparkbug/code.py", line 6, in }}
> {{df['data'].plot.hist()}}
> {{...}}
> {{File 
> "/mnt/transient/develop/miniconda3/envs/testenv/lib/python3.9/site-packages/pyspark/pandas/plot/matplotlib.py",
>  line 403, in _make_plot}}
> {{self._add_legend_handle(artists[0], label, index=i)}}
> {{AttributeError: 'PandasOnSparkHistPlot' object has no attribute 
> '_add_legend_handle'}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37730) plot.hist throws AttributeError on pandas=1.3.5

2021-12-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37730:


Assignee: (was: Apache Spark)

> plot.hist throws AttributeError on pandas=1.3.5
> ---
>
> Key: SPARK-37730
> URL: https://issues.apache.org/jira/browse/SPARK-37730
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0, 3.3.0
> Environment: Conda environment.yml (also tested with 3.3.0-SNAPSHOT):
> {{name: testenv}}
> {{channels:}}
> {{  - conda-forge}}
> {{dependencies:}}
> {{  - python=3.9.9}}
> {{  }}
> {{  - numpy=1.21.5}}
> {{  - pandas=1.3.5}}
> {{  - matplotlib=3.5.1}}
> {{  }}
> {{  - pyspark=3.2.0}}
>  
>Reporter: Michał Słapek
>Priority: Major
>
> plot.hist from PySpark throws AttributeError exception when pyspark.pandas is 
> used with pandas=1.3.5.
> Pandas in commit 
> [https://github.com/pandas-dev/pandas/commit/029907c9d69a0260401b78a016a6c4515d8f1c40]
> replaced MPLPlot._add_legend_handle with 
> MPLPlot._append_legend_handles_labels.
> I've attached PR on github which replaces use of MPLPlot._add_legend_handle 
> in PySpark with MPLPlot._append_legend_handles_labels.
> Code:
> {{import pyspark.pandas as ps}}
> {{from matplotlib import pyplot as }}{{plt}}
> {{ps.set_option("plotting.backend", "matplotlib")}}
> {{{}df = ps.DataFrame({}}}{{{}{'data': [4, 5, 5, 6, 8, 9]}{}}}{{{}){}}}
> {{df['data'].plot.hist()}}
> {{plt.show()}}
>  
> Truncated traceback:
> {{Traceback (most recent call last): }}
> {{File "/home/develop/Documents/sparkbug/code.py", line 6, in }}
> {{df['data'].plot.hist()}}
> {{...}}
> {{File 
> "/mnt/transient/develop/miniconda3/envs/testenv/lib/python3.9/site-packages/pyspark/pandas/plot/matplotlib.py",
>  line 403, in _make_plot}}
> {{self._add_legend_handle(artists[0], label, index=i)}}
> {{AttributeError: 'PandasOnSparkHistPlot' object has no attribute 
> '_add_legend_handle'}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37730) plot.hist throws AttributeError on pandas=1.3.5

2021-12-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37730:


Assignee: Apache Spark

> plot.hist throws AttributeError on pandas=1.3.5
> ---
>
> Key: SPARK-37730
> URL: https://issues.apache.org/jira/browse/SPARK-37730
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0, 3.3.0
> Environment: Conda environment.yml (also tested with 3.3.0-SNAPSHOT):
> {{name: testenv}}
> {{channels:}}
> {{  - conda-forge}}
> {{dependencies:}}
> {{  - python=3.9.9}}
> {{  }}
> {{  - numpy=1.21.5}}
> {{  - pandas=1.3.5}}
> {{  - matplotlib=3.5.1}}
> {{  }}
> {{  - pyspark=3.2.0}}
>  
>Reporter: Michał Słapek
>Assignee: Apache Spark
>Priority: Major
>
> plot.hist from PySpark throws AttributeError exception when pyspark.pandas is 
> used with pandas=1.3.5.
> Pandas in commit 
> [https://github.com/pandas-dev/pandas/commit/029907c9d69a0260401b78a016a6c4515d8f1c40]
> replaced MPLPlot._add_legend_handle with 
> MPLPlot._append_legend_handles_labels.
> I've attached PR on github which replaces use of MPLPlot._add_legend_handle 
> in PySpark with MPLPlot._append_legend_handles_labels.
> Code:
> {{import pyspark.pandas as ps}}
> {{from matplotlib import pyplot as }}{{plt}}
> {{ps.set_option("plotting.backend", "matplotlib")}}
> {{{}df = ps.DataFrame({}}}{{{}{'data': [4, 5, 5, 6, 8, 9]}{}}}{{{}){}}}
> {{df['data'].plot.hist()}}
> {{plt.show()}}
>  
> Truncated traceback:
> {{Traceback (most recent call last): }}
> {{File "/home/develop/Documents/sparkbug/code.py", line 6, in }}
> {{df['data'].plot.hist()}}
> {{...}}
> {{File 
> "/mnt/transient/develop/miniconda3/envs/testenv/lib/python3.9/site-packages/pyspark/pandas/plot/matplotlib.py",
>  line 403, in _make_plot}}
> {{self._add_legend_handle(artists[0], label, index=i)}}
> {{AttributeError: 'PandasOnSparkHistPlot' object has no attribute 
> '_add_legend_handle'}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37730) plot.hist throws AttributeError on pandas=1.3.5

2021-12-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17464522#comment-17464522
 ] 

Apache Spark commented on SPARK-37730:
--

User 'mslapek' has created a pull request for this issue:
https://github.com/apache/spark/pull/35000

> plot.hist throws AttributeError on pandas=1.3.5
> ---
>
> Key: SPARK-37730
> URL: https://issues.apache.org/jira/browse/SPARK-37730
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0, 3.3.0
> Environment: Conda environment.yml (also tested with 3.3.0-SNAPSHOT):
> {{name: testenv}}
> {{channels:}}
> {{  - conda-forge}}
> {{dependencies:}}
> {{  - python=3.9.9}}
> {{  }}
> {{  - numpy=1.21.5}}
> {{  - pandas=1.3.5}}
> {{  - matplotlib=3.5.1}}
> {{  }}
> {{  - pyspark=3.2.0}}
>  
>Reporter: Michał Słapek
>Priority: Major
>
> plot.hist from PySpark throws AttributeError exception when pyspark.pandas is 
> used with pandas=1.3.5.
> Pandas in commit 
> [https://github.com/pandas-dev/pandas/commit/029907c9d69a0260401b78a016a6c4515d8f1c40]
> replaced MPLPlot._add_legend_handle with 
> MPLPlot._append_legend_handles_labels.
> I've attached PR on github which replaces use of MPLPlot._add_legend_handle 
> in PySpark with MPLPlot._append_legend_handles_labels.
> Code:
> {{import pyspark.pandas as ps}}
> {{from matplotlib import pyplot as }}{{plt}}
> {{ps.set_option("plotting.backend", "matplotlib")}}
> {{{}df = ps.DataFrame({}}}{{{}{'data': [4, 5, 5, 6, 8, 9]}{}}}{{{}){}}}
> {{df['data'].plot.hist()}}
> {{plt.show()}}
>  
> Truncated traceback:
> {{Traceback (most recent call last): }}
> {{File "/home/develop/Documents/sparkbug/code.py", line 6, in }}
> {{df['data'].plot.hist()}}
> {{...}}
> {{File 
> "/mnt/transient/develop/miniconda3/envs/testenv/lib/python3.9/site-packages/pyspark/pandas/plot/matplotlib.py",
>  line 403, in _make_plot}}
> {{self._add_legend_handle(artists[0], label, index=i)}}
> {{AttributeError: 'PandasOnSparkHistPlot' object has no attribute 
> '_add_legend_handle'}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37730) plot.hist throws AttributeError on pandas=1.3.5

2021-12-23 Thread Jira
Michał Słapek created SPARK-37730:
-

 Summary: plot.hist throws AttributeError on pandas=1.3.5
 Key: SPARK-37730
 URL: https://issues.apache.org/jira/browse/SPARK-37730
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.2.0, 3.3.0
 Environment: Conda environment.yml (also tested with 3.3.0-SNAPSHOT):


{{name: testenv}}
{{channels:}}
{{  - conda-forge}}
{{dependencies:}}
{{  - python=3.9.9}}
{{  }}
{{  - numpy=1.21.5}}
{{  - pandas=1.3.5}}
{{  - matplotlib=3.5.1}}
{{  }}
{{  - pyspark=3.2.0}}

 
Reporter: Michał Słapek


plot.hist from PySpark throws AttributeError exception when pyspark.pandas is 
used with pandas=1.3.5.

Pandas in commit 
[https://github.com/pandas-dev/pandas/commit/029907c9d69a0260401b78a016a6c4515d8f1c40]
replaced MPLPlot._add_legend_handle with MPLPlot._append_legend_handles_labels.

I've attached PR on github which replaces use of MPLPlot._add_legend_handle in 
PySpark with MPLPlot._append_legend_handles_labels.

Code:


{{import pyspark.pandas as ps}}
{{from matplotlib import pyplot as }}{{plt}}

{{ps.set_option("plotting.backend", "matplotlib")}}

{{{}df = ps.DataFrame({}}}{{{}{'data': [4, 5, 5, 6, 8, 9]}{}}}{{{}){}}}
{{df['data'].plot.hist()}}

{{plt.show()}}

 

Truncated traceback:

{{Traceback (most recent call last): }}
{{File "/home/develop/Documents/sparkbug/code.py", line 6, in }}
{{df['data'].plot.hist()}}
{{...}}
{{File 
"/mnt/transient/develop/miniconda3/envs/testenv/lib/python3.9/site-packages/pyspark/pandas/plot/matplotlib.py",
 line 403, in _make_plot}}
{{self._add_legend_handle(artists[0], label, index=i)}}
{{AttributeError: 'PandasOnSparkHistPlot' object has no attribute 
'_add_legend_handle'}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37714) ANSI mode: allow casting between numeric type and timestamp type

2021-12-23 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-37714.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34985
[https://github.com/apache/spark/pull/34985]

> ANSI mode: allow casting between numeric type and timestamp type 
> -
>
> Key: SPARK-37714
> URL: https://issues.apache.org/jira/browse/SPARK-37714
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.3.0
>
>
> h3. What changes were proposed?
>  * By default, allow casting between numeric type and timestamp type under 
> ANSI mode
>  * Remove the user-facing configuration 
> {{spark.sql.ansi.allowCastBetweenDatetimeAndNumeric}}
> h3. Why are the changes needed?
> Same reason as mentioned in 
> [#34459|https://github.com/apache/spark/pull/34459]. It is for better 
> adoption of ANSI SQL mode since users are relying on it:
>  * As we did some data science, we found that many Spark SQL users are 
> actually using {{Cast(Timestamp as Numeric)}} and {{{}Cast(Numeric as 
> Timestamp){}}}.
>  * The Spark SQL connector for Tableau is using this feature for DateTime 
> math. e.g.
> {{CAST(FROM_UNIXTIME(CAST(CAST(%1 AS BIGINT) + (%2 * 86400) AS BIGINT)) AS 
> TIMESTAMP)}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37727) Show ignored confs & hide warnings for conf already set in SparkSession.builder.getOrCreate

2021-12-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37727:
-
Description: 
Currently, {{SparkSession.builder.getOrCreate()}} is too noisy even when 
duplicate configurations are set. And users cannot tell which configurations 
are to fix. See the example below:

{code}
./bin/spark-shell --conf spark.abc=abc
{code}

{code}
import org.apache.spark.sql.SparkSession
spark.sparkContext.setLogLevel("DEBUG")
SparkSession.builder.config("spark.abc", "abc").getOrCreate
{code}

{code}
...
21:12:40.601 [main] WARN  org.apache.spark.sql.SparkSession - Using an existing 
SparkSession; some spark core configurations may not take effect.
{code}

This is strait forward when there are few configurations but it is difficult 
for users to figure out when there are too many configurations especially when 
these configurations are defined in property files like {{spark-default.conf}} 
that is sometimes maintained separately by system admins.

See also https://github.com/apache/spark/pull/34757#discussion_r769248275

  was:
Currently, {{SparkSession.builder.getOrCreate()}} is too noisy even when 
duplicate configurations are set. And users cannot tell which configurations 
are to fix. See the example below:

{code}
./bin/spark-shell --conf spark.abc=abc
{code}

{code}
import org.apache.spark.sql.SparkSession
SparkSession.builder.config("spark.abc", "abc").getOrCreate
{code}

{code}

{code}

This is strait forward when there are few configurations but it is difficult 
for users to figure out when there are too many configurations especially when 
these configurations are defined in property files like {{spark-default.conf}} 
that is sometimes maintained separately by system admins.

See also https://github.com/apache/spark/pull/34757#discussion_r769248275


> Show ignored confs & hide warnings for conf already set in 
> SparkSession.builder.getOrCreate
> ---
>
> Key: SPARK-37727
> URL: https://issues.apache.org/jira/browse/SPARK-37727
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, {{SparkSession.builder.getOrCreate()}} is too noisy even when 
> duplicate configurations are set. And users cannot tell which configurations 
> are to fix. See the example below:
> {code}
> ./bin/spark-shell --conf spark.abc=abc
> {code}
> {code}
> import org.apache.spark.sql.SparkSession
> spark.sparkContext.setLogLevel("DEBUG")
> SparkSession.builder.config("spark.abc", "abc").getOrCreate
> {code}
> {code}
> ...
> 21:12:40.601 [main] WARN  org.apache.spark.sql.SparkSession - Using an 
> existing SparkSession; some spark core configurations may not take effect.
> {code}
> This is strait forward when there are few configurations but it is difficult 
> for users to figure out when there are too many configurations especially 
> when these configurations are defined in property files like 
> {{spark-default.conf}} that is sometimes maintained separately by system 
> admins.
> See also https://github.com/apache/spark/pull/34757#discussion_r769248275



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37729) SparkSession.setLogLevel not working in Spark Shell

2021-12-23 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17464510#comment-17464510
 ] 

Hyukjin Kwon commented on SPARK-37729:
--

cc [~viirya] FYI. I think this is a regression by upgrading to Log4J 2 .. cc 
[~dongjoon] too FYI

> SparkSession.setLogLevel not working in Spark Shell
> ---
>
> Key: SPARK-37729
> URL: https://issues.apache.org/jira/browse/SPARK-37729
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> In Spark 3.2:
> {code}
> scala> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.SparkSession
> scala> spark.sparkContext.setLogLevel("FATAL")
> scala> SparkSession.builder.config("spark.abc", "abc").getOrCreate
> res1: org.apache.spark.sql.SparkSession = 
> org.apache.spark.sql.SparkSession@7dafb9f9
> scala> spark.sparkContext.setLogLevel("WARN")
> scala> SparkSession.builder.config("spark.abc", "abc").getOrCreate
> 21/12/23 21:08:18 WARN SparkSession$Builder: Using an existing SparkSession; 
> some spark core configurations may not take effect.
> res3: org.apache.spark.sql.SparkSession = 
> org.apache.spark.sql.SparkSession@7dafb9f9
> scala> spark.sparkContext.setLogLevel("FATAL")
> scala> SparkSession.builder.config("spark.abc", "abc").getOrCreate
> res5: org.apache.spark.sql.SparkSession = 
> org.apache.spark.sql.SparkSession@7dafb9f9
> {code}
> In the current master:
> {code}
> scala> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.SparkSession
> scala> spark.sparkContext.setLogLevel("FATAL")
> scala> SparkSession.builder.config("spark.abc", "abc").getOrCreate
> res1: org.apache.spark.sql.SparkSession = 
> org.apache.spark.sql.SparkSession@3e8a1137
> scala> spark.sparkContext.setLogLevel("WARN")
> scala> SparkSession.builder.config("spark.abc", "abc").getOrCreate
> res3: org.apache.spark.sql.SparkSession = 
> org.apache.spark.sql.SparkSession@3e8a1137
> scala> spark.sparkContext.setLogLevel("FATAL")
> scala> SparkSession.builder.config("spark.abc", "abc").getOrCreate
> res5: org.apache.spark.sql.SparkSession = 
> org.apache.spark.sql.SparkSession@3e8a1137
> {code}
> Seems like it works when you set via {{setLogLevel}} initially but cannot be 
> changed afterward.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37729) SparkSession.setLogLevel not working in Spark Shell

2021-12-23 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-37729:


 Summary: SparkSession.setLogLevel not working in Spark Shell
 Key: SPARK-37729
 URL: https://issues.apache.org/jira/browse/SPARK-37729
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Hyukjin Kwon


In Spark 3.2:

{code}
scala> import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SparkSession

scala> spark.sparkContext.setLogLevel("FATAL")

scala> SparkSession.builder.config("spark.abc", "abc").getOrCreate
res1: org.apache.spark.sql.SparkSession = 
org.apache.spark.sql.SparkSession@7dafb9f9

scala> spark.sparkContext.setLogLevel("WARN")

scala> SparkSession.builder.config("spark.abc", "abc").getOrCreate
21/12/23 21:08:18 WARN SparkSession$Builder: Using an existing SparkSession; 
some spark core configurations may not take effect.
res3: org.apache.spark.sql.SparkSession = 
org.apache.spark.sql.SparkSession@7dafb9f9

scala> spark.sparkContext.setLogLevel("FATAL")

scala> SparkSession.builder.config("spark.abc", "abc").getOrCreate
res5: org.apache.spark.sql.SparkSession = 
org.apache.spark.sql.SparkSession@7dafb9f9
{code}

In the current master:

{code}
scala> import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SparkSession

scala> spark.sparkContext.setLogLevel("FATAL")

scala> SparkSession.builder.config("spark.abc", "abc").getOrCreate
res1: org.apache.spark.sql.SparkSession = 
org.apache.spark.sql.SparkSession@3e8a1137

scala> spark.sparkContext.setLogLevel("WARN")

scala> SparkSession.builder.config("spark.abc", "abc").getOrCreate
res3: org.apache.spark.sql.SparkSession = 
org.apache.spark.sql.SparkSession@3e8a1137

scala> spark.sparkContext.setLogLevel("FATAL")

scala> SparkSession.builder.config("spark.abc", "abc").getOrCreate
res5: org.apache.spark.sql.SparkSession = 
org.apache.spark.sql.SparkSession@3e8a1137
{code}

Seems like it works when you set via {{setLogLevel}} initially but cannot be 
changed afterward.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37728) reading nested columns with ORC vectorized reader can cause ArrayIndexOutOfBoundsException

2021-12-23 Thread Yimin Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yimin Yang updated SPARK-37728:
---
Description: 
When spark.sql.orc.enableNestedColumnVectorizedReader is set to true, reading 
nested columns of ORC files can cause ArrayIndexOutOfBoundsException. Here is a 
simple reproduction:

1) create an ORC file which contains records of type Array>:
{code:java}
./bin/spark-shell {code}
{code:java}
case class Item(record: Array[Array[String]])

val data = new Array[Array[Array[String]]](100)
    for (i <- 0 to 99) {
      val temp = new Array[Array[String]](50)
      for (j <- 0 to 49) {
        temp(j) = new Array[String](1000)
        for (k <- 0 to 999) {
          temp(j)(k) = k.toString
        }
      }
      data(i) = temp
    }
val rdd = spark.sparkContext.parallelize(data, 1)
val df = rdd.map(x => Item(x)).toDF
df.write.orc("file:///home/user_name/data") {code}
 

2) read the orc with spark.sql.orc.enableNestedColumnVectorizedReader=true
{code:java}
./bin/spark-shell --conf spark.sql.orc.enableVectorizedReader=true --conf 
spark.sql.codegen.wholeStage=true --conf 
spark.sql.orc.enableNestedColumnVectorizedReader=true --conf 
spark.sql.orc.columnarReaderBatchSize=4096 {code}
{code:java}
val df = spark.read.orc("file:///home/user_name/data")
df.show(100) {code}
 

Then Spark threw ArrayIndexOutOfBoundsException:

Driver stacktrace:
  at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2455)
  at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2404)
  at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2403)
  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2403)
  at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1162)
  at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1162)
  at scala.Option.foreach(Option.scala:407)
  at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1162)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2643)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2585)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2574)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:940)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2227)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2248)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2267)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:490)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:443)
  at 
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48)
  at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3833)
  at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2832)
  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3824)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
  at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3822)
  at org.apache.spark.sql.Dataset.head(Dataset.scala:2832)
  at org.apache.spark.sql.Dataset.take(Dataset.scala:3053)
  at org.apache.spark.sql.Dataset.getRows(Dataset.scala:288)
  at org.apache.spark.sql.Dataset.showString(Dataset.scala:327)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:807)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:766)
  ... 47 elided
Caused by: java.lang.ArrayIndexOutOfBoundsException: 4096
  at 
org.apache.spark.sql.execution.datasources.orc.OrcArrayColumnVector.getArray(OrcArrayColumnVector.java:53)
  at 
org.apache.spark.sql.vectorized.ColumnarArray.getArray(ColumnarArray.java:170)
  at 
org.apache.spark.sql.vectorized.ColumnarArray.getArray(ColumnarArray.java:31)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
  at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at 

[jira] [Updated] (SPARK-37728) reading nested columns with ORC vectorized reader can cause ArrayIndexOutOfBoundsException

2021-12-23 Thread Yimin Yang (Jira)
A0 at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2403)
  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2403)
  at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1162)
  at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1162)
  at scala.Option.foreach(Option.scala:407)
  at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1162)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2643)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2585)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2574)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:940)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2227)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2248)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2267)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:490)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:443)
  at 
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48)
  at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3833)
  at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2832)
  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3824)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
  at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3822)
  at org.apache.spark.sql.Dataset.head(Dataset.scala:2832)
  at org.apache.spark.sql.Dataset.take(Dataset.scala:3053)
  at org.apache.spark.sql.Dataset.getRows(Dataset.scala:288)
  at org.apache.spark.sql.Dataset.showString(Dataset.scala:327)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:807)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:766)
  ... 47 elided
Caused by: java.lang.ArrayIndexOutOfBoundsException: 4096
  at 
org.apache.spark.sql.execution.datasources.orc.OrcArrayColumnVector.getArray(OrcArrayColumnVector.java:53)
  at 
org.apache.spark.sql.vectorized.ColumnarArray.getArray(ColumnarArray.java:170)
  at 
org.apache.spark.sql.vectorized.ColumnarArray.getArray(ColumnarArray.java:31)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
  at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
  at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:363)
  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
  at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
  at org.apache.spark.scheduler.Task.run(Task.scala:136)
  at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:507)
  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1468)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:510)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)

 


> reading nested columns with ORC vectorized reader can cause 
> ArrayIndexOutOfBoundsException
> --
>
> Key: SPARK-37728
> URL: https://issues.apache.org/jira/browse/SPARK-37728
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: Yimin Yang
>Priority: Major
>
> When 

[jira] [Updated] (SPARK-37728) reading nested columns with ORC vectorized reader can cause ArrayIndexOutOfBoundsException

2021-12-23 Thread Yimin Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yimin Yang updated SPARK-37728:
---
Description: 
When spark.sql.orc.enableNestedColumnVectorizedReader is set to true, reading 
nested columns of ORC files can cause ArrayIndexOutOfBoundsException. Here is a 
simple reproduction:

1) create an ORC file which contains records of type Array>:
{code:java}
./bin/spark-shell {code}
{code:java}
case class Item(record: Array[Array[String]])

val data = new Array[Array[Array[String]]](100)
    for (i <- 0 to 99) {
      val temp = new Array[Array[String]](50)
      for (j <- 0 to 49) {
        temp(j) = new Array[String](1000)
        for (k <- 0 to 999) {
          temp(j)(k) = k.toString
        }
      }
      data(i) = temp
    }
val rdd = spark.sparkContext.parallelize(data, 1)
val df = rdd.map(x => Item(x)).toDF
df.write.orc("file:///home/user_name/data") {code}
 

2) read the orc with spark.sql.orc.enableNestedColumnVectorizedReader=true
{code:java}
./bin/spark-shell --conf spark.sql.orc.enableVectorizedReader=true --conf 
spark.sql.codegen.wholeStage=true --conf 
spark.sql.orc.enableNestedColumnVectorizedReader=true --conf 
spark.sql.orc.columnarReaderBatchSize=4096 {code}
{code:java}
val df = spark.read.orc("file:///home/user_name/data")
df.show(100) {code}
 

Then Spark threw ArrayIndexOutOfBoundsException:

Driver stacktrace:
  at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2455)
  at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2404)
  at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2403)
  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2403)
  at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1162)
  at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1162)
  at scala.Option.foreach(Option.scala:407)
  at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1162)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2643)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2585)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2574)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:940)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2227)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2248)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2267)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:490)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:443)
  at 
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48)
  at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3833)
  at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2832)
  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3824)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
  at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3822)
  at org.apache.spark.sql.Dataset.head(Dataset.scala:2832)
  at org.apache.spark.sql.Dataset.take(Dataset.scala:3053)
  at org.apache.spark.sql.Dataset.getRows(Dataset.scala:288)
  at org.apache.spark.sql.Dataset.showString(Dataset.scala:327)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:807)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:766)
  ... 47 elided
Caused by: java.lang.ArrayIndexOutOfBoundsException: 4096
  at 
org.apache.spark.sql.execution.datasources.orc.OrcArrayColumnVector.getArray(OrcArrayColumnVector.java:53)
  at 
org.apache.spark.sql.vectorized.ColumnarArray.getArray(ColumnarArray.java:170)
  at 
org.apache.spark.sql.vectorized.ColumnarArray.getArray(ColumnarArray.java:31)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
  at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at 

[jira] [Created] (SPARK-37728) reading nested columns with ORC vectorized reader can cause ArrayIndexOutOfBoundsException

2021-12-23 Thread Yimin Yang (Jira)
Yimin Yang created SPARK-37728:
--

 Summary: reading nested columns with ORC vectorized reader can 
cause ArrayIndexOutOfBoundsException
 Key: SPARK-37728
 URL: https://issues.apache.org/jira/browse/SPARK-37728
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0, 3.1.2, 3.0.3
Reporter: Yimin Yang


When spark.sql.orc.enableNestedColumnVectorizedReader is set to true, reading 
nested columns of ORC files can cause ArrayIndexOutOfBoundsException. Here is a 
simple reproduction:

1) create an ORC file which contains records of type Array>:

 
{code:java}
./bin/spark-shell {code}
 

 

 
{code:java}
case class Item(record: Array[Array[String]])

val data = new Array[Array[Array[String]]](100)
    for (i <- 0 to 99) {
      val temp = new Array[Array[String]](50)
      for (j <- 0 to 49) {
        temp(j) = new Array[String](1000)
        for (k <- 0 to 999) {
          temp(j)(k) = k.toString
        }
      }
      data(i) = temp
    }
val rdd = spark.sparkContext.parallelize(data, 1)
val df = rdd.map(x => Item(x)).toDF
df.write.orc("file:///home/user_name/data") {code}
 

 

2) read the orc with spark.sql.orc.enableNestedColumnVectorizedReader=true

 
{code:java}
./bin/spark-shell --conf spark.sql.orc.enableVectorizedReader=true --conf 
spark.sql.codegen.wholeStage=true --conf 
spark.sql.orc.enableNestedColumnVectorizedReader=true --conf 
spark.sql.orc.columnarReaderBatchSize=4096 {code}
 

 

 
{code:java}
val df = spark.read.orc("file:///home/user_name/data")
df.show(100) {code}
 

 

Then Spark threw ArrayIndexOutOfBoundsException:

Driver stacktrace:
  at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2455)
  at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2404)
  at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2403)
  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2403)
  at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1162)
  at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1162)
  at scala.Option.foreach(Option.scala:407)
  at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1162)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2643)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2585)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2574)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:940)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2227)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2248)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2267)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:490)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:443)
  at 
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48)
  at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3833)
  at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2832)
  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3824)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
  at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3822)
  at org.apache.spark.sql.Dataset.head(Dataset.scala:2832)
  at org.apache.spark.sql.Dataset.take(Dataset.scala:3053)
  at org.apache.spark.sql.Dataset.getRows(Dataset.scala:288)
  at org.apache.spark.sql.Dataset.showString(Dataset.scala:327)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:807)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:766)
  ... 47 elided
Caused by: java.lang.ArrayIndexOutOfBoundsException: 4096
  at 
org.apache.spark.sql.execution.datasources.orc.OrcArrayColumnVector.getArray(OrcArrayColumnVector.java:53)
  at 
org.apache.spark.sql.vectorized.ColumnarArray.getArray(ColumnarArray.java:170)
  at 

[jira] [Updated] (SPARK-37727) Show ignored confs & hide warnings for conf already set in SparkSession.builder.getOrCreate

2021-12-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37727:
-
Description: 
Currently, {{SparkSession.builder.getOrCreate()}} is too noisy even when 
duplicate configurations are set. And users cannot tell which configurations 
are to fix. See the example below:

{code}
./bin/spark-shell --conf spark.abc=abc
{code}

{code}
import org.apache.spark.sql.SparkSession
SparkSession.builder.config("spark.abc", "abc").getOrCreate
{code}

{code}

{code}

This is strait forward when there are few configurations but it is difficult 
for users to figure out when there are too many configurations especially when 
these configurations are defined in property files like {{spark-default.conf}} 
that is sometimes maintained separately by system admins.

See also https://github.com/apache/spark/pull/34757#discussion_r769248275

  was:
Currently, {{SparkSession.builder.getOrCreate()}} is too noisy even when 
duplicate configurations are set. And users cannot tell which configurations 
are to fix. See the example below:

{code}
./bin/spark-shell --conf spark.abc=abc
{code}

{code}
import org.apache.spark.sql.SparkSession
SparkSession.builder.config("spark.abc", "abc").getOrCreate
{code}

{code}
{code}

This is strait forward when there are few configurations but it is difficult 
for users to figure out when there are too many configurations especially when 
these configurations are defined in property files like {{spark-default.conf}} 
that is sometimes maintained separately by system admins.

See also https://github.com/apache/spark/pull/34757#discussion_r769248275


> Show ignored confs & hide warnings for conf already set in 
> SparkSession.builder.getOrCreate
> ---
>
> Key: SPARK-37727
> URL: https://issues.apache.org/jira/browse/SPARK-37727
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, {{SparkSession.builder.getOrCreate()}} is too noisy even when 
> duplicate configurations are set. And users cannot tell which configurations 
> are to fix. See the example below:
> {code}
> ./bin/spark-shell --conf spark.abc=abc
> {code}
> {code}
> import org.apache.spark.sql.SparkSession
> SparkSession.builder.config("spark.abc", "abc").getOrCreate
> {code}
> {code}
> {code}
> This is strait forward when there are few configurations but it is difficult 
> for users to figure out when there are too many configurations especially 
> when these configurations are defined in property files like 
> {{spark-default.conf}} that is sometimes maintained separately by system 
> admins.
> See also https://github.com/apache/spark/pull/34757#discussion_r769248275



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37727) Show ignored confs & hide warnings for conf already set in SparkSession.builder.getOrCreate

2021-12-23 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-37727:


 Summary: Show ignored confs & hide warnings for conf already set 
in SparkSession.builder.getOrCreate
 Key: SPARK-37727
 URL: https://issues.apache.org/jira/browse/SPARK-37727
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Hyukjin Kwon


Currently, {{SparkSession.builder.getOrCreate()}} is too noisy even when 
duplicate configurations are set. And users cannot tell which configurations 
are to fix. See the example below:

{code}
./bin/spark-shell --conf spark.abc=abc
{code}

{code}
import org.apache.spark.sql.SparkSession
SparkSession.builder.config("spark.abc", "abc").getOrCreate
{code}

{code}
{code}

This is strait forward when there are few configurations but it is difficult 
for users to figure out when there are too many configurations especially when 
these configurations are defined in property files like {{spark-default.conf}} 
that is sometimes maintained separately by system admins.

See also https://github.com/apache/spark/pull/34757#discussion_r769248275



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37721) Failed to execute pyspark test in Win WSL

2021-12-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-37721:


Assignee: Yikun Jiang

> Failed to execute pyspark test in Win WSL
> -
>
> Key: SPARK-37721
> URL: https://issues.apache.org/jira/browse/SPARK-37721
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
>
>  
> {code:java}
> Launching unittests with arguments python -m unittest 
> test_rdd.RDDTests.test_range in 
> /home/yikun/spark/python/pyspark/testsTraceback (most recent call last):
>   File "/mnt/d/Program Files/JetBrains/PyCharm 
> 2021.1.3/plugins/python/helpers/pycharm/_jb_unittest_runner.py", line 35, in 
> 
>     sys.exit(main(argv=args, module=None, 
> testRunner=unittestpy.TeamcityTestRunner, buffer=not JB_DISABLE_BUFFERING))
>   File "/usr/lib/python3.8/unittest/main.py", line 100, in __init__
>     self.parseArgs(argv)
>   File "/usr/lib/python3.8/unittest/main.py", line 147, in parseArgs
>     self.createTests()
>   File "/usr/lib/python3.8/unittest/main.py", line 158, in createTests
>     self.test = self.testLoader.loadTestsFromNames(self.testNames,
>   File "/usr/lib/python3.8/unittest/loader.py", line 220, in 
> loadTestsFromNames
>     suites = [self.loadTestsFromName(name, module) for name in names]
>   File "/usr/lib/python3.8/unittest/loader.py", line 220, in 
>     suites = [self.loadTestsFromName(name, module) for name in names]
>   File "/usr/lib/python3.8/unittest/loader.py", line 154, in loadTestsFromName
>     module = __import__(module_name)
>   File "/home/yikun/spark/python/pyspark/tests/test_rdd.py", line 37, in 
> 
>     from pyspark.testing.utils import ReusedPySparkTestCase, SPARK_HOME, 
> QuietTest
>   File "/home/yikun/spark/python/pyspark/testing/utils.py", line 47, in 
> 
>     SPARK_HOME = os.environ["SPARK_HOME"]#_find_spark_home()
>   File "/usr/lib/python3.8/os.py", line 675, in __getitem__
>     raise KeyError(key) from None
> KeyError: 'SPARK_HOME' {code}
> Looks like we  should change "SPARK_HOME = os.environ["SPARK_HOME"]" to 
> "SPARK_HOME = _find_spark_home()"
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37721) Failed to execute pyspark test in Win WSL

2021-12-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37721.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34993
[https://github.com/apache/spark/pull/34993]

> Failed to execute pyspark test in Win WSL
> -
>
> Key: SPARK-37721
> URL: https://issues.apache.org/jira/browse/SPARK-37721
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.3.0
>
>
>  
> {code:java}
> Launching unittests with arguments python -m unittest 
> test_rdd.RDDTests.test_range in 
> /home/yikun/spark/python/pyspark/testsTraceback (most recent call last):
>   File "/mnt/d/Program Files/JetBrains/PyCharm 
> 2021.1.3/plugins/python/helpers/pycharm/_jb_unittest_runner.py", line 35, in 
> 
>     sys.exit(main(argv=args, module=None, 
> testRunner=unittestpy.TeamcityTestRunner, buffer=not JB_DISABLE_BUFFERING))
>   File "/usr/lib/python3.8/unittest/main.py", line 100, in __init__
>     self.parseArgs(argv)
>   File "/usr/lib/python3.8/unittest/main.py", line 147, in parseArgs
>     self.createTests()
>   File "/usr/lib/python3.8/unittest/main.py", line 158, in createTests
>     self.test = self.testLoader.loadTestsFromNames(self.testNames,
>   File "/usr/lib/python3.8/unittest/loader.py", line 220, in 
> loadTestsFromNames
>     suites = [self.loadTestsFromName(name, module) for name in names]
>   File "/usr/lib/python3.8/unittest/loader.py", line 220, in 
>     suites = [self.loadTestsFromName(name, module) for name in names]
>   File "/usr/lib/python3.8/unittest/loader.py", line 154, in loadTestsFromName
>     module = __import__(module_name)
>   File "/home/yikun/spark/python/pyspark/tests/test_rdd.py", line 37, in 
> 
>     from pyspark.testing.utils import ReusedPySparkTestCase, SPARK_HOME, 
> QuietTest
>   File "/home/yikun/spark/python/pyspark/testing/utils.py", line 47, in 
> 
>     SPARK_HOME = os.environ["SPARK_HOME"]#_find_spark_home()
>   File "/usr/lib/python3.8/os.py", line 675, in __getitem__
>     raise KeyError(key) from None
> KeyError: 'SPARK_HOME' {code}
> Looks like we  should change "SPARK_HOME = os.environ["SPARK_HOME"]" to 
> "SPARK_HOME = _find_spark_home()"
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37726) Add spill size metrics for sort merge join

2021-12-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37726:


Assignee: (was: Apache Spark)

> Add spill size metrics for sort merge join
> --
>
> Key: SPARK-37726
> URL: https://issues.apache.org/jira/browse/SPARK-37726
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Cheng Su
>Priority: Trivial
>
> Sort merge join allows buffered side to spill if the size is too large to 
> hold in memory. It would be good to add a "spill size" SQL metrics in sort 
> merge join, to track how often the spill happens, and how much of spill size 
> would be in case when it spills.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37726) Add spill size metrics for sort merge join

2021-12-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37726:


Assignee: Apache Spark

> Add spill size metrics for sort merge join
> --
>
> Key: SPARK-37726
> URL: https://issues.apache.org/jira/browse/SPARK-37726
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Cheng Su
>Assignee: Apache Spark
>Priority: Trivial
>
> Sort merge join allows buffered side to spill if the size is too large to 
> hold in memory. It would be good to add a "spill size" SQL metrics in sort 
> merge join, to track how often the spill happens, and how much of spill size 
> would be in case when it spills.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37726) Add spill size metrics for sort merge join

2021-12-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17464411#comment-17464411
 ] 

Apache Spark commented on SPARK-37726:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/34999

> Add spill size metrics for sort merge join
> --
>
> Key: SPARK-37726
> URL: https://issues.apache.org/jira/browse/SPARK-37726
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Cheng Su
>Priority: Trivial
>
> Sort merge join allows buffered side to spill if the size is too large to 
> hold in memory. It would be good to add a "spill size" SQL metrics in sort 
> merge join, to track how often the spill happens, and how much of spill size 
> would be in case when it spills.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37726) Add spill size metrics for sort merge join

2021-12-23 Thread Cheng Su (Jira)
Cheng Su created SPARK-37726:


 Summary: Add spill size metrics for sort merge join
 Key: SPARK-37726
 URL: https://issues.apache.org/jira/browse/SPARK-37726
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Cheng Su


Sort merge join allows buffered side to spill if the size is too large to hold 
in memory. It would be good to add a "spill size" SQL metrics in sort merge 
join, to track how often the spill happens, and how much of spill size would be 
in case when it spills.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37725) Spark master UI behind reverse proxy app storage redirects to master home page

2021-12-23 Thread Adrian Paraschiv (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrian Paraschiv updated SPARK-37725:
-
Description: 
1. Using a Spark v3.1.1 Cluster with haproxy as reverse proxy enabled, web UI 
is at: [http://spark.internal.domain:8080/]
2. Starting an application and enter the application specific web UI (by 
clicking on the running application name): 
[http://spark.internal.domain:8080/app/?appId=app-20211202144454-0119]
3. Click on "Application Detail UI" and you will get redirected to 
[http://spark.internal.domain:8080|http://spark.internal.domain:8080/] 
[/proxy/app-20211202144454-0119/jobs/|http://sparkm-v3-mlm-op.itn.intraorange/proxy/app-20211202144454-0119/jobs/]
 (notice the "proxy" in the URL)
4. If you click on any link (e.g. on the "Stages" or "Storage") the Spark 
master home page UI will be served again, not "stages" not "storage".
5. If you insert the "proxy/app-20211202144454-0119/" and add the stages/ or 
storage/ in the URL it works.

  was:
1. Using a Spark v3.1.1 Cluster with haproxy as reverse proxy enabled, web UI 
is at: http://spark.internal.domain:8080/
2. Starting an application and enter the application specific web UI (by 
clicking on the running application name): 
http://spark.internal.domain:8080/app/?appId=app-20211202144454-0119
3. Click on "Application Detail UI" and you will get redirected to 
http://sparkm-v3-mlm-op.itn.intraorange/proxy/app-20211202144454-0119/jobs/ 
(notice the "proxy" in the URL)
4. If you click on any link (e.g. on the "Stages" or "Storage") the Spark 
master home page UI will be served again, not "stages" not "storage".
5. If you insert the "proxy/app-20211202144454-0119/" and add the stages/ or 
storage/ in the URL it works.


> Spark master UI behind reverse proxy app storage redirects to master home page
> --
>
> Key: SPARK-37725
> URL: https://issues.apache.org/jira/browse/SPARK-37725
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.1, 3.1.1
>Reporter: Adrian Paraschiv
>Priority: Minor
>
> 1. Using a Spark v3.1.1 Cluster with haproxy as reverse proxy enabled, web UI 
> is at: [http://spark.internal.domain:8080/]
> 2. Starting an application and enter the application specific web UI (by 
> clicking on the running application name): 
> [http://spark.internal.domain:8080/app/?appId=app-20211202144454-0119]
> 3. Click on "Application Detail UI" and you will get redirected to 
> [http://spark.internal.domain:8080|http://spark.internal.domain:8080/] 
> [/proxy/app-20211202144454-0119/jobs/|http://sparkm-v3-mlm-op.itn.intraorange/proxy/app-20211202144454-0119/jobs/]
>  (notice the "proxy" in the URL)
> 4. If you click on any link (e.g. on the "Stages" or "Storage") the Spark 
> master home page UI will be served again, not "stages" not "storage".
> 5. If you insert the "proxy/app-20211202144454-0119/" and add the stages/ or 
> storage/ in the URL it works.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37725) Spark master UI behind reverse proxy app storage redirects to master home page

2021-12-23 Thread Adrian Paraschiv (Jira)
Adrian Paraschiv created SPARK-37725:


 Summary: Spark master UI behind reverse proxy app storage 
redirects to master home page
 Key: SPARK-37725
 URL: https://issues.apache.org/jira/browse/SPARK-37725
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 3.1.1, 3.0.1
Reporter: Adrian Paraschiv


1. Using a Spark v3.1.1 Cluster with haproxy as reverse proxy enabled, web UI 
is at: http://spark.internal.domain:8080/
2. Starting an application and enter the application specific web UI (by 
clicking on the running application name): 
http://spark.internal.domain:8080/app/?appId=app-20211202144454-0119
3. Click on "Application Detail UI" and you will get redirected to 
http://sparkm-v3-mlm-op.itn.intraorange/proxy/app-20211202144454-0119/jobs/ 
(notice the "proxy" in the URL)
4. If you click on any link (e.g. on the "Stages" or "Storage") the Spark 
master home page UI will be served again, not "stages" not "storage".
5. If you insert the "proxy/app-20211202144454-0119/" and add the stages/ or 
storage/ in the URL it works.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37496) Migrate ReplaceTableAsSelectStatement to v2 command

2021-12-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17464356#comment-17464356
 ] 

Apache Spark commented on SPARK-37496:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/34997

> Migrate ReplaceTableAsSelectStatement to v2 command
> ---
>
> Key: SPARK-37496
> URL: https://issues.apache.org/jira/browse/SPARK-37496
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37690) Recursive view `df` detected (cycle: `df` -> `df`)

2021-12-23 Thread Robin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17464354#comment-17464354
 ] 

Robin commented on SPARK-37690:
---

Someone 
[here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
 has suggested this is an intentional breaking change introduced in Spark 3.1:

>From [Migration Guide: SQL, Datasets and DataFrame - Spark 3.1.1 Documentation 
>(apache.org)|https://spark.apache.org/docs/3.1.1/sql-migration-guide.html]]

> In Spark 3.1, the temporary view will have same behaviors with the permanent 
> view, i.e. capture and store runtime SQL configs, SQL text, catalog and 
> namespace. The capatured view properties will be applied during the parsing 
> and analysis phases of the view resolution. To restore the behavior before 
> Spark 3.1, {*}you can set spark.sql.legacy.storeAnalyzedPlanForView to 
> true{*}.

 

Grateful if someone could clarify.  Worth noting that the example code works in 
Spark 3.1.2, just not 3.2.0.  It's not obvious to me the above quote implies 
`createOrReplaceTempView` would fail in the example code posted in the issue.

> Recursive view `df` detected (cycle: `df` -> `df`)
> --
>
> Key: SPARK-37690
> URL: https://issues.apache.org/jira/browse/SPARK-37690
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Robin
>Priority: Major
>
> In Spark 3.2.0, you can no longer reuse the same name for a temporary view.  
> This change is backwards incompatible, and means a common way of running 
> pipelines of SQL queries no longer works.   The following is a simple 
> reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: 
> {code:python}from pyspark.context import SparkContext 
> from pyspark.sql import SparkSession 
> sc = SparkContext.getOrCreate() 
> spark = SparkSession(sc) 
> sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ 
> df = spark.sql(sql) 
> df.createOrReplaceTempView("df") 
> sql = """ SELECT * FROM df """ 
> df = spark.sql(sql) 
> df.createOrReplaceTempView("df") 
> sql = """ SELECT * FROM df """ 
> df = spark.sql(sql) {code}   
> The following error is now produced:   
> {code:python}AnalysisException: Recursive view `df` detected (cycle: `df` -> 
> `df`) 
> {code} 
> I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a 
> lot of legacy code, and the `createOrReplaceTempView` method is named 
> explicitly such that replacing an existing view should be allowed.   An 
> internet search suggests other users have run into a similar problems, e.g. 
> [here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
>   



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37724) ANSI mode: disable ANSI reserved keywords by default

2021-12-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37724:


Assignee: Gengliang Wang  (was: Apache Spark)

> ANSI mode: disable ANSI reserved keywords by default
> 
>
> Key: SPARK-37724
> URL: https://issues.apache.org/jira/browse/SPARK-37724
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> The reserved keywords thing is a big stopper for many users that want to try 
> ANSI mode. They have to update the SQL queries to pass the parser, which is 
> nothing about data quality but just trouble.
> By disabling the feature as default, I think we can get better adoption of 
> the ANSI mode.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37724) ANSI mode: disable ANSI reserved keywords by default

2021-12-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37724:


Assignee: Apache Spark  (was: Gengliang Wang)

> ANSI mode: disable ANSI reserved keywords by default
> 
>
> Key: SPARK-37724
> URL: https://issues.apache.org/jira/browse/SPARK-37724
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> The reserved keywords thing is a big stopper for many users that want to try 
> ANSI mode. They have to update the SQL queries to pass the parser, which is 
> nothing about data quality but just trouble.
> By disabling the feature as default, I think we can get better adoption of 
> the ANSI mode.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37724) ANSI mode: disable ANSI reserved keywords by default

2021-12-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17464352#comment-17464352
 ] 

Apache Spark commented on SPARK-37724:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/34996

> ANSI mode: disable ANSI reserved keywords by default
> 
>
> Key: SPARK-37724
> URL: https://issues.apache.org/jira/browse/SPARK-37724
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> The reserved keywords thing is a big stopper for many users that want to try 
> ANSI mode. They have to update the SQL queries to pass the parser, which is 
> nothing about data quality but just trouble.
> By disabling the feature as default, I think we can get better adoption of 
> the ANSI mode.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37724) ANSI mode: disable ANSI reserved keywords by default

2021-12-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17464351#comment-17464351
 ] 

Apache Spark commented on SPARK-37724:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/34996

> ANSI mode: disable ANSI reserved keywords by default
> 
>
> Key: SPARK-37724
> URL: https://issues.apache.org/jira/browse/SPARK-37724
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> The reserved keywords thing is a big stopper for many users that want to try 
> ANSI mode. They have to update the SQL queries to pass the parser, which is 
> nothing about data quality but just trouble.
> By disabling the feature as default, I think we can get better adoption of 
> the ANSI mode.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37724) ANSI mode: disable ANSI reserved keywords by default

2021-12-23 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-37724:
---
Summary: ANSI mode: disable ANSI reserved keywords by default  (was: ANSI 
mode: disable ANSI reserved keyworks by default)

> ANSI mode: disable ANSI reserved keywords by default
> 
>
> Key: SPARK-37724
> URL: https://issues.apache.org/jira/browse/SPARK-37724
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> The reserved keywords thing is a big stopper for many users that want to try 
> ANSI mode. They have to update the SQL queries to pass the parser, which is 
> nothing about data quality but just trouble.
> By disabling the feature as default, I think we can get better adoption of 
> the ANSI mode.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37724) ANSI mode: disable ANSI reserved keyworks by default

2021-12-23 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-37724:
--

 Summary: ANSI mode: disable ANSI reserved keyworks by default
 Key: SPARK-37724
 URL: https://issues.apache.org/jira/browse/SPARK-37724
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang


The reserved keywords thing is a big stopper for many users that want to try 
ANSI mode. They have to update the SQL queries to pass the parser, which is 
nothing about data quality but just trouble.

By disabling the feature as default, I think we can get better adoption of the 
ANSI mode.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org