[jira] [Updated] (SPARK-42929) make mapInPandas / mapInArrow support "is_barrier"
[ https://issues.apache.org/jira/browse/SPARK-42929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-42929: --- Labels: pull-request-available (was: ) > make mapInPandas / mapInArrow support "is_barrier" > -- > > Key: SPARK-42929 > URL: https://issues.apache.org/jira/browse/SPARK-42929 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > make mapInPandas / mapInArrow support "is_barrier" -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-47194) Upgrade log4j2 to 2.23.0
[ https://issues.apache.org/jira/browse/SPARK-47194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17821543#comment-17821543 ] Yang Jie edited comment on SPARK-47194 at 2/28/24 7:19 AM: --- It seems that the `-Dlog4j2.debug` option may not be working in 2.23.0, perhaps we should skip this upgrade. I have tested the following scenarios: 1. run `dev/make-distribution.sh --tgz` to build a Spark Client 2. add `log4j2.properties` and `spark-defaults.conf` with the same content as test case `Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j2.properties` ``` log4j2.properties # This log4j config file is for integration test SparkConfPropagateSuite. rootLogger.level = debug rootLogger.appenderRef.stdout.ref = console appender.console.type = Console appender.console.name = console appender.console.target = SYSTEM_ERR appender.console.layout.type = PatternLayout appender.console.layout.pattern = %d\{HH:mm:ss.SSS} %p %c: %maxLen\{%m} {512} %n%ex\{8}%n ``` ``` spark-defaults.conf spark.driver.extraJavaOptions -Dlog4j2.debug spark.executor.extraJavaOptions -Dlog4j2.debug spark.kubernetes.executor.deleteOnTermination false ``` 3. run `bin/run-example SparkPi` When using log4j 2.22.1, we can have the following log: ``` ... TRACE StatusLogger DefaultConfiguration cleaning Appenders from 1 LoggerConfigs. DEBUG StatusLogger Stopped org.apache.logging.log4j.core.config.DefaultConfiguration@384ad17b OK TRACE StatusLogger Reregistering MBeans after reconfigure. Selector=org.apache.logging.log4j.core.selector.ClassLoaderContextSelector@5852c06f TRACE StatusLogger Reregistering context (1/1): '5ffd2b27' org.apache.logging.log4j.core.LoggerContext@31190526 TRACE StatusLogger Unregistering but no MBeans found matching 'org.apache.logging.log4j2:type=5ffd2b27' TRACE StatusLogger Unregistering but no MBeans found matching 'org.apache.logging.log4j2:type=5ffd2b27,component=StatusLogger' TRACE StatusLogger Unregistering but no MBeans found matching 'org.apache.logging.log4j2:type=5ffd2b27,component=ContextSelector' TRACE StatusLogger Unregistering but no MBeans found matching 'org.apache.logging.log4j2:type=5ffd2b27,component=Loggers,name=*' TRACE StatusLogger Unregistering but no MBeans found matching 'org.apache.logging.log4j2:type=5ffd2b27,component=Appenders,name=*' TRACE StatusLogger Unregistering but no MBeans found matching 'org.apache.logging.log4j2:type=5ffd2b27,component=AsyncAppenders,name=*' TRACE StatusLogger Unregistering but no MBeans found matching 'org.apache.logging.log4j2:type=5ffd2b27,component=AsyncLoggerRingBuffer' TRACE StatusLogger Unregistering but no MBeans found matching 'org.apache.logging.log4j2:type=5ffd2b27,component=Loggers,name=*,subtype=RingBuffer' DEBUG StatusLogger Registering MBean org.apache.logging.log4j2:type=5ffd2b27 DEBUG StatusLogger Registering MBean org.apache.logging.log4j2:type=5ffd2b27,component=StatusLogger DEBUG StatusLogger Registering MBean org.apache.logging.log4j2:type=5ffd2b27,component=ContextSelector DEBUG StatusLogger Registering MBean org.apache.logging.log4j2:type=5ffd2b27,component=Loggers,name= DEBUG StatusLogger Registering MBean org.apache.logging.log4j2:type=5ffd2b27,component=Appenders,name=console TRACE StatusLogger Using default SystemClock for timestamps. DEBUG StatusLogger org.apache.logging.log4j.core.util.SystemClock supports precise timestamps. TRACE StatusLogger Using DummyNanoClock for nanosecond timestamps. DEBUG StatusLogger Reconfiguration complete for context[name=5ffd2b27] at URI /Users/yangjie01/Tools/4.0/spark-4.0.0-SNAPSHOT-bin-3.3.6/conf/log4j2.properties (org.apache.logging.log4j.core.LoggerContext@31190526) with optional ClassLoader: null DEBUG StatusLogger Shutdown hook enabled. Registering a new one. ... ``` But when using log4j 2.23.0, no logs related to `StatusLogger` are printed. So let's skip this upgrade was (Author: luciferyang): It seems that the `-Dlog4j2.debug` option may not be working in 2.23.0, perhaps we should skip this upgrade. I have tested the following scenarios: 1. run `dev/make-distribution.sh --tgz` to build a Spark Client 2. add `log4j2.properties` and `spark-defaults.conf` with the same content as test case `Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j2.properties` ``` log4j2.properties # This log4j config file is for integration test SparkConfPropagateSuite. rootLogger.level = debug rootLogger.appenderRef.stdout.ref = console appender.console.type = Console appender.console.name = console appender.console.target = SYSTEM_ERR appender.console.layout.type = PatternLayout appender.console.layout.pattern = %d\{HH:mm:ss.SSS} %p %c: %maxLen\{%m}{512}%n%ex\{8}%n ``` ``` spark-defaults.conf spark.driver.extraJavaOptions -Dlog4j2.debug spark.executor.extraJavaOptions -Dlog4j2.debug spark.kubernetes.executor.deleteOnT
[jira] [Resolved] (SPARK-47194) Upgrade log4j2 to 2.23.0
[ https://issues.apache.org/jira/browse/SPARK-47194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-47194. -- Resolution: Won't Fix It seems that the `-Dlog4j2.debug` option may not be working in 2.23.0, perhaps we should skip this upgrade. I have tested the following scenarios: 1. run `dev/make-distribution.sh --tgz` to build a Spark Client 2. add `log4j2.properties` and `spark-defaults.conf` with the same content as test case `Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j2.properties` ``` log4j2.properties # This log4j config file is for integration test SparkConfPropagateSuite. rootLogger.level = debug rootLogger.appenderRef.stdout.ref = console appender.console.type = Console appender.console.name = console appender.console.target = SYSTEM_ERR appender.console.layout.type = PatternLayout appender.console.layout.pattern = %d\{HH:mm:ss.SSS} %p %c: %maxLen\{%m}{512}%n%ex\{8}%n ``` ``` spark-defaults.conf spark.driver.extraJavaOptions -Dlog4j2.debug spark.executor.extraJavaOptions -Dlog4j2.debug spark.kubernetes.executor.deleteOnTermination false ``` 3. run `bin/run-example SparkPi` When using log4j 2.22.1, we can have the following log: ``` ... TRACE StatusLogger DefaultConfiguration cleaning Appenders from 1 LoggerConfigs. DEBUG StatusLogger Stopped org.apache.logging.log4j.core.config.DefaultConfiguration@384ad17b OK TRACE StatusLogger Reregistering MBeans after reconfigure. Selector=org.apache.logging.log4j.core.selector.ClassLoaderContextSelector@5852c06f TRACE StatusLogger Reregistering context (1/1): '5ffd2b27' org.apache.logging.log4j.core.LoggerContext@31190526 TRACE StatusLogger Unregistering but no MBeans found matching 'org.apache.logging.log4j2:type=5ffd2b27' TRACE StatusLogger Unregistering but no MBeans found matching 'org.apache.logging.log4j2:type=5ffd2b27,component=StatusLogger' TRACE StatusLogger Unregistering but no MBeans found matching 'org.apache.logging.log4j2:type=5ffd2b27,component=ContextSelector' TRACE StatusLogger Unregistering but no MBeans found matching 'org.apache.logging.log4j2:type=5ffd2b27,component=Loggers,name=*' TRACE StatusLogger Unregistering but no MBeans found matching 'org.apache.logging.log4j2:type=5ffd2b27,component=Appenders,name=*' TRACE StatusLogger Unregistering but no MBeans found matching 'org.apache.logging.log4j2:type=5ffd2b27,component=AsyncAppenders,name=*' TRACE StatusLogger Unregistering but no MBeans found matching 'org.apache.logging.log4j2:type=5ffd2b27,component=AsyncLoggerRingBuffer' TRACE StatusLogger Unregistering but no MBeans found matching 'org.apache.logging.log4j2:type=5ffd2b27,component=Loggers,name=*,subtype=RingBuffer' DEBUG StatusLogger Registering MBean org.apache.logging.log4j2:type=5ffd2b27 DEBUG StatusLogger Registering MBean org.apache.logging.log4j2:type=5ffd2b27,component=StatusLogger DEBUG StatusLogger Registering MBean org.apache.logging.log4j2:type=5ffd2b27,component=ContextSelector DEBUG StatusLogger Registering MBean org.apache.logging.log4j2:type=5ffd2b27,component=Loggers,name= DEBUG StatusLogger Registering MBean org.apache.logging.log4j2:type=5ffd2b27,component=Appenders,name=console TRACE StatusLogger Using default SystemClock for timestamps. DEBUG StatusLogger org.apache.logging.log4j.core.util.SystemClock supports precise timestamps. TRACE StatusLogger Using DummyNanoClock for nanosecond timestamps. DEBUG StatusLogger Reconfiguration complete for context[name=5ffd2b27] at URI /Users/yangjie01/Tools/4.0/spark-4.0.0-SNAPSHOT-bin-3.3.6/conf/log4j2.properties (org.apache.logging.log4j.core.LoggerContext@31190526) with optional ClassLoader: null DEBUG StatusLogger Shutdown hook enabled. Registering a new one. ... ``` But when using log4j 2.23.0, no logs related to `StatusLogger` are printed. cc @dongjoon-hyun > Upgrade log4j2 to 2.23.0 > > > Key: SPARK-47194 > URL: https://issues.apache.org/jira/browse/SPARK-47194 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47199) Add prefix into TemporaryDirectory to avoid flakiness
[ https://issues.apache.org/jira/browse/SPARK-47199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-47199: - Assignee: Hyukjin Kwon > Add prefix into TemporaryDirectory to avoid flakiness > - > > Key: SPARK-47199 > URL: https://issues.apache.org/jira/browse/SPARK-47199 > Project: Spark > Issue Type: Test > Components: PySpark, Tests >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Labels: pull-request-available > > Sometimes the test fail because the temporary directory names are same > (https://github.com/apache/spark/actions/runs/8066850485/job/22036007390). > {code} > File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line ?, in > pyspark.sql.dataframe.DataFrame.writeStream > Failed example: > with tempfile.TemporaryDirectory() as d: > # Create a table with Rate source. > df.writeStream.toTable( > "my_table", checkpointLocation=d) > Exception raised: > Traceback (most recent call last): > File "/usr/lib/python3.11/doctest.py", line 1353, in __run > exec(compile(example.source, filename, "single", > File "", line > 1, in > with tempfile.TemporaryDirectory() as d: > File "/usr/lib/python3.11/tempfile.py", line 1043, in __exit__ > self.cleanup() > File "/usr/lib/python3.11/tempfile.py", line 1047, in cleanup > self._rmtree(self.name, ignore_errors=self._ignore_cleanup_errors) > File "/usr/lib/python3.11/tempfile.py", line 1029, in _rmtree > _rmtree(name, onerror=onerror) > File "/usr/lib/python3.11/shutil.py", line 738, in rmtree > onerror(os.rmdir, path, sys.exc_info()) > File "/usr/lib/python3.11/shutil.py", line 736, in rmtree > os.rmdir(path, dir_fd=dir_fd) > OSError: [Errno 39] Directory not empty: > '/__w/spark/spark/python/target/4f062b09-213f-4ac2-a10a-2d704990141b/tmp29irqweq' > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47199) Add prefix into TemporaryDirectory to avoid flakiness
[ https://issues.apache.org/jira/browse/SPARK-47199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47199. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45298 [https://github.com/apache/spark/pull/45298] > Add prefix into TemporaryDirectory to avoid flakiness > - > > Key: SPARK-47199 > URL: https://issues.apache.org/jira/browse/SPARK-47199 > Project: Spark > Issue Type: Test > Components: PySpark, Tests >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > > Sometimes the test fail because the temporary directory names are same > (https://github.com/apache/spark/actions/runs/8066850485/job/22036007390). > {code} > File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line ?, in > pyspark.sql.dataframe.DataFrame.writeStream > Failed example: > with tempfile.TemporaryDirectory() as d: > # Create a table with Rate source. > df.writeStream.toTable( > "my_table", checkpointLocation=d) > Exception raised: > Traceback (most recent call last): > File "/usr/lib/python3.11/doctest.py", line 1353, in __run > exec(compile(example.source, filename, "single", > File "", line > 1, in > with tempfile.TemporaryDirectory() as d: > File "/usr/lib/python3.11/tempfile.py", line 1043, in __exit__ > self.cleanup() > File "/usr/lib/python3.11/tempfile.py", line 1047, in cleanup > self._rmtree(self.name, ignore_errors=self._ignore_cleanup_errors) > File "/usr/lib/python3.11/tempfile.py", line 1029, in _rmtree > _rmtree(name, onerror=onerror) > File "/usr/lib/python3.11/shutil.py", line 738, in rmtree > onerror(os.rmdir, path, sys.exc_info()) > File "/usr/lib/python3.11/shutil.py", line 736, in rmtree > os.rmdir(path, dir_fd=dir_fd) > OSError: [Errno 39] Directory not empty: > '/__w/spark/spark/python/target/4f062b09-213f-4ac2-a10a-2d704990141b/tmp29irqweq' > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45200) Spark 3.4.0 always using default log4j profile
[ https://issues.apache.org/jira/browse/SPARK-45200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-45200. -- Resolution: Not A Problem > Spark 3.4.0 always using default log4j profile > -- > > Key: SPARK-45200 > URL: https://issues.apache.org/jira/browse/SPARK-45200 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Jitin Dominic >Priority: Major > Labels: pull-request-available > > I've been using Spark core 3.2.2 and was upgrading to 3.4.0 > On execution of my Java code with the 3.4.0, it generates some extra set of > logs but don't face this issue with 3.2.2. > > I noticed that logs says _Using Spark's default log4j profile: > org/apache/spark/log4j2-defaults.properties._ > > Is this a bug or do we have a a configuration to disable the using of > default log4j profile? > I didn't see anything in the documentation > > +*Output:*+ > {code:java} > Using Spark's default log4j profile: > org/apache/spark/log4j2-defaults.properties > 23/09/18 20:05:08 INFO SparkContext: Running Spark version 3.4.0 > 23/09/18 20:05:08 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 23/09/18 20:05:08 INFO ResourceUtils: > == > 23/09/18 20:05:08 INFO ResourceUtils: No custom resources configured for > spark.driver. > 23/09/18 20:05:08 INFO ResourceUtils: > == > 23/09/18 20:05:08 INFO SparkContext: Submitted application: XYZ > 23/09/18 20:05:08 INFO ResourceProfile: Default ResourceProfile created, > executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , > memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: > offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: > cpus, amount: 1.0) > 23/09/18 20:05:08 INFO ResourceProfile: Limiting resource is cpu > 23/09/18 20:05:08 INFO ResourceProfileManager: Added ResourceProfile id: 0 > 23/09/18 20:05:08 INFO SecurityManager: Changing view acls to: jd > 23/09/18 20:05:08 INFO SecurityManager: Changing modify acls to: jd > 23/09/18 20:05:08 INFO SecurityManager: Changing view acls groups to: > 23/09/18 20:05:08 INFO SecurityManager: Changing modify acls groups to: > 23/09/18 20:05:08 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: jd; groups with view > permissions: EMPTY; users with modify permissions: jd; groups with modify > permissions: EMPTY > 23/09/18 20:05:08 INFO Utils: Successfully started service 'sparkDriver' on > port 39155. > 23/09/18 20:05:08 INFO SparkEnv: Registering MapOutputTracker > 23/09/18 20:05:08 INFO SparkEnv: Registering BlockManagerMaster > 23/09/18 20:05:08 INFO BlockManagerMasterEndpoint: Using > org.apache.spark.storage.DefaultTopologyMapper for getting topology > information > 23/09/18 20:05:08 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint > up > 23/09/18 20:05:08 INFO SparkEnv: Registering BlockManagerMasterHeartbeat > 23/09/18 20:05:08 INFO DiskBlockManager: Created local directory at > /tmp/blockmgr-226d599c-1511-4fae-b0e7-aae81b684012 > 23/09/18 20:05:08 INFO MemoryStore: MemoryStore started with capacity 2004.6 > MiB > 23/09/18 20:05:08 INFO SparkEnv: Registering OutputCommitCoordinator > 23/09/18 20:05:08 INFO JettyUtils: Start Jetty 0.0.0.0:4040 for SparkUI > 23/09/18 20:05:09 INFO Utils: Successfully started service 'SparkUI' on port > 4040. > 23/09/18 20:05:09 INFO Executor: Starting executor ID driver on host jd > 23/09/18 20:05:09 INFO Executor: Starting executor with user classpath > (userClassPathFirst = false): '' > 23/09/18 20:05:09 INFO Utils: Successfully started service > 'org.apache.spark.network.netty.NettyBlockTransferService' on port 32819. > 23/09/18 20:05:09 INFO NettyBlockTransferService: Server created on jd:32819 > 23/09/18 20:05:09 INFO BlockManager: Using > org.apache.spark.storage.RandomBlockReplicationPolicy for block replication > policy > 23/09/18 20:05:09 INFO BlockManagerMaster: Registering BlockManager > BlockManagerId(driver, jd, 32819, None) > 23/09/18 20:05:09 INFO BlockManagerMasterEndpoint: Registering block manager > jd:32819 with 2004.6 MiB RAM, BlockManagerId(driver, jd, 32819, None) > 23/09/18 20:05:09 INFO BlockManagerMaster: Registered BlockManager > BlockManagerId(driver, jd, 32819, None) > 23/09/18 20:05:09 INFO BlockManager: Initialized BlockManager: > BlockManagerId(driver, jd, 32819, None) > {code} > > > > I'm using Spark core dependency in one of my Jars, the jar code contains the > following: > > *+Code:+* > {code:java} > SparkSession >
[jira] [Resolved] (SPARK-46766) ZSTD Buffer Pool Support For AVRO datasource
[ https://issues.apache.org/jira/browse/SPARK-46766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-46766. -- Fix Version/s: 4.0.0 Resolution: Fixed resolved by https://github.com/apache/spark/pull/44792 > ZSTD Buffer Pool Support For AVRO datasource > > > Key: SPARK-46766 > URL: https://issues.apache.org/jira/browse/SPARK-46766 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46766) ZSTD Buffer Pool Support For AVRO datasource
[ https://issues.apache.org/jira/browse/SPARK-46766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-46766: Assignee: Kent Yao > ZSTD Buffer Pool Support For AVRO datasource > > > Key: SPARK-46766 > URL: https://issues.apache.org/jira/browse/SPARK-46766 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47197) Failed to connect HiveMetastore when using iceberg with HiveCatalog on spark-sql or spark-shell
[ https://issues.apache.org/jira/browse/SPARK-47197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47197: --- Labels: pull-request-available (was: ) > Failed to connect HiveMetastore when using iceberg with HiveCatalog on > spark-sql or spark-shell > --- > > Key: SPARK-47197 > URL: https://issues.apache.org/jira/browse/SPARK-47197 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 3.2.3, 3.5.1 >Reporter: YUBI LEE >Priority: Major > Labels: pull-request-available > > I can't connect to kerberized HiveMetastore when using iceberg with > HiveCatalog on spark-sql or spark-shell. > I think this issue is caused by the fact that there is no way to get > HIVE_DELEGATION_TOKEN when using spark-sql or spark-shell. > ([https://github.com/apache/spark/blob/v3.5.1/sql/hive/src/main/scala/org/apache/spark/sql/hive/security/HiveDelegationTokenProvider.scala#L78-L83)] > > {code:java} > val currentToken = > UserGroupInformation.getCurrentUser().getCredentials().getToken(tokenAlias) > currentToken == null && UserGroupInformation.isSecurityEnabled && > hiveConf(hadoopConf).getTrimmed("hive.metastore.uris", "").nonEmpty && > (SparkHadoopUtil.get.isProxyUser(UserGroupInformation.getCurrentUser()) > || > (!Utils.isClientMode(sparkConf) && !sparkConf.contains(KEYTAB))) > {code} > There should be a way to force to get HIVE_DELEGATION_TOKEN even when using > spark-sql or spark-shell. > Possible way is to get HIVE_DELEGATION_TOKEN if the configuration below is > set? > {code:java} > spark.security.credentials.hive.enabled true {code} > > {code:java} > 24/02/28 07:42:04 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) > (machine1.example.com executor 2): > org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive > Metastore > ... > Caused by: MetaException(message:Could not connect to meta store using any of > the URIs provided. Most recent failure: > org.apache.thrift.transport.TTransportException: GSS initiate failed {code} > > > {code:java} > spark-sql> select * from temp.test_hive_catalog; > ... > ... > 24/02/28 07:42:04 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) > (machine1.example.com executor 2): > org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive > Metastore > at > org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:84) > at > org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:34) > at org.apache.iceberg.ClientPoolImpl.get(ClientPoolImpl.java:125) > at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:56) > at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:51) > at > org.apache.iceberg.hive.CachedClientPool.run(CachedClientPool.java:122) > at > org.apache.iceberg.hive.HiveTableOperations.doRefresh(HiveTableOperations.java:158) > at > org.apache.iceberg.BaseMetastoreTableOperations.refresh(BaseMetastoreTableOperations.java:97) > at > org.apache.iceberg.BaseMetastoreTableOperations.current(BaseMetastoreTableOperations.java:80) > at > org.apache.iceberg.BaseMetastoreCatalog.loadTable(BaseMetastoreCatalog.java:47) > at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:124) > at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:111) > at > org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.overlayTableProperties(HiveIcebergStorageHandler.java:276) > at > org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.configureInputJobProperties(HiveIcebergStorageHandler.java:86) > at > org.apache.spark.sql.hive.HiveTableUtil$.configureJobPropertiesForStorageHandler(TableReader.scala:426) > at > org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:456) > at > org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1(TableReader.scala:342) > at > org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1$adapted(TableReader.scala:342) > at > org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8(HadoopRDD.scala:181) > at > org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8$adapted(HadoopRDD.scala:181) > at scala.Option.foreach(Option.scala:407) > at > org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$6(HadoopRDD.scala:181) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:178) > at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:247) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:24
[jira] [Assigned] (SPARK-47192) Convert _LEGACY_ERROR_TEMP_0035 (unsupported hive feature)
[ https://issues.apache.org/jira/browse/SPARK-47192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-47192: Assignee: Serge Rielau > Convert _LEGACY_ERROR_TEMP_0035 (unsupported hive feature) > -- > > Key: SPARK-47192 > URL: https://issues.apache.org/jira/browse/SPARK-47192 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Serge Rielau >Assignee: Serge Rielau >Priority: Major > Labels: pull-request-available > > Old: > > GRANT ROLE; > _LEGACY_ERROR_TEMP_0035 > Operation not allowed: grant role. (line 1, pos 0) > > New: > error class: HIVE_OPERATION_NOT_SUPPORTED > The Hive operation is not supported. (line 1, pos 0) > > sqlstate: 0A000 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47192) Convert _LEGACY_ERROR_TEMP_0035 (unsupported hive feature)
[ https://issues.apache.org/jira/browse/SPARK-47192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-47192. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45291 [https://github.com/apache/spark/pull/45291] > Convert _LEGACY_ERROR_TEMP_0035 (unsupported hive feature) > -- > > Key: SPARK-47192 > URL: https://issues.apache.org/jira/browse/SPARK-47192 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Serge Rielau >Assignee: Serge Rielau >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Old: > > GRANT ROLE; > _LEGACY_ERROR_TEMP_0035 > Operation not allowed: grant role. (line 1, pos 0) > > New: > error class: HIVE_OPERATION_NOT_SUPPORTED > The Hive operation is not supported. (line 1, pos 0) > > sqlstate: 0A000 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47206) Add official image Dockerfile for Apache Spark 3.5.1
[ https://issues.apache.org/jira/browse/SPARK-47206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47206: --- Labels: pull-request-available (was: ) > Add official image Dockerfile for Apache Spark 3.5.1 > > > Key: SPARK-47206 > URL: https://issues.apache.org/jira/browse/SPARK-47206 > Project: Spark > Issue Type: Bug > Components: Spark Docker >Affects Versions: 3.5.1 >Reporter: Yikun Jiang >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47206) Add official image Dockerfile for Apache Spark 3.5.1
[ https://issues.apache.org/jira/browse/SPARK-47206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yikun Jiang updated SPARK-47206: Summary: Add official image Dockerfile for Apache Spark 3.5.1 (was: [SPARK-45169] Add official image Dockerfile for Apache Spark 3.5.1) > Add official image Dockerfile for Apache Spark 3.5.1 > > > Key: SPARK-47206 > URL: https://issues.apache.org/jira/browse/SPARK-47206 > Project: Spark > Issue Type: Bug > Components: Spark Docker >Affects Versions: 3.5.1 >Reporter: Yikun Jiang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47206) [SPARK-45169] Add official image Dockerfile for Apache Spark 3.5.1
Yikun Jiang created SPARK-47206: --- Summary: [SPARK-45169] Add official image Dockerfile for Apache Spark 3.5.1 Key: SPARK-47206 URL: https://issues.apache.org/jira/browse/SPARK-47206 Project: Spark Issue Type: Bug Components: Spark Docker Affects Versions: 3.5.1 Reporter: Yikun Jiang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47205) Upgrade docker-java to 3.3.5
[ https://issues.apache.org/jira/browse/SPARK-47205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47205: --- Labels: pull-request-available (was: ) > Upgrade docker-java to 3.3.5 > > > Key: SPARK-47205 > URL: https://issues.apache.org/jira/browse/SPARK-47205 > Project: Spark > Issue Type: Dependency upgrade > Components: Spark Docker >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47155) Fix incorrect error class in create_data_source.py
[ https://issues.apache.org/jira/browse/SPARK-47155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47155: --- Labels: pull-request-available (was: ) > Fix incorrect error class in create_data_source.py > -- > > Key: SPARK-47155 > URL: https://issues.apache.org/jira/browse/SPARK-47155 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Chaoqin Li >Priority: Major > Labels: pull-request-available > > This is part of the effort of SPARK-44076 that enable python user to develop > data source in python and make python accessible to wider python community. > Error class "PYTHON_DATA_SOURCE_CREATE_ERROR" and > "PYTHON_DATA_SOURCE_METHOD_NOT_IMPLEMENTED" used in create_data_source.py > doesn't exist and will throw error class not found error which make error > message confusing to customer. Need to add the corresponding error class to > error-class.py or use existing error class. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47205) Upgrade docker-java to 3.3.5
Kent Yao created SPARK-47205: Summary: Upgrade docker-java to 3.3.5 Key: SPARK-47205 URL: https://issues.apache.org/jira/browse/SPARK-47205 Project: Spark Issue Type: Dependency upgrade Components: Spark Docker Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21918) HiveClient shouldn't share Hive object between different thread
[ https://issues.apache.org/jira/browse/SPARK-21918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17821516#comment-17821516 ] Goutam Ghosh edited comment on SPARK-21918 at 2/28/24 5:58 AM: --- Can the patch by [~angerszhuuu] be validated ? was (Author: goutamghosh): Can the path by [~angerszhuuu] be validated ? > HiveClient shouldn't share Hive object between different thread > --- > > Key: SPARK-21918 > URL: https://issues.apache.org/jira/browse/SPARK-21918 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hu Liu, >Priority: Major > Labels: bulk-closed > > I'm testing the spark thrift server and found that all the DDL statements are > run by user hive even if hive.server2.enable.doAs=true > The root cause is that Hive object is shared between different thread in > HiveClientImpl > {code:java} > private def client: Hive = { > if (clientLoader.cachedHive != null) { > clientLoader.cachedHive.asInstanceOf[Hive] > } else { > val c = Hive.get(conf) > clientLoader.cachedHive = c > c > } > } > {code} > But in impersonation mode, we should just share the Hive object inside the > thread so that the metastore client in Hive could be associated with right > user. > we can pass the Hive object of parent thread to child thread when running > the sql to fix it > I have already had a initial patch for review and I'm glad to work on it if > anyone could assign it to me. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21918) HiveClient shouldn't share Hive object between different thread
[ https://issues.apache.org/jira/browse/SPARK-21918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17821516#comment-17821516 ] Goutam Ghosh commented on SPARK-21918: -- Can the path by [~angerszhuuu] be validated ? > HiveClient shouldn't share Hive object between different thread > --- > > Key: SPARK-21918 > URL: https://issues.apache.org/jira/browse/SPARK-21918 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hu Liu, >Priority: Major > Labels: bulk-closed > > I'm testing the spark thrift server and found that all the DDL statements are > run by user hive even if hive.server2.enable.doAs=true > The root cause is that Hive object is shared between different thread in > HiveClientImpl > {code:java} > private def client: Hive = { > if (clientLoader.cachedHive != null) { > clientLoader.cachedHive.asInstanceOf[Hive] > } else { > val c = Hive.get(conf) > clientLoader.cachedHive = c > c > } > } > {code} > But in impersonation mode, we should just share the Hive object inside the > thread so that the metastore client in Hive could be associated with right > user. > we can pass the Hive object of parent thread to child thread when running > the sql to fix it > I have already had a initial patch for review and I'm glad to work on it if > anyone could assign it to me. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47191) avoid unnecessary relation lookup when uncaching table/view
[ https://issues.apache.org/jira/browse/SPARK-47191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-47191. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45289 [https://github.com/apache/spark/pull/45289] > avoid unnecessary relation lookup when uncaching table/view > --- > > Key: SPARK-47191 > URL: https://issues.apache.org/jira/browse/SPARK-47191 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47191) avoid unnecessary relation lookup when uncaching table/view
[ https://issues.apache.org/jira/browse/SPARK-47191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-47191: --- Assignee: Wenchen Fan > avoid unnecessary relation lookup when uncaching table/view > --- > > Key: SPARK-47191 > URL: https://issues.apache.org/jira/browse/SPARK-47191 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47187) Fix hive compress output config does not work
[ https://issues.apache.org/jira/browse/SPARK-47187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You resolved SPARK-47187. --- Fix Version/s: 3.4.3 Resolution: Fixed Issue resolved by pull request 45286 [https://github.com/apache/spark/pull/45286] > Fix hive compress output config does not work > - > > Key: SPARK-47187 > URL: https://issues.apache.org/jira/browse/SPARK-47187 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.4.2 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Labels: pull-request-available > Fix For: 3.4.3 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47187) Fix hive compress output config does not work
[ https://issues.apache.org/jira/browse/SPARK-47187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You reassigned SPARK-47187: - Assignee: XiDuo You > Fix hive compress output config does not work > - > > Key: SPARK-47187 > URL: https://issues.apache.org/jira/browse/SPARK-47187 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.4.2 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47202) AttributeError: module 'pandas' has no attribute 'Timstamp'
[ https://issues.apache.org/jira/browse/SPARK-47202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-47202: Assignee: Arzav Jain > AttributeError: module 'pandas' has no attribute 'Timstamp' > --- > > Key: SPARK-47202 > URL: https://issues.apache.org/jira/browse/SPARK-47202 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.5.1 >Reporter: Arzav Jain >Assignee: Arzav Jain >Priority: Minor > Labels: pull-request-available > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > When using the pyspark.sql.types.TimestampType, if your value is a > datetime.datetime object with a tzinfo, [this > typo|https://github.com/apache/spark/blob/master/python/pyspark/sql/pandas/types.py#L996] > breaks things. > > I believe [this > commit|https://github.com/apache/spark/commit/46949e692e863992f4c50bdd482d5216d4fd9221] > introduced the bug 9 months ago > > Full stack trace below: > > {code:java} > File "/databricks/spark/python/pyspark/worker.py", line 1490, in main > process() File "/databricks/spark/python/pyspark/worker.py", line 1482, in > process serializer.dump_stream(out_iter, outfile) File > "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 531, in > dump_stream return ArrowStreamSerializer.dump_stream( File > "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 107, in > dump_stream for batch in iterator: File > "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 525, in > init_stream_yield_batches batch = self._create_batch(series) File > "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 511, in > _create_batch arrs.append(self._create_array(s, t, > arrow_cast=self._arrow_cast)) File > "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 284, in > _create_array series = conv(series) File > "/databricks/spark/python/pyspark/sql/pandas/types.py", line 1060, in > return lambda pser: pser.apply( # type: ignore[return-value] File > "/databricks/python/lib/python3.10/site-packages/pandas/core/series.py", line > 4771, in apply return SeriesApply(self, func, convert_dtype, args, > kwargs).apply() File > "/databricks/python/lib/python3.10/site-packages/pandas/core/apply.py", line > 1123, in apply return self.apply_standard() File > "/databricks/python/lib/python3.10/site-packages/pandas/core/apply.py", line > 1174, in apply_standard mapped = lib.map_infer( File "pandas/_libs/lib.pyx", > line 2924, in pandas._libs.lib.map_infer File > "/databricks/spark/python/pyspark/sql/pandas/types.py", line 1061, in > lambda x: conv(x) if x is not None else None # type: ignore[misc] > File "/databricks/spark/python/pyspark/sql/pandas/types.py", line 889, in > convert_array return [ File > "/databricks/spark/python/pyspark/sql/pandas/types.py", line 890, in > _element_conv(v) if v is not None else None # type: ignore[misc] > File "/databricks/spark/python/pyspark/sql/pandas/types.py", line 1010, in > convert_struct return { File > "/databricks/spark/python/pyspark/sql/pandas/types.py", line 1011, in > name: conv(v) if conv is not None and v is not None else v File > "/databricks/spark/python/pyspark/sql/pandas/types.py", line 1032, in > convert_timestamp ts = pd.Timstamp(value) File > "/databricks/python/lib/python3.10/site-packages/pandas/__init__.py", line > 264, in __getattr__ raise AttributeError(f"module 'pandas' has no attribute > '{name}'") AttributeError: module 'pandas' has no attribute 'Timstamp' > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47202) AttributeError: module 'pandas' has no attribute 'Timstamp'
[ https://issues.apache.org/jira/browse/SPARK-47202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-47202. -- Fix Version/s: 3.5.2 4.0.0 Resolution: Fixed Issue resolved by pull request 45301 [https://github.com/apache/spark/pull/45301] > AttributeError: module 'pandas' has no attribute 'Timstamp' > --- > > Key: SPARK-47202 > URL: https://issues.apache.org/jira/browse/SPARK-47202 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.5.1 >Reporter: Arzav Jain >Assignee: Arzav Jain >Priority: Minor > Labels: pull-request-available > Fix For: 3.5.2, 4.0.0 > > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > When using the pyspark.sql.types.TimestampType, if your value is a > datetime.datetime object with a tzinfo, [this > typo|https://github.com/apache/spark/blob/master/python/pyspark/sql/pandas/types.py#L996] > breaks things. > > I believe [this > commit|https://github.com/apache/spark/commit/46949e692e863992f4c50bdd482d5216d4fd9221] > introduced the bug 9 months ago > > Full stack trace below: > > {code:java} > File "/databricks/spark/python/pyspark/worker.py", line 1490, in main > process() File "/databricks/spark/python/pyspark/worker.py", line 1482, in > process serializer.dump_stream(out_iter, outfile) File > "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 531, in > dump_stream return ArrowStreamSerializer.dump_stream( File > "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 107, in > dump_stream for batch in iterator: File > "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 525, in > init_stream_yield_batches batch = self._create_batch(series) File > "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 511, in > _create_batch arrs.append(self._create_array(s, t, > arrow_cast=self._arrow_cast)) File > "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 284, in > _create_array series = conv(series) File > "/databricks/spark/python/pyspark/sql/pandas/types.py", line 1060, in > return lambda pser: pser.apply( # type: ignore[return-value] File > "/databricks/python/lib/python3.10/site-packages/pandas/core/series.py", line > 4771, in apply return SeriesApply(self, func, convert_dtype, args, > kwargs).apply() File > "/databricks/python/lib/python3.10/site-packages/pandas/core/apply.py", line > 1123, in apply return self.apply_standard() File > "/databricks/python/lib/python3.10/site-packages/pandas/core/apply.py", line > 1174, in apply_standard mapped = lib.map_infer( File "pandas/_libs/lib.pyx", > line 2924, in pandas._libs.lib.map_infer File > "/databricks/spark/python/pyspark/sql/pandas/types.py", line 1061, in > lambda x: conv(x) if x is not None else None # type: ignore[misc] > File "/databricks/spark/python/pyspark/sql/pandas/types.py", line 889, in > convert_array return [ File > "/databricks/spark/python/pyspark/sql/pandas/types.py", line 890, in > _element_conv(v) if v is not None else None # type: ignore[misc] > File "/databricks/spark/python/pyspark/sql/pandas/types.py", line 1010, in > convert_struct return { File > "/databricks/spark/python/pyspark/sql/pandas/types.py", line 1011, in > name: conv(v) if conv is not None and v is not None else v File > "/databricks/spark/python/pyspark/sql/pandas/types.py", line 1032, in > convert_timestamp ts = pd.Timstamp(value) File > "/databricks/python/lib/python3.10/site-packages/pandas/__init__.py", line > 264, in __getattr__ raise AttributeError(f"module 'pandas' has no attribute > '{name}'") AttributeError: module 'pandas' has no attribute 'Timstamp' > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47144) Fix Spark Connect collation issue
[ https://issues.apache.org/jira/browse/SPARK-47144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-47144. - Resolution: Fixed Issue resolved by pull request 45233 [https://github.com/apache/spark/pull/45233] > Fix Spark Connect collation issue > - > > Key: SPARK-47144 > URL: https://issues.apache.org/jira/browse/SPARK-47144 > Project: Spark > Issue Type: Bug > Components: Connect, SQL >Affects Versions: 4.0.0 >Reporter: Nikola Mandic >Assignee: Nikola Mandic >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Collated expression "SELECT 'abc' COLLATE 'UCS_BASIC_LCASE'" is failing when > connecting to sever using Spark Connect: > {code:java} > pyspark.errors.exceptions.connect.SparkConnectGrpcException: > (org.apache.spark.sql.connect.common.InvalidPlanInput) Does not support > convert string(UCS_BASIC_LCASE) to connect proto types.{code} > When using default collation "UCS_BASIC", the error is not occurring. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47144) Fix Spark Connect collation issue
[ https://issues.apache.org/jira/browse/SPARK-47144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-47144: --- Assignee: Nikola Mandic > Fix Spark Connect collation issue > - > > Key: SPARK-47144 > URL: https://issues.apache.org/jira/browse/SPARK-47144 > Project: Spark > Issue Type: Bug > Components: Connect, SQL >Affects Versions: 4.0.0 >Reporter: Nikola Mandic >Assignee: Nikola Mandic >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Collated expression "SELECT 'abc' COLLATE 'UCS_BASIC_LCASE'" is failing when > connecting to sever using Spark Connect: > {code:java} > pyspark.errors.exceptions.connect.SparkConnectGrpcException: > (org.apache.spark.sql.connect.common.InvalidPlanInput) Does not support > convert string(UCS_BASIC_LCASE) to connect proto types.{code} > When using default collation "UCS_BASIC", the error is not occurring. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47063) CAST long to timestamp has different behavior for codegen vs interpreted
[ https://issues.apache.org/jira/browse/SPARK-47063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-47063. -- Fix Version/s: 3.4.3 3.5.2 4.0.0 Resolution: Fixed Issue resolved by pull request 45294 [https://github.com/apache/spark/pull/45294] > CAST long to timestamp has different behavior for codegen vs interpreted > > > Key: SPARK-47063 > URL: https://issues.apache.org/jira/browse/SPARK-47063 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.2 >Reporter: Robert Joseph Evans >Assignee: Pablo Langa Blanco >Priority: Major > Labels: pull-request-available > Fix For: 3.4.3, 3.5.2, 4.0.0 > > > It probably impacts a lot more versions of the code than this, but I verified > it on 3.4.2. This also appears to be related to > https://issues.apache.org/jira/browse/SPARK-39209 > {code:java} > scala> Seq(Long.MaxValue, Long.MinValue, 0L, 1990L).toDF("v").selectExpr("*", > "CAST(v AS timestamp) as ts").selectExpr("*", "unix_micros(ts)").show(false) > ++-++ > |v |ts |unix_micros(ts) | > ++-++ > |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 | > |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808| > |0 |1970-01-01 00:00:00 |0 | > |1990 |1970-01-01 00:33:10 |199000 | > ++-++ > scala> Seq(Long.MaxValue, Long.MinValue, 0L, > 1990L).toDF("v").repartition(1).selectExpr("*", "CAST(v AS timestamp) as > ts").selectExpr("*", "unix_micros(ts)").show(false) > ++---+---+ > |v |ts |unix_micros(ts)| > ++---+---+ > |9223372036854775807 |1969-12-31 23:59:59|-100 | > |-9223372036854775808|1970-01-01 00:00:00|0 | > |0 |1970-01-01 00:00:00|0 | > |1990 |1970-01-01 00:33:10|199000 | > ++---+---+ > {code} > It looks like InMemoryTableScanExec is not doing code generation for the > expressions, but the ProjectExec after the repartition is. > If I disable code gen I get the same answer in both cases. > {code:java} > scala> spark.conf.set("spark.sql.codegen.wholeStage", false) > scala> spark.conf.set("spark.sql.codegen.factoryMode", "NO_CODEGEN") > scala> Seq(Long.MaxValue, Long.MinValue, 0L, 1990L).toDF("v").selectExpr("*", > "CAST(v AS timestamp) as ts").selectExpr("*", "unix_micros(ts)").show(false) > ++-++ > |v |ts |unix_micros(ts) | > ++-++ > |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 | > |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808| > |0 |1970-01-01 00:00:00 |0 | > |1990 |1970-01-01 00:33:10 |199000 | > ++-++ > scala> Seq(Long.MaxValue, Long.MinValue, 0L, > 1990L).toDF("v").repartition(1).selectExpr("*", "CAST(v AS timestamp) as > ts").selectExpr("*", "unix_micros(ts)").show(false) > ++-++ > |v |ts |unix_micros(ts) | > ++-++ > |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 | > |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808| > |0 |1970-01-01 00:00:00 |0 | > |1990 |1970-01-01 00:33:10 |199000 | > ++-++ > {code} > [https://github.com/apache/spark/blob/e2cd71a4cd54bbdf5af76d3edfbb2fc8c1b067b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L1627] > Is the code used in codegen, but > [https://github.com/apache/spark/blob/e2cd71a4cd54bbdf5af76d3edfbb2fc8c1b067b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L687] > is what is used outside of code gen. > Apparently `SECONDS.toMicros` truncates the value on an overflow, but the > code
[jira] [Assigned] (SPARK-47063) CAST long to timestamp has different behavior for codegen vs interpreted
[ https://issues.apache.org/jira/browse/SPARK-47063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-47063: Assignee: Pablo Langa Blanco > CAST long to timestamp has different behavior for codegen vs interpreted > > > Key: SPARK-47063 > URL: https://issues.apache.org/jira/browse/SPARK-47063 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.2 >Reporter: Robert Joseph Evans >Assignee: Pablo Langa Blanco >Priority: Major > Labels: pull-request-available > > It probably impacts a lot more versions of the code than this, but I verified > it on 3.4.2. This also appears to be related to > https://issues.apache.org/jira/browse/SPARK-39209 > {code:java} > scala> Seq(Long.MaxValue, Long.MinValue, 0L, 1990L).toDF("v").selectExpr("*", > "CAST(v AS timestamp) as ts").selectExpr("*", "unix_micros(ts)").show(false) > ++-++ > |v |ts |unix_micros(ts) | > ++-++ > |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 | > |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808| > |0 |1970-01-01 00:00:00 |0 | > |1990 |1970-01-01 00:33:10 |199000 | > ++-++ > scala> Seq(Long.MaxValue, Long.MinValue, 0L, > 1990L).toDF("v").repartition(1).selectExpr("*", "CAST(v AS timestamp) as > ts").selectExpr("*", "unix_micros(ts)").show(false) > ++---+---+ > |v |ts |unix_micros(ts)| > ++---+---+ > |9223372036854775807 |1969-12-31 23:59:59|-100 | > |-9223372036854775808|1970-01-01 00:00:00|0 | > |0 |1970-01-01 00:00:00|0 | > |1990 |1970-01-01 00:33:10|199000 | > ++---+---+ > {code} > It looks like InMemoryTableScanExec is not doing code generation for the > expressions, but the ProjectExec after the repartition is. > If I disable code gen I get the same answer in both cases. > {code:java} > scala> spark.conf.set("spark.sql.codegen.wholeStage", false) > scala> spark.conf.set("spark.sql.codegen.factoryMode", "NO_CODEGEN") > scala> Seq(Long.MaxValue, Long.MinValue, 0L, 1990L).toDF("v").selectExpr("*", > "CAST(v AS timestamp) as ts").selectExpr("*", "unix_micros(ts)").show(false) > ++-++ > |v |ts |unix_micros(ts) | > ++-++ > |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 | > |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808| > |0 |1970-01-01 00:00:00 |0 | > |1990 |1970-01-01 00:33:10 |199000 | > ++-++ > scala> Seq(Long.MaxValue, Long.MinValue, 0L, > 1990L).toDF("v").repartition(1).selectExpr("*", "CAST(v AS timestamp) as > ts").selectExpr("*", "unix_micros(ts)").show(false) > ++-++ > |v |ts |unix_micros(ts) | > ++-++ > |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 | > |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808| > |0 |1970-01-01 00:00:00 |0 | > |1990 |1970-01-01 00:33:10 |199000 | > ++-++ > {code} > [https://github.com/apache/spark/blob/e2cd71a4cd54bbdf5af76d3edfbb2fc8c1b067b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L1627] > Is the code used in codegen, but > [https://github.com/apache/spark/blob/e2cd71a4cd54bbdf5af76d3edfbb2fc8c1b067b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L687] > is what is used outside of code gen. > Apparently `SECONDS.toMicros` truncates the value on an overflow, but the > codegen does not. > {code:java} > scala> Long.MaxValue > res11: Long = 9223372036854775807 > scala> java.util.concurrent.TimeUnit.SECONDS.toMicros(Long.MaxValue) > res12: Long = 9223372036854775
[jira] [Resolved] (SPARK-47204) [CORE] Check whether enabled checksum before delete checksum file
[ https://issues.apache.org/jira/browse/SPARK-47204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Binjie Yang resolved SPARK-47204. - Resolution: Not A Problem We will check whether the checksum file is exists or not before try to delete this file. > [CORE] Check whether enabled checksum before delete checksum file > - > > Key: SPARK-47204 > URL: https://issues.apache.org/jira/browse/SPARK-47204 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Binjie Yang >Priority: Minor > Fix For: 4.0.0 > > > We should check whether enabled shuffle checksum feature before we try to > find and delete this file. Unless, we will find Error deleting checksum > warning in log. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47204) [CORE] Check whether enabled checksum before delete checksum file
Binjie Yang created SPARK-47204: --- Summary: [CORE] Check whether enabled checksum before delete checksum file Key: SPARK-47204 URL: https://issues.apache.org/jira/browse/SPARK-47204 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.5.0 Reporter: Binjie Yang Fix For: 4.0.0 We should check whether enabled shuffle checksum feature before we try to find and delete this file. Unless, we will find Error deleting checksum warning in log. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47202) AttributeError: module 'pandas' has no attribute 'Timstamp'
[ https://issues.apache.org/jira/browse/SPARK-47202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47202: --- Labels: pull-request-available (was: ) > AttributeError: module 'pandas' has no attribute 'Timstamp' > --- > > Key: SPARK-47202 > URL: https://issues.apache.org/jira/browse/SPARK-47202 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.5.1 >Reporter: Arzav Jain >Priority: Minor > Labels: pull-request-available > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > When using the pyspark.sql.types.TimestampType, if your value is a > datetime.datetime object with a tzinfo, [this > typo|https://github.com/apache/spark/blob/master/python/pyspark/sql/pandas/types.py#L996] > breaks things. > > I believe [this > commit|https://github.com/apache/spark/commit/46949e692e863992f4c50bdd482d5216d4fd9221] > introduced the bug 9 months ago > > Full stack trace below: > > {code:java} > File "/databricks/spark/python/pyspark/worker.py", line 1490, in main > process() File "/databricks/spark/python/pyspark/worker.py", line 1482, in > process serializer.dump_stream(out_iter, outfile) File > "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 531, in > dump_stream return ArrowStreamSerializer.dump_stream( File > "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 107, in > dump_stream for batch in iterator: File > "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 525, in > init_stream_yield_batches batch = self._create_batch(series) File > "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 511, in > _create_batch arrs.append(self._create_array(s, t, > arrow_cast=self._arrow_cast)) File > "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 284, in > _create_array series = conv(series) File > "/databricks/spark/python/pyspark/sql/pandas/types.py", line 1060, in > return lambda pser: pser.apply( # type: ignore[return-value] File > "/databricks/python/lib/python3.10/site-packages/pandas/core/series.py", line > 4771, in apply return SeriesApply(self, func, convert_dtype, args, > kwargs).apply() File > "/databricks/python/lib/python3.10/site-packages/pandas/core/apply.py", line > 1123, in apply return self.apply_standard() File > "/databricks/python/lib/python3.10/site-packages/pandas/core/apply.py", line > 1174, in apply_standard mapped = lib.map_infer( File "pandas/_libs/lib.pyx", > line 2924, in pandas._libs.lib.map_infer File > "/databricks/spark/python/pyspark/sql/pandas/types.py", line 1061, in > lambda x: conv(x) if x is not None else None # type: ignore[misc] > File "/databricks/spark/python/pyspark/sql/pandas/types.py", line 889, in > convert_array return [ File > "/databricks/spark/python/pyspark/sql/pandas/types.py", line 890, in > _element_conv(v) if v is not None else None # type: ignore[misc] > File "/databricks/spark/python/pyspark/sql/pandas/types.py", line 1010, in > convert_struct return { File > "/databricks/spark/python/pyspark/sql/pandas/types.py", line 1011, in > name: conv(v) if conv is not None and v is not None else v File > "/databricks/spark/python/pyspark/sql/pandas/types.py", line 1032, in > convert_timestamp ts = pd.Timstamp(value) File > "/databricks/python/lib/python3.10/site-packages/pandas/__init__.py", line > 264, in __getattr__ raise AttributeError(f"module 'pandas' has no attribute > '{name}'") AttributeError: module 'pandas' has no attribute 'Timstamp' > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47202) AttributeError: module 'pandas' has no attribute 'Timstamp'
[ https://issues.apache.org/jira/browse/SPARK-47202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arzav Jain updated SPARK-47202: --- Description: When using the pyspark.sql.types.TimestampType, if your value is a datetime.datetime object with a tzinfo, [this typo|https://github.com/apache/spark/blob/master/python/pyspark/sql/pandas/types.py#L996] breaks things. I believe [this commit|https://github.com/apache/spark/commit/46949e692e863992f4c50bdd482d5216d4fd9221] introduced the bug 9 months ago Full stack trace below: {code:java} File "/databricks/spark/python/pyspark/worker.py", line 1490, in main process() File "/databricks/spark/python/pyspark/worker.py", line 1482, in process serializer.dump_stream(out_iter, outfile) File "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 531, in dump_stream return ArrowStreamSerializer.dump_stream( File "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 107, in dump_stream for batch in iterator: File "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 525, in init_stream_yield_batches batch = self._create_batch(series) File "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 511, in _create_batch arrs.append(self._create_array(s, t, arrow_cast=self._arrow_cast)) File "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 284, in _create_array series = conv(series) File "/databricks/spark/python/pyspark/sql/pandas/types.py", line 1060, in return lambda pser: pser.apply( # type: ignore[return-value] File "/databricks/python/lib/python3.10/site-packages/pandas/core/series.py", line 4771, in apply return SeriesApply(self, func, convert_dtype, args, kwargs).apply() File "/databricks/python/lib/python3.10/site-packages/pandas/core/apply.py", line 1123, in apply return self.apply_standard() File "/databricks/python/lib/python3.10/site-packages/pandas/core/apply.py", line 1174, in apply_standard mapped = lib.map_infer( File "pandas/_libs/lib.pyx", line 2924, in pandas._libs.lib.map_infer File "/databricks/spark/python/pyspark/sql/pandas/types.py", line 1061, in lambda x: conv(x) if x is not None else None # type: ignore[misc] File "/databricks/spark/python/pyspark/sql/pandas/types.py", line 889, in convert_array return [ File "/databricks/spark/python/pyspark/sql/pandas/types.py", line 890, in _element_conv(v) if v is not None else None # type: ignore[misc] File "/databricks/spark/python/pyspark/sql/pandas/types.py", line 1010, in convert_struct return { File "/databricks/spark/python/pyspark/sql/pandas/types.py", line 1011, in name: conv(v) if conv is not None and v is not None else v File "/databricks/spark/python/pyspark/sql/pandas/types.py", line 1032, in convert_timestamp ts = pd.Timstamp(value) File "/databricks/python/lib/python3.10/site-packages/pandas/__init__.py", line 264, in __getattr__ raise AttributeError(f"module 'pandas' has no attribute '{name}'") AttributeError: module 'pandas' has no attribute 'Timstamp' {code} was: When using the pyspark.sql.types.TimestampType, if your value is a datetime.datetime object with a tzinfo, [this typo|https://github.com/apache/spark/blob/master/python/pyspark/sql/pandas/types.py#L996] breaks things. Full stack trace below: {code:java} File "/databricks/spark/python/pyspark/worker.py", line 1490, in main process() File "/databricks/spark/python/pyspark/worker.py", line 1482, in process serializer.dump_stream(out_iter, outfile) File "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 531, in dump_stream return ArrowStreamSerializer.dump_stream( File "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 107, in dump_stream for batch in iterator: File "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 525, in init_stream_yield_batches batch = self._create_batch(series) File "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 511, in _create_batch arrs.append(self._create_array(s, t, arrow_cast=self._arrow_cast)) File "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 284, in _create_array series = conv(series) File "/databricks/spark/python/pyspark/sql/pandas/types.py", line 1060, in return lambda pser: pser.apply( # type: ignore[return-value] File "/databricks/python/lib/python3.10/site-packages/pandas/core/series.py", line 4771, in apply return SeriesApply(self, func, convert_dtype, args, kwargs).apply() File "/databricks/python/lib/python3.10/site-packages/pandas/core/apply.py", line 1123, in apply return self.apply_standard() File "/databricks/python/lib/python3.10/site-packages/pandas/core/apply.py", line 1174, in apply_standard mapped = lib.map_infer( File "pandas/_libs/lib.pyx", line 2924, in pandas._libs.lib.map_infer File "/databricks/spark/python/pyspark/sql/pandas/types.py", line 1061, in lambda x: conv(x) if
[jira] [Assigned] (SPARK-47201) sameSemantics check input types
[ https://issues.apache.org/jira/browse/SPARK-47201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-47201: - Assignee: Ruifeng Zheng > sameSemantics check input types > --- > > Key: SPARK-47201 > URL: https://issues.apache.org/jira/browse/SPARK-47201 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47201) sameSemantics check input types
[ https://issues.apache.org/jira/browse/SPARK-47201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-47201. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45300 [https://github.com/apache/spark/pull/45300] > sameSemantics check input types > --- > > Key: SPARK-47201 > URL: https://issues.apache.org/jira/browse/SPARK-47201 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47202) AttributeError: module 'pandas' has no attribute 'Timstamp'
Arzav Jain created SPARK-47202: -- Summary: AttributeError: module 'pandas' has no attribute 'Timstamp' Key: SPARK-47202 URL: https://issues.apache.org/jira/browse/SPARK-47202 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.5.1 Reporter: Arzav Jain When using the pyspark.sql.types.TimestampType, if your value is a datetime.datetime object with a tzinfo, [this typo|https://github.com/apache/spark/blob/master/python/pyspark/sql/pandas/types.py#L996] breaks things. Full stack trace below: {code:java} File "/databricks/spark/python/pyspark/worker.py", line 1490, in main process() File "/databricks/spark/python/pyspark/worker.py", line 1482, in process serializer.dump_stream(out_iter, outfile) File "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 531, in dump_stream return ArrowStreamSerializer.dump_stream( File "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 107, in dump_stream for batch in iterator: File "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 525, in init_stream_yield_batches batch = self._create_batch(series) File "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 511, in _create_batch arrs.append(self._create_array(s, t, arrow_cast=self._arrow_cast)) File "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 284, in _create_array series = conv(series) File "/databricks/spark/python/pyspark/sql/pandas/types.py", line 1060, in return lambda pser: pser.apply( # type: ignore[return-value] File "/databricks/python/lib/python3.10/site-packages/pandas/core/series.py", line 4771, in apply return SeriesApply(self, func, convert_dtype, args, kwargs).apply() File "/databricks/python/lib/python3.10/site-packages/pandas/core/apply.py", line 1123, in apply return self.apply_standard() File "/databricks/python/lib/python3.10/site-packages/pandas/core/apply.py", line 1174, in apply_standard mapped = lib.map_infer( File "pandas/_libs/lib.pyx", line 2924, in pandas._libs.lib.map_infer File "/databricks/spark/python/pyspark/sql/pandas/types.py", line 1061, in lambda x: conv(x) if x is not None else None # type: ignore[misc] File "/databricks/spark/python/pyspark/sql/pandas/types.py", line 889, in convert_array return [ File "/databricks/spark/python/pyspark/sql/pandas/types.py", line 890, in _element_conv(v) if v is not None else None # type: ignore[misc] File "/databricks/spark/python/pyspark/sql/pandas/types.py", line 1010, in convert_struct return { File "/databricks/spark/python/pyspark/sql/pandas/types.py", line 1011, in name: conv(v) if conv is not None and v is not None else v File "/databricks/spark/python/pyspark/sql/pandas/types.py", line 1032, in convert_timestamp ts = pd.Timstamp(value) File "/databricks/python/lib/python3.10/site-packages/pandas/__init__.py", line 264, in __getattr__ raise AttributeError(f"module 'pandas' has no attribute '{name}'") AttributeError: module 'pandas' has no attribute 'Timstamp' {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47201) sameSemantics check input types
[ https://issues.apache.org/jira/browse/SPARK-47201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47201: --- Labels: pull-request-available (was: ) > sameSemantics check input types > --- > > Key: SPARK-47201 > URL: https://issues.apache.org/jira/browse/SPARK-47201 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47201) sameSemantics check input types
Ruifeng Zheng created SPARK-47201: - Summary: sameSemantics check input types Key: SPARK-47201 URL: https://issues.apache.org/jira/browse/SPARK-47201 Project: Spark Issue Type: Improvement Components: Connect, PySpark Affects Versions: 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47186) Improve the debuggability for docker integration test
[ https://issues.apache.org/jira/browse/SPARK-47186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-47186. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45284 [https://github.com/apache/spark/pull/45284] > Improve the debuggability for docker integration test > - > > Key: SPARK-47186 > URL: https://issues.apache.org/jira/browse/SPARK-47186 > Project: Spark > Issue Type: Test > Components: Spark Docker >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47196) Fix `core` module to succeed SBT tests
[ https://issues.apache.org/jira/browse/SPARK-47196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47196. --- Fix Version/s: 3.4.3 Resolution: Fixed Issue resolved by pull request 45295 [https://github.com/apache/spark/pull/45295] > Fix `core` module to succeed SBT tests > -- > > Key: SPARK-47196 > URL: https://issues.apache.org/jira/browse/SPARK-47196 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.2, 3.4.0, 3.4.1 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 3.4.3 > > > This happens at branch-3.4 only. branch-3.3/branch-3.5/master are okay. > {code:java} > $ build/sbt "core/testOnly *.DAGSchedulerSuite" > [info] DAGSchedulerSuite: > [info] - [SPARK-3353] parent stage should have lower stage id *** FAILED *** > (439 milliseconds) > [info] java.lang.IllegalStateException: Could not initialize plugin: > interface org.mockito.plugins.MockMaker (alternate: null) > ... > [info] *** 1 SUITE ABORTED *** > [info] *** 118 TESTS FAILED *** > [error] Error during tests: > [error] org.apache.spark.scheduler.DAGSchedulerSuite > [error] (core / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful > [error] Total time: 48 s, completed Feb 27, 2024, 1:26:27 PM {code} > > MAVEN > {code:java} > $ build/mvn dependency:tree -pl core | grep byte-buddy > ... > [INFO] | +- net.bytebuddy:byte-buddy:jar:1.12.10:test > [INFO] | +- net.bytebuddy:byte-buddy-agent:jar:1.12.10:test > {code} > SBT > {code:java} > $ build/sbt "core/test:dependencyTree" | grep byte-buddy > [info] | | | | +-net.bytebuddy:byte-buddy:1.12.10 (evicted by: 1.12.18) > [info] | | | | +-net.bytebuddy:byte-buddy:1.12.18 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47196) Fix `core` module to succeed SBT tests
[ https://issues.apache.org/jira/browse/SPARK-47196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-47196: - Assignee: Dongjoon Hyun > Fix `core` module to succeed SBT tests > -- > > Key: SPARK-47196 > URL: https://issues.apache.org/jira/browse/SPARK-47196 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.2, 3.4.0, 3.4.1 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > This happens at branch-3.4 only. branch-3.3/branch-3.5/master are okay. > {code:java} > $ build/sbt "core/testOnly *.DAGSchedulerSuite" > [info] DAGSchedulerSuite: > [info] - [SPARK-3353] parent stage should have lower stage id *** FAILED *** > (439 milliseconds) > [info] java.lang.IllegalStateException: Could not initialize plugin: > interface org.mockito.plugins.MockMaker (alternate: null) > ... > [info] *** 1 SUITE ABORTED *** > [info] *** 118 TESTS FAILED *** > [error] Error during tests: > [error] org.apache.spark.scheduler.DAGSchedulerSuite > [error] (core / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful > [error] Total time: 48 s, completed Feb 27, 2024, 1:26:27 PM {code} > > MAVEN > {code:java} > $ build/mvn dependency:tree -pl core | grep byte-buddy > ... > [INFO] | +- net.bytebuddy:byte-buddy:jar:1.12.10:test > [INFO] | +- net.bytebuddy:byte-buddy-agent:jar:1.12.10:test > {code} > SBT > {code:java} > $ build/sbt "core/test:dependencyTree" | grep byte-buddy > [info] | | | | +-net.bytebuddy:byte-buddy:1.12.10 (evicted by: 1.12.18) > [info] | | | | +-net.bytebuddy:byte-buddy:1.12.18 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47200) Handle and classify errors from ForEachBatchSink user function
[ https://issues.apache.org/jira/browse/SPARK-47200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47200: --- Labels: pull-request-available (was: ) > Handle and classify errors from ForEachBatchSink user function > -- > > Key: SPARK-47200 > URL: https://issues.apache.org/jira/browse/SPARK-47200 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: B. Micheal Okutubo >Priority: Major > Labels: pull-request-available > > Any exception can be thrown from the user provided function for > ForEachBatchSink. We want to classify this class of errors. Including errors > from Python and Scala functions. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47200) Handle and classify errors from ForEachBatchSink user function
B. Micheal Okutubo created SPARK-47200: -- Summary: Handle and classify errors from ForEachBatchSink user function Key: SPARK-47200 URL: https://issues.apache.org/jira/browse/SPARK-47200 Project: Spark Issue Type: Task Components: Structured Streaming Affects Versions: 4.0.0 Reporter: B. Micheal Okutubo Any exception can be thrown from the user provided function for ForEachBatchSink. We want to classify this class of errors. Including errors from Python and Scala functions. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47199) Add prefix into TemporaryDirectory to avoid flakiness
[ https://issues.apache.org/jira/browse/SPARK-47199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47199: --- Labels: pull-request-available (was: ) > Add prefix into TemporaryDirectory to avoid flakiness > - > > Key: SPARK-47199 > URL: https://issues.apache.org/jira/browse/SPARK-47199 > Project: Spark > Issue Type: Test > Components: PySpark, Tests >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > Labels: pull-request-available > > Sometimes the test fail because the temporary directory names are same > (https://github.com/apache/spark/actions/runs/8066850485/job/22036007390). > {code} > File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line ?, in > pyspark.sql.dataframe.DataFrame.writeStream > Failed example: > with tempfile.TemporaryDirectory() as d: > # Create a table with Rate source. > df.writeStream.toTable( > "my_table", checkpointLocation=d) > Exception raised: > Traceback (most recent call last): > File "/usr/lib/python3.11/doctest.py", line 1353, in __run > exec(compile(example.source, filename, "single", > File "", line > 1, in > with tempfile.TemporaryDirectory() as d: > File "/usr/lib/python3.11/tempfile.py", line 1043, in __exit__ > self.cleanup() > File "/usr/lib/python3.11/tempfile.py", line 1047, in cleanup > self._rmtree(self.name, ignore_errors=self._ignore_cleanup_errors) > File "/usr/lib/python3.11/tempfile.py", line 1029, in _rmtree > _rmtree(name, onerror=onerror) > File "/usr/lib/python3.11/shutil.py", line 738, in rmtree > onerror(os.rmdir, path, sys.exc_info()) > File "/usr/lib/python3.11/shutil.py", line 736, in rmtree > os.rmdir(path, dir_fd=dir_fd) > OSError: [Errno 39] Directory not empty: > '/__w/spark/spark/python/target/4f062b09-213f-4ac2-a10a-2d704990141b/tmp29irqweq' > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47199) Add prefix into TemporaryDirectory to avoid flakiness
Hyukjin Kwon created SPARK-47199: Summary: Add prefix into TemporaryDirectory to avoid flakiness Key: SPARK-47199 URL: https://issues.apache.org/jira/browse/SPARK-47199 Project: Spark Issue Type: Test Components: PySpark, Tests Affects Versions: 4.0.0 Reporter: Hyukjin Kwon Sometimes the test fail because the temporary directory names are same (https://github.com/apache/spark/actions/runs/8066850485/job/22036007390). {code} File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line ?, in pyspark.sql.dataframe.DataFrame.writeStream Failed example: with tempfile.TemporaryDirectory() as d: # Create a table with Rate source. df.writeStream.toTable( "my_table", checkpointLocation=d) Exception raised: Traceback (most recent call last): File "/usr/lib/python3.11/doctest.py", line 1353, in __run exec(compile(example.source, filename, "single", File "", line 1, in with tempfile.TemporaryDirectory() as d: File "/usr/lib/python3.11/tempfile.py", line 1043, in __exit__ self.cleanup() File "/usr/lib/python3.11/tempfile.py", line 1047, in cleanup self._rmtree(self.name, ignore_errors=self._ignore_cleanup_errors) File "/usr/lib/python3.11/tempfile.py", line 1029, in _rmtree _rmtree(name, onerror=onerror) File "/usr/lib/python3.11/shutil.py", line 738, in rmtree onerror(os.rmdir, path, sys.exc_info()) File "/usr/lib/python3.11/shutil.py", line 736, in rmtree os.rmdir(path, dir_fd=dir_fd) OSError: [Errno 39] Directory not empty: '/__w/spark/spark/python/target/4f062b09-213f-4ac2-a10a-2d704990141b/tmp29irqweq' {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47198) Is it possible to dynamically add backend service to ingress with Kubernetes?
[ https://issues.apache.org/jira/browse/SPARK-47198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] melin updated SPARK-47198: -- Description: spark on k8s runs multiple spark apps at the same time. proxy/[sparkappid] path, forwarding to different sparkapp ui console based on sparkappid. spark apps are dynamically added and decreased. ingress Dynamically adds spark svc. [sparkappid]_svc == spark svc name [https://matthewpalmer.net/kubernetes-app-developer/articles/kubernetes-ingress-guide-nginx-example.html] was: spark on k8s runs multiple spark apps at the same time. proxy/[sparkappid] path, forwarding to different sparkapp ui console based on sparkappid. spark apps are dynamically added and decreased. ingress Dynamically adds spark svc. sparkappid == spark svc name [https://matthewpalmer.net/kubernetes-app-developer/articles/kubernetes-ingress-guide-nginx-example.html] > Is it possible to dynamically add backend service to ingress with Kubernetes? > - > > Key: SPARK-47198 > URL: https://issues.apache.org/jira/browse/SPARK-47198 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 4.0.0 >Reporter: melin >Priority: Major > > spark on k8s runs multiple spark apps at the same time. proxy/[sparkappid] > path, forwarding to different sparkapp ui console based on sparkappid. spark > apps are dynamically added and decreased. ingress Dynamically adds spark svc. > [sparkappid]_svc == spark svc name > [https://matthewpalmer.net/kubernetes-app-developer/articles/kubernetes-ingress-guide-nginx-example.html] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47198) Is it possible to dynamically add backend service to ingress with Kubernetes?
melin created SPARK-47198: - Summary: Is it possible to dynamically add backend service to ingress with Kubernetes? Key: SPARK-47198 URL: https://issues.apache.org/jira/browse/SPARK-47198 Project: Spark Issue Type: New Feature Components: Kubernetes Affects Versions: 4.0.0 Reporter: melin spark on k8s runs multiple spark apps at the same time. proxy/[sparkappid] path, forwarding to different sparkapp ui console based on sparkappid. spark apps are dynamically added and decreased. ingress Dynamically adds spark svc. sparkappid == spark svc name [https://matthewpalmer.net/kubernetes-app-developer/articles/kubernetes-ingress-guide-nginx-example.html] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47114) In the spark driver pod. Failed to access the krb5 file
[ https://issues.apache.org/jira/browse/SPARK-47114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] melin resolved SPARK-47114. --- Resolution: Resolved > In the spark driver pod. Failed to access the krb5 file > --- > > Key: SPARK-47114 > URL: https://issues.apache.org/jira/browse/SPARK-47114 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.4.1 >Reporter: melin >Priority: Major > > spark runs in kubernetes and accesses an external hdfs cluster (kerberos),pod > error logs > {code:java} > Caused by: java.lang.IllegalArgumentException: KrbException: krb5.conf > loading failed{code} > This error generally occurs when the krb5 file cannot be found > [~yao] [~Qin Yao] > {code:java} > ./bin/spark-submit \ > --master k8s://https://172.18.5.44:6443 \ > --deploy-mode cluster \ > --name spark-pi \ > --class org.apache.spark.examples.SparkPi \ > --conf spark.executor.instances=1 \ > --conf spark.kubernetes.submission.waitAppCompletion=true \ > --conf spark.kubernetes.driver.pod.name=spark-xxx \ > --conf spark.kubernetes.executor.podNamePrefix=spark-executor-xxx \ > --conf spark.kubernetes.driver.label.profile=production \ > --conf spark.kubernetes.executor.label.profile=production \ > --conf spark.kubernetes.namespace=superior \ > --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ > --conf > spark.kubernetes.container.image=registry.cn-hangzhou.aliyuncs.com/melin1204/spark-jobserver:3.4.0 > \ > --conf > spark.kubernetes.file.upload.path=hdfs://cdh1:8020/user/superior/kubernetes/ \ > --conf spark.kubernetes.container.image.pullPolicy=Always \ > --conf spark.kubernetes.container.image.pullSecrets=docker-reg-demos \ > --conf spark.kubernetes.kerberos.krb5.path=/etc/krb5.conf \ > --conf spark.kerberos.principal=superior/ad...@datacyber.com \ > --conf spark.kerberos.keytab=/root/superior.keytab \ > > file:///root/spark-3.4.2-bin-hadoop3/examples/jars/spark-examples_2.12-3.4.2.jar > 5{code} > {code:java} > (base) [root@cdh1 ~]# kubectl logs spark-xxx -n superior > Exception in thread "main" java.lang.IllegalArgumentException: Can't get > Kerberos realm > at > org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:71) > at > org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:315) > at > org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:300) > at > org.apache.hadoop.security.UserGroupInformation.isAuthenticationMethodEnabled(UserGroupInformation.java:395) > at > org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:389) > at > org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytab(UserGroupInformation.java:1119) > at > org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:385) > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955) > at > org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192) > at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215) > at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91) > at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1120) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: java.lang.IllegalArgumentException: KrbException: krb5.conf > loading failed > at > java.security.jgss/javax.security.auth.kerberos.KerberosPrincipal.(Unknown > Source) > at > org.apache.hadoop.security.authentication.util.KerberosUtil.getDefaultRealm(KerberosUtil.java:120) > at > org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:69) > ... 13 more > (base) [root@cdh1 ~]# kubectl describe pod spark-xxx -n superior > Name: spark-xxx > Namespace: superior > Priority: 0 > Service Account: spark > Node: cdh2/172.18.5.45 > Start Time: Wed, 21 Feb 2024 15:48:08 +0800 > Labels: profile=production > spark-app-name=spark-pi > spark-app-selector=spark-728e24e49f9040fa86b04c521463020b > spark-role=driver > spark-version=3.4.2 > Annotations: > Status: Failed > IP: 10.244.1.4 > IPs: > IP: 10.244.1.4 > Containers: > spark-kubernetes-driver: > Container ID: > containerd://cceaf13b70cc5f21a639e71cb8663989ec73e
[jira] [Commented] (SPARK-47114) In the spark driver pod. Failed to access the krb5 file
[ https://issues.apache.org/jira/browse/SPARK-47114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17821469#comment-17821469 ] melin commented on SPARK-47114: --- 默认jre17 不支持kerberos,换成jdk > In the spark driver pod. Failed to access the krb5 file > --- > > Key: SPARK-47114 > URL: https://issues.apache.org/jira/browse/SPARK-47114 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.4.1 >Reporter: melin >Priority: Major > > spark runs in kubernetes and accesses an external hdfs cluster (kerberos),pod > error logs > {code:java} > Caused by: java.lang.IllegalArgumentException: KrbException: krb5.conf > loading failed{code} > This error generally occurs when the krb5 file cannot be found > [~yao] [~Qin Yao] > {code:java} > ./bin/spark-submit \ > --master k8s://https://172.18.5.44:6443 \ > --deploy-mode cluster \ > --name spark-pi \ > --class org.apache.spark.examples.SparkPi \ > --conf spark.executor.instances=1 \ > --conf spark.kubernetes.submission.waitAppCompletion=true \ > --conf spark.kubernetes.driver.pod.name=spark-xxx \ > --conf spark.kubernetes.executor.podNamePrefix=spark-executor-xxx \ > --conf spark.kubernetes.driver.label.profile=production \ > --conf spark.kubernetes.executor.label.profile=production \ > --conf spark.kubernetes.namespace=superior \ > --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ > --conf > spark.kubernetes.container.image=registry.cn-hangzhou.aliyuncs.com/melin1204/spark-jobserver:3.4.0 > \ > --conf > spark.kubernetes.file.upload.path=hdfs://cdh1:8020/user/superior/kubernetes/ \ > --conf spark.kubernetes.container.image.pullPolicy=Always \ > --conf spark.kubernetes.container.image.pullSecrets=docker-reg-demos \ > --conf spark.kubernetes.kerberos.krb5.path=/etc/krb5.conf \ > --conf spark.kerberos.principal=superior/ad...@datacyber.com \ > --conf spark.kerberos.keytab=/root/superior.keytab \ > > file:///root/spark-3.4.2-bin-hadoop3/examples/jars/spark-examples_2.12-3.4.2.jar > 5{code} > {code:java} > (base) [root@cdh1 ~]# kubectl logs spark-xxx -n superior > Exception in thread "main" java.lang.IllegalArgumentException: Can't get > Kerberos realm > at > org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:71) > at > org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:315) > at > org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:300) > at > org.apache.hadoop.security.UserGroupInformation.isAuthenticationMethodEnabled(UserGroupInformation.java:395) > at > org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:389) > at > org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytab(UserGroupInformation.java:1119) > at > org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:385) > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955) > at > org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192) > at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215) > at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91) > at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1120) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: java.lang.IllegalArgumentException: KrbException: krb5.conf > loading failed > at > java.security.jgss/javax.security.auth.kerberos.KerberosPrincipal.(Unknown > Source) > at > org.apache.hadoop.security.authentication.util.KerberosUtil.getDefaultRealm(KerberosUtil.java:120) > at > org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:69) > ... 13 more > (base) [root@cdh1 ~]# kubectl describe pod spark-xxx -n superior > Name: spark-xxx > Namespace: superior > Priority: 0 > Service Account: spark > Node: cdh2/172.18.5.45 > Start Time: Wed, 21 Feb 2024 15:48:08 +0800 > Labels: profile=production > spark-app-name=spark-pi > spark-app-selector=spark-728e24e49f9040fa86b04c521463020b > spark-role=driver > spark-version=3.4.2 > Annotations: > Status: Failed > IP: 10.244.1.4 > IPs: > IP: 10.244.1.4 > Containers: > spark-kubernetes-driver: > Container
[jira] [Updated] (SPARK-47197) Failed to connect HiveMetastore when using iceberg with HiveCatalog on spark-sql or spark-shell
[ https://issues.apache.org/jira/browse/SPARK-47197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YUBI LEE updated SPARK-47197: - Description: I can't connect to kerberized HiveMetastore when using iceberg with HiveCatalog on spark-sql or spark-shell. I think this issue is caused by the fact that there is no way to get HIVE_DELEGATION_TOKEN when using spark-sql or spark-shell. ([https://github.com/apache/spark/blob/v3.5.1/sql/hive/src/main/scala/org/apache/spark/sql/hive/security/HiveDelegationTokenProvider.scala#L78-L83)] {code:java} val currentToken = UserGroupInformation.getCurrentUser().getCredentials().getToken(tokenAlias) currentToken == null && UserGroupInformation.isSecurityEnabled && hiveConf(hadoopConf).getTrimmed("hive.metastore.uris", "").nonEmpty && (SparkHadoopUtil.get.isProxyUser(UserGroupInformation.getCurrentUser()) || (!Utils.isClientMode(sparkConf) && !sparkConf.contains(KEYTAB))) {code} There should be a way to force to get HIVE_DELEGATION_TOKEN even when using spark-sql or spark-shell. Possible way is to get HIVE_DELEGATION_TOKEN if the configuration below is set? {code:java} spark.security.credentials.hive.enabled true {code} {code:java} 24/02/28 07:42:04 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) (machine1.example.com executor 2): org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive Metastore ... Caused by: MetaException(message:Could not connect to meta store using any of the URIs provided. Most recent failure: org.apache.thrift.transport.TTransportException: GSS initiate failed {code} {code:java} spark-sql> select * from temp.test_hive_catalog; ... ... 24/02/28 07:42:04 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) (machine1.example.com executor 2): org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive Metastore at org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:84) at org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:34) at org.apache.iceberg.ClientPoolImpl.get(ClientPoolImpl.java:125) at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:56) at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:51) at org.apache.iceberg.hive.CachedClientPool.run(CachedClientPool.java:122) at org.apache.iceberg.hive.HiveTableOperations.doRefresh(HiveTableOperations.java:158) at org.apache.iceberg.BaseMetastoreTableOperations.refresh(BaseMetastoreTableOperations.java:97) at org.apache.iceberg.BaseMetastoreTableOperations.current(BaseMetastoreTableOperations.java:80) at org.apache.iceberg.BaseMetastoreCatalog.loadTable(BaseMetastoreCatalog.java:47) at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:124) at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:111) at org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.overlayTableProperties(HiveIcebergStorageHandler.java:276) at org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.configureInputJobProperties(HiveIcebergStorageHandler.java:86) at org.apache.spark.sql.hive.HiveTableUtil$.configureJobPropertiesForStorageHandler(TableReader.scala:426) at org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:456) at org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1(TableReader.scala:342) at org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1$adapted(TableReader.scala:342) at org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8(HadoopRDD.scala:181) at org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8$adapted(HadoopRDD.scala:181) at scala.Option.foreach(Option.scala:407) at org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$6(HadoopRDD.scala:181) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:178) at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:247) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:243) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:96) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apach
[jira] [Updated] (SPARK-47197) Failed to connect HiveMetastore when using iceberg with HiveCatalog on spark-sql or spark-shell
[ https://issues.apache.org/jira/browse/SPARK-47197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YUBI LEE updated SPARK-47197: - Summary: Failed to connect HiveMetastore when using iceberg with HiveCatalog on spark-sql or spark-shell (was: Failed to connect HiveMetastore when using iceberg with HiveCatalog by spark-sql or spark-shell) > Failed to connect HiveMetastore when using iceberg with HiveCatalog on > spark-sql or spark-shell > --- > > Key: SPARK-47197 > URL: https://issues.apache.org/jira/browse/SPARK-47197 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 3.2.3, 3.5.1 >Reporter: YUBI LEE >Priority: Major > > I can't connect to kerberized HiveMetastore when using iceberg with > HiveCatalog by spark-sql or spark-shell. > I think this issue is caused by the fact that there is no way to get > HIVE_DELEGATION_TOKEN when using spark-sql or spark-shell. > ([https://github.com/apache/spark/blob/v3.5.1/sql/hive/src/main/scala/org/apache/spark/sql/hive/security/HiveDelegationTokenProvider.scala#L78-L83)] > > {code:java} > val currentToken = > UserGroupInformation.getCurrentUser().getCredentials().getToken(tokenAlias) > currentToken == null && UserGroupInformation.isSecurityEnabled && > hiveConf(hadoopConf).getTrimmed("hive.metastore.uris", "").nonEmpty && > (SparkHadoopUtil.get.isProxyUser(UserGroupInformation.getCurrentUser()) > || > (!Utils.isClientMode(sparkConf) && !sparkConf.contains(KEYTAB))) > {code} > There should be a way to force to get HIVE_DELEGATION_TOKEN even when using > spark-sql or spark-shell. > Possible way is to get HIVE_DELEGATION_TOKEN if the configuration below is > set? > {code:java} > spark.security.credentials.hive.enabled true {code} > > {code:java} > 24/02/28 07:42:04 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) > (machine1.example.com executor 2): > org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive > Metastore > ... > Caused by: MetaException(message:Could not connect to meta store using any of > the URIs provided. Most recent failure: > org.apache.thrift.transport.TTransportException: GSS initiate failed {code} > > > {code:java} > spark-sql> select * from temp.test_hive_catalog; > ... > ... > 24/02/28 07:42:04 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) > (machine1.example.com executor 2): > org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive > Metastore > at > org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:84) > at > org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:34) > at org.apache.iceberg.ClientPoolImpl.get(ClientPoolImpl.java:125) > at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:56) > at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:51) > at > org.apache.iceberg.hive.CachedClientPool.run(CachedClientPool.java:122) > at > org.apache.iceberg.hive.HiveTableOperations.doRefresh(HiveTableOperations.java:158) > at > org.apache.iceberg.BaseMetastoreTableOperations.refresh(BaseMetastoreTableOperations.java:97) > at > org.apache.iceberg.BaseMetastoreTableOperations.current(BaseMetastoreTableOperations.java:80) > at > org.apache.iceberg.BaseMetastoreCatalog.loadTable(BaseMetastoreCatalog.java:47) > at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:124) > at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:111) > at > org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.overlayTableProperties(HiveIcebergStorageHandler.java:276) > at > org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.configureInputJobProperties(HiveIcebergStorageHandler.java:86) > at > org.apache.spark.sql.hive.HiveTableUtil$.configureJobPropertiesForStorageHandler(TableReader.scala:426) > at > org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:456) > at > org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1(TableReader.scala:342) > at > org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1$adapted(TableReader.scala:342) > at > org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8(HadoopRDD.scala:181) > at > org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8$adapted(HadoopRDD.scala:181) > at scala.Option.foreach(Option.scala:407) > at > org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$6(HadoopRDD.scala:181) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:178) > at org.apache.spark.r
[jira] [Updated] (SPARK-47197) Failed to connect HiveMetastore when using iceberg with HiveCatalog by spark-sql or spark-shell
[ https://issues.apache.org/jira/browse/SPARK-47197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YUBI LEE updated SPARK-47197: - Component/s: SQL > Failed to connect HiveMetastore when using iceberg with HiveCatalog by > spark-sql or spark-shell > --- > > Key: SPARK-47197 > URL: https://issues.apache.org/jira/browse/SPARK-47197 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 3.2.3, 3.5.1 >Reporter: YUBI LEE >Priority: Major > > I can't connect to kerberized HiveMetastore when using iceberg with > HiveCatalog by spark-sql or spark-shell. > I think this issue is caused by the fact that there is no way to get > HIVE_DELEGATION_TOKEN when using spark-sql or spark-shell. > ([https://github.com/apache/spark/blob/v3.5.1/sql/hive/src/main/scala/org/apache/spark/sql/hive/security/HiveDelegationTokenProvider.scala#L78-L83)] > > {code:java} > val currentToken = > UserGroupInformation.getCurrentUser().getCredentials().getToken(tokenAlias) > currentToken == null && UserGroupInformation.isSecurityEnabled && > hiveConf(hadoopConf).getTrimmed("hive.metastore.uris", "").nonEmpty && > (SparkHadoopUtil.get.isProxyUser(UserGroupInformation.getCurrentUser()) > || > (!Utils.isClientMode(sparkConf) && !sparkConf.contains(KEYTAB))) > {code} > There should be a way to force to get HIVE_DELEGATION_TOKEN even when using > spark-sql or spark-shell. > Possible way is to get HIVE_DELEGATION_TOKEN if the configuration below is > set? > {code:java} > spark.security.credentials.hive.enabled true {code} > > {code:java} > 24/02/28 07:42:04 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) > (machine1.example.com executor 2): > org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive > Metastore > ... > Caused by: MetaException(message:Could not connect to meta store using any of > the URIs provided. Most recent failure: > org.apache.thrift.transport.TTransportException: GSS initiate failed {code} > > > {code:java} > spark-sql> select * from temp.test_hive_catalog; > ... > ... > 24/02/28 07:42:04 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) > (machine1.example.com executor 2): > org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive > Metastore > at > org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:84) > at > org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:34) > at org.apache.iceberg.ClientPoolImpl.get(ClientPoolImpl.java:125) > at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:56) > at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:51) > at > org.apache.iceberg.hive.CachedClientPool.run(CachedClientPool.java:122) > at > org.apache.iceberg.hive.HiveTableOperations.doRefresh(HiveTableOperations.java:158) > at > org.apache.iceberg.BaseMetastoreTableOperations.refresh(BaseMetastoreTableOperations.java:97) > at > org.apache.iceberg.BaseMetastoreTableOperations.current(BaseMetastoreTableOperations.java:80) > at > org.apache.iceberg.BaseMetastoreCatalog.loadTable(BaseMetastoreCatalog.java:47) > at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:124) > at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:111) > at > org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.overlayTableProperties(HiveIcebergStorageHandler.java:276) > at > org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.configureInputJobProperties(HiveIcebergStorageHandler.java:86) > at > org.apache.spark.sql.hive.HiveTableUtil$.configureJobPropertiesForStorageHandler(TableReader.scala:426) > at > org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:456) > at > org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1(TableReader.scala:342) > at > org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1$adapted(TableReader.scala:342) > at > org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8(HadoopRDD.scala:181) > at > org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8$adapted(HadoopRDD.scala:181) > at scala.Option.foreach(Option.scala:407) > at > org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$6(HadoopRDD.scala:181) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:178) > at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:247) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:243) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:96) >
[jira] [Commented] (SPARK-47172) Upgrade Transport block cipher mode to GCM
[ https://issues.apache.org/jira/browse/SPARK-47172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17821449#comment-17821449 ] Steve Weis commented on SPARK-47172: Hi [~mridul]. I'd like to get you input on this now that there is TLS support for RPC calls. The "AES based encryption" that is using AES-CTR is not really needed anymore. It's not a 1-line change to fix because Apache Commons CryptoInputStream/CryptoOutputStream don't support GCM. We can move to the JCE CipherInputStream/CipherOutputStream, but it's not a drop in replacement. I'm wondering if we can just plan to deprecate the ad hoc transport encryption and standardize on TLS. > Upgrade Transport block cipher mode to GCM > -- > > Key: SPARK-47172 > URL: https://issues.apache.org/jira/browse/SPARK-47172 > Project: Spark > Issue Type: Improvement > Components: Security >Affects Versions: 3.4.2, 3.5.0 >Reporter: Steve Weis >Priority: Minor > > The cipher transformation currently used for encrypting RPC calls is an > unauthenticated mode (AES/CTR/NoPadding). This needs to be upgraded to an > authenticated mode (AES/GCM/NoPadding) to prevent ciphertext from being > modified in transit. > The relevant line is here: > [https://github.com/apache/spark/blob/a939a7d0fd9c6b23c879cbee05275c6fbc939e38/common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java#L220] > GCM is relatively more computationally expensive than CTR and adds a 16-byte > block of authentication tag data to each payload. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47197) Failed to connect HiveMetastore when using iceberg with HiveCatalog by spark-sql or spark-shell
[ https://issues.apache.org/jira/browse/SPARK-47197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YUBI LEE updated SPARK-47197: - Description: I can't connect to kerberized HiveMetastore when using iceberg with HiveCatalog by spark-sql or spark-shell. I think this issue is caused by the fact that there is no way to get HIVE_DELEGATION_TOKEN when using spark-sql or spark-shell. ([https://github.com/apache/spark/blob/v3.5.1/sql/hive/src/main/scala/org/apache/spark/sql/hive/security/HiveDelegationTokenProvider.scala#L78-L83)] {code:java} val currentToken = UserGroupInformation.getCurrentUser().getCredentials().getToken(tokenAlias) currentToken == null && UserGroupInformation.isSecurityEnabled && hiveConf(hadoopConf).getTrimmed("hive.metastore.uris", "").nonEmpty && (SparkHadoopUtil.get.isProxyUser(UserGroupInformation.getCurrentUser()) || (!Utils.isClientMode(sparkConf) && !sparkConf.contains(KEYTAB))) {code} There should be a way to force to get HIVE_DELEGATION_TOKEN even when using spark-sql or spark-shell. Possible way is to get HIVE_DELEGATION_TOKEN if the configuration below is set? {code:java} spark.security.credentials.hive.enabled true {code} {code:java} 24/02/28 07:42:04 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) (machine1.example.com executor 2): org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive Metastore ... Caused by: MetaException(message:Could not connect to meta store using any of the URIs provided. Most recent failure: org.apache.thrift.transport.TTransportException: GSS initiate failed {code} {code:java} spark-sql> select * from temp.test_hive_catalog; ... ... 24/02/28 07:42:04 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) (machine1.example.com executor 2): org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive Metastore at org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:84) at org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:34) at org.apache.iceberg.ClientPoolImpl.get(ClientPoolImpl.java:125) at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:56) at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:51) at org.apache.iceberg.hive.CachedClientPool.run(CachedClientPool.java:122) at org.apache.iceberg.hive.HiveTableOperations.doRefresh(HiveTableOperations.java:158) at org.apache.iceberg.BaseMetastoreTableOperations.refresh(BaseMetastoreTableOperations.java:97) at org.apache.iceberg.BaseMetastoreTableOperations.current(BaseMetastoreTableOperations.java:80) at org.apache.iceberg.BaseMetastoreCatalog.loadTable(BaseMetastoreCatalog.java:47) at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:124) at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:111) at org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.overlayTableProperties(HiveIcebergStorageHandler.java:276) at org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.configureInputJobProperties(HiveIcebergStorageHandler.java:86) at org.apache.spark.sql.hive.HiveTableUtil$.configureJobPropertiesForStorageHandler(TableReader.scala:426) at org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:456) at org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1(TableReader.scala:342) at org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1$adapted(TableReader.scala:342) at org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8(HadoopRDD.scala:181) at org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8$adapted(HadoopRDD.scala:181) at scala.Option.foreach(Option.scala:407) at org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$6(HadoopRDD.scala:181) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:178) at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:247) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:243) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:96) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apach
[jira] [Updated] (SPARK-47197) Failed to connect HiveMetastore when using iceberg with HiveCatalog by spark-sql or spark-shell
[ https://issues.apache.org/jira/browse/SPARK-47197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YUBI LEE updated SPARK-47197: - Description: I can't connect to kerberized HiveMetastore when using iceberg with HiveCatalog by spark-sql or spark-shell. I think this issue is caused by the fact that there is no way to get HIVE_DELEGATION_TOKEN when using spark-sql or spark-shell. ([https://github.com/apache/spark/blob/v3.5.1/sql/hive/src/main/scala/org/apache/spark/sql/hive/security/HiveDelegationTokenProvider.scala#L78-L83)] {code:java} val currentToken = UserGroupInformation.getCurrentUser().getCredentials().getToken(tokenAlias) currentToken == null && UserGroupInformation.isSecurityEnabled && hiveConf(hadoopConf).getTrimmed("hive.metastore.uris", "").nonEmpty && (SparkHadoopUtil.get.isProxyUser(UserGroupInformation.getCurrentUser()) || (!Utils.isClientMode(sparkConf) && !sparkConf.contains(KEYTAB))) {code} There should be a way to force to get HIVE_DELEGATION_TOKEN even when using spark-sql or spark-shell. Possible way is to get HIVE_DELEGATION_TOKEN if the configuration below is set? {code:java} spark.security.credentials.hive.enabled true {code} {code:java} spark-sql> select * from temp.test_hive_catalog; ... ... 24/02/28 07:42:04 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) (machine1.example.com executor 2): org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive Metastore at org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:84) at org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:34) at org.apache.iceberg.ClientPoolImpl.get(ClientPoolImpl.java:125) at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:56) at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:51) at org.apache.iceberg.hive.CachedClientPool.run(CachedClientPool.java:122) at org.apache.iceberg.hive.HiveTableOperations.doRefresh(HiveTableOperations.java:158) at org.apache.iceberg.BaseMetastoreTableOperations.refresh(BaseMetastoreTableOperations.java:97) at org.apache.iceberg.BaseMetastoreTableOperations.current(BaseMetastoreTableOperations.java:80) at org.apache.iceberg.BaseMetastoreCatalog.loadTable(BaseMetastoreCatalog.java:47) at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:124) at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:111) at org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.overlayTableProperties(HiveIcebergStorageHandler.java:276) at org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.configureInputJobProperties(HiveIcebergStorageHandler.java:86) at org.apache.spark.sql.hive.HiveTableUtil$.configureJobPropertiesForStorageHandler(TableReader.scala:426) at org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:456) at org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1(TableReader.scala:342) at org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1$adapted(TableReader.scala:342) at org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8(HadoopRDD.scala:181) at org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8$adapted(HadoopRDD.scala:181) at scala.Option.foreach(Option.scala:407) at org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$6(HadoopRDD.scala:181) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:178) at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:247) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:243) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:96) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.sc
[jira] [Created] (SPARK-47197) Failed to connect HiveMetastore when using iceberg with HiveCatalog by spark-sql or spark-shell
YUBI LEE created SPARK-47197: Summary: Failed to connect HiveMetastore when using iceberg with HiveCatalog by spark-sql or spark-shell Key: SPARK-47197 URL: https://issues.apache.org/jira/browse/SPARK-47197 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 3.5.1, 3.2.3 Reporter: YUBI LEE I can't connect to kerberized HiveMetastore when using iceberg with HiveCatalog by spark-sql or spark-shell. I think this issue is caused by the fact that there is no way to get HIVE_DELEGATION_TOKEN when using spark-sql or spark-shell. ([https://github.com/apache/spark/blob/v3.5.1/sql/hive/src/main/scala/org/apache/spark/sql/hive/security/HiveDelegationTokenProvider.scala#L78-L83)] {code:java} val currentToken = UserGroupInformation.getCurrentUser().getCredentials().getToken(tokenAlias) currentToken == null && UserGroupInformation.isSecurityEnabled && hiveConf(hadoopConf).getTrimmed("hive.metastore.uris", "").nonEmpty && (SparkHadoopUtil.get.isProxyUser(UserGroupInformation.getCurrentUser()) || (!Utils.isClientMode(sparkConf) && !sparkConf.contains(KEYTAB))) {code} There should be a way to force to get HIVE_DELEGATION_TOKEN even when using spark-sql or spark-shell. Possible way is to get HIVE_DELEGATION_TOKEN if the configuration below is set? {code:java} spark.security.credentials.hive.enabled true {code} {code:java} spark-sql> select * from temp.test_hive_catalog; ... ... 24/02/28 07:42:04 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) (machine1.example.com executor 2): org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive Metastore at org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:84) at org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:34) at org.apache.iceberg.ClientPoolImpl.get(ClientPoolImpl.java:125) at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:56) at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:51) at org.apache.iceberg.hive.CachedClientPool.run(CachedClientPool.java:122) at org.apache.iceberg.hive.HiveTableOperations.doRefresh(HiveTableOperations.java:158) at org.apache.iceberg.BaseMetastoreTableOperations.refresh(BaseMetastoreTableOperations.java:97) at org.apache.iceberg.BaseMetastoreTableOperations.current(BaseMetastoreTableOperations.java:80) at org.apache.iceberg.BaseMetastoreCatalog.loadTable(BaseMetastoreCatalog.java:47) at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:124) at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:111) at org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.overlayTableProperties(HiveIcebergStorageHandler.java:276) at org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.configureInputJobProperties(HiveIcebergStorageHandler.java:86) at org.apache.spark.sql.hive.HiveTableUtil$.configureJobPropertiesForStorageHandler(TableReader.scala:426) at org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:456) at org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1(TableReader.scala:342) at org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1$adapted(TableReader.scala:342) at org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8(HadoopRDD.scala:181) at org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8$adapted(HadoopRDD.scala:181) at scala.Option.foreach(Option.scala:407) at org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$6(HadoopRDD.scala:181) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:178) at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:247) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:243) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:96) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.com
[jira] [Commented] (SPARK-24815) Structured Streaming should support dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-24815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17821433#comment-17821433 ] Mich Talebzadeh commented on SPARK-24815: - some thoughts on this if I may This enhancement request provides a solid foundation for improving dynamic allocation in Structured Streaming. Adding more specific details, outlining potential benefits, and addressing potential challenges can further strengthen the proposal and increase its chances of being implemented. So these are my thoughts: - Pluggable Dynamic Allocation: This suggestion shows good design principles, allowing for flexibility and future improvements. We should elaborate benefits of a pluggable approach, like customization and integration with external resource management tools. - Separate Algorithm for Structured Streaming: This is crucial for adapting allocation strategies to the unique nature of streaming workloads Also outlining how a separate algorithm might differ from the batch counterpart could be useful - Warning for Enabled Core Dynamic Allocation: This is a valuable warning to prevent accidental misuse and raise awareness among users. Also consider suggesting the warning level (e.g. info, warning, error) and potential content to provide clarity. - Briefly mention potential challenges or trade-offs associated with implementing these proposals. Suggesting relevant discussions, resources, or alternative approaches could strengthen the request for enhancement > Structured Streaming should support dynamic allocation > -- > > Key: SPARK-24815 > URL: https://issues.apache.org/jira/browse/SPARK-24815 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core, Structured Streaming >Affects Versions: 2.3.1 >Reporter: Karthik Palaniappan >Priority: Minor > Labels: pull-request-available > > For batch jobs, dynamic allocation is very useful for adding and removing > containers to match the actual workload. On multi-tenant clusters, it ensures > that a Spark job is taking no more resources than necessary. In cloud > environments, it enables autoscaling. > However, if you set spark.dynamicAllocation.enabled=true and run a structured > streaming job, the batch dynamic allocation algorithm kicks in. It requests > more executors if the task backlog is a certain size, and removes executors > if they idle for a certain period of time. > Quick thoughts: > 1) Dynamic allocation should be pluggable, rather than hardcoded to a > particular implementation in SparkContext.scala (this should be a separate > JIRA). > 2) We should make a structured streaming algorithm that's separate from the > batch algorithm. Eventually, continuous processing might need its own > algorithm. > 3) Spark should print a warning if you run a structured streaming job when > Core's dynamic allocation is enabled -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47196) Fix `core` module to succeed SBT tests
[ https://issues.apache.org/jira/browse/SPARK-47196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-47196: -- Description: This happens at branch-3.4 only. branch-3.3/branch-3.5/master are okay. {code:java} $ build/sbt "core/testOnly *.DAGSchedulerSuite" [info] DAGSchedulerSuite: [info] - [SPARK-3353] parent stage should have lower stage id *** FAILED *** (439 milliseconds) [info] java.lang.IllegalStateException: Could not initialize plugin: interface org.mockito.plugins.MockMaker (alternate: null) ... [info] *** 1 SUITE ABORTED *** [info] *** 118 TESTS FAILED *** [error] Error during tests: [error] org.apache.spark.scheduler.DAGSchedulerSuite [error] (core / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful [error] Total time: 48 s, completed Feb 27, 2024, 1:26:27 PM {code} MAVEN {code:java} $ build/mvn dependency:tree -pl core | grep byte-buddy ... [INFO] | +- net.bytebuddy:byte-buddy:jar:1.12.10:test [INFO] | +- net.bytebuddy:byte-buddy-agent:jar:1.12.10:test {code} SBT {code:java} $ build/sbt "core/test:dependencyTree" | grep byte-buddy [info] | | | | +-net.bytebuddy:byte-buddy:1.12.10 (evicted by: 1.12.18) [info] | | | | +-net.bytebuddy:byte-buddy:1.12.18 {code} was: This happens at branch-3.4 only. {code:java} $ build/sbt "core/testOnly *.DAGSchedulerSuite" [info] DAGSchedulerSuite: [info] - [SPARK-3353] parent stage should have lower stage id *** FAILED *** (439 milliseconds) [info] java.lang.IllegalStateException: Could not initialize plugin: interface org.mockito.plugins.MockMaker (alternate: null) ... [info] *** 1 SUITE ABORTED *** [info] *** 118 TESTS FAILED *** [error] Error during tests: [error] org.apache.spark.scheduler.DAGSchedulerSuite [error] (core / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful [error] Total time: 48 s, completed Feb 27, 2024, 1:26:27 PM {code} MAVEN {code:java} $ build/mvn dependency:tree -pl core | grep byte-buddy ... [INFO] | +- net.bytebuddy:byte-buddy:jar:1.12.10:test [INFO] | +- net.bytebuddy:byte-buddy-agent:jar:1.12.10:test {code} SBT {code:java} $ build/sbt "core/test:dependencyTree" | grep byte-buddy [info] | | | | +-net.bytebuddy:byte-buddy:1.12.10 (evicted by: 1.12.18) [info] | | | | +-net.bytebuddy:byte-buddy:1.12.18 {code} > Fix `core` module to succeed SBT tests > -- > > Key: SPARK-47196 > URL: https://issues.apache.org/jira/browse/SPARK-47196 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.2, 3.4.0, 3.4.1 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > This happens at branch-3.4 only. branch-3.3/branch-3.5/master are okay. > {code:java} > $ build/sbt "core/testOnly *.DAGSchedulerSuite" > [info] DAGSchedulerSuite: > [info] - [SPARK-3353] parent stage should have lower stage id *** FAILED *** > (439 milliseconds) > [info] java.lang.IllegalStateException: Could not initialize plugin: > interface org.mockito.plugins.MockMaker (alternate: null) > ... > [info] *** 1 SUITE ABORTED *** > [info] *** 118 TESTS FAILED *** > [error] Error during tests: > [error] org.apache.spark.scheduler.DAGSchedulerSuite > [error] (core / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful > [error] Total time: 48 s, completed Feb 27, 2024, 1:26:27 PM {code} > > MAVEN > {code:java} > $ build/mvn dependency:tree -pl core | grep byte-buddy > ... > [INFO] | +- net.bytebuddy:byte-buddy:jar:1.12.10:test > [INFO] | +- net.bytebuddy:byte-buddy-agent:jar:1.12.10:test > {code} > SBT > {code:java} > $ build/sbt "core/test:dependencyTree" | grep byte-buddy > [info] | | | | +-net.bytebuddy:byte-buddy:1.12.10 (evicted by: 1.12.18) > [info] | | | | +-net.bytebuddy:byte-buddy:1.12.18 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47196) Fix `core` module to succeed SBT tests
[ https://issues.apache.org/jira/browse/SPARK-47196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-47196: -- Description: This happens at branch-3.4 only. {code:java} $ build/sbt "core/testOnly *.DAGSchedulerSuite" [info] DAGSchedulerSuite: [info] - [SPARK-3353] parent stage should have lower stage id *** FAILED *** (439 milliseconds) [info] java.lang.IllegalStateException: Could not initialize plugin: interface org.mockito.plugins.MockMaker (alternate: null) ... [info] *** 1 SUITE ABORTED *** [info] *** 118 TESTS FAILED *** [error] Error during tests: [error] org.apache.spark.scheduler.DAGSchedulerSuite [error] (core / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful [error] Total time: 48 s, completed Feb 27, 2024, 1:26:27 PM {code} MAVEN {code:java} $ build/mvn dependency:tree -pl core | grep byte-buddy ... [INFO] | +- net.bytebuddy:byte-buddy:jar:1.12.10:test [INFO] | +- net.bytebuddy:byte-buddy-agent:jar:1.12.10:test {code} SBT {code:java} $ build/sbt "core/test:dependencyTree" | grep byte-buddy [info] | | | | +-net.bytebuddy:byte-buddy:1.12.10 (evicted by: 1.12.18) [info] | | | | +-net.bytebuddy:byte-buddy:1.12.18 {code} was: MAVEN {code} $ build/mvn dependency:tree -pl core | grep byte-buddy ... [INFO] | +- net.bytebuddy:byte-buddy:jar:1.12.10:test [INFO] | +- net.bytebuddy:byte-buddy-agent:jar:1.12.10:test {code} SBT {code} $ build/sbt "core/test:dependencyTree" | grep byte-buddy [info] | | | | +-net.bytebuddy:byte-buddy:1.12.10 (evicted by: 1.12.18) [info] | | | | +-net.bytebuddy:byte-buddy:1.12.18 {code} > Fix `core` module to succeed SBT tests > -- > > Key: SPARK-47196 > URL: https://issues.apache.org/jira/browse/SPARK-47196 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.2, 3.4.0, 3.4.1 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > This happens at branch-3.4 only. > {code:java} > $ build/sbt "core/testOnly *.DAGSchedulerSuite" > [info] DAGSchedulerSuite: > [info] - [SPARK-3353] parent stage should have lower stage id *** FAILED *** > (439 milliseconds) > [info] java.lang.IllegalStateException: Could not initialize plugin: > interface org.mockito.plugins.MockMaker (alternate: null) > ... > [info] *** 1 SUITE ABORTED *** > [info] *** 118 TESTS FAILED *** > [error] Error during tests: > [error] org.apache.spark.scheduler.DAGSchedulerSuite > [error] (core / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful > [error] Total time: 48 s, completed Feb 27, 2024, 1:26:27 PM {code} > > MAVEN > {code:java} > $ build/mvn dependency:tree -pl core | grep byte-buddy > ... > [INFO] | +- net.bytebuddy:byte-buddy:jar:1.12.10:test > [INFO] | +- net.bytebuddy:byte-buddy-agent:jar:1.12.10:test > {code} > SBT > {code:java} > $ build/sbt "core/test:dependencyTree" | grep byte-buddy > [info] | | | | +-net.bytebuddy:byte-buddy:1.12.10 (evicted by: 1.12.18) > [info] | | | | +-net.bytebuddy:byte-buddy:1.12.18 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47196) Fix `core` module to succeed SBT tests
[ https://issues.apache.org/jira/browse/SPARK-47196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-47196: -- Description: MAVEN {code} $ build/mvn dependency:tree -pl core | grep byte-buddy ... [INFO] | +- net.bytebuddy:byte-buddy:jar:1.12.10:test [INFO] | +- net.bytebuddy:byte-buddy-agent:jar:1.12.10:test {code} SBT {code} $ build/sbt "core/test:dependencyTree" | grep byte-buddy [info] | | | | +-net.bytebuddy:byte-buddy:1.12.10 (evicted by: 1.12.18) [info] | | | | +-net.bytebuddy:byte-buddy:1.12.18 {code} > Fix `core` module to succeed SBT tests > -- > > Key: SPARK-47196 > URL: https://issues.apache.org/jira/browse/SPARK-47196 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.2, 3.4.0, 3.4.1 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > MAVEN > {code} > $ build/mvn dependency:tree -pl core | grep byte-buddy > ... > [INFO] | +- net.bytebuddy:byte-buddy:jar:1.12.10:test > [INFO] | +- net.bytebuddy:byte-buddy-agent:jar:1.12.10:test > {code} > SBT > {code} > $ build/sbt "core/test:dependencyTree" | grep byte-buddy > [info] | | | | +-net.bytebuddy:byte-buddy:1.12.10 (evicted by: 1.12.18) > [info] | | | | +-net.bytebuddy:byte-buddy:1.12.18 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47177) Cached SQL plan do not display final AQE plan in explain string
[ https://issues.apache.org/jira/browse/SPARK-47177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ziqi Liu updated SPARK-47177: - Description: AQE plan is expected to display final plan after execution. This is not true for cached SQL plan: it will show the initial plan instead. This behavior change is introduced in [https://github.com/apache/spark/pull/40812] it tried to fix the concurrency issue with cached plan. *In short, the plan used to executed and the plan used to explain is not the same instance, thus causing the inconsistency.* I don't have a clear idea how yet * maybe we just a coarse granularity lock in explain? * make innerChildren a function: clone the initial plan, every time checked for whether the original AQE plan is finalized (making the final flag atomic first, of course), if no return the cloned initial plan, if it's finalized, clone the final plan and return that one. But still this won't be able to reflect the AQE plan in real time, in a concurrent situation, but at least we have initial version and final version. A simple repro: {code:java} d1 = spark.range(1000).withColumn("key", expr("id % 100")).groupBy("key").agg({"key": "count"}) cached_d2 = d1.cache() df = cached_d2.filter("key > 10") df.collect() {code} {code:java} >>> df.explain() == Physical Plan == AdaptiveSparkPlan isFinalPlan=true +- == Final Plan == *(1) Filter (isnotnull(key#4L) AND (key#4L > 10)) +- TableCacheQueryStage 0 +- InMemoryTableScan [key#4L, count(key)#10L], [isnotnull(key#4L), (key#4L > 10)] +- InMemoryRelation [key#4L, count(key)#10L], StorageLevel(disk, memory, deserialized, 1 replicas) +- AdaptiveSparkPlan isFinalPlan=false +- HashAggregate(keys=[key#4L], functions=[count(key#4L)]) +- Exchange hashpartitioning(key#4L, 200), ENSURE_REQUIREMENTS, [plan_id=24] +- HashAggregate(keys=[key#4L], functions=[partial_count(key#4L)]) +- Project [(id#2L % 100) AS key#4L] +- Range (0, 1000, step=1, splits=10) +- == Initial Plan == Filter (isnotnull(key#4L) AND (key#4L > 10)) +- InMemoryTableScan [key#4L, count(key)#10L], [isnotnull(key#4L), (key#4L > 10)] +- InMemoryRelation [key#4L, count(key)#10L], StorageLevel(disk, memory, deserialized, 1 replicas) +- AdaptiveSparkPlan isFinalPlan=false +- HashAggregate(keys=[key#4L], functions=[count(key#4L)]) +- Exchange hashpartitioning(key#4L, 200), ENSURE_REQUIREMENTS, [plan_id=24] +- HashAggregate(keys=[key#4L], functions=[partial_count(key#4L)]) +- Project [(id#2L % 100) AS key#4L] +- Range (0, 1000, step=1, splits=10){code} was: AQE plan is expected to display final plan after execution. This is not true for cached SQL plan: it will show the initial plan instead. This behavior change is introduced in [https://github.com/apache/spark/pull/40812] it tried to fix the concurrency issue with cached plan. *In short, the plan used to executed and the plan used to explain is not the same instance, thus causing the inconsistency.* I don't have a clear idea how yet * maybe we just a coarse granularity lock in explain? * make innerChildren a function: clone the initial plan, every time checked for whether the original AQE plan is finalized (making the final flag atomic first, of course), if no return the cloned initial plan, if it's finalized, clone the final plan and return that one. But still this won't be able to reflect the AQE plan in real time, in a concurrent situation, but at least we have initial version and final version. A simple repro: {code:java} d1 = spark.range(1000).withColumn("key", expr("id % 100")).groupBy("key").agg({"key": "count"}) cached_d2 = d1.cache() df = cached_d2.withColumn("key2", expr("key % 10")).groupBy("key2").agg({"key2": "count"}) df.collect() {code} {code:java} >>> df.explain() == Physical Plan == AdaptiveSparkPlan isFinalPlan=true +- == Final Plan == *(2) HashAggregate(keys=[key2#36L], functions=[count(key2#36L)]) +- AQEShuffleRead coalesced +- ShuffleQueryStage 1 +- Exchange hashpartitioning(key2#36L, 200), ENSURE_REQUIREMENTS, [plan_id=83] +- *(1) HashAggregate(keys=[key2#36L], functions=[partial_count(key2#36L)]) +- *(1) Project [(key#27L % 10) AS key2#36L] +- TableCacheQueryStage 0 +- InMemoryTableScan [key#27L] +- InMemoryRelation [key#27L, count(key)#33L], StorageLevel(disk, memory, deserialized, 1 replicas) +- AdaptiveSparkPlan isFinalPlan=false +- HashAggregate(keys=[key#4L], functions=[count(key#4L)])
[jira] [Updated] (SPARK-47196) Fix `core` module to succeed SBT tests
[ https://issues.apache.org/jira/browse/SPARK-47196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-47196: -- Affects Version/s: 3.4.1 3.4.0 > Fix `core` module to succeed SBT tests > -- > > Key: SPARK-47196 > URL: https://issues.apache.org/jira/browse/SPARK-47196 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.2, 3.4.0, 3.4.1 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43157) TreeNode tags can become corrupted and hang driver when the dataset is cached
[ https://issues.apache.org/jira/browse/SPARK-43157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-43157: --- Labels: pull-request-available (was: ) > TreeNode tags can become corrupted and hang driver when the dataset is cached > - > > Key: SPARK-43157 > URL: https://issues.apache.org/jira/browse/SPARK-43157 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 3.5.0 >Reporter: Rob Reeves >Assignee: Rob Reeves >Priority: Major > Labels: pull-request-available > Fix For: 3.4.1, 3.5.0 > > > If a cached dataset is used by multiple other datasets materialized in > separate threads it can corrupt the TreeNode.tags map in any of the cached > plan nodes. This will hang the driver forever. This happens because > TreeNode.tags is not thread-safe. How this happens: > # Multiple datasets are materialized at the same time in different threads > that reference the same cached dataset > # AdaptiveSparkPlanExec.onUpdatePlan will call ExplainMode.fromString > # ExplainUtils uses the TreeNode.tags map to store the operator Id for every > node in the plan. This is usually okay because the plan is cloned. When there > is an InMemoryScanExec the InMemoryRelation.cachedPlan is not cloned so > multiple threads can set the operator Id. > Making the TreeNode.tags field thread-safe does not solve this problem > because there is still a correctness issue. The threads may be overwriting > each other's operator Ids, which could be different. > Example stack trace of the infinite loop: > {code:scala} > scala.collection.mutable.HashTable.resize(HashTable.scala:265) > scala.collection.mutable.HashTable.addEntry0(HashTable.scala:158) > scala.collection.mutable.HashTable.findOrAddEntry(HashTable.scala:170) > scala.collection.mutable.HashTable.findOrAddEntry$(HashTable.scala:167) > scala.collection.mutable.HashMap.findOrAddEntry(HashMap.scala:44) > scala.collection.mutable.HashMap.put(HashMap.scala:126) > scala.collection.mutable.HashMap.update(HashMap.scala:131) > org.apache.spark.sql.catalyst.trees.TreeNode.setTagValue(TreeNode.scala:108) > org.apache.spark.sql.execution.ExplainUtils$.setOpId$1(ExplainUtils.scala:134) > … > org.apache.spark.sql.execution.QueryExecution.explainString(QueryExecution.scala:175) > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.onUpdatePlan(AdaptiveSparkPlanExec.scala:662){code} > Example to show the cachedPlan object is not cloned: > {code:java} > import org.apache.spark.sql.execution.SparkPlan > import org.apache.spark.sql.execution.columnar.InMemoryTableScanExec > import spark.implicits._ > def findCacheOperator(plan: SparkPlan): Option[InMemoryTableScanExec] = { > if (plan.isInstanceOf[InMemoryTableScanExec]) { > Some(plan.asInstanceOf[InMemoryTableScanExec]) > } else if (plan.children.isEmpty && plan.subqueries.isEmpty) { > None > } else { > (plan.subqueries.flatMap(p => findCacheOperator(p)) ++ > plan.children.flatMap(findCacheOperator)).headOption > } > } > val df = spark.range(10).filter($"id" < 100).cache() > val df1 = df.limit(1) > val df2 = df.limit(1) > // Get the cache operator (InMemoryTableScanExec) in each plan > val plan1 = findCacheOperator(df1.queryExecution.executedPlan).get > val plan2 = findCacheOperator(df2.queryExecution.executedPlan).get > // Check if InMemoryTableScanExec references point to the same object > println(plan1.eq(plan2)) > // returns false// Check if InMemoryRelation references point to the same > object > println(plan1.relation.eq(plan2.relation)) > // returns false > // Check if the cached SparkPlan references point to the same object > println(plan1.relation.cachedPlan.eq(plan2.relation.cachedPlan)) > // returns true > // This shows that the cloned plan2 still has references to the original > plan1 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47177) Cached SQL plan do not display final AQE plan in explain string
[ https://issues.apache.org/jira/browse/SPARK-47177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ziqi Liu updated SPARK-47177: - Description: AQE plan is expected to display final plan after execution. This is not true for cached SQL plan: it will show the initial plan instead. This behavior change is introduced in [https://github.com/apache/spark/pull/40812] it tried to fix the concurrency issue with cached plan. *In short, the plan used to executed and the plan used to explain is not the same instance, thus causing the inconsistency.* I don't have a clear idea how yet * maybe we just a coarse granularity lock in explain? * make innerChildren a function: clone the initial plan, every time checked for whether the original AQE plan is finalized (making the final flag atomic first, of course), if no return the cloned initial plan, if it's finalized, clone the final plan and return that one. But still this won't be able to reflect the AQE plan in real time, in a concurrent situation, but at least we have initial version and final version. A simple repro: {code:java} d1 = spark.range(1000).withColumn("key", expr("id % 100")).groupBy("key").agg({"key": "count"}) cached_d2 = d1.cache() df = cached_d2.withColumn("key2", expr("key % 10")).groupBy("key2").agg({"key2": "count"}) df.collect() {code} {code:java} >>> df.explain() == Physical Plan == AdaptiveSparkPlan isFinalPlan=true +- == Final Plan == *(2) HashAggregate(keys=[key2#36L], functions=[count(key2#36L)]) +- AQEShuffleRead coalesced +- ShuffleQueryStage 1 +- Exchange hashpartitioning(key2#36L, 200), ENSURE_REQUIREMENTS, [plan_id=83] +- *(1) HashAggregate(keys=[key2#36L], functions=[partial_count(key2#36L)]) +- *(1) Project [(key#27L % 10) AS key2#36L] +- TableCacheQueryStage 0 +- InMemoryTableScan [key#27L] +- InMemoryRelation [key#27L, count(key)#33L], StorageLevel(disk, memory, deserialized, 1 replicas) +- AdaptiveSparkPlan isFinalPlan=false +- HashAggregate(keys=[key#4L], functions=[count(key#4L)]) +- Exchange hashpartitioning(key#4L, 200), ENSURE_REQUIREMENTS, [plan_id=33] +- HashAggregate(keys=[key#4L], functions=[partial_count(key#4L)]) +- Project [(id#2L % 100) AS key#4L] +- Range (0, 1000, step=1, splits=10) +- == Initial Plan == HashAggregate(keys=[key2#36L], functions=[count(key2#36L)]) +- Exchange hashpartitioning(key2#36L, 200), ENSURE_REQUIREMENTS, [plan_id=30] +- HashAggregate(keys=[key2#36L], functions=[partial_count(key2#36L)]) +- Project [(key#27L % 10) AS key2#36L] +- InMemoryTableScan [key#27L] +- InMemoryRelation [key#27L, count(key)#33L], StorageLevel(disk, memory, deserialized, 1 replicas) +- AdaptiveSparkPlan isFinalPlan=false +- HashAggregate(keys=[key#4L], functions=[count(key#4L)]) +- Exchange hashpartitioning(key#4L, 200), ENSURE_REQUIREMENTS, [plan_id=33] +- HashAggregate(keys=[key#4L], functions=[partial_count(key#4L)]) +- Project [(id#2L % 100) AS key#4L] +- Range (0, 1000, step=1, splits=10) {code} was: AQE plan is expected to display final plan after execution. This is not true for cached SQL plan: it will show the initial plan instead. This behavior change is introduced in [https://github.com/apache/spark/pull/40812] it tried to fix the concurrency issue with cached plan. *In short, the plan used to executed and the plan used to explain is not the same instance, thus causing the inconsistency.* I don't have a clear idea how yet, maybe we just a coarse granularity lock in explain? A simple repro: {code:java} d1 = spark.range(1000).withColumn("key", expr("id % 100")).groupBy("key").agg({"key": "count"}) cached_d2 = d1.cache() df = cached_d2.withColumn("key2", expr("key % 10")).groupBy("key2").agg({"key2": "count"}) df.collect() {code} {code:java} >>> df.explain() == Physical Plan == AdaptiveSparkPlan isFinalPlan=true +- == Final Plan == *(2) HashAggregate(keys=[key2#36L], functions=[count(key2#36L)]) +- AQEShuffleRead coalesced +- ShuffleQueryStage 1 +- Exchange hashpartitioning(key2#36L, 200), ENSURE_REQUIREMENTS, [plan_id=83] +- *(1) HashAggregate(keys=[key2#36L], functions=[partial_count(key2#36L)]) +- *(1) Project [(key#27L % 10) AS key2#36L] +- TableCacheQueryStage 0 +- InMemoryTableScan [key#27L] +- I
[jira] [Updated] (SPARK-47177) Cached SQL plan do not display final AQE plan in explain string
[ https://issues.apache.org/jira/browse/SPARK-47177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ziqi Liu updated SPARK-47177: - Description: AQE plan is expected to display final plan after execution. This is not true for cached SQL plan: it will show the initial plan instead. This behavior change is introduced in [https://github.com/apache/spark/pull/40812] it tried to fix the concurrency issue with cached plan. *In short, the plan used to executed and the plan used to explain is not the same instance, thus causing the inconsistency.* I don't have a clear idea how yet, maybe we just a coarse granularity lock in explain? A simple repro: {code:java} d1 = spark.range(1000).withColumn("key", expr("id % 100")).groupBy("key").agg({"key": "count"}) cached_d2 = d1.cache() df = cached_d2.withColumn("key2", expr("key % 10")).groupBy("key2").agg({"key2": "count"}) df.collect() {code} {code:java} >>> df.explain() == Physical Plan == AdaptiveSparkPlan isFinalPlan=true +- == Final Plan == *(2) HashAggregate(keys=[key2#36L], functions=[count(key2#36L)]) +- AQEShuffleRead coalesced +- ShuffleQueryStage 1 +- Exchange hashpartitioning(key2#36L, 200), ENSURE_REQUIREMENTS, [plan_id=83] +- *(1) HashAggregate(keys=[key2#36L], functions=[partial_count(key2#36L)]) +- *(1) Project [(key#27L % 10) AS key2#36L] +- TableCacheQueryStage 0 +- InMemoryTableScan [key#27L] +- InMemoryRelation [key#27L, count(key)#33L], StorageLevel(disk, memory, deserialized, 1 replicas) +- AdaptiveSparkPlan isFinalPlan=false +- HashAggregate(keys=[key#4L], functions=[count(key#4L)]) +- Exchange hashpartitioning(key#4L, 200), ENSURE_REQUIREMENTS, [plan_id=33] +- HashAggregate(keys=[key#4L], functions=[partial_count(key#4L)]) +- Project [(id#2L % 100) AS key#4L] +- Range (0, 1000, step=1, splits=10) +- == Initial Plan == HashAggregate(keys=[key2#36L], functions=[count(key2#36L)]) +- Exchange hashpartitioning(key2#36L, 200), ENSURE_REQUIREMENTS, [plan_id=30] +- HashAggregate(keys=[key2#36L], functions=[partial_count(key2#36L)]) +- Project [(key#27L % 10) AS key2#36L] +- InMemoryTableScan [key#27L] +- InMemoryRelation [key#27L, count(key)#33L], StorageLevel(disk, memory, deserialized, 1 replicas) +- AdaptiveSparkPlan isFinalPlan=false +- HashAggregate(keys=[key#4L], functions=[count(key#4L)]) +- Exchange hashpartitioning(key#4L, 200), ENSURE_REQUIREMENTS, [plan_id=33] +- HashAggregate(keys=[key#4L], functions=[partial_count(key#4L)]) +- Project [(id#2L % 100) AS key#4L] +- Range (0, 1000, step=1, splits=10) {code} was: AQE plan is expected to display final plan after execution. This is not true for cached SQL plan: it will show the initial plan instead. This behavior change is introduced in [https://github.com/apache/spark/pull/40812] it tried to fix the concurrency issue with cached plan. I don't have a clear idea how yet, maybe we can check whether the AQE plan is finalized(make the final flag atomic first, of course), if not we can return the cloned one, otherwise it's thread-safe to return the final one, since it's immutable. A simple repro: {code:java} d1 = spark.range(1000).withColumn("key", expr("id % 100")).groupBy("key").agg({"key": "count"}) cached_d2 = d1.cache() df = cached_d2.withColumn("key2", expr("key % 10")).groupBy("key2").agg({"key2": "count"}) df.collect() {code} {code:java} >>> df.explain() == Physical Plan == AdaptiveSparkPlan isFinalPlan=true +- == Final Plan == *(2) HashAggregate(keys=[key2#36L], functions=[count(key2#36L)]) +- AQEShuffleRead coalesced +- ShuffleQueryStage 1 +- Exchange hashpartitioning(key2#36L, 200), ENSURE_REQUIREMENTS, [plan_id=83] +- *(1) HashAggregate(keys=[key2#36L], functions=[partial_count(key2#36L)]) +- *(1) Project [(key#27L % 10) AS key2#36L] +- TableCacheQueryStage 0 +- InMemoryTableScan [key#27L] +- InMemoryRelation [key#27L, count(key)#33L], StorageLevel(disk, memory, deserialized, 1 replicas) +- AdaptiveSparkPlan isFinalPlan=false +- HashAggregate(keys=[key#4L], functions=[count(key#4L)]) +- Exchange hashpartitioning(key#4L, 200), ENSURE_REQUIREMENTS, [plan_id=33]
[jira] [Updated] (SPARK-47196) Fix `core` module to succeed SBT tests
[ https://issues.apache.org/jira/browse/SPARK-47196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47196: --- Labels: pull-request-available (was: ) > Fix `core` module to succeed SBT tests > -- > > Key: SPARK-47196 > URL: https://issues.apache.org/jira/browse/SPARK-47196 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.2 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47196) Fix `core` module to succeed SBT tests
Dongjoon Hyun created SPARK-47196: - Summary: Fix `core` module to succeed SBT tests Key: SPARK-47196 URL: https://issues.apache.org/jira/browse/SPARK-47196 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.4.2 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-47193) Converting dataframe to rdd results in data loss
[ https://issues.apache.org/jira/browse/SPARK-47193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17821393#comment-17821393 ] Bruce Robbins edited comment on SPARK-47193 at 2/27/24 8:48 PM: Running this in Spark 3.5.0 in local mode on my laptop, I get {noformat} df count = 8 ... rdd count = 8 {noformat} What is your environment and Spark configuration? By the way, the "{{...}}" above are messages like {noformat} 24/02/27 11:34:51 WARN CSVHeaderChecker: CSV header does not conform to the schema. Header: UserId, LocationId, LocationName, CreatedDate, Status Schema: UserId, LocationId, LocationName, Status, CreatedDate Expected: Status but found: CreatedDate CSV file: file:userLocation.csv {noformat} was (Author: bersprockets): Running this in Spark 3.5.0 in local mode on my laptop, I get {noformat} df count = 8 ... rdd count = 8 {noformat} What is your environment and Spark configuration? By the way, the {{...}} above are messages like {noformat} 24/02/27 11:34:51 WARN CSVHeaderChecker: CSV header does not conform to the schema. Header: UserId, LocationId, LocationName, CreatedDate, Status Schema: UserId, LocationId, LocationName, Status, CreatedDate Expected: Status but found: CreatedDate CSV file: file:userLocation.csv {noformat} > Converting dataframe to rdd results in data loss > > > Key: SPARK-47193 > URL: https://issues.apache.org/jira/browse/SPARK-47193 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0, 3.5.1 >Reporter: Ivan Bova >Priority: Critical > Labels: correctness > Attachments: device.csv, deviceClass.csv, deviceType.csv, > language.csv, location.csv, location1.csv, timeZoneLookup.csv, user.csv, > userLocation.csv, userProfile.csv > > > I have 10 csv files and need to create mapping from them. After all of the > joins dataframe contains all expected rows but rdd from this dataframe > contains only half of them. > {code:java} > case class MyUserProfileMessage(UserId: Int, Email: String, FirstName: > String, LastName: String, LanguageId: Option[Int]) > case class MyLanguageMessage(LanguageId: Int, LanguageLocaleId: String) > case class MyDeviceMessage(DeviceId1: String, Created: Option[Timestamp], > UpdatedDate: Timestamp, DeviceId2: String, DeviceName: String, LocationId: > Option[Int], DeviceTypeId: Option[Int], DeviceClassId: Int, UserId1: > Option[Int]) > case class MyDeviceClassMessage(DeviceClassId: Int, DeviceClassName: String) > case class MyDeviceTypeMessage(DeviceTypeId: Int, DeviceTypeName: String) > case class MyLocation1(LocationId1: Int, LocationId: Int, Latitude: > Option[Double], Longitude: Option[Double], Radius: Option[Double], > CreatedDate: Timestamp) > case class MyTimeZoneLookupMessage(TimeZoneId: Int, ZoneName: String) > case class MyUserLocationMessage(UserId: Int, LocationId: Int, LocationName: > String, Status: Int, CreatedDate: Timestamp) > case class MyUserMessage(UserId: Int, Created: Option[Timestamp], Deleted: > Option[Timestamp], Active: Option[Boolean], ActivatedDate: Option[Timestamp]) > case class MyLocationMessage(LocationId: Int, IsDeleted: Option[Boolean], > Address1: String, Address2: String, City: String, State: String, Country: > String, ZipCode: String, Feature2Enabled: Option[Boolean], LocationStatus: > Option[Int], Location1Enabled: Option[Boolean], LocationKey: String, > UpdatedDateTime: Timestamp, CreatedDate: Timestamp, Feature1Enabled: > Option[Boolean], Level: Option[Int], TimeZone: Option[Int]) > val userProfile = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyUserProfileMessage].schema).csv("userProfile.csv").as[MyUserProfileMessage] > val language = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyLanguageMessage].schema).csv("language.csv").as[MyLanguageMessage] > val device = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyDeviceMessage].schema).csv("device.csv").as[MyDeviceMessage] > val deviceClass = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyDeviceClassMessage].schema).csv("deviceClass.csv").as[MyDeviceClassMessage] > val deviceType = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyDeviceTypeMessage].schema).csv("deviceType.csv").as[MyDeviceTypeMessage] > val location1 = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyLocation1].schema).csv("location1.csv").as[MyLocation1]
[jira] [Updated] (SPARK-47063) CAST long to timestamp has different behavior for codegen vs interpreted
[ https://issues.apache.org/jira/browse/SPARK-47063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47063: --- Labels: pull-request-available (was: ) > CAST long to timestamp has different behavior for codegen vs interpreted > > > Key: SPARK-47063 > URL: https://issues.apache.org/jira/browse/SPARK-47063 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.2 >Reporter: Robert Joseph Evans >Priority: Major > Labels: pull-request-available > > It probably impacts a lot more versions of the code than this, but I verified > it on 3.4.2. This also appears to be related to > https://issues.apache.org/jira/browse/SPARK-39209 > {code:java} > scala> Seq(Long.MaxValue, Long.MinValue, 0L, 1990L).toDF("v").selectExpr("*", > "CAST(v AS timestamp) as ts").selectExpr("*", "unix_micros(ts)").show(false) > ++-++ > |v |ts |unix_micros(ts) | > ++-++ > |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 | > |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808| > |0 |1970-01-01 00:00:00 |0 | > |1990 |1970-01-01 00:33:10 |199000 | > ++-++ > scala> Seq(Long.MaxValue, Long.MinValue, 0L, > 1990L).toDF("v").repartition(1).selectExpr("*", "CAST(v AS timestamp) as > ts").selectExpr("*", "unix_micros(ts)").show(false) > ++---+---+ > |v |ts |unix_micros(ts)| > ++---+---+ > |9223372036854775807 |1969-12-31 23:59:59|-100 | > |-9223372036854775808|1970-01-01 00:00:00|0 | > |0 |1970-01-01 00:00:00|0 | > |1990 |1970-01-01 00:33:10|199000 | > ++---+---+ > {code} > It looks like InMemoryTableScanExec is not doing code generation for the > expressions, but the ProjectExec after the repartition is. > If I disable code gen I get the same answer in both cases. > {code:java} > scala> spark.conf.set("spark.sql.codegen.wholeStage", false) > scala> spark.conf.set("spark.sql.codegen.factoryMode", "NO_CODEGEN") > scala> Seq(Long.MaxValue, Long.MinValue, 0L, 1990L).toDF("v").selectExpr("*", > "CAST(v AS timestamp) as ts").selectExpr("*", "unix_micros(ts)").show(false) > ++-++ > |v |ts |unix_micros(ts) | > ++-++ > |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 | > |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808| > |0 |1970-01-01 00:00:00 |0 | > |1990 |1970-01-01 00:33:10 |199000 | > ++-++ > scala> Seq(Long.MaxValue, Long.MinValue, 0L, > 1990L).toDF("v").repartition(1).selectExpr("*", "CAST(v AS timestamp) as > ts").selectExpr("*", "unix_micros(ts)").show(false) > ++-++ > |v |ts |unix_micros(ts) | > ++-++ > |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 | > |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808| > |0 |1970-01-01 00:00:00 |0 | > |1990 |1970-01-01 00:33:10 |199000 | > ++-++ > {code} > [https://github.com/apache/spark/blob/e2cd71a4cd54bbdf5af76d3edfbb2fc8c1b067b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L1627] > Is the code used in codegen, but > [https://github.com/apache/spark/blob/e2cd71a4cd54bbdf5af76d3edfbb2fc8c1b067b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L687] > is what is used outside of code gen. > Apparently `SECONDS.toMicros` truncates the value on an overflow, but the > codegen does not. > {code:java} > scala> Long.MaxValue > res11: Long = 9223372036854775807 > scala> java.util.concurrent.TimeUnit.SECONDS.toMicros(Long.MaxValue) > res12: Long = 9223372036854775807 > scala> Long.MaxValue
[jira] [Resolved] (SPARK-43256) Assign a name to the error class _LEGACY_ERROR_TEMP_2021
[ https://issues.apache.org/jira/browse/SPARK-43256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-43256. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45198 [https://github.com/apache/spark/pull/45198] > Assign a name to the error class _LEGACY_ERROR_TEMP_2021 > > > Key: SPARK-43256 > URL: https://issues.apache.org/jira/browse/SPARK-43256 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Max Gekk >Assignee: A G >Priority: Minor > Labels: pull-request-available, starter > Fix For: 4.0.0 > > > Choose a proper name for the error class *_LEGACY_ERROR_TEMP_2021* defined in > {*}core/src/main/resources/error/error-classes.json{*}. The name should be > short but complete (look at the example in error-classes.json). > Add a test which triggers the error from user code if such test still doesn't > exist. Check exception fields by using {*}checkError(){*}. The last function > checks valuable error fields only, and avoids dependencies from error text > message. In this way, tech editors can modify error format in > error-classes.json, and don't worry of Spark's internal tests. Migrate other > tests that might trigger the error onto checkError(). > If you cannot reproduce the error from user space (using SQL query), replace > the error by an internal error, see {*}SparkException.internalError(){*}. > Improve the error message format in error-classes.json if the current is not > clear. Propose a solution to users how to avoid and fix such kind of errors. > Please, look at the PR below as examples: > * [https://github.com/apache/spark/pull/38685] > * [https://github.com/apache/spark/pull/38656] > * [https://github.com/apache/spark/pull/38490] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43256) Assign a name to the error class _LEGACY_ERROR_TEMP_2021
[ https://issues.apache.org/jira/browse/SPARK-43256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-43256: Assignee: A G > Assign a name to the error class _LEGACY_ERROR_TEMP_2021 > > > Key: SPARK-43256 > URL: https://issues.apache.org/jira/browse/SPARK-43256 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Max Gekk >Assignee: A G >Priority: Minor > Labels: pull-request-available, starter > > Choose a proper name for the error class *_LEGACY_ERROR_TEMP_2021* defined in > {*}core/src/main/resources/error/error-classes.json{*}. The name should be > short but complete (look at the example in error-classes.json). > Add a test which triggers the error from user code if such test still doesn't > exist. Check exception fields by using {*}checkError(){*}. The last function > checks valuable error fields only, and avoids dependencies from error text > message. In this way, tech editors can modify error format in > error-classes.json, and don't worry of Spark's internal tests. Migrate other > tests that might trigger the error onto checkError(). > If you cannot reproduce the error from user space (using SQL query), replace > the error by an internal error, see {*}SparkException.internalError(){*}. > Improve the error message format in error-classes.json if the current is not > clear. Propose a solution to users how to avoid and fix such kind of errors. > Please, look at the PR below as examples: > * [https://github.com/apache/spark/pull/38685] > * [https://github.com/apache/spark/pull/38656] > * [https://github.com/apache/spark/pull/38490] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47193) Converting dataframe to rdd results in data loss
[ https://issues.apache.org/jira/browse/SPARK-47193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17821393#comment-17821393 ] Bruce Robbins commented on SPARK-47193: --- Running this in Spark 3.5.0 in local mode on my laptop, I get {noformat} df count = 8 ... rdd count = 8 {noformat} What is your environment and Spark configuration? By the way, the {{...}} above are messages like {noformat} 24/02/27 11:34:51 WARN CSVHeaderChecker: CSV header does not conform to the schema. Header: UserId, LocationId, LocationName, CreatedDate, Status Schema: UserId, LocationId, LocationName, Status, CreatedDate Expected: Status but found: CreatedDate CSV file: file:userLocation.csv {noformat} > Converting dataframe to rdd results in data loss > > > Key: SPARK-47193 > URL: https://issues.apache.org/jira/browse/SPARK-47193 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0, 3.5.1 >Reporter: Ivan Bova >Priority: Critical > Labels: correctness > Attachments: device.csv, deviceClass.csv, deviceType.csv, > language.csv, location.csv, location1.csv, timeZoneLookup.csv, user.csv, > userLocation.csv, userProfile.csv > > > I have 10 csv files and need to create mapping from them. After all of the > joins dataframe contains all expected rows but rdd from this dataframe > contains only half of them. > {code:java} > case class MyUserProfileMessage(UserId: Int, Email: String, FirstName: > String, LastName: String, LanguageId: Option[Int]) > case class MyLanguageMessage(LanguageId: Int, LanguageLocaleId: String) > case class MyDeviceMessage(DeviceId1: String, Created: Option[Timestamp], > UpdatedDate: Timestamp, DeviceId2: String, DeviceName: String, LocationId: > Option[Int], DeviceTypeId: Option[Int], DeviceClassId: Int, UserId1: > Option[Int]) > case class MyDeviceClassMessage(DeviceClassId: Int, DeviceClassName: String) > case class MyDeviceTypeMessage(DeviceTypeId: Int, DeviceTypeName: String) > case class MyLocation1(LocationId1: Int, LocationId: Int, Latitude: > Option[Double], Longitude: Option[Double], Radius: Option[Double], > CreatedDate: Timestamp) > case class MyTimeZoneLookupMessage(TimeZoneId: Int, ZoneName: String) > case class MyUserLocationMessage(UserId: Int, LocationId: Int, LocationName: > String, Status: Int, CreatedDate: Timestamp) > case class MyUserMessage(UserId: Int, Created: Option[Timestamp], Deleted: > Option[Timestamp], Active: Option[Boolean], ActivatedDate: Option[Timestamp]) > case class MyLocationMessage(LocationId: Int, IsDeleted: Option[Boolean], > Address1: String, Address2: String, City: String, State: String, Country: > String, ZipCode: String, Feature2Enabled: Option[Boolean], LocationStatus: > Option[Int], Location1Enabled: Option[Boolean], LocationKey: String, > UpdatedDateTime: Timestamp, CreatedDate: Timestamp, Feature1Enabled: > Option[Boolean], Level: Option[Int], TimeZone: Option[Int]) > val userProfile = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyUserProfileMessage].schema).csv("userProfile.csv").as[MyUserProfileMessage] > val language = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyLanguageMessage].schema).csv("language.csv").as[MyLanguageMessage] > val device = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyDeviceMessage].schema).csv("device.csv").as[MyDeviceMessage] > val deviceClass = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyDeviceClassMessage].schema).csv("deviceClass.csv").as[MyDeviceClassMessage] > val deviceType = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyDeviceTypeMessage].schema).csv("deviceType.csv").as[MyDeviceTypeMessage] > val location1 = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyLocation1].schema).csv("location1.csv").as[MyLocation1] > val timeZoneLookup = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyTimeZoneLookupMessage].schema).csv("timeZoneLookup.csv").as[MyTimeZoneLookupMessage] > val userLocation = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyUserLocationMessage].schema).csv("userLocation.csv").as[MyUserLocationMessage] > val user = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyUserMessage].schema).csv("u
[jira] [Commented] (SPARK-44389) ExecutorDeadException when using decommissioning without external shuffle service
[ https://issues.apache.org/jira/browse/SPARK-44389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17821373#comment-17821373 ] Dongjoon Hyun commented on SPARK-44389: --- I guess this was fixed by reverting SPARK-43043 via SPARK-44630 . Could you try this Apache Spark 3.4.2? > ExecutorDeadException when using decommissioning without external shuffle > service > - > > Key: SPARK-44389 > URL: https://issues.apache.org/jira/browse/SPARK-44389 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Volodymyr Kot >Priority: Major > > Hey, we are trying to use executor decommissioning without external shuffle > service. We are trying to understand: > # How often should we expect to see ExecutorDeadException? How is > information about changes to location of blocks is propagated? > # Whether the task should be re-submited if we hit that during > decommissioning? > > Current behavior that we observe: > # Executor 1 is decommissioned > # Driver successfully removes executor 1's block manager > [here|https://github.com/apache/spark/blob/87a5442f7ed96b11051d8a9333476d080054e5a0/core/src/main/scala/org/apache/spark/storage/BlockManagerMaster.scala#L44] > # A task is started on executor 2 > # We hit `ExecutorDeadException` on executor 2 when trying to fetch blocks > from executor 1 > [here|https://github.com/apache/spark/blob/87a5442f7ed96b11051d8a9333476d080054e5a0/core/src/main/scala/org/apache/spark/network/netty/NettyBlockTransferService.scala#L139-L140] > # Task on executor 2 fails > # Stage fails > # Stage is re-submitted and succeeds > As far as we understand, this happens because executor 2 has stale [map > status > cache|https://github.com/apache/spark/blob/87a5442f7ed96b11051d8a9333476d080054e5a0/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L1235-L1236] > Is that expected behavior? Shouldn't the task be retried in that case instead > of whole stage failing and being retried? This makes Spark job execution > longer, especially if there are a lot of decommission events. > > Also found [this > comment|https://github.palantir.build/foundry/spark/blob/aad028ae02011b079e8812f7e63869323cc1ed78/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L113-L115], > which makes sense for FetchFailures w/o decommissioning, but with > decommissioning data could have been migrated - and we need to fetch a new > location. Maybe it makes sense to special case this codepath to check whether > executor was decommissioned? Since > https://issues.apache.org/jira/browse/SPARK-40979 we already store that > information. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-44478) Executor decommission causes stage failure
[ https://issues.apache.org/jira/browse/SPARK-44478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17821370#comment-17821370 ] Dongjoon Hyun edited comment on SPARK-44478 at 2/27/24 6:51 PM: Do you still see this issue with Apache Spark 3.4.2 or 3.5.1, [~dhuett]? was (Author: dongjoon): Do you still see this issue with Apache Spark 3.4.2, [~dhuett]? > Executor decommission causes stage failure > -- > > Key: SPARK-44478 > URL: https://issues.apache.org/jira/browse/SPARK-44478 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 3.4.0, 3.4.1 >Reporter: Dale Huettenmoser >Priority: Minor > > During spark execution, save fails due to executor decommissioning. Issue not > present in 3.3.0 > Sample error: > > {code:java} > An error occurred while calling o8948.save. > : org.apache.spark.SparkException: Job aborted due to stage failure: > Authorized committer (attemptNumber=0, stage=170, partition=233) failed; but > task commit success, data duplication may happen. > reason=ExecutorLostFailure(1,false,Some(Executor decommission: Executor 1 is > decommissioned.)) > at > org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2785) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2721) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2720) > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2720) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleStageFailed$1(DAGScheduler.scala:1199) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleStageFailed$1$adapted(DAGScheduler.scala:1199) > at scala.Option.foreach(Option.scala:407) > at > org.apache.spark.scheduler.DAGScheduler.handleStageFailed(DAGScheduler.scala:1199) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2981) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2923) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2912) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:971) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2263) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeWrite$4(FileFormatWriter.scala:307) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.writeAndCommit(FileFormatWriter.scala:271) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeWrite(FileFormatWriter.scala:304) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:190) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:190) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:118) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:195) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:103) > at > org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:512) >
[jira] [Commented] (SPARK-44478) Executor decommission causes stage failure
[ https://issues.apache.org/jira/browse/SPARK-44478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17821370#comment-17821370 ] Dongjoon Hyun commented on SPARK-44478: --- Do you still see this issue with Apache Spark 3.4.2, [~dhuett]? > Executor decommission causes stage failure > -- > > Key: SPARK-44478 > URL: https://issues.apache.org/jira/browse/SPARK-44478 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 3.4.0, 3.4.1 >Reporter: Dale Huettenmoser >Priority: Minor > > During spark execution, save fails due to executor decommissioning. Issue not > present in 3.3.0 > Sample error: > > {code:java} > An error occurred while calling o8948.save. > : org.apache.spark.SparkException: Job aborted due to stage failure: > Authorized committer (attemptNumber=0, stage=170, partition=233) failed; but > task commit success, data duplication may happen. > reason=ExecutorLostFailure(1,false,Some(Executor decommission: Executor 1 is > decommissioned.)) > at > org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2785) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2721) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2720) > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2720) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleStageFailed$1(DAGScheduler.scala:1199) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleStageFailed$1$adapted(DAGScheduler.scala:1199) > at scala.Option.foreach(Option.scala:407) > at > org.apache.spark.scheduler.DAGScheduler.handleStageFailed(DAGScheduler.scala:1199) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2981) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2923) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2912) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:971) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2263) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeWrite$4(FileFormatWriter.scala:307) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.writeAndCommit(FileFormatWriter.scala:271) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeWrite(FileFormatWriter.scala:304) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:190) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:190) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:118) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:195) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:103) > at > org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:512) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:104) > at > org.apache.spark.sql.catalyst.trees.Tre
[jira] [Commented] (SPARK-47194) Upgrade log4j2 to 2.23.0
[ https://issues.apache.org/jira/browse/SPARK-47194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17821367#comment-17821367 ] Dongjoon Hyun commented on SPARK-47194: --- Hi, [~LuciferYang]. Thank you for adding here. SPARK-47046 is inclusive for all dependency upgrades. Feel free to add here. This JIRA is an umbrella not to miss any *notable* dependency changes. > Upgrade log4j2 to 2.23.0 > > > Key: SPARK-47194 > URL: https://issues.apache.org/jira/browse/SPARK-47194 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-47033) EXECUTE IMMEDIATE USING does not recognize session variable names
[ https://issues.apache.org/jira/browse/SPARK-47033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17820680#comment-17820680 ] A G edited comment on SPARK-47033 at 2/27/24 6:08 PM: -- I want to work on this! PR: https://github.com/apache/spark/pull/45293 was (Author: JIRAUSER304341): I want to work on this! > EXECUTE IMMEDIATE USING does not recognize session variable names > - > > Key: SPARK-47033 > URL: https://issues.apache.org/jira/browse/SPARK-47033 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Serge Rielau >Priority: Major > Labels: pull-request-available > > {noformat} > DECLARE parm = 'Hello'; > EXECUTE IMMEDIATE 'SELECT :parm' USING parm; > [ALL_PARAMETERS_MUST_BE_NAMED] Using name parameterized queries requires all > parameters to be named. Parameters missing names: "parm". SQLSTATE: 07001 > EXECUTE IMMEDIATE 'SELECT :parm' USING parm AS parm; > Hello > {noformat} > variables are like column references, they act as their own aliases and thus > should not be required to be named to associate with a named parameter with > the same name. > Note that unlike for pySpark this should be case insensitive (haven't > verified). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47033) EXECUTE IMMEDIATE USING does not recognize session variable names
[ https://issues.apache.org/jira/browse/SPARK-47033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47033: --- Labels: pull-request-available (was: ) > EXECUTE IMMEDIATE USING does not recognize session variable names > - > > Key: SPARK-47033 > URL: https://issues.apache.org/jira/browse/SPARK-47033 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Serge Rielau >Priority: Major > Labels: pull-request-available > > {noformat} > DECLARE parm = 'Hello'; > EXECUTE IMMEDIATE 'SELECT :parm' USING parm; > [ALL_PARAMETERS_MUST_BE_NAMED] Using name parameterized queries requires all > parameters to be named. Parameters missing names: "parm". SQLSTATE: 07001 > EXECUTE IMMEDIATE 'SELECT :parm' USING parm AS parm; > Hello > {noformat} > variables are like column references, they act as their own aliases and thus > should not be required to be named to associate with a named parameter with > the same name. > Note that unlike for pySpark this should be case insensitive (haven't > verified). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47194) Upgrade log4j2 to 2.23.0
[ https://issues.apache.org/jira/browse/SPARK-47194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47194: --- Labels: pull-request-available (was: ) > Upgrade log4j2 to 2.23.0 > > > Key: SPARK-47194 > URL: https://issues.apache.org/jira/browse/SPARK-47194 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47194) Upgrade log4j2 to 2.23.0
Yang Jie created SPARK-47194: Summary: Upgrade log4j2 to 2.23.0 Key: SPARK-47194 URL: https://issues.apache.org/jira/browse/SPARK-47194 Project: Spark Issue Type: Sub-task Components: Build Affects Versions: 4.0.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47192) Convert _LEGACY_ERROR_TEMP_0035 (unsupported hive feature)
[ https://issues.apache.org/jira/browse/SPARK-47192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47192: --- Labels: pull-request-available (was: ) > Convert _LEGACY_ERROR_TEMP_0035 (unsupported hive feature) > -- > > Key: SPARK-47192 > URL: https://issues.apache.org/jira/browse/SPARK-47192 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Serge Rielau >Priority: Major > Labels: pull-request-available > > Old: > > GRANT ROLE; > _LEGACY_ERROR_TEMP_0035 > Operation not allowed: grant role. (line 1, pos 0) > > New: > error class: HIVE_OPERATION_NOT_SUPPORTED > The Hive operation is not supported. (line 1, pos 0) > > sqlstate: 0A000 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46834) Aggregate support for strings with collation
[ https://issues.apache.org/jira/browse/SPARK-46834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46834: --- Labels: pull-request-available (was: ) > Aggregate support for strings with collation > > > Key: SPARK-46834 > URL: https://issues.apache.org/jira/browse/SPARK-46834 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Aleksandar Tomic >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47193) Converting dataframe to rdd results in data loss
[ https://issues.apache.org/jira/browse/SPARK-47193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bova updated SPARK-47193: -- Attachment: device.csv deviceClass.csv deviceType.csv language.csv location.csv location1.csv timeZoneLookup.csv user.csv userLocation.csv userProfile.csv > Converting dataframe to rdd results in data loss > > > Key: SPARK-47193 > URL: https://issues.apache.org/jira/browse/SPARK-47193 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0, 3.5.1 >Reporter: Ivan Bova >Priority: Critical > Labels: correctness > Attachments: device.csv, deviceClass.csv, deviceType.csv, > language.csv, location.csv, location1.csv, timeZoneLookup.csv, user.csv, > userLocation.csv, userProfile.csv > > > I have 10 csv files and need to create mapping from them. After all of the > joins dataframe contains all expected rows but rdd from this dataframe > contains only half of them. > {code:java} > case class MyUserProfileMessage(UserId: Int, Email: String, FirstName: > String, LastName: String, LanguageId: Option[Int]) > case class MyLanguageMessage(LanguageId: Int, LanguageLocaleId: String) > case class MyDeviceMessage(DeviceId1: String, Created: Option[Timestamp], > UpdatedDate: Timestamp, DeviceId2: String, DeviceName: String, LocationId: > Option[Int], DeviceTypeId: Option[Int], DeviceClassId: Int, UserId1: > Option[Int]) > case class MyDeviceClassMessage(DeviceClassId: Int, DeviceClassName: String) > case class MyDeviceTypeMessage(DeviceTypeId: Int, DeviceTypeName: String) > case class MyLocation1(LocationId1: Int, LocationId: Int, Latitude: > Option[Double], Longitude: Option[Double], Radius: Option[Double], > CreatedDate: Timestamp) > case class MyTimeZoneLookupMessage(TimeZoneId: Int, ZoneName: String) > case class MyUserLocationMessage(UserId: Int, LocationId: Int, LocationName: > String, Status: Int, CreatedDate: Timestamp) > case class MyUserMessage(UserId: Int, Created: Option[Timestamp], Deleted: > Option[Timestamp], Active: Option[Boolean], ActivatedDate: Option[Timestamp]) > case class MyLocationMessage(LocationId: Int, IsDeleted: Option[Boolean], > Address1: String, Address2: String, City: String, State: String, Country: > String, ZipCode: String, Feature2Enabled: Option[Boolean], LocationStatus: > Option[Int], Location1Enabled: Option[Boolean], LocationKey: String, > UpdatedDateTime: Timestamp, CreatedDate: Timestamp, Feature1Enabled: > Option[Boolean], Level: Option[Int], TimeZone: Option[Int]) > val userProfile = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyUserProfileMessage].schema).csv("userProfile.csv").as[MyUserProfileMessage] > val language = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyLanguageMessage].schema).csv("language.csv").as[MyLanguageMessage] > val device = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyDeviceMessage].schema).csv("device.csv").as[MyDeviceMessage] > val deviceClass = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyDeviceClassMessage].schema).csv("deviceClass.csv").as[MyDeviceClassMessage] > val deviceType = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyDeviceTypeMessage].schema).csv("deviceType.csv").as[MyDeviceTypeMessage] > val location1 = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyLocation1].schema).csv("location1.csv").as[MyLocation1] > val timeZoneLookup = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyTimeZoneLookupMessage].schema).csv("timeZoneLookup.csv").as[MyTimeZoneLookupMessage] > val userLocation = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyUserLocationMessage].schema).csv("userLocation.csv").as[MyUserLocationMessage] > val user = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyUserMessage].schema).csv("user.csv").as[MyUserMessage] > val location = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyLocationMessage].schema).csv("location.csv").as[MyLocationMessage] > val result = user > .join(userProfile, user("UserId"
[jira] [Created] (SPARK-47193) Converting dataframe to rdd results in data loss
Ivan Bova created SPARK-47193: - Summary: Converting dataframe to rdd results in data loss Key: SPARK-47193 URL: https://issues.apache.org/jira/browse/SPARK-47193 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.5.1, 3.5.0 Reporter: Ivan Bova I have 10 csv files and need to create mapping from them. After all of the joins dataframe contains all expected rows but rdd from this dataframe contains only half of them. {code:java} case class MyUserProfileMessage(UserId: Int, Email: String, FirstName: String, LastName: String, LanguageId: Option[Int]) case class MyLanguageMessage(LanguageId: Int, LanguageLocaleId: String) case class MyDeviceMessage(DeviceId1: String, Created: Option[Timestamp], UpdatedDate: Timestamp, DeviceId2: String, DeviceName: String, LocationId: Option[Int], DeviceTypeId: Option[Int], DeviceClassId: Int, UserId1: Option[Int]) case class MyDeviceClassMessage(DeviceClassId: Int, DeviceClassName: String) case class MyDeviceTypeMessage(DeviceTypeId: Int, DeviceTypeName: String) case class MyLocation1(LocationId1: Int, LocationId: Int, Latitude: Option[Double], Longitude: Option[Double], Radius: Option[Double], CreatedDate: Timestamp) case class MyTimeZoneLookupMessage(TimeZoneId: Int, ZoneName: String) case class MyUserLocationMessage(UserId: Int, LocationId: Int, LocationName: String, Status: Int, CreatedDate: Timestamp) case class MyUserMessage(UserId: Int, Created: Option[Timestamp], Deleted: Option[Timestamp], Active: Option[Boolean], ActivatedDate: Option[Timestamp]) case class MyLocationMessage(LocationId: Int, IsDeleted: Option[Boolean], Address1: String, Address2: String, City: String, State: String, Country: String, ZipCode: String, Feature2Enabled: Option[Boolean], LocationStatus: Option[Int], Location1Enabled: Option[Boolean], LocationKey: String, UpdatedDateTime: Timestamp, CreatedDate: Timestamp, Feature1Enabled: Option[Boolean], Level: Option[Int], TimeZone: Option[Int]) val userProfile = spark.read.option("header", "true").option("comment", "#").option("nullValue", "null").schema(Encoders.product[MyUserProfileMessage].schema).csv("userProfile.csv").as[MyUserProfileMessage] val language = spark.read.option("header", "true").option("comment", "#").option("nullValue", "null").schema(Encoders.product[MyLanguageMessage].schema).csv("language.csv").as[MyLanguageMessage] val device = spark.read.option("header", "true").option("comment", "#").option("nullValue", "null").schema(Encoders.product[MyDeviceMessage].schema).csv("device.csv").as[MyDeviceMessage] val deviceClass = spark.read.option("header", "true").option("comment", "#").option("nullValue", "null").schema(Encoders.product[MyDeviceClassMessage].schema).csv("deviceClass.csv").as[MyDeviceClassMessage] val deviceType = spark.read.option("header", "true").option("comment", "#").option("nullValue", "null").schema(Encoders.product[MyDeviceTypeMessage].schema).csv("deviceType.csv").as[MyDeviceTypeMessage] val location1 = spark.read.option("header", "true").option("comment", "#").option("nullValue", "null").schema(Encoders.product[MyLocation1].schema).csv("location1.csv").as[MyLocation1] val timeZoneLookup = spark.read.option("header", "true").option("comment", "#").option("nullValue", "null").schema(Encoders.product[MyTimeZoneLookupMessage].schema).csv("timeZoneLookup.csv").as[MyTimeZoneLookupMessage] val userLocation = spark.read.option("header", "true").option("comment", "#").option("nullValue", "null").schema(Encoders.product[MyUserLocationMessage].schema).csv("userLocation.csv").as[MyUserLocationMessage] val user = spark.read.option("header", "true").option("comment", "#").option("nullValue", "null").schema(Encoders.product[MyUserMessage].schema).csv("user.csv").as[MyUserMessage] val location = spark.read.option("header", "true").option("comment", "#").option("nullValue", "null").schema(Encoders.product[MyLocationMessage].schema).csv("location.csv").as[MyLocationMessage] val result = user .join(userProfile, user("UserId") === userProfile("UserId"), "inner") .join(language, userProfile("LanguageId") === language("LanguageId"), "left") .join(userLocation, user("UserId") === userLocation("UserId"), "inner") .join(location, userLocation("LocationId") === location("LocationId"), "inner") .join(device, location("LocationId") === device("LocationId"), "inner") .join(deviceType, device("DeviceTypeId") === deviceType("DeviceTypeId"), "inner") .join( deviceClass, device("DeviceClassId") === deviceClass("DeviceClassId"), "inner") .join( timeZoneLookup, timeZoneLookup("TimeZoneId") === location("TimeZone"), "left") .join(location1, location("LocationId") === location1("LocationId"), "left") .where( device("UserId1").isNull && (user("Active") === lit(true) || user("ActivatedDate").isNotNull) ) .
[jira] [Created] (SPARK-47192) Convert _LEGACY_ERROR_TEMP_0035 (unsupported hive feature)
Serge Rielau created SPARK-47192: Summary: Convert _LEGACY_ERROR_TEMP_0035 (unsupported hive feature) Key: SPARK-47192 URL: https://issues.apache.org/jira/browse/SPARK-47192 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 4.0.0 Reporter: Serge Rielau Old: > GRANT ROLE; _LEGACY_ERROR_TEMP_0035 Operation not allowed: grant role. (line 1, pos 0) New: error class: HIVE_OPERATION_NOT_SUPPORTED The Hive operation is not supported. (line 1, pos 0) sqlstate: 0A000 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47185) Increase timeout between actions in KafkaContinuousTest
[ https://issues.apache.org/jira/browse/SPARK-47185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-47185: - Assignee: Hyukjin Kwon > Increase timeout between actions in KafkaContinuousTest > --- > > Key: SPARK-47185 > URL: https://issues.apache.org/jira/browse/SPARK-47185 > Project: Spark > Issue Type: Test > Components: Structured Streaming, Tests >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Labels: pull-request-available > > It fails in MacOS build -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47185) Increase timeout between actions in KafkaContinuousTest
[ https://issues.apache.org/jira/browse/SPARK-47185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47185. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45283 [https://github.com/apache/spark/pull/45283] > Increase timeout between actions in KafkaContinuousTest > --- > > Key: SPARK-47185 > URL: https://issues.apache.org/jira/browse/SPARK-47185 > Project: Spark > Issue Type: Test > Components: Structured Streaming, Tests >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > > It fails in MacOS build -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47191) avoid unnecessary relation lookup when uncaching table/view
[ https://issues.apache.org/jira/browse/SPARK-47191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47191: --- Labels: pull-request-available (was: ) > avoid unnecessary relation lookup when uncaching table/view > --- > > Key: SPARK-47191 > URL: https://issues.apache.org/jira/browse/SPARK-47191 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wenchen Fan >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47191) avoid unnecessary relation lookup when uncaching table/view
Wenchen Fan created SPARK-47191: --- Summary: avoid unnecessary relation lookup when uncaching table/view Key: SPARK-47191 URL: https://issues.apache.org/jira/browse/SPARK-47191 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Wenchen Fan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47190) Add support for checkpointing to Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-47190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17821286#comment-17821286 ] Nicholas Chammas commented on SPARK-47190: -- [~gurwls223] - Is there some design reason we do _not_ want to support checkpointing in Spark Connect? Or is it just a matter of someone taking the time to implement support? If the latter, do we do so via a new method directly on {{SparkSession}}, or shall we somehow expose a limited version of {{spark.sparkContext}} so users can call the existing {{setCheckpointDir()}} method? > Add support for checkpointing to Spark Connect > -- > > Key: SPARK-47190 > URL: https://issues.apache.org/jira/browse/SPARK-47190 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 4.0.0 >Reporter: Nicholas Chammas >Priority: Minor > > The {{sparkContext}} that underlies a given {{SparkSession}} is not > accessible over Spark Connect. This means you cannot call > {{spark.sparkContext.setCheckpointDir(...)}}, which in turn means you cannot > checkpoint a DataFrame. > We should add support for this somehow to Spark Connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47190) Add support for checkpointing to Spark Connect
Nicholas Chammas created SPARK-47190: Summary: Add support for checkpointing to Spark Connect Key: SPARK-47190 URL: https://issues.apache.org/jira/browse/SPARK-47190 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 4.0.0 Reporter: Nicholas Chammas The {{sparkContext}} that underlies a given {{SparkSession}} is not accessible over Spark Connect. This means you cannot call {{spark.sparkContext.setCheckpointDir(...)}}, which in turn means you cannot checkpoint a DataFrame. We should add support for this somehow to Spark Connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47189) Tweak column error names and text
[ https://issues.apache.org/jira/browse/SPARK-47189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-47189: Assignee: Nicholas Chammas > Tweak column error names and text > - > > Key: SPARK-47189 > URL: https://issues.apache.org/jira/browse/SPARK-47189 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47189) Tweak column error names and text
[ https://issues.apache.org/jira/browse/SPARK-47189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-47189. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45276 [https://github.com/apache/spark/pull/45276] > Tweak column error names and text > - > > Key: SPARK-47189 > URL: https://issues.apache.org/jira/browse/SPARK-47189 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45527) Task fraction resource request is not expected
[ https://issues.apache.org/jira/browse/SPARK-45527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17821279#comment-17821279 ] Thomas Graves commented on SPARK-45527: --- Note that this is related to SPARK-39853 which was supposed to implement stage level scheduling with dynamic allocation disabled. That pr did not properly handle resources (gpu, fpga, etc) > Task fraction resource request is not expected > -- > > Key: SPARK-45527 > URL: https://issues.apache.org/jira/browse/SPARK-45527 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.1, 3.3.3, 3.4.1, 3.5.0 >Reporter: wuyi >Assignee: Bobby Wang >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > > {code:java} > test("SPARK-XXX") { > import org.apache.spark.resource.{ResourceProfileBuilder, > TaskResourceRequests} > withTempDir { dir => > val scriptPath = createTempScriptWithExpectedOutput(dir, > "gpuDiscoveryScript", > """{"name": "gpu","addresses":["0"]}""") > val conf = new SparkConf() > .setAppName("test") > .setMaster("local-cluster[1, 12, 1024]") > .set("spark.executor.cores", "12") > conf.set(TASK_GPU_ID.amountConf, "0.08") > conf.set(WORKER_GPU_ID.amountConf, "1") > conf.set(WORKER_GPU_ID.discoveryScriptConf, scriptPath) > conf.set(EXECUTOR_GPU_ID.amountConf, "1") > sc = new SparkContext(conf) > val rdd = sc.range(0, 100, 1, 4) > var rdd1 = rdd.repartition(3) > val treqs = new TaskResourceRequests().cpus(1).resource("gpu", 1.0) > val rp = new ResourceProfileBuilder().require(treqs).build > rdd1 = rdd1.withResources(rp) > assert(rdd1.collect().size === 100) > } > } {code} > In the above test, the 3 tasks generated by rdd1 are expected to be executed > in sequence as we expect "new TaskResourceRequests().cpus(1).resource("gpu", > 1.0)" should override "conf.set(TASK_GPU_ID.amountConf, "0.08")". However, > those 3 tasks are run in parallel in fact. > The root cause is that ExecutorData#ExecutorResourceInfo#numParts is static. > In this case, the "gpu.numParts" is initialized with 12 (1/0.08) and won't > change even if there's a new task resource request (e.g., resource("gpu", > 1.0) in this case). Thus, those 3 tasks are able to be executed in parallel. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47063) CAST long to timestamp has different behavior for codegen vs interpreted
[ https://issues.apache.org/jira/browse/SPARK-47063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17821273#comment-17821273 ] Pablo Langa Blanco commented on SPARK-47063: Ok, personally I would prefer the truncation too, I'll make a PR with it and we can discuss it there. > CAST long to timestamp has different behavior for codegen vs interpreted > > > Key: SPARK-47063 > URL: https://issues.apache.org/jira/browse/SPARK-47063 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.2 >Reporter: Robert Joseph Evans >Priority: Major > > It probably impacts a lot more versions of the code than this, but I verified > it on 3.4.2. This also appears to be related to > https://issues.apache.org/jira/browse/SPARK-39209 > {code:java} > scala> Seq(Long.MaxValue, Long.MinValue, 0L, 1990L).toDF("v").selectExpr("*", > "CAST(v AS timestamp) as ts").selectExpr("*", "unix_micros(ts)").show(false) > ++-++ > |v |ts |unix_micros(ts) | > ++-++ > |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 | > |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808| > |0 |1970-01-01 00:00:00 |0 | > |1990 |1970-01-01 00:33:10 |199000 | > ++-++ > scala> Seq(Long.MaxValue, Long.MinValue, 0L, > 1990L).toDF("v").repartition(1).selectExpr("*", "CAST(v AS timestamp) as > ts").selectExpr("*", "unix_micros(ts)").show(false) > ++---+---+ > |v |ts |unix_micros(ts)| > ++---+---+ > |9223372036854775807 |1969-12-31 23:59:59|-100 | > |-9223372036854775808|1970-01-01 00:00:00|0 | > |0 |1970-01-01 00:00:00|0 | > |1990 |1970-01-01 00:33:10|199000 | > ++---+---+ > {code} > It looks like InMemoryTableScanExec is not doing code generation for the > expressions, but the ProjectExec after the repartition is. > If I disable code gen I get the same answer in both cases. > {code:java} > scala> spark.conf.set("spark.sql.codegen.wholeStage", false) > scala> spark.conf.set("spark.sql.codegen.factoryMode", "NO_CODEGEN") > scala> Seq(Long.MaxValue, Long.MinValue, 0L, 1990L).toDF("v").selectExpr("*", > "CAST(v AS timestamp) as ts").selectExpr("*", "unix_micros(ts)").show(false) > ++-++ > |v |ts |unix_micros(ts) | > ++-++ > |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 | > |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808| > |0 |1970-01-01 00:00:00 |0 | > |1990 |1970-01-01 00:33:10 |199000 | > ++-++ > scala> Seq(Long.MaxValue, Long.MinValue, 0L, > 1990L).toDF("v").repartition(1).selectExpr("*", "CAST(v AS timestamp) as > ts").selectExpr("*", "unix_micros(ts)").show(false) > ++-++ > |v |ts |unix_micros(ts) | > ++-++ > |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 | > |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808| > |0 |1970-01-01 00:00:00 |0 | > |1990 |1970-01-01 00:33:10 |199000 | > ++-++ > {code} > [https://github.com/apache/spark/blob/e2cd71a4cd54bbdf5af76d3edfbb2fc8c1b067b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L1627] > Is the code used in codegen, but > [https://github.com/apache/spark/blob/e2cd71a4cd54bbdf5af76d3edfbb2fc8c1b067b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L687] > is what is used outside of code gen. > Apparently `SECONDS.toMicros` truncates the value on an overflow, but the > codegen does not. > {code:java} > scala> Long.MaxValue > res11: Long = 9223372036854775807 > scala> java.util.concurrent.TimeUnit.SECONDS.toM
[jira] [Updated] (SPARK-47189) Tweak column error names and text
[ https://issues.apache.org/jira/browse/SPARK-47189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47189: --- Labels: pull-request-available (was: ) > Tweak column error names and text > - > > Key: SPARK-47189 > URL: https://issues.apache.org/jira/browse/SPARK-47189 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Nicholas Chammas >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47189) Tweak column error names and text
Nicholas Chammas created SPARK-47189: Summary: Tweak column error names and text Key: SPARK-47189 URL: https://issues.apache.org/jira/browse/SPARK-47189 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Nicholas Chammas -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47145) Provide table identifier to scan node when DS v2 strategy is applied
[ https://issues.apache.org/jira/browse/SPARK-47145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47145: --- Labels: pull-request-available (was: ) > Provide table identifier to scan node when DS v2 strategy is applied > > > Key: SPARK-47145 > URL: https://issues.apache.org/jira/browse/SPARK-47145 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Uros Stankovic >Priority: Minor > Labels: pull-request-available > > Currently, DataSourceScanExec node can accept table identifier, and that > information can be useful for later logging, debugging, etc, but > DataSourceV2Strategy does not provide that information to scan node. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47179) Improve error message from SparkThrowableSuite for better debuggability
[ https://issues.apache.org/jira/browse/SPARK-47179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-47179. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45273 [https://github.com/apache/spark/pull/45273] > Improve error message from SparkThrowableSuite for better debuggability > --- > > Key: SPARK-47179 > URL: https://issues.apache.org/jira/browse/SPARK-47179 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Current error message is not very helpful when error classes documentation is > not up-to-date so we better improve it -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47179) Improve error message from SparkThrowableSuite for better debuggability
[ https://issues.apache.org/jira/browse/SPARK-47179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-47179: Assignee: Haejoon Lee > Improve error message from SparkThrowableSuite for better debuggability > --- > > Key: SPARK-47179 > URL: https://issues.apache.org/jira/browse/SPARK-47179 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Labels: pull-request-available > > Current error message is not very helpful when error classes documentation is > not up-to-date so we better improve it -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org