[jira] [Created] (SPARK-46352) Support spark conf to configure log level of specific package or class
Zhongwei Zhu created SPARK-46352: Summary: Support spark conf to configure log level of specific package or class Key: SPARK-46352 URL: https://issues.apache.org/jira/browse/SPARK-46352 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.5.0 Reporter: Zhongwei Zhu Use spark conf `spark.log.level.org.apache.spark=error` to set logger `org.apache.spark` to `error` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45622) java -target should use java.version instead of 17
Zhongwei Zhu created SPARK-45622: Summary: java -target should use java.version instead of 17 Key: SPARK-45622 URL: https://issues.apache.org/jira/browse/SPARK-45622 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.5.0 Reporter: Zhongwei Zhu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41954) Add isDecommissioned in ExecutorDeadException
[ https://issues.apache.org/jira/browse/SPARK-41954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongwei Zhu updated SPARK-41954: - Description: When one executor is dead, we want to know whether this dead executor is caused by decommission. > Add isDecommissioned in ExecutorDeadException > - > > Key: SPARK-41954 > URL: https://issues.apache.org/jira/browse/SPARK-41954 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.1 >Reporter: Zhongwei Zhu >Priority: Major > > When one executor is dead, we want to know whether this dead executor is > caused by decommission. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45372) Handle ClassNotFoundException when load extension
[ https://issues.apache.org/jira/browse/SPARK-45372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongwei Zhu updated SPARK-45372: - Summary: Handle ClassNotFoundException when load extension (was: Handle ClassNotFound when load extension) > Handle ClassNotFoundException when load extension > - > > Key: SPARK-45372 > URL: https://issues.apache.org/jira/browse/SPARK-45372 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Zhongwei Zhu >Priority: Minor > > When load extension with ClassNotFoundException, SparkContext failed to > initialize. Better behavior is skip this extension and log error without > failing SparkContext initialization. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45372) Handle ClassNotFound when load extension
Zhongwei Zhu created SPARK-45372: Summary: Handle ClassNotFound when load extension Key: SPARK-45372 URL: https://issues.apache.org/jira/browse/SPARK-45372 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.5.0 Reporter: Zhongwei Zhu When load extension with ClassNotFoundException, SparkContext failed to initialize. Better behavior is skip this extension and log error without failing SparkContext initialization. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45217) Support change log level of specific package or class
Zhongwei Zhu created SPARK-45217: Summary: Support change log level of specific package or class Key: SPARK-45217 URL: https://issues.apache.org/jira/browse/SPARK-45217 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.5.0 Reporter: Zhongwei Zhu Add SparkContext.setLogLevel(loggerName: String, logLevel: String) to support change log level of specific package or class -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45057) Deadlock caused by rdd replication level of 2
[ https://issues.apache.org/jira/browse/SPARK-45057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongwei Zhu updated SPARK-45057: - Description: When 2 tasks try to compute same rdd with replication level of 2 and running on only 2 executors. Deadlock will happen. Task only release lock after writing into local machine and replicate to remote executor. ||Time||Exe 1 (Task Thread T1)||Exe 1 (Shuffle Server Thread T2)||Exe 2 (Task Thread T3)||Exe 2 (Shuffle Server Thread T4)|| |T0|write lock of rdd| | | | |T1| | |write lock of rdd| | |T2|replicate -> UploadBlockSync (blocked by T4)| | | | |T3| | | |Received UploadBlock request from T1 (blocked by T4)| |T4| | |replicate -> UploadBlockSync (blocked by T2)| | |T5| |Received UploadBlock request from T3 (blocked by T1)| | | |T6|Deadlock|Deadlock|Deadlock|Deadlock| was: When 2 tasks try to compute same rdd with replication level of 2 and running on only 2 executors. Deadlock will happen. ||Time||Exe 1 (Task Thread T1)||Exe 1 (Shuffle Server Thread T2)||Exe 2 (Task Thread T3)||Exe 2 (Shuffle Server Thread T4)|| |T0|write lock of rdd| | | | |T1| | |write lock of rdd| | |T2|replicate -> UploadBlockSync (blocked by T4)| | | | |T3| | | |Received UploadBlock request from T1 (blocked by T4)| |T4| | |replicate -> UploadBlockSync (blocked by T2)| | |T5| |Received UploadBlock request from T3 (blocked by T1)| | | |T6|Deadlock|Deadlock|Deadlock|Deadlock| > Deadlock caused by rdd replication level of 2 > - > > Key: SPARK-45057 > URL: https://issues.apache.org/jira/browse/SPARK-45057 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.1 >Reporter: Zhongwei Zhu >Priority: Major > > > When 2 tasks try to compute same rdd with replication level of 2 and running > on only 2 executors. Deadlock will happen. > Task only release lock after writing into local machine and replicate to > remote executor. > > ||Time||Exe 1 (Task Thread T1)||Exe 1 (Shuffle Server Thread T2)||Exe 2 (Task > Thread T3)||Exe 2 (Shuffle Server Thread T4)|| > |T0|write lock of rdd| | | | > |T1| | |write lock of rdd| | > |T2|replicate -> UploadBlockSync (blocked by T4)| | | | > |T3| | | |Received UploadBlock request from T1 (blocked by T4)| > |T4| | |replicate -> UploadBlockSync (blocked by T2)| | > |T5| |Received UploadBlock request from T3 (blocked by T1)| | | > |T6|Deadlock|Deadlock|Deadlock|Deadlock| -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45057) Deadlock caused by rdd replication level of 2
[ https://issues.apache.org/jira/browse/SPARK-45057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongwei Zhu updated SPARK-45057: - Description: When 2 tasks try to compute same rdd with replication level of 2 and running on only 2 executors. Deadlock will happen. ||Time||Exe 1 (Task Thread T1)||Exe 1 (Shuffle Server Thread T2)||Exe 2 (Task Thread T3)||Exe 2 (Shuffle Server Thread T4)|| |T0|write lock of rdd| | | | |T1| | |write lock of rdd| | |T2|replicate -> UploadBlockSync (blocked by T4)| | | | |T3| | | |Received UploadBlock request from T1 (blocked by T4)| |T4| | |replicate -> UploadBlockSync (blocked by T2)| | |T5| |Received UploadBlock request from T3 (blocked by T1)| | | |T6|Deadlock|Deadlock|Deadlock|Deadlock| was: When 2 tasks try to compute same rdd with replication level of 2 and running on only 2 executors. Deadlock will happen. ||Time||Exe 1 (Task Thread 1)||Exe 1 (Shuffle Server Thread 2)||Exe 2 (Task Thread 3)||Exe 2 (Shuffle Server Thread 4)|| |T0|write lock of rdd| | | | |T1| | |write lock of rdd| | |T2|replicate -> UploadBlockSync (blocked by shuffle server thread 4)| | | | |T3| | | |Received UploadBlock request(blocked by task thread 3)| |T4| | |replicate -> UploadBlockSync (blocked by shuffle server thread 2)| | |T5| |Received UploadBlock request(blocked by task thread 1)| | | |T6|Deadlock|Deadlock|Deadlock|Deadlock| > Deadlock caused by rdd replication level of 2 > - > > Key: SPARK-45057 > URL: https://issues.apache.org/jira/browse/SPARK-45057 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.1 >Reporter: Zhongwei Zhu >Priority: Major > > > When 2 tasks try to compute same rdd with replication level of 2 and running > on only 2 executors. Deadlock will happen. > > ||Time||Exe 1 (Task Thread T1)||Exe 1 (Shuffle Server Thread T2)||Exe 2 (Task > Thread T3)||Exe 2 (Shuffle Server Thread T4)|| > |T0|write lock of rdd| | | | > |T1| | |write lock of rdd| | > |T2|replicate -> UploadBlockSync (blocked by T4)| | | | > |T3| | | |Received UploadBlock request from T1 (blocked by T4)| > |T4| | |replicate -> UploadBlockSync (blocked by T2)| | > |T5| |Received UploadBlock request from T3 (blocked by T1)| | | > |T6|Deadlock|Deadlock|Deadlock|Deadlock| -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45057) Deadlock caused by rdd replication level of 2
Zhongwei Zhu created SPARK-45057: Summary: Deadlock caused by rdd replication level of 2 Key: SPARK-45057 URL: https://issues.apache.org/jira/browse/SPARK-45057 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.4.1 Reporter: Zhongwei Zhu When 2 tasks try to compute same rdd with replication level of 2 and running on only 2 executors. Deadlock will happen. ||Time||Exe 1 (Task Thread 1)||Exe 1 (Shuffle Server Thread 2)||Exe 2 (Task Thread 3)||Exe 2 (Shuffle Server Thread 4)|| |T0|write lock of rdd| | | | |T1| | |write lock of rdd| | |T2|replicate -> UploadBlockSync (blocked by shuffle server thread 4)| | | | |T3| | | |Received UploadBlock request(blocked by task thread 3)| |T4| | |replicate -> UploadBlockSync (blocked by shuffle server thread 2)| | |T5| |Received UploadBlock request(blocked by task thread 1)| | | |T6|Deadlock|Deadlock|Deadlock|Deadlock| -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44345) Only log unknown shuffle map output as error when shuffle migration disabled
Zhongwei Zhu created SPARK-44345: Summary: Only log unknown shuffle map output as error when shuffle migration disabled Key: SPARK-44345 URL: https://issues.apache.org/jira/browse/SPARK-44345 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.4.1 Reporter: Zhongwei Zhu When decommission and shuffle migration is enabled, there're lots of error message like `Asked to update map output for unknown shuffle` . As shuffle clean and unregister is done by ContextCleaner in an async way, when target block manager received the shuffle block from decommissioned block manager, then update shuffle location to map output tracker, but at that time, shuffle might have been unregistered. This should not consider as error. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44126) Migration shuffle to decommissioned executor should not count as block failure
[ https://issues.apache.org/jira/browse/SPARK-44126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongwei Zhu updated SPARK-44126: - Description: When shuffle migration to decommissioned executor, the below exception is thrown: {code:java} org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301) at org.apache.spark.network.BlockTransferService.uploadBlockSync(BlockTransferService.scala:122) at org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.$anonfun$run$5(BlockManagerDecommissioner.scala:127) at org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.$anonfun$run$5$adapted(BlockManagerDecommissioner.scala:118) at scala.collection.immutable.List.foreach(List.scala:431) at org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.run(BlockManagerDecommissioner.scala:118) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829)Caused by: java.lang.RuntimeException: org.apache.spark.storage.BlockSavedOnDecommissionedBlockManagerException: Block shuffle_2_6429_0.data cannot be saved on decommissioned executor at org.apache.spark.errors.SparkCoreErrors$.cannotSaveBlockOnDecommissionedExecutorError(SparkCoreErrors.scala:238) at org.apache.spark.storage.BlockManager.checkShouldStore(BlockManager.scala:277) at org.apache.spark.storage.BlockManager.putBlockDataAsStream(BlockManager.scala:741) at org.apache.spark.network.netty.NettyBlockRpcServer.receiveStream(NettyBlockRpcServer.scala:174) at org.apache.spark.network.server.AbstractAuthRpcHandler.receiveStream(AbstractAuthRpcHandler.java:78) at org.apache.spark.network.server.TransportRequestHandler.processStreamUpload(TransportRequestHandler.java:202) at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:115) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) at org.apache.spark.network.crypto.TransportCipher$DecryptionHandler.channelRead(TransportCipher.java:190) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
[jira] [Created] (SPARK-44126) Migration shuffle to decommissioned executor should not count as block failure
Zhongwei Zhu created SPARK-44126: Summary: Migration shuffle to decommissioned executor should not count as block failure Key: SPARK-44126 URL: https://issues.apache.org/jira/browse/SPARK-44126 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.4.0 Reporter: Zhongwei Zhu When shuffle migration to decommissioned executor, the below exception is thrown: ``` org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301) at org.apache.spark.network.BlockTransferService.uploadBlockSync(BlockTransferService.scala:122) at org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.$anonfun$run$5(BlockManagerDecommissioner.scala:127) at org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.$anonfun$run$5$adapted(BlockManagerDecommissioner.scala:118) at scala.collection.immutable.List.foreach(List.scala:431) at org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.run(BlockManagerDecommissioner.scala:118) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829)Caused by: java.lang.RuntimeException: org.apache.spark.storage.BlockSavedOnDecommissionedBlockManagerException: Block shuffle_2_6429_0.data cannot be saved on decommissioned executor at org.apache.spark.errors.SparkCoreErrors$.cannotSaveBlockOnDecommissionedExecutorError(SparkCoreErrors.scala:238) at org.apache.spark.storage.BlockManager.checkShouldStore(BlockManager.scala:277) at org.apache.spark.storage.BlockManager.putBlockDataAsStream(BlockManager.scala:741) at org.apache.spark.network.netty.NettyBlockRpcServer.receiveStream(NettyBlockRpcServer.scala:174) at org.apache.spark.network.server.AbstractAuthRpcHandler.receiveStream(AbstractAuthRpcHandler.java:78) at org.apache.spark.network.server.TransportRequestHandler.processStreamUpload(TransportRequestHandler.java:202) at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:115) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) at org.apache.spark.network.crypto.TransportCipher$DecryptionHandler.channelRead(TransportCipher.java:190) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) at
[jira] [Created] (SPARK-44084) Dynamic allocation pending tasks should not include finished ones
Zhongwei Zhu created SPARK-44084: Summary: Dynamic allocation pending tasks should not include finished ones Key: SPARK-44084 URL: https://issues.apache.org/jira/browse/SPARK-44084 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.4.0 Reporter: Zhongwei Zhu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43828) Add config to control whether close idle connection
Zhongwei Zhu created SPARK-43828: Summary: Add config to control whether close idle connection Key: SPARK-43828 URL: https://issues.apache.org/jira/browse/SPARK-43828 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.4.0 Reporter: Zhongwei Zhu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43398) Executor timeout should be max of idleTimeout rddTimeout shuffleTimeout
[ https://issues.apache.org/jira/browse/SPARK-43398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongwei Zhu updated SPARK-43398: - Affects Version/s: (was: 3.4.0) > Executor timeout should be max of idleTimeout rddTimeout shuffleTimeout > --- > > Key: SPARK-43398 > URL: https://issues.apache.org/jira/browse/SPARK-43398 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Zhongwei Zhu >Priority: Major > > When dynamic allocation enabled, Executor timeout should be max of > idleTimeout, rddTimeout and shuffleTimeout. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43398) Executor timeout should be max of idleTimeout rddTimeout shuffleTimeout
[ https://issues.apache.org/jira/browse/SPARK-43398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongwei Zhu updated SPARK-43398: - Affects Version/s: 3.0.0 > Executor timeout should be max of idleTimeout rddTimeout shuffleTimeout > --- > > Key: SPARK-43398 > URL: https://issues.apache.org/jira/browse/SPARK-43398 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0, 3.4.0 >Reporter: Zhongwei Zhu >Priority: Major > > When dynamic allocation enabled, Executor timeout should be max of > idleTimeout, rddTimeout and shuffleTimeout. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43399) Add config to control threshold of unregister map output when fetch failed
Zhongwei Zhu created SPARK-43399: Summary: Add config to control threshold of unregister map output when fetch failed Key: SPARK-43399 URL: https://issues.apache.org/jira/browse/SPARK-43399 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.4.0 Reporter: Zhongwei Zhu Spark will unregister map output on the executor when fetch failed from this executor. This might be too aggressive when fetch failed is temporary and recoverable. So adding the config to control the number of fetch failed failures needed to unregister map output and allow to disable unregister. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43398) Executor timeout should be max of idleTimeout rddTimeout shuffleTimeout
Zhongwei Zhu created SPARK-43398: Summary: Executor timeout should be max of idleTimeout rddTimeout shuffleTimeout Key: SPARK-43398 URL: https://issues.apache.org/jira/browse/SPARK-43398 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.4.0 Reporter: Zhongwei Zhu When dynamic allocation enabled, Executor timeout should be max of idleTimeout, rddTimeout and shuffleTimeout. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43397) Log executor decommission duration
[ https://issues.apache.org/jira/browse/SPARK-43397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongwei Zhu updated SPARK-43397: - Description: Log executor decommission duration. > Log executor decommission duration > -- > > Key: SPARK-43397 > URL: https://issues.apache.org/jira/browse/SPARK-43397 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Zhongwei Zhu >Priority: Minor > > Log executor decommission duration. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43397) Log executor decommission duration
Zhongwei Zhu created SPARK-43397: Summary: Log executor decommission duration Key: SPARK-43397 URL: https://issues.apache.org/jira/browse/SPARK-43397 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.4.0 Reporter: Zhongwei Zhu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43396) Add config to control max ratio of decommissioning executors
Zhongwei Zhu created SPARK-43396: Summary: Add config to control max ratio of decommissioning executors Key: SPARK-43396 URL: https://issues.apache.org/jira/browse/SPARK-43396 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 3.4.0 Reporter: Zhongwei Zhu Decommission too many executors at the same time with shuffle or rdd migration could severely hurt performance of shuffle fetch. Block manager decommissioner try to migrate shuffle or rdd as soon as possible, this will compete network and disk IO with shuffle fetch in the target executor. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43391) Idle connection should not be closed when closeIdleConnection is disabled
[ https://issues.apache.org/jira/browse/SPARK-43391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongwei Zhu updated SPARK-43391: - Description: Spark will close idle connection when there're outstanding requests but no traffic for at least {{requestTimeoutMs}} even with closeIdleConnection is disabled. (was: Spark will close idle connection when there're pending response and closeIdleConnection is disabled. ) > Idle connection should not be closed when closeIdleConnection is disabled > - > > Key: SPARK-43391 > URL: https://issues.apache.org/jira/browse/SPARK-43391 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Zhongwei Zhu >Priority: Major > > Spark will close idle connection when there're outstanding requests but no > traffic for at least {{requestTimeoutMs}} even with closeIdleConnection is > disabled. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43391) Idle connection should not be closed when closeIdleConnection is disabled
Zhongwei Zhu created SPARK-43391: Summary: Idle connection should not be closed when closeIdleConnection is disabled Key: SPARK-43391 URL: https://issues.apache.org/jira/browse/SPARK-43391 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.4.0 Reporter: Zhongwei Zhu Spark will close idle connection when there're pending response and closeIdleConnection is disabled. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43224) Executor should not be removed when decommissioned in standalone
[ https://issues.apache.org/jira/browse/SPARK-43224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongwei Zhu updated SPARK-43224: - Description: Currently, executor info in standalone master will be removed once decommissioned. The executor should be removed after executor shutdown. > Executor should not be removed when decommissioned in standalone > > > Key: SPARK-43224 > URL: https://issues.apache.org/jira/browse/SPARK-43224 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Zhongwei Zhu >Priority: Minor > > Currently, executor info in standalone master will be removed once > decommissioned. The executor should be removed after executor shutdown. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43238) Support only decommission idle workers in standalone
Zhongwei Zhu created SPARK-43238: Summary: Support only decommission idle workers in standalone Key: SPARK-43238 URL: https://issues.apache.org/jira/browse/SPARK-43238 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.4.0 Reporter: Zhongwei Zhu Currently, standalone master web ui supports kill/decommission workers. But when graceful decommission is enabled, running task, shuffle and rdd migration could take long time. While waiting for running task, shuffle and rdd migration, decommissioned workers can't run new executors and decommissioned executors can't run new tasks. This caused lot of resource waste. If only idle workers to be decommissioned, these workers could be shutdown and removed to save cost without waiting long decommissioning process. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43237) Handle null exception message in event log
Zhongwei Zhu created SPARK-43237: Summary: Handle null exception message in event log Key: SPARK-43237 URL: https://issues.apache.org/jira/browse/SPARK-43237 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.4.0 Reporter: Zhongwei Zhu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43224) Executor should not be removed when decommissioned in standalone
Zhongwei Zhu created SPARK-43224: Summary: Executor should not be removed when decommissioned in standalone Key: SPARK-43224 URL: https://issues.apache.org/jira/browse/SPARK-43224 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.4.0 Reporter: Zhongwei Zhu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43086) Support bin pack task scheduling on executors
Zhongwei Zhu created SPARK-43086: Summary: Support bin pack task scheduling on executors Key: SPARK-43086 URL: https://issues.apache.org/jira/browse/SPARK-43086 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.3.2 Reporter: Zhongwei Zhu Dynamic allocation only remove or decommission an idle executor. The default task scheduling use round robin to do task assignment on executors. For example, we have 4 tasks to run, 4 executors(each has 4 cpu cores). Default task scheduling will assign 1 task per executors. With bin pack, one executor could assign 4 tasks, then dynamic allocation could remove other 3 executors to reduce resource waste. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43052) Handle stacktrace with null file name in event log
Zhongwei Zhu created SPARK-43052: Summary: Handle stacktrace with null file name in event log Key: SPARK-43052 URL: https://issues.apache.org/jira/browse/SPARK-43052 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.3.2 Reporter: Zhongwei Zhu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43037) Support fetch migrated shuffle with multiple reducers
Zhongwei Zhu created SPARK-43037: Summary: Support fetch migrated shuffle with multiple reducers Key: SPARK-43037 URL: https://issues.apache.org/jira/browse/SPARK-43037 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.3.2 Reporter: Zhongwei Zhu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42925) Check executor alive from driver before fetch blocks
Zhongwei Zhu created SPARK-42925: Summary: Check executor alive from driver before fetch blocks Key: SPARK-42925 URL: https://issues.apache.org/jira/browse/SPARK-42925 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.3.2 Reporter: Zhongwei Zhu This can fail fast instead of waiting spark.shuffle.io.connectionTimeout(default 2min) when executor is dead. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41956) Refetch shuffle blocks when executor is decommissioned
[ https://issues.apache.org/jira/browse/SPARK-41956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongwei Zhu updated SPARK-41956: - Summary: Refetch shuffle blocks when executor is decommissioned (was: Shuffle output location refetch in ShuffleBlockFetcherIterator) > Refetch shuffle blocks when executor is decommissioned > -- > > Key: SPARK-41956 > URL: https://issues.apache.org/jira/browse/SPARK-41956 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.1 >Reporter: Zhongwei Zhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41955) Support fetch latest map output from executor
[ https://issues.apache.org/jira/browse/SPARK-41955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongwei Zhu updated SPARK-41955: - Summary: Support fetch latest map output from executor (was: Support fetch latest map output from worker) > Support fetch latest map output from executor > -- > > Key: SPARK-41955 > URL: https://issues.apache.org/jira/browse/SPARK-41955 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.1 >Reporter: Zhongwei Zhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42104) Throw ExecutorDeadException in fetchBlocks when executor dead
Zhongwei Zhu created SPARK-42104: Summary: Throw ExecutorDeadException in fetchBlocks when executor dead Key: SPARK-42104 URL: https://issues.apache.org/jira/browse/SPARK-42104 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.3.1 Reporter: Zhongwei Zhu When fetchBlocks failed due to IOException, ExecutorDeadException will be thrown when executor is dead. There're other cases that executor dead will cause TimeoutException or other Exceptions. {code:java} Caused by: java.lang.RuntimeException: java.util.concurrent.TimeoutException: Waited 3 milliseconds (plus 143334 nanoseconds delay) for SettableFuture@624de392[status=PENDING] at org.sparkproject.guava.base.Throwables.propagate(Throwables.java:243) at org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:293) at org.apache.spark.network.crypto.AuthClientBootstrap.doSparkAuth(AuthClientBootstrap.java:113) at org.apache.spark.network.crypto.AuthClientBootstrap.doBootstrap(AuthClientBootstrap.java:80) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:300) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:218) at org.apache.spark.network.netty.NettyBlockTransferService$$anon$2.createAndStart(NettyBlockTransferService.scala:126) at org.apache.spark.network.shuffle.RetryingBlockTransferor.transferAllOutstanding(RetryingBlockTransferor.java:154) at org.apache.spark.network.shuffle.RetryingBlockTransferor.lambda$initiateRetry$0(RetryingBlockTransferor.java:184) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41956) Shuffle output location refetch in ShuffleBlockFetcherIterator
Zhongwei Zhu created SPARK-41956: Summary: Shuffle output location refetch in ShuffleBlockFetcherIterator Key: SPARK-41956 URL: https://issues.apache.org/jira/browse/SPARK-41956 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.3.1 Reporter: Zhongwei Zhu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41955) Support fetch latest map output from worker
Zhongwei Zhu created SPARK-41955: Summary: Support fetch latest map output from worker Key: SPARK-41955 URL: https://issues.apache.org/jira/browse/SPARK-41955 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.3.1 Reporter: Zhongwei Zhu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41954) Add isDecommissioned in ExecutorDeadException
Zhongwei Zhu created SPARK-41954: Summary: Add isDecommissioned in ExecutorDeadException Key: SPARK-41954 URL: https://issues.apache.org/jira/browse/SPARK-41954 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.3.1 Reporter: Zhongwei Zhu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41953) Shuffle output location refetch during shuffle migration in decommission
[ https://issues.apache.org/jira/browse/SPARK-41953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656365#comment-17656365 ] Zhongwei Zhu commented on SPARK-41953: -- [~dongjoon] [~mridulm80] [~Ngone51] Any comments for this? > Shuffle output location refetch during shuffle migration in decommission > > > Key: SPARK-41953 > URL: https://issues.apache.org/jira/browse/SPARK-41953 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.1 >Reporter: Zhongwei Zhu >Priority: Major > > When shuffle migration enabled during spark decommissionm, shuffle data will > be migrated into live executors, then update latest location to > MapOutputTracker. It has some issues: > # Executors only do map output location fetch in the beginning of the reduce > stage, so any shuffle output location change in the middle of reduce will > cause FetchFailed as reducer fetch from old location. Even stage retries > could solve this, this still cause lots of resource waste as all shuffle read > and compute happened before FetchFailed partition will be wasted. > # During stage retries, less running tasks cause more executors to be > decommissioned and shuffle data location keep changing. In the worst case, > stage could need lots of retries, further breaking SLA. > So I propose to support refetch map output location during reduce phase if > shuffle migration is enabled and FetchFailed is caused by a decommissioned > dead executor. The detailed steps as below: > # When `BlockTransferService` fetch blocks failed from a decommissioned dead > executor, ExecutorDeadException(isDecommission as true) will be thrown. > # Make MapOutputTracker support fetch latest output without epoch provided. > # `ShuffleBlockFetcherIterator` will refetch latest output from > MapOutputTrackMaster. For all the shuffle blocks on this decommissioned, > there should be a new location on another executor. If not, throw exception > as current. If yes, create new local and remote requests to fetch these > migrated shuffle blocks. The flow will be similar as failback fetch when push > merged fetch failed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41953) Shuffle output location refetch during shuffle migration in decommission
[ https://issues.apache.org/jira/browse/SPARK-41953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongwei Zhu updated SPARK-41953: - Description: When shuffle migration enabled during spark decommissionm, shuffle data will be migrated into live executors, then update latest location to MapOutputTracker. It has some issues: # Executors only do map output location fetch in the beginning of the reduce stage, so any shuffle output location change in the middle of reduce will cause FetchFailed as reducer fetch from old location. Even stage retries could solve this, this still cause lots of resource waste as all shuffle read and compute happened before FetchFailed partition will be wasted. # During stage retries, less running tasks cause more executors to be decommissioned and shuffle data location keep changing. In the worst case, stage could need lots of retries, further breaking SLA. So I propose to support refetch map output location during reduce phase if shuffle migration is enabled and FetchFailed is caused by a decommissioned dead executor. The detailed steps as below: # When `BlockTransferService` fetch blocks failed from a decommissioned dead executor, ExecutorDeadException(isDecommission as true) will be thrown. # Make MapOutputTracker support fetch latest output without epoch provided. # `ShuffleBlockFetcherIterator` will refetch latest output from MapOutputTrackMaster. For all the shuffle blocks on this decommissioned, there should be a new location on another executor. If not, throw exception as current. If yes, create new local and remote requests to fetch these migrated shuffle blocks. The flow will be similar as failback fetch when push merged fetch failed. was: When shuffle migration enabled during spark decommissionm, shuffle data will be migrated into live executors, then update latest location to MapOutputTracker. It has some issues: # Executors only do map output location fetch in the beginning of the reduce stage, so any shuffle output location change in the middle of reduce will cause FetchFailed as reducer fetch from old location. Even stage retries could solve this, this still cause lots of resource waste as all shuffle read and compute happened before FetchFailed partition will be wasted. # During stage retries, less running tasks cause more executors to be decommissioned and shuffle data location keep changing. In the worst case, stage could need lots of retries, further breaking SLA. So I propose to support refetch map output location during reduce phase if shuffle migration is enabled and FetchFailed is caused by a decommissioned dead executor. The detailed steps as > Shuffle output location refetch during shuffle migration in decommission > > > Key: SPARK-41953 > URL: https://issues.apache.org/jira/browse/SPARK-41953 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.1 >Reporter: Zhongwei Zhu >Priority: Major > > When shuffle migration enabled during spark decommissionm, shuffle data will > be migrated into live executors, then update latest location to > MapOutputTracker. It has some issues: > # Executors only do map output location fetch in the beginning of the reduce > stage, so any shuffle output location change in the middle of reduce will > cause FetchFailed as reducer fetch from old location. Even stage retries > could solve this, this still cause lots of resource waste as all shuffle read > and compute happened before FetchFailed partition will be wasted. > # During stage retries, less running tasks cause more executors to be > decommissioned and shuffle data location keep changing. In the worst case, > stage could need lots of retries, further breaking SLA. > So I propose to support refetch map output location during reduce phase if > shuffle migration is enabled and FetchFailed is caused by a decommissioned > dead executor. The detailed steps as below: > # When `BlockTransferService` fetch blocks failed from a decommissioned dead > executor, ExecutorDeadException(isDecommission as true) will be thrown. > # Make MapOutputTracker support fetch latest output without epoch provided. > # `ShuffleBlockFetcherIterator` will refetch latest output from > MapOutputTrackMaster. For all the shuffle blocks on this decommissioned, > there should be a new location on another executor. If not, throw exception > as current. If yes, create new local and remote requests to fetch these > migrated shuffle blocks. The flow will be similar as failback fetch when push > merged fetch failed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail:
[jira] [Created] (SPARK-41953) Shuffle output location refetch during shuffle migration in decommission
Zhongwei Zhu created SPARK-41953: Summary: Shuffle output location refetch during shuffle migration in decommission Key: SPARK-41953 URL: https://issues.apache.org/jira/browse/SPARK-41953 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.3.1 Reporter: Zhongwei Zhu When shuffle migration enabled during spark decommissionm, shuffle data will be migrated into live executors, then update latest location to MapOutputTracker. It has some issues: # Executors only do map output location fetch in the beginning of the reduce stage, so any shuffle output location change in the middle of reduce will cause FetchFailed as reducer fetch from old location. Even stage retries could solve this, this still cause lots of resource waste as all shuffle read and compute happened before FetchFailed partition will be wasted. # During stage retries, less running tasks cause more executors to be decommissioned and shuffle data location keep changing. In the worst case, stage could need lots of retries, further breaking SLA. So I propose to support refetch map output location during reduce phase if shuffle migration is enabled and FetchFailed is caused by a decommissioned dead executor. The detailed steps as -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41766) Handle decommission request sent before executor registration
[ https://issues.apache.org/jira/browse/SPARK-41766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongwei Zhu updated SPARK-41766: - Summary: Handle decommission request sent before executor registration (was: Handle decommission request for unregistered executor) > Handle decommission request sent before executor registration > - > > Key: SPARK-41766 > URL: https://issues.apache.org/jira/browse/SPARK-41766 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.1 >Reporter: Zhongwei Zhu >Priority: Major > > When decommissioning an unregistered executor, the request will be ignored. > It should send to executor after registration. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41766) Handle decommission request for unregistered executor
Zhongwei Zhu created SPARK-41766: Summary: Handle decommission request for unregistered executor Key: SPARK-41766 URL: https://issues.apache.org/jira/browse/SPARK-41766 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.3.1 Reporter: Zhongwei Zhu When decommissioning an unregistered executor, the request will be ignored. It should send to executor after registration. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41341) Wait shuffle fetch to finish when decommission executor
Zhongwei Zhu created SPARK-41341: Summary: Wait shuffle fetch to finish when decommission executor Key: SPARK-41341 URL: https://issues.apache.org/jira/browse/SPARK-41341 Project: Spark Issue Type: Improvement Components: Block Manager Affects Versions: 3.3.1 Reporter: Zhongwei Zhu When decommission executor, should wait shuffle fetch to finish, otherwise lots of FetchFailed will happen. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41153) Log migrated shuffle data size and migration time
Zhongwei Zhu created SPARK-41153: Summary: Log migrated shuffle data size and migration time Key: SPARK-41153 URL: https://issues.apache.org/jira/browse/SPARK-41153 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.3.1 Reporter: Zhongwei Zhu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40979) Keep removed executor info in decommission state
[ https://issues.apache.org/jira/browse/SPARK-40979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongwei Zhu updated SPARK-40979: - Description: Removed executor due to decommission should be kept in a separate set. To avoid OOM, set size will be limited to 1K or 10K. FetchFailed caused by decom executor could be divided into 2 categories: # When FetchFailed reached DAGScheduler, the executor is still alive or is lost but the lost info hasn't reached TaskSchedulerImpl. This is already handled in SPARK-40979 # FetchFailed is caused by decom executor loss, so the decom info is already removed in TaskSchedulerImpl. If we keep such info in a short period, it is good enough. Even we limit the size of removed executors to 10K, it could be only at most 10MB memory usage. In real case, it's rare to have cluster size of over 10K and the chance that all these executors decomed and lost at the same time would be small. > Keep removed executor info in decommission state > > > Key: SPARK-40979 > URL: https://issues.apache.org/jira/browse/SPARK-40979 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Zhongwei Zhu >Priority: Major > > Removed executor due to decommission should be kept in a separate set. To > avoid OOM, set size will be limited to 1K or 10K. > FetchFailed caused by decom executor could be divided into 2 categories: > # When FetchFailed reached DAGScheduler, the executor is still alive or is > lost but the lost info hasn't reached TaskSchedulerImpl. This is already > handled in SPARK-40979 > # FetchFailed is caused by decom executor loss, so the decom info is already > removed in TaskSchedulerImpl. If we keep such info in a short period, it is > good enough. Even we limit the size of removed executors to 10K, it could be > only at most 10MB memory usage. In real case, it's rare to have cluster size > of over 10K and the chance that all these executors decomed and lost at the > same time would be small. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40781) Explain exit code 137 as killed due to OOM
Zhongwei Zhu created SPARK-40781: Summary: Explain exit code 137 as killed due to OOM Key: SPARK-40781 URL: https://issues.apache.org/jira/browse/SPARK-40781 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.3.0 Reporter: Zhongwei Zhu Explain exit code 137 as killed due to OOM to reduce the efforts of users to search the meaning of exit code. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40778) Make HeartbeatReceiver as an IsolatedRpcEndpoint
Zhongwei Zhu created SPARK-40778: Summary: Make HeartbeatReceiver as an IsolatedRpcEndpoint Key: SPARK-40778 URL: https://issues.apache.org/jira/browse/SPARK-40778 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.3.0 Reporter: Zhongwei Zhu All RpcEndpoint including HeartbeatReceiver in driver are sharing one thread pool. When there're lots of rpc messages queued, the waiting process time of heartbeat time could easily exceed heartbeat timeout. Make HeartbeatReceiver extends IsolatedRcpEndpoint then it has dedicated single thread to process heartbeat -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40636) Fix wrong remained shuffles log in BlockManagerDecommissioner
[ https://issues.apache.org/jira/browse/SPARK-40636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongwei Zhu updated SPARK-40636: - Description: BlockManagerDecommissioner should log correct remained shuffles. {code:java} 4 of 24 local shuffles are added. In total, 24 shuffles are remained. 2022-09-30 17:42:15.035 PDT 0 of 24 local shuffles are added. In total, 24 shuffles are remained. 2022-09-30 17:42:45.069 PDT 0 of 24 local shuffles are added. In total, 24 shuffles are remained.{code} was: BlockManagerDecommissioner should log correct remained shuffles {code:java} 4 of 24 local shuffles are added. In total, 24 shuffles are remained. 2022-09-30 17:42:15.035 PDT 0 of 24 local shuffles are added. In total, 24 shuffles are remained. 2022-09-30 17:42:45.069 PDT 0 of 24 local shuffles are added. In total, 24 shuffles are remained.{code} > Fix wrong remained shuffles log in BlockManagerDecommissioner > - > > Key: SPARK-40636 > URL: https://issues.apache.org/jira/browse/SPARK-40636 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 3.3.0 >Reporter: Zhongwei Zhu >Priority: Minor > > BlockManagerDecommissioner should log correct remained shuffles. > {code:java} > 4 of 24 local shuffles are added. In total, 24 shuffles are remained. > 2022-09-30 17:42:15.035 PDT > 0 of 24 local shuffles are added. In total, 24 shuffles are remained. > 2022-09-30 17:42:45.069 PDT > 0 of 24 local shuffles are added. In total, 24 shuffles are remained.{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40636) Fix wrong remained shuffles log in BlockManagerDecommissioner
[ https://issues.apache.org/jira/browse/SPARK-40636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongwei Zhu updated SPARK-40636: - Description: BlockManagerDecommissioner should log correct remained shuffles ``` 4 of 24 local shuffles are added. In total, 24 shuffles are remained. 2022-09-30 17:42:15.035 PDT 0 of 24 local shuffles are added. In total, 24 shuffles are remained. 2022-09-30 17:42:45.069 PDT 0 of 24 local shuffles are added. In total, 24 shuffles are remained. ``` was: ``` 4 of 24 local shuffles are added. In total, 24 shuffles are remained. 2022-09-30 17:42:15.035 PDT 0 of 24 local shuffles are added. In total, 24 shuffles are remained. 2022-09-30 17:42:45.069 PDT 0 of 24 local shuffles are added. In total, 24 shuffles are remained. ``` > Fix wrong remained shuffles log in BlockManagerDecommissioner > - > > Key: SPARK-40636 > URL: https://issues.apache.org/jira/browse/SPARK-40636 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 3.3.0 >Reporter: Zhongwei Zhu >Priority: Minor > > BlockManagerDecommissioner should log correct remained shuffles > > ``` > 4 of 24 local shuffles are added. In total, 24 shuffles are remained. > 2022-09-30 17:42:15.035 PDT > 0 of 24 local shuffles are added. In total, 24 shuffles are remained. > 2022-09-30 17:42:45.069 PDT > 0 of 24 local shuffles are added. In total, 24 shuffles are remained. > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40636) Fix wrong remained shuffles log in BlockManagerDecommissioner
[ https://issues.apache.org/jira/browse/SPARK-40636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongwei Zhu updated SPARK-40636: - Description: BlockManagerDecommissioner should log correct remained shuffles {code:java} 4 of 24 local shuffles are added. In total, 24 shuffles are remained. 2022-09-30 17:42:15.035 PDT 0 of 24 local shuffles are added. In total, 24 shuffles are remained. 2022-09-30 17:42:45.069 PDT 0 of 24 local shuffles are added. In total, 24 shuffles are remained.{code} was: BlockManagerDecommissioner should log correct remained shuffles ``` 4 of 24 local shuffles are added. In total, 24 shuffles are remained. 2022-09-30 17:42:15.035 PDT 0 of 24 local shuffles are added. In total, 24 shuffles are remained. 2022-09-30 17:42:45.069 PDT 0 of 24 local shuffles are added. In total, 24 shuffles are remained. ``` > Fix wrong remained shuffles log in BlockManagerDecommissioner > - > > Key: SPARK-40636 > URL: https://issues.apache.org/jira/browse/SPARK-40636 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 3.3.0 >Reporter: Zhongwei Zhu >Priority: Minor > > BlockManagerDecommissioner should log correct remained shuffles > > {code:java} > 4 of 24 local shuffles are added. In total, 24 shuffles are remained. > 2022-09-30 17:42:15.035 PDT > 0 of 24 local shuffles are added. In total, 24 shuffles are remained. > 2022-09-30 17:42:45.069 PDT > 0 of 24 local shuffles are added. In total, 24 shuffles are remained.{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40636) Fix wrong remained shuffles log in BlockManagerDecommissioner
Zhongwei Zhu created SPARK-40636: Summary: Fix wrong remained shuffles log in BlockManagerDecommissioner Key: SPARK-40636 URL: https://issues.apache.org/jira/browse/SPARK-40636 Project: Spark Issue Type: Bug Components: Block Manager Affects Versions: 3.3.0 Reporter: Zhongwei Zhu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40636) Fix wrong remained shuffles log in BlockManagerDecommissioner
[ https://issues.apache.org/jira/browse/SPARK-40636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongwei Zhu updated SPARK-40636: - Description: ``` 4 of 24 local shuffles are added. In total, 24 shuffles are remained. 2022-09-30 17:42:15.035 PDT 0 of 24 local shuffles are added. In total, 24 shuffles are remained. 2022-09-30 17:42:45.069 PDT 0 of 24 local shuffles are added. In total, 24 shuffles are remained. ``` > Fix wrong remained shuffles log in BlockManagerDecommissioner > - > > Key: SPARK-40636 > URL: https://issues.apache.org/jira/browse/SPARK-40636 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 3.3.0 >Reporter: Zhongwei Zhu >Priority: Minor > > > > ``` > 4 of 24 local shuffles are added. In total, 24 shuffles are remained. > 2022-09-30 17:42:15.035 PDT > 0 of 24 local shuffles are added. In total, 24 shuffles are remained. > 2022-09-30 17:42:45.069 PDT > 0 of 24 local shuffles are added. In total, 24 shuffles are remained. > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40481) Ignore stage fetch failure caused by decommissioned executor
Zhongwei Zhu created SPARK-40481: Summary: Ignore stage fetch failure caused by decommissioned executor Key: SPARK-40481 URL: https://issues.apache.org/jira/browse/SPARK-40481 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.3.0 Reporter: Zhongwei Zhu When executor decommission is enabled, there would be many stage failure caused by FetchFailed from decommissioned executor, further causing whole job's failure. It would be better not to count such failure in `spark.stage.maxConsecutiveAttempts` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40381) Support standalone worker recommission
Zhongwei Zhu created SPARK-40381: Summary: Support standalone worker recommission Key: SPARK-40381 URL: https://issues.apache.org/jira/browse/SPARK-40381 Project: Spark Issue Type: Improvement Components: Deploy Affects Versions: 3.3.0 Reporter: Zhongwei Zhu Currently, spark standalone only support kill workers. We may want to recommission some workers. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40269) Randomize the orders of peer in BlockManagerDecommissioner
Zhongwei Zhu created SPARK-40269: Summary: Randomize the orders of peer in BlockManagerDecommissioner Key: SPARK-40269 URL: https://issues.apache.org/jira/browse/SPARK-40269 Project: Spark Issue Type: Improvement Components: Block Manager Affects Versions: 3.3.0 Reporter: Zhongwei Zhu Randomize the orders of peer in BlockManagerDecommissioner to avoid migrating data to the same set of nodes -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40267) Add description for ExecutorAllocationManager metrics
Zhongwei Zhu created SPARK-40267: Summary: Add description for ExecutorAllocationManager metrics Key: SPARK-40267 URL: https://issues.apache.org/jira/browse/SPARK-40267 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 3.3.0 Reporter: Zhongwei Zhu Some ExecutorAllocationManager metrics are hard to know what stands for just from metric name. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40168) Handle FileNotFoundException when shuffle file deleted in decommissioner
[ https://issues.apache.org/jira/browse/SPARK-40168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongwei Zhu updated SPARK-40168: - Description: When shuffle files not found, decommissioner will handles IOException, but the real exception is as below: {code:java} 22/08/10 18:05:34 ERROR BlockManagerDecommissioner: Error occurred during migrating migrate_shuffle_1_356 org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301) at org.apache.spark.network.BlockTransferService.uploadBlockSync(BlockTransferService.scala:122) at org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.$anonfun$run$4(BlockManagerDecommissioner.scala:120) at org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.$anonfun$run$4$adapted(BlockManagerDecommissioner.scala:111) at scala.collection.immutable.List.foreach(List.scala:431) at org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.run(BlockManagerDecommissioner.scala:111) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: java.io.IOException: Failed to send RPC RPC 5697756267528635203 to /10.240.2.65:43481: java.io.FileNotFoundException: /tmp/blockmgr-98a2a29a-5231-4fed-a82e-6bc0531ad407/15/shuffle_1_356_0.index (No such file or directory) at org.apache.spark.network.client.TransportClient$RpcChannelListener.handleFailure(TransportClient.java:392) at org.apache.spark.network.client.TransportClient$StdChannelListener.operationComplete(TransportClient.java:369) at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578) at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571) at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550) at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491) at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616) at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609) at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) at io.netty.util.internal.PromiseNotificationUtil.tryFailure(PromiseNotificationUtil.java:64) at io.netty.channel.ChannelOutboundBuffer.safeFail(ChannelOutboundBuffer.java:723) at io.netty.channel.ChannelOutboundBuffer.remove0(ChannelOutboundBuffer.java:308) at io.netty.channel.ChannelOutboundBuffer.failFlushed(ChannelOutboundBuffer.java:660) at io.netty.channel.AbstractChannel$AbstractUnsafe.close(AbstractChannel.java:735) at io.netty.channel.AbstractChannel$AbstractUnsafe.handleWriteError(AbstractChannel.java:950) at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:933) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:354) at io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:895) at io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1372) at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:750) at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:742) at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:728) at io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:127) at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:750) at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:765) at io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071) at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ... 1 more Caused by: java.io.FileNotFoundException:
[jira] [Updated] (SPARK-40168) Handle FileNotFoundException when shuffle file deleted in decommissioner
[ https://issues.apache.org/jira/browse/SPARK-40168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongwei Zhu updated SPARK-40168: - Description: {code:java} // Some comments here public String getFoo() { return foo; } {code} {code:java} // code placeholder {code} When shuffle files not found, decommissioner will handles IOException, but the real exception is as below: ``` 22/08/10 18:05:34 ERROR BlockManagerDecommissioner: Error occurred during migrating migrate_shuffle_1_356 org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301) at org.apache.spark.network.BlockTransferService.uploadBlockSync(BlockTransferService.scala:122) at org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.$anonfun$run$4(BlockManagerDecommissioner.scala:120) at org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.$anonfun$run$4$adapted(BlockManagerDecommissioner.scala:111) at scala.collection.immutable.List.foreach(List.scala:431) at org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.run(BlockManagerDecommissioner.scala:111) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: java.io.IOException: Failed to send RPC RPC 5697756267528635203 to /10.240.2.65:43481: java.io.FileNotFoundException: /tmp/blockmgr-98a2a29a-5231-4fed-a82e-6bc0531ad407/15/shuffle_1_356_0.index (No such file or directory) at org.apache.spark.network.client.TransportClient$RpcChannelListener.handleFailure(TransportClient.java:392) at org.apache.spark.network.client.TransportClient$StdChannelListener.operationComplete(TransportClient.java:369) at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578) at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571) at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550) at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491) at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616) at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609) at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) at io.netty.util.internal.PromiseNotificationUtil.tryFailure(PromiseNotificationUtil.java:64) at io.netty.channel.ChannelOutboundBuffer.safeFail(ChannelOutboundBuffer.java:723) at io.netty.channel.ChannelOutboundBuffer.remove0(ChannelOutboundBuffer.java:308) at io.netty.channel.ChannelOutboundBuffer.failFlushed(ChannelOutboundBuffer.java:660) at io.netty.channel.AbstractChannel$AbstractUnsafe.close(AbstractChannel.java:735) at io.netty.channel.AbstractChannel$AbstractUnsafe.handleWriteError(AbstractChannel.java:950) at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:933) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:354) at io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:895) at io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1372) at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:750) at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:742) at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:728) at io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:127) at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:750) at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:765) at io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071) at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ...
[jira] [Created] (SPARK-40168) Handle FileNotFoundException when shuffle file deleted in decommissioner
Zhongwei Zhu created SPARK-40168: Summary: Handle FileNotFoundException when shuffle file deleted in decommissioner Key: SPARK-40168 URL: https://issues.apache.org/jira/browse/SPARK-40168 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.3.0 Reporter: Zhongwei Zhu When shuffle files not found, decommissioner will handles IOException, but the real exception is as below: ``` 22/08/10 18:05:34 ERROR BlockManagerDecommissioner: Error occurred during migrating migrate_shuffle_1_356 org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301) at org.apache.spark.network.BlockTransferService.uploadBlockSync(BlockTransferService.scala:122) at org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.$anonfun$run$4(BlockManagerDecommissioner.scala:120) at org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.$anonfun$run$4$adapted(BlockManagerDecommissioner.scala:111) at scala.collection.immutable.List.foreach(List.scala:431) at org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.run(BlockManagerDecommissioner.scala:111) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: java.io.IOException: Failed to send RPC RPC 5697756267528635203 to /10.240.2.65:43481: java.io.FileNotFoundException: /tmp/blockmgr-98a2a29a-5231-4fed-a82e-6bc0531ad407/15/shuffle_1_356_0.index (No such file or directory) at org.apache.spark.network.client.TransportClient$RpcChannelListener.handleFailure(TransportClient.java:392) at org.apache.spark.network.client.TransportClient$StdChannelListener.operationComplete(TransportClient.java:369) at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578) at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571) at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550) at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491) at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616) at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609) at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) at io.netty.util.internal.PromiseNotificationUtil.tryFailure(PromiseNotificationUtil.java:64) at io.netty.channel.ChannelOutboundBuffer.safeFail(ChannelOutboundBuffer.java:723) at io.netty.channel.ChannelOutboundBuffer.remove0(ChannelOutboundBuffer.java:308) at io.netty.channel.ChannelOutboundBuffer.failFlushed(ChannelOutboundBuffer.java:660) at io.netty.channel.AbstractChannel$AbstractUnsafe.close(AbstractChannel.java:735) at io.netty.channel.AbstractChannel$AbstractUnsafe.handleWriteError(AbstractChannel.java:950) at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:933) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:354) at io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:895) at io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1372) at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:750) at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:742) at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:728) at io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:127) at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:750) at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:765) at io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071) at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at
[jira] [Created] (SPARK-40060) Add numberDecommissioningExecutors metric
Zhongwei Zhu created SPARK-40060: Summary: Add numberDecommissioningExecutors metric Key: SPARK-40060 URL: https://issues.apache.org/jira/browse/SPARK-40060 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.3.0 Reporter: Zhongwei Zhu The num of decommissioning executor should exposed as metric -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36893) upgrade mesos into 1.4.3
Zhongwei Zhu created SPARK-36893: Summary: upgrade mesos into 1.4.3 Key: SPARK-36893 URL: https://issues.apache.org/jira/browse/SPARK-36893 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.1.2 Reporter: Zhongwei Zhu Upgrade mesos to 1.4.3 to fix CVE-2018-11793 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36864) guava version mismatch with hadoop-aws
[ https://issues.apache.org/jira/browse/SPARK-36864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongwei Zhu updated SPARK-36864: - Summary: guava version mismatch with hadoop-aws (was: guava version mismatch between hadoop-aws and spark) > guava version mismatch with hadoop-aws > -- > > Key: SPARK-36864 > URL: https://issues.apache.org/jira/browse/SPARK-36864 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.2 >Reporter: Zhongwei Zhu >Priority: Minor > > When use hadoop-aws 3.2 with spark 3.0, got below error. This is caused by > guava version mismatch as hadoop used guava 27.0-jre while spark used 14.0.1. > Exception in thread "main" java.lang.NoSuchMethodError: > com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;Ljava/lang/Object;)V > at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:742) > at org.apache.hadoop.fs.s3a.S3AUtils.getAWSAccessKeys(S3AUtils.java:712) > at > org.apache.hadoop.fs.s3a.S3AUtils.createAWSCredentialProviderSet(S3AUtils.java:559) > at > org.apache.hadoop.fs.s3a.DefaultS3ClientFactory.createS3Client(DefaultS3ClientFactory.java:52) > at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:264) > at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124) > at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479) > at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1853) > at > org.apache.spark.deploy.history.EventLogFileWriter.(EventLogFileWriters.scala:60) > at > org.apache.spark.deploy.history.SingleEventLogFileWriter.(EventLogFileWriters.scala:213) > at > org.apache.spark.deploy.history.EventLogFileWriter$.apply(EventLogFileWriters.scala:181) > at > org.apache.spark.scheduler.EventLoggingListener.(EventLoggingListener.scala:66) > at org.apache.spark.SparkContext.(SparkContext.scala:584) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2588) > at > org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:937) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:931) > at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:30) > at org.apache.spark.examples.SparkPi.main(SparkPi.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:944) > at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) > at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) > at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1023) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1032) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36864) guava version mismatch between hadoop-aws and spark
Zhongwei Zhu created SPARK-36864: Summary: guava version mismatch between hadoop-aws and spark Key: SPARK-36864 URL: https://issues.apache.org/jira/browse/SPARK-36864 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.1.2 Reporter: Zhongwei Zhu When use hadoop-aws 3.2 with spark 3.0, got below error. This is caused by guava version mismatch as hadoop used guava 27.0-jre while spark used 14.0.1. Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;Ljava/lang/Object;)V at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:742) at org.apache.hadoop.fs.s3a.S3AUtils.getAWSAccessKeys(S3AUtils.java:712) at org.apache.hadoop.fs.s3a.S3AUtils.createAWSCredentialProviderSet(S3AUtils.java:559) at org.apache.hadoop.fs.s3a.DefaultS3ClientFactory.createS3Client(DefaultS3ClientFactory.java:52) at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:264) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479) at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1853) at org.apache.spark.deploy.history.EventLogFileWriter.(EventLogFileWriters.scala:60) at org.apache.spark.deploy.history.SingleEventLogFileWriter.(EventLogFileWriters.scala:213) at org.apache.spark.deploy.history.EventLogFileWriter$.apply(EventLogFileWriters.scala:181) at org.apache.spark.scheduler.EventLoggingListener.(EventLoggingListener.scala:66) at org.apache.spark.SparkContext.(SparkContext.scala:584) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2588) at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:937) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:931) at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:30) at org.apache.spark.examples.SparkPi.main(SparkPi.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:944) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1023) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1032) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36793) [K8S] Support write container stdout/stderr to file
Zhongwei Zhu created SPARK-36793: Summary: [K8S] Support write container stdout/stderr to file Key: SPARK-36793 URL: https://issues.apache.org/jira/browse/SPARK-36793 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 3.1.2 Reporter: Zhongwei Zhu Currently, executor and driver pod only redirect stdout/stderr. If users want to sidecar logging agent to send stdout/stderr to external log storage, only way is to change entrypoint.sh, which might break compatibility with community version. We should support this feature, and this feature could be enabled by spark config. Related spark configs are: |Key|Default|Desc| |Spark.kubernetes.logToFile.enabled|false|Whether to write executor/driver stdout/stderr as log file| |Spark.kubernetes.logToFile.path|/var/log/spark|The path to write executor/driver stdout/stderr as log file| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32288) [UI] Add failure summary table in stage page
[ https://issues.apache.org/jira/browse/SPARK-32288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongwei Zhu updated SPARK-32288: - Summary: [UI] Add failure summary table in stage page (was: [UI] Add exception summary table in stage page) > [UI] Add failure summary table in stage page > > > Key: SPARK-32288 > URL: https://issues.apache.org/jira/browse/SPARK-32288 > Project: Spark > Issue Type: New Feature > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Zhongwei Zhu >Priority: Major > > When there're many task failure during one stage, it's hard to find failure > pattern such as aggregation task failure by exception type and message. If we > have such information, we can easily know which type of exception of failure > is the root cause of stage failure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34777) [UI] StagePage input size/records not show when records greater than zero
[ https://issues.apache.org/jira/browse/SPARK-34777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongwei Zhu updated SPARK-34777: - Summary: [UI] StagePage input size/records not show when records greater than zero (was: [UI] StagePage input size records not show when records greater than zero) > [UI] StagePage input size/records not show when records greater than zero > - > > Key: SPARK-34777 > URL: https://issues.apache.org/jira/browse/SPARK-34777 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.1.1 >Reporter: Zhongwei Zhu >Priority: Minor > Attachments: No input size records.png > > > !No input size records.png|width=547,height=212! > The `Input Size / Records` should show in summary metrics table and task > columns, as input records greater than zero and bytes is zero. One example is > spark streaming job read from kafka -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34777) [UI] StagePage input size records not show when records greater than zero
[ https://issues.apache.org/jira/browse/SPARK-34777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongwei Zhu updated SPARK-34777: - Description: !No input size records.png|width=547,height=212! The `Input Size / Records` should show in summary metrics table and task columns, as input records greater than zero and bytes is zero. One example is spark streaming job read from kafka was: !No input size records.png|width=547,height=212! The `Input Size / Records` should show in summary metrics table and task columns, as input records greater than zero > [UI] StagePage input size records not show when records greater than zero > - > > Key: SPARK-34777 > URL: https://issues.apache.org/jira/browse/SPARK-34777 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.1.1 >Reporter: Zhongwei Zhu >Priority: Minor > Attachments: No input size records.png > > > !No input size records.png|width=547,height=212! > The `Input Size / Records` should show in summary metrics table and task > columns, as input records greater than zero and bytes is zero. One example is > spark streaming job read from kafka -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34777) [UI] StagePage input size records not show when records greater than zero
[ https://issues.apache.org/jira/browse/SPARK-34777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongwei Zhu updated SPARK-34777: - Description: !No input size records.png|width=547,height=212! The `Input Size / Records` should show in summary metrics table and task columns, as input records greater than zero was: !image-2021-03-17-09-46-58-653.png|width=514,height=171! The `Input Size / Records` should show in summary metrics table and task columns, as input records greater than zero > [UI] StagePage input size records not show when records greater than zero > - > > Key: SPARK-34777 > URL: https://issues.apache.org/jira/browse/SPARK-34777 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.1.1 >Reporter: Zhongwei Zhu >Priority: Minor > Attachments: No input size records.png > > > !No input size records.png|width=547,height=212! > The `Input Size / Records` should show in summary metrics table and task > columns, as input records greater than zero -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34777) [UI] StagePage input size records not show when records greater than zero
[ https://issues.apache.org/jira/browse/SPARK-34777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongwei Zhu updated SPARK-34777: - Attachment: No input size records.png > [UI] StagePage input size records not show when records greater than zero > - > > Key: SPARK-34777 > URL: https://issues.apache.org/jira/browse/SPARK-34777 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.1.1 >Reporter: Zhongwei Zhu >Priority: Minor > Attachments: No input size records.png > > > !image-2021-03-17-09-46-58-653.png|width=514,height=171! > The `Input Size / Records` should show in summary metrics table and task > columns, as input records greater than zero -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34777) [UI] StagePage input size records not show when records greater than zero
Zhongwei Zhu created SPARK-34777: Summary: [UI] StagePage input size records not show when records greater than zero Key: SPARK-34777 URL: https://issues.apache.org/jira/browse/SPARK-34777 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 3.1.1 Reporter: Zhongwei Zhu !image-2021-03-17-09-46-58-653.png|width=514,height=171! The `Input Size / Records` should show in summary metrics table and task columns, as input records greater than zero -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34232) [CORE] redact credentials not working when log slow event enabled
Zhongwei Zhu created SPARK-34232: Summary: [CORE] redact credentials not working when log slow event enabled Key: SPARK-34232 URL: https://issues.apache.org/jira/browse/SPARK-34232 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.1 Reporter: Zhongwei Zhu When process time of event SparkListenerEnvironmentUpdate exceeded logSlowEventThreshold, the credentials will be logged without consideration of redact. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19450) Replace askWithRetry with askSync.
[ https://issues.apache.org/jira/browse/SPARK-19450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17257091#comment-17257091 ] Zhongwei Zhu edited comment on SPARK-19450 at 12/31/20, 7:58 PM: - For old askWithRetry method, it can use provided config `spark.rpc.numRetries` and `spark.rpc.retry.wait`. The default value for `spark.rpc.numRetries` is 3. So I suppose it will retry 3 times if rpc failed, but now askSync is used without using above 2 configs. Could we support such retry? If retry is disabled intentionally, maybe doc need to be updated. [~jinxing6...@126.com] [~srowen] was (Author: warrenzhu25): For old askWithRetry method, it can use provided config `spark.rpc.numRetries` and `spark.rpc.retry.wait`. The default value for `spark.rpc.numRetries` is 3. So I suppose it will retry 3 times if rpc failed, but now askSync is used without using above 2 configs. Does that mean no retry anymore? [~jinxing6...@126.com] [~srowen] > Replace askWithRetry with askSync. > -- > > Key: SPARK-19450 > URL: https://issues.apache.org/jira/browse/SPARK-19450 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Jin Xing >Assignee: Jin Xing >Priority: Minor > Fix For: 2.2.0 > > > *askSync* is already added in *RpcEndpointRef* (see SPARK-19347 and > https://github.com/apache/spark/pull/16690#issuecomment-276850068) and > *askWithRetry* is marked as deprecated. > As mentioned > SPARK-18113(https://github.com/apache/spark/pull/16503#event-927953218): > ??askWithRetry is basically an unneeded API, and a leftover from the akka > days that doesn't make sense anymore. It's prone to cause deadlocks (exactly > because it's blocking), it imposes restrictions on the caller (e.g. > idempotency) and other things that people generally don't pay that much > attention to when using it.?? > Since *askWithRetry* is just used inside spark and not in user logic. It > might make sense to replace all of them with *askSync*. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19450) Replace askWithRetry with askSync.
[ https://issues.apache.org/jira/browse/SPARK-19450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17257091#comment-17257091 ] Zhongwei Zhu commented on SPARK-19450: -- For old askWithRetry method, it can use provided config `spark.rpc.numRetries` and `spark.rpc.retry.wait`. The default value for `spark.rpc.numRetries` is 3. So I suppose it will retry 3 times if rpc failed, but now askSync is used without using above 2 configs. Does that mean no retry anymore? [~jinxing6...@126.com] [~srowen] > Replace askWithRetry with askSync. > -- > > Key: SPARK-19450 > URL: https://issues.apache.org/jira/browse/SPARK-19450 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Jin Xing >Assignee: Jin Xing >Priority: Minor > Fix For: 2.2.0 > > > *askSync* is already added in *RpcEndpointRef* (see SPARK-19347 and > https://github.com/apache/spark/pull/16690#issuecomment-276850068) and > *askWithRetry* is marked as deprecated. > As mentioned > SPARK-18113(https://github.com/apache/spark/pull/16503#event-927953218): > ??askWithRetry is basically an unneeded API, and a leftover from the akka > days that doesn't make sense anymore. It's prone to cause deadlocks (exactly > because it's blocking), it imposes restrictions on the caller (e.g. > idempotency) and other things that people generally don't pay that much > attention to when using it.?? > Since *askWithRetry* is just used inside spark and not in user logic. It > might make sense to replace all of them with *askSync*. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33446) [CORE] Add config spark.executor.coresOverhead
Zhongwei Zhu created SPARK-33446: Summary: [CORE] Add config spark.executor.coresOverhead Key: SPARK-33446 URL: https://issues.apache.org/jira/browse/SPARK-33446 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.1 Reporter: Zhongwei Zhu Add config spark.executor.coresOverhead to request extra cores per executor. This config will be helpful in below cases: Suppose for physical machines or vm, the memory/cpu ratio is 3GB/core. But we run spark job, we want to have 6GB per task. If we request resource in such way, there will be resource waste. If we request extra cores without affecting cores per executor for task allocation, extra cores won't be wasted. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32314) [SHS] Remove old format of stacktrace in event log
[ https://issues.apache.org/jira/browse/SPARK-32314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongwei Zhu updated SPARK-32314: - Description: Currently, EventLoggingListeneer write both "Stack Trace" and "Full Stack Trace" in TaskEndResaon of ExceptionFailure to event log. Both fields contains same info, and the former one is kept for backward compatibility of spark history before version 1.2.0. We can remove 1st field. This will help reduce eventlog size significantly when lots of task are failed due to ExceptionFailure. The sample json of current format as below: {noformat} { "Event": "SparkListenerTaskEnd", "Stage ID": 1237, "Stage Attempt ID": 0, "Task Type": "ShuffleMapTask", "Task End Reason": { "Reason": "ExceptionFailure", "Class Name": "java.io.IOException", "Description": "org.apache.spark.SparkException: Failed to get broadcast_1405_piece10 of broadcast_1405", "Stack Trace": [ { "Declaring Class": "org.apache.spark.util.Utils$", "Method Name": "tryOrIOException", "File Name": "Utils.scala", "Line Number": 1350 }, { "Declaring Class": "org.apache.spark.broadcast.TorrentBroadcast", "Method Name": "readBroadcastBlock", "File Name": "TorrentBroadcast.scala", "Line Number": 218 }, { "Declaring Class": "org.apache.spark.broadcast.TorrentBroadcast", "Method Name": "getValue", "File Name": "TorrentBroadcast.scala", "Line Number": 103 }, { "Declaring Class": "org.apache.spark.broadcast.Broadcast", "Method Name": "value", "File Name": "Broadcast.scala", "Line Number": 70 }, { "Declaring Class": "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage9", "Method Name": "wholestagecodegen_init_0_0$", "File Name": "generated.java", "Line Number": 466 }, { "Declaring Class": "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage9", "Method Name": "init", "File Name": "generated.java", "Line Number": 33 }, { "Declaring Class": "org.apache.spark.sql.execution.WholeStageCodegenExec", "Method Name": "$anonfun$doExecute$4", "File Name": "WholeStageCodegenExec.scala", "Line Number": 750 }, { "Declaring Class": "org.apache.spark.sql.execution.WholeStageCodegenExec", "Method Name": "$anonfun$doExecute$4$adapted", "File Name": "WholeStageCodegenExec.scala", "Line Number": 747 }, { "Declaring Class": "org.apache.spark.rdd.RDD", "Method Name": "$anonfun$mapPartitionsWithIndex$2", "File Name": "RDD.scala", "Line Number": 915 }, { "Declaring Class": "org.apache.spark.rdd.RDD", "Method Name": "$anonfun$mapPartitionsWithIndex$2$adapted", "File Name": "RDD.scala", "Line Number": 915 }, { "Declaring Class": "org.apache.spark.rdd.MapPartitionsRDD", "Method Name": "compute", "File Name": "MapPartitionsRDD.scala", "Line Number": 52 }, { "Declaring Class": "org.apache.spark.rdd.RDD", "Method Name": "computeOrReadCheckpoint", "File Name": "RDD.scala", "Line Number": 373 }, { "Declaring Class": "org.apache.spark.rdd.RDD", "Method Name": "iterator", "File Name": "RDD.scala", "Line Number": 337 }, { "Declaring Class": "org.apache.spark.rdd.MapPartitionsRDD", "Method Name": "compute", "File Name": "MapPartitionsRDD.scala", "Line Number": 52 }, { "Declaring Class": "org.apache.spark.rdd.RDD", "Method Name": "computeOrReadCheckpoint", "File Name": "RDD.scala", "Line Number": 373 }, { "Declaring Class": "org.apache.spark.rdd.RDD", "Method Name": "iterator", "File Name": "RDD.scala", "Line Number": 337 }, { "Declaring Class": "org.apache.spark.shuffle.ShuffleWriteProcessor", "Method Name": "write", "File Name": "ShuffleWriteProcessor.scala", "Line Number": 59 }, { "Declaring Class": "org.apache.spark.scheduler.ShuffleMapTask", "Method Name": "runTask", "File Name": "ShuffleMapTask.scala", "Line Number": 99 }, { "Declaring Class": "org.apache.spark.scheduler.ShuffleMapTask", "Method Name": "runTask", "File Name": "ShuffleMapTask.scala", "Line Number": 52 }, { "Declaring Class": "org.apache.spark.scheduler.Task", "Method Name": "run", "File Name": "Task.scala", "Line Number": 127 }, {
[jira] [Updated] (SPARK-32314) [SHS] Remove old format of stacktrace in event log
[ https://issues.apache.org/jira/browse/SPARK-32314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongwei Zhu updated SPARK-32314: - Summary: [SHS] Remove old format of stacktrace in event log (was: [SHS] Add config to control whether log old format of stacktrace) > [SHS] Remove old format of stacktrace in event log > -- > > Key: SPARK-32314 > URL: https://issues.apache.org/jira/browse/SPARK-32314 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Zhongwei Zhu >Priority: Minor > > Currently, EventLoggingListeneer write both "Stack Trace" and "Full Stack > Trace" in TaskEndResaon of ExceptionFailure to event log. Both fields > contains same info, and the former one is kept for backward compatibility of > spark history before version 1.2.0. We can remove 1st field in default > setting and add one config to control whether log 1st field. This will help > reduce eventlog size significantly when lots of task are failed due to > ExceptionFailure. > > The sample json of current format as below: > > {noformat} > { > "Event": "SparkListenerTaskEnd", > "Stage ID": 1237, > "Stage Attempt ID": 0, > "Task Type": "ShuffleMapTask", > "Task End Reason": { > "Reason": "ExceptionFailure", > "Class Name": "java.io.IOException", > "Description": "org.apache.spark.SparkException: Failed to get > broadcast_1405_piece10 of broadcast_1405", > "Stack Trace": [ > { > "Declaring Class": "org.apache.spark.util.Utils$", > "Method Name": "tryOrIOException", > "File Name": "Utils.scala", > "Line Number": 1350 > }, > { > "Declaring Class": "org.apache.spark.broadcast.TorrentBroadcast", > "Method Name": "readBroadcastBlock", > "File Name": "TorrentBroadcast.scala", > "Line Number": 218 > }, > { > "Declaring Class": "org.apache.spark.broadcast.TorrentBroadcast", > "Method Name": "getValue", > "File Name": "TorrentBroadcast.scala", > "Line Number": 103 > }, > { > "Declaring Class": "org.apache.spark.broadcast.Broadcast", > "Method Name": "value", > "File Name": "Broadcast.scala", > "Line Number": 70 > }, > { > "Declaring Class": > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage9", > "Method Name": "wholestagecodegen_init_0_0$", > "File Name": "generated.java", > "Line Number": 466 > }, > { > "Declaring Class": > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage9", > "Method Name": "init", > "File Name": "generated.java", > "Line Number": 33 > }, > { > "Declaring Class": > "org.apache.spark.sql.execution.WholeStageCodegenExec", > "Method Name": "$anonfun$doExecute$4", > "File Name": "WholeStageCodegenExec.scala", > "Line Number": 750 > }, > { > "Declaring Class": > "org.apache.spark.sql.execution.WholeStageCodegenExec", > "Method Name": "$anonfun$doExecute$4$adapted", > "File Name": "WholeStageCodegenExec.scala", > "Line Number": 747 > }, > { > "Declaring Class": "org.apache.spark.rdd.RDD", > "Method Name": "$anonfun$mapPartitionsWithIndex$2", > "File Name": "RDD.scala", > "Line Number": 915 > }, > { > "Declaring Class": "org.apache.spark.rdd.RDD", > "Method Name": "$anonfun$mapPartitionsWithIndex$2$adapted", > "File Name": "RDD.scala", > "Line Number": 915 > }, > { > "Declaring Class": "org.apache.spark.rdd.MapPartitionsRDD", > "Method Name": "compute", > "File Name": "MapPartitionsRDD.scala", > "Line Number": 52 > }, > { > "Declaring Class": "org.apache.spark.rdd.RDD", > "Method Name": "computeOrReadCheckpoint", > "File Name": "RDD.scala", > "Line Number": 373 > }, > { > "Declaring Class": "org.apache.spark.rdd.RDD", > "Method Name": "iterator", > "File Name": "RDD.scala", > "Line Number": 337 > }, > { > "Declaring Class": "org.apache.spark.rdd.MapPartitionsRDD", > "Method Name": "compute", > "File Name": "MapPartitionsRDD.scala", > "Line Number": 52 > }, > { > "Declaring Class": "org.apache.spark.rdd.RDD", > "Method Name": "computeOrReadCheckpoint", > "File Name": "RDD.scala", > "Line Number": 373 > }, > { > "Declaring Class": "org.apache.spark.rdd.RDD", > "Method Name": "iterator", >
[jira] [Updated] (SPARK-33375) [CORE] Add config spark.yarn.pyspark.archives
[ https://issues.apache.org/jira/browse/SPARK-33375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongwei Zhu updated SPARK-33375: - Summary: [CORE] Add config spark.yarn.pyspark.archives (was: [CORE] config spark.yarn.pyspark.archives) > [CORE] Add config spark.yarn.pyspark.archives > - > > Key: SPARK-33375 > URL: https://issues.apache.org/jira/browse/SPARK-33375 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 3.0.1 >Reporter: Zhongwei Zhu >Priority: Minor > > This config provides similar function as `spark.yarn.archive`, but instead > passing location of pyspark.zip and py4j.zip. It has below benefits: > 1. If I have different version of `spark.yarn.archive` from current machine, > then there is no way to provide matched pyspark version. This happens in > version upgrade. > 2. This can save effort of uploading pyspark.zip and py4j.zip every time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33375) [CORE] config spark.yarn.pyspark.archives
Zhongwei Zhu created SPARK-33375: Summary: [CORE] config spark.yarn.pyspark.archives Key: SPARK-33375 URL: https://issues.apache.org/jira/browse/SPARK-33375 Project: Spark Issue Type: Improvement Components: Spark Submit Affects Versions: 3.0.1 Reporter: Zhongwei Zhu This config provides similar function as `spark.yarn.archive`, but instead passing location of pyspark.zip and py4j.zip. It has below benefits: 1. If I have different version of `spark.yarn.archive` from current machine, then there is no way to provide matched pyspark version. This happens in version upgrade. 2. This can save effort of uploading pyspark.zip and py4j.zip every time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33374) [CORE] Remove unnecessary python path from spark home
Zhongwei Zhu created SPARK-33374: Summary: [CORE] Remove unnecessary python path from spark home Key: SPARK-33374 URL: https://issues.apache.org/jira/browse/SPARK-33374 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.1 Reporter: Zhongwei Zhu Currently, spark-submit will upload pyspark.zip and py4j-0.10.9-src.zip into staging folder, and both files will be added into PYTHONPATH. So it's unnecessary to add duplicate files in current spark home folder on local machine. Output of `sys.path` as below: 'D:\\data\\yarnnm\\local\\usercache\\z\\appcache\\application_1603546638930_150736\\container_e1148_1603546638930_150736_01_02\\pyspark.zip', 'D:\\data\\yarnnm\\local\\usercache\\z\\appcache\\application_1603546638930_150736\\container_e1148_1603546638930_150736_01_02\\py4j-0.10.7-src.zip', 'D:\\data\\spark.latest\\python\\lib\\pyspark.zip', 'D:\\data\\spark.latest\\python\\lib\\py4j-0.10.7-src.zip', -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33274) [SS] Fix job hang in cp mode when total cores less than total kafka partition
Zhongwei Zhu created SPARK-33274: Summary: [SS] Fix job hang in cp mode when total cores less than total kafka partition Key: SPARK-33274 URL: https://issues.apache.org/jira/browse/SPARK-33274 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.0.1 Reporter: Zhongwei Zhu In continuous processing mode, EpochCoordinator won't add offsets to query until got ReportPartitionOffset from all partitions. Normally, each kafka topic partition will be handled by one core, if total cores is smaller than total kafka topic partition counts, the job will hang without any error message. Failing the job with error message is better solution. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32863) Full outer stream-stream join
[ https://issues.apache.org/jira/browse/SPARK-32863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17219872#comment-17219872 ] Zhongwei Zhu commented on SPARK-32863: -- [~chengsu] Have you already worked on this? I want to help on this PR. > Full outer stream-stream join > - > > Key: SPARK-32863 > URL: https://issues.apache.org/jira/browse/SPARK-32863 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Cheng Su >Priority: Major > > Current stream-stream join supports inner, left outer and right outer join > ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala#L166] > ). With current design of stream-stream join (which marks whether the row is > matched or not in state store), it would be very easy to support full outer > join as well. > > Full outer stream-stream join will work as followed: > (1).for left side input row, check if there's a match on right side state > store. If there's a match, output all matched rows. Put the row in left side > state store. > (2).for right side input row, check if there's a match on left side state > store. If there's a match, output all matched rows and update left side rows > state with "matched" field to set to true. Put the right side row in right > side state store. > (3).for left side row needs to be evicted from state store, output the row if > "matched" field is false. > (4).for right side row needs to be evicted from state store, output the row > if "matched" field is false. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32446) Add new executor metrics summary REST APIs and parameters
Zhongwei Zhu created SPARK-32446: Summary: Add new executor metrics summary REST APIs and parameters Key: SPARK-32446 URL: https://issues.apache.org/jira/browse/SPARK-32446 Project: Spark Issue Type: Sub-task Components: Web UI Affects Versions: 3.0.0 Reporter: Zhongwei Zhu Add percentile distribution of peak memory metrics for all executors http://:18080/api/v1/applications//
[jira] [Created] (SPARK-32349) [UI] Reduce unnecessary allexecutors call when render stage page executor summary
Zhongwei Zhu created SPARK-32349: Summary: [UI] Reduce unnecessary allexecutors call when render stage page executor summary Key: SPARK-32349 URL: https://issues.apache.org/jira/browse/SPARK-32349 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 3.0.0 Reporter: Zhongwei Zhu Currently, stage page executor summary table data are built by merge /allexecutors response and stage.executorSummary. The data needed from /allexecutor are only two fields: hostPort and executorLogs. By adding both fields in v1.ExecutorStageSummary, we can save one /allexecutors call and make frontend logic simpler -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32314) [SHS] Add config to control whether log old format of stacktrace
Zhongwei Zhu created SPARK-32314: Summary: [SHS] Add config to control whether log old format of stacktrace Key: SPARK-32314 URL: https://issues.apache.org/jira/browse/SPARK-32314 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: Zhongwei Zhu Currently, EventLoggingListeneer write both "Stack Trace" and "Full Stack Trace" in TaskEndResaon of ExceptionFailure to event log. Both fields contains same info, and the former one is kept for backward compatibility of spark history before version 1.2.0. We can remove 1st field in default setting and add one config to control whether log 1st field. This will help reduce eventlog size significantly when lots of task are failed due to ExceptionFailure. The sample json of current format as below: {noformat} { "Event": "SparkListenerTaskEnd", "Stage ID": 1237, "Stage Attempt ID": 0, "Task Type": "ShuffleMapTask", "Task End Reason": { "Reason": "ExceptionFailure", "Class Name": "java.io.IOException", "Description": "org.apache.spark.SparkException: Failed to get broadcast_1405_piece10 of broadcast_1405", "Stack Trace": [ { "Declaring Class": "org.apache.spark.util.Utils$", "Method Name": "tryOrIOException", "File Name": "Utils.scala", "Line Number": 1350 }, { "Declaring Class": "org.apache.spark.broadcast.TorrentBroadcast", "Method Name": "readBroadcastBlock", "File Name": "TorrentBroadcast.scala", "Line Number": 218 }, { "Declaring Class": "org.apache.spark.broadcast.TorrentBroadcast", "Method Name": "getValue", "File Name": "TorrentBroadcast.scala", "Line Number": 103 }, { "Declaring Class": "org.apache.spark.broadcast.Broadcast", "Method Name": "value", "File Name": "Broadcast.scala", "Line Number": 70 }, { "Declaring Class": "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage9", "Method Name": "wholestagecodegen_init_0_0$", "File Name": "generated.java", "Line Number": 466 }, { "Declaring Class": "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage9", "Method Name": "init", "File Name": "generated.java", "Line Number": 33 }, { "Declaring Class": "org.apache.spark.sql.execution.WholeStageCodegenExec", "Method Name": "$anonfun$doExecute$4", "File Name": "WholeStageCodegenExec.scala", "Line Number": 750 }, { "Declaring Class": "org.apache.spark.sql.execution.WholeStageCodegenExec", "Method Name": "$anonfun$doExecute$4$adapted", "File Name": "WholeStageCodegenExec.scala", "Line Number": 747 }, { "Declaring Class": "org.apache.spark.rdd.RDD", "Method Name": "$anonfun$mapPartitionsWithIndex$2", "File Name": "RDD.scala", "Line Number": 915 }, { "Declaring Class": "org.apache.spark.rdd.RDD", "Method Name": "$anonfun$mapPartitionsWithIndex$2$adapted", "File Name": "RDD.scala", "Line Number": 915 }, { "Declaring Class": "org.apache.spark.rdd.MapPartitionsRDD", "Method Name": "compute", "File Name": "MapPartitionsRDD.scala", "Line Number": 52 }, { "Declaring Class": "org.apache.spark.rdd.RDD", "Method Name": "computeOrReadCheckpoint", "File Name": "RDD.scala", "Line Number": 373 }, { "Declaring Class": "org.apache.spark.rdd.RDD", "Method Name": "iterator", "File Name": "RDD.scala", "Line Number": 337 }, { "Declaring Class": "org.apache.spark.rdd.MapPartitionsRDD", "Method Name": "compute", "File Name": "MapPartitionsRDD.scala", "Line Number": 52 }, { "Declaring Class": "org.apache.spark.rdd.RDD", "Method Name": "computeOrReadCheckpoint", "File Name": "RDD.scala", "Line Number": 373 }, { "Declaring Class": "org.apache.spark.rdd.RDD", "Method Name": "iterator", "File Name": "RDD.scala", "Line Number": 337 }, { "Declaring Class": "org.apache.spark.shuffle.ShuffleWriteProcessor", "Method Name": "write", "File Name": "ShuffleWriteProcessor.scala", "Line Number": 59 }, { "Declaring Class": "org.apache.spark.scheduler.ShuffleMapTask", "Method Name": "runTask", "File Name": "ShuffleMapTask.scala", "Line Number": 99 }, { "Declaring Class": "org.apache.spark.scheduler.ShuffleMapTask",
[jira] [Created] (SPARK-32288) [UI] Add exception summary table in stage page
Zhongwei Zhu created SPARK-32288: Summary: [UI] Add exception summary table in stage page Key: SPARK-32288 URL: https://issues.apache.org/jira/browse/SPARK-32288 Project: Spark Issue Type: New Feature Components: Web UI Affects Versions: 3.0.0 Reporter: Zhongwei Zhu When there're many task failure during one stage, it's hard to find failure pattern such as aggregation task failure by exception type and message. If we have such information, we can easily know which type of exception of failure is the root cause of stage failure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32125) [UI] Support get taskList by status in Web UI and SHS Rest API
Zhongwei Zhu created SPARK-32125: Summary: [UI] Support get taskList by status in Web UI and SHS Rest API Key: SPARK-32125 URL: https://issues.apache.org/jira/browse/SPARK-32125 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 3.0.0 Reporter: Zhongwei Zhu Support fetching taskList by status as below: /applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/taskList?status=failed -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32124) [SHS] Failed to parse FetchFailed TaskEndReason from event log produce by Spark 2.4
Zhongwei Zhu created SPARK-32124: Summary: [SHS] Failed to parse FetchFailed TaskEndReason from event log produce by Spark 2.4 Key: SPARK-32124 URL: https://issues.apache.org/jira/browse/SPARK-32124 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Reporter: Zhongwei Zhu When read event log produced by Spark 2.4.4, parsing TaskEndReason failed to due to missing field "Map Index". -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32044) [SS] 2.4 Kafka continuous processing print mislead initial offsets log
[ https://issues.apache.org/jira/browse/SPARK-32044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongwei Zhu updated SPARK-32044: - Summary: [SS] 2.4 Kafka continuous processing print mislead initial offsets log (was: [SS] 2.4 Kakfa continuous processing print mislead initial offsets log ) > [SS] 2.4 Kafka continuous processing print mislead initial offsets log > --- > > Key: SPARK-32044 > URL: https://issues.apache.org/jira/browse/SPARK-32044 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.6 >Reporter: Zhongwei Zhu >Priority: Trivial > Original Estimate: 24h > Remaining Estimate: 24h > > When using structured streaming in continuous processing mode, after restart > spark job, spark job can correctly pick up offsets in checkpoint location > from last epoch. But it always print out below log: > 20/06/12 00:58:09 INFO [stream execution thread for [id = > 34e5b909-f9fe-422a-89c0-081251a68693, runId = > 0246e19d-aaa1-4a5c-9091-bab1a0578a0a]] kafka010.KafkaContinuousReader: > Initial offsets: > \{"kafka_topic":{"8":51618236,"11":51610655,"2":51622889,"5":51637171,"14":51637346,"13":51627784,"4":51606960,"7":51632475,"1":51636129,"10":51632212,"9":51634107,"3":51611013,"12":51626567,"15":51640774,"6":51637823,"0":51629106}} > This log is misleading as spark didn't use this one as initial offsets. Also, > it results in unnecessary kafka offset fetch. This is caused by below code in > KafkaContinuousReader > {code:java} > offset = start.orElse { > val offsets = initialOffsets match { > case EarliestOffsetRangeLimit => > KafkaSourceOffset(offsetReader.fetchEarliestOffsets()) > case LatestOffsetRangeLimit => > KafkaSourceOffset(offsetReader.fetchLatestOffsets(None)) > case SpecificOffsetRangeLimit(p) => > offsetReader.fetchSpecificOffsets(p, reportDataLoss) > } > logInfo(s"Initial offsets: $offsets") > offsets > } > {code} > The code inside orElse block is always executed even when start has value. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32044) [SS] 2.4 Kakfa continuous processing print mislead initial offsets log
[ https://issues.apache.org/jira/browse/SPARK-32044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongwei Zhu updated SPARK-32044: - Summary: [SS] 2.4 Kakfa continuous processing print mislead initial offsets log (was: [SS] Kakfa continuous processing print mislead initial offsets log ) > [SS] 2.4 Kakfa continuous processing print mislead initial offsets log > --- > > Key: SPARK-32044 > URL: https://issues.apache.org/jira/browse/SPARK-32044 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.6 >Reporter: Zhongwei Zhu >Priority: Trivial > Original Estimate: 24h > Remaining Estimate: 24h > > When using structured streaming in continuous processing mode, after restart > spark job, spark job can correctly pick up offsets in checkpoint location > from last epoch. But it always print out below log: > 20/06/12 00:58:09 INFO [stream execution thread for [id = > 34e5b909-f9fe-422a-89c0-081251a68693, runId = > 0246e19d-aaa1-4a5c-9091-bab1a0578a0a]] kafka010.KafkaContinuousReader: > Initial offsets: > \{"kafka_topic":{"8":51618236,"11":51610655,"2":51622889,"5":51637171,"14":51637346,"13":51627784,"4":51606960,"7":51632475,"1":51636129,"10":51632212,"9":51634107,"3":51611013,"12":51626567,"15":51640774,"6":51637823,"0":51629106}} > This log is misleading as spark didn't use this one as initial offsets. Also, > it results in unnecessary kafka offset fetch. This is caused by below code in > KafkaContinuousReader > {code:java} > offset = start.orElse { > val offsets = initialOffsets match { > case EarliestOffsetRangeLimit => > KafkaSourceOffset(offsetReader.fetchEarliestOffsets()) > case LatestOffsetRangeLimit => > KafkaSourceOffset(offsetReader.fetchLatestOffsets(None)) > case SpecificOffsetRangeLimit(p) => > offsetReader.fetchSpecificOffsets(p, reportDataLoss) > } > logInfo(s"Initial offsets: $offsets") > offsets > } > {code} > The code inside orElse block is always executed even when start has value. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32044) [SS] Kakfa continuous processing print mislead initial offsets log
[ https://issues.apache.org/jira/browse/SPARK-32044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongwei Zhu updated SPARK-32044: - Description: When using structured streaming in continuous processing mode, after restart spark job, spark job can correctly pick up offsets in checkpoint location from last epoch. But it always print out below log: 20/06/12 00:58:09 INFO [stream execution thread for [id = 34e5b909-f9fe-422a-89c0-081251a68693, runId = 0246e19d-aaa1-4a5c-9091-bab1a0578a0a]] kafka010.KafkaContinuousReader: Initial offsets: \{"kafka_topic":{"8":51618236,"11":51610655,"2":51622889,"5":51637171,"14":51637346,"13":51627784,"4":51606960,"7":51632475,"1":51636129,"10":51632212,"9":51634107,"3":51611013,"12":51626567,"15":51640774,"6":51637823,"0":51629106}} This log is misleading as spark didn't use this one as initial offsets. Also, it results in unnecessary kafka offset fetch. This is caused by below code in KafkaContinuousReader {code:java} offset = start.orElse { val offsets = initialOffsets match { case EarliestOffsetRangeLimit => KafkaSourceOffset(offsetReader.fetchEarliestOffsets()) case LatestOffsetRangeLimit => KafkaSourceOffset(offsetReader.fetchLatestOffsets(None)) case SpecificOffsetRangeLimit(p) => offsetReader.fetchSpecificOffsets(p, reportDataLoss) } logInfo(s"Initial offsets: $offsets") offsets } {code} The code inside orElse block is always executed even when start has value. was: When using structured streaming in continuous processing mode, after restart spark job, spark job can correctly pick up offsets in checkpoint location from last epoch. But it always print out below log: 20/06/12 00:58:09 INFO [stream execution thread for [id = 34e5b909-f9fe-422a-89c0-081251a68693, runId = 0246e19d-aaa1-4a5c-9091-bab1a0578a0a]] kafka010.KafkaContinuousReader: Initial offsets: \{"kafka_topic":{"8":51618236,"11":51610655,"2":51622889,"5":51637171,"14":51637346,"13":51627784,"4":51606960,"7":51632475,"1":51636129,"10":51632212,"9":51634107,"3":51611013,"12":51626567,"15":51640774,"6":51637823,"0":51629106}} And this log is misleading, as spark didn't use this one as initial offsets. This is caused by below code in KafkaContinuousReader {code:java} offset = start.orElse { val offsets = initialOffsets match { case EarliestOffsetRangeLimit => KafkaSourceOffset(offsetReader.fetchEarliestOffsets()) case LatestOffsetRangeLimit => KafkaSourceOffset(offsetReader.fetchLatestOffsets(None)) case SpecificOffsetRangeLimit(p) => offsetReader.fetchSpecificOffsets(p, reportDataLoss) } logInfo(s"Initial offsets: $offsets") offsets } {code} The code inside orElse block is always executed even when start has value. > [SS] Kakfa continuous processing print mislead initial offsets log > --- > > Key: SPARK-32044 > URL: https://issues.apache.org/jira/browse/SPARK-32044 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.6 >Reporter: Zhongwei Zhu >Priority: Trivial > Fix For: 2.4.7 > > Original Estimate: 24h > Remaining Estimate: 24h > > When using structured streaming in continuous processing mode, after restart > spark job, spark job can correctly pick up offsets in checkpoint location > from last epoch. But it always print out below log: > 20/06/12 00:58:09 INFO [stream execution thread for [id = > 34e5b909-f9fe-422a-89c0-081251a68693, runId = > 0246e19d-aaa1-4a5c-9091-bab1a0578a0a]] kafka010.KafkaContinuousReader: > Initial offsets: > \{"kafka_topic":{"8":51618236,"11":51610655,"2":51622889,"5":51637171,"14":51637346,"13":51627784,"4":51606960,"7":51632475,"1":51636129,"10":51632212,"9":51634107,"3":51611013,"12":51626567,"15":51640774,"6":51637823,"0":51629106}} > This log is misleading as spark didn't use this one as initial offsets. Also, > it results in unnecessary kafka offset fetch. This is caused by below code in > KafkaContinuousReader > {code:java} > offset = start.orElse { > val offsets = initialOffsets match { > case EarliestOffsetRangeLimit => > KafkaSourceOffset(offsetReader.fetchEarliestOffsets()) > case LatestOffsetRangeLimit => > KafkaSourceOffset(offsetReader.fetchLatestOffsets(None)) > case SpecificOffsetRangeLimit(p) => > offsetReader.fetchSpecificOffsets(p, reportDataLoss) > } > logInfo(s"Initial offsets: $offsets") > offsets > } > {code} > The code inside orElse block is always executed even when start has value. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Updated] (SPARK-32044) [SS] Kakfa continuous processing print mislead initial offsets log
[ https://issues.apache.org/jira/browse/SPARK-32044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongwei Zhu updated SPARK-32044: - Summary: [SS] Kakfa continuous processing print mislead initial offsets log (was: Kakfa continuous processing print mislead initial offsets log ) > [SS] Kakfa continuous processing print mislead initial offsets log > --- > > Key: SPARK-32044 > URL: https://issues.apache.org/jira/browse/SPARK-32044 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.6 >Reporter: Zhongwei Zhu >Priority: Trivial > Fix For: 2.4.7 > > Original Estimate: 24h > Remaining Estimate: 24h > > When using structured streaming in continuous processing mode, after restart > spark job, spark job can correctly pick up offsets in checkpoint location > from last epoch. But it always print out below log: > 20/06/12 00:58:09 INFO [stream execution thread for [id = > 34e5b909-f9fe-422a-89c0-081251a68693, runId = > 0246e19d-aaa1-4a5c-9091-bab1a0578a0a]] kafka010.KafkaContinuousReader: > Initial offsets: > \{"kafka_topic":{"8":51618236,"11":51610655,"2":51622889,"5":51637171,"14":51637346,"13":51627784,"4":51606960,"7":51632475,"1":51636129,"10":51632212,"9":51634107,"3":51611013,"12":51626567,"15":51640774,"6":51637823,"0":51629106}} > And this log is misleading, as spark didn't use this one as initial offsets. > This is caused by below code in KafkaContinuousReader > {code:java} > offset = start.orElse { > val offsets = initialOffsets match { > case EarliestOffsetRangeLimit => > KafkaSourceOffset(offsetReader.fetchEarliestOffsets()) > case LatestOffsetRangeLimit => > KafkaSourceOffset(offsetReader.fetchLatestOffsets(None)) > case SpecificOffsetRangeLimit(p) => > offsetReader.fetchSpecificOffsets(p, reportDataLoss) > } > logInfo(s"Initial offsets: $offsets") > offsets > } > {code} > The code inside orElse block is always executed even when start has value. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32044) Kakfa continuous processing print mislead initial offsets log
Zhongwei Zhu created SPARK-32044: Summary: Kakfa continuous processing print mislead initial offsets log Key: SPARK-32044 URL: https://issues.apache.org/jira/browse/SPARK-32044 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.4.6 Reporter: Zhongwei Zhu Fix For: 2.4.7 When using structured streaming in continuous processing mode, after restart spark job, spark job can correctly pick up offsets in checkpoint location from last epoch. But it always print out below log: 20/06/12 00:58:09 INFO [stream execution thread for [id = 34e5b909-f9fe-422a-89c0-081251a68693, runId = 0246e19d-aaa1-4a5c-9091-bab1a0578a0a]] kafka010.KafkaContinuousReader: Initial offsets: \{"kafka_topic":{"8":51618236,"11":51610655,"2":51622889,"5":51637171,"14":51637346,"13":51627784,"4":51606960,"7":51632475,"1":51636129,"10":51632212,"9":51634107,"3":51611013,"12":51626567,"15":51640774,"6":51637823,"0":51629106}} And this log is misleading, as spark didn't use this one as initial offsets. This is caused by below code in KafkaContinuousReader {code:java} offset = start.orElse { val offsets = initialOffsets match { case EarliestOffsetRangeLimit => KafkaSourceOffset(offsetReader.fetchEarliestOffsets()) case LatestOffsetRangeLimit => KafkaSourceOffset(offsetReader.fetchLatestOffsets(None)) case SpecificOffsetRangeLimit(p) => offsetReader.fetchSpecificOffsets(p, reportDataLoss) } logInfo(s"Initial offsets: $offsets") offsets } {code} The code inside orElse block is always executed even when start has value. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23432) Expose executor memory metrics in the web UI for executors
[ https://issues.apache.org/jira/browse/SPARK-23432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17008493#comment-17008493 ] Zhongwei Zhu commented on SPARK-23432: -- I'll work on this. > Expose executor memory metrics in the web UI for executors > -- > > Key: SPARK-23432 > URL: https://issues.apache.org/jira/browse/SPARK-23432 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Edward Lu >Priority: Major > > Add the new memory metrics (jvmUsedMemory, executionMemory, storageMemory, > and unifiedMemory, etc.) to the executors tab, in the summary and for each > executor. > This is a subtask for SPARK-23206. Please refer to the design doc for that > ticket for more details. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org