[jira] [Created] (SPARK-46352) Support spark conf to configure log level of specific package or class

2023-12-10 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-46352:


 Summary: Support spark conf to configure log level of specific 
package or class
 Key: SPARK-46352
 URL: https://issues.apache.org/jira/browse/SPARK-46352
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.5.0
Reporter: Zhongwei Zhu


Use spark conf `spark.log.level.org.apache.spark=error` to set logger 
`org.apache.spark` to `error`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45622) java -target should use java.version instead of 17

2023-10-21 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-45622:


 Summary: java -target should use java.version instead of 17
 Key: SPARK-45622
 URL: https://issues.apache.org/jira/browse/SPARK-45622
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.5.0
Reporter: Zhongwei Zhu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41954) Add isDecommissioned in ExecutorDeadException

2023-10-21 Thread Zhongwei Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongwei Zhu updated SPARK-41954:
-
Description: When one executor is dead, we want to know whether this dead 
executor is caused by decommission. 

> Add isDecommissioned in ExecutorDeadException
> -
>
> Key: SPARK-41954
> URL: https://issues.apache.org/jira/browse/SPARK-41954
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.1
>Reporter: Zhongwei Zhu
>Priority: Major
>
> When one executor is dead, we want to know whether this dead executor is 
> caused by decommission. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45372) Handle ClassNotFoundException when load extension

2023-09-28 Thread Zhongwei Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongwei Zhu updated SPARK-45372:
-
Summary: Handle ClassNotFoundException when load extension  (was: Handle 
ClassNotFound when load extension)

> Handle ClassNotFoundException when load extension
> -
>
> Key: SPARK-45372
> URL: https://issues.apache.org/jira/browse/SPARK-45372
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Zhongwei Zhu
>Priority: Minor
>
> When load extension with ClassNotFoundException, SparkContext failed to 
> initialize. Better behavior is skip this extension and log error without 
> failing SparkContext initialization.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45372) Handle ClassNotFound when load extension

2023-09-28 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-45372:


 Summary: Handle ClassNotFound when load extension
 Key: SPARK-45372
 URL: https://issues.apache.org/jira/browse/SPARK-45372
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.5.0
Reporter: Zhongwei Zhu


When load extension with ClassNotFoundException, SparkContext failed to 
initialize. Better behavior is skip this extension and log error without 
failing SparkContext initialization.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45217) Support change log level of specific package or class

2023-09-19 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-45217:


 Summary: Support change log level of specific package or class
 Key: SPARK-45217
 URL: https://issues.apache.org/jira/browse/SPARK-45217
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.5.0
Reporter: Zhongwei Zhu


Add SparkContext.setLogLevel(loggerName: String, logLevel: String) to support 
change log level of specific package or class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45057) Deadlock caused by rdd replication level of 2

2023-09-05 Thread Zhongwei Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongwei Zhu updated SPARK-45057:
-
Description: 
 
When 2 tasks try to compute same rdd with replication level of 2 and running on 
only 2 executors. Deadlock will happen.

Task only release lock after writing into local machine and replicate to remote 
executor.

 
||Time||Exe 1 (Task Thread T1)||Exe 1 (Shuffle Server Thread T2)||Exe 2 (Task 
Thread T3)||Exe 2 (Shuffle Server Thread T4)||
|T0|write lock of rdd| | | |
|T1| | |write lock of rdd| |
|T2|replicate -> UploadBlockSync (blocked by T4)| | | |
|T3| | | |Received UploadBlock request from T1 (blocked by T4)|
|T4| | |replicate -> UploadBlockSync (blocked by T2)| |
|T5| |Received UploadBlock request from T3 (blocked by T1)| | |
|T6|Deadlock|Deadlock|Deadlock|Deadlock|

  was:
 
When 2 tasks try to compute same rdd with replication level of 2 and running on 
only 2 executors. Deadlock will happen.

 
||Time||Exe 1 (Task Thread T1)||Exe 1 (Shuffle Server Thread T2)||Exe 2 (Task 
Thread T3)||Exe 2 (Shuffle Server Thread T4)||
|T0|write lock of rdd| | | |
|T1| | |write lock of rdd| |
|T2|replicate -> UploadBlockSync (blocked by T4)| | | |
|T3| | | |Received UploadBlock request from T1 (blocked by T4)|
|T4| | |replicate -> UploadBlockSync (blocked by T2)| |
|T5| |Received UploadBlock request from T3 (blocked by T1)| | |
|T6|Deadlock|Deadlock|Deadlock|Deadlock|


> Deadlock caused by rdd replication level of 2
> -
>
> Key: SPARK-45057
> URL: https://issues.apache.org/jira/browse/SPARK-45057
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: Zhongwei Zhu
>Priority: Major
>
>  
> When 2 tasks try to compute same rdd with replication level of 2 and running 
> on only 2 executors. Deadlock will happen.
> Task only release lock after writing into local machine and replicate to 
> remote executor.
>  
> ||Time||Exe 1 (Task Thread T1)||Exe 1 (Shuffle Server Thread T2)||Exe 2 (Task 
> Thread T3)||Exe 2 (Shuffle Server Thread T4)||
> |T0|write lock of rdd| | | |
> |T1| | |write lock of rdd| |
> |T2|replicate -> UploadBlockSync (blocked by T4)| | | |
> |T3| | | |Received UploadBlock request from T1 (blocked by T4)|
> |T4| | |replicate -> UploadBlockSync (blocked by T2)| |
> |T5| |Received UploadBlock request from T3 (blocked by T1)| | |
> |T6|Deadlock|Deadlock|Deadlock|Deadlock|



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45057) Deadlock caused by rdd replication level of 2

2023-09-01 Thread Zhongwei Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongwei Zhu updated SPARK-45057:
-
Description: 
 
When 2 tasks try to compute same rdd with replication level of 2 and running on 
only 2 executors. Deadlock will happen.

 
||Time||Exe 1 (Task Thread T1)||Exe 1 (Shuffle Server Thread T2)||Exe 2 (Task 
Thread T3)||Exe 2 (Shuffle Server Thread T4)||
|T0|write lock of rdd| | | |
|T1| | |write lock of rdd| |
|T2|replicate -> UploadBlockSync (blocked by T4)| | | |
|T3| | | |Received UploadBlock request from T1 (blocked by T4)|
|T4| | |replicate -> UploadBlockSync (blocked by T2)| |
|T5| |Received UploadBlock request from T3 (blocked by T1)| | |
|T6|Deadlock|Deadlock|Deadlock|Deadlock|

  was:
 
When 2 tasks try to compute same rdd with replication level of 2 and running on 
only 2 executors. Deadlock will happen.

 
||Time||Exe 1 (Task Thread 1)||Exe 1 (Shuffle Server Thread 2)||Exe 2 (Task 
Thread 3)||Exe 2 (Shuffle Server Thread 4)||
|T0|write lock of rdd| | | |
|T1| | |write lock of rdd| |
|T2|replicate -> UploadBlockSync (blocked by shuffle server thread 4)| | | |
|T3| | | |Received UploadBlock request(blocked by task thread 3)|
|T4| | |replicate -> UploadBlockSync (blocked by shuffle server thread 2)| |
|T5| |Received UploadBlock request(blocked by task thread 1)| | |
|T6|Deadlock|Deadlock|Deadlock|Deadlock|


> Deadlock caused by rdd replication level of 2
> -
>
> Key: SPARK-45057
> URL: https://issues.apache.org/jira/browse/SPARK-45057
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: Zhongwei Zhu
>Priority: Major
>
>  
> When 2 tasks try to compute same rdd with replication level of 2 and running 
> on only 2 executors. Deadlock will happen.
>  
> ||Time||Exe 1 (Task Thread T1)||Exe 1 (Shuffle Server Thread T2)||Exe 2 (Task 
> Thread T3)||Exe 2 (Shuffle Server Thread T4)||
> |T0|write lock of rdd| | | |
> |T1| | |write lock of rdd| |
> |T2|replicate -> UploadBlockSync (blocked by T4)| | | |
> |T3| | | |Received UploadBlock request from T1 (blocked by T4)|
> |T4| | |replicate -> UploadBlockSync (blocked by T2)| |
> |T5| |Received UploadBlock request from T3 (blocked by T1)| | |
> |T6|Deadlock|Deadlock|Deadlock|Deadlock|



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45057) Deadlock caused by rdd replication level of 2

2023-09-01 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-45057:


 Summary: Deadlock caused by rdd replication level of 2
 Key: SPARK-45057
 URL: https://issues.apache.org/jira/browse/SPARK-45057
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.4.1
Reporter: Zhongwei Zhu


 
When 2 tasks try to compute same rdd with replication level of 2 and running on 
only 2 executors. Deadlock will happen.

 
||Time||Exe 1 (Task Thread 1)||Exe 1 (Shuffle Server Thread 2)||Exe 2 (Task 
Thread 3)||Exe 2 (Shuffle Server Thread 4)||
|T0|write lock of rdd| | | |
|T1| | |write lock of rdd| |
|T2|replicate -> UploadBlockSync (blocked by shuffle server thread 4)| | | |
|T3| | | |Received UploadBlock request(blocked by task thread 3)|
|T4| | |replicate -> UploadBlockSync (blocked by shuffle server thread 2)| |
|T5| |Received UploadBlock request(blocked by task thread 1)| | |
|T6|Deadlock|Deadlock|Deadlock|Deadlock|



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44345) Only log unknown shuffle map output as error when shuffle migration disabled

2023-07-08 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-44345:


 Summary: Only log unknown shuffle map output as error when shuffle 
migration disabled
 Key: SPARK-44345
 URL: https://issues.apache.org/jira/browse/SPARK-44345
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.1
Reporter: Zhongwei Zhu


When decommission and shuffle migration is enabled, there're lots of error 
message like `Asked to update map output for unknown shuffle` . 

As shuffle clean and unregister is done by ContextCleaner in an async way, when 
target block manager received the shuffle block from decommissioned block 
manager, then update shuffle location to map output tracker, but at that time, 
shuffle might have been unregistered. This should not consider as error.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44126) Migration shuffle to decommissioned executor should not count as block failure

2023-06-20 Thread Zhongwei Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongwei Zhu updated SPARK-44126:
-
Description: 
When shuffle migration to decommissioned executor, the below exception is 
thrown:

{code:java}
org.apache.spark.SparkException: Exception thrown in awaitResult:     at 
org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)    at 
org.apache.spark.network.BlockTransferService.uploadBlockSync(BlockTransferService.scala:122)
    at 
org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.$anonfun$run$5(BlockManagerDecommissioner.scala:127)
    at 
org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.$anonfun$run$5$adapted(BlockManagerDecommissioner.scala:118)
    at scala.collection.immutable.List.foreach(List.scala:431)    at 
org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.run(BlockManagerDecommissioner.scala:118)
    at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)    at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)Caused by: 
java.lang.RuntimeException: 
org.apache.spark.storage.BlockSavedOnDecommissionedBlockManagerException: Block 
shuffle_2_6429_0.data cannot be saved on decommissioned executor    at 
org.apache.spark.errors.SparkCoreErrors$.cannotSaveBlockOnDecommissionedExecutorError(SparkCoreErrors.scala:238)
    at 
org.apache.spark.storage.BlockManager.checkShouldStore(BlockManager.scala:277)  
  at 
org.apache.spark.storage.BlockManager.putBlockDataAsStream(BlockManager.scala:741)
    at 
org.apache.spark.network.netty.NettyBlockRpcServer.receiveStream(NettyBlockRpcServer.scala:174)
    at 
org.apache.spark.network.server.AbstractAuthRpcHandler.receiveStream(AbstractAuthRpcHandler.java:78)
    at 
org.apache.spark.network.server.TransportRequestHandler.processStreamUpload(TransportRequestHandler.java:202)
    at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:115)
    at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
    at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
    at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
    at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
    at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
    at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
    at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
    at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
    at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
    at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
    at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
    at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
    at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
    at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
    at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
    at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
    at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
    at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
    at 
org.apache.spark.network.crypto.TransportCipher$DecryptionHandler.channelRead(TransportCipher.java:190)
    at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
    at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
    at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
    at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
    at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
  

[jira] [Created] (SPARK-44126) Migration shuffle to decommissioned executor should not count as block failure

2023-06-20 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-44126:


 Summary: Migration shuffle to decommissioned executor should not 
count as block failure
 Key: SPARK-44126
 URL: https://issues.apache.org/jira/browse/SPARK-44126
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Zhongwei Zhu


When shuffle migration to decommissioned executor, the below exception is 
thrown:

```

org.apache.spark.SparkException: Exception thrown in awaitResult:     at 
org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)    at 
org.apache.spark.network.BlockTransferService.uploadBlockSync(BlockTransferService.scala:122)
    at 
org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.$anonfun$run$5(BlockManagerDecommissioner.scala:127)
    at 
org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.$anonfun$run$5$adapted(BlockManagerDecommissioner.scala:118)
    at scala.collection.immutable.List.foreach(List.scala:431)    at 
org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.run(BlockManagerDecommissioner.scala:118)
    at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)    at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)Caused by: 
java.lang.RuntimeException: 
org.apache.spark.storage.BlockSavedOnDecommissionedBlockManagerException: Block 
shuffle_2_6429_0.data cannot be saved on decommissioned executor    at 
org.apache.spark.errors.SparkCoreErrors$.cannotSaveBlockOnDecommissionedExecutorError(SparkCoreErrors.scala:238)
    at 
org.apache.spark.storage.BlockManager.checkShouldStore(BlockManager.scala:277)  
  at 
org.apache.spark.storage.BlockManager.putBlockDataAsStream(BlockManager.scala:741)
    at 
org.apache.spark.network.netty.NettyBlockRpcServer.receiveStream(NettyBlockRpcServer.scala:174)
    at 
org.apache.spark.network.server.AbstractAuthRpcHandler.receiveStream(AbstractAuthRpcHandler.java:78)
    at 
org.apache.spark.network.server.TransportRequestHandler.processStreamUpload(TransportRequestHandler.java:202)
    at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:115)
    at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
    at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
    at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
    at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
    at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
    at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
    at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
    at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
    at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
    at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
    at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
    at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
    at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
    at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
    at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
    at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
    at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
    at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
    at 
org.apache.spark.network.crypto.TransportCipher$DecryptionHandler.channelRead(TransportCipher.java:190)
    at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
    at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
    at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
    at 

[jira] [Created] (SPARK-44084) Dynamic allocation pending tasks should not include finished ones

2023-06-16 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-44084:


 Summary: Dynamic allocation pending tasks should not include 
finished ones
 Key: SPARK-44084
 URL: https://issues.apache.org/jira/browse/SPARK-44084
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Zhongwei Zhu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43828) Add config to control whether close idle connection

2023-05-26 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-43828:


 Summary: Add config to control whether close idle connection
 Key: SPARK-43828
 URL: https://issues.apache.org/jira/browse/SPARK-43828
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Zhongwei Zhu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43398) Executor timeout should be max of idleTimeout rddTimeout shuffleTimeout

2023-05-07 Thread Zhongwei Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongwei Zhu updated SPARK-43398:
-
Affects Version/s: (was: 3.4.0)

> Executor timeout should be max of idleTimeout rddTimeout shuffleTimeout
> ---
>
> Key: SPARK-43398
> URL: https://issues.apache.org/jira/browse/SPARK-43398
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Zhongwei Zhu
>Priority: Major
>
> When dynamic allocation enabled, Executor timeout should be max of 
> idleTimeout, rddTimeout and shuffleTimeout.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43398) Executor timeout should be max of idleTimeout rddTimeout shuffleTimeout

2023-05-07 Thread Zhongwei Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongwei Zhu updated SPARK-43398:
-
Affects Version/s: 3.0.0

> Executor timeout should be max of idleTimeout rddTimeout shuffleTimeout
> ---
>
> Key: SPARK-43398
> URL: https://issues.apache.org/jira/browse/SPARK-43398
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.4.0
>Reporter: Zhongwei Zhu
>Priority: Major
>
> When dynamic allocation enabled, Executor timeout should be max of 
> idleTimeout, rddTimeout and shuffleTimeout.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43399) Add config to control threshold of unregister map output when fetch failed

2023-05-07 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-43399:


 Summary: Add config to control threshold of unregister map output 
when fetch failed
 Key: SPARK-43399
 URL: https://issues.apache.org/jira/browse/SPARK-43399
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Zhongwei Zhu


Spark will unregister map output on the executor when fetch failed from this 
executor. This might be too aggressive when fetch failed is temporary and 
recoverable. So adding the config to control the number of fetch failed 
failures needed to unregister map output and allow to disable unregister.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43398) Executor timeout should be max of idleTimeout rddTimeout shuffleTimeout

2023-05-07 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-43398:


 Summary: Executor timeout should be max of idleTimeout rddTimeout 
shuffleTimeout
 Key: SPARK-43398
 URL: https://issues.apache.org/jira/browse/SPARK-43398
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Zhongwei Zhu


When dynamic allocation enabled, Executor timeout should be max of idleTimeout, 
rddTimeout and shuffleTimeout.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43397) Log executor decommission duration

2023-05-06 Thread Zhongwei Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongwei Zhu updated SPARK-43397:
-
Description: Log executor decommission duration.

> Log executor decommission duration
> --
>
> Key: SPARK-43397
> URL: https://issues.apache.org/jira/browse/SPARK-43397
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Zhongwei Zhu
>Priority: Minor
>
> Log executor decommission duration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43397) Log executor decommission duration

2023-05-06 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-43397:


 Summary: Log executor decommission duration
 Key: SPARK-43397
 URL: https://issues.apache.org/jira/browse/SPARK-43397
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Zhongwei Zhu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43396) Add config to control max ratio of decommissioning executors

2023-05-06 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-43396:


 Summary: Add config to control max ratio of decommissioning 
executors
 Key: SPARK-43396
 URL: https://issues.apache.org/jira/browse/SPARK-43396
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Zhongwei Zhu


Decommission too many executors at the same time with shuffle or rdd migration 
could severely hurt performance of shuffle fetch. Block manager decommissioner 
try to migrate shuffle or rdd as soon as possible, this will compete network 
and disk IO with shuffle fetch in the target executor. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43391) Idle connection should not be closed when closeIdleConnection is disabled

2023-05-05 Thread Zhongwei Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongwei Zhu updated SPARK-43391:
-
Description: Spark will close idle connection when there're outstanding 
requests but no traffic for at least {{requestTimeoutMs}} even with 
closeIdleConnection is disabled.  (was: Spark will close idle connection when 
there're pending response and closeIdleConnection is disabled. )

> Idle connection should not be closed when closeIdleConnection is disabled
> -
>
> Key: SPARK-43391
> URL: https://issues.apache.org/jira/browse/SPARK-43391
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Zhongwei Zhu
>Priority: Major
>
> Spark will close idle connection when there're outstanding requests but no 
> traffic for at least {{requestTimeoutMs}} even with closeIdleConnection is 
> disabled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43391) Idle connection should not be closed when closeIdleConnection is disabled

2023-05-05 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-43391:


 Summary: Idle connection should not be closed when 
closeIdleConnection is disabled
 Key: SPARK-43391
 URL: https://issues.apache.org/jira/browse/SPARK-43391
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Zhongwei Zhu


Spark will close idle connection when there're pending response and 
closeIdleConnection is disabled. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43224) Executor should not be removed when decommissioned in standalone

2023-04-22 Thread Zhongwei Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongwei Zhu updated SPARK-43224:
-
Description: Currently, executor info in standalone master will be removed 
once decommissioned. The executor should be removed after executor shutdown.

> Executor should not be removed when decommissioned in standalone
> 
>
> Key: SPARK-43224
> URL: https://issues.apache.org/jira/browse/SPARK-43224
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Zhongwei Zhu
>Priority: Minor
>
> Currently, executor info in standalone master will be removed once 
> decommissioned. The executor should be removed after executor shutdown.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43238) Support only decommission idle workers in standalone

2023-04-22 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-43238:


 Summary: Support only decommission idle workers in standalone
 Key: SPARK-43238
 URL: https://issues.apache.org/jira/browse/SPARK-43238
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Zhongwei Zhu


Currently, standalone master web ui supports kill/decommission workers. But 
when graceful decommission is enabled, running task, shuffle and rdd migration 
could take long time. While waiting for running task, shuffle and rdd 
migration, decommissioned workers can't run new executors and decommissioned 
executors can't run new tasks. This caused lot of resource waste.

If only idle workers to be decommissioned, these workers could be shutdown and 
removed to save cost without waiting long decommissioning process.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43237) Handle null exception message in event log

2023-04-22 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-43237:


 Summary: Handle null exception message in event log
 Key: SPARK-43237
 URL: https://issues.apache.org/jira/browse/SPARK-43237
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Zhongwei Zhu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43224) Executor should not be removed when decommissioned in standalone

2023-04-20 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-43224:


 Summary: Executor should not be removed when decommissioned in 
standalone
 Key: SPARK-43224
 URL: https://issues.apache.org/jira/browse/SPARK-43224
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Zhongwei Zhu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43086) Support bin pack task scheduling on executors

2023-04-10 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-43086:


 Summary: Support bin pack task scheduling on executors 
 Key: SPARK-43086
 URL: https://issues.apache.org/jira/browse/SPARK-43086
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.3.2
Reporter: Zhongwei Zhu


Dynamic allocation only remove or decommission an idle executor. The default 
task scheduling use round robin to do task assignment on executors. 

For example, we have 4 tasks to run, 4 executors(each has 4 cpu cores). Default 
task scheduling will assign 1 task per executors. With bin pack, one executor 
could assign 4 tasks, then dynamic allocation could remove other 3 executors to 
reduce resource waste.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43052) Handle stacktrace with null file name in event log

2023-04-06 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-43052:


 Summary: Handle stacktrace with null file name in event log
 Key: SPARK-43052
 URL: https://issues.apache.org/jira/browse/SPARK-43052
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.3.2
Reporter: Zhongwei Zhu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43037) Support fetch migrated shuffle with multiple reducers

2023-04-05 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-43037:


 Summary: Support fetch migrated shuffle with multiple reducers
 Key: SPARK-43037
 URL: https://issues.apache.org/jira/browse/SPARK-43037
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.3.2
Reporter: Zhongwei Zhu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42925) Check executor alive from driver before fetch blocks

2023-03-26 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-42925:


 Summary: Check executor alive from driver before fetch blocks
 Key: SPARK-42925
 URL: https://issues.apache.org/jira/browse/SPARK-42925
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.3.2
Reporter: Zhongwei Zhu


This can fail fast instead of waiting 
spark.shuffle.io.connectionTimeout(default 2min) when executor is dead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41956) Refetch shuffle blocks when executor is decommissioned

2023-02-27 Thread Zhongwei Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongwei Zhu updated SPARK-41956:
-
Summary: Refetch shuffle blocks when executor is decommissioned  (was: 
Shuffle output location refetch in ShuffleBlockFetcherIterator)

> Refetch shuffle blocks when executor is decommissioned
> --
>
> Key: SPARK-41956
> URL: https://issues.apache.org/jira/browse/SPARK-41956
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.1
>Reporter: Zhongwei Zhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41955) Support fetch latest map output from executor

2023-02-23 Thread Zhongwei Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongwei Zhu updated SPARK-41955:
-
Summary:  Support fetch latest map output from executor  (was:  Support 
fetch latest map output from worker)

>  Support fetch latest map output from executor
> --
>
> Key: SPARK-41955
> URL: https://issues.apache.org/jira/browse/SPARK-41955
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.1
>Reporter: Zhongwei Zhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42104) Throw ExecutorDeadException in fetchBlocks when executor dead

2023-01-17 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-42104:


 Summary: Throw ExecutorDeadException in fetchBlocks when executor 
dead
 Key: SPARK-42104
 URL: https://issues.apache.org/jira/browse/SPARK-42104
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.3.1
Reporter: Zhongwei Zhu


When fetchBlocks failed due to IOException, ExecutorDeadException will be 
thrown when executor is dead.

There're other cases that executor dead will cause TimeoutException or other 
Exceptions.
{code:java}
Caused by: java.lang.RuntimeException: java.util.concurrent.TimeoutException: 
Waited 3 milliseconds (plus 143334 nanoseconds delay) for 
SettableFuture@624de392[status=PENDING]
    at org.sparkproject.guava.base.Throwables.propagate(Throwables.java:243)
    at 
org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:293)
    at 
org.apache.spark.network.crypto.AuthClientBootstrap.doSparkAuth(AuthClientBootstrap.java:113)
    at 
org.apache.spark.network.crypto.AuthClientBootstrap.doBootstrap(AuthClientBootstrap.java:80)
    at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:300)
    at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:218)
    at 
org.apache.spark.network.netty.NettyBlockTransferService$$anon$2.createAndStart(NettyBlockTransferService.scala:126)
    at 
org.apache.spark.network.shuffle.RetryingBlockTransferor.transferAllOutstanding(RetryingBlockTransferor.java:154)
    at 
org.apache.spark.network.shuffle.RetryingBlockTransferor.lambda$initiateRetry$0(RetryingBlockTransferor.java:184)
    at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41956) Shuffle output location refetch in ShuffleBlockFetcherIterator

2023-01-09 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-41956:


 Summary: Shuffle output location refetch in 
ShuffleBlockFetcherIterator
 Key: SPARK-41956
 URL: https://issues.apache.org/jira/browse/SPARK-41956
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.3.1
Reporter: Zhongwei Zhu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41955) Support fetch latest map output from worker

2023-01-09 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-41955:


 Summary:  Support fetch latest map output from worker
 Key: SPARK-41955
 URL: https://issues.apache.org/jira/browse/SPARK-41955
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.3.1
Reporter: Zhongwei Zhu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41954) Add isDecommissioned in ExecutorDeadException

2023-01-09 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-41954:


 Summary: Add isDecommissioned in ExecutorDeadException
 Key: SPARK-41954
 URL: https://issues.apache.org/jira/browse/SPARK-41954
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.3.1
Reporter: Zhongwei Zhu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41953) Shuffle output location refetch during shuffle migration in decommission

2023-01-09 Thread Zhongwei Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656365#comment-17656365
 ] 

Zhongwei Zhu commented on SPARK-41953:
--

[~dongjoon] [~mridulm80] [~Ngone51] Any comments for this?

> Shuffle output location refetch during shuffle migration in decommission
> 
>
> Key: SPARK-41953
> URL: https://issues.apache.org/jira/browse/SPARK-41953
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.1
>Reporter: Zhongwei Zhu
>Priority: Major
>
> When shuffle migration enabled during spark decommissionm, shuffle data will 
> be migrated into live executors, then update latest location to 
> MapOutputTracker. It has some issues:
>  # Executors only do map output location fetch in the beginning of the reduce 
> stage, so any shuffle output location change in the middle of reduce will 
> cause FetchFailed as reducer fetch from old location. Even stage retries 
> could solve this, this still cause lots of resource waste as all shuffle read 
> and compute happened before FetchFailed partition will be wasted.
>  # During stage retries, less running tasks cause more executors to be 
> decommissioned and shuffle data location keep changing. In the worst case, 
> stage could need lots of retries, further breaking SLA.
> So I propose to support refetch map output location during reduce phase if 
> shuffle migration is enabled and FetchFailed is caused by a decommissioned 
> dead executor. The detailed steps as below:
>  # When `BlockTransferService` fetch blocks failed from a decommissioned dead 
> executor, ExecutorDeadException(isDecommission as true) will be thrown.
>  # Make MapOutputTracker support fetch latest output without epoch provided.
>  # `ShuffleBlockFetcherIterator` will refetch latest output from 
> MapOutputTrackMaster. For all the shuffle blocks on this decommissioned, 
> there should be a new location on another executor. If not, throw exception 
> as current. If yes, create new local and remote requests to fetch these 
> migrated shuffle blocks. The flow will be similar as failback fetch when push 
> merged fetch failed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41953) Shuffle output location refetch during shuffle migration in decommission

2023-01-09 Thread Zhongwei Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongwei Zhu updated SPARK-41953:
-
Description: 
When shuffle migration enabled during spark decommissionm, shuffle data will be 
migrated into live executors, then update latest location to MapOutputTracker. 
It has some issues:
 # Executors only do map output location fetch in the beginning of the reduce 
stage, so any shuffle output location change in the middle of reduce will cause 
FetchFailed as reducer fetch from old location. Even stage retries could solve 
this, this still cause lots of resource waste as all shuffle read and compute 
happened before FetchFailed partition will be wasted.
 # During stage retries, less running tasks cause more executors to be 
decommissioned and shuffle data location keep changing. In the worst case, 
stage could need lots of retries, further breaking SLA.

So I propose to support refetch map output location during reduce phase if 
shuffle migration is enabled and FetchFailed is caused by a decommissioned dead 
executor. The detailed steps as below:
 # When `BlockTransferService` fetch blocks failed from a decommissioned dead 
executor, ExecutorDeadException(isDecommission as true) will be thrown.
 # Make MapOutputTracker support fetch latest output without epoch provided.
 # `ShuffleBlockFetcherIterator` will refetch latest output from 
MapOutputTrackMaster. For all the shuffle blocks on this decommissioned, there 
should be a new location on another executor. If not, throw exception as 
current. If yes, create new local and remote requests to fetch these migrated 
shuffle blocks. The flow will be similar as failback fetch when push merged 
fetch failed.

  was:
When shuffle migration enabled during spark decommissionm, shuffle data will be 
migrated into live executors, then update latest location to MapOutputTracker. 
It has some issues:
 # Executors only do map output location fetch in the beginning of the reduce 
stage, so any shuffle output location change in the middle of reduce will cause 
FetchFailed as reducer fetch from old location. Even stage retries could solve 
this, this still cause lots of resource waste as all shuffle read and compute 
happened before FetchFailed partition will be wasted.
 # During stage retries, less running tasks cause more executors to be 
decommissioned and shuffle data location keep changing. In the worst case, 
stage could need lots of retries, further breaking SLA.

So I propose to support refetch map output location during reduce phase if 
shuffle migration is enabled and FetchFailed is caused by a decommissioned dead 
executor. The detailed steps as


> Shuffle output location refetch during shuffle migration in decommission
> 
>
> Key: SPARK-41953
> URL: https://issues.apache.org/jira/browse/SPARK-41953
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.1
>Reporter: Zhongwei Zhu
>Priority: Major
>
> When shuffle migration enabled during spark decommissionm, shuffle data will 
> be migrated into live executors, then update latest location to 
> MapOutputTracker. It has some issues:
>  # Executors only do map output location fetch in the beginning of the reduce 
> stage, so any shuffle output location change in the middle of reduce will 
> cause FetchFailed as reducer fetch from old location. Even stage retries 
> could solve this, this still cause lots of resource waste as all shuffle read 
> and compute happened before FetchFailed partition will be wasted.
>  # During stage retries, less running tasks cause more executors to be 
> decommissioned and shuffle data location keep changing. In the worst case, 
> stage could need lots of retries, further breaking SLA.
> So I propose to support refetch map output location during reduce phase if 
> shuffle migration is enabled and FetchFailed is caused by a decommissioned 
> dead executor. The detailed steps as below:
>  # When `BlockTransferService` fetch blocks failed from a decommissioned dead 
> executor, ExecutorDeadException(isDecommission as true) will be thrown.
>  # Make MapOutputTracker support fetch latest output without epoch provided.
>  # `ShuffleBlockFetcherIterator` will refetch latest output from 
> MapOutputTrackMaster. For all the shuffle blocks on this decommissioned, 
> there should be a new location on another executor. If not, throw exception 
> as current. If yes, create new local and remote requests to fetch these 
> migrated shuffle blocks. The flow will be similar as failback fetch when push 
> merged fetch failed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: 

[jira] [Created] (SPARK-41953) Shuffle output location refetch during shuffle migration in decommission

2023-01-09 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-41953:


 Summary: Shuffle output location refetch during shuffle migration 
in decommission
 Key: SPARK-41953
 URL: https://issues.apache.org/jira/browse/SPARK-41953
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.3.1
Reporter: Zhongwei Zhu


When shuffle migration enabled during spark decommissionm, shuffle data will be 
migrated into live executors, then update latest location to MapOutputTracker. 
It has some issues:
 # Executors only do map output location fetch in the beginning of the reduce 
stage, so any shuffle output location change in the middle of reduce will cause 
FetchFailed as reducer fetch from old location. Even stage retries could solve 
this, this still cause lots of resource waste as all shuffle read and compute 
happened before FetchFailed partition will be wasted.
 # During stage retries, less running tasks cause more executors to be 
decommissioned and shuffle data location keep changing. In the worst case, 
stage could need lots of retries, further breaking SLA.

So I propose to support refetch map output location during reduce phase if 
shuffle migration is enabled and FetchFailed is caused by a decommissioned dead 
executor. The detailed steps as



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41766) Handle decommission request sent before executor registration

2022-12-28 Thread Zhongwei Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongwei Zhu updated SPARK-41766:
-
Summary: Handle decommission request sent before executor registration  
(was: Handle decommission request for unregistered executor)

> Handle decommission request sent before executor registration
> -
>
> Key: SPARK-41766
> URL: https://issues.apache.org/jira/browse/SPARK-41766
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.1
>Reporter: Zhongwei Zhu
>Priority: Major
>
> When decommissioning an unregistered executor, the request will be ignored. 
> It should send to executor after registration. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41766) Handle decommission request for unregistered executor

2022-12-28 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-41766:


 Summary: Handle decommission request for unregistered executor
 Key: SPARK-41766
 URL: https://issues.apache.org/jira/browse/SPARK-41766
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.3.1
Reporter: Zhongwei Zhu


When decommissioning an unregistered executor, the request will be ignored. It 
should send to executor after registration. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41341) Wait shuffle fetch to finish when decommission executor

2022-11-30 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-41341:


 Summary: Wait shuffle fetch to finish when decommission executor
 Key: SPARK-41341
 URL: https://issues.apache.org/jira/browse/SPARK-41341
 Project: Spark
  Issue Type: Improvement
  Components: Block Manager
Affects Versions: 3.3.1
Reporter: Zhongwei Zhu


When decommission executor, should wait shuffle fetch to finish, otherwise lots 
of FetchFailed will happen.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41153) Log migrated shuffle data size and migration time

2022-11-15 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-41153:


 Summary: Log migrated shuffle data size and migration time
 Key: SPARK-41153
 URL: https://issues.apache.org/jira/browse/SPARK-41153
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.3.1
Reporter: Zhongwei Zhu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40979) Keep removed executor info in decommission state

2022-10-31 Thread Zhongwei Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongwei Zhu updated SPARK-40979:
-
Description: 
Removed executor due to decommission should be kept in a separate set. To avoid 
OOM, set size will be limited to 1K or 10K.

FetchFailed caused by decom executor could be divided into 2 categories:
 # When FetchFailed reached DAGScheduler, the executor is still alive or is 
lost but the lost info hasn't reached TaskSchedulerImpl. This is already 
handled in SPARK-40979
 # FetchFailed is caused by decom executor loss, so the decom info is already 
removed in TaskSchedulerImpl. If we keep such info in a short period, it is 
good enough. Even we limit the size of removed executors to 10K, it could be 
only at most 10MB memory usage. In real case, it's rare to have cluster size of 
over 10K and the chance that all these executors decomed and lost at the same 
time would be small.

> Keep removed executor info in decommission state
> 
>
> Key: SPARK-40979
> URL: https://issues.apache.org/jira/browse/SPARK-40979
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Zhongwei Zhu
>Priority: Major
>
> Removed executor due to decommission should be kept in a separate set. To 
> avoid OOM, set size will be limited to 1K or 10K.
> FetchFailed caused by decom executor could be divided into 2 categories:
>  # When FetchFailed reached DAGScheduler, the executor is still alive or is 
> lost but the lost info hasn't reached TaskSchedulerImpl. This is already 
> handled in SPARK-40979
>  # FetchFailed is caused by decom executor loss, so the decom info is already 
> removed in TaskSchedulerImpl. If we keep such info in a short period, it is 
> good enough. Even we limit the size of removed executors to 10K, it could be 
> only at most 10MB memory usage. In real case, it's rare to have cluster size 
> of over 10K and the chance that all these executors decomed and lost at the 
> same time would be small.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40781) Explain exit code 137 as killed due to OOM

2022-10-12 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-40781:


 Summary: Explain exit code 137 as killed due to OOM
 Key: SPARK-40781
 URL: https://issues.apache.org/jira/browse/SPARK-40781
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.3.0
Reporter: Zhongwei Zhu


Explain exit code 137 as killed due to OOM to reduce the efforts of users to 
search the meaning of exit code.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40778) Make HeartbeatReceiver as an IsolatedRpcEndpoint

2022-10-12 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-40778:


 Summary: Make HeartbeatReceiver as an IsolatedRpcEndpoint
 Key: SPARK-40778
 URL: https://issues.apache.org/jira/browse/SPARK-40778
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.3.0
Reporter: Zhongwei Zhu


All RpcEndpoint including HeartbeatReceiver in driver are sharing one thread 
pool. When there're lots of rpc messages queued, the waiting process time of 
heartbeat time could easily exceed heartbeat timeout. Make HeartbeatReceiver 
extends IsolatedRcpEndpoint then it has dedicated single thread to process 
heartbeat



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40636) Fix wrong remained shuffles log in BlockManagerDecommissioner

2022-10-02 Thread Zhongwei Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongwei Zhu updated SPARK-40636:
-
Description: 
 BlockManagerDecommissioner should log correct remained shuffles.
{code:java}
4 of 24 local shuffles are added. In total, 24 shuffles are remained.
2022-09-30 17:42:15.035 PDT
0 of 24 local shuffles are added. In total, 24 shuffles are remained.
2022-09-30 17:42:45.069 PDT
0 of 24 local shuffles are added. In total, 24 shuffles are remained.{code}
 

  was:
 BlockManagerDecommissioner should log correct remained shuffles

 
{code:java}
4 of 24 local shuffles are added. In total, 24 shuffles are remained.
2022-09-30 17:42:15.035 PDT
0 of 24 local shuffles are added. In total, 24 shuffles are remained.
2022-09-30 17:42:45.069 PDT
0 of 24 local shuffles are added. In total, 24 shuffles are remained.{code}
 


> Fix wrong remained shuffles log in BlockManagerDecommissioner
> -
>
> Key: SPARK-40636
> URL: https://issues.apache.org/jira/browse/SPARK-40636
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 3.3.0
>Reporter: Zhongwei Zhu
>Priority: Minor
>
>  BlockManagerDecommissioner should log correct remained shuffles.
> {code:java}
> 4 of 24 local shuffles are added. In total, 24 shuffles are remained.
> 2022-09-30 17:42:15.035 PDT
> 0 of 24 local shuffles are added. In total, 24 shuffles are remained.
> 2022-09-30 17:42:45.069 PDT
> 0 of 24 local shuffles are added. In total, 24 shuffles are remained.{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40636) Fix wrong remained shuffles log in BlockManagerDecommissioner

2022-10-02 Thread Zhongwei Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongwei Zhu updated SPARK-40636:
-
Description: 
 BlockManagerDecommissioner should log correct remained shuffles

 

```

4 of 24 local shuffles are added. In total, 24 shuffles are remained.
2022-09-30 17:42:15.035 PDT
0 of 24 local shuffles are added. In total, 24 shuffles are remained.
2022-09-30 17:42:45.069 PDT
0 of 24 local shuffles are added. In total, 24 shuffles are remained.

```

  was:
 

 

```

4 of 24 local shuffles are added. In total, 24 shuffles are remained.
2022-09-30 17:42:15.035 PDT
0 of 24 local shuffles are added. In total, 24 shuffles are remained.
2022-09-30 17:42:45.069 PDT
0 of 24 local shuffles are added. In total, 24 shuffles are remained.

```


> Fix wrong remained shuffles log in BlockManagerDecommissioner
> -
>
> Key: SPARK-40636
> URL: https://issues.apache.org/jira/browse/SPARK-40636
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 3.3.0
>Reporter: Zhongwei Zhu
>Priority: Minor
>
>  BlockManagerDecommissioner should log correct remained shuffles
>  
> ```
> 4 of 24 local shuffles are added. In total, 24 shuffles are remained.
> 2022-09-30 17:42:15.035 PDT
> 0 of 24 local shuffles are added. In total, 24 shuffles are remained.
> 2022-09-30 17:42:45.069 PDT
> 0 of 24 local shuffles are added. In total, 24 shuffles are remained.
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40636) Fix wrong remained shuffles log in BlockManagerDecommissioner

2022-10-02 Thread Zhongwei Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongwei Zhu updated SPARK-40636:
-
Description: 
 BlockManagerDecommissioner should log correct remained shuffles

 
{code:java}
4 of 24 local shuffles are added. In total, 24 shuffles are remained.
2022-09-30 17:42:15.035 PDT
0 of 24 local shuffles are added. In total, 24 shuffles are remained.
2022-09-30 17:42:45.069 PDT
0 of 24 local shuffles are added. In total, 24 shuffles are remained.{code}
 

  was:
 BlockManagerDecommissioner should log correct remained shuffles

 

```

4 of 24 local shuffles are added. In total, 24 shuffles are remained.
2022-09-30 17:42:15.035 PDT
0 of 24 local shuffles are added. In total, 24 shuffles are remained.
2022-09-30 17:42:45.069 PDT
0 of 24 local shuffles are added. In total, 24 shuffles are remained.

```


> Fix wrong remained shuffles log in BlockManagerDecommissioner
> -
>
> Key: SPARK-40636
> URL: https://issues.apache.org/jira/browse/SPARK-40636
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 3.3.0
>Reporter: Zhongwei Zhu
>Priority: Minor
>
>  BlockManagerDecommissioner should log correct remained shuffles
>  
> {code:java}
> 4 of 24 local shuffles are added. In total, 24 shuffles are remained.
> 2022-09-30 17:42:15.035 PDT
> 0 of 24 local shuffles are added. In total, 24 shuffles are remained.
> 2022-09-30 17:42:45.069 PDT
> 0 of 24 local shuffles are added. In total, 24 shuffles are remained.{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40636) Fix wrong remained shuffles log in BlockManagerDecommissioner

2022-10-02 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-40636:


 Summary: Fix wrong remained shuffles log in 
BlockManagerDecommissioner
 Key: SPARK-40636
 URL: https://issues.apache.org/jira/browse/SPARK-40636
 Project: Spark
  Issue Type: Bug
  Components: Block Manager
Affects Versions: 3.3.0
Reporter: Zhongwei Zhu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40636) Fix wrong remained shuffles log in BlockManagerDecommissioner

2022-10-02 Thread Zhongwei Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongwei Zhu updated SPARK-40636:
-
Description: 
 

 

```

4 of 24 local shuffles are added. In total, 24 shuffles are remained.
2022-09-30 17:42:15.035 PDT
0 of 24 local shuffles are added. In total, 24 shuffles are remained.
2022-09-30 17:42:45.069 PDT
0 of 24 local shuffles are added. In total, 24 shuffles are remained.

```

> Fix wrong remained shuffles log in BlockManagerDecommissioner
> -
>
> Key: SPARK-40636
> URL: https://issues.apache.org/jira/browse/SPARK-40636
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 3.3.0
>Reporter: Zhongwei Zhu
>Priority: Minor
>
>  
>  
> ```
> 4 of 24 local shuffles are added. In total, 24 shuffles are remained.
> 2022-09-30 17:42:15.035 PDT
> 0 of 24 local shuffles are added. In total, 24 shuffles are remained.
> 2022-09-30 17:42:45.069 PDT
> 0 of 24 local shuffles are added. In total, 24 shuffles are remained.
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40481) Ignore stage fetch failure caused by decommissioned executor

2022-09-18 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-40481:


 Summary: Ignore stage fetch failure caused by decommissioned 
executor
 Key: SPARK-40481
 URL: https://issues.apache.org/jira/browse/SPARK-40481
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.3.0
Reporter: Zhongwei Zhu


When executor decommission is enabled, there would be many stage failure caused 
by FetchFailed from decommissioned executor, further causing whole job's 
failure. It would be better not to count such failure in 
`spark.stage.maxConsecutiveAttempts`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40381) Support standalone worker recommission

2022-09-07 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-40381:


 Summary: Support standalone worker recommission
 Key: SPARK-40381
 URL: https://issues.apache.org/jira/browse/SPARK-40381
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 3.3.0
Reporter: Zhongwei Zhu


Currently, spark standalone only support kill workers. We may want to 
recommission some workers. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40269) Randomize the orders of peer in BlockManagerDecommissioner

2022-08-29 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-40269:


 Summary: Randomize the orders of peer in BlockManagerDecommissioner
 Key: SPARK-40269
 URL: https://issues.apache.org/jira/browse/SPARK-40269
 Project: Spark
  Issue Type: Improvement
  Components: Block Manager
Affects Versions: 3.3.0
Reporter: Zhongwei Zhu


Randomize the orders of peer in BlockManagerDecommissioner to avoid migrating 
data to the same set of nodes



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40267) Add description for ExecutorAllocationManager metrics

2022-08-29 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-40267:


 Summary: Add description for ExecutorAllocationManager metrics
 Key: SPARK-40267
 URL: https://issues.apache.org/jira/browse/SPARK-40267
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 3.3.0
Reporter: Zhongwei Zhu


Some ExecutorAllocationManager metrics are hard to know what stands for just 
from metric name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40168) Handle FileNotFoundException when shuffle file deleted in decommissioner

2022-08-21 Thread Zhongwei Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongwei Zhu updated SPARK-40168:
-
Description: 
When shuffle files not found, decommissioner will handles IOException, but the 
real exception is as below:
{code:java}
22/08/10 18:05:34 ERROR BlockManagerDecommissioner: Error occurred during 
migrating migrate_shuffle_1_356
org.apache.spark.SparkException: Exception thrown in awaitResult: 
    at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
    at 
org.apache.spark.network.BlockTransferService.uploadBlockSync(BlockTransferService.scala:122)
    at 
org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.$anonfun$run$4(BlockManagerDecommissioner.scala:120)
    at 
org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.$anonfun$run$4$adapted(BlockManagerDecommissioner.scala:111)
    at scala.collection.immutable.List.foreach(List.scala:431)
    at 
org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.run(BlockManagerDecommissioner.scala:111)
    at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.io.IOException: Failed to send RPC RPC 5697756267528635203 to 
/10.240.2.65:43481: java.io.FileNotFoundException: 
/tmp/blockmgr-98a2a29a-5231-4fed-a82e-6bc0531ad407/15/shuffle_1_356_0.index (No 
such file or directory)
    at 
org.apache.spark.network.client.TransportClient$RpcChannelListener.handleFailure(TransportClient.java:392)
    at 
org.apache.spark.network.client.TransportClient$StdChannelListener.operationComplete(TransportClient.java:369)
    at 
io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)
    at 
io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571)
    at 
io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550)
    at 
io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491)
    at 
io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616)
    at 
io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609)
    at 
io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
    at 
io.netty.util.internal.PromiseNotificationUtil.tryFailure(PromiseNotificationUtil.java:64)
    at 
io.netty.channel.ChannelOutboundBuffer.safeFail(ChannelOutboundBuffer.java:723)
    at 
io.netty.channel.ChannelOutboundBuffer.remove0(ChannelOutboundBuffer.java:308)
    at 
io.netty.channel.ChannelOutboundBuffer.failFlushed(ChannelOutboundBuffer.java:660)
    at 
io.netty.channel.AbstractChannel$AbstractUnsafe.close(AbstractChannel.java:735)
    at 
io.netty.channel.AbstractChannel$AbstractUnsafe.handleWriteError(AbstractChannel.java:950)
    at 
io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:933)
    at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:354)
    at 
io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:895)
    at 
io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1372)
    at 
io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:750)
    at 
io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:742)
    at 
io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:728)
    at 
io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:127)
    at 
io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:750)
    at 
io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:765)
    at 
io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
    at 
io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
    at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
    at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    ... 1 more
Caused by: java.io.FileNotFoundException: 

[jira] [Updated] (SPARK-40168) Handle FileNotFoundException when shuffle file deleted in decommissioner

2022-08-21 Thread Zhongwei Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongwei Zhu updated SPARK-40168:
-
Description: 

{code:java}
// Some comments here
public String getFoo()
{
return foo;
}
{code}
{code:java}
// code placeholder
{code}
When shuffle files not found, decommissioner will handles IOException, but the 
real exception is as below:

```

22/08/10 18:05:34 ERROR BlockManagerDecommissioner: Error occurred during 
migrating migrate_shuffle_1_356
org.apache.spark.SparkException: Exception thrown in awaitResult: 
    at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
    at 
org.apache.spark.network.BlockTransferService.uploadBlockSync(BlockTransferService.scala:122)
    at 
org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.$anonfun$run$4(BlockManagerDecommissioner.scala:120)
    at 
org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.$anonfun$run$4$adapted(BlockManagerDecommissioner.scala:111)
    at scala.collection.immutable.List.foreach(List.scala:431)
    at 
org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.run(BlockManagerDecommissioner.scala:111)
    at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.io.IOException: Failed to send RPC RPC 5697756267528635203 to 
/10.240.2.65:43481: java.io.FileNotFoundException: 
/tmp/blockmgr-98a2a29a-5231-4fed-a82e-6bc0531ad407/15/shuffle_1_356_0.index (No 
such file or directory)
    at 
org.apache.spark.network.client.TransportClient$RpcChannelListener.handleFailure(TransportClient.java:392)
    at 
org.apache.spark.network.client.TransportClient$StdChannelListener.operationComplete(TransportClient.java:369)
    at 
io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)
    at 
io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571)
    at 
io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550)
    at 
io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491)
    at 
io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616)
    at 
io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609)
    at 
io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
    at 
io.netty.util.internal.PromiseNotificationUtil.tryFailure(PromiseNotificationUtil.java:64)
    at 
io.netty.channel.ChannelOutboundBuffer.safeFail(ChannelOutboundBuffer.java:723)
    at 
io.netty.channel.ChannelOutboundBuffer.remove0(ChannelOutboundBuffer.java:308)
    at 
io.netty.channel.ChannelOutboundBuffer.failFlushed(ChannelOutboundBuffer.java:660)
    at 
io.netty.channel.AbstractChannel$AbstractUnsafe.close(AbstractChannel.java:735)
    at 
io.netty.channel.AbstractChannel$AbstractUnsafe.handleWriteError(AbstractChannel.java:950)
    at 
io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:933)
    at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:354)
    at 
io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:895)
    at 
io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1372)
    at 
io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:750)
    at 
io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:742)
    at 
io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:728)
    at 
io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:127)
    at 
io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:750)
    at 
io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:765)
    at 
io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
    at 
io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
    at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
    at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    ... 

[jira] [Created] (SPARK-40168) Handle FileNotFoundException when shuffle file deleted in decommissioner

2022-08-21 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-40168:


 Summary: Handle FileNotFoundException when shuffle file deleted in 
decommissioner
 Key: SPARK-40168
 URL: https://issues.apache.org/jira/browse/SPARK-40168
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.3.0
Reporter: Zhongwei Zhu


When shuffle files not found, decommissioner will handles IOException, but the 
real exception is as below:

```

22/08/10 18:05:34 ERROR BlockManagerDecommissioner: Error occurred during 
migrating migrate_shuffle_1_356
org.apache.spark.SparkException: Exception thrown in awaitResult: 
    at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
    at 
org.apache.spark.network.BlockTransferService.uploadBlockSync(BlockTransferService.scala:122)
    at 
org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.$anonfun$run$4(BlockManagerDecommissioner.scala:120)
    at 
org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.$anonfun$run$4$adapted(BlockManagerDecommissioner.scala:111)
    at scala.collection.immutable.List.foreach(List.scala:431)
    at 
org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.run(BlockManagerDecommissioner.scala:111)
    at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.io.IOException: Failed to send RPC RPC 5697756267528635203 to 
/10.240.2.65:43481: java.io.FileNotFoundException: 
/tmp/blockmgr-98a2a29a-5231-4fed-a82e-6bc0531ad407/15/shuffle_1_356_0.index (No 
such file or directory)
    at 
org.apache.spark.network.client.TransportClient$RpcChannelListener.handleFailure(TransportClient.java:392)
    at 
org.apache.spark.network.client.TransportClient$StdChannelListener.operationComplete(TransportClient.java:369)
    at 
io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)
    at 
io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571)
    at 
io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550)
    at 
io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491)
    at 
io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616)
    at 
io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609)
    at 
io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
    at 
io.netty.util.internal.PromiseNotificationUtil.tryFailure(PromiseNotificationUtil.java:64)
    at 
io.netty.channel.ChannelOutboundBuffer.safeFail(ChannelOutboundBuffer.java:723)
    at 
io.netty.channel.ChannelOutboundBuffer.remove0(ChannelOutboundBuffer.java:308)
    at 
io.netty.channel.ChannelOutboundBuffer.failFlushed(ChannelOutboundBuffer.java:660)
    at 
io.netty.channel.AbstractChannel$AbstractUnsafe.close(AbstractChannel.java:735)
    at 
io.netty.channel.AbstractChannel$AbstractUnsafe.handleWriteError(AbstractChannel.java:950)
    at 
io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:933)
    at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:354)
    at 
io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:895)
    at 
io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1372)
    at 
io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:750)
    at 
io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:742)
    at 
io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:728)
    at 
io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:127)
    at 
io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:750)
    at 
io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:765)
    at 
io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
    at 
io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
    at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
    at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at 

[jira] [Created] (SPARK-40060) Add numberDecommissioningExecutors metric

2022-08-12 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-40060:


 Summary: Add numberDecommissioningExecutors metric
 Key: SPARK-40060
 URL: https://issues.apache.org/jira/browse/SPARK-40060
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.3.0
Reporter: Zhongwei Zhu


The num of decommissioning executor should exposed as metric



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36893) upgrade mesos into 1.4.3

2021-09-29 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-36893:


 Summary: upgrade mesos into 1.4.3
 Key: SPARK-36893
 URL: https://issues.apache.org/jira/browse/SPARK-36893
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.1.2
Reporter: Zhongwei Zhu


Upgrade mesos to 1.4.3 to fix CVE-2018-11793



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36864) guava version mismatch with hadoop-aws

2021-09-27 Thread Zhongwei Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongwei Zhu updated SPARK-36864:
-
Summary: guava version mismatch with hadoop-aws  (was: guava version 
mismatch between hadoop-aws and spark)

> guava version mismatch with hadoop-aws
> --
>
> Key: SPARK-36864
> URL: https://issues.apache.org/jira/browse/SPARK-36864
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.2
>Reporter: Zhongwei Zhu
>Priority: Minor
>
> When use hadoop-aws 3.2 with spark 3.0, got below error. This is caused by 
> guava version mismatch as hadoop used guava 27.0-jre while spark used 14.0.1.
> Exception in thread "main" java.lang.NoSuchMethodError: 
> com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;Ljava/lang/Object;)V
>  at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:742)
>  at org.apache.hadoop.fs.s3a.S3AUtils.getAWSAccessKeys(S3AUtils.java:712)
>  at 
> org.apache.hadoop.fs.s3a.S3AUtils.createAWSCredentialProviderSet(S3AUtils.java:559)
>  at 
> org.apache.hadoop.fs.s3a.DefaultS3ClientFactory.createS3Client(DefaultS3ClientFactory.java:52)
>  at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:264)
>  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
>  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
>  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
>  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
>  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
>  at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1853)
>  at 
> org.apache.spark.deploy.history.EventLogFileWriter.(EventLogFileWriters.scala:60)
>  at 
> org.apache.spark.deploy.history.SingleEventLogFileWriter.(EventLogFileWriters.scala:213)
>  at 
> org.apache.spark.deploy.history.EventLogFileWriter$.apply(EventLogFileWriters.scala:181)
>  at 
> org.apache.spark.scheduler.EventLoggingListener.(EventLoggingListener.scala:66)
>  at org.apache.spark.SparkContext.(SparkContext.scala:584)
>  at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2588)
>  at 
> org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:937)
>  at scala.Option.getOrElse(Option.scala:189)
>  at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:931)
>  at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:30)
>  at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>  at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:944)
>  at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
>  at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
>  at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
>  at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1023)
>  at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1032)
>  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36864) guava version mismatch between hadoop-aws and spark

2021-09-27 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-36864:


 Summary: guava version mismatch between hadoop-aws and spark
 Key: SPARK-36864
 URL: https://issues.apache.org/jira/browse/SPARK-36864
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.1.2
Reporter: Zhongwei Zhu


When use hadoop-aws 3.2 with spark 3.0, got below error. This is caused by 
guava version mismatch as hadoop used guava 27.0-jre while spark used 14.0.1.

Exception in thread "main" java.lang.NoSuchMethodError: 
com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;Ljava/lang/Object;)V
 at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:742)
 at org.apache.hadoop.fs.s3a.S3AUtils.getAWSAccessKeys(S3AUtils.java:712)
 at 
org.apache.hadoop.fs.s3a.S3AUtils.createAWSCredentialProviderSet(S3AUtils.java:559)
 at 
org.apache.hadoop.fs.s3a.DefaultS3ClientFactory.createS3Client(DefaultS3ClientFactory.java:52)
 at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:264)
 at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
 at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
 at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1853)
 at 
org.apache.spark.deploy.history.EventLogFileWriter.(EventLogFileWriters.scala:60)
 at 
org.apache.spark.deploy.history.SingleEventLogFileWriter.(EventLogFileWriters.scala:213)
 at 
org.apache.spark.deploy.history.EventLogFileWriter$.apply(EventLogFileWriters.scala:181)
 at 
org.apache.spark.scheduler.EventLoggingListener.(EventLoggingListener.scala:66)
 at org.apache.spark.SparkContext.(SparkContext.scala:584)
 at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2588)
 at 
org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:937)
 at scala.Option.getOrElse(Option.scala:189)
 at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:931)
 at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:30)
 at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
 at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:944)
 at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
 at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
 at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
 at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1023)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1032)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36793) [K8S] Support write container stdout/stderr to file

2021-09-17 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-36793:


 Summary: [K8S] Support write container stdout/stderr to file 
 Key: SPARK-36793
 URL: https://issues.apache.org/jira/browse/SPARK-36793
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.1.2
Reporter: Zhongwei Zhu


Currently, executor and driver pod only redirect stdout/stderr. If users want 
to sidecar logging agent to send stdout/stderr to external log storage,  only 
way is to change entrypoint.sh, which might break compatibility with community 
version.

We should support this feature, and this feature could be enabled by spark 
config. Related spark configs are:
|Key|Default|Desc|
|Spark.kubernetes.logToFile.enabled|false|Whether to write executor/driver 
stdout/stderr as log file|
|Spark.kubernetes.logToFile.path|/var/log/spark|The path to write 
executor/driver stdout/stderr as log file|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32288) [UI] Add failure summary table in stage page

2021-04-20 Thread Zhongwei Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongwei Zhu updated SPARK-32288:
-
Summary: [UI] Add failure summary table in stage page  (was: [UI] Add 
exception summary table in stage page)

> [UI] Add failure summary table in stage page
> 
>
> Key: SPARK-32288
> URL: https://issues.apache.org/jira/browse/SPARK-32288
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Zhongwei Zhu
>Priority: Major
>
> When there're many task failure during one stage, it's hard to find failure 
> pattern such as aggregation task failure by exception type and message. If we 
> have such information, we can easily know which type of exception of failure 
> is the root cause of stage failure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34777) [UI] StagePage input size/records not show when records greater than zero

2021-04-20 Thread Zhongwei Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongwei Zhu updated SPARK-34777:
-
Summary: [UI] StagePage input size/records not show when records greater 
than zero  (was: [UI] StagePage input size records not show when records 
greater than zero)

> [UI] StagePage input size/records not show when records greater than zero
> -
>
> Key: SPARK-34777
> URL: https://issues.apache.org/jira/browse/SPARK-34777
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.1.1
>Reporter: Zhongwei Zhu
>Priority: Minor
> Attachments: No input size records.png
>
>
> !No input size records.png|width=547,height=212!
> The `Input Size / Records` should show in summary metrics table and task 
> columns, as input records greater than zero and bytes is zero. One example is 
> spark streaming job read from kafka



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34777) [UI] StagePage input size records not show when records greater than zero

2021-03-17 Thread Zhongwei Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongwei Zhu updated SPARK-34777:
-
Description: 
!No input size records.png|width=547,height=212!

The `Input Size / Records` should show in summary metrics table and task 
columns, as input records greater than zero and bytes is zero. One example is 
spark streaming job read from kafka

  was:
!No input size records.png|width=547,height=212!

The `Input Size / Records` should show in summary metrics table and task 
columns, as input records greater than zero


> [UI] StagePage input size records not show when records greater than zero
> -
>
> Key: SPARK-34777
> URL: https://issues.apache.org/jira/browse/SPARK-34777
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.1.1
>Reporter: Zhongwei Zhu
>Priority: Minor
> Attachments: No input size records.png
>
>
> !No input size records.png|width=547,height=212!
> The `Input Size / Records` should show in summary metrics table and task 
> columns, as input records greater than zero and bytes is zero. One example is 
> spark streaming job read from kafka



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34777) [UI] StagePage input size records not show when records greater than zero

2021-03-17 Thread Zhongwei Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongwei Zhu updated SPARK-34777:
-
Description: 
!No input size records.png|width=547,height=212!

The `Input Size / Records` should show in summary metrics table and task 
columns, as input records greater than zero

  was:
!image-2021-03-17-09-46-58-653.png|width=514,height=171!

The `Input Size / Records` should show in summary metrics table and task 
columns, as input records greater than zero


> [UI] StagePage input size records not show when records greater than zero
> -
>
> Key: SPARK-34777
> URL: https://issues.apache.org/jira/browse/SPARK-34777
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.1.1
>Reporter: Zhongwei Zhu
>Priority: Minor
> Attachments: No input size records.png
>
>
> !No input size records.png|width=547,height=212!
> The `Input Size / Records` should show in summary metrics table and task 
> columns, as input records greater than zero



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34777) [UI] StagePage input size records not show when records greater than zero

2021-03-17 Thread Zhongwei Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongwei Zhu updated SPARK-34777:
-
Attachment: No input size records.png

> [UI] StagePage input size records not show when records greater than zero
> -
>
> Key: SPARK-34777
> URL: https://issues.apache.org/jira/browse/SPARK-34777
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.1.1
>Reporter: Zhongwei Zhu
>Priority: Minor
> Attachments: No input size records.png
>
>
> !image-2021-03-17-09-46-58-653.png|width=514,height=171!
> The `Input Size / Records` should show in summary metrics table and task 
> columns, as input records greater than zero



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34777) [UI] StagePage input size records not show when records greater than zero

2021-03-17 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-34777:


 Summary: [UI] StagePage input size records not show when records 
greater than zero
 Key: SPARK-34777
 URL: https://issues.apache.org/jira/browse/SPARK-34777
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 3.1.1
Reporter: Zhongwei Zhu


!image-2021-03-17-09-46-58-653.png|width=514,height=171!

The `Input Size / Records` should show in summary metrics table and task 
columns, as input records greater than zero



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34232) [CORE] redact credentials not working when log slow event enabled

2021-01-25 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-34232:


 Summary: [CORE] redact credentials not working when log slow event 
enabled
 Key: SPARK-34232
 URL: https://issues.apache.org/jira/browse/SPARK-34232
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.1
Reporter: Zhongwei Zhu


When process time of event SparkListenerEnvironmentUpdate exceeded 
logSlowEventThreshold, the credentials will be logged without consideration of 
redact.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19450) Replace askWithRetry with askSync.

2020-12-31 Thread Zhongwei Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-19450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17257091#comment-17257091
 ] 

Zhongwei Zhu edited comment on SPARK-19450 at 12/31/20, 7:58 PM:
-

For old askWithRetry method, it can use provided config `spark.rpc.numRetries` 
and `spark.rpc.retry.wait`. The default value for `spark.rpc.numRetries` is 3. 
So I suppose it will retry 3 times if rpc failed, but now askSync is used 
without using above 2 configs. Could we support such retry? If retry is 
disabled intentionally, maybe doc need to be updated. [~jinxing6...@126.com] 
[~srowen]


was (Author: warrenzhu25):
For old askWithRetry method, it can use provided config `spark.rpc.numRetries` 
and `spark.rpc.retry.wait`. The default value for  `spark.rpc.numRetries` is 3. 
So I suppose it will retry 3 times if rpc failed, but now askSync is used 
without using above 2 configs. Does that mean no retry anymore? 
[~jinxing6...@126.com] [~srowen]

> Replace askWithRetry with askSync.
> --
>
> Key: SPARK-19450
> URL: https://issues.apache.org/jira/browse/SPARK-19450
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Jin Xing
>Assignee: Jin Xing
>Priority: Minor
> Fix For: 2.2.0
>
>
> *askSync* is already added in *RpcEndpointRef* (see SPARK-19347 and 
> https://github.com/apache/spark/pull/16690#issuecomment-276850068) and 
> *askWithRetry* is marked as deprecated. 
> As mentioned 
> SPARK-18113(https://github.com/apache/spark/pull/16503#event-927953218):
> ??askWithRetry is basically an unneeded API, and a leftover from the akka 
> days that doesn't make sense anymore. It's prone to cause deadlocks (exactly 
> because it's blocking), it imposes restrictions on the caller (e.g. 
> idempotency) and other things that people generally don't pay that much 
> attention to when using it.??
> Since *askWithRetry* is just used inside spark and not in user logic. It 
> might make sense to replace all of them with *askSync*.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19450) Replace askWithRetry with askSync.

2020-12-31 Thread Zhongwei Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-19450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17257091#comment-17257091
 ] 

Zhongwei Zhu commented on SPARK-19450:
--

For old askWithRetry method, it can use provided config `spark.rpc.numRetries` 
and `spark.rpc.retry.wait`. The default value for  `spark.rpc.numRetries` is 3. 
So I suppose it will retry 3 times if rpc failed, but now askSync is used 
without using above 2 configs. Does that mean no retry anymore? 
[~jinxing6...@126.com] [~srowen]

> Replace askWithRetry with askSync.
> --
>
> Key: SPARK-19450
> URL: https://issues.apache.org/jira/browse/SPARK-19450
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Jin Xing
>Assignee: Jin Xing
>Priority: Minor
> Fix For: 2.2.0
>
>
> *askSync* is already added in *RpcEndpointRef* (see SPARK-19347 and 
> https://github.com/apache/spark/pull/16690#issuecomment-276850068) and 
> *askWithRetry* is marked as deprecated. 
> As mentioned 
> SPARK-18113(https://github.com/apache/spark/pull/16503#event-927953218):
> ??askWithRetry is basically an unneeded API, and a leftover from the akka 
> days that doesn't make sense anymore. It's prone to cause deadlocks (exactly 
> because it's blocking), it imposes restrictions on the caller (e.g. 
> idempotency) and other things that people generally don't pay that much 
> attention to when using it.??
> Since *askWithRetry* is just used inside spark and not in user logic. It 
> might make sense to replace all of them with *askSync*.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33446) [CORE] Add config spark.executor.coresOverhead

2020-11-13 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-33446:


 Summary: [CORE] Add config spark.executor.coresOverhead
 Key: SPARK-33446
 URL: https://issues.apache.org/jira/browse/SPARK-33446
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.1
Reporter: Zhongwei Zhu


Add config spark.executor.coresOverhead to request extra cores per executor. 
This config will be helpful in below cases:

Suppose for physical machines or vm, the memory/cpu ratio is 3GB/core. But we 
run spark job, we want to have 6GB per task. If we request resource in such 
way, there will be resource waste.
If we request extra cores without affecting cores per executor for task 
allocation, extra cores won't be wasted. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32314) [SHS] Remove old format of stacktrace in event log

2020-11-12 Thread Zhongwei Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongwei Zhu updated SPARK-32314:
-
Description: 
Currently, EventLoggingListeneer write both "Stack Trace" and "Full Stack 
Trace" in TaskEndResaon of ExceptionFailure to event log. Both fields contains 
same info, and the former one is kept for backward compatibility of spark 
history before version 1.2.0. We can remove 1st field. This will help reduce 
eventlog size significantly when lots of task are failed due to 
ExceptionFailure.

The sample json of current format as below:

{noformat}
{
  "Event": "SparkListenerTaskEnd",
  "Stage ID": 1237,
  "Stage Attempt ID": 0,
  "Task Type": "ShuffleMapTask",
  "Task End Reason": {
"Reason": "ExceptionFailure",
"Class Name": "java.io.IOException",
"Description": "org.apache.spark.SparkException: Failed to get 
broadcast_1405_piece10 of broadcast_1405",
"Stack Trace": [
  {
"Declaring Class": "org.apache.spark.util.Utils$",
"Method Name": "tryOrIOException",
"File Name": "Utils.scala",
"Line Number": 1350
  },
  {
"Declaring Class": "org.apache.spark.broadcast.TorrentBroadcast",
"Method Name": "readBroadcastBlock",
"File Name": "TorrentBroadcast.scala",
"Line Number": 218
  },
  {
"Declaring Class": "org.apache.spark.broadcast.TorrentBroadcast",
"Method Name": "getValue",
"File Name": "TorrentBroadcast.scala",
"Line Number": 103
  },
  {
"Declaring Class": "org.apache.spark.broadcast.Broadcast",
"Method Name": "value",
"File Name": "Broadcast.scala",
"Line Number": 70
  },
  {
"Declaring Class": 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage9",
"Method Name": "wholestagecodegen_init_0_0$",
"File Name": "generated.java",
"Line Number": 466
  },
  {
"Declaring Class": 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage9",
"Method Name": "init",
"File Name": "generated.java",
"Line Number": 33
  },
  {
"Declaring Class": 
"org.apache.spark.sql.execution.WholeStageCodegenExec",
"Method Name": "$anonfun$doExecute$4",
"File Name": "WholeStageCodegenExec.scala",
"Line Number": 750
  },
  {
"Declaring Class": 
"org.apache.spark.sql.execution.WholeStageCodegenExec",
"Method Name": "$anonfun$doExecute$4$adapted",
"File Name": "WholeStageCodegenExec.scala",
"Line Number": 747
  },
  {
"Declaring Class": "org.apache.spark.rdd.RDD",
"Method Name": "$anonfun$mapPartitionsWithIndex$2",
"File Name": "RDD.scala",
"Line Number": 915
  },
  {
"Declaring Class": "org.apache.spark.rdd.RDD",
"Method Name": "$anonfun$mapPartitionsWithIndex$2$adapted",
"File Name": "RDD.scala",
"Line Number": 915
  },
  {
"Declaring Class": "org.apache.spark.rdd.MapPartitionsRDD",
"Method Name": "compute",
"File Name": "MapPartitionsRDD.scala",
"Line Number": 52
  },
  {
"Declaring Class": "org.apache.spark.rdd.RDD",
"Method Name": "computeOrReadCheckpoint",
"File Name": "RDD.scala",
"Line Number": 373
  },
  {
"Declaring Class": "org.apache.spark.rdd.RDD",
"Method Name": "iterator",
"File Name": "RDD.scala",
"Line Number": 337
  },
  {
"Declaring Class": "org.apache.spark.rdd.MapPartitionsRDD",
"Method Name": "compute",
"File Name": "MapPartitionsRDD.scala",
"Line Number": 52
  },
  {
"Declaring Class": "org.apache.spark.rdd.RDD",
"Method Name": "computeOrReadCheckpoint",
"File Name": "RDD.scala",
"Line Number": 373
  },
  {
"Declaring Class": "org.apache.spark.rdd.RDD",
"Method Name": "iterator",
"File Name": "RDD.scala",
"Line Number": 337
  },
  {
"Declaring Class": "org.apache.spark.shuffle.ShuffleWriteProcessor",
"Method Name": "write",
"File Name": "ShuffleWriteProcessor.scala",
"Line Number": 59
  },
  {
"Declaring Class": "org.apache.spark.scheduler.ShuffleMapTask",
"Method Name": "runTask",
"File Name": "ShuffleMapTask.scala",
"Line Number": 99
  },
  {
"Declaring Class": "org.apache.spark.scheduler.ShuffleMapTask",
"Method Name": "runTask",
"File Name": "ShuffleMapTask.scala",
"Line Number": 52
  },
  {
"Declaring Class": "org.apache.spark.scheduler.Task",
"Method Name": "run",
"File Name": "Task.scala",
"Line Number": 127
  },
  {

[jira] [Updated] (SPARK-32314) [SHS] Remove old format of stacktrace in event log

2020-11-12 Thread Zhongwei Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongwei Zhu updated SPARK-32314:
-
Summary: [SHS] Remove old format of stacktrace in event log  (was: [SHS] 
Add config to control whether log old format of stacktrace)

> [SHS] Remove old format of stacktrace in event log
> --
>
> Key: SPARK-32314
> URL: https://issues.apache.org/jira/browse/SPARK-32314
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Zhongwei Zhu
>Priority: Minor
>
> Currently, EventLoggingListeneer write both "Stack Trace" and "Full Stack 
> Trace" in TaskEndResaon of ExceptionFailure to event log. Both fields 
> contains same info, and the former one is kept for backward compatibility of 
> spark history before version 1.2.0. We can remove 1st field in default 
> setting and add one config to control whether log 1st field. This will help 
> reduce eventlog size significantly when lots of task are failed due to 
> ExceptionFailure.
>  
> The sample json of current format as below:
>  
> {noformat}
> {
>   "Event": "SparkListenerTaskEnd",
>   "Stage ID": 1237,
>   "Stage Attempt ID": 0,
>   "Task Type": "ShuffleMapTask",
>   "Task End Reason": {
> "Reason": "ExceptionFailure",
> "Class Name": "java.io.IOException",
> "Description": "org.apache.spark.SparkException: Failed to get 
> broadcast_1405_piece10 of broadcast_1405",
> "Stack Trace": [
>   {
> "Declaring Class": "org.apache.spark.util.Utils$",
> "Method Name": "tryOrIOException",
> "File Name": "Utils.scala",
> "Line Number": 1350
>   },
>   {
> "Declaring Class": "org.apache.spark.broadcast.TorrentBroadcast",
> "Method Name": "readBroadcastBlock",
> "File Name": "TorrentBroadcast.scala",
> "Line Number": 218
>   },
>   {
> "Declaring Class": "org.apache.spark.broadcast.TorrentBroadcast",
> "Method Name": "getValue",
> "File Name": "TorrentBroadcast.scala",
> "Line Number": 103
>   },
>   {
> "Declaring Class": "org.apache.spark.broadcast.Broadcast",
> "Method Name": "value",
> "File Name": "Broadcast.scala",
> "Line Number": 70
>   },
>   {
> "Declaring Class": 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage9",
> "Method Name": "wholestagecodegen_init_0_0$",
> "File Name": "generated.java",
> "Line Number": 466
>   },
>   {
> "Declaring Class": 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage9",
> "Method Name": "init",
> "File Name": "generated.java",
> "Line Number": 33
>   },
>   {
> "Declaring Class": 
> "org.apache.spark.sql.execution.WholeStageCodegenExec",
> "Method Name": "$anonfun$doExecute$4",
> "File Name": "WholeStageCodegenExec.scala",
> "Line Number": 750
>   },
>   {
> "Declaring Class": 
> "org.apache.spark.sql.execution.WholeStageCodegenExec",
> "Method Name": "$anonfun$doExecute$4$adapted",
> "File Name": "WholeStageCodegenExec.scala",
> "Line Number": 747
>   },
>   {
> "Declaring Class": "org.apache.spark.rdd.RDD",
> "Method Name": "$anonfun$mapPartitionsWithIndex$2",
> "File Name": "RDD.scala",
> "Line Number": 915
>   },
>   {
> "Declaring Class": "org.apache.spark.rdd.RDD",
> "Method Name": "$anonfun$mapPartitionsWithIndex$2$adapted",
> "File Name": "RDD.scala",
> "Line Number": 915
>   },
>   {
> "Declaring Class": "org.apache.spark.rdd.MapPartitionsRDD",
> "Method Name": "compute",
> "File Name": "MapPartitionsRDD.scala",
> "Line Number": 52
>   },
>   {
> "Declaring Class": "org.apache.spark.rdd.RDD",
> "Method Name": "computeOrReadCheckpoint",
> "File Name": "RDD.scala",
> "Line Number": 373
>   },
>   {
> "Declaring Class": "org.apache.spark.rdd.RDD",
> "Method Name": "iterator",
> "File Name": "RDD.scala",
> "Line Number": 337
>   },
>   {
> "Declaring Class": "org.apache.spark.rdd.MapPartitionsRDD",
> "Method Name": "compute",
> "File Name": "MapPartitionsRDD.scala",
> "Line Number": 52
>   },
>   {
> "Declaring Class": "org.apache.spark.rdd.RDD",
> "Method Name": "computeOrReadCheckpoint",
> "File Name": "RDD.scala",
> "Line Number": 373
>   },
>   {
> "Declaring Class": "org.apache.spark.rdd.RDD",
> "Method Name": "iterator",
> 

[jira] [Updated] (SPARK-33375) [CORE] Add config spark.yarn.pyspark.archives

2020-11-06 Thread Zhongwei Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongwei Zhu updated SPARK-33375:
-
Summary: [CORE] Add config spark.yarn.pyspark.archives  (was: [CORE] config 
spark.yarn.pyspark.archives)

> [CORE] Add config spark.yarn.pyspark.archives
> -
>
> Key: SPARK-33375
> URL: https://issues.apache.org/jira/browse/SPARK-33375
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 3.0.1
>Reporter: Zhongwei Zhu
>Priority: Minor
>
> This config provides similar function as `spark.yarn.archive`, but instead 
> passing location of pyspark.zip and py4j.zip. It has below benefits:
> 1. If I have different version of `spark.yarn.archive` from current machine, 
> then there is no way to   provide matched pyspark version. This happens in 
> version upgrade.
> 2. This can save effort of uploading pyspark.zip and py4j.zip every time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33375) [CORE] config spark.yarn.pyspark.archives

2020-11-06 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-33375:


 Summary: [CORE] config spark.yarn.pyspark.archives
 Key: SPARK-33375
 URL: https://issues.apache.org/jira/browse/SPARK-33375
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit
Affects Versions: 3.0.1
Reporter: Zhongwei Zhu


This config provides similar function as `spark.yarn.archive`, but instead 
passing location of pyspark.zip and py4j.zip. It has below benefits:

1. If I have different version of `spark.yarn.archive` from current machine, 
then there is no way to   provide matched pyspark version. This happens in 
version upgrade.
2. This can save effort of uploading pyspark.zip and py4j.zip every time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33374) [CORE] Remove unnecessary python path from spark home

2020-11-06 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-33374:


 Summary: [CORE] Remove unnecessary python path from spark home
 Key: SPARK-33374
 URL: https://issues.apache.org/jira/browse/SPARK-33374
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.1
Reporter: Zhongwei Zhu


Currently, spark-submit will upload pyspark.zip and py4j-0.10.9-src.zip into 
staging folder, and both files will be added into PYTHONPATH. So it's 
unnecessary to add duplicate files in current spark home folder on local 
machine. 
Output of `sys.path` as below:

'D:\\data\\yarnnm\\local\\usercache\\z\\appcache\\application_1603546638930_150736\\container_e1148_1603546638930_150736_01_02\\pyspark.zip',
'D:\\data\\yarnnm\\local\\usercache\\z\\appcache\\application_1603546638930_150736\\container_e1148_1603546638930_150736_01_02\\py4j-0.10.7-src.zip',
'D:\\data\\spark.latest\\python\\lib\\pyspark.zip',
'D:\\data\\spark.latest\\python\\lib\\py4j-0.10.7-src.zip',





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33274) [SS] Fix job hang in cp mode when total cores less than total kafka partition

2020-10-28 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-33274:


 Summary: [SS] Fix job hang in cp mode when total cores less than 
total kafka partition 
 Key: SPARK-33274
 URL: https://issues.apache.org/jira/browse/SPARK-33274
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.0.1
Reporter: Zhongwei Zhu


In continuous processing mode, EpochCoordinator won't add offsets to query 
until got ReportPartitionOffset from all partitions. Normally, each kafka topic 
partition will be handled by one core, if total cores is smaller than total 
kafka topic partition counts, the job will hang without any error message. 

Failing the job with error message is better solution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32863) Full outer stream-stream join

2020-10-23 Thread Zhongwei Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17219872#comment-17219872
 ] 

Zhongwei Zhu commented on SPARK-32863:
--

[~chengsu] Have you already worked on this? I want to help on this PR.

> Full outer stream-stream join
> -
>
> Key: SPARK-32863
> URL: https://issues.apache.org/jira/browse/SPARK-32863
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Major
>
> Current stream-stream join supports inner, left outer and right outer join 
> ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala#L166]
>  ). With current design of stream-stream join (which marks whether the row is 
> matched or not in state store), it would be very easy to support full outer 
> join as well.
>  
> Full outer stream-stream join will work as followed:
> (1).for left side input row, check if there's a match on right side state 
> store. If there's a match, output all matched rows. Put the row in left side 
> state store.
> (2).for right side input row, check if there's a match on left side state 
> store. If there's a match, output all matched rows and update left side rows 
> state with "matched" field to set to true. Put the right side row in right 
> side state store.
> (3).for left side row needs to be evicted from state store, output the row if 
> "matched" field is false.
> (4).for right side row needs to be evicted from state store, output the row 
> if "matched" field is false.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32446) Add new executor metrics summary REST APIs and parameters

2020-07-26 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-32446:


 Summary: Add new executor metrics summary REST APIs and parameters
 Key: SPARK-32446
 URL: https://issues.apache.org/jira/browse/SPARK-32446
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Affects Versions: 3.0.0
Reporter: Zhongwei Zhu


Add percentile distribution of peak memory metrics for all executors

http://:18080/api/v1/applications//

[jira] [Created] (SPARK-32349) [UI] Reduce unnecessary allexecutors call when render stage page executor summary

2020-07-17 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-32349:


 Summary: [UI] Reduce unnecessary allexecutors call when render 
stage page executor summary
 Key: SPARK-32349
 URL: https://issues.apache.org/jira/browse/SPARK-32349
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 3.0.0
Reporter: Zhongwei Zhu


Currently, stage page executor summary table data are built by merge 
/allexecutors response and stage.executorSummary. The data needed from 
/allexecutor are only two fields: hostPort and executorLogs.

By adding both fields in v1.ExecutorStageSummary, we can save one /allexecutors 
call and make frontend logic simpler



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32314) [SHS] Add config to control whether log old format of stacktrace

2020-07-14 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-32314:


 Summary: [SHS] Add config to control whether log old format of 
stacktrace
 Key: SPARK-32314
 URL: https://issues.apache.org/jira/browse/SPARK-32314
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Zhongwei Zhu


Currently, EventLoggingListeneer write both "Stack Trace" and "Full Stack 
Trace" in TaskEndResaon of ExceptionFailure to event log. Both fields contains 
same info, and the former one is kept for backward compatibility of spark 
history before version 1.2.0. We can remove 1st field in default setting and 
add one config to control whether log 1st field. This will help reduce eventlog 
size significantly when lots of task are failed due to ExceptionFailure.

 

The sample json of current format as below:

 
{noformat}
{
  "Event": "SparkListenerTaskEnd",
  "Stage ID": 1237,
  "Stage Attempt ID": 0,
  "Task Type": "ShuffleMapTask",
  "Task End Reason": {
"Reason": "ExceptionFailure",
"Class Name": "java.io.IOException",
"Description": "org.apache.spark.SparkException: Failed to get 
broadcast_1405_piece10 of broadcast_1405",
"Stack Trace": [
  {
"Declaring Class": "org.apache.spark.util.Utils$",
"Method Name": "tryOrIOException",
"File Name": "Utils.scala",
"Line Number": 1350
  },
  {
"Declaring Class": "org.apache.spark.broadcast.TorrentBroadcast",
"Method Name": "readBroadcastBlock",
"File Name": "TorrentBroadcast.scala",
"Line Number": 218
  },
  {
"Declaring Class": "org.apache.spark.broadcast.TorrentBroadcast",
"Method Name": "getValue",
"File Name": "TorrentBroadcast.scala",
"Line Number": 103
  },
  {
"Declaring Class": "org.apache.spark.broadcast.Broadcast",
"Method Name": "value",
"File Name": "Broadcast.scala",
"Line Number": 70
  },
  {
"Declaring Class": 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage9",
"Method Name": "wholestagecodegen_init_0_0$",
"File Name": "generated.java",
"Line Number": 466
  },
  {
"Declaring Class": 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage9",
"Method Name": "init",
"File Name": "generated.java",
"Line Number": 33
  },
  {
"Declaring Class": 
"org.apache.spark.sql.execution.WholeStageCodegenExec",
"Method Name": "$anonfun$doExecute$4",
"File Name": "WholeStageCodegenExec.scala",
"Line Number": 750
  },
  {
"Declaring Class": 
"org.apache.spark.sql.execution.WholeStageCodegenExec",
"Method Name": "$anonfun$doExecute$4$adapted",
"File Name": "WholeStageCodegenExec.scala",
"Line Number": 747
  },
  {
"Declaring Class": "org.apache.spark.rdd.RDD",
"Method Name": "$anonfun$mapPartitionsWithIndex$2",
"File Name": "RDD.scala",
"Line Number": 915
  },
  {
"Declaring Class": "org.apache.spark.rdd.RDD",
"Method Name": "$anonfun$mapPartitionsWithIndex$2$adapted",
"File Name": "RDD.scala",
"Line Number": 915
  },
  {
"Declaring Class": "org.apache.spark.rdd.MapPartitionsRDD",
"Method Name": "compute",
"File Name": "MapPartitionsRDD.scala",
"Line Number": 52
  },
  {
"Declaring Class": "org.apache.spark.rdd.RDD",
"Method Name": "computeOrReadCheckpoint",
"File Name": "RDD.scala",
"Line Number": 373
  },
  {
"Declaring Class": "org.apache.spark.rdd.RDD",
"Method Name": "iterator",
"File Name": "RDD.scala",
"Line Number": 337
  },
  {
"Declaring Class": "org.apache.spark.rdd.MapPartitionsRDD",
"Method Name": "compute",
"File Name": "MapPartitionsRDD.scala",
"Line Number": 52
  },
  {
"Declaring Class": "org.apache.spark.rdd.RDD",
"Method Name": "computeOrReadCheckpoint",
"File Name": "RDD.scala",
"Line Number": 373
  },
  {
"Declaring Class": "org.apache.spark.rdd.RDD",
"Method Name": "iterator",
"File Name": "RDD.scala",
"Line Number": 337
  },
  {
"Declaring Class": "org.apache.spark.shuffle.ShuffleWriteProcessor",
"Method Name": "write",
"File Name": "ShuffleWriteProcessor.scala",
"Line Number": 59
  },
  {
"Declaring Class": "org.apache.spark.scheduler.ShuffleMapTask",
"Method Name": "runTask",
"File Name": "ShuffleMapTask.scala",
"Line Number": 99
  },
  {
"Declaring Class": "org.apache.spark.scheduler.ShuffleMapTask",

[jira] [Created] (SPARK-32288) [UI] Add exception summary table in stage page

2020-07-12 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-32288:


 Summary: [UI] Add exception summary table in stage page
 Key: SPARK-32288
 URL: https://issues.apache.org/jira/browse/SPARK-32288
 Project: Spark
  Issue Type: New Feature
  Components: Web UI
Affects Versions: 3.0.0
Reporter: Zhongwei Zhu


When there're many task failure during one stage, it's hard to find failure 
pattern such as aggregation task failure by exception type and message. If we 
have such information, we can easily know which type of exception of failure is 
the root cause of stage failure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32125) [UI] Support get taskList by status in Web UI and SHS Rest API

2020-06-28 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-32125:


 Summary: [UI] Support get taskList by status in Web UI and SHS 
Rest API
 Key: SPARK-32125
 URL: https://issues.apache.org/jira/browse/SPARK-32125
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 3.0.0
Reporter: Zhongwei Zhu


Support fetching taskList by status as below:

/applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/taskList?status=failed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32124) [SHS] Failed to parse FetchFailed TaskEndReason from event log produce by Spark 2.4

2020-06-28 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-32124:


 Summary: [SHS] Failed to parse FetchFailed TaskEndReason from 
event log produce by Spark 2.4 
 Key: SPARK-32124
 URL: https://issues.apache.org/jira/browse/SPARK-32124
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Zhongwei Zhu


When read event log produced by Spark 2.4.4, parsing TaskEndReason failed to 
due to missing field "Map Index".

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32044) [SS] 2.4 Kafka continuous processing print mislead initial offsets log

2020-06-22 Thread Zhongwei Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongwei Zhu updated SPARK-32044:
-
Summary: [SS] 2.4 Kafka continuous processing print mislead initial offsets 
log   (was: [SS] 2.4 Kakfa continuous processing print mislead initial offsets 
log )

> [SS] 2.4 Kafka continuous processing print mislead initial offsets log 
> ---
>
> Key: SPARK-32044
> URL: https://issues.apache.org/jira/browse/SPARK-32044
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.6
>Reporter: Zhongwei Zhu
>Priority: Trivial
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When using structured streaming in continuous processing mode, after restart 
> spark job, spark job can correctly pick up offsets in checkpoint location 
> from last epoch. But it always print out below log:
> 20/06/12 00:58:09 INFO [stream execution thread for [id = 
> 34e5b909-f9fe-422a-89c0-081251a68693, runId = 
> 0246e19d-aaa1-4a5c-9091-bab1a0578a0a]] kafka010.KafkaContinuousReader: 
> Initial offsets: 
> \{"kafka_topic":{"8":51618236,"11":51610655,"2":51622889,"5":51637171,"14":51637346,"13":51627784,"4":51606960,"7":51632475,"1":51636129,"10":51632212,"9":51634107,"3":51611013,"12":51626567,"15":51640774,"6":51637823,"0":51629106}}
> This log is misleading as spark didn't use this one as initial offsets. Also, 
> it results in unnecessary kafka offset fetch. This is caused by below code in 
> KafkaContinuousReader
> {code:java}
> offset = start.orElse {
>   val offsets = initialOffsets match {
> case EarliestOffsetRangeLimit =>
>   KafkaSourceOffset(offsetReader.fetchEarliestOffsets())
> case LatestOffsetRangeLimit =>
>   KafkaSourceOffset(offsetReader.fetchLatestOffsets(None))
> case SpecificOffsetRangeLimit(p) =>
>   offsetReader.fetchSpecificOffsets(p, reportDataLoss)
>   }
>   logInfo(s"Initial offsets: $offsets")
>   offsets
> }
> {code}
>  The code inside orElse block is always executed even when start has value.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32044) [SS] 2.4 Kakfa continuous processing print mislead initial offsets log

2020-06-22 Thread Zhongwei Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongwei Zhu updated SPARK-32044:
-
Summary: [SS] 2.4 Kakfa continuous processing print mislead initial offsets 
log   (was: [SS] Kakfa continuous processing print mislead initial offsets log )

> [SS] 2.4 Kakfa continuous processing print mislead initial offsets log 
> ---
>
> Key: SPARK-32044
> URL: https://issues.apache.org/jira/browse/SPARK-32044
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.6
>Reporter: Zhongwei Zhu
>Priority: Trivial
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When using structured streaming in continuous processing mode, after restart 
> spark job, spark job can correctly pick up offsets in checkpoint location 
> from last epoch. But it always print out below log:
> 20/06/12 00:58:09 INFO [stream execution thread for [id = 
> 34e5b909-f9fe-422a-89c0-081251a68693, runId = 
> 0246e19d-aaa1-4a5c-9091-bab1a0578a0a]] kafka010.KafkaContinuousReader: 
> Initial offsets: 
> \{"kafka_topic":{"8":51618236,"11":51610655,"2":51622889,"5":51637171,"14":51637346,"13":51627784,"4":51606960,"7":51632475,"1":51636129,"10":51632212,"9":51634107,"3":51611013,"12":51626567,"15":51640774,"6":51637823,"0":51629106}}
> This log is misleading as spark didn't use this one as initial offsets. Also, 
> it results in unnecessary kafka offset fetch. This is caused by below code in 
> KafkaContinuousReader
> {code:java}
> offset = start.orElse {
>   val offsets = initialOffsets match {
> case EarliestOffsetRangeLimit =>
>   KafkaSourceOffset(offsetReader.fetchEarliestOffsets())
> case LatestOffsetRangeLimit =>
>   KafkaSourceOffset(offsetReader.fetchLatestOffsets(None))
> case SpecificOffsetRangeLimit(p) =>
>   offsetReader.fetchSpecificOffsets(p, reportDataLoss)
>   }
>   logInfo(s"Initial offsets: $offsets")
>   offsets
> }
> {code}
>  The code inside orElse block is always executed even when start has value.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32044) [SS] Kakfa continuous processing print mislead initial offsets log

2020-06-21 Thread Zhongwei Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongwei Zhu updated SPARK-32044:
-
Description: 
When using structured streaming in continuous processing mode, after restart 
spark job, spark job can correctly pick up offsets in checkpoint location from 
last epoch. But it always print out below log:

20/06/12 00:58:09 INFO [stream execution thread for [id = 
34e5b909-f9fe-422a-89c0-081251a68693, runId = 
0246e19d-aaa1-4a5c-9091-bab1a0578a0a]] kafka010.KafkaContinuousReader: Initial 
offsets: 
\{"kafka_topic":{"8":51618236,"11":51610655,"2":51622889,"5":51637171,"14":51637346,"13":51627784,"4":51606960,"7":51632475,"1":51636129,"10":51632212,"9":51634107,"3":51611013,"12":51626567,"15":51640774,"6":51637823,"0":51629106}}

This log is misleading as spark didn't use this one as initial offsets. Also, 
it results in unnecessary kafka offset fetch. This is caused by below code in 
KafkaContinuousReader
{code:java}
offset = start.orElse {
  val offsets = initialOffsets match {
case EarliestOffsetRangeLimit =>
  KafkaSourceOffset(offsetReader.fetchEarliestOffsets())
case LatestOffsetRangeLimit =>
  KafkaSourceOffset(offsetReader.fetchLatestOffsets(None))
case SpecificOffsetRangeLimit(p) =>
  offsetReader.fetchSpecificOffsets(p, reportDataLoss)
  }
  logInfo(s"Initial offsets: $offsets")
  offsets
}
{code}
 The code inside orElse block is always executed even when start has value.

 

  was:
When using structured streaming in continuous processing mode, after restart 
spark job, spark job can correctly pick up offsets in checkpoint location from 
last epoch. But it always print out below log:

20/06/12 00:58:09 INFO [stream execution thread for [id = 
34e5b909-f9fe-422a-89c0-081251a68693, runId = 
0246e19d-aaa1-4a5c-9091-bab1a0578a0a]] kafka010.KafkaContinuousReader: Initial 
offsets: 
\{"kafka_topic":{"8":51618236,"11":51610655,"2":51622889,"5":51637171,"14":51637346,"13":51627784,"4":51606960,"7":51632475,"1":51636129,"10":51632212,"9":51634107,"3":51611013,"12":51626567,"15":51640774,"6":51637823,"0":51629106}}

 And this log is misleading, as spark didn't use this one as initial offsets. 
This is caused by below code in KafkaContinuousReader
{code:java}
offset = start.orElse {
  val offsets = initialOffsets match {
case EarliestOffsetRangeLimit =>
  KafkaSourceOffset(offsetReader.fetchEarliestOffsets())
case LatestOffsetRangeLimit =>
  KafkaSourceOffset(offsetReader.fetchLatestOffsets(None))
case SpecificOffsetRangeLimit(p) =>
  offsetReader.fetchSpecificOffsets(p, reportDataLoss)
  }
  logInfo(s"Initial offsets: $offsets")
  offsets
}
{code}
 The code inside orElse block is always executed even when start has value.

 


> [SS] Kakfa continuous processing print mislead initial offsets log 
> ---
>
> Key: SPARK-32044
> URL: https://issues.apache.org/jira/browse/SPARK-32044
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.6
>Reporter: Zhongwei Zhu
>Priority: Trivial
> Fix For: 2.4.7
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When using structured streaming in continuous processing mode, after restart 
> spark job, spark job can correctly pick up offsets in checkpoint location 
> from last epoch. But it always print out below log:
> 20/06/12 00:58:09 INFO [stream execution thread for [id = 
> 34e5b909-f9fe-422a-89c0-081251a68693, runId = 
> 0246e19d-aaa1-4a5c-9091-bab1a0578a0a]] kafka010.KafkaContinuousReader: 
> Initial offsets: 
> \{"kafka_topic":{"8":51618236,"11":51610655,"2":51622889,"5":51637171,"14":51637346,"13":51627784,"4":51606960,"7":51632475,"1":51636129,"10":51632212,"9":51634107,"3":51611013,"12":51626567,"15":51640774,"6":51637823,"0":51629106}}
> This log is misleading as spark didn't use this one as initial offsets. Also, 
> it results in unnecessary kafka offset fetch. This is caused by below code in 
> KafkaContinuousReader
> {code:java}
> offset = start.orElse {
>   val offsets = initialOffsets match {
> case EarliestOffsetRangeLimit =>
>   KafkaSourceOffset(offsetReader.fetchEarliestOffsets())
> case LatestOffsetRangeLimit =>
>   KafkaSourceOffset(offsetReader.fetchLatestOffsets(None))
> case SpecificOffsetRangeLimit(p) =>
>   offsetReader.fetchSpecificOffsets(p, reportDataLoss)
>   }
>   logInfo(s"Initial offsets: $offsets")
>   offsets
> }
> {code}
>  The code inside orElse block is always executed even when start has value.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Updated] (SPARK-32044) [SS] Kakfa continuous processing print mislead initial offsets log

2020-06-21 Thread Zhongwei Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongwei Zhu updated SPARK-32044:
-
Summary: [SS] Kakfa continuous processing print mislead initial offsets log 
  (was: Kakfa continuous processing print mislead initial offsets log )

> [SS] Kakfa continuous processing print mislead initial offsets log 
> ---
>
> Key: SPARK-32044
> URL: https://issues.apache.org/jira/browse/SPARK-32044
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.6
>Reporter: Zhongwei Zhu
>Priority: Trivial
> Fix For: 2.4.7
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When using structured streaming in continuous processing mode, after restart 
> spark job, spark job can correctly pick up offsets in checkpoint location 
> from last epoch. But it always print out below log:
> 20/06/12 00:58:09 INFO [stream execution thread for [id = 
> 34e5b909-f9fe-422a-89c0-081251a68693, runId = 
> 0246e19d-aaa1-4a5c-9091-bab1a0578a0a]] kafka010.KafkaContinuousReader: 
> Initial offsets: 
> \{"kafka_topic":{"8":51618236,"11":51610655,"2":51622889,"5":51637171,"14":51637346,"13":51627784,"4":51606960,"7":51632475,"1":51636129,"10":51632212,"9":51634107,"3":51611013,"12":51626567,"15":51640774,"6":51637823,"0":51629106}}
>  And this log is misleading, as spark didn't use this one as initial offsets. 
> This is caused by below code in KafkaContinuousReader
> {code:java}
> offset = start.orElse {
>   val offsets = initialOffsets match {
> case EarliestOffsetRangeLimit =>
>   KafkaSourceOffset(offsetReader.fetchEarliestOffsets())
> case LatestOffsetRangeLimit =>
>   KafkaSourceOffset(offsetReader.fetchLatestOffsets(None))
> case SpecificOffsetRangeLimit(p) =>
>   offsetReader.fetchSpecificOffsets(p, reportDataLoss)
>   }
>   logInfo(s"Initial offsets: $offsets")
>   offsets
> }
> {code}
>  The code inside orElse block is always executed even when start has value.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32044) Kakfa continuous processing print mislead initial offsets log

2020-06-21 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-32044:


 Summary: Kakfa continuous processing print mislead initial offsets 
log 
 Key: SPARK-32044
 URL: https://issues.apache.org/jira/browse/SPARK-32044
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.6
Reporter: Zhongwei Zhu
 Fix For: 2.4.7


When using structured streaming in continuous processing mode, after restart 
spark job, spark job can correctly pick up offsets in checkpoint location from 
last epoch. But it always print out below log:

20/06/12 00:58:09 INFO [stream execution thread for [id = 
34e5b909-f9fe-422a-89c0-081251a68693, runId = 
0246e19d-aaa1-4a5c-9091-bab1a0578a0a]] kafka010.KafkaContinuousReader: Initial 
offsets: 
\{"kafka_topic":{"8":51618236,"11":51610655,"2":51622889,"5":51637171,"14":51637346,"13":51627784,"4":51606960,"7":51632475,"1":51636129,"10":51632212,"9":51634107,"3":51611013,"12":51626567,"15":51640774,"6":51637823,"0":51629106}}

 And this log is misleading, as spark didn't use this one as initial offsets. 
This is caused by below code in KafkaContinuousReader
{code:java}
offset = start.orElse {
  val offsets = initialOffsets match {
case EarliestOffsetRangeLimit =>
  KafkaSourceOffset(offsetReader.fetchEarliestOffsets())
case LatestOffsetRangeLimit =>
  KafkaSourceOffset(offsetReader.fetchLatestOffsets(None))
case SpecificOffsetRangeLimit(p) =>
  offsetReader.fetchSpecificOffsets(p, reportDataLoss)
  }
  logInfo(s"Initial offsets: $offsets")
  offsets
}
{code}
 The code inside orElse block is always executed even when start has value.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23432) Expose executor memory metrics in the web UI for executors

2020-01-05 Thread Zhongwei Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17008493#comment-17008493
 ] 

Zhongwei Zhu commented on SPARK-23432:
--

I'll work on this. 

> Expose executor memory metrics in the web UI for executors
> --
>
> Key: SPARK-23432
> URL: https://issues.apache.org/jira/browse/SPARK-23432
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edward Lu
>Priority: Major
>
> Add the new memory metrics (jvmUsedMemory, executionMemory, storageMemory, 
> and unifiedMemory, etc.) to the executors tab, in the summary and for each 
> executor.
> This is a subtask for SPARK-23206. Please refer to the design doc for that 
> ticket for more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org