[jira] [Updated] (SPARK-47759) Apps being stuck after JavaUtils.timeStringAs fails to parse a legitimate time string

2024-04-09 Thread Bo Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-47759:
-
Description: 
h2. Symptom

It's observed that our Spark apps occasionally got stuck with an unexpected 
stack trace when reading/parsing a legitimate time string. Note that we 
manually killed the stuck app instances and the retry goes thru on the same 
cluster (without requiring any app code change).

 

*[Stack Trace 1]* The stack trace doesn't make sense since *120s* is a 
legitimate time string, where the app runs on emr-7.0.0 with Spark 3.5.0 
runtime.
{code:java}
Caused by: java.lang.RuntimeException: java.lang.NumberFormatException: Time 
must be specified as seconds (s), milliseconds (ms), microseconds (us), minutes 
(m or min), hour (h), or day (d). E.g. 50s, 100ms, or 250us.
Failed to parse time string: 120s
at org.apache.spark.network.util.JavaUtils.timeStringAs(JavaUtils.java:258)
at 
org.apache.spark.network.util.JavaUtils.timeStringAsSec(JavaUtils.java:275)
at org.apache.spark.util.Utils$.timeStringAsSeconds(Utils.scala:1166)
at org.apache.spark.rpc.RpcTimeout$.apply(RpcTimeout.scala:131)
at org.apache.spark.util.RpcUtils$.askRpcTimeout(RpcUtils.scala:41)
at org.apache.spark.rpc.RpcEndpointRef.(RpcEndpointRef.scala:33)
at 
org.apache.spark.rpc.netty.NettyRpcEndpointRef.(NettyRpcEnv.scala:533)
at org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:640)
at 
org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:697)
at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:682)
at 
org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:163)
at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
org.apache.spark.network.crypto.TransportCipher$DecryptionHandler.channelRead(TransportCipher.java:192)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
at 

[jira] [Updated] (SPARK-47759) Apps being stuck after JavaUtils.timeStringAs fails to parse a legitimate time string

2024-04-09 Thread Bo Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-47759:
-
Description: 
h2. Symptom

It's observed that our Spark apps occasionally got stuck with an unexpected 
stack trace when reading/parsing a legitimate time string. Note that we 
manually killed the stuck app instances and the retry goes thru on the same 
cluster (without requiring any app code change).

 

*[Stack Trace 1]* The stack trace doesn't make sense since *120s* is a 
legitimate time string, where the app runs on emr-7.0.0 with Spark 3.5.0 
runtime.
{code:java}
Caused by: java.lang.RuntimeException: java.lang.NumberFormatException: Time 
must be specified as seconds (s), milliseconds (ms), microseconds (us), minutes 
(m or min), hour (h), or day (d). E.g. 50s, 100ms, or 250us.
Failed to parse time string: 120s
at org.apache.spark.network.util.JavaUtils.timeStringAs(JavaUtils.java:258)
at 
org.apache.spark.network.util.JavaUtils.timeStringAsSec(JavaUtils.java:275)
at org.apache.spark.util.Utils$.timeStringAsSeconds(Utils.scala:1166)
at org.apache.spark.rpc.RpcTimeout$.apply(RpcTimeout.scala:131)
at org.apache.spark.util.RpcUtils$.askRpcTimeout(RpcUtils.scala:41)
at org.apache.spark.rpc.RpcEndpointRef.(RpcEndpointRef.scala:33)
at 
org.apache.spark.rpc.netty.NettyRpcEndpointRef.(NettyRpcEnv.scala:533)
at org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:640)
at 
org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:697)
at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:682)
at 
org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:163)
at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
org.apache.spark.network.crypto.TransportCipher$DecryptionHandler.channelRead(TransportCipher.java:192)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
at 

[jira] [Updated] (SPARK-47759) Apps being stuck after JavaUtils.timeStringAs fails to parse a legitimate time string

2024-04-09 Thread Bo Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-47759:
-
Description: 
h2. Symptom

It's observed that our Spark apps occasionally got stuck with an unexpected 
stack trace when reading/parsing a legitimate time string. Note that we 
manually killed the stuck app instances and the retry goes thru on the same 
cluster (without requiring any app code change).

 

*[Stack Trace 1]* The stack trace doesn't make sense since *120s* is a 
legitimate time string, where the app runs on emr-7.0.0 with Spark 3.5.0 
runtime.
{code:java}
Caused by: java.lang.RuntimeException: java.lang.NumberFormatException: Time 
must be specified as seconds (s), milliseconds (ms), microseconds (us), minutes 
(m or min), hour (h), or day (d). E.g. 50s, 100ms, or 250us.
Failed to parse time string: 120s
at org.apache.spark.network.util.JavaUtils.timeStringAs(JavaUtils.java:258)
at 
org.apache.spark.network.util.JavaUtils.timeStringAsSec(JavaUtils.java:275)
at org.apache.spark.util.Utils$.timeStringAsSeconds(Utils.scala:1166)
at org.apache.spark.rpc.RpcTimeout$.apply(RpcTimeout.scala:131)
at org.apache.spark.util.RpcUtils$.askRpcTimeout(RpcUtils.scala:41)
at org.apache.spark.rpc.RpcEndpointRef.(RpcEndpointRef.scala:33)
at 
org.apache.spark.rpc.netty.NettyRpcEndpointRef.(NettyRpcEnv.scala:533)
at org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:640)
at 
org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:697)
at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:682)
at 
org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:163)
at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
org.apache.spark.network.crypto.TransportCipher$DecryptionHandler.channelRead(TransportCipher.java:192)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
at 

[jira] [Updated] (SPARK-47759) Apps being stuck after JavaUtils.timeStringAs fails to parse a legitimate time string

2024-04-09 Thread Bo Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-47759:
-
Description: 
h2. Symptom

It's observed that our Spark apps occasionally got stuck with an unexpected 
stack trace when reading/parsing a legitimate time string. Note that we 
manually killed the stuck app instances and the retry goes thru on the same 
cluster (without requiring any app code change).

 

*[Stack Trace 1]* The stack trace doesn't make sense since *120s* is a 
legitimate time string, where the app runs on emr-7.0.0 with Spark 3.5.0 
runtime.
{code:java}
Caused by: java.lang.RuntimeException: java.lang.NumberFormatException: Time 
must be specified as seconds (s), milliseconds (ms), microseconds (us), minutes 
(m or min), hour (h), or day (d). E.g. 50s, 100ms, or 250us.
Failed to parse time string: 120s
at org.apache.spark.network.util.JavaUtils.timeStringAs(JavaUtils.java:258)
at 
org.apache.spark.network.util.JavaUtils.timeStringAsSec(JavaUtils.java:275)
at org.apache.spark.util.Utils$.timeStringAsSeconds(Utils.scala:1166)
at org.apache.spark.rpc.RpcTimeout$.apply(RpcTimeout.scala:131)
at org.apache.spark.util.RpcUtils$.askRpcTimeout(RpcUtils.scala:41)
at org.apache.spark.rpc.RpcEndpointRef.(RpcEndpointRef.scala:33)
at 
org.apache.spark.rpc.netty.NettyRpcEndpointRef.(NettyRpcEnv.scala:533)
at org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:640)
at 
org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:697)
at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:682)
at 
org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:163)
at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
org.apache.spark.network.crypto.TransportCipher$DecryptionHandler.channelRead(TransportCipher.java:192)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
at 

[jira] [Commented] (SPARK-47759) Apps being stuck after JavaUtils.timeStringAs fails to parse a legitimate time string

2024-04-08 Thread Bo Xiong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17835136#comment-17835136
 ] 

Bo Xiong commented on SPARK-47759:
--

I've submitted [a fix|https://github.com/apache/spark/pull/45942].  Please help 
get it merged to the master branch.

Once that's merged, I'll submit other pull requests to patch v3.5.0 and above.  
Thanks!

> Apps being stuck after JavaUtils.timeStringAs fails to parse a legitimate 
> time string
> -
>
> Key: SPARK-47759
> URL: https://issues.apache.org/jira/browse/SPARK-47759
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Bo Xiong
>Assignee: Bo Xiong
>Priority: Critical
>  Labels: hang, pull-request-available, stuck, threadsafe
> Fix For: 3.5.0, 4.0.0, 3.5.1, 3.5.2
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> h2. Symptom
> It's observed that our Spark apps occasionally got stuck with an unexpected 
> stack trace when reading/parsing a legitimate time string. Note that we 
> manually killed the stuck app instances and the retry goes thru on the same 
> cluster (without requiring any app code change).
>  
> *[Stack Trace 1]* The stack trace doesn't make sense since *120s* is a 
> legitimate time string, where the app runs on emr-7.0.0 with Spark 3.5.0 
> runtime.
> {code:java}
> Caused by: java.lang.RuntimeException: java.lang.NumberFormatException: Time 
> must be specified as seconds (s), milliseconds (ms), microseconds (us), 
> minutes (m or min), hour (h), or day (d). E.g. 50s, 100ms, or 250us.
> Failed to parse time string: 120s
> at 
> org.apache.spark.network.util.JavaUtils.timeStringAs(JavaUtils.java:258)
> at 
> org.apache.spark.network.util.JavaUtils.timeStringAsSec(JavaUtils.java:275)
> at org.apache.spark.util.Utils$.timeStringAsSeconds(Utils.scala:1166)
> at org.apache.spark.rpc.RpcTimeout$.apply(RpcTimeout.scala:131)
> at org.apache.spark.util.RpcUtils$.askRpcTimeout(RpcUtils.scala:41)
> at org.apache.spark.rpc.RpcEndpointRef.(RpcEndpointRef.scala:33)
> at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.(NettyRpcEnv.scala:533)
> at org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:640)
> at 
> org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:697)
> at 
> org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:682)
> at 
> org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:163)
> at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
> at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at 
> 

[jira] [Updated] (SPARK-47759) Apps being stuck after JavaUtils.timeStringAs fails to parse a legitimate time string

2024-04-08 Thread Bo Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-47759:
-
Fix Version/s: 3.5.2
   3.5.1
Affects Version/s: 3.5.1
   (was: 4.0.0)

> Apps being stuck after JavaUtils.timeStringAs fails to parse a legitimate 
> time string
> -
>
> Key: SPARK-47759
> URL: https://issues.apache.org/jira/browse/SPARK-47759
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Bo Xiong
>Assignee: Bo Xiong
>Priority: Critical
>  Labels: hang, pull-request-available, stuck, threadsafe
> Fix For: 3.5.0, 4.0.0, 3.5.1, 3.5.2
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> h2. Symptom
> It's observed that our Spark apps occasionally got stuck with an unexpected 
> stack trace when reading/parsing a legitimate time string. Note that we 
> manually killed the stuck app instances and the retry goes thru on the same 
> cluster (without requiring any app code change).
>  
> *[Stack Trace 1]* The stack trace doesn't make sense since *120s* is a 
> legitimate time string, where the app runs on emr-7.0.0 with Spark 3.5.0 
> runtime.
> {code:java}
> Caused by: java.lang.RuntimeException: java.lang.NumberFormatException: Time 
> must be specified as seconds (s), milliseconds (ms), microseconds (us), 
> minutes (m or min), hour (h), or day (d). E.g. 50s, 100ms, or 250us.
> Failed to parse time string: 120s
> at 
> org.apache.spark.network.util.JavaUtils.timeStringAs(JavaUtils.java:258)
> at 
> org.apache.spark.network.util.JavaUtils.timeStringAsSec(JavaUtils.java:275)
> at org.apache.spark.util.Utils$.timeStringAsSeconds(Utils.scala:1166)
> at org.apache.spark.rpc.RpcTimeout$.apply(RpcTimeout.scala:131)
> at org.apache.spark.util.RpcUtils$.askRpcTimeout(RpcUtils.scala:41)
> at org.apache.spark.rpc.RpcEndpointRef.(RpcEndpointRef.scala:33)
> at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.(NettyRpcEnv.scala:533)
> at org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:640)
> at 
> org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:697)
> at 
> org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:682)
> at 
> org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:163)
> at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
> at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at 
> org.apache.spark.network.crypto.TransportCipher$DecryptionHandler.channelRead(TransportCipher.java:192)
> at 
> 

[jira] [Updated] (SPARK-47759) Apps being stuck after JavaUtils.timeStringAs fails to parse a legitimate time string

2024-04-08 Thread Bo Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-47759:
-
Description: 
h2. Symptom

It's observed that our Spark apps occasionally got stuck with an unexpected 
stack trace when reading/parsing a legitimate time string. Note that we 
manually killed the stuck app instances and the retry goes thru on the same 
cluster (without requiring any app code change).

 

*[Stack Trace 1]* The stack trace doesn't make sense since *120s* is a 
legitimate time string, where the app runs on emr-7.0.0 with Spark 3.5.0 
runtime.
{code:java}
Caused by: java.lang.RuntimeException: java.lang.NumberFormatException: Time 
must be specified as seconds (s), milliseconds (ms), microseconds (us), minutes 
(m or min), hour (h), or day (d). E.g. 50s, 100ms, or 250us.
Failed to parse time string: 120s
at org.apache.spark.network.util.JavaUtils.timeStringAs(JavaUtils.java:258)
at 
org.apache.spark.network.util.JavaUtils.timeStringAsSec(JavaUtils.java:275)
at org.apache.spark.util.Utils$.timeStringAsSeconds(Utils.scala:1166)
at org.apache.spark.rpc.RpcTimeout$.apply(RpcTimeout.scala:131)
at org.apache.spark.util.RpcUtils$.askRpcTimeout(RpcUtils.scala:41)
at org.apache.spark.rpc.RpcEndpointRef.(RpcEndpointRef.scala:33)
at 
org.apache.spark.rpc.netty.NettyRpcEndpointRef.(NettyRpcEnv.scala:533)
at org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:640)
at 
org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:697)
at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:682)
at 
org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:163)
at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
org.apache.spark.network.crypto.TransportCipher$DecryptionHandler.channelRead(TransportCipher.java:192)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
at 

[jira] [Updated] (SPARK-47759) Apps being stuck after JavaUtils.timeStringAs fails to parse a legitimate time string

2024-04-08 Thread Bo Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-47759:
-
Summary: Apps being stuck after JavaUtils.timeStringAs fails to parse a 
legitimate time string  (was: Apps being stuck with an unexpected stack trace 
when reading/parsing a legitimate time string)

> Apps being stuck after JavaUtils.timeStringAs fails to parse a legitimate 
> time string
> -
>
> Key: SPARK-47759
> URL: https://issues.apache.org/jira/browse/SPARK-47759
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Bo Xiong
>Assignee: Bo Xiong
>Priority: Critical
>  Labels: hang, pull-request-available, stuck, threadsafe
> Fix For: 3.5.0, 4.0.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> h2. Symptom
> It's observed that our Spark apps occasionally got stuck with an unexpected 
> stack trace when reading/parsing a legitimate time string. Note that we 
> manually killed the stuck app instances and the rety goes thru on the same 
> cluster (without requiring any app code change).
>  
> *[Stack Trace 1]* The stack trace doesn't make sense since *120s* is a 
> legitimate time string, where the app runs on emr-7.0.0 with Spark 3.5.0 
> runtime.
> {code:java}
> Caused by: java.lang.RuntimeException: java.lang.NumberFormatException: Time 
> must be specified as seconds (s), milliseconds (ms), microseconds (us), 
> minutes (m or min), hour (h), or day (d). E.g. 50s, 100ms, or 250us.
> Failed to parse time string: 120s
> at 
> org.apache.spark.network.util.JavaUtils.timeStringAs(JavaUtils.java:258)
> at 
> org.apache.spark.network.util.JavaUtils.timeStringAsSec(JavaUtils.java:275)
> at org.apache.spark.util.Utils$.timeStringAsSeconds(Utils.scala:1166)
> at org.apache.spark.rpc.RpcTimeout$.apply(RpcTimeout.scala:131)
> at org.apache.spark.util.RpcUtils$.askRpcTimeout(RpcUtils.scala:41)
> at org.apache.spark.rpc.RpcEndpointRef.(RpcEndpointRef.scala:33)
> at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.(NettyRpcEnv.scala:533)
> at org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:640)
> at 
> org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:697)
> at 
> org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:682)
> at 
> org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:163)
> at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
> at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at 
> org.apache.spark.network.crypto.TransportCipher$DecryptionHandler.channelRead(TransportCipher.java:192)
>  

[jira] [Updated] (SPARK-47759) Apps being stuck with an unexpected stack trace when reading/parsing a time string

2024-04-08 Thread Bo Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-47759:
-
Description: 
h2. Symptom

It's observed that our Spark apps occasionally got stuck with an unexpected 
stack trace when reading/parsing a legitimate time string. Note that we 
manually killed the stuck app instances and the rety goes thru on the same 
cluster (without requiring any app code change).

 

*[Stack Trace 1]* The stack trace doesn't make sense since *120s* is a 
legitimate time string, where the app runs on emr-7.0.0 with Spark 3.5.0 
runtime.
{code:java}
Caused by: java.lang.RuntimeException: java.lang.NumberFormatException: Time 
must be specified as seconds (s), milliseconds (ms), microseconds (us), minutes 
(m or min), hour (h), or day (d). E.g. 50s, 100ms, or 250us.
Failed to parse time string: 120s
at org.apache.spark.network.util.JavaUtils.timeStringAs(JavaUtils.java:258)
at 
org.apache.spark.network.util.JavaUtils.timeStringAsSec(JavaUtils.java:275)
at org.apache.spark.util.Utils$.timeStringAsSeconds(Utils.scala:1166)
at org.apache.spark.rpc.RpcTimeout$.apply(RpcTimeout.scala:131)
at org.apache.spark.util.RpcUtils$.askRpcTimeout(RpcUtils.scala:41)
at org.apache.spark.rpc.RpcEndpointRef.(RpcEndpointRef.scala:33)
at 
org.apache.spark.rpc.netty.NettyRpcEndpointRef.(NettyRpcEnv.scala:533)
at org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:640)
at 
org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:697)
at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:682)
at 
org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:163)
at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
org.apache.spark.network.crypto.TransportCipher$DecryptionHandler.channelRead(TransportCipher.java:192)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
at 

[jira] [Updated] (SPARK-47759) Apps being stuck with an unexpected stack trace when reading/parsing a legitimate time string

2024-04-08 Thread Bo Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-47759:
-
Summary: Apps being stuck with an unexpected stack trace when 
reading/parsing a legitimate time string  (was: Apps being stuck with an 
unexpected stack trace when reading/parsing a time string)

> Apps being stuck with an unexpected stack trace when reading/parsing a 
> legitimate time string
> -
>
> Key: SPARK-47759
> URL: https://issues.apache.org/jira/browse/SPARK-47759
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Bo Xiong
>Assignee: Bo Xiong
>Priority: Critical
>  Labels: hang, pull-request-available, stuck, threadsafe
> Fix For: 3.5.0, 4.0.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> h2. Symptom
> It's observed that our Spark apps occasionally got stuck with an unexpected 
> stack trace when reading/parsing a legitimate time string. Note that we 
> manually killed the stuck app instances and the rety goes thru on the same 
> cluster (without requiring any app code change).
>  
> *[Stack Trace 1]* The stack trace doesn't make sense since *120s* is a 
> legitimate time string, where the app runs on emr-7.0.0 with Spark 3.5.0 
> runtime.
> {code:java}
> Caused by: java.lang.RuntimeException: java.lang.NumberFormatException: Time 
> must be specified as seconds (s), milliseconds (ms), microseconds (us), 
> minutes (m or min), hour (h), or day (d). E.g. 50s, 100ms, or 250us.
> Failed to parse time string: 120s
> at 
> org.apache.spark.network.util.JavaUtils.timeStringAs(JavaUtils.java:258)
> at 
> org.apache.spark.network.util.JavaUtils.timeStringAsSec(JavaUtils.java:275)
> at org.apache.spark.util.Utils$.timeStringAsSeconds(Utils.scala:1166)
> at org.apache.spark.rpc.RpcTimeout$.apply(RpcTimeout.scala:131)
> at org.apache.spark.util.RpcUtils$.askRpcTimeout(RpcUtils.scala:41)
> at org.apache.spark.rpc.RpcEndpointRef.(RpcEndpointRef.scala:33)
> at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.(NettyRpcEnv.scala:533)
> at org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:640)
> at 
> org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:697)
> at 
> org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:682)
> at 
> org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:163)
> at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
> at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at 
> 

[jira] [Updated] (SPARK-47759) Apps being stuck with an unexpected stack trace when reading/parsing a time string

2024-04-08 Thread Bo Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-47759:
-
Description: 
h2. Symptom

It's observed that our Spark apps occasionally got stuck with an unexpected 
stack trace when reading/parsing a time string. Note that we manually killed 
the stuck app instances and the rety goes thru on the same cluster (without any 
code change).

 

*[Stack Trace 1]* The stack trace doesn't make sense since *120s* is a 
legitimate time string, where the app runs on emr-7.0.0 with Spark 3.5.0 
runtime.
{code:java}
Caused by: java.lang.RuntimeException: java.lang.NumberFormatException: Time 
must be specified as seconds (s), milliseconds (ms), microseconds (us), minutes 
(m or min), hour (h), or day (d). E.g. 50s, 100ms, or 250us.
Failed to parse time string: 120s
at org.apache.spark.network.util.JavaUtils.timeStringAs(JavaUtils.java:258)
at 
org.apache.spark.network.util.JavaUtils.timeStringAsSec(JavaUtils.java:275)
at org.apache.spark.util.Utils$.timeStringAsSeconds(Utils.scala:1166)
at org.apache.spark.rpc.RpcTimeout$.apply(RpcTimeout.scala:131)
at org.apache.spark.util.RpcUtils$.askRpcTimeout(RpcUtils.scala:41)
at org.apache.spark.rpc.RpcEndpointRef.(RpcEndpointRef.scala:33)
at 
org.apache.spark.rpc.netty.NettyRpcEndpointRef.(NettyRpcEnv.scala:533)
at org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:640)
at 
org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:697)
at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:682)
at 
org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:163)
at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
org.apache.spark.network.crypto.TransportCipher$DecryptionHandler.channelRead(TransportCipher.java:192)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
at 

[jira] [Updated] (SPARK-47759) Apps being stuck with an unexpected stack trace when reading/parsing a time string

2024-04-08 Thread Bo Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-47759:
-
Description: 
h2. Symptom

It's observed that our Spark apps occasionally got stuck with an unexpected 
stack trace when reading/parsing a time string. Note that we manually killed 
the stuck app instances and the rety goes thru on the same cluster (without 
requiring any app code change).

 

*[Stack Trace 1]* The stack trace doesn't make sense since *120s* is a 
legitimate time string, where the app runs on emr-7.0.0 with Spark 3.5.0 
runtime.
{code:java}
Caused by: java.lang.RuntimeException: java.lang.NumberFormatException: Time 
must be specified as seconds (s), milliseconds (ms), microseconds (us), minutes 
(m or min), hour (h), or day (d). E.g. 50s, 100ms, or 250us.
Failed to parse time string: 120s
at org.apache.spark.network.util.JavaUtils.timeStringAs(JavaUtils.java:258)
at 
org.apache.spark.network.util.JavaUtils.timeStringAsSec(JavaUtils.java:275)
at org.apache.spark.util.Utils$.timeStringAsSeconds(Utils.scala:1166)
at org.apache.spark.rpc.RpcTimeout$.apply(RpcTimeout.scala:131)
at org.apache.spark.util.RpcUtils$.askRpcTimeout(RpcUtils.scala:41)
at org.apache.spark.rpc.RpcEndpointRef.(RpcEndpointRef.scala:33)
at 
org.apache.spark.rpc.netty.NettyRpcEndpointRef.(NettyRpcEnv.scala:533)
at org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:640)
at 
org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:697)
at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:682)
at 
org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:163)
at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
org.apache.spark.network.crypto.TransportCipher$DecryptionHandler.channelRead(TransportCipher.java:192)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
at 

[jira] [Updated] (SPARK-47759) App being stuck with an unexpected stack trace when reading/parsing a time string

2024-04-08 Thread Bo Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-47759:
-
Summary: App being stuck with an unexpected stack trace when 
reading/parsing a time string  (was: A Spark app being stuck with an unexpected 
stack trace when reading/parsing a time string)

> App being stuck with an unexpected stack trace when reading/parsing a time 
> string
> -
>
> Key: SPARK-47759
> URL: https://issues.apache.org/jira/browse/SPARK-47759
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Bo Xiong
>Assignee: Bo Xiong
>Priority: Critical
>  Labels: hang, pull-request-available, stuck, threadsafe
> Fix For: 3.5.0, 4.0.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> h2. Symptom
> It's observed that our Spark apps occasionally got stuck with an unexpected 
> stack trace when reading/parsing a time string.
>  
> *[Stack Trace 1]* The stack trace doesn't make sense since *120s* is a 
> legitimate time string, where the app runs on emr-7.0.0 with Spark 3.5.0 
> runtime.
> {code:java}
> Caused by: java.lang.RuntimeException: java.lang.NumberFormatException: Time 
> must be specified as seconds (s), milliseconds (ms), microseconds (us), 
> minutes (m or min), hour (h), or day (d). E.g. 50s, 100ms, or 250us.
> Failed to parse time string: 120s
> at 
> org.apache.spark.network.util.JavaUtils.timeStringAs(JavaUtils.java:258)
> at 
> org.apache.spark.network.util.JavaUtils.timeStringAsSec(JavaUtils.java:275)
> at org.apache.spark.util.Utils$.timeStringAsSeconds(Utils.scala:1166)
> at org.apache.spark.rpc.RpcTimeout$.apply(RpcTimeout.scala:131)
> at org.apache.spark.util.RpcUtils$.askRpcTimeout(RpcUtils.scala:41)
> at org.apache.spark.rpc.RpcEndpointRef.(RpcEndpointRef.scala:33)
> at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.(NettyRpcEnv.scala:533)
> at org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:640)
> at 
> org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:697)
> at 
> org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:682)
> at 
> org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:163)
> at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
> at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at 
> org.apache.spark.network.crypto.TransportCipher$DecryptionHandler.channelRead(TransportCipher.java:192)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at 
> 

[jira] [Updated] (SPARK-47759) Apps being stuck with an unexpected stack trace when reading/parsing a time string

2024-04-08 Thread Bo Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-47759:
-
Summary: Apps being stuck with an unexpected stack trace when 
reading/parsing a time string  (was: App being stuck with an unexpected stack 
trace when reading/parsing a time string)

> Apps being stuck with an unexpected stack trace when reading/parsing a time 
> string
> --
>
> Key: SPARK-47759
> URL: https://issues.apache.org/jira/browse/SPARK-47759
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Bo Xiong
>Assignee: Bo Xiong
>Priority: Critical
>  Labels: hang, pull-request-available, stuck, threadsafe
> Fix For: 3.5.0, 4.0.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> h2. Symptom
> It's observed that our Spark apps occasionally got stuck with an unexpected 
> stack trace when reading/parsing a time string.
>  
> *[Stack Trace 1]* The stack trace doesn't make sense since *120s* is a 
> legitimate time string, where the app runs on emr-7.0.0 with Spark 3.5.0 
> runtime.
> {code:java}
> Caused by: java.lang.RuntimeException: java.lang.NumberFormatException: Time 
> must be specified as seconds (s), milliseconds (ms), microseconds (us), 
> minutes (m or min), hour (h), or day (d). E.g. 50s, 100ms, or 250us.
> Failed to parse time string: 120s
> at 
> org.apache.spark.network.util.JavaUtils.timeStringAs(JavaUtils.java:258)
> at 
> org.apache.spark.network.util.JavaUtils.timeStringAsSec(JavaUtils.java:275)
> at org.apache.spark.util.Utils$.timeStringAsSeconds(Utils.scala:1166)
> at org.apache.spark.rpc.RpcTimeout$.apply(RpcTimeout.scala:131)
> at org.apache.spark.util.RpcUtils$.askRpcTimeout(RpcUtils.scala:41)
> at org.apache.spark.rpc.RpcEndpointRef.(RpcEndpointRef.scala:33)
> at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.(NettyRpcEnv.scala:533)
> at org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:640)
> at 
> org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:697)
> at 
> org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:682)
> at 
> org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:163)
> at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
> at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at 
> org.apache.spark.network.crypto.TransportCipher$DecryptionHandler.channelRead(TransportCipher.java:192)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at 
> 

[jira] [Updated] (SPARK-47759) A Spark app being stuck with an unexpected stack trace when reading/parsing a time string

2024-04-08 Thread Bo Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-47759:
-
Description: 
h2. Symptom

It's observed that our Spark apps occasionally got stuck with an unexpected 
stack trace when reading/parsing a time string.

 

*[Stack Trace 1]* The stack trace doesn't make sense since *120s* is a 
legitimate time string, where the app runs on emr-7.0.0 with Spark 3.5.0 
runtime.
{code:java}
Caused by: java.lang.RuntimeException: java.lang.NumberFormatException: Time 
must be specified as seconds (s), milliseconds (ms), microseconds (us), minutes 
(m or min), hour (h), or day (d). E.g. 50s, 100ms, or 250us.
Failed to parse time string: 120s
at org.apache.spark.network.util.JavaUtils.timeStringAs(JavaUtils.java:258)
at 
org.apache.spark.network.util.JavaUtils.timeStringAsSec(JavaUtils.java:275)
at org.apache.spark.util.Utils$.timeStringAsSeconds(Utils.scala:1166)
at org.apache.spark.rpc.RpcTimeout$.apply(RpcTimeout.scala:131)
at org.apache.spark.util.RpcUtils$.askRpcTimeout(RpcUtils.scala:41)
at org.apache.spark.rpc.RpcEndpointRef.(RpcEndpointRef.scala:33)
at 
org.apache.spark.rpc.netty.NettyRpcEndpointRef.(NettyRpcEnv.scala:533)
at org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:640)
at 
org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:697)
at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:682)
at 
org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:163)
at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
org.apache.spark.network.crypto.TransportCipher$DecryptionHandler.channelRead(TransportCipher.java:192)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
at 

[jira] [Updated] (SPARK-47759) A Spark app being stuck with an unexpected stack trace when reading/parsing a time string

2024-04-08 Thread Bo Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-47759:
-
Description: 
h2. Symptom

It's observed that our Spark apps occasionally got stuck with an unexpected 
stack trace when reading/parsing a time string.

 

*[Stack Trace 1]* The stack trace doesn't make sense since *120s* is a 
legitimate time string, where the app runs on emr-7.0.0 with Spark 3.5.0 
runtime.
{code:java}
Caused by: java.lang.RuntimeException: java.lang.NumberFormatException: Time 
must be specified as seconds (s), milliseconds (ms), microseconds (us), minutes 
(m or min), hour (h), or day (d). E.g. 50s, 100ms, or 250us.
Failed to parse time string: 120s
at org.apache.spark.network.util.JavaUtils.timeStringAs(JavaUtils.java:258)
at 
org.apache.spark.network.util.JavaUtils.timeStringAsSec(JavaUtils.java:275)
at org.apache.spark.util.Utils$.timeStringAsSeconds(Utils.scala:1166)
at org.apache.spark.rpc.RpcTimeout$.apply(RpcTimeout.scala:131)
at org.apache.spark.util.RpcUtils$.askRpcTimeout(RpcUtils.scala:41)
at org.apache.spark.rpc.RpcEndpointRef.(RpcEndpointRef.scala:33)
at 
org.apache.spark.rpc.netty.NettyRpcEndpointRef.(NettyRpcEnv.scala:533)
at org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:640)
at 
org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:697)
at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:682)
at 
org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:163)
at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
org.apache.spark.network.crypto.TransportCipher$DecryptionHandler.channelRead(TransportCipher.java:192)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
at 

[jira] [Updated] (SPARK-47759) A Spark app being stuck with an unexpected stack trace when reading/parsing a time string

2024-04-08 Thread Bo Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-47759:
-
Description: 
h2. Symptom

It's observed that our Spark apps occasionally got stuck with an unexpected 
stack trace when reading/parsing a time string.

 

*[Stack Trace 1]* The stack trace doesn't make sense since *120s* is a 
legitimate time string, where the app runs on emr-7.0.0 with Spark 3.5.0 
runtime.
{code:java}
Caused by: java.lang.RuntimeException: java.lang.NumberFormatException: Time 
must be specified as seconds (s), milliseconds (ms), microseconds (us), minutes 
(m or min), hour (h), or day (d). E.g. 50s, 100ms, or 250us.
Failed to parse time string: 120s
at org.apache.spark.network.util.JavaUtils.timeStringAs(JavaUtils.java:258)
at 
org.apache.spark.network.util.JavaUtils.timeStringAsSec(JavaUtils.java:275)
at org.apache.spark.util.Utils$.timeStringAsSeconds(Utils.scala:1166)
at org.apache.spark.rpc.RpcTimeout$.apply(RpcTimeout.scala:131)
at org.apache.spark.util.RpcUtils$.askRpcTimeout(RpcUtils.scala:41)
at org.apache.spark.rpc.RpcEndpointRef.(RpcEndpointRef.scala:33)
at 
org.apache.spark.rpc.netty.NettyRpcEndpointRef.(NettyRpcEnv.scala:533)
at org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:640)
at 
org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:697)
at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:682)
at 
org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:163)
at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
org.apache.spark.network.crypto.TransportCipher$DecryptionHandler.channelRead(TransportCipher.java:192)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
at 

[jira] [Updated] (SPARK-47759) A Spark app being stuck with an unexpected stack trace when reading/parsing a time string

2024-04-07 Thread Bo Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-47759:
-
Description: 
h2. Symptom

It's observed that our Spark apps occasionally got stuck with an unexpected 
stack trace when reading/parsing a time string.

 

[Stack Trace 1] On emr-7.0.0 with Spark 3.5.0 runtime, the stack trace doesn't 
make sense since *120s* is a legitimate time string.
{code:java}
Caused by: java.lang.RuntimeException: java.lang.NumberFormatException: Time 
must be specified as seconds (s), milliseconds (ms), microseconds (us), minutes 
(m or min), hour (h), or day (d). E.g. 50s, 100ms, or 250us.
Failed to parse time string: 120s
at org.apache.spark.network.util.JavaUtils.timeStringAs(JavaUtils.java:258)
at 
org.apache.spark.network.util.JavaUtils.timeStringAsSec(JavaUtils.java:275)
at org.apache.spark.util.Utils$.timeStringAsSeconds(Utils.scala:1166)
at org.apache.spark.rpc.RpcTimeout$.apply(RpcTimeout.scala:131)
at org.apache.spark.util.RpcUtils$.askRpcTimeout(RpcUtils.scala:41)
at org.apache.spark.rpc.RpcEndpointRef.(RpcEndpointRef.scala:33)
at 
org.apache.spark.rpc.netty.NettyRpcEndpointRef.(NettyRpcEnv.scala:533)
at org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:640)
at 
org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:697)
at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:682)
at 
org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:163)
at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
org.apache.spark.network.crypto.TransportCipher$DecryptionHandler.channelRead(TransportCipher.java:192)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
at 

[jira] [Updated] (SPARK-47759) A Spark app being stuck with an unexpected stack trace when reading/parsing a time string

2024-04-07 Thread Bo Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-47759:
-
Summary: A Spark app being stuck with an unexpected stack trace when 
reading/parsing a time string  (was: A Spark app being stuck with an unexpected 
stack trace around parsing a time string)

> A Spark app being stuck with an unexpected stack trace when reading/parsing a 
> time string
> -
>
> Key: SPARK-47759
> URL: https://issues.apache.org/jira/browse/SPARK-47759
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Bo Xiong
>Assignee: Bo Xiong
>Priority: Critical
>  Labels: hang, pull-request-available, stuck, threadsafe
> Fix For: 3.5.0, 4.0.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> h2. Symptom
> It's observed that our Spark apps occasionally got stuck with a stack trace 
> indicating that a legitimate time string cannot be parsed.
>  
> [Stack Trace 1] On emr-7.0.0 with Spark 3.5.0 runtime, the stack trace 
> doesn't make sense since *120s* is a legitimate time string.
> {code:java}
> Caused by: java.lang.RuntimeException: java.lang.NumberFormatException: Time 
> must be specified as seconds (s), milliseconds (ms), microseconds (us), 
> minutes (m or min), hour (h), or day (d). E.g. 50s, 100ms, or 250us.
> Failed to parse time string: 120s
> at 
> org.apache.spark.network.util.JavaUtils.timeStringAs(JavaUtils.java:258)
> at 
> org.apache.spark.network.util.JavaUtils.timeStringAsSec(JavaUtils.java:275)
> at org.apache.spark.util.Utils$.timeStringAsSeconds(Utils.scala:1166)
> at org.apache.spark.rpc.RpcTimeout$.apply(RpcTimeout.scala:131)
> at org.apache.spark.util.RpcUtils$.askRpcTimeout(RpcUtils.scala:41)
> at org.apache.spark.rpc.RpcEndpointRef.(RpcEndpointRef.scala:33)
> at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.(NettyRpcEnv.scala:533)
> at org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:640)
> at 
> org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:697)
> at 
> org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:682)
> at 
> org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:163)
> at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
> at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at 
> org.apache.spark.network.crypto.TransportCipher$DecryptionHandler.channelRead(TransportCipher.java:192)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at 
> 

[jira] [Updated] (SPARK-47759) A Spark app being stuck with an unexpected stack trace around parsing a time string

2024-04-07 Thread Bo Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-47759:
-
Description: 
h2. Symptom

It's observed that our Spark apps occasionally got stuck with a stack trace 
indicating that a legitimate time string cannot be parsed.

 

[Stack Trace 1] On emr-7.0.0 with Spark 3.5.0 runtime, the stack trace doesn't 
make sense since *120s* is a legitimate time string.
{code:java}
Caused by: java.lang.RuntimeException: java.lang.NumberFormatException: Time 
must be specified as seconds (s), milliseconds (ms), microseconds (us), minutes 
(m or min), hour (h), or day (d). E.g. 50s, 100ms, or 250us.
Failed to parse time string: 120s
at org.apache.spark.network.util.JavaUtils.timeStringAs(JavaUtils.java:258)
at 
org.apache.spark.network.util.JavaUtils.timeStringAsSec(JavaUtils.java:275)
at org.apache.spark.util.Utils$.timeStringAsSeconds(Utils.scala:1166)
at org.apache.spark.rpc.RpcTimeout$.apply(RpcTimeout.scala:131)
at org.apache.spark.util.RpcUtils$.askRpcTimeout(RpcUtils.scala:41)
at org.apache.spark.rpc.RpcEndpointRef.(RpcEndpointRef.scala:33)
at 
org.apache.spark.rpc.netty.NettyRpcEndpointRef.(NettyRpcEnv.scala:533)
at org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:640)
at 
org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:697)
at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:682)
at 
org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:163)
at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
org.apache.spark.network.crypto.TransportCipher$DecryptionHandler.channelRead(TransportCipher.java:192)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
at 

[jira] [Updated] (SPARK-47759) A Spark app being stuck with a stack trace indicating that a legitimate time string cannot be parsed

2024-04-07 Thread Bo Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-47759:
-
Description: 
h2. Symptom

It's observed that our Spark apps occasionally got stuck with a stack trace 
indicating that a legitimate time string cannot be parsed.

 

[Stack Trace 1] On emr-7.0.0 with Spark 3.5.0 runtime
{code:java}
Caused by: java.lang.RuntimeException: java.lang.NumberFormatException: Time 
must be specified as seconds (s), milliseconds (ms), microseconds (us), minutes 
(m or min), hour (h), or day (d). E.g. 50s, 100ms, or 250us.
Failed to parse time string: 120s
at org.apache.spark.network.util.JavaUtils.timeStringAs(JavaUtils.java:258)
at 
org.apache.spark.network.util.JavaUtils.timeStringAsSec(JavaUtils.java:275)
at org.apache.spark.util.Utils$.timeStringAsSeconds(Utils.scala:1166)
at org.apache.spark.rpc.RpcTimeout$.apply(RpcTimeout.scala:131)
at org.apache.spark.util.RpcUtils$.askRpcTimeout(RpcUtils.scala:41)
at org.apache.spark.rpc.RpcEndpointRef.(RpcEndpointRef.scala:33)
at 
org.apache.spark.rpc.netty.NettyRpcEndpointRef.(NettyRpcEnv.scala:533)
at org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:640)
at 
org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:697)
at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:682)
at 
org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:163)
at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
org.apache.spark.network.crypto.TransportCipher$DecryptionHandler.channelRead(TransportCipher.java:192)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
at 

[jira] [Updated] (SPARK-47759) A Spark app being stuck with an unexpected stack trace around parsing a time string

2024-04-07 Thread Bo Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-47759:
-
Summary: A Spark app being stuck with an unexpected stack trace around 
parsing a time string  (was: A Spark app being stuck with a stack trace 
indicating that a legitimate time string cannot be parsed)

> A Spark app being stuck with an unexpected stack trace around parsing a time 
> string
> ---
>
> Key: SPARK-47759
> URL: https://issues.apache.org/jira/browse/SPARK-47759
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Bo Xiong
>Assignee: Bo Xiong
>Priority: Critical
>  Labels: hang, pull-request-available, stuck, threadsafe
> Fix For: 3.5.0, 4.0.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> h2. Symptom
> It's observed that our Spark apps occasionally got stuck with a stack trace 
> indicating that a legitimate time string cannot be parsed.
>  
> [Stack Trace 1] On emr-7.0.0 with Spark 3.5.0 runtime
> {code:java}
> Caused by: java.lang.RuntimeException: java.lang.NumberFormatException: Time 
> must be specified as seconds (s), milliseconds (ms), microseconds (us), 
> minutes (m or min), hour (h), or day (d). E.g. 50s, 100ms, or 250us.
> Failed to parse time string: 120s
> at 
> org.apache.spark.network.util.JavaUtils.timeStringAs(JavaUtils.java:258)
> at 
> org.apache.spark.network.util.JavaUtils.timeStringAsSec(JavaUtils.java:275)
> at org.apache.spark.util.Utils$.timeStringAsSeconds(Utils.scala:1166)
> at org.apache.spark.rpc.RpcTimeout$.apply(RpcTimeout.scala:131)
> at org.apache.spark.util.RpcUtils$.askRpcTimeout(RpcUtils.scala:41)
> at org.apache.spark.rpc.RpcEndpointRef.(RpcEndpointRef.scala:33)
> at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.(NettyRpcEnv.scala:533)
> at org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:640)
> at 
> org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:697)
> at 
> org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:682)
> at 
> org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:163)
> at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
> at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at 
> org.apache.spark.network.crypto.TransportCipher$DecryptionHandler.channelRead(TransportCipher.java:192)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> 

[jira] [Updated] (SPARK-47759) A Spark app being stuck with a stack trace indicating that a legitimate time string cannot be parsed

2024-04-07 Thread Bo Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-47759:
-
Labels: hang pull-request-available stuck threadsafe  (was: hang 
infinite-loop pull-request-available race-condition stuck threadsafe)

> A Spark app being stuck with a stack trace indicating that a legitimate time 
> string cannot be parsed
> 
>
> Key: SPARK-47759
> URL: https://issues.apache.org/jira/browse/SPARK-47759
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Bo Xiong
>Assignee: Bo Xiong
>Priority: Critical
>  Labels: hang, pull-request-available, stuck, threadsafe
> Fix For: 3.5.0, 4.0.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> h2. Symptom
> It's observed that our Spark apps occasionally got stuck with a stack trace 
> indicating that a legitimate time string cannot be parsed.
>  
> using emr-7.0.0 with Spark 3.5.0 runtime
>  
> occasionally got stuck with the stack trace in the bottom. It doesn't make 
> sense – [the 
> code|https://code.amazon.com/packages/Aws157Spark/blobs/5412fb0590e55d635e9e31887ec5c72d10011899/--/common/utils/src/main/java/org/apache/spark/network/util/JavaUtils.java#L258,L238]
>  should be able to parse {{120s}} based on the predefined regex.
> h2. Observations
> Below is what's observed from relevant container logs and thread dump.
>  * A regular task that's sent to the executor, which also reported back to 
> the driver upon the task completion.
>  
> {quote}$zgrep 'task 150' container_1694029806204_12865_01_01/stderr.gz
> 23/09/12 18:13:55 INFO TaskSetManager: Starting task 150.0 in stage 23.0 (TID 
> 923) (ip-10-0-185-107.ec2.internal, executor 3, partition 150, NODE_LOCAL, 
> 4432 bytes) taskResourceAssignments Map()
> 23/09/12 18:13:55 INFO TaskSetManager: Finished task 150.0 in stage 23.0 (TID 
> 923) in 126 ms on ip-10-0-185-107.ec2.internal (executor 3) (16/200)
> $zgrep 'task 923' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 923
> $zgrep 'task 150' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO Executor: Running task 150.0 in stage 23.0 (TID 923)
> 23/09/12 18:13:55 INFO Executor: Finished task 150.0 in stage 23.0 (TID 923). 
> 4495 bytes result sent to driver}}
> {quote} * Another task that's sent to the executor but didn't get launched 
> since the single-threaded dispatcher was stuck (presumably in an "infinite 
> loop" as explained later).
>  
>  
> {quote}$zgrep 'task 153' container_1694029806204_12865_01_01/stderr.gz
> 23/09/12 18:13:55 INFO TaskSetManager: Starting task 153.0 in stage 23.0 (TID 
> 924) (ip-10-0-185-107.ec2.internal, executor 3, partition 153, NODE_LOCAL, 
> 4432 bytes) taskResourceAssignments Map()
> $zgrep ' 924' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 924
> $zgrep 'task 153' container_1694029806204_12865_01_04/stderr.gz
> >> note that the above command has no matching result, indicating that task 
> >> 153.0 in stage 23.0 (TID 924) was never launched}}
> {quote} * Thread dump shows that the dispatcher-Executor thread has the 
> following stack trace.
>  
> {quote}"dispatcher-Executor" #40 daemon prio=5 os_prio=0 
> tid=0x98e37800 nid=0x1aff runnable [0x73bba000]
> java.lang.Thread.State: RUNNABLE
> at scala.runtime.BoxesRunTime.equalsNumObject(BoxesRunTime.java:142)
> at scala.runtime.BoxesRunTime.equals2(BoxesRunTime.java:131)
> at scala.runtime.BoxesRunTime.equals(BoxesRunTime.java:123)
> at scala.collection.mutable.HashTable.elemEquals(HashTable.scala:365)
> at scala.collection.mutable.HashTable.elemEquals$(HashTable.scala:365)
> at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:44)
> at scala.collection.mutable.HashTable.findEntry0(HashTable.scala:140)
> at scala.collection.mutable.HashTable.findOrAddEntry(HashTable.scala:169)
> at scala.collection.mutable.HashTable.findOrAddEntry$(HashTable.scala:167)
> at scala.collection.mutable.HashMap.findOrAddEntry(HashMap.scala:44)
> at scala.collection.mutable.HashMap.put(HashMap.scala:126)
> at scala.collection.mutable.HashMap.update(HashMap.scala:131)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:200)
> at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
> at 
> org.apache.spark.rpc.netty.Inbox$$Lambda$323/1930826709.apply$mcV$sp(Unknown 
> Source)
> at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
> at 

[jira] [Updated] (SPARK-47759) A Spark app being stuck with a stack trace indicating that a legitimate time string cannot be parsed

2024-04-07 Thread Bo Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-47759:
-
Summary: A Spark app being stuck with a stack trace indicating that a 
legitimate time string cannot be parsed  (was: A Spark app being stuck with a 
stacktrace indicating that a legitimate time string cannot be parsed)

> A Spark app being stuck with a stack trace indicating that a legitimate time 
> string cannot be parsed
> 
>
> Key: SPARK-47759
> URL: https://issues.apache.org/jira/browse/SPARK-47759
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Bo Xiong
>Assignee: Bo Xiong
>Priority: Critical
>  Labels: hang, infinite-loop, pull-request-available, 
> race-condition, stuck, threadsafe
> Fix For: 3.5.0, 4.0.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> h2. Symptom
> It's observed that our Spark apps occasionally got stuck with a stack trace 
> indicating that a legitimate time string cannot be parsed.
>  
> using emr-7.0.0 with Spark 3.5.0 runtime
>  
> occasionally got stuck with the stack trace in the bottom. It doesn't make 
> sense – [the 
> code|https://code.amazon.com/packages/Aws157Spark/blobs/5412fb0590e55d635e9e31887ec5c72d10011899/--/common/utils/src/main/java/org/apache/spark/network/util/JavaUtils.java#L258,L238]
>  should be able to parse {{120s}} based on the predefined regex.
> h2. Observations
> Below is what's observed from relevant container logs and thread dump.
>  * A regular task that's sent to the executor, which also reported back to 
> the driver upon the task completion.
>  
> {quote}$zgrep 'task 150' container_1694029806204_12865_01_01/stderr.gz
> 23/09/12 18:13:55 INFO TaskSetManager: Starting task 150.0 in stage 23.0 (TID 
> 923) (ip-10-0-185-107.ec2.internal, executor 3, partition 150, NODE_LOCAL, 
> 4432 bytes) taskResourceAssignments Map()
> 23/09/12 18:13:55 INFO TaskSetManager: Finished task 150.0 in stage 23.0 (TID 
> 923) in 126 ms on ip-10-0-185-107.ec2.internal (executor 3) (16/200)
> $zgrep 'task 923' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 923
> $zgrep 'task 150' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO Executor: Running task 150.0 in stage 23.0 (TID 923)
> 23/09/12 18:13:55 INFO Executor: Finished task 150.0 in stage 23.0 (TID 923). 
> 4495 bytes result sent to driver}}
> {quote} * Another task that's sent to the executor but didn't get launched 
> since the single-threaded dispatcher was stuck (presumably in an "infinite 
> loop" as explained later).
>  
>  
> {quote}$zgrep 'task 153' container_1694029806204_12865_01_01/stderr.gz
> 23/09/12 18:13:55 INFO TaskSetManager: Starting task 153.0 in stage 23.0 (TID 
> 924) (ip-10-0-185-107.ec2.internal, executor 3, partition 153, NODE_LOCAL, 
> 4432 bytes) taskResourceAssignments Map()
> $zgrep ' 924' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 924
> $zgrep 'task 153' container_1694029806204_12865_01_04/stderr.gz
> >> note that the above command has no matching result, indicating that task 
> >> 153.0 in stage 23.0 (TID 924) was never launched}}
> {quote} * Thread dump shows that the dispatcher-Executor thread has the 
> following stack trace.
>  
> {quote}"dispatcher-Executor" #40 daemon prio=5 os_prio=0 
> tid=0x98e37800 nid=0x1aff runnable [0x73bba000]
> java.lang.Thread.State: RUNNABLE
> at scala.runtime.BoxesRunTime.equalsNumObject(BoxesRunTime.java:142)
> at scala.runtime.BoxesRunTime.equals2(BoxesRunTime.java:131)
> at scala.runtime.BoxesRunTime.equals(BoxesRunTime.java:123)
> at scala.collection.mutable.HashTable.elemEquals(HashTable.scala:365)
> at scala.collection.mutable.HashTable.elemEquals$(HashTable.scala:365)
> at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:44)
> at scala.collection.mutable.HashTable.findEntry0(HashTable.scala:140)
> at scala.collection.mutable.HashTable.findOrAddEntry(HashTable.scala:169)
> at scala.collection.mutable.HashTable.findOrAddEntry$(HashTable.scala:167)
> at scala.collection.mutable.HashMap.findOrAddEntry(HashMap.scala:44)
> at scala.collection.mutable.HashMap.put(HashMap.scala:126)
> at scala.collection.mutable.HashMap.update(HashMap.scala:131)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:200)
> at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
> at 
> org.apache.spark.rpc.netty.Inbox$$Lambda$323/1930826709.apply$mcV$sp(Unknown 
> Source)
> at 

[jira] [Updated] (SPARK-47759) A Spark app being stuck with a stacktrace indicating that a legitimate time string cannot be parsed

2024-04-07 Thread Bo Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-47759:
-
Description: 
h2. Symptom

It's observed that our Spark apps occasionally got stuck with a stack trace 
indicating that a legitimate time string cannot be parsed.

 

using emr-7.0.0 with Spark 3.5.0 runtime

 

occasionally got stuck with the stack trace in the bottom. It doesn't make 
sense – [the 
code|https://code.amazon.com/packages/Aws157Spark/blobs/5412fb0590e55d635e9e31887ec5c72d10011899/--/common/utils/src/main/java/org/apache/spark/network/util/JavaUtils.java#L258,L238]
 should be able to parse {{120s}} based on the predefined regex.
h2. Observations

Below is what's observed from relevant container logs and thread dump.
 * A regular task that's sent to the executor, which also reported back to the 
driver upon the task completion.

 
{quote}$zgrep 'task 150' container_1694029806204_12865_01_01/stderr.gz
23/09/12 18:13:55 INFO TaskSetManager: Starting task 150.0 in stage 23.0 (TID 
923) (ip-10-0-185-107.ec2.internal, executor 3, partition 150, NODE_LOCAL, 4432 
bytes) taskResourceAssignments Map()
23/09/12 18:13:55 INFO TaskSetManager: Finished task 150.0 in stage 23.0 (TID 
923) in 126 ms on ip-10-0-185-107.ec2.internal (executor 3) (16/200)

$zgrep 'task 923' container_1694029806204_12865_01_04/stderr.gz
23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 923

$zgrep 'task 150' container_1694029806204_12865_01_04/stderr.gz
23/09/12 18:13:55 INFO Executor: Running task 150.0 in stage 23.0 (TID 923)
23/09/12 18:13:55 INFO Executor: Finished task 150.0 in stage 23.0 (TID 923). 
4495 bytes result sent to driver}}
{quote} * Another task that's sent to the executor but didn't get launched 
since the single-threaded dispatcher was stuck (presumably in an "infinite 
loop" as explained later).

 

 
{quote}$zgrep 'task 153' container_1694029806204_12865_01_01/stderr.gz
23/09/12 18:13:55 INFO TaskSetManager: Starting task 153.0 in stage 23.0 (TID 
924) (ip-10-0-185-107.ec2.internal, executor 3, partition 153, NODE_LOCAL, 4432 
bytes) taskResourceAssignments Map()

$zgrep ' 924' container_1694029806204_12865_01_04/stderr.gz
23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 924

$zgrep 'task 153' container_1694029806204_12865_01_04/stderr.gz
>> note that the above command has no matching result, indicating that task 
>> 153.0 in stage 23.0 (TID 924) was never launched}}
{quote} * Thread dump shows that the dispatcher-Executor thread has the 
following stack trace.

 
{quote}"dispatcher-Executor" #40 daemon prio=5 os_prio=0 tid=0x98e37800 
nid=0x1aff runnable [0x73bba000]
java.lang.Thread.State: RUNNABLE
at scala.runtime.BoxesRunTime.equalsNumObject(BoxesRunTime.java:142)
at scala.runtime.BoxesRunTime.equals2(BoxesRunTime.java:131)
at scala.runtime.BoxesRunTime.equals(BoxesRunTime.java:123)
at scala.collection.mutable.HashTable.elemEquals(HashTable.scala:365)
at scala.collection.mutable.HashTable.elemEquals$(HashTable.scala:365)
at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:44)
at scala.collection.mutable.HashTable.findEntry0(HashTable.scala:140)
at scala.collection.mutable.HashTable.findOrAddEntry(HashTable.scala:169)
at scala.collection.mutable.HashTable.findOrAddEntry$(HashTable.scala:167)
at scala.collection.mutable.HashMap.findOrAddEntry(HashMap.scala:44)
at scala.collection.mutable.HashMap.put(HashMap.scala:126)
at scala.collection.mutable.HashMap.update(HashMap.scala:131)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:200)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox$$Lambda$323/1930826709.apply$mcV$sp(Unknown 
Source)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)}}
{quote}
h2. Relevant code paths

Within an executor process, there's a [dispatcher 
thread|https://github.com/apache/spark/blob/1fdd46f173f7bc90e0523eb0a2d5e8e27e990102/core/src/main/scala/org/apache/spark/rpc/netty/MessageLoop.scala#L170]
 dedicated to CoarseGrainedExecutorBackend (a single RPC endpoint) that 
launches tasks scheduled by the driver. Each task is run on a TaskRunner thread 
backed by a 

[jira] [Updated] (SPARK-47759) A Spark app being stuck with a stacktrace indicating that a legitimate time string cannot be parsed

2024-04-07 Thread Bo Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-47759:
-
Description: 
h2. Symptom

It's observed that our Spark apps occasionally got stuck with the stack trace 
in the bottom. It doesn't make sense -- [the 
code|https://code.amazon.com/packages/Aws157Spark/blobs/5412fb0590e55d635e9e31887ec5c72d10011899/--/common/utils/src/main/java/org/apache/spark/network/util/JavaUtils.java#L258,L238]
 should be able to parse {{120s}} based on the predefined regex.

 

using emr-7.0.0 with Spark 3.5.0 runtime

 

occasionally got stuck with the stack trace in the bottom. It doesn't make 
sense -- [the 
code|https://code.amazon.com/packages/Aws157Spark/blobs/5412fb0590e55d635e9e31887ec5c72d10011899/--/common/utils/src/main/java/org/apache/spark/network/util/JavaUtils.java#L258,L238]
 should be able to parse {{120s}} based on the predefined regex.
h2. Observations

Below is what's observed from relevant container logs and thread dump.
 * A regular task that's sent to the executor, which also reported back to the 
driver upon the task completion.

 
{quote}$zgrep 'task 150' container_1694029806204_12865_01_01/stderr.gz
23/09/12 18:13:55 INFO TaskSetManager: Starting task 150.0 in stage 23.0 (TID 
923) (ip-10-0-185-107.ec2.internal, executor 3, partition 150, NODE_LOCAL, 4432 
bytes) taskResourceAssignments Map()
23/09/12 18:13:55 INFO TaskSetManager: Finished task 150.0 in stage 23.0 (TID 
923) in 126 ms on ip-10-0-185-107.ec2.internal (executor 3) (16/200)

$zgrep 'task 923' container_1694029806204_12865_01_04/stderr.gz
23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 923

$zgrep 'task 150' container_1694029806204_12865_01_04/stderr.gz
23/09/12 18:13:55 INFO Executor: Running task 150.0 in stage 23.0 (TID 923)
23/09/12 18:13:55 INFO Executor: Finished task 150.0 in stage 23.0 (TID 923). 
4495 bytes result sent to driver}}
{quote}
* Another task that's sent to the executor but didn't get launched since the 
single-threaded dispatcher was stuck (presumably in an "infinite loop" as 
explained later).

 

 
{quote}$zgrep 'task 153' container_1694029806204_12865_01_01/stderr.gz
23/09/12 18:13:55 INFO TaskSetManager: Starting task 153.0 in stage 23.0 (TID 
924) (ip-10-0-185-107.ec2.internal, executor 3, partition 153, NODE_LOCAL, 4432 
bytes) taskResourceAssignments Map()

$zgrep ' 924' container_1694029806204_12865_01_04/stderr.gz
23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 924

$zgrep 'task 153' container_1694029806204_12865_01_04/stderr.gz
>> note that the above command has no matching result, indicating that task 
>> 153.0 in stage 23.0 (TID 924) was never launched}}
{quote}
* Thread dump shows that the dispatcher-Executor thread has the following stack 
trace.

 
{quote}"dispatcher-Executor" #40 daemon prio=5 os_prio=0 tid=0x98e37800 
nid=0x1aff runnable [0x73bba000]
java.lang.Thread.State: RUNNABLE
at scala.runtime.BoxesRunTime.equalsNumObject(BoxesRunTime.java:142)
at scala.runtime.BoxesRunTime.equals2(BoxesRunTime.java:131)
at scala.runtime.BoxesRunTime.equals(BoxesRunTime.java:123)
at scala.collection.mutable.HashTable.elemEquals(HashTable.scala:365)
at scala.collection.mutable.HashTable.elemEquals$(HashTable.scala:365)
at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:44)
at scala.collection.mutable.HashTable.findEntry0(HashTable.scala:140)
at scala.collection.mutable.HashTable.findOrAddEntry(HashTable.scala:169)
at scala.collection.mutable.HashTable.findOrAddEntry$(HashTable.scala:167)
at scala.collection.mutable.HashMap.findOrAddEntry(HashMap.scala:44)
at scala.collection.mutable.HashMap.put(HashMap.scala:126)
at scala.collection.mutable.HashMap.update(HashMap.scala:131)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:200)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox$$Lambda$323/1930826709.apply$mcV$sp(Unknown 
Source)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)}}
{quote}
h2. Relevant code paths

Within an executor process, there's a [dispatcher 

[jira] [Created] (SPARK-47759) A Spark app being stuck with a stacktrace indicating that a legitimate time string cannot be parsed

2024-04-07 Thread Bo Xiong (Jira)
Bo Xiong created SPARK-47759:


 Summary: A Spark app being stuck with a stacktrace indicating that 
a legitimate time string cannot be parsed
 Key: SPARK-47759
 URL: https://issues.apache.org/jira/browse/SPARK-47759
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.3.1, 3.5.0, 4.0.0
Reporter: Bo Xiong
Assignee: Bo Xiong
 Fix For: 3.4.2, 4.0.0, 3.5.1, 3.3.4


h2. Symptom

Our Spark 3 app running on EMR 6.10.0 with Spark 3.3.1 got stuck in the very 
last step of writing a data frame to S3 by calling {{{}df.write{}}}. Looking at 
Spark UI, we saw that an executor process hung over 1 hour. After we manually 
killed the executor process, the app succeeded. Note that the same EMR cluster 
with two worker nodes was able to run the same app without any issue before and 
after the incident.
h2. Observations

Below is what's observed from relevant container logs and thread dump.
 * A regular task that's sent to the executor, which also reported back to the 
driver upon the task completion.

{quote}$zgrep 'task 150' container_1694029806204_12865_01_01/stderr.gz
23/09/12 18:13:55 INFO TaskSetManager: Starting task 150.0 in stage 23.0 (TID 
923) (ip-10-0-185-107.ec2.internal, executor 3, partition 150, NODE_LOCAL, 4432 
bytes) taskResourceAssignments Map()
23/09/12 18:13:55 INFO TaskSetManager: Finished task 150.0 in stage 23.0 (TID 
923) in 126 ms on ip-10-0-185-107.ec2.internal (executor 3) (16/200)

$zgrep 'task 923' container_1694029806204_12865_01_04/stderr.gz
23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 923

$zgrep 'task 150' container_1694029806204_12865_01_04/stderr.gz
23/09/12 18:13:55 INFO Executor: Running task 150.0 in stage 23.0 (TID 923)
23/09/12 18:13:55 INFO Executor: Finished task 150.0 in stage 23.0 (TID 923). 
4495 bytes result sent to driver}}
{quote} * Another task that's sent to the executor but didn't get launched 
since the single-threaded dispatcher was stuck (presumably in an "infinite 
loop" as explained later).

{quote}$zgrep 'task 153' container_1694029806204_12865_01_01/stderr.gz
23/09/12 18:13:55 INFO TaskSetManager: Starting task 153.0 in stage 23.0 (TID 
924) (ip-10-0-185-107.ec2.internal, executor 3, partition 153, NODE_LOCAL, 4432 
bytes) taskResourceAssignments Map()

$zgrep ' 924' container_1694029806204_12865_01_04/stderr.gz
23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 924

$zgrep 'task 153' container_1694029806204_12865_01_04/stderr.gz
>> note that the above command has no matching result, indicating that task 
>> 153.0 in stage 23.0 (TID 924) was never launched}}
{quote}* Thread dump shows that the dispatcher-Executor thread has the 
following stack trace.

{quote}"dispatcher-Executor" #40 daemon prio=5 os_prio=0 tid=0x98e37800 
nid=0x1aff runnable [0x73bba000]
java.lang.Thread.State: RUNNABLE
at scala.runtime.BoxesRunTime.equalsNumObject(BoxesRunTime.java:142)
at scala.runtime.BoxesRunTime.equals2(BoxesRunTime.java:131)
at scala.runtime.BoxesRunTime.equals(BoxesRunTime.java:123)
at scala.collection.mutable.HashTable.elemEquals(HashTable.scala:365)
at scala.collection.mutable.HashTable.elemEquals$(HashTable.scala:365)
at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:44)
at scala.collection.mutable.HashTable.findEntry0(HashTable.scala:140)
at scala.collection.mutable.HashTable.findOrAddEntry(HashTable.scala:169)
at scala.collection.mutable.HashTable.findOrAddEntry$(HashTable.scala:167)
at scala.collection.mutable.HashMap.findOrAddEntry(HashMap.scala:44)
at scala.collection.mutable.HashMap.put(HashMap.scala:126)
at scala.collection.mutable.HashMap.update(HashMap.scala:131)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:200)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox$$Lambda$323/1930826709.apply$mcV$sp(Unknown 
Source)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)}}
{quote}
h2. Relevant code paths

Within an executor process, there's a [dispatcher 

[jira] [Updated] (SPARK-47759) A Spark app being stuck with a stacktrace indicating that a legitimate time string cannot be parsed

2024-04-07 Thread Bo Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-47759:
-
Fix Version/s: 3.5.0
   (was: 3.4.2)
   (was: 3.5.1)
   (was: 3.3.4)
Affects Version/s: (was: 3.3.1)

> A Spark app being stuck with a stacktrace indicating that a legitimate time 
> string cannot be parsed
> ---
>
> Key: SPARK-47759
> URL: https://issues.apache.org/jira/browse/SPARK-47759
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Bo Xiong
>Assignee: Bo Xiong
>Priority: Critical
>  Labels: hang, infinite-loop, pull-request-available, 
> race-condition, stuck, threadsafe
> Fix For: 3.5.0, 4.0.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> h2. Symptom
> Our Spark 3 app running on EMR 6.10.0 with Spark 3.3.1 got stuck in the very 
> last step of writing a data frame to S3 by calling {{{}df.write{}}}. Looking 
> at Spark UI, we saw that an executor process hung over 1 hour. After we 
> manually killed the executor process, the app succeeded. Note that the same 
> EMR cluster with two worker nodes was able to run the same app without any 
> issue before and after the incident.
> h2. Observations
> Below is what's observed from relevant container logs and thread dump.
>  * A regular task that's sent to the executor, which also reported back to 
> the driver upon the task completion.
> {quote}$zgrep 'task 150' container_1694029806204_12865_01_01/stderr.gz
> 23/09/12 18:13:55 INFO TaskSetManager: Starting task 150.0 in stage 23.0 (TID 
> 923) (ip-10-0-185-107.ec2.internal, executor 3, partition 150, NODE_LOCAL, 
> 4432 bytes) taskResourceAssignments Map()
> 23/09/12 18:13:55 INFO TaskSetManager: Finished task 150.0 in stage 23.0 (TID 
> 923) in 126 ms on ip-10-0-185-107.ec2.internal (executor 3) (16/200)
> $zgrep 'task 923' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 923
> $zgrep 'task 150' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO Executor: Running task 150.0 in stage 23.0 (TID 923)
> 23/09/12 18:13:55 INFO Executor: Finished task 150.0 in stage 23.0 (TID 923). 
> 4495 bytes result sent to driver}}
> {quote} * Another task that's sent to the executor but didn't get launched 
> since the single-threaded dispatcher was stuck (presumably in an "infinite 
> loop" as explained later).
> {quote}$zgrep 'task 153' container_1694029806204_12865_01_01/stderr.gz
> 23/09/12 18:13:55 INFO TaskSetManager: Starting task 153.0 in stage 23.0 (TID 
> 924) (ip-10-0-185-107.ec2.internal, executor 3, partition 153, NODE_LOCAL, 
> 4432 bytes) taskResourceAssignments Map()
> $zgrep ' 924' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 924
> $zgrep 'task 153' container_1694029806204_12865_01_04/stderr.gz
> >> note that the above command has no matching result, indicating that task 
> >> 153.0 in stage 23.0 (TID 924) was never launched}}
> {quote}* Thread dump shows that the dispatcher-Executor thread has the 
> following stack trace.
> {quote}"dispatcher-Executor" #40 daemon prio=5 os_prio=0 
> tid=0x98e37800 nid=0x1aff runnable [0x73bba000]
> java.lang.Thread.State: RUNNABLE
> at scala.runtime.BoxesRunTime.equalsNumObject(BoxesRunTime.java:142)
> at scala.runtime.BoxesRunTime.equals2(BoxesRunTime.java:131)
> at scala.runtime.BoxesRunTime.equals(BoxesRunTime.java:123)
> at scala.collection.mutable.HashTable.elemEquals(HashTable.scala:365)
> at scala.collection.mutable.HashTable.elemEquals$(HashTable.scala:365)
> at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:44)
> at scala.collection.mutable.HashTable.findEntry0(HashTable.scala:140)
> at scala.collection.mutable.HashTable.findOrAddEntry(HashTable.scala:169)
> at scala.collection.mutable.HashTable.findOrAddEntry$(HashTable.scala:167)
> at scala.collection.mutable.HashMap.findOrAddEntry(HashMap.scala:44)
> at scala.collection.mutable.HashMap.put(HashMap.scala:126)
> at scala.collection.mutable.HashMap.update(HashMap.scala:131)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:200)
> at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
> at 
> org.apache.spark.rpc.netty.Inbox$$Lambda$323/1930826709.apply$mcV$sp(Unknown 
> Source)
> at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
> at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
> at 
> 

[jira] [Updated] (SPARK-45283) Make StatusTrackerSuite less fragile

2023-10-01 Thread Bo Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-45283:
-
Description: 
It's discovered from [Github 
Actions|https://github.com/xiongbo-sjtu/spark/actions/runs/6270601155/job/17028788767]
 that StatusTrackerSuite can run into random failures, as shown by the 
following stack trace (highlighted in red).  The proposed fix is to update the 
unit test to remove the nondeterministic behavior.
{quote}[info] StatusTrackerSuite:
[info] - basic status API usage (99 milliseconds)
[info] - getJobIdsForGroup() (56 milliseconds)
[info] - getJobIdsForGroup() with takeAsync() (48 milliseconds)
[info] - getJobIdsForGroup() with takeAsync() across multiple partitions (58 
milliseconds)
[info] - getJobIdsForTag() *** FAILED *** (10 seconds, 77 milliseconds)
{color:#ff}[info] The code passed to eventually never returned normally. 
Attempted 651 times over 10.00505994401 seconds. Last failure message: 
Set(3, 2, 1) was not equal to Set(1, 2). (StatusTrackerSuite.scala:148){color}
[info] org.scalatest.exceptions.TestFailedDueToTimeoutException:
[info] at 
org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:219)
[info] at org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:226)
[info] at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:348)
[info] at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:347)
[info] at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:457)
[info] at 
org.apache.spark.StatusTrackerSuite.$anonfun$new$21(StatusTrackerSuite.scala:148)
[info] at org.scalatest.enablers.Timed$$anon$1.timeoutAfter(Timed.scala:127)
[info] at 
org.scalatest.concurrent.TimeLimits$.failAfterImpl(TimeLimits.scala:282)
[info] at org.scalatest.concurrent.TimeLimits.failAfter(TimeLimits.scala:231)
[info] at org.scalatest.concurrent.TimeLimits.failAfter$(TimeLimits.scala:230)
[info] at org.apache.spark.SparkFunSuite.failAfter(SparkFunSuite.scala:69)
[info] at 
org.apache.spark.SparkFunSuite.$anonfun$test$2(SparkFunSuite.scala:155)
[info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info] at org.scalatest.Transformer.apply(Transformer.scala:22)
[info] at org.scalatest.Transformer.apply(Transformer.scala:20)
[info] at 
org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
[info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:227)
[info] at 
org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
[info] at 
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
[info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
[info] at 
org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
[info] at 
org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
[info] at 
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:69)
[info] at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
[info] at 
org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
[info] at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:69)
[info] at 
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
[info] at 
org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
[info] at scala.collection.immutable.List.foreach(List.scala:333)
[info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
[info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
[info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
[info] at 
org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
[info] at 
org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
[info] at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564)
[info] at org.scalatest.Suite.run(Suite.scala:1114)
[info] at org.scalatest.Suite.run$(Suite.scala:1096)
[info] at 
org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564)
[info] at 
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273)
[info] at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272)
[info] at 
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:69)
[info] at 
org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
[info] at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
[info] at 

[jira] [Updated] (SPARK-45283) Make StatusTrackerSuite less fragile

2023-10-01 Thread Bo Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-45283:
-
Description: 
It's discovered from [Github 
Actions|https://github.com/xiongbo-sjtu/spark/actions/runs/6270601155/job/17028788767]
 that StatusTrackerSuite can run into random failures, as shown in the 
following stack trace (highlighted in red).  The proposed fix is to update the 
unit test to remove the nondeterministic behavior.
{quote}[info] StatusTrackerSuite:
[info] - basic status API usage (99 milliseconds)
[info] - getJobIdsForGroup() (56 milliseconds)
[info] - getJobIdsForGroup() with takeAsync() (48 milliseconds)
[info] - getJobIdsForGroup() with takeAsync() across multiple partitions (58 
milliseconds)
[info] - getJobIdsForTag() *** FAILED *** (10 seconds, 77 milliseconds)
{color:#ff}[info] The code passed to eventually never returned normally. 
Attempted 651 times over 10.00505994401 seconds. Last failure message: 
Set(3, 2, 1) was not equal to Set(1, 2). (StatusTrackerSuite.scala:148){color}
[info] org.scalatest.exceptions.TestFailedDueToTimeoutException:
[info] at 
org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:219)
[info] at org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:226)
[info] at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:348)
[info] at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:347)
[info] at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:457)
[info] at 
org.apache.spark.StatusTrackerSuite.$anonfun$new$21(StatusTrackerSuite.scala:148)
[info] at org.scalatest.enablers.Timed$$anon$1.timeoutAfter(Timed.scala:127)
[info] at 
org.scalatest.concurrent.TimeLimits$.failAfterImpl(TimeLimits.scala:282)
[info] at org.scalatest.concurrent.TimeLimits.failAfter(TimeLimits.scala:231)
[info] at org.scalatest.concurrent.TimeLimits.failAfter$(TimeLimits.scala:230)
[info] at org.apache.spark.SparkFunSuite.failAfter(SparkFunSuite.scala:69)
[info] at 
org.apache.spark.SparkFunSuite.$anonfun$test$2(SparkFunSuite.scala:155)
[info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info] at org.scalatest.Transformer.apply(Transformer.scala:22)
[info] at org.scalatest.Transformer.apply(Transformer.scala:20)
[info] at 
org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
[info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:227)
[info] at 
org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
[info] at 
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
[info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
[info] at 
org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
[info] at 
org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
[info] at 
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:69)
[info] at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
[info] at 
org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
[info] at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:69)
[info] at 
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
[info] at 
org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
[info] at scala.collection.immutable.List.foreach(List.scala:333)
[info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
[info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
[info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
[info] at 
org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
[info] at 
org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
[info] at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564)
[info] at org.scalatest.Suite.run(Suite.scala:1114)
[info] at org.scalatest.Suite.run$(Suite.scala:1096)
[info] at 
org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564)
[info] at 
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273)
[info] at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272)
[info] at 
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:69)
[info] at 
org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
[info] at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
[info] at 

[jira] [Created] (SPARK-45283) Make StatusTrackerSuite less fragile

2023-09-22 Thread Bo Xiong (Jira)
Bo Xiong created SPARK-45283:


 Summary: Make StatusTrackerSuite less fragile
 Key: SPARK-45283
 URL: https://issues.apache.org/jira/browse/SPARK-45283
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Tests
Affects Versions: 3.5.0, 4.0.0
Reporter: Bo Xiong


It's discovered from [Github 
Actions|https://github.com/xiongbo-sjtu/spark/actions/runs/6270601155/job/17028788767]
 that StatusTrackerSuite can run into random failures because 
FutureAction.jobIds is not a sorted sequence (by design), as shown in the 
following stack trace (highlighted in red).  The proposed fix is to update the 
unit test to remove the nondeterministic behavior.
{quote}[info] StatusTrackerSuite:
[info] - basic status API usage (99 milliseconds)
[info] - getJobIdsForGroup() (56 milliseconds)
[info] - getJobIdsForGroup() with takeAsync() (48 milliseconds)
[info] - getJobIdsForGroup() with takeAsync() across multiple partitions (58 
milliseconds)
[info] - getJobIdsForTag() *** FAILED *** (10 seconds, 77 milliseconds)
{color:#FF}[info] The code passed to eventually never returned normally. 
Attempted 651 times over 10.00505994401 seconds. Last failure message: 
Set(3, 2, 1) was not equal to Set(1, 2). (StatusTrackerSuite.scala:148){color}
[info] org.scalatest.exceptions.TestFailedDueToTimeoutException:
[info] at 
org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:219)
[info] at org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:226)
[info] at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:348)
[info] at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:347)
[info] at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:457)
[info] at 
org.apache.spark.StatusTrackerSuite.$anonfun$new$21(StatusTrackerSuite.scala:148)
[info] at org.scalatest.enablers.Timed$$anon$1.timeoutAfter(Timed.scala:127)
[info] at 
org.scalatest.concurrent.TimeLimits$.failAfterImpl(TimeLimits.scala:282)
[info] at org.scalatest.concurrent.TimeLimits.failAfter(TimeLimits.scala:231)
[info] at org.scalatest.concurrent.TimeLimits.failAfter$(TimeLimits.scala:230)
[info] at org.apache.spark.SparkFunSuite.failAfter(SparkFunSuite.scala:69)
[info] at 
org.apache.spark.SparkFunSuite.$anonfun$test$2(SparkFunSuite.scala:155)
[info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info] at org.scalatest.Transformer.apply(Transformer.scala:22)
[info] at org.scalatest.Transformer.apply(Transformer.scala:20)
[info] at 
org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
[info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:227)
[info] at 
org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
[info] at 
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
[info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
[info] at 
org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
[info] at 
org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
[info] at 
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:69)
[info] at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
[info] at 
org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
[info] at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:69)
[info] at 
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
[info] at 
org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
[info] at scala.collection.immutable.List.foreach(List.scala:333)
[info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
[info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
[info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
[info] at 
org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
[info] at 
org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
[info] at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564)
[info] at org.scalatest.Suite.run(Suite.scala:1114)
[info] at org.scalatest.Suite.run$(Suite.scala:1096)
[info] at 
org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564)
[info] at 
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273)
[info] at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272)
[info] at 

[jira] [Comment Edited] (SPARK-45227) Fix a subtle thread-safety issue with CoarseGrainedExecutorBackend where an executor process randomly gets stuck

2023-09-22 Thread Bo Xiong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17767843#comment-17767843
 ] 

Bo Xiong edited comment on SPARK-45227 at 9/22/23 6:12 AM:
---

I've submitted [a fix|https://github.com/apache/spark/pull/43021].  Please help 
get it merged.

If possible, please also help patch v3.3.1 and above.  Thanks!


was (Author: JIRAUSER302302):
I've submitted a fix.  Please help get it merged.

If possible, please also help patch v3.3.1 and above.  Thanks!

> Fix a subtle thread-safety issue with CoarseGrainedExecutorBackend where an 
> executor process randomly gets stuck
> 
>
> Key: SPARK-45227
> URL: https://issues.apache.org/jira/browse/SPARK-45227
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.1, 3.5.0, 4.0.0
>Reporter: Bo Xiong
>Priority: Critical
>  Labels: hang, infinite-loop, pull-request-available, 
> race-condition, stuck, threadsafe
> Attachments: hashtable1.png, hashtable2.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> h2. Symptom
> Our Spark 3 app running on EMR 6.10.0 with Spark 3.3.1 got stuck in the very 
> last step of writing a data frame to S3 by calling {{{}df.write{}}}. Looking 
> at Spark UI, we saw that an executor process hung over 1 hour. After we 
> manually killed the executor process, the app succeeded. Note that the same 
> EMR cluster with two worker nodes was able to run the same app without any 
> issue before and after the incident.
> h2. Observations
> Below is what's observed from relevant container logs and thread dump.
>  * A regular task that's sent to the executor, which also reported back to 
> the driver upon the task completion.
> {quote}$zgrep 'task 150' container_1694029806204_12865_01_01/stderr.gz
> 23/09/12 18:13:55 INFO TaskSetManager: Starting task 150.0 in stage 23.0 (TID 
> 923) (ip-10-0-185-107.ec2.internal, executor 3, partition 150, NODE_LOCAL, 
> 4432 bytes) taskResourceAssignments Map()
> 23/09/12 18:13:55 INFO TaskSetManager: Finished task 150.0 in stage 23.0 (TID 
> 923) in 126 ms on ip-10-0-185-107.ec2.internal (executor 3) (16/200)
> $zgrep 'task 923' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 923
> $zgrep 'task 150' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO Executor: Running task 150.0 in stage 23.0 (TID 923)
> 23/09/12 18:13:55 INFO Executor: Finished task 150.0 in stage 23.0 (TID 923). 
> 4495 bytes result sent to driver}}
> {quote} * Another task that's sent to the executor but didn't get launched 
> since the single-threaded dispatcher was stuck (presumably in an "infinite 
> loop" as explained later).
> {quote}$zgrep 'task 153' container_1694029806204_12865_01_01/stderr.gz
> 23/09/12 18:13:55 INFO TaskSetManager: Starting task 153.0 in stage 23.0 (TID 
> 924) (ip-10-0-185-107.ec2.internal, executor 3, partition 153, NODE_LOCAL, 
> 4432 bytes) taskResourceAssignments Map()
> $zgrep ' 924' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 924
> $zgrep 'task 153' container_1694029806204_12865_01_04/stderr.gz
> >> note that the above command has no matching result, indicating that task 
> >> 153.0 in stage 23.0 (TID 924) was never launched}}
> {quote}* Thread dump shows that the dispatcher-Executor thread has the 
> following stack trace.
> {quote}"dispatcher-Executor" #40 daemon prio=5 os_prio=0 
> tid=0x98e37800 nid=0x1aff runnable [0x73bba000]
> java.lang.Thread.State: RUNNABLE
> at scala.runtime.BoxesRunTime.equalsNumObject(BoxesRunTime.java:142)
> at scala.runtime.BoxesRunTime.equals2(BoxesRunTime.java:131)
> at scala.runtime.BoxesRunTime.equals(BoxesRunTime.java:123)
> at scala.collection.mutable.HashTable.elemEquals(HashTable.scala:365)
> at scala.collection.mutable.HashTable.elemEquals$(HashTable.scala:365)
> at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:44)
> at scala.collection.mutable.HashTable.findEntry0(HashTable.scala:140)
> at scala.collection.mutable.HashTable.findOrAddEntry(HashTable.scala:169)
> at scala.collection.mutable.HashTable.findOrAddEntry$(HashTable.scala:167)
> at scala.collection.mutable.HashMap.findOrAddEntry(HashMap.scala:44)
> at scala.collection.mutable.HashMap.put(HashMap.scala:126)
> at scala.collection.mutable.HashMap.update(HashMap.scala:131)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:200)
> at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
> at 
> 

[jira] [Comment Edited] (SPARK-45227) Fix a subtle thread-safety issue with CoarseGrainedExecutorBackend where an executor process randomly gets stuck

2023-09-22 Thread Bo Xiong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17767843#comment-17767843
 ] 

Bo Xiong edited comment on SPARK-45227 at 9/22/23 6:11 AM:
---

I've submitted a fix.  Please help get it merged.

If possible, please also help patch v3.3.1 and above.  Thanks!


was (Author: JIRAUSER302302):
I've submitted a fix.  Please help get it merged.

If possible, please also help patch v3.3.1 and above.

 

Thanks,

Bo

> Fix a subtle thread-safety issue with CoarseGrainedExecutorBackend where an 
> executor process randomly gets stuck
> 
>
> Key: SPARK-45227
> URL: https://issues.apache.org/jira/browse/SPARK-45227
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.1, 3.5.0, 4.0.0
>Reporter: Bo Xiong
>Priority: Critical
>  Labels: hang, infinite-loop, pull-request-available, 
> race-condition, stuck, threadsafe
> Attachments: hashtable1.png, hashtable2.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> h2. Symptom
> Our Spark 3 app running on EMR 6.10.0 with Spark 3.3.1 got stuck in the very 
> last step of writing a data frame to S3 by calling {{{}df.write{}}}. Looking 
> at Spark UI, we saw that an executor process hung over 1 hour. After we 
> manually killed the executor process, the app succeeded. Note that the same 
> EMR cluster with two worker nodes was able to run the same app without any 
> issue before and after the incident.
> h2. Observations
> Below is what's observed from relevant container logs and thread dump.
>  * A regular task that's sent to the executor, which also reported back to 
> the driver upon the task completion.
> {quote}$zgrep 'task 150' container_1694029806204_12865_01_01/stderr.gz
> 23/09/12 18:13:55 INFO TaskSetManager: Starting task 150.0 in stage 23.0 (TID 
> 923) (ip-10-0-185-107.ec2.internal, executor 3, partition 150, NODE_LOCAL, 
> 4432 bytes) taskResourceAssignments Map()
> 23/09/12 18:13:55 INFO TaskSetManager: Finished task 150.0 in stage 23.0 (TID 
> 923) in 126 ms on ip-10-0-185-107.ec2.internal (executor 3) (16/200)
> $zgrep 'task 923' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 923
> $zgrep 'task 150' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO Executor: Running task 150.0 in stage 23.0 (TID 923)
> 23/09/12 18:13:55 INFO Executor: Finished task 150.0 in stage 23.0 (TID 923). 
> 4495 bytes result sent to driver}}
> {quote} * Another task that's sent to the executor but didn't get launched 
> since the single-threaded dispatcher was stuck (presumably in an "infinite 
> loop" as explained later).
> {quote}$zgrep 'task 153' container_1694029806204_12865_01_01/stderr.gz
> 23/09/12 18:13:55 INFO TaskSetManager: Starting task 153.0 in stage 23.0 (TID 
> 924) (ip-10-0-185-107.ec2.internal, executor 3, partition 153, NODE_LOCAL, 
> 4432 bytes) taskResourceAssignments Map()
> $zgrep ' 924' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 924
> $zgrep 'task 153' container_1694029806204_12865_01_04/stderr.gz
> >> note that the above command has no matching result, indicating that task 
> >> 153.0 in stage 23.0 (TID 924) was never launched}}
> {quote}* Thread dump shows that the dispatcher-Executor thread has the 
> following stack trace.
> {quote}"dispatcher-Executor" #40 daemon prio=5 os_prio=0 
> tid=0x98e37800 nid=0x1aff runnable [0x73bba000]
> java.lang.Thread.State: RUNNABLE
> at scala.runtime.BoxesRunTime.equalsNumObject(BoxesRunTime.java:142)
> at scala.runtime.BoxesRunTime.equals2(BoxesRunTime.java:131)
> at scala.runtime.BoxesRunTime.equals(BoxesRunTime.java:123)
> at scala.collection.mutable.HashTable.elemEquals(HashTable.scala:365)
> at scala.collection.mutable.HashTable.elemEquals$(HashTable.scala:365)
> at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:44)
> at scala.collection.mutable.HashTable.findEntry0(HashTable.scala:140)
> at scala.collection.mutable.HashTable.findOrAddEntry(HashTable.scala:169)
> at scala.collection.mutable.HashTable.findOrAddEntry$(HashTable.scala:167)
> at scala.collection.mutable.HashMap.findOrAddEntry(HashMap.scala:44)
> at scala.collection.mutable.HashMap.put(HashMap.scala:126)
> at scala.collection.mutable.HashMap.update(HashMap.scala:131)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:200)
> at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
> at 
> 

[jira] [Commented] (SPARK-45227) Fix a subtle thread-safety issue with CoarseGrainedExecutorBackend where an executor process randomly gets stuck

2023-09-22 Thread Bo Xiong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17767843#comment-17767843
 ] 

Bo Xiong commented on SPARK-45227:
--

I've submitted a fix.  Please help get it merged.

If possible, please also help patch v3.3.1 and above.

 

Thanks,

Bo

> Fix a subtle thread-safety issue with CoarseGrainedExecutorBackend where an 
> executor process randomly gets stuck
> 
>
> Key: SPARK-45227
> URL: https://issues.apache.org/jira/browse/SPARK-45227
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.1, 3.5.0, 4.0.0
>Reporter: Bo Xiong
>Priority: Critical
>  Labels: hang, infinite-loop, pull-request-available, 
> race-condition, stuck, threadsafe
> Attachments: hashtable1.png, hashtable2.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> h2. Symptom
> Our Spark 3 app running on EMR 6.10.0 with Spark 3.3.1 got stuck in the very 
> last step of writing a data frame to S3 by calling {{{}df.write{}}}. Looking 
> at Spark UI, we saw that an executor process hung over 1 hour. After we 
> manually killed the executor process, the app succeeded. Note that the same 
> EMR cluster with two worker nodes was able to run the same app without any 
> issue before and after the incident.
> h2. Observations
> Below is what's observed from relevant container logs and thread dump.
>  * A regular task that's sent to the executor, which also reported back to 
> the driver upon the task completion.
> {quote}$zgrep 'task 150' container_1694029806204_12865_01_01/stderr.gz
> 23/09/12 18:13:55 INFO TaskSetManager: Starting task 150.0 in stage 23.0 (TID 
> 923) (ip-10-0-185-107.ec2.internal, executor 3, partition 150, NODE_LOCAL, 
> 4432 bytes) taskResourceAssignments Map()
> 23/09/12 18:13:55 INFO TaskSetManager: Finished task 150.0 in stage 23.0 (TID 
> 923) in 126 ms on ip-10-0-185-107.ec2.internal (executor 3) (16/200)
> $zgrep 'task 923' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 923
> $zgrep 'task 150' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO Executor: Running task 150.0 in stage 23.0 (TID 923)
> 23/09/12 18:13:55 INFO Executor: Finished task 150.0 in stage 23.0 (TID 923). 
> 4495 bytes result sent to driver}}
> {quote} * Another task that's sent to the executor but didn't get launched 
> since the single-threaded dispatcher was stuck (presumably in an "infinite 
> loop" as explained later).
> {quote}$zgrep 'task 153' container_1694029806204_12865_01_01/stderr.gz
> 23/09/12 18:13:55 INFO TaskSetManager: Starting task 153.0 in stage 23.0 (TID 
> 924) (ip-10-0-185-107.ec2.internal, executor 3, partition 153, NODE_LOCAL, 
> 4432 bytes) taskResourceAssignments Map()
> $zgrep ' 924' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 924
> $zgrep 'task 153' container_1694029806204_12865_01_04/stderr.gz
> >> note that the above command has no matching result, indicating that task 
> >> 153.0 in stage 23.0 (TID 924) was never launched}}
> {quote}* Thread dump shows that the dispatcher-Executor thread has the 
> following stack trace.
> {quote}"dispatcher-Executor" #40 daemon prio=5 os_prio=0 
> tid=0x98e37800 nid=0x1aff runnable [0x73bba000]
> java.lang.Thread.State: RUNNABLE
> at scala.runtime.BoxesRunTime.equalsNumObject(BoxesRunTime.java:142)
> at scala.runtime.BoxesRunTime.equals2(BoxesRunTime.java:131)
> at scala.runtime.BoxesRunTime.equals(BoxesRunTime.java:123)
> at scala.collection.mutable.HashTable.elemEquals(HashTable.scala:365)
> at scala.collection.mutable.HashTable.elemEquals$(HashTable.scala:365)
> at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:44)
> at scala.collection.mutable.HashTable.findEntry0(HashTable.scala:140)
> at scala.collection.mutable.HashTable.findOrAddEntry(HashTable.scala:169)
> at scala.collection.mutable.HashTable.findOrAddEntry$(HashTable.scala:167)
> at scala.collection.mutable.HashMap.findOrAddEntry(HashMap.scala:44)
> at scala.collection.mutable.HashMap.put(HashMap.scala:126)
> at scala.collection.mutable.HashMap.update(HashMap.scala:131)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:200)
> at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
> at 
> org.apache.spark.rpc.netty.Inbox$$Lambda$323/1930826709.apply$mcV$sp(Unknown 
> Source)
> at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
> at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
> at 
> 

[jira] [Updated] (SPARK-45227) Fix a subtle thread-safety issue with CoarseGrainedExecutorBackend where an executor process randomly gets stuck

2023-09-22 Thread Bo Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-45227:
-
Summary: Fix a subtle thread-safety issue with CoarseGrainedExecutorBackend 
where an executor process randomly gets stuck  (was: Fix an issue where an 
executor process randomly gets stuck, by making 
CoarseGrainedExecutorBackend.taskResources thread-safe)

> Fix a subtle thread-safety issue with CoarseGrainedExecutorBackend where an 
> executor process randomly gets stuck
> 
>
> Key: SPARK-45227
> URL: https://issues.apache.org/jira/browse/SPARK-45227
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.1, 3.5.0, 4.0.0
>Reporter: Bo Xiong
>Priority: Critical
>  Labels: hang, infinite-loop, pull-request-available, 
> race-condition, stuck, threadsafe
> Attachments: hashtable1.png, hashtable2.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> h2. Symptom
> Our Spark 3 app running on EMR 6.10.0 with Spark 3.3.1 got stuck in the very 
> last step of writing a data frame to S3 by calling {{{}df.write{}}}. Looking 
> at Spark UI, we saw that an executor process hung over 1 hour. After we 
> manually killed the executor process, the app succeeded. Note that the same 
> EMR cluster with two worker nodes was able to run the same app without any 
> issue before and after the incident.
> h2. Observations
> Below is what's observed from relevant container logs and thread dump.
>  * A regular task that's sent to the executor, which also reported back to 
> the driver upon the task completion.
> {quote}$zgrep 'task 150' container_1694029806204_12865_01_01/stderr.gz
> 23/09/12 18:13:55 INFO TaskSetManager: Starting task 150.0 in stage 23.0 (TID 
> 923) (ip-10-0-185-107.ec2.internal, executor 3, partition 150, NODE_LOCAL, 
> 4432 bytes) taskResourceAssignments Map()
> 23/09/12 18:13:55 INFO TaskSetManager: Finished task 150.0 in stage 23.0 (TID 
> 923) in 126 ms on ip-10-0-185-107.ec2.internal (executor 3) (16/200)
> $zgrep 'task 923' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 923
> $zgrep 'task 150' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO Executor: Running task 150.0 in stage 23.0 (TID 923)
> 23/09/12 18:13:55 INFO Executor: Finished task 150.0 in stage 23.0 (TID 923). 
> 4495 bytes result sent to driver}}
> {quote} * Another task that's sent to the executor but didn't get launched 
> since the single-threaded dispatcher was stuck (presumably in an "infinite 
> loop" as explained later).
> {quote}$zgrep 'task 153' container_1694029806204_12865_01_01/stderr.gz
> 23/09/12 18:13:55 INFO TaskSetManager: Starting task 153.0 in stage 23.0 (TID 
> 924) (ip-10-0-185-107.ec2.internal, executor 3, partition 153, NODE_LOCAL, 
> 4432 bytes) taskResourceAssignments Map()
> $zgrep ' 924' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 924
> $zgrep 'task 153' container_1694029806204_12865_01_04/stderr.gz
> >> note that the above command has no matching result, indicating that task 
> >> 153.0 in stage 23.0 (TID 924) was never launched}}
> {quote}* Thread dump shows that the dispatcher-Executor thread has the 
> following stack trace.
> {quote}"dispatcher-Executor" #40 daemon prio=5 os_prio=0 
> tid=0x98e37800 nid=0x1aff runnable [0x73bba000]
> java.lang.Thread.State: RUNNABLE
> at scala.runtime.BoxesRunTime.equalsNumObject(BoxesRunTime.java:142)
> at scala.runtime.BoxesRunTime.equals2(BoxesRunTime.java:131)
> at scala.runtime.BoxesRunTime.equals(BoxesRunTime.java:123)
> at scala.collection.mutable.HashTable.elemEquals(HashTable.scala:365)
> at scala.collection.mutable.HashTable.elemEquals$(HashTable.scala:365)
> at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:44)
> at scala.collection.mutable.HashTable.findEntry0(HashTable.scala:140)
> at scala.collection.mutable.HashTable.findOrAddEntry(HashTable.scala:169)
> at scala.collection.mutable.HashTable.findOrAddEntry$(HashTable.scala:167)
> at scala.collection.mutable.HashMap.findOrAddEntry(HashMap.scala:44)
> at scala.collection.mutable.HashMap.put(HashMap.scala:126)
> at scala.collection.mutable.HashMap.update(HashMap.scala:131)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:200)
> at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
> at 
> org.apache.spark.rpc.netty.Inbox$$Lambda$323/1930826709.apply$mcV$sp(Unknown 
> Source)
> at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
> at 

[jira] [Updated] (SPARK-45227) Fix an issue where an executor process randomly gets stuck, by making CoarseGrainedExecutorBackend.taskResources thread-safe

2023-09-19 Thread Bo Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-45227:
-
Description: 
h2. Symptom

Our Spark 3 app running on EMR 6.10.0 with Spark 3.3.1 got stuck in the very 
last step of writing a data frame to S3 by calling {{{}df.write{}}}. Looking at 
Spark UI, we saw that an executor process hung over 1 hour. After we manually 
killed the executor process, the app succeeded. Note that the same EMR cluster 
with two worker nodes was able to run the same app without any issue before and 
after the incident.
h2. Observations

Below is what's observed from relevant container logs and thread dump.
 * A regular task that's sent to the executor, which also reported back to the 
driver upon the task completion.

{quote}$zgrep 'task 150' container_1694029806204_12865_01_01/stderr.gz
23/09/12 18:13:55 INFO TaskSetManager: Starting task 150.0 in stage 23.0 (TID 
923) (ip-10-0-185-107.ec2.internal, executor 3, partition 150, NODE_LOCAL, 4432 
bytes) taskResourceAssignments Map()
23/09/12 18:13:55 INFO TaskSetManager: Finished task 150.0 in stage 23.0 (TID 
923) in 126 ms on ip-10-0-185-107.ec2.internal (executor 3) (16/200)

$zgrep 'task 923' container_1694029806204_12865_01_04/stderr.gz
23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 923

$zgrep 'task 150' container_1694029806204_12865_01_04/stderr.gz
23/09/12 18:13:55 INFO Executor: Running task 150.0 in stage 23.0 (TID 923)
23/09/12 18:13:55 INFO Executor: Finished task 150.0 in stage 23.0 (TID 923). 
4495 bytes result sent to driver}}
{quote} * Another task that's sent to the executor but didn't get launched 
since the single-threaded dispatcher was stuck (presumably in an "infinite 
loop" as explained later).

{quote}$zgrep 'task 153' container_1694029806204_12865_01_01/stderr.gz
23/09/12 18:13:55 INFO TaskSetManager: Starting task 153.0 in stage 23.0 (TID 
924) (ip-10-0-185-107.ec2.internal, executor 3, partition 153, NODE_LOCAL, 4432 
bytes) taskResourceAssignments Map()

$zgrep ' 924' container_1694029806204_12865_01_04/stderr.gz
23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 924

$zgrep 'task 153' container_1694029806204_12865_01_04/stderr.gz
>> note that the above command has no matching result, indicating that task 
>> 153.0 in stage 23.0 (TID 924) was never launched}}
{quote}* Thread dump shows that the dispatcher-Executor thread has the 
following stack trace.

{quote}"dispatcher-Executor" #40 daemon prio=5 os_prio=0 tid=0x98e37800 
nid=0x1aff runnable [0x73bba000]
java.lang.Thread.State: RUNNABLE
at scala.runtime.BoxesRunTime.equalsNumObject(BoxesRunTime.java:142)
at scala.runtime.BoxesRunTime.equals2(BoxesRunTime.java:131)
at scala.runtime.BoxesRunTime.equals(BoxesRunTime.java:123)
at scala.collection.mutable.HashTable.elemEquals(HashTable.scala:365)
at scala.collection.mutable.HashTable.elemEquals$(HashTable.scala:365)
at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:44)
at scala.collection.mutable.HashTable.findEntry0(HashTable.scala:140)
at scala.collection.mutable.HashTable.findOrAddEntry(HashTable.scala:169)
at scala.collection.mutable.HashTable.findOrAddEntry$(HashTable.scala:167)
at scala.collection.mutable.HashMap.findOrAddEntry(HashMap.scala:44)
at scala.collection.mutable.HashMap.put(HashMap.scala:126)
at scala.collection.mutable.HashMap.update(HashMap.scala:131)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:200)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox$$Lambda$323/1930826709.apply$mcV$sp(Unknown 
Source)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)}}
{quote}
h2. Relevant code paths

Within an executor process, there's a [dispatcher 
thread|https://github.com/apache/spark/blob/1fdd46f173f7bc90e0523eb0a2d5e8e27e990102/core/src/main/scala/org/apache/spark/rpc/netty/MessageLoop.scala#L170]
 dedicated to CoarseGrainedExecutorBackend (a single RPC endpoint) that 
launches tasks scheduled by the driver. Each task is run on a TaskRunner thread 
backed by a thread pool created for the executor. The TaskRunner thread and the 
dispatcher thread are different. However, they 

[jira] [Updated] (SPARK-45227) Fix an issue where an executor process randomly gets stuck, by making CoarseGrainedExecutorBackend.taskResources thread-safe

2023-09-19 Thread Bo Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-45227:
-
   Fix Version/s: (was: 4.0.0)
  (was: 3.5.1)
Target Version/s:   (was: 3.3.1)

> Fix an issue where an executor process randomly gets stuck, by making 
> CoarseGrainedExecutorBackend.taskResources thread-safe
> 
>
> Key: SPARK-45227
> URL: https://issues.apache.org/jira/browse/SPARK-45227
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.1, 3.5.0, 4.0.0
>Reporter: Bo Xiong
>Priority: Critical
>  Labels: hang, infinite-loop, race-condition, stuck, threadsafe
> Attachments: hashtable1.png, hashtable2.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> h2. Symptom
> Our Spark 3 app running on EMR 6.10.0 with Spark 3.3.1 got stuck in the very 
> last step of writing a data frame to S3 by calling {{{}df.write{}}}. Looking 
> at Spark UI, we saw that an executor process hung over 1 hour. After we 
> manually killed the executor process, the app succeeded. Note that the same 
> EMR cluster with two worker nodes was able to run the same app without any 
> issue before and after the incident.
> h2. Observations
> Below is what's observed from relevant container logs and thread dump.
>  * A regular task that's sent to the executor, which also reported back to 
> the driver upon the task completion.
> {quote}{{$zgrep 'task 150' container_1694029806204_12865_01_01/stderr.gz
> 23/09/12 18:13:55 INFO TaskSetManager: Starting task 150.0 in stage 23.0 (TID 
> 923) (ip-10-0-185-107.ec2.internal, executor 3, partition 150, NODE_LOCAL, 
> 4432 bytes) taskResourceAssignments Map()
> 23/09/12 18:13:55 INFO TaskSetManager: Finished task 150.0 in stage 23.0 (TID 
> 923) in 126 ms on ip-10-0-185-107.ec2.internal (executor 3) (16/200)
> $zgrep 'task 923' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 923
> $zgrep 'task 150' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO Executor: Running task 150.0 in stage 23.0 (TID 923)
> 23/09/12 18:13:55 INFO Executor: Finished task 150.0 in stage 23.0 (TID 923). 
> 4495 bytes result sent to driver}}
> {quote}
> * Another task that's sent to the executor but didn't get launched since the 
> single-threaded dispatcher was stuck (presumably in an "infinite loop" as 
> explained later).
> {quote}{{$zgrep 'task 153' container_1694029806204_12865_01_01/stderr.gz
> 23/09/12 18:13:55 INFO TaskSetManager: Starting task 153.0 in stage 23.0 (TID 
> 924) (ip-10-0-185-107.ec2.internal, executor 3, partition 153, NODE_LOCAL, 
> 4432 bytes) taskResourceAssignments Map()
> $zgrep ' 924' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 924
> $zgrep 'task 153' container_1694029806204_12865_01_04/stderr.gz
> >> note that the above command has no matching result, indicating that task 
> >> 153.0 in stage 23.0 (TID 924) was never launched}}
> {quote} * Thread dump shows that the dispatcher-Executor thread has the 
> following stack trace.
> {quote}{{"dispatcher-Executor" #40 daemon prio=5 os_prio=0 
> tid=0x98e37800 nid=0x1aff runnable [0x73bba000]
> java.lang.Thread.State: RUNNABLE
> at scala.runtime.BoxesRunTime.equalsNumObject(BoxesRunTime.java:142)
> at scala.runtime.BoxesRunTime.equals2(BoxesRunTime.java:131)
> at scala.runtime.BoxesRunTime.equals(BoxesRunTime.java:123)
> at scala.collection.mutable.HashTable.elemEquals(HashTable.scala:365)
> at scala.collection.mutable.HashTable.elemEquals$(HashTable.scala:365)
> at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:44)
> at scala.collection.mutable.HashTable.findEntry0(HashTable.scala:140)
> at scala.collection.mutable.HashTable.findOrAddEntry(HashTable.scala:169)
> at scala.collection.mutable.HashTable.findOrAddEntry$(HashTable.scala:167)
> at scala.collection.mutable.HashMap.findOrAddEntry(HashMap.scala:44)
> at scala.collection.mutable.HashMap.put(HashMap.scala:126)
> at scala.collection.mutable.HashMap.update(HashMap.scala:131)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:200)
> at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
> at 
> org.apache.spark.rpc.netty.Inbox$$Lambda$323/1930826709.apply$mcV$sp(Unknown 
> Source)
> at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
> at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
> at 
> 

[jira] [Updated] (SPARK-45227) Fix an issue where an executor process randomly gets stuck, by making CoarseGrainedExecutorBackend.taskResources thread-safe

2023-09-19 Thread Bo Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-45227:
-
Summary: Fix an issue where an executor process randomly gets stuck, by 
making CoarseGrainedExecutorBackend.taskResources thread-safe  (was: Fix an 
issue where an executor process randomly gets stuck by making 
CoarseGrainedExecutorBackend.taskResources thread-safe)

> Fix an issue where an executor process randomly gets stuck, by making 
> CoarseGrainedExecutorBackend.taskResources thread-safe
> 
>
> Key: SPARK-45227
> URL: https://issues.apache.org/jira/browse/SPARK-45227
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.1, 3.5.0, 4.0.0
>Reporter: Bo Xiong
>Priority: Critical
>  Labels: hang, infinite-loop, race-condition, stuck, threadsafe
> Fix For: 4.0.0, 3.5.1
>
> Attachments: hashtable1.png, hashtable2.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> h2. Symptom
> Our Spark 3 app running on EMR 6.10.0 with Spark 3.3.1 got stuck in the very 
> last step of writing a data frame to S3 by calling {{{}df.write{}}}. Looking 
> at Spark UI, we saw that an executor process hung over 1 hour. After we 
> manually killed the executor process, the app succeeded. Note that the same 
> EMR cluster with two worker nodes was able to run the same app without any 
> issue before and after the incident.
> h2. Observations
> Below is what's observed from relevant container logs and thread dump.
>  * A regular task that's sent to the executor, which also reported back to 
> the driver upon the task completion.
> {quote}{{$zgrep 'task 150' container_1694029806204_12865_01_01/stderr.gz
> 23/09/12 18:13:55 INFO TaskSetManager: Starting task 150.0 in stage 23.0 (TID 
> 923) (ip-10-0-185-107.ec2.internal, executor 3, partition 150, NODE_LOCAL, 
> 4432 bytes) taskResourceAssignments Map()
> 23/09/12 18:13:55 INFO TaskSetManager: Finished task 150.0 in stage 23.0 (TID 
> 923) in 126 ms on ip-10-0-185-107.ec2.internal (executor 3) (16/200)
> $zgrep 'task 923' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 923
> $zgrep 'task 150' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO Executor: Running task 150.0 in stage 23.0 (TID 923)
> 23/09/12 18:13:55 INFO Executor: Finished task 150.0 in stage 23.0 (TID 923). 
> 4495 bytes result sent to driver}}
> {quote}
> * Another task that's sent to the executor but didn't get launched since the 
> single-threaded dispatcher was stuck (presumably in an "infinite loop" as 
> explained later).
> {quote}{{$zgrep 'task 153' container_1694029806204_12865_01_01/stderr.gz
> 23/09/12 18:13:55 INFO TaskSetManager: Starting task 153.0 in stage 23.0 (TID 
> 924) (ip-10-0-185-107.ec2.internal, executor 3, partition 153, NODE_LOCAL, 
> 4432 bytes) taskResourceAssignments Map()
> $zgrep ' 924' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 924
> $zgrep 'task 153' container_1694029806204_12865_01_04/stderr.gz
> >> note that the above command has no matching result, indicating that task 
> >> 153.0 in stage 23.0 (TID 924) was never launched}}
> {quote} * Thread dump shows that the dispatcher-Executor thread has the 
> following stack trace.
> {quote}{{"dispatcher-Executor" #40 daemon prio=5 os_prio=0 
> tid=0x98e37800 nid=0x1aff runnable [0x73bba000]
> java.lang.Thread.State: RUNNABLE
> at scala.runtime.BoxesRunTime.equalsNumObject(BoxesRunTime.java:142)
> at scala.runtime.BoxesRunTime.equals2(BoxesRunTime.java:131)
> at scala.runtime.BoxesRunTime.equals(BoxesRunTime.java:123)
> at scala.collection.mutable.HashTable.elemEquals(HashTable.scala:365)
> at scala.collection.mutable.HashTable.elemEquals$(HashTable.scala:365)
> at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:44)
> at scala.collection.mutable.HashTable.findEntry0(HashTable.scala:140)
> at scala.collection.mutable.HashTable.findOrAddEntry(HashTable.scala:169)
> at scala.collection.mutable.HashTable.findOrAddEntry$(HashTable.scala:167)
> at scala.collection.mutable.HashMap.findOrAddEntry(HashMap.scala:44)
> at scala.collection.mutable.HashMap.put(HashMap.scala:126)
> at scala.collection.mutable.HashMap.update(HashMap.scala:131)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:200)
> at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
> at 
> org.apache.spark.rpc.netty.Inbox$$Lambda$323/1930826709.apply$mcV$sp(Unknown 
> Source)
> at 

[jira] [Updated] (SPARK-45227) Fix an issue where an executor process randomly gets stuck by making CoarseGrainedExecutorBackend.taskResources thread-safe

2023-09-19 Thread Bo Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-45227:
-
Description: 
h2. Symptom

Our Spark 3 app running on EMR 6.10.0 with Spark 3.3.1 got stuck in the very 
last step of writing a data frame to S3 by calling {{{}df.write{}}}. Looking at 
Spark UI, we saw that an executor process hung over 1 hour. After we manually 
killed the executor process, the app succeeded. Note that the same EMR cluster 
with two worker nodes was able to run the same app without any issue before and 
after the incident.
h2. Observations

Below is what's observed from relevant container logs and thread dump.
 * A regular task that's sent to the executor, which also reported back to the 
driver upon the task completion.

{quote}{{$zgrep 'task 150' container_1694029806204_12865_01_01/stderr.gz
23/09/12 18:13:55 INFO TaskSetManager: Starting task 150.0 in stage 23.0 (TID 
923) (ip-10-0-185-107.ec2.internal, executor 3, partition 150, NODE_LOCAL, 4432 
bytes) taskResourceAssignments Map()
23/09/12 18:13:55 INFO TaskSetManager: Finished task 150.0 in stage 23.0 (TID 
923) in 126 ms on ip-10-0-185-107.ec2.internal (executor 3) (16/200)

$zgrep 'task 923' container_1694029806204_12865_01_04/stderr.gz
23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 923

$zgrep 'task 150' container_1694029806204_12865_01_04/stderr.gz
23/09/12 18:13:55 INFO Executor: Running task 150.0 in stage 23.0 (TID 923)
23/09/12 18:13:55 INFO Executor: Finished task 150.0 in stage 23.0 (TID 923). 
4495 bytes result sent to driver}}
{quote}
* Another task that's sent to the executor but didn't get launched since the 
single-threaded dispatcher was stuck (presumably in an "infinite loop" as 
explained later).
{quote}{{$zgrep 'task 153' container_1694029806204_12865_01_01/stderr.gz
23/09/12 18:13:55 INFO TaskSetManager: Starting task 153.0 in stage 23.0 (TID 
924) (ip-10-0-185-107.ec2.internal, executor 3, partition 153, NODE_LOCAL, 4432 
bytes) taskResourceAssignments Map()

$zgrep ' 924' container_1694029806204_12865_01_04/stderr.gz
23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 924

$zgrep 'task 153' container_1694029806204_12865_01_04/stderr.gz
>> note that the above command has no matching result, indicating that task 
>> 153.0 in stage 23.0 (TID 924) was never launched}}
{quote} * Thread dump shows that the dispatcher-Executor thread has the 
following stack trace.

{quote}{{"dispatcher-Executor" #40 daemon prio=5 os_prio=0 
tid=0x98e37800 nid=0x1aff runnable [0x73bba000]
java.lang.Thread.State: RUNNABLE
at scala.runtime.BoxesRunTime.equalsNumObject(BoxesRunTime.java:142)
at scala.runtime.BoxesRunTime.equals2(BoxesRunTime.java:131)
at scala.runtime.BoxesRunTime.equals(BoxesRunTime.java:123)
at scala.collection.mutable.HashTable.elemEquals(HashTable.scala:365)
at scala.collection.mutable.HashTable.elemEquals$(HashTable.scala:365)
at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:44)
at scala.collection.mutable.HashTable.findEntry0(HashTable.scala:140)
at scala.collection.mutable.HashTable.findOrAddEntry(HashTable.scala:169)
at scala.collection.mutable.HashTable.findOrAddEntry$(HashTable.scala:167)
at scala.collection.mutable.HashMap.findOrAddEntry(HashMap.scala:44)
at scala.collection.mutable.HashMap.put(HashMap.scala:126)
at scala.collection.mutable.HashMap.update(HashMap.scala:131)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:200)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox$$Lambda$323/1930826709.apply$mcV$sp(Unknown 
Source)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)}}
{quote}
h2. Relevant code paths

Within an executor process, there's a [dispatcher 
thread|https://github.com/apache/spark/blob/1fdd46f173f7bc90e0523eb0a2d5e8e27e990102/core/src/main/scala/org/apache/spark/rpc/netty/MessageLoop.scala#L170]
 dedicated to CoarseGrainedExecutorBackend (a single RPC endpoint) that 
launches tasks scheduled by the driver. Each task is run on a TaskRunner thread 
backed by a thread pool created for the executor. The TaskRunner thread and the 
dispatcher thread are different. 

[jira] [Updated] (SPARK-45227) Fix an issue where an executor process randomly gets stuck by making CoarseGrainedExecutorBackend.taskResources thread-safe

2023-09-19 Thread Bo Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-45227:
-
Description: 
h3. Symptom

Our Spark 3 app running on EMR 6.10.0 with Spark 3.3.1 got stuck in the very 
last step of writing a data frame to S3 by calling {{{}df.write{}}}. Looking at 
Spark UI, we saw that an executor process hung over 1 hour. After we manually 
killed the executor process, the app succeeded.

Note that the same EMR cluster with two worker nodes was able to run the same 
app without any issue before and after the incident.
h3. Observations

Below is what's observed from relevant container logs and thread dump.
 * A regular task that's sent to the executor, which also reported back to the 
driver upon the task completion.

 
{quote}{{$zgrep 'task 150' container_1694029806204_12865_01_01/stderr.gz
23/09/12 18:13:55 INFO TaskSetManager: Starting task 150.0 in stage 23.0 (TID 
923) (ip-10-0-185-107.ec2.internal, executor 3, partition 150, NODE_LOCAL, 4432 
bytes) taskResourceAssignments Map()
23/09/12 18:13:55 INFO TaskSetManager: Finished task 150.0 in stage 23.0 (TID 
923) in 126 ms on ip-10-0-185-107.ec2.internal (executor 3) (16/200)

$zgrep 'task 923' container_1694029806204_12865_01_04/stderr.gz
23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 923

$zgrep 'task 150' container_1694029806204_12865_01_04/stderr.gz
23/09/12 18:13:55 INFO Executor: Running task 150.0 in stage 23.0 (TID 923)
23/09/12 18:13:55 INFO Executor: Finished task 150.0 in stage 23.0 (TID 923). 
4495 bytes result sent to driver}}
{quote} * Another task that's sent to the executor but didn't get launched 
since the single-threaded dispatcher was stuck (presumably in an "infinite 
loop" as explained later).

 

 
{quote}{{$zgrep 'task 153' container_1694029806204_12865_01_01/stderr.gz
23/09/12 18:13:55 INFO TaskSetManager: Starting task 153.0 in stage 23.0 (TID 
924) (ip-10-0-185-107.ec2.internal, executor 3, partition 153, NODE_LOCAL, 4432 
bytes) taskResourceAssignments Map()

$zgrep ' 924' container_1694029806204_12865_01_04/stderr.gz
23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 924

$zgrep 'task 153' container_1694029806204_12865_01_04/stderr.gz
>> note that the above command has no matching result, indicating that task 
>> 153.0 in stage 23.0 (TID 924) was never launched}}
{quote}
* Thread dump shows that the dispatcher-Executor thread has the following stack 
trace.

 

 
{quote}{{"dispatcher-Executor" #40 daemon prio=5 os_prio=0 
tid=0x98e37800 nid=0x1aff runnable [0x73bba000]
java.lang.Thread.State: RUNNABLE
at scala.runtime.BoxesRunTime.equalsNumObject(BoxesRunTime.java:142)
at scala.runtime.BoxesRunTime.equals2(BoxesRunTime.java:131)
at scala.runtime.BoxesRunTime.equals(BoxesRunTime.java:123)
at scala.collection.mutable.HashTable.elemEquals(HashTable.scala:365)
at scala.collection.mutable.HashTable.elemEquals$(HashTable.scala:365)
at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:44)
at scala.collection.mutable.HashTable.findEntry0(HashTable.scala:140)
at scala.collection.mutable.HashTable.findOrAddEntry(HashTable.scala:169)
at scala.collection.mutable.HashTable.findOrAddEntry$(HashTable.scala:167)
at scala.collection.mutable.HashMap.findOrAddEntry(HashMap.scala:44)
at scala.collection.mutable.HashMap.put(HashMap.scala:126)
at scala.collection.mutable.HashMap.update(HashMap.scala:131)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:200)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox$$Lambda$323/1930826709.apply$mcV$sp(Unknown 
Source)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)}}
{quote} * Relevant code paths

 
{quote}Within an executor process, there's a [dispatcher 
thread|https://github.com/apache/spark/blob/1fdd46f173f7bc90e0523eb0a2d5e8e27e990102/core/src/main/scala/org/apache/spark/rpc/netty/MessageLoop.scala#L170]
 dedicated to CoarseGrainedExecutorBackend (a single RPC endpoint) that 
launches tasks scheduled by the driver. Each task is run on a TaskRunner thread 
backed by a thread pool created for the executor. The TaskRunner thread and the 
dispatcher thread 

[jira] [Updated] (SPARK-45227) Fix an issue where an executor process randomly gets stuck by making CoarseGrainedExecutorBackend.taskResources thread-safe

2023-09-19 Thread Bo Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-45227:
-
Attachment: hashtable1.png

> Fix an issue where an executor process randomly gets stuck by making 
> CoarseGrainedExecutorBackend.taskResources thread-safe
> ---
>
> Key: SPARK-45227
> URL: https://issues.apache.org/jira/browse/SPARK-45227
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.1, 3.5.0, 4.0.0
>Reporter: Bo Xiong
>Priority: Critical
>  Labels: hang, infinite-loop, race-condition, stuck, threadsafe
> Fix For: 4.0.0, 3.5.1
>
> Attachments: hashtable1.png, hashtable2.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> h3. Symptom
> Our Spark 3 app running on EMR 6.10.0 with Spark 3.3.1 got stuck in the very 
> last step of writing a data frame to S3 by calling {{{}df.write{}}}. Looking 
> at Spark UI, we saw that an executor process hung over 1 hour. After we 
> manually killed the executor process, the app succeeded.
> Note that the same EMR cluster with two worker nodes was able to run the same 
> app without any issue before and after the incident.
> h3. Observations
> Below is what's observed from relevant container logs and thread dump.
>  * A regular task that's sent to the executor, which also reported back to 
> the driver upon the task completion.
>  
> {quote}{{$zgrep 'task 150' container_1694029806204_12865_01_01/stderr.gz
> 23/09/12 18:13:55 INFO TaskSetManager: Starting task 150.0 in stage 23.0 (TID 
> 923) (ip-10-0-185-107.ec2.internal, executor 3, partition 150, NODE_LOCAL, 
> 4432 bytes) taskResourceAssignments Map()
> 23/09/12 18:13:55 INFO TaskSetManager: Finished task 150.0 in stage 23.0 (TID 
> 923) in 126 ms on ip-10-0-185-107.ec2.internal (executor 3) (16/200)
> $zgrep 'task 923' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 923
> $zgrep 'task 150' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO Executor: Running task 150.0 in stage 23.0 (TID 923)
> 23/09/12 18:13:55 INFO Executor: Finished task 150.0 in stage 23.0 (TID 923). 
> 4495 bytes result sent to driver}}
> {quote}
> * Another task that's sent to the executor but didn't get launched since the 
> single-threaded dispatcher was stuck (presumably in an "infinite loop" as 
> explained later).
>  
>  
> {quote}{{$zgrep 'task 153' container_1694029806204_12865_01_01/stderr.gz
> 23/09/12 18:13:55 INFO TaskSetManager: Starting task 153.0 in stage 23.0 (TID 
> 924) (ip-10-0-185-107.ec2.internal, executor 3, partition 153, NODE_LOCAL, 
> 4432 bytes) taskResourceAssignments Map()
> $zgrep ' 924' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 924
> $zgrep 'task 153' container_1694029806204_12865_01_04/stderr.gz
> >> note that the above command has no matching result, indicating that task 
> >> 153.0 in stage 23.0 (TID 924) was never launched}}
> {quote} * Thread dump shows that the dispatcher-Executor thread has the 
> following stack trace.
>  
> {quote}{{"dispatcher-Executor" #40 daemon prio=5 os_prio=0 
> tid=0x98e37800 nid=0x1aff runnable [0x73bba000]
> java.lang.Thread.State: RUNNABLE
> at scala.runtime.BoxesRunTime.equalsNumObject(BoxesRunTime.java:142)
> at scala.runtime.BoxesRunTime.equals2(BoxesRunTime.java:131)
> at scala.runtime.BoxesRunTime.equals(BoxesRunTime.java:123)
> at scala.collection.mutable.HashTable.elemEquals(HashTable.scala:365)
> at scala.collection.mutable.HashTable.elemEquals$(HashTable.scala:365)
> at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:44)
> at scala.collection.mutable.HashTable.findEntry0(HashTable.scala:140)
> at scala.collection.mutable.HashTable.findOrAddEntry(HashTable.scala:169)
> at scala.collection.mutable.HashTable.findOrAddEntry$(HashTable.scala:167)
> at scala.collection.mutable.HashMap.findOrAddEntry(HashMap.scala:44)
> at scala.collection.mutable.HashMap.put(HashMap.scala:126)
> at scala.collection.mutable.HashMap.update(HashMap.scala:131)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:200)
> at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
> at 
> org.apache.spark.rpc.netty.Inbox$$Lambda$323/1930826709.apply$mcV$sp(Unknown 
> Source)
> at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
> at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
> at 
> org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
> 

[jira] [Updated] (SPARK-45227) Fix an issue where an executor process randomly gets stuck by making CoarseGrainedExecutorBackend.taskResources thread-safe

2023-09-19 Thread Bo Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-45227:
-
Attachment: hashtable2.png

> Fix an issue where an executor process randomly gets stuck by making 
> CoarseGrainedExecutorBackend.taskResources thread-safe
> ---
>
> Key: SPARK-45227
> URL: https://issues.apache.org/jira/browse/SPARK-45227
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.1, 3.5.0, 4.0.0
>Reporter: Bo Xiong
>Priority: Critical
>  Labels: hang, infinite-loop, race-condition, stuck, threadsafe
> Fix For: 4.0.0, 3.5.1
>
> Attachments: hashtable1.png, hashtable2.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> h3. Symptom
> Our Spark 3 app running on EMR 6.10.0 with Spark 3.3.1 got stuck in the very 
> last step of writing a data frame to S3 by calling {{{}df.write{}}}. Looking 
> at Spark UI, we saw that an executor process hung over 1 hour. After we 
> manually killed the executor process, the app succeeded.
> Note that the same EMR cluster with two worker nodes was able to run the same 
> app without any issue before and after the incident.
> h3. Observations
> Below is what's observed from relevant container logs and thread dump.
>  * A regular task that's sent to the executor, which also reported back to 
> the driver upon the task completion.
>  
> {quote}{{$zgrep 'task 150' container_1694029806204_12865_01_01/stderr.gz
> 23/09/12 18:13:55 INFO TaskSetManager: Starting task 150.0 in stage 23.0 (TID 
> 923) (ip-10-0-185-107.ec2.internal, executor 3, partition 150, NODE_LOCAL, 
> 4432 bytes) taskResourceAssignments Map()
> 23/09/12 18:13:55 INFO TaskSetManager: Finished task 150.0 in stage 23.0 (TID 
> 923) in 126 ms on ip-10-0-185-107.ec2.internal (executor 3) (16/200)
> $zgrep 'task 923' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 923
> $zgrep 'task 150' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO Executor: Running task 150.0 in stage 23.0 (TID 923)
> 23/09/12 18:13:55 INFO Executor: Finished task 150.0 in stage 23.0 (TID 923). 
> 4495 bytes result sent to driver}}
> {quote}
> * Another task that's sent to the executor but didn't get launched since the 
> single-threaded dispatcher was stuck (presumably in an "infinite loop" as 
> explained later).
>  
>  
> {quote}{{$zgrep 'task 153' container_1694029806204_12865_01_01/stderr.gz
> 23/09/12 18:13:55 INFO TaskSetManager: Starting task 153.0 in stage 23.0 (TID 
> 924) (ip-10-0-185-107.ec2.internal, executor 3, partition 153, NODE_LOCAL, 
> 4432 bytes) taskResourceAssignments Map()
> $zgrep ' 924' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 924
> $zgrep 'task 153' container_1694029806204_12865_01_04/stderr.gz
> >> note that the above command has no matching result, indicating that task 
> >> 153.0 in stage 23.0 (TID 924) was never launched}}
> {quote} * Thread dump shows that the dispatcher-Executor thread has the 
> following stack trace.
>  
> {quote}{{"dispatcher-Executor" #40 daemon prio=5 os_prio=0 
> tid=0x98e37800 nid=0x1aff runnable [0x73bba000]
> java.lang.Thread.State: RUNNABLE
> at scala.runtime.BoxesRunTime.equalsNumObject(BoxesRunTime.java:142)
> at scala.runtime.BoxesRunTime.equals2(BoxesRunTime.java:131)
> at scala.runtime.BoxesRunTime.equals(BoxesRunTime.java:123)
> at scala.collection.mutable.HashTable.elemEquals(HashTable.scala:365)
> at scala.collection.mutable.HashTable.elemEquals$(HashTable.scala:365)
> at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:44)
> at scala.collection.mutable.HashTable.findEntry0(HashTable.scala:140)
> at scala.collection.mutable.HashTable.findOrAddEntry(HashTable.scala:169)
> at scala.collection.mutable.HashTable.findOrAddEntry$(HashTable.scala:167)
> at scala.collection.mutable.HashMap.findOrAddEntry(HashMap.scala:44)
> at scala.collection.mutable.HashMap.put(HashMap.scala:126)
> at scala.collection.mutable.HashMap.update(HashMap.scala:131)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:200)
> at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
> at 
> org.apache.spark.rpc.netty.Inbox$$Lambda$323/1930826709.apply$mcV$sp(Unknown 
> Source)
> at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
> at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
> at 
> org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
> 

[jira] [Updated] (SPARK-45227) Fix an issue where an executor process randomly gets stuck by making CoarseGrainedExecutorBackend.taskResources thread-safe

2023-09-19 Thread Bo Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-45227:
-
Attachment: (was: Screenshot 2023-09-19 at 7.55.37 PM.png)

> Fix an issue where an executor process randomly gets stuck by making 
> CoarseGrainedExecutorBackend.taskResources thread-safe
> ---
>
> Key: SPARK-45227
> URL: https://issues.apache.org/jira/browse/SPARK-45227
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.1, 3.5.0, 4.0.0
>Reporter: Bo Xiong
>Priority: Critical
>  Labels: hang, infinite-loop, race-condition, stuck, threadsafe
> Fix For: 4.0.0, 3.5.1
>
> Attachments: hashtable1.png, hashtable2.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> h3. Symptom
> Our Spark 3 app running on EMR 6.10.0 with Spark 3.3.1 got stuck in the very 
> last step of writing a data frame to S3 by calling {{{}df.write{}}}. Looking 
> at Spark UI, we saw that an executor process hung over 1 hour. After we 
> manually killed the executor process, the app succeeded.
> Note that the same EMR cluster with two worker nodes was able to run the same 
> app without any issue before and after the incident.
> h3. Observations
> Below is what's observed from relevant container logs and thread dump.
>  * A regular task that's sent to the executor, which also reported back to 
> the driver upon the task completion.
>  
> {quote}{{$zgrep 'task 150' container_1694029806204_12865_01_01/stderr.gz
> 23/09/12 18:13:55 INFO TaskSetManager: Starting task 150.0 in stage 23.0 (TID 
> 923) (ip-10-0-185-107.ec2.internal, executor 3, partition 150, NODE_LOCAL, 
> 4432 bytes) taskResourceAssignments Map()
> 23/09/12 18:13:55 INFO TaskSetManager: Finished task 150.0 in stage 23.0 (TID 
> 923) in 126 ms on ip-10-0-185-107.ec2.internal (executor 3) (16/200)
> $zgrep 'task 923' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 923
> $zgrep 'task 150' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO Executor: Running task 150.0 in stage 23.0 (TID 923)
> 23/09/12 18:13:55 INFO Executor: Finished task 150.0 in stage 23.0 (TID 923). 
> 4495 bytes result sent to driver}}
> {quote}
> * Another task that's sent to the executor but didn't get launched since the 
> single-threaded dispatcher was stuck (presumably in an "infinite loop" as 
> explained later).
>  
>  
> {quote}{{$zgrep 'task 153' container_1694029806204_12865_01_01/stderr.gz
> 23/09/12 18:13:55 INFO TaskSetManager: Starting task 153.0 in stage 23.0 (TID 
> 924) (ip-10-0-185-107.ec2.internal, executor 3, partition 153, NODE_LOCAL, 
> 4432 bytes) taskResourceAssignments Map()
> $zgrep ' 924' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 924
> $zgrep 'task 153' container_1694029806204_12865_01_04/stderr.gz
> >> note that the above command has no matching result, indicating that task 
> >> 153.0 in stage 23.0 (TID 924) was never launched}}
> {quote} * Thread dump shows that the dispatcher-Executor thread has the 
> following stack trace.
>  
> {quote}{{"dispatcher-Executor" #40 daemon prio=5 os_prio=0 
> tid=0x98e37800 nid=0x1aff runnable [0x73bba000]
> java.lang.Thread.State: RUNNABLE
> at scala.runtime.BoxesRunTime.equalsNumObject(BoxesRunTime.java:142)
> at scala.runtime.BoxesRunTime.equals2(BoxesRunTime.java:131)
> at scala.runtime.BoxesRunTime.equals(BoxesRunTime.java:123)
> at scala.collection.mutable.HashTable.elemEquals(HashTable.scala:365)
> at scala.collection.mutable.HashTable.elemEquals$(HashTable.scala:365)
> at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:44)
> at scala.collection.mutable.HashTable.findEntry0(HashTable.scala:140)
> at scala.collection.mutable.HashTable.findOrAddEntry(HashTable.scala:169)
> at scala.collection.mutable.HashTable.findOrAddEntry$(HashTable.scala:167)
> at scala.collection.mutable.HashMap.findOrAddEntry(HashMap.scala:44)
> at scala.collection.mutable.HashMap.put(HashMap.scala:126)
> at scala.collection.mutable.HashMap.update(HashMap.scala:131)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:200)
> at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
> at 
> org.apache.spark.rpc.netty.Inbox$$Lambda$323/1930826709.apply$mcV$sp(Unknown 
> Source)
> at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
> at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
> at 
> 

[jira] [Updated] (SPARK-45227) Fix an issue where an executor process randomly gets stuck by making CoarseGrainedExecutorBackend.taskResources thread-safe

2023-09-19 Thread Bo Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-45227:
-
Attachment: (was: Screenshot 2023-09-19 at 7.55.31 PM.png)

> Fix an issue where an executor process randomly gets stuck by making 
> CoarseGrainedExecutorBackend.taskResources thread-safe
> ---
>
> Key: SPARK-45227
> URL: https://issues.apache.org/jira/browse/SPARK-45227
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.1, 3.5.0, 4.0.0
>Reporter: Bo Xiong
>Priority: Critical
>  Labels: hang, infinite-loop, race-condition, stuck, threadsafe
> Fix For: 4.0.0, 3.5.1
>
> Attachments: hashtable1.png, hashtable2.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> h3. Symptom
> Our Spark 3 app running on EMR 6.10.0 with Spark 3.3.1 got stuck in the very 
> last step of writing a data frame to S3 by calling {{{}df.write{}}}. Looking 
> at Spark UI, we saw that an executor process hung over 1 hour. After we 
> manually killed the executor process, the app succeeded.
> Note that the same EMR cluster with two worker nodes was able to run the same 
> app without any issue before and after the incident.
> h3. Observations
> Below is what's observed from relevant container logs and thread dump.
>  * A regular task that's sent to the executor, which also reported back to 
> the driver upon the task completion.
>  
> {quote}{{$zgrep 'task 150' container_1694029806204_12865_01_01/stderr.gz
> 23/09/12 18:13:55 INFO TaskSetManager: Starting task 150.0 in stage 23.0 (TID 
> 923) (ip-10-0-185-107.ec2.internal, executor 3, partition 150, NODE_LOCAL, 
> 4432 bytes) taskResourceAssignments Map()
> 23/09/12 18:13:55 INFO TaskSetManager: Finished task 150.0 in stage 23.0 (TID 
> 923) in 126 ms on ip-10-0-185-107.ec2.internal (executor 3) (16/200)
> $zgrep 'task 923' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 923
> $zgrep 'task 150' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO Executor: Running task 150.0 in stage 23.0 (TID 923)
> 23/09/12 18:13:55 INFO Executor: Finished task 150.0 in stage 23.0 (TID 923). 
> 4495 bytes result sent to driver}}
> {quote}
> * Another task that's sent to the executor but didn't get launched since the 
> single-threaded dispatcher was stuck (presumably in an "infinite loop" as 
> explained later).
>  
>  
> {quote}{{$zgrep 'task 153' container_1694029806204_12865_01_01/stderr.gz
> 23/09/12 18:13:55 INFO TaskSetManager: Starting task 153.0 in stage 23.0 (TID 
> 924) (ip-10-0-185-107.ec2.internal, executor 3, partition 153, NODE_LOCAL, 
> 4432 bytes) taskResourceAssignments Map()
> $zgrep ' 924' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 924
> $zgrep 'task 153' container_1694029806204_12865_01_04/stderr.gz
> >> note that the above command has no matching result, indicating that task 
> >> 153.0 in stage 23.0 (TID 924) was never launched}}
> {quote} * Thread dump shows that the dispatcher-Executor thread has the 
> following stack trace.
>  
> {quote}{{"dispatcher-Executor" #40 daemon prio=5 os_prio=0 
> tid=0x98e37800 nid=0x1aff runnable [0x73bba000]
> java.lang.Thread.State: RUNNABLE
> at scala.runtime.BoxesRunTime.equalsNumObject(BoxesRunTime.java:142)
> at scala.runtime.BoxesRunTime.equals2(BoxesRunTime.java:131)
> at scala.runtime.BoxesRunTime.equals(BoxesRunTime.java:123)
> at scala.collection.mutable.HashTable.elemEquals(HashTable.scala:365)
> at scala.collection.mutable.HashTable.elemEquals$(HashTable.scala:365)
> at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:44)
> at scala.collection.mutable.HashTable.findEntry0(HashTable.scala:140)
> at scala.collection.mutable.HashTable.findOrAddEntry(HashTable.scala:169)
> at scala.collection.mutable.HashTable.findOrAddEntry$(HashTable.scala:167)
> at scala.collection.mutable.HashMap.findOrAddEntry(HashMap.scala:44)
> at scala.collection.mutable.HashMap.put(HashMap.scala:126)
> at scala.collection.mutable.HashMap.update(HashMap.scala:131)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:200)
> at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
> at 
> org.apache.spark.rpc.netty.Inbox$$Lambda$323/1930826709.apply$mcV$sp(Unknown 
> Source)
> at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
> at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
> at 
> 

[jira] [Updated] (SPARK-45227) Fix an issue where an executor process randomly gets stuck by making CoarseGrainedExecutorBackend.taskResources thread-safe

2023-09-19 Thread Bo Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-45227:
-
Description: 
h3. Symptom

Our Spark 3 app running on EMR 6.10.0 with Spark 3.3.1 got stuck in the very 
last step of writing a data frame to S3 by calling {{{}df.write{}}}. Looking at 
Spark UI, we saw that an executor process hung over 1 hour. After we manually 
killed the executor process, the app succeeded.

Note that the same EMR cluster with two worker nodes was able to run the same 
app without any issue before and after the incident.
h3. Observations

Below is what's observed from relevant container logs and thread dump.
 * A regular task that's sent to the executor, which also reported back to the 
driver upon the task completion.

 
{quote}{{$zgrep 'task 150' container_1694029806204_12865_01_01/stderr.gz
23/09/12 18:13:55 INFO TaskSetManager: Starting task 150.0 in stage 23.0 (TID 
923) (ip-10-0-185-107.ec2.internal, executor 3, partition 150, NODE_LOCAL, 4432 
bytes) taskResourceAssignments Map()
23/09/12 18:13:55 INFO TaskSetManager: Finished task 150.0 in stage 23.0 (TID 
923) in 126 ms on ip-10-0-185-107.ec2.internal (executor 3) (16/200)

$zgrep 'task 923' container_1694029806204_12865_01_04/stderr.gz
23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 923

$zgrep 'task 150' container_1694029806204_12865_01_04/stderr.gz
23/09/12 18:13:55 INFO Executor: Running task 150.0 in stage 23.0 (TID 923)
23/09/12 18:13:55 INFO Executor: Finished task 150.0 in stage 23.0 (TID 923). 
4495 bytes result sent to driver}}
{quote}
* Another task that's sent to the executor but didn't get launched since the 
single-threaded dispatcher was stuck (presumably in an "infinite loop" as 
explained later).

 

 
{quote}{{$zgrep 'task 153' container_1694029806204_12865_01_01/stderr.gz
23/09/12 18:13:55 INFO TaskSetManager: Starting task 153.0 in stage 23.0 (TID 
924) (ip-10-0-185-107.ec2.internal, executor 3, partition 153, NODE_LOCAL, 4432 
bytes) taskResourceAssignments Map()

$zgrep ' 924' container_1694029806204_12865_01_04/stderr.gz
23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 924

$zgrep 'task 153' container_1694029806204_12865_01_04/stderr.gz
>> note that the above command has no matching result, indicating that task 
>> 153.0 in stage 23.0 (TID 924) was never launched}}
{quote} * Thread dump shows that the dispatcher-Executor thread has the 
following stack trace.

 
{quote}{{"dispatcher-Executor" #40 daemon prio=5 os_prio=0 
tid=0x98e37800 nid=0x1aff runnable [0x73bba000]
java.lang.Thread.State: RUNNABLE
at scala.runtime.BoxesRunTime.equalsNumObject(BoxesRunTime.java:142)
at scala.runtime.BoxesRunTime.equals2(BoxesRunTime.java:131)
at scala.runtime.BoxesRunTime.equals(BoxesRunTime.java:123)
at scala.collection.mutable.HashTable.elemEquals(HashTable.scala:365)
at scala.collection.mutable.HashTable.elemEquals$(HashTable.scala:365)
at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:44)
at scala.collection.mutable.HashTable.findEntry0(HashTable.scala:140)
at scala.collection.mutable.HashTable.findOrAddEntry(HashTable.scala:169)
at scala.collection.mutable.HashTable.findOrAddEntry$(HashTable.scala:167)
at scala.collection.mutable.HashMap.findOrAddEntry(HashMap.scala:44)
at scala.collection.mutable.HashMap.put(HashMap.scala:126)
at scala.collection.mutable.HashMap.update(HashMap.scala:131)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:200)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox$$Lambda$323/1930826709.apply$mcV$sp(Unknown 
Source)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)}}
{quote}
* Relevant code paths

 
{quote}Within an executor process, there's a [dispatcher 
thread|https://github.com/apache/spark/blob/1fdd46f173f7bc90e0523eb0a2d5e8e27e990102/core/src/main/scala/org/apache/spark/rpc/netty/MessageLoop.scala#L170]
 dedicated to CoarseGrainedExecutorBackend (a single RPC endpoint) that 
launches tasks scheduled by the driver. Each task is run on a TaskRunner thread 
backed by a thread pool created for the executor. The TaskRunner thread and the 
dispatcher thread are 

[jira] [Updated] (SPARK-45227) Fix an issue where an executor process randomly gets stuck by making CoarseGrainedExecutorBackend.taskResources thread-safe

2023-09-19 Thread Bo Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-45227:
-
Attachment: Screenshot 2023-09-19 at 7.55.37 PM.png

> Fix an issue where an executor process randomly gets stuck by making 
> CoarseGrainedExecutorBackend.taskResources thread-safe
> ---
>
> Key: SPARK-45227
> URL: https://issues.apache.org/jira/browse/SPARK-45227
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.1, 3.5.0, 4.0.0
>Reporter: Bo Xiong
>Priority: Critical
>  Labels: hang, infinite-loop, race-condition, stuck, threadsafe
> Fix For: 4.0.0, 3.5.1
>
> Attachments: Screenshot 2023-09-19 at 7.55.31 PM.png, Screenshot 
> 2023-09-19 at 7.55.37 PM.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> h3. Symptom
> Our Spark 3 app running on EMR 6.10.0 with Spark 3.3.1 got stuck in the very 
> last step of writing a data frame to S3 by calling {{{}df.write{}}}. Looking 
> at Spark UI, we saw that an executor process hung over 1 hour. After we 
> manually killed the executor process, the app succeeded.
> Note that the same EMR cluster with two worker nodes was able to run the same 
> app without any issue before and after the incident.
> h3. Observations
> Below is what's observed from relevant container logs and thread dump.
>  * A regular task that's sent to the executor, which also reported back to 
> the driver upon the task completion.
>  
> {quote}{{$zgrep 'task 150' container_1694029806204_12865_01_01/stderr.gz
> 23/09/12 18:13:55 INFO TaskSetManager: Starting task 150.0 in stage 23.0 (TID 
> 923) (ip-10-0-185-107.ec2.internal, executor 3, partition 150, NODE_LOCAL, 
> 4432 bytes) taskResourceAssignments Map()
> 23/09/12 18:13:55 INFO TaskSetManager: Finished task 150.0 in stage 23.0 (TID 
> 923) in 126 ms on ip-10-0-185-107.ec2.internal (executor 3) (16/200)
> $zgrep 'task 923' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 923
> $zgrep 'task 150' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO Executor: Running task 150.0 in stage 23.0 (TID 923)
> 23/09/12 18:13:55 INFO Executor: Finished task 150.0 in stage 23.0 (TID 923). 
> 4495 bytes result sent to driver}}
> {quote}
> * Another task that's sent to the executor but didn't get launched since the 
> single-threaded dispatcher was stuck (presumably in an "infinite loop" as 
> explained later).
>  
>  
> {quote}{{$zgrep 'task 153' container_1694029806204_12865_01_01/stderr.gz
> 23/09/12 18:13:55 INFO TaskSetManager: Starting task 153.0 in stage 23.0 (TID 
> 924) (ip-10-0-185-107.ec2.internal, executor 3, partition 153, NODE_LOCAL, 
> 4432 bytes) taskResourceAssignments Map()
> $zgrep ' 924' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 924
> $zgrep 'task 153' container_1694029806204_12865_01_04/stderr.gz
> >> note that the above command has no matching result, indicating that task 
> >> 153.0 in stage 23.0 (TID 924) was never launched}}
> {quote} * Thread dump shows that the dispatcher-Executor thread has the 
> following stack trace.
>  
> {quote}{{"dispatcher-Executor" #40 daemon prio=5 os_prio=0 
> tid=0x98e37800 nid=0x1aff runnable [0x73bba000]
> java.lang.Thread.State: RUNNABLE
> at scala.runtime.BoxesRunTime.equalsNumObject(BoxesRunTime.java:142)
> at scala.runtime.BoxesRunTime.equals2(BoxesRunTime.java:131)
> at scala.runtime.BoxesRunTime.equals(BoxesRunTime.java:123)
> at scala.collection.mutable.HashTable.elemEquals(HashTable.scala:365)
> at scala.collection.mutable.HashTable.elemEquals$(HashTable.scala:365)
> at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:44)
> at scala.collection.mutable.HashTable.findEntry0(HashTable.scala:140)
> at scala.collection.mutable.HashTable.findOrAddEntry(HashTable.scala:169)
> at scala.collection.mutable.HashTable.findOrAddEntry$(HashTable.scala:167)
> at scala.collection.mutable.HashMap.findOrAddEntry(HashMap.scala:44)
> at scala.collection.mutable.HashMap.put(HashMap.scala:126)
> at scala.collection.mutable.HashMap.update(HashMap.scala:131)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:200)
> at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
> at 
> org.apache.spark.rpc.netty.Inbox$$Lambda$323/1930826709.apply$mcV$sp(Unknown 
> Source)
> at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
> at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
> at 
> 

[jira] [Updated] (SPARK-45227) Fix an issue where an executor process randomly gets stuck by making CoarseGrainedExecutorBackend.taskResources thread-safe

2023-09-19 Thread Bo Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-45227:
-
Attachment: Screenshot 2023-09-19 at 7.55.31 PM.png

> Fix an issue where an executor process randomly gets stuck by making 
> CoarseGrainedExecutorBackend.taskResources thread-safe
> ---
>
> Key: SPARK-45227
> URL: https://issues.apache.org/jira/browse/SPARK-45227
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.1, 3.5.0, 4.0.0
>Reporter: Bo Xiong
>Priority: Critical
>  Labels: hang, infinite-loop, race-condition, stuck, threadsafe
> Fix For: 4.0.0, 3.5.1
>
> Attachments: Screenshot 2023-09-19 at 7.55.31 PM.png, Screenshot 
> 2023-09-19 at 7.55.37 PM.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> h3. Symptom
> Our Spark 3 app running on EMR 6.10.0 with Spark 3.3.1 got stuck in the very 
> last step of writing a data frame to S3 by calling {{{}df.write{}}}. Looking 
> at Spark UI, we saw that an executor process hung over 1 hour. After we 
> manually killed the executor process, the app succeeded.
> Note that the same EMR cluster with two worker nodes was able to run the same 
> app without any issue before and after the incident.
> h3. Observations
> Below is what's observed from relevant container logs and thread dump.
>  * A regular task that's sent to the executor, which also reported back to 
> the driver upon the task completion.
>  
> {quote}{{$zgrep 'task 150' container_1694029806204_12865_01_01/stderr.gz
> 23/09/12 18:13:55 INFO TaskSetManager: Starting task 150.0 in stage 23.0 (TID 
> 923) (ip-10-0-185-107.ec2.internal, executor 3, partition 150, NODE_LOCAL, 
> 4432 bytes) taskResourceAssignments Map()
> 23/09/12 18:13:55 INFO TaskSetManager: Finished task 150.0 in stage 23.0 (TID 
> 923) in 126 ms on ip-10-0-185-107.ec2.internal (executor 3) (16/200)
> $zgrep 'task 923' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 923
> $zgrep 'task 150' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO Executor: Running task 150.0 in stage 23.0 (TID 923)
> 23/09/12 18:13:55 INFO Executor: Finished task 150.0 in stage 23.0 (TID 923). 
> 4495 bytes result sent to driver}}
> {quote}
> * Another task that's sent to the executor but didn't get launched since the 
> single-threaded dispatcher was stuck (presumably in an "infinite loop" as 
> explained later).
>  
>  
> {quote}{{$zgrep 'task 153' container_1694029806204_12865_01_01/stderr.gz
> 23/09/12 18:13:55 INFO TaskSetManager: Starting task 153.0 in stage 23.0 (TID 
> 924) (ip-10-0-185-107.ec2.internal, executor 3, partition 153, NODE_LOCAL, 
> 4432 bytes) taskResourceAssignments Map()
> $zgrep ' 924' container_1694029806204_12865_01_04/stderr.gz
> 23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 924
> $zgrep 'task 153' container_1694029806204_12865_01_04/stderr.gz
> >> note that the above command has no matching result, indicating that task 
> >> 153.0 in stage 23.0 (TID 924) was never launched}}
> {quote} * Thread dump shows that the dispatcher-Executor thread has the 
> following stack trace.
>  
> {quote}{{"dispatcher-Executor" #40 daemon prio=5 os_prio=0 
> tid=0x98e37800 nid=0x1aff runnable [0x73bba000]
> java.lang.Thread.State: RUNNABLE
> at scala.runtime.BoxesRunTime.equalsNumObject(BoxesRunTime.java:142)
> at scala.runtime.BoxesRunTime.equals2(BoxesRunTime.java:131)
> at scala.runtime.BoxesRunTime.equals(BoxesRunTime.java:123)
> at scala.collection.mutable.HashTable.elemEquals(HashTable.scala:365)
> at scala.collection.mutable.HashTable.elemEquals$(HashTable.scala:365)
> at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:44)
> at scala.collection.mutable.HashTable.findEntry0(HashTable.scala:140)
> at scala.collection.mutable.HashTable.findOrAddEntry(HashTable.scala:169)
> at scala.collection.mutable.HashTable.findOrAddEntry$(HashTable.scala:167)
> at scala.collection.mutable.HashMap.findOrAddEntry(HashMap.scala:44)
> at scala.collection.mutable.HashMap.put(HashMap.scala:126)
> at scala.collection.mutable.HashMap.update(HashMap.scala:131)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:200)
> at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
> at 
> org.apache.spark.rpc.netty.Inbox$$Lambda$323/1930826709.apply$mcV$sp(Unknown 
> Source)
> at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
> at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
> at 
> 

[jira] [Updated] (SPARK-45227) Fix an issue where an executor process randomly gets stuck by making CoarseGrainedExecutorBackend.taskResources thread-safe

2023-09-19 Thread Bo Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-45227:
-
Description: 
h3. Symptom

Our Spark 3 app running on EMR 6.10.0 with Spark 3.3.1 got stuck in the very 
last step of writing a data frame to S3 by calling {{{}df.write{}}}. Looking at 
Spark UI, we saw that an executor process hung over 1 hour. After we manually 
killed the executor process, the app succeeded.

Note that the same EMR cluster with two worker nodes was able to run the same 
app without any issue before and after the incident.
h3. Observations

Below is what's observed from relevant container logs and thread dump.
 * A regular task that's sent to the executor, which also reported back to the 
driver upon the task completion.

 
{quote}{{$zgrep 'task 150' container_1694029806204_12865_01_01/stderr.gz
23/09/12 18:13:55 INFO TaskSetManager: Starting task 150.0 in stage 23.0 (TID 
923) (ip-10-0-185-107.ec2.internal, executor 3, partition 150, NODE_LOCAL, 4432 
bytes) taskResourceAssignments Map()
23/09/12 18:13:55 INFO TaskSetManager: Finished task 150.0 in stage 23.0 (TID 
923) in 126 ms on ip-10-0-185-107.ec2.internal (executor 3) (16/200)

$zgrep 'task 923' container_1694029806204_12865_01_04/stderr.gz
23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 923

$zgrep 'task 150' container_1694029806204_12865_01_04/stderr.gz
23/09/12 18:13:55 INFO Executor: Running task 150.0 in stage 23.0 (TID 923)
23/09/12 18:13:55 INFO Executor: Finished task 150.0 in stage 23.0 (TID 923). 
4495 bytes result sent to driver}}
{quote} * Another task that's sent to the executor but didn't get launched 
since the single-threaded dispatcher was stuck (presumably in an "infinite 
loop" as explained later).

 
{quote}{{$zgrep 'task 153' container_1694029806204_12865_01_01/stderr.gz
23/09/12 18:13:55 INFO TaskSetManager: Starting task 153.0 in stage 23.0 (TID 
924) (ip-10-0-185-107.ec2.internal, executor 3, partition 153, NODE_LOCAL, 4432 
bytes) taskResourceAssignments Map()

$zgrep ' 924' container_1694029806204_12865_01_04/stderr.gz
23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 924

$zgrep 'task 153' container_1694029806204_12865_01_04/stderr.gz
>> note that the above command has no matching result, indicating that task 
>> 153.0 in stage 23.0 (TID 924) was never launched}}
{quote}
* Thread dump shows that the dispatcher-Executor thread has the following stack 
trace.

 
{quote}{{"dispatcher-Executor" #40 daemon prio=5 os_prio=0 
tid=0x98e37800 nid=0x1aff runnable [0x73bba000]
java.lang.Thread.State: RUNNABLE
at scala.runtime.BoxesRunTime.equalsNumObject(BoxesRunTime.java:142)
at scala.runtime.BoxesRunTime.equals2(BoxesRunTime.java:131)
at scala.runtime.BoxesRunTime.equals(BoxesRunTime.java:123)
at scala.collection.mutable.HashTable.elemEquals(HashTable.scala:365)
at scala.collection.mutable.HashTable.elemEquals$(HashTable.scala:365)
at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:44)
at scala.collection.mutable.HashTable.findEntry0(HashTable.scala:140)
at scala.collection.mutable.HashTable.findOrAddEntry(HashTable.scala:169)
at scala.collection.mutable.HashTable.findOrAddEntry$(HashTable.scala:167)
at scala.collection.mutable.HashMap.findOrAddEntry(HashMap.scala:44)
at scala.collection.mutable.HashMap.put(HashMap.scala:126)
at scala.collection.mutable.HashMap.update(HashMap.scala:131)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:200)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox$$Lambda$323/1930826709.apply$mcV$sp(Unknown 
Source)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)}}
{quote} * Relevant code paths
{quote}Within an executor process, there's a [dispatcher 
thread|https://github.com/apache/spark/blob/1fdd46f173f7bc90e0523eb0a2d5e8e27e990102/core/src/main/scala/org/apache/spark/rpc/netty/MessageLoop.scala#L170]
 dedicated to CoarseGrainedExecutorBackend (a single RPC endpoint) that 
launches tasks scheduled by the driver. Each task is run on a TaskRunner thread 
backed by a thread pool created for the executor. The TaskRunner thread and the 
dispatcher thread are 

[jira] [Updated] (SPARK-45227) Fix an issue where an executor process randomly gets stuck by making CoarseGrainedExecutorBackend.taskResources thread-safe

2023-09-19 Thread Bo Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-45227:
-
Description: 
h3. Symptom

Our Spark 3 app running on EMR 6.10.0 with Spark 3.3.1 got stuck in the very 
last step of writing a data frame to S3 by calling {{{}df.write{}}}. Looking at 
Spark UI, we saw that an executor process hung over 1 hour. After we manually 
killed the executor process, the app succeeded.

Note that the same EMR cluster with two worker nodes was able to run the same 
app without any issue before and after the incident.
h3. Observations

Below is what's observed from relevant container logs and thread dump.
 * A regular task that's sent to the executor, which also reported back to the 
driver upon the task completion.

 
{quote}{{$zgrep 'task 150' container_1694029806204_12865_01_01/stderr.gz
23/09/12 18:13:55 INFO TaskSetManager: Starting task 150.0 in stage 23.0 (TID 
923) (ip-10-0-185-107.ec2.internal, executor 3, partition 150, NODE_LOCAL, 4432 
bytes) taskResourceAssignments Map()
23/09/12 18:13:55 INFO TaskSetManager: Finished task 150.0 in stage 23.0 (TID 
923) in 126 ms on ip-10-0-185-107.ec2.internal (executor 3) (16/200)

$zgrep 'task 923' container_1694029806204_12865_01_04/stderr.gz
23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 923

$zgrep 'task 150' container_1694029806204_12865_01_04/stderr.gz
23/09/12 18:13:55 INFO Executor: Running task 150.0 in stage 23.0 (TID 923)
23/09/12 18:13:55 INFO Executor: Finished task 150.0 in stage 23.0 (TID 923). 
4495 bytes result sent to driver}}
{quote}
* Another task that's sent to the executor but didn't get launched since the 
single-threaded dispatcher was stuck (presumably in an "infinite loop" as 
explained later).

 
{quote}{{$zgrep 'task 153' container_1694029806204_12865_01_01/stderr.gz
23/09/12 18:13:55 INFO TaskSetManager: Starting task 153.0 in stage 23.0 (TID 
924) (ip-10-0-185-107.ec2.internal, executor 3, partition 153, NODE_LOCAL, 4432 
bytes) taskResourceAssignments Map()

$zgrep ' 924' container_1694029806204_12865_01_04/stderr.gz
23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 924

$zgrep 'task 153' container_1694029806204_12865_01_04/stderr.gz
>> note that the above command has no matching result, indicating that task 
>> 153.0 in stage 23.0 (TID 924) was never launched}}
{quote} * Thread dump shows that the dispatcher-Executor thread has the 
following stack trace.
{quote}{{"dispatcher-Executor" #40 daemon prio=5 os_prio=0 
tid=0x98e37800 nid=0x1aff runnable [0x73bba000]
java.lang.Thread.State: RUNNABLE
at scala.runtime.BoxesRunTime.equalsNumObject(BoxesRunTime.java:142)
at scala.runtime.BoxesRunTime.equals2(BoxesRunTime.java:131)
at scala.runtime.BoxesRunTime.equals(BoxesRunTime.java:123)
at scala.collection.mutable.HashTable.elemEquals(HashTable.scala:365)
at scala.collection.mutable.HashTable.elemEquals$(HashTable.scala:365)
at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:44)
at scala.collection.mutable.HashTable.findEntry0(HashTable.scala:140)
at scala.collection.mutable.HashTable.findOrAddEntry(HashTable.scala:169)
at scala.collection.mutable.HashTable.findOrAddEntry$(HashTable.scala:167)
at scala.collection.mutable.HashMap.findOrAddEntry(HashMap.scala:44)
at scala.collection.mutable.HashMap.put(HashMap.scala:126)
at scala.collection.mutable.HashMap.update(HashMap.scala:131)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:200)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox$$Lambda$323/1930826709.apply$mcV$sp(Unknown 
Source)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)}}
{quote}
 * Relevant code paths
{quote}Within an executor process, there's a [dispatcher 
thread|https://github.com/apache/spark/blob/1fdd46f173f7bc90e0523eb0a2d5e8e27e990102/core/src/main/scala/org/apache/spark/rpc/netty/MessageLoop.scala#L170]
 dedicated to CoarseGrainedExecutorBackend (a single RPC endpoint) that 
launches tasks scheduled by the driver. Each task is run on a TaskRunner thread 
backed by a thread pool created for the executor. The TaskRunner thread and the 
dispatcher thread are 

[jira] [Updated] (SPARK-45227) Fix an issue where an executor process randomly gets stuck by making CoarseGrainedExecutorBackend.taskResources thread-safe

2023-09-19 Thread Bo Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-45227:
-
Description: 
h3. Symptom

Our Spark 3 app running on EMR 6.10.0 with Spark 3.3.1 got stuck in the very 
last step of writing a data frame to S3 by calling {{{}df.write{}}}. Looking at 
Spark UI, we saw that an executor process hung over 1 hour. After we manually 
killed the executor process, the app succeeded.

Note that the same EMR cluster with two worker nodes was able to run the same 
app without any issue before and after the incident.
h3. Observations

Below is what's observed from relevant container logs and thread dump.
 * A regular task that's sent to the executor, which also reported back to the 
driver upon the task completion.

{quote}{{$zgrep 'task 150' container_1694029806204_12865_01_01/stderr.gz
23/09/12 18:13:55 INFO TaskSetManager: Starting task 150.0 in stage 23.0 (TID 
923) (ip-10-0-185-107.ec2.internal, executor 3, partition 150, NODE_LOCAL, 4432 
bytes) taskResourceAssignments Map()
23/09/12 18:13:55 INFO TaskSetManager: Finished task 150.0 in stage 23.0 (TID 
923) in 126 ms on ip-10-0-185-107.ec2.internal (executor 3) (16/200)

$zgrep 'task 923' container_1694029806204_12865_01_04/stderr.gz
23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 923

$zgrep 'task 150' container_1694029806204_12865_01_04/stderr.gz
23/09/12 18:13:55 INFO Executor: Running task 150.0 in stage 23.0 (TID 923)
23/09/12 18:13:55 INFO Executor: Finished task 150.0 in stage 23.0 (TID 923). 
4495 bytes result sent to driver}}
{quote} * Another task that's sent to the executor but didn't get launched 
since the single-threaded dispatcher was stuck (presumably in an "infinite 
loop" as explained later).
{quote}{{$zgrep 'task 153' container_1694029806204_12865_01_01/stderr.gz
23/09/12 18:13:55 INFO TaskSetManager: Starting task 153.0 in stage 23.0 (TID 
924) (ip-10-0-185-107.ec2.internal, executor 3, partition 153, NODE_LOCAL, 4432 
bytes) taskResourceAssignments Map()

$zgrep ' 924' container_1694029806204_12865_01_04/stderr.gz
23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 924

$zgrep 'task 153' container_1694029806204_12865_01_04/stderr.gz
>> note that the above command has no matching result, indicating that task 
>> 153.0 in stage 23.0 (TID 924) was never launched}}
{quote}
 * Thread dump shows that the dispatcher-Executor thread has the following 
stack trace.
{quote}{{"dispatcher-Executor" #40 daemon prio=5 os_prio=0 
tid=0x98e37800 nid=0x1aff runnable [0x73bba000]
java.lang.Thread.State: RUNNABLE
at scala.runtime.BoxesRunTime.equalsNumObject(BoxesRunTime.java:142)
at scala.runtime.BoxesRunTime.equals2(BoxesRunTime.java:131)
at scala.runtime.BoxesRunTime.equals(BoxesRunTime.java:123)
at scala.collection.mutable.HashTable.elemEquals(HashTable.scala:365)
at scala.collection.mutable.HashTable.elemEquals$(HashTable.scala:365)
at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:44)
at scala.collection.mutable.HashTable.findEntry0(HashTable.scala:140)
at scala.collection.mutable.HashTable.findOrAddEntry(HashTable.scala:169)
at scala.collection.mutable.HashTable.findOrAddEntry$(HashTable.scala:167)
at scala.collection.mutable.HashMap.findOrAddEntry(HashMap.scala:44)
at scala.collection.mutable.HashMap.put(HashMap.scala:126)
at scala.collection.mutable.HashMap.update(HashMap.scala:131)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:200)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox$$Lambda$323/1930826709.apply$mcV$sp(Unknown 
Source)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)}}
{quote}
 * Relevant code paths
{quote}Within an executor process, there's a [dispatcher 
thread|https://github.com/apache/spark/blob/1fdd46f173f7bc90e0523eb0a2d5e8e27e990102/core/src/main/scala/org/apache/spark/rpc/netty/MessageLoop.scala#L170]
 dedicated to CoarseGrainedExecutorBackend (a single RPC endpoint) that 
launches tasks scheduled by the driver. Each task is run on a TaskRunner thread 
backed by a thread pool created for the executor. The TaskRunner thread and the 
dispatcher thread are different. 

[jira] [Updated] (SPARK-45227) Fix an issue where an executor process randomly gets stuck by making CoarseGrainedExecutorBackend.taskResources thread-safe

2023-09-19 Thread Bo Xiong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-45227:
-
Fix Version/s: 4.0.0
  Description: 
h3. Symptom

Our Spark 3 app running on EMR 6.10.0 with Spark 3.3.1 got stuck in the very 
last step of writing a data frame to S3 by calling {{{}df.write{}}}. Looking at 
Spark UI, we saw that an executor process hung over 1 hour. After we manually 
killed the executor process, the app succeeded.

Note that the same EMR cluster with two worker nodes was able to run the same 
app without any issue before and after the incident.
h3. Observations

Below is what's observed from relevant container logs and thread dump.
 * A regular task that's sent to the executor, which also reported back to the 
driver upon the task completion.

{quote}{{$zgrep 'task 150' container_1694029806204_12865_01_01/stderr.gz
23/09/12 18:13:55 INFO TaskSetManager: Starting task 150.0 in stage 23.0 (TID 
923) (ip-10-0-185-107.ec2.internal, executor 3, partition 150, NODE_LOCAL, 4432 
bytes) taskResourceAssignments Map()
23/09/12 18:13:55 INFO TaskSetManager: Finished task 150.0 in stage 23.0 (TID 
923) in 126 ms on ip-10-0-185-107.ec2.internal (executor 3) (16/200)

$zgrep 'task 923' container_1694029806204_12865_01_04/stderr.gz
23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 923

$zgrep 'task 150' container_1694029806204_12865_01_04/stderr.gz
23/09/12 18:13:55 INFO Executor: Running task 150.0 in stage 23.0 (TID 923)
23/09/12 18:13:55 INFO Executor: Finished task 150.0 in stage 23.0 (TID 923). 
4495 bytes result sent to driver}}
{quote}
* Another task that's sent to the executor but didn't get launched since the 
single-threaded dispatcher was stuck (presumably in an "infinite loop" as 
explained later).
{quote}{{$zgrep 'task 153' container_1694029806204_12865_01_01/stderr.gz
23/09/12 18:13:55 INFO TaskSetManager: Starting task 153.0 in stage 23.0 (TID 
924) (ip-10-0-185-107.ec2.internal, executor 3, partition 153, NODE_LOCAL, 4432 
bytes) taskResourceAssignments Map()

$zgrep ' 924' container_1694029806204_12865_01_04/stderr.gz
23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 924

$zgrep 'task 153' container_1694029806204_12865_01_04/stderr.gz
>> note that the above command has no matching result, indicating that task 
>> 153.0 in stage 23.0 (TID 924) was never launched}}
{quote}
* Thread dump shows that the dispatcher-Executor thread has the following stack 
trace.
{quote}{{"dispatcher-Executor" #40 daemon prio=5 os_prio=0 
tid=0x98e37800 nid=0x1aff runnable [0x73bba000]
java.lang.Thread.State: RUNNABLE
at scala.runtime.BoxesRunTime.equalsNumObject(BoxesRunTime.java:142)
at scala.runtime.BoxesRunTime.equals2(BoxesRunTime.java:131)
at scala.runtime.BoxesRunTime.equals(BoxesRunTime.java:123)
at scala.collection.mutable.HashTable.elemEquals(HashTable.scala:365)
at scala.collection.mutable.HashTable.elemEquals$(HashTable.scala:365)
at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:44)
at scala.collection.mutable.HashTable.findEntry0(HashTable.scala:140)
at scala.collection.mutable.HashTable.findOrAddEntry(HashTable.scala:169)
at scala.collection.mutable.HashTable.findOrAddEntry$(HashTable.scala:167)
at scala.collection.mutable.HashMap.findOrAddEntry(HashMap.scala:44)
at scala.collection.mutable.HashMap.put(HashMap.scala:126)
at scala.collection.mutable.HashMap.update(HashMap.scala:131)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:200)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox$$Lambda$323/1930826709.apply$mcV$sp(Unknown 
Source)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)}}
{quote}
* Relevant code paths
{quote}Within an executor process, there's a [dispatcher 
thread|https://github.com/apache/spark/blob/1fdd46f173f7bc90e0523eb0a2d5e8e27e990102/core/src/main/scala/org/apache/spark/rpc/netty/MessageLoop.scala#L170]
 dedicated to CoarseGrainedExecutorBackend (a single RPC endpoint) that 
launches tasks scheduled by the driver. Each task is run on a TaskRunner thread 
backed by a thread pool created for the executor. The TaskRunner thread and the 
dispatcher 

[jira] [Created] (SPARK-45227) Fix an issue where an executor process randomly gets stuck by making CoarseGrainedExecutorBackend.taskResources thread-safe

2023-09-19 Thread Bo Xiong (Jira)
Bo Xiong created SPARK-45227:


 Summary: Fix an issue where an executor process randomly gets 
stuck by making CoarseGrainedExecutorBackend.taskResources thread-safe
 Key: SPARK-45227
 URL: https://issues.apache.org/jira/browse/SPARK-45227
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.5.0, 3.3.1, 4.0.0
Reporter: Bo Xiong
 Fix For: 3.5.1


h3. Symptom

Our Spark 3 app running on EMR 6.10.0 with Spark 3.3.1 got stuck in the very 
last step of writing a data frame to S3 by calling {{{}df.write{}}}. Looking at 
Spark UI, we saw that an executor process hung over 1 hour. After we manually 
killed the executor process, the app succeeded.

Note that the same EMR cluster with two worker nodes was able to run the same 
app without any issue before and after the incident.
h3. Observations

Below is what's observed from relevant container logs and thread dump.
 * A regular task that's sent to the executor, which also reported back to the 
driver upon the task completion.

{quote}{{$zgrep 'task 150' container_1694029806204_12865_01_01/stderr.gz
23/09/12 18:13:55 INFO TaskSetManager: Starting task 150.0 in stage 23.0 (TID 
923) (ip-10-0-185-107.ec2.internal, executor 3, partition 150, NODE_LOCAL, 4432 
bytes) taskResourceAssignments Map()
23/09/12 18:13:55 INFO TaskSetManager: Finished task 150.0 in stage 23.0 (TID 
923) in 126 ms on ip-10-0-185-107.ec2.internal (executor 3) (16/200)

$zgrep 'task 923' container_1694029806204_12865_01_04/stderr.gz
23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 923

$zgrep 'task 150' container_1694029806204_12865_01_04/stderr.gz
23/09/12 18:13:55 INFO Executor: Running task 150.0 in stage 23.0 (TID 923)
23/09/12 18:13:55 INFO Executor: Finished task 150.0 in stage 23.0 (TID 923). 
4495 bytes result sent to driver}}
{quote} * Another task that's sent to the executor but didn't get launched 
since the single-threaded dispatcher was stuck (presumably in an "infinite 
loop" as explained later).

{quote}{{$zgrep 'task 153' container_1694029806204_12865_01_01/stderr.gz
23/09/12 18:13:55 INFO TaskSetManager: Starting task 153.0 in stage 23.0 (TID 
924) (ip-10-0-185-107.ec2.internal, executor 3, partition 153, NODE_LOCAL, 4432 
bytes) taskResourceAssignments Map()

$zgrep ' 924' container_1694029806204_12865_01_04/stderr.gz
23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 924

$zgrep 'task 153' container_1694029806204_12865_01_04/stderr.gz
>> note that the above command has no matching result, indicating that task 
>> 153.0 in stage 23.0 (TID 924) was never launched}}
{quote} * Thread dump shows that the dispatcher-Executor thread has the 
following stack trace.

{quote}{{"dispatcher-Executor" #40 daemon prio=5 os_prio=0 
tid=0x98e37800 nid=0x1aff runnable [0x73bba000]
   java.lang.Thread.State: RUNNABLE
at scala.runtime.BoxesRunTime.equalsNumObject(BoxesRunTime.java:142)
at scala.runtime.BoxesRunTime.equals2(BoxesRunTime.java:131)
at scala.runtime.BoxesRunTime.equals(BoxesRunTime.java:123)
at scala.collection.mutable.HashTable.elemEquals(HashTable.scala:365)
at scala.collection.mutable.HashTable.elemEquals$(HashTable.scala:365)
at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:44)
at scala.collection.mutable.HashTable.findEntry0(HashTable.scala:140)
at 
scala.collection.mutable.HashTable.findOrAddEntry(HashTable.scala:169)
at 
scala.collection.mutable.HashTable.findOrAddEntry$(HashTable.scala:167)
at scala.collection.mutable.HashMap.findOrAddEntry(HashMap.scala:44)
at scala.collection.mutable.HashMap.put(HashMap.scala:126)
at scala.collection.mutable.HashMap.update(HashMap.scala:131)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:200)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at 
org.apache.spark.rpc.netty.Inbox$$Lambda$323/1930826709.apply$mcV$sp(Unknown 
Source)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at 
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)}}
{quote} *