[ 
https://issues.apache.org/jira/browse/SPARK-47759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17835587#comment-17835587
 ] 

Mridul Muralidharan edited comment on SPARK-47759 at 4/10/24 4:08 AM:
----------------------------------------------------------------------

In order to validate, I would suggest two things.
Wrap the input str (or lower), within quotes, in the exception message ... for 
example, "120s\u00A0" will look like 120s in the exception message as it is a 
unicode non breaking space.
The other would be to include the NumberFormatException 'e' as the cause in the 
exception being thrown.

Once you are able to get a stack trace with these two change in place, it 
should help us debug this better.


was (Author: mridulm80):
In order to validate, I would suggest two things.
Wrap the input str, within quote, in the exception message ... for example, 
"120s\u00A0" will look like 120s in the exception message as it is a unicode 
non breaking space.
The other would be to include the NumberFormatException 'e' as the cause in the 
exception being thrown.

Once you are able to get a stack trace with these two change in place, it 
should help us debug this better.

> Apps being stuck after JavaUtils.timeStringAs fails to parse a legitimate 
> time string
> -------------------------------------------------------------------------------------
>
>                 Key: SPARK-47759
>                 URL: https://issues.apache.org/jira/browse/SPARK-47759
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.5.0, 3.5.1
>            Reporter: Bo Xiong
>            Assignee: Bo Xiong
>            Priority: Critical
>              Labels: hang, pull-request-available, stuck, threadsafe
>             Fix For: 3.5.0, 4.0.0, 3.5.1, 3.5.2
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> h2. Symptom
> It's observed that our Spark apps occasionally got stuck with an unexpected 
> stack trace when reading/parsing a legitimate time string. Note that we 
> manually killed the stuck app instances and the retry goes thru on the same 
> cluster (without requiring any app code change).
>  
> *[Stack Trace 1]* The stack trace doesn't make sense since *120s* is a 
> legitimate time string, where the app runs on emr-7.0.0 with Spark 3.5.0 
> runtime.
> {code:java}
> Caused by: java.lang.RuntimeException: java.lang.NumberFormatException: Time 
> must be specified as seconds (s), milliseconds (ms), microseconds (us), 
> minutes (m or min), hour (h), or day (d). E.g. 50s, 100ms, or 250us.
> Failed to parse time string: 120s
>     at 
> org.apache.spark.network.util.JavaUtils.timeStringAs(JavaUtils.java:258)
>     at 
> org.apache.spark.network.util.JavaUtils.timeStringAsSec(JavaUtils.java:275)
>     at org.apache.spark.util.Utils$.timeStringAsSeconds(Utils.scala:1166)
>     at org.apache.spark.rpc.RpcTimeout$.apply(RpcTimeout.scala:131)
>     at org.apache.spark.util.RpcUtils$.askRpcTimeout(RpcUtils.scala:41)
>     at org.apache.spark.rpc.RpcEndpointRef.<init>(RpcEndpointRef.scala:33)
>     at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.<init>(NettyRpcEnv.scala:533)
>     at org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:640)
>     at 
> org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:697)
>     at 
> org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:682)
>     at 
> org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:163)
>     at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
>     at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
>     at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
>     at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
>     at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
>     at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
>     at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
>     at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
>     at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
>     at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
>     at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
>     at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>     at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
>     at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
>     at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
>     at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
>     at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
>     at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
>     at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
>     at 
> org.apache.spark.network.crypto.TransportCipher$DecryptionHandler.channelRead(TransportCipher.java:192)
>     at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
>     at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
>     at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
>     at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
>     at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
>     at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
>     at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
>     at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
>     at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
>     at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
>     at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
>     at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
>     at 
> io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
>     at 
> io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>     at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>     at java.base/java.lang.Thread.run(Thread.java:840) {code}
>  
> *[Stack Trace 2]* The stack trace is slightly different but it's also about 
> reading/parsing a time string on the line "timeout = \{ 
> Utils.timeStringAsSeconds(finalProp._2).seconds }" in in RpcTimeout.scala, 
> where the app runs on emr-7.0.0 with Spark 3.5.0 runtime.
> {code:java}
> 24/04/02 14:20:14 ERROR TransportRequestHandler: Error while invoking 
> RpcHandler#receive() for one-way message.
> java.lang.NullPointerException: Cannot invoke 
> "scala.concurrent.duration.Duration$.apply(long, 
> java.util.concurrent.TimeUnit)" because 
> "scala.concurrent.duration.Duration$.MODULE$" is null
>     at 
> scala.concurrent.duration.package$DurationLong$.durationIn$extension(package.scala:60)
>  ~[scala-library-2.12.17.jar:?]
>     at 
> scala.concurrent.duration.package$DurationLong.durationIn(package.scala:60) 
> ~[scala-library-2.12.17.jar:?]
>     at 
> scala.concurrent.duration.DurationConversions.seconds(DurationConversions.scala:37)
>  ~[scala-library-2.12.17.jar:?]
>     at 
> scala.concurrent.duration.DurationConversions.seconds$(DurationConversions.scala:37)
>  ~[scala-library-2.12.17.jar:?]
>     at 
> scala.concurrent.duration.package$DurationLong.seconds(package.scala:59) 
> ~[scala-library-2.12.17.jar:?]
>     at org.apache.spark.rpc.RpcTimeout$.apply(RpcTimeout.scala:131) 
> ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
>     at org.apache.spark.util.RpcUtils$.askRpcTimeout(RpcUtils.scala:41) 
> ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
>     at org.apache.spark.rpc.RpcEndpointRef.<init>(RpcEndpointRef.scala:33) 
> ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
>     at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.<init>(NettyRpcEnv.scala:533) 
> ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
>     at 
> org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:640) 
> ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
>     at 
> org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:697)
>  ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
>     at 
> org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:689) 
> ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
>     at 
> org.apache.spark.network.server.AbstractAuthRpcHandler.receive(AbstractAuthRpcHandler.java:66)
>  ~[spark-network-common_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
>     at 
> org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:274)
>  [spark-network-common_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
>     at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:111)
>  [spark-network-common_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
>     at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
>  [spark-network-common_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
>     at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
>  [spark-network-common_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
>     at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
>  [netty-transport-4.1.96.Final.jar:4.1.96.Final]
>     at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
>  [netty-transport-4.1.96.Final.jar:4.1.96.Final]
>     at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
>  [netty-transport-4.1.96.Final.jar:4.1.96.Final]
>     at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
>  [netty-transport-4.1.96.Final.jar:4.1.96.Final]
>     at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
>  [netty-handler-4.1.96.Final.jar:4.1.96.Final]
>     at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
>  [netty-transport-4.1.96.Final.jar:4.1.96.Final]
>     at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
>  [netty-transport-4.1.96.Final.jar:4.1.96.Final]
>     at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
>  [netty-transport-4.1.96.Final.jar:4.1.96.Final]
>     at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>  [netty-codec-4.1.96.Final.jar:4.1.96.Final]
>     at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
>  [netty-transport-4.1.96.Final.jar:4.1.96.Final]
>     at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
>  [netty-transport-4.1.96.Final.jar:4.1.96.Final]
>     at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
>  [netty-transport-4.1.96.Final.jar:4.1.96.Final]
>     at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
>  [spark-network-common_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
>     at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
>  [netty-transport-4.1.96.Final.jar:4.1.96.Final]
>     at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
>  [netty-transport-4.1.96.Final.jar:4.1.96.Final]
>     at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
>  [netty-transport-4.1.96.Final.jar:4.1.96.Final]
>     at 
> org.apache.spark.network.crypto.TransportCipher$DecryptionHandler.channelRead(TransportCipher.java:192)
>  [spark-network-common_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
>     at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
>  [netty-transport-4.1.96.Final.jar:4.1.96.Final]
>     at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
>  [netty-transport-4.1.96.Final.jar:4.1.96.Final]
>     at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
>  [netty-transport-4.1.96.Final.jar:4.1.96.Final]
>     at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
>  [netty-transport-4.1.96.Final.jar:4.1.96.Final]
>     at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
>  [netty-transport-4.1.96.Final.jar:4.1.96.Final]
>     at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
>  [netty-transport-4.1.96.Final.jar:4.1.96.Final]
>     at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
>  [netty-transport-4.1.96.Final.jar:4.1.96.Final]
>     at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
>  [netty-transport-4.1.96.Final.jar:4.1.96.Final]
>     at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) 
> [netty-transport-4.1.96.Final.jar:4.1.96.Final]
>     at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
>  [netty-transport-4.1.96.Final.jar:4.1.96.Final]
>     at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650) 
> [netty-transport-4.1.96.Final.jar:4.1.96.Final]
>     at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) 
> [netty-transport-4.1.96.Final.jar:4.1.96.Final]
>     at 
> io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
>  [netty-common-4.1.96.Final.jar:4.1.96.Final]
>     at 
> io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) 
> [netty-common-4.1.96.Final.jar:4.1.96.Final]
>     at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>  [netty-common-4.1.96.Final.jar:4.1.96.Final]
>     at java.lang.Thread.run(Thread.java:840) [?:?] {code}
>  
> *[Stack Trace 3]* The stack trace doesn't make sense since *3s* is a 
> legitimate time string, where the app runs on emr-6.9.0 with Spark 3.3.0 
> runtime. Given that RpcUtils$.retryWaitMs has been refactored away in [this 
> commit|https://github.com/apache/spark/commit/e7b8be37226142cd7fcfac82576c46f8f88f74b3],
>  this is out of scope for the PR.
> {code:java}
> 24/03/01 21:04:24 ERROR TransportRequestHandler: Error while invoking 
> RpcHandler#receive() on RPC id 7358995660930182634
> java.lang.NumberFormatException: Time must be specified as seconds (s), 
> milliseconds (ms), microseconds (us), minutes (m or min), hour (h), or day 
> (d). E.g. 50s, 100ms, or 250us.
> Failed to parse time string: 3s
>     at 
> org.apache.spark.network.util.JavaUtils.timeStringAs(JavaUtils.java:252) 
> ~[spark-network-common_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
>     at 
> org.apache.spark.internal.config.ConfigHelpers$.timeFromString(ConfigBuilder.scala:56)
>  ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
>     at 
> org.apache.spark.internal.config.ConfigBuilder.$anonfun$timeConf$1(ConfigBuilder.scala:256)
>  ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
>     at 
> org.apache.spark.internal.config.ConfigBuilder.$anonfun$timeConf$1$adapted(ConfigBuilder.scala:256)
>  ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
>     at 
> org.apache.spark.internal.config.ConfigEntryWithDefaultString.readFrom(ConfigEntry.scala:206)
>  ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
>     at org.apache.spark.SparkConf.get(SparkConf.scala:261) 
> ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
>     at org.apache.spark.util.RpcUtils$.retryWaitMs(RpcUtils.scala:46) 
> ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
>     at org.apache.spark.rpc.RpcEndpointRef.<init>(RpcEndpointRef.scala:34) 
> ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
>     at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.<init>(NettyRpcEnv.scala:533) 
> ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
>     at 
> org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:640) 
> ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
>     at 
> org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:697)
>  ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
>     at 
> org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:682) 
> ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
>     at 
> org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:163)
>  ~[spark-network-common_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
>     ...
>     at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) 
> ~[netty-transport-4.1.74.Final.jar:4.1.74.Final]
>     at 
> io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
>  ~[netty-common-4.1.74.Final.jar:4.1.74.Final]
>     at 
> io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) 
> ~[netty-common-4.1.74.Final.jar:4.1.74.Final]
>     at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>  ~[netty-common-4.1.74.Final.jar:4.1.74.Final]
>     at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_402] {code}
> h2. What's going on?
> "Failed to parse time string" indicates that JavaUtils.timeStringAs throws 
> the following exception even if the input is a legitimate time string like 
> *120s* or *3s.*
> {code:java}
> Matcher m = TIME_PATTERN.matcher(lower);
> if (!m.matches()) {
>   throw new NumberFormatException("Failed to parse time string: " + str);
> } {code}
> h2. Proposed mitigation
>  * Under the circumstance where a caller can fall back to a default value of 
> a given time unit (e.g., second, ms), we should swallow runtime exceptions 
> triggered by the regex logic and unit conversion, especially for the critical 
> path like RpcUtils$.askRpcTimeout in the above stack traces.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to