[
https://issues.apache.org/jira/browse/SPARK-47759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Bo Xiong updated SPARK-47759:
-----------------------------
Description:
h2. Symptom
It's observed that our Spark apps occasionally got stuck with an unexpected
stack trace when reading/parsing a legitimate time string. Note that we
manually killed the stuck app instances and the retry goes thru on the same
cluster (without requiring any app code change).
*[Stack Trace 1]* The stack trace doesn't make sense since *120s* is a
legitimate time string, where the app runs on emr-7.0.0 with Spark 3.5.0
runtime.
{code:java}
Caused by: java.lang.RuntimeException: java.lang.NumberFormatException: Time
must be specified as seconds (s), milliseconds (ms), microseconds (us), minutes
(m or min), hour (h), or day (d). E.g. 50s, 100ms, or 250us.
Failed to parse time string: 120s
at org.apache.spark.network.util.JavaUtils.timeStringAs(JavaUtils.java:258)
at
org.apache.spark.network.util.JavaUtils.timeStringAsSec(JavaUtils.java:275)
at org.apache.spark.util.Utils$.timeStringAsSeconds(Utils.scala:1166)
at org.apache.spark.rpc.RpcTimeout$.apply(RpcTimeout.scala:131)
at org.apache.spark.util.RpcUtils$.askRpcTimeout(RpcUtils.scala:41)
at org.apache.spark.rpc.RpcEndpointRef.<init>(RpcEndpointRef.scala:33)
at
org.apache.spark.rpc.netty.NettyRpcEndpointRef.<init>(NettyRpcEnv.scala:533)
at org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:640)
at
org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:697)
at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:682)
at
org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:163)
at
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
at
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
at
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
at
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at
org.apache.spark.network.crypto.TransportCipher$DecryptionHandler.channelRead(TransportCipher.java:192)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
at
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
at
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:840) {code}
*[Stack Trace 2]* The stack trace is slightly different but it's also about
reading/parsing a time string on the line "timeout = \{
Utils.timeStringAsSeconds(finalProp._2).seconds }" in in RpcTimeout.scala,
where the app runs on emr-7.0.0 with Spark 3.5.0 runtime.
{code:java}
24/04/02 14:20:14 ERROR TransportRequestHandler: Error while invoking
RpcHandler#receive() for one-way message.
java.lang.NullPointerException: Cannot invoke
"scala.concurrent.duration.Duration$.apply(long,
java.util.concurrent.TimeUnit)" because
"scala.concurrent.duration.Duration$.MODULE$" is null
at
scala.concurrent.duration.package$DurationLong$.durationIn$extension(package.scala:60)
~[scala-library-2.12.17.jar:?]
at
scala.concurrent.duration.package$DurationLong.durationIn(package.scala:60)
~[scala-library-2.12.17.jar:?]
at
scala.concurrent.duration.DurationConversions.seconds(DurationConversions.scala:37)
~[scala-library-2.12.17.jar:?]
at
scala.concurrent.duration.DurationConversions.seconds$(DurationConversions.scala:37)
~[scala-library-2.12.17.jar:?]
at scala.concurrent.duration.package$DurationLong.seconds(package.scala:59)
~[scala-library-2.12.17.jar:?]
at org.apache.spark.rpc.RpcTimeout$.apply(RpcTimeout.scala:131)
~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
at org.apache.spark.util.RpcUtils$.askRpcTimeout(RpcUtils.scala:41)
~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
at org.apache.spark.rpc.RpcEndpointRef.<init>(RpcEndpointRef.scala:33)
~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
at
org.apache.spark.rpc.netty.NettyRpcEndpointRef.<init>(NettyRpcEnv.scala:533)
~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
at org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:640)
~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
at
org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:697)
~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
at
org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:689)
~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
at
org.apache.spark.network.server.AbstractAuthRpcHandler.receive(AbstractAuthRpcHandler.java:66)
~[spark-network-common_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
at
org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:274)
[spark-network-common_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
at
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:111)
[spark-network-common_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
at
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
[spark-network-common_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
at
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
[spark-network-common_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
at
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
[netty-handler-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
[netty-codec-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
[spark-network-common_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
org.apache.spark.network.crypto.TransportCipher$DecryptionHandler.channelRead(TransportCipher.java:192)
[spark-network-common_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
[netty-common-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
[netty-common-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
[netty-common-4.1.96.Final.jar:4.1.96.Final]
at java.lang.Thread.run(Thread.java:840) [?:?] {code}
*[Stack Trace 3]* The stack trace doesn't make sense since *3s* is a legitimate
time string, where the app runs on emr-6.9.0 with Spark 3.3.0 runtime. Given
that RpcUtils$.retryWaitMs has been refactored away in [this
commit|https://github.com/apache/spark/commit/e7b8be37226142cd7fcfac82576c46f8f88f74b3],
this is out of scope for the PR.
{code:java}
24/03/01 21:04:24 ERROR TransportRequestHandler: Error while invoking
RpcHandler#receive() on RPC id 7358995660930182634
java.lang.NumberFormatException: Time must be specified as seconds (s),
milliseconds (ms), microseconds (us), minutes (m or min), hour (h), or day (d).
E.g. 50s, 100ms, or 250us.
Failed to parse time string: 3s
at org.apache.spark.network.util.JavaUtils.timeStringAs(JavaUtils.java:252)
~[spark-network-common_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at
org.apache.spark.internal.config.ConfigHelpers$.timeFromString(ConfigBuilder.scala:56)
~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at
org.apache.spark.internal.config.ConfigBuilder.$anonfun$timeConf$1(ConfigBuilder.scala:256)
~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at
org.apache.spark.internal.config.ConfigBuilder.$anonfun$timeConf$1$adapted(ConfigBuilder.scala:256)
~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at
org.apache.spark.internal.config.ConfigEntryWithDefaultString.readFrom(ConfigEntry.scala:206)
~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.SparkConf.get(SparkConf.scala:261)
~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.util.RpcUtils$.retryWaitMs(RpcUtils.scala:46)
~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.rpc.RpcEndpointRef.<init>(RpcEndpointRef.scala:34)
~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at
org.apache.spark.rpc.netty.NettyRpcEndpointRef.<init>(NettyRpcEnv.scala:533)
~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:640)
~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at
org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:697)
~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at
org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:682)
~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at
org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:163)
~[spark-network-common_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
...
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
~[netty-transport-4.1.74.Final.jar:4.1.74.Final]
at
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
~[netty-common-4.1.74.Final.jar:4.1.74.Final]
at
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
~[netty-common-4.1.74.Final.jar:4.1.74.Final]
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
~[netty-common-4.1.74.Final.jar:4.1.74.Final]
at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_402] {code}
h2. Proposed mitigation
* Under the circumstance where a caller already knows which default value and
time unit (e.g., second, ms) should be used, there's no need to go thru the
regex logic and unit conversion, especially for the critical path like
RpcUtils$.askRpcTimeout in the above stack traces.
was:
h2. Symptom
It's observed that our Spark apps occasionally got stuck with an unexpected
stack trace when reading/parsing a legitimate time string. Note that we
manually killed the stuck app instances and the retry goes thru on the same
cluster (without requiring any app code change).
*[Stack Trace 1]* The stack trace doesn't make sense since *120s* is a
legitimate time string, where the app runs on emr-7.0.0 with Spark 3.5.0
runtime.
{code:java}
Caused by: java.lang.RuntimeException: java.lang.NumberFormatException: Time
must be specified as seconds (s), milliseconds (ms), microseconds (us), minutes
(m or min), hour (h), or day (d). E.g. 50s, 100ms, or 250us.
Failed to parse time string: 120s
at org.apache.spark.network.util.JavaUtils.timeStringAs(JavaUtils.java:258)
at
org.apache.spark.network.util.JavaUtils.timeStringAsSec(JavaUtils.java:275)
at org.apache.spark.util.Utils$.timeStringAsSeconds(Utils.scala:1166)
at org.apache.spark.rpc.RpcTimeout$.apply(RpcTimeout.scala:131)
at org.apache.spark.util.RpcUtils$.askRpcTimeout(RpcUtils.scala:41)
at org.apache.spark.rpc.RpcEndpointRef.<init>(RpcEndpointRef.scala:33)
at
org.apache.spark.rpc.netty.NettyRpcEndpointRef.<init>(NettyRpcEnv.scala:533)
at org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:640)
at
org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:697)
at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:682)
at
org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:163)
at
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
at
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
at
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
at
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at
org.apache.spark.network.crypto.TransportCipher$DecryptionHandler.channelRead(TransportCipher.java:192)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
at
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
at
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:840) {code}
*[Stack Trace 2]* The stack trace doesn't make sense since *3s* is a legitimate
time string, where the app runs on emr-6.9.0 with Spark 3.3.0 runtime.
{code:java}
24/03/01 21:04:24 ERROR TransportRequestHandler: Error while invoking
RpcHandler#receive() on RPC id 7358995660930182634
java.lang.NumberFormatException: Time must be specified as seconds (s),
milliseconds (ms), microseconds (us), minutes (m or min), hour (h), or day (d).
E.g. 50s, 100ms, or 250us.
Failed to parse time string: 3s
at org.apache.spark.network.util.JavaUtils.timeStringAs(JavaUtils.java:252)
~[spark-network-common_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at
org.apache.spark.internal.config.ConfigHelpers$.timeFromString(ConfigBuilder.scala:56)
~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at
org.apache.spark.internal.config.ConfigBuilder.$anonfun$timeConf$1(ConfigBuilder.scala:256)
~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at
org.apache.spark.internal.config.ConfigBuilder.$anonfun$timeConf$1$adapted(ConfigBuilder.scala:256)
~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at
org.apache.spark.internal.config.ConfigEntryWithDefaultString.readFrom(ConfigEntry.scala:206)
~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.SparkConf.get(SparkConf.scala:261)
~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.util.RpcUtils$.retryWaitMs(RpcUtils.scala:46)
~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.rpc.RpcEndpointRef.<init>(RpcEndpointRef.scala:34)
~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at
org.apache.spark.rpc.netty.NettyRpcEndpointRef.<init>(NettyRpcEnv.scala:533)
~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:640)
~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at
org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:697)
~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at
org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:682)
~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
at
org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:163)
~[spark-network-common_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
...
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
~[netty-transport-4.1.74.Final.jar:4.1.74.Final]
at
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
~[netty-common-4.1.74.Final.jar:4.1.74.Final]
at
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
~[netty-common-4.1.74.Final.jar:4.1.74.Final]
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
~[netty-common-4.1.74.Final.jar:4.1.74.Final]
at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_402] {code}
*[Stack Trace 3]* The stack trace is slightly different but it's also about
reading/parsing a time string on the line "timeout = \{
Utils.timeStringAsSeconds(finalProp._2).seconds }" in in RpcTimeout.scala,
where the app runs on emr-7.0.0 with Spark 3.5.0 runtime.
{code:java}
24/04/02 14:20:14 ERROR TransportRequestHandler: Error while invoking
RpcHandler#receive() for one-way message.
java.lang.NullPointerException: Cannot invoke
"scala.concurrent.duration.Duration$.apply(long,
java.util.concurrent.TimeUnit)" because
"scala.concurrent.duration.Duration$.MODULE$" is null
at
scala.concurrent.duration.package$DurationLong$.durationIn$extension(package.scala:60)
~[scala-library-2.12.17.jar:?]
at
scala.concurrent.duration.package$DurationLong.durationIn(package.scala:60)
~[scala-library-2.12.17.jar:?]
at
scala.concurrent.duration.DurationConversions.seconds(DurationConversions.scala:37)
~[scala-library-2.12.17.jar:?]
at
scala.concurrent.duration.DurationConversions.seconds$(DurationConversions.scala:37)
~[scala-library-2.12.17.jar:?]
at scala.concurrent.duration.package$DurationLong.seconds(package.scala:59)
~[scala-library-2.12.17.jar:?]
at org.apache.spark.rpc.RpcTimeout$.apply(RpcTimeout.scala:131)
~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
at org.apache.spark.util.RpcUtils$.askRpcTimeout(RpcUtils.scala:41)
~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
at org.apache.spark.rpc.RpcEndpointRef.<init>(RpcEndpointRef.scala:33)
~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
at
org.apache.spark.rpc.netty.NettyRpcEndpointRef.<init>(NettyRpcEnv.scala:533)
~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
at org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:640)
~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
at
org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:697)
~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
at
org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:689)
~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
at
org.apache.spark.network.server.AbstractAuthRpcHandler.receive(AbstractAuthRpcHandler.java:66)
~[spark-network-common_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
at
org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:274)
[spark-network-common_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
at
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:111)
[spark-network-common_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
at
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
[spark-network-common_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
at
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
[spark-network-common_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
at
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
[netty-handler-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
[netty-codec-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
[spark-network-common_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
org.apache.spark.network.crypto.TransportCipher$DecryptionHandler.channelRead(TransportCipher.java:192)
[spark-network-common_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
[netty-transport-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
[netty-common-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
[netty-common-4.1.96.Final.jar:4.1.96.Final]
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
[netty-common-4.1.96.Final.jar:4.1.96.Final]
at java.lang.Thread.run(Thread.java:840) [?:?] {code}
h2. What's going on?
Based on the above observations, our hypothesis is that there's a thread-safety
issue in JavaUtils.java, where a couple of methods call Pattern.compile on the
fly. Note that java.util.regex.Pattern objects are thread safe, but the
{{Pattern.compile}} method might not be.
h2. Proposed fix
* Refactor JavaUtils.java to "staic final" Pattern objects, which are
immutable and are safe for use by multiple concurrent threads (see
[reference|https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/regex/Pattern.html]).
> Apps being stuck after JavaUtils.timeStringAs fails to parse a legitimate
> time string
> -------------------------------------------------------------------------------------
>
> Key: SPARK-47759
> URL: https://issues.apache.org/jira/browse/SPARK-47759
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 3.5.0, 3.5.1
> Reporter: Bo Xiong
> Assignee: Bo Xiong
> Priority: Critical
> Labels: hang, pull-request-available, stuck, threadsafe
> Fix For: 3.5.0, 4.0.0, 3.5.1, 3.5.2
>
> Original Estimate: 4h
> Remaining Estimate: 4h
>
> h2. Symptom
> It's observed that our Spark apps occasionally got stuck with an unexpected
> stack trace when reading/parsing a legitimate time string. Note that we
> manually killed the stuck app instances and the retry goes thru on the same
> cluster (without requiring any app code change).
>
> *[Stack Trace 1]* The stack trace doesn't make sense since *120s* is a
> legitimate time string, where the app runs on emr-7.0.0 with Spark 3.5.0
> runtime.
> {code:java}
> Caused by: java.lang.RuntimeException: java.lang.NumberFormatException: Time
> must be specified as seconds (s), milliseconds (ms), microseconds (us),
> minutes (m or min), hour (h), or day (d). E.g. 50s, 100ms, or 250us.
> Failed to parse time string: 120s
> at
> org.apache.spark.network.util.JavaUtils.timeStringAs(JavaUtils.java:258)
> at
> org.apache.spark.network.util.JavaUtils.timeStringAsSec(JavaUtils.java:275)
> at org.apache.spark.util.Utils$.timeStringAsSeconds(Utils.scala:1166)
> at org.apache.spark.rpc.RpcTimeout$.apply(RpcTimeout.scala:131)
> at org.apache.spark.util.RpcUtils$.askRpcTimeout(RpcUtils.scala:41)
> at org.apache.spark.rpc.RpcEndpointRef.<init>(RpcEndpointRef.scala:33)
> at
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.<init>(NettyRpcEnv.scala:533)
> at org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:640)
> at
> org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:697)
> at
> org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:682)
> at
> org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:163)
> at
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
> at
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
> at
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
> at
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at
> org.apache.spark.network.crypto.TransportCipher$DecryptionHandler.channelRead(TransportCipher.java:192)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
> at
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
> at
> io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
> at
> io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
> at
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.base/java.lang.Thread.run(Thread.java:840) {code}
>
> *[Stack Trace 2]* The stack trace is slightly different but it's also about
> reading/parsing a time string on the line "timeout = \{
> Utils.timeStringAsSeconds(finalProp._2).seconds }" in in RpcTimeout.scala,
> where the app runs on emr-7.0.0 with Spark 3.5.0 runtime.
> {code:java}
> 24/04/02 14:20:14 ERROR TransportRequestHandler: Error while invoking
> RpcHandler#receive() for one-way message.
> java.lang.NullPointerException: Cannot invoke
> "scala.concurrent.duration.Duration$.apply(long,
> java.util.concurrent.TimeUnit)" because
> "scala.concurrent.duration.Duration$.MODULE$" is null
> at
> scala.concurrent.duration.package$DurationLong$.durationIn$extension(package.scala:60)
> ~[scala-library-2.12.17.jar:?]
> at
> scala.concurrent.duration.package$DurationLong.durationIn(package.scala:60)
> ~[scala-library-2.12.17.jar:?]
> at
> scala.concurrent.duration.DurationConversions.seconds(DurationConversions.scala:37)
> ~[scala-library-2.12.17.jar:?]
> at
> scala.concurrent.duration.DurationConversions.seconds$(DurationConversions.scala:37)
> ~[scala-library-2.12.17.jar:?]
> at
> scala.concurrent.duration.package$DurationLong.seconds(package.scala:59)
> ~[scala-library-2.12.17.jar:?]
> at org.apache.spark.rpc.RpcTimeout$.apply(RpcTimeout.scala:131)
> ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
> at org.apache.spark.util.RpcUtils$.askRpcTimeout(RpcUtils.scala:41)
> ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
> at org.apache.spark.rpc.RpcEndpointRef.<init>(RpcEndpointRef.scala:33)
> ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
> at
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.<init>(NettyRpcEnv.scala:533)
> ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
> at
> org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:640)
> ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
> at
> org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:697)
> ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
> at
> org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:689)
> ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
> at
> org.apache.spark.network.server.AbstractAuthRpcHandler.receive(AbstractAuthRpcHandler.java:66)
> ~[spark-network-common_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
> at
> org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:274)
> [spark-network-common_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
> at
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:111)
> [spark-network-common_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
> at
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
> [spark-network-common_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
> at
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
> [spark-network-common_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
> at
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
> [netty-transport-4.1.96.Final.jar:4.1.96.Final]
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> [netty-transport-4.1.96.Final.jar:4.1.96.Final]
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> [netty-transport-4.1.96.Final.jar:4.1.96.Final]
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> [netty-transport-4.1.96.Final.jar:4.1.96.Final]
> at
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
> [netty-handler-4.1.96.Final.jar:4.1.96.Final]
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
> [netty-transport-4.1.96.Final.jar:4.1.96.Final]
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> [netty-transport-4.1.96.Final.jar:4.1.96.Final]
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> [netty-transport-4.1.96.Final.jar:4.1.96.Final]
> at
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
> [netty-codec-4.1.96.Final.jar:4.1.96.Final]
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> [netty-transport-4.1.96.Final.jar:4.1.96.Final]
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> [netty-transport-4.1.96.Final.jar:4.1.96.Final]
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> [netty-transport-4.1.96.Final.jar:4.1.96.Final]
> at
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
> [spark-network-common_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> [netty-transport-4.1.96.Final.jar:4.1.96.Final]
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> [netty-transport-4.1.96.Final.jar:4.1.96.Final]
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> [netty-transport-4.1.96.Final.jar:4.1.96.Final]
> at
> org.apache.spark.network.crypto.TransportCipher$DecryptionHandler.channelRead(TransportCipher.java:192)
> [spark-network-common_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0]
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> [netty-transport-4.1.96.Final.jar:4.1.96.Final]
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> [netty-transport-4.1.96.Final.jar:4.1.96.Final]
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> [netty-transport-4.1.96.Final.jar:4.1.96.Final]
> at
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
> [netty-transport-4.1.96.Final.jar:4.1.96.Final]
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
> [netty-transport-4.1.96.Final.jar:4.1.96.Final]
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> [netty-transport-4.1.96.Final.jar:4.1.96.Final]
> at
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
> [netty-transport-4.1.96.Final.jar:4.1.96.Final]
> at
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
> [netty-transport-4.1.96.Final.jar:4.1.96.Final]
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
> [netty-transport-4.1.96.Final.jar:4.1.96.Final]
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
> [netty-transport-4.1.96.Final.jar:4.1.96.Final]
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
> [netty-transport-4.1.96.Final.jar:4.1.96.Final]
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
> [netty-transport-4.1.96.Final.jar:4.1.96.Final]
> at
> io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
> [netty-common-4.1.96.Final.jar:4.1.96.Final]
> at
> io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
> [netty-common-4.1.96.Final.jar:4.1.96.Final]
> at
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> [netty-common-4.1.96.Final.jar:4.1.96.Final]
> at java.lang.Thread.run(Thread.java:840) [?:?] {code}
>
> *[Stack Trace 3]* The stack trace doesn't make sense since *3s* is a
> legitimate time string, where the app runs on emr-6.9.0 with Spark 3.3.0
> runtime. Given that RpcUtils$.retryWaitMs has been refactored away in [this
> commit|https://github.com/apache/spark/commit/e7b8be37226142cd7fcfac82576c46f8f88f74b3],
> this is out of scope for the PR.
> {code:java}
> 24/03/01 21:04:24 ERROR TransportRequestHandler: Error while invoking
> RpcHandler#receive() on RPC id 7358995660930182634
> java.lang.NumberFormatException: Time must be specified as seconds (s),
> milliseconds (ms), microseconds (us), minutes (m or min), hour (h), or day
> (d). E.g. 50s, 100ms, or 250us.
> Failed to parse time string: 3s
> at
> org.apache.spark.network.util.JavaUtils.timeStringAs(JavaUtils.java:252)
> ~[spark-network-common_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
> at
> org.apache.spark.internal.config.ConfigHelpers$.timeFromString(ConfigBuilder.scala:56)
> ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
> at
> org.apache.spark.internal.config.ConfigBuilder.$anonfun$timeConf$1(ConfigBuilder.scala:256)
> ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
> at
> org.apache.spark.internal.config.ConfigBuilder.$anonfun$timeConf$1$adapted(ConfigBuilder.scala:256)
> ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
> at
> org.apache.spark.internal.config.ConfigEntryWithDefaultString.readFrom(ConfigEntry.scala:206)
> ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
> at org.apache.spark.SparkConf.get(SparkConf.scala:261)
> ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
> at org.apache.spark.util.RpcUtils$.retryWaitMs(RpcUtils.scala:46)
> ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
> at org.apache.spark.rpc.RpcEndpointRef.<init>(RpcEndpointRef.scala:34)
> ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
> at
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.<init>(NettyRpcEnv.scala:533)
> ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
> at
> org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:640)
> ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
> at
> org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:697)
> ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
> at
> org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:682)
> ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
> at
> org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:163)
> ~[spark-network-common_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]
> ...
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
> ~[netty-transport-4.1.74.Final.jar:4.1.74.Final]
> at
> io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
> ~[netty-common-4.1.74.Final.jar:4.1.74.Final]
> at
> io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
> ~[netty-common-4.1.74.Final.jar:4.1.74.Final]
> at
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> ~[netty-common-4.1.74.Final.jar:4.1.74.Final]
> at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_402] {code}
> h2. Proposed mitigation
> * Under the circumstance where a caller already knows which default value
> and time unit (e.g., second, ms) should be used, there's no need to go thru
> the regex logic and unit conversion, especially for the critical path like
> RpcUtils$.askRpcTimeout in the above stack traces.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]