[ https://issues.apache.org/jira/browse/SPARK-43175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Iain Cardnell updated SPARK-43175: ---------------------------------- Description: decom.sh can cause an UnsupportedOperationException which then causes the Executor to die with a SparkUncaughtException and does not complete the decommission properly. *Problem:* SignalUtils.scala line 124: {code:java} if (escalate) { prevHandler.handle(sig) }{code} *Logs:* {noformat} failed - error: command '/opt/decom.sh' exited with 137: + echo 'Asked to decommission' + date + tee -a ++ ps -o pid -C java ++ awk '{ sub(/^[ \t]+/, """"); print }' ++ tail -n 1 + WORKER_PID=17 + echo 'Using worker pid 17' + kill -s SIGPWR 17 + echo 'Waiting for worker pid to exit' + timeout 60 tail --pid=17 -f /dev/null , message: ""Asked to decommission\nMon Apr 17 23:44:35 UTC 2023\nUsing worker pid 17\nWaiting for worker pid to exit\n+ echo 'Asked to decommission'\n+ date\n+ tee -a\n++ ps -o pid -C java\n++ awk '{ sub(/^[ \\t]+/, \""\""); print }'\n++ tail -n 1\n+ WORKER_PID=17\n+ echo 'Using worker pid 17'\n+ kill -s SIGPWR 17\n+ echo 'Waiting for worker pid to exit'\n+ timeout 60 tail --pid=17 -f /dev/null\n""",2023-04-17T23:44:39Z, "java.lang.UnsupportedOperationException: invoking native signal handle not supported at java.base/jdk.internal.misc.Signal$NativeHandler.handle(Unknown Source) at jdk.unsupported/sun.misc.Signal$SunMiscHandler.handle(Unknown Source) at org.apache.spark.util.SignalUtils$ActionHandler.handle(SignalUtils.scala:124) at jdk.unsupported/sun.misc.Signal$InternalMiscHandler.handle(Unknown Source) at java.base/jdk.internal.misc.Signal$1.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source)",2023-04-17T23:44:35.407488217Z "2023-04-17 23:44:35 [SIGPWR handler] ERROR org.apache.spark.util.SparkUncaughtExceptionHandler - Uncaught exception in thread Thread[SIGPWR handler,9,system] - {}",2023-04-17T23:44:35.407457859Z " ... 1 more",2023-04-17T23:44:35.405548994Z " at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)",2023-04-17T23:44:35.405542621Z " at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)",2023-04-17T23:44:35.405536674Z " at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)",2023-04-17T23:44:35.405516396Z " at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)",2023-04-17T23:44:35.405416352Z " at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)",2023-04-17T23:44:35.405410491Z " ... at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)",2023-04-17T23:44:35.405262304Z " at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:142)",2023-04-17T23:44:35.405256591Z " at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:209)",2023-04-17T23:44:35.405250814Z{noformat} In this case prevHandler is the NativeHandler (See [https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/19fb8f93c59dfd791f62d41f332db9e306bc1422/src/java.base/share/classes/jdk/internal/misc/Signal.java#L280]) and it throws the exception. *Possible Solutions:* * Check if prevHandler is an instance of NativeHandler and do not call it in that case. * try catch around the invoke of the handler and log a warning/error on exceptions. was: decom.sh can cause an UnsupportedOperationException which then causes the Executor to die with a SparkUncaughtException and does not complete the decommission properly. *Problem:* SignalUtils.scala line 124: {code:java} if (escalate) { prevHandler.handle(sig) }{code} *Logs:* {noformat} failed - error: command '/opt/decom.sh' exited with 137: + echo 'Asked to decommission' + date + tee -a ++ ps -o pid -C java ++ awk '{ sub(/^[ \t]+/, """"); print }' ++ tail -n 1 + WORKER_PID=17 + echo 'Using worker pid 17' + kill -s SIGPWR 17 + echo 'Waiting for worker pid to exit' + timeout 60 tail --pid=17 -f /dev/null , message: ""Asked to decommission\nMon Apr 17 23:44:35 UTC 2023\nUsing worker pid 17\nWaiting for worker pid to exit\n+ echo 'Asked to decommission'\n+ date\n+ tee -a\n++ ps -o pid -C java\n++ awk '{ sub(/^[ \\t]+/, \""\""); print }'\n++ tail -n 1\n+ WORKER_PID=17\n+ echo 'Using worker pid 17'\n+ kill -s SIGPWR 17\n+ echo 'Waiting for worker pid to exit'\n+ timeout 60 tail --pid=17 -f /dev/null\n""",2023-04-17T23:44:39Z, "java.lang.UnsupportedOperationException: invoking native signal handle not supported at java.base/jdk.internal.misc.Signal$NativeHandler.handle(Unknown Source) at jdk.unsupported/sun.misc.Signal$SunMiscHandler.handle(Unknown Source) at org.apache.spark.util.SignalUtils$ActionHandler.handle(SignalUtils.scala:124) at jdk.unsupported/sun.misc.Signal$InternalMiscHandler.handle(Unknown Source) at java.base/jdk.internal.misc.Signal$1.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source)",2023-04-17T23:44:35.407488217Z "2023-04-17 23:44:35 [SIGPWR handler] ERROR org.apache.spark.util.SparkUncaughtExceptionHandler - Uncaught exception in thread Thread[SIGPWR handler,9,system] - {}",2023-04-17T23:44:35.407457859Z " ... 1 more",2023-04-17T23:44:35.405548994Z " at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)",2023-04-17T23:44:35.405542621Z " at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)",2023-04-17T23:44:35.405536674Z " at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)",2023-04-17T23:44:35.405516396Z " at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)",2023-04-17T23:44:35.405416352Z " at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)",2023-04-17T23:44:35.405410491Z " ... at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)",2023-04-17T23:44:35.405262304Z " at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:142)",2023-04-17T23:44:35.405256591Z " at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:209)",2023-04-17T23:44:35.405250814Z{noformat} In this case prevHandler is the NativeHandler (See [https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/19fb8f93c59dfd791f62d41f332db9e306bc1422/src/java.base/share/classes/jdk/internal/misc/Signal.java#L280|https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/19fb8f93c59dfd791f62d41f332db9e306bc1422/src/java.base/share/classes/jdk/internal/misc/Signal.java#L280]) and it throws the exception. *Possible Solutions:* * Check if prevHandler is an instance of NativeHandler and do not call it in that case. * try catch around the invoke of the handler and log a warning/error on exceptions. > decom.sh can cause an UnsupportedOperationException > --------------------------------------------------- > > Key: SPARK-43175 > URL: https://issues.apache.org/jira/browse/SPARK-43175 > Project: Spark > Issue Type: Bug > Components: Kubernetes > Affects Versions: 3.3.0 > Reporter: Iain Cardnell > Priority: Major > > decom.sh can cause an UnsupportedOperationException which then causes the > Executor to die with a SparkUncaughtException and does not complete the > decommission properly. > > *Problem:* > SignalUtils.scala line 124: > > {code:java} > if (escalate) { > prevHandler.handle(sig) > }{code} > > > *Logs:* > > {noformat} > failed - error: command '/opt/decom.sh' exited with 137: + echo 'Asked to > decommission' + date + tee -a ++ ps -o pid -C java ++ awk '{ sub(/^[ \t]+/, > """"); print }' ++ tail -n 1 + WORKER_PID=17 + echo 'Using worker pid 17' + > kill -s SIGPWR 17 + echo 'Waiting for worker pid to exit' + timeout 60 tail > --pid=17 -f /dev/null , message: ""Asked to decommission\nMon Apr 17 23:44:35 > UTC 2023\nUsing worker pid 17\nWaiting for worker pid to exit\n+ echo 'Asked > to decommission'\n+ date\n+ tee -a\n++ ps -o pid -C java\n++ awk '{ sub(/^[ > \\t]+/, \""\""); print }'\n++ tail -n 1\n+ WORKER_PID=17\n+ echo 'Using > worker pid 17'\n+ kill -s SIGPWR 17\n+ echo 'Waiting for worker pid to > exit'\n+ timeout 60 tail --pid=17 -f /dev/null\n""",2023-04-17T23:44:39Z, > "java.lang.UnsupportedOperationException: invoking native signal handle not > supported > at java.base/jdk.internal.misc.Signal$NativeHandler.handle(Unknown Source) > at jdk.unsupported/sun.misc.Signal$SunMiscHandler.handle(Unknown Source) > at > org.apache.spark.util.SignalUtils$ActionHandler.handle(SignalUtils.scala:124) > at jdk.unsupported/sun.misc.Signal$InternalMiscHandler.handle(Unknown Source) > at java.base/jdk.internal.misc.Signal$1.run(Unknown Source) at > java.base/java.lang.Thread.run(Unknown > Source)",2023-04-17T23:44:35.407488217Z "2023-04-17 23:44:35 > [SIGPWR handler] ERROR org.apache.spark.util.SparkUncaughtExceptionHandler - > Uncaught exception in thread Thread[SIGPWR handler,9,system] - > {}",2023-04-17T23:44:35.407457859Z > " ... 1 more",2023-04-17T23:44:35.405548994Z " > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)",2023-04-17T23:44:35.405542621Z > " > at > io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)",2023-04-17T23:44:35.405536674Z > " > at > io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)",2023-04-17T23:44:35.405516396Z > " > at > io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)",2023-04-17T23:44:35.405416352Z > " > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)",2023-04-17T23:44:35.405410491Z > " > ... > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)",2023-04-17T23:44:35.405262304Z > " > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:142)",2023-04-17T23:44:35.405256591Z > " > at > org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:209)",2023-04-17T23:44:35.405250814Z{noformat} > > In this case prevHandler is the NativeHandler (See > [https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/19fb8f93c59dfd791f62d41f332db9e306bc1422/src/java.base/share/classes/jdk/internal/misc/Signal.java#L280]) > and it throws the exception. > *Possible Solutions:* > * Check if prevHandler is an instance of NativeHandler and do not call it in > that case. > * try catch around the invoke of the handler and log a warning/error on > exceptions. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org