[ https://issues.apache.org/jira/browse/SPARK-32975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17494396#comment-17494396 ]
Abhijeet Singh edited comment on SPARK-32975 at 2/18/22, 6:52 AM: ------------------------------------------------------------------ Though the issue points to driver and the fix is related to a driver config, but I was getting the same error because sidecar injection was happening to executor pod and sidecar container was taking more time to initialize than the exec container. I resolved it by adding a sleep/wait time in entrypoint.sh for exec, but it would be neat to have a spark.kubernetes.allocation.executor.readinessWait config which allows to set wait time. was (Author: singh-abhijeet): Though the issue points to driver and the fix is related to a driver config, but I was getting the same error because sidecar injection was happening to executor pod and sidecar container was taking more time to initialize than the exec container. I was getting connection refused excp because sidecar container was not ready and executor was trying to communicate. I resolved it by adding a sleep/wait time in entrypoint.sh for exec, but it would be neat to have a `spark.k8s` config which allows to set wait time. > Add config for driver readiness timeout before executors start > -------------------------------------------------------------- > > Key: SPARK-32975 > URL: https://issues.apache.org/jira/browse/SPARK-32975 > Project: Spark > Issue Type: Bug > Components: Kubernetes > Affects Versions: 2.4.4, 3.0.2, 3.1.2, 3.2.0 > Reporter: Shenson Joseph > Assignee: Chris Wu > Priority: Major > Fix For: 3.2.0, 3.1.3 > > > We are using v1beta2-1.1.2-2.4.5 version of operator with spark-2.4.4 > spark executors keeps getting killed with exit code 1 and we are seeing > following exception in the executor which goes to error state. Once this > error happens, driver doesn't restart executor. > > Exception in thread "main" java.lang.reflect.UndeclaredThrowableException > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713) > at > org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:281) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala) > Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:201) > at > org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:65) > at > org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:64) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) > ... 4 more > Caused by: java.io.IOException: Failed to connect to > act-pipeline-app-1600187491917-driver-svc.default.svc:7078 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187) > at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198) > at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194) > at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.net.UnknownHostException: > act-pipeline-app-1600187491917-driver-svc.default.svc > at java.net.InetAddress.getAllByName0(InetAddress.java:1281) > at java.net.InetAddress.getAllByName(InetAddress.java:1193) > at java.net.InetAddress.getAllByName(InetAddress.java:1127) > at java.net.InetAddress.getByName(InetAddress.java:1077) > at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:146) > at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:143) > at java.security.AccessController.doPrivileged(Native Method) > at io.netty.util.internal.SocketUtils.addressByName(SocketUtils.java:143) > at > io.netty.resolver.DefaultNameResolver.doResolve(DefaultNameResolver.java:43) > at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:63) > at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:55) > at > io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:57) > at > io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:32) > at > io.netty.resolver.AbstractAddressResolver.resolve(AbstractAddressResolver.java:108) > at io.netty.bootstrap.Bootstrap.doResolveAndConnect0(Bootstrap.java:208) > at io.netty.bootstrap.Bootstrap.access$000(Bootstrap.java:49) > at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:188) > at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:174) > at > io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507) > at > io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481) > at > io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420) > at io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:104) > at > io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:82) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetSuccess(AbstractChannel.java:978) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:512) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.access$200(AbstractChannel.java:423) > at > io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:482) > at > io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463) > at > io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) > ... 1 more > CodeCache: size=245760Kb used=4762Kb max_used=4763Kb free=240997Kb > bounds [0x00007f49f5000000, 0x00007f49f54b0000, 0x00007f4a04000000] > total_blobs=1764 nmethods=1356 adapters=324 > compilation: enabled > > > > *Additional information:* > *The status of spark application shows it is RUNNING:* > kubectl describe sparkapplications.sparkoperator.k8s.io act-pipeline-app > ... > ... > Status: > Application State: > State: RUNNING > Driver Info: > Pod Name: act-pipeline-app-driver > Web UI Address: 10.233.57.201:40550 > Web UI Port: 40550 > Web UI Service Name: act-pipeline-app-ui-svc > Execution Attempts: 1 > Executor State: > act-pipeline-app-1600097064694-exec-1: RUNNING > Last Submission Attempt Time: 2020-09-14T15:24:26Z > Spark Application Id: > spark-942bb2e500c54f92ac357b818c712558 > Submission Attempts: 1 > Submission ID: > 4ecdb6ca-d237-4524-b05e-c42cfcc73dc7 > Termination Time: <nil> > Events: <none> > > *The executor pod is reporting that it is Terminated:* > kubectl describe pod -l > sparkoperator.k8s.io/app-name=act-pipeline-app,spark-role=executor > ... > ... > Containers: > executor: > Container ID: > docker://9aa5b585e8fb7390b87a4771f3ed1402cae41f0fe55905d0172ed6e90dde34e6 > ... > Ports: 7079/TCP, 8090/TCP > Host Ports: 0/TCP, 0/TCP > Args: > executor > State: Terminated > Reason: Error > Exit Code: 1 > Started: Mon, 14 Sep 2020 11:25:35 -0400 > Finished: Mon, 14 Sep 2020 11:25:39 -0400 > Ready: False > Restart Count: 0 > ... > Conditions: > Type Status > Initialized True > Ready False > ContainersReady False > PodScheduled True > ... > QoS Class: Burstable > Node-Selectors: <none> > Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s > node.kubernetes.io/unreachable:NoExecute for 300s > Events: <none> > In early stage of the driver’s life the failed executor is not detected (it > is assumed to be running) and therefore it will not be restarted. > -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org