[ https://issues.apache.org/jira/browse/SPARK-25128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16868138#comment-16868138 ]
Suman Somasundar commented on SPARK-25128: ------------------------------------------ I have the same issue. When multiple jobs are submitted, the driver pods start, then the executor pods start. But executor fails because it is not able to resolve the driver service. Driver is stuck in running state with the warning - "Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources" Error in executor pod: Exception in thread "main" java.lang.reflect.UndeclaredThrowableException at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1707) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:293) at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala) Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101) at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:201) at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:65) at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:64) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692) ... 4 more Caused by: java.io.IOException: Failed to connect to t-f5d67725474036458526157f70bc999c-driver-svc.spark-namespace.svc:7078 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187) at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.net.UnknownHostException: t-f5d67725474036458526157f70bc999c-driver-svc.spark-namespace.svc at java.net.InetAddress.getAllByName0(InetAddress.java:1280) at java.net.InetAddress.getAllByName(InetAddress.java:1192) at java.net.InetAddress.getAllByName(InetAddress.java:1126) at java.net.InetAddress.getByName(InetAddress.java:1076) at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:146) at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:143) at java.security.AccessController.doPrivileged(Native Method) at io.netty.util.internal.SocketUtils.addressByName(SocketUtils.java:143) at io.netty.resolver.DefaultNameResolver.doResolve(DefaultNameResolver.java:43) at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:63) at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:55) at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:57) at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:32) at io.netty.resolver.AbstractAddressResolver.resolve(AbstractAddressResolver.java:108) at io.netty.bootstrap.Bootstrap.doResolveAndConnect0(Bootstrap.java:208) at io.netty.bootstrap.Bootstrap.access$000(Bootstrap.java:49) at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:188) at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:174) at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511) at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:485) at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:424) at io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:103) at io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:84) at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetSuccess(AbstractChannel.java:982) at io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:516) at io.netty.channel.AbstractChannel$AbstractUnsafe.access$200(AbstractChannel.java:427) at io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:486) at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:474) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > multiple simultaneous job submissions against k8s backend cause driver pods > to hang > ----------------------------------------------------------------------------------- > > Key: SPARK-25128 > URL: https://issues.apache.org/jira/browse/SPARK-25128 > Project: Spark > Issue Type: Bug > Components: Kubernetes > Affects Versions: 2.3.0 > Reporter: Erik Erlandson > Priority: Minor > Labels: kubernetes > > User is reporting that multiple "simultaneous" (or rapidly in succession) job > submissions against the k8s back-end are causing driver pods to hang in > "Waiting: PodInitializing" state. They filed an associated question at > [stackoverflow|[https://stackoverflow.com/questions/51843212/spark-driver-pod-stuck-in-waiting-podinitializing-state-in-kubernetes|https://stackoverflow.com/questions/51843212/spark-driver-pod-stuck-in-waiting-podinitializing-state-in-kubernetes?noredirect=1#comment90640662_51843212]]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org