Re: Spark with Kubernetes connecting to pod ID, not address

Pat Ferrel Wed, 13 Feb 2019 17:23:07 -0800

Hmm, I’m not asking about using k8s to control Spark as a Job manager or 
scheduler like Yarn. We use the built-in standalone Spark Job Manager and 
sparl://spark-api:7077 as the master not k8s.


The problem is using k8s to manage a cluster consisting of our app, some 
databases, and Spark (one master, one driver, several executors). The problem 
is that some kind of callback from Spark is trying to use the pod ID in the 
callback and is failing to connect because of that. We have tried deployMode 
“client” and “cluster” but get the same error

The full trace is below but the important bit is:

    Failed to connect to harness-64d97d6d6-6n7nh:46337

This came from the deployMode = “client: and the port is the driver port, which 
should be on the launching pod. For some reason it is using a pod ID instead of 
a real address. Doesn’t the driver run in the launching app’s process? The 
launching app is on the pod ID harness-64d97d6d6-6n7nh but it has the k8s DNS 
address of harness-api. I can see the correct address fro the launching pod 
with "kubectl get services"


The error is:

Spark Executor Command: "/usr/lib/jvm/java-1.8-openjdk/bin/java" "-cp" 
"/spark/conf/:/spark/jars/*:/etc/hadoop/" "-Xmx1024M" 
"-Dspark.driver.port=46337" 
"org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" 
"spark://CoarseGrainedScheduler@harness-64d97d6d6-6n7nh:46337" "--executor-id" 
"138" "--hostname" "10.31.31.174" "--cores" "8" "--app-id" 
"app-20190213210105-0000" "--worker-url" "spark://Worker@10.31.31.174:37609"
========================================

Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
        at 
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:63)
        at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
        at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:293)
        at 
org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult: 
        at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
        at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
        at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
        at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:201)
        at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:64)
        at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:63)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
        ... 4 more
Caused by: java.io.IOException: Failed to connect to 
harness-64d97d6d6-6n7nh:46337
        at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
        at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
        at 
org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
        at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
        at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.UnknownHostException: harness-64d97d6d6-6n7nh
        at java.net.InetAddress.getAllByName0(InetAddress.java:1281)
        at java.net.InetAddress.getAllByName(InetAddress.java:1193)
        at java.net.InetAddress.getAllByName(InetAddress.java:1127)
        at java.net.InetAddress.getByName(InetAddress.java:1077)
        at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:146)
        at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:143)
        at java.security.AccessController.doPrivileged(Native Method)
        at 
io.netty.util.internal.SocketUtils.addressByName(SocketUtils.java:143)
        at 
io.netty.resolver.DefaultNameResolver.doResolve(DefaultNameResolver.java:43)
        at 
io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:63)
        at 
io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:55)
        at 
io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:57)
        at 
io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:32)
        at 
io.netty.resolver.AbstractAddressResolver.resolve(AbstractAddressResolver.java:108)
        at io.netty.bootstrap.Bootstrap.doResolveAndConnect0(Bootstrap.java:208)
        at io.netty.bootstrap.Bootstrap.access$000(Bootstrap.java:49)
        at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:188)
        at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:174)
        at 
io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
        at 
io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
        at 
io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
        at 
io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:104)
        at 
io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:82)
        at 
io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetSuccess(AbstractChannel.java:978)
        at 
io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:512)
        at 
io.netty.channel.AbstractChannel$AbstractUnsafe.access$200(AbstractChannel.java:423)
        at 
io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:482)
        at 
io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
        at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
        at 
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
        at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
        ... 1 more



From: Erik Erlandson <eerla...@redhat.com>
Date: February 13, 2019 at 4:57:30 AM
To: Pat Ferrel <p...@actionml.com>
Subject:  Re: Spark with Kubernetes connecting to pod id, not address  

Hi Pat,

I'd suggest visiting the big data slack channel, it's a more spark oriented 
forum than kube-dev:
https://kubernetes.slack.com/messages/C0ELB338T/

Tentatively, I think you may want to submit in client mode (unless you are 
initiating your application from outside the kube cluster). When in client 
mode, you need to set up a headless service for the application driver pod that 
the executors can use to talk back to the driver.
https://spark.apache.org/docs/latest/running-on-kubernetes.html#client-mode

Cheers,
Erik


On Wed, Feb 13, 2019 at 1:55 AM Pat Ferrel <p...@actionml.com> wrote:
We have a k8s deployment of several services including Apache Spark. All 
services seem to be operational. Our application connects to the Spark master 
to submit a job using the k8s DNS service for the cluster where the master is 
called spark-api so we use master=spark://spark-api:7077 and we use 
spark.submit.deployMode=cluster. We submit the job through the API not by the 
spark-submit script. 

This will run the "driver" and all "executors" on the cluster and this part 
seems to work but there is a callback to the launching code in our app from 
some Spark process. For some reason it is trying to connect to 
harness-64d97d6d6-4r4d8, which is the pod ID, not the k8s cluster IP or DNS.

How could this pod ID be getting into the system? Spark somehow seems to think 
it is the address of the service that called it. Needless to say any connection 
to the k8s pod ID fails and so does the job.

Any idea how Spark could think the pod ID is an IP address or DNS name? 

BTW if we run a small sample job with `master=local` all is well, but the same 
job executed with the above config tries to connect to the spurious pod ID.
--
You received this message because you are subscribed to the Google Groups 
"Kubernetes developer/contributor discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to kubernetes-dev+unsubscr...@googlegroups.com.
To post to this group, send email to kubernetes-...@googlegroups.com.
Visit this group at https://groups.google.com/group/kubernetes-dev.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/kubernetes-dev/36bb6bf8-1cac-428e-8ad7-3d639c90a86b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Spark with Kubernetes connecting to pod ID, not address

Reply via email to