Hello, I have built a livy docker container and trying to connect it to a YARN cluster running outside of docker container host. When I am submitting a spark job via livy inside the container (yarn-cluster mode) from Zeppelin UI, I see the job is submitted (verified from RM UI and yarn container logs) but job eventually fail with following errors (shown in zeppelin UI):
%livy.spark sc.version org.apache.zeppelin.livy.LivyException: Session 3 is finished, appId: application_1508274923092_0022, log: [ at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624), at java.lang.Thread.run(Thread.java:748), , Shell output: main : command provided 1, main : user is user1, main : requested yarn user is user1, , , Container exited with a non-zero exit code 15, Failing this attempt. Failing the application.] at org.apache.zeppelin.livy.BaseLivyInterpreter.createSession(BaseLivyInterpreter.java:230) at org.apache.zeppelin.livy.BaseLivyInterpreter.initLivySession(BaseLivyInterpreter.java:119) at org.apache.zeppelin.livy.BaseLivyInterpreter.open(BaseLivyInterpreter.java:101) at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:70) at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:491) at org.apache.zeppelin.scheduler.Job.run(Job.java:175) at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) >From the yarn application logs, I found the following related to the job submitted: 2017-10-19 12:44:59,423 INFO [main] yarn.ApplicationMaster: Waiting for spark context initialization... 2017-10-19 12:44:59,476 INFO [Driver] driver.RSCDriver: Connecting to: 57539c730d5f:34680 2017-10-19 12:44:59,476 INFO [Driver] driver.RSCDriver: Starting RPC server... 2017-10-19 12:44:59,688 WARN [Driver] rsc.RSCConf: Your hostname, node1.lab (*valid hostname*), resolves to a loopback address, but we couldn't find any external IP address! 2017-10-19 12:44:59,688 WARN [Driver] rsc.RSCConf: Set livy.rsc.rpc.server.address if you need to bind to another address. 2017-10-19 12:44:59,704 ERROR [Driver] yarn.ApplicationMaster: User class threw exception: java.util.concurrent.ExecutionException: java.nio.channels.UnresolvedAddressException java.util.concurrent.ExecutionException: java.nio.channels.UnresolvedAddressException at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:41) at com.cloudera.livy.rsc.driver.RSCDriver.initializeServer(RSCDriver.java:197) at com.cloudera.livy.rsc.driver.RSCDriver.run(RSCDriver.java:326) at com.cloudera.livy.rsc.driver.RSCDriverBootstrapper.main(RSCDriverBootstrapper.java:86) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637) Caused by: java.nio.channels.UnresolvedAddressException at sun.nio.ch.Net.checkAddress(Net.java:101) at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622) at io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:242) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:205) at io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1226) at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:540) at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:525) at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:507) at io.netty.channel.DefaultChannelPipeline.connect(DefaultChannelPipeline.java:970) at io.netty.channel.AbstractChannel.connect(AbstractChannel.java:215) at io.netty.bootstrap.Bootstrap$2.run(Bootstrap.java:166) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) at java.lang.Thread.run(Thread.java:748) As from above, it seems there is issue with connecting back to container or resolving address, but, the dns resolution seems working from the cluster node and inside docker container. Any pointers on what else I can check and what else could be the issue? Note: if the spark job is submitted via spark-submit from the same docker container in Yarn-Cluster mode, it completes without any issue. Hence, the issue is only isolated to livy or any configuration I may be missing here? Let me know if there is any other information helpful in debugging the issue. - Sarjeet Singh