Dear Hadoop community, I am testing my yarn cluster with zeppelin notebooks and using spark engine to submit python code. The whole point of this exercise is to analyze how yarn behaves in different situations.
For instance I am trying to run a code that needs relatively big amount of memory compared to the memory allocated to the spark driver. The job fails but I can't easily see the reason of why this is happening so I am wondering whether I have other problems (e.g error in cluster configuration): This is the submitted code (the only thing I am trying to do is to allocate a big amount of memory to see how yarn behaves): a = "bigword" aList = [] for i in range(1000): aList.append(i**i*a) #print aList for word in aList: print word The application fails and this is the error I can see in the logs: Exception 1: WARN [2019-10-17 09:28:58,567] ({Reporter} Logging.scala[logWarning]:66) - Container marked as failed: container_e15_1570749574365_0027_01_000003 on host: r940-1-1-mlx.mlx. Exit status: 1. Diagnostics: [2019-10-17 09:28:57.458]Exception from container-launch. Container id: container_e15_1570749574365_0027_01_000003 Exit code: 1 Exception message: Launch container failed Shell output: main : command provided 1 main : run as user is mansop main : requested yarn user is mansop Getting exit code file... Creating script paths... Writing pid file... Writing to tmp file /d1/hadoop/yarn/local/nmPrivate/application_1570749574365_0027/container_e15_1570749574365_0027_01_000003/container_e15_1570749574365_0027_01_000003.pid.tmp Writing to cgroup task files... Creating local dirs... Launching container... Getting exit code file... Creating script paths... [2019-10-17 09:28:57.459]Container exited with a non-zero exit code 1. Error file: prelaunch.err. Last 4096 bytes of prelaunch.err : Last 4096 bytes of stderr : Error: Could not find or load main class org.apache.spark.executor.CoarseGrainedExecutorBackend [2019-10-17 09:28:57.462]Container exited with a non-zero exit code 1. Error file: prelaunch.err. Last 4096 bytes of prelaunch.err : Last 4096 bytes of stderr : Error: Could not find or load main class org.apache.spark.executor.CoarseGrainedExecutorBackend WARN [2019-10-17 09:28:58,571] ({dispatcher-event-loop-19} Logging.scala[logWarning]:66) - Requesting driver to remove executor 2 for reason Container marked as failed: container_e15_1570749574365_0027_01_000003 on host: r940-1-1-mlx.mlx. Exit status: 1. Diagnostics: [2019-10-17 09:28:57.458]Exception from container-launch. Container id: container_e15_1570749574365_0027_01_000003 Exit code: 1 Exception message: Launch container failed Shell output: main : command provided 1 main : run as user is mansop main : requested yarn user is mansop Getting exit code file... Creating script paths... Writing pid file... Writing to tmp file /d1/hadoop/yarn/local/nmPrivate/application_1570749574365_0027/container_e15_1570749574365_0027_01_000003/container_e15_1570749574365_0027_01_000003.pid.tmp Writing to cgroup task files... Creating local dirs... Launching container... Getting exit code file... Creating script paths... [2019-10-17 09:28:57.459]Container exited with a non-zero exit code 1. Error file: prelaunch.err. Last 4096 bytes of prelaunch.err : Last 4096 bytes of stderr : Error: Could not find or load main class org.apache.spark.executor.CoarseGrainedExecutorBackend [2019-10-17 09:28:57.462]Container exited with a non-zero exit code 1. Error file: prelaunch.err. Last 4096 bytes of prelaunch.err : Last 4096 bytes of stderr : Error: Could not find or load main class org.apache.spark.executor.CoarseGrainedExecutorBackend Exception 2: DEBUG [2019-10-17 09:29:20,583] ({pool-5-thread-1} RemoteInterpreterServer.java[interpret]:426) - st: a = "bigword" aList = [] for i in range(1000): aList.append(i**i*a) #print aList for word in aList: print word DEBUG [2019-10-17 09:29:20,584] ({pool-5-thread-1} RemoteInterpreterEventClient.java[sendEvent]:413) - Send Event: RemoteInterpreterEvent(type:OUTPUT_UPDATE_ALL, data:{"messages":[],"noteId":"2ENM9X82N","paragraphId":"20190926-163159_1153559848"}) DEBUG [2019-10-17 09:29:20,584] ({pool-5-thread-2} RemoteInterpreterEventClient.java[pollEvent]:366) - Send event OUTPUT_UPDATE_ALL INFO [2019-10-17 09:29:20,584] ({pool-6-thread-4} SchedulerFactory.java[jobStarted]:109) - Job 20190926-163159_1153559848 started by scheduler interpreter_1768035734 DEBUG [2019-10-17 09:29:20,585] ({pool-6-thread-4} RemoteInterpreterServer.java[jobRun]:632) - Script after hooks: a = "bigword" aList = [] for i in range(1000): aList.append(i**i*a) #print aList for word in aList: print word __zeppelin__._displayhook() DEBUG [2019-10-17 09:29:20,585] ({pool-6-thread-4} RemoteInterpreterEventClient.java[sendEvent]:413) - Send Event: RemoteInterpreterEvent(type:META_INFOS, data:{"message":"Spark UI enabled","url":"http://r640-1-10-mlx.mlx:42304"}) DEBUG [2019-10-17 09:29:20,586] ({pool-5-thread-2} RemoteInterpreterEventClient.java[pollEvent]:366) - Send event META_INFOS DEBUG [2019-10-17 09:29:20,684] ({pool-5-thread-3} Interpreter.java[getProperty]:222) - key: zeppelin.spark.concurrentSQL, value: false DEBUG [2019-10-17 09:29:20,684] ({pool-5-thread-3} Interpreter.java[getProperty]:222) - key: zeppelin.spark.concurrentSQL, value: false ERROR [2019-10-17 09:29:25,301] ({SIGTERM handler} SignalUtils.scala[apply$mcZ$sp]:43) - RECEIVED SIGNAL TERM INFO [2019-10-17 09:29:25,308] ({shutdown-hook-0} Logging.scala[logInfo]:54) - Invoking stop() from shutdown hook INFO [2019-10-17 09:29:25,327] ({shutdown-hook-0} AbstractConnector.java[doStop]:318) - Stopped Spark@21f71c23{HTTP/1.1,[http/1.1]}{0.0.0.0:0} INFO [2019-10-17 09:29:25,331] ({shutdown-hook-0} Logging.scala[logInfo]:54) - Stopped Spark web UI at http://r640-1-10-mlx.mlx:42304 INFO [2019-10-17 09:29:25,387] ({dispatcher-event-loop-44} Logging.scala[logInfo]:54) - Driver requested a total number of 0 executor(s). INFO [2019-10-17 09:29:25,389] ({shutdown-hook-0} Logging.scala[logInfo]:54) - Shutting down all executors INFO [2019-10-17 09:29:25,392] ({dispatcher-event-loop-51} Logging.scala[logInfo]:54) - Asking each executor to shut down INFO [2019-10-17 09:29:25,398] ({shutdown-hook-0} Logging.scala[logInfo]:54) - Stopping SchedulerExtensionServices (serviceOption=None, services=List(), started=false) ERROR [2019-10-17 09:29:25,407] ({Reporter} Logging.scala[logError]:91) - Exception from Reporter thread. org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException: Application attempt appattempt_1570749574365_0027_000001 doesn't exist in ApplicationMasterService cache. at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:404) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateYarnException(RPCUtil.java:75) at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:116) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) at com.sun.proxy.$Proxy21.allocate(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:320) at org.apache.spark.deploy.yarn.YarnAllocator.allocateResources(YarnAllocator.scala:268) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:556) Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException): Application attempt appattempt_1570749574365_0027_000001 doesn't exist in ApplicationMasterService cache. at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:404) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1497) at org.apache.hadoop.ipc.Client.call(Client.java:1443) at org.apache.hadoop.ipc.Client.call(Client.java:1353) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) at com.sun.proxy.$Proxy20.allocate(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77) ... 13 more INFO [2019-10-17 09:29:25,411] ({dispatcher-event-loop-55} Logging.scala[logInfo]:54) - MapOutputTrackerMasterEndpoint stopped! INFO [2019-10-17 09:29:25,412] ({Reporter} Logging.scala[logInfo]:54) - Final app status: FAILED, exitCode: 12, (reason: Application attempt appattempt_1570749574365_0027_000001 doesn't exist in ApplicationMasterService cache. at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:404) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682) ) INFO [2019-10-17 09:29:25,439] ({shutdown-hook-0} Logging.scala[logInfo]:54) - MemoryStore cleared INFO [2019-10-17 09:29:25,440] ({shutdown-hook-0} Logging.scala[logInfo]:54) - BlockManager stopped INFO [2019-10-17 09:29:25,441] ({shutdown-hook-0} Logging.scala[logInfo]:54) - BlockManagerMaster stopped INFO [2019-10-17 09:29:25,445] ({dispatcher-event-loop-63} Logging.scala[logInfo]:54) - OutputCommitCoordinator stopped! INFO [2019-10-17 09:29:25,459] ({shutdown-hook-0} Logging.scala[logInfo]:54) - Successfully stopped SparkContext INFO [2019-10-17 09:29:25,461] ({shutdown-hook-0} Logging.scala[logInfo]:54) - Shutdown hook called INFO [2019-10-17 09:29:25,463] ({shutdown-hook-0} Logging.scala[logInfo]:54) - Deleting directory /d0/hadoop/yarn/local/usercache/mansop/appcache/application_1570749574365_0027/spark-8328975e-fd10-4676-9b51-c800253c1016/pyspark-55a3ee94-1620-423b-8564-a0f2a7b1da76 INFO [2019-10-17 09:29:25,463] ({shutdown-hook-0} Logging.scala[logInfo]:54) - Deleting directory /d0/hadoop/yarn/local/usercache/mansop/appcache/application_1570749574365_0027/spark-8328975e-fd10-4676-9b51-c800253c1016 INFO [2019-10-17 09:29:25,463] ({shutdown-hook-0} Logging.scala[logInfo]:54) - Deleting directory /d1/hadoop/yarn/local/usercache/mansop/appcache/application_1570749574365_0027/spark-f0f652c9-2a8a-4235-818d-9f9165890c52 End of LogType:stdout Exception 3: INFO [2019-10-17 09:29:28,077] ({main} Logging.scala[logInfo]:54) - Registered signal handler for TERM INFO [2019-10-17 09:29:28,079] ({main} Logging.scala[logInfo]:54) - Registered signal handler for HUP INFO [2019-10-17 09:29:28,080] ({main} Logging.scala[logInfo]:54) - Registered signal handler for INT INFO [2019-10-17 09:29:28,193] ({main} Logging.scala[logInfo]:54) - Changing view acls to: mansop INFO [2019-10-17 09:29:28,193] ({main} Logging.scala[logInfo]:54) - Changing modify acls to: mansop INFO [2019-10-17 09:29:28,194] ({main} Logging.scala[logInfo]:54) - Changing view acls groups to: INFO [2019-10-17 09:29:28,194] ({main} Logging.scala[logInfo]:54) - Changing modify acls groups to: INFO [2019-10-17 09:29:28,195] ({main} Logging.scala[logInfo]:54) - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mansop); groups with view permissions: Set(); users with modify permissions: Set(mansop); groups with modify permissions: Set() INFO [2019-10-17 09:29:28,493] ({main} Logging.scala[logInfo]:54) - Preparing Local resources INFO [2019-10-17 09:29:29,195] ({main} Logging.scala[logInfo]:54) - ApplicationAttemptId: appattempt_1570749574365_0027_000002 INFO [2019-10-17 09:29:29,219] ({main} Logging.scala[logInfo]:54) - Starting the user application in a separate Thread INFO [2019-10-17 09:29:29,241] ({main} Logging.scala[logInfo]:54) - Waiting for spark context initialization... INFO [2019-10-17 09:29:29,243] ({Driver} RemoteInterpreterServer.java[main]:260) - URL:jar:file:/d1/hadoop/yarn/local/usercache/mansop/appcache/application_1570749574365_0027/container_e15_1570749574365_0027_02_000001/__app__.jar!/org/apache/zeppelin/interpreter/remote/RemoteInterpreterServer.class INFO [2019-10-17 09:29:29,311] ({Driver} RemoteInterpreterServer.java[<init>]:161) - Launching ThriftServer at 10.0.1.12:41438 INFO [2019-10-17 09:29:29,319] ({Driver} RemoteInterpreterServer.java[<init>]:165) - Starting remote interpreter server on port 41438 INFO [2019-10-17 09:29:29,322] ({Thread-8} RemoteInterpreterServer.java[run]:202) - Starting remote interpreter server on port 41438 INFO [2019-10-17 09:29:30,340] ({Thread-9} RemoteInterpreterUtils.java[registerInterpreter]:165) - callbackHost: 10.0.11.18, callbackPort: 42835, callbackInfo: CallbackInfo(host:10.0.1.12, port:41438) ERROR [2019-10-17 09:31:09,266] ({main} Logging.scala[logError]:91) - Uncaught exception: java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201) at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:498) at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:345) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply$mcV$sp(ApplicationMaster.scala:260) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$5.run(ApplicationMaster.scala:815) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:814) at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:259) at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:839) at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala) INFO [2019-10-17 09:31:09,273] ({main} Logging.scala[logInfo]:54) - Final app status: FAILED, exitCode: 13, (reason: Uncaught exception: java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds]) INFO [2019-10-17 09:31:09,284] ({shutdown-hook-0} Logging.scala[logInfo]:54) - Deleting staging directory hdfs://gl-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1570749574365_0027 INFO [2019-10-17 09:31:09,299] ({shutdown-hook-0} Logging.scala[logInfo]:54) - Shutdown hook called End of LogType:stdout Could someone please help to understand what these errors means? Manuel Sopena Ballesteros Big Data Engineer | Kinghorn Centre for Clinical Genomics [cid:image001.png@01D4C835.ED3C2230] <https://www.garvan.org.au/> a: 384 Victoria Street, Darlinghurst NSW 2010 p: +61 2 9355 5760 | +61 4 12 123 123 e: manuel...@garvan.org.au<mailto:manuel...@garvan.org.au> Like us on Facebook<http://www.facebook.com/garvaninstitute> | Follow us on Twitter<http://twitter.com/GarvanInstitute> and LinkedIn<http://www.linkedin.com/company/garvan-institute-of-medical-research> NOTICE Please consider the environment before printing this email. This message and any attachments are intended for the addressee named and may contain legally privileged/confidential/copyright information. If you are not the intended recipient, you should not read, use, disclose, copy or distribute this communication. If you have received this message in error please notify us at once by return email and then delete both messages. We accept no liability for the distribution of viruses or similar in electronic communications. This notice should not be removed.