i have a simple spark app that i run with spark-submit on yarn. it runs fine and shows up with finalStatus=SUCCEEDED in the resource manager logs.
however in the nodemanager logs i see this: 2015-01-31 18:30:48,195 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 18613 for container-id conta\ iner_1422728945460_0001_01_000001: 146.7 MB of 1 GB physical memory used; 1.1 GB of 2.1 GB virtual memory used 2015-01-31 18:30:51,166 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1422728945460_0001_01_000002 is : 1 2015-01-31 18:30:51,168 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_1422728945460_000\ 1_01_000002 and exit code: 1 ExitCodeException exitCode=1: at org.apache.hadoop.util.Shell.runCommand(Shell.java:538) at org.apache.hadoop.util.Shell.run(Shell.java:455) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:197) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) not much extra info besides a status code here. i can find the logs for container_1422728945460_0001_01_000002, but they seem perfectly fine. this is the tail of them: 15/01/31 18:30:49 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker actor = Actor[akka.tcp://sparkDriver@ip-10-176-1-83.ec2.internal :41307/user/MapOutputTracker#-840\ 495197] 15/01/31 18:30:49 INFO spark.MapOutputTrackerWorker: Got the output locations 15/01/31 18:30:49 INFO storage.ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks 15/01/31 18:30:49 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 9 ms 15/01/31 18:30:49 INFO Configuration.deprecation: mapred.work.output.dir is deprecated. Instead, use mapreduce.task.output.dir 15/01/31 18:30:49 INFO util.Hadoop18TapUtil: setting up task: 'attempt_201501311830_0001_m_000000_1' - hdfs://ip-10-176-1-83.ec2.internal:8020/user/koert/out4/_temporary/_atte\ mpt_201501311830_0001_m_000000_1 15/01/31 18:30:49 INFO io.TapOutputCollector: creating path: part-00000 15/01/31 18:30:49 INFO io.TapOutputCollector: closing tap collector for: out4/part-00000 15/01/31 18:30:49 INFO util.Hadoop18TapUtil: committing task: 'attempt_201501311830_0001_m_000000_1' - hdfs://ip-10-176-1-83.ec2.internal:8020/user/koert/out4/_temporary/_atte\ mpt_201501311830_0001_m_000000_1 15/01/31 18:30:50 INFO util.Hadoop18TapUtil: saved output of task 'attempt_201501311830_0001_m_000000_1' to hdfs://ip-10-176-1-83.ec2.internal:8020/user/koert/out4 15/01/31 18:30:50 INFO executor.Executor: Finished task 0.0 in stage 1.0 (TID 1). 800 bytes result sent to driver 15/01/31 18:30:50 ERROR executor.CoarseGrainedExecutorBackend: Driver Disassociated [akka.tcp://sparkExecutor@ip-10-176-1-83.ec2.internal:36521] -> [akka.tcp://sparkDriver@ip-\ 10-176-1-83.ec2.internal:41307] disassociated! Shutting down. so i am not sure why i am seeing that exit status. i ignored it so far, except that it seems to have a weird side effect: after this job runs no other jobs launch anymore on the same node, as if it got blacklisted or something like that. and since this is a single node pseudo-distributed cluster i use for testing nothing runs anymore. a restart of yarn fixes that. i do not see the same problem with say SparkPi, so it seems to be something in my app, but i do not know where to look. any ideas? thanks! koert