Can you look inside RM log to see if you can find some clue there ? You can pastebin part of the RM log around the time your job ran ?
What hadoop version are you using ? Thanks On Sat, Jan 31, 2015 at 11:24 AM, Koert Kuipers <ko...@tresata.com> wrote: > i have a simple spark app that i run with spark-submit on yarn. it runs > fine and shows up with finalStatus=SUCCEEDED in the resource manager logs. > > however in the nodemanager logs i see this: > > 2015-01-31 18:30:48,195 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Memory usage of ProcessTree 18613 for container-id conta\ > iner_1422728945460_0001_01_000001: 146.7 MB of 1 GB physical memory used; > 1.1 GB of 2.1 GB virtual memory used > 2015-01-31 18:30:51,166 WARN > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit > code from container container_1422728945460_0001_01_000002 is : 1 > 2015-01-31 18:30:51,168 WARN > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: > Exception from container-launch with container ID: > container_1422728945460_000\ > 1_01_000002 and exit code: 1 > ExitCodeException exitCode=1: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:538) > at org.apache.hadoop.util.Shell.run(Shell.java:455) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:197) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:724) > > not much extra info besides a status code here. > i can find the logs for container_1422728945460_0001_01_000002, but they > seem perfectly fine. this is the tail of them: > > 15/01/31 18:30:49 INFO spark.MapOutputTrackerWorker: Doing the fetch; > tracker actor = Actor[akka.tcp://sparkDriver@ip-10-176-1-83.ec2.internal > :41307/user/MapOutputTracker#-840\ > 495197] > 15/01/31 18:30:49 INFO spark.MapOutputTrackerWorker: Got the output > locations > 15/01/31 18:30:49 INFO storage.ShuffleBlockFetcherIterator: Getting 1 > non-empty blocks out of 1 blocks > 15/01/31 18:30:49 INFO storage.ShuffleBlockFetcherIterator: Started 0 > remote fetches in 9 ms > 15/01/31 18:30:49 INFO Configuration.deprecation: mapred.work.output.dir > is deprecated. Instead, use mapreduce.task.output.dir > 15/01/31 18:30:49 INFO util.Hadoop18TapUtil: setting up task: > 'attempt_201501311830_0001_m_000000_1' - > hdfs://ip-10-176-1-83.ec2.internal:8020/user/koert/out4/_temporary/_atte\ > mpt_201501311830_0001_m_000000_1 > 15/01/31 18:30:49 INFO io.TapOutputCollector: creating path: part-00000 > 15/01/31 18:30:49 INFO io.TapOutputCollector: closing tap collector for: > out4/part-00000 > 15/01/31 18:30:49 INFO util.Hadoop18TapUtil: committing task: > 'attempt_201501311830_0001_m_000000_1' - > hdfs://ip-10-176-1-83.ec2.internal:8020/user/koert/out4/_temporary/_atte\ > mpt_201501311830_0001_m_000000_1 > 15/01/31 18:30:50 INFO util.Hadoop18TapUtil: saved output of task > 'attempt_201501311830_0001_m_000000_1' to > hdfs://ip-10-176-1-83.ec2.internal:8020/user/koert/out4 > 15/01/31 18:30:50 INFO executor.Executor: Finished task 0.0 in stage 1.0 > (TID 1). 800 bytes result sent to driver > 15/01/31 18:30:50 ERROR executor.CoarseGrainedExecutorBackend: Driver > Disassociated [akka.tcp://sparkExecutor@ip-10-176-1-83.ec2.internal:36521] > -> [akka.tcp://sparkDriver@ip-\ > 10-176-1-83.ec2.internal:41307] disassociated! Shutting down. > > so i am not sure why i am seeing that exit status. i ignored it so far, > except that it seems to have a weird side effect: after this job runs no > other jobs launch anymore on the same node, as if it got blacklisted or > something like that. and since this is a single node pseudo-distributed > cluster i use for testing nothing runs anymore. a restart of yarn fixes > that. > > i do not see the same problem with say SparkPi, so it seems to be > something in my app, but i do not know where to look. > any ideas? thanks! koert >