it is CDH 5.3 with the spark that ships with it. i went through the RM logs line by line, and i found the exit code in there:
container_1422728945460_0001_01_000029 Container Transitioned from NEW to RESERVED 2015-01-31 18:30:49,633 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: Reserved container container_1422728945460_0001_01_000029 on node host: ip-10-176-1-83.ec2.internal:40695 #containers=2 available=1024 used=3072 for application org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt@7c96c19c 2015-01-31 18:30:50,638 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Releasing reservation that cannot be satisfied for application appattempt_1422728945460_0001_000001 on node host: ip-10-176-1-83.ec2.internal:40695 #containers=2 available=1024 used=3072 2015-01-31 18:30:50,638 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: Application application_1422728945460_0001 unreserved on node host: ip-10-176-1-83.ec2.internal:40695 #containers=2 available=1024 used=3072, currently has 0 at priority 1; currentReservation <memory:0, vCores:0> 2015-01-31 18:30:50,638 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: Making reservation: node=ip-10-176-1-83.ec2.internal app_id=application_1422728945460_0001 2015-01-31 18:30:50,639 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1422728945460_0001_01_000030 Container Transitioned from NEW to RESERVED 2015-01-31 18:30:50,639 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: Reserved container container_1422728945460_0001_01_000030 on node host: ip-10-176-1-83.ec2.internal:40695 #containers=2 available=1024 used=3072 for application org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt@7c96c19c 2015-01-31 18:30:50,724 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Updating application attempt appattempt_1422728945460_0001_000001 with final state: FINISHING, and exit status: -1000 2015-01-31 18:30:50,729 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1422728945460_0001_000001 State change from RUNNING to FINAL_SAVING it seems to have something to do with reservations that cannot be satisfied. i am running both my job and SparkPi without any specific resource demands, and SparkPi runs fine, so not sure why that is. but at least now i know where to dig further. thanks! On Sat, Jan 31, 2015 at 3:28 PM, Ted Yu <yuzhih...@gmail.com> wrote: > Can you look inside RM log to see if you can find some clue there ? > > You can pastebin part of the RM log around the time your job ran ? > > What hadoop version are you using ? > > Thanks > > On Sat, Jan 31, 2015 at 11:24 AM, Koert Kuipers <ko...@tresata.com> wrote: > >> i have a simple spark app that i run with spark-submit on yarn. it runs >> fine and shows up with finalStatus=SUCCEEDED in the resource manager logs. >> >> however in the nodemanager logs i see this: >> >> 2015-01-31 18:30:48,195 INFO >> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: >> Memory usage of ProcessTree 18613 for container-id conta\ >> iner_1422728945460_0001_01_000001: 146.7 MB of 1 GB physical memory used; >> 1.1 GB of 2.1 GB virtual memory used >> 2015-01-31 18:30:51,166 WARN >> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit >> code from container container_1422728945460_0001_01_000002 is : 1 >> 2015-01-31 18:30:51,168 WARN >> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: >> Exception from container-launch with container ID: >> container_1422728945460_000\ >> 1_01_000002 and exit code: 1 >> ExitCodeException exitCode=1: >> at org.apache.hadoop.util.Shell.runCommand(Shell.java:538) >> at org.apache.hadoop.util.Shell.run(Shell.java:455) >> at >> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702) >> at >> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:197) >> at >> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299) >> at >> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81) >> at java.util.concurrent.FutureTask.run(FutureTask.java:262) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >> at java.lang.Thread.run(Thread.java:724) >> >> not much extra info besides a status code here. >> i can find the logs for container_1422728945460_0001_01_000002, but they >> seem perfectly fine. this is the tail of them: >> >> 15/01/31 18:30:49 INFO spark.MapOutputTrackerWorker: Doing the fetch; >> tracker actor = Actor[akka.tcp://sparkDriver@ip-10-176-1-83.ec2.internal >> :41307/user/MapOutputTracker#-840\ >> 495197] >> 15/01/31 18:30:49 INFO spark.MapOutputTrackerWorker: Got the output >> locations >> 15/01/31 18:30:49 INFO storage.ShuffleBlockFetcherIterator: Getting 1 >> non-empty blocks out of 1 blocks >> 15/01/31 18:30:49 INFO storage.ShuffleBlockFetcherIterator: Started 0 >> remote fetches in 9 ms >> 15/01/31 18:30:49 INFO Configuration.deprecation: mapred.work.output.dir >> is deprecated. Instead, use mapreduce.task.output.dir >> 15/01/31 18:30:49 INFO util.Hadoop18TapUtil: setting up task: >> 'attempt_201501311830_0001_m_000000_1' - >> hdfs://ip-10-176-1-83.ec2.internal:8020/user/koert/out4/_temporary/_atte\ >> mpt_201501311830_0001_m_000000_1 >> 15/01/31 18:30:49 INFO io.TapOutputCollector: creating path: part-00000 >> 15/01/31 18:30:49 INFO io.TapOutputCollector: closing tap collector for: >> out4/part-00000 >> 15/01/31 18:30:49 INFO util.Hadoop18TapUtil: committing task: >> 'attempt_201501311830_0001_m_000000_1' - >> hdfs://ip-10-176-1-83.ec2.internal:8020/user/koert/out4/_temporary/_atte\ >> mpt_201501311830_0001_m_000000_1 >> 15/01/31 18:30:50 INFO util.Hadoop18TapUtil: saved output of task >> 'attempt_201501311830_0001_m_000000_1' to >> hdfs://ip-10-176-1-83.ec2.internal:8020/user/koert/out4 >> 15/01/31 18:30:50 INFO executor.Executor: Finished task 0.0 in stage 1.0 >> (TID 1). 800 bytes result sent to driver >> 15/01/31 18:30:50 ERROR executor.CoarseGrainedExecutorBackend: Driver >> Disassociated [akka.tcp://sparkExecutor@ip-10-176-1-83.ec2.internal:36521] >> -> [akka.tcp://sparkDriver@ip-\ >> 10-176-1-83.ec2.internal:41307] disassociated! Shutting down. >> >> so i am not sure why i am seeing that exit status. i ignored it so far, >> except that it seems to have a weird side effect: after this job runs no >> other jobs launch anymore on the same node, as if it got blacklisted or >> something like that. and since this is a single node pseudo-distributed >> cluster i use for testing nothing runs anymore. a restart of yarn fixes >> that. >> >> i do not see the same problem with say SparkPi, so it seems to be >> something in my app, but i do not know where to look. >> any ideas? thanks! koert >> > >