i have a simple spark app that i run with spark-submit on yarn. it runs
fine and shows up with finalStatus=SUCCEEDED in the resource manager logs.

however in the nodemanager logs i see this:

2015-01-31 18:30:48,195 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
Memory usage of ProcessTree 18613 for container-id conta\
iner_1422728945460_0001_01_000001: 146.7 MB of 1 GB physical memory used;
1.1 GB of 2.1 GB virtual memory used
2015-01-31 18:30:51,166 WARN
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit
code from container container_1422728945460_0001_01_000002 is : 1
2015-01-31 18:30:51,168 WARN
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor:
Exception from container-launch with container ID:
container_1422728945460_000\
1_01_000002 and exit code: 1
ExitCodeException exitCode=1:
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
        at org.apache.hadoop.util.Shell.run(Shell.java:455)
        at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702)
        at
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:197)
        at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
        at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:724)

not much extra info besides a status code here.
i can find the logs for container_1422728945460_0001_01_000002, but they
seem perfectly fine. this is the tail of them:

15/01/31 18:30:49 INFO spark.MapOutputTrackerWorker: Doing the fetch;
tracker actor = Actor[akka.tcp://sparkDriver@ip-10-176-1-83.ec2.internal
:41307/user/MapOutputTracker#-840\
495197]
15/01/31 18:30:49 INFO spark.MapOutputTrackerWorker: Got the output
locations
15/01/31 18:30:49 INFO storage.ShuffleBlockFetcherIterator: Getting 1
non-empty blocks out of 1 blocks
15/01/31 18:30:49 INFO storage.ShuffleBlockFetcherIterator: Started 0
remote fetches in 9 ms
15/01/31 18:30:49 INFO Configuration.deprecation: mapred.work.output.dir is
deprecated. Instead, use mapreduce.task.output.dir
15/01/31 18:30:49 INFO util.Hadoop18TapUtil: setting up task:
'attempt_201501311830_0001_m_000000_1' -
hdfs://ip-10-176-1-83.ec2.internal:8020/user/koert/out4/_temporary/_atte\
mpt_201501311830_0001_m_000000_1
15/01/31 18:30:49 INFO io.TapOutputCollector: creating path: part-00000
15/01/31 18:30:49 INFO io.TapOutputCollector: closing tap collector for:
out4/part-00000
15/01/31 18:30:49 INFO util.Hadoop18TapUtil: committing task:
'attempt_201501311830_0001_m_000000_1' -
hdfs://ip-10-176-1-83.ec2.internal:8020/user/koert/out4/_temporary/_atte\
mpt_201501311830_0001_m_000000_1
15/01/31 18:30:50 INFO util.Hadoop18TapUtil: saved output of task
'attempt_201501311830_0001_m_000000_1' to
hdfs://ip-10-176-1-83.ec2.internal:8020/user/koert/out4
15/01/31 18:30:50 INFO executor.Executor: Finished task 0.0 in stage 1.0
(TID 1). 800 bytes result sent to driver
15/01/31 18:30:50 ERROR executor.CoarseGrainedExecutorBackend: Driver
Disassociated [akka.tcp://sparkExecutor@ip-10-176-1-83.ec2.internal:36521]
-> [akka.tcp://sparkDriver@ip-\
10-176-1-83.ec2.internal:41307] disassociated! Shutting down.

so i am not sure why i am seeing that exit status. i ignored it so far,
except that it seems to have a weird side effect: after this job runs no
other jobs launch anymore on the same node, as if it got blacklisted or
something like that. and since this is a single node pseudo-distributed
cluster i use for testing nothing runs anymore. a restart of yarn fixes
that.

i do not see the same problem with say SparkPi, so it seems to be something
in my app, but i do not know where to look.
any ideas? thanks! koert

Reply via email to