Can you look inside RM log to see if you can find some clue there ?

You can pastebin part of the RM log around the time your job ran ?

What hadoop version are you using ?

Thanks

On Sat, Jan 31, 2015 at 11:24 AM, Koert Kuipers <ko...@tresata.com> wrote:

> i have a simple spark app that i run with spark-submit on yarn. it runs
> fine and shows up with finalStatus=SUCCEEDED in the resource manager logs.
>
> however in the nodemanager logs i see this:
>
> 2015-01-31 18:30:48,195 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
> Memory usage of ProcessTree 18613 for container-id conta\
> iner_1422728945460_0001_01_000001: 146.7 MB of 1 GB physical memory used;
> 1.1 GB of 2.1 GB virtual memory used
> 2015-01-31 18:30:51,166 WARN
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit
> code from container container_1422728945460_0001_01_000002 is : 1
> 2015-01-31 18:30:51,168 WARN
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor:
> Exception from container-launch with container ID:
> container_1422728945460_000\
> 1_01_000002 and exit code: 1
> ExitCodeException exitCode=1:
>         at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
>         at org.apache.hadoop.util.Shell.run(Shell.java:455)
>         at
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702)
>         at
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:197)
>         at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
>         at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:724)
>
> not much extra info besides a status code here.
> i can find the logs for container_1422728945460_0001_01_000002, but they
> seem perfectly fine. this is the tail of them:
>
> 15/01/31 18:30:49 INFO spark.MapOutputTrackerWorker: Doing the fetch;
> tracker actor = Actor[akka.tcp://sparkDriver@ip-10-176-1-83.ec2.internal
> :41307/user/MapOutputTracker#-840\
> 495197]
> 15/01/31 18:30:49 INFO spark.MapOutputTrackerWorker: Got the output
> locations
> 15/01/31 18:30:49 INFO storage.ShuffleBlockFetcherIterator: Getting 1
> non-empty blocks out of 1 blocks
> 15/01/31 18:30:49 INFO storage.ShuffleBlockFetcherIterator: Started 0
> remote fetches in 9 ms
> 15/01/31 18:30:49 INFO Configuration.deprecation: mapred.work.output.dir
> is deprecated. Instead, use mapreduce.task.output.dir
> 15/01/31 18:30:49 INFO util.Hadoop18TapUtil: setting up task:
> 'attempt_201501311830_0001_m_000000_1' -
> hdfs://ip-10-176-1-83.ec2.internal:8020/user/koert/out4/_temporary/_atte\
> mpt_201501311830_0001_m_000000_1
> 15/01/31 18:30:49 INFO io.TapOutputCollector: creating path: part-00000
> 15/01/31 18:30:49 INFO io.TapOutputCollector: closing tap collector for:
> out4/part-00000
> 15/01/31 18:30:49 INFO util.Hadoop18TapUtil: committing task:
> 'attempt_201501311830_0001_m_000000_1' -
> hdfs://ip-10-176-1-83.ec2.internal:8020/user/koert/out4/_temporary/_atte\
> mpt_201501311830_0001_m_000000_1
> 15/01/31 18:30:50 INFO util.Hadoop18TapUtil: saved output of task
> 'attempt_201501311830_0001_m_000000_1' to
> hdfs://ip-10-176-1-83.ec2.internal:8020/user/koert/out4
> 15/01/31 18:30:50 INFO executor.Executor: Finished task 0.0 in stage 1.0
> (TID 1). 800 bytes result sent to driver
> 15/01/31 18:30:50 ERROR executor.CoarseGrainedExecutorBackend: Driver
> Disassociated [akka.tcp://sparkExecutor@ip-10-176-1-83.ec2.internal:36521]
> -> [akka.tcp://sparkDriver@ip-\
> 10-176-1-83.ec2.internal:41307] disassociated! Shutting down.
>
> so i am not sure why i am seeing that exit status. i ignored it so far,
> except that it seems to have a weird side effect: after this job runs no
> other jobs launch anymore on the same node, as if it got blacklisted or
> something like that. and since this is a single node pseudo-distributed
> cluster i use for testing nothing runs anymore. a restart of yarn fixes
> that.
>
> i do not see the same problem with say SparkPi, so it seems to be
> something in my app, but i do not know where to look.
> any ideas? thanks! koert
>

Reply via email to