Hi Vinod, Here is Diagnostics message from RM Web UI page: Application application_1424919411720_0878 failed 10 times due to Error launching appattempt_1424919411720_0878_000010. Got exception: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:197) at java.io.DataInputStream.readFully(DataInputStream.java:169) at org.apache.hadoop.security.Credentials.readTokenStorageStream(Credentials.java:209) at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.setupTokens(AMLauncher.java:226) at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.createAMContainerLaunchContext(AMLauncher.java:198) at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:108) at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:254) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) . Failing the application.
The log link only show following messages and doesn't produce some stdout and stderr file: Logs not available for container_1424919411720_0878_08_000001_14. Aggregation may not be complete, Check back later or try the nodemanager at hadoopdn01:8041 Here is the screenshot: https://dl.dropboxusercontent.com/u/33705885/2015-03-02_163138.png Thank you. On Sat, Feb 28, 2015 at 2:56 AM, Vinod Kumar Vavilapalli <vino...@hortonworks.com> wrote: > That's an old JIRA. The right solution is not an AM-retry interval but > launching the AM somewhere. > > Why is your AM failing in the first place? If it is due to full-disk, the > situation should be better with YARN-1781 - can you use the configuration > (yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage) > added at YARN-1781? > > +Vinod > > On Feb 27, 2015, at 7:31 AM, Ted Yu <yuzhih...@gmail.com> wrote: > > Looks like this is related: > https://issues.apache.org/jira/browse/YARN-964 > > On Fri, Feb 27, 2015 at 4:29 AM, Nur Kholis Majid > <nur.kholis.ma...@gmail.com> wrote: >> >> Hi All, >> >> I have many jobs failed because AM trying to rerun job in very short >> interval (only in 6 second). How can I add the interval to bigger >> value? >> >> https://dl.dropboxusercontent.com/u/33705885/2015-02-27_145104.png >> >> Thank you. > > >