Ed Kohlwey created YARN-1114: -------------------------------- Summary: Resource Manager Failure Due to Unreachable DNS Key: YARN-1114 URL: https://issues.apache.org/jira/browse/YARN-1114 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.1.0-beta Environment: Centos 6.3, Hortonworks vendor distro based on Hadoop 2.1 Reporter: Ed Kohlwey
We encountered an issue last night where DNS was not resolvable on our cluster briefly. Our resource manager appears to have crashed due to an unresolvable hostname for a node manager. This is definitely not the right behavior since anyone can crash the resource manager by advertising a node manager with an unresolvable hostname. It also makes the RM non-very-robust to transient network issues that may arise. Here is the stack trace: {noformat} 2013-08-28 05:06:24,703 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type NODE_UPDATE to the scheduler java.lang.IllegalArgumentException: java.net.UnknownHostException: <hostname removed> at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:418) at org.apache.hadoop.yarn.server.utils.BuilderUtils.newContainerToken(BuilderUtils.java:243) at org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager.createContainerToken(RMContainerTokenSecretManager.java:195) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.createContainer(AppSchedulable.java:160) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.assignContainer(AppSchedulable.java:237) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.assignContainer(AppSchedulable.java:338) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.assignContainer(AppSchedulable.java:364) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:160) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:149) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:907) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:980) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:110) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:413) at java.lang.Thread.run(Thread.java:724) Caused by: java.net.UnknownHostException: <hostname removed> ... 14 more {noformat} The following is our version information (from the hortonworks distro): {noformat} Hadoop 2.1.0.2.0.4.0-38 Subversion g...@github.com:hortonworks/hadoop.git -r 1c6feea9d537846789eb3337dc5b1a8911cfd60a Compiled by jenkins on 2013-07-08T10:29Z >From source with checksum d1403d7842ef98c85d5f3d1332fa4 This command was run using /usr/lib/hadoop/hadoop-common-2.1.0.2.0.4.0-38.jar {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira