Which Hadoop version are you running ? this should be recently fixed. Jian
On Thu, Mar 13, 2014 at 8:33 PM, Hitesh Shah <hit...@apache.org> wrote: > Hi John > > Would you mind filing a jira with more details. The RM going down just > because a host was not resolvable or DNS timed out is something that should > be addressed. > > thanks > -- Hitesh > > On Mar 13, 2014, at 2:29 PM, John Lilley wrote: > > > Never mind... we figured out its DNS entry was going missing. > > john > > > > From: John Lilley [mailto:john.lil...@redpoint.net] > > Sent: Thursday, March 13, 2014 2:52 PM > > To: user@hadoop.apache.org > > Subject: ResourceManager shutting down > > > > We have this erratic behavior where every so often the RM will shutdown > with an UnknownHostException. The odd thing is, the host it complains > about have been in use for days at that point without problem. Any ideas? > > Thanks, > > John > > > > > > 2014-03-13 14:38:14,746 INFO rmapp.RMAppImpl > (RMAppImpl.java:handle(578)) - application_1394204725813_0220 State change > from ACCEPTED to RUNNING > > 2014-03-13 14:38:15,794 FATAL resourcemanager.ResourceManager > (ResourceManager.java:run(449)) - Error in handling event type NODE_UPDATE > to the scheduler > > java.lang.IllegalArgumentException: java.net.UnknownHostException: > skitzo.office.datalever.com > > at > org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:418) > > at > org.apache.hadoop.yarn.server.utils.BuilderUtils.newContainerToken(BuilderUtils.java:247) > > at > org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager.createContainerToken(RMContainerTokenSecretManager.java:195) > > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.createContainerToken(LeafQueue.java:1297) > > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainer(LeafQueue.java:1345) > > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignOffSwitchContainers(LeafQueue.java:1211) > > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainersOnNode(LeafQueue.java:1170) > > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:871) > > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:645) > > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:559) > > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:690) > > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:734) > > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:86) > > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) > > at java.lang.Thread.run(Thread.java:662) > > Caused by: java.net.UnknownHostException: skitzo.office.datalever.com > > ... 15 more > > 2014-03-13 14:38:15,794 INFO resourcemanager.ResourceManager > (ResourceManager.java:run(453)) - Exiting, bbye.. > > 2014-03-13 14:38:15,911 INFO mortbay.log (Slf4jLog.java:info(67)) - > Stopped selectchannelconnec...@metallica.office.datalever.com:8088 > > 2014-03-13 14:38:16,013 ERROR > delegation.AbstractDelegationTokenSecretManager > (AbstractDelegationTokenSecretManager.java:run(557)) - InterruptedExcpetion > recieved for ExpiredTokenRemover thread java.lang.InterruptedException: > sleep interrupted > > 2014-03-13 14:38:16,013 INFO impl.MetricsSystemImpl > (MetricsSystemImpl.java:stop(200)) - Stopping ResourceManager metrics > system... > > 2014-03-13 14:38:16,014 INFO impl.MetricsSystemImpl > (MetricsSystemImpl.java:stop(206)) - ResourceManager metrics system stopped. > > 2014-03-13 14:38:16,014 INFO impl.MetricsSystemImpl > (MetricsSystemImpl.java:shutdown(572)) - ResourceManager metrics system > shutdown complete. > > 2014-03-13 14:38:16,015 WARN amlauncher.ApplicationMasterLauncher > (ApplicationMasterLauncher.java:run(98)) - > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.ApplicationMasterLauncher$LauncherThread > interrupted. Returning. > > 2014-03-13 14:38:16,015 INFO ipc.Server (Server.java:stop(2442)) - > Stopping server on 8141 > > 2014-03-13 14:38:16,017 INFO ipc.Server (Server.java:stop(2442)) - > Stopping server on 8050 > > ... and so on, it shuts down > > > > -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.