[jira] [Created] (YARN-8984) OutstandingSchedRequests in AMRMClient could not be removed when AllocationTags is null or empty
Yang Wang created YARN-8984: --- Summary: OutstandingSchedRequests in AMRMClient could not be removed when AllocationTags is null or empty Key: YARN-8984 URL: https://issues.apache.org/jira/browse/YARN-8984 Project: Hadoop YARN Issue Type: Bug Reporter: Yang Wang In AMRMClient, outstandingSchedRequests should be removed or decreased when container allocated. However, it could not work when allocation tag is null or empty. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8331) Race condition in NM container launched after done
Yang Wang created YARN-8331: --- Summary: Race condition in NM container launched after done Key: YARN-8331 URL: https://issues.apache.org/jira/browse/YARN-8331 Project: Hadoop YARN Issue Type: Bug Reporter: Yang Wang When a container is launching, in ContainerLaunch#launchContainer, state is SCHEDULED, kill event was sent to this container, state : SCHEDULED->KILLING->DONE Then ContainerLaunch send CONTAINER_LAUNCHED event and start the container processes. These absent container processes will not be cleaned up anymore. {code:java} 2018-05-21 13:11:56,114 INFO [Thread-11] nodemanager.NMAuditLogger (NMAuditLogger.java:logSuccess(94)) - USER=nobody OPERATION=Start Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_0_CONTAINERID=container_0__01_00 2018-05-21 13:11:56,114 INFO [NM ContainerManager dispatcher] application.ApplicationImpl (ApplicationImpl.java:handle(632)) - Application application_0_ transitioned from NEW to INITING 2018-05-21 13:11:56,114 INFO [NM ContainerManager dispatcher] application.ApplicationImpl (ApplicationImpl.java:transition(446)) - Adding container_0__01_00 to application application_0_ 2018-05-21 13:11:56,118 INFO [NM ContainerManager dispatcher] application.ApplicationImpl (ApplicationImpl.java:handle(632)) - Application application_0_ transitioned from INITING to RUNNING 2018-05-21 13:11:56,119 INFO [NM ContainerManager dispatcher] container.ContainerImpl (ContainerImpl.java:handle(2111)) - Container container_0__01_00 transitioned from NEW to SCHEDULED 2018-05-21 13:11:56,119 INFO [NM ContainerManager dispatcher] containermanager.AuxServices (AuxServices.java:handle(220)) - Got event CONTAINER_INIT for appId application_0_ 2018-05-21 13:11:56,119 INFO [NM ContainerManager dispatcher] scheduler.ContainerScheduler (ContainerScheduler.java:startContainer(504)) - Starting container [container_0__01_00] 2018-05-21 13:11:56,226 INFO [NM ContainerManager dispatcher] container.ContainerImpl (ContainerImpl.java:handle(2111)) - Container container_0__01_00 transitioned from SCHEDULED to KILLING 2018-05-21 13:11:56,227 INFO [NM ContainerManager dispatcher] containermanager.TestContainerManager (BaseContainerManagerTest.java:delete(287)) - Psuedo delete: user - nobody, type - FILE 2018-05-21 13:11:56,227 INFO [NM ContainerManager dispatcher] nodemanager.NMAuditLogger (NMAuditLogger.java:logSuccess(94)) - USER=nobody OPERATION=Container Finished - Killed TARGET=ContainerImplRESULT=SUCCESS APPID=application_0_CONTAINERID=container_0__01_00 2018-05-21 13:11:56,238 INFO [NM ContainerManager dispatcher] container.ContainerImpl (ContainerImpl.java:handle(2111)) - Container container_0__01_00 transitioned from KILLING to DONE 2018-05-21 13:11:56,238 INFO [NM ContainerManager dispatcher] application.ApplicationImpl (ApplicationImpl.java:transition(489)) - Removing container_0__01_00 from application application_0_ 2018-05-21 13:11:56,239 INFO [NM ContainerManager dispatcher] monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:onStopMonitoringContainer(932)) - Stopping resource-monitoring for container_0__01_00 2018-05-21 13:11:56,239 INFO [NM ContainerManager dispatcher] containermanager.AuxServices (AuxServices.java:handle(220)) - Got event CONTAINER_STOP for appId application_0_ 2018-05-21 13:11:56,274 WARN [NM ContainerManager dispatcher] container.ContainerImpl (ContainerImpl.java:handle(2106)) - Can't handle this event at current state: Current: [DONE], eventType: [CONTAINER_LAUNCHED], container: [container_0__01_00] org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: CONTAINER_LAUNCHED at DONE at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:2104) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:104) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1525) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1518) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) at
[jira] [Created] (YARN-8153) Guaranteed containers always stay in SCHEDULED on NM after restart
Yang Wang created YARN-8153: --- Summary: Guaranteed containers always stay in SCHEDULED on NM after restart Key: YARN-8153 URL: https://issues.apache.org/jira/browse/YARN-8153 Project: Hadoop YARN Issue Type: Bug Reporter: Yang Wang When nm recovery is enabled, after NM restart, some containers always stay in SCHEDULED because of no sufficient resources. The root cause is that utilizationTracker.addContainerResources has been called twice when restart. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7661) NodeManager metrics return wrong value after update node resource
Yang Wang created YARN-7661: --- Summary: NodeManager metrics return wrong value after update node resource Key: YARN-7661 URL: https://issues.apache.org/jira/browse/YARN-7661 Project: Hadoop YARN Issue Type: Bug Reporter: Yang Wang {code:title=NodeManagerMetrics.java} public void addResource(Resource res) { availableMB = availableMB + res.getMemorySize(); availableGB.incr((int)Math.floor(availableMB/1024d)); availableVCores.incr(res.getVirtualCores()); } {code} When the node resource was updated through RM-NM heartbeat, the NM metric will get wrong value. The root cause of this issue is that new resource has been added to availableMB, so not needed to increase for availableGB again. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7660) NodeManager metrics return wrong value after update node resource
Yang Wang created YARN-7660: --- Summary: NodeManager metrics return wrong value after update node resource Key: YARN-7660 URL: https://issues.apache.org/jira/browse/YARN-7660 Project: Hadoop YARN Issue Type: Bug Reporter: Yang Wang {code:title=NodeManagerMetrics.java} public void addResource(Resource res) { availableMB = availableMB + res.getMemorySize(); availableGB.incr((int)Math.floor(availableMB/1024d)); availableVCores.incr(res.getVirtualCores()); } {code} When the node resource was updated through RM-NM heartbeat, the NM metric will get wrong value. The root cause of this issue is that new resource has been added to availableMB, so not needed to increase for availableGB again. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7659) NodeManager metrics return wrong value after update resource
Yang Wang created YARN-7659: --- Summary: NodeManager metrics return wrong value after update resource Key: YARN-7659 URL: https://issues.apache.org/jira/browse/YARN-7659 Project: Hadoop YARN Issue Type: Bug Reporter: Yang Wang {code:title=NodeManagerMetrics.java} public void addResource(Resource res) { availableMB = availableMB + res.getMemorySize(); availableGB.incr((int)Math.floor(availableMB/1024d)); availableVCores.incr(res.getVirtualCores()); } {code} When the node resource was updated through RM-NM heartbeat, the NM metric will get wrong value. The root cause of this issue is that new resource has been added to availableMB, so not needed to increase for availableGB again. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7647) NM print inappropriate error log when node-labels is enabled
Yang Wang created YARN-7647: --- Summary: NM print inappropriate error log when node-labels is enabled Key: YARN-7647 URL: https://issues.apache.org/jira/browse/YARN-7647 Project: Hadoop YARN Issue Type: Bug Reporter: Yang Wang {code:title=NodeStatusUpdaterImpl.java} ... ... if (response.getAreNodeLabelsAcceptedByRM() && LOG.isDebugEnabled()) { LOG.debug("Node Labels {" + StringUtils.join(",", previousNodeLabels) + "} were Accepted by RM "); } else { // case where updated labels from NodeLabelsProvider is sent to RM and // RM rejected the labels LOG.error( "NM node labels {" + StringUtils.join(",", previousNodeLabels) + "} were not accepted by RM and message from RM : " + response.getDiagnosticsMessage()); } ... ... {code} When LOG.isDebugEnabled() is false, NM will always print error log. It is an obvious error and is so misleading. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6966) NodeManager metrics may returning wrong negative values when after restart
Yang Wang created YARN-6966: --- Summary: NodeManager metrics may returning wrong negative values when after restart Key: YARN-6966 URL: https://issues.apache.org/jira/browse/YARN-6966 Project: Hadoop YARN Issue Type: Bug Reporter: Yang Wang Just as YARN-6212. However, I think it is not a duplicate of YARN-3933. The primary cause of negative values is that metrics do not recover properly when NM restart. AllocatedContainers,ContainersLaunched,AllocatedGB,AvailableGB,AllocatedVCores,AvailableVCores in metrics also need to recover when NM restart. This should be done in ContainerManagerImpl#recoverContainer. The scenario could be reproduction by the following steps: # Make sure YarnConfiguration.NM_RECOVERY_ENABLED=true,YarnConfiguration.NM_RECOVERY_SUPERVISED=true in NM # Submit an application and keep running # Restart NM # Stop the application # Now you get the negative values {code} /jmx?qry=Hadoop:service=NodeManager,name=NodeManagerMetrics {code} {code} { name: "Hadoop:service=NodeManager,name=NodeManagerMetrics", modelerType: "NodeManagerMetrics", tag.Context: "yarn", tag.Hostname: "hadoop.com", ContainersLaunched: 0, ContainersCompleted: 0, ContainersFailed: 2, ContainersKilled: 0, ContainersIniting: 0, ContainersRunning: 0, AllocatedGB: 0, AllocatedContainers: -2, AvailableGB: 160, AllocatedVCores: -11, AvailableVCores: 3611, ContainerLaunchDurationNumOps: 2, ContainerLaunchDurationAvgTime: 6, BadLocalDirs: 0, BadLogDirs: 0, GoodLocalDirsDiskUtilizationPerc: 2, GoodLogDirsDiskUtilizationPerc: 2 } {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6951) Fix debug log when Resource handler chain is enabled
Yang Wang created YARN-6951: --- Summary: Fix debug log when Resource handler chain is enabled Key: YARN-6951 URL: https://issues.apache.org/jira/browse/YARN-6951 Project: Hadoop YARN Issue Type: Bug Reporter: Yang Wang {code title=LinuxContainerExecutor.java} ... ... if (LOG.isDebugEnabled()) { LOG.debug("Resource handler chain enabled = " + (resourceHandlerChain == null)); } ... ... {code} I think it is just a typo.When resourceHandlerChain is not null, print the log "Resource handler chain enabled = true". -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6630) Container worker dir could not recover when NM restart
Yang Wang created YARN-6630: --- Summary: Container worker dir could not recover when NM restart Key: YARN-6630 URL: https://issues.apache.org/jira/browse/YARN-6630 Project: Hadoop YARN Issue Type: Bug Reporter: Yang Wang When ContainerRetryPolicy is NEVER_RETRY, container worker dir will not be saved in NM state store. Then NM restarts, container.workDir is null, and may cause other exceptions. {code:title=ContainerLaunch.java} ... private void recordContainerWorkDir(ContainerId containerId, String workDir) throws IOException{ container.setWorkDir(workDir); if (container.isRetryContextSet()) { context.getNMStateStore().storeContainerWorkDir(containerId, workDir); } } {code} {code:title=ContainerImpl.java} static class ResourceLocalizedWhileRunningTransition extends ContainerTransition { ... String linkFile = new Path(container.workDir, link).toString(); ... {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6589) Recover all resources when NM restart
Yang Wang created YARN-6589: --- Summary: Recover all resources when NM restart Key: YARN-6589 URL: https://issues.apache.org/jira/browse/YARN-6589 Project: Hadoop YARN Issue Type: Bug Reporter: Yang Wang When NM restart, containers will be recovered. However, only memory and vcores in capability have been recovered. All resources need to be recovered. {code:title=ContainerImpl.java} // resource capability had been updated before NM was down this.resource = Resource.newInstance(recoveredCapability.getMemorySize(), recoveredCapability.getVirtualCores()); {code} It should be like this. {code:title=ContainerImpl.java} // resource capability had been updated before NM was down // need to recover all resources, not onlythis.resource = Resources.clone(recoveredCapability); {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6578) Return container resource utilization from NM ContainerStatus call
Yang Wang created YARN-6578: --- Summary: Return container resource utilization from NM ContainerStatus call Key: YARN-6578 URL: https://issues.apache.org/jira/browse/YARN-6578 Project: Hadoop YARN Issue Type: New Feature Reporter: Yang Wang When the applicationMaster wants to change(increase/decrease) resources of an allocated container, resource utilization is an important reference indicator for decision making. So, when AM call NMClient.getContainerStatus, resource utilization needs to be returned. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org