[ https://issues.apache.org/jira/browse/YARN-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16566683#comment-16566683 ]
Szilard Nemeth commented on YARN-6966: -------------------------------------- Hi [~haibochen]! I found out what is causing the test to fail. There is a missing backport on branch-2: YARN-7542 In {{RecoveredContainerLaunch}}, in {{call}} the {{ContainerEventType}} being sent is {{PAUSED}} instead of {{CONTAINER_LAUNCHED}}. This ultimately puts the container to PAUSED state instead of RUNNING. When the container becomes RUNNING, the running container metrics is increased. Could you please do the backport of YARN-7452 to branch-2? Thanks! > NodeManager metrics may return wrong negative values when NM restart > -------------------------------------------------------------------- > > Key: YARN-6966 > URL: https://issues.apache.org/jira/browse/YARN-6966 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Yang Wang > Assignee: Szilard Nemeth > Priority: Major > Fix For: 3.2.0, 3.0.4, 3.1.2 > > Attachments: YARN-6966-branch-2.001.patch, > YARN-6966-branch-2.002.patch, YARN-6966-branch-2.002.patch, > YARN-6966-branch-2.002.patch, YARN-6966-branch-3.0.0.001.patch, > YARN-6966-branch-3.0.001.patch, YARN-6966.001.patch, YARN-6966.002.patch, > YARN-6966.003.patch, YARN-6966.004.patch, YARN-6966.005.patch, > YARN-6966.005.patch, YARN-6966.006.patch > > > Just as YARN-6212. However, I think it is not a duplicate of YARN-3933. > The primary cause of negative values is that metrics do not recover properly > when NM restart. > AllocatedContainers,ContainersLaunched,AllocatedGB,AvailableGB,AllocatedVCores,AvailableVCores > in metrics also need to recover when NM restart. > This should be done in ContainerManagerImpl#recoverContainer. > The scenario could be reproduction by the following steps: > # Make sure > YarnConfiguration.NM_RECOVERY_ENABLED=true,YarnConfiguration.NM_RECOVERY_SUPERVISED=true > in NM > # Submit an application and keep running > # Restart NM > # Stop the application > # Now you get the negative values > {code} > /jmx?qry=Hadoop:service=NodeManager,name=NodeManagerMetrics > {code} > {code} > { > name: "Hadoop:service=NodeManager,name=NodeManagerMetrics", > modelerType: "NodeManagerMetrics", > tag.Context: "yarn", > tag.Hostname: "hadoop1111.com", > ContainersLaunched: 0, > ContainersCompleted: 0, > ContainersFailed: 2, > ContainersKilled: 0, > ContainersIniting: 0, > ContainersRunning: 0, > AllocatedGB: 0, > AllocatedContainers: -2, > AvailableGB: 160, > AllocatedVCores: -11, > AvailableVCores: 3611, > ContainerLaunchDurationNumOps: 2, > ContainerLaunchDurationAvgTime: 6, > BadLocalDirs: 0, > BadLogDirs: 0, > GoodLocalDirsDiskUtilizationPerc: 2, > GoodLogDirsDiskUtilizationPerc: 2 > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org