[ https://issues.apache.org/jira/browse/YARN-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16546439#comment-16546439 ]
Szilard Nemeth commented on YARN-6966: -------------------------------------- Hi [~haibochen] and [~rkanter]! Thanks for taking time for the review, see my latest patch that fixes the issues. [~haibochen]: Yes, you were right, it wasn't necessary to store the container token as it can be created from the startRequest. Moved the container recovery logic to the recoverActiveContainer() method. [~rkanter]: Your first comment no longer applies as I'm not saving the container token separately, see my answer to Haibo above. For the second comment, it is a very good idea to have the negative values in the testcase. I was struggling to reproduce it with a test and ultimetely I gave up as it's not that straightforward. I hope we can live with this and you think patch 005 is fine even without this kind of testcase. > NodeManager metrics may return wrong negative values when NM restart > -------------------------------------------------------------------- > > Key: YARN-6966 > URL: https://issues.apache.org/jira/browse/YARN-6966 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Yang Wang > Assignee: Szilard Nemeth > Priority: Major > Attachments: YARN-6966.001.patch, YARN-6966.002.patch, > YARN-6966.003.patch, YARN-6966.004.patch, YARN-6966.005.patch > > > Just as YARN-6212. However, I think it is not a duplicate of YARN-3933. > The primary cause of negative values is that metrics do not recover properly > when NM restart. > AllocatedContainers,ContainersLaunched,AllocatedGB,AvailableGB,AllocatedVCores,AvailableVCores > in metrics also need to recover when NM restart. > This should be done in ContainerManagerImpl#recoverContainer. > The scenario could be reproduction by the following steps: > # Make sure > YarnConfiguration.NM_RECOVERY_ENABLED=true,YarnConfiguration.NM_RECOVERY_SUPERVISED=true > in NM > # Submit an application and keep running > # Restart NM > # Stop the application > # Now you get the negative values > {code} > /jmx?qry=Hadoop:service=NodeManager,name=NodeManagerMetrics > {code} > {code} > { > name: "Hadoop:service=NodeManager,name=NodeManagerMetrics", > modelerType: "NodeManagerMetrics", > tag.Context: "yarn", > tag.Hostname: "hadoop1111.com", > ContainersLaunched: 0, > ContainersCompleted: 0, > ContainersFailed: 2, > ContainersKilled: 0, > ContainersIniting: 0, > ContainersRunning: 0, > AllocatedGB: 0, > AllocatedContainers: -2, > AvailableGB: 160, > AllocatedVCores: -11, > AvailableVCores: 3611, > ContainerLaunchDurationNumOps: 2, > ContainerLaunchDurationAvgTime: 6, > BadLocalDirs: 0, > BadLogDirs: 0, > GoodLocalDirsDiskUtilizationPerc: 2, > GoodLogDirsDiskUtilizationPerc: 2 > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org