[ https://issues.apache.org/jira/browse/YARN-4756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Eric Badger updated YARN-4756: ------------------------------ Attachment: YARN-4756.003.patch [~kasha], I wasn't clear in my original text. The patches in [YARN-4686] do not break any extra tests. However, while exploring the fixes for those failures, I came across an unnecessary wait in the NodeStatusUpdater thread, NodeStatusUpdaterImpl:850. When a reboot happens, the isStopped variable is set to true, but the thread waits until the next heartbeat. The next heartbeat won't come and so it will wait for a heartbeat timeout. So instead of wasting this time unnecessarily, I added a notify to wake the thread up and let it know to continue in the loop, where it would find that isStopped is set to true. Adding in this optimization uncovered a race condition in the TestNodeManagerResync test. The test doesn't wait for the NM to completely reboot before it checks for its updated capabilities. The only reason that it worked before is because the unnecessary wait in the NodeStatusUpdater acted as a sleep that masked the race condition. I'm uploading a patch that removes the unnecessary wait in the NodeStatusUpdater thread and also fixes the race condition in TestNodeManagerResync that it uncovers. > Unnecessary wait in Node Status Updater during reboot > ----------------------------------------------------- > > Key: YARN-4756 > URL: https://issues.apache.org/jira/browse/YARN-4756 > Project: Hadoop YARN > Issue Type: Improvement > Reporter: Eric Badger > Assignee: Eric Badger > Attachments: YARN-4756.001.patch, YARN-4756.002.patch, > YARN-4756.003.patch > > > The startStatusUpdater thread waits for the isStopped variable to be set to > true, but it is waiting for the next heartbeat. During a reboot, the next > heartbeat will not come and so the thread waits for a timeout. Instead, we > should notify the thread to continue so that it can check the isStopped > variable and exit without having to wait for a timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332)