[ 
https://issues.apache.org/jira/browse/YARN-4756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-4756:
------------------------------
    Attachment: YARN-4756.003.patch

[~kasha], I wasn't clear in my original text. The patches in [YARN-4686] do not 
break any extra tests. However, while exploring the fixes for those failures, I 
came across an unnecessary wait in the NodeStatusUpdater thread, 
NodeStatusUpdaterImpl:850. When a reboot happens, the isStopped variable is set 
to true, but the thread waits until the next heartbeat. The next heartbeat 
won't come and so it will wait for a heartbeat timeout. So instead of wasting 
this time unnecessarily, I added a notify to wake the thread up and let it know 
to continue in the loop, where it would find that isStopped is set to true. 

Adding in this optimization uncovered a race condition in the 
TestNodeManagerResync test. The test doesn't wait for the NM to completely 
reboot before it checks for its updated capabilities. The only reason that it 
worked before is because the unnecessary wait in the NodeStatusUpdater acted as 
a sleep that masked the race condition. 

I'm uploading a patch that removes the unnecessary wait in the 
NodeStatusUpdater thread and also fixes the race condition in 
TestNodeManagerResync that it uncovers. 

> Unnecessary wait in Node Status Updater during reboot
> -----------------------------------------------------
>
>                 Key: YARN-4756
>                 URL: https://issues.apache.org/jira/browse/YARN-4756
>             Project: Hadoop YARN
>          Issue Type: Improvement
>            Reporter: Eric Badger
>            Assignee: Eric Badger
>         Attachments: YARN-4756.001.patch, YARN-4756.002.patch, 
> YARN-4756.003.patch
>
>
> The startStatusUpdater thread waits for the isStopped variable to be set to 
> true, but it is waiting for the next heartbeat. During a reboot, the next 
> heartbeat will not come and so the thread waits for a timeout. Instead, we 
> should notify the thread to continue so that it can check the isStopped 
> variable and exit without having to wait for a timeout. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to