[ 
https://issues.apache.org/jira/browse/YARN-4041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14969820#comment-14969820
 ] 

Jason Lowe commented on YARN-4041:
----------------------------------

The problem with checking the renewer event queue directly is that the queue 
can be empty but processing has not yet completed.  Threads can still be 
executing the last events, having just pulled them from the queue to leave it 
empty.  Therefore the test is still racy.  A simpler approach would be to just 
keep checking if the tokens are equal.  If they aren't then sleep for a bit 
then try again, up to some limit of time to keep checking.

By the way, we should not sleep an entire second between checks.  All those 
seconds of waiting add up across all of our tests doing it, making it take 
significantly longer to run them overall.  We should be sleeping for only 10ms 
or so.  That's still a large amount of time for modern processors to get work 
done while we're waiting, and we still won't be spinning non-stop on the CPU.


> Slow delegation token renewal can severely prolong RM recovery
> --------------------------------------------------------------
>
>                 Key: YARN-4041
>                 URL: https://issues.apache.org/jira/browse/YARN-4041
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Jason Lowe
>            Assignee: Sunil G
>         Attachments: 0001-YARN-4041.patch, 0002-YARN-4041.patch, 
> 0003-YARN-4041.patch, 0004-YARN-4041.patch
>
>
> When the RM does a work-preserving restart it synchronously tries to renew 
> delegation tokens for every active application.  If a token server happens to 
> be down or is running slow and a lot of the active apps were using tokens 
> from that server then it can have a huge impact on the time it takes the RM 
> to process the restart.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to