[ 
https://issues.apache.org/jira/browse/YARN-6680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054197#comment-16054197
 ] 

Jason Lowe commented on YARN-6680:
----------------------------------

There definitely is a bug in the code with respect to locking in ResourceUsage, 
both before and after this proposed change.  Besides the issues Daryn pointed 
out earlier, there's this problem:

- Thread 1 calls getUsed on some label.  Whether we lock or not, we can return 
the Resource object that is being used for bookkeeping.  Once we return from 
the get, the caller has access to the bookeeping object with no locks held.
- Thread 2 calls decUsed on the same label.  It proceeds to mutate the _same 
Resource object_ with the write lock held.  The lock doesn't help for this 
scenario, since Thread 1 already has the object being mutated and is not 
calling any ResourceUsage code at the time.
- Thread 1 can now see an inconsistent view of the Resource, where the memory 
field has been decremented but the vcore field has yet to be decremented.  In 
other words, a Resource usage that never actually occurred in practice.

This locking bug has been there for quite some time.  Daryn is simply 
optimizing what it already does today.  I'm guessing the inconsistency isn't 
much of an issue in practice due to the granular scheduler and queue locks 
already being used during scheduling, which leaves the UI to show occasional 
inconsistent values since I believe it can grab these values without holding 
those same granular locks.

I'm +1 for the patch.  It significantly speeds up what is a very common case 
for us, and I suspect no node label is fairly common among other users as well. 
 Eventually we should try to make this completely lockless as much as possible, 
using ConcurrentHashMap where the map stores atomic snapshot objects of state 
where we need to update many at once.  But that's a more significant effort for 
another JIRA.  This is a small change that offers a nice speedup for a common 
scenario in the interim.


> Avoid locking overhead for NO_LABEL lookups
> -------------------------------------------
>
>                 Key: YARN-6680
>                 URL: https://issues.apache.org/jira/browse/YARN-6680
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.8.0
>            Reporter: Daryn Sharp
>            Assignee: Daryn Sharp
>         Attachments: YARN-6680.patch
>
>
> Labels are managed via a hash that is protected with a read lock.  The lock 
> acquire and release are each just as expensive as the hash lookup itself - 
> resulting in a 3X slowdown.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to