[ https://issues.apache.org/jira/browse/YARN-6680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054197#comment-16054197 ]
Jason Lowe commented on YARN-6680: ---------------------------------- There definitely is a bug in the code with respect to locking in ResourceUsage, both before and after this proposed change. Besides the issues Daryn pointed out earlier, there's this problem: - Thread 1 calls getUsed on some label. Whether we lock or not, we can return the Resource object that is being used for bookkeeping. Once we return from the get, the caller has access to the bookeeping object with no locks held. - Thread 2 calls decUsed on the same label. It proceeds to mutate the _same Resource object_ with the write lock held. The lock doesn't help for this scenario, since Thread 1 already has the object being mutated and is not calling any ResourceUsage code at the time. - Thread 1 can now see an inconsistent view of the Resource, where the memory field has been decremented but the vcore field has yet to be decremented. In other words, a Resource usage that never actually occurred in practice. This locking bug has been there for quite some time. Daryn is simply optimizing what it already does today. I'm guessing the inconsistency isn't much of an issue in practice due to the granular scheduler and queue locks already being used during scheduling, which leaves the UI to show occasional inconsistent values since I believe it can grab these values without holding those same granular locks. I'm +1 for the patch. It significantly speeds up what is a very common case for us, and I suspect no node label is fairly common among other users as well. Eventually we should try to make this completely lockless as much as possible, using ConcurrentHashMap where the map stores atomic snapshot objects of state where we need to update many at once. But that's a more significant effort for another JIRA. This is a small change that offers a nice speedup for a common scenario in the interim. > Avoid locking overhead for NO_LABEL lookups > ------------------------------------------- > > Key: YARN-6680 > URL: https://issues.apache.org/jira/browse/YARN-6680 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager > Affects Versions: 2.8.0 > Reporter: Daryn Sharp > Assignee: Daryn Sharp > Attachments: YARN-6680.patch > > > Labels are managed via a hash that is protected with a read lock. The lock > acquire and release are each just as expensive as the hash lookup itself - > resulting in a 3X slowdown. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org