[ https://issues.apache.org/jira/browse/HADOOP-9932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13758034#comment-13758034 ]
Jing Zhao commented on HADOOP-9932: ----------------------------------- Looks like neither LightWeightCache and LightWeightGSet is thread-safe (according to their javadoc). So maybe here we'd better do the synchronization in RetryCache.java? Specifically, I think we need to add synchronized to RetryCache#addCacheEntry and RetryCache#addCacheEntryWithPayload. > Name node crashes due to improper synchronization in RetryCache > --------------------------------------------------------------- > > Key: HADOOP-9932 > URL: https://issues.apache.org/jira/browse/HADOOP-9932 > Project: Hadoop Common > Issue Type: Bug > Reporter: Kihwal Lee > Priority: Blocker > Attachments: HADOOP-9932.patch > > > In LightWeightCache#evictExpiredEntries(), the precondition check can fail. > [~patwhitey2007] ran a HA failover test and it occurred while the SBN was > catching up with edits during a transition to active. This caused NN to > terminate. > Here is my theory: If an RPC handler calls waitForCompletion() and it happens > to remove the head of the queue in get(), it will race with > evictExpiredEntries() frrom put(). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira