[ https://issues.apache.org/jira/browse/HDFS-5651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13853758#comment-13853758 ]
Colin Patrick McCabe commented on HDFS-5651: -------------------------------------------- So, there were some synchronization issues here that needed to be cleaned up. The biggest one was stopping and starting the CRM thread. Previously, this was prone to deadlock, since if some other thread (or the thread stopping the CRM) was holding the FSN write lock, and the CRM thread itself needed to get that lock, we'd block forever. I tried to get around that by sending an {{InterruptedException}} to the CRM thread, but it turns out that {{Condition#await}} does not actually "have" to wake up in response to one of those exceptions being sent (although it "may"). It explicitly documents this in the JDK 6 javadoc, and it seems that the Linux HotSpot implementation may be one of those implementations where condition variables cannot be interrupted. The solution here is to *not* join the CRM thread when transitioning to the standby state, but simply to set {{shutdown = true}} in the CRM thread, and have the CRM thread check that variable after grabbing the {{FSNamesystem}} lock. So we may have an old CRM thread hanging around for a while, but it will never mutate {{CacheManager}} state, since {{CRM#shutdown = true}}. Along the way, I discovered that our strategy of doing {{writeUnlock}} in some places in CRM was not working very well. The problem is that since the FSN write lock is a reentrant lock, a thread that calls {{ReentrantLock#unlock}} may still hold that lock. You may need to unlock multiple times to really release! In general, having random "unlock some of the caller's locks" sections sprinkled throughout the code seems like a recipe for problems, since the caller may not be expecting it. I think it's better to ask the top-level caller in {{FSNamesystem}} to handle these locks. So I moved the {{waitForRescanIfNeeded}} calls in {{FSNamesystem}} to a point before the FSN lock was even taken in those functions. Minor: we don't need to do anything with CacheManager in {{FSNamesystem#stopCommonServices}}, since we do it in {{FSNamesystem#stopActiveServices}}. I also fixed a few cases where we had more lock blocks than needed. > remove dfs.namenode.caching.enabled > ----------------------------------- > > Key: HDFS-5651 > URL: https://issues.apache.org/jira/browse/HDFS-5651 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode > Affects Versions: 3.0.0 > Reporter: Colin Patrick McCabe > Assignee: Colin Patrick McCabe > Attachments: HDFS-5651.001.patch, HDFS-5651.002.patch, > HDFS-5651.003.patch, HDFS-5651.004.patch, HDFS-5651.006.patch > > > We can remove dfs.namenode.caching.enabled and simply always enable caching, > similar to how we do with snapshots and other features. The main overhead is > the size of the cachedBlocks GSet. However, we can simply make the size of > this GSet configurable, and people who don't want caching can set it to a > very small value. -- This message was sent by Atlassian JIRA (v6.1.4#6159)