[
https://issues.apache.org/jira/browse/HADOOP-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Doug Cutting updated HADOOP-1269:
---------------------------------
Resolution: Fixed
Fix Version/s: 0.14.0
Status: Resolved (was: Patch Available)
I just committed this. Thanks, Dhruba!
> DFS Scalability: namenode throughput impacted becuase of global FSNamesystem
> lock
> ---------------------------------------------------------------------------------
>
> Key: HADOOP-1269
> URL: https://issues.apache.org/jira/browse/HADOOP-1269
> Project: Hadoop
> Issue Type: Bug
> Components: dfs
> Reporter: dhruba borthakur
> Assignee: dhruba borthakur
> Fix For: 0.14.0
>
> Attachments: chooseTargetLock2.patch, serverThreads1.html,
> serverThreads40.html
>
>
> I have been running a 2000 node cluster and measuring namenode performance.
> There are quite a few "Calls dropped" messages in the namenode log. The
> namenode machine has 4 CPUs and each CPU is about 30% busy. Profiling the
> namenode shows that the methods the consume CPU the most are addStoredBlock()
> and getAdditionalBlock(). The first method in invoked when a datanode
> confirms the presence of a newly created block. The second method in invoked
> when a DFSClient request a new block for a file.
> I am attaching two files that were generated by the profiler.
> serverThreads40.html captures the scenario when the namenode had 40 server
> handler threads. serverThreads1.html is with 1 server handler thread (with a
> max_queue_size of 4000).
> In the case when there are 40 handler threads, the total elapsed time taken
> by FSNamesystem.getAdditionalBlock() is 1957 seconds whereas the methods
> that that it invokes (chooseTarget) takes only about 97 seconds.
> FSNamesystem.getAdditionalBlock is blocked on the global FSNamesystem lock
> for all those 1860 seconds.
> My proposal is to implement a finer grain locking model in the namenode. The
> FSNamesystem has a few important data structures, e.g. blocksMap,
> datanodeMap, leases, neededReplication, pendingCreates, heartbeats, etc. Many
> of these data structures already have their own lock. My proposal is to have
> a lock for each one of these data structures. The individual lock will
> protect the integrity of the contents of the data structure that it protects.
> The global FSNamesystem lock is still needed to maintain consistency across
> different data structures.
> If we implement the above proposal, both addStoredBlock() and
> getAdditionalBlock() does not need to hold the global FSNamesystem lock.
> startFile() and closeFile() still needs to acquire the global FSNamesystem
> lock because it needs to ensure consistency across multiple data structures.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.