[ http://issues.apache.org/jira/browse/HADOOP-814?page=all ]
Doug Cutting updated HADOOP-814: -------------------------------- Status: Resolved (was: Patch Available) Fix Version/s: 0.10.0 Resolution: Fixed I just committed this. Thanks, Dhruba! > Increase dfs scalability by optimizing locking on namenode. > ----------------------------------------------------------- > > Key: HADOOP-814 > URL: http://issues.apache.org/jira/browse/HADOOP-814 > Project: Hadoop > Issue Type: Bug > Components: dfs > Reporter: dhruba borthakur > Assigned To: dhruba borthakur > Fix For: 0.10.0 > > Attachments: heartbeatlock3.patch > > > The current dfs namenode encounters locking bottlenecks when the number of > datanodes is large. The namenode uses a single global lock to protect access > to data structures. One key area is heartbeat processing. The lower the cost > of processing a heartbeat, more the number of nodes HDFS can support. A > simple change to this current locking model can increase the scalability. > Here are the details: > Case 1: Currently we have three locks, the global lock (on FSNamesystem), the > heartbeat lock and the datanodeMap lock. The following function is called > when a heartbeat is received by the Namenode > public synchronized FSNamesystem. gotHeartbeat() { ........ (A) > synchronized (heartbeat) { > ........ (B) > synchronized (datanodeMap) { ......... (C) > ... > } > } > In the above piece of code, statement (A) acquires the > global-FSNamesystem-lock. This synchronization can be safely removed (remove > updateStats too). This means that a heartbeat from the datanode can be > processed without holding the FSnamesystem-global-lock. > Case 2: A following thread called the heartbeatCheck thread periodically > traverses all known Datanodes to determine if any of them has timed out. It > is of the following form: > void FSNamesystem.heartbeatCheck() { > synchronized (this) { > ........... (D) > synchronized (heartbeats) { > .............(E) > } > This thread acquires the global-FSNamesystem lock in Statement (D). This > statement (D) can be removed. Instead the loop can check to see if any nodes > are dead. If a dead node is found, only then it acquires the > FSNamesystem-global-lock. > It is possible that fixing the above two cases will cause HDFS to scale to > higher number of nodes. > -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira