[ http://issues.apache.org/jira/browse/HADOOP-814?page=all ]
dhruba borthakur updated HADOOP-814:
------------------------------------
Attachment: heartbeatlock2.patch
A new patch that incorporates Konstantin's review comments.
The updateStats() method is removed because we do not want to acquire the
global lock (just for computing statistics) while processing heartbeats.
The tradeoff is to compute the global stats from per-node stats when a user
request to retrieve startistics is processed by the namenode. In the current
code, every heartbeat request is acquiring the global lock to update the global
statistics counters.
> Increase dfs scalability by optimizing locking on namenode.
> -----------------------------------------------------------
>
> Key: HADOOP-814
> URL: http://issues.apache.org/jira/browse/HADOOP-814
> Project: Hadoop
> Issue Type: Bug
> Components: dfs
> Reporter: dhruba borthakur
> Assigned To: dhruba borthakur
> Attachments: heartbeatlock2.patch
>
>
> The current dfs namenode encounters locking bottlenecks when the number of
> datanodes is large. The namenode uses a single global lock to protect access
> to data structures. One key area is heartbeat processing. The lower the cost
> of processing a heartbeat, more the number of nodes HDFS can support. A
> simple change to this current locking model can increase the scalability.
> Here are the details:
> Case 1: Currently we have three locks, the global lock (on FSNamesystem), the
> heartbeat lock and the datanodeMap lock. The following function is called
> when a heartbeat is received by the Namenode
> public synchronized FSNamesystem. gotHeartbeat() { ........ (A)
> synchronized (heartbeat) {
> ........ (B)
> synchronized (datanodeMap) { ......... (C)
> ...
> }
> }
> In the above piece of code, statement (A) acquires the
> global-FSNamesystem-lock. This synchronization can be safely removed (remove
> updateStats too). This means that a heartbeat from the datanode can be
> processed without holding the FSnamesystem-global-lock.
> Case 2: A following thread called the heartbeatCheck thread periodically
> traverses all known Datanodes to determine if any of them has timed out. It
> is of the following form:
> void FSNamesystem.heartbeatCheck() {
> synchronized (this) {
> ........... (D)
> synchronized (heartbeats) {
> .............(E)
> }
> This thread acquires the global-FSNamesystem lock in Statement (D). This
> statement (D) can be removed. Instead the loop can check to see if any nodes
> are dead. If a dead node is found, only then it acquires the
> FSNamesystem-global-lock.
> It is possible that fixing the above two cases will cause HDFS to scale to
> higher number of nodes.
>
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira