[ https://issues.apache.org/jira/browse/HADOOP-923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Haryadi Gunawi updated HADOOP-923: ---------------------------------- Description: The datanode sends a heartbeat to the namenode every 3 seconds. The namenode processes the heartbeat and sends a list of block-to-be-replicated and blocks-to-be-deleted as part of the heartbeat response. At times when a couple of datanodes fail, the heartbeat processing on the namenode becomes pretty heavyweight. It acquires the global FSNamesystem lock, traverses the neededReplication structure, generates a list of blocks to be replicated and responds to the heartbeat message. Determining the list of blocks-to-be-replciated is pretty heavyweight, takes plenty of CPU and blocks processing of other heartbeats because of the global FSNamesystem lock. It would improve scalability a lot if heartbeat processing does not require the FSNamesystem lock. In fact, the pre-existing "heartbeat" lock already exists for this purpose. I propose that the Heartbeat message be separate from the "retrieve blocks-to-replicate and blocks-to-delete" messages. The datanode can continue to heartbeat once every 3 seconds while it can afford to "retrieve blocks-to-replicate" at a much coarser interval. Heartbeat processing on the namenode will be fast because it does not require the global FSNamesystem lock. Moreover, a datanode failure will not aggrevate the heartbeat processing time on the namenode. was: The datanode sends a heartbeat to the namenode every 3 seconds. The namenode processes the heartbeat and sends a list of block-to-be-replicated and blocks-to-be-deleted as part of the heartbeat response. At times when a couple of datanodes fail, the heartbeat processing on the namenode becomes pretty heavyweight. It acquires the global FSNamesystem lock, traverses the neededReplication structure, generates a list of blocks to be replicated and responds to the heartbeat message. Determining the list of blocks-to-be-replciated is pretty heavyweight, takes plenty of CPU and blocks processing of other heartbeats because of the global FSNamesystem lock. It would improve scalability a lot if heartbeat processing does not require the FSNamesystem lock. In fact, the pre-existing "heartbeat" lock already exists for this purpose. I propose that the Heartbeat message be separate from the "retrieve blocks-to-replicate and blocks-to-delete" messages. The datanode can continue to heartbeat once every 3 seconds while it can afford to "retrieve blocks-to-replicate" at a much coarser interval. Heartbeat processing on the namenode will be fast because it does not require the global FSNamesystem lock. Moreover, a datanode failure will not aggrevate the heartbeat processing time on the namenode. > DFS Scalability: datanode heartbeat timeouts cause cascading timeouts of > other datanodes > ---------------------------------------------------------------------------------------- > > Key: HADOOP-923 > URL: https://issues.apache.org/jira/browse/HADOOP-923 > Project: Hadoop Common > Issue Type: Bug > Affects Versions: 0.10.1 > Reporter: dhruba borthakur > Assignee: dhruba borthakur > Fix For: 0.12.0 > > Attachments: pendingTransferThread2.patch > > > The datanode sends a heartbeat to the namenode every 3 seconds. The namenode > processes the heartbeat and sends a list of block-to-be-replicated and > blocks-to-be-deleted as part of the heartbeat response. > At times when a couple of datanodes fail, the heartbeat processing on the > namenode becomes pretty heavyweight. It acquires the global FSNamesystem > lock, traverses the neededReplication structure, generates a list of blocks > to be replicated and responds to the heartbeat message. Determining the list > of blocks-to-be-replciated is pretty heavyweight, takes plenty of CPU and > blocks processing of other heartbeats because of the global FSNamesystem lock. > It would improve scalability a lot if heartbeat processing does not require > the FSNamesystem lock. In fact, the pre-existing "heartbeat" lock already > exists for this purpose. > I propose that the Heartbeat message be separate from the "retrieve > blocks-to-replicate and blocks-to-delete" messages. The datanode can continue > to heartbeat once every 3 seconds while it can afford to "retrieve > blocks-to-replicate" at a much coarser interval. Heartbeat processing on the > namenode will be fast because it does not require the global FSNamesystem > lock. Moreover, a datanode failure will not aggrevate the heartbeat > processing time on the namenode. > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira