[
https://issues.apache.org/jira/browse/HDFS-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932632#action_12932632
]
dhruba borthakur commented on HDFS-1476:
----------------------------------------
Thinking more about this one, we can exit safemode faster if we can compute
misReplicatedBlocks even before we have one replica of all blocks.
Step 1: the namenode waits to ensure that there is at least one replica of all
known blocks.
Step 2: Then it invokes processMisReplicatedBlocks to update neededReplication
When the cluster restarts, the namenode starts in Step 1 and starts processing
a storm of block reports from all datanodes. But a few datanodes are somewhat
slow and the block report from the straggler datanodes delays the transition
from Step 1 to Step 2. The CPU usage on the NN decreases exponentially as Step
1 progresses and becomes almost negligible when Step 1 is about to end.
This jira could change the code so that processMisReplicatedBlocks is invoked
before Step 1 finishes completely. This will make the NN exit safemode earlier
> listCorruptFileBlocks should be functional while the name node is still in
> safe mode
> ------------------------------------------------------------------------------------
>
> Key: HDFS-1476
> URL: https://issues.apache.org/jira/browse/HDFS-1476
> Project: Hadoop HDFS
> Issue Type: Improvement
> Reporter: Patrick Kling
>
> This would allow us to detect whether missing blocks can be fixed using Raid
> and if that is the case exit safe mode earlier.
> One way to make listCorruptFileBlocks available before the name node has
> exited from safe mode would be to perform a scan of the blocks map on each
> call to listCorruptFileBlocks to determine if there are any blocks with no
> replicas. This scan could be parallelized by dividing the space of block IDs
> into multiple intervals than can be scanned independently.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.