[
https://issues.apache.org/jira/browse/HDFS-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Patrick Kling updated HDFS-1476:
--------------------------------
Attachment: HDFS-1476.patch
This patch introduces a new configuration variable
dfs.namenode.replqueue.threshold-pct that determines the fraction of blocks for
which block reports have to be received before the NameNode will start
initializing the needed replication queues. Once a sufficient number of block
reports have been received, the queues are initialized while the NameNode is
still in safe mode. After the queues are initialized, subsequent block reports
are handled by updating the queues incrementally.
The benefit of this is twofold:
- It allows us to compute the replication queues while we are waiting for the
last few block reports (when the NameNode is mostly idle). Once these block
reports have been received, we can then immediately leave safe mode without
having to wait for the computation of the needed replication queues (which
requires a full traversal of the blocks map).
- With Raid, it may not be necessary to stay in safe mode until all blocks have
been reported. Using this change, we could monitor if all of the missing blocks
can be recreated using parity information and if so leave safe mode early. In
order for this monitoring to work, we need access to the needed replication
queues while the NameNode is still in safe mode.
The review board entry for this patch can be found at
https://reviews.apache.org/r/105/ .
> listCorruptFileBlocks should be functional while the name node is still in
> safe mode
> ------------------------------------------------------------------------------------
>
> Key: HDFS-1476
> URL: https://issues.apache.org/jira/browse/HDFS-1476
> Project: Hadoop HDFS
> Issue Type: Improvement
> Reporter: Patrick Kling
> Attachments: HDFS-1476.patch
>
>
> This would allow us to detect whether missing blocks can be fixed using Raid
> and if that is the case exit safe mode earlier.
> One way to make listCorruptFileBlocks available before the name node has
> exited from safe mode would be to perform a scan of the blocks map on each
> call to listCorruptFileBlocks to determine if there are any blocks with no
> replicas. This scan could be parallelized by dividing the space of block IDs
> into multiple intervals than can be scanned independently.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.