[ 
https://issues.apache.org/jira/browse/HDFS-6425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated HDFS-6425:
--------------------------

    Attachment: HDFS-6425-Test-Case.pdf
                HDFS-6425.patch

Here is the initial patch.

1. Have HeartbeatManager compute # of stale storages periodically.
2. Have BlockManager's ReplicationMonitor rescan postponedMisreplicatedBlocks 
only if # of stale storages drops below the defined threshold.
3. Reset postponedMisreplicatedBlocks and postponedMisreplicatedBlocksCount 
upon fail over. This is to fix the SBN metrics so that the new SBN has metrics 
value of zero.

> Large postponedMisreplicatedBlocks has impact on blockReport latency
> --------------------------------------------------------------------
>
>                 Key: HDFS-6425
>                 URL: https://issues.apache.org/jira/browse/HDFS-6425
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>         Attachments: HDFS-6425-Test-Case.pdf, HDFS-6425.patch
>
>
> Sometimes we have large number of over replicates when NN fails over. When 
> the new active NN took over, over replicated blocks will be put to 
> postponedMisreplicatedBlocks until all DNs for that block aren't stale 
> anymore.
> We have a case where NNs flip flop. Before postponedMisreplicatedBlocks 
> became empty, NN fail over again and again. So postponedMisreplicatedBlocks 
> just kept increasing until the cluster is stable. 
> In addition, large postponedMisreplicatedBlocks could make 
> rescanPostponedMisreplicatedBlocks slow. rescanPostponedMisreplicatedBlocks 
> takes write lock. So it could slow down the block report processing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to