[ https://issues.apache.org/jira/browse/HDFS-6425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246808#comment-14246808 ]
Kihwal Lee commented on HDFS-6425: ---------------------------------- Did you have a chance to analyze the cause of the large number of over-replication? It might be due to the race between completeFile and incremental block reports. If a file is closed with just min_replicas and the replication monitor runs before all the rest of incremental block reports are received, replication will be scheduled and this will lead to over-replication. > Large postponedMisreplicatedBlocks has impact on blockReport latency > -------------------------------------------------------------------- > > Key: HDFS-6425 > URL: https://issues.apache.org/jira/browse/HDFS-6425 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: Ming Ma > Assignee: Ming Ma > Attachments: HDFS-6425-2.patch, HDFS-6425-Test-Case.pdf, > HDFS-6425.patch > > > Sometimes we have large number of over replicates when NN fails over. When > the new active NN took over, over replicated blocks will be put to > postponedMisreplicatedBlocks until all DNs for that block aren't stale > anymore. > We have a case where NNs flip flop. Before postponedMisreplicatedBlocks > became empty, NN fail over again and again. So postponedMisreplicatedBlocks > just kept increasing until the cluster is stable. > In addition, large postponedMisreplicatedBlocks could make > rescanPostponedMisreplicatedBlocks slow. rescanPostponedMisreplicatedBlocks > takes write lock. So it could slow down the block report processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)