[ 
https://issues.apache.org/jira/browse/HDFS-15634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17215288#comment-17215288
 ] 

Stephen O'Donnell commented on HDFS-15634:
------------------------------------------

{quote}
Proposal: Invalidate these blocks once they are replicated and there are enough 
live replicas in the cluster.
{quote}

Looking at the PR, you are adding these blocks to addToInvalidates(...) which 
will actually remove the replicas from the DNs. 

I am not sure this is a good idea, for a few reasons:

1. Right now, a decommissioned DN is untouched by the process - if something 
goes wrong with decommission (which we have seen happen) we can just 
recommission the node again and know all the blocks are still safely present.

2. I seem to recall there are some edge cases where a decommissioned but still 
online replica can be read.

3. On some clusters, nodes are decommissioned for maintenance (yes, they should 
use maintenance mode, but some don't) such as OS upgrades and then 
recommissioned. In these cases, when the DN rejoins, the blocks will become 
over replicated and then the NN will remove replicas randomly. This is arguably 
better than adding back an empty node, which may require running the balancer 
to move data onto it. If we remove the blocks from the DN while it is 
decommissioning, then on recommission we can only ever add back an empty node. 

{quote}
 A recent shutdown of decommissioned datanodes to finished the flow caused 
Namenode latency spike since namenode needs to remove all of the blocks from 
its memory and this step requires holding write lock. If we have gradually 
invalidated these blocks the deletion will be much easier and faster.
{quote}

What version were you running when you saw this problem?

How many blocks approximately were on the DNs which were stopped after 
decommission completed?

How many decommissioned hosts were stopped when this happened?

I am wondering if there would be a better way to handle this, possibly yielding 
the write lock which removing the blocks periodically, as this same problem 
would exist for a node going dead unexpectedly and not just during decommission.

> Invalidate block on decommissioning DataNode after replication
> --------------------------------------------------------------
>
>                 Key: HDFS-15634
>                 URL: https://issues.apache.org/jira/browse/HDFS-15634
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs
>            Reporter: Fengnan Li
>            Assignee: Fengnan Li
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> Right now when a DataNode starts decommission, Namenode will mark it as 
> decommissioning and its blocks will be replicated over to different 
> DataNodes, then marked as decommissioned. These blocks are not touched since 
> they are not counted as live replicas.
> Proposal: Invalidate these blocks once they are replicated and there are 
> enough live replicas in the cluster.
> Reason: A recent shutdown of decommissioned datanodes to finished the flow 
> caused Namenode latency spike since namenode needs to remove all of the 
> blocks from its memory and this step requires holding write lock. If we have 
> gradually invalidated these blocks the deletion will be much easier and 
> faster.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to