[ 
https://issues.apache.org/jira/browse/HDFS-15634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17215676#comment-17215676
 ] 

Stephen O'Donnell commented on HDFS-15634:
------------------------------------------

The issue you have seen is a fairly extreme example - not too many clusters 
will have 200 nodes decommissioned at the same time I suspect. The empty node 
problem is valid concern on smaller clusters and in the 
decommission-recommission case, the empty node may never catch up to the other 
nodes, as the default block placement policy picks nodes randomly.

The long lock hold is a problem even when a node (or perhaps rack) goes dead 
unexpectedly. I think it would be better to try to fix that generally, rather 
than doing something special for decommission. I am also not really comfortable 
changing the current "non destructive" decommission flow to something that 
removes the blocks from the DN.

If you look at HeartBeatManager.heartbeatCheck(...), it seems to handle only 1 
DN as dead on each check interval. The check interval is either 5 minute by 
default or 30 seconds if "dfs.namenode.avoid.write.stale.datanode" is true.

Then it ultimately calls BlockManager.removeBlocksAssociatedTo(...) which does 
the expensive work of removing the blocks. In that method, I wonder if we could 
drop and re-take the write lock periodically so this does not hold the lock for 
too long?


{code}
  /** Remove the blocks associated to the given datanode. */
  void removeBlocksAssociatedTo(final DatanodeDescriptor node) {
    providedStorageMap.removeDatanode(node);
    for (DatanodeStorageInfo storage : node.getStorageInfos()) {
      final Iterator<BlockInfo> it = storage.getBlockIterator();
      //add the BlockInfos to a new collection as the
      //returned iterator is not modifiable.
      Collection<BlockInfo> toRemove = new ArrayList<>();
      while (it.hasNext()) {
        toRemove.add(it.next());
      }

      // Could we drop and re-take the write lock in this loop every 1000 
blocks?
      for (BlockInfo b : toRemove) {
        removeStoredBlock(b, node);
      }
    }
    // Remove all pending DN messages referencing this DN.
    pendingDNMessages.removeAllMessagesForDatanode(node);

    node.resetBlocks();
    invalidateBlocks.remove(node);
  }
{code}

We see some nodes with 5M, 10M or even more blocks sometimes, so this would 
help them in general.

I am not sure if there would be any negative consequences of dropping and 
retaking the write lock in this scenario?

> Invalidate block on decommissioning DataNode after replication
> --------------------------------------------------------------
>
>                 Key: HDFS-15634
>                 URL: https://issues.apache.org/jira/browse/HDFS-15634
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs
>            Reporter: Fengnan Li
>            Assignee: Fengnan Li
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: write lock.png
>
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> Right now when a DataNode starts decommission, Namenode will mark it as 
> decommissioning and its blocks will be replicated over to different 
> DataNodes, then marked as decommissioned. These blocks are not touched since 
> they are not counted as live replicas.
> Proposal: Invalidate these blocks once they are replicated and there are 
> enough live replicas in the cluster.
> Reason: A recent shutdown of decommissioned datanodes to finished the flow 
> caused Namenode latency spike since namenode needs to remove all of the 
> blocks from its memory and this step requires holding write lock. If we have 
> gradually invalidated these blocks the deletion will be much easier and 
> faster.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to