[jira] [Commented] (HDFS-14854) Create improved decommission monitor implementation

Stephen O'Donnell (Jira) Thu, 07 Nov 2019 03:03:16 -0800


    [ 
https://issues.apache.org/jira/browse/HDFS-14854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969166#comment-16969166
 ]


Stephen O'Donnell commented on HDFS-14854:
------------------------------------------

I have addressed the earlier conflict and the error in the log message Wei-Chiu 
raised in his last comment. To test the patch in more detail I then did the 
following:

1) Enabled the new BackOff Monitor.
2) Created a 7 node cluster where the datanodes have simulated storage and then 
executed some tests as detailed below:

 * Decommission 2 nodes which have no overlapping blocks
    -> Confirmed both nodes make progress at roughly the same rate.
    -> Confirm nodes complete decommission in the logs and webUI
    -> Stop nodes and ensure no missing blocks after 10 minutes using fsck
    -> Recommissioned nodes and ensure all remains healthy
    -> Ensured over-replicated blocks on recommission and they are removed by 
namenode. 

 * Decommission 2 nodes. Cancel decommission on one.
    -> Ensured the cancelled node stops decom and the other node continues
    -> Confirmed the node completes decom.
    -> Stopped the node and ensured no missing blocks.
    -> Recommissioned and ensured over replicated blocks are removed.

 * Put two nodes to maintenance with min replicas set to 2. Set different 
expiry time for each node.
    -> Confirmed some blocks needed replicated, then maintenance mode is entered
    -> Observed one node ending maintenance at the set time automatically.
    -> Observed the other node ending maintenance at the set time.
    !! -> I found a bug here, in that the blocks on the node were getting 
scanned as the node leaves maintenance, which is not necessary. The reason, is 
that the call to dnAdmin.stopMaintenance adds the node to the cancelled list, 
but I was also adding it to the toRemove list.

 * Put one node to maintenance on a healthy cluster.
    -> Confirmed the node enters maintenance on the first monitor tick, as no 
blocks need replicated.
    -> Stopped the node and observed no missing blocks.
    -> Started the node again and observed no over-replicated blocks.

 * Put two nodes to maintenance with no end time and then cancel maintenance on 
one
    -> Confirmed the node cancels and the other node remains in maintenance

As part of doing this, I uncovered on bug mentioned about, and also noted a few 
log messages that were too verbose. With that in mind I uploaded patch 013. As 
this patch is so large I will also attach a diff of the changes between 012 and 
013 to make it easier to review.

This feature seems to be working on a small cluster and the code is in pretty 
good shape, so I think it is ready to commit if [~elgoiri] and [~weichiu] are 
happy with the latest revision.

> Create improved decommission monitor implementation
> ---------------------------------------------------
>
>                 Key: HDFS-14854
>                 URL: https://issues.apache.org/jira/browse/HDFS-14854
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>    Affects Versions: 3.3.0
>            Reporter: Stephen O'Donnell
>            Assignee: Stephen O'Donnell
>            Priority: Major
>         Attachments: Decommission_Monitor_V2_001.pdf, HDFS-14854.001.patch, 
> HDFS-14854.002.patch, HDFS-14854.003.patch, HDFS-14854.004.patch, 
> HDFS-14854.005.patch, HDFS-14854.006.patch, HDFS-14854.007.patch, 
> HDFS-14854.008.patch, HDFS-14854.009.patch, HDFS-14854.010.patch, 
> HDFS-14854.011.patch, HDFS-14854.012.patch, HDFS-14854.013.patch
>
>
> In HDFS-13157, we discovered a series of problems with the current 
> decommission monitor implementation, such as:
>  * Blocks are replicated sequentially disk by disk and node by node, and 
> hence the load is not spread well across the cluster
>  * Adding a node for decommission can cause the namenode write lock to be 
> held for a long time.
>  * Decommissioning nodes floods the replication queue and under replicated 
> blocks from a future node or disk failure may way for a long time before they 
> are replicated.
>  * Blocks pending replication are checked many times under a write lock 
> before they are sufficiently replicate, wasting resources
> In this Jira I propose to create a new implementation of the decommission 
> monitor that resolves these issues. As it will be difficult to prove one 
> implementation is better than another, the new implementation can be enabled 
> or disabled giving the option of the existing implementation or the new one.
> I will attach a pdf with some more details on the design and then a version 1 
> patch shortly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-14854) Create improved decommission monitor implementation

Reply via email to