[jira] [Commented] (HDFS-14854) Create improved decommission monitor implementation

Stephen O'Donnell (Jira) Tue, 15 Oct 2019 10:45:04 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-14854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952156#comment-16952156
 ]


Stephen O'Donnell commented on HDFS-14854:
------------------------------------------

[~belugabehr] Thanks for the comments. I have addressed some in the 009 patch 
and a few I have not addressed. Please see below which gives the summary:

{quote}
Best practice is to grab the lock outside of the try statement.
{quote}

Maybe there is something I don't understand and I see it does state the best 
practice you mentioned in the link, but I don't see how:
{code:java}
 lock
try {
   ...
} finally {
unlock
}{code}
Is better than:
{code:java}
 try {
   lock()
   ...
} finally {
   unlock()
}{code}

{quote}
The cancelledNodes data structure is a List but it should be a Queue
{quote}

True, I have changed it to an ArrayDeque like with pendingNodes.

{quote}
Using 'null' values is very out of vogue. Better to put a new HashMap here. 
Allows for simplification of the code by assuming that values will never be 
'null'.
{quote}

I use the fact that the value is null to decide whether the datanode needs an 
initial scan or not. The flow is:

1. Take a nodes from pendingNodes and add to outOfServiceNodeBlocks with a null 
value.

2. Later, in the check() method, for each null entry in outofServiceNodeBlocks 
scan the node and add a hashmap of the blocks needing processed.

{quote}
The method scanDatanodeStorage uses the namesystem.readLock(); in a pretty 
verbose and complicated way.
{quote}

We need to hold the read lock when calling dn.getStorageInfos() in the outer 
loop, but we want to drop the lock after processing each storage and then take 
it again. I could refactor this to call dn.getStorageInfos() in a block, eg:
{code:java}
DatanodeStorageInfo[] storages;
namesystem.readLock()
try {
 storages = dn.getStorageInfo();
} finally {
 namesystem.readUnlock();
} {code}
And then simplify the locking code which is there, but I feel this isn't much 
better than what is there.


{quote}
This method is accessed by the local running Thread. However, pendingNodes does 
not appear to be a thread-safe Collection. Perhaps the collection cannot be 
modified because of the external locking of the writeLock but there is no 
requirement to have the lock stated in the startTrackingNode method javadoc.
{quote}

For nodes to be added to pendingNodes, that is always done under the namenode 
writeLock which is taken in fsnamesystem. Then the only place pendingNodes is 
processed in BackOffMonitor in via the monitor thread, and it ensures it holds 
the write lock when processing pendingNodes here:
{code:java}
@Override
 public void run() {
 LOG.debug("DatanodeAdminMonitorV2 is running.");
 if (!namesystem.isRunning()) {
 LOG.info("Namesystem is not running, skipping " +
 "decommissioning/maintenance checks.");
 return;
 }
 // Reset the checked count at beginning of each iteration
 numBlocksChecked = 0;
 // Check decommission or maintenance progress.
 try {
 try {
 /**
 * Other threads can modify the pendingNode list and the cancelled
 * node list, so we must process them under the NN write lock to
 * prevent any concurrent modifications.
 */
 namesystem.writeLock();
 // Always process the cancelled list before the pending list, as
 // it is possible for a node to be cancelled, and then quickly added
 // back again. If we process these the other way around, the added
 // node will be removed from tracking by the pending cancel.
 processCancelledNodes();
 processPendingNodes();
 } finally {
 namesystem.writeUnlock();
 } {code}

{quote}
Nit: this is not very java-y...
{quote}

I think we could argue this one either way. I think my version looks cleaner, 
but I admit modifying the passed in structure can be slightly confusing.


{quote}
Please remove this method. It can be replaced with map.computeIfAbsent(key, k 
-> new LinkedList<V>()).add(v);
{quote}

Thanks for pointing this out. I did not know about that method. I have changed 
this to use computeIfAbsent.

{quote}
This code knows the pendingCount value and the pendingRepLimit... do not grab 
the write lock if the function is going to immediately return anyway.
{quote}

The problem is, we need the lock to check 
blockManager.getLowRedundancyBlocksCount() and if the pendingCount is not 
reducing, then I would really like to log the replication queue size, as it may 
be due to an overloaded replication queue that the pendingCount is not 
reducing. In an earlier version I did have the check outside the lock but then 
I wanted to added the rep queue size to the log and moved it back in. This code 
is only run once per 30 seconds and the lock would only be held a tiny amount 
of time, so I think the additional details in the log are worth the lock price.


{quote}
I think it should return 'true' if the block is orphaned, no? It should skip 
them in the same way that an 'unknown' block is.
{quote}

That code is take from the original decommissioning code, and I have not 
changed it. I believe it returns false as it keeps tracking the block in the 
decommission structures until the NN cleans it out. Then it would become an 
unknown block and be deleted by the "unknown block" code.

> Create improved decommission monitor implementation
> ---------------------------------------------------
>
>                 Key: HDFS-14854
>                 URL: https://issues.apache.org/jira/browse/HDFS-14854
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>    Affects Versions: 3.3.0
>            Reporter: Stephen O'Donnell
>            Assignee: Stephen O'Donnell
>            Priority: Major
>         Attachments: Decommission_Monitor_V2_001.pdf, HDFS-14854.001.patch, 
> HDFS-14854.002.patch, HDFS-14854.003.patch, HDFS-14854.004.patch, 
> HDFS-14854.005.patch, HDFS-14854.006.patch, HDFS-14854.007.patch, 
> HDFS-14854.008.patch
>
>
> In HDFS-13157, we discovered a series of problems with the current 
> decommission monitor implementation, such as:
>  * Blocks are replicated sequentially disk by disk and node by node, and 
> hence the load is not spread well across the cluster
>  * Adding a node for decommission can cause the namenode write lock to be 
> held for a long time.
>  * Decommissioning nodes floods the replication queue and under replicated 
> blocks from a future node or disk failure may way for a long time before they 
> are replicated.
>  * Blocks pending replication are checked many times under a write lock 
> before they are sufficiently replicate, wasting resources
> In this Jira I propose to create a new implementation of the decommission 
> monitor that resolves these issues. As it will be difficult to prove one 
> implementation is better than another, the new implementation can be enabled 
> or disabled giving the option of the existing implementation or the new one.
> I will attach a pdf with some more details on the design and then a version 1 
> patch shortly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-14854) Create improved decommission monitor implementation

Reply via email to