[ 
https://issues.apache.org/jira/browse/HDFS-14852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16937203#comment-16937203
 ] 

Kihwal Lee commented on HDFS-14852:
-----------------------------------

I don't think it is simply reproduced by deleting any missing blocks. I just 
tested on a 10-node cluster. After stopping multiple datanodes, missing blocks 
appeared. Deleting them did not cause phantom missing blocks. It is easy for NN 
to figure out the correct queue for missing blocks, so the bug must be far less 
obvious. There must be additional conditions that make it happen.

It is true that the remove method doesn't do what you said, because the design 
assumptions of the method were:
 * A block will be in at most one queue.
 * Guessing priority level is correct in most of times. (The code was written 
to optimize on the common case)

If you change one or more assumptions, the wording in the comment can be 
misleading. It was meant to remove ONCE from ANY level. To remove from all 
levels, potentially multiple times, you would simply scan all levels all the 
time, regardless of the level specified in the input. This might be fine in 
terms of correctness, but not in terms of performance. The performance impact 
might not look much, but it does actually matter.
{quote}So i improve the implement of LowRedundancyBlocks.remove and will it 
works as expected.
{quote}
+Your patch actually doesn't do this.+ If the block is found in the specified 
priority level, it will return after removing once; it won't try to remove from 
other queues.

Let's look at what can possibly cause this. (based on 2.8 where this issue is 
known to happen)

The only place that can cause double addition is when 
processPendingReplications() is adding a block back to the queue. If a single 
replication work was created and then another node dies, replication will be 
scheduled without removing the block from the queue. If the replication times 
out, it will be unconditionally added to a queue by 
processPendingReplications(). I do not see how, but if the priority level is 
somehow determined differently from the one already in a queue, the block will 
end up in two queues. So there is possibility that this is involved.

All other places where a block is added to a queue cannot cause double insertion
 - checkReplication() - only when closing a file. Before closing, blocks are 
not added to a replication queue. 
 - updateNeededReplications() - this first removes and then adds.
 - processBlocksInternal() - contains() is called against the queue (checks all 
levels) and skips the block if true.

Can a block added to a queue after deletion? If so, no effort in changing 
{{remove()}} will help solving this problem. Fist of all, deletion is a two 
phase process. First, inodes and blocks are collected and each of them are 
marked deleted. This is done with the write lock held. In the second phase, 
write lock is held for processing 1000 blocks at a time. In this phase, the 
blocks are actually removed from BlocksMap and queues. Before this phase, 
block.isDeleted() returns true, but getStoredBlock() still returns a block 
since it is still in the map.

*Callers of add()*
 - checkReplication() - closeFile() and commitBlockSynchronization(). holds the 
write lock and deletion check performed.
 - processPendingReplications() write lock is held. checks whether the block 
exists in the map. It does not check with {{block.isDeleted()}}.
 - updateNeededReplications() removes and adds with the write lock held. See 
below for what callers of this method do.
 - DatanodeAdminManager#processBlocksInternal() write lock is held and deletion 
is checked.

*Callers of updateNeededReplications()*
 - markBlockAsCorrupt() - deletion check done with the write lock held.
 - addStoredBlock() - Both deletion check and map entry check with the write 
lock held.
 - setReplication() - done with the write lock held. The INode (and thus 
blocks) need to exist to get here.
 - removeStoredBlock() - deletion check and map entry check with the write lock 
held.

{{processPendingReplications()}} is only place where a block after phase delete 
but before phase 2 can be added to a replication queue. Nothing else will add 
deleted blocks back to a queue.

I suggest two changes.
 - Add a proper deletion check ({{block.isDeleted()}}) to 
{{processPendingReplications()}}.
 - For {{LowRedundancyBlocks}} or {{UnderReplicatedBlocks}}, add a new remove 
method to be used by deletion. This method will 1) unconditionally try to 
remove the block from {{QUEUE_WITH_CORRUPT_BLOCKS}} level. 2) Regardless of the 
result of 1, try deleting ONCE from a specified priority level. This will take 
care of most of cases, but if anything is left, we know they will be removed 
later when replication is scheduled. {{BlockManager#removeBlock()}} can call 
this.

This change will make sure deleted blocks are not added back to the replication 
queue and deletion clears missing block even if the priority level is not 
properly specified. All without adding much performance penalty.

> Remove of LowRedundancyBlocks do NOT remove the block from all queues
> ---------------------------------------------------------------------
>
>                 Key: HDFS-14852
>                 URL: https://issues.apache.org/jira/browse/HDFS-14852
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ec
>    Affects Versions: 3.2.0, 3.0.3, 3.1.2, 3.3.0
>            Reporter: Fei Hui
>            Assignee: Fei Hui
>            Priority: Major
>         Attachments: CorruptBlocksMismatch.png, HDFS-14852.001.patch, 
> HDFS-14852.002.patch
>
>
> LowRedundancyBlocks.java
> {code:java}
> // Some comments here
>     if(priLevel >= 0 && priLevel < LEVEL
>         && priorityQueues.get(priLevel).remove(block)) {
>       NameNode.blockStateChangeLog.debug(
>           "BLOCK* NameSystem.LowRedundancyBlock.remove: Removing block {}"
>               + " from priority queue {}",
>           block, priLevel);
>       decrementBlockStat(block, priLevel, oldExpectedReplicas);
>       return true;
>     } else {
>       // Try to remove the block from all queues if the block was
>       // not found in the queue for the given priority level.
>       for (int i = 0; i < LEVEL; i++) {
>         if (i != priLevel && priorityQueues.get(i).remove(block)) {
>           NameNode.blockStateChangeLog.debug(
>               "BLOCK* NameSystem.LowRedundancyBlock.remove: Removing block" +
>                   " {} from priority queue {}", block, i);
>           decrementBlockStat(block, i, oldExpectedReplicas);
>           return true;
>         }
>       }
>     }
>     return false;
>   }
> {code}
> Source code is above, the comments as follow
> {quote}
>       // Try to remove the block from all queues if the block was
>       // not found in the queue for the given priority level.
> {quote}
> The function "remove" does NOT remove the block from all queues.
> Function add from LowRedundancyBlocks.java is used on some places and maybe 
> one block in two or more queues.
> We found that corrupt blocks mismatch corrupt files on NN web UI. Maybe it is 
> related to this.
> Upload initial patch



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to