[ 
https://issues.apache.org/jira/browse/HDFS-6833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14217113#comment-14217113
 ] 

Yongjun Zhang commented on HDFS-6833:
-------------------------------------

Hi Nicholas and Chris,

Thanks for your good comments!

Let me try to describe my understanding of the current flow, the problem of the 
current flow, the solution Shinichi worked out, the flaw you guys pointed out, 
and a proposed change here.

The current flow:

# When a block is to be deleted, it's removed from the in-memory record 
(volumeMap), and the disk removal is scheduled to happen asynchronously to 
later, to be done by FsDatasetAsyncDiskService.
# When DirectoryScanner is running ({{DirectoryScan#scan}}), it checks the 
in-memory blocks (retrieved via {{dataset.getFinalizedBlocks(bpid)}}) against 
the block on disk {{get via {{DirectoryScanner#getDiskReport()}}. If it sees a 
block is not in memory but on disk, or if the block is in memory but not on 
disk, it will remember this in {{DirectoryScanner#diffs}}.
# After {{DirectoryScan#scan} is done. The {{DirectoryScanner#diffs}} is 
processed via calling {{dataset.checkAndUpdate}} in 
{{DirectoryScanner#reconcile}}. If a block is not in memory but on disk, the 
block will be re-added to memory and reported back to NN later. 

The problem of the current flow:

* Though DirectoryScanner is only scheduled to run every 6 hours by default, 
the asynchonous block removal from disk described could be so delayed that 
DirectoryScanner would see difference between in-memory record and disk, that a 
block to be deleted is not in memory but on disk, even though the block is to 
be deleted. In this case, the later block report sent to NN would say this 
block as a good block, this is the problem.

The solution that Shinichi worked out:

# {{final ReplicaMap deletingBlock;}} is introduced in FsDatasetImpl to 
remember the blocks to be deleted. These blocks are recorded to this structure 
by  {{FsDatasetImpl#invalidateBlock}} when called by 
{{BPOfferService#processCommandFromActive}}. That is, right after a block is 
removed from in-memory volumeMap due to a invalidate request from NN, it's 
recorded in {{FsDatasetImpl#deletingBlock}}.
# When  DirectoryScanner is running ({{DirectoryScan#scan}}), when it sees a 
block is not in memory but on disk, don't jump to conclusion that the block 
need to be recorded in  {{DirectoryScanner#diffs}}, instead, check against 
{{FsDatasetImpl#deletingBlock}} first.
# At the end of each DirectoryScanner run, it will remove from the 
{{FsDatasetImpl#deletingBlock}} the blocks examined in step 2, by calling 
{{dataset.removeDeletedBlocks(bpid, deletingBlockIds);}}

The flaw that Chris pointed out:

* In step 3 of the solution, removing from {{FsDatasetImpl#deletingBlock}} 
should happen after the disk block is removed.

This is a good catch here! Though the chance the blocks don't get removed from 
disk is slim, it is still possible.

Nicholas suggested to create map <block, ReplicaFileDeleteTask>, and manages it 
FsDatasetAsyncDiskService. Chris suggested to move the step 3 closer to the 
disk block removal. Both are nice.

A proposed change to address the comments:

Since the volumeMap is managed in FsDatasetImpl, I think it's not too bad to 
keep the {{FsDatasetImpl#deletingBlock}} as where it is put now (sitting 
together with the volumMap). 

The key is that we want to be sure that we remove entries from it only after 
disk removal. There are two approaches:

# Let FsDatasetAsyncDiskService assumulated the list of blocks whose 
{{ReplicaFileDeleteTask}} is FINISHED to a certain size, then call the 
FsDataset api to remove them from {{FsDatasetImpl#deletingBlock}}.
# Let  FsDatasetAsyncDiskService call the FsDataset api to remove a block each 
time a {{ReplicaFileDeleteTask}}  finishes.

Since these operations requires synchronization, it may be better to use 
approach 1 performance wise. 

To go with approach 1, we need to add a data structure (assume a list) in 
{{FsDatasetAsyncDiskService}} to remember deleted blocks, and update the list 
in {{ReplicaFileDeleteTask#deleteFiles}} and 
{{ReplicaFileDeleteTask#moveFiles}} in a synchronized way as:

* add block deleted to the list
* when the size of the list reaches a certain number, call a similar API like 
{{dataset.removeDeletedBlocks(bpid, deletingBlockIds);}} in the patch to remove 
the list entries from  {{FsDatasetImpl#deletingBlock}}, and empty the list

Do you guys think this proposal would addresses your comments? thanks a lot.

I will leave to Shinichi to address the good questions about possible 
configuration issue.

BTW, the same thought did come to me earlier, but my bad to let it slip away 
due to a later incorrect thought. See 
https://issues.apache.org/jira/browse/HDFS-6833?focusedCommentId=14099884&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14099884.
 I will certainly be more careful in the future.


> DirectoryScanner should not register a deleting block with memory of DataNode
> -----------------------------------------------------------------------------
>
>                 Key: HDFS-6833
>                 URL: https://issues.apache.org/jira/browse/HDFS-6833
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 3.0.0, 2.5.0, 2.5.1
>            Reporter: Shinichi Yamashita
>            Assignee: Shinichi Yamashita
>            Priority: Critical
>         Attachments: HDFS-6833-6-2.patch, HDFS-6833-6-3.patch, 
> HDFS-6833-6.patch, HDFS-6833-7-2.patch, HDFS-6833-7.patch, HDFS-6833.8.patch, 
> HDFS-6833.9.patch, HDFS-6833.patch, HDFS-6833.patch, HDFS-6833.patch, 
> HDFS-6833.patch, HDFS-6833.patch
>
>
> When a block is deleted in DataNode, the following messages are usually 
> output.
> {code}
> 2014-08-07 17:53:11,606 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
>  Scheduling blk_1073741825_1001 file 
> /hadoop/data1/dfs/data/current/BP-1887080305-172.28.0.101-1407398838872/current/finalized/subdir0/subdir0/blk_1073741825
>  for deletion
> 2014-08-07 17:53:11,617 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
>  Deleted BP-1887080305-172.28.0.101-1407398838872 blk_1073741825_1001 file 
> /hadoop/data1/dfs/data/current/BP-1887080305-172.28.0.101-1407398838872/current/finalized/subdir0/subdir0/blk_1073741825
> {code}
> However, DirectoryScanner may be executed when DataNode deletes the block in 
> the current implementation. And the following messsages are output.
> {code}
> 2014-08-07 17:53:30,519 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
>  Scheduling blk_1073741825_1001 file 
> /hadoop/data1/dfs/data/current/BP-1887080305-172.28.0.101-1407398838872/current/finalized/subdir0/subdir0/blk_1073741825
>  for deletion
> 2014-08-07 17:53:31,426 INFO 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool 
> BP-1887080305-172.28.0.101-1407398838872 Total blocks: 1, missing metadata 
> files:0, missing block files:0, missing blocks in memory:1, mismatched 
> blocks:0
> 2014-08-07 17:53:31,426 WARN 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Added 
> missing block to memory FinalizedReplica, blk_1073741825_1001, FINALIZED
>   getNumBytes()     = 21230663
>   getBytesOnDisk()  = 21230663
>   getVisibleLength()= 21230663
>   getVolume()       = /hadoop/data1/dfs/data/current
>   getBlockFile()    = 
> /hadoop/data1/dfs/data/current/BP-1887080305-172.28.0.101-1407398838872/current/finalized/subdir0/subdir0/blk_1073741825
>   unlinked          =false
> 2014-08-07 17:53:31,531 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
>  Deleted BP-1887080305-172.28.0.101-1407398838872 blk_1073741825_1001 file 
> /hadoop/data1/dfs/data/current/BP-1887080305-172.28.0.101-1407398838872/current/finalized/subdir0/subdir0/blk_1073741825
> {code}
> Deleting block information is registered in DataNode's memory.
> And when DataNode sends a block report, NameNode receives wrong block 
> information.
> For example, when we execute recommission or change the number of 
> replication, NameNode may delete the right block as "ExcessReplicate" by this 
> problem.
> And "Under-Replicated Blocks" and "Missing Blocks" occur.
> When DataNode run DirectoryScanner, DataNode should not register a deleting 
> block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to