[ 
https://issues.apache.org/jira/browse/HDFS-16261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425805#comment-17425805
 ] 

Bryan Beaudreault commented on HDFS-16261:
------------------------------------------

I'm looking at this now. I don't have much experience in this area, but am 
looking into two possibilities high level: handling this on the namenode or 
handling in the datanode.
h2. Handling in the NameNode

When a DataNode receives a block, it notifies the namenode via 
notifyNamenodeReceivedBlock. This sends a RECEIVED_BLOCK to the namenode along 
with a "delHint", which tells the namenode to invalidate that block on the old 
host.

Tracing that delHint through on the namenode side through a bunch of layers, 
you eventually land in BlockManager.processExtraRedundancyBlock which 
eventually lands in processChosenExcessRedundancy. 
processChosenExcessRedundancy adds the block to a excessRedundancyMap and to a 
nodeToBlocks map in InvalidateBlocks.

There is a RedundancyChore which periodically checks InvalidateBlocks, pulling 
a configurable amount of blocks and adding them to the DatanodeDescriptor's 
invalidateBlocks map. One quick option might be to configure the 
RedundancyChore with dfs.namenode.redundancy.interval.seconds, though that's 
not exactly what we want which is a per-block grace period.

Next time a DataNode sends a heartbeat, the Namenode processes various state 
for that datanode and sends back a series of commands. Here the NameNode pulls 
a configurable number of blocks from the DatanodeDescriptor's invalidateBlocks 
and sends them to the DataNode as part of a DNA_INVALIDATE command.

If we were to handle this in the NameNode, we could potentially hook in a 
couple places:
 * When adding to nodeToBlocks, we could include a timestamp. The 
RedundancyChore could only add blocks to the Descriptor's invalidateBlocks map 
if older than a threshold.
 * When adding to Descriptor's invalidateBlocks, we could add a timestamp. When 
processing heartbeats, we could only send blocks via DNA_INVALIDATE which have 
been in invalidateBlocks for more than a threshold
 * As mentioned above, we could try tuning 
dfs.namenode.redundancy.interval.seconds, though that isn't perfect because a 
block could be added right before the chore runs and thus get immediately 
invalidated.

h2. Handling in the DataNode

When a DataNode gets a request for a block, it looks that up in its 
FsDatasetImpl volumeMap. If the block does not exist, a 
ReplicaNotFoundException is thrown.

The DataNode receives the list of blocks to invalidate from the DNA_INVALIDATE 
command, which is processed by BPOfferService. This is immediately handed off 
to FsDatasetImpl.invalidate, which validates the request and immediately 
removes the block from volumeMap. At this point, the data still exists on disk 
but requests for the block would throw a ReplicaNotFoundException per above.

Once removed from volumeMap, the deletion of data is handled by the 
FsDatasetAsyncDiskService. The processing is done async, but is immediately 
handed off to a ThreadPoolExecutor which should execute fairly quickly.

A couple options:
 * Defer the call to FsDatasetImpl.invalidate, at the highest level. This could 
be passed off to a thread pool to be executed after a delay. In this case, the 
block would remain in the volumeMap until the task is executed.
 * Execute invalidate immediately, but defer the data deletion. We're already 
using a thread pool here, so it might be easier to execute after a delay. It's 
worth noting that there are other actions taken around the volumeMap removal. 
We'd need to verify whether those need to be synchronized with removal from 
volumeMap. In this case we'd need to either:
 ** relocate the volumeMap.remove call to within the FsDatasetAsyncDiskService. 
This seems like somewhat of a leaky abstraction.
 ** Add a pendingDeletion map and add to that when removing from volumeMap. The 
FsDatasetAsyncDiskService would remove from pendingDeletion once completed. 
We'd need to update our block fetch code to check volumeMap _or_ 
pendingDeletion. This separation might give us opportunities in the future, 
such as including a flag in the response that instructs the DFSClient "this 
block may go away soon".

 
----
I'm doing more investigation and specifically want to look into what would 
happen if the handling service died before invalidating blocks. I'm assuming 
this is already handled since this process is very async already, but it will 
be good to know. I also want to do a bit more thinking of the pros and cons of 
each option above, and some experimenting with the easiest option of tuning the 
redundancy chore. I'll report back when I have some more information, and also 
open to other opinions or suggestions.

> Configurable grace period around deletion of invalidated blocks
> ---------------------------------------------------------------
>
>                 Key: HDFS-16261
>                 URL: https://issues.apache.org/jira/browse/HDFS-16261
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Bryan Beaudreault
>            Assignee: Bryan Beaudreault
>            Priority: Major
>
> When a block is moved with REPLACE_BLOCK, the new location is recorded in the 
> NameNode and the NameNode instructs the old host to in invalidate the block 
> using DNA_INVALIDATE. As it stands today, this invalidation is async but 
> tends to happen relatively quickly.
> I'm working on a feature for HBase which enables efficient healing of 
> locality through Balancer-style low level block moves (HBASE-26250). One 
> issue is that HBase tends to keep open long running DFSInputStreams and 
> moving blocks from under them causes lots of warns in the RegionServer and 
> increases long tail latencies due to the necessary retries in the DFSClient.
> One way I'd like to fix this is to provide a configurable grace period on 
> async invalidations. This would give the DFSClient enough time to refresh 
> block locations before hitting any errors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to