[
https://issues.apache.org/jira/browse/HDFS-16261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425805#comment-17425805
]
Bryan Beaudreault commented on HDFS-16261:
------------------------------------------
I'm looking at this now. I don't have much experience in this area, but am
looking into two possibilities high level: handling this on the namenode or
handling in the datanode.
h2. Handling in the NameNode
When a DataNode receives a block, it notifies the namenode via
notifyNamenodeReceivedBlock. This sends a RECEIVED_BLOCK to the namenode along
with a "delHint", which tells the namenode to invalidate that block on the old
host.
Tracing that delHint through on the namenode side through a bunch of layers,
you eventually land in BlockManager.processExtraRedundancyBlock which
eventually lands in processChosenExcessRedundancy.
processChosenExcessRedundancy adds the block to a excessRedundancyMap and to a
nodeToBlocks map in InvalidateBlocks.
There is a RedundancyChore which periodically checks InvalidateBlocks, pulling
a configurable amount of blocks and adding them to the DatanodeDescriptor's
invalidateBlocks map. One quick option might be to configure the
RedundancyChore with dfs.namenode.redundancy.interval.seconds, though that's
not exactly what we want which is a per-block grace period.
Next time a DataNode sends a heartbeat, the Namenode processes various state
for that datanode and sends back a series of commands. Here the NameNode pulls
a configurable number of blocks from the DatanodeDescriptor's invalidateBlocks
and sends them to the DataNode as part of a DNA_INVALIDATE command.
If we were to handle this in the NameNode, we could potentially hook in a
couple places:
* When adding to nodeToBlocks, we could include a timestamp. The
RedundancyChore could only add blocks to the Descriptor's invalidateBlocks map
if older than a threshold.
* When adding to Descriptor's invalidateBlocks, we could add a timestamp. When
processing heartbeats, we could only send blocks via DNA_INVALIDATE which have
been in invalidateBlocks for more than a threshold
* As mentioned above, we could try tuning
dfs.namenode.redundancy.interval.seconds, though that isn't perfect because a
block could be added right before the chore runs and thus get immediately
invalidated.
h2. Handling in the DataNode
When a DataNode gets a request for a block, it looks that up in its
FsDatasetImpl volumeMap. If the block does not exist, a
ReplicaNotFoundException is thrown.
The DataNode receives the list of blocks to invalidate from the DNA_INVALIDATE
command, which is processed by BPOfferService. This is immediately handed off
to FsDatasetImpl.invalidate, which validates the request and immediately
removes the block from volumeMap. At this point, the data still exists on disk
but requests for the block would throw a ReplicaNotFoundException per above.
Once removed from volumeMap, the deletion of data is handled by the
FsDatasetAsyncDiskService. The processing is done async, but is immediately
handed off to a ThreadPoolExecutor which should execute fairly quickly.
A couple options:
* Defer the call to FsDatasetImpl.invalidate, at the highest level. This could
be passed off to a thread pool to be executed after a delay. In this case, the
block would remain in the volumeMap until the task is executed.
* Execute invalidate immediately, but defer the data deletion. We're already
using a thread pool here, so it might be easier to execute after a delay. It's
worth noting that there are other actions taken around the volumeMap removal.
We'd need to verify whether those need to be synchronized with removal from
volumeMap. In this case we'd need to either:
** relocate the volumeMap.remove call to within the FsDatasetAsyncDiskService.
This seems like somewhat of a leaky abstraction.
** Add a pendingDeletion map and add to that when removing from volumeMap. The
FsDatasetAsyncDiskService would remove from pendingDeletion once completed.
We'd need to update our block fetch code to check volumeMap _or_
pendingDeletion. This separation might give us opportunities in the future,
such as including a flag in the response that instructs the DFSClient "this
block may go away soon".
----
I'm doing more investigation and specifically want to look into what would
happen if the handling service died before invalidating blocks. I'm assuming
this is already handled since this process is very async already, but it will
be good to know. I also want to do a bit more thinking of the pros and cons of
each option above, and some experimenting with the easiest option of tuning the
redundancy chore. I'll report back when I have some more information, and also
open to other opinions or suggestions.
> Configurable grace period around deletion of invalidated blocks
> ---------------------------------------------------------------
>
> Key: HDFS-16261
> URL: https://issues.apache.org/jira/browse/HDFS-16261
> Project: Hadoop HDFS
> Issue Type: New Feature
> Reporter: Bryan Beaudreault
> Assignee: Bryan Beaudreault
> Priority: Major
>
> When a block is moved with REPLACE_BLOCK, the new location is recorded in the
> NameNode and the NameNode instructs the old host to in invalidate the block
> using DNA_INVALIDATE. As it stands today, this invalidation is async but
> tends to happen relatively quickly.
> I'm working on a feature for HBase which enables efficient healing of
> locality through Balancer-style low level block moves (HBASE-26250). One
> issue is that HBase tends to keep open long running DFSInputStreams and
> moving blocks from under them causes lots of warns in the RegionServer and
> increases long tail latencies due to the necessary retries in the DFSClient.
> One way I'd like to fix this is to provide a configurable grace period on
> async invalidations. This would give the DFSClient enough time to refresh
> block locations before hitting any errors.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]