[ 
https://issues.apache.org/jira/browse/HDFS-16261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426391#comment-17426391
 ] 

Bryan Beaudreault commented on HDFS-16261:
------------------------------------------

Despite how well tuning "dfs.namenode.redundancy.interval.seconds" has worked, 
I don't think that's a good long term option because the RedundancyMonitor also 
handles some processing of reconstruction and misplaced blocks. I don't want to 
mess with those processes.

For now I decided to going the route of encoding an insertion timestamp in the 
NameNode's InvalidateBlocks nodeToBlocks map. This felt like the easiest 
approach, since it's just a minor change to the existing system for handing out 
block invalidation commands.

I'll be testing this out in a test cluster shortly.

In the meantime, I've spent some time looking into how the NameNode handles 
crash recovery when blocks might have been awaiting deletion. The only handling 
is that on NameNode startup it will find all over-replicated blocks and try to 
reduce the replication to what is expected. This means invalidating blocks 
again, but not necessarily the ones we had originally chose.

This definitely seems like a downside of this approach. It basically means that 
we may mess up locality again after the NameNode restarts, since it very well 
may decide to keep the block we originally invalidated. We'd need to re-run the 
process of moving blocks back to how we want them, which could be automatically 
handled but may temporarily degrade latencies a bit. Another negative aspect of 
this approach, which I realized during the investigation, is that if a client 
calls DFSClient.getLocatedBlocks while a block is pending deletion, the result 
will include the to-be-deleted replica until it's been fully purged.

I think implementing this in the DataNode instead would avoid both of those 
downsides. On the flip side, if a DataNode restarted while a block was pending 
deletion, when it started back up again the block would no longer be available. 
This seems like a totally reasonable failure mode.

For now I'm going to do some testing of the NameNode side to see how it works 
in practice, but will also look into what a DataNode side implementation would 
look like.

> Configurable grace period around deletion of invalidated blocks
> ---------------------------------------------------------------
>
>                 Key: HDFS-16261
>                 URL: https://issues.apache.org/jira/browse/HDFS-16261
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Bryan Beaudreault
>            Assignee: Bryan Beaudreault
>            Priority: Major
>
> When a block is moved with REPLACE_BLOCK, the new location is recorded in the 
> NameNode and the NameNode instructs the old host to in invalidate the block 
> using DNA_INVALIDATE. As it stands today, this invalidation is async but 
> tends to happen relatively quickly.
> I'm working on a feature for HBase which enables efficient healing of 
> locality through Balancer-style low level block moves (HBASE-26250). One 
> issue is that HBase tends to keep open long running DFSInputStreams and 
> moving blocks from under them causes lots of warns in the RegionServer and 
> increases long tail latencies due to the necessary retries in the DFSClient.
> One way I'd like to fix this is to provide a configurable grace period on 
> async invalidations. This would give the DFSClient enough time to refresh 
> block locations before hitting any errors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to