[ https://issues.apache.org/jira/browse/HDFS-16261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426391#comment-17426391 ]
Bryan Beaudreault commented on HDFS-16261: ------------------------------------------ Despite how well tuning "dfs.namenode.redundancy.interval.seconds" has worked, I don't think that's a good long term option because the RedundancyMonitor also handles some processing of reconstruction and misplaced blocks. I don't want to mess with those processes. For now I decided to going the route of encoding an insertion timestamp in the NameNode's InvalidateBlocks nodeToBlocks map. This felt like the easiest approach, since it's just a minor change to the existing system for handing out block invalidation commands. I'll be testing this out in a test cluster shortly. In the meantime, I've spent some time looking into how the NameNode handles crash recovery when blocks might have been awaiting deletion. The only handling is that on NameNode startup it will find all over-replicated blocks and try to reduce the replication to what is expected. This means invalidating blocks again, but not necessarily the ones we had originally chose. This definitely seems like a downside of this approach. It basically means that we may mess up locality again after the NameNode restarts, since it very well may decide to keep the block we originally invalidated. We'd need to re-run the process of moving blocks back to how we want them, which could be automatically handled but may temporarily degrade latencies a bit. Another negative aspect of this approach, which I realized during the investigation, is that if a client calls DFSClient.getLocatedBlocks while a block is pending deletion, the result will include the to-be-deleted replica until it's been fully purged. I think implementing this in the DataNode instead would avoid both of those downsides. On the flip side, if a DataNode restarted while a block was pending deletion, when it started back up again the block would no longer be available. This seems like a totally reasonable failure mode. For now I'm going to do some testing of the NameNode side to see how it works in practice, but will also look into what a DataNode side implementation would look like. > Configurable grace period around deletion of invalidated blocks > --------------------------------------------------------------- > > Key: HDFS-16261 > URL: https://issues.apache.org/jira/browse/HDFS-16261 > Project: Hadoop HDFS > Issue Type: New Feature > Reporter: Bryan Beaudreault > Assignee: Bryan Beaudreault > Priority: Major > > When a block is moved with REPLACE_BLOCK, the new location is recorded in the > NameNode and the NameNode instructs the old host to in invalidate the block > using DNA_INVALIDATE. As it stands today, this invalidation is async but > tends to happen relatively quickly. > I'm working on a feature for HBase which enables efficient healing of > locality through Balancer-style low level block moves (HBASE-26250). One > issue is that HBase tends to keep open long running DFSInputStreams and > moving blocks from under them causes lots of warns in the RegionServer and > increases long tail latencies due to the necessary retries in the DFSClient. > One way I'd like to fix this is to provide a configurable grace period on > async invalidations. This would give the DFSClient enough time to refresh > block locations before hitting any errors. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org