[ 
https://issues.apache.org/jira/browse/HDFS-16261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17457637#comment-17457637
 ] 

Bryan Beaudreault commented on HDFS-16261:
------------------------------------------

[~hexiaoqiao] thank you very much for the feedback! I am happy to try a 
different approach if it makes sense.

I saw two problems with the DataNode side:
 # It's much more operationally complicated to change configurations on 
DataNode since there may be 100s or 1000s of them. Restarting DataNodes causes 
pain to low latency clients (like HBase).
 # The code on the DataNode side is hard to integrate a deferral process into.

I like the idea of hooking BlockSender, but unfortunately that does not work 
for my use-case. I am not just trying to handle in-progress streams. I'm also 
trying to avoid ReplicaNotFoundExceptions for new requests, which causes long 
tail latency spikes for us. This is meant to pair with HDFS-16262, which will 
allow a DFSInputStream to refresh their block locations before the grace period 
expires and avoid hitting any ReplicaNotFoundExceptions. This is an important 
goal of this issue, avoid ReplicaNotFoundExceptions.

I could get around the problem 1 above by having the namenode send along a 
grace period with DNA_INVALIDATE. That way the configuration is still on the 
namenode, but the DataNode is responsible for handling it. 

Before I investigate that approach, can you help me better understand your 
concern with the NameNode side? I'm not sure what added costs there are here, 
the amount of PendingDeletion blocks should be very small in comparison to 
total block capacity served by NameNode. Note this grace period is only on 
_replaced_ blocks, not deleted blocks.

Thank you again, I look forward to your input.

> Configurable grace period around invalidation of replaced blocks
> ----------------------------------------------------------------
>
>                 Key: HDFS-16261
>                 URL: https://issues.apache.org/jira/browse/HDFS-16261
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Bryan Beaudreault
>            Assignee: Bryan Beaudreault
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> When a block is moved with REPLACE_BLOCK, the new location is recorded in the 
> NameNode and the NameNode instructs the old host to in invalidate the block 
> using DNA_INVALIDATE. As it stands today, this invalidation is async but 
> tends to happen relatively quickly.
> I'm working on a feature for HBase which enables efficient healing of 
> locality through Balancer-style low level block moves (HBASE-26250). One 
> issue is that HBase tends to keep open long running DFSInputStreams and 
> moving blocks from under them causes lots of warns in the RegionServer and 
> increases long tail latencies due to the necessary retries in the DFSClient.
> One way I'd like to fix this is to provide a configurable grace period on 
> async invalidations. This would give the DFSClient enough time to refresh 
> block locations before hitting any errors.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to