[ 
https://issues.apache.org/jira/browse/HDFS-17553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zinan Zhuang updated HDFS-17553:
--------------------------------
    Description: 
HDFS-15865 introduced an interrupt in DFSStreamer class to interrupt the 
waitForAckedSeqno call when timeout has exceeded, which throws an 
InterrupttedIOExceptions. This method is being used in 
[DFSOutputStream.java#flushInternal 
|https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L773]
 , one of whose use case is 
[DFSOutputStream.java#closeImpl|https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L870]
 to close a file.

What we saw was that we were getting InterrupttedIOExceptions during the 
flushInternal call when we were closing out a file, which was unhandled by 
DFSClient and got thrown to caller. There's a known issue HDFS-4504 that when a 
file failed to close on HDFS side, block recovery was not called and the lease 
got leaked until the DFSClient gets recycled. In our HBase setups, DFSClients 
remain long-lived in regionservers, which means these files remain undead until 
the corresponding regionservers get restarted.

This issue was observed during datanode decomission because it was stuck on 
open files caused by above leakage. As it's good to close a HDFS file as smooth 
as possible, retries of flushInternal during closeImpl operations would be 
beneficial to reduce such leakages. The number of retries can be based on 
dfsclient config. [For 
example|https://github.com/apache/hadoop/blob/63633812a417a3d548be3bdcecebd2ae893d03e0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1660]
 

  was:
[HDFS-15865|https://issues.apache.org/jira/browse/HDFS-15865] introduced an 
interrupt in DFSStreamer class to interrupt the 
waitForAckedSeqno call when timeout has exceeded, which throws an 
InterrupttedIOExceptions. This method is being used in 
[DFSOutputStream.java#flushInternal 
|https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L773]
 , one of whose use case is 
[DFSOutputStream.java#closeImpl|https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L870]
 to close a file.

What we saw was that we were getting InterrupttedIOExceptions during the 
flushInternal call when we were closing out a file, which was unhandled by 
DFSClient and got thrown to caller. There's a known issue HDFS-4504 that when a 
file failed to close on HDFS side, block recovery was not called and the lease 
got leaked until the DFSClient gets recycled. In our HBase setups, DFSClients 
remain long-lived in regionservers, which means these files remain undead until 
the corresponding regionservers get restarted.

This issue was observed during datanode decomission because it was stuck on 
open files caused by above leakage. As it's good to close a HDFS file as smooth 
as possible, a retry of flushInternal during closeImpl operations would be 
beneficial to reduce such leakages. 
 

        Summary: DFSOutputStream.java#closeImpl should configurable retries 
upon flushInternal failures  (was: DFSOutputStream.java#closeImpl should have a 
retry upon flushInternal failures)

> DFSOutputStream.java#closeImpl should configurable retries upon flushInternal 
> failures
> --------------------------------------------------------------------------------------
>
>                 Key: HDFS-17553
>                 URL: https://issues.apache.org/jira/browse/HDFS-17553
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: dfsclient
>    Affects Versions: 3.3.1, 3.4.0
>            Reporter: Zinan Zhuang
>            Priority: Major
>
> HDFS-15865 introduced an interrupt in DFSStreamer class to interrupt the 
> waitForAckedSeqno call when timeout has exceeded, which throws an 
> InterrupttedIOExceptions. This method is being used in 
> [DFSOutputStream.java#flushInternal 
> |https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L773]
>  , one of whose use case is 
> [DFSOutputStream.java#closeImpl|https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L870]
>  to close a file.
> What we saw was that we were getting InterrupttedIOExceptions during the 
> flushInternal call when we were closing out a file, which was unhandled by 
> DFSClient and got thrown to caller. There's a known issue HDFS-4504 that when 
> a file failed to close on HDFS side, block recovery was not called and the 
> lease got leaked until the DFSClient gets recycled. In our HBase setups, 
> DFSClients remain long-lived in regionservers, which means these files remain 
> undead until the corresponding regionservers get restarted.
> This issue was observed during datanode decomission because it was stuck on 
> open files caused by above leakage. As it's good to close a HDFS file as 
> smooth as possible, retries of flushInternal during closeImpl operations 
> would be beneficial to reduce such leakages. The number of retries can be 
> based on dfsclient config. [For 
> example|https://github.com/apache/hadoop/blob/63633812a417a3d548be3bdcecebd2ae893d03e0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1660]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to