[jira] [Resolved] (HDFS-17554) OIV: Print the storage policy name in OIV delimited output
[ https://issues.apache.org/jira/browse/HDFS-17554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hualong Zhang resolved HDFS-17554. -- Resolution: Not A Problem > OIV: Print the storage policy name in OIV delimited output > -- > > Key: HDFS-17554 > URL: https://issues.apache.org/jira/browse/HDFS-17554 > Project: Hadoop HDFS > Issue Type: Improvement > Components: tools >Affects Versions: 3.5.0 >Reporter: Hualong Zhang >Assignee: Hualong Zhang >Priority: Major > > Refer to adding the storage policy name to the OIV output instead of the > erasure coding policy. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-17528) FsImageValidation: set txid when saving a new image
[ https://issues.apache.org/jira/browse/HDFS-17528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze resolved HDFS-17528. --- Fix Version/s: 3.5.0 Hadoop Flags: Reviewed Resolution: Fixed Th pull request is now merged. > FsImageValidation: set txid when saving a new image > --- > > Key: HDFS-17528 > URL: https://issues.apache.org/jira/browse/HDFS-17528 > Project: Hadoop HDFS > Issue Type: Bug > Components: tools >Reporter: Tsz-wo Sze >Assignee: Tsz-wo Sze >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > - When the fsimage is specified as a file and the FsImageValidation tool > saves a new image (for removing inaccessible inodes), the txid is not set. > Then, the resulted image will have 0 as its txid. > - When the fsimage is specified as a directory, the txid is set. However, it > will get NPE since NameNode metrics is uninitialized (although the metrics is > not used by FsImageValidation). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17546) Implementing Timeout for HostFileReader when FS hangs
[ https://issues.apache.org/jira/browse/HDFS-17546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17856023#comment-17856023 ] ASF GitHub Bot commented on HDFS-17546: --- ctrezzo commented on PR #6891: URL: https://github.com/apache/hadoop/pull/6891#issuecomment-2176655680 @NyteKnight can you take a look at the new spot bugs warning? It looks like it is complaining about an unchecked conversion. > Implementing Timeout for HostFileReader when FS hangs > - > > Key: HDFS-17546 > URL: https://issues.apache.org/jira/browse/HDFS-17546 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Simbarashe Dzinamarira >Assignee: Heagan >Priority: Minor > Labels: pull-request-available > > Certain implementations of Hadoop have the dfs.hosts file residing on NAS/NFS > and potentially with symlinks. If the FS hangs for any reason, the > refreshNodes call would infinitely hang on the HostsFileReader until the FS > returns. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17528) FsImageValidation: set txid when saving a new image
[ https://issues.apache.org/jira/browse/HDFS-17528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17855989#comment-17855989 ] Tsz-wo Sze commented on HDFS-17528: --- [~smajeti], there were still some build issues. Just have pushed an empty commit for triggering a new build. > FsImageValidation: set txid when saving a new image > --- > > Key: HDFS-17528 > URL: https://issues.apache.org/jira/browse/HDFS-17528 > Project: Hadoop HDFS > Issue Type: Bug > Components: tools >Reporter: Tsz-wo Sze >Assignee: Tsz-wo Sze >Priority: Major > Labels: pull-request-available > > - When the fsimage is specified as a file and the FsImageValidation tool > saves a new image (for removing inaccessible inodes), the txid is not set. > Then, the resulted image will have 0 as its txid. > - When the fsimage is specified as a directory, the txid is set. However, it > will get NPE since NameNode metrics is uninitialized (although the metrics is > not used by FsImageValidation). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17554) OIV: Print the storage policy name in OIV delimited output
Hualong Zhang created HDFS-17554: Summary: OIV: Print the storage policy name in OIV delimited output Key: HDFS-17554 URL: https://issues.apache.org/jira/browse/HDFS-17554 Project: Hadoop HDFS Issue Type: Improvement Components: tools Affects Versions: 3.5.0 Reporter: Hualong Zhang Assignee: Hualong Zhang Refer to adding the storage policy name to the OIV output instead of the erasure coding policy. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17553) DFSOutputStream.java#closeImpl should have configurable retries upon flushInternal failures
[ https://issues.apache.org/jira/browse/HDFS-17553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zinan Zhuang updated HDFS-17553: Summary: DFSOutputStream.java#closeImpl should have configurable retries upon flushInternal failures (was: DFSOutputStream.java#closeImpl should configurable retries upon flushInternal failures) > DFSOutputStream.java#closeImpl should have configurable retries upon > flushInternal failures > --- > > Key: HDFS-17553 > URL: https://issues.apache.org/jira/browse/HDFS-17553 > Project: Hadoop HDFS > Issue Type: Improvement > Components: dfsclient >Affects Versions: 3.3.1, 3.4.0 >Reporter: Zinan Zhuang >Priority: Major > > HDFS-15865 introduced an interrupt in DFSStreamer class to interrupt the > waitForAckedSeqno call when timeout has exceeded, which throws an > InterrupttedIOExceptions. This method is being used in > [DFSOutputStream.java#flushInternal > |https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L773] > , one of whose use case is > [DFSOutputStream.java#closeImpl|https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L870] > to close a file. > What we saw was that we were getting InterrupttedIOExceptions during the > flushInternal call when we were closing out a file, which was unhandled by > DFSClient and got thrown to caller. There's a known issue HDFS-4504 that when > a file failed to close on HDFS side, block recovery was not called and the > lease got leaked until the DFSClient gets recycled. In our HBase setups, > DFSClients remain long-lived in regionservers, which means these files remain > undead until the corresponding regionservers get restarted. > This issue was observed during datanode decomission because it was stuck on > open files caused by above leakage. As it's good to close a HDFS file as > smooth as possible, retries of flushInternal during closeImpl operations > would be beneficial to reduce such leakages. The number of retries can be > based on dfsclient config. [For > example|https://github.com/apache/hadoop/blob/63633812a417a3d548be3bdcecebd2ae893d03e0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1660] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17553) DFSOutputStream.java#closeImpl should configurable retries upon flushInternal failures
[ https://issues.apache.org/jira/browse/HDFS-17553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zinan Zhuang updated HDFS-17553: Description: HDFS-15865 introduced an interrupt in DFSStreamer class to interrupt the waitForAckedSeqno call when timeout has exceeded, which throws an InterrupttedIOExceptions. This method is being used in [DFSOutputStream.java#flushInternal |https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L773] , one of whose use case is [DFSOutputStream.java#closeImpl|https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L870] to close a file. What we saw was that we were getting InterrupttedIOExceptions during the flushInternal call when we were closing out a file, which was unhandled by DFSClient and got thrown to caller. There's a known issue HDFS-4504 that when a file failed to close on HDFS side, block recovery was not called and the lease got leaked until the DFSClient gets recycled. In our HBase setups, DFSClients remain long-lived in regionservers, which means these files remain undead until the corresponding regionservers get restarted. This issue was observed during datanode decomission because it was stuck on open files caused by above leakage. As it's good to close a HDFS file as smooth as possible, retries of flushInternal during closeImpl operations would be beneficial to reduce such leakages. The number of retries can be based on dfsclient config. [For example|https://github.com/apache/hadoop/blob/63633812a417a3d548be3bdcecebd2ae893d03e0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1660] was: [HDFS-15865|https://issues.apache.org/jira/browse/HDFS-15865] introduced an interrupt in DFSStreamer class to interrupt the waitForAckedSeqno call when timeout has exceeded, which throws an InterrupttedIOExceptions. This method is being used in [DFSOutputStream.java#flushInternal |https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L773] , one of whose use case is [DFSOutputStream.java#closeImpl|https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L870] to close a file. What we saw was that we were getting InterrupttedIOExceptions during the flushInternal call when we were closing out a file, which was unhandled by DFSClient and got thrown to caller. There's a known issue HDFS-4504 that when a file failed to close on HDFS side, block recovery was not called and the lease got leaked until the DFSClient gets recycled. In our HBase setups, DFSClients remain long-lived in regionservers, which means these files remain undead until the corresponding regionservers get restarted. This issue was observed during datanode decomission because it was stuck on open files caused by above leakage. As it's good to close a HDFS file as smooth as possible, a retry of flushInternal during closeImpl operations would be beneficial to reduce such leakages. Summary: DFSOutputStream.java#closeImpl should configurable retries upon flushInternal failures (was: DFSOutputStream.java#closeImpl should have a retry upon flushInternal failures) > DFSOutputStream.java#closeImpl should configurable retries upon flushInternal > failures > -- > > Key: HDFS-17553 > URL: https://issues.apache.org/jira/browse/HDFS-17553 > Project: Hadoop HDFS > Issue Type: Improvement > Components: dfsclient >Affects Versions: 3.3.1, 3.4.0 >Reporter: Zinan Zhuang >Priority: Major > > HDFS-15865 introduced an interrupt in DFSStreamer class to interrupt the > waitForAckedSeqno call when timeout has exceeded, which throws an > InterrupttedIOExceptions. This method is being used in > [DFSOutputStream.java#flushInternal > |https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L773] > , one of whose use case is > [DFSOutputStream.java#closeImpl|https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L870] > to close a file. > What we saw was that we were getting InterrupttedIOExceptions during the > flushInternal call when we were closing out a file, which was unhandled by > DFSClient and got thrown to caller. There's a known issue HDFS-4504 that when > a file failed to close on HDFS side, block recovery was not called and the > lea
[jira] [Updated] (HDFS-17553) DFSOutputStream.java#closeImpl should have a retry upon flushInternal failures
[ https://issues.apache.org/jira/browse/HDFS-17553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zinan Zhuang updated HDFS-17553: Description: [HDFS-15865|https://issues.apache.org/jira/browse/HDFS-15865] introduced an interrupt in DFSStreamer class to interrupt the waitForAckedSeqno call when timeout has exceeded, which throws an InterrupttedIOExceptions. This method is being used in [DFSOutputStream.java#flushInternal |https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L773] , one of whose use case is [DFSOutputStream.java#closeImpl|https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L870] to close a file. What we saw was that we were getting InterrupttedIOExceptions during the flushInternal call when we were closing out a file, which was unhandled by DFSClient and got thrown to caller. There's a known issue HDFS-4504 that when a file failed to close on HDFS side, block recovery was not called and the lease got leaked until the DFSClient gets recycled. In our HBase setups, DFSClients remain long-lived in regionservers, which means these files remain undead until the corresponding regionservers get restarted. This issue was observed during datanode decomission because it was stuck on open files caused by above leakage. As it's good to close a HDFS file as smooth as possible, a retry of flushInternal during closeImpl operations would be beneficial to reduce such leakages. was: HDFS-15865 introduced an interrupt in DFSStreamer class to interrupt the waitForAckedSeqno call when timeout has exceeded, which throws an InterrupttedIOExceptions. This method is being used in [DFSOutputStream.java#flushInternal |https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L773] , one of whose use case is [DFSOutputStream.java#closeImpl|https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L870] to close a file. What we saw was that we were getting InterrupttedIOExceptions during the flushInternal call when we were closing out a file, which was unhandled by DFSClient and got thrown to caller. There's a known issue HDFS-4504 that when a file failed to close on HDFS side, block recovery was not called and the lease got leaked until the DFSClient gets recycled. In our HBase setups, DFSClients remain long-lived in regionservers, which means these files remain undead until the corresponding regionservers get restarted. This issue was observed during datanode decomission because it was stuck on open files caused by above leakage. As it's good to close a HDFS file as smooth as possible, a retry of flushInternal during closeImpl operations would be beneficial to reduce such leakages. > DFSOutputStream.java#closeImpl should have a retry upon flushInternal failures > -- > > Key: HDFS-17553 > URL: https://issues.apache.org/jira/browse/HDFS-17553 > Project: Hadoop HDFS > Issue Type: Improvement > Components: dfsclient >Affects Versions: 3.3.1, 3.4.0 >Reporter: Zinan Zhuang >Priority: Major > > [HDFS-15865|https://issues.apache.org/jira/browse/HDFS-15865] introduced an > interrupt in DFSStreamer class to interrupt the > waitForAckedSeqno call when timeout has exceeded, which throws an > InterrupttedIOExceptions. This method is being used in > [DFSOutputStream.java#flushInternal > |https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L773] > , one of whose use case is > [DFSOutputStream.java#closeImpl|https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L870] > to close a file. > What we saw was that we were getting InterrupttedIOExceptions during the > flushInternal call when we were closing out a file, which was unhandled by > DFSClient and got thrown to caller. There's a known issue HDFS-4504 that when > a file failed to close on HDFS side, block recovery was not called and the > lease got leaked until the DFSClient gets recycled. In our HBase setups, > DFSClients remain long-lived in regionservers, which means these files remain > undead until the corresponding regionservers get restarted. > This issue was observed during datanode decomission because it was stuck on > open files caused by above leakage. As it's good to close a HDFS file as > smooth as possible, a retry of flushI
[jira] [Updated] (HDFS-17553) DFSOutputStream.java#closeImpl should have a retry upon flushInternal failures
[ https://issues.apache.org/jira/browse/HDFS-17553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zinan Zhuang updated HDFS-17553: Description: HDFS-15865 introduced an interrupt in DFSStreamer class to interrupt the waitForAckedSeqno call when timeout has exceeded, which throws an InterrupttedIOExceptions. This method is being used in [DFSOutputStream.java#flushInternal |https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L773] , one of whose use case is [DFSOutputStream.java#closeImpl|https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L870] to close a file. What we saw was that we were getting InterrupttedIOExceptions during the flushInternal call when we were closing out a file, which was unhandled by DFSClient and got thrown to caller. There's a known issue HDFS-4504 that when a file failed to close on HDFS side, block recovery was not called and the lease got leaked until the DFSClient gets recycled. In our HBase setups, DFSClients remain long-lived in regionservers, which means these files remain undead until the corresponding regionservers get restarted. This issue was observed during datanode decomission because it was stuck on open files caused by above leakage. As it's good to close a HDFS file as smooth as possible, a retry of flushInternal during closeImpl operations would be beneficial to reduce such leakages. was: [HDFS-15865|https://issues.apache.org/jira/browse/HDFS-15865] introduced an interrupt in DFSStreamer class to interrupt the waitForAckedSeqno call when timeout has exceeded, which throws an InterrupttedIOExceptions. This method is being used in [DFSOutputStream.java#flushInternal |https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L773] , one of whose use case is [DFSOutputStream.java#closeImpl|https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L870] to close a file. What we saw was that we were getting InterrupttedIOExceptions during the flushInternal call when we were closing out a file, which was unhandled by DFSClient and got thrown to caller. There's a known issue [HDFS-4504|https://issues.apache.org/jira/browse/HDFS-4504] that when a file failed to close on HDFS side, the lease got leaked until the DFSClient gets recycled. In our HBase setups, DFSClients remain long-lived in regionservers, which means these files remain undead until the corresponding regionservers get restarted. This issue was observed during datanode decomission because it was stuck on open files caused by above leakage. As it's good to close a HDFS file as smooth as possible, a retry of flushInternal during closeImpl operations would be beneficial to reduce such leakages. > DFSOutputStream.java#closeImpl should have a retry upon flushInternal failures > -- > > Key: HDFS-17553 > URL: https://issues.apache.org/jira/browse/HDFS-17553 > Project: Hadoop HDFS > Issue Type: Improvement > Components: dfsclient >Affects Versions: 3.3.1, 3.4.0 >Reporter: Zinan Zhuang >Priority: Major > > HDFS-15865 introduced an interrupt in DFSStreamer class to interrupt the > waitForAckedSeqno call when timeout has exceeded, which throws an > InterrupttedIOExceptions. This method is being used in > [DFSOutputStream.java#flushInternal > |https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L773] > , one of whose use case is > [DFSOutputStream.java#closeImpl|https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L870] > to close a file. > What we saw was that we were getting InterrupttedIOExceptions during the > flushInternal call when we were closing out a file, which was unhandled by > DFSClient and got thrown to caller. There's a known issue HDFS-4504 that when > a file failed to close on HDFS side, block recovery was not called and the > lease got leaked until the DFSClient gets recycled. In our HBase setups, > DFSClients remain long-lived in regionservers, which means these files remain > undead until the corresponding regionservers get restarted. > This issue was observed during datanode decomission because it was stuck on > open files caused by above leakage. As it's good to close a HDFS file as > smooth as possible, a retry of flushInternal during closeImpl operation
[jira] [Updated] (HDFS-17553) DFSOutputStream.java#closeImpl should have a retry upon flushInternal failures
[ https://issues.apache.org/jira/browse/HDFS-17553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zinan Zhuang updated HDFS-17553: Description: [HDFS-15865|https://issues.apache.org/jira/browse/HDFS-15865] introduced an interrupt in DFSStreamer class to interrupt the waitForAckedSeqno call when timeout has exceeded, which throws an InterrupttedIOExceptions. This method is being used in [DFSOutputStream.java#flushInternal |https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L773] , one of whose use case is [DFSOutputStream.java#closeImpl|https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L870] to close a file. What we saw was that we were getting InterrupttedIOExceptions during the flushInternal call when we were closing out a file, which was unhandled by DFSClient and got thrown to caller. There's a known issue [HDFS-4504|https://issues.apache.org/jira/browse/HDFS-4504] that when a file failed to close on HDFS side, the lease got leaked until the DFSClient gets recycled. In our HBase setups, DFSClients remain long-lived in regionservers, which means these files remain undead until the corresponding regionservers get restarted. This issue was observed during datanode decomission because it was stuck on open files caused by above leakage. As it's good to close a HDFS file as smooth as possible, a retry of flushInternal during closeImpl operations would be beneficial to reduce such leakages. was: [HDFS-15865|https://issues.apache.org/jira/browse/HDFS-15865] introduced an interrupt in DFSStreamer class to interrupt the waitForAckedSeqno call when timeout has exceeded, which throws an InterrupttedIOExceptions. This method is being used in [DFSOutputStream.java#flushInternal |https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L773] , one of whose use case is [DFSOutputStream.java#closeImpl|https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L870] to close a file. What we saw was that we were getting InterrupttedIOExceptions during the flushInternal call when we are closing out a file, which was unhandled by DFSClient and got thrown to caller. There's a known issue [HDFS-4504|https://issues.apache.org/jira/browse/HDFS-4504] that when a file failed to close on HDFS side, the lease got leaked until the DFSClient gets recycled. In our HBase setups, DFSClients remain long-lived in each regionserver, which means these files remain undead until the regionserver gets restarted. This issue was observed during datanode decomission because it was stuck on open files caused by above leakage. As it's good to close a HDFS file as smooth as possible, a retry of flushInternal during closeImpl operations would be beneficial to reduce such leakages. > DFSOutputStream.java#closeImpl should have a retry upon flushInternal failures > -- > > Key: HDFS-17553 > URL: https://issues.apache.org/jira/browse/HDFS-17553 > Project: Hadoop HDFS > Issue Type: Improvement > Components: dfsclient >Affects Versions: 3.3.1, 3.4.0 >Reporter: Zinan Zhuang >Priority: Major > > [HDFS-15865|https://issues.apache.org/jira/browse/HDFS-15865] introduced an > interrupt in DFSStreamer class to interrupt the > waitForAckedSeqno call when timeout has exceeded, which throws an > InterrupttedIOExceptions. This method is being used in > [DFSOutputStream.java#flushInternal > |https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L773] > , one of whose use case is > [DFSOutputStream.java#closeImpl|https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L870] > to close a file. > What we saw was that we were getting InterrupttedIOExceptions during the > flushInternal call when we were closing out a file, which was unhandled by > DFSClient and got thrown to caller. There's a known issue > [HDFS-4504|https://issues.apache.org/jira/browse/HDFS-4504] that when a file > failed to close on HDFS side, the lease got leaked until the DFSClient gets > recycled. In our HBase setups, DFSClients remain long-lived in regionservers, > which means these files remain undead until the corresponding regionservers > get restarted. > This issue was observed during datanode decomission because it was stuck on > open files
[jira] [Updated] (HDFS-17553) DFSOutputStream.java#closeImpl should have a retry upon flushInternal failures
[ https://issues.apache.org/jira/browse/HDFS-17553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zinan Zhuang updated HDFS-17553: Description: [HDFS-15865|https://issues.apache.org/jira/browse/HDFS-15865] introduced an interrupt in DFSStreamer class to interrupt the waitForAckedSeqno call when timeout has exceeded, which throws an InterrupttedIOExceptions. This method is being used in [DFSOutputStream.java#flushInternal |https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L773] , one of whose use case is [DFSOutputStream.java#closeImpl|https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L870] to close a file. What we saw was that we were getting InterrupttedIOExceptions during the flushInternal call when we are closing out a file, which was unhandled by DFSClient and got thrown to caller. There's a known issue [HDFS-4504|https://issues.apache.org/jira/browse/HDFS-4504] that when a file failed to close on HDFS side, the lease got leaked until the DFSClient gets recycled. In our HBase setups, DFSClients remain long-lived in each regionserver, which means these files remain undead until the regionserver gets restarted. This issue was observed during datanode decomission because it was stuck on open files caused by above leakage. As it's good to close a HDFS file as smooth as possible, a retry of flushInternal during closeImpl operations would be beneficial to reduce such leakages. was: [HDFS-15865|https://issues.apache.org/jira/browse/HDFS-15865] introduced an interrupt in DFSStreamer class to interrupt the waitForAckedSeqno call when timeout has exceeded. This method is being used in [DFSOutputStream.java#flushInternal |https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L773] , one of whose use case is [DFSOutputStream.java#closeImpl|https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L870] to close a file. What we saw was that we were getting more interrupts during the flushInternal call when we are closing out a file, which was unhandled by DFSClient and got thrown to caller. There's a known issue [HDFS-4504|https://issues.apache.org/jira/browse/HDFS-4504] that when a file failed to close on HDFS side, the lease got leaked until the DFSClient gets recycled. In our HBase setups, DFSClients remain long-lived in each regionserver, which means these files remain undead until the regionserver gets restarted. This issue was observed during datanode decomission because it was stuck on open files caused by above leakage. As it's good to close a HDFS file as smooth as possible, a retry of flushInternal during closeImpl operations would be beneficial to reduce such leakages. > DFSOutputStream.java#closeImpl should have a retry upon flushInternal failures > -- > > Key: HDFS-17553 > URL: https://issues.apache.org/jira/browse/HDFS-17553 > Project: Hadoop HDFS > Issue Type: Improvement > Components: dfsclient >Affects Versions: 3.3.1, 3.4.0 >Reporter: Zinan Zhuang >Priority: Major > > [HDFS-15865|https://issues.apache.org/jira/browse/HDFS-15865] introduced an > interrupt in DFSStreamer class to interrupt the > waitForAckedSeqno call when timeout has exceeded, which throws an > InterrupttedIOExceptions. This method is being used in > [DFSOutputStream.java#flushInternal > |https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L773] > , one of whose use case is > [DFSOutputStream.java#closeImpl|https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L870] > to close a file. > What we saw was that we were getting InterrupttedIOExceptions during the > flushInternal call when we are closing out a file, which was unhandled by > DFSClient and got thrown to caller. There's a known issue > [HDFS-4504|https://issues.apache.org/jira/browse/HDFS-4504] that when a file > failed to close on HDFS side, the lease got leaked until the DFSClient gets > recycled. In our HBase setups, DFSClients remain long-lived in each > regionserver, which means these files remain undead until the regionserver > gets restarted. > This issue was observed during datanode decomission because it was stuck on > open files caused by above leakage. As it's good to close a HDFS file as > smooth a
[jira] [Created] (HDFS-17553) DFSOutputStream.java#closeImpl should have a retry upon flushInternal failures
Zinan Zhuang created HDFS-17553: --- Summary: DFSOutputStream.java#closeImpl should have a retry upon flushInternal failures Key: HDFS-17553 URL: https://issues.apache.org/jira/browse/HDFS-17553 Project: Hadoop HDFS Issue Type: Improvement Components: dfsclient Affects Versions: 3.4.0, 3.3.1 Reporter: Zinan Zhuang [HDFS-15865|https://issues.apache.org/jira/browse/HDFS-15865] introduced an interrupt in DFSStreamer class to interrupt the waitForAckedSeqno call when timeout has exceeded. This method is being used in [DFSOutputStream.java#flushInternal |https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L773] , one of whose use case is [DFSOutputStream.java#closeImpl|https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L870] to close a file. What we saw was that we were getting more interrupts during the flushInternal call when we are closing out a file, which was unhandled by DFSClient and got thrown to caller. There's a known issue [HDFS-4504|https://issues.apache.org/jira/browse/HDFS-4504] that when a file failed to close on HDFS side, the lease got leaked until the DFSClient gets recycled. In our HBase setups, DFSClients remain long-lived in each regionserver, which means these files remain undead until the regionserver gets restarted. This issue was observed during datanode decomission because it was stuck on open files caused by above leakage. As it's good to close a HDFS file as smooth as possible, a retry of flushInternal during closeImpl operations would be beneficial to reduce such leakages. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-17439) Improve NNThroughputBenchmark to allow non super user to use the tool
[ https://issues.apache.org/jira/browse/HDFS-17439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephen O'Donnell resolved HDFS-17439. -- Fix Version/s: 3.5.0 Resolution: Fixed > Improve NNThroughputBenchmark to allow non super user to use the tool > - > > Key: HDFS-17439 > URL: https://issues.apache.org/jira/browse/HDFS-17439 > Project: Hadoop HDFS > Issue Type: Improvement > Components: benchmarks, namenode >Reporter: Fateh Singh >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > The NNThroughputBenchmark can only be used with hdfs user or any user with > super user privileges since entering/exiting safemode is a privileged > operation. However, when using super user, ACL checks are skipped. Hence it > renders the tool to be useless when testing namenode performance along with > authorization frameworks such as Apache Ranger / any other authorization > frameworks. > An optional argument such as -nonSuperUser can be used to skip the statements > such as entering / exiting safemode. This optional argument makes the tool > useful for incorporating authorization frameworks into the performance > estimation flows. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17546) Implementing Timeout for HostFileReader when FS hangs
[ https://issues.apache.org/jira/browse/HDFS-17546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17855879#comment-17855879 ] ASF GitHub Bot commented on HDFS-17546: --- hadoop-yetus commented on PR #6891: URL: https://github.com/apache/hadoop/pull/6891#issuecomment-2175850768 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 42s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +0 :ok: | xmllint | 0m 0s | | xmllint was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ branch-3.3 Compile Tests _ | | +0 :ok: | mvndep | 13m 34s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 39m 44s | | branch-3.3 passed | | +1 :green_heart: | compile | 3m 56s | | branch-3.3 passed | | +1 :green_heart: | checkstyle | 1m 4s | | branch-3.3 passed | | +1 :green_heart: | mvnsite | 2m 24s | | branch-3.3 passed | | +1 :green_heart: | javadoc | 2m 9s | | branch-3.3 passed | | -1 :x: | spotbugs | 2m 40s | [/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6891/2/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html) | hadoop-hdfs-project/hadoop-hdfs-client in branch-3.3 has 2 extant spotbugs warnings. | | +1 :green_heart: | shadedclient | 41m 42s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 32s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 2m 8s | | the patch passed | | +1 :green_heart: | compile | 3m 49s | | the patch passed | | -1 :x: | javac | 3m 49s | [/results-compile-javac-hadoop-hdfs-project.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6891/2/artifact/out/results-compile-javac-hadoop-hdfs-project.txt) | hadoop-hdfs-project generated 1 new + 608 unchanged - 1 fixed = 609 total (was 609) | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 57s | | the patch passed | | +1 :green_heart: | mvnsite | 2m 8s | | the patch passed | | +1 :green_heart: | javadoc | 1m 56s | | the patch passed | | +1 :green_heart: | spotbugs | 6m 0s | | the patch passed | | +1 :green_heart: | shadedclient | 41m 54s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 2m 16s | | hadoop-hdfs-client in the patch passed. | | -1 :x: | unit | 242m 21s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6891/2/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 51s | | The patch does not generate ASF License warnings. | | | | 413m 58s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.protocol.TestBlockListAsLongs | | | hadoop.hdfs.server.namenode.TestNameNodeMXBean | | | hadoop.hdfs.TestReconstructStripedFileWithRandomECPolicy | | | hadoop.hdfs.server.balancer.TestBalancerWithHANameNodes | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.44 ServerAPI=1.44 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6891/2/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/6891 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint | | uname | Linux 6bdbb87477ba 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | branch-3.3 / 6ecc20ecef50726ebd71b0ade7b059bdd2b7f613 | | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6891/2/testReport/ | | Max. process+thread count | 1986 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project
[jira] [Commented] (HDFS-17545) [ARR] router async rpc client.
[ https://issues.apache.org/jira/browse/HDFS-17545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17855822#comment-17855822 ] ASF GitHub Bot commented on HDFS-17545: --- KeeProMise commented on code in PR #6871: URL: https://github.com/apache/hadoop/pull/6871#discussion_r1643993096 ## hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/router/RouterRpcServer.java: ## @@ -428,6 +456,25 @@ public RouterRpcServer(Configuration conf, Router router, initRouterFedRename(); } + private void initAsyncThreadPool() { +int asyncHandlerCount = conf.getInt(DFS_ROUTER_RPC_ASYNC_HANDLER_COUNT, +DFS_ROUTER_RPC_ASYNC_HANDLER_COUNT_DEFAULT); +int asyncResponderCount = conf.getInt(DFS_ROUTER_RPC_ASYNC_RESPONDER_COUNT, +DFS_ROUTER_RPC_ASYNC_RESPONDER_COUNT_DEFAULT); +synchronized (RouterRpcServer.class) { + if (asyncRouterHandler == null) { +LOG.info("init router async handler count: {}", asyncHandlerCount); +asyncRouterHandler = Executors.newFixedThreadPool( +asyncHandlerCount, new AsyncThreadFactory("router async handler ")); + } + if (asyncRouterResponder == null) { +LOG.info("init router async responder count: {}", asyncResponderCount); +asyncRouterResponder = Executors.newFixedThreadPool( +asyncHandlerCount, new AsyncThreadFactory("router async responder ")); Review Comment: @hfutatzhanghb Yes, I will change this to asyncResponderCount later. Thank you for your careful review. > [ARR] router async rpc client. > -- > > Key: HDFS-17545 > URL: https://issues.apache.org/jira/browse/HDFS-17545 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jian Zhang >Assignee: Jian Zhang >Priority: Major > Labels: pull-request-available > > *Describe* > 1. Mainly using AsyncUtil to implement {*}RouterAsyncRpcClient{*}, this class > extends RouterRpcClient, enabling the {*}invoiceAll{*}, {*}invoiceMethod{*}, > {*}invoiceSequential{*}, {*}invoiceConcurrent{*}, and *invoiceSingle* methods > to support asynchrony. > 2. Use two thread pools, *asyncRouterHandler* and {*}asyncRouterResponder{*}, > to handle asynchronous requests and responses, respectively. > 3. Added {*}DFS_ROUTER_RPC_ENABLE_ASYNC{*}, > {*}DFS_ROUTER_RPC_ASYNC_HANDLER_COUNT{*}, > *DFS_ROUTER_RPC_ASYNC_RESPONDER_COUNT_DEFAULT* to configure whether to use > async router, as well as the number of asyncRouterHandlers and > asyncRouterResponders. > 4. Using *ThreadLocalContext* to maintain thread local variables, ensuring > that thread local variables can be correctly passed between handler, > asyncRouterHandler, and asyncRouterResponder. > > *Test* > Currently, I let handler wait synchronously for the response of async > resonder to test the function of routeraysncRpcClient. > Note: For discussions on *AsyncUtil* and client {*}protocolPB{*}, please > refer to HDFS-17543 and HDFS-17544. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17545) [ARR] router async rpc client.
[ https://issues.apache.org/jira/browse/HDFS-17545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17855817#comment-17855817 ] ASF GitHub Bot commented on HDFS-17545: --- hfutatzhanghb commented on code in PR #6871: URL: https://github.com/apache/hadoop/pull/6871#discussion_r1643984134 ## hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/router/RouterRpcServer.java: ## @@ -428,6 +456,25 @@ public RouterRpcServer(Configuration conf, Router router, initRouterFedRename(); } + private void initAsyncThreadPool() { +int asyncHandlerCount = conf.getInt(DFS_ROUTER_RPC_ASYNC_HANDLER_COUNT, +DFS_ROUTER_RPC_ASYNC_HANDLER_COUNT_DEFAULT); +int asyncResponderCount = conf.getInt(DFS_ROUTER_RPC_ASYNC_RESPONDER_COUNT, +DFS_ROUTER_RPC_ASYNC_RESPONDER_COUNT_DEFAULT); +synchronized (RouterRpcServer.class) { + if (asyncRouterHandler == null) { +LOG.info("init router async handler count: {}", asyncHandlerCount); +asyncRouterHandler = Executors.newFixedThreadPool( +asyncHandlerCount, new AsyncThreadFactory("router async handler ")); + } + if (asyncRouterResponder == null) { +LOG.info("init router async responder count: {}", asyncResponderCount); +asyncRouterResponder = Executors.newFixedThreadPool( +asyncHandlerCount, new AsyncThreadFactory("router async responder ")); Review Comment: @KeeProMise Hi, sir. here should be asyncResponderCount right? > [ARR] router async rpc client. > -- > > Key: HDFS-17545 > URL: https://issues.apache.org/jira/browse/HDFS-17545 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jian Zhang >Assignee: Jian Zhang >Priority: Major > Labels: pull-request-available > > *Describe* > 1. Mainly using AsyncUtil to implement {*}RouterAsyncRpcClient{*}, this class > extends RouterRpcClient, enabling the {*}invoiceAll{*}, {*}invoiceMethod{*}, > {*}invoiceSequential{*}, {*}invoiceConcurrent{*}, and *invoiceSingle* methods > to support asynchrony. > 2. Use two thread pools, *asyncRouterHandler* and {*}asyncRouterResponder{*}, > to handle asynchronous requests and responses, respectively. > 3. Added {*}DFS_ROUTER_RPC_ENABLE_ASYNC{*}, > {*}DFS_ROUTER_RPC_ASYNC_HANDLER_COUNT{*}, > *DFS_ROUTER_RPC_ASYNC_RESPONDER_COUNT_DEFAULT* to configure whether to use > async router, as well as the number of asyncRouterHandlers and > asyncRouterResponders. > 4. Using *ThreadLocalContext* to maintain thread local variables, ensuring > that thread local variables can be correctly passed between handler, > asyncRouterHandler, and asyncRouterResponder. > > *Test* > Currently, I let handler wait synchronously for the response of async > resonder to test the function of routeraysncRpcClient. > Note: For discussions on *AsyncUtil* and client {*}protocolPB{*}, please > refer to HDFS-17543 and HDFS-17544. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org