[jira] [Reopened] (HDFS-14498) LeaseManager can loop forever on the file for which create has failed

2020-07-13 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He reopened HDFS-14498:


> LeaseManager can loop forever on the file for which create has failed 
> --
>
> Key: HDFS-14498
> URL: https://issues.apache.org/jira/browse/HDFS-14498
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.9.0
>Reporter: Sergey Shelukhin
>Assignee: Stephen O'Donnell
>Priority: Major
> Fix For: 3.2.2, 2.10.1, 3.3.1, 3.4.0, 3.1.5
>
> Attachments: HDFS-14498.001.patch, HDFS-14498.002.patch
>
>
> The logs from file creation are long gone due to infinite lease logging, 
> however it presumably failed... the client who was trying to write this file 
> is definitely long dead.
> The version includes HDFS-4882.
> We get this log pattern repeating infinitely:
> {noformat}
> 2019-05-16 14:00:16,893 INFO 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease.  Holder: 
> DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1] has expired hard 
> limit
> 2019-05-16 14:00:16,893 INFO 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  
> Holder: DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1], src=
> 2019-05-16 14:00:16,893 WARN 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: 
> Failed to release lease for file . Committed blocks are waiting to be 
> minimally replicated. Try again later.
> 2019-05-16 14:00:16,893 WARN 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Cannot release the path 
>  in the lease [Lease.  Holder: DFSClient_NONMAPREDUCE_-20898906_61, 
> pending creates: 1]. It will be retried.
> org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: DIR* 
> NameSystem.internalReleaseLease: Failed to release lease for file . 
> Committed blocks are waiting to be minimally replicated. Try again later.
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.internalReleaseLease(FSNamesystem.java:3357)
>   at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager.checkLeases(LeaseManager.java:573)
>   at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:509)
>   at java.lang.Thread.run(Thread.java:745)
> $  grep -c "Recovering.*DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 
> 1" hdfs_nn*
> hdfs_nn.log:1068035
> hdfs_nn.log.2019-05-16-14:1516179
> hdfs_nn.log.2019-05-16-15:1538350
> {noformat}
> Aside from an actual bug fix, it might make sense to make LeaseManager not 
> log so much, in case if there are more bugs like this...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14498) LeaseManager can loop forever on the file for which create has failed

2020-07-13 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-14498:
---
Attachment: HDFS-14498-branch-2.10.001.patch
Status: Patch Available  (was: Reopened)

Thanks [~Jim_Brennan] for your reports. Just revert commit for branch-2.10, 
submit [^HDFS-14498-branch-2.10.001.patch] and try to trigger Jenkins.
Thanks [~Jim_Brennan].

> LeaseManager can loop forever on the file for which create has failed 
> --
>
> Key: HDFS-14498
> URL: https://issues.apache.org/jira/browse/HDFS-14498
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.9.0
>Reporter: Sergey Shelukhin
>Assignee: Stephen O'Donnell
>Priority: Major
> Fix For: 3.2.2, 3.3.1, 3.4.0, 3.1.5
>
> Attachments: HDFS-14498-branch-2.10.001.patch, HDFS-14498.001.patch, 
> HDFS-14498.002.patch
>
>
> The logs from file creation are long gone due to infinite lease logging, 
> however it presumably failed... the client who was trying to write this file 
> is definitely long dead.
> The version includes HDFS-4882.
> We get this log pattern repeating infinitely:
> {noformat}
> 2019-05-16 14:00:16,893 INFO 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease.  Holder: 
> DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1] has expired hard 
> limit
> 2019-05-16 14:00:16,893 INFO 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  
> Holder: DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1], src=
> 2019-05-16 14:00:16,893 WARN 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: 
> Failed to release lease for file . Committed blocks are waiting to be 
> minimally replicated. Try again later.
> 2019-05-16 14:00:16,893 WARN 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Cannot release the path 
>  in the lease [Lease.  Holder: DFSClient_NONMAPREDUCE_-20898906_61, 
> pending creates: 1]. It will be retried.
> org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: DIR* 
> NameSystem.internalReleaseLease: Failed to release lease for file . 
> Committed blocks are waiting to be minimally replicated. Try again later.
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.internalReleaseLease(FSNamesystem.java:3357)
>   at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager.checkLeases(LeaseManager.java:573)
>   at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:509)
>   at java.lang.Thread.run(Thread.java:745)
> $  grep -c "Recovering.*DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 
> 1" hdfs_nn*
> hdfs_nn.log:1068035
> hdfs_nn.log.2019-05-16-14:1516179
> hdfs_nn.log.2019-05-16-15:1538350
> {noformat}
> Aside from an actual bug fix, it might make sense to make LeaseManager not 
> log so much, in case if there are more bugs like this...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14498) LeaseManager can loop forever on the file for which create has failed

2020-07-13 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156816#comment-17156816
 ] 

Xiaoqiao He edited comment on HDFS-14498 at 7/13/20, 4:11 PM:
--

Thanks [~Jim_Brennan] for your reports. Just revert commit for branch-2.10, 
submit [^HDFS-14498-branch-2.10.001.patch] and try to trigger Jenkins.
Thanks [~Jim_Brennan]. Please give another reviews.


was (Author: hexiaoqiao):
Thanks [~Jim_Brennan] for your reports. Just revert commit for branch-2.10, 
submit [^HDFS-14498-branch-2.10.001.patch] and try to trigger Jenkins.
Thanks [~Jim_Brennan].

> LeaseManager can loop forever on the file for which create has failed 
> --
>
> Key: HDFS-14498
> URL: https://issues.apache.org/jira/browse/HDFS-14498
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.9.0
>Reporter: Sergey Shelukhin
>Assignee: Stephen O'Donnell
>Priority: Major
> Fix For: 3.2.2, 3.3.1, 3.4.0, 3.1.5
>
> Attachments: HDFS-14498-branch-2.10.001.patch, HDFS-14498.001.patch, 
> HDFS-14498.002.patch
>
>
> The logs from file creation are long gone due to infinite lease logging, 
> however it presumably failed... the client who was trying to write this file 
> is definitely long dead.
> The version includes HDFS-4882.
> We get this log pattern repeating infinitely:
> {noformat}
> 2019-05-16 14:00:16,893 INFO 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease.  Holder: 
> DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1] has expired hard 
> limit
> 2019-05-16 14:00:16,893 INFO 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  
> Holder: DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1], src=
> 2019-05-16 14:00:16,893 WARN 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: 
> Failed to release lease for file . Committed blocks are waiting to be 
> minimally replicated. Try again later.
> 2019-05-16 14:00:16,893 WARN 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Cannot release the path 
>  in the lease [Lease.  Holder: DFSClient_NONMAPREDUCE_-20898906_61, 
> pending creates: 1]. It will be retried.
> org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: DIR* 
> NameSystem.internalReleaseLease: Failed to release lease for file . 
> Committed blocks are waiting to be minimally replicated. Try again later.
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.internalReleaseLease(FSNamesystem.java:3357)
>   at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager.checkLeases(LeaseManager.java:573)
>   at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:509)
>   at java.lang.Thread.run(Thread.java:745)
> $  grep -c "Recovering.*DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 
> 1" hdfs_nn*
> hdfs_nn.log:1068035
> hdfs_nn.log.2019-05-16-14:1516179
> hdfs_nn.log.2019-05-16-15:1538350
> {noformat}
> Aside from an actual bug fix, it might make sense to make LeaseManager not 
> log so much, in case if there are more bugs like this...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14498) LeaseManager can loop forever on the file for which create has failed

2020-07-13 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-14498:
---
Attachment: HDFS-14498-branch-3.2.001.patch
HDFS-14498-branch-3.1.001.patch
HDFS-14498-branch-2.10.002.patch

> LeaseManager can loop forever on the file for which create has failed 
> --
>
> Key: HDFS-14498
> URL: https://issues.apache.org/jira/browse/HDFS-14498
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.9.0
>Reporter: Sergey Shelukhin
>Assignee: Stephen O'Donnell
>Priority: Major
> Fix For: 3.3.1, 3.4.0
>
> Attachments: HDFS-14498-branch-2.10.001.patch, 
> HDFS-14498-branch-2.10.002.patch, HDFS-14498-branch-3.1.001.patch, 
> HDFS-14498-branch-3.2.001.patch, HDFS-14498.001.patch, HDFS-14498.002.patch
>
>
> The logs from file creation are long gone due to infinite lease logging, 
> however it presumably failed... the client who was trying to write this file 
> is definitely long dead.
> The version includes HDFS-4882.
> We get this log pattern repeating infinitely:
> {noformat}
> 2019-05-16 14:00:16,893 INFO 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease.  Holder: 
> DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1] has expired hard 
> limit
> 2019-05-16 14:00:16,893 INFO 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  
> Holder: DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1], src=
> 2019-05-16 14:00:16,893 WARN 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: 
> Failed to release lease for file . Committed blocks are waiting to be 
> minimally replicated. Try again later.
> 2019-05-16 14:00:16,893 WARN 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Cannot release the path 
>  in the lease [Lease.  Holder: DFSClient_NONMAPREDUCE_-20898906_61, 
> pending creates: 1]. It will be retried.
> org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: DIR* 
> NameSystem.internalReleaseLease: Failed to release lease for file . 
> Committed blocks are waiting to be minimally replicated. Try again later.
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.internalReleaseLease(FSNamesystem.java:3357)
>   at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager.checkLeases(LeaseManager.java:573)
>   at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:509)
>   at java.lang.Thread.run(Thread.java:745)
> $  grep -c "Recovering.*DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 
> 1" hdfs_nn*
> hdfs_nn.log:1068035
> hdfs_nn.log.2019-05-16-14:1516179
> hdfs_nn.log.2019-05-16-15:1538350
> {noformat}
> Aside from an actual bug fix, it might make sense to make LeaseManager not 
> log so much, in case if there are more bugs like this...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14498) LeaseManager can loop forever on the file for which create has failed

2020-07-13 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157155#comment-17157155
 ] 

Xiaoqiao He commented on HDFS-14498:


Thanks [~epayne],[~ebadger] for your operation suggestions. I try to recall 
this commit ops which only build and test branch trunk at local and missing 
build other branches after cherry-picking. Sorry to break the build. Upload new 
patch for branch-3.2/branch-3.1/branch-2.10. Do you mind to take another 
reviews? Thanks. cc[[~sodonnell].

> LeaseManager can loop forever on the file for which create has failed 
> --
>
> Key: HDFS-14498
> URL: https://issues.apache.org/jira/browse/HDFS-14498
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.9.0
>Reporter: Sergey Shelukhin
>Assignee: Stephen O'Donnell
>Priority: Major
> Fix For: 3.3.1, 3.4.0
>
> Attachments: HDFS-14498-branch-2.10.001.patch, 
> HDFS-14498-branch-2.10.002.patch, HDFS-14498-branch-3.1.001.patch, 
> HDFS-14498-branch-3.2.001.patch, HDFS-14498.001.patch, HDFS-14498.002.patch
>
>
> The logs from file creation are long gone due to infinite lease logging, 
> however it presumably failed... the client who was trying to write this file 
> is definitely long dead.
> The version includes HDFS-4882.
> We get this log pattern repeating infinitely:
> {noformat}
> 2019-05-16 14:00:16,893 INFO 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease.  Holder: 
> DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1] has expired hard 
> limit
> 2019-05-16 14:00:16,893 INFO 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  
> Holder: DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1], src=
> 2019-05-16 14:00:16,893 WARN 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: 
> Failed to release lease for file . Committed blocks are waiting to be 
> minimally replicated. Try again later.
> 2019-05-16 14:00:16,893 WARN 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Cannot release the path 
>  in the lease [Lease.  Holder: DFSClient_NONMAPREDUCE_-20898906_61, 
> pending creates: 1]. It will be retried.
> org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: DIR* 
> NameSystem.internalReleaseLease: Failed to release lease for file . 
> Committed blocks are waiting to be minimally replicated. Try again later.
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.internalReleaseLease(FSNamesystem.java:3357)
>   at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager.checkLeases(LeaseManager.java:573)
>   at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:509)
>   at java.lang.Thread.run(Thread.java:745)
> $  grep -c "Recovering.*DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 
> 1" hdfs_nn*
> hdfs_nn.log:1068035
> hdfs_nn.log.2019-05-16-14:1516179
> hdfs_nn.log.2019-05-16-15:1538350
> {noformat}
> Aside from an actual bug fix, it might make sense to make LeaseManager not 
> log so much, in case if there are more bugs like this...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14498) LeaseManager can loop forever on the file for which create has failed

2020-07-14 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157305#comment-17157305
 ] 

Xiaoqiao He commented on HDFS-14498:


Hi [~sodonnell],
patch for branch-3.1/3.2 only remove the final parameter null of create RPC 
invoke.
patch for branch-2.10 rename `penultimateBlockMinStorage` to 
`penultimateBlockMinReplication`,  replace lambda with anonymous class, also 
fix uni test about the create RPC parameter and fix checkstyle for 
TestLeaseRecovery#testLeaseManagerRecoversEmptyCommittedLastBlock.

> LeaseManager can loop forever on the file for which create has failed 
> --
>
> Key: HDFS-14498
> URL: https://issues.apache.org/jira/browse/HDFS-14498
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.9.0
>Reporter: Sergey Shelukhin
>Assignee: Stephen O'Donnell
>Priority: Major
> Fix For: 3.3.1, 3.4.0
>
> Attachments: HDFS-14498-branch-2.10.001.patch, 
> HDFS-14498-branch-2.10.002.patch, HDFS-14498-branch-3.1.001.patch, 
> HDFS-14498-branch-3.2.001.patch, HDFS-14498.001.patch, HDFS-14498.002.patch
>
>
> The logs from file creation are long gone due to infinite lease logging, 
> however it presumably failed... the client who was trying to write this file 
> is definitely long dead.
> The version includes HDFS-4882.
> We get this log pattern repeating infinitely:
> {noformat}
> 2019-05-16 14:00:16,893 INFO 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease.  Holder: 
> DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1] has expired hard 
> limit
> 2019-05-16 14:00:16,893 INFO 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  
> Holder: DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1], src=
> 2019-05-16 14:00:16,893 WARN 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: 
> Failed to release lease for file . Committed blocks are waiting to be 
> minimally replicated. Try again later.
> 2019-05-16 14:00:16,893 WARN 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Cannot release the path 
>  in the lease [Lease.  Holder: DFSClient_NONMAPREDUCE_-20898906_61, 
> pending creates: 1]. It will be retried.
> org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: DIR* 
> NameSystem.internalReleaseLease: Failed to release lease for file . 
> Committed blocks are waiting to be minimally replicated. Try again later.
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.internalReleaseLease(FSNamesystem.java:3357)
>   at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager.checkLeases(LeaseManager.java:573)
>   at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:509)
>   at java.lang.Thread.run(Thread.java:745)
> $  grep -c "Recovering.*DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 
> 1" hdfs_nn*
> hdfs_nn.log:1068035
> hdfs_nn.log.2019-05-16-14:1516179
> hdfs_nn.log.2019-05-16-15:1538350
> {noformat}
> Aside from an actual bug fix, it might make sense to make LeaseManager not 
> log so much, in case if there are more bugs like this...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14498) LeaseManager can loop forever on the file for which create has failed

2020-07-14 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-14498:
---
Attachment: HDFS-14498-branch-3.1.001.patch

> LeaseManager can loop forever on the file for which create has failed 
> --
>
> Key: HDFS-14498
> URL: https://issues.apache.org/jira/browse/HDFS-14498
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.9.0
>Reporter: Sergey Shelukhin
>Assignee: Stephen O'Donnell
>Priority: Major
> Fix For: 3.3.1, 3.4.0
>
> Attachments: HDFS-14498-branch-2.10.001.patch, 
> HDFS-14498-branch-2.10.002.patch, HDFS-14498-branch-2.10.002.patch, 
> HDFS-14498-branch-3.1.001.patch, HDFS-14498-branch-3.1.001.patch, 
> HDFS-14498-branch-3.2.001.patch, HDFS-14498.001.patch, HDFS-14498.002.patch
>
>
> The logs from file creation are long gone due to infinite lease logging, 
> however it presumably failed... the client who was trying to write this file 
> is definitely long dead.
> The version includes HDFS-4882.
> We get this log pattern repeating infinitely:
> {noformat}
> 2019-05-16 14:00:16,893 INFO 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease.  Holder: 
> DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1] has expired hard 
> limit
> 2019-05-16 14:00:16,893 INFO 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  
> Holder: DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1], src=
> 2019-05-16 14:00:16,893 WARN 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: 
> Failed to release lease for file . Committed blocks are waiting to be 
> minimally replicated. Try again later.
> 2019-05-16 14:00:16,893 WARN 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Cannot release the path 
>  in the lease [Lease.  Holder: DFSClient_NONMAPREDUCE_-20898906_61, 
> pending creates: 1]. It will be retried.
> org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: DIR* 
> NameSystem.internalReleaseLease: Failed to release lease for file . 
> Committed blocks are waiting to be minimally replicated. Try again later.
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.internalReleaseLease(FSNamesystem.java:3357)
>   at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager.checkLeases(LeaseManager.java:573)
>   at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:509)
>   at java.lang.Thread.run(Thread.java:745)
> $  grep -c "Recovering.*DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 
> 1" hdfs_nn*
> hdfs_nn.log:1068035
> hdfs_nn.log.2019-05-16-14:1516179
> hdfs_nn.log.2019-05-16-15:1538350
> {noformat}
> Aside from an actual bug fix, it might make sense to make LeaseManager not 
> log so much, in case if there are more bugs like this...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14498) LeaseManager can loop forever on the file for which create has failed

2020-07-14 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-14498:
---
Attachment: HDFS-14498-branch-2.10.002.patch

> LeaseManager can loop forever on the file for which create has failed 
> --
>
> Key: HDFS-14498
> URL: https://issues.apache.org/jira/browse/HDFS-14498
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.9.0
>Reporter: Sergey Shelukhin
>Assignee: Stephen O'Donnell
>Priority: Major
> Fix For: 3.3.1, 3.4.0
>
> Attachments: HDFS-14498-branch-2.10.001.patch, 
> HDFS-14498-branch-2.10.002.patch, HDFS-14498-branch-2.10.002.patch, 
> HDFS-14498-branch-3.1.001.patch, HDFS-14498-branch-3.1.001.patch, 
> HDFS-14498-branch-3.2.001.patch, HDFS-14498.001.patch, HDFS-14498.002.patch
>
>
> The logs from file creation are long gone due to infinite lease logging, 
> however it presumably failed... the client who was trying to write this file 
> is definitely long dead.
> The version includes HDFS-4882.
> We get this log pattern repeating infinitely:
> {noformat}
> 2019-05-16 14:00:16,893 INFO 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease.  Holder: 
> DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1] has expired hard 
> limit
> 2019-05-16 14:00:16,893 INFO 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  
> Holder: DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1], src=
> 2019-05-16 14:00:16,893 WARN 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: 
> Failed to release lease for file . Committed blocks are waiting to be 
> minimally replicated. Try again later.
> 2019-05-16 14:00:16,893 WARN 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Cannot release the path 
>  in the lease [Lease.  Holder: DFSClient_NONMAPREDUCE_-20898906_61, 
> pending creates: 1]. It will be retried.
> org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: DIR* 
> NameSystem.internalReleaseLease: Failed to release lease for file . 
> Committed blocks are waiting to be minimally replicated. Try again later.
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.internalReleaseLease(FSNamesystem.java:3357)
>   at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager.checkLeases(LeaseManager.java:573)
>   at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:509)
>   at java.lang.Thread.run(Thread.java:745)
> $  grep -c "Recovering.*DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 
> 1" hdfs_nn*
> hdfs_nn.log:1068035
> hdfs_nn.log.2019-05-16-14:1516179
> hdfs_nn.log.2019-05-16-15:1538350
> {noformat}
> Aside from an actual bug fix, it might make sense to make LeaseManager not 
> log so much, in case if there are more bugs like this...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14498) LeaseManager can loop forever on the file for which create has failed

2020-07-14 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157486#comment-17157486
 ] 

Xiaoqiao He commented on HDFS-14498:


submit [^HDFS-14498-branch-3.1.001.patch] and  
[^HDFS-14498-branch-2.10.002.patch] again and not any changes to trigger 
Jenkins.

> LeaseManager can loop forever on the file for which create has failed 
> --
>
> Key: HDFS-14498
> URL: https://issues.apache.org/jira/browse/HDFS-14498
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.9.0
>Reporter: Sergey Shelukhin
>Assignee: Stephen O'Donnell
>Priority: Major
> Fix For: 3.3.1, 3.4.0
>
> Attachments: HDFS-14498-branch-2.10.001.patch, 
> HDFS-14498-branch-2.10.002.patch, HDFS-14498-branch-2.10.002.patch, 
> HDFS-14498-branch-3.1.001.patch, HDFS-14498-branch-3.1.001.patch, 
> HDFS-14498-branch-3.2.001.patch, HDFS-14498.001.patch, HDFS-14498.002.patch
>
>
> The logs from file creation are long gone due to infinite lease logging, 
> however it presumably failed... the client who was trying to write this file 
> is definitely long dead.
> The version includes HDFS-4882.
> We get this log pattern repeating infinitely:
> {noformat}
> 2019-05-16 14:00:16,893 INFO 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease.  Holder: 
> DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1] has expired hard 
> limit
> 2019-05-16 14:00:16,893 INFO 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  
> Holder: DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1], src=
> 2019-05-16 14:00:16,893 WARN 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: 
> Failed to release lease for file . Committed blocks are waiting to be 
> minimally replicated. Try again later.
> 2019-05-16 14:00:16,893 WARN 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Cannot release the path 
>  in the lease [Lease.  Holder: DFSClient_NONMAPREDUCE_-20898906_61, 
> pending creates: 1]. It will be retried.
> org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: DIR* 
> NameSystem.internalReleaseLease: Failed to release lease for file . 
> Committed blocks are waiting to be minimally replicated. Try again later.
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.internalReleaseLease(FSNamesystem.java:3357)
>   at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager.checkLeases(LeaseManager.java:573)
>   at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:509)
>   at java.lang.Thread.run(Thread.java:745)
> $  grep -c "Recovering.*DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 
> 1" hdfs_nn*
> hdfs_nn.log:1068035
> hdfs_nn.log.2019-05-16-14:1516179
> hdfs_nn.log.2019-05-16-15:1538350
> {noformat}
> Aside from an actual bug fix, it might make sense to make LeaseManager not 
> log so much, in case if there are more bugs like this...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14498) LeaseManager can loop forever on the file for which create has failed

2020-07-14 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-14498:
---
Attachment: HDFS-14498-branch-3.1.001.patch

> LeaseManager can loop forever on the file for which create has failed 
> --
>
> Key: HDFS-14498
> URL: https://issues.apache.org/jira/browse/HDFS-14498
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.9.0
>Reporter: Sergey Shelukhin
>Assignee: Stephen O'Donnell
>Priority: Major
> Fix For: 3.3.1, 3.4.0
>
> Attachments: HDFS-14498-branch-2.10.001.patch, 
> HDFS-14498-branch-2.10.002.patch, HDFS-14498-branch-2.10.002.patch, 
> HDFS-14498-branch-3.1.001.patch, HDFS-14498-branch-3.1.001.patch, 
> HDFS-14498-branch-3.1.001.patch, HDFS-14498-branch-3.2.001.patch, 
> HDFS-14498.001.patch, HDFS-14498.002.patch
>
>
> The logs from file creation are long gone due to infinite lease logging, 
> however it presumably failed... the client who was trying to write this file 
> is definitely long dead.
> The version includes HDFS-4882.
> We get this log pattern repeating infinitely:
> {noformat}
> 2019-05-16 14:00:16,893 INFO 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease.  Holder: 
> DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1] has expired hard 
> limit
> 2019-05-16 14:00:16,893 INFO 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  
> Holder: DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1], src=
> 2019-05-16 14:00:16,893 WARN 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: 
> Failed to release lease for file . Committed blocks are waiting to be 
> minimally replicated. Try again later.
> 2019-05-16 14:00:16,893 WARN 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Cannot release the path 
>  in the lease [Lease.  Holder: DFSClient_NONMAPREDUCE_-20898906_61, 
> pending creates: 1]. It will be retried.
> org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: DIR* 
> NameSystem.internalReleaseLease: Failed to release lease for file . 
> Committed blocks are waiting to be minimally replicated. Try again later.
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.internalReleaseLease(FSNamesystem.java:3357)
>   at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager.checkLeases(LeaseManager.java:573)
>   at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:509)
>   at java.lang.Thread.run(Thread.java:745)
> $  grep -c "Recovering.*DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 
> 1" hdfs_nn*
> hdfs_nn.log:1068035
> hdfs_nn.log.2019-05-16-14:1516179
> hdfs_nn.log.2019-05-16-15:1538350
> {noformat}
> Aside from an actual bug fix, it might make sense to make LeaseManager not 
> log so much, in case if there are more bugs like this...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14498) LeaseManager can loop forever on the file for which create has failed

2020-07-14 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157870#comment-17157870
 ] 

Xiaoqiao He commented on HDFS-14498:


re-submit [^HDFS-14498-branch-3.1.001.patch] without any changes and try to 
trigger Jenkins. Not sure why can't trigger it automatically.

> LeaseManager can loop forever on the file for which create has failed 
> --
>
> Key: HDFS-14498
> URL: https://issues.apache.org/jira/browse/HDFS-14498
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.9.0
>Reporter: Sergey Shelukhin
>Assignee: Stephen O'Donnell
>Priority: Major
> Fix For: 3.3.1, 3.4.0
>
> Attachments: HDFS-14498-branch-2.10.001.patch, 
> HDFS-14498-branch-2.10.002.patch, HDFS-14498-branch-2.10.002.patch, 
> HDFS-14498-branch-3.1.001.patch, HDFS-14498-branch-3.1.001.patch, 
> HDFS-14498-branch-3.1.001.patch, HDFS-14498-branch-3.2.001.patch, 
> HDFS-14498.001.patch, HDFS-14498.002.patch
>
>
> The logs from file creation are long gone due to infinite lease logging, 
> however it presumably failed... the client who was trying to write this file 
> is definitely long dead.
> The version includes HDFS-4882.
> We get this log pattern repeating infinitely:
> {noformat}
> 2019-05-16 14:00:16,893 INFO 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease.  Holder: 
> DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1] has expired hard 
> limit
> 2019-05-16 14:00:16,893 INFO 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  
> Holder: DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1], src=
> 2019-05-16 14:00:16,893 WARN 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: 
> Failed to release lease for file . Committed blocks are waiting to be 
> minimally replicated. Try again later.
> 2019-05-16 14:00:16,893 WARN 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Cannot release the path 
>  in the lease [Lease.  Holder: DFSClient_NONMAPREDUCE_-20898906_61, 
> pending creates: 1]. It will be retried.
> org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: DIR* 
> NameSystem.internalReleaseLease: Failed to release lease for file . 
> Committed blocks are waiting to be minimally replicated. Try again later.
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.internalReleaseLease(FSNamesystem.java:3357)
>   at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager.checkLeases(LeaseManager.java:573)
>   at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:509)
>   at java.lang.Thread.run(Thread.java:745)
> $  grep -c "Recovering.*DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 
> 1" hdfs_nn*
> hdfs_nn.log:1068035
> hdfs_nn.log.2019-05-16-14:1516179
> hdfs_nn.log.2019-05-16-15:1538350
> {noformat}
> Aside from an actual bug fix, it might make sense to make LeaseManager not 
> log so much, in case if there are more bugs like this...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15469) Dynamically configure the size of PacketReceiver#MAX_PACKET_SIZE

2020-07-14 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157880#comment-17157880
 ] 

Xiaoqiao He commented on HDFS-15469:


Thanks [~jianghuazhu] involve me here. Sorry I do not have benchmark for over 
16M per packet size so not anymore suggestions.
If any benchmark result offered will be better to push this changes forwards.

> Dynamically configure the size of PacketReceiver#MAX_PACKET_SIZE
> 
>
> Key: HDFS-15469
> URL: https://issues.apache.org/jira/browse/HDFS-15469
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.0.3
>Reporter: jianghua zhu
>Assignee: jianghua zhu
>Priority: Major
> Attachments: HDFS-15469.001.patch
>
>
> Now the value of PacketReceiver#MAX_PACKET_SIZE is fixed and the size is 16M. 
> This value should be configurable to facilitate better performance in 
> different environments. For example, when the network environment is poor, or 
> the machine quality is not good, and the hard disk quality is not good, this 
> value should be set below 16M, such as 8M, which will be more conducive to 
> the stability of the cluster.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14498) LeaseManager can loop forever on the file for which create has failed

2020-07-15 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158144#comment-17158144
 ] 

Xiaoqiao He commented on HDFS-14498:


Check failed unit tests for branch-2.10/branch-3.1 and run success at local. I 
think it is not related with changes.
for branch-2.10, checkstyle issue reported, it is ok for me to keep the same 
code segment between different branches.
{code:java}
./hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java:3158:
  boolean internalReleaseLease(Lease lease, String src, INodesInPath iip,:3: 
Method length is 151 lines (max allowed is 150).
{code}
[~sodonnell][~ebadger][~ebadger][~Jim_Brennan] do you mind take another review?

> LeaseManager can loop forever on the file for which create has failed 
> --
>
> Key: HDFS-14498
> URL: https://issues.apache.org/jira/browse/HDFS-14498
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.9.0
>Reporter: Sergey Shelukhin
>Assignee: Stephen O'Donnell
>Priority: Major
> Fix For: 3.3.1, 3.4.0
>
> Attachments: HDFS-14498-branch-2.10.001.patch, 
> HDFS-14498-branch-2.10.002.patch, HDFS-14498-branch-2.10.002.patch, 
> HDFS-14498-branch-3.1.001.patch, HDFS-14498-branch-3.1.001.patch, 
> HDFS-14498-branch-3.1.001.patch, HDFS-14498-branch-3.2.001.patch, 
> HDFS-14498.001.patch, HDFS-14498.002.patch
>
>
> The logs from file creation are long gone due to infinite lease logging, 
> however it presumably failed... the client who was trying to write this file 
> is definitely long dead.
> The version includes HDFS-4882.
> We get this log pattern repeating infinitely:
> {noformat}
> 2019-05-16 14:00:16,893 INFO 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease.  Holder: 
> DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1] has expired hard 
> limit
> 2019-05-16 14:00:16,893 INFO 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  
> Holder: DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1], src=
> 2019-05-16 14:00:16,893 WARN 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: 
> Failed to release lease for file . Committed blocks are waiting to be 
> minimally replicated. Try again later.
> 2019-05-16 14:00:16,893 WARN 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Cannot release the path 
>  in the lease [Lease.  Holder: DFSClient_NONMAPREDUCE_-20898906_61, 
> pending creates: 1]. It will be retried.
> org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: DIR* 
> NameSystem.internalReleaseLease: Failed to release lease for file . 
> Committed blocks are waiting to be minimally replicated. Try again later.
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.internalReleaseLease(FSNamesystem.java:3357)
>   at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager.checkLeases(LeaseManager.java:573)
>   at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:509)
>   at java.lang.Thread.run(Thread.java:745)
> $  grep -c "Recovering.*DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 
> 1" hdfs_nn*
> hdfs_nn.log:1068035
> hdfs_nn.log.2019-05-16-14:1516179
> hdfs_nn.log.2019-05-16-15:1538350
> {noformat}
> Aside from an actual bug fix, it might make sense to make LeaseManager not 
> log so much, in case if there are more bugs like this...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14498) LeaseManager can loop forever on the file for which create has failed

2020-07-15 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158144#comment-17158144
 ] 

Xiaoqiao He edited comment on HDFS-14498 at 7/15/20, 1:10 PM:
--

Check failed unit tests for branch-2.10/branch-3.1 and run success at local. I 
think it is not related with changes.
for branch-2.10, checkstyle issue reported, it is ok for me to keep the same 
code segment between different branches.
{code:java}
./hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java:3158:
  boolean internalReleaseLease(Lease lease, String src, INodesInPath iip,:3: 
Method length is 151 lines (max allowed is 150).
{code}
[~sodonnell],[~ebadger],[~ebadger],[~Jim_Brennan] do you mind take another 
review?


was (Author: hexiaoqiao):
Check failed unit tests for branch-2.10/branch-3.1 and run success at local. I 
think it is not related with changes.
for branch-2.10, checkstyle issue reported, it is ok for me to keep the same 
code segment between different branches.
{code:java}
./hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java:3158:
  boolean internalReleaseLease(Lease lease, String src, INodesInPath iip,:3: 
Method length is 151 lines (max allowed is 150).
{code}
[~sodonnell][~ebadger][~ebadger][~Jim_Brennan] do you mind take another review?

> LeaseManager can loop forever on the file for which create has failed 
> --
>
> Key: HDFS-14498
> URL: https://issues.apache.org/jira/browse/HDFS-14498
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.9.0
>Reporter: Sergey Shelukhin
>Assignee: Stephen O'Donnell
>Priority: Major
> Fix For: 3.3.1, 3.4.0
>
> Attachments: HDFS-14498-branch-2.10.001.patch, 
> HDFS-14498-branch-2.10.002.patch, HDFS-14498-branch-2.10.002.patch, 
> HDFS-14498-branch-3.1.001.patch, HDFS-14498-branch-3.1.001.patch, 
> HDFS-14498-branch-3.1.001.patch, HDFS-14498-branch-3.2.001.patch, 
> HDFS-14498.001.patch, HDFS-14498.002.patch
>
>
> The logs from file creation are long gone due to infinite lease logging, 
> however it presumably failed... the client who was trying to write this file 
> is definitely long dead.
> The version includes HDFS-4882.
> We get this log pattern repeating infinitely:
> {noformat}
> 2019-05-16 14:00:16,893 INFO 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease.  Holder: 
> DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1] has expired hard 
> limit
> 2019-05-16 14:00:16,893 INFO 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  
> Holder: DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1], src=
> 2019-05-16 14:00:16,893 WARN 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: 
> Failed to release lease for file . Committed blocks are waiting to be 
> minimally replicated. Try again later.
> 2019-05-16 14:00:16,893 WARN 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Cannot release the path 
>  in the lease [Lease.  Holder: DFSClient_NONMAPREDUCE_-20898906_61, 
> pending creates: 1]. It will be retried.
> org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: DIR* 
> NameSystem.internalReleaseLease: Failed to release lease for file . 
> Committed blocks are waiting to be minimally replicated. Try again later.
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.internalReleaseLease(FSNamesystem.java:3357)
>   at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager.checkLeases(LeaseManager.java:573)
>   at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:509)
>   at java.lang.Thread.run(Thread.java:745)
> $  grep -c "Recovering.*DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 
> 1" hdfs_nn*
> hdfs_nn.log:1068035
> hdfs_nn.log.2019-05-16-14:1516179
> hdfs_nn.log.2019-05-16-15:1538350
> {noformat}
> Aside from an actual bug fix, it might make sense to make LeaseManager not 
> log so much, in case if there are more bugs like this...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14498) LeaseManager can loop forever on the file for which create has failed

2020-07-15 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-14498:
---
Attachment: HDFS-14498-branch-2.10.003.patch

> LeaseManager can loop forever on the file for which create has failed 
> --
>
> Key: HDFS-14498
> URL: https://issues.apache.org/jira/browse/HDFS-14498
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.9.0
>Reporter: Sergey Shelukhin
>Assignee: Stephen O'Donnell
>Priority: Major
> Fix For: 3.3.1, 3.4.0
>
> Attachments: HDFS-14498-branch-2.10.001.patch, 
> HDFS-14498-branch-2.10.002.patch, HDFS-14498-branch-2.10.002.patch, 
> HDFS-14498-branch-2.10.003.patch, HDFS-14498-branch-3.1.001.patch, 
> HDFS-14498-branch-3.1.001.patch, HDFS-14498-branch-3.1.001.patch, 
> HDFS-14498-branch-3.2.001.patch, HDFS-14498.001.patch, HDFS-14498.002.patch
>
>
> The logs from file creation are long gone due to infinite lease logging, 
> however it presumably failed... the client who was trying to write this file 
> is definitely long dead.
> The version includes HDFS-4882.
> We get this log pattern repeating infinitely:
> {noformat}
> 2019-05-16 14:00:16,893 INFO 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease.  Holder: 
> DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1] has expired hard 
> limit
> 2019-05-16 14:00:16,893 INFO 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  
> Holder: DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1], src=
> 2019-05-16 14:00:16,893 WARN 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: 
> Failed to release lease for file . Committed blocks are waiting to be 
> minimally replicated. Try again later.
> 2019-05-16 14:00:16,893 WARN 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Cannot release the path 
>  in the lease [Lease.  Holder: DFSClient_NONMAPREDUCE_-20898906_61, 
> pending creates: 1]. It will be retried.
> org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: DIR* 
> NameSystem.internalReleaseLease: Failed to release lease for file . 
> Committed blocks are waiting to be minimally replicated. Try again later.
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.internalReleaseLease(FSNamesystem.java:3357)
>   at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager.checkLeases(LeaseManager.java:573)
>   at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:509)
>   at java.lang.Thread.run(Thread.java:745)
> $  grep -c "Recovering.*DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 
> 1" hdfs_nn*
> hdfs_nn.log:1068035
> hdfs_nn.log.2019-05-16-14:1516179
> hdfs_nn.log.2019-05-16-15:1538350
> {noformat}
> Aside from an actual bug fix, it might make sense to make LeaseManager not 
> log so much, in case if there are more bugs like this...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14498) LeaseManager can loop forever on the file for which create has failed

2020-07-15 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158875#comment-17158875
 ] 

Xiaoqiao He commented on HDFS-14498:


Thanks [~Jim_Brennan],[~sodonnell],
{quote}there are multiple patches with the same names. I don't know if that 
breaks anything, but it is a little confusing.{quote}
Not sure why do not trigger jenkins automatically. So submit the total same 
patch and try to trigger jenkins again, Please reference to the most recent one 
if have same name.
{quote}It is better to do that, I think, than refactor the change to make the 
method shorter, as then the patch will be different from the other 
branches.{quote}
It makes sense to me. submit v003 for branch-2.10 and pending what Jenkins says.
Thanks again.

> LeaseManager can loop forever on the file for which create has failed 
> --
>
> Key: HDFS-14498
> URL: https://issues.apache.org/jira/browse/HDFS-14498
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.9.0
>Reporter: Sergey Shelukhin
>Assignee: Stephen O'Donnell
>Priority: Major
> Fix For: 3.3.1, 3.4.0
>
> Attachments: HDFS-14498-branch-2.10.001.patch, 
> HDFS-14498-branch-2.10.002.patch, HDFS-14498-branch-2.10.002.patch, 
> HDFS-14498-branch-2.10.003.patch, HDFS-14498-branch-3.1.001.patch, 
> HDFS-14498-branch-3.1.001.patch, HDFS-14498-branch-3.1.001.patch, 
> HDFS-14498-branch-3.2.001.patch, HDFS-14498.001.patch, HDFS-14498.002.patch
>
>
> The logs from file creation are long gone due to infinite lease logging, 
> however it presumably failed... the client who was trying to write this file 
> is definitely long dead.
> The version includes HDFS-4882.
> We get this log pattern repeating infinitely:
> {noformat}
> 2019-05-16 14:00:16,893 INFO 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease.  Holder: 
> DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1] has expired hard 
> limit
> 2019-05-16 14:00:16,893 INFO 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  
> Holder: DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1], src=
> 2019-05-16 14:00:16,893 WARN 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: 
> Failed to release lease for file . Committed blocks are waiting to be 
> minimally replicated. Try again later.
> 2019-05-16 14:00:16,893 WARN 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Cannot release the path 
>  in the lease [Lease.  Holder: DFSClient_NONMAPREDUCE_-20898906_61, 
> pending creates: 1]. It will be retried.
> org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: DIR* 
> NameSystem.internalReleaseLease: Failed to release lease for file . 
> Committed blocks are waiting to be minimally replicated. Try again later.
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.internalReleaseLease(FSNamesystem.java:3357)
>   at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager.checkLeases(LeaseManager.java:573)
>   at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:509)
>   at java.lang.Thread.run(Thread.java:745)
> $  grep -c "Recovering.*DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 
> 1" hdfs_nn*
> hdfs_nn.log:1068035
> hdfs_nn.log.2019-05-16-14:1516179
> hdfs_nn.log.2019-05-16-15:1538350
> {noformat}
> Aside from an actual bug fix, it might make sense to make LeaseManager not 
> log so much, in case if there are more bugs like this...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14498) LeaseManager can loop forever on the file for which create has failed

2020-07-16 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17159331#comment-17159331
 ] 

Xiaoqiao He commented on HDFS-14498:


+1. Commit to branch-3.2/branch-3.1/branch-2.10.
Thanks [~sodonnell] and every reviewers.

> LeaseManager can loop forever on the file for which create has failed 
> --
>
> Key: HDFS-14498
> URL: https://issues.apache.org/jira/browse/HDFS-14498
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.9.0
>Reporter: Sergey Shelukhin
>Assignee: Stephen O'Donnell
>Priority: Major
> Fix For: 3.3.1, 3.4.0
>
> Attachments: HDFS-14498-branch-2.10.001.patch, 
> HDFS-14498-branch-2.10.002.patch, HDFS-14498-branch-2.10.002.patch, 
> HDFS-14498-branch-2.10.003.patch, HDFS-14498-branch-2.10.004.patch, 
> HDFS-14498-branch-3.1.001.patch, HDFS-14498-branch-3.1.001.patch, 
> HDFS-14498-branch-3.1.001.patch, HDFS-14498-branch-3.2.001.patch, 
> HDFS-14498.001.patch, HDFS-14498.002.patch
>
>
> The logs from file creation are long gone due to infinite lease logging, 
> however it presumably failed... the client who was trying to write this file 
> is definitely long dead.
> The version includes HDFS-4882.
> We get this log pattern repeating infinitely:
> {noformat}
> 2019-05-16 14:00:16,893 INFO 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease.  Holder: 
> DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1] has expired hard 
> limit
> 2019-05-16 14:00:16,893 INFO 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  
> Holder: DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1], src=
> 2019-05-16 14:00:16,893 WARN 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: 
> Failed to release lease for file . Committed blocks are waiting to be 
> minimally replicated. Try again later.
> 2019-05-16 14:00:16,893 WARN 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Cannot release the path 
>  in the lease [Lease.  Holder: DFSClient_NONMAPREDUCE_-20898906_61, 
> pending creates: 1]. It will be retried.
> org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: DIR* 
> NameSystem.internalReleaseLease: Failed to release lease for file . 
> Committed blocks are waiting to be minimally replicated. Try again later.
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.internalReleaseLease(FSNamesystem.java:3357)
>   at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager.checkLeases(LeaseManager.java:573)
>   at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:509)
>   at java.lang.Thread.run(Thread.java:745)
> $  grep -c "Recovering.*DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 
> 1" hdfs_nn*
> hdfs_nn.log:1068035
> hdfs_nn.log.2019-05-16-14:1516179
> hdfs_nn.log.2019-05-16-15:1538350
> {noformat}
> Aside from an actual bug fix, it might make sense to make LeaseManager not 
> log so much, in case if there are more bugs like this...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14498) LeaseManager can loop forever on the file for which create has failed

2020-07-16 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-14498:
---
Fix Version/s: 3.1.5
   2.10.1
   3.2.2
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

> LeaseManager can loop forever on the file for which create has failed 
> --
>
> Key: HDFS-14498
> URL: https://issues.apache.org/jira/browse/HDFS-14498
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.9.0
>Reporter: Sergey Shelukhin
>Assignee: Stephen O'Donnell
>Priority: Major
> Fix For: 3.2.2, 2.10.1, 3.3.1, 3.4.0, 3.1.5
>
> Attachments: HDFS-14498-branch-2.10.001.patch, 
> HDFS-14498-branch-2.10.002.patch, HDFS-14498-branch-2.10.002.patch, 
> HDFS-14498-branch-2.10.003.patch, HDFS-14498-branch-2.10.004.patch, 
> HDFS-14498-branch-3.1.001.patch, HDFS-14498-branch-3.1.001.patch, 
> HDFS-14498-branch-3.1.001.patch, HDFS-14498-branch-3.2.001.patch, 
> HDFS-14498.001.patch, HDFS-14498.002.patch
>
>
> The logs from file creation are long gone due to infinite lease logging, 
> however it presumably failed... the client who was trying to write this file 
> is definitely long dead.
> The version includes HDFS-4882.
> We get this log pattern repeating infinitely:
> {noformat}
> 2019-05-16 14:00:16,893 INFO 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease.  Holder: 
> DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1] has expired hard 
> limit
> 2019-05-16 14:00:16,893 INFO 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  
> Holder: DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1], src=
> 2019-05-16 14:00:16,893 WARN 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: 
> Failed to release lease for file . Committed blocks are waiting to be 
> minimally replicated. Try again later.
> 2019-05-16 14:00:16,893 WARN 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Cannot release the path 
>  in the lease [Lease.  Holder: DFSClient_NONMAPREDUCE_-20898906_61, 
> pending creates: 1]. It will be retried.
> org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: DIR* 
> NameSystem.internalReleaseLease: Failed to release lease for file . 
> Committed blocks are waiting to be minimally replicated. Try again later.
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.internalReleaseLease(FSNamesystem.java:3357)
>   at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager.checkLeases(LeaseManager.java:573)
>   at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:509)
>   at java.lang.Thread.run(Thread.java:745)
> $  grep -c "Recovering.*DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 
> 1" hdfs_nn*
> hdfs_nn.log:1068035
> hdfs_nn.log.2019-05-16-14:1516179
> hdfs_nn.log.2019-05-16-15:1538350
> {noformat}
> Aside from an actual bug fix, it might make sense to make LeaseManager not 
> log so much, in case if there are more bugs like this...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously

2020-08-20 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181643#comment-17181643
 ] 

Xiaoqiao He commented on HDFS-14997:


Thanks [~Captainhzy] for your comments. This improvement try to process some 
high cost cmds async, and avoid to block the core flow of BPServiceActor, such 
as `DNA_INVALIDATE`.
As for your concerns, It could be not issue IMO, because 
`updateActorStatesFromHeartbeat` is very light operation, no need to async. 
Another side, we need to update namenode state at real time to avoid to execute 
some cmds from Standby. Hope it could answer your questions. Thanks.

> BPServiceActor processes commands from NameNode asynchronously
> --
>
> Key: HDFS-14997
> URL: https://issues.apache.org/jira/browse/HDFS-14997
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14997.001.patch, HDFS-14997.002.patch, 
> HDFS-14997.003.patch, HDFS-14997.004.patch, HDFS-14997.005.patch, 
> HDFS-14997.addendum.patch, image-2019-12-26-16-15-44-814.png
>
>
> There are two core functions, report(#sendHeartbeat, #blockReport, 
> #cacheReport) and #processCommand in #BPServiceActor main process flow. If 
> processCommand cost long time it will block send report flow. Meanwhile 
> processCommand could cost long time(over 1000s the worst case I meet) when IO 
> load  of DataNode is very high. Since some IO operations are under 
> #datasetLock, So it has to wait to acquire #datasetLock long time when 
> process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat 
> will not send to NameNode in-time, and trigger other disasters.
> I propose to improve #processCommand asynchronously and not block 
> #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
> Notes:
> 1. Lifeline could be one effective solution, however some old branches are 
> not support this feature.
> 2. IO operations under #datasetLock is another issue, I think we should solve 
> it at another JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15448) When starting a DataNode, call BlockPoolManager#startAll() twice.

2020-08-21 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181660#comment-17181660
 ] 

Xiaoqiao He commented on HDFS-15448:


Hi [~jianghuazhu], Please check the failed unit tests if it is related with 
this changes. Just want to trigger Yetus manually, but it seems not works well 
now. not sure if I missed something.

> When starting a DataNode, call BlockPoolManager#startAll() twice.
> -
>
> Key: HDFS-15448
> URL: https://issues.apache.org/jira/browse/HDFS-15448
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.1.1
>Reporter: jianghua zhu
>Assignee: jianghua zhu
>Priority: Major
> Attachments: HDFS-15448.001.patch, HDFS-15448.002.patch, 
> HDFS-15448.003.patch, method_invoke_path.jpg
>
>
> When starting a DataNode, call BlockPoolManager#startAll() twice.
> The first call:
> BlockPoolManager#doRefreshNamenodes()
> private void doRefreshNamenodes(
>  Map> addrMap,
>  Map> lifelineAddrMap)
>  throws IOException {
>  ...
> startAll();
> ...
> }
> The second call:
> DataNode#runDatanodeDaemon()
> public void runDatanodeDaemon() throws IOException {
> blockPoolManager.startAll();
> ...
> }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14852) Remove of LowRedundancyBlocks do NOT remove the block from all queues

2020-08-22 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17182337#comment-17182337
 ] 

Xiaoqiao He commented on HDFS-14852:


Thanks [~ferhui] for your works here. For v006, it seems not necessary to check 
`QUEUE_WITH_CORRUPT_BLOCKS` first. Because `QUEUE_WITH_CORRUPT_BLOCKS` is also 
less than `LowRedundancyBlocks.LEVEL`, it can be covered by original code 
segment logic. 
{code:java}
+  if (priorityQueues.get(QUEUE_WITH_CORRUPT_BLOCKS).remove(block)) {
+decrementBlockStat(block, QUEUE_WITH_CORRUPT_BLOCKS,
+oldExpectedReplicas);
+  }
{code}
I am interested how reproduce this case. After quick check method 
LowRedundancyBlocks#add and LowRedundancyBlocks#update, it seems one block 
reference will exist only one single Queue. Any corner case will break this 
constraint?

> Remove of LowRedundancyBlocks do NOT remove the block from all queues
> -
>
> Key: HDFS-14852
> URL: https://issues.apache.org/jira/browse/HDFS-14852
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0, 3.0.3, 3.1.2, 3.3.0
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: CorruptBlocksMismatch.png, HDFS-14852.001.patch, 
> HDFS-14852.002.patch, HDFS-14852.003.patch, HDFS-14852.004.patch, 
> HDFS-14852.005.patch, HDFS-14852.006.patch, screenshot-1.png
>
>
> LowRedundancyBlocks.java
> {code:java}
> // Some comments here
> if(priLevel >= 0 && priLevel < LEVEL
> && priorityQueues.get(priLevel).remove(block)) {
>   NameNode.blockStateChangeLog.debug(
>   "BLOCK* NameSystem.LowRedundancyBlock.remove: Removing block {}"
>   + " from priority queue {}",
>   block, priLevel);
>   decrementBlockStat(block, priLevel, oldExpectedReplicas);
>   return true;
> } else {
>   // Try to remove the block from all queues if the block was
>   // not found in the queue for the given priority level.
>   for (int i = 0; i < LEVEL; i++) {
> if (i != priLevel && priorityQueues.get(i).remove(block)) {
>   NameNode.blockStateChangeLog.debug(
>   "BLOCK* NameSystem.LowRedundancyBlock.remove: Removing block" +
>   " {} from priority queue {}", block, i);
>   decrementBlockStat(block, i, oldExpectedReplicas);
>   return true;
> }
>   }
> }
> return false;
>   }
> {code}
> Source code is above, the comments as follow
> {quote}
>   // Try to remove the block from all queues if the block was
>   // not found in the queue for the given priority level.
> {quote}
> The function "remove" does NOT remove the block from all queues.
> Function add from LowRedundancyBlocks.java is used on some places and maybe 
> one block in two or more queues.
> We found that corrupt blocks mismatch corrupt files on NN web UI. Maybe it is 
> related to this.
> Upload initial patch



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15448) When starting a DataNode, call BlockPoolManager#startAll() twice.

2020-08-22 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17182340#comment-17182340
 ] 

Xiaoqiao He commented on HDFS-15448:


Thanks [~jianghuazhu] for your works here. Failed unit tested seems not related 
to the changes. Please have another check if have time.
v003 LGTM. +1 from my side. Will commit to trunk pending two days if there is 
no objection. cc [~linyiqun][~elgoiri].

> When starting a DataNode, call BlockPoolManager#startAll() twice.
> -
>
> Key: HDFS-15448
> URL: https://issues.apache.org/jira/browse/HDFS-15448
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.1.1
>Reporter: jianghua zhu
>Assignee: jianghua zhu
>Priority: Major
> Attachments: HDFS-15448.001.patch, HDFS-15448.002.patch, 
> HDFS-15448.003.patch, method_invoke_path.jpg
>
>
> When starting a DataNode, call BlockPoolManager#startAll() twice.
> The first call:
> BlockPoolManager#doRefreshNamenodes()
> private void doRefreshNamenodes(
>  Map> addrMap,
>  Map> lifelineAddrMap)
>  throws IOException {
>  ...
> startAll();
> ...
> }
> The second call:
> DataNode#runDatanodeDaemon()
> public void runDatanodeDaemon() throws IOException {
> blockPoolManager.startAll();
> ...
> }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15175) Multiple CloseOp shared block instance causes the standby namenode to crash when rolling editlog

2020-08-22 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17182374#comment-17182374
 ] 

Xiaoqiao He commented on HDFS-15175:


After a deep dig, I believe this case is related to async edit logging.
Consider the following scenario,
a. client A try to close file foo which have two blocks, blk_1 + blk_2. It has 
released write lock and file foo's lease but not write out async edit log at 
NameNode side. at the same time, client A is waiting for `complete` RPC invoke 
to return back.
b. client B try to truncate file foo, and this RPC request is completed 
smoothly include write out edit log.
c. `close` operation entry from client A write out edit log entry after 
`truncate` successfully also.
d. Standby NameNode will encounter this exception when replay edit entry 
`create`- `addblock` - `truncate` - `close`, because it missed one block 
reference which is deleted by truncate.
cc [~daryn],[~kihwal] it seems to this corner case is involved by async edit 
logging feature. Do you mind to have another check?

> Multiple CloseOp shared block instance causes the standby namenode to crash 
> when rolling editlog
> 
>
> Key: HDFS-15175
> URL: https://issues.apache.org/jira/browse/HDFS-15175
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Assignee: Wan Chang
>Priority: Critical
>  Labels: NameNode
> Attachments: HDFS-15175-trunk.1.patch
>
>
>  
> {panel:title=Crash exception}
> 2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log 
> tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp 
> [length=0, inodeId=0, path=..., replication=3, mtime=1581816138774, 
> atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845], 
> permissions=da_music:hdfs:rw-r-, aclEntries=null, clientName=, 
> clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, 
> txid=32625024993]
>  java.io.IOException: File is not under construction: ..
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:360)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873)
>  at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361)
> {panel}
>  
> {panel:title=Editlog}
> 
>  OP_REASSIGN_LEASE
>  
>  32625021150
>  DFSClient_NONMAPREDUCE_-969060727_197760
>  ..
>  DFSClient_NONMAPREDUCE_1000868229_201260
>  
>  
> ..
> 
>  OP_CLOSE
>  
>  32625023743
>  0
>  0
>  ..
>  3
>  1581816135883
>  1581814760398
>  536870912
>  
>  
>  false
>  
>  5568434562
>  185818644
>  4495417845
>  
>  
>  da_music
>  hdfs
>  416
>  
>  
>  
> ..
> 
>  OP_TRUNCATE
>  
>  32625024049
>  ..
>  DFSClient_NONMAPREDUCE_1000868229_201260
>  ..
>  185818644
>  1581816136336
>  
>  5568434562
>  185818648
>  4495417845
>  
>  
>  
> ..
> 
>  OP_CLOSE
>  
>  32625024993
>  0
>  0
>  ..
>  3
>  1581816138774
>  1581814760398
>  536870912
>  
>  
>  false
>  
>  5568434562
>  185818644
>  4495417845
>  
>  
>  da_music
>  hdfs
>  416
>  
>  
>  
> {panel}
>  
>  
> The block size should be 185818648 in the first CloseOp. When truncate is 
> used, the block size becomes 185818644. The CloseOp/TruncateOp/CloseOp is 
> synchronized to the JournalNode in the same batch. The block used by CloseOp 
> twice is the same instance, which causes the first CloseOp has wrong block 
> size. When SNN rolling Editlog, TruncateOp does not make the file to the 
> UnderConstruction state. Then, when the second CloseOp is executed, the file 
> is not in the UnderConstruction state, and SNN crashes.




[jira] [Updated] (HDFS-15448) Remove duplicate BlockPoolManager starting when run DataNode

2020-08-24 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15448:
---
Summary: Remove duplicate BlockPoolManager starting when run DataNode  
(was: When starting a DataNode, call BlockPoolManager#startAll() twice.)

> Remove duplicate BlockPoolManager starting when run DataNode
> 
>
> Key: HDFS-15448
> URL: https://issues.apache.org/jira/browse/HDFS-15448
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.1.1
>Reporter: jianghua zhu
>Assignee: jianghua zhu
>Priority: Major
> Attachments: HDFS-15448.001.patch, HDFS-15448.002.patch, 
> HDFS-15448.003.patch, method_invoke_path.jpg
>
>
> When starting a DataNode, call BlockPoolManager#startAll() twice.
> The first call:
> BlockPoolManager#doRefreshNamenodes()
> private void doRefreshNamenodes(
>  Map> addrMap,
>  Map> lifelineAddrMap)
>  throws IOException {
>  ...
> startAll();
> ...
> }
> The second call:
> DataNode#runDatanodeDaemon()
> public void runDatanodeDaemon() throws IOException {
> blockPoolManager.startAll();
> ...
> }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15448) Remove duplicate BlockPoolManager starting when run DataNode

2020-08-24 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15448:
---
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

Commit trunk.
Thanks [~jianghuazhu] for your report and contribution.
Thanks [~linyiqun],[~elgoiri] and [~hemanthboyina] for your comments and 
reviews.

> Remove duplicate BlockPoolManager starting when run DataNode
> 
>
> Key: HDFS-15448
> URL: https://issues.apache.org/jira/browse/HDFS-15448
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.1.1
>Reporter: jianghua zhu
>Assignee: jianghua zhu
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: HDFS-15448.001.patch, HDFS-15448.002.patch, 
> HDFS-15448.003.patch, method_invoke_path.jpg
>
>
> When starting a DataNode, call BlockPoolManager#startAll() twice.
> The first call:
> BlockPoolManager#doRefreshNamenodes()
> private void doRefreshNamenodes(
>  Map> addrMap,
>  Map> lifelineAddrMap)
>  throws IOException {
>  ...
> startAll();
> ...
> }
> The second call:
> DataNode#runDatanodeDaemon()
> public void runDatanodeDaemon() throws IOException {
> blockPoolManager.startAll();
> ...
> }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously

2020-08-25 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17184920#comment-17184920
 ] 

Xiaoqiao He commented on HDFS-14997:


Thanks [~Captainhzy], In my experience, process cmds async can resolve most 
cases about DataNode lost because no heartbeat for long times from NameNode 
views.
In general, we have split cmds to process one by one asyc rather than process 
all of them in main flow. So I believe it improved significantly.
About the lock contention, I agree that it could be blocked especially process 
one very heavy cmd(maybe some corner case). Any ideas to improve it? welcome to 
more discuss. Thanks again.

> BPServiceActor processes commands from NameNode asynchronously
> --
>
> Key: HDFS-14997
> URL: https://issues.apache.org/jira/browse/HDFS-14997
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14997.001.patch, HDFS-14997.002.patch, 
> HDFS-14997.003.patch, HDFS-14997.004.patch, HDFS-14997.005.patch, 
> HDFS-14997.addendum.patch, image-2019-12-26-16-15-44-814.png
>
>
> There are two core functions, report(#sendHeartbeat, #blockReport, 
> #cacheReport) and #processCommand in #BPServiceActor main process flow. If 
> processCommand cost long time it will block send report flow. Meanwhile 
> processCommand could cost long time(over 1000s the worst case I meet) when IO 
> load  of DataNode is very high. Since some IO operations are under 
> #datasetLock, So it has to wait to acquire #datasetLock long time when 
> process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat 
> will not send to NameNode in-time, and trigger other disasters.
> I propose to improve #processCommand asynchronously and not block 
> #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
> Notes:
> 1. Lifeline could be one effective solution, however some old branches are 
> not support this feature.
> 2. IO operations under #datasetLock is another issue, I think we should solve 
> it at another JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14694) Call recoverLease on DFSOutputStream close exception

2020-08-29 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186931#comment-17186931
 ] 

Xiaoqiao He commented on HDFS-14694:


Thanks [~zhangchen] and [~leosun08] for your continued works here. The idea 
seems good to me. 
{quote}But seems there are couple of changes just for testing and to trigger 
exception, May be we should try to avoid as much as possible. {quote}
+1 for that. Just suggest to replace `setExceptionInClose ` with some other 
ways such as mock or fault injector for the unit test. Thanks.

> Call recoverLease on DFSOutputStream close exception
> 
>
> Key: HDFS-14694
> URL: https://issues.apache.org/jira/browse/HDFS-14694
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Reporter: Chen Zhang
>Assignee: Lisheng Sun
>Priority: Major
> Attachments: HDFS-14694.001.patch, HDFS-14694.002.patch, 
> HDFS-14694.003.patch, HDFS-14694.004.patch, HDFS-14694.005.patch, 
> HDFS-14694.006.patch
>
>
> HDFS uses file-lease to manage opened files, when a file is not closed 
> normally, NN will recover lease automatically after hard limit exceeded. But 
> for a long running service(e.g. HBase), the hdfs-client will never die and NN 
> don't have any chances to recover the file.
> Usually client program needs to handle exceptions by themself to avoid this 
> condition(e.g. HBase automatically call recover lease for files that not 
> closed normally), but in our experience, most services (in our company) don't 
> process this condition properly, which will cause lots of files in abnormal 
> status or even data loss.
> This Jira propose to add a feature that call recoverLease operation 
> automatically when DFSOutputSteam close encounters exception. It should be 
> disabled by default, but when somebody builds a long-running service based on 
> HDFS, they can enable this option.
> We've add this feature to our internal Hadoop distribution for more than 3 
> years, it's quite useful according our experience.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2020-08-30 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17187451#comment-17187451
 ] 

Xiaoqiao He commented on HDFS-14703:


cc [~shv].
{quote}I want to do some work on this issue ,could you which  version does the 
patch base on?thanks{quote}
Thanks involve me here. As I know, only sub-task HDFS-14731 has merge to trunk, 
others do not commit yet.

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, NameNode 
> Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15550) Remove unused imports from TestFileTruncate.java

2020-08-30 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17187453#comment-17187453
 ] 

Xiaoqiao He commented on HDFS-15550:


[^HDFS-15550.001.patch] LGTM. +1 from my side. Will commit to trunk.

> Remove unused imports from TestFileTruncate.java
> 
>
> Key: HDFS-15550
> URL: https://issues.apache.org/jira/browse/HDFS-15550
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ravuri Sushma sree
>Assignee: Ravuri Sushma sree
>Priority: Minor
> Attachments: HDFS-15550.001.patch
>
>
> {{import org.apache.hadoop.fs.BlockLocation and import org.junit.Assert 
> remain unused in }}{{TestFileTruncate.java}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15550) Remove unused imports from TestFileTruncate.java

2020-08-30 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15550:
---
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

Committed to trunk. Thanks [~Sushma_28] for reports and fix.
Thanks [~brahmareddy] for reviews.

> Remove unused imports from TestFileTruncate.java
> 
>
> Key: HDFS-15550
> URL: https://issues.apache.org/jira/browse/HDFS-15550
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ravuri Sushma sree
>Assignee: Ravuri Sushma sree
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: HDFS-15550.001.patch
>
>
> {{import org.apache.hadoop.fs.BlockLocation and import org.junit.Assert 
> remain unused in }}{{TestFileTruncate.java}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15550) Remove unused imports from TestFileTruncate.java

2020-08-30 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15550:
---
Issue Type: Improvement  (was: Bug)

> Remove unused imports from TestFileTruncate.java
> 
>
> Key: HDFS-15550
> URL: https://issues.apache.org/jira/browse/HDFS-15550
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Ravuri Sushma sree
>Assignee: Ravuri Sushma sree
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: HDFS-15550.001.patch
>
>
> {{import org.apache.hadoop.fs.BlockLocation and import org.junit.Assert 
> remain unused in }}{{TestFileTruncate.java}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Moved] (HDFS-15553) Improve NameNode RPC throughput with ReadWriteRpcCallQueue

2020-09-01 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He moved HADOOP-17237 to HDFS-15553:
-

Component/s: (was: rpc-server)
 namenode
Key: HDFS-15553  (was: HADOOP-17237)
Project: Hadoop HDFS  (was: Hadoop Common)

> Improve NameNode RPC throughput with ReadWriteRpcCallQueue 
> ---
>
> Key: HDFS-15553
> URL: https://issues.apache.org/jira/browse/HDFS-15553
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Wang, Xinglong
>Priority: Major
>
> *Current*
>  In our production cluster, a typical traffic model is read to write raito is 
> 10:1 and sometimes the ratios goes to 30:1.
>  NameNode is using ReEntrantReadWriteLock under the hood of FSNamesystemLock. 
> Read lock is shared lock while write lock is exclusive lock.
> Read RPC and Write RPC comes randomly to namenode. This makes read and write 
> mixed up. And then only a small fraction of read can really share their read 
> lock.
> Currently we have default callqueue and faircallqueue. And we can 
> refreshCallQueue on the fly. This opens room to design new call queue.
> *Idea*
>  If we reorder the rpc call in callqueue to group read rpc together and write 
> rpc together, we will have sort of control to let a batch of read rpc come to 
> handlers together and possibly share the same read lock. Thus we can reduce 
> Fragments of read locks.
>  This will only improve the chance to share the read lock among the batch of 
> read rpc due to there are some namenode internal write lock is out of call 
> queue.
> Under ReEntrantReadWriteLock, there is a queue to manage threads asking for 
> locks. We can give an example.
>  R: stands for read rpc
>  W: stands for write rpc
>  e.g
>  WWWWWWWW
>  In this case, we need 16 lock timeslice.
> optimized
>  
>  In this case, we only need 9 lock timeslice.
> *Correctness*
>  Since the execution order of any 2 concurrent or queued rpc in namenode is 
> not guaranteed. We can reorder the rpc in callqueue into read group and write 
> group. And then dequeue from these 2 queues by a designed strategy. let's say 
> dequeue 100 read and then dequeue 5 write rpc and then dequeue read again and 
> then write again.
>  Since FairCallQueue also does rpc call reorder in callqueue, for this part I 
> think they share the same logic to guarantee rpc result correctness.
> *Performance*
>  In test environment, we can see a 15% - 20% NameNode RPC throughput 
> improvement comparing with default callqueue. 
>  Test traffic is 30 read:3 write :1 list using NNLoadGeneratorMR
> This performance is not a surprise. Due to some write rpc is not managed in 
> callqueue. We can't do reorder to them by reording calls in callqueue. 
>  But still we can do a fully read write reorder if we redesign 
> ReEntrantReadWriteLock to achieve this. This will be further step after this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15553) Improve NameNode RPC throughput with ReadWriteRpcCallQueue

2020-09-01 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17188406#comment-17188406
 ] 

Xiaoqiao He commented on HDFS-15553:


It seems to improve NameNode performance based on the proposal. Move to HDFS 
sub-project.

> Improve NameNode RPC throughput with ReadWriteRpcCallQueue 
> ---
>
> Key: HDFS-15553
> URL: https://issues.apache.org/jira/browse/HDFS-15553
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Wang, Xinglong
>Priority: Major
>
> *Current*
>  In our production cluster, a typical traffic model is read to write raito is 
> 10:1 and sometimes the ratios goes to 30:1.
>  NameNode is using ReEntrantReadWriteLock under the hood of FSNamesystemLock. 
> Read lock is shared lock while write lock is exclusive lock.
> Read RPC and Write RPC comes randomly to namenode. This makes read and write 
> mixed up. And then only a small fraction of read can really share their read 
> lock.
> Currently we have default callqueue and faircallqueue. And we can 
> refreshCallQueue on the fly. This opens room to design new call queue.
> *Idea*
>  If we reorder the rpc call in callqueue to group read rpc together and write 
> rpc together, we will have sort of control to let a batch of read rpc come to 
> handlers together and possibly share the same read lock. Thus we can reduce 
> Fragments of read locks.
>  This will only improve the chance to share the read lock among the batch of 
> read rpc due to there are some namenode internal write lock is out of call 
> queue.
> Under ReEntrantReadWriteLock, there is a queue to manage threads asking for 
> locks. We can give an example.
>  R: stands for read rpc
>  W: stands for write rpc
>  e.g
>  WWWWWWWW
>  In this case, we need 16 lock timeslice.
> optimized
>  
>  In this case, we only need 9 lock timeslice.
> *Correctness*
>  Since the execution order of any 2 concurrent or queued rpc in namenode is 
> not guaranteed. We can reorder the rpc in callqueue into read group and write 
> group. And then dequeue from these 2 queues by a designed strategy. let's say 
> dequeue 100 read and then dequeue 5 write rpc and then dequeue read again and 
> then write again.
>  Since FairCallQueue also does rpc call reorder in callqueue, for this part I 
> think they share the same logic to guarantee rpc result correctness.
> *Performance*
>  In test environment, we can see a 15% - 20% NameNode RPC throughput 
> improvement comparing with default callqueue. 
>  Test traffic is 30 read:3 write :1 list using NNLoadGeneratorMR
> This performance is not a surprise. Due to some write rpc is not managed in 
> callqueue. We can't do reorder to them by reording calls in callqueue. 
>  But still we can do a fully read write reorder if we redesign 
> ReEntrantReadWriteLock to achieve this. This will be further step after this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15553) Improve NameNode RPC throughput with ReadWriteRpcCallQueue

2020-09-01 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17188422#comment-17188422
 ] 

Xiaoqiao He commented on HDFS-15553:


Thanks [~suxingfate] for your interesting proposal. The initial feeling was 
that it is useful in most of cases, some particular scenario may be not very 
proper.
Such as,
a) Client A send Write request at T1 which create file foo, 
b) while Client B send Read request at T2 which get filestatus about file foo.
considering T1 Improve NameNode RPC throughput with ReadWriteRpcCallQueue 
> ---
>
> Key: HDFS-15553
> URL: https://issues.apache.org/jira/browse/HDFS-15553
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Wang, Xinglong
>Priority: Major
>
> *Current*
>  In our production cluster, a typical traffic model is read to write raito is 
> 10:1 and sometimes the ratios goes to 30:1.
>  NameNode is using ReEntrantReadWriteLock under the hood of FSNamesystemLock. 
> Read lock is shared lock while write lock is exclusive lock.
> Read RPC and Write RPC comes randomly to namenode. This makes read and write 
> mixed up. And then only a small fraction of read can really share their read 
> lock.
> Currently we have default callqueue and faircallqueue. And we can 
> refreshCallQueue on the fly. This opens room to design new call queue.
> *Idea*
>  If we reorder the rpc call in callqueue to group read rpc together and write 
> rpc together, we will have sort of control to let a batch of read rpc come to 
> handlers together and possibly share the same read lock. Thus we can reduce 
> Fragments of read locks.
>  This will only improve the chance to share the read lock among the batch of 
> read rpc due to there are some namenode internal write lock is out of call 
> queue.
> Under ReEntrantReadWriteLock, there is a queue to manage threads asking for 
> locks. We can give an example.
>  R: stands for read rpc
>  W: stands for write rpc
>  e.g
>  WWWWWWWW
>  In this case, we need 16 lock timeslice.
> optimized
>  
>  In this case, we only need 9 lock timeslice.
> *Correctness*
>  Since the execution order of any 2 concurrent or queued rpc in namenode is 
> not guaranteed. We can reorder the rpc in callqueue into read group and write 
> group. And then dequeue from these 2 queues by a designed strategy. let's say 
> dequeue 100 read and then dequeue 5 write rpc and then dequeue read again and 
> then write again.
>  Since FairCallQueue also does rpc call reorder in callqueue, for this part I 
> think they share the same logic to guarantee rpc result correctness.
> *Performance*
>  In test environment, we can see a 15% - 20% NameNode RPC throughput 
> improvement comparing with default callqueue. 
>  Test traffic is 30 read:3 write :1 list using NNLoadGeneratorMR
> This performance is not a surprise. Due to some write rpc is not managed in 
> callqueue. We can't do reorder to them by reording calls in callqueue. 
>  But still we can do a fully read write reorder if we redesign 
> ReEntrantReadWriteLock to achieve this. This will be further step after this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15553) Improve NameNode RPC throughput with ReadWriteRpcCallQueue

2020-09-01 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17188476#comment-17188476
 ] 

Xiaoqiao He commented on HDFS-15553:


Thanks [~suxingfate] for your quick response. It makes sense to me. Would you 
like to attach design doc for your proposal (it will be better if include POC 
and benchmark result). I believe it will be better to discuss deeply and push 
forward for other reviewers. Thanks.

> Improve NameNode RPC throughput with ReadWriteRpcCallQueue 
> ---
>
> Key: HDFS-15553
> URL: https://issues.apache.org/jira/browse/HDFS-15553
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Wang, Xinglong
>Priority: Major
>
> *Current*
>  In our production cluster, a typical traffic model is read to write raito is 
> 10:1 and sometimes the ratios goes to 30:1.
>  NameNode is using ReEntrantReadWriteLock under the hood of FSNamesystemLock. 
> Read lock is shared lock while write lock is exclusive lock.
> Read RPC and Write RPC comes randomly to namenode. This makes read and write 
> mixed up. And then only a small fraction of read can really share their read 
> lock.
> Currently we have default callqueue and faircallqueue. And we can 
> refreshCallQueue on the fly. This opens room to design new call queue.
> *Idea*
>  If we reorder the rpc call in callqueue to group read rpc together and write 
> rpc together, we will have sort of control to let a batch of read rpc come to 
> handlers together and possibly share the same read lock. Thus we can reduce 
> Fragments of read locks.
>  This will only improve the chance to share the read lock among the batch of 
> read rpc due to there are some namenode internal write lock is out of call 
> queue.
> Under ReEntrantReadWriteLock, there is a queue to manage threads asking for 
> locks. We can give an example.
>  R: stands for read rpc
>  W: stands for write rpc
>  e.g
>  WWWWWWWW
>  In this case, we need 16 lock timeslice.
> optimized
>  
>  In this case, we only need 9 lock timeslice.
> *Correctness*
>  Since the execution order of any 2 concurrent or queued rpc in namenode is 
> not guaranteed. We can reorder the rpc in callqueue into read group and write 
> group. And then dequeue from these 2 queues by a designed strategy. let's say 
> dequeue 100 read and then dequeue 5 write rpc and then dequeue read again and 
> then write again.
>  Since FairCallQueue also does rpc call reorder in callqueue, for this part I 
> think they share the same logic to guarantee rpc result correctness.
> *Performance*
>  In test environment, we can see a 15% - 20% NameNode RPC throughput 
> improvement comparing with default callqueue. 
>  Test traffic is 30 read:3 write :1 list using NNLoadGeneratorMR
> This performance is not a surprise. Due to some write rpc is not managed in 
> callqueue. We can't do reorder to them by reording calls in callqueue. 
>  But still we can do a fully read write reorder if we redesign 
> ReEntrantReadWriteLock to achieve this. This will be further step after this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14694) Call recoverLease on DFSOutputStream close exception

2020-09-01 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17189013#comment-17189013
 ] 

Xiaoqiao He commented on HDFS-14694:


Thanks [~leosun08],  [^HDFS-14694.008.patch] seems better, some nits,
A. Please fix checkstyle and check failed unit tests.
B. The new unit tests can pass before patches. Please take another check.

> Call recoverLease on DFSOutputStream close exception
> 
>
> Key: HDFS-14694
> URL: https://issues.apache.org/jira/browse/HDFS-14694
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Reporter: Chen Zhang
>Assignee: Lisheng Sun
>Priority: Major
> Attachments: HDFS-14694.001.patch, HDFS-14694.002.patch, 
> HDFS-14694.003.patch, HDFS-14694.004.patch, HDFS-14694.005.patch, 
> HDFS-14694.006.patch, HDFS-14694.007.patch, HDFS-14694.008.patch
>
>
> HDFS uses file-lease to manage opened files, when a file is not closed 
> normally, NN will recover lease automatically after hard limit exceeded. But 
> for a long running service(e.g. HBase), the hdfs-client will never die and NN 
> don't have any chances to recover the file.
> Usually client program needs to handle exceptions by themself to avoid this 
> condition(e.g. HBase automatically call recover lease for files that not 
> closed normally), but in our experience, most services (in our company) don't 
> process this condition properly, which will cause lots of files in abnormal 
> status or even data loss.
> This Jira propose to add a feature that call recoverLease operation 
> automatically when DFSOutputSteam close encounters exception. It should be 
> disabled by default, but when somebody builds a long-running service based on 
> HDFS, they can enable this option.
> We've add this feature to our internal Hadoop distribution for more than 3 
> years, it's quite useful according our experience.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190102#comment-17190102
 ] 

Xiaoqiao He commented on HDFS-15556:


[~haiyang Hu] Great catch here. v001 is fair for me, it will be better if add 
new unit test to cover.
I am interested that why {{storage}} is null here. Anywhere not synchronized 
{{storageMap}} where should do that?

> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch, NN-CPU.png
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cluster.
> *NameNode the exception stack*:
> {code:java}
> 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 5 on 8022, call Call#20535 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
> from x:34766
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
> at 
> org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
> {code}
> {code:java}
> // DatanodeDescriptor#updateStorageStats
> ...
> for (StorageReport report : reports) {
>   DatanodeStorageInfo storage = null;
>   synchronized (storageMap) {
> storage =
> storageMap.get(report.getStorage().getStorageID());
>   }
>   if (checkFailedStorages) {
> failedStorageInfos.remove(storage);
>   }
>   storage.receivedHeartbeat(report);  //  NPE exception occurred here 
>   // skip accounting for capacity of PROVIDED storages!
>   if (StorageType.PROVIDED.equals(storage.getStorageType())) {
> continue;
>   }
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190102#comment-17190102
 ] 

Xiaoqiao He edited comment on HDFS-15556 at 9/3/20, 12:30 PM:
--

[~haiyang Hu] Thanks for report. Great catch here. v001 is fair for me, it will 
be better if add new unit test to cover.
I am interested that why {{storage}} is null here. Anywhere not synchronized 
{{storageMap}} where should do that?


was (Author: hexiaoqiao):
[~haiyang Hu] Great catch here. v001 is fair for me, it will be better if add 
new unit test to cover.
I am interested that why {{storage}} is null here. Anywhere not synchronized 
{{storageMap}} where should do that?

> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch, NN-CPU.png
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cluster.
> *NameNode the exception stack*:
> {code:java}
> 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 5 on 8022, call Call#20535 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
> from x:34766
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
> at 
> org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
> {code}
> {code:java}
> // DatanodeDescriptor#updateStorageStats
> ...
> for (StorageReport report : reports) {
>   DatanodeStorageInfo storage = null;
>   synchronized (storageMap) {
> storage =
> storageMap.get(report.getStorage().getStorageID());
>   }
>   if (checkFailedStorages) {
> failedStorageInfos.remove(storage);
>   }
>   storage.receivedHeartbeat(report);  //  NPE exception occurred here 
>   // skip accounting for capacity of PROVIDED storages!
>   if (StorageType.PROVIDED.equals(storage.getStorageType())) {
> continue;
>   }
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14694) Call recoverLease on DFSOutputStream close exception

2020-09-03 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190162#comment-17190162
 ] 

Xiaoqiao He commented on HDFS-14694:


Thanks [~leosun08] for your continued patches.
I will give my +1 about  [^HDFS-14694.010.patch] after remove unused print 
`System.out.println("sls close:" + closed);` Thanks again.

> Call recoverLease on DFSOutputStream close exception
> 
>
> Key: HDFS-14694
> URL: https://issues.apache.org/jira/browse/HDFS-14694
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Reporter: Chen Zhang
>Assignee: Lisheng Sun
>Priority: Major
> Attachments: HDFS-14694.001.patch, HDFS-14694.002.patch, 
> HDFS-14694.003.patch, HDFS-14694.004.patch, HDFS-14694.005.patch, 
> HDFS-14694.006.patch, HDFS-14694.007.patch, HDFS-14694.008.patch, 
> HDFS-14694.009.patch, HDFS-14694.010.patch
>
>
> HDFS uses file-lease to manage opened files, when a file is not closed 
> normally, NN will recover lease automatically after hard limit exceeded. But 
> for a long running service(e.g. HBase), the hdfs-client will never die and NN 
> don't have any chances to recover the file.
> Usually client program needs to handle exceptions by themself to avoid this 
> condition(e.g. HBase automatically call recover lease for files that not 
> closed normally), but in our experience, most services (in our company) don't 
> process this condition properly, which will cause lots of files in abnormal 
> status or even data loss.
> This Jira propose to add a feature that call recoverLease operation 
> automatically when DFSOutputSteam close encounters exception. It should be 
> disabled by default, but when somebody builds a long-running service based on 
> HDFS, they can enable this option.
> We've add this feature to our internal Hadoop distribution for more than 3 
> years, it's quite useful according our experience.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-15551) Tiny Improve for DeadNode detector

2020-09-03 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He reassigned HDFS-15551:
--

Assignee: dark_num

> Tiny Improve for DeadNode detector
> --
>
> Key: HDFS-15551
> URL: https://issues.apache.org/jira/browse/HDFS-15551
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs-client
>Affects Versions: 3.3.0
>Reporter: dark_num
>Assignee: dark_num
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> # add or improve some logs for adding local & global deadnodes
>  # logic improve
>  # fix typo



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-15551) Tiny Improve for DeadNode detector

2020-09-03 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He reassigned HDFS-15551:
--

Assignee: imbajin  (was: dark_num)

> Tiny Improve for DeadNode detector
> --
>
> Key: HDFS-15551
> URL: https://issues.apache.org/jira/browse/HDFS-15551
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs-client
>Affects Versions: 3.3.0
>Reporter: dark_num
>Assignee: imbajin
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> # add or improve some logs for adding local & global deadnodes
>  # logic improve
>  # fix typo



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15551) Tiny Improve for DeadNode detector

2020-09-03 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190526#comment-17190526
 ] 

Xiaoqiao He commented on HDFS-15551:


Thanks [~imbajin] involve me here.
Add [~imbajin] to contributor list and assign this JIRA to him. [~leosun08] 
would you like to take another review?

> Tiny Improve for DeadNode detector
> --
>
> Key: HDFS-15551
> URL: https://issues.apache.org/jira/browse/HDFS-15551
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs-client
>Affects Versions: 3.3.0
>Reporter: dark_num
>Assignee: imbajin
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> # add or improve some logs for adding local & global deadnodes
>  # logic improve
>  # fix typo



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14694) Call recoverLease on DFSOutputStream close exception

2020-09-08 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192656#comment-17192656
 ] 

Xiaoqiao He commented on HDFS-14694:


+1, Thanks [~leosun08] and [~ayushtkn]. Try to run all failed unit tests at 
local with  [^HDFS-14694.014.patch]. All of them passed, seems unrelated with 
this changes.
Will commit to trunk.

> Call recoverLease on DFSOutputStream close exception
> 
>
> Key: HDFS-14694
> URL: https://issues.apache.org/jira/browse/HDFS-14694
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Reporter: Chen Zhang
>Assignee: Lisheng Sun
>Priority: Major
> Attachments: HDFS-14694.001.patch, HDFS-14694.002.patch, 
> HDFS-14694.003.patch, HDFS-14694.004.patch, HDFS-14694.005.patch, 
> HDFS-14694.006.patch, HDFS-14694.007.patch, HDFS-14694.008.patch, 
> HDFS-14694.009.patch, HDFS-14694.010.patch, HDFS-14694.011.patch, 
> HDFS-14694.012.patch, HDFS-14694.013.patch, HDFS-14694.014.patch
>
>
> HDFS uses file-lease to manage opened files, when a file is not closed 
> normally, NN will recover lease automatically after hard limit exceeded. But 
> for a long running service(e.g. HBase), the hdfs-client will never die and NN 
> don't have any chances to recover the file.
> Usually client program needs to handle exceptions by themself to avoid this 
> condition(e.g. HBase automatically call recover lease for files that not 
> closed normally), but in our experience, most services (in our company) don't 
> process this condition properly, which will cause lots of files in abnormal 
> status or even data loss.
> This Jira propose to add a feature that call recoverLease operation 
> automatically when DFSOutputSteam close encounters exception. It should be 
> disabled by default, but when somebody builds a long-running service based on 
> HDFS, they can enable this option.
> We've add this feature to our internal Hadoop distribution for more than 3 
> years, it's quite useful according our experience.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14694) Call recoverLease on DFSOutputStream close exception

2020-09-09 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-14694:
---
Hadoop Flags: Reviewed
  Resolution: Fixed
  Status: Resolved  (was: Patch Available)

Committed to trunk.
Thanks [~leosun08] and [~zhangchen] for your report and contributions!
Thanks [~ayushtkn],[~weichiu] and [~xkrogen] for your reviews!

> Call recoverLease on DFSOutputStream close exception
> 
>
> Key: HDFS-14694
> URL: https://issues.apache.org/jira/browse/HDFS-14694
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Reporter: Chen Zhang
>Assignee: Lisheng Sun
>Priority: Major
> Attachments: HDFS-14694.001.patch, HDFS-14694.002.patch, 
> HDFS-14694.003.patch, HDFS-14694.004.patch, HDFS-14694.005.patch, 
> HDFS-14694.006.patch, HDFS-14694.007.patch, HDFS-14694.008.patch, 
> HDFS-14694.009.patch, HDFS-14694.010.patch, HDFS-14694.011.patch, 
> HDFS-14694.012.patch, HDFS-14694.013.patch, HDFS-14694.014.patch
>
>
> HDFS uses file-lease to manage opened files, when a file is not closed 
> normally, NN will recover lease automatically after hard limit exceeded. But 
> for a long running service(e.g. HBase), the hdfs-client will never die and NN 
> don't have any chances to recover the file.
> Usually client program needs to handle exceptions by themself to avoid this 
> condition(e.g. HBase automatically call recover lease for files that not 
> closed normally), but in our experience, most services (in our company) don't 
> process this condition properly, which will cause lots of files in abnormal 
> status or even data loss.
> This Jira propose to add a feature that call recoverLease operation 
> automatically when DFSOutputSteam close encounters exception. It should be 
> disabled by default, but when somebody builds a long-running service based on 
> HDFS, they can enable this option.
> We've add this feature to our internal Hadoop distribution for more than 3 
> years, it's quite useful according our experience.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14694) Call recoverLease on DFSOutputStream close exception

2020-09-09 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-14694:
---
Fix Version/s: 3.4.0

> Call recoverLease on DFSOutputStream close exception
> 
>
> Key: HDFS-14694
> URL: https://issues.apache.org/jira/browse/HDFS-14694
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Reporter: Chen Zhang
>Assignee: Lisheng Sun
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: HDFS-14694.001.patch, HDFS-14694.002.patch, 
> HDFS-14694.003.patch, HDFS-14694.004.patch, HDFS-14694.005.patch, 
> HDFS-14694.006.patch, HDFS-14694.007.patch, HDFS-14694.008.patch, 
> HDFS-14694.009.patch, HDFS-14694.010.patch, HDFS-14694.011.patch, 
> HDFS-14694.012.patch, HDFS-14694.013.patch, HDFS-14694.014.patch
>
>
> HDFS uses file-lease to manage opened files, when a file is not closed 
> normally, NN will recover lease automatically after hard limit exceeded. But 
> for a long running service(e.g. HBase), the hdfs-client will never die and NN 
> don't have any chances to recover the file.
> Usually client program needs to handle exceptions by themself to avoid this 
> condition(e.g. HBase automatically call recover lease for files that not 
> closed normally), but in our experience, most services (in our company) don't 
> process this condition properly, which will cause lots of files in abnormal 
> status or even data loss.
> This Jira propose to add a feature that call recoverLease operation 
> automatically when DFSOutputSteam close encounters exception. It should be 
> disabled by default, but when somebody builds a long-running service based on 
> HDFS, they can enable this option.
> We've add this feature to our internal Hadoop distribution for more than 3 
> years, it's quite useful according our experience.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15516) Add info for create flags in NameNode audit logs

2020-09-09 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192891#comment-17192891
 ] 

Xiaoqiao He commented on HDFS-15516:


[~jianghuazhu] Thanks for involve me here. Actually, I do not think it is a 
good idea to add another new field for create, because it could break some log 
collector or parser. I prefer to add parameter for cmd field of `create`, just 
as `rename` do. I would like to hear some other suggestions. Thanks.

> Add info for create flags in NameNode audit logs
> 
>
> Key: HDFS-15516
> URL: https://issues.apache.org/jira/browse/HDFS-15516
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Shashikant Banerjee
>Assignee: jianghua zhu
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-15516.001.patch, HDFS-15516.002.patch, 
> HDFS-15516.003.patch
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Currently, if file create happens with flags like overwrite , the audit logs 
> doesn't seem to contain the info regarding the flags in the audit logs. It 
> would be useful to add info regarding the create options in the audit logs 
> similar to Rename ops. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-09 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-15556.

Resolution: Duplicate

Close this issue and linked to HDFS-14042.
[~haiyang Hu] Please feel free to reopen if meet other issues which HDFS-14042 
can not resolve. Thanks.

> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch, NN-CPU.png, NN_DN.LOG
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cluster.
> *NameNode the exception stack*:
> {code:java}
> 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 5 on 8022, call Call#20535 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
> from x:34766
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
> at 
> org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
> {code}
> {code:java}
> // DatanodeDescriptor#updateStorageStats
> ...
> for (StorageReport report : reports) {
>   DatanodeStorageInfo storage = null;
>   synchronized (storageMap) {
> storage =
> storageMap.get(report.getStorage().getStorageID());
>   }
>   if (checkFailedStorages) {
> failedStorageInfos.remove(storage);
>   }
>   storage.receivedHeartbeat(report);  //  NPE exception occurred here 
>   // skip accounting for capacity of PROVIDED storages!
>   if (StorageType.PROVIDED.equals(storage.getStorageType())) {
> continue;
>   }
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15565) Remove the invalid code in the Balancer#doBalance() method.

2020-09-09 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15565:
---
Resolution: Not A Problem
Status: Resolved  (was: Patch Available)

[~jianghuazhu] Thanks for your report and contributions.
BTW, Please submit patch here or Github, keep only one place which you like 
(Github is a better practice IMO), otherwise it will disperse 
reviewers/watchers' comments. Thanks.
For this issue, I think this standard output of balancer is useful to note the 
header line for the balancer progress as [~sunchao] comments at Github.
Please feel free to reopen if do not answer your questions.
Thanks again.

> Remove the invalid code in the Balancer#doBalance() method.
> ---
>
> Key: HDFS-15565
> URL: https://issues.apache.org/jira/browse/HDFS-15565
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover
>Reporter: jianghua zhu
>Assignee: jianghua zhu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.3.0
>
> Attachments: HDFS-15565.001.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> In the Balancer#doBalance() method, an invalid line of code is added, as 
> follows:
>  static private int doBalance(Collection namenodes,
>  Collection nsIds, final BalancerParameters p, Configuration conf)
>  throws IOException, InterruptedException
> { ... System.out.println("Time Stamp Iteration# Bytes Already Moved Bytes 
> Left To Move Bytes Being Moved"); ... }
>  
> I think it was originally used for testing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15564) Add Test annotation for TestPersistBlocks#testRestartDfsWithSync

2020-09-09 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17193346#comment-17193346
 ] 

Xiaoqiao He commented on HDFS-15564:


+1 on [^HDFS-15564.001.patch].

> Add Test annotation for TestPersistBlocks#testRestartDfsWithSync
> 
>
> Key: HDFS-15564
> URL: https://issues.apache.org/jira/browse/HDFS-15564
> Project: Hadoop HDFS
>  Issue Type: Test
>  Components: hdfs
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Minor
> Attachments: HDFS-15564.001.patch
>
>
> Add Test annotation for TestPersistBlocks#testRestartDfsWithSync,  otherwise 
> it’s dead code



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15559) Complement initialize member variables in TestHdfsConfigFields#initializeMemberVariables

2020-09-09 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17193366#comment-17193366
 ] 

Xiaoqiao He commented on HDFS-15559:


Hi [~leosun08], it seems that missed to add config for hdfs-default.xml at 
HDFS-14694? if that, we should submit addendum patch at HDFS-14694 directly. 
Thanks.

> Complement initialize member variables in 
> TestHdfsConfigFields#initializeMemberVariables
> 
>
> Key: HDFS-15559
> URL: https://issues.apache.org/jira/browse/HDFS-15559
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Minor
> Attachments: HDFS-15559.001.patch
>
>
> There are some missing constant interfaces in 
> TestHdfsConfigFields#initializeMemberVariables
> {code:java}
> @Override
> public void initializeMemberVariables() {
>   xmlFilename = new String("hdfs-default.xml");
>   configurationClasses = new Class[] { HdfsClientConfigKeys.class,
>   HdfsClientConfigKeys.Failover.class,
>   HdfsClientConfigKeys.StripedRead.class, DFSConfigKeys.class,
>   HdfsClientConfigKeys.BlockWrite.class,
>   HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.class };
> }{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-15551) Tiny Improve for DeadNode detector

2020-09-11 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-15551.

Hadoop Flags: Reviewed
  Resolution: Fixed

Committed to trunk.
Thanks [~imbajin] for your report and contribution!
Thanks [~leosun08] for your reviews!

> Tiny Improve for DeadNode detector
> --
>
> Key: HDFS-15551
> URL: https://issues.apache.org/jira/browse/HDFS-15551
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs-client
>Affects Versions: 3.3.0
>Reporter: dark_num
>Assignee: imbajin
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> # add or improve some logs for adding local & global deadnodes
>  # logic improve
>  # fix typo



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15559) Complement initialize member variables in TestHdfsConfigFields#initializeMemberVariables

2020-09-11 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17194071#comment-17194071
 ] 

Xiaoqiao He commented on HDFS-15559:


Thanks [~leosun08], it makes sense to me. +1 for [^HDFS-15559.001.patch]. 
Consider there are many failed unit tests, I try to trigger Yetus manually. 
Let's wait what it will say. 
https://ci-hadoop.apache.org/job/PreCommit-HDFS-Build/142/

> Complement initialize member variables in 
> TestHdfsConfigFields#initializeMemberVariables
> 
>
> Key: HDFS-15559
> URL: https://issues.apache.org/jira/browse/HDFS-15559
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Minor
> Attachments: HDFS-15559.001.patch
>
>
> There are some missing constant interfaces in 
> TestHdfsConfigFields#initializeMemberVariables
> {code:java}
> @Override
> public void initializeMemberVariables() {
>   xmlFilename = new String("hdfs-default.xml");
>   configurationClasses = new Class[] { HdfsClientConfigKeys.class,
>   HdfsClientConfigKeys.Failover.class,
>   HdfsClientConfigKeys.StripedRead.class, DFSConfigKeys.class,
>   HdfsClientConfigKeys.BlockWrite.class,
>   HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.class };
> }{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15559) Complement initialize member variables in TestHdfsConfigFields#initializeMemberVariables

2020-09-11 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17194072#comment-17194072
 ] 

Xiaoqiao He commented on HDFS-15559:


[~leosun08] Please rebase code, it seems that patch does not apply to trunk.

> Complement initialize member variables in 
> TestHdfsConfigFields#initializeMemberVariables
> 
>
> Key: HDFS-15559
> URL: https://issues.apache.org/jira/browse/HDFS-15559
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Minor
> Attachments: HDFS-15559.001.patch
>
>
> There are some missing constant interfaces in 
> TestHdfsConfigFields#initializeMemberVariables
> {code:java}
> @Override
> public void initializeMemberVariables() {
>   xmlFilename = new String("hdfs-default.xml");
>   configurationClasses = new Class[] { HdfsClientConfigKeys.class,
>   HdfsClientConfigKeys.Failover.class,
>   HdfsClientConfigKeys.StripedRead.class, DFSConfigKeys.class,
>   HdfsClientConfigKeys.BlockWrite.class,
>   HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.class };
> }{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15560) The getMaxNodesPerRack May Cause "Failed to place enough replicas"

2020-09-11 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17194073#comment-17194073
 ] 

Xiaoqiao He commented on HDFS-15560:


Thanks [~wzx513] for your report, any idea to improve it?

> The getMaxNodesPerRack May Cause "Failed to place enough replicas"
> --
>
> Key: HDFS-15560
> URL: https://issues.apache.org/jira/browse/HDFS-15560
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: wangzhixiang
>Assignee: wangzhixiang
>Priority: Major
>
> In our hdfs Cluster, the nodes in each rack is extremely uneven.
> Eg. rack1=[1 node], rack2=[1 node], rack3=[3 nodes], rack4=[5 nodes], 
> rack5=[4 nodes], rack6=[4 nodes].
> When invoke getMaxNodesPerRack method, we will get MaxNodesPerRack = 4 by 
> MaxNodesPerRack = (totalNumOfReplicas-1)/numOfRacks + 2, totalNumOfReplicas = 
> 18, numOfRacks = 6。
> And the replications of some files in our cluster is set to 50, so it be 
> allocated 18 replicas and we need the all nodes . However, the rack4 could 
> only choose 4 nodes because of  MaxNodesPerRack = 4. It will cause only 17 
> (1+1+3+4+4+4) replicas be choosen and throws the warn log "Failed to place 
> enough replicas, still in need of 1 to reach 18".  
> Besides, ReplicationMonitor will add the file as ReplicationWork to retry and 
> it still failed in loop. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14084) Need for more stats in DFSClient

2020-09-12 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-14084:
---
Target Version/s: 3.2.3  (was: 3.2.2)

Updated the target version to 3.2.3 for preparing 3.2.2 release. Please let me 
know if it is blocker for you. Thanks.

> Need for more stats in DFSClient
> 
>
> Key: HDFS-14084
> URL: https://issues.apache.org/jira/browse/HDFS-14084
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.0.0
>Reporter: Pranay Singh
>Assignee: Erik Krogen
>Priority: Minor
> Attachments: HDFS-14084.001.patch, HDFS-14084.002.patch, 
> HDFS-14084.003.patch, HDFS-14084.004.patch, HDFS-14084.005.patch, 
> HDFS-14084.006.patch, HDFS-14084.007.patch, HDFS-14084.008.patch, 
> HDFS-14084.009.patch, HDFS-14084.010.patch, HDFS-14084.011.patch, 
> HDFS-14084.012.patch, HDFS-14084.013.patch, HDFS-14084.014.patch, 
> HDFS-14084.015.patch, HDFS-14084.016.patch, HDFS-14084.017.patch, 
> HDFS-14084.018.patch
>
>
> The usage of HDFS has changed from being used as a map-reduce filesystem, now 
> it's becoming more of like a general purpose filesystem. In most of the cases 
> there are issues with the Namenode so we have metrics to know the workload or 
> stress on Namenode.
> However, there is a need to have more statistics collected for different 
> operations/RPCs in DFSClient to know which RPC operations are taking longer 
> time or to know what is the frequency of the operation.These statistics can 
> be exposed to the users of DFS Client and they can periodically log or do 
> some sort of flow control if the response is slow. This will also help to 
> isolate HDFS issue in a mixed environment where on a node say we have Spark, 
> HBase and Impala running together. We can check the throughput of different 
> operation across client and isolate the problem caused because of noisy 
> neighbor or network congestion or shared JVM.
> We have dealt with several problems from the field for which there is no 
> conclusive evidence as to what caused the problem. If we had metrics or stats 
> in DFSClient we would be better equipped to solve such complex problems.
> List of jiras for reference:
> -
>  HADOOP-15538 HADOOP-15530 ( client side deadlock)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14571) Command line to force volume failures

2020-09-12 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-14571:
---
Target Version/s: 3.2.3  (was: 3.2.2)

Updated the target version to 3.2.3 for preparing 3.2.2 release. Please let me 
know if it is blocker for you. Thanks.

> Command line to force volume failures
> -
>
> Key: HDFS-14571
> URL: https://issues.apache.org/jira/browse/HDFS-14571
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, hdfs
> Environment: Linux
>Reporter: Scott A. Wehner
>Priority: Major
>  Labels: disks, volumes
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Datanodes that have failed hard drives reports to the namenode that it has a 
> failed volume in line with enabling slow datanode detection and we have a 
> failing drive that has not failed, or has uncorrectable sectors,  I want to 
> be able to run a command to force fail a datanode volume based on storageID 
> or Target Storage location (a.k.a mount point).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15289) Allow viewfs mounts with HDFS/HCFS scheme and centralized mount table

2020-09-12 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17194692#comment-17194692
 ] 

Xiaoqiao He commented on HDFS-15289:


[~umamaheswararao] Thanks for your great works here. we are preparing for 3.2.2 
release recently. this task set target versions include 3.2.2, and there are 5 
sub-tasks not resolved now, I just wonder if we could postpone to 3.2.3? 
Pending for your response. Thanks again.

> Allow viewfs mounts with HDFS/HCFS scheme and centralized mount table
> -
>
> Key: HDFS-15289
> URL: https://issues.apache.org/jira/browse/HDFS-15289
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Affects Versions: 3.2.0
>Reporter: Uma Maheswara Rao G
>Assignee: Uma Maheswara Rao G
>Priority: Major
> Attachments: ViewFSOverloadScheme - V1.0.pdf, ViewFSOverloadScheme.png
>
>
> ViewFS provides flexibility to mount different filesystem types with mount 
> points configuration table. This approach is solving the scalability 
> problems, but users need to reconfigure the filesystem to ViewFS and to its 
> scheme.  This will be problematic in the case of paths persisted in meta 
> stores, ex: Hive. In systems like Hive, it will store uris in meta store. So, 
> changing the file system scheme will create a burden to upgrade/recreate meta 
> stores. In our experience many users are not ready to change that.  
> Router based federation is another implementation to provide coordinated 
> mount points for HDFS federation clusters. Even though this provides 
> flexibility to handle mount points easily, this will not allow 
> other(non-HDFS) file systems to mount. So, this does not solve the purpose 
> when users want to mount external(non-HDFS) filesystems.
> So, the problem here is: Even though many users want to adapt to the scalable 
> fs options available, technical challenges of changing schemes (ex: in meta 
> stores) in deployments are obstructing them. 
> So, we propose to allow hdfs scheme in ViewFS like client side mount system 
> and provision user to create mount links without changing URI paths. 
> I will upload detailed design doc shortly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14349) Edit log may be rolled more frequently than necessary with multiple Standby nodes

2020-09-12 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-14349:
---
Target Version/s: 3.2.3  (was: 3.2.2)

Updated the target version to 3.2.3 for preparing 3.2.2 release. Please let me 
know if it is blocker for you. Thanks.

> Edit log may be rolled more frequently than necessary with multiple Standby 
> nodes
> -
>
> Key: HDFS-14349
> URL: https://issues.apache.org/jira/browse/HDFS-14349
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha, hdfs, qjm
>Reporter: Erik Krogen
>Assignee: Ekanth Sethuramalingam
>Priority: Major
>  Labels: multi-sbnn
>
> When HDFS-14317 was fixed, we tackled the problem that in a cluster with 
> in-progress edit log tailing enabled, a Standby NameNode may _never_ roll the 
> edit logs, which can eventually cause data loss.
> Unfortunately, in the process, it was made so that if there are multiple 
> Standby NameNodes, they will all roll the edit logs at their specified 
> frequency, so the edit log will be rolled X times more frequently than they 
> should be (where X is the number of Standby NNs). This is not as bad as the 
> original bug since rolling frequently does not affect correctness or data 
> availability, but may degrade performance by creating more edit log segments 
> than necessary.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15559) Complement initialize member variables in TestHdfsConfigFields#initializeMemberVariables

2020-09-12 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17194774#comment-17194774
 ] 

Xiaoqiao He commented on HDFS-15559:


Thanks [~leosun08],  [^HDFS-15559.002.patch] LGTM. Considering there are 7 
failed unit tests which seems not related with this changes, I try to trigger 
CI again. Let's wait for another building result.

> Complement initialize member variables in 
> TestHdfsConfigFields#initializeMemberVariables
> 
>
> Key: HDFS-15559
> URL: https://issues.apache.org/jira/browse/HDFS-15559
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Minor
> Attachments: HDFS-15559.001.patch, HDFS-15559.002.patch
>
>
> There are some missing constant interfaces in 
> TestHdfsConfigFields#initializeMemberVariables
> {code:java}
> @Override
> public void initializeMemberVariables() {
>   xmlFilename = new String("hdfs-default.xml");
>   configurationClasses = new Class[] { HdfsClientConfigKeys.class,
>   HdfsClientConfigKeys.Failover.class,
>   HdfsClientConfigKeys.StripedRead.class, DFSConfigKeys.class,
>   HdfsClientConfigKeys.BlockWrite.class,
>   HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.class };
> }{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15559) Complement initialize member variables in TestHdfsConfigFields#initializeMemberVariables

2020-09-14 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17195487#comment-17195487
 ] 

Xiaoqiao He commented on HDFS-15559:


[~leosun08] Try to run failed unit tests at local, most of them are passed. It 
seems not related. Please have another checks. 

> Complement initialize member variables in 
> TestHdfsConfigFields#initializeMemberVariables
> 
>
> Key: HDFS-15559
> URL: https://issues.apache.org/jira/browse/HDFS-15559
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Minor
> Attachments: HDFS-15559.001.patch, HDFS-15559.002.patch
>
>
> There are some missing constant interfaces in 
> TestHdfsConfigFields#initializeMemberVariables
> {code:java}
> @Override
> public void initializeMemberVariables() {
>   xmlFilename = new String("hdfs-default.xml");
>   configurationClasses = new Class[] { HdfsClientConfigKeys.class,
>   HdfsClientConfigKeys.Failover.class,
>   HdfsClientConfigKeys.StripedRead.class, DFSConfigKeys.class,
>   HdfsClientConfigKeys.BlockWrite.class,
>   HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.class };
> }{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15559) Complement initialize member variables in TestHdfsConfigFields#initializeMemberVariables

2020-09-14 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15559:
---
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

Committed to trunk. Thanks [~leosun08] for your report and contributions!

> Complement initialize member variables in 
> TestHdfsConfigFields#initializeMemberVariables
> 
>
> Key: HDFS-15559
> URL: https://issues.apache.org/jira/browse/HDFS-15559
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: HDFS-15559.001.patch, HDFS-15559.002.patch
>
>
> There are some missing constant interfaces in 
> TestHdfsConfigFields#initializeMemberVariables
> {code:java}
> @Override
> public void initializeMemberVariables() {
>   xmlFilename = new String("hdfs-default.xml");
>   configurationClasses = new Class[] { HdfsClientConfigKeys.class,
>   HdfsClientConfigKeys.Failover.class,
>   HdfsClientConfigKeys.StripedRead.class, DFSConfigKeys.class,
>   HdfsClientConfigKeys.BlockWrite.class,
>   HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.class };
> }{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15548) Allow configuring DISK/ARCHIVE storage types on same device mount

2020-09-16 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197337#comment-17197337
 ] 

Xiaoqiao He commented on HDFS-15548:


Thanks [~LeonG] involve me here. I would like to take a look in the couple days.

> Allow configuring DISK/ARCHIVE storage types on same device mount
> -
>
> Key: HDFS-15548
> URL: https://issues.apache.org/jira/browse/HDFS-15548
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode
>Reporter: Leon Gao
>Assignee: Leon Gao
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> We can allow configuring DISK/ARCHIVE storage types on the same device mount 
> on two separate directories.
> Users should be able to configure the capacity for each. Also, the datanode 
> usage report should report stats correctly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15589) Huge PostponedMisreplicatedBlocks can't decrease immediately when start namenode after datanode

2020-09-21 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17199870#comment-17199870
 ] 

Xiaoqiao He commented on HDFS-15589:


Thanks [~zhengchenyu] for your report. Just wonder if any impact to NameNode 
when PMB(abbr. `PostponedMisreplicatedBlocks`) keeps large number for long 
time? The largest number of PMB near to 100M in my practice, and I do not meet 
any performance issue with my inner branch. Any issues do you meet? Thanks.

> Huge PostponedMisreplicatedBlocks can't decrease immediately when start 
> namenode after datanode
> ---
>
> Key: HDFS-15589
> URL: https://issues.apache.org/jira/browse/HDFS-15589
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
> Environment: CentOS 7
>Reporter: zhengchenyu
>Priority: Major
>
> In our test cluster, I restart my namenode. Then I found many 
> PostponedMisreplicatedBlocks which doesn't decrease immediately. 
> I search the log below like this. 
> {code:java}
> 2020-09-21 17:02:37,029 DEBUG BlockStateChange: *BLOCK* NameNode.blockReport: 
> from DatanodeRegistration(xx.xx.xx.xx:9866, 
> datanodeUuid=c6a9934f-afd4-4437-b976-fed55173ce57, infoPort=9864, 
> infoSecurePort=0, ipcPort=9867, 
> storageInfo=lv=-57;cid=CID-9f6d0a32-e51c-459a-9f65-6e7b5791ee25;nsid=1016509846;c=1592578350834),
>  reports.length=12
> 2020-09-21 17:02:37,029 DEBUG BlockStateChange: *BLOCK* NameNode.blockReport: 
> from DatanodeRegistration(xx.xx.xx.xx:9866, 
> datanodeUuid=aee144f1-2082-4bca-a92b-f3c154a71c65, infoPort=9864, 
> infoSecurePort=0, ipcPort=9867, 
> storageInfo=lv=-57;cid=CID-9f6d0a32-e51c-459a-9f65-6e7b5791ee25;nsid=1016509846;c=1592578350834),
>  reports.length=12
> 2020-09-21 17:02:37,029 DEBUG BlockStateChange: *BLOCK* NameNode.blockReport: 
> from DatanodeRegistration(xx.xx.xx.xx:9866, 
> datanodeUuid=d152fa5b-1089-4bfc-b9c4-e3a7d98c7a7b, infoPort=9864, 
> infoSecurePort=0, ipcPort=9867, 
> storageInfo=lv=-57;cid=CID-9f6d0a32-e51c-459a-9f65-6e7b5791ee25;nsid=1016509846;c=1592578350834),
>  reports.length=12
> 2020-09-21 17:02:37,156 DEBUG BlockStateChange: *BLOCK* NameNode.blockReport: 
> from DatanodeRegistration(xx.xx.xx.xx:9866, 
> datanodeUuid=5cffc1fe-ace9-4af8-adfc-6002a7f5565d, infoPort=9864, 
> infoSecurePort=0, ipcPort=9867, 
> storageInfo=lv=-57;cid=CID-9f6d0a32-e51c-459a-9f65-6e7b5791ee25;nsid=1016509846;c=1592578350834),
>  reports.length=12
> 2020-09-21 17:02:37,161 DEBUG BlockStateChange: *BLOCK* NameNode.blockReport: 
> from DatanodeRegistration(xx.xx.xx.xx:9866, 
> datanodeUuid=9980d8e1-b0d9-4657-b97d-c803f82c1459, infoPort=9864, 
> infoSecurePort=0, ipcPort=9867, 
> storageInfo=lv=-57;cid=CID-9f6d0a32-e51c-459a-9f65-6e7b5791ee25;nsid=1016509846;c=1592578350834),
>  reports.length=12
> 2020-09-21 17:02:37,197 DEBUG BlockStateChange: *BLOCK* NameNode.blockReport: 
> from DatanodeRegistration(xx.xx.xx.xx:9866, 
> datanodeUuid=77ff3f5e-37f0-405f-a16c-166311546cae, infoPort=9864, 
> infoSecurePort=0, ipcPort=9867, 
> storageInfo=lv=-57;cid=CID-9f6d0a32-e51c-459a-9f65-6e7b5791ee25;nsid=1016509846;c=1592578350834),
>  reports.length=12
> {code}
> Node: test cluster only have 6 datanode.
> You will see the blockreport called before "Marking all datanodes as stale" 
> which is logged by startActiveServices. But 
> DatanodeStorageInfo.blockContentsStale only set to false in blockreport, then 
> startActiveServices set all datnaode to stale node. So the datanodes will 
> keep stale util next blockreport, then PostponedMisreplicatedBlocks keep a 
> huge number.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15382) Split FsDatasetImpl from blockpool lock to blockpool volume lock

2020-09-22 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17200157#comment-17200157
 ] 

Xiaoqiao He commented on HDFS-15382:


[~weichiu],[~sodonnell],[~linyiqun] do you have time to review this solution? 
It works well in our internal cluster. I believe this is useful feature if we 
can push it forward.
Any suggestions and comments are welcome. We will prepare new patch based on 
trunk if come to an agreement on this solution.

> Split FsDatasetImpl from blockpool lock to blockpool volume lock 
> -
>
> Key: HDFS-15382
> URL: https://issues.apache.org/jira/browse/HDFS-15382
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Aiphago
>Assignee: Aiphago
>Priority: Major
> Fix For: 3.2.1
>
> Attachments: HDFS-15382-sample.patch, image-2020-06-02-1.png, 
> image-2020-06-03-1.png
>
>
> In HDFS-15180 we split lock to blockpool grain size.But when one volume is in 
> heavy load and will block other request which in same blockpool but different 
> volume.So we split lock to two leval to avoid this happend.And to improve 
> datanode performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15594) Lazy calculate live datanodes in safe mode tip

2020-09-23 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17201259#comment-17201259
 ] 

Xiaoqiao He commented on HDFS-15594:


Thanks [~NickyYe] and [~elgoiri], it is useful works IMO. Just one concerns, 
will there be any confuses for end user if BlockThreshold never meet datanode 
threshold will not calculate when start phase?
Anyway, I always thought that BlockThreshold is fair enough for general cases 
of NameNode restart, so +1 for this improvement from my side.

> Lazy calculate live datanodes in safe mode tip
> --
>
> Key: HDFS-15594
> URL: https://issues.apache.org/jira/browse/HDFS-15594
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Ye Ni
>Assignee: Ye Ni
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Safe mode tip is printed every 20 seconds.
> This change is to calculate live datanodes until reported block threshold is 
> meet.
>  Old 
> {code:java}
> STATE* Safe mode ON. The reported blocks 111054015 needs additional 27902753 
> blocks to reach the threshold 0.9990 of total blocks 139095856. The number of 
> live datanodes 2531 has reached the minimum number 1. Safe mode will be 
> turned off automatically once the thresholds have been reached.{code}
> New 
> {code:java}
> STATE* Safe mode ON. 
> The reported blocks 134851250 needs additional 3218494 blocks to reach the 
> threshold 0.9990 of total blocks 138207947.
> The number of live datanodes is not calculated since reported blocks hasn't 
> reached the threshold. Safe mode will be turned off automatically once the 
> thresholds have been reached.{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously

2020-09-27 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17203015#comment-17203015
 ] 

Xiaoqiao He commented on HDFS-14997:


[~Captainhzy][~sodonnell] Thanks for your comments. 
BPOfferService#mReadWriteLock is indeed another competition point. In my 
practice, processes commands asynchronously could mitigate most of datanode 
issues, but not complete. If any datanode traces share will help to dig this 
case further more.
{quote}I have an idea. It can put the `updateActorStatesFromHeartbeat` function 
to `CommandProcessingThread`. In this case, it will not block heartbeat due to 
`writeLock`.{quote}
Would you like to file another JIRA and try to submit patch? Thanks.

> BPServiceActor processes commands from NameNode asynchronously
> --
>
> Key: HDFS-14997
> URL: https://issues.apache.org/jira/browse/HDFS-14997
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14997.001.patch, HDFS-14997.002.patch, 
> HDFS-14997.003.patch, HDFS-14997.004.patch, HDFS-14997.005.patch, 
> HDFS-14997.addendum.patch, image-2019-12-26-16-15-44-814.png
>
>
> There are two core functions, report(#sendHeartbeat, #blockReport, 
> #cacheReport) and #processCommand in #BPServiceActor main process flow. If 
> processCommand cost long time it will block send report flow. Meanwhile 
> processCommand could cost long time(over 1000s the worst case I meet) when IO 
> load  of DataNode is very high. Since some IO operations are under 
> #datasetLock, So it has to wait to acquire #datasetLock long time when 
> process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat 
> will not send to NameNode in-time, and trigger other disasters.
> I propose to improve #processCommand asynchronously and not block 
> #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
> Notes:
> 1. Lifeline could be one effective solution, however some old branches are 
> not support this feature.
> 2. IO operations under #datasetLock is another issue, I think we should solve 
> it at another JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14553) Make queue size of BlockReportProcessingThread configurable

2020-09-27 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17203018#comment-17203018
 ] 

Xiaoqiao He commented on HDFS-14553:


[~xyao] Thanks for your comments.
{quote}does the new configurable block report queue help mitigate the "Block 
report queue is full" issue?{quote}
I think 'Block report queue is full'  mentioned here is about NameNode restart 
and Block Report flood, right? If that I think it could be helpful when change 
this config. I want to state that HDFS-7923 could be another better solution.
In my practice, I set this queue size to 4096. the queue's load is still very 
high when NameNode restart, But not seem any issues when process IBR. Thanks.

> Make queue size of BlockReportProcessingThread configurable
> ---
>
> Key: HDFS-14553
> URL: https://issues.apache.org/jira/browse/HDFS-14553
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: HDFS-14553.001.patch, HDFS-14553.branch-3.2.patch
>
>
> ArrayBlockingQueue size of BlockReportProcessingThread is static 1024 
> currently, I propose to make this queue size configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15603) RBF: Fix getLocationsForPath twice in create operation

2020-09-28 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17203199#comment-17203199
 ] 

Xiaoqiao He commented on HDFS-15603:


[~wangzhaohui] Great catch here. Would you like to add unit test to verify this 
changes?

> RBF: Fix getLocationsForPath twice in create operation
> --
>
> Key: HDFS-15603
> URL: https://issues.apache.org/jira/browse/HDFS-15603
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: wangzhaohui
>Assignee: wangzhaohui
>Priority: Major
> Attachments: HDFS-15603-001.patch
>
>
> getLocationsForPath in create(), but getLocationsForPath again in 
> getCreateLocation(),is not necessary.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15382) Split FsDatasetImpl from blockpool lock to blockpool volume lock

2020-09-29 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15382:
---
Fix Version/s: (was: 3.2.1)

> Split FsDatasetImpl from blockpool lock to blockpool volume lock 
> -
>
> Key: HDFS-15382
> URL: https://issues.apache.org/jira/browse/HDFS-15382
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Aiphago
>Assignee: Aiphago
>Priority: Major
> Attachments: HDFS-15382-sample.patch, image-2020-06-02-1.png, 
> image-2020-06-03-1.png
>
>
> In HDFS-15180 we split lock to blockpool grain size.But when one volume is in 
> heavy load and will block other request which in same blockpool but different 
> volume.So we split lock to two leval to avoid this happend.And to improve 
> datanode performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15382) Split FsDatasetImpl from blockpool lock to blockpool volume lock

2020-09-29 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204434#comment-17204434
 ] 

Xiaoqiao He commented on HDFS-15382:


Thanks [~sodonnell],[~Jiang Xin] for your comments.
HDFS-15150 and HDFS-15160 is very interesting improvement for DataNode, and I 
think the result is also impressive. But this solution does not solve coupling 
issue between different BlockPools and different Volumes when enable Federation 
feature. Especially one of BlockPool/Volume's load is very high, other 
BlockPools/Volumes read/write operation will be blocked since still some IO 
operation in Lock which could hold for long time, such as 
#updateReplicaUnderRecovery. In our inner branch, this issue is very critical. 
Please reference 
https://drive.google.com/file/d/1eaE8vSEhIli0H3j2eDiPJNYuKAC0MFgu/view?usp=sharing
 if interesting details.
IMO, the key of this improvement is decoupling BlockPools and Volumes and try 
to improve performance further. with HDFS-15150 and HDFS-15160, it will get a 
better result. 
About the demo patch, if agreement, we will split to some subtask to push this 
feature forwards. cc [~Aiphag0]
Thanks [~sodonnell] and [~LiJinglun] again. Welcome more discussion and 
suggestions.

> Split FsDatasetImpl from blockpool lock to blockpool volume lock 
> -
>
> Key: HDFS-15382
> URL: https://issues.apache.org/jira/browse/HDFS-15382
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Aiphago
>Assignee: Aiphago
>Priority: Major
> Attachments: HDFS-15382-sample.patch, image-2020-06-02-1.png, 
> image-2020-06-03-1.png
>
>
> In HDFS-15180 we split lock to blockpool grain size.But when one volume is in 
> heavy load and will block other request which in same blockpool but different 
> volume.So we split lock to two leval to avoid this happend.And to improve 
> datanode performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14553) Make queue size of BlockReportProcessingThread configurable

2020-09-29 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204447#comment-17204447
 ] 

Xiaoqiao He commented on HDFS-14553:


Hi [~xyao], IIRC I use HDFS-5153 feature by default set `100` to split 
report size. Our prod cluster scale is far beyond 10K, I did not meet very 
serious issue about report queue. Any process time changes when report queue 
full?

> Make queue size of BlockReportProcessingThread configurable
> ---
>
> Key: HDFS-14553
> URL: https://issues.apache.org/jira/browse/HDFS-14553
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: HDFS-14553.001.patch, HDFS-14553.branch-3.2.patch
>
>
> ArrayBlockingQueue size of BlockReportProcessingThread is static 1024 
> currently, I propose to make this queue size configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15382) Split FsDatasetImpl from blockpool lock to blockpool volume lock

2020-09-30 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204434#comment-17204434
 ] 

Xiaoqiao He edited comment on HDFS-15382 at 9/30/20, 8:40 AM:
--

Thanks [~sodonnell],[~LiJinglun], [~Jiang Xin] for your comments.
HDFS-15150 and HDFS-15160 is very interesting improvement for DataNode, and I 
think the result is also impressive. But this solution does not solve coupling 
issue between different BlockPools and different Volumes when enable Federation 
feature. Especially one of BlockPool/Volume's load is very high, other 
BlockPools/Volumes read/write operation will be blocked since still some IO 
operation in Lock which could hold for long time, such as 
#updateReplicaUnderRecovery. In our inner branch, this issue is very critical. 
Please reference 
https://drive.google.com/file/d/1eaE8vSEhIli0H3j2eDiPJNYuKAC0MFgu/view?usp=sharing
 if interesting details.
IMO, the key of this improvement is decoupling BlockPools and Volumes and try 
to improve performance further. with HDFS-15150 and HDFS-15160, it will get a 
better result. 
About the demo patch, if agreement, we will split to some subtask to push this 
feature forwards. cc [~Aiphag0]
Thanks [~sodonnell] and [~LiJinglun] again. Welcome more discussion and 
suggestions.


was (Author: hexiaoqiao):
Thanks [~sodonnell],[~Jiang Xin] for your comments.
HDFS-15150 and HDFS-15160 is very interesting improvement for DataNode, and I 
think the result is also impressive. But this solution does not solve coupling 
issue between different BlockPools and different Volumes when enable Federation 
feature. Especially one of BlockPool/Volume's load is very high, other 
BlockPools/Volumes read/write operation will be blocked since still some IO 
operation in Lock which could hold for long time, such as 
#updateReplicaUnderRecovery. In our inner branch, this issue is very critical. 
Please reference 
https://drive.google.com/file/d/1eaE8vSEhIli0H3j2eDiPJNYuKAC0MFgu/view?usp=sharing
 if interesting details.
IMO, the key of this improvement is decoupling BlockPools and Volumes and try 
to improve performance further. with HDFS-15150 and HDFS-15160, it will get a 
better result. 
About the demo patch, if agreement, we will split to some subtask to push this 
feature forwards. cc [~Aiphag0]
Thanks [~sodonnell] and [~LiJinglun] again. Welcome more discussion and 
suggestions.

> Split FsDatasetImpl from blockpool lock to blockpool volume lock 
> -
>
> Key: HDFS-15382
> URL: https://issues.apache.org/jira/browse/HDFS-15382
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Aiphago
>Assignee: Aiphago
>Priority: Major
> Attachments: HDFS-15382-sample.patch, image-2020-06-02-1.png, 
> image-2020-06-03-1.png
>
>
> In HDFS-15180 we split lock to blockpool grain size.But when one volume is in 
> heavy load and will block other request which in same blockpool but different 
> volume.So we split lock to two leval to avoid this happend.And to improve 
> datanode performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode

2020-10-19 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15641:
---
Target Version/s: 3.2.2, 3.2.3

> DataNode could meet deadlock if invoke refreshNameNode
> --
>
> Key: HDFS-15641
> URL: https://issues.apache.org/jira/browse/HDFS-15641
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Critical
> Attachments: HDFS-15641.000.test.patch, deadlock.png, jstack.log
>
>
> DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
> hostname:50020` to register a new namespace in federation env.
> The jstack is shown in jstack.log
>  The specific process is shown in Figure deadlock.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode

2020-10-19 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15641:
---
Target Version/s: 3.2.2, 3.3.1, 3.4.0, 3.2.3  (was: 3.2.2, 3.2.3)

> DataNode could meet deadlock if invoke refreshNameNode
> --
>
> Key: HDFS-15641
> URL: https://issues.apache.org/jira/browse/HDFS-15641
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Critical
> Attachments: HDFS-15641.000.test.patch, deadlock.png, jstack.log
>
>
> DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
> hostname:50020` to register a new namespace in federation env.
> The jstack is shown in jstack.log
>  The specific process is shown in Figure deadlock.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode

2020-10-19 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216703#comment-17216703
 ] 

Xiaoqiao He commented on HDFS-15641:


Thanks [~wanghongbing] for your report. Great catch here! mark this issue 
target to 3.2.2 and 3.2.3.
just wonder if this issue is also in trunk. Pending for the fix patch. Thanks 
[~wanghongbing] again.

> DataNode could meet deadlock if invoke refreshNameNode
> --
>
> Key: HDFS-15641
> URL: https://issues.apache.org/jira/browse/HDFS-15641
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Critical
> Attachments: HDFS-15641.000.test.patch, deadlock.png, jstack.log
>
>
> DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
> hostname:50020` to register a new namespace in federation env.
> The jstack is shown in jstack.log
>  The specific process is shown in Figure deadlock.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15640) RBF: Add fast distcp threshold to FedBalance.

2020-10-19 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216705#comment-17216705
 ] 

Xiaoqiao He commented on HDFS-15640:


Thanks [~LiJinglun] involve me here. [~linyiqun] would be more proper to take 
this reviews. If Yiqun don't have time, I will take it this week. Thanks 
[~LiJinglun].

> RBF: Add fast distcp threshold to FedBalance.
> -
>
> Key: HDFS-15640
> URL: https://issues.apache.org/jira/browse/HDFS-15640
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15640.001.patch
>
>
> Currently in the DistCpProcedure it must submit distcp round by round until 
> there is no diff to go to the final distcp stage. The condition is very 
> strict. If the distcp could finish in an acceptable period then we don't need 
> to wait for no diff. For example if 3 consecutive distcp jobs all finish 
> within 10 minutes then we can predict the final distcp could also finish 
> within 10 minutes. So we can start the final distcp directly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode

2020-10-20 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15641:
---
Status: Patch Available  (was: Open)

> DataNode could meet deadlock if invoke refreshNameNode
> --
>
> Key: HDFS-15641
> URL: https://issues.apache.org/jira/browse/HDFS-15641
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Critical
> Attachments: HDFS-15641.001.patch, deadlock.png, jstack.log
>
>
> DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
> hostname:50020` to register a new namespace in federation env.
> The jstack is shown in jstack.log
>  The specific process is shown in Figure deadlock.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11546) Federation Router RPC server

2020-10-20 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-11546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217606#comment-17217606
 ] 

Xiaoqiao He commented on HDFS-11546:


[~elgoiri] Hi elgoiri, when trace RouterRpcClient source code, I found that we 
offered three types to invoke proxy call as 
invokeSingle/invokeSequential/invokeConcurrent in this task. For invokeSingle 
and invokeConcurrent, it is easy to understand and it makes sense. But I am 
confused what {{invokeSequential}} method is used for and why involve this 
method here, IMO any sequential invoke could be replaced by invokeSingle or 
invokeConcurrent or combination. Sorry I do not dig the original design 
document about this invoke method. Thanks.

> Federation Router RPC server
> 
>
> Key: HDFS-11546
> URL: https://issues.apache.org/jira/browse/HDFS-11546
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs
>Affects Versions: HDFS-10467
>Reporter: Íñigo Goiri
>Assignee: Íñigo Goiri
>Priority: Major
> Fix For: 2.9.0, 3.0.0
>
> Attachments: HDFS-11546-HDFS-10467-000.patch, 
> HDFS-11546-HDFS-10467-001.patch, HDFS-11546-HDFS-10467-002.patch, 
> HDFS-11546-HDFS-10467-003.patch, HDFS-11546-HDFS-10467-004.patch, 
> HDFS-11546-HDFS-10467-005.patch, HDFS-11546-HDFS-10467-007.patch, 
> HDFS-11546-HDFS-10467-008.patch, HDFS-11546-HDFS-10467-009.patch, 
> HDFS-11546-HDFS-10467-010.patch, HDFS-11546-HDFS-10467-011.patch
>
>
> RPC server side of the Federation Router implements ClientProtocol.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode

2020-10-20 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17218097#comment-17218097
 ] 

Xiaoqiao He commented on HDFS-15641:


Thanks [~wanghongbing] for your works here. For the patch, I do not think we 
can avoid deadlock when adjust `bpThread` and `lifelineSender` thread start 
order. Thanks.

> DataNode could meet deadlock if invoke refreshNameNode
> --
>
> Key: HDFS-15641
> URL: https://issues.apache.org/jira/browse/HDFS-15641
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Critical
> Attachments: HDFS-15641.001.patch, HDFS-15641.002.patch, 
> deadlock.png, jstack.log
>
>
> DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
> hostname:50020` to register a new namespace in federation env.
> The jstack is shown in jstack.log
>  The specific process is shown in Figure deadlock.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode

2020-10-26 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220744#comment-17220744
 ] 

Xiaoqiao He commented on HDFS-15641:


Thanks [~wanghongbing],[~ferhui] for your works here, and sorry for late 
response.
{quote}UT passed without your fix. Could you please take a look?{quote}
+1, Just check the added unit test, I found it can pass without other changes.
suggest to add annotation or it will be confused to other guys. we should 
create bpThread after lifelineSender start if possible. 
{quote}cluster = new 
MiniDFSCluster.Builder(conf).numDataNodes(3).build();{quote}
For added unit test, is it OK with one datanode?
We can submit addendum patch to improve it if possible since v003 has committed.

> DataNode could meet deadlock if invoke refreshNameNode
> --
>
> Key: HDFS-15641
> URL: https://issues.apache.org/jira/browse/HDFS-15641
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Critical
> Fix For: 3.4.0
>
> Attachments: HDFS-15641.001.patch, HDFS-15641.002.patch, 
> HDFS-15641.003.patch, deadlock.png, deadlock_fixed.png, jstack.log
>
>
> DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
> hostname:50020` to register a new namespace in federation env.
> The jstack is shown in jstack.log
>  The specific process is shown in Figure deadlock.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15651) Client could not obtain block when DN CommandProcessingThread exit

2020-10-26 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220749#comment-17220749
 ] 

Xiaoqiao He commented on HDFS-15651:


Thanks [~linyiqun] for your report. Catch the error and loop forever could not 
resolve this issue in my opinion because DataNode still service but without the 
correct blockToken key.
In my internal version, we set datanode process exit when meet error at 
CommandProcessingThread#run, but missed it at HDFS-14997. I believe [~Aiphag0] 
could fix it if no objection.

> Client could not obtain block when DN CommandProcessingThread exit
> --
>
> Key: HDFS-15651
> URL: https://issues.apache.org/jira/browse/HDFS-15651
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Yiqun Lin
>Priority: Major
>
> In our cluster, we applied the HDFS-14997 improvement.
>  We find one case that CommandProcessingThread will exit due to OOM error. 
> OOM error was caused by our one abnormal application that running on this DN 
> node.
> {noformat}
> 2020-10-18 10:27:12,604 ERROR 
> org.apache.hadoop.hdfs.server.datanode.DataNode: Command processor 
> encountered fatal exception and exit.
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:717)
> at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:173)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:222)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2005)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:671)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:617)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1247)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.access$1000(BPServiceActor.java:1194)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread$3.run(BPServiceActor.java:1299)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1221)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1208)
> {noformat}
> Here the main point is that CommandProcessingThread crashed will lead a very 
> bad impact. All the NN response commands will not be processed by DN side.
> We enabled the block token to access the data, but here the DN command 
> DNA_ACCESSKEYUPDATE is not processed on time by DN. And then we see lots of 
> Sasl error due to key expiration in DN log:
> {noformat}
> javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password 
> [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't 
> re-compute password for block_token_identifier (expiryDate=xxx, keyId=xx, 
> userId=xxx, blockPoolId=, blockId=xxx, access modes=[READ]), since the 
> required block key (keyID=xxx) doesn't exist.]
> {noformat}
>  
> For the impact in client side, our users receive lots of 'could not obtain 
> block' error  with BlockMissingException.
> CommandProcessingThread is a critical thread, it should always be running.
> {code:java}
>   /**
>* CommandProcessingThread that process commands asynchronously.
>*/
>   class CommandProcessingThread extends Thread {
> private final BPServiceActor actor;
> private final BlockingQueue queue;
> ...
> @Override
> public void run() {
>   try {
> processQueue();
>   } catch (Throwable t) {
> LOG.error("{} encountered fatal exception and exit.", getName(), t);  
>  <=== should not exit this thread
>   }
> }
> {code}
> Once a unexpected error happened, a better handing should be:
>  * catch the exception, appropriately deal with the error and let 
> processQueue continue to run
>  or
>  * exit the DN process to let admin user investigate this



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15644) Failed volumes can cause DNs to stop block reporting

2020-10-26 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220765#comment-17220765
 ] 

Xiaoqiao He commented on HDFS-15644:


Hi [~weichiu],[~ahussein], Just found that fix versions include 3.2.2, but 
cherry pick to branch-3.2 actually, should we also cherry-pick to branch-3.2.2 
(which is pending release branch). Thanks

> Failed volumes can cause DNs to stop block reporting
> 
>
> Key: HDFS-15644
> URL: https://issues.apache.org/jira/browse/HDFS-15644
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: block placement, datanode
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
>  Labels: refactor
> Fix For: 3.2.2, 3.3.1, 3.4.0, 3.1.5
>
> Attachments: HDFS-15644-branch-2.10.002.patch, HDFS-15644.001.patch, 
> HDFS-15644.002.patch
>
>
> [~daryn] found a corner case where remove failed volumes can cause a NPE in 
> [FsDataSetImpl.getBlockReports()|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java#L1939].
> +Scenario:+
>  * Inside {{Datanode#HandleVolumeFailures()}}, removing a failed volume is a 
> 2-step process.
>  ** First it's removed from from the volumes list
>  ** Later in time are the replicas scrubbed from the volume map
>  * A concurrent thread generating blockReports may access the replicaMap 
> accessing a non existing VolumeID.
> He made a fix for that and we have been using it on our clusters since 
> Hadoop-2.7.
> By analyzing the code, the bug is still applicable to Trunk.
>  * The path Datanode#removeVolumes() is safe because the two step process in 
> {{FsDataImpl.removeVolumes()}} 
> [FsDatasetImpl.java#L577|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java#L577]
>  is protected by {{datasetWriteLock}} .
>  * The path Datanode#handleVolumeFailures() is not safe because the failed 
> volume is removed from the list without acquiring 
> {{datasetWriteLock}}.[FsVolumList#239|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsVolumeList.java#L239]
> The race condition can cause the caller of getBlockReports() to throw NPE if 
> the RUR is referring to a volume that has already been removed 
> [FsDatasetImpl.java#L1976|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java#L1976].
> {code:java}
> case RUR:
>   ReplicaInfo orig = b.getOriginalReplica();
>   builders.get(volStorageID).add(orig);
>   break;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15651) Client could not obtain block when DN CommandProcessingThread exit

2020-10-26 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221107#comment-17221107
 ] 

Xiaoqiao He commented on HDFS-15651:


{quote}The blocktoken key will be updated for every keyUpdateInterval 
(dfs.block.access.key.update.interval). Once we recover the 
CommandProcessingThread, DN will get the new key from NN in the next 
keyUpdateInterval (by default is 10 hours).{quote}
Actually true, but for the worst case, if the issue (include lack of memory or 
some other error) does not recovery for more than 30 hour (3*10hour) by 
default, the above exception will also meet, so I think let DataNode process 
exit may be a safe way. Or we could given time threshold or counter to decide 
if DataNode should exit when meet error in CommandProcessingThread. Thanks.

> Client could not obtain block when DN CommandProcessingThread exit
> --
>
> Key: HDFS-15651
> URL: https://issues.apache.org/jira/browse/HDFS-15651
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Yiqun Lin
>Priority: Major
>
> In our cluster, we applied the HDFS-14997 improvement.
>  We find one case that CommandProcessingThread will exit due to OOM error. 
> OOM error was caused by our one abnormal application that running on this DN 
> node.
> {noformat}
> 2020-10-18 10:27:12,604 ERROR 
> org.apache.hadoop.hdfs.server.datanode.DataNode: Command processor 
> encountered fatal exception and exit.
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:717)
> at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:173)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:222)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2005)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:671)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:617)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1247)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.access$1000(BPServiceActor.java:1194)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread$3.run(BPServiceActor.java:1299)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1221)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1208)
> {noformat}
> Here the main point is that CommandProcessingThread crashed will lead a very 
> bad impact. All the NN response commands will not be processed by DN side.
> We enabled the block token to access the data, but here the DN command 
> DNA_ACCESSKEYUPDATE is not processed on time by DN. And then we see lots of 
> Sasl error due to key expiration in DN log:
> {noformat}
> javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password 
> [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't 
> re-compute password for block_token_identifier (expiryDate=xxx, keyId=xx, 
> userId=xxx, blockPoolId=, blockId=xxx, access modes=[READ]), since the 
> required block key (keyID=xxx) doesn't exist.]
> {noformat}
>  
> For the impact in client side, our users receive lots of 'could not obtain 
> block' error  with BlockMissingException.
> CommandProcessingThread is a critical thread, it should always be running.
> {code:java}
>   /**
>* CommandProcessingThread that process commands asynchronously.
>*/
>   class CommandProcessingThread extends Thread {
> private final BPServiceActor actor;
> private final BlockingQueue queue;
> ...
> @Override
> public void run() {
>   try {
> processQueue();
>   } catch (Throwable t) {
> LOG.error("{} encountered fatal exception and exit.", getName(), t);  
>  <=== should not exit this thread
>   }
> }
> {code}
> Once a unexpected error happened, a better handing should be:
>  * catch the exception, appropriately deal with the error and let 
> processQueue continue to run
>  or
>  * exit the DN process to let admin user investigate this



--
This me

[jira] [Commented] (HDFS-15654) TestBPOfferService#testMissBlocksWhenReregister is flaky

2020-10-26 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221146#comment-17221146
 ] 

Xiaoqiao He commented on HDFS-15654:


Thanks [~ahussein] for your dig and detailed description. Would you mind submit 
patch? Thanks.

> TestBPOfferService#testMissBlocksWhenReregister is flaky
> 
>
> Key: HDFS-15654
> URL: https://issues.apache.org/jira/browse/HDFS-15654
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode
>Reporter: Ahmed Hussein
>Priority: Major
>
> {{TestBPOfferService.testMissBlocksWhenReregister}}  is flaky. It fails 
> randomly when the 
> following expression is not true:
> {code:java}
>   assertTrue(fullBlockReportCount == totalTestBlocks ||
>   incrBlockReportCount == totalTestBlocks);
> {code}
> There is a race condition here that relies once more on "time" to synchronize 
> between concurrent threads. The code below is is causing the 
> non-deterministic execution.
> On a slow server, {{addNewBlockThread}} may not be done by the time the main 
> thread reach the assertion call.
> {code:java}
>   // Verify FBR/IBR count is equal to generate number.
>   assertTrue(fullBlockReportCount == totalTestBlocks ||
>   incrBlockReportCount == totalTestBlocks);
> } finally {
>   addNewBlockThread.join();
>   bpos.stop();
>   bpos.join();
> {code}
> Therefore, the correct implementation should wait for the thread to finish
> {code:java}
>  // the thread finished execution.
>  addNewBlockThread.join();
>   // Verify FBR/IBR count is equal to generate number.
>   assertTrue(fullBlockReportCount == totalTestBlocks ||
>   incrBlockReportCount == totalTestBlocks);
> } finally {
>   bpos.stop();
>   bpos.join();
> {code}
> {{DataNodeFaultInjector}} needs to have a longer wait_time too. 1 second is 
> not enough to satisfy the condition.
> {code:java}
>   DataNodeFaultInjector.set(new DataNodeFaultInjector() {
> public void blockUtilSendFullBlockReport() {
>   try {
> GenericTestUtils.waitFor(() -> {
>   if(count.get() > 2000) {
> return true;
>   }
>   return false;
> }, 100, 1); // increase that waiting time to 10 seconds.
>   } catch (Exception e) {
> e.printStackTrace();
>   }
> }
>   });
> {code}
> {code:bash}
> Stacktrace
> java.lang.AssertionError
>   at org.junit.Assert.fail(Assert.java:86)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at org.junit.Assert.assertTrue(Assert.java:52)
>   at 
> org.apache.hadoop.hdfs.server.datanode.TestBPOfferService.testMissBlocksWhenReregister(TestBPOfferService.java:350)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:3

[jira] [Updated] (HDFS-15651) Client could not obtain block when DN CommandProcessingThread exit

2020-10-27 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15651:
---
Assignee: Aiphago
  Status: Patch Available  (was: Open)

Try to trigger Yetus manually.

> Client could not obtain block when DN CommandProcessingThread exit
> --
>
> Key: HDFS-15651
> URL: https://issues.apache.org/jira/browse/HDFS-15651
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Yiqun Lin
>Assignee: Aiphago
>Priority: Major
> Attachments: HDFS-15651.patch
>
>
> In our cluster, we applied the HDFS-14997 improvement.
>  We find one case that CommandProcessingThread will exit due to OOM error. 
> OOM error was caused by our one abnormal application that running on this DN 
> node.
> {noformat}
> 2020-10-18 10:27:12,604 ERROR 
> org.apache.hadoop.hdfs.server.datanode.DataNode: Command processor 
> encountered fatal exception and exit.
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:717)
> at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:173)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:222)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2005)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:671)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:617)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1247)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.access$1000(BPServiceActor.java:1194)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread$3.run(BPServiceActor.java:1299)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1221)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1208)
> {noformat}
> Here the main point is that CommandProcessingThread crashed will lead a very 
> bad impact. All the NN response commands will not be processed by DN side.
> We enabled the block token to access the data, but here the DN command 
> DNA_ACCESSKEYUPDATE is not processed on time by DN. And then we see lots of 
> Sasl error due to key expiration in DN log:
> {noformat}
> javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password 
> [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't 
> re-compute password for block_token_identifier (expiryDate=xxx, keyId=xx, 
> userId=xxx, blockPoolId=, blockId=xxx, access modes=[READ]), since the 
> required block key (keyID=xxx) doesn't exist.]
> {noformat}
>  
> For the impact in client side, our users receive lots of 'could not obtain 
> block' error  with BlockMissingException.
> CommandProcessingThread is a critical thread, it should always be running.
> {code:java}
>   /**
>* CommandProcessingThread that process commands asynchronously.
>*/
>   class CommandProcessingThread extends Thread {
> private final BPServiceActor actor;
> private final BlockingQueue queue;
> ...
> @Override
> public void run() {
>   try {
> processQueue();
>   } catch (Throwable t) {
> LOG.error("{} encountered fatal exception and exit.", getName(), t);  
>  <=== should not exit this thread
>   }
> }
> {code}
> Once a unexpected error happened, a better handing should be:
>  * catch the exception, appropriately deal with the error and let 
> processQueue continue to run
>  or
>  * exit the DN process to let admin user investigate this



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode

2020-10-30 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He reassigned HDFS-15641:
--

Assignee: Xiaoqiao He  (was: Hongbing Wang)

> DataNode could meet deadlock if invoke refreshNameNode
> --
>
> Key: HDFS-15641
> URL: https://issues.apache.org/jira/browse/HDFS-15641
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Xiaoqiao He
>Priority: Critical
> Fix For: 3.3.1, 3.4.0, 3.1.5, 3.2.3
>
> Attachments: HDFS-15641.001.patch, HDFS-15641.002.patch, 
> HDFS-15641.003.patch, deadlock.png, deadlock_fixed.png, jstack.log
>
>
> DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
> hostname:50020` to register a new namespace in federation env.
> The jstack is shown in jstack.log
>  The specific process is shown in Figure deadlock.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode

2020-10-30 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He reassigned HDFS-15641:
--

Assignee: Hongbing Wang  (was: Xiaoqiao He)

> DataNode could meet deadlock if invoke refreshNameNode
> --
>
> Key: HDFS-15641
> URL: https://issues.apache.org/jira/browse/HDFS-15641
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Critical
> Fix For: 3.3.1, 3.4.0, 3.1.5, 3.2.3
>
> Attachments: HDFS-15641.001.patch, HDFS-15641.002.patch, 
> HDFS-15641.003.patch, deadlock.png, deadlock_fixed.png, jstack.log
>
>
> DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
> hostname:50020` to register a new namespace in federation env.
> The jstack is shown in jstack.log
>  The specific process is shown in Figure deadlock.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15627) Audit log deletes before collecting blocks

2020-10-30 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223509#comment-17223509
 ] 

Xiaoqiao He commented on HDFS-15627:


commit this to branch-3.2.2 and verify it clean.

> Audit log deletes before collecting blocks
> --
>
> Key: HDFS-15627
> URL: https://issues.apache.org/jira/browse/HDFS-15627
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: logging, namenode
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
> Fix For: 3.2.2, 3.3.1, 3.4.0, 3.1.5, 3.2.3
>
> Attachments: HDFS-15627.001.patch
>
>
> Deletes currently collect blocks in the write lock, write the edit, 
> incrementally block delete, finally +audit log+. It should be collect blocks, 
> edit log, +audit log+, incremental delete. Once the edit is durable it's 
> consistent to audit log the delete. There is no sense in deferring the audit 
> into the indeterminate future.
> The problem occurs when thereto server hung due to large deletes but it won't 
> be easy to identify the problem. That should have been easily identified as 
> the first delete logged after the hang.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15627) Audit log deletes before collecting blocks

2020-10-30 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15627:
---
Fix Version/s: 3.2.2

> Audit log deletes before collecting blocks
> --
>
> Key: HDFS-15627
> URL: https://issues.apache.org/jira/browse/HDFS-15627
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: logging, namenode
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
> Fix For: 3.2.2, 3.3.1, 3.4.0, 3.1.5, 3.2.3
>
> Attachments: HDFS-15627.001.patch
>
>
> Deletes currently collect blocks in the write lock, write the edit, 
> incrementally block delete, finally +audit log+. It should be collect blocks, 
> edit log, +audit log+, incremental delete. Once the edit is durable it's 
> consistent to audit log the delete. There is no sense in deferring the audit 
> into the indeterminate future.
> The problem occurs when thereto server hung due to large deletes but it won't 
> be easy to identify the problem. That should have been easily identified as 
> the first delete logged after the hang.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15618) Improve datanode shutdown latency

2020-10-30 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15618:
---
Fix Version/s: 3.2.2

> Improve datanode shutdown latency
> -
>
> Key: HDFS-15618
> URL: https://issues.apache.org/jira/browse/HDFS-15618
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
> Fix For: 3.2.2, 3.3.1, 3.4.0, 3.1.5, 3.2.3
>
> Attachments: HDFS-15618-branch-3.3.004.patch, HDFS-15618.001.patch, 
> HDFS-15618.002.patch, HDFS-15618.003.patch, HDFS-15618.004.patch
>
>
> The shutdown of Datanode is a very long latency. A block scanner waits for 5 
> minutes to join on each VolumeScanner thread.
> Since the scanners are daemon threads and do not alter the block content, it 
> is safe to ignore such conditions on shutdown of Datanode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15618) Improve datanode shutdown latency

2020-10-30 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223513#comment-17223513
 ] 

Xiaoqiao He commented on HDFS-15618:


cherry-pick to branch-3.2.2 and verify at local, Thanks [~ahussein] and 
[~kihwal].

> Improve datanode shutdown latency
> -
>
> Key: HDFS-15618
> URL: https://issues.apache.org/jira/browse/HDFS-15618
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
> Fix For: 3.2.2, 3.3.1, 3.4.0, 3.1.5, 3.2.3
>
> Attachments: HDFS-15618-branch-3.3.004.patch, HDFS-15618.001.patch, 
> HDFS-15618.002.patch, HDFS-15618.003.patch, HDFS-15618.004.patch
>
>
> The shutdown of Datanode is a very long latency. A block scanner waits for 5 
> minutes to join on each VolumeScanner thread.
> Since the scanners are daemon threads and do not alter the block content, it 
> is safe to ignore such conditions on shutdown of Datanode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15639) [JDK 11] Fix Javadoc errors in hadoop-hdfs-client

2020-10-30 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15639:
---
Fix Version/s: 3.2.2

> [JDK 11] Fix Javadoc errors in hadoop-hdfs-client
> -
>
> Key: HDFS-15639
> URL: https://issues.apache.org/jira/browse/HDFS-15639
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Takanobu Asanuma
>Assignee: Takanobu Asanuma
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.2.2, 3.3.1, 3.4.0, 3.1.5, 3.2.3
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> This is caused by HDFS-15567.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15639) [JDK 11] Fix Javadoc errors in hadoop-hdfs-client

2020-10-30 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223512#comment-17223512
 ] 

Xiaoqiao He commented on HDFS-15639:


cherry-pick to branch-3.2.2 and verify at local, Thanks [~aajisaka] and 
[~tasanuma].

> [JDK 11] Fix Javadoc errors in hadoop-hdfs-client
> -
>
> Key: HDFS-15639
> URL: https://issues.apache.org/jira/browse/HDFS-15639
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Takanobu Asanuma
>Assignee: Takanobu Asanuma
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.2.2, 3.3.1, 3.4.0, 3.1.5, 3.2.3
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> This is caused by HDFS-15567.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15622) Deleted blocks linger in the replications queue

2020-10-30 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15622:
---
Fix Version/s: 3.2.2

cherry-pick to branch-3.2.2 and verify at local, Thanks [~ahussein].

> Deleted blocks linger in the replications queue
> ---
>
> Key: HDFS-15622
> URL: https://issues.apache.org/jira/browse/HDFS-15622
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
> Fix For: 3.2.2, 3.3.1, 3.4.0, 3.1.5, 3.2.3
>
> Attachments: HDFS-15622.001.patch, HDFS-15622.002.patch
>
>
> We had incident whereas after resolving a missing blocks incident by 
> restarting two dead nodes, there were still 8 missing, but the list was 
> empty. Metasave shows the 8 blocks are "orphaned" meaning the files were 
> already deleted. It is unclear why they were left in the replication queue.
> * The containing node was flaky and started stoped multiple time.
> * The block allocation didn't work well due to the cluster-level storage 
> space exhaustion.
> * The NN was in safe mode.
> Triggering a full block report from the node didn't have any effect. It will 
> clear up if a failover happens as the repl queue will be reinitialized.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15459) TestBlockTokenWithDFSStriped fails intermittently

2020-10-30 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15459:
---
Fix Version/s: 3.2.2

cherry-pick to branch-3.2.2 and verify at local, Thanks [~ahussein].

> TestBlockTokenWithDFSStriped fails intermittently
> -
>
> Key: HDFS-15459
> URL: https://issues.apache.org/jira/browse/HDFS-15459
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
>  Labels: test
> Fix For: 3.2.2, 3.3.1, 3.4.0, 3.1.5, 3.2.3
>
> Attachments: HDFS-15459.001.patch, 
> TestBlockTokenWithDFSStriped.testRead.log
>
>
> {{TestBlockTokenWithDFSStriped}} fails intermittently on trunk with a NPE. I 
> have intuition that this failure is caused by another Unit tests timing out.
> {code:bash}
> [ERROR] Tests run: 4, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 94.448 s <<< FAILURE! - in 
> org.apache.hadoop.hdfs.server.blockmanagement.TestBlockTokenWithDFSStriped
> [ERROR] 
> testRead(org.apache.hadoop.hdfs.server.blockmanagement.TestBlockTokenWithDFSStriped)
>   Time elapsed: 9.455 s  <<< ERROR!
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestBlockTokenWithDFS.isBlockTokenExpired(TestBlockTokenWithDFS.java:633)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestBlockTokenWithDFSStriped.isBlockTokenExpired(TestBlockTokenWithDFSStriped.java:139)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestBlockTokenWithDFS.doTestRead(TestBlockTokenWithDFS.java:508)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestBlockTokenWithDFSStriped.testRead(TestBlockTokenWithDFSStriped.java:92)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15644) Failed volumes can cause DNs to stop block reporting

2020-10-30 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15644:
---
Fix Version/s: 3.2.2

cherry-pick to branch-3.2.2 and verify at local, Thanks [~ahussein].

> Failed volumes can cause DNs to stop block reporting
> 
>
> Key: HDFS-15644
> URL: https://issues.apache.org/jira/browse/HDFS-15644
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: block placement, datanode
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
>  Labels: refactor
> Fix For: 3.2.2, 3.3.1, 3.4.0, 3.1.5, 2.10.2, 3.2.3
>
> Attachments: HDFS-15644-branch-2.10.002.patch, HDFS-15644.001.patch, 
> HDFS-15644.002.patch
>
>
> [~daryn] found a corner case where remove failed volumes can cause a NPE in 
> [FsDataSetImpl.getBlockReports()|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java#L1939].
> +Scenario:+
>  * Inside {{Datanode#HandleVolumeFailures()}}, removing a failed volume is a 
> 2-step process.
>  ** First it's removed from from the volumes list
>  ** Later in time are the replicas scrubbed from the volume map
>  * A concurrent thread generating blockReports may access the replicaMap 
> accessing a non existing VolumeID.
> He made a fix for that and we have been using it on our clusters since 
> Hadoop-2.7.
> By analyzing the code, the bug is still applicable to Trunk.
>  * The path Datanode#removeVolumes() is safe because the two step process in 
> {{FsDataImpl.removeVolumes()}} 
> [FsDatasetImpl.java#L577|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java#L577]
>  is protected by {{datasetWriteLock}} .
>  * The path Datanode#handleVolumeFailures() is not safe because the failed 
> volume is removed from the list without acquiring 
> {{datasetWriteLock}}.[FsVolumList#239|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsVolumeList.java#L239]
> The race condition can cause the caller of getBlockReports() to throw NPE if 
> the RUR is referring to a volume that has already been removed 
> [FsDatasetImpl.java#L1976|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java#L1976].
> {code:java}
> case RUR:
>   ReplicaInfo orig = b.getOriginalReplica();
>   builders.get(volStorageID).add(orig);
>   break;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15461) TestDFSClientRetries#testGetFileChecksum fails intermittently

2020-10-30 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15461:
---
Fix Version/s: 3.2.2

cherry-pick to branch-3.2.2 and verify at local, Thanks [~ahussein].

> TestDFSClientRetries#testGetFileChecksum fails intermittently
> -
>
> Key: HDFS-15461
> URL: https://issues.apache.org/jira/browse/HDFS-15461
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: dfsclient, test
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.2.2, 3.3.1, 3.4.0, 3.2.3
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> {{TestDFSClientRetries.testGetFileChecksum}} fails intermittently on hadoop 
> trunk
> {code:bash}
> [INFO] Running org.apache.hadoop.hdfs.TestGetFileChecksum
> [ERROR] Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 10.491 s <<< FAILURE! - in org.apache.hadoop.hdfs.TestGetFileChecksum
> [ERROR] testGetFileChecksum(org.apache.hadoop.hdfs.TestGetFileChecksum)  Time 
> elapsed: 4.248 s  <<< ERROR!
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[DatanodeInfoWithStorage[127.0.0.1:52468,DS-e35b6720-8ac2-4e5e-98df-306985da6924,DISK],
>  
> DatanodeInfoWithStorage[127.0.0.1:52472,DS-91ec34d5-3f0a-494e-aed6-b01fa0131d8a,DISK]],
>  
> original=[DatanodeInfoWithStorage[127.0.0.1:52472,DS-91ec34d5-3f0a-494e-aed6-b01fa0131d8a,DISK],
>  
> DatanodeInfoWithStorage[127.0.0.1:52468,DS-e35b6720-8ac2-4e5e-98df-306985da6924,DISK]]).
>  The current failed datanode replacement policy is DEFAULT, and a client may 
> configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
> configuration.
>   at 
> org.apache.hadoop.hdfs.DataStreamer.findNewDatanode(DataStreamer.java:1304)
>   at 
> org.apache.hadoop.hdfs.DataStreamer.addDatanode2ExistingPipeline(DataStreamer.java:1372)
>   at 
> org.apache.hadoop.hdfs.DataStreamer.handleDatanodeReplacement(DataStreamer.java:1598)
>   at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1499)
>   at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481)
>   at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:719)
> [INFO]
> [INFO] Results:
> [INFO]
> [ERROR] Errors:
> [ERROR]   TestGetFileChecksum.testGetFileChecksum » IO Failed to replace a 
> bad datanode ...
> [INFO]
> [ERROR] Tests run: 2, Failures: 0, Errors: 1, Skipped: 0
> [INFO]
> [ERROR] There are test failures.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15641) DataNode could meet deadlock if invoke refreshNameNode

2020-10-30 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15641:
---
Fix Version/s: 3.2.2

cherry-pick to branch-3.2.2 and verify at local, Thanks [~wanghongbing] and 
[~ferhui].

> DataNode could meet deadlock if invoke refreshNameNode
> --
>
> Key: HDFS-15641
> URL: https://issues.apache.org/jira/browse/HDFS-15641
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Critical
> Fix For: 3.2.2, 3.3.1, 3.4.0, 3.1.5, 3.2.3
>
> Attachments: HDFS-15641.001.patch, HDFS-15641.002.patch, 
> HDFS-15641.003.patch, deadlock.png, deadlock_fixed.png, jstack.log
>
>
> DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
> hostname:50020` to register a new namespace in federation env.
> The jstack is shown in jstack.log
>  The specific process is shown in Figure deadlock.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-9776) TestHAAppend#testMultipleAppendsDuringCatchupTailing is flaky

2020-10-30 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-9776:
--
Fix Version/s: 3.2.2

cherry-pick to branch-3.2.2 and verify at local, Thanks [~ahussein].

> TestHAAppend#testMultipleAppendsDuringCatchupTailing is flaky
> -
>
> Key: HDFS-9776
> URL: https://issues.apache.org/jira/browse/HDFS-9776
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Vinayakumar B
>Assignee: Ahmed Hussein
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.2.2, 3.3.1, 3.4.0, 3.1.5, 3.2.3
>
> Attachments: TestHAAppend.testMultipleAppendsDuringCatchupTailing.log
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Initial analysys of Recent test failure in 
> {{TestHAAppend#testMultipleAppendsDuringCatchupTailing}}
> [here|https://builds.apache.org/job/PreCommit-HDFS-Build/14420/testReport/org.apache.hadoop.hdfs.server.namenode.ha/TestHAAppend/testMultipleAppendsDuringCatchupTailing/]
>  
> has found that, if the Active NameNode goes down immediately after truncate 
> operation, but before BlockRecovery command sent to datanode,
> Then this block will never be truncated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15651) Client could not obtain block when DN CommandProcessingThread exit

2020-11-02 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17225151#comment-17225151
 ] 

Xiaoqiao He commented on HDFS-15651:


Thanks [~Aiphag0] for your update, v002 LGTM. [~linyiqun] do you have bandwidth 
to have another check?

> Client could not obtain block when DN CommandProcessingThread exit
> --
>
> Key: HDFS-15651
> URL: https://issues.apache.org/jira/browse/HDFS-15651
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Yiqun Lin
>Assignee: Aiphago
>Priority: Major
> Attachments: HDFS-15651.001.patch, HDFS-15651.002.patch, 
> HDFS-15651.patch
>
>
> In our cluster, we applied the HDFS-14997 improvement.
>  We find one case that CommandProcessingThread will exit due to OOM error. 
> OOM error was caused by our one abnormal application that running on this DN 
> node.
> {noformat}
> 2020-10-18 10:27:12,604 ERROR 
> org.apache.hadoop.hdfs.server.datanode.DataNode: Command processor 
> encountered fatal exception and exit.
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:717)
> at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:173)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:222)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2005)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:671)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:617)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1247)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.access$1000(BPServiceActor.java:1194)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread$3.run(BPServiceActor.java:1299)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1221)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1208)
> {noformat}
> Here the main point is that CommandProcessingThread crashed will lead a very 
> bad impact. All the NN response commands will not be processed by DN side.
> We enabled the block token to access the data, but here the DN command 
> DNA_ACCESSKEYUPDATE is not processed on time by DN. And then we see lots of 
> Sasl error due to key expiration in DN log:
> {noformat}
> javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password 
> [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't 
> re-compute password for block_token_identifier (expiryDate=xxx, keyId=xx, 
> userId=xxx, blockPoolId=, blockId=xxx, access modes=[READ]), since the 
> required block key (keyID=xxx) doesn't exist.]
> {noformat}
>  
> For the impact in client side, our users receive lots of 'could not obtain 
> block' error  with BlockMissingException.
> CommandProcessingThread is a critical thread, it should always be running.
> {code:java}
>   /**
>* CommandProcessingThread that process commands asynchronously.
>*/
>   class CommandProcessingThread extends Thread {
> private final BPServiceActor actor;
> private final BlockingQueue queue;
> ...
> @Override
> public void run() {
>   try {
> processQueue();
>   } catch (Throwable t) {
> LOG.error("{} encountered fatal exception and exit.", getName(), t);  
>  <=== should not exit this thread
>   }
> }
> {code}
> Once a unexpected error happened, a better handing should be:
>  * catch the exception, appropriately deal with the error and let 
> processQueue continue to run
>  or
>  * exit the DN process to let admin user investigate this



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



<    1   2   3   4   5   6   7   8   9   10   >