[jira] [Commented] (HDFS-17523) Add fine-grained locks metrics in DataSetLockManager
[ https://issues.apache.org/jira/browse/HDFS-17523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846659#comment-17846659 ] lei w commented on HDFS-17523: -- [~hexiaoqiao]Thank you for your comment, PR will be submitted later > Add fine-grained locks metrics in DataSetLockManager > - > > Key: HDFS-17523 > URL: https://issues.apache.org/jira/browse/HDFS-17523 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: lei w >Priority: Major > > Currently we use fine-grained locks to manage FsDataSetImpl. But we did not > add lock-related metrics. In some cases, we actually need lock-holding > information to understand the time-consuming lock-holding of a certain > operation. Using this information, we can also optimize some long-term lock > operations as early as possible. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17523) Add fine-grained locks metrics in DataSetLockManager
lei w created HDFS-17523: Summary: Add fine-grained locks metrics in DataSetLockManager Key: HDFS-17523 URL: https://issues.apache.org/jira/browse/HDFS-17523 Project: Hadoop HDFS Issue Type: Improvement Reporter: lei w Currently we use fine-grained locks to manage FsDataSetImpl. But we did not add lock-related metrics. In some cases, we actually need lock-holding information to understand the time-consuming lock-holding of a certain operation. Using this information, we can also optimize some long-term lock operations as early as possible. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17519) Reduce lease checks when the last block is in the complete state and the penultimate block is in the committed state
[ https://issues.apache.org/jira/browse/HDFS-17519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-17519: - Summary: Reduce lease checks when the last block is in the complete state and the penultimate block is in the committed state (was: Reduce lease checks when the last block is in the complete state and the penultimate block is in the committed state in ones file.) > Reduce lease checks when the last block is in the complete state and the > penultimate block is in the committed state > > > Key: HDFS-17519 > URL: https://issues.apache.org/jira/browse/HDFS-17519 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: lei w >Priority: Major > > When the last block of the file is in the complete state and the penultimate > block is in the committed state, LeaseMonitor will continuously check the > lease of this file. In this case, it is usually because the DN encountered > some anomalies and did not report to the NN for a long time. So can we renew > the lease with the LeaseMonitor, and then reduce the number of checks on this > lease? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17519) Reduce lease checks when the last block is in the complete state and the penultimate block is in the committed state in ones file.
[ https://issues.apache.org/jira/browse/HDFS-17519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845401#comment-17845401 ] lei w commented on HDFS-17519: -- [~ayushsaxena] Can you give me some suggestions? > Reduce lease checks when the last block is in the complete state and the > penultimate block is in the committed state in ones file. > -- > > Key: HDFS-17519 > URL: https://issues.apache.org/jira/browse/HDFS-17519 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: lei w >Priority: Major > > When the last block of the file is in the complete state and the penultimate > block is in the committed state, LeaseMonitor will continuously check the > lease of this file. In this case, it is usually because the DN encountered > some anomalies and did not report to the NN for a long time. So can we renew > the lease with the LeaseMonitor, and then reduce the number of checks on this > lease? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17519) Reduce lease checks when the last block is in the complete state and the penultimate block is in the committed state in ones file.
[ https://issues.apache.org/jira/browse/HDFS-17519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-17519: - Summary: Reduce lease checks when the last block is in the complete state and the penultimate block is in the committed state in ones file. (was: LeaseMonitor renew the lease when the last block is in the complete state and the penultimate block is in the committed state in ones file.) > Reduce lease checks when the last block is in the complete state and the > penultimate block is in the committed state in ones file. > -- > > Key: HDFS-17519 > URL: https://issues.apache.org/jira/browse/HDFS-17519 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: lei w >Priority: Major > > When the last block of the file is in the complete state and the penultimate > block is in the committed state, LeaseMonitor will continuously check the > lease of this file. In this case, it is usually because the DN encountered > some anomalies and did not report to the NN for a long time. So can we renew > the lease with the LeaseMonitor, and then reduce the number of checks on this > lease? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17519) LeaseMonitor renew the lease when the last block is in the complete state and the penultimate block is in the committed state in ones file.
lei w created HDFS-17519: Summary: LeaseMonitor renew the lease when the last block is in the complete state and the penultimate block is in the committed state in ones file. Key: HDFS-17519 URL: https://issues.apache.org/jira/browse/HDFS-17519 Project: Hadoop HDFS Issue Type: Bug Reporter: lei w When the last block of the file is in the complete state and the penultimate block is in the committed state, LeaseMonitor will continuously check the lease of this file. In this case, it is usually because the DN encountered some anomalies and did not report to the NN for a long time. So can we renew the lease with the LeaseMonitor, and then reduce the number of checks on this lease? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17518) In the lease monitor, if a file is closed, we should sync the editslog
lei w created HDFS-17518: Summary: In the lease monitor, if a file is closed, we should sync the editslog Key: HDFS-17518 URL: https://issues.apache.org/jira/browse/HDFS-17518 Project: Hadoop HDFS Issue Type: Bug Reporter: lei w In the lease monitor, if a file is closed, method checklease will return true, and then the edits log will not be sync. In my opinion, we should sync the edits log to avoid not synchronizing the state to the standby NameNode for a long time. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17408) Reduce the number of quota calculations in FSDirRenameOp
[ https://issues.apache.org/jira/browse/HDFS-17408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-17408: - Description: During the execution of the rename operation, we first calculate the quota for the source INode using verifyQuotaForRename, and at the same time, we calculate the quota for the target INode. Subsequently, in RenameOperation#removeSrc, RenameOperation#removeSrc4OldRename, and RenameOperation#addSourceToDestination, the quota for the source directory is calculated again. In exceptional cases, RenameOperation#restoreDst and RenameOperation#restoreSource will also perform quota calculations for the source and target directories. In fact, many of the quota calculations are redundant and unnecessary, so we should optimize them away. > Reduce the number of quota calculations in FSDirRenameOp > > > Key: HDFS-17408 > URL: https://issues.apache.org/jira/browse/HDFS-17408 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: lei w >Assignee: lei w >Priority: Major > Labels: pull-request-available > > During the execution of the rename operation, we first calculate the quota > for the source INode using verifyQuotaForRename, and at the same time, we > calculate the quota for the target INode. Subsequently, in > RenameOperation#removeSrc, RenameOperation#removeSrc4OldRename, and > RenameOperation#addSourceToDestination, the quota for the source directory is > calculated again. In exceptional cases, RenameOperation#restoreDst and > RenameOperation#restoreSource will also perform quota calculations for the > source and target directories. In fact, many of the quota calculations are > redundant and unnecessary, so we should optimize them away. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17422) Enhance the stability of the unit test TestDFSAdmin
lei w created HDFS-17422: Summary: Enhance the stability of the unit test TestDFSAdmin Key: HDFS-17422 URL: https://issues.apache.org/jira/browse/HDFS-17422 Project: Hadoop HDFS Issue Type: Bug Reporter: lei w It has been observed that TestDFSAdmin frequently fails tests, such as [PR-6620|https://github.com/apache/hadoop/pull/6620]. The failure occurs when the test method testDecommissionDataNodesReconfig asserts the first line of the standard output. The issue arises when the content being checked does not appear on a single line. I believe we should change the method of testing. The standard output content, which was printed in [PR-6620|https://github.com/apache/hadoop/pull/6620], is as follows : {panel:title=TestInformation} 2024-03-11 02:36:19,442 [main] INFO tools.TestDFSAdmin (TestDFSAdmin.java:testDecommissionDataNodesReconfig(1356)) - outsForFinishReconf first element is Reconfiguring status for node [127.0.0.1:41361]: started at Mon Mar 11 02:36:18 UTC 2024 and finished at Mon Mar 11 02:36:18 UTC 2024., all element is [Reconfiguring status for node [127.0.0.1:41361]: started at Mon Mar 11 02:36:18 UTC 2024 and finished at Mon Mar 11 02:36:18 UTC 2024., SUCCESS: Changed property dfs.datanode.data.transfer.bandwidthPerSec,From: "0", To: "1000", Reconfiguring status for node [127.0.0.1:33073]: started at Mon Mar 11 02:36:18 UTC 2024 and finished at Mon Mar 11 02:36:18 UTC 2024., SUCCESS: Changed property dfs.datanode.data.transfer.bandwidthPerSec, From: "0", To: "1000", Retrieval of reconfiguration status successful on 2 nodes, failed on 0 nodes.], node1Addr is 127.0.0.1:41361 , node2Addr is 127.0.0.1:33073. {panel} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17408) Reduce the number of quota calculations in FSDirRenameOp
lei w created HDFS-17408: Summary: Reduce the number of quota calculations in FSDirRenameOp Key: HDFS-17408 URL: https://issues.apache.org/jira/browse/HDFS-17408 Project: Hadoop HDFS Issue Type: Improvement Reporter: lei w -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17391) Adjust the checkpoint io buffer size to the chunk size
[ https://issues.apache.org/jira/browse/HDFS-17391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-17391: - Description: Adjust the checkpoint io buffer size to the chunk size to reduce checkpoint time. Before change: 2022-07-11 07:10:50,900 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Uploaded image with txid 374700896827 to namenode at http://:50070 in 1729.465 seconds After change: 2022-07-12 08:15:55,068 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Uploaded image with txid 375717629244 to namenode at http://:50070 in 858.668 seconds was:Adjust the checkpoint io buffer size to the chunk size to reduce checkpoint time > Adjust the checkpoint io buffer size to the chunk size > -- > > Key: HDFS-17391 > URL: https://issues.apache.org/jira/browse/HDFS-17391 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: lei w >Priority: Major > > Adjust the checkpoint io buffer size to the chunk size to reduce checkpoint > time. > Before change: > 2022-07-11 07:10:50,900 INFO > org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Uploaded image with > txid 374700896827 to namenode at http://:50070 in 1729.465 seconds > After change: > 2022-07-12 08:15:55,068 INFO > org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Uploaded image with > txid 375717629244 to namenode at http://:50070 in 858.668 seconds -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17391) Adjust the checkpoint io buffer size to the chunk size
lei w created HDFS-17391: Summary: Adjust the checkpoint io buffer size to the chunk size Key: HDFS-17391 URL: https://issues.apache.org/jira/browse/HDFS-17391 Project: Hadoop HDFS Issue Type: Improvement Reporter: lei w Adjust the checkpoint io buffer size to the chunk size to reduce checkpoint time -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17383) Datanode current block token should come from active NameNode in HA mode
[ https://issues.apache.org/jira/browse/HDFS-17383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-17383: - Summary: Datanode current block token should come from active NameNode in HA mode (was: Use the block token from active NameNode to transfer block in DataNode) > Datanode current block token should come from active NameNode in HA mode > > > Key: HDFS-17383 > URL: https://issues.apache.org/jira/browse/HDFS-17383 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: lei w >Priority: Major > Attachments: reproduce.diff > > > We found that transfer block failed during the namenode upgrade. The specific > error reported was that the block token verification failed. The reason is > that during the datanode transfer block process, the source datanode uses its > own generated block token, and the keyid comes from ANN or SBN. However, > because the newly upgraded NN has just been started, the keyid owned by the > source datanode may not be owned by the target datanode, so the write fails. > Here's how to reproduce this situation in the attachment -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17383) Use the block token from active NameNode to transfer block in DataNode
[ https://issues.apache.org/jira/browse/HDFS-17383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-17383: - Description: We found that transfer block failed during the namenode upgrade. The specific error reported was that the block token verification failed. The reason is that during the datanode transfer block process, the source datanode uses its own generated block token, and the keyid comes from ANN or SBN. However, because the newly upgraded NN has just been started, the keyid owned by the source datanode may not be owned by the target datanode, so the write fails. Here's how to reproduce this situation in the attachment (was: We found that transfer block failed during the namenode upgrade. The specific error reported was that the block token verification failed. The reason is that during the datanode transfer block process, the source datanode uses its own generated block token, and the keyid comes from ANN or SBN. However, because the newly upgraded NN has just been started, the keyid owned by the source datanode may not be owned by the target datanode, so the write fails. Here's how to reproduce this situation.) > Use the block token from active NameNode to transfer block in DataNode > -- > > Key: HDFS-17383 > URL: https://issues.apache.org/jira/browse/HDFS-17383 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: lei w >Priority: Major > Attachments: reproduce.diff > > > We found that transfer block failed during the namenode upgrade. The specific > error reported was that the block token verification failed. The reason is > that during the datanode transfer block process, the source datanode uses its > own generated block token, and the keyid comes from ANN or SBN. However, > because the newly upgraded NN has just been started, the keyid owned by the > source datanode may not be owned by the target datanode, so the write fails. > Here's how to reproduce this situation in the attachment -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17383) Use the block token from active NameNode to transfer block in DataNode
[ https://issues.apache.org/jira/browse/HDFS-17383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-17383: - Attachment: reproduce.diff > Use the block token from active NameNode to transfer block in DataNode > -- > > Key: HDFS-17383 > URL: https://issues.apache.org/jira/browse/HDFS-17383 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: lei w >Priority: Major > Attachments: reproduce.diff > > > We found that transfer block failed during the namenode upgrade. The specific > error reported was that the block token verification failed. The reason is > that during the datanode transfer block process, the source datanode uses its > own generated block token, and the keyid comes from ANN or SBN. However, > because the newly upgraded NN has just been started, the keyid owned by the > source datanode may not be owned by the target datanode, so the write fails. > Here's how to reproduce this situation. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17383) Use the block token from active NameNode to transfer block in DataNode
[ https://issues.apache.org/jira/browse/HDFS-17383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-17383: - Description: We found that transfer block failed during the namenode upgrade. The specific error reported was that the block token verification failed. The reason is that during the datanode transfer block process, the source datanode uses its own generated block token, and the keyid comes from ANN or SBN. However, because the newly upgraded NN has just been started, the keyid owned by the source datanode may not be owned by the target datanode, so the write fails. Here's how to reproduce this situation. > Use the block token from active NameNode to transfer block in DataNode > -- > > Key: HDFS-17383 > URL: https://issues.apache.org/jira/browse/HDFS-17383 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: lei w >Priority: Major > > We found that transfer block failed during the namenode upgrade. The specific > error reported was that the block token verification failed. The reason is > that during the datanode transfer block process, the source datanode uses its > own generated block token, and the keyid comes from ANN or SBN. However, > because the newly upgraded NN has just been started, the keyid owned by the > source datanode may not be owned by the target datanode, so the write fails. > Here's how to reproduce this situation. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17383) Use the block token from active NameNode to transfer block in DataNode
lei w created HDFS-17383: Summary: Use the block token from active NameNode to transfer block in DataNode Key: HDFS-17383 URL: https://issues.apache.org/jira/browse/HDFS-17383 Project: Hadoop HDFS Issue Type: Bug Reporter: lei w -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17354) Delay invoke clearStaleNamespacesInRouterStateIdContext during router start up
[ https://issues.apache.org/jira/browse/HDFS-17354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-17354: - Description: We should start clear expired namespace thread at RouterRpcServer RUNNING phase because StateStoreService is Initialized in initialization phase. Now, router will throw IoException when start up. {panel:title=Exception} 2024-01-09 16:27:06,939 WARN org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer: Could not fetch current list of namespaces. java.io.IOException: State Store does not have an interface for MembershipStore at org.apache.hadoop.hdfs.server.federation.resolver.MembershipNamenodeResolver.getStoreInterface(MembershipNamenodeResolver.java:121) at org.apache.hadoop.hdfs.server.federation.resolver.MembershipNamenodeResolver.getMembershipStore(MembershipNamenodeResolver.java:102) at org.apache.hadoop.hdfs.server.federation.resolver.MembershipNamenodeResolver.getNamespaces(MembershipNamenodeResolver.java:388) at org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.clearStaleNamespacesInRouterStateIdContext(RouterRpcServer.java:434) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {panel} was: We should start clear expired namespace thread at RouterRpcServer initialization phase because StateStoreService is Initialized in initialization phase. Now, router will throw IoException when start up. {panel:title=Exception} 2024-01-09 16:27:06,939 WARN org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer: Could not fetch current list of namespaces. java.io.IOException: State Store does not have an interface for MembershipStore at org.apache.hadoop.hdfs.server.federation.resolver.MembershipNamenodeResolver.getStoreInterface(MembershipNamenodeResolver.java:121) at org.apache.hadoop.hdfs.server.federation.resolver.MembershipNamenodeResolver.getMembershipStore(MembershipNamenodeResolver.java:102) at org.apache.hadoop.hdfs.server.federation.resolver.MembershipNamenodeResolver.getNamespaces(MembershipNamenodeResolver.java:388) at org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.clearStaleNamespacesInRouterStateIdContext(RouterRpcServer.java:434) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {panel} > Delay invoke clearStaleNamespacesInRouterStateIdContext during router start > up > --- > > Key: HDFS-17354 > URL: https://issues.apache.org/jira/browse/HDFS-17354 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: lei w >Priority: Major > > We should start clear expired namespace thread at RouterRpcServer RUNNING > phase because StateStoreService is Initialized in initialization phase. > Now, router will throw IoException when start up. > {panel:title=Exception} > 2024-01-09 16:27:06,939 WARN > org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer: Could not > fetch current list of namespaces. > java.io.IOException: State Store does not have an interface for > MembershipStore > at > org.apache.hadoop.hdfs.server.federation.resolver.MembershipNamenodeResolver.getStoreInterface(MembershipNamenodeResolver.java:121) > at > org.apache.hadoop.hdfs.server.federation.resolver.MembershipNamenodeResolver.getMembershipStore(MembershipNamenodeResolver.java:102) > at > org.apache.hadoop.hdfs.server.federation.resolver.MembershipNamenodeResolver.getNamespaces(MembershipNamenodeResolver.java:388) > at >
[jira] [Updated] (HDFS-17354) Delay invoke clearStaleNamespacesInRouterStateIdContext during router start up
[ https://issues.apache.org/jira/browse/HDFS-17354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-17354: - Description: We should start clear expired namespace thread at RouterRpcServer initialization phase because StateStoreService is Initialized in initialization phase. Now, router will throw IoException when start up. {panel:title=Exception} 2024-01-09 16:27:06,939 WARN org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer: Could not fetch current list of namespaces. java.io.IOException: State Store does not have an interface for MembershipStore at org.apache.hadoop.hdfs.server.federation.resolver.MembershipNamenodeResolver.getStoreInterface(MembershipNamenodeResolver.java:121) at org.apache.hadoop.hdfs.server.federation.resolver.MembershipNamenodeResolver.getMembershipStore(MembershipNamenodeResolver.java:102) at org.apache.hadoop.hdfs.server.federation.resolver.MembershipNamenodeResolver.getNamespaces(MembershipNamenodeResolver.java:388) at org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.clearStaleNamespacesInRouterStateIdContext(RouterRpcServer.java:434) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {panel} was: We should start clear expired namespace thread at RouterRpcServer initialization phase because StateStoreService is Initialized in initialization phase. Now, router will throw IoException when start up. {panel:title=My title} 2024-01-09 16:27:06,939 WARN org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer: Could not fetch current list of namespaces. java.io.IOException: State Store does not have an interface for MembershipStore at org.apache.hadoop.hdfs.server.federation.resolver.MembershipNamenodeResolver.getStoreInterface(MembershipNamenodeResolver.java:121) at org.apache.hadoop.hdfs.server.federation.resolver.MembershipNamenodeResolver.getMembershipStore(MembershipNamenodeResolver.java:102) at org.apache.hadoop.hdfs.server.federation.resolver.MembershipNamenodeResolver.getNamespaces(MembershipNamenodeResolver.java:388) at org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.clearStaleNamespacesInRouterStateIdContext(RouterRpcServer.java:434) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {panel} > Delay invoke clearStaleNamespacesInRouterStateIdContext during router start > up > --- > > Key: HDFS-17354 > URL: https://issues.apache.org/jira/browse/HDFS-17354 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: lei w >Priority: Major > > We should start clear expired namespace thread at RouterRpcServer > initialization phase because StateStoreService is Initialized in > initialization phase. Now, router will throw IoException when start up. > {panel:title=Exception} > 2024-01-09 16:27:06,939 WARN > org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer: Could not > fetch current list of namespaces. > java.io.IOException: State Store does not have an interface for > MembershipStore > at > org.apache.hadoop.hdfs.server.federation.resolver.MembershipNamenodeResolver.getStoreInterface(MembershipNamenodeResolver.java:121) > at > org.apache.hadoop.hdfs.server.federation.resolver.MembershipNamenodeResolver.getMembershipStore(MembershipNamenodeResolver.java:102) > at > org.apache.hadoop.hdfs.server.federation.resolver.MembershipNamenodeResolver.getNamespaces(MembershipNamenodeResolver.java:388) > at >
[jira] [Created] (HDFS-17354) Delay invoke clearStaleNamespacesInRouterStateIdContext during router start up
lei w created HDFS-17354: Summary: Delay invoke clearStaleNamespacesInRouterStateIdContext during router start up Key: HDFS-17354 URL: https://issues.apache.org/jira/browse/HDFS-17354 Project: Hadoop HDFS Issue Type: Bug Reporter: lei w We should start clear expired namespace thread at RouterRpcServer initialization phase because StateStoreService is Initialized in initialization phase. Now, router will throw IoException when start up. {panel:title=My title} 2024-01-09 16:27:06,939 WARN org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer: Could not fetch current list of namespaces. java.io.IOException: State Store does not have an interface for MembershipStore at org.apache.hadoop.hdfs.server.federation.resolver.MembershipNamenodeResolver.getStoreInterface(MembershipNamenodeResolver.java:121) at org.apache.hadoop.hdfs.server.federation.resolver.MembershipNamenodeResolver.getMembershipStore(MembershipNamenodeResolver.java:102) at org.apache.hadoop.hdfs.server.federation.resolver.MembershipNamenodeResolver.getNamespaces(MembershipNamenodeResolver.java:388) at org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.clearStaleNamespacesInRouterStateIdContext(RouterRpcServer.java:434) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {panel} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17339) BPServiceActor should skip cacheReport when one blockPool does not have CacheBlock on this DataNode
lei w created HDFS-17339: Summary: BPServiceActor should skip cacheReport when one blockPool does not have CacheBlock on this DataNode Key: HDFS-17339 URL: https://issues.apache.org/jira/browse/HDFS-17339 Project: Hadoop HDFS Issue Type: Improvement Reporter: lei w Now, DataNode will cacheReport to all NameNode when CacheCapacitySize is not zero. But sometimes, not all NameNodes have CacheBlock on this DataNode. So BPServiceActor should skip cacheReport when one blockPool does not have CacheBlock on this DataNode. If so, the NameNode will reduce unnecessary lock contention -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17331) Fix Blocks are always -1 and DataNode`s version are always UNKNOWN in federationhealth.html
[ https://issues.apache.org/jira/browse/HDFS-17331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-17331: - Attachment: Before fix.png > Fix Blocks are always -1 and DataNode`s version are always UNKNOWN in > federationhealth.html > --- > > Key: HDFS-17331 > URL: https://issues.apache.org/jira/browse/HDFS-17331 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: lei w >Priority: Major > Labels: pull-request-available > Attachments: After fix.png, Before fix.png > > > Blocks are always -1 and DataNode`s version are always UNKNOWN in > federationhealth.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17331) Fix Blocks are always -1 and DataNode`s version are always UNKNOWN in federationhealth.html
[ https://issues.apache.org/jira/browse/HDFS-17331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-17331: - Attachment: After fix.png > Fix Blocks are always -1 and DataNode`s version are always UNKNOWN in > federationhealth.html > --- > > Key: HDFS-17331 > URL: https://issues.apache.org/jira/browse/HDFS-17331 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: lei w >Priority: Major > Labels: pull-request-available > Attachments: After fix.png, Before fix.png > > > Blocks are always -1 and DataNode`s version are always UNKNOWN in > federationhealth.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17331) Fix Blocks are always -1 and DataNode`s version are always UNKNOWN in federationhealth.html
[ https://issues.apache.org/jira/browse/HDFS-17331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-17331: - Attachment: (was: 截屏2024-01-08 下午8.17.07.png) > Fix Blocks are always -1 and DataNode`s version are always UNKNOWN in > federationhealth.html > --- > > Key: HDFS-17331 > URL: https://issues.apache.org/jira/browse/HDFS-17331 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: lei w >Priority: Major > Labels: pull-request-available > > Blocks are always -1 and DataNode`s version are always UNKNOWN in > federationhealth.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17331) Fix Blocks are always -1 and DataNode`s version are always UNKNOWN in federationhealth.html
lei w created HDFS-17331: Summary: Fix Blocks are always -1 and DataNode`s version are always UNKNOWN in federationhealth.html Key: HDFS-17331 URL: https://issues.apache.org/jira/browse/HDFS-17331 Project: Hadoop HDFS Issue Type: Improvement Reporter: lei w Attachments: 截屏2024-01-08 下午8.17.07.png Blocks are always -1 and DataNode`s version are always UNKNOWN in federationhealth.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17331) Fix Blocks are always -1 and DataNode`s version are always UNKNOWN in federationhealth.html
[ https://issues.apache.org/jira/browse/HDFS-17331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-17331: - Attachment: 截屏2024-01-08 下午8.17.07.png > Fix Blocks are always -1 and DataNode`s version are always UNKNOWN in > federationhealth.html > --- > > Key: HDFS-17331 > URL: https://issues.apache.org/jira/browse/HDFS-17331 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: lei w >Priority: Major > Attachments: 截屏2024-01-08 下午8.17.07.png > > > Blocks are always -1 and DataNode`s version are always UNKNOWN in > federationhealth.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16083) Forbid Observer NameNode trigger active namenode log roll
[ https://issues.apache.org/jira/browse/HDFS-16083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795580#comment-17795580 ] lei w commented on HDFS-16083: -- [~tomscut] Is this reasonable? > Forbid Observer NameNode trigger active namenode log roll > -- > > Key: HDFS-16083 > URL: https://issues.apache.org/jira/browse/HDFS-16083 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namanode >Reporter: lei w >Assignee: lei w >Priority: Minor > Labels: pull-request-available > Attachments: HDFS-16083.001.patch, HDFS-16083.002.patch, > HDFS-16083.003.patch, HDFS-16083.004.patch, HDFS-16083.005.1.patch, > HDFS-16083.005.patch, activeRollEdits.png > > > When the Observer NameNode is turned on in the cluster, the Active NameNode > will receive rollEditLog RPC requests from the Standby NameNode and Observer > NameNode in a short time. Observer NameNode's rollEditLog request is a > repetitive operation, so should we forbid Observer NameNode trigger active > namenode log roll ? We 'dfs.ha.log-roll.period' configured is 300( 5 > minutes) and active NameNode receives rollEditLog RPC as shown in > activeRollEdits.png -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-16102) Remove redundant iteration in BlockManager#removeBlocksAssociatedTo(...) to save time
[ https://issues.apache.org/jira/browse/HDFS-16102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w resolved HDFS-16102. -- Resolution: Invalid > Remove redundant iteration in BlockManager#removeBlocksAssociatedTo(...) to > save time > -- > > Key: HDFS-16102 > URL: https://issues.apache.org/jira/browse/HDFS-16102 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: lei w >Assignee: lei w >Priority: Minor > Attachments: HDFS-16102.001.patch > > > The current logic in removeBlocksAssociatedTo(...) is as follows: > {code:java} > void removeBlocksAssociatedTo(final DatanodeDescriptor node) { > providedStorageMap.removeDatanode(node); > for (DatanodeStorageInfo storage : node.getStorageInfos()) { > final Iterator it = storage.getBlockIterator(); > //add the BlockInfos to a new collection as the > //returned iterator is not modifiable. > Collection toRemove = new ArrayList<>(); > while (it.hasNext()) { > toRemove.add(it.next()); // First iteration : to put blocks to > another collection > } > for (BlockInfo b : toRemove) { > removeStoredBlock(b, node); // Another iteration : to remove blocks > } > } > // .. > } > {code} > In fact , we can use the first iteration to achieve this logic , so should > we remove the redundant iteration to save time and memory? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-16020) DatanodeReportType should add LIVE_NOT_DECOMMISSIONING type
[ https://issues.apache.org/jira/browse/HDFS-16020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w resolved HDFS-16020. -- Resolution: Invalid > DatanodeReportType should add LIVE_NOT_DECOMMISSIONING type > --- > > Key: HDFS-16020 > URL: https://issues.apache.org/jira/browse/HDFS-16020 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer mover, namanode >Reporter: lei w >Assignee: lei w >Priority: Minor > Attachments: HDFS-16020.001.patch > > > Balancer builds cluster nodes by > getDatanodeStorageReport(DatanodeReportType.LIVE) method。If the user does not > specify the exclude node list, the balancer may migrate data to the DataNode > in the decommission state. Should we filter out nodes in the decommission > state by a new DatanodeReportType(LIVE_NOT_DECOMMISSIONING) regardless of > whether the user specifies the exclude node list ? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16196) Namesystem#completeFile method will log incorrect path information when router to access
[ https://issues.apache.org/jira/browse/HDFS-16196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-16196: - Resolution: Fixed Status: Resolved (was: Patch Available) > Namesystem#completeFile method will log incorrect path information when > router to access > > > Key: HDFS-16196 > URL: https://issues.apache.org/jira/browse/HDFS-16196 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: lei w >Assignee: lei w >Priority: Minor > Attachments: HDFS-16196.001.patch > > > Router not send entire path information to namenode because > ClientProtocol#complete method`s parameter with fileId. Then NameNode will > log incorrect path information. This is very confusing, should we let the > router pass the path information or modify the log path on namenode? > completeFile log as fllow: > StateChange: DIR* completeFile: / is closed by DFSClient_NONMAPREDUC_* -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17275) We should determine whether the block has been deleted in the block report
lei w created HDFS-17275: Summary: We should determine whether the block has been deleted in the block report Key: HDFS-17275 URL: https://issues.apache.org/jira/browse/HDFS-17275 Project: Hadoop HDFS Issue Type: Improvement Reporter: lei w Now, we use asynchronous thread MarkedDeleteBlockScrubber to delete block. In block report.,We may do some useless block related calculations when blocks haven't been added to invalidateBlocks -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-17270) Fix ZKDelegationTokenSecretManagerImpl use closed zookeeper Client to get token in some case
[ https://issues.apache.org/jira/browse/HDFS-17270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w reassigned HDFS-17270: Assignee: lei w > Fix ZKDelegationTokenSecretManagerImpl use closed zookeeper Client to get > token in some case > --- > > Key: HDFS-17270 > URL: https://issues.apache.org/jira/browse/HDFS-17270 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: lei w >Assignee: lei w >Priority: Major > Attachments: CuratorFrameworkException > > > Now, we use CuratorFramework to simplifies using ZooKeeper in > ZKDelegationTokenSecretManagerImpl and we always hold the same > zookeeperClient after initialization ZKDelegationTokenSecretManagerImpl. But > in some cases like network problem , CuratorFramework may close current > zookeeperClient and create new one. In this case , we will use a zkclient > which has been closed to get token. We encountered this situation in our > cluster,exception information in attachment. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17270) Fix ZKDelegationTokenSecretManagerImpl use closed zookeeper Client to get token in some case
lei w created HDFS-17270: Summary: Fix ZKDelegationTokenSecretManagerImpl use closed zookeeper Client to get token in some case Key: HDFS-17270 URL: https://issues.apache.org/jira/browse/HDFS-17270 Project: Hadoop HDFS Issue Type: Bug Reporter: lei w Attachments: CuratorFrameworkException Now, we use CuratorFramework to simplifies using ZooKeeper in ZKDelegationTokenSecretManagerImpl and we always hold the same zookeeperClient after initialization ZKDelegationTokenSecretManagerImpl. But in some cases like network problem , CuratorFramework may close current zookeeperClient and create new one. In this case , we will use a zkclient which has been closed to get token. We encountered this situation in our cluster,exception information in attachment. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16038) DataNode Unrecognized Observer Node when cluster add an observer node
[ https://issues.apache.org/jira/browse/HDFS-16038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529764#comment-17529764 ] lei w commented on HDFS-16038: -- Is it possible to add a switch on the namenode node, if namenode state is observe and switch on. we can replace the Observer state with the Standby state then respon to datanode.[~tomscut] > DataNode Unrecognized Observer Node when cluster add an observer node > - > > Key: HDFS-16038 > URL: https://issues.apache.org/jira/browse/HDFS-16038 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: lei w >Priority: Critical > Labels: Observer > > When an Observer node is added to the cluster, the DataNode will not be able > to recognize the HAServiceState.observer, This is because we did not upgrade > the DataNode. Generally, it will take a long time for a big cluster to > upgrade the DataNode . So should we add a switch to replace the Observer > state with the Standby state when DataNode can not recognize the > HAServiceState.observer state? > The following are some error messages of DataNode: > {code:java} > 11:14:31,812 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > IOException in offerService > com.google.protobuf.InvalidProtocolBufferException: Message missing required > fields: haStatus.state > at > com.google.protobuf.UninitializedMessageException.asInvalidProtocolBufferException(UninitializedMessageException.java:81) > at > com.google.protobuf.AbstractParser.checkMessageInitialized(AbstractParser.java:71) > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16428) Source path setted storagePolicy will cause wrong typeConsumed in rename operation
[ https://issues.apache.org/jira/browse/HDFS-16428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17477588#comment-17477588 ] lei w commented on HDFS-16428: -- [~LiJinglun] [~ayushtkn] [~xiaoyuyao] Looking forward to your comments! > Source path setted storagePolicy will cause wrong typeConsumed in rename > operation > --- > > Key: HDFS-16428 > URL: https://issues.apache.org/jira/browse/HDFS-16428 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs, namenode >Reporter: lei w >Priority: Major > Attachments: example.txt > > > When compute quota in rename operation , we use storage policy of the target > directory to compute src quota usage. This will cause wrong value of > typeConsumed when source path was setted storage policy. I provided a unit > test to present this situation. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16428) Source path setted storagePolicy will cause wrong typeConsumed in rename operation
[ https://issues.apache.org/jira/browse/HDFS-16428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-16428: - Summary: Source path setted storagePolicy will cause wrong typeConsumed in rename operation (was: When source path setted storagePolicy in rename operation will cause wrong typeConsumed ) > Source path setted storagePolicy will cause wrong typeConsumed in rename > operation > --- > > Key: HDFS-16428 > URL: https://issues.apache.org/jira/browse/HDFS-16428 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs, namenode >Reporter: lei w >Priority: Major > Attachments: example.txt > > > When compute quota in rename operation , we use storage policy of the target > directory to compute src quota usage. This will cause wrong value of > typeConsumed when source path was setted storage policy. I provided a unit > test to present this situation. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16428) When source path setted storagePolicy in rename operation will cause wrong typeConsumed
[ https://issues.apache.org/jira/browse/HDFS-16428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-16428: - Attachment: example.txt > When source path setted storagePolicy in rename operation will cause wrong > typeConsumed > > > Key: HDFS-16428 > URL: https://issues.apache.org/jira/browse/HDFS-16428 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs, namenode >Reporter: lei w >Priority: Major > Attachments: example.txt > > > When compute quota in rename operation , we use storage policy of the target > directory to compute src quota usage. This will cause wrong value of > typeConsumed when source path was setted storage policy. I provided a unit > test to present this situation. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16428) When source path setted storagePolicy in rename operation will cause wrong typeConsumed
lei w created HDFS-16428: Summary: When source path setted storagePolicy in rename operation will cause wrong typeConsumed Key: HDFS-16428 URL: https://issues.apache.org/jira/browse/HDFS-16428 Project: Hadoop HDFS Issue Type: Bug Components: hdfs, namenode Reporter: lei w When compute quota in rename operation , we use storage policy of the target directory to compute src quota usage. This will cause wrong value of typeConsumed when source path was setted storage policy. I provided a unit test to present this situation. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-15068) DataNode could meet deadlock if invoke refreshVolumes when register
[ https://issues.apache.org/jira/browse/HDFS-15068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w reassigned HDFS-15068: Assignee: Aiphago (was: lei w) > DataNode could meet deadlock if invoke refreshVolumes when register > --- > > Key: HDFS-15068 > URL: https://issues.apache.org/jira/browse/HDFS-15068 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: Xiaoqiao He >Assignee: Aiphago >Priority: Major > Fix For: 3.3.0, 3.1.4, 3.2.2 > > Attachments: HDFS-15068.001.patch, HDFS-15068.002.patch, > HDFS-15068.003.patch, HDFS-15068.004.patch, HDFS-15068.005.patch > > > DataNode could meet deadlock when invoke `dfsadmin -reconfig datanode ip:host > start` to trigger #refreshVolumes. > 1. DataNod#refreshVolumes hold datanode instance ownable {{synchronizer}} > when enter this method first, then try to hold BPOfferService {{readlock}} > when `bpos.getNamespaceInfo()` in following code segment. > {code:java} > for (BPOfferService bpos : blockPoolManager.getAllNamenodeThreads()) { > nsInfos.add(bpos.getNamespaceInfo()); > } > {code} > 2. BPOfferService#registrationSucceeded (which is invoked by #register when > DataNode start or #reregister when processCommandFromActor) hold > BPOfferService {{writelock}} first, then try to hold datanode instance > ownable {{synchronizer}} in following method. > {code:java} > synchronized void bpRegistrationSucceeded(DatanodeRegistration > bpRegistration, > String blockPoolId) throws IOException { > id = bpRegistration; > if(!storage.getDatanodeUuid().equals(bpRegistration.getDatanodeUuid())) { > throw new IOException("Inconsistent Datanode IDs. Name-node returned " > + bpRegistration.getDatanodeUuid() > + ". Expecting " + storage.getDatanodeUuid()); > } > > registerBlockPoolWithSecretManager(bpRegistration, blockPoolId); > } > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-15068) DataNode could meet deadlock if invoke refreshVolumes when register
[ https://issues.apache.org/jira/browse/HDFS-15068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w reassigned HDFS-15068: Assignee: lei w (was: Aiphago) > DataNode could meet deadlock if invoke refreshVolumes when register > --- > > Key: HDFS-15068 > URL: https://issues.apache.org/jira/browse/HDFS-15068 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: Xiaoqiao He >Assignee: lei w >Priority: Major > Fix For: 3.3.0, 3.1.4, 3.2.2 > > Attachments: HDFS-15068.001.patch, HDFS-15068.002.patch, > HDFS-15068.003.patch, HDFS-15068.004.patch, HDFS-15068.005.patch > > > DataNode could meet deadlock when invoke `dfsadmin -reconfig datanode ip:host > start` to trigger #refreshVolumes. > 1. DataNod#refreshVolumes hold datanode instance ownable {{synchronizer}} > when enter this method first, then try to hold BPOfferService {{readlock}} > when `bpos.getNamespaceInfo()` in following code segment. > {code:java} > for (BPOfferService bpos : blockPoolManager.getAllNamenodeThreads()) { > nsInfos.add(bpos.getNamespaceInfo()); > } > {code} > 2. BPOfferService#registrationSucceeded (which is invoked by #register when > DataNode start or #reregister when processCommandFromActor) hold > BPOfferService {{writelock}} first, then try to hold datanode instance > ownable {{synchronizer}} in following method. > {code:java} > synchronized void bpRegistrationSucceeded(DatanodeRegistration > bpRegistration, > String blockPoolId) throws IOException { > id = bpRegistration; > if(!storage.getDatanodeUuid().equals(bpRegistration.getDatanodeUuid())) { > throw new IOException("Inconsistent Datanode IDs. Name-node returned " > + bpRegistration.getDatanodeUuid() > + ". Expecting " + storage.getDatanodeUuid()); > } > > registerBlockPoolWithSecretManager(bpRegistration, blockPoolId); > } > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14768) EC : Busy DN replica should be consider in live replica check.
[ https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17415995#comment-17415995 ] lei w commented on HDFS-14768: -- hi [~gjhkael] Please provide patch for branch 3.1 and 3.2, We need it. Thanks! > EC : Busy DN replica should be consider in live replica check. > -- > > Key: HDFS-14768 > URL: https://issues.apache.org/jira/browse/HDFS-14768 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding, hdfs, namenode >Affects Versions: 3.0.2 >Reporter: guojh >Assignee: guojh >Priority: Major > Labels: patch > Fix For: 3.3.0 > > Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, > HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, > HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.005.patch, > HDFS-14768.006.patch, HDFS-14768.007.patch, HDFS-14768.008.patch, > HDFS-14768.009.patch, HDFS-14768.010.patch, HDFS-14768.011.patch, > HDFS-14768.jpg, guojh_UT_after_deomission.txt, > guojh_UT_before_deomission.txt, zhaoyiming_UT_after_deomission.txt, > zhaoyiming_UT_beofre_deomission.txt > > > Policy is RS-6-3-1024K, version is hadoop 3.0.2; > We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission > index[3,4], increase the index 6 datanode's > pendingReplicationWithoutTargets that make it large than > replicationStreamsHardLimit(we set 14). Then, After the method > chooseSourceDatanodes of BlockMananger, the liveBlockIndices is > [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. > In method scheduleReconstruction of BlockManager, the additionalReplRequired > is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a > erasureCode task to target datanode. > When datanode get the task will build targetIndices from liveBlockIndices > and target length. the code is blow. > {code:java} > // code placeholder > targetIndices = new short[targets.length]; > private void initTargetIndices() { > BitSet bitset = reconstructor.getLiveBitSet(); > int m = 0; hasValidTargets = false; > for (int i = 0; i < dataBlkNum + parityBlkNum; i++) { > if (!bitset.get) { > if (reconstructor.getBlockLen > 0) { > if (m < targets.length) { > targetIndices[m++] = (short)i; > hasValidTargets = true; > } > } > } > } > {code} > targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value. > The StripedReader is aways create reader from first 6 index block, and is > [0,1,2,3,4,5] > Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal > bug. the block index6's data is corruption(all data is zero). > I write a unit test can stabilize repreduce. > {code:java} > // code placeholder > private int replicationStreamsHardLimit = > DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT; > numDNs = dataBlocks + parityBlocks + 10; > @Test(timeout = 24) > public void testFileDecommission() throws Exception { > LOG.info("Starting test testFileDecommission"); > final Path ecFile = new Path(ecDir, "testFileDecommission"); > int writeBytes = cellSize * dataBlocks; > writeStripedFile(dfs, ecFile, writeBytes); > Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks()); > FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes); > final INodeFile fileNode = cluster.getNamesystem().getFSDirectory() > .getINode4Write(ecFile.toString()).asFile(); > LocatedBlocks locatedBlocks = > StripedFileTestUtil.getLocatedBlocks(ecFile, dfs); > LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0) > .get(0); > DatanodeInfo[] dnLocs = lb.getLocations(); > LocatedStripedBlock lastBlock = > (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock(); > DatanodeInfo[] storageInfos = lastBlock.getLocations(); > // > DatanodeDescriptor datanodeDescriptor = > cluster.getNameNode().getNamesystem() > > .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid()); > BlockInfo firstBlock = fileNode.getBlocks()[0]; > DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock); > // the first heartbeat will consume 3 replica tasks > for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) { > BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new > Block(i), > new DatanodeStorageInfo[]{dStorageInfos[0]}); > } > assertEquals(dataBlocks + parityBlocks, dnLocs.length); > int[] decommNodeIndex = {3, 4}; > final List decommisionNodes = new ArrayList(); > // add the node which will be decommissioning > decommisionNodes.add(dnLocs[decommNodeIndex[0]]); > decommisionNodes.add(dnLocs[decommNodeIndex[1]]); >
[jira] [Updated] (HDFS-15240) Erasure Coding: dirty buffer causes reconstruction block error
[ https://issues.apache.org/jira/browse/HDFS-15240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-15240: - Description: # When read some lzo files we found some blocks were broken. I read back all internal blocks(b0-b8) of the block group(RS-6-3-1024k) from DN directly, and choose 6(b0-b5) blocks to decode other 3(b6', b7', b8') blocks. And find the longest common sequenece(LCS) between b6'(decoded) and b6(read from DN)(b7'/b7 and b8'/b8). After selecting 6 blocks of the block group in combinations one time and iterating through all cases, I find one case that the length of LCS is the block length - 64KB, 64KB is just the length of ByteBuffer used by StripedBlockReader. So the corrupt reconstruction block is made by a dirty buffer. The following log snippet(only show 2 of 28 cases) is my check program output. In my case, I known the 3th block is corrupt, so need other 5 blocks to decode another 3 blocks, then find the 1th block's LCS substring is block length - 64kb. It means (0,1,2,4,5,6)th blocks were used to reconstruct 3th block, and the dirty buffer was used before read the 1th block. Must be noted that StripedBlockReader read from the offset 0 of the 1th block after used the dirty buffer. EDITED for readability. {code:java} decode from block[0, 2, 3, 4, 5, 7] to generate block[1', 6', 8'] Check the first 131072 bytes between block[1] and block[1'], the longest common substring length is 4 Check the first 131072 bytes between block[6] and block[6'], the longest common substring length is 4 Check the first 131072 bytes between block[8] and block[8'], the longest common substring length is 4 decode from block[0, 2, 3, 4, 5, 6] to generate block[1', 7', 8'] Check the first 131072 bytes between block[1] and block[1'], the longest common substring length is 65536 CHECK AGAIN: all 27262976 bytes between block[1] and block[1'], the longest common substring length is 27197440 # this one Check the first 131072 bytes between block[7] and block[7'], the longest common substring length is 4 Check the first 131072 bytes between block[8] and block[8'], the longest common substring length is 4{code} Now I know the dirty buffer causes reconstruction block error, but how does the dirty buffer come about? After digging into the code and DN log, I found this following DN log is the root reason. {code:java} [INFO] [stripedRead-1017] : Interrupted while waiting for IO on channel java.nio.channels.SocketChannel[connected local=/:52586 remote=/:50010]. 18 millis timeout left. [WARN] [StripedBlockReconstruction-199] : Failed to reconstruct striped block: BP-714356632--1519726836856:blk_-YY_3472979393 java.lang.NullPointerException at org.apache.hadoop.hdfs.util.StripedBlockUtil.getNextCompletedStripedRead(StripedBlockUtil.java:314) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.doReadMinimumSources(StripedReader.java:308) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.readMinimumSources(StripedReader.java:269) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:94) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834) {code} Reading from DN may timeout(hold by a future(F)) and output the INFO log, but the futures that contains the future(F) is cleared, {code:java} return new StripingChunkReadResult(futures.remove(future), StripingChunkReadResult.CANCELLED); {code} futures.remove(future) cause NPE. So the EC reconstruction is failed. In the finally phase, the code snippet in *getStripedReader().close()* {code:java} reconstructor.freeBuffer(reader.getReadBuffer()); reader.freeReadBuffer(); reader.closeBlockReader(); {code} free buffer firstly, but the StripedBlockReader still holds the buffer and write it, that pollute the buffer of BufferPool. was: When read some lzo files we found some blocks were broken. I read back all internal blocks(b0-b8) of the block group(RS-6-3-1024k) from DN directly, and choose 6(b0-b5) blocks to decode other 3(b6', b7', b8') blocks. And find the longest common sequenece(LCS) between b6'(decoded) and b6(read from DN)(b7'/b7 and b8'/b8). After selecting 6 blocks of the block group in combinations one time and iterating through all cases, I find one case that the length of LCS is the block length - 64KB, 64KB is just the
[jira] [Commented] (HDFS-16196) Namesystem#completeFile method will log incorrect path information when router to access
[ https://issues.apache.org/jira/browse/HDFS-16196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17410605#comment-17410605 ] lei w commented on HDFS-16196: -- Hi [~hexiaoqiao] . I saw other requests on NameNode, those do not have the same case。Because router will send full path on other requests which without filedID. But fsync、addBlock、complete、abandonBlock has fileID and the path loged by method addBlock、complete、abandonBlock is resolved by fileID at first. Only complete method will use the path sent by router. So I think we don’t need to let the router send the full path in complete method, we can resolve path by fileID at first like addBlock. > Namesystem#completeFile method will log incorrect path information when > router to access > > > Key: HDFS-16196 > URL: https://issues.apache.org/jira/browse/HDFS-16196 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: lei w >Assignee: lei w >Priority: Minor > Attachments: HDFS-16196.001.patch > > > Router not send entire path information to namenode because > ClientProtocol#complete method`s parameter with fileId. Then NameNode will > log incorrect path information. This is very confusing, should we let the > router pass the path information or modify the log path on namenode? > completeFile log as fllow: > StateChange: DIR* completeFile: / is closed by DFSClient_NONMAPREDUC_* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-16196) Namesystem#completeFile method will log incorrect path information when router to access
[ https://issues.apache.org/jira/browse/HDFS-16196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w reassigned HDFS-16196: Assignee: lei w > Namesystem#completeFile method will log incorrect path information when > router to access > > > Key: HDFS-16196 > URL: https://issues.apache.org/jira/browse/HDFS-16196 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: lei w >Assignee: lei w >Priority: Minor > Attachments: HDFS-16196.001.patch > > > Router not send entire path information to namenode because > ClientProtocol#complete method`s parameter with fileId. Then NameNode will > log incorrect path information. This is very confusing, should we let the > router pass the path information or modify the log path on namenode? > completeFile log as fllow: > StateChange: DIR* completeFile: / is closed by DFSClient_NONMAPREDUC_* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16196) Namesystem#completeFile method will log incorrect path information when router to access
[ https://issues.apache.org/jira/browse/HDFS-16196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-16196: - Attachment: HDFS-16196.001.patch Status: Patch Available (was: Open) > Namesystem#completeFile method will log incorrect path information when > router to access > > > Key: HDFS-16196 > URL: https://issues.apache.org/jira/browse/HDFS-16196 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: lei w >Assignee: lei w >Priority: Minor > Attachments: HDFS-16196.001.patch > > > Router not send entire path information to namenode because > ClientProtocol#complete method`s parameter with fileId. Then NameNode will > log incorrect path information. This is very confusing, should we let the > router pass the path information or modify the log path on namenode? > completeFile log as fllow: > StateChange: DIR* completeFile: / is closed by DFSClient_NONMAPREDUC_* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16196) Namesystem#completeFile method will log incorrect path information when router to access
[ https://issues.apache.org/jira/browse/HDFS-16196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-16196: - Attachment: HDFS-16196.001.patch > Namesystem#completeFile method will log incorrect path information when > router to access > > > Key: HDFS-16196 > URL: https://issues.apache.org/jira/browse/HDFS-16196 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: lei w >Priority: Minor > Attachments: HDFS-16196.001.patch > > > Router not send entire path information to namenode because > ClientProtocol#complete method`s parameter with fileId. Then NameNode will > log incorrect path information. This is very confusing, should we let the > router pass the path information or modify the log path on namenode? > completeFile log as fllow: > StateChange: DIR* completeFile: / is closed by DFSClient_NONMAPREDUC_* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16196) Namesystem#completeFile method will log incorrect path information when router to access
[ https://issues.apache.org/jira/browse/HDFS-16196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-16196: - Attachment: (was: HDFS-16196.001.patch) > Namesystem#completeFile method will log incorrect path information when > router to access > > > Key: HDFS-16196 > URL: https://issues.apache.org/jira/browse/HDFS-16196 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: lei w >Assignee: lei w >Priority: Minor > Attachments: HDFS-16196.001.patch > > > Router not send entire path information to namenode because > ClientProtocol#complete method`s parameter with fileId. Then NameNode will > log incorrect path information. This is very confusing, should we let the > router pass the path information or modify the log path on namenode? > completeFile log as fllow: > StateChange: DIR* completeFile: / is closed by DFSClient_NONMAPREDUC_* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16196) Namesystem#completeFile method will log incorrect path information when router to access
[ https://issues.apache.org/jira/browse/HDFS-16196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17410501#comment-17410501 ] lei w commented on HDFS-16196: -- OK,Thanks for [~hexiaoqiao] comment. > Namesystem#completeFile method will log incorrect path information when > router to access > > > Key: HDFS-16196 > URL: https://issues.apache.org/jira/browse/HDFS-16196 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: lei w >Priority: Minor > > Router not send entire path information to namenode because > ClientProtocol#complete method`s parameter with fileId. Then NameNode will > log incorrect path information. This is very confusing, should we let the > router pass the path information or modify the log path on namenode? > completeFile log as fllow: > StateChange: DIR* completeFile: / is closed by DFSClient_NONMAPREDUC_* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-16196) Namesystem#completeFile method will log incorrect path information when router to access
[ https://issues.apache.org/jira/browse/HDFS-16196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17410334#comment-17410334 ] lei w edited comment on HDFS-16196 at 9/6/21, 4:17 AM: --- Thanks [~hexiaoqiao] comment. I checked the logic of trunk was same like this. In RouterClientProtocol#complete logic is as follow: RouterClientProtocol#complete(... ...) --> RouterRpcClient#invokeSingle(... ...) --> RouterRpcClient#invokeMethod(..) --> RPC to NN. And in RouterClientProtocol#complete(... ...) we will generate RemoteMethod as follows: {code:java} RemoteMethod method = new RemoteMethod("complete", new Class[] {String.class, String.class, ExtendedBlock.class, long.class}, // params[0]params[1]params[2] params[3] new RemoteParam(), clientName,last,fileId); {code} RemoteMethod uses an array of object type named params to store parameters , and params[0] will store class RemoteParam instance. Then in RouterRpcClient#invokeSingle(... ...) we will generate path by RemoteLocation instance as follows : {code:java} List nns = getNamenodesForNameservice(nsId); // create RemoteLocation with src path "/" , destpath "/" RemoteLocationContext loc = new RemoteLocation(nsId, "/", "/"); Class proto = method.getProtocol(); Method m = method.getMethod(); //Because RemoteMethod params[0] store class RemoteParam instance. we will use RemoteLocation dest "/" as method path. please look router.RemoteMethod#getParams(... ...) Object[] params = method.getParams(loc); return invokeMethod(ugi, nns, proto, m, params); {code} As mentioned above, NameNodeRpcServer will receive "/" as complete method path information and will log "/" in audit log. This is my simple analysis. was (Author: lei w): Thanks [~hexiaoqiao] comment. I checked the logic of trunk was same like this. In RouterClientProtocol#complete logic is as follow: RouterClientProtocol#complete(... ...) --> RouterRpcClient#invokeSingle(... ...) --> RouterRpcClient#invokeMethod(..) --> RPC to NN. And in RouterClientProtocol#complete(... ...) we will generate RemoteMethod as follows: {code:java} RemoteMethod method = new RemoteMethod("complete", new Class[] {String.class, String.class, ExtendedBlock.class, long.class}, // params[0]params[1]params[2] params[3] new RemoteParam(), clientName,last,fileId); {code} RemoteMethod uses an array of object type named params to store parameters , and params[0] will store class RemoteParam instance. Then in RouterRpcClient#invokeSingle(... ...) we will generate path by RemoteLocation instance as follows : {code:java} List nns = getNamenodesForNameservice(nsId); // create RemoteLocation with src path "/" , destpath "/" RemoteLocationContext loc = new RemoteLocation(nsId, "/", "/"); Class proto = method.getProtocol(); Method m = method.getMethod(); //Because RemoteMethod params[0] store class RemoteParam instance. we will use RemoteLocation dest "/" as method path. please look router.RemoteMethod#getParams(... ...) Object[] params = method.getParams(loc); return invokeMethod(ugi, nns, proto, m, params); {code} As mentioned above, NameNodeRpcServer will receive "/" as complete method path information and will log "/" in audit log. This is my simple analysis. If you have any questions, thank you for your correction. > Namesystem#completeFile method will log incorrect path information when > router to access > > > Key: HDFS-16196 > URL: https://issues.apache.org/jira/browse/HDFS-16196 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: lei w >Priority: Minor > > Router not send entire path information to namenode because > ClientProtocol#complete method`s parameter with fileId. Then NameNode will > log incorrect path information. This is very confusing, should we let the > router pass the path information or modify the log path on namenode? > completeFile log as fllow: > StateChange: DIR* completeFile: / is closed by DFSClient_NONMAPREDUC_* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-16196) Namesystem#completeFile method will log incorrect path information when router to access
[ https://issues.apache.org/jira/browse/HDFS-16196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17410334#comment-17410334 ] lei w edited comment on HDFS-16196 at 9/6/21, 3:28 AM: --- Thanks [~hexiaoqiao] comment. I checked the logic of trunk was same like this. In RouterClientProtocol#complete logic is as follow: RouterClientProtocol#complete(... ...) --> RouterRpcClient#invokeSingle(... ...) --> RouterRpcClient#invokeMethod(..) --> RPC to NN. And in RouterClientProtocol#complete(... ...) we will generate RemoteMethod as follows: {code:java} RemoteMethod method = new RemoteMethod("complete", new Class[] {String.class, String.class, ExtendedBlock.class, long.class}, // params[0]params[1]params[2] params[3] new RemoteParam(), clientName,last,fileId); {code} RemoteMethod uses an array of object type named params to store parameters , and params[0] will store class RemoteParam instance. Then in RouterRpcClient#invokeSingle(... ...) we will generate path by RemoteLocation instance as follows : {code:java} List nns = getNamenodesForNameservice(nsId); // create RemoteLocation with src path "/" , destpath "/" RemoteLocationContext loc = new RemoteLocation(nsId, "/", "/"); Class proto = method.getProtocol(); Method m = method.getMethod(); //Because RemoteMethod params[0] store class RemoteParam instance. we will use RemoteLocation dest "/" as method path. please look router.RemoteMethod#getParams(... ...) Object[] params = method.getParams(loc); return invokeMethod(ugi, nns, proto, m, params); {code} As mentioned above, NameNodeRpcServer will receive "/" as complete method path information and will log "/" in audit log. This is my simple analysis. If you have any questions, thank you for your correction. was (Author: lei w): Thanks [~hexiaoqiao] comment. I checked the logic of trunk was same like this. In RouterClientProtocol#complete logic is as follow: RouterClientProtocol#complete(... ...) --> RouterRpcClient#invokeSingle(... ...) --> RouterRpcClient#invokeMethod(..) --> RPC to NN. And in RouterClientProtocol#complete(... ...) we will generate RemoteMethod as follows: {code:java} RemoteMethod method = new RemoteMethod("complete", new Class[] {String.class, String.class, ExtendedBlock.class, long.class}, // params[0]params[1]params[2] params[3] new RemoteParam(), clientName,last,fileId); {code} RemoteMethod uses an array of object type named params to store parameters , and params[0] will store class RemoteParam instance. Then in RouterRpcClient#invokeSingle(... ...) we will generate path by RemoteLocation instance as follows : {code:java} List nns = getNamenodesForNameservice(nsId); // create RemoteLocation with src path "/" , destpath "/" RemoteLocationContext loc = new RemoteLocation(nsId, "/", "/"); Class proto = method.getProtocol(); Method m = method.getMethod(); //Because RemoteMethod params[0] store class RemoteParam instance. we will use RemoteLocation dest "/" as method path. please look router.RemoteMethod#getParams(... ...) Object[] params = method.getParams(loc); return invokeMethod(ugi, nns, proto, m, params); {code} As mentioned above, NameNodeRpcServer will receive "/" as complete method path information and will log "/" in audit log. This is my simple analysis. If you have any questions, thank you for your correction. > Namesystem#completeFile method will log incorrect path information when > router to access > > > Key: HDFS-16196 > URL: https://issues.apache.org/jira/browse/HDFS-16196 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: lei w >Priority: Minor > > Router not send entire path information to namenode because > ClientProtocol#complete method`s parameter with fileId. Then NameNode will > log incorrect path information. This is very confusing, should we let the > router pass the path information or modify the log path on namenode? > completeFile log as fllow: > StateChange: DIR* completeFile: / is closed by DFSClient_NONMAPREDUC_* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16196) Namesystem#completeFile method will log incorrect path information when router to access
[ https://issues.apache.org/jira/browse/HDFS-16196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17410334#comment-17410334 ] lei w commented on HDFS-16196: -- Thanks [~hexiaoqiao] comment. I checked the logic of trunk was same like this. In RouterClientProtocol#complete logic is as follow: RouterClientProtocol#complete(... ...) --> RouterRpcClient#invokeSingle(... ...) --> RouterRpcClient#invokeMethod(..) --> RPC to NN. And in RouterClientProtocol#complete(... ...) we will generate RemoteMethod as follows: {code:java} RemoteMethod method = new RemoteMethod("complete", new Class[] {String.class, String.class, ExtendedBlock.class, long.class}, // params[0]params[1]params[2] params[3] new RemoteParam(), clientName,last,fileId); {code} RemoteMethod uses an array of object type named params to store parameters , and params[0] will store class RemoteParam instance. Then in RouterRpcClient#invokeSingle(... ...) we will generate path by RemoteLocation instance as follows : {code:java} List nns = getNamenodesForNameservice(nsId); // create RemoteLocation with src path "/" , destpath "/" RemoteLocationContext loc = new RemoteLocation(nsId, "/", "/"); Class proto = method.getProtocol(); Method m = method.getMethod(); //Because RemoteMethod params[0] store class RemoteParam instance. we will use RemoteLocation dest "/" as method path. please look router.RemoteMethod#getParams(... ...) Object[] params = method.getParams(loc); return invokeMethod(ugi, nns, proto, m, params); {code} As mentioned above, NameNodeRpcServer will receive "/" as complete method path information and will log "/" in audit log. This is my simple analysis. If you have any questions, thank you for your correction. > Namesystem#completeFile method will log incorrect path information when > router to access > > > Key: HDFS-16196 > URL: https://issues.apache.org/jira/browse/HDFS-16196 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: lei w >Priority: Minor > > Router not send entire path information to namenode because > ClientProtocol#complete method`s parameter with fileId. Then NameNode will > log incorrect path information. This is very confusing, should we let the > router pass the path information or modify the log path on namenode? > completeFile log as fllow: > StateChange: DIR* completeFile: / is closed by DFSClient_NONMAPREDUC_* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16196) Namesystem#completeFile method will log incorrect path information when router to access
[ https://issues.apache.org/jira/browse/HDFS-16196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17409393#comment-17409393 ] lei w commented on HDFS-16196: -- Thanks [~ayushtkn] comment. I may not have made it clear . The issue is when complete method invoked by router to namenode , NameNode will log "/" as file path rather than file's real path. This is not conducive to troubleshooting. NameNode logs the path through router as follows: 2021-09-03 16:01:26,838 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: / is closed by DFSClient_attempt_*** NameNode logs the path through client directly as follows: 2021-09-03 16:01:26,803 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /home/* /* / is closed by DFSClient_attempt_*** So I think we can use fileID(complete method`s parameter ) to get file real path , then logs it. > Namesystem#completeFile method will log incorrect path information when > router to access > > > Key: HDFS-16196 > URL: https://issues.apache.org/jira/browse/HDFS-16196 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: lei w >Priority: Minor > > Router not send entire path information to namenode because > ClientProtocol#complete method`s parameter with fileId. Then NameNode will > log incorrect path information. This is very confusing, should we let the > router pass the path information or modify the log path on namenode? > completeFile log as fllow: > StateChange: DIR* completeFile: / is closed by DFSClient_NONMAPREDUC_* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16196) Namesystem#completeFile method will log incorrect path information when router to access
lei w created HDFS-16196: Summary: Namesystem#completeFile method will log incorrect path information when router to access Key: HDFS-16196 URL: https://issues.apache.org/jira/browse/HDFS-16196 Project: Hadoop HDFS Issue Type: Bug Reporter: lei w Router not send entire path information to namenode because ClientProtocol#complete method`s parameter with fileId. Then NameNode will log incorrect path information. This is very confusing, should we let the router pass the path information or modify the log path on namenode? completeFile log as fllow: StateChange: DIR* completeFile: / is closed by DFSClient_NONMAPREDUC_* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-16126) VolumePair should override hashcode() method
[ https://issues.apache.org/jira/browse/HDFS-16126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w resolved HDFS-16126. -- Resolution: Invalid > VolumePair should override hashcode() method > --- > > Key: HDFS-16126 > URL: https://issues.apache.org/jira/browse/HDFS-16126 > Project: Hadoop HDFS > Issue Type: Bug > Components: diskbalancer >Reporter: lei w >Priority: Minor > > Now we use a map to check one plan with more than one line of same > VolumePair in createWorkPlan(final VolumePair volumePair, Step step) , code > is as flow: > {code:java} > private void createWorkPlan(final VolumePair volumePair, Step step) > throws DiskBalancerException { > // ... > // In case we have a plan with more than > // one line of same VolumePair > // we compress that into one work order. > if (workMap.containsKey(volumePair)) {// To check use map > bytesToMove += workMap.get(volumePair).getBytesToCopy(); > } >// ... > } > {code} > I found the object volumePair is always a new object and without hashcode() > method, So use a map to check is invalid. Should we add hashcode() in > VolumePair ? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16126) VolumePair should override hashcode() method
lei w created HDFS-16126: Summary: VolumePair should override hashcode() method Key: HDFS-16126 URL: https://issues.apache.org/jira/browse/HDFS-16126 Project: Hadoop HDFS Issue Type: Bug Components: diskbalancer Reporter: lei w Now we use a map to check one plan with more than one line of same VolumePair in createWorkPlan(final VolumePair volumePair, Step step) , code is as flow: {code:java} private void createWorkPlan(final VolumePair volumePair, Step step) throws DiskBalancerException { // ... // In case we have a plan with more than // one line of same VolumePair // we compress that into one work order. if (workMap.containsKey(volumePair)) {// To check use map bytesToMove += workMap.get(volumePair).getBytesToCopy(); } // ... } {code} I found the object volumePair is always a new object and without hashcode() method, So use a map to check is invalid. Should we add hashcode() in VolumePair ? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16083) Forbid Observer NameNode trigger active namenode log roll
[ https://issues.apache.org/jira/browse/HDFS-16083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-16083: - Attachment: HDFS-16083.005.patch Status: Patch Available (was: Open) > Forbid Observer NameNode trigger active namenode log roll > -- > > Key: HDFS-16083 > URL: https://issues.apache.org/jira/browse/HDFS-16083 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namanode >Reporter: lei w >Assignee: lei w >Priority: Minor > Attachments: HDFS-16083.001.patch, HDFS-16083.002.patch, > HDFS-16083.003.patch, HDFS-16083.004.patch, HDFS-16083.005.1.patch, > HDFS-16083.005.patch, activeRollEdits.png > > > When the Observer NameNode is turned on in the cluster, the Active NameNode > will receive rollEditLog RPC requests from the Standby NameNode and Observer > NameNode in a short time. Observer NameNode's rollEditLog request is a > repetitive operation, so should we forbid Observer NameNode trigger active > namenode log roll ? We 'dfs.ha.log-roll.period' configured is 300( 5 > minutes) and active NameNode receives rollEditLog RPC as shown in > activeRollEdits.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16083) Forbid Observer NameNode trigger active namenode log roll
[ https://issues.apache.org/jira/browse/HDFS-16083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-16083: - Status: Open (was: Patch Available) > Forbid Observer NameNode trigger active namenode log roll > -- > > Key: HDFS-16083 > URL: https://issues.apache.org/jira/browse/HDFS-16083 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namanode >Reporter: lei w >Assignee: lei w >Priority: Minor > Attachments: HDFS-16083.001.patch, HDFS-16083.002.patch, > HDFS-16083.003.patch, HDFS-16083.004.patch, HDFS-16083.005.1.patch, > activeRollEdits.png > > > When the Observer NameNode is turned on in the cluster, the Active NameNode > will receive rollEditLog RPC requests from the Standby NameNode and Observer > NameNode in a short time. Observer NameNode's rollEditLog request is a > repetitive operation, so should we forbid Observer NameNode trigger active > namenode log roll ? We 'dfs.ha.log-roll.period' configured is 300( 5 > minutes) and active NameNode receives rollEditLog RPC as shown in > activeRollEdits.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16083) Forbid Observer NameNode trigger active namenode log roll
[ https://issues.apache.org/jira/browse/HDFS-16083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-16083: - Attachment: (was: HDFS-16083.005.patch) > Forbid Observer NameNode trigger active namenode log roll > -- > > Key: HDFS-16083 > URL: https://issues.apache.org/jira/browse/HDFS-16083 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namanode >Reporter: lei w >Assignee: lei w >Priority: Minor > Attachments: HDFS-16083.001.patch, HDFS-16083.002.patch, > HDFS-16083.003.patch, HDFS-16083.004.patch, HDFS-16083.005.1.patch, > activeRollEdits.png > > > When the Observer NameNode is turned on in the cluster, the Active NameNode > will receive rollEditLog RPC requests from the Standby NameNode and Observer > NameNode in a short time. Observer NameNode's rollEditLog request is a > repetitive operation, so should we forbid Observer NameNode trigger active > namenode log roll ? We 'dfs.ha.log-roll.period' configured is 300( 5 > minutes) and active NameNode receives rollEditLog RPC as shown in > activeRollEdits.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16083) Forbid Observer NameNode trigger active namenode log roll
[ https://issues.apache.org/jira/browse/HDFS-16083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-16083: - Attachment: HDFS-16083.005.patch > Forbid Observer NameNode trigger active namenode log roll > -- > > Key: HDFS-16083 > URL: https://issues.apache.org/jira/browse/HDFS-16083 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namanode >Reporter: lei w >Assignee: lei w >Priority: Minor > Attachments: HDFS-16083.001.patch, HDFS-16083.002.patch, > HDFS-16083.003.patch, HDFS-16083.004.patch, HDFS-16083.005.1.patch, > HDFS-16083.005.patch, activeRollEdits.png > > > When the Observer NameNode is turned on in the cluster, the Active NameNode > will receive rollEditLog RPC requests from the Standby NameNode and Observer > NameNode in a short time. Observer NameNode's rollEditLog request is a > repetitive operation, so should we forbid Observer NameNode trigger active > namenode log roll ? We 'dfs.ha.log-roll.period' configured is 300( 5 > minutes) and active NameNode receives rollEditLog RPC as shown in > activeRollEdits.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16083) Forbid Observer NameNode trigger active namenode log roll
[ https://issues.apache.org/jira/browse/HDFS-16083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-16083: - Attachment: (was: HDFS-16083.005.patch) > Forbid Observer NameNode trigger active namenode log roll > -- > > Key: HDFS-16083 > URL: https://issues.apache.org/jira/browse/HDFS-16083 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namanode >Reporter: lei w >Assignee: lei w >Priority: Minor > Attachments: HDFS-16083.001.patch, HDFS-16083.002.patch, > HDFS-16083.003.patch, HDFS-16083.004.patch, HDFS-16083.005.1.patch, > activeRollEdits.png > > > When the Observer NameNode is turned on in the cluster, the Active NameNode > will receive rollEditLog RPC requests from the Standby NameNode and Observer > NameNode in a short time. Observer NameNode's rollEditLog request is a > repetitive operation, so should we forbid Observer NameNode trigger active > namenode log roll ? We 'dfs.ha.log-roll.period' configured is 300( 5 > minutes) and active NameNode receives rollEditLog RPC as shown in > activeRollEdits.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16097) Datanode receives ipc requests will throw NPE when datanode quickly restart
[ https://issues.apache.org/jira/browse/HDFS-16097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-16097: - Description: Datanode receives ipc requests will throw NPE when datanode quickly restart. This is because when DN is reStarted, BlockPool is first registered with blockPoolManager and then fsdataset is initialized. When BlockPool is registered to blockPoolManager without initializing fsdataset, DataNode receives an IPC request will throw NPE, because it will call related methods provided by fsdataset. The stack exception is as follows: {code:java} java.lang.NullPointerException at org.apache.hadoop.hdfs.server.datanode.DataNode.initReplicaRecovery(DataNode.java:3468) at org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolServerSideTranslatorPB.initReplicaRecovery(InterDatanodeProtocolServerSideTranslatorPB.java:55) at org.apache.hadoop.hdfs.protocol.proto.InterDatanodeProtocolProtos$InterDatanodeProtocolService$2.callBlockingMethod(InterDatanodeProtocolProtos.java:3105) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:916) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:862) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) {code} The client side stack exception is as follows: {code:java} WARN org.apache.hadoop.hdfs.server.protocol.InterDatanodeProtocol: Failed to recover block (block=BP-###:blk_###, datanode=DatanodeInfoWithStorage[,null,null]) org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): java.lang.NullPointerException at org.apache.hadoop.hdfs.server.datanode.DataNode.initReplicaRecovery(DataNode.java:3468) at org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolServerSideTranslatorPB.initReplicaRecovery(InterDatanodeProtocolServerSideTranslatorPB.java:55) at org.apache.hadoop.hdfs.protocol.proto.InterDatanodeProtocolProtos$InterDatanodeProtocolService$2.callBlockingMethod(InterDatanodeProtocolProtos.java:3105) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:916) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:862) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2873) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1511) at org.apache.hadoop.ipc.Client.call(Client.java:1457) at org.apache.hadoop.ipc.Client.call(Client.java:1367) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) at com.sun.proxy.$Proxy26.initReplicaRecovery(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolTranslatorPB.initReplicaRecovery(InterDatanodeProtocolTranslatorPB.java:83) at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker.callInitReplicaRecovery(BlockRecoveryWorker.java:571) at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker.access$400(BlockRecoveryWorker.java:57) at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.recover(BlockRecoveryWorker.java:142) at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:610) at java.lang.Thread.run(Thread.java:748) {code} was: Datanode receives ipc requests will throw NPE when datanode quickly restart. This is because when DN is reStarted, BlockPool is first registered with blockPoolManager and then fsdataset is initialized. When BlockPool is registered to blockPoolManager without initializing fsdataset, DataNode receives an IPC request will throw NPE, because it will call related methods provided by fsdataset. The stack exception is as follows: {code:java} java.lang.NullPointerException at org.apache.hadoop.hdfs.server.datanode.DataNode.initReplicaRecovery(DataNode.java:3468) at org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolServerSideTranslatorPB.initReplicaRecovery(InterDatanodeProtocolServerSideTranslatorPB.java:55) at org.apache.hadoop.hdfs.protocol.proto.InterDatanodeProtocolProtos$InterDatanodeProtocolService$2.callBlockingMethod(InterDatanodeProtocolProtos.java:3105) at
[jira] [Commented] (HDFS-16097) Datanode receives ipc requests will throw NPE when datanode quickly restart
[ https://issues.apache.org/jira/browse/HDFS-16097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17373287#comment-17373287 ] lei w commented on HDFS-16097: -- Thanks [~hexiaoqiao] for your comment. I did not look for the consequences of the client encountering this kind of request. But judging from the code logic, if the DataNode performs block recovery, the block recovery task will fail and if the client calls the getReplicaVisibleLength() method in the ClientDataNodeProtocol, the client should exit directly. > Datanode receives ipc requests will throw NPE when datanode quickly restart > > > Key: HDFS-16097 > URL: https://issues.apache.org/jira/browse/HDFS-16097 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Environment: >Reporter: lei w >Assignee: lei w >Priority: Major > Attachments: HDFS-16097.001.patch > > > Datanode receives ipc requests will throw NPE when datanode quickly restart. > This is because when DN is reStarted, BlockPool is first registered with > blockPoolManager and then fsdataset is initialized. When BlockPool is > registered to blockPoolManager without initializing fsdataset, DataNode > receives an IPC request will throw NPE, because it will call related methods > provided by fsdataset. The stack exception is as follows: > {code:java} > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initReplicaRecovery(DataNode.java:3468) > at > org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolServerSideTranslatorPB.initReplicaRecovery(InterDatanodeProtocolServerSideTranslatorPB.java:55) > at > org.apache.hadoop.hdfs.protocol.proto.InterDatanodeProtocolProtos$InterDatanodeProtocolService$2.callBlockingMethod(InterDatanodeProtocolProtos.java:3105) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:916) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:862) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16101) Remove unuse variable and IoException in ProvidedStorageMap
[ https://issues.apache.org/jira/browse/HDFS-16101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17372310#comment-17372310 ] lei w commented on HDFS-16101: -- Thanks [~ayushtkn] for your reply. > Remove unuse variable and IoException in ProvidedStorageMap > --- > > Key: HDFS-16101 > URL: https://issues.apache.org/jira/browse/HDFS-16101 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: lei w >Assignee: lei w >Priority: Minor > Attachments: HDFS-16101.001.patch > > > Remove unuse variable and IoException in ProvidedStorageMap -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16102) Remove redundant iteration in BlockManager#removeBlocksAssociatedTo(...) to save time
[ https://issues.apache.org/jira/browse/HDFS-16102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17372307#comment-17372307 ] lei w commented on HDFS-16102: -- Thanks [~hexiaoqiao] for your reply. I will update it. > Remove redundant iteration in BlockManager#removeBlocksAssociatedTo(...) to > save time > -- > > Key: HDFS-16102 > URL: https://issues.apache.org/jira/browse/HDFS-16102 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: lei w >Assignee: lei w >Priority: Minor > Attachments: HDFS-16102.001.patch > > > The current logic in removeBlocksAssociatedTo(...) is as follows: > {code:java} > void removeBlocksAssociatedTo(final DatanodeDescriptor node) { > providedStorageMap.removeDatanode(node); > for (DatanodeStorageInfo storage : node.getStorageInfos()) { > final Iterator it = storage.getBlockIterator(); > //add the BlockInfos to a new collection as the > //returned iterator is not modifiable. > Collection toRemove = new ArrayList<>(); > while (it.hasNext()) { > toRemove.add(it.next()); // First iteration : to put blocks to > another collection > } > for (BlockInfo b : toRemove) { > removeStoredBlock(b, node); // Another iteration : to remove blocks > } > } > // .. > } > {code} > In fact , we can use the first iteration to achieve this logic , so should > we remove the redundant iteration to save time and memory? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16083) Forbid Observer NameNode trigger active namenode log roll
[ https://issues.apache.org/jira/browse/HDFS-16083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17372039#comment-17372039 ] lei w commented on HDFS-16083: -- Thanks [~LiJinglun] reply. Take your suggestion and make some changes in v05 . Please review again. > Forbid Observer NameNode trigger active namenode log roll > -- > > Key: HDFS-16083 > URL: https://issues.apache.org/jira/browse/HDFS-16083 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namanode >Reporter: lei w >Assignee: lei w >Priority: Minor > Attachments: HDFS-16083.001.patch, HDFS-16083.002.patch, > HDFS-16083.003.patch, HDFS-16083.004.patch, HDFS-16083.005.patch, > activeRollEdits.png > > > When the Observer NameNode is turned on in the cluster, the Active NameNode > will receive rollEditLog RPC requests from the Standby NameNode and Observer > NameNode in a short time. Observer NameNode's rollEditLog request is a > repetitive operation, so should we forbid Observer NameNode trigger active > namenode log roll ? We 'dfs.ha.log-roll.period' configured is 300( 5 > minutes) and active NameNode receives rollEditLog RPC as shown in > activeRollEdits.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16083) Forbid Observer NameNode trigger active namenode log roll
[ https://issues.apache.org/jira/browse/HDFS-16083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-16083: - Attachment: HDFS-16083.005.patch > Forbid Observer NameNode trigger active namenode log roll > -- > > Key: HDFS-16083 > URL: https://issues.apache.org/jira/browse/HDFS-16083 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namanode >Reporter: lei w >Assignee: lei w >Priority: Minor > Attachments: HDFS-16083.001.patch, HDFS-16083.002.patch, > HDFS-16083.003.patch, HDFS-16083.004.patch, HDFS-16083.005.patch, > activeRollEdits.png > > > When the Observer NameNode is turned on in the cluster, the Active NameNode > will receive rollEditLog RPC requests from the Standby NameNode and Observer > NameNode in a short time. Observer NameNode's rollEditLog request is a > repetitive operation, so should we forbid Observer NameNode trigger active > namenode log roll ? We 'dfs.ha.log-roll.period' configured is 300( 5 > minutes) and active NameNode receives rollEditLog RPC as shown in > activeRollEdits.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16101) Remove unuse variable and IoException in ProvidedStorageMap
[ https://issues.apache.org/jira/browse/HDFS-16101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371998#comment-17371998 ] lei w commented on HDFS-16101: -- [~ayushsaxena] Could you give me some advice? > Remove unuse variable and IoException in ProvidedStorageMap > --- > > Key: HDFS-16101 > URL: https://issues.apache.org/jira/browse/HDFS-16101 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: lei w >Priority: Minor > Attachments: HDFS-16101.001.patch > > > Remove unuse variable and IoException in ProvidedStorageMap -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16102) Remove redundant iteration in BlockManager#removeBlocksAssociatedTo(...) to save time
[ https://issues.apache.org/jira/browse/HDFS-16102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-16102: - Attachment: HDFS-16102.001.patch > Remove redundant iteration in BlockManager#removeBlocksAssociatedTo(...) to > save time > -- > > Key: HDFS-16102 > URL: https://issues.apache.org/jira/browse/HDFS-16102 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: lei w >Assignee: lei w >Priority: Minor > Attachments: HDFS-16102.001.patch > > > The current logic in removeBlocksAssociatedTo(...) is as follows: > {code:java} > void removeBlocksAssociatedTo(final DatanodeDescriptor node) { > providedStorageMap.removeDatanode(node); > for (DatanodeStorageInfo storage : node.getStorageInfos()) { > final Iterator it = storage.getBlockIterator(); > //add the BlockInfos to a new collection as the > //returned iterator is not modifiable. > Collection toRemove = new ArrayList<>(); > while (it.hasNext()) { > toRemove.add(it.next()); // First iteration : to put blocks to > another collection > } > for (BlockInfo b : toRemove) { > removeStoredBlock(b, node); // Another iteration : to remove blocks > } > } > // .. > } > {code} > In fact , we can use the first iteration to achieve this logic , so should > we remove the redundant iteration to save time and memory? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16102) Remove redundant iteration in BlockManager#removeBlocksAssociatedTo(...) to save time
[ https://issues.apache.org/jira/browse/HDFS-16102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-16102: - Description: The current logic in removeBlocksAssociatedTo(...) is as follows: {code:java} void removeBlocksAssociatedTo(final DatanodeDescriptor node) { providedStorageMap.removeDatanode(node); for (DatanodeStorageInfo storage : node.getStorageInfos()) { final Iterator it = storage.getBlockIterator(); //add the BlockInfos to a new collection as the //returned iterator is not modifiable. Collection toRemove = new ArrayList<>(); while (it.hasNext()) { toRemove.add(it.next()); // First iteration : to put blocks to another collection } for (BlockInfo b : toRemove) { removeStoredBlock(b, node); // Another iteration : to remove blocks } } // .. } {code} In fact , we can use the first iteration to achieve this logic , so should we remove the redundant iteration to save time and memory? was: The current logic in removeBlocksAssociatedTo(...) is as follows: {code:java} void removeBlocksAssociatedTo(final DatanodeDescriptor node) { providedStorageMap.removeDatanode(node); for (DatanodeStorageInfo storage : node.getStorageInfos()) { final Iterator it = storage.getBlockIterator(); //add the BlockInfos to a new collection as the //returned iterator is not modifiable. Collection toRemove = new ArrayList<>(); while (it.hasNext()) { toRemove.add(it.next()); // First iteration : to put blocks to another collection } for (BlockInfo b : toRemove) { removeStoredBlock(b, node); // Another iteration : to remove blocks } } // .. } {code} In fact , we can use the first iteration to achieve this logic , so should we remove the redundant iteration to save time? > Remove redundant iteration in BlockManager#removeBlocksAssociatedTo(...) to > save time > -- > > Key: HDFS-16102 > URL: https://issues.apache.org/jira/browse/HDFS-16102 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: lei w >Assignee: lei w >Priority: Minor > Attachments: HDFS-16102.001.patch > > > The current logic in removeBlocksAssociatedTo(...) is as follows: > {code:java} > void removeBlocksAssociatedTo(final DatanodeDescriptor node) { > providedStorageMap.removeDatanode(node); > for (DatanodeStorageInfo storage : node.getStorageInfos()) { > final Iterator it = storage.getBlockIterator(); > //add the BlockInfos to a new collection as the > //returned iterator is not modifiable. > Collection toRemove = new ArrayList<>(); > while (it.hasNext()) { > toRemove.add(it.next()); // First iteration : to put blocks to > another collection > } > for (BlockInfo b : toRemove) { > removeStoredBlock(b, node); // Another iteration : to remove blocks > } > } > // .. > } > {code} > In fact , we can use the first iteration to achieve this logic , so should > we remove the redundant iteration to save time and memory? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16102) Remove redundant iteration in BlockManager#removeBlocksAssociatedTo(...) to save time
lei w created HDFS-16102: Summary: Remove redundant iteration in BlockManager#removeBlocksAssociatedTo(...) to save time Key: HDFS-16102 URL: https://issues.apache.org/jira/browse/HDFS-16102 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Reporter: lei w Assignee: lei w The current logic in removeBlocksAssociatedTo(...) is as follows: {code:java} void removeBlocksAssociatedTo(final DatanodeDescriptor node) { providedStorageMap.removeDatanode(node); for (DatanodeStorageInfo storage : node.getStorageInfos()) { final Iterator it = storage.getBlockIterator(); //add the BlockInfos to a new collection as the //returned iterator is not modifiable. Collection toRemove = new ArrayList<>(); while (it.hasNext()) { toRemove.add(it.next()); // First iteration : to put blocks to another collection } for (BlockInfo b : toRemove) { removeStoredBlock(b, node); // Another iteration : to remove blocks } } // .. } {code} In fact , we can use the first iteration to achieve this logic , so should we remove the redundant iteration to save time? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16101) Remove unuse variable and IoException in ProvidedStorageMap
[ https://issues.apache.org/jira/browse/HDFS-16101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-16101: - Attachment: HDFS-16101.001.patch > Remove unuse variable and IoException in ProvidedStorageMap > --- > > Key: HDFS-16101 > URL: https://issues.apache.org/jira/browse/HDFS-16101 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: lei w >Priority: Minor > Attachments: HDFS-16101.001.patch > > > Remove unuse variable and IoException in ProvidedStorageMap -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16101) Remove unuse variable and IoException in ProvidedStorageMap
lei w created HDFS-16101: Summary: Remove unuse variable and IoException in ProvidedStorageMap Key: HDFS-16101 URL: https://issues.apache.org/jira/browse/HDFS-16101 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Reporter: lei w Remove unuse variable and IoException in ProvidedStorageMap -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16083) Forbid Observer NameNode trigger active namenode log roll
[ https://issues.apache.org/jira/browse/HDFS-16083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371403#comment-17371403 ] lei w commented on HDFS-16083: -- Add test in HDFS-16083.003.patch > Forbid Observer NameNode trigger active namenode log roll > -- > > Key: HDFS-16083 > URL: https://issues.apache.org/jira/browse/HDFS-16083 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namanode >Reporter: lei w >Assignee: lei w >Priority: Minor > Attachments: HDFS-16083.001.patch, HDFS-16083.002.patch, > HDFS-16083.003.patch, activeRollEdits.png > > > When the Observer NameNode is turned on in the cluster, the Active NameNode > will receive rollEditLog RPC requests from the Standby NameNode and Observer > NameNode in a short time. Observer NameNode's rollEditLog request is a > repetitive operation, so should we forbid Observer NameNode trigger active > namenode log roll ? We 'dfs.ha.log-roll.period' configured is 300( 5 > minutes) and active NameNode receives rollEditLog RPC as shown in > activeRollEdits.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16083) Forbid Observer NameNode trigger active namenode log roll
[ https://issues.apache.org/jira/browse/HDFS-16083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-16083: - Attachment: HDFS-16083.003.patch > Forbid Observer NameNode trigger active namenode log roll > -- > > Key: HDFS-16083 > URL: https://issues.apache.org/jira/browse/HDFS-16083 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namanode >Reporter: lei w >Assignee: lei w >Priority: Minor > Attachments: HDFS-16083.001.patch, HDFS-16083.002.patch, > HDFS-16083.003.patch, activeRollEdits.png > > > When the Observer NameNode is turned on in the cluster, the Active NameNode > will receive rollEditLog RPC requests from the Standby NameNode and Observer > NameNode in a short time. Observer NameNode's rollEditLog request is a > repetitive operation, so should we forbid Observer NameNode trigger active > namenode log roll ? We 'dfs.ha.log-roll.period' configured is 300( 5 > minutes) and active NameNode receives rollEditLog RPC as shown in > activeRollEdits.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16097) Datanode receives ipc requests will throw NPE when datanode quickly restart
[ https://issues.apache.org/jira/browse/HDFS-16097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-16097: - Attachment: HDFS-16097.001.patch > Datanode receives ipc requests will throw NPE when datanode quickly restart > > > Key: HDFS-16097 > URL: https://issues.apache.org/jira/browse/HDFS-16097 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Environment: >Reporter: lei w >Priority: Major > Attachments: HDFS-16097.001.patch > > > Datanode receives ipc requests will throw NPE when datanode quickly restart. > This is because when DN is reStarted, BlockPool is first registered with > blockPoolManager and then fsdataset is initialized. When BlockPool is > registered to blockPoolManager without initializing fsdataset, DataNode > receives an IPC request will throw NPE, because it will call related methods > provided by fsdataset. The stack exception is as follows: > {code:java} > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initReplicaRecovery(DataNode.java:3468) > at > org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolServerSideTranslatorPB.initReplicaRecovery(InterDatanodeProtocolServerSideTranslatorPB.java:55) > at > org.apache.hadoop.hdfs.protocol.proto.InterDatanodeProtocolProtos$InterDatanodeProtocolService$2.callBlockingMethod(InterDatanodeProtocolProtos.java:3105) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:916) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:862) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16097) Datanode receives ipc requests will throw NPE when datanode quickly restart
lei w created HDFS-16097: Summary: Datanode receives ipc requests will throw NPE when datanode quickly restart Key: HDFS-16097 URL: https://issues.apache.org/jira/browse/HDFS-16097 Project: Hadoop HDFS Issue Type: Bug Components: datanode Environment: Reporter: lei w Datanode receives ipc requests will throw NPE when datanode quickly restart. This is because when DN is reStarted, BlockPool is first registered with blockPoolManager and then fsdataset is initialized. When BlockPool is registered to blockPoolManager without initializing fsdataset, DataNode receives an IPC request will throw NPE, because it will call related methods provided by fsdataset. The stack exception is as follows: {code:java} java.lang.NullPointerException at org.apache.hadoop.hdfs.server.datanode.DataNode.initReplicaRecovery(DataNode.java:3468) at org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolServerSideTranslatorPB.initReplicaRecovery(InterDatanodeProtocolServerSideTranslatorPB.java:55) at org.apache.hadoop.hdfs.protocol.proto.InterDatanodeProtocolProtos$InterDatanodeProtocolService$2.callBlockingMethod(InterDatanodeProtocolProtos.java:3105) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:916) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:862) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16083) Forbid Observer NameNode trigger active namenode log roll
[ https://issues.apache.org/jira/browse/HDFS-16083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-16083: - Description: When the Observer NameNode is turned on in the cluster, the Active NameNode will receive rollEditLog RPC requests from the Standby NameNode and Observer NameNode in a short time. Observer NameNode's rollEditLog request is a repetitive operation, so should we forbid Observer NameNode trigger active namenode log roll ? We 'dfs.ha.log-roll.period' configured is 300( 5 minutes) and active NameNode receives rollEditLog RPC as shown in activeRollEdits.png (was: When the Observer NameNode is turned on in the cluster, the Active NameNode will receive rollEditLog RPC requests from the Standby NameNode and Observer NameNode in a short time. Observer NameNode's rollEditLog request is a repetitive operation, so should we forbid Observer NameNode trigger active namenode log roll ? We 'dfs.ha.log-roll.period' configured is 300( 5 minutes)) > Forbid Observer NameNode trigger active namenode log roll > -- > > Key: HDFS-16083 > URL: https://issues.apache.org/jira/browse/HDFS-16083 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namanode >Reporter: lei w >Assignee: lei w >Priority: Minor > Attachments: HDFS-16083.001.patch, HDFS-16083.002.patch, > activeRollEdits.png > > > When the Observer NameNode is turned on in the cluster, the Active NameNode > will receive rollEditLog RPC requests from the Standby NameNode and Observer > NameNode in a short time. Observer NameNode's rollEditLog request is a > repetitive operation, so should we forbid Observer NameNode trigger active > namenode log roll ? We 'dfs.ha.log-roll.period' configured is 300( 5 > minutes) and active NameNode receives rollEditLog RPC as shown in > activeRollEdits.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16083) Forbid Observer NameNode trigger active namenode log roll
[ https://issues.apache.org/jira/browse/HDFS-16083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-16083: - Description: When the Observer NameNode is turned on in the cluster, the Active NameNode will receive rollEditLog RPC requests from the Standby NameNode and Observer NameNode in a short time. Observer NameNode's rollEditLog request is a repetitive operation, so should we forbid Observer NameNode trigger active namenode log roll ? We 'dfs.ha.log-roll.period' configured is 300( 5 minutes) (was: When the Observer NameNode is turned on in the cluster, the Active NameNode will receive rollEditLog RPC requests from the Standby NameNode and Observer NameNode in a short time. Observer NameNode's rollEditLog request is a repetitive operation, so should we forbid Observer NameNode trigger active namenode log roll ? We 'dfs.ha.log-roll.period' configuration ) > Forbid Observer NameNode trigger active namenode log roll > -- > > Key: HDFS-16083 > URL: https://issues.apache.org/jira/browse/HDFS-16083 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namanode >Reporter: lei w >Assignee: lei w >Priority: Minor > Attachments: HDFS-16083.001.patch, HDFS-16083.002.patch, > activeRollEdits.png > > > When the Observer NameNode is turned on in the cluster, the Active NameNode > will receive rollEditLog RPC requests from the Standby NameNode and Observer > NameNode in a short time. Observer NameNode's rollEditLog request is a > repetitive operation, so should we forbid Observer NameNode trigger active > namenode log roll ? We 'dfs.ha.log-roll.period' configured is 300( 5 minutes) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16083) Forbid Observer NameNode trigger active namenode log roll
[ https://issues.apache.org/jira/browse/HDFS-16083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-16083: - Description: When the Observer NameNode is turned on in the cluster, the Active NameNode will receive rollEditLog RPC requests from the Standby NameNode and Observer NameNode in a short time. Observer NameNode's rollEditLog request is a repetitive operation, so should we forbid Observer NameNode trigger active namenode log roll ? We 'dfs.ha.log-roll.period' configuration was: When the Observer NameNode is turned on in the cluster, the Active NameNode will receive rollEditLog RPC requests from the Standby NameNode and Observer NameNode in a short time. Observer NameNode's rollEditLog request is a repetitive operation, so should we forbid Observer NameNode trigger active namenode log roll ? We Forbid Observer NameNode trigger active namenode log roll > Forbid Observer NameNode trigger active namenode log roll > -- > > Key: HDFS-16083 > URL: https://issues.apache.org/jira/browse/HDFS-16083 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namanode >Reporter: lei w >Assignee: lei w >Priority: Minor > Attachments: HDFS-16083.001.patch, HDFS-16083.002.patch, > activeRollEdits.png > > > When the Observer NameNode is turned on in the cluster, the Active NameNode > will receive rollEditLog RPC requests from the Standby NameNode and Observer > NameNode in a short time. Observer NameNode's rollEditLog request is a > repetitive operation, so should we forbid Observer NameNode trigger active > namenode log roll ? We 'dfs.ha.log-roll.period' configuration -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16083) Forbid Observer NameNode trigger active namenode log roll
[ https://issues.apache.org/jira/browse/HDFS-16083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-16083: - Description: When the Observer NameNode is turned on in the cluster, the Active NameNode will receive rollEditLog RPC requests from the Standby NameNode and Observer NameNode in a short time. Observer NameNode's rollEditLog request is a repetitive operation, so should we forbid Observer NameNode trigger active namenode log roll ? We was: When the Observer NameNode is turned on in the cluster, the Active NameNode will receive rollEditLog RPC requests from the Standby NameNode and Observer NameNode in a short time. Observer NameNode's rollEditLog request is a repetitive operation, so should we proh > Forbid Observer NameNode trigger active namenode log roll > -- > > Key: HDFS-16083 > URL: https://issues.apache.org/jira/browse/HDFS-16083 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namanode >Reporter: lei w >Assignee: lei w >Priority: Minor > Attachments: HDFS-16083.001.patch, HDFS-16083.002.patch, > activeRollEdits.png > > > When the Observer NameNode is turned on in the cluster, the Active NameNode > will receive rollEditLog RPC requests from the Standby NameNode and Observer > NameNode in a short time. Observer NameNode's rollEditLog request is a > repetitive operation, so should we forbid Observer NameNode trigger active > namenode log roll ? We -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16083) Forbid Observer NameNode trigger active namenode log roll
[ https://issues.apache.org/jira/browse/HDFS-16083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-16083: - Description: When the Observer NameNode is turned on in the cluster, the Active NameNode will receive rollEditLog RPC requests from the Standby NameNode and Observer NameNode in a short time. Observer NameNode's rollEditLog request is a repetitive operation, so should we forbid Observer NameNode trigger active namenode log roll ? We Forbid Observer NameNode trigger active namenode log roll was: When the Observer NameNode is turned on in the cluster, the Active NameNode will receive rollEditLog RPC requests from the Standby NameNode and Observer NameNode in a short time. Observer NameNode's rollEditLog request is a repetitive operation, so should we forbid Observer NameNode trigger active namenode log roll ? We > Forbid Observer NameNode trigger active namenode log roll > -- > > Key: HDFS-16083 > URL: https://issues.apache.org/jira/browse/HDFS-16083 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namanode >Reporter: lei w >Assignee: lei w >Priority: Minor > Attachments: HDFS-16083.001.patch, HDFS-16083.002.patch, > activeRollEdits.png > > > When the Observer NameNode is turned on in the cluster, the Active NameNode > will receive rollEditLog RPC requests from the Standby NameNode and Observer > NameNode in a short time. Observer NameNode's rollEditLog request is a > repetitive operation, so should we forbid Observer NameNode trigger active > namenode log roll ? We Forbid Observer NameNode trigger active namenode log > roll -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16083) Forbid Observer NameNode trigger active namenode log roll
[ https://issues.apache.org/jira/browse/HDFS-16083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-16083: - Attachment: activeRollEdits.png > Forbid Observer NameNode trigger active namenode log roll > -- > > Key: HDFS-16083 > URL: https://issues.apache.org/jira/browse/HDFS-16083 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namanode >Reporter: lei w >Assignee: lei w >Priority: Minor > Attachments: HDFS-16083.001.patch, HDFS-16083.002.patch, > activeRollEdits.png > > > When the Observer NameNode is turned on in the cluster, the Active NameNode > will receive rollEditLog RPC requests from the Standby NameNode and Observer > NameNode in a short time. Observer NameNode's rollEditLog request is a > repetitive operation, so should we proh -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16083) Forbid Observer NameNode trigger active namenode log roll
[ https://issues.apache.org/jira/browse/HDFS-16083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-16083: - Description: When the Observer NameNode is turned on in the cluster, the Active NameNode will receive rollEditLog RPC requests from the Standby NameNode and Observer NameNode in a short time. Observer NameNode's rollEditLog request is a repetitive operation, so should we proh was:When the Observer NameNode is turned on in the cluster, the Active NameNode will receive rollEditLog RPC requests from the Standby NameNode and Observer NameNode in a short time. Observer NameNode's rollEditLog request is a repetitive operation, so should we prohibit Observer NameNode from triggering rollEditLog? > Forbid Observer NameNode trigger active namenode log roll > -- > > Key: HDFS-16083 > URL: https://issues.apache.org/jira/browse/HDFS-16083 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namanode >Reporter: lei w >Assignee: lei w >Priority: Minor > Attachments: HDFS-16083.001.patch, HDFS-16083.002.patch > > > When the Observer NameNode is turned on in the cluster, the Active NameNode > will receive rollEditLog RPC requests from the Standby NameNode and Observer > NameNode in a short time. Observer NameNode's rollEditLog request is a > repetitive operation, so should we proh -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16081) List a large directory, the client waits for a long time
[ https://issues.apache.org/jira/browse/HDFS-16081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368075#comment-17368075 ] lei w commented on HDFS-16081: -- OK,Thanks [~ste...@apache.org] reply. > List a large directory, the client waits for a long time > > > Key: HDFS-16081 > URL: https://issues.apache.org/jira/browse/HDFS-16081 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Reporter: lei w >Priority: Minor > > When we list a large directory, we need to wait a lot of time. This is > because the NameNode only returns the number of files corresponding to > dfs.ls.limit each time, and then the client iteratively obtains the remaining > files. But in many scenarios, we only need to know part of the files in the > current directory, and then process this part of the file. After processing, > go to get the remaining files. So can we add a limit on the number of files > and return it to the client after obtaining the specified number of files or > NameNode returnes files based on lock hold time instead of just relying on a > configuration. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16083) Forbid Observer NameNode trigger active namenode log roll
[ https://issues.apache.org/jira/browse/HDFS-16083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368017#comment-17368017 ] lei w commented on HDFS-16083: -- Thanks [~LiJinglun] reply ,I will add more information and add unit tests later > Forbid Observer NameNode trigger active namenode log roll > -- > > Key: HDFS-16083 > URL: https://issues.apache.org/jira/browse/HDFS-16083 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namanode >Reporter: lei w >Assignee: lei w >Priority: Minor > Attachments: HDFS-16083.001.patch, HDFS-16083.002.patch > > > When the Observer NameNode is turned on in the cluster, the Active NameNode > will receive rollEditLog RPC requests from the Standby NameNode and Observer > NameNode in a short time. Observer NameNode's rollEditLog request is a > repetitive operation, so should we prohibit Observer NameNode from triggering > rollEditLog? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16083) Forbid Observer NameNode trigger active namenode log roll
[ https://issues.apache.org/jira/browse/HDFS-16083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367906#comment-17367906 ] lei w commented on HDFS-16083: -- [~ayushsaxena] , [~LiJinglun] anyone have any suggestions? > Forbid Observer NameNode trigger active namenode log roll > -- > > Key: HDFS-16083 > URL: https://issues.apache.org/jira/browse/HDFS-16083 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namanode >Reporter: lei w >Priority: Minor > Attachments: HDFS-16083.001.patch, HDFS-16083.002.patch > > > When the Observer NameNode is turned on in the cluster, the Active NameNode > will receive rollEditLog RPC requests from the Standby NameNode and Observer > NameNode in a short time. Observer NameNode's rollEditLog request is a > repetitive operation, so should we prohibit Observer NameNode from triggering > rollEditLog? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16081) List a large directory, the client waits for a long time
[ https://issues.apache.org/jira/browse/HDFS-16081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367889#comment-17367889 ] lei w commented on HDFS-16081: -- Thanks [~ste...@apache.org] comment. We use original DFSClient API , not found listStatusIncremental() or you mean listStatusInternal() ? The method listStatusInternal() will only return after accepting all the files, so it will wait a long time when there are too many files in a directory. We want add a limit on the number of files and return it to the client after obtaining the specified number of files. > List a large directory, the client waits for a long time > > > Key: HDFS-16081 > URL: https://issues.apache.org/jira/browse/HDFS-16081 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Reporter: lei w >Priority: Minor > > When we list a large directory, we need to wait a lot of time. This is > because the NameNode only returns the number of files corresponding to > dfs.ls.limit each time, and then the client iteratively obtains the remaining > files. But in many scenarios, we only need to know part of the files in the > current directory, and then process this part of the file. After processing, > go to get the remaining files. So can we add a limit on the number of files > and return it to the client after obtaining the specified number of files or > NameNode returnes files based on lock hold time instead of just relying on a > configuration. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16081) List a large directory, the client waits for a long time
[ https://issues.apache.org/jira/browse/HDFS-16081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-16081: - Description: When we list a large directory, we need to wait a lot of time. This is because the NameNode only returns the number of files corresponding to dfs.ls.limit each time, and then the client iteratively obtains the remaining files. But in many scenarios, we only need to know part of the files in the current directory, and then process this part of the file. After processing, go to get the remaining files. So can we add a limit on the number of files and return it to the client after obtaining the specified number of files or NameNode returnes files based on lock hold time instead of just relying on a configuration. (was: When we list a large directory, we need to wait a lot of time. This is because the NameNode only returns the number of files corresponding to dfs.ls.limit each time, and then the client iteratively obtains the remaining files. But in many scenarios, we only need to know part of the files in the current directory, and then process this part of the file. After processing, go to get the remaining files. So can we add a limit on the number of ls files and return it to the client after obtaining the specified number of files or NameNode returnes files based on lock hold time instead of just relying on a configuration. ) > List a large directory, the client waits for a long time > > > Key: HDFS-16081 > URL: https://issues.apache.org/jira/browse/HDFS-16081 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Reporter: lei w >Priority: Minor > > When we list a large directory, we need to wait a lot of time. This is > because the NameNode only returns the number of files corresponding to > dfs.ls.limit each time, and then the client iteratively obtains the remaining > files. But in many scenarios, we only need to know part of the files in the > current directory, and then process this part of the file. After processing, > go to get the remaining files. So can we add a limit on the number of files > and return it to the client after obtaining the specified number of files or > NameNode returnes files based on lock hold time instead of just relying on a > configuration. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16083) Forbid Observer NameNode trigger active namenode log roll
[ https://issues.apache.org/jira/browse/HDFS-16083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367858#comment-17367858 ] lei w commented on HDFS-16083: -- Thanks [~tomscut] comment. Changed and add new patch HDFS-16083.002.patch > Forbid Observer NameNode trigger active namenode log roll > -- > > Key: HDFS-16083 > URL: https://issues.apache.org/jira/browse/HDFS-16083 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namanode >Reporter: lei w >Priority: Minor > Attachments: HDFS-16083.001.patch, HDFS-16083.002.patch > > > When the Observer NameNode is turned on in the cluster, the Active NameNode > will receive rollEditLog RPC requests from the Standby NameNode and Observer > NameNode in a short time. Observer NameNode's rollEditLog request is a > repetitive operation, so should we prohibit Observer NameNode from triggering > rollEditLog? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16083) Forbid Observer NameNode trigger active namenode log roll
[ https://issues.apache.org/jira/browse/HDFS-16083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-16083: - Attachment: HDFS-16083.002.patch > Forbid Observer NameNode trigger active namenode log roll > -- > > Key: HDFS-16083 > URL: https://issues.apache.org/jira/browse/HDFS-16083 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namanode >Reporter: lei w >Priority: Minor > Attachments: HDFS-16083.001.patch, HDFS-16083.002.patch > > > When the Observer NameNode is turned on in the cluster, the Active NameNode > will receive rollEditLog RPC requests from the Standby NameNode and Observer > NameNode in a short time. Observer NameNode's rollEditLog request is a > repetitive operation, so should we prohibit Observer NameNode from triggering > rollEditLog? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16083) Forbid Observer NameNode trigger active namenode log roll
[ https://issues.apache.org/jira/browse/HDFS-16083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-16083: - Attachment: HDFS-16083.001.patch > Forbid Observer NameNode trigger active namenode log roll > -- > > Key: HDFS-16083 > URL: https://issues.apache.org/jira/browse/HDFS-16083 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namanode >Reporter: lei w >Priority: Minor > Attachments: HDFS-16083.001.patch > > > When the Observer NameNode is turned on in the cluster, the Active NameNode > will receive rollEditLog RPC requests from the Standby NameNode and Observer > NameNode in a short time. Observer NameNode's rollEditLog request is a > repetitive operation, so should we prohibit Observer NameNode from triggering > rollEditLog? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16083) Forbid Observer NameNode trigger active namenode log roll
lei w created HDFS-16083: Summary: Forbid Observer NameNode trigger active namenode log roll Key: HDFS-16083 URL: https://issues.apache.org/jira/browse/HDFS-16083 Project: Hadoop HDFS Issue Type: Improvement Components: namanode Reporter: lei w When the Observer NameNode is turned on in the cluster, the Active NameNode will receive rollEditLog RPC requests from the Standby NameNode and Observer NameNode in a short time. Observer NameNode's rollEditLog request is a repetitive operation, so should we prohibit Observer NameNode from triggering rollEditLog? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16081) List a large directory, the client waits for a long time
[ https://issues.apache.org/jira/browse/HDFS-16081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-16081: - Description: When we list a large directory, we need to wait a lot of time. This is because the NameNode only returns the number of files corresponding to dfs.ls.limit each time, and then the client iteratively obtains the remaining files. But in many scenarios, we only need to know part of the files in the current directory, and then process this part of the file. After processing, go to get the remaining files. So can we add a limit on the number of ls files and return it to the client after obtaining the specified number of files or NameNode returnes files based on lock hold time instead of just relying on a configuration. (was: When we list a large directory, we need to wait a lot of time. This is because the NameNode only returns the number of files corresponding to dfs.ls.limit each time, and then the client iteratively obtains the remaining files. But in many scenarios, we only need to know part of the files in the current directory, and then process this part of the file. After processing, go to get the remaining files. So can we add a limit on the number of ls files and return it to the client after obtaining the specified number of files ?) > List a large directory, the client waits for a long time > > > Key: HDFS-16081 > URL: https://issues.apache.org/jira/browse/HDFS-16081 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Reporter: lei w >Priority: Minor > > When we list a large directory, we need to wait a lot of time. This is > because the NameNode only returns the number of files corresponding to > dfs.ls.limit each time, and then the client iteratively obtains the remaining > files. But in many scenarios, we only need to know part of the files in the > current directory, and then process this part of the file. After processing, > go to get the remaining files. So can we add a limit on the number of ls > files and return it to the client after obtaining the specified number of > files or NameNode returnes files based on lock hold time instead of just > relying on a configuration. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-16081) List a large directory, the client waits for a long time
[ https://issues.apache.org/jira/browse/HDFS-16081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17366440#comment-17366440 ] lei w edited comment on HDFS-16081 at 6/21/21, 12:46 PM: - Our cluster dfs.ls.limit is not configured(default is 1000). In many cases, there will be nearly one million files in a directory, so every list operation has to wait for several minutes. was (Author: lei w): Our cluster dfs.ls.limit is not configured. In many cases, there will be nearly one million files in a directory, so every list operation has to wait for several minutes. > List a large directory, the client waits for a long time > > > Key: HDFS-16081 > URL: https://issues.apache.org/jira/browse/HDFS-16081 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Reporter: lei w >Priority: Minor > > When we list a large directory, we need to wait a lot of time. This is > because the NameNode only returns the number of files corresponding to > dfs.ls.limit each time, and then the client iteratively obtains the remaining > files. But in many scenarios, we only need to know part of the files in the > current directory, and then process this part of the file. After processing, > go to get the remaining files. So can we add a limit on the number of ls > files and return it to the client after obtaining the specified number of > files ? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16081) List a large directory, the client waits for a long time
[ https://issues.apache.org/jira/browse/HDFS-16081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17366440#comment-17366440 ] lei w commented on HDFS-16081: -- Our cluster dfs.ls.limit is not configured. In many cases, there will be nearly one million files in a directory, so every list operation has to wait for several minutes. > List a large directory, the client waits for a long time > > > Key: HDFS-16081 > URL: https://issues.apache.org/jira/browse/HDFS-16081 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Reporter: lei w >Priority: Minor > > When we list a large directory, we need to wait a lot of time. This is > because the NameNode only returns the number of files corresponding to > dfs.ls.limit each time, and then the client iteratively obtains the remaining > files. But in many scenarios, we only need to know part of the files in the > current directory, and then process this part of the file. After processing, > go to get the remaining files. So can we add a limit on the number of ls > files and return it to the client after obtaining the specified number of > files ? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16081) List a large directory, the client waits for a long time
lei w created HDFS-16081: Summary: List a large directory, the client waits for a long time Key: HDFS-16081 URL: https://issues.apache.org/jira/browse/HDFS-16081 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs-client Reporter: lei w When we list a large directory, we need to wait a lot of time. This is because the NameNode only returns the number of files corresponding to dfs.ls.limit each time, and then the client iteratively obtains the remaining files. But in many scenarios, we only need to know part of the files in the current directory, and then process this part of the file. After processing, go to get the remaining files. So can we add a limit on the number of ls files and return it to the client after obtaining the specified number of files ? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16073) Remove redundant RPC requests for getFileLinkInfo in ClientNamenodeProtocolTranslatorPB
[ https://issues.apache.org/jira/browse/HDFS-16073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17364098#comment-17364098 ] lei w commented on HDFS-16073: -- [~ayushtkn] thanks for your comment . Add HDFS-16073.001.patch to adjust it in a single line. > Remove redundant RPC requests for getFileLinkInfo in > ClientNamenodeProtocolTranslatorPB > --- > > Key: HDFS-16073 > URL: https://issues.apache.org/jira/browse/HDFS-16073 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: lei w >Priority: Minor > Attachments: HDFS-16073.001.patch, HDFS-16073.patch > > > Remove redundant RPC requests for getFileLinkInfo in > ClientNamenodeProtocolTranslatorPB. The original logic is as follows: > {code:java} > @Override > public HdfsFileStatus getFileLinkInfo(String src) throws IOException { > GetFileLinkInfoRequestProto req = GetFileLinkInfoRequestProto.newBuilder() > .setSrc(src).build(); > try { > GetFileLinkInfoResponseProto result = rpcProxy.getFileLinkInfo(null, > req);// First getFileLinkInfo RPC request > return result.hasFs() ? > PBHelperClient.convert(rpcProxy.getFileLinkInfo(null, req).getFs()) > :// Repeated getFileLinkInfo RPC request > null; > } catch (ServiceException e) { > throw ProtobufHelper.getRemoteException(e); > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16073) Remove redundant RPC requests for getFileLinkInfo in ClientNamenodeProtocolTranslatorPB
[ https://issues.apache.org/jira/browse/HDFS-16073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-16073: - Attachment: HDFS-16073.001.patch > Remove redundant RPC requests for getFileLinkInfo in > ClientNamenodeProtocolTranslatorPB > --- > > Key: HDFS-16073 > URL: https://issues.apache.org/jira/browse/HDFS-16073 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: lei w >Priority: Minor > Attachments: HDFS-16073.001.patch, HDFS-16073.patch > > > Remove redundant RPC requests for getFileLinkInfo in > ClientNamenodeProtocolTranslatorPB. The original logic is as follows: > {code:java} > @Override > public HdfsFileStatus getFileLinkInfo(String src) throws IOException { > GetFileLinkInfoRequestProto req = GetFileLinkInfoRequestProto.newBuilder() > .setSrc(src).build(); > try { > GetFileLinkInfoResponseProto result = rpcProxy.getFileLinkInfo(null, > req);// First getFileLinkInfo RPC request > return result.hasFs() ? > PBHelperClient.convert(rpcProxy.getFileLinkInfo(null, req).getFs()) > :// Repeated getFileLinkInfo RPC request > null; > } catch (ServiceException e) { > throw ProtobufHelper.getRemoteException(e); > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16073) Remove redundant RPC requests for getFileLinkInfo in ClientNamenodeProtocolTranslatorPB
[ https://issues.apache.org/jira/browse/HDFS-16073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17364049#comment-17364049 ] lei w commented on HDFS-16073: -- [~hexiaoqiao], [~ayushsaxena], [~zhuqi] anyone have any suggestions? > Remove redundant RPC requests for getFileLinkInfo in > ClientNamenodeProtocolTranslatorPB > --- > > Key: HDFS-16073 > URL: https://issues.apache.org/jira/browse/HDFS-16073 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: lei w >Priority: Minor > Attachments: HDFS-16073.patch > > > Remove redundant RPC requests for getFileLinkInfo in > ClientNamenodeProtocolTranslatorPB. The original logic is as follows: > {code:java} > @Override > public HdfsFileStatus getFileLinkInfo(String src) throws IOException { > GetFileLinkInfoRequestProto req = GetFileLinkInfoRequestProto.newBuilder() > .setSrc(src).build(); > try { > GetFileLinkInfoResponseProto result = rpcProxy.getFileLinkInfo(null, > req);// First getFileLinkInfo RPC request > return result.hasFs() ? > PBHelperClient.convert(rpcProxy.getFileLinkInfo(null, req).getFs()) > :// Repeated getFileLinkInfo RPC request > null; > } catch (ServiceException e) { > throw ProtobufHelper.getRemoteException(e); > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16073) Remove redundant RPC requests for getFileLinkInfo in ClientNamenodeProtocolTranslatorPB
[ https://issues.apache.org/jira/browse/HDFS-16073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lei w updated HDFS-16073: - Attachment: HDFS-16073.patch > Remove redundant RPC requests for getFileLinkInfo in > ClientNamenodeProtocolTranslatorPB > --- > > Key: HDFS-16073 > URL: https://issues.apache.org/jira/browse/HDFS-16073 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: lei w >Priority: Minor > Attachments: HDFS-16073.patch > > > Remove redundant RPC requests for getFileLinkInfo in > ClientNamenodeProtocolTranslatorPB. The original logic is as follows: > {code:java} > @Override > public HdfsFileStatus getFileLinkInfo(String src) throws IOException { > GetFileLinkInfoRequestProto req = GetFileLinkInfoRequestProto.newBuilder() > .setSrc(src).build(); > try { > GetFileLinkInfoResponseProto result = rpcProxy.getFileLinkInfo(null, > req);// First getFileLinkInfo RPC request > return result.hasFs() ? > PBHelperClient.convert(rpcProxy.getFileLinkInfo(null, req).getFs()) > :// Repeated getFileLinkInfo RPC request > null; > } catch (ServiceException e) { > throw ProtobufHelper.getRemoteException(e); > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org