[jira] [Created] (HDFS-16557) BootstrapStandby failed because of checking Gap for inprogress EditLogInputStream
tomscut created HDFS-16557: -- Summary: BootstrapStandby failed because of checking Gap for inprogress EditLogInputStream Key: HDFS-16557 URL: https://issues.apache.org/jira/browse/HDFS-16557 Project: Hadoop HDFS Issue Type: Bug Reporter: tomscut Assignee: tomscut The lastTxId of an inprogress EditLogInputStream lastTxId isn't necessarily HdfsServerConstants.INVALID_TXID. We can determine its status directly by EditLogInputStream#isInProgress. For example, when bootstrapStandby, the EditLogInputStream of inProgress is misjudged, resulting in a gap check failure, which causes bootstrapStandby to fail. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16552) Fix NPE for BlockManager
tomscut created HDFS-16552: -- Summary: Fix NPE for BlockManager Key: HDFS-16552 URL: https://issues.apache.org/jira/browse/HDFS-16552 Project: Hadoop HDFS Issue Type: Bug Reporter: tomscut Assignee: tomscut There is a NPE in BlockManager when run TestBlockManager#testSkipReconstructionWithManyBusyNodes2. Because NameNodeMetrics is not initialized in this unit test. Related ci link, see [this|https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt]. {code:java} [ERROR] Tests run: 34, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 30.088 s <<< FAILURE! - in org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager [ERROR] testSkipReconstructionWithManyBusyNodes2(org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager) Time elapsed: 2.783 s <<< ERROR! java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.scheduleReconstruction(BlockManager.java:2171) at org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.testSkipReconstructionWithManyBusyNodes2(TestBlockManager.java:947) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.ParentRunner.run(ParentRunner.java:413) at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash
tomscut created HDFS-16550: -- Summary: [SBN read] Improper cache-size for journal node may cause cluster crash Key: HDFS-16550 URL: https://issues.apache.org/jira/browse/HDFS-16550 Project: Hadoop HDFS Issue Type: Bug Reporter: tomscut Assignee: tomscut Attachments: image-2022-04-21-09-54-29-751.png, image-2022-04-21-09-54-57-111.png When we introduced SBN Read, we encountered a situation when upgrading the JournalNodes. Cluster Info: *Active: nn0* *Standby: nn1* 1. Rolling restart journal node. {color:#FF}(related config: fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color} 2. The cluster runs for a while. 3. {color:#FF}Active namenode(nn0){color} shutdown because of Timed out waiting 12ms for a quorum of nodes to respond. 4. Transfer nn1 to Active state. 5. {color:#FF}New Active namenode(nn1){color} also shutdown because of Timed out waiting 12ms for a quorum of nodes to respond. 6. {color:#FF}The cluster crashed{color}. Related code: {code:java} JournaledEditsCache(Configuration conf) { capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY, DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT); if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) { Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " + "maximum JVM memory is only %d bytes. It is recommended that you " + "decrease the cache size or increase the heap size.", capacity, Runtime.getRuntime().maxMemory())); } Journal.LOG.info("Enabling the journaled edits cache with a capacity " + "of bytes: " + capacity); ReadWriteLock lock = new ReentrantReadWriteLock(true); readLock = new AutoCloseableLock(lock.readLock()); writeLock = new AutoCloseableLock(lock.writeLock()); initialize(INVALID_TXN_ID); } {code} Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size than the memory requested by the process. If {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * Runtime.getruntime().maxMemory(){*}, only warn logs are printed during journalnode startup. This can easily be overlooked by users. However, as the cluster runs to a certain period of time, it is likely to cause the cluster to crash. !image-2022-04-21-09-54-57-111.png|width=1227,height=57! IMO, when {*}fs.journalNode.edit-cache-size-bytes > threshold * Runtime.getruntime ().maxMemory(){*}, we should throw an Exception and {color:#FF}fast fail{color}. Giving a clear hint for users to update related configurations. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16548) Failed unit test testRenameMoreThanOnceAcrossSnapDirs_2
tomscut created HDFS-16548: -- Summary: Failed unit test testRenameMoreThanOnceAcrossSnapDirs_2 Key: HDFS-16548 URL: https://issues.apache.org/jira/browse/HDFS-16548 Project: Hadoop HDFS Issue Type: Bug Reporter: tomscut {code:java} [ERROR] Tests run: 44, Failures: 6, Errors: 0, Skipped: 0, Time elapsed: 143.701 s <<< FAILURE! - in org.apache.hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots [ERROR] testRenameMoreThanOnceAcrossSnapDirs_2(org.apache.hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots) Time elapsed: 6.606 s <<< FAILURE! java.lang.AssertionError: expected:<3> but was:<1> at org.junit.Assert.fail(Assert.java:89) at org.junit.Assert.failNotEquals(Assert.java:835) at org.junit.Assert.assertEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:633) at org.apache.hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots.testRenameMoreThanOnceAcrossSnapDirs_2(TestRenameWithSnapshots.java:985) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.ParentRunner.run(ParentRunner.java:413) at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16547) [SBN read] Namenode in safe mode should not be transfer to observer state
tomscut created HDFS-16547: -- Summary: [SBN read] Namenode in safe mode should not be transfer to observer state Key: HDFS-16547 URL: https://issues.apache.org/jira/browse/HDFS-16547 Project: Hadoop HDFS Issue Type: Improvement Reporter: tomscut Assignee: tomscut Currently, when a Namenode is in safemode(under starting or enter safemode manually), we can transfer this Namenode to Observer by command. This Observer node may receive many requests and then throw a SafemodeException, this causes unnecessary failover on the client. So Namenode in safe mode should not be transfer to observer state. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16527) Add global timeout rule for TestRouterDistCpProcedure
tomscut created HDFS-16527: -- Summary: Add global timeout rule for TestRouterDistCpProcedure Key: HDFS-16527 URL: https://issues.apache.org/jira/browse/HDFS-16527 Project: Hadoop HDFS Issue Type: Improvement Reporter: tomscut Assignee: tomscut As [Ayush Saxena|https://github.com/ayushtkn] mentioned [here|[https://github.com/apache/hadoop/pull/4009#pullrequestreview-925554297].] TestRouterDistCpProcedure failed many times because of timeout. I will add a global timeout rule for it. This makes it easy to set the timeout. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16513) [SBN read] Observer Namenode does not trigger the edits rolling of active Namenode
tomscut created HDFS-16513: -- Summary: [SBN read] Observer Namenode does not trigger the edits rolling of active Namenode Key: HDFS-16513 URL: https://issues.apache.org/jira/browse/HDFS-16513 Project: Hadoop HDFS Issue Type: Improvement Reporter: tomscut Assignee: tomscut To avoid frequent edtis rolling, we should disable OBN from triggering the edits rolling of active Namenode. It is sufficient to retain only the triggering of SNN and the auto rolling of ANN. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16507) Purged edit logs which is in process
tomscut created HDFS-16507: -- Summary: Purged edit logs which is in process Key: HDFS-16507 URL: https://issues.apache.org/jira/browse/HDFS-16507 Project: Hadoop HDFS Issue Type: Bug Environment: {code:java} // code placeholder {code} Reporter: tomscut We introduced Standby read functionality in branch-3.1.0, but found a FATAL exception. It looks like it's purging edit logs which is in process. According to the analysis, I suspect that the Editlog to be purged does not finalize normally . I post some key logs for your reference: 1. ANN. Create editlog, {color:#FF}edits_InProgresS_00024207987{color}. {code:java} 2022-03-15 17:24:52,558 INFO namenode.FSEditLog (FSEditLog.java:startLogSegment(1394)) - Starting log segment at 24207987 2022-03-15 17:24:52,609 INFO namenode.FSEditLog (FSEditLog.java:startLogSegment(1423)) - Ending log segment at 24207987 2022-03-15 17:24:52,610 INFO namenode.FSEditLog (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1432)) - logEdit at 24207987 2022-03-15 17:24:52,624 INFO namenode.FSEditLog (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1434)) - logSync at 24207987 {code} 2. SNN. Checkpoint. {color:#FF}25892513 + 1 - 100 = 24892514{color} {color:#FF}dfs.namenode.num.extra.edits.retained=100{color} {code:java} 2022-03-15 17:28:02,640 INFO ha.StandbyCheckpointer (StandbyCheckpointer.java:doWork(443)) - Triggering checkpoint because there have been 1189661 txns since the last checkpoint, which exceeds the configured threshold 2 2022-03-15 17:28:02,648 INFO namenode.FSImage (FSEditLogLoader.java:loadFSEdits(188)) - Edits file ByteStringEditLog[27082175, 27082606], ByteStringEditLog[27082175, 27082606], ByteStringEditLog[27082175, 27082606] of size 60008 edits # 432 loaded in 0 seconds 2022-03-15 17:28:02,649 INFO namenode.FSImage (FSImage.java:saveNamespace(1121)) - Save namespace ... 2022-03-15 17:28:02,650 INFO namenode.FSImageFormatProtobuf (FSImageFormatProtobuf.java:save(718)) - Saving image file /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 using no compression 2022-03-15 17:28:03,180 INFO namenode.FSImageFormatProtobuf (FSImageFormatProtobuf.java:save(722)) - Image file /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 of size 17885002 bytes saved in 0 seconds . 2022-03-15 17:28:03,183 INFO namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:getImageTxIdToRetain(211)) - Going to retain 2 images with txid >= 25892513 2022-03-15 17:28:03,183 INFO namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:purgeImage(233)) - Purging old image FSImageFile(file=/data/hadoop/hdfs/namenode/current/fsimage_00024794305, cpktTxId=00024794305) 2022-03-15 17:28:03,188 INFO namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:purgeOldStorage(169)) - purgeLogsFrom: 24892514 2022-03-15 17:28:03,282 INFO namenode.TransferFsImage (TransferFsImage.java:copyFileToStream(396)) - Sending fileName: /data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: 17885002. Sent total: 17885002 bytes. Size of last segment intended to send: -1 bytes. 2022-03-15 17:28:03,536 INFO namenode.TransferFsImage (TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 27082606 to namenode at http://sg-test-ambari-nn1.bigdata.bigo.inner:50070 in 0.343 seconds 2022-03-15 17:28:03,640 INFO namenode.TransferFsImage (TransferFsImage.java:copyFileToStream(396)) - Sending fileName: /data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: 17885002. Sent total: 17885002 bytes. Size of last segment intended to send: -1 bytes. 2022-03-15 17:28:03,684 INFO namenode.TransferFsImage (TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 27082606 to namenode at http://sg-test-ambari-dn1.bigdata.bigo.inner:50070 in 0.148 seconds 2022-03-15 17:28:03,748 INFO namenode.TransferFsImage (TransferFsImage.java:copyFileToStream(396)) - Sending fileName: /data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: 17885002. Sent total: 17885002 bytes. Size of last segment intended to send: -1 bytes. 2022-03-15 17:28:03,798 INFO namenode.TransferFsImage (TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 27082606 to namenode at http://sg-test-ambari-dn2.bigdata.bigo.inner:50070 in 0.113 seconds 2022-03-15 17:28:03,798 INFO ha.StandbyCheckpointer (StandbyCheckpointer.java:doWork(482)) - Checkpoint finished successfully. {code} 3. ANN. Purge edit logs. {color:#FF}25892513 + 1 - 100 = 24892514{color} {color:#FF}dfs.namenode.num.extra.edits.retained=100{color} {code:java} 2022-03-15 17:28:03,515 INFO namenode.NNStorageRetentionManager
[jira] [Created] (HDFS-16506) Unit tests failed because of OutOfMemoryError
tomscut created HDFS-16506: -- Summary: Unit tests failed because of OutOfMemoryError Key: HDFS-16506 URL: https://issues.apache.org/jira/browse/HDFS-16506 Project: Hadoop HDFS Issue Type: Bug Reporter: tomscut Unit tests failed because of OutOfMemoryError. An example: [[OutOfMemoryError|https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4009/5/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt].|https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4009/5/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt] {code:java} [ERROR] Tests run: 32, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: 95.727 s <<< FAILURE! - in org.apache.hadoop.hdfs.server.blockmanagement.TestBlockInfoStriped [ERROR] testGetBlockInfo[4: ErasureCodingPolicy=[Name=RS-10-4-1024k, Schema=[ECSchema=[Codec=rs, numDataUnits=10, numParityUnits=4]], CellSize=1048576, Id=5]](org.apache.hadoop.hdfs.server.blockmanagement.TestBlockInfoStriped) Time elapsed: 15.831 s <<< ERROR! java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:717) at io.netty.util.concurrent.ThreadPerTaskExecutor.execute(ThreadPerTaskExecutor.java:32) at io.netty.util.internal.ThreadExecutorMap$1.execute(ThreadExecutorMap.java:57) at io.netty.util.concurrent.SingleThreadEventExecutor.doStartThread(SingleThreadEventExecutor.java:975) at io.netty.util.concurrent.SingleThreadEventExecutor.ensureThreadStarted(SingleThreadEventExecutor.java:958) at io.netty.util.concurrent.SingleThreadEventExecutor.shutdownGracefully(SingleThreadEventExecutor.java:660) at io.netty.util.concurrent.MultithreadEventExecutorGroup.shutdownGracefully(MultithreadEventExecutorGroup.java:163) at io.netty.util.concurrent.AbstractEventExecutorGroup.shutdownGracefully(AbstractEventExecutorGroup.java:70) at org.apache.hadoop.hdfs.server.datanode.web.DatanodeHttpServer.close(DatanodeHttpServer.java:346) at org.apache.hadoop.hdfs.server.datanode.DataNode.shutdown(DataNode.java:2348) at org.apache.hadoop.hdfs.MiniDFSCluster.shutdownDataNode(MiniDFSCluster.java:2166) at org.apache.hadoop.hdfs.MiniDFSCluster.shutdownDataNodes(MiniDFSCluster.java:2156) at org.apache.hadoop.hdfs.MiniDFSCluster.shutdown(MiniDFSCluster.java:2135) at org.apache.hadoop.hdfs.MiniDFSCluster.shutdown(MiniDFSCluster.java:2109) at org.apache.hadoop.hdfs.MiniDFSCluster.shutdown(MiniDFSCluster.java:2102) at org.apache.hadoop.hdfs.MiniDFSCluster.close(MiniDFSCluster.java:3479) at org.apache.hadoop.hdfs.server.blockmanagement.TestBlockInfoStriped.testGetBlockInfo(TestBlockInfoStriped.java:257) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.lang.Thread.run(Thread.java:748) {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16505) Setting safemode should not be interrupted by abnormal nodes
tomscut created HDFS-16505: -- Summary: Setting safemode should not be interrupted by abnormal nodes Key: HDFS-16505 URL: https://issues.apache.org/jira/browse/HDFS-16505 Project: Hadoop HDFS Issue Type: Bug Reporter: tomscut Assignee: tomscut Attachments: image-2022-03-15-09-29-36-538.png, image-2022-03-15-09-29-44-430.png Setting safemode should not be interrupted by abnormal nodes. For example, we have four namenodes configured in the following order: NS1 -> active NS2 -> standby NS3 -> observer NS4 -> observer. When the {color:#FF}NS1 {color}process exits, setting the states of safemode, {color:#FF}NS2{color}, {color:#FF}NS3{color}, and {color:#FF}NS4 {color}fails. Similarly, when the {color:#FF}NS2{color} process exits, only the safemode state of {color:#FF}NS1{color} can be set successfully. When the {color:#FF}NS1{color} process exits: Before the change: !image-2022-03-15-09-29-36-538.png|width=1145,height=97! After the change: !image-2022-03-15-09-29-44-430.png|width=1104,height=119! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16503) Should verify whether the path name is valid in the WebHDFS
tomscut created HDFS-16503: -- Summary: Should verify whether the path name is valid in the WebHDFS Key: HDFS-16503 URL: https://issues.apache.org/jira/browse/HDFS-16503 Project: Hadoop HDFS Issue Type: Bug Reporter: tomscut Assignee: tomscut Attachments: image-2022-03-14-09-35-49-860.png When creating a file using WebHDFS, there are two main steps: 1. Obtain the location of the Datanode to be written. 2. Put the file to this location. Currently *NameNodeRpcServer* verifies that pathName is valid, but *NamenodeWebHdfsMethods* and *RouterWebHdfsMethods* do not. So if we use an invalid path, the first step returns success, but the second step throws an {*}InvalidPathException{*}. We should also do the validation in WebHdfs, which is consistent with the NameNodeRpcServer. !image-2022-03-14-09-35-49-860.png|width=548,height=164! The same webHDFS operations are: CREATE, APPEND, OPEN, GETFILECHECKSUM. So we can add DFSUtil.isValidName to redirectURI for *NamenodeWebHdfsMethods* and *RouterWebHdfsMethods.* -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16499) [SPS]: Should not start indefinitely while another SPS process is running
tomscut created HDFS-16499: -- Summary: [SPS]: Should not start indefinitely while another SPS process is running Key: HDFS-16499 URL: https://issues.apache.org/jira/browse/HDFS-16499 Project: Hadoop HDFS Issue Type: Sub-task Reporter: tomscut Assignee: tomscut Normally, we can only start one SPS process at a time. When one process is running, start another process and retry indefinitely. I think, in this case, we should exit immediately. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16498) Fix NPE for checkBlockReportLease
tomscut created HDFS-16498: -- Summary: Fix NPE for checkBlockReportLease Key: HDFS-16498 URL: https://issues.apache.org/jira/browse/HDFS-16498 Project: Hadoop HDFS Issue Type: Bug Reporter: tomscut Assignee: tomscut During the restart of Namenode, a Datanode is not registered, but this Datanode triggers FBR, which causes NPE. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16488) [SPS]: Expose metrics to JMX for external SPS
tomscut created HDFS-16488: -- Summary: [SPS]: Expose metrics to JMX for external SPS Key: HDFS-16488 URL: https://issues.apache.org/jira/browse/HDFS-16488 Project: Hadoop HDFS Issue Type: Sub-task Reporter: tomscut Assignee: tomscut Currently, external SPS has no monitoring metrics. We do not know how many blocks are waiting to be processed, how many blocks are waiting to be retried, and how many blocks have been migrated. We can expose these metrics in JMX for easy collection and display by monitoring systems. For example, in our cluster, we exposed these metrics to JMX, collected by JMX-Exporter and combined with Prometheus, and finally display by Grafana. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16477) [SPS]: Add metric PendingSPSPaths for getting the number of paths to be processed by SPS
tomscut created HDFS-16477: -- Summary: [SPS]: Add metric PendingSPSPaths for getting the number of paths to be processed by SPS Key: HDFS-16477 URL: https://issues.apache.org/jira/browse/HDFS-16477 Project: Hadoop HDFS Issue Type: Sub-task Reporter: tomscut Assignee: tomscut Currently we have no idea how many paths are waiting to be processed when using the SPS feature. We should add metric PendingSPSPaths for getting the number of paths to be processed by SPS in NameNode. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16460) [SPS]: Handle failure retries for moving tasks
tomscut created HDFS-16460: -- Summary: [SPS]: Handle failure retries for moving tasks Key: HDFS-16460 URL: https://issues.apache.org/jira/browse/HDFS-16460 Project: Hadoop HDFS Issue Type: Sub-task Reporter: tomscut Assignee: tomscut Handle failure retries for moving tasks. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16458) [SPS]: Fix bug for unit test of reconfiguring SPS mode
tomscut created HDFS-16458: -- Summary: [SPS]: Fix bug for unit test of reconfiguring SPS mode Key: HDFS-16458 URL: https://issues.apache.org/jira/browse/HDFS-16458 Project: Hadoop HDFS Issue Type: Sub-task Reporter: tomscut Assignee: tomscut TestNameNodeReconfigure#verifySPSEnabled was compared with itself(isSPSRunning) at assertEquals. In addition, after an *internal SPS* has been removed, *spsService daemon* will not start within StoragePolicySatisfyManager. I think the relevant code can be removed to simplify the code. IMO, after reconfig SPS mode, we just need to confirm whether the mode is correct and whether spsManager is NULL. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16446) Consider ioutils of disk when choosing volume
tomscut created HDFS-16446: -- Summary: Consider ioutils of disk when choosing volume Key: HDFS-16446 URL: https://issues.apache.org/jira/browse/HDFS-16446 Project: Hadoop HDFS Issue Type: New Feature Reporter: tomscut Assignee: tomscut Attachments: image-2022-02-05-09-50-12-241.png Consider ioutils of disk when choosing volume. Principle is as follows: !image-2022-02-05-09-50-12-241.png|width=309,height=159! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16444) Show start time of JournalNode on Web
tomscut created HDFS-16444: -- Summary: Show start time of JournalNode on Web Key: HDFS-16444 URL: https://issues.apache.org/jira/browse/HDFS-16444 Project: Hadoop HDFS Issue Type: New Feature Reporter: tomscut Assignee: tomscut Attachments: image-2022-01-29-08-09-42-544.png, image-2022-01-29-08-09-53-734.png Show start time of JournalNode on Web. Before: !image-2022-01-29-08-09-42-544.png|width=379,height=98! After: !image-2022-01-29-08-09-53-734.png|width=378,height=118! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16438) Avoid holding read locks for a long time when scanDatanodeStorage
tomscut created HDFS-16438: -- Summary: Avoid holding read locks for a long time when scanDatanodeStorage Key: HDFS-16438 URL: https://issues.apache.org/jira/browse/HDFS-16438 Project: Hadoop HDFS Issue Type: Improvement Reporter: tomscut Assignee: tomscut Attachments: image-2022-01-25-23-18-30-275.png At the time of decommission, if use {*}DatanodeAdminBackoffMonitor{*}, there will be a heavy operation: {*}scanDatanodeStorage{*}. If the number of blocks on a storage is large(more than 5 million), and GC performance is also poor, it may hold *read lock* for a long time, we should optimize it. !image-2022-01-25-23-18-30-275.png|width=764,height=193! {code:java} 2021-12-22 07:49:01,279 INFO namenode.FSNamesystem (FSNamesystemLock.java:readUnlock(220)) - FSNamesystem scanDatanodeStorage read lock held for 5491 ms via java.lang.Thread.getStackTrace(Thread.java:1552) org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.readUnlock(FSNamesystemLock.java:222) org.apache.hadoop.hdfs.server.namenode.FSNamesystem.readUnlock(FSNamesystem.java:1641) org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminBackoffMonitor.scanDatanodeStorage(DatanodeAdminBackoffMonitor.java:646) org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminBackoffMonitor.checkForCompletedNodes(DatanodeAdminBackoffMonitor.java:417) org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminBackoffMonitor.check(DatanodeAdminBackoffMonitor.java:300) org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminBackoffMonitor.run(DatanodeAdminBackoffMonitor.java:201) java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) java.lang.Thread.run(Thread.java:745) Number of suppressed read-lock reports: 0 Longest read-lock held interval: 5491 {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16435) Remove no need TODO comment for ObserverReadProxyProvider
tomscut created HDFS-16435: -- Summary: Remove no need TODO comment for ObserverReadProxyProvider Key: HDFS-16435 URL: https://issues.apache.org/jira/browse/HDFS-16435 Project: Hadoop HDFS Issue Type: Wish Reporter: tomscut Assignee: tomscut Based on discussion in [HDFS-13923|https://issues.apache.org/jira/browse/HDFS-13923], we don't think need to Add a configuration to turn on/off observer reads. So I suggest removing the `TODO comment` that are not needed. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16434) Add operation name to read/write lock for remaining operations
tomscut created HDFS-16434: -- Summary: Add operation name to read/write lock for remaining operations Key: HDFS-16434 URL: https://issues.apache.org/jira/browse/HDFS-16434 Project: Hadoop HDFS Issue Type: Improvement Reporter: tomscut Assignee: tomscut In this issue at [HDFS-10872|https://issues.apache.org/jira/browse/HDFS-10872], we add opname to read and write locks. However, there are still many operations that have not been completed. When analyzing some operations that hold locks for a long time, we can only find specific methods through stack. I suggest that these remaining operations be completed to facilitate later performance optimization. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16427) Add debug log for BlockManager#chooseExcessRedundancyStriped
tomscut created HDFS-16427: -- Summary: Add debug log for BlockManager#chooseExcessRedundancyStriped Key: HDFS-16427 URL: https://issues.apache.org/jira/browse/HDFS-16427 Project: Hadoop HDFS Issue Type: Improvement Reporter: tomscut Assignee: tomscut To solve this issue[HDFS-16420|https://issues.apache.org/jira/browse/HDFS-16420] , we added some debug logs, which were also necessary. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16413) Reconfig dfs usage parameters for datanode
tomscut created HDFS-16413: -- Summary: Reconfig dfs usage parameters for datanode Key: HDFS-16413 URL: https://issues.apache.org/jira/browse/HDFS-16413 Project: Hadoop HDFS Issue Type: New Feature Reporter: tomscut Assignee: tomscut Reconfig dfs usage parameters for datanode. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16404) Fix typo for CachingGetSpaceUsed
tomscut created HDFS-16404: -- Summary: Fix typo for CachingGetSpaceUsed Key: HDFS-16404 URL: https://issues.apache.org/jira/browse/HDFS-16404 Project: Hadoop HDFS Issue Type: Bug Reporter: tomscut Assignee: tomscut Fix typo for CachingGetSpaceUsed. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16402) HeartbeatManager may cause incorrect stats
tomscut created HDFS-16402: -- Summary: HeartbeatManager may cause incorrect stats Key: HDFS-16402 URL: https://issues.apache.org/jira/browse/HDFS-16402 Project: Hadoop HDFS Issue Type: Bug Reporter: tomscut Assignee: tomscut After reconfig {*}dfs.datanode.data.dir{*}, we found that the stats of the Namenode Web became negative and there were many NPE in namenode logs. This problem has been solved by [HDFS-14042|https://issues.apache.org/jira/browse/HDFS-14042]. However, if HeartbeatManager#updateHeartbeat and HeartbeatManager#updateLifeline throw other exceptions, stats errors can also occur. We should ensure that stats.subtract() and stats.add() are transactional. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16400) Reconfig DataXceiver parameters for datanode
tomscut created HDFS-16400: -- Summary: Reconfig DataXceiver parameters for datanode Key: HDFS-16400 URL: https://issues.apache.org/jira/browse/HDFS-16400 Project: Hadoop HDFS Issue Type: New Feature Reporter: tomscut Assignee: tomscut -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16399) Reconfig cache report parameters for datanode
tomscut created HDFS-16399: -- Summary: Reconfig cache report parameters for datanode Key: HDFS-16399 URL: https://issues.apache.org/jira/browse/HDFS-16399 Project: Hadoop HDFS Issue Type: New Feature Reporter: tomscut Assignee: tomscut -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16398) Reconfig block report parameters for datanode
tomscut created HDFS-16398: -- Summary: Reconfig block report parameters for datanode Key: HDFS-16398 URL: https://issues.apache.org/jira/browse/HDFS-16398 Project: Hadoop HDFS Issue Type: New Feature Reporter: tomscut Assignee: tomscut -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16397) Reconfig slow disk parameters for datanode
tomscut created HDFS-16397: -- Summary: Reconfig slow disk parameters for datanode Key: HDFS-16397 URL: https://issues.apache.org/jira/browse/HDFS-16397 Project: Hadoop HDFS Issue Type: New Feature Reporter: tomscut Assignee: tomscut In large clusters, rolling restart datanodes takes long time. We can make slow peers parameters and slow disks parameters in datanode reconfigurable to facilitate cluster operation and maintenance. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16396) Reconfig slow peer parameters for datanode
tomscut created HDFS-16396: -- Summary: Reconfig slow peer parameters for datanode Key: HDFS-16396 URL: https://issues.apache.org/jira/browse/HDFS-16396 Project: Hadoop HDFS Issue Type: New Feature Reporter: tomscut Assignee: tomscut In large clusters, rolling restart datanodes takes a long time. We can make slow peers parameters and slow disks parameters in datanode reconfigurable to facilitate cluster operation and maintenance. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16379) Reset fullBlockReportLeaseId after any exceptions
tomscut created HDFS-16379: -- Summary: Reset fullBlockReportLeaseId after any exceptions Key: HDFS-16379 URL: https://issues.apache.org/jira/browse/HDFS-16379 Project: Hadoop HDFS Issue Type: Bug Reporter: tomscut Assignee: tomscut Recently we encountered FBR-related problems in the production environment, which were solved by introducing HDFS-12914 and HDFS-14314. But there may be situations like this: 1 DN got *fullBlockReportLeaseId* via heartbeat. 2 DN trigger a blockReport, but some exception occurs (this may be rare, but it may exist), and then DN does multiple retries {*}without resetting leaseID{*}. Because leaseID is reset only if it succeeds currently. 3 After a while, the exception is cleared, but the LeaseID has expired. *Since NN did not throw an exception after the lease expired, the DN considered that the blockReport was successful.* So the blockReport was not actually executed this time and needs to wait until the next time. Therefore, {*}should we consider resetting the fullBlockReportLeaseId in the finally block{*}? The advantage of this is that lease expiration can be avoided. The downside is that each heartbeat will apply for a new fullBlockReportLeaseId during the exception, but I think this cost is negligible. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16378) Add datanode address to BlockReportLeaseManager logs
tomscut created HDFS-16378: -- Summary: Add datanode address to BlockReportLeaseManager logs Key: HDFS-16378 URL: https://issues.apache.org/jira/browse/HDFS-16378 Project: Hadoop HDFS Issue Type: Improvement Reporter: tomscut Assignee: tomscut Attachments: image-2021-12-11-09-58-59-494.png We should add datanode address to BlockReportLeaseManager logs. Because the datanodeuuid is not convenient for tracking. !image-2021-12-11-09-58-59-494.png|width=643,height=152! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16377) Should CheckNotNull before access FsDatasetSpi
tomscut created HDFS-16377: -- Summary: Should CheckNotNull before access FsDatasetSpi Key: HDFS-16377 URL: https://issues.apache.org/jira/browse/HDFS-16377 Project: Hadoop HDFS Issue Type: Bug Reporter: tomscut Assignee: tomscut Attachments: image-2021-12-10-19-19-22-957.png, image-2021-12-10-19-20-58-022.png When starting the DN, we found NPE in the staring DN's log, as follows: !image-2021-12-10-19-19-22-957.png|width=909,height=126! The logs of the upstream DN are as follows: !image-2021-12-10-19-20-58-022.png|width=905,height=239! This is mainly because *FsDatasetSpi* has not been initialized at the time of access. I noticed that checkNotNull is already done in these two method({*}DataNode#getBlockLocalPathInfo{*} and {*}DataNode#getVolumeInfo{*}). So we should add it to other places(interfaces that clients and other DN can access directly) so that we can add a message when throwing exceptions. Therefore, the client and the upstream DN know that FsDatasetSpi has not been initialized, rather than blindly unaware of the specific cause of the NPE. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16376) Expose metrics of NodeNotChosenReason to JMX
tomscut created HDFS-16376: -- Summary: Expose metrics of NodeNotChosenReason to JMX Key: HDFS-16376 URL: https://issues.apache.org/jira/browse/HDFS-16376 Project: Hadoop HDFS Issue Type: New Feature Reporter: tomscut Assignee: tomscut Attachments: image-2021-12-09-23-48-42-865.png In our cluster, we can see logs for nodes that are not chosen. But it's hard to see the percentages in each reason from the logs. It is best to add relevant metrics to monitor the entire cluster. !image-2021-12-09-23-48-42-865.png|width=517,height=187! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16375) The FBR lease ID should be exposed to the log
tomscut created HDFS-16375: -- Summary: The FBR lease ID should be exposed to the log Key: HDFS-16375 URL: https://issues.apache.org/jira/browse/HDFS-16375 Project: Hadoop HDFS Issue Type: Improvement Reporter: tomscut Assignee: tomscut Our Hadoop version is 3.1.0. We encountered HDFS-12914 and HDFS-14314 in the production environment. When locating the problem, the *fullBrLeaseId* was not exposed in the log, which caused some difficulties. We should expose it to the log. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16371) Exclude slow disks when choosing volume
tomscut created HDFS-16371: -- Summary: Exclude slow disks when choosing volume Key: HDFS-16371 URL: https://issues.apache.org/jira/browse/HDFS-16371 Project: Hadoop HDFS Issue Type: Improvement Reporter: tomscut Assignee: tomscut Currently, the datanode can detect slow disks. When choosing volume, we can exclude these slow disks according to some rules. This will prevents some slow disks from affecting the throughput of the whole datanode. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16370) Fix assert message for BlockInfo
tomscut created HDFS-16370: -- Summary: Fix assert message for BlockInfo Key: HDFS-16370 URL: https://issues.apache.org/jira/browse/HDFS-16370 Project: Hadoop HDFS Issue Type: Bug Reporter: tomscut Assignee: tomscut In both methods BlockInfo#getPrevious and BlockInfo#getNext, the assert message is wrong. This may cause some misunderstanding and needs to be fixed. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16361) Fix log format for QueryCommand
tomscut created HDFS-16361: -- Summary: Fix log format for QueryCommand Key: HDFS-16361 URL: https://issues.apache.org/jira/browse/HDFS-16361 Project: Hadoop HDFS Issue Type: Bug Reporter: tomscut Assignee: tomscut Fix log format for QueryCommand. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16359) RBF: RouterRpcServer#invokeAtAvailableNs does not take effect when retrying
tomscut created HDFS-16359: -- Summary: RBF: RouterRpcServer#invokeAtAvailableNs does not take effect when retrying Key: HDFS-16359 URL: https://issues.apache.org/jira/browse/HDFS-16359 Project: Hadoop HDFS Issue Type: Bug Reporter: tomscut Assignee: tomscut RouterRpcServer#invokeAtAvailableNs does not take effect when retrying. The original code of RouterRpcServer#getNameSpaceInfo looks like this: {code:java} private Set getNameSpaceInfo(String nsId) { Set namespaceInfos = new HashSet<>(); for (FederationNamespaceInfo ns : namespaceInfos) { if (!nsId.equals(ns.getNameserviceId())) { namespaceInfos.add(ns); } } return namespaceInfos; } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16344) Improve DirectoryScanner.Stats#toString
tomscut created HDFS-16344: -- Summary: Improve DirectoryScanner.Stats#toString Key: HDFS-16344 URL: https://issues.apache.org/jira/browse/HDFS-16344 Project: Hadoop HDFS Issue Type: Improvement Reporter: tomscut Assignee: tomscut Improve DirectoryScanner.Stats#toString. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16339) Show the threshold when mover threads quota is exceeded
tomscut created HDFS-16339: -- Summary: Show the threshold when mover threads quota is exceeded Key: HDFS-16339 URL: https://issues.apache.org/jira/browse/HDFS-16339 Project: Hadoop HDFS Issue Type: Wish Reporter: tomscut Assignee: tomscut Attachments: image-2021-11-20-12-32-55-167.png Show the threshold when mover threads quota is exceeded in DataXceiver#replaceBlock and DataXceiver#copyBlock. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16337) Show start time of Datanode on Web
tomscut created HDFS-16337: -- Summary: Show start time of Datanode on Web Key: HDFS-16337 URL: https://issues.apache.org/jira/browse/HDFS-16337 Project: Hadoop HDFS Issue Type: New Feature Reporter: tomscut Assignee: tomscut Attachments: image-2021-11-19-08-55-58-343.png Show _start time_ of Datanode on Web. !image-2021-11-19-08-55-58-343.png|width=540,height=155! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16335) Fix HDFSCommands.md
tomscut created HDFS-16335: -- Summary: Fix HDFSCommands.md Key: HDFS-16335 URL: https://issues.apache.org/jira/browse/HDFS-16335 Project: Hadoop HDFS Issue Type: Wish Reporter: tomscut Assignee: tomscut Fix HDFSCommands.md. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16331) Make dfs.blockreport.intervalMsec reconfigurable
tomscut created HDFS-16331: -- Summary: Make dfs.blockreport.intervalMsec reconfigurable Key: HDFS-16331 URL: https://issues.apache.org/jira/browse/HDFS-16331 Project: Hadoop HDFS Issue Type: New Feature Reporter: tomscut Assignee: tomscut Attachments: image-2021-11-18-09-33-24-236.png, image-2021-11-18-09-35-35-400.png We have a cold data cluster, which stores as EC policy. There are 24 fast disks on each node and each disk is 7 TB. Recently, many nodes have more than 10 million blocks, and the interval of FBR is 6h as default. Frequent FBR caused great pressure on NN. !image-2021-11-18-09-35-35-400.png|width=491,height=337! !image-2021-11-18-09-33-24-236.png|width=912,height=256! We want to increase the interval of FBR, but have to rolling restart the DNs, this operation is very heavy. In this scenario, it is necessary to make _dfs.blockreport.intervalMsec_ reconfigurable. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16329) Fix log format for BlockManager
tomscut created HDFS-16329: -- Summary: Fix log format for BlockManager Key: HDFS-16329 URL: https://issues.apache.org/jira/browse/HDFS-16329 Project: Hadoop HDFS Issue Type: Bug Reporter: tomscut Assignee: tomscut Fix log format for BlockManager. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16327) Change dfs.namenode.max.slowpeer.collect.nodes to a proportional value
tomscut created HDFS-16327: -- Summary: Change dfs.namenode.max.slowpeer.collect.nodes to a proportional value Key: HDFS-16327 URL: https://issues.apache.org/jira/browse/HDFS-16327 Project: Hadoop HDFS Issue Type: Wish Reporter: tomscut Assignee: tomscut Currently, dfs.namenode.max.slowpeer.collect.nodes is a fixed value, but often needs to be changed as the cluster size changes. We can change it to a scale value and make it reconfigurable. See [HDFS-15879|https://issues.apache.org/jira/browse/HDFS-15879]. And dfs.datanode.max.disks.to.report can be changed similarly. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16326) Simplify the code for DiskBalancer
tomscut created HDFS-16326: -- Summary: Simplify the code for DiskBalancer Key: HDFS-16326 URL: https://issues.apache.org/jira/browse/HDFS-16326 Project: Hadoop HDFS Issue Type: Wish Reporter: tomscut Assignee: tomscut Simplify the code for DiskBalancer. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16319) Add metrics doc for ReadLockLongHoldCount and WriteLockLongHoldCount
tomscut created HDFS-16319: -- Summary: Add metrics doc for ReadLockLongHoldCount and WriteLockLongHoldCount Key: HDFS-16319 URL: https://issues.apache.org/jira/browse/HDFS-16319 Project: Hadoop HDFS Issue Type: Wish Reporter: tomscut Assignee: tomscut Add metrics doc for ReadLockLongHoldCount and WriteLockLongHoldCount. See [HDFS-15808|https://issues.apache.org/jira/browse/HDFS-15808]. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16315) Add metrics related to Transfer and NativeCopy to DataNode
tomscut created HDFS-16315: -- Summary: Add metrics related to Transfer and NativeCopy to DataNode Key: HDFS-16315 URL: https://issues.apache.org/jira/browse/HDFS-16315 Project: Hadoop HDFS Issue Type: Improvement Reporter: tomscut Assignee: tomscut Attachments: image-2021-11-11-08-26-33-074.png Datanodes already have Read, Write, Sync and Flush metrics. We should add NativeCopy and Transfer as well. Here is a partial look after the change: !image-2021-11-11-08-26-33-074.png|width=205,height=235! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16312) Fix typo for DataNodeVolumeMetrics and ProfilingFileIoEvents
tomscut created HDFS-16312: -- Summary: Fix typo for DataNodeVolumeMetrics and ProfilingFileIoEvents Key: HDFS-16312 URL: https://issues.apache.org/jira/browse/HDFS-16312 Project: Hadoop HDFS Issue Type: Wish Reporter: tomscut Assignee: tomscut Fix typo for DataNodeVolumeMetrics and ProfilingFileIoEvents. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16311) Metric metadataOperationRate calculation error in DataNodeVolumeMetrics
tomscut created HDFS-16311: -- Summary: Metric metadataOperationRate calculation error in DataNodeVolumeMetrics Key: HDFS-16311 URL: https://issues.apache.org/jira/browse/HDFS-16311 Project: Hadoop HDFS Issue Type: Bug Reporter: tomscut Assignee: tomscut Attachments: image-2021-11-09-20-22-26-828.png Metric metadataOperationRate calculation error in DataNodeVolumeMetrics#addFileIoError, causing MetadataOperationRateAvgTime is very large in some cases. !image-2021-11-09-20-22-26-828.png|width=450,height=205! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16310) RBF: Add client port to CallerContext for Router
tomscut created HDFS-16310: -- Summary: RBF: Add client port to CallerContext for Router Key: HDFS-16310 URL: https://issues.apache.org/jira/browse/HDFS-16310 Project: Hadoop HDFS Issue Type: Improvement Reporter: tomscut Assignee: tomscut We mentioned in [HDFS-16266|https://issues.apache.org/jira/browse/HDFS-16266] that adding the client port to the CallerContext of the Router. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16299) Fix bug for TestDataNodeVolumeMetrics#verifyDataNodeVolumeMetrics
tomscut created HDFS-16299: -- Summary: Fix bug for TestDataNodeVolumeMetrics#verifyDataNodeVolumeMetrics Key: HDFS-16299 URL: https://issues.apache.org/jira/browse/HDFS-16299 Project: Hadoop HDFS Issue Type: Improvement Reporter: tomscut Assignee: tomscut Fix bug for TestDataNodeVolumeMetrics#verifyDataNodeVolumeMetrics. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16298) Improve error msg for BlockMissingException
tomscut created HDFS-16298: -- Summary: Improve error msg for BlockMissingException Key: HDFS-16298 URL: https://issues.apache.org/jira/browse/HDFS-16298 Project: Hadoop HDFS Issue Type: Wish Reporter: tomscut Assignee: tomscut When the client fails to obtain a block, a BlockMissingException is thrown. To analyze the issues, we can add the relevant location information to error msg here. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16281) Fix flaky unit tests failed due to timeout
tomscut created HDFS-16281: -- Summary: Fix flaky unit tests failed due to timeout Key: HDFS-16281 URL: https://issues.apache.org/jira/browse/HDFS-16281 Project: Hadoop HDFS Issue Type: Wish Reporter: tomscut Assignee: tomscut I found that this unit test *_TestViewFileSystemOverloadSchemeWithHdfsScheme_* failed several times due to timeout. Can we change the timeout for some methods from _*3s*_ to *_30s_* to be consistent with the other methods? {code:java} [ERROR] Tests run: 19, Failures: 0, Errors: 4, Skipped: 0, Time elapsed: 65.39 s <<< FAILURE! - in org.apache.hadoop.fs.viewfs.TestViewFSOverloadSchemeWithMountTableConfigInHDFS[ERROR] Tests run: 19, Failures: 0, Errors: 4, Skipped: 0, Time elapsed: 65.39 s <<< FAILURE! - in org.apache.hadoop.fs.viewfs.TestViewFSOverloadSchemeWithMountTableConfigInHDFS[ERROR] testNflyRepair(org.apache.hadoop.fs.viewfs.TestViewFSOverloadSchemeWithMountTableConfigInHDFS) Time elapsed: 4.132 s <<< ERROR!org.junit.runners.model.TestTimedOutException: test timed out after 3000 milliseconds at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at org.apache.hadoop.util.concurrent.AsyncGet$Util.wait(AsyncGet.java:59) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1577) at org.apache.hadoop.ipc.Client.call(Client.java:1535) at org.apache.hadoop.ipc.Client.call(Client.java:1432) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:242) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:129) at com.sun.proxy.$Proxy26.setTimes(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.setTimes(ClientNamenodeProtocolTranslatorPB.java:1059) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:431) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362) at com.sun.proxy.$Proxy27.setTimes(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.setTimes(DFSClient.java:2658) at org.apache.hadoop.hdfs.DistributedFileSystem$37.doCall(DistributedFileSystem.java:1978) at org.apache.hadoop.hdfs.DistributedFileSystem$37.doCall(DistributedFileSystem.java:1975) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.setTimes(DistributedFileSystem.java:1988) at org.apache.hadoop.fs.FilterFileSystem.setTimes(FilterFileSystem.java:542) at org.apache.hadoop.fs.viewfs.ChRootedFileSystem.setTimes(ChRootedFileSystem.java:328) at org.apache.hadoop.fs.viewfs.NflyFSystem$NflyOutputStream.commit(NflyFSystem.java:439) at org.apache.hadoop.fs.viewfs.NflyFSystem$NflyOutputStream.close(NflyFSystem.java:395) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:77) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106) at org.apache.hadoop.fs.viewfs.TestViewFileSystemOverloadSchemeWithHdfsScheme.writeString(TestViewFileSystemOverloadSchemeWithHdfsScheme.java:685) at org.apache.hadoop.fs.viewfs.TestViewFileSystemOverloadSchemeWithHdfsScheme.testNflyRepair(TestViewFileSystemOverloadSchemeWithHdfsScheme.java:622) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.lang.Thread.run(Thread.java:748) {code} -- This message was sent by Atlassian Jira
[jira] [Created] (HDFS-16280) Fix typo for ShortCircuitReplica#isStale
tomscut created HDFS-16280: -- Summary: Fix typo for ShortCircuitReplica#isStale Key: HDFS-16280 URL: https://issues.apache.org/jira/browse/HDFS-16280 Project: Hadoop HDFS Issue Type: Wish Reporter: tomscut Assignee: tomscut Fix typo for ShortCircuitReplica#isStale. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16279) Print detail datanode info when process block report
tomscut created HDFS-16279: -- Summary: Print detail datanode info when process block report Key: HDFS-16279 URL: https://issues.apache.org/jira/browse/HDFS-16279 Project: Hadoop HDFS Issue Type: Improvement Reporter: tomscut Assignee: tomscut Attachments: image-2021-10-19-20-37-55-850.png Print detail datanode info when process block report. !image-2021-10-19-20-37-55-850.png|width=547,height=98! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16274) Improve log for FSNamesystem#startFileInt
tomscut created HDFS-16274: -- Summary: Improve log for FSNamesystem#startFileInt Key: HDFS-16274 URL: https://issues.apache.org/jira/browse/HDFS-16274 Project: Hadoop HDFS Issue Type: Improvement Reporter: tomscut Assignee: tomscut Attachments: image-2021-10-14-23-52-53-100.png, image-2021-10-14-23-55-04-133.png When the blocksize of a file is smaller than dfs.namenode.fs-limits.min-block-size, an IOE will be thrown. In current exception messages, it is easy to confuse the value of blocksize with the value of dfs.namenode.fs-limits.min-block-size. Before the change: !image-2021-10-14-23-55-04-133.png|width=678,height=111! After the change: !image-2021-10-14-23-52-53-100.png|width=710,height=63! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16266) Add remote port information to HDFS audit log
tomscut created HDFS-16266: -- Summary: Add remote port information to HDFS audit log Key: HDFS-16266 URL: https://issues.apache.org/jira/browse/HDFS-16266 Project: Hadoop HDFS Issue Type: Improvement Reporter: tomscut Assignee: tomscut In our production environment, we occasionally encounter a problem where a user submits an abnormal computation task, causing a sudden flood of requests, which causes the queueTime and processingTime of the Namenode to rise very high, causing a large backlog of tasks. We usually locate and kill specific Spark, Flink, or MapReduce tasks based on metrics and audit logs. Currently, IP and UGI are recorded in audit logs, but there is no port information, so it is difficult to locate specific processes sometimes. Therefore, I propose that we add the port information to the audit log, so that we can easily track the upstream process. Currently, some projects contain port information in audit logs, such as Hbase and Alluxio. I think it is also necessary to add port information for HDFS audit logs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16232) Fix java doc for BlockReaderRemote#newBlockReader
tomscut created HDFS-16232: -- Summary: Fix java doc for BlockReaderRemote#newBlockReader Key: HDFS-16232 URL: https://issues.apache.org/jira/browse/HDFS-16232 Project: Hadoop HDFS Issue Type: Improvement Reporter: tomscut Assignee: tomscut Fix java doc for BlockReaderRemote#newBlockReader. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16225) Fix typo for FederationTestUtils
tomscut created HDFS-16225: -- Summary: Fix typo for FederationTestUtils Key: HDFS-16225 URL: https://issues.apache.org/jira/browse/HDFS-16225 Project: Hadoop HDFS Issue Type: Improvement Reporter: tomscut Assignee: tomscut Fix typo for FederationTestUtils. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16209) Set dfs.namenode.caching.enabled to false as default
tomscut created HDFS-16209: -- Summary: Set dfs.namenode.caching.enabled to false as default Key: HDFS-16209 URL: https://issues.apache.org/jira/browse/HDFS-16209 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.1.0 Reporter: tomscut Assignee: tomscut Namenode config: dfs.namenode.write-lock-reporting-threshold-ms=50ms dfs.namenode.caching.enabled=true (default) In fact, the caching feature is not used in our cluster, but this switch is turned on by default(dfs.namenode.caching.enabled=true), incurring some additional write lock overhead. We count the number of write lock warnings in a log file, and find that the number of rescan cache warnings reaches about 32%, which greatly affects the performance of Namenode. We should set 'dfs.namenode.caching.enabled' to false by default and turn it on when we wants to use it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16203) Discover datanodes with unbalanced block pool usage by the standard deviation
tomscut created HDFS-16203: -- Summary: Discover datanodes with unbalanced block pool usage by the standard deviation Key: HDFS-16203 URL: https://issues.apache.org/jira/browse/HDFS-16203 Project: Hadoop HDFS Issue Type: New Feature Environment: !image-2021-09-01-19-16-27-172.png|width=581,height=216! Reporter: tomscut Assignee: tomscut Attachments: image-2021-09-01-19-16-27-172.png Discover datanodes with unbalanced volume usage by the standard deviation In some scenarios, we may cause unbalanced datanode disk usage: 1. Repair the damaged disk and make it online again. 2. Add disks to some Datanodes. 3. Some disks are damaged, resulting in slow data writing. 4. Use some custom volume choosing policies. In the case of unbalanced disk usage, a sudden increase in datanode write traffic may result in busy disk I/O with low volume usage, resulting in decreased throughput across datanodes. We need to find these nodes in time to do diskBalance, or other processing. Based on the volume usage of each datanode, we can calculate the standard deviation of the volume usage. The more unbalanced the volume, the higher the standard deviation. We can display the result on the Web of namenode, and then sorting directly to find the nodes where the volumes usages are unbalanced. {color:#172b4d}This interface is only used to obtain metrics and does not adversely affect namenode performance.{color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-16158) Discover datanodes with unbalanced volume usage by the standard deviation
[ https://issues.apache.org/jira/browse/HDFS-16158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut resolved HDFS-16158. Resolution: Abandoned > Discover datanodes with unbalanced volume usage by the standard deviation > -- > > Key: HDFS-16158 > URL: https://issues.apache.org/jira/browse/HDFS-16158 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: tomscut >Assignee: tomscut >Priority: Major > Labels: pull-request-available > Attachments: image-2021-08-11-10-14-58-430.png > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Discover datanodes with unbalanced volume usage by the standard deviation > In some scenarios, we may cause unbalanced datanode disk usage: > 1. Repair the damaged disk and make it online again. > 2. Add disks to some Datanodes. > 3. Some disks are damaged, resulting in slow data writing. > 4. Use some custom volume choosing policies. > In the case of unbalanced disk usage, a sudden increase in datanode write > traffic may result in busy disk I/O with low volume usage, resulting in > decreased throughput across datanodes. > In this case, we need to find these nodes in time to do diskBalance, or other > processing. Based on the volume usage of each datanode, we can calculate the > standard deviation of the volume usage. The more unbalanced the volume, the > higher the standard deviation. > To prevent the namenode from being too busy, we can calculate the standard > variance on the datanode side, transmit it to the namenode through heartbeat, > and display the result on the Web of namenode. We can then sort directly to > find the nodes on the Web where the volumes usages are unbalanced. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16194) Add a public method DatanodeID#getDisplayName
tomscut created HDFS-16194: -- Summary: Add a public method DatanodeID#getDisplayName Key: HDFS-16194 URL: https://issues.apache.org/jira/browse/HDFS-16194 Project: Hadoop HDFS Issue Type: Wish Reporter: tomscut Assignee: tomscut Add a public method DatanodeID#getDisplayName to simplify the code. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16179) Update loglevel for BlockManager#chooseExcessRedundancyStriped avoid too much logs
tomscut created HDFS-16179: -- Summary: Update loglevel for BlockManager#chooseExcessRedundancyStriped avoid too much logs Key: HDFS-16179 URL: https://issues.apache.org/jira/browse/HDFS-16179 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.1.0 Reporter: tomscut Assignee: tomscut Attachments: log-count.jpg, logs.jpg {code:java} private void chooseExcessRedundancyStriped(BlockCollection bc, final Collection nonExcess, BlockInfo storedBlock, DatanodeDescriptor delNodeHint) { ... // cardinality of found indicates the expected number of internal blocks final int numOfTarget = found.cardinality(); final BlockStoragePolicy storagePolicy = storagePolicySuite.getPolicy( bc.getStoragePolicyID()); final List excessTypes = storagePolicy.chooseExcess( (short) numOfTarget, DatanodeStorageInfo.toStorageTypes(nonExcess)); if (excessTypes.isEmpty()) { LOG.warn("excess types chosen for block {} among storages {} is empty", storedBlock, nonExcess); return; } ... } {code} IMO, here is just detecting excess StorageType and setting the log level to debug has no effect. We have a cluster that uses the EC policy to store data. The current log level is WARN here, and in about 50 minutes, 286,093 logs are printed, which can cause other important logs to drown out. !logs.jpg|width=1167,height=62! !log-count.jpg|width=760,height=30! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16177) Bug fix for Util#receiveFile
tomscut created HDFS-16177: -- Summary: Bug fix for Util#receiveFile Key: HDFS-16177 URL: https://issues.apache.org/jira/browse/HDFS-16177 Project: Hadoop HDFS Issue Type: Task Affects Versions: 3.1.0 Reporter: tomscut Assignee: tomscut Attachments: download-fsimage.jpg The time to write file was miscalculated in Util#receiveFile. !download-fsimage.jpg|width=578,height=134! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16160) Improve the parameter annotation in DatanodeProtocol#sendHeartbeat
tomscut created HDFS-16160: -- Summary: Improve the parameter annotation in DatanodeProtocol#sendHeartbeat Key: HDFS-16160 URL: https://issues.apache.org/jira/browse/HDFS-16160 Project: Hadoop HDFS Issue Type: Improvement Reporter: tomscut Assignee: tomscut Improve the parameter annotation in DatanodeProtocol#sendHeartbeat. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16158) Discover datanodes with unbalanced volume usage by the standard deviation
tomscut created HDFS-16158: -- Summary: Discover datanodes with unbalanced volume usage by the standard deviation Key: HDFS-16158 URL: https://issues.apache.org/jira/browse/HDFS-16158 Project: Hadoop HDFS Issue Type: New Feature Reporter: tomscut Assignee: tomscut Discover datanodes with unbalanced volume usage by the standard deviation In some scenarios, we may cause unbalanced datanode disk usage: 1. Repair the damaged disk and make it online again. 2. Add disks to some Datanodes. 3. Some disks are damaged, resulting in slow data writing. 4. Use some custom volume choosing policies. In the case of unbalanced disk usage, a sudden increase in datanode write traffic may result in busy disk I/O with low volume usage, resulting in decreased throughput across datanodes. In this case, we need to find these nodes in time to do diskBalance, or other processing. Based on the volume usage of each datanode, we can calculate the standard deviation of the volume usage. The more unbalanced the volume, the higher the standard deviation. To prevent the namenode from being too busy, we can calculate the standard variance on the datanode side, transmit it to the namenode through heartbeat, and display the result on the Web of namenode. We can then sort directly to find the nodes on the Web where the volumes usages are unbalanced. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16131) Show storage type for failed volumes on namenode web
tomscut created HDFS-16131: -- Summary: Show storage type for failed volumes on namenode web Key: HDFS-16131 URL: https://issues.apache.org/jira/browse/HDFS-16131 Project: Hadoop HDFS Issue Type: Wish Reporter: tomscut Assignee: tomscut To make it easy to query the storage type for failed volumes, we can display them on namenode web. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16122) Fix DistCpContext#toString()
tomscut created HDFS-16122: -- Summary: Fix DistCpContext#toString() Key: HDFS-16122 URL: https://issues.apache.org/jira/browse/HDFS-16122 Project: Hadoop HDFS Issue Type: Wish Reporter: tomscut Assignee: tomscut Attachments: distcp.jpg !distcp.jpg|width=880,height=71! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16112) Fix flaky unit test TestDecommissioningStatusWithBackoffMonitor
tomscut created HDFS-16112: -- Summary: Fix flaky unit test TestDecommissioningStatusWithBackoffMonitor Key: HDFS-16112 URL: https://issues.apache.org/jira/browse/HDFS-16112 Project: Hadoop HDFS Issue Type: Wish Reporter: tomscut The unit test TestDecommissioningStatusWithBackoffMonitor#testDecommissionStatus and TestDecommissioningStatus#testDecommissionStatus recently seems a little flaky, we should fix them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16110) Remove unused method reportChecksumFailure in DFSClient
tomscut created HDFS-16110: -- Summary: Remove unused method reportChecksumFailure in DFSClient Key: HDFS-16110 URL: https://issues.apache.org/jira/browse/HDFS-16110 Project: Hadoop HDFS Issue Type: Wish Reporter: tomscut Assignee: tomscut Remove unused method reportChecksumFailure and fix some code styles by the way in DFSClient. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16109) Fix flaky some unit tests since they offen timeout
tomscut created HDFS-16109: -- Summary: Fix flaky some unit tests since they offen timeout Key: HDFS-16109 URL: https://issues.apache.org/jira/browse/HDFS-16109 Project: Hadoop HDFS Issue Type: Wish Reporter: tomscut Assignee: tomscut Increase timeout for TestBootstrapStandby, TestFsVolumeList and TestDecommissionWithBackoffMonitor since they offen timeout. TestBootstrapStandby: {code:java} [ERROR] Tests run: 8, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: 159.474 s <<< FAILURE! - in org.apache.hadoop.hdfs.server.namenode.ha.TestBootstrapStandby[ERROR] Tests run: 8, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: 159.474 s <<< FAILURE! - in org.apache.hadoop.hdfs.server.namenode.ha.TestBootstrapStandby[ERROR] testRateThrottling(org.apache.hadoop.hdfs.server.namenode.ha.TestBootstrapStandby) Time elapsed: 31.262 s <<< ERROR!org.junit.runners.model.TestTimedOutException: test timed out after 3 milliseconds at java.io.RandomAccessFile.writeBytes(Native Method) at java.io.RandomAccessFile.write(RandomAccessFile.java:512) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.tryLock(Storage.java:947) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:910) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:699) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:642) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverStorageDirs(FSImage.java:387) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:243) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1224) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:795) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:673) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:760) at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:1014) at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:989) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1763) at org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:2261) at org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:2231) at org.apache.hadoop.hdfs.server.namenode.ha.TestBootstrapStandby.testRateThrottling(TestBootstrapStandby.java:297) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.lang.Thread.run(Thread.java:748) {code} TestFsVolumeList: {code:java} [ERROR] Tests run: 12, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: 190.294 s <<< FAILURE! - in org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestFsVolumeList[ERROR] Tests run: 12, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: 190.294 s <<< FAILURE! - in org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestFsVolumeList[ERROR] testAddRplicaProcessorForAddingReplicaInMap(org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestFsVolumeList) Time elapsed: 60.028 s <<< ERROR!org.junit.runners.model.TestTimedOutException: test timed out after 6 milliseconds at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:429) at java.util.concurrent.FutureTask.get(FutureTask.java:191) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestFsVolumeList.testAddRplicaProcessorForAddingReplicaInMap(TestFsVolumeList.java:395) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) at
[jira] [Created] (HDFS-16106) Fix flaky unit test TestDFSShell
tomscut created HDFS-16106: -- Summary: Fix flaky unit test TestDFSShell Key: HDFS-16106 URL: https://issues.apache.org/jira/browse/HDFS-16106 Project: Hadoop HDFS Issue Type: Bug Reporter: tomscut Assignee: tomscut This unit test occasionally fails. The value set for dfs.namenode.accesstime.precision is too low, result in the execution of the method, accesstime could be set many times, eventually leading to failed assert. IMO, dfs.namenode.accesstime.precision should be greater than or equal to the timeout(120s) of TestDFSShell#testCopyCommandsWithPreserveOption(), or directly set to 0 to disable this feature. {code:java} [ERROR] Tests run: 52, Failures: 3, Errors: 0, Skipped: 0, Time elapsed: 106.778 s <<< FAILURE! - in org.apache.hadoop.hdfs.TestDFSShell[ERROR] Tests run: 52, Failures: 3, Errors: 0, Skipped: 0, Time elapsed: 106.778 s <<< FAILURE! - in org.apache.hadoop.hdfs.TestDFSShell [ERROR] testCopyCommandsWithPreserveOption(org.apache.hadoop.hdfs.TestDFSShell) Time elapsed: 2.353 s <<< FAILURE! java.lang.AssertionError: expected:<1625095098319> but was:<1625095099374> at org.junit.Assert.fail(Assert.java:89) at org.junit.Assert.failNotEquals(Assert.java:835) at org.junit.Assert.assertEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:633) at org.apache.hadoop.hdfs.TestDFSShell.testCopyCommandsWithPreserveOption(TestDFSShell.java:2282) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.lang.Thread.run(Thread.java:748) [ERROR] testCopyCommandsWithPreserveOption(org.apache.hadoop.hdfs.TestDFSShell) Time elapsed: 2.467 s <<< FAILURE! java.lang.AssertionError: expected:<1625095192527> but was:<1625095193950> at org.junit.Assert.fail(Assert.java:89) at org.junit.Assert.failNotEquals(Assert.java:835) at org.junit.Assert.assertEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:633) at org.apache.hadoop.hdfs.TestDFSShell.testCopyCommandsWithPreserveOption(TestDFSShell.java:2323) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.lang.Thread.run(Thread.java:748) [ERROR] testCopyCommandsWithPreserveOption(org.apache.hadoop.hdfs.TestDFSShell) Time elapsed: 2.173 s <<< FAILURE! java.lang.AssertionError: expected:<1625095196756> but was:<1625095197975> at org.junit.Assert.fail(Assert.java:89) at org.junit.Assert.failNotEquals(Assert.java:835) at org.junit.Assert.assertEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:633) at org.apache.hadoop.hdfs.TestDFSShell.testCopyCommandsWithPreserveOption(TestDFSShell.java:2303) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) at
[jira] [Created] (HDFS-16104) Remove unused parameter and fix java doc for DiskBalancerCLI
tomscut created HDFS-16104: -- Summary: Remove unused parameter and fix java doc for DiskBalancerCLI Key: HDFS-16104 URL: https://issues.apache.org/jira/browse/HDFS-16104 Project: Hadoop HDFS Issue Type: Wish Reporter: tomscut Assignee: tomscut Remove unused parameter and fix java doc for DiskBalancerCLI. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16089) Add metric EcReconstructionValidateTimeMillis for StripedBlockReconstructor
tomscut created HDFS-16089: -- Summary: Add metric EcReconstructionValidateTimeMillis for StripedBlockReconstructor Key: HDFS-16089 URL: https://issues.apache.org/jira/browse/HDFS-16089 Project: Hadoop HDFS Issue Type: Wish Reporter: tomscut Assignee: tomscut Add metric EcReconstructionValidateTimeMillis for StripedBlockReconstructor, so that we can count the elapsed time for striped block reconstructing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16088) Standby NameNode process getLiveDatanodeStorageReport request to reduce Active load
tomscut created HDFS-16088: -- Summary: Standby NameNode process getLiveDatanodeStorageReport request to reduce Active load Key: HDFS-16088 URL: https://issues.apache.org/jira/browse/HDFS-16088 Project: Hadoop HDFS Issue Type: Improvement Reporter: tomscut Assignee: tomscut As with [HDFS-13183|https://issues.apache.org/jira/browse/HDFS-13183], NameNodeConnector#getLiveDatanodeStorageReport() can also request to SNN to reduce the ANN load. There are two points that need to be mentioned: 1. NameNodeConnector#getLiveDatanodeStorageReport() is OperationCategory.UNCHECKED in FSNamesystem, so we can access SNN directly. 2. We can share the same UT(testBalancerRequestSBNWithHA) with NameNodeConnector#getBlocks(). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16086) Add volume information to datanode log for tracing
tomscut created HDFS-16086: -- Summary: Add volume information to datanode log for tracing Key: HDFS-16086 URL: https://issues.apache.org/jira/browse/HDFS-16086 Project: Hadoop HDFS Issue Type: Improvement Reporter: tomscut Assignee: tomscut To keep track of the block in volume, we can add the volume information to the datanode log. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16085) Move the getPermissionChecker out of the read lock
tomscut created HDFS-16085: -- Summary: Move the getPermissionChecker out of the read lock Key: HDFS-16085 URL: https://issues.apache.org/jira/browse/HDFS-16085 Project: Hadoop HDFS Issue Type: Improvement Reporter: tomscut Assignee: tomscut Move the getPermissionChecker out of the read lock in NamenodeFsck#getBlockLocations() since the operation does not need to be locked. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16079) Improve the block state change log
tomscut created HDFS-16079: -- Summary: Improve the block state change log Key: HDFS-16079 URL: https://issues.apache.org/jira/browse/HDFS-16079 Project: Hadoop HDFS Issue Type: Wish Reporter: tomscut Assignee: tomscut Improve the block state change log. Add readOnlyReplicas and replicasOnStaleNodes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16078) Remove unused parameters for DatanodeManager.handleLifeline()
tomscut created HDFS-16078: -- Summary: Remove unused parameters for DatanodeManager.handleLifeline() Key: HDFS-16078 URL: https://issues.apache.org/jira/browse/HDFS-16078 Project: Hadoop HDFS Issue Type: Wish Reporter: tomscut Assignee: tomscut Remove unused parameters (blockPoolId, maxTransfers) for DatanodeManager.handleLifeline(). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16076) Avoid using slow DataNodes for reading by sorting locations
tomscut created HDFS-16076: -- Summary: Avoid using slow DataNodes for reading by sorting locations Key: HDFS-16076 URL: https://issues.apache.org/jira/browse/HDFS-16076 Project: Hadoop HDFS Issue Type: Improvement Reporter: tomscut Assignee: tomscut After sorting the expected location list will be: live -> slow -> stale -> staleAndSlow -> entering_maintenance -> decommissioned. This reduces the probability that slow nodes will be used for reading. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16057) Make sure the order for location in ENTERING_MAINTENANCE state
tomscut created HDFS-16057: -- Summary: Make sure the order for location in ENTERING_MAINTENANCE state Key: HDFS-16057 URL: https://issues.apache.org/jira/browse/HDFS-16057 Project: Hadoop HDFS Issue Type: Bug Reporter: tomscut Assignee: tomscut We use compactor to sort locations in getBlockLocations(), and the expected result is: live -> stale -> entering_maintenance -> decommissioned. But the networktopology. SortByDistance() will disrupt the order. We should also filtered out node in sate AdminStates.ENTERING_MAINTENANCE before networktopology. SortByDistance(). org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager#sortLocatedBlock() {code:java} DatanodeInfoWithStorage[] di = lb.getLocations(); // Move decommissioned/stale datanodes to the bottom Arrays.sort(di, comparator); // Sort nodes by network distance only for located blocks int lastActiveIndex = di.length - 1; while (lastActiveIndex > 0 && isInactive(di[lastActiveIndex])) { --lastActiveIndex; } int activeLen = lastActiveIndex + 1; if(nonDatanodeReader) { networktopology.sortByDistanceUsingNetworkLocation(client, lb.getLocations(), activeLen, createSecondaryNodeSorter()); } else { networktopology.sortByDistance(client, lb.getLocations(), activeLen, createSecondaryNodeSorter()); } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16048) Print network topology on the router web
tomscut created HDFS-16048: -- Summary: Print network topology on the router web Key: HDFS-16048 URL: https://issues.apache.org/jira/browse/HDFS-16048 Project: Hadoop HDFS Issue Type: Wish Reporter: tomscut Assignee: tomscut In order to query the network topology information conveniently, we can print it on the router web. It's related to [HDFS-15970|https://issues.apache.org/jira/browse/HDFS-15970] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-15991) Add location into datanode info for NameNodeMXBean
tomscut created HDFS-15991: -- Summary: Add location into datanode info for NameNodeMXBean Key: HDFS-15991 URL: https://issues.apache.org/jira/browse/HDFS-15991 Project: Hadoop HDFS Issue Type: Wish Reporter: tomscut Assignee: tomscut Add location into datanode info for NameNodeMXBean. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-15975) Use LongAdder instead of AtomicLong
tomscut created HDFS-15975: -- Summary: Use LongAdder instead of AtomicLong Key: HDFS-15975 URL: https://issues.apache.org/jira/browse/HDFS-15975 Project: Hadoop HDFS Issue Type: Wish Reporter: tomscut Assignee: tomscut When counting some indicators, we can use LongAdder instead of AtomicLong to improve performance. The long value is not an atomic snapshot in LongAdder, but I think we can tolerate that. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-15970) Print network topology on web
tomscut created HDFS-15970: -- Summary: Print network topology on web Key: HDFS-15970 URL: https://issues.apache.org/jira/browse/HDFS-15970 Project: Hadoop HDFS Issue Type: Wish Reporter: tomscut Assignee: tomscut Attachments: hdfs-topology.jpg, hdfs-web.jpg In order to query the network topology information conveniently, we can print it on the web. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-15951) Remove unused parameters in NameNodeProxiesClient
tomscut created HDFS-15951: -- Summary: Remove unused parameters in NameNodeProxiesClient Key: HDFS-15951 URL: https://issues.apache.org/jira/browse/HDFS-15951 Project: Hadoop HDFS Issue Type: Wish Reporter: tomscut Assignee: tomscut Remove unused parameters in org.apache.hadoop.hdfs.NameNodeProxiesClient. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-15946) Fix java doc in FSPermissionChecker
tomscut created HDFS-15946: -- Summary: Fix java doc in FSPermissionChecker Key: HDFS-15946 URL: https://issues.apache.org/jira/browse/HDFS-15946 Project: Hadoop HDFS Issue Type: Wish Reporter: tomscut Assignee: tomscut Fix java doc for org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker#hasAclPermission. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-15938) Fix java doc in FSEditLog
tomscut created HDFS-15938: -- Summary: Fix java doc in FSEditLog Key: HDFS-15938 URL: https://issues.apache.org/jira/browse/HDFS-15938 Project: Hadoop HDFS Issue Type: Wish Reporter: tomscut Assignee: tomscut Fix java doc in org.apache.hadoop.hdfs.server.namenode.FSEditLog#logAddCacheDirectiveInfo. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-15906) Close FSImage and FSNamesystem after formatting is complete
tomscut created HDFS-15906: -- Summary: Close FSImage and FSNamesystem after formatting is complete Key: HDFS-15906 URL: https://issues.apache.org/jira/browse/HDFS-15906 Project: Hadoop HDFS Issue Type: Wish Reporter: tomscut Assignee: tomscut Close FSImage and FSNamesystem after formatting is complete. org.apache.hadoop.hdfs.server.namenode#format. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-15892) Add metric for editPendingQ in FSEditLogAsync
tomscut created HDFS-15892: -- Summary: Add metric for editPendingQ in FSEditLogAsync Key: HDFS-15892 URL: https://issues.apache.org/jira/browse/HDFS-15892 Project: Hadoop HDFS Issue Type: Wish Reporter: tomscut Assignee: tomscut To monitor editPendingQ in FSEditLogAsync, we add a metric and print log when the queue is full. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-15884) RBF: Remove unused method getCreateLocation in RouterRpcServer
tomscut created HDFS-15884: -- Summary: RBF: Remove unused method getCreateLocation in RouterRpcServer Key: HDFS-15884 URL: https://issues.apache.org/jira/browse/HDFS-15884 Project: Hadoop HDFS Issue Type: Wish Reporter: tomscut Assignee: tomscut Remove unused method org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer#getCreateLocation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-15883) Add a metric BlockReportQueueFullCount
tomscut created HDFS-15883: -- Summary: Add a metric BlockReportQueueFullCount Key: HDFS-15883 URL: https://issues.apache.org/jira/browse/HDFS-15883 Project: Hadoop HDFS Issue Type: Wish Reporter: tomscut Assignee: tomscut Add a metric that reflects the number of times the block report queue is full -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-15879) Exclude slow nodes when choose targets for blocks
tomscut created HDFS-15879: -- Summary: Exclude slow nodes when choose targets for blocks Key: HDFS-15879 URL: https://issues.apache.org/jira/browse/HDFS-15879 Project: Hadoop HDFS Issue Type: Wish Reporter: tomscut Assignee: tomscut Previously, we have monitored the slow nodes, related to https://issues.apache.org/jira/browse/HDFS-11194. We can use a thread to periodically collect these slow nodes into a set. Then use the set to filter out slow nodes when choose targets for blocks. This feature can be configured to be turned on when needed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-15873) Add namenode address in logs for block report
tomscut created HDFS-15873: -- Summary: Add namenode address in logs for block report Key: HDFS-15873 URL: https://issues.apache.org/jira/browse/HDFS-15873 Project: Hadoop HDFS Issue Type: Wish Components: datanode, hdfs Reporter: tomscut Assignee: tomscut Add namenode address in logs for block report. It's easier to track when the block report was sent to ANN or SNN. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-15870) Remove unused configuration dfs.namenode.stripe.min
tomscut created HDFS-15870: -- Summary: Remove unused configuration dfs.namenode.stripe.min Key: HDFS-15870 URL: https://issues.apache.org/jira/browse/HDFS-15870 Project: Hadoop HDFS Issue Type: Wish Reporter: tomscut Assignee: tomscut Remove unused configuration dfs.namenode.stripe.min. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-15854) Make some parameters configurable for SlowDiskTracker and SlowPeerTracker
tomscut created HDFS-15854: -- Summary: Make some parameters configurable for SlowDiskTracker and SlowPeerTracker Key: HDFS-15854 URL: https://issues.apache.org/jira/browse/HDFS-15854 Project: Hadoop HDFS Issue Type: Wish Reporter: tomscut Make some parameters configurable for SlowDiskTracker and SlowPeerTracker. Related to https://issues.apache.org/jira/browse/HDFS-15814. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org