[jira] [Updated] (HDFS-16557) BootstrapStandby failed because of checking gap for inprogress EditLogInputStream
[ https://issues.apache.org/jira/browse/HDFS-16557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16557: --- Description: The lastTxId of an inprogress EditLogInputStream lastTxId isn't necessarily HdfsServerConstants.INVALID_TXID. We can determine its status directly by EditLogInputStream#isInProgress. For example, set {color:#ff}{{dfs.ha.tail-edits.in-progress=true}}{color}. Then bootstrapStandby, the EditLogInputStream of inProgress is misjudged, resulting in a gap check failure, which causes bootstrapStandby to fail. hdfs namenode -bootstrapStandby !image-2022-04-22-17-17-32-487.png|width=766,height=161! !image-2022-04-22-17-17-14-577.png|width=598,height=187! was: The lastTxId of an inprogress EditLogInputStream lastTxId isn't necessarily HdfsServerConstants.INVALID_TXID. We can determine its status directly by EditLogInputStream#isInProgress. For example, set {color:#FF}{{dfs.ha.tail-edits.in-progress=true}}{color}. Then bootstrapStandby, the EditLogInputStream of inProgress is misjudged, resulting in a gap check failure, which causes bootstrapStandby to fail. !image-2022-04-22-17-17-32-487.png|width=766,height=161! !image-2022-04-22-17-17-14-577.png|width=598,height=187! > BootstrapStandby failed because of checking gap for inprogress > EditLogInputStream > - > > Key: HDFS-16557 > URL: https://issues.apache.org/jira/browse/HDFS-16557 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: tomscut >Assignee: tomscut >Priority: Major > Labels: pull-request-available > Attachments: image-2022-04-22-17-17-14-577.png, > image-2022-04-22-17-17-14-618.png, image-2022-04-22-17-17-23-113.png, > image-2022-04-22-17-17-32-487.png > > Time Spent: 10m > Remaining Estimate: 0h > > The lastTxId of an inprogress EditLogInputStream lastTxId isn't necessarily > HdfsServerConstants.INVALID_TXID. We can determine its status directly by > EditLogInputStream#isInProgress. > For example, set > {color:#ff}{{dfs.ha.tail-edits.in-progress=true}}{color}. Then > bootstrapStandby, the EditLogInputStream of inProgress is misjudged, > resulting in a gap check failure, which causes bootstrapStandby to fail. > hdfs namenode -bootstrapStandby > !image-2022-04-22-17-17-32-487.png|width=766,height=161! > !image-2022-04-22-17-17-14-577.png|width=598,height=187! -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16557) BootstrapStandby failed because of checking gap for inprogress EditLogInputStream
[ https://issues.apache.org/jira/browse/HDFS-16557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16557: --- Description: The lastTxId of an inprogress EditLogInputStream lastTxId isn't necessarily HdfsServerConstants.INVALID_TXID. We can determine its status directly by EditLogInputStream#isInProgress. For example, set {color:#FF}{{dfs.ha.tail-edits.in-progress=true}}{color}. Then bootstrapStandby, the EditLogInputStream of inProgress is misjudged, resulting in a gap check failure, which causes bootstrapStandby to fail. !image-2022-04-22-17-17-32-487.png|width=766,height=161! !image-2022-04-22-17-17-14-577.png|width=598,height=187! was: The lastTxId of an inprogress EditLogInputStream lastTxId isn't necessarily HdfsServerConstants.INVALID_TXID. We can determine its status directly by EditLogInputStream#isInProgress. For example, when bootstrapStandby, the EditLogInputStream of inProgress is misjudged, resulting in a gap check failure, which causes bootstrapStandby to fail. !image-2022-04-22-17-17-32-487.png|width=766,height=161! !image-2022-04-22-17-17-14-577.png|width=598,height=187! > BootstrapStandby failed because of checking gap for inprogress > EditLogInputStream > - > > Key: HDFS-16557 > URL: https://issues.apache.org/jira/browse/HDFS-16557 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: tomscut >Assignee: tomscut >Priority: Major > Labels: pull-request-available > Attachments: image-2022-04-22-17-17-14-577.png, > image-2022-04-22-17-17-14-618.png, image-2022-04-22-17-17-23-113.png, > image-2022-04-22-17-17-32-487.png > > Time Spent: 10m > Remaining Estimate: 0h > > The lastTxId of an inprogress EditLogInputStream lastTxId isn't necessarily > HdfsServerConstants.INVALID_TXID. We can determine its status directly by > EditLogInputStream#isInProgress. > For example, set > {color:#FF}{{dfs.ha.tail-edits.in-progress=true}}{color}. Then > bootstrapStandby, the EditLogInputStream of inProgress is misjudged, > resulting in a gap check failure, which causes bootstrapStandby to fail. > !image-2022-04-22-17-17-32-487.png|width=766,height=161! > !image-2022-04-22-17-17-14-577.png|width=598,height=187! -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16557) BootstrapStandby failed because of checking gap for inprogress EditLogInputStream
[ https://issues.apache.org/jira/browse/HDFS-16557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16557: --- Summary: BootstrapStandby failed because of checking gap for inprogress EditLogInputStream (was: BootstrapStandby failed because of checking Gap for inprogress EditLogInputStream) > BootstrapStandby failed because of checking gap for inprogress > EditLogInputStream > - > > Key: HDFS-16557 > URL: https://issues.apache.org/jira/browse/HDFS-16557 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: tomscut >Assignee: tomscut >Priority: Major > Labels: pull-request-available > Attachments: image-2022-04-22-17-17-14-577.png, > image-2022-04-22-17-17-14-618.png, image-2022-04-22-17-17-23-113.png, > image-2022-04-22-17-17-32-487.png > > Time Spent: 10m > Remaining Estimate: 0h > > The lastTxId of an inprogress EditLogInputStream lastTxId isn't necessarily > HdfsServerConstants.INVALID_TXID. We can determine its status directly by > EditLogInputStream#isInProgress. > For example, when bootstrapStandby, the EditLogInputStream of inProgress is > misjudged, resulting in a gap check failure, which causes bootstrapStandby to > fail. > !image-2022-04-22-17-17-32-487.png|width=766,height=161! > !image-2022-04-22-17-17-14-577.png|width=598,height=187! -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16557) BootstrapStandby failed because of checking Gap for inprogress EditLogInputStream
[ https://issues.apache.org/jira/browse/HDFS-16557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16557: --- Attachment: image-2022-04-22-17-17-32-487.png > BootstrapStandby failed because of checking Gap for inprogress > EditLogInputStream > - > > Key: HDFS-16557 > URL: https://issues.apache.org/jira/browse/HDFS-16557 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: tomscut >Assignee: tomscut >Priority: Major > Attachments: image-2022-04-22-17-17-14-577.png, > image-2022-04-22-17-17-14-618.png, image-2022-04-22-17-17-23-113.png, > image-2022-04-22-17-17-32-487.png > > > The lastTxId of an inprogress EditLogInputStream lastTxId isn't necessarily > HdfsServerConstants.INVALID_TXID. We can determine its status directly by > EditLogInputStream#isInProgress. > For example, when bootstrapStandby, the EditLogInputStream of inProgress is > misjudged, resulting in a gap check failure, which causes bootstrapStandby to > fail. > !image-2022-04-22-17-17-23-113.png! > !image-2022-04-22-17-17-14-577.png! -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16557) BootstrapStandby failed because of checking Gap for inprogress EditLogInputStream
[ https://issues.apache.org/jira/browse/HDFS-16557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16557: --- Description: The lastTxId of an inprogress EditLogInputStream lastTxId isn't necessarily HdfsServerConstants.INVALID_TXID. We can determine its status directly by EditLogInputStream#isInProgress. For example, when bootstrapStandby, the EditLogInputStream of inProgress is misjudged, resulting in a gap check failure, which causes bootstrapStandby to fail. !image-2022-04-22-17-17-23-113.png! !image-2022-04-22-17-17-14-577.png! was: The lastTxId of an inprogress EditLogInputStream lastTxId isn't necessarily HdfsServerConstants.INVALID_TXID. We can determine its status directly by EditLogInputStream#isInProgress. For example, when bootstrapStandby, the EditLogInputStream of inProgress is misjudged, resulting in a gap check failure, which causes bootstrapStandby to fail. > BootstrapStandby failed because of checking Gap for inprogress > EditLogInputStream > - > > Key: HDFS-16557 > URL: https://issues.apache.org/jira/browse/HDFS-16557 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: tomscut >Assignee: tomscut >Priority: Major > Attachments: image-2022-04-22-17-17-14-577.png, > image-2022-04-22-17-17-14-618.png, image-2022-04-22-17-17-23-113.png, > image-2022-04-22-17-17-32-487.png > > > The lastTxId of an inprogress EditLogInputStream lastTxId isn't necessarily > HdfsServerConstants.INVALID_TXID. We can determine its status directly by > EditLogInputStream#isInProgress. > For example, when bootstrapStandby, the EditLogInputStream of inProgress is > misjudged, resulting in a gap check failure, which causes bootstrapStandby to > fail. > !image-2022-04-22-17-17-23-113.png! > !image-2022-04-22-17-17-14-577.png! -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16557) BootstrapStandby failed because of checking Gap for inprogress EditLogInputStream
[ https://issues.apache.org/jira/browse/HDFS-16557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16557: --- Attachment: image-2022-04-22-17-17-14-618.png > BootstrapStandby failed because of checking Gap for inprogress > EditLogInputStream > - > > Key: HDFS-16557 > URL: https://issues.apache.org/jira/browse/HDFS-16557 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: tomscut >Assignee: tomscut >Priority: Major > Attachments: image-2022-04-22-17-17-14-577.png, > image-2022-04-22-17-17-14-618.png, image-2022-04-22-17-17-23-113.png, > image-2022-04-22-17-17-32-487.png > > > The lastTxId of an inprogress EditLogInputStream lastTxId isn't necessarily > HdfsServerConstants.INVALID_TXID. We can determine its status directly by > EditLogInputStream#isInProgress. > For example, when bootstrapStandby, the EditLogInputStream of inProgress is > misjudged, resulting in a gap check failure, which causes bootstrapStandby to > fail. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16557) BootstrapStandby failed because of checking Gap for inprogress EditLogInputStream
[ https://issues.apache.org/jira/browse/HDFS-16557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16557: --- Attachment: image-2022-04-22-17-17-23-113.png > BootstrapStandby failed because of checking Gap for inprogress > EditLogInputStream > - > > Key: HDFS-16557 > URL: https://issues.apache.org/jira/browse/HDFS-16557 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: tomscut >Assignee: tomscut >Priority: Major > Attachments: image-2022-04-22-17-17-14-577.png, > image-2022-04-22-17-17-14-618.png, image-2022-04-22-17-17-23-113.png, > image-2022-04-22-17-17-32-487.png > > > The lastTxId of an inprogress EditLogInputStream lastTxId isn't necessarily > HdfsServerConstants.INVALID_TXID. We can determine its status directly by > EditLogInputStream#isInProgress. > For example, when bootstrapStandby, the EditLogInputStream of inProgress is > misjudged, resulting in a gap check failure, which causes bootstrapStandby to > fail. > !image-2022-04-22-17-17-23-113.png! > !image-2022-04-22-17-17-14-577.png! -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16557) BootstrapStandby failed because of checking Gap for inprogress EditLogInputStream
[ https://issues.apache.org/jira/browse/HDFS-16557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16557: --- Description: The lastTxId of an inprogress EditLogInputStream lastTxId isn't necessarily HdfsServerConstants.INVALID_TXID. We can determine its status directly by EditLogInputStream#isInProgress. For example, when bootstrapStandby, the EditLogInputStream of inProgress is misjudged, resulting in a gap check failure, which causes bootstrapStandby to fail. !image-2022-04-22-17-17-32-487.png|width=766,height=161! !image-2022-04-22-17-17-14-577.png|width=598,height=187! was: The lastTxId of an inprogress EditLogInputStream lastTxId isn't necessarily HdfsServerConstants.INVALID_TXID. We can determine its status directly by EditLogInputStream#isInProgress. For example, when bootstrapStandby, the EditLogInputStream of inProgress is misjudged, resulting in a gap check failure, which causes bootstrapStandby to fail. !image-2022-04-22-17-17-23-113.png! !image-2022-04-22-17-17-14-577.png! > BootstrapStandby failed because of checking Gap for inprogress > EditLogInputStream > - > > Key: HDFS-16557 > URL: https://issues.apache.org/jira/browse/HDFS-16557 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: tomscut >Assignee: tomscut >Priority: Major > Attachments: image-2022-04-22-17-17-14-577.png, > image-2022-04-22-17-17-14-618.png, image-2022-04-22-17-17-23-113.png, > image-2022-04-22-17-17-32-487.png > > > The lastTxId of an inprogress EditLogInputStream lastTxId isn't necessarily > HdfsServerConstants.INVALID_TXID. We can determine its status directly by > EditLogInputStream#isInProgress. > For example, when bootstrapStandby, the EditLogInputStream of inProgress is > misjudged, resulting in a gap check failure, which causes bootstrapStandby to > fail. > !image-2022-04-22-17-17-32-487.png|width=766,height=161! > !image-2022-04-22-17-17-14-577.png|width=598,height=187! -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16557) BootstrapStandby failed because of checking Gap for inprogress EditLogInputStream
[ https://issues.apache.org/jira/browse/HDFS-16557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16557: --- Attachment: image-2022-04-22-17-17-14-577.png > BootstrapStandby failed because of checking Gap for inprogress > EditLogInputStream > - > > Key: HDFS-16557 > URL: https://issues.apache.org/jira/browse/HDFS-16557 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: tomscut >Assignee: tomscut >Priority: Major > Attachments: image-2022-04-22-17-17-14-577.png, > image-2022-04-22-17-17-14-618.png, image-2022-04-22-17-17-23-113.png, > image-2022-04-22-17-17-32-487.png > > > The lastTxId of an inprogress EditLogInputStream lastTxId isn't necessarily > HdfsServerConstants.INVALID_TXID. We can determine its status directly by > EditLogInputStream#isInProgress. > For example, when bootstrapStandby, the EditLogInputStream of inProgress is > misjudged, resulting in a gap check failure, which causes bootstrapStandby to > fail. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16557) BootstrapStandby failed because of checking Gap for inprogress EditLogInputStream
tomscut created HDFS-16557: -- Summary: BootstrapStandby failed because of checking Gap for inprogress EditLogInputStream Key: HDFS-16557 URL: https://issues.apache.org/jira/browse/HDFS-16557 Project: Hadoop HDFS Issue Type: Bug Reporter: tomscut Assignee: tomscut The lastTxId of an inprogress EditLogInputStream lastTxId isn't necessarily HdfsServerConstants.INVALID_TXID. We can determine its status directly by EditLogInputStream#isInProgress. For example, when bootstrapStandby, the EditLogInputStream of inProgress is misjudged, resulting in a gap check failure, which causes bootstrapStandby to fail. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16552) Fix NPE for TestBlockManager
[ https://issues.apache.org/jira/browse/HDFS-16552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16552: --- Summary: Fix NPE for TestBlockManager (was: Fix NPE for BlockManager#scheduleReconstruction) > Fix NPE for TestBlockManager > > > Key: HDFS-16552 > URL: https://issues.apache.org/jira/browse/HDFS-16552 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: tomscut >Assignee: tomscut >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > There is a NPE in BlockManager when run > TestBlockManager#testSkipReconstructionWithManyBusyNodes2. Because > NameNodeMetrics is not initialized in this unit test. > > Related ci link, see > [this|https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt]. > {code:java} > [ERROR] Tests run: 34, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: > 30.088 s <<< FAILURE! - in > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager > [ERROR] > testSkipReconstructionWithManyBusyNodes2(org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager) > Time elapsed: 2.783 s <<< ERROR! > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.scheduleReconstruction(BlockManager.java:2171) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.testSkipReconstructionWithManyBusyNodes2(TestBlockManager.java:947) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) > at > org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) > at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) > at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) > at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) > at org.junit.runners.ParentRunner.run(ParentRunner.java:413) > at > org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) > at > org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) > at > org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) > at > org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) > at > org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) > at > org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) > {code} > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash
[ https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17526151#comment-17526151 ] tomscut commented on HDFS-16550: I have submitted a simple PR according to the way of Fast Fail. [~sunchao] [~xkrogen] Please help to have a look at it, thank you very much. > [SBN read] Improper cache-size for journal node may cause cluster crash > --- > > Key: HDFS-16550 > URL: https://issues.apache.org/jira/browse/HDFS-16550 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: tomscut >Assignee: tomscut >Priority: Major > Labels: pull-request-available > Attachments: image-2022-04-21-09-54-29-751.png, > image-2022-04-21-09-54-57-111.png, image-2022-04-21-12-32-56-170.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > When we introduced {*}SBN Read{*}, we encountered a situation during upgrade > the JournalNodes. > Cluster Info: > *Active: nn0* > *Standby: nn1* > 1. Rolling restart journal node. {color:#ff}(related config: > fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color} > 2. The cluster runs for a while, edits cache usage is increasing and memory > is used up. > 3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed > out waiting 12ms for a quorum of nodes to respond”{_}. > 4. Transfer nn1 to Active state. > 5. {color:#ff}New Active namenode(nn1){color} also shutdown because of > “{_}Timed out waiting 12ms for a quorum of nodes to respond” too{_}. > 6. {color:#ff}The cluster crashed{color}. > > Related code: > {code:java} > JournaledEditsCache(Configuration conf) { > capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY, > DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT); > if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) { > Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " + > "maximum JVM memory is only %d bytes. It is recommended that you " + > "decrease the cache size or increase the heap size.", > capacity, Runtime.getRuntime().maxMemory())); > } > Journal.LOG.info("Enabling the journaled edits cache with a capacity " + > "of bytes: " + capacity); > ReadWriteLock lock = new ReentrantReadWriteLock(true); > readLock = new AutoCloseableLock(lock.readLock()); > writeLock = new AutoCloseableLock(lock.writeLock()); > initialize(INVALID_TXN_ID); > } {code} > Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size > than the memory requested by the process. If > {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * > Runtime.getruntime().maxMemory(){*}, only warn logs are printed during > journalnode startup. This can easily be overlooked by users. However, as the > cluster runs to a certain period of time, it is likely to cause the cluster > to crash. > > NN log: > !image-2022-04-21-09-54-57-111.png|width=1012,height=47! > !image-2022-04-21-12-32-56-170.png|width=809,height=218! > IMO, when {*}fs.journalNode.edit-cache-size-bytes > threshold * > Runtime.getruntime ().maxMemory(){*}, we should throw an Exception and > {color:#ff}fast fail{color}. Giving a clear hint for users to update > related configurations. Or if cache-size exceeds 50% (or some other > threshold) of maxMemory, force cache-size to be 25% of maxMemory. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash
[ https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16550: --- Description: When we introduced {*}SBN Read{*}, we encountered a situation during upgrade the JournalNodes. Cluster Info: *Active: nn0* *Standby: nn1* 1. Rolling restart journal node. {color:#ff}(related config: fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color} 2. The cluster runs for a while, edits cache usage is increasing and memory is used up. 3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed out waiting 12ms for a quorum of nodes to respond”{_}. 4. Transfer nn1 to Active state. 5. {color:#ff}New Active namenode(nn1){color} also shutdown because of “{_}Timed out waiting 12ms for a quorum of nodes to respond” too{_}. 6. {color:#ff}The cluster crashed{color}. Related code: {code:java} JournaledEditsCache(Configuration conf) { capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY, DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT); if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) { Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " + "maximum JVM memory is only %d bytes. It is recommended that you " + "decrease the cache size or increase the heap size.", capacity, Runtime.getRuntime().maxMemory())); } Journal.LOG.info("Enabling the journaled edits cache with a capacity " + "of bytes: " + capacity); ReadWriteLock lock = new ReentrantReadWriteLock(true); readLock = new AutoCloseableLock(lock.readLock()); writeLock = new AutoCloseableLock(lock.writeLock()); initialize(INVALID_TXN_ID); } {code} Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size than the memory requested by the process. If {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * Runtime.getruntime().maxMemory(){*}, only warn logs are printed during journalnode startup. This can easily be overlooked by users. However, as the cluster runs to a certain period of time, it is likely to cause the cluster to crash. NN log: !image-2022-04-21-09-54-57-111.png|width=1012,height=47! !image-2022-04-21-12-32-56-170.png|width=809,height=218! IMO, when {*}fs.journalNode.edit-cache-size-bytes > threshold * Runtime.getruntime ().maxMemory(){*}, we should throw an Exception and {color:#ff}fast fail{color}. Giving a clear hint for users to update related configurations. Or if cache-size exceeds 50% (or some other threshold) of maxMemory, force cache-size to be 25% of maxMemory. was: When we introduced {*}SBN Read{*}, we encountered a situation during upgrade the JournalNodes. Cluster Info: *Active: nn0* *Standby: nn1* 1. Rolling restart journal node. {color:#ff}(related config: fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color} 2. The cluster runs for a while, edits cache usage is increasing and memory is used up. 3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed out waiting 12ms for a quorum of nodes to respond”{_}. 4. Transfer nn1 to Active state. 5. {color:#ff}New Active namenode(nn1){color} also shutdown because of “{_}Timed out waiting 12ms for a quorum of nodes to respond” too{_}. 6. {color:#ff}The cluster crashed{color}. Related code: {code:java} JournaledEditsCache(Configuration conf) { capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY, DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT); if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) { Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " + "maximum JVM memory is only %d bytes. It is recommended that you " + "decrease the cache size or increase the heap size.", capacity, Runtime.getRuntime().maxMemory())); } Journal.LOG.info("Enabling the journaled edits cache with a capacity " + "of bytes: " + capacity); ReadWriteLock lock = new ReentrantReadWriteLock(true); readLock = new AutoCloseableLock(lock.readLock()); writeLock = new AutoCloseableLock(lock.writeLock()); initialize(INVALID_TXN_ID); } {code} Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size than the memory requested by the process. If {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * Runtime.getruntime().maxMemory(){*}, only warn logs are printed during journalnode startup. This can easily be overlooked by users. However, as the cluster runs to a certain period of time, it is likely to cause the cluster to crash. NN log: !image-2022-04-21-09-54-57-111.png|width=1012,height=47! !image-2022-04-21-12-32-56-170.png|width=809,height=218! IMO, when {*}fs.journalNode.edit-cache-size-bytes > threshold * Runtime.getruntime ().maxMemory(){*}, we should throw an Exception and {color:#ff}fast fail{color}. Giving a clear
[jira] [Updated] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash
[ https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16550: --- Description: When we introduced {*}SBN Read{*}, we encountered a situation during upgrade the JournalNodes. Cluster Info: *Active: nn0* *Standby: nn1* 1. Rolling restart journal node. {color:#ff}(related config: fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color} 2. The cluster runs for a while, edits cache usage is increasing and memory is used up. 3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed out waiting 12ms for a quorum of nodes to respond”{_}. 4. Transfer nn1 to Active state. 5. {color:#ff}New Active namenode(nn1){color} also shutdown because of “{_}Timed out waiting 12ms for a quorum of nodes to respond” too{_}. 6. {color:#ff}The cluster crashed{color}. Related code: {code:java} JournaledEditsCache(Configuration conf) { capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY, DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT); if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) { Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " + "maximum JVM memory is only %d bytes. It is recommended that you " + "decrease the cache size or increase the heap size.", capacity, Runtime.getRuntime().maxMemory())); } Journal.LOG.info("Enabling the journaled edits cache with a capacity " + "of bytes: " + capacity); ReadWriteLock lock = new ReentrantReadWriteLock(true); readLock = new AutoCloseableLock(lock.readLock()); writeLock = new AutoCloseableLock(lock.writeLock()); initialize(INVALID_TXN_ID); } {code} Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size than the memory requested by the process. If {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * Runtime.getruntime().maxMemory(){*}, only warn logs are printed during journalnode startup. This can easily be overlooked by users. However, as the cluster runs to a certain period of time, it is likely to cause the cluster to crash. NN log: !image-2022-04-21-09-54-57-111.png|width=1012,height=47! !image-2022-04-21-12-32-56-170.png|width=809,height=218! IMO, when {*}fs.journalNode.edit-cache-size-bytes > threshold * Runtime.getruntime ().maxMemory(){*}, we should throw an Exception and {color:#ff}fast fail{color}. Giving a clear hint for users to update related configurations. was: When we introduced {*}SBN Read{*}, we encountered a situation during upgrade the JournalNodes. Cluster Info: *Active: nn0* *Standby: nn1* 1. Rolling restart journal node. {color:#ff}(related config: fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color} 2. The cluster runs for a while. 3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed out waiting 12ms for a quorum of nodes to respond”{_}. 4. Transfer nn1 to Active state. 5. {color:#ff}New Active namenode(nn1){color} also shutdown because of “{_}Timed out waiting 12ms for a quorum of nodes to respond” too{_}. 6. {color:#ff}The cluster crashed{color}. Related code: {code:java} JournaledEditsCache(Configuration conf) { capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY, DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT); if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) { Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " + "maximum JVM memory is only %d bytes. It is recommended that you " + "decrease the cache size or increase the heap size.", capacity, Runtime.getRuntime().maxMemory())); } Journal.LOG.info("Enabling the journaled edits cache with a capacity " + "of bytes: " + capacity); ReadWriteLock lock = new ReentrantReadWriteLock(true); readLock = new AutoCloseableLock(lock.readLock()); writeLock = new AutoCloseableLock(lock.writeLock()); initialize(INVALID_TXN_ID); } {code} Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size than the memory requested by the process. If {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * Runtime.getruntime().maxMemory(){*}, only warn logs are printed during journalnode startup. This can easily be overlooked by users. However, as the cluster runs to a certain period of time, it is likely to cause the cluster to crash. NN log: !image-2022-04-21-09-54-57-111.png|width=1012,height=47! !image-2022-04-21-12-32-56-170.png|width=809,height=218! IMO, when {*}fs.journalNode.edit-cache-size-bytes > threshold * Runtime.getruntime ().maxMemory(){*}, we should throw an Exception and {color:#ff}fast fail{color}. Giving a clear hint for users to update related configurations. > [SBN read] Improper cache-size for journal node may cause cluster crash >
[jira] [Updated] (HDFS-16552) Fix NPE for BlockManager#scheduleReconstruction
[ https://issues.apache.org/jira/browse/HDFS-16552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16552: --- Summary: Fix NPE for BlockManager#scheduleReconstruction (was: Fix NPE for BlockManager) > Fix NPE for BlockManager#scheduleReconstruction > --- > > Key: HDFS-16552 > URL: https://issues.apache.org/jira/browse/HDFS-16552 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: tomscut >Assignee: tomscut >Priority: Major > > There is a NPE in BlockManager when run > TestBlockManager#testSkipReconstructionWithManyBusyNodes2. Because > NameNodeMetrics is not initialized in this unit test. > > Related ci link, see > [this|https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt]. > {code:java} > [ERROR] Tests run: 34, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: > 30.088 s <<< FAILURE! - in > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager > [ERROR] > testSkipReconstructionWithManyBusyNodes2(org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager) > Time elapsed: 2.783 s <<< ERROR! > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.scheduleReconstruction(BlockManager.java:2171) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.testSkipReconstructionWithManyBusyNodes2(TestBlockManager.java:947) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) > at > org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) > at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) > at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) > at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) > at org.junit.runners.ParentRunner.run(ParentRunner.java:413) > at > org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) > at > org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) > at > org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) > at > org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) > at > org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) > at > org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) > {code} > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16552) Fix NPE for BlockManager
tomscut created HDFS-16552: -- Summary: Fix NPE for BlockManager Key: HDFS-16552 URL: https://issues.apache.org/jira/browse/HDFS-16552 Project: Hadoop HDFS Issue Type: Bug Reporter: tomscut Assignee: tomscut There is a NPE in BlockManager when run TestBlockManager#testSkipReconstructionWithManyBusyNodes2. Because NameNodeMetrics is not initialized in this unit test. Related ci link, see [this|https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt]. {code:java} [ERROR] Tests run: 34, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 30.088 s <<< FAILURE! - in org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager [ERROR] testSkipReconstructionWithManyBusyNodes2(org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager) Time elapsed: 2.783 s <<< ERROR! java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.scheduleReconstruction(BlockManager.java:2171) at org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.testSkipReconstructionWithManyBusyNodes2(TestBlockManager.java:947) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.ParentRunner.run(ParentRunner.java:413) at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16552) Fix NPE for BlockManager
[ https://issues.apache.org/jira/browse/HDFS-16552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16552: --- Description: There is a NPE in BlockManager when run TestBlockManager#testSkipReconstructionWithManyBusyNodes2. Because NameNodeMetrics is not initialized in this unit test. Related ci link, see [this|https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt]. {code:java} [ERROR] Tests run: 34, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 30.088 s <<< FAILURE! - in org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager [ERROR] testSkipReconstructionWithManyBusyNodes2(org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager) Time elapsed: 2.783 s <<< ERROR! java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.scheduleReconstruction(BlockManager.java:2171) at org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.testSkipReconstructionWithManyBusyNodes2(TestBlockManager.java:947) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.ParentRunner.run(ParentRunner.java:413) at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) {code} was: There is a NPE in BlockManager when run TestBlockManager#testSkipReconstructionWithManyBusyNodes2. Because NameNodeMetrics is not initialized in this unit test. Related ci link, see [this|https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt]. {code:java} [ERROR] Tests run: 34, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 30.088 s <<< FAILURE! - in org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager [ERROR] testSkipReconstructionWithManyBusyNodes2(org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager) Time elapsed: 2.783 s <<< ERROR! java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.scheduleReconstruction(BlockManager.java:2171) at org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.testSkipReconstructionWithManyBusyNodes2(TestBlockManager.java:947) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at
[jira] [Updated] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash
[ https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16550: --- Description: When we introduced {*}SBN Read{*}, we encountered a situation during upgrade the JournalNodes. Cluster Info: *Active: nn0* *Standby: nn1* 1. Rolling restart journal node. {color:#ff}(related config: fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color} 2. The cluster runs for a while. 3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed out waiting 12ms for a quorum of nodes to respond”{_}. 4. Transfer nn1 to Active state. 5. {color:#ff}New Active namenode(nn1){color} also shutdown because of “{_}Timed out waiting 12ms for a quorum of nodes to respond” too{_}. 6. {color:#ff}The cluster crashed{color}. Related code: {code:java} JournaledEditsCache(Configuration conf) { capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY, DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT); if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) { Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " + "maximum JVM memory is only %d bytes. It is recommended that you " + "decrease the cache size or increase the heap size.", capacity, Runtime.getRuntime().maxMemory())); } Journal.LOG.info("Enabling the journaled edits cache with a capacity " + "of bytes: " + capacity); ReadWriteLock lock = new ReentrantReadWriteLock(true); readLock = new AutoCloseableLock(lock.readLock()); writeLock = new AutoCloseableLock(lock.writeLock()); initialize(INVALID_TXN_ID); } {code} Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size than the memory requested by the process. If {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * Runtime.getruntime().maxMemory(){*}, only warn logs are printed during journalnode startup. This can easily be overlooked by users. However, as the cluster runs to a certain period of time, it is likely to cause the cluster to crash. NN log: !image-2022-04-21-09-54-57-111.png|width=1012,height=47! !image-2022-04-21-12-32-56-170.png|width=809,height=218! IMO, when {*}fs.journalNode.edit-cache-size-bytes > threshold * Runtime.getruntime ().maxMemory(){*}, we should throw an Exception and {color:#ff}fast fail{color}. Giving a clear hint for users to update related configurations. was: When we introduced {*}SBN Read{*}, we encountered a situation during upgrade the JournalNodes. Cluster Info: *Active: nn0* *Standby: nn1* 1. Rolling restart journal node. {color:#ff}(related config: fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color} 2. The cluster runs for a while. 3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed out waiting 12ms for a quorum of nodes to respond”{_}. 4. Transfer nn1 to Active state. 5. {color:#ff}New Active namenode(nn1){color} also shutdown because of Timed out waiting 12ms for a quorum of nodes to respond. 6. {color:#ff}The cluster crashed{color}. Related code: {code:java} JournaledEditsCache(Configuration conf) { capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY, DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT); if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) { Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " + "maximum JVM memory is only %d bytes. It is recommended that you " + "decrease the cache size or increase the heap size.", capacity, Runtime.getRuntime().maxMemory())); } Journal.LOG.info("Enabling the journaled edits cache with a capacity " + "of bytes: " + capacity); ReadWriteLock lock = new ReentrantReadWriteLock(true); readLock = new AutoCloseableLock(lock.readLock()); writeLock = new AutoCloseableLock(lock.writeLock()); initialize(INVALID_TXN_ID); } {code} Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size than the memory requested by the process. If {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * Runtime.getruntime().maxMemory(){*}, only warn logs are printed during journalnode startup. This can easily be overlooked by users. However, as the cluster runs to a certain period of time, it is likely to cause the cluster to crash. NN log: !image-2022-04-21-09-54-57-111.png|width=1012,height=47! !image-2022-04-21-12-32-56-170.png|width=809,height=218! IMO, when {*}fs.journalNode.edit-cache-size-bytes > threshold * Runtime.getruntime ().maxMemory(){*}, we should throw an Exception and {color:#ff}fast fail{color}. Giving a clear hint for users to update related configurations. > [SBN read] Improper cache-size for journal node may cause cluster crash > --- > > Key: HDFS-16550
[jira] [Updated] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash
[ https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16550: --- Attachment: image-2022-04-21-12-32-56-170.png Description: When we introduced {*}SBN Read{*}, we encountered a situation during upgrade the JournalNodes. Cluster Info: *Active: nn0* *Standby: nn1* 1. Rolling restart journal node. {color:#ff}(related config: fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color} 2. The cluster runs for a while. 3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed out waiting 12ms for a quorum of nodes to respond”{_}. 4. Transfer nn1 to Active state. 5. {color:#ff}New Active namenode(nn1){color} also shutdown because of Timed out waiting 12ms for a quorum of nodes to respond. 6. {color:#ff}The cluster crashed{color}. Related code: {code:java} JournaledEditsCache(Configuration conf) { capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY, DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT); if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) { Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " + "maximum JVM memory is only %d bytes. It is recommended that you " + "decrease the cache size or increase the heap size.", capacity, Runtime.getRuntime().maxMemory())); } Journal.LOG.info("Enabling the journaled edits cache with a capacity " + "of bytes: " + capacity); ReadWriteLock lock = new ReentrantReadWriteLock(true); readLock = new AutoCloseableLock(lock.readLock()); writeLock = new AutoCloseableLock(lock.writeLock()); initialize(INVALID_TXN_ID); } {code} Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size than the memory requested by the process. If {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * Runtime.getruntime().maxMemory(){*}, only warn logs are printed during journalnode startup. This can easily be overlooked by users. However, as the cluster runs to a certain period of time, it is likely to cause the cluster to crash. !image-2022-04-21-09-54-57-111.png|width=1012,height=47! !image-2022-04-21-12-32-56-170.png|width=809,height=218! IMO, when {*}fs.journalNode.edit-cache-size-bytes > threshold * Runtime.getruntime ().maxMemory(){*}, we should throw an Exception and {color:#ff}fast fail{color}. Giving a clear hint for users to update related configurations. was: When we introduced {*}SBN Read{*}, we encountered a situation during upgrade the JournalNodes. Cluster Info: *Active: nn0* *Standby: nn1* 1. Rolling restart journal node. {color:#ff}(related config: fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color} 2. The cluster runs for a while. 3. {color:#ff}Active namenode(nn0){color} shutdown because of Timed out waiting 12ms for a quorum of nodes to respond. 4. Transfer nn1 to Active state. 5. {color:#ff}New Active namenode(nn1){color} also shutdown because of Timed out waiting 12ms for a quorum of nodes to respond. 6. {color:#ff}The cluster crashed{color}. Related code: {code:java} JournaledEditsCache(Configuration conf) { capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY, DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT); if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) { Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " + "maximum JVM memory is only %d bytes. It is recommended that you " + "decrease the cache size or increase the heap size.", capacity, Runtime.getRuntime().maxMemory())); } Journal.LOG.info("Enabling the journaled edits cache with a capacity " + "of bytes: " + capacity); ReadWriteLock lock = new ReentrantReadWriteLock(true); readLock = new AutoCloseableLock(lock.readLock()); writeLock = new AutoCloseableLock(lock.writeLock()); initialize(INVALID_TXN_ID); } {code} Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size than the memory requested by the process. If {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * Runtime.getruntime().maxMemory(){*}, only warn logs are printed during journalnode startup. This can easily be overlooked by users. However, as the cluster runs to a certain period of time, it is likely to cause the cluster to crash. !image-2022-04-21-09-54-57-111.png|width=1227,height=57! IMO, when {*}fs.journalNode.edit-cache-size-bytes > threshold * Runtime.getruntime ().maxMemory(){*}, we should throw an Exception and {color:#ff}fast fail{color}. Giving a clear hint for users to update related configurations. > [SBN read] Improper cache-size for journal node may cause cluster crash > --- > > Key: HDFS-16550 > URL:
[jira] [Updated] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash
[ https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16550: --- Description: When we introduced {*}SBN Read{*}, we encountered a situation during upgrade the JournalNodes. Cluster Info: *Active: nn0* *Standby: nn1* 1. Rolling restart journal node. {color:#ff}(related config: fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color} 2. The cluster runs for a while. 3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed out waiting 12ms for a quorum of nodes to respond”{_}. 4. Transfer nn1 to Active state. 5. {color:#ff}New Active namenode(nn1){color} also shutdown because of Timed out waiting 12ms for a quorum of nodes to respond. 6. {color:#ff}The cluster crashed{color}. Related code: {code:java} JournaledEditsCache(Configuration conf) { capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY, DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT); if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) { Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " + "maximum JVM memory is only %d bytes. It is recommended that you " + "decrease the cache size or increase the heap size.", capacity, Runtime.getRuntime().maxMemory())); } Journal.LOG.info("Enabling the journaled edits cache with a capacity " + "of bytes: " + capacity); ReadWriteLock lock = new ReentrantReadWriteLock(true); readLock = new AutoCloseableLock(lock.readLock()); writeLock = new AutoCloseableLock(lock.writeLock()); initialize(INVALID_TXN_ID); } {code} Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size than the memory requested by the process. If {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * Runtime.getruntime().maxMemory(){*}, only warn logs are printed during journalnode startup. This can easily be overlooked by users. However, as the cluster runs to a certain period of time, it is likely to cause the cluster to crash. NN log: !image-2022-04-21-09-54-57-111.png|width=1012,height=47! !image-2022-04-21-12-32-56-170.png|width=809,height=218! IMO, when {*}fs.journalNode.edit-cache-size-bytes > threshold * Runtime.getruntime ().maxMemory(){*}, we should throw an Exception and {color:#ff}fast fail{color}. Giving a clear hint for users to update related configurations. was: When we introduced {*}SBN Read{*}, we encountered a situation during upgrade the JournalNodes. Cluster Info: *Active: nn0* *Standby: nn1* 1. Rolling restart journal node. {color:#ff}(related config: fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color} 2. The cluster runs for a while. 3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed out waiting 12ms for a quorum of nodes to respond”{_}. 4. Transfer nn1 to Active state. 5. {color:#ff}New Active namenode(nn1){color} also shutdown because of Timed out waiting 12ms for a quorum of nodes to respond. 6. {color:#ff}The cluster crashed{color}. Related code: {code:java} JournaledEditsCache(Configuration conf) { capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY, DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT); if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) { Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " + "maximum JVM memory is only %d bytes. It is recommended that you " + "decrease the cache size or increase the heap size.", capacity, Runtime.getRuntime().maxMemory())); } Journal.LOG.info("Enabling the journaled edits cache with a capacity " + "of bytes: " + capacity); ReadWriteLock lock = new ReentrantReadWriteLock(true); readLock = new AutoCloseableLock(lock.readLock()); writeLock = new AutoCloseableLock(lock.writeLock()); initialize(INVALID_TXN_ID); } {code} Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size than the memory requested by the process. If {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * Runtime.getruntime().maxMemory(){*}, only warn logs are printed during journalnode startup. This can easily be overlooked by users. However, as the cluster runs to a certain period of time, it is likely to cause the cluster to crash. !image-2022-04-21-09-54-57-111.png|width=1012,height=47! !image-2022-04-21-12-32-56-170.png|width=809,height=218! IMO, when {*}fs.journalNode.edit-cache-size-bytes > threshold * Runtime.getruntime ().maxMemory(){*}, we should throw an Exception and {color:#ff}fast fail{color}. Giving a clear hint for users to update related configurations. > [SBN read] Improper cache-size for journal node may cause cluster crash > --- > > Key: HDFS-16550 > URL:
[jira] [Updated] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash
[ https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16550: --- Description: When we introduced {*}SBN Read{*}, we encountered a situation during upgrade the JournalNodes. Cluster Info: *Active: nn0* *Standby: nn1* 1. Rolling restart journal node. {color:#ff}(related config: fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color} 2. The cluster runs for a while. 3. {color:#ff}Active namenode(nn0){color} shutdown because of Timed out waiting 12ms for a quorum of nodes to respond. 4. Transfer nn1 to Active state. 5. {color:#ff}New Active namenode(nn1){color} also shutdown because of Timed out waiting 12ms for a quorum of nodes to respond. 6. {color:#ff}The cluster crashed{color}. Related code: {code:java} JournaledEditsCache(Configuration conf) { capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY, DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT); if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) { Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " + "maximum JVM memory is only %d bytes. It is recommended that you " + "decrease the cache size or increase the heap size.", capacity, Runtime.getRuntime().maxMemory())); } Journal.LOG.info("Enabling the journaled edits cache with a capacity " + "of bytes: " + capacity); ReadWriteLock lock = new ReentrantReadWriteLock(true); readLock = new AutoCloseableLock(lock.readLock()); writeLock = new AutoCloseableLock(lock.writeLock()); initialize(INVALID_TXN_ID); } {code} Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size than the memory requested by the process. If {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * Runtime.getruntime().maxMemory(){*}, only warn logs are printed during journalnode startup. This can easily be overlooked by users. However, as the cluster runs to a certain period of time, it is likely to cause the cluster to crash. !image-2022-04-21-09-54-57-111.png|width=1227,height=57! IMO, when {*}fs.journalNode.edit-cache-size-bytes > threshold * Runtime.getruntime ().maxMemory(){*}, we should throw an Exception and {color:#ff}fast fail{color}. Giving a clear hint for users to update related configurations. was: When we introduced SBN Read, we encountered a situation when upgrading the JournalNodes. Cluster Info: *Active: nn0* *Standby: nn1* 1. Rolling restart journal node. {color:#FF}(related config: fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color} 2. The cluster runs for a while. 3. {color:#FF}Active namenode(nn0){color} shutdown because of Timed out waiting 12ms for a quorum of nodes to respond. 4. Transfer nn1 to Active state. 5. {color:#FF}New Active namenode(nn1){color} also shutdown because of Timed out waiting 12ms for a quorum of nodes to respond. 6. {color:#FF}The cluster crashed{color}. Related code: {code:java} JournaledEditsCache(Configuration conf) { capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY, DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT); if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) { Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " + "maximum JVM memory is only %d bytes. It is recommended that you " + "decrease the cache size or increase the heap size.", capacity, Runtime.getRuntime().maxMemory())); } Journal.LOG.info("Enabling the journaled edits cache with a capacity " + "of bytes: " + capacity); ReadWriteLock lock = new ReentrantReadWriteLock(true); readLock = new AutoCloseableLock(lock.readLock()); writeLock = new AutoCloseableLock(lock.writeLock()); initialize(INVALID_TXN_ID); } {code} Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size than the memory requested by the process. If {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * Runtime.getruntime().maxMemory(){*}, only warn logs are printed during journalnode startup. This can easily be overlooked by users. However, as the cluster runs to a certain period of time, it is likely to cause the cluster to crash. !image-2022-04-21-09-54-57-111.png|width=1227,height=57! IMO, when {*}fs.journalNode.edit-cache-size-bytes > threshold * Runtime.getruntime ().maxMemory(){*}, we should throw an Exception and {color:#FF}fast fail{color}. Giving a clear hint for users to update related configurations. > [SBN read] Improper cache-size for journal node may cause cluster crash > --- > > Key: HDFS-16550 > URL: https://issues.apache.org/jira/browse/HDFS-16550 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: tomscut >
[jira] [Created] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash
tomscut created HDFS-16550: -- Summary: [SBN read] Improper cache-size for journal node may cause cluster crash Key: HDFS-16550 URL: https://issues.apache.org/jira/browse/HDFS-16550 Project: Hadoop HDFS Issue Type: Bug Reporter: tomscut Assignee: tomscut Attachments: image-2022-04-21-09-54-29-751.png, image-2022-04-21-09-54-57-111.png When we introduced SBN Read, we encountered a situation when upgrading the JournalNodes. Cluster Info: *Active: nn0* *Standby: nn1* 1. Rolling restart journal node. {color:#FF}(related config: fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color} 2. The cluster runs for a while. 3. {color:#FF}Active namenode(nn0){color} shutdown because of Timed out waiting 12ms for a quorum of nodes to respond. 4. Transfer nn1 to Active state. 5. {color:#FF}New Active namenode(nn1){color} also shutdown because of Timed out waiting 12ms for a quorum of nodes to respond. 6. {color:#FF}The cluster crashed{color}. Related code: {code:java} JournaledEditsCache(Configuration conf) { capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY, DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT); if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) { Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " + "maximum JVM memory is only %d bytes. It is recommended that you " + "decrease the cache size or increase the heap size.", capacity, Runtime.getRuntime().maxMemory())); } Journal.LOG.info("Enabling the journaled edits cache with a capacity " + "of bytes: " + capacity); ReadWriteLock lock = new ReentrantReadWriteLock(true); readLock = new AutoCloseableLock(lock.readLock()); writeLock = new AutoCloseableLock(lock.writeLock()); initialize(INVALID_TXN_ID); } {code} Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size than the memory requested by the process. If {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * Runtime.getruntime().maxMemory(){*}, only warn logs are printed during journalnode startup. This can easily be overlooked by users. However, as the cluster runs to a certain period of time, it is likely to cause the cluster to crash. !image-2022-04-21-09-54-57-111.png|width=1227,height=57! IMO, when {*}fs.journalNode.edit-cache-size-bytes > threshold * Runtime.getruntime ().maxMemory(){*}, we should throw an Exception and {color:#FF}fast fail{color}. Giving a clear hint for users to update related configurations. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16548) Failed unit test testRenameMoreThanOnceAcrossSnapDirs_2
[ https://issues.apache.org/jira/browse/HDFS-16548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16548: --- Description: It seems to be related to HDFS-16531. {code:java} [ERROR] Tests run: 44, Failures: 6, Errors: 0, Skipped: 0, Time elapsed: 143.701 s <<< FAILURE! - in org.apache.hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots [ERROR] testRenameMoreThanOnceAcrossSnapDirs_2(org.apache.hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots) Time elapsed: 6.606 s <<< FAILURE! java.lang.AssertionError: expected:<3> but was:<1> at org.junit.Assert.fail(Assert.java:89) at org.junit.Assert.failNotEquals(Assert.java:835) at org.junit.Assert.assertEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:633) at org.apache.hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots.testRenameMoreThanOnceAcrossSnapDirs_2(TestRenameWithSnapshots.java:985) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.ParentRunner.run(ParentRunner.java:413) at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) {code} was: It seems to be related to this HDFS-16531. {code:java} [ERROR] Tests run: 44, Failures: 6, Errors: 0, Skipped: 0, Time elapsed: 143.701 s <<< FAILURE! - in org.apache.hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots [ERROR] testRenameMoreThanOnceAcrossSnapDirs_2(org.apache.hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots) Time elapsed: 6.606 s <<< FAILURE! java.lang.AssertionError: expected:<3> but was:<1> at org.junit.Assert.fail(Assert.java:89) at org.junit.Assert.failNotEquals(Assert.java:835) at org.junit.Assert.assertEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:633) at org.apache.hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots.testRenameMoreThanOnceAcrossSnapDirs_2(TestRenameWithSnapshots.java:985) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at
[jira] [Updated] (HDFS-16548) Failed unit test testRenameMoreThanOnceAcrossSnapDirs_2
[ https://issues.apache.org/jira/browse/HDFS-16548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16548: --- Description: It seems to be related to this HDFS-16531. {code:java} [ERROR] Tests run: 44, Failures: 6, Errors: 0, Skipped: 0, Time elapsed: 143.701 s <<< FAILURE! - in org.apache.hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots [ERROR] testRenameMoreThanOnceAcrossSnapDirs_2(org.apache.hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots) Time elapsed: 6.606 s <<< FAILURE! java.lang.AssertionError: expected:<3> but was:<1> at org.junit.Assert.fail(Assert.java:89) at org.junit.Assert.failNotEquals(Assert.java:835) at org.junit.Assert.assertEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:633) at org.apache.hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots.testRenameMoreThanOnceAcrossSnapDirs_2(TestRenameWithSnapshots.java:985) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.ParentRunner.run(ParentRunner.java:413) at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) {code} was: {code:java} [ERROR] Tests run: 44, Failures: 6, Errors: 0, Skipped: 0, Time elapsed: 143.701 s <<< FAILURE! - in org.apache.hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots [ERROR] testRenameMoreThanOnceAcrossSnapDirs_2(org.apache.hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots) Time elapsed: 6.606 s <<< FAILURE! java.lang.AssertionError: expected:<3> but was:<1> at org.junit.Assert.fail(Assert.java:89) at org.junit.Assert.failNotEquals(Assert.java:835) at org.junit.Assert.assertEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:633) at org.apache.hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots.testRenameMoreThanOnceAcrossSnapDirs_2(TestRenameWithSnapshots.java:985) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at
[jira] [Created] (HDFS-16548) Failed unit test testRenameMoreThanOnceAcrossSnapDirs_2
tomscut created HDFS-16548: -- Summary: Failed unit test testRenameMoreThanOnceAcrossSnapDirs_2 Key: HDFS-16548 URL: https://issues.apache.org/jira/browse/HDFS-16548 Project: Hadoop HDFS Issue Type: Bug Reporter: tomscut {code:java} [ERROR] Tests run: 44, Failures: 6, Errors: 0, Skipped: 0, Time elapsed: 143.701 s <<< FAILURE! - in org.apache.hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots [ERROR] testRenameMoreThanOnceAcrossSnapDirs_2(org.apache.hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots) Time elapsed: 6.606 s <<< FAILURE! java.lang.AssertionError: expected:<3> but was:<1> at org.junit.Assert.fail(Assert.java:89) at org.junit.Assert.failNotEquals(Assert.java:835) at org.junit.Assert.assertEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:633) at org.apache.hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots.testRenameMoreThanOnceAcrossSnapDirs_2(TestRenameWithSnapshots.java:985) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.ParentRunner.run(ParentRunner.java:413) at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16547) [SBN read] Namenode in safe mode should not be transfered to observer state
[ https://issues.apache.org/jira/browse/HDFS-16547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16547: --- Summary: [SBN read] Namenode in safe mode should not be transfered to observer state (was: [SBN read] Namenode in safe mode should not be transfer to observer state) > [SBN read] Namenode in safe mode should not be transfered to observer state > --- > > Key: HDFS-16547 > URL: https://issues.apache.org/jira/browse/HDFS-16547 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: tomscut >Assignee: tomscut >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > Currently, when a Namenode is in safemode(under starting or enter safemode > manually), we can transfer this Namenode to Observer by command. This > Observer node may receive many requests and then throw a SafemodeException, > this causes unnecessary failover on the client. > So Namenode in safe mode should not be transfer to observer state. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16547) [SBN read] Namenode in safe mode should not be transfer to observer state
tomscut created HDFS-16547: -- Summary: [SBN read] Namenode in safe mode should not be transfer to observer state Key: HDFS-16547 URL: https://issues.apache.org/jira/browse/HDFS-16547 Project: Hadoop HDFS Issue Type: Improvement Reporter: tomscut Assignee: tomscut Currently, when a Namenode is in safemode(under starting or enter safemode manually), we can transfer this Namenode to Observer by command. This Observer node may receive many requests and then throw a SafemodeException, this causes unnecessary failover on the client. So Namenode in safe mode should not be transfer to observer state. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-16507) [SBN read] Avoid purging edit log which is in progress
[ https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17521995#comment-17521995 ] tomscut edited comment on HDFS-16507 at 4/14/22 1:10 AM: - Thanks [~xkrogen] for your comments. If the situation arises that ` minTxIdToKeep > curSegmentTxId `, Preconditions.CheckArgument` while fail, then throw an IllegalArgumentException. This will cause ImageServlet#doPut to fail, and then cause the SNN checkpoint to fail, and maybe the SNN will retry several times until ANN rolls the edits log itself. But ANN avoids purging the inprogress edit log, so it will not crash. We can see the stack as follows. Please point out if my description is incorrect. Thank you. The stack of purgeLogsOlderThan: {code:java} java.lang.Thread.getStackTrace(Thread.java:1552) org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) org.apache.hadoop.hdfs.server.namenode.FileJournalManager.purgeLogsOlderThan(FileJournalManager.java:185) org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:623) org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:388) org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:620) org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1512) org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:177) org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:1249) org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:617) org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:516) java.security.AccessController.doPrivileged(Native Method) javax.security.auth.Subject.doAs(Subject.java:422) org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) org.apache.hadoop.hdfs.server.namenode.ImageServlet.doPut(ImageServlet.java:515) javax.servlet.http.HttpServlet.service(HttpServlet.java:710) javax.servlet.http.HttpServlet.service(HttpServlet.java:790) org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772) org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604) org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) org.eclipse.jetty.server.Server.handle(Server.java:539) org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333) org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108) org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) java.lang.Thread.run(Thread.java:745) {code} was (Author: tomscut): Thanks [~xkrogen] for your comments. if the situation arises that ` minTxIdToKeep > curSegmentTxId `, Preconditions.CheckArgument` while fail, then throw an IllegalArgumentException. This will cause ImageServlet#doPut to fail, and then cause the SNN
[jira] [Commented] (HDFS-16507) [SBN read] Avoid purging edit log which is in progress
[ https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17521995#comment-17521995 ] tomscut commented on HDFS-16507: Thanks [~xkrogen] for your comments. if the situation arises that ` minTxIdToKeep > curSegmentTxId `, Preconditions.CheckArgument` while fail, then throw an IllegalArgumentException. This will cause ImageServlet#doPut to fail, and then cause the SNN checkpoint to fail, and maybe the SNN will retry several times until ANN rolls the edits log itself. But ANN avoids purging the inprogress edit log, so it will not crash. We can see the stack as follows. Please point out if my description is incorrect. Thank you. The stack of purgeLogsOlderThan: {code:java} java.lang.Thread.getStackTrace(Thread.java:1552) org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) org.apache.hadoop.hdfs.server.namenode.FileJournalManager.purgeLogsOlderThan(FileJournalManager.java:185) org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:623) org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:388) org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:620) org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1512) org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:177) org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:1249) org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:617) org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:516) java.security.AccessController.doPrivileged(Native Method) javax.security.auth.Subject.doAs(Subject.java:422) org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) org.apache.hadoop.hdfs.server.namenode.ImageServlet.doPut(ImageServlet.java:515) javax.servlet.http.HttpServlet.service(HttpServlet.java:710) javax.servlet.http.HttpServlet.service(HttpServlet.java:790) org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772) org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604) org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) org.eclipse.jetty.server.Server.handle(Server.java:539) org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333) org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108) org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) java.lang.Thread.run(Thread.java:745) {code} > [SBN read] Avoid purging edit log which is in progress > -- > > Key: HDFS-16507 > URL: https://issues.apache.org/jira/browse/HDFS-16507 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.1.0 >
[jira] (HDFS-16507) [SBN read] Avoid purging edit log which is in progress
[ https://issues.apache.org/jira/browse/HDFS-16507 ] tomscut deleted comment on HDFS-16507: was (Author: tomscut): [~xkrogen] Your comment makes a lot of sense to me. IMO, there are two ways to approach this problem: 1. Throw an IllegalArgumentException, wait for Edit to be turned off normally, and then automatically FSEditLog#purgeLogsOlderThan. However, if SNN is down for a long time, edits log may take up more disk space. 2. Update `minTxIdToKeep` here. Like the PR I submitted at the beginning. {code:java} // Reset purgeLogsFrom to avoid purging edit log which is in progress. if (isSegmentOpen()) { minTxIdToKeep = minTxIdToKeep > curSegmentTxId ? curSegmentTxId : minTxIdToKeep; } {code} What do you think of this? cc [~sunchao] [~vjasani] . > [SBN read] Avoid purging edit log which is in progress > -- > > Key: HDFS-16507 > URL: https://issues.apache.org/jira/browse/HDFS-16507 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.1.0 >Reporter: tomscut >Assignee: tomscut >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0, 3.2.4, 3.3.4 > > Time Spent: 4h 50m > Remaining Estimate: 0h > > We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL > exception. It looks like it's purging edit logs which is in process. > According to the analysis, I suspect that the editlog which is in progress to > be purged(after SNN checkpoint) does not finalize(See HDFS-14317) before ANN > rolls edit its self. > The stack: > {code:java} > java.lang.Thread.getStackTrace(Thread.java:1552) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) > > org.apache.hadoop.hdfs.server.namenode.FileJournalManager.purgeLogsOlderThan(FileJournalManager.java:185) > > org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:623) > > org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:388) > > org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:620) > > org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1512) > org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:177) > > org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:1249) > > org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:617) > > org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:516) > java.security.AccessController.doPrivileged(Native Method) > javax.security.auth.Subject.doAs(Subject.java:422) > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > > org.apache.hadoop.hdfs.server.namenode.ImageServlet.doPut(ImageServlet.java:515) > javax.servlet.http.HttpServlet.service(HttpServlet.java:710) > javax.servlet.http.HttpServlet.service(HttpServlet.java:790) > org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772) > > org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604) > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) > > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) > > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) > > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) > > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > org.eclipse.jetty.server.Server.handle(Server.java:539) > org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333) > > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > >
[jira] [Comment Edited] (HDFS-16507) [SBN read] Avoid purging edit log which is in progress
[ https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17521409#comment-17521409 ] tomscut edited comment on HDFS-16507 at 4/13/22 2:06 AM: - [~xkrogen] Your comment makes a lot of sense to me. IMO, there are two ways to approach this problem: 1. Throw an IllegalArgumentException, wait for Edit to be turned off normally, and then automatically FSEditLog#purgeLogsOlderThan. However, if SNN is down for a long time, edits log may take up more disk space. 2. Update `minTxIdToKeep` here. Like the PR I submitted at the beginning. {code:java} // Reset purgeLogsFrom to avoid purging edit log which is in progress. if (isSegmentOpen()) { minTxIdToKeep = minTxIdToKeep > curSegmentTxId ? curSegmentTxId : minTxIdToKeep; } {code} What do you think of this? cc [~sunchao] [~vjasani] . was (Author: tomscut): [~xkrogen] Your comment makes a lot of sense to me. IMO, there are two ways to approach this problem: 1. Throw an IllegalArgumentException, wait for Edit to be turned off normally, and then automatically FSEditLog#purgeLogsOlderThan. However, if SNN is down for a long time, edits may take up more disk space. 2. Update `minTxIdToKeep` here. Like the PR I submitted at the beginning. {code:java} // Reset purgeLogsFrom to avoid purging edit log which is in progress. if (isSegmentOpen()) { minTxIdToKeep = minTxIdToKeep > curSegmentTxId ? curSegmentTxId : minTxIdToKeep; } {code} What do you think of this? cc [~sunchao] [~vjasani] . > [SBN read] Avoid purging edit log which is in progress > -- > > Key: HDFS-16507 > URL: https://issues.apache.org/jira/browse/HDFS-16507 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.1.0 >Reporter: tomscut >Assignee: tomscut >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0, 3.2.4, 3.3.4 > > Time Spent: 4h 50m > Remaining Estimate: 0h > > We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL > exception. It looks like it's purging edit logs which is in process. > According to the analysis, I suspect that the editlog which is in progress to > be purged(after SNN checkpoint) does not finalize(See HDFS-14317) before ANN > rolls edit its self. > The stack: > {code:java} > java.lang.Thread.getStackTrace(Thread.java:1552) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) > > org.apache.hadoop.hdfs.server.namenode.FileJournalManager.purgeLogsOlderThan(FileJournalManager.java:185) > > org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:623) > > org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:388) > > org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:620) > > org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1512) > org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:177) > > org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:1249) > > org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:617) > > org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:516) > java.security.AccessController.doPrivileged(Native Method) > javax.security.auth.Subject.doAs(Subject.java:422) > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > > org.apache.hadoop.hdfs.server.namenode.ImageServlet.doPut(ImageServlet.java:515) > javax.servlet.http.HttpServlet.service(HttpServlet.java:710) > javax.servlet.http.HttpServlet.service(HttpServlet.java:790) > org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772) > > org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604) > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) > > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) > > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) > > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) >
[jira] [Commented] (HDFS-16507) [SBN read] Avoid purging edit log which is in progress
[ https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17521409#comment-17521409 ] tomscut commented on HDFS-16507: [~xkrogen] Your comment makes a lot of sense to me. IMO, there are two ways to approach this problem: 1. Throw an IllegalArgumentException, wait for Edit to be turned off normally, and then automatically FSEditLog#purgeLogsOlderThan. However, if SNN is down for a long time, edits may take up more disk space. 2. Update `minTxIdToKeep` here. Like the PR I submitted at the beginning. {code:java} // Reset purgeLogsFrom to avoid purging edit log which is in progress. if (isSegmentOpen()) { minTxIdToKeep = minTxIdToKeep > curSegmentTxId ? curSegmentTxId : minTxIdToKeep; } {code} What do you think of this? cc [~sunchao] [~vjasani] . > [SBN read] Avoid purging edit log which is in progress > -- > > Key: HDFS-16507 > URL: https://issues.apache.org/jira/browse/HDFS-16507 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.1.0 >Reporter: tomscut >Assignee: tomscut >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0, 3.2.4, 3.3.4 > > Time Spent: 4h 50m > Remaining Estimate: 0h > > We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL > exception. It looks like it's purging edit logs which is in process. > According to the analysis, I suspect that the editlog which is in progress to > be purged(after SNN checkpoint) does not finalize(See HDFS-14317) before ANN > rolls edit its self. > The stack: > {code:java} > java.lang.Thread.getStackTrace(Thread.java:1552) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) > > org.apache.hadoop.hdfs.server.namenode.FileJournalManager.purgeLogsOlderThan(FileJournalManager.java:185) > > org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:623) > > org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:388) > > org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:620) > > org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1512) > org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:177) > > org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:1249) > > org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:617) > > org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:516) > java.security.AccessController.doPrivileged(Native Method) > javax.security.auth.Subject.doAs(Subject.java:422) > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > > org.apache.hadoop.hdfs.server.namenode.ImageServlet.doPut(ImageServlet.java:515) > javax.servlet.http.HttpServlet.service(HttpServlet.java:710) > javax.servlet.http.HttpServlet.service(HttpServlet.java:790) > org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772) > > org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604) > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) > > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) > > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) > > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) > > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > org.eclipse.jetty.server.Server.handle(Server.java:539) > org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333) > > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > >
[jira] [Comment Edited] (HDFS-16507) [SBN read] Avoid purging edit log which is in progress
[ https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17521403#comment-17521403 ] tomscut edited comment on HDFS-16507 at 4/13/22 1:41 AM: - Hi [~xkrogen] , thanks for your comments. The process is as follows: After checkpoint, SNN passes fsimage to ANN. When ANN receives fsimage, it trigger in ImageServlet#doPut. Do you mean, if the situation arises that ` minTxIdToKeep > curSegmentTxId `, ANN will crash because `Preconditions.CheckArgument` failure? was (Author: tomscut): Hi [~xkrogen] , thanks for your comments. The process is as follows: After checkpoint, SNN passes fsimage to ANN. When ANN receives fsimage, it fires fseditlogpurgelogsolderthan in ImageServlet#doPut. Do you mean, if the situation arises that ` minTxIdToKeep > curSegmentTxId `, ANN will crash because `Preconditions.CheckArgument` failure? > [SBN read] Avoid purging edit log which is in progress > -- > > Key: HDFS-16507 > URL: https://issues.apache.org/jira/browse/HDFS-16507 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.1.0 >Reporter: tomscut >Assignee: tomscut >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0, 3.2.4, 3.3.4 > > Time Spent: 4h 50m > Remaining Estimate: 0h > > We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL > exception. It looks like it's purging edit logs which is in process. > According to the analysis, I suspect that the editlog which is in progress to > be purged(after SNN checkpoint) does not finalize(See HDFS-14317) before ANN > rolls edit its self. > The stack: > {code:java} > java.lang.Thread.getStackTrace(Thread.java:1552) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) > > org.apache.hadoop.hdfs.server.namenode.FileJournalManager.purgeLogsOlderThan(FileJournalManager.java:185) > > org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:623) > > org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:388) > > org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:620) > > org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1512) > org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:177) > > org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:1249) > > org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:617) > > org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:516) > java.security.AccessController.doPrivileged(Native Method) > javax.security.auth.Subject.doAs(Subject.java:422) > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > > org.apache.hadoop.hdfs.server.namenode.ImageServlet.doPut(ImageServlet.java:515) > javax.servlet.http.HttpServlet.service(HttpServlet.java:710) > javax.servlet.http.HttpServlet.service(HttpServlet.java:790) > org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772) > > org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604) > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) > > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) > > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) > > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) > > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > org.eclipse.jetty.server.Server.handle(Server.java:539) > org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333) > >
[jira] [Comment Edited] (HDFS-16507) [SBN read] Avoid purging edit log which is in progress
[ https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17521403#comment-17521403 ] tomscut edited comment on HDFS-16507 at 4/13/22 1:41 AM: - Hi [~xkrogen] , thanks for your comments. The process is as follows: After checkpoint, SNN passes fsimage to ANN. When ANN receives fsimage, it trigger FSEditLog#purgeLogsOlderThan in ImageServlet#doPut. Do you mean, if the situation arises that ` minTxIdToKeep > curSegmentTxId `, ANN will crash because `Preconditions.CheckArgument` failure? was (Author: tomscut): Hi [~xkrogen] , thanks for your comments. The process is as follows: After checkpoint, SNN passes fsimage to ANN. When ANN receives fsimage, it trigger in ImageServlet#doPut. Do you mean, if the situation arises that ` minTxIdToKeep > curSegmentTxId `, ANN will crash because `Preconditions.CheckArgument` failure? > [SBN read] Avoid purging edit log which is in progress > -- > > Key: HDFS-16507 > URL: https://issues.apache.org/jira/browse/HDFS-16507 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.1.0 >Reporter: tomscut >Assignee: tomscut >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0, 3.2.4, 3.3.4 > > Time Spent: 4h 50m > Remaining Estimate: 0h > > We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL > exception. It looks like it's purging edit logs which is in process. > According to the analysis, I suspect that the editlog which is in progress to > be purged(after SNN checkpoint) does not finalize(See HDFS-14317) before ANN > rolls edit its self. > The stack: > {code:java} > java.lang.Thread.getStackTrace(Thread.java:1552) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) > > org.apache.hadoop.hdfs.server.namenode.FileJournalManager.purgeLogsOlderThan(FileJournalManager.java:185) > > org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:623) > > org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:388) > > org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:620) > > org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1512) > org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:177) > > org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:1249) > > org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:617) > > org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:516) > java.security.AccessController.doPrivileged(Native Method) > javax.security.auth.Subject.doAs(Subject.java:422) > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > > org.apache.hadoop.hdfs.server.namenode.ImageServlet.doPut(ImageServlet.java:515) > javax.servlet.http.HttpServlet.service(HttpServlet.java:710) > javax.servlet.http.HttpServlet.service(HttpServlet.java:790) > org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772) > > org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604) > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) > > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) > > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) > > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) > > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > org.eclipse.jetty.server.Server.handle(Server.java:539) > org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333) >
[jira] [Commented] (HDFS-16507) [SBN read] Avoid purging edit log which is in progress
[ https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17521403#comment-17521403 ] tomscut commented on HDFS-16507: Hi [~xkrogen] , thanks for your comments. The process is as follows: After checkpoint, SNN passes fsimage to ANN. When ANN receives fsimage, it fires fseditlogpurgelogsolderthan in ImageServlet#doPut. Do you mean, if the situation arises that ` minTxIdToKeep > curSegmentTxId `, ANN will crash because `Preconditions.CheckArgument` failure? > [SBN read] Avoid purging edit log which is in progress > -- > > Key: HDFS-16507 > URL: https://issues.apache.org/jira/browse/HDFS-16507 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.1.0 >Reporter: tomscut >Assignee: tomscut >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0, 3.2.4, 3.3.4 > > Time Spent: 4h 50m > Remaining Estimate: 0h > > We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL > exception. It looks like it's purging edit logs which is in process. > According to the analysis, I suspect that the editlog which is in progress to > be purged(after SNN checkpoint) does not finalize(See HDFS-14317) before ANN > rolls edit its self. > The stack: > {code:java} > java.lang.Thread.getStackTrace(Thread.java:1552) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) > > org.apache.hadoop.hdfs.server.namenode.FileJournalManager.purgeLogsOlderThan(FileJournalManager.java:185) > > org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:623) > > org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:388) > > org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:620) > > org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1512) > org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:177) > > org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:1249) > > org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:617) > > org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:516) > java.security.AccessController.doPrivileged(Native Method) > javax.security.auth.Subject.doAs(Subject.java:422) > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > > org.apache.hadoop.hdfs.server.namenode.ImageServlet.doPut(ImageServlet.java:515) > javax.servlet.http.HttpServlet.service(HttpServlet.java:710) > javax.servlet.http.HttpServlet.service(HttpServlet.java:790) > org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772) > > org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604) > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) > > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) > > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) > > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) > > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > org.eclipse.jetty.server.Server.handle(Server.java:539) > org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333) > > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > > org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) > org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108) > > org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > >
[jira] [Created] (HDFS-16527) Add global timeout rule for TestRouterDistCpProcedure
tomscut created HDFS-16527: -- Summary: Add global timeout rule for TestRouterDistCpProcedure Key: HDFS-16527 URL: https://issues.apache.org/jira/browse/HDFS-16527 Project: Hadoop HDFS Issue Type: Improvement Reporter: tomscut Assignee: tomscut As [Ayush Saxena|https://github.com/ayushtkn] mentioned [here|[https://github.com/apache/hadoop/pull/4009#pullrequestreview-925554297].] TestRouterDistCpProcedure failed many times because of timeout. I will add a global timeout rule for it. This makes it easy to set the timeout. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16513) [SBN read] Observer Namenode should not trigger the edits rolling of active Namenode
[ https://issues.apache.org/jira/browse/HDFS-16513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16513: --- Summary: [SBN read] Observer Namenode should not trigger the edits rolling of active Namenode (was: [SBN read] Observer Namenode does not trigger the edits rolling of active Namenode) > [SBN read] Observer Namenode should not trigger the edits rolling of active > Namenode > > > Key: HDFS-16513 > URL: https://issues.apache.org/jira/browse/HDFS-16513 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: tomscut >Assignee: tomscut >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > To avoid frequent edtis rolling, we should disable OBN from triggering the > edits rolling of active Namenode. > It is sufficient to retain only the triggering of SNN and the auto rolling of > ANN. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-16507) [SBN read] Avoid purging edit log which is in progress
[ https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut reassigned HDFS-16507: -- Assignee: tomscut > [SBN read] Avoid purging edit log which is in progress > -- > > Key: HDFS-16507 > URL: https://issues.apache.org/jira/browse/HDFS-16507 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.1.0 >Reporter: tomscut >Assignee: tomscut >Priority: Critical > Labels: pull-request-available > Time Spent: 3h 10m > Remaining Estimate: 0h > > We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL > exception. It looks like it's purging edit logs which is in process. > According to the analysis, I suspect that the editlog which is in progress to > be purged(after SNN checkpoint) does not finalize(See HDFS-14317) before ANN > rolls edit its self. > The stack: > {code:java} > java.lang.Thread.getStackTrace(Thread.java:1552) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) > > org.apache.hadoop.hdfs.server.namenode.FileJournalManager.purgeLogsOlderThan(FileJournalManager.java:185) > > org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:623) > > org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:388) > > org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:620) > > org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1512) > org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:177) > > org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:1249) > > org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:617) > > org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:516) > java.security.AccessController.doPrivileged(Native Method) > javax.security.auth.Subject.doAs(Subject.java:422) > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > > org.apache.hadoop.hdfs.server.namenode.ImageServlet.doPut(ImageServlet.java:515) > javax.servlet.http.HttpServlet.service(HttpServlet.java:710) > javax.servlet.http.HttpServlet.service(HttpServlet.java:790) > org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772) > > org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604) > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) > > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) > > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) > > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) > > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > org.eclipse.jetty.server.Server.handle(Server.java:539) > org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333) > > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > > org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) > org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108) > > org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) > > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) > > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) > >
[jira] [Updated] (HDFS-16507) [SBN read] Avoid purging edit log which is in progress
[ https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16507: --- Description: We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL exception. It looks like it's purging edit logs which is in process. According to the analysis, I suspect that the editlog which is in progress to be purged(after SNN checkpoint) does not finalize(See HDFS-14317) before ANN rolls edit its self. The stack: {code:java} java.lang.Thread.getStackTrace(Thread.java:1552) org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) org.apache.hadoop.hdfs.server.namenode.FileJournalManager.purgeLogsOlderThan(FileJournalManager.java:185) org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:623) org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:388) org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:620) org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1512) org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:177) org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:1249) org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:617) org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:516) java.security.AccessController.doPrivileged(Native Method) javax.security.auth.Subject.doAs(Subject.java:422) org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) org.apache.hadoop.hdfs.server.namenode.ImageServlet.doPut(ImageServlet.java:515) javax.servlet.http.HttpServlet.service(HttpServlet.java:710) javax.servlet.http.HttpServlet.service(HttpServlet.java:790) org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772) org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604) org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) org.eclipse.jetty.server.Server.handle(Server.java:539) org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333) org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108) org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) java.lang.Thread.run(Thread.java:745) {code} I post some key logs for your reference: 1. ANN. Create editlog, {color:#ff}edits_InProgress_00024207987{color}. {code:java} 2022-03-15 17:24:52,558 INFO namenode.FSEditLog (FSEditLog.java:startLogSegment(1394)) - Starting log segment at 24207987 2022-03-15 17:24:52,609 INFO namenode.FSEditLog (FSEditLog.java:startLogSegment(1423)) - Ending log segment at 24207987 2022-03-15 17:24:52,610 INFO namenode.FSEditLog (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1432)) - logEdit at 24207987 2022-03-15 17:24:52,624 INFO namenode.FSEditLog
[jira] [Updated] (HDFS-16446) Consider ioutils of disk when choosing volume
[ https://issues.apache.org/jira/browse/HDFS-16446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16446: --- Description: Consider ioutils of disk when choosing volume. Consider ioutils of disk when choosing volume to avoid busy disks. Document: [https://docs.google.com/document/d/1Ko1J7shz8hVLnNACT6PKVQ_leIHf_YaIFA2s3yJMZHQ/edit?usp=sharing] Principle is as follows: !https://user-images.githubusercontent.com/55134131/159827737-f4ca4d66-c2f2-4bef-901b-6d2bc7bdda9a.png|width=440,height=192! was: Consider ioutils of disk when choosing volume. Principle is as follows: !image-2022-02-05-09-50-12-241.png|width=309,height=159! > Consider ioutils of disk when choosing volume > - > > Key: HDFS-16446 > URL: https://issues.apache.org/jira/browse/HDFS-16446 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: tomscut >Assignee: tomscut >Priority: Major > Labels: pull-request-available > Attachments: image-2022-02-05-09-50-12-241.png > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Consider ioutils of disk when choosing volume. > Consider ioutils of disk when choosing volume to avoid busy disks. > Document: > [https://docs.google.com/document/d/1Ko1J7shz8hVLnNACT6PKVQ_leIHf_YaIFA2s3yJMZHQ/edit?usp=sharing] > Principle is as follows: > !https://user-images.githubusercontent.com/55134131/159827737-f4ca4d66-c2f2-4bef-901b-6d2bc7bdda9a.png|width=440,height=192! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16446) Consider ioutils of disk when choosing volume
[ https://issues.apache.org/jira/browse/HDFS-16446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16446: --- Description: Consider ioutils of disk when choosing volume to avoid busy disks. Document: [https://docs.google.com/document/d/1Ko1J7shz8hVLnNACT6PKVQ_leIHf_YaIFA2s3yJMZHQ/edit?usp=sharing] Principle is as follows: !https://user-images.githubusercontent.com/55134131/159827737-f4ca4d66-c2f2-4bef-901b-6d2bc7bdda9a.png|width=440,height=192! was: Consider ioutils of disk when choosing volume. Consider ioutils of disk when choosing volume to avoid busy disks. Document: [https://docs.google.com/document/d/1Ko1J7shz8hVLnNACT6PKVQ_leIHf_YaIFA2s3yJMZHQ/edit?usp=sharing] Principle is as follows: !https://user-images.githubusercontent.com/55134131/159827737-f4ca4d66-c2f2-4bef-901b-6d2bc7bdda9a.png|width=440,height=192! > Consider ioutils of disk when choosing volume > - > > Key: HDFS-16446 > URL: https://issues.apache.org/jira/browse/HDFS-16446 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: tomscut >Assignee: tomscut >Priority: Major > Labels: pull-request-available > Attachments: image-2022-02-05-09-50-12-241.png > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Consider ioutils of disk when choosing volume to avoid busy disks. > Document: > [https://docs.google.com/document/d/1Ko1J7shz8hVLnNACT6PKVQ_leIHf_YaIFA2s3yJMZHQ/edit?usp=sharing] > Principle is as follows: > !https://user-images.githubusercontent.com/55134131/159827737-f4ca4d66-c2f2-4bef-901b-6d2bc7bdda9a.png|width=440,height=192! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13671) Namenode deletes large dir slowly caused by FoldedTreeSet#removeAndGet
[ https://issues.apache.org/jira/browse/HDFS-13671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511575#comment-17511575 ] tomscut commented on HDFS-13671: Hi [~max2049] , we are still using CMS on a cluster without EC data, some parameter adjustment should be able to solve this problem. And how long is your FBR period? If it is 6 hours(default) and the cluster size is large, it may have an impact on GC. We set this to 3 days. We use G1GC on a cluster with this feature that uses EC data. The main parameters(open JDK 1.8) are as follows: {code:java} -server -Xmx200g -Xms200g -XX:MaxDirectMemorySize=2g -XX:MaxMetaspaceSize=2g -XX:MetaspaceSize=1g -XX:+UseG1GC -XX:+UnlockExperimentalVMOptions -XX:InitiatingHeapOccupancyPercent=75 -XX:G1NewSizePercent=0 -XX:G1MaxNewSizePercent=3 -XX:SurvivorRatio=2 -XX:+DisableExplicitGC -XX:MaxTenuringThreshold=15 -XX:-UseBiasedLocking -XX:ParallelGCThreads=40 -XX:ConcGCThreads=20 -XX:MaxJavaStackTraceDepth=100 -XX:MaxGCPauseMillis=200 -verbose:gc -XX:+UnlockDiagnosticVMOptions -XX:+PrintGCDetails -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCCause -XX:+PrintGCDateStamps -XX:+PrintReferenceGC -XX:+PrintHeapAtGC -XX:+PrintAdaptiveSizePolicy -XX:+G1PrintHeapRegions -XX:+PrintTenuringDistribution -Xloggc:/data1/var/log/hadoop/$USER/gc.log-`date +'%Y%m%d%H%M'`" {code} > Namenode deletes large dir slowly caused by FoldedTreeSet#removeAndGet > -- > > Key: HDFS-13671 > URL: https://issues.apache.org/jira/browse/HDFS-13671 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.1.0, 3.0.3 >Reporter: Yiqun Lin >Assignee: Haibin Huang >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.2.3, 3.3.2 > > Attachments: HDFS-13671-001.patch, image-2021-06-10-19-28-18-373.png, > image-2021-06-10-19-28-58-359.png, image-2021-06-18-15-46-46-052.png, > image-2021-06-18-15-47-04-037.png > > Time Spent: 7h 40m > Remaining Estimate: 0h > > NameNode hung when deleting large files/blocks. The stack info: > {code} > "IPC Server handler 4 on 8020" #87 daemon prio=5 os_prio=0 > tid=0x7fb505b27800 nid=0x94c3 runnable [0x7fa861361000] >java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.hdfs.util.FoldedTreeSet.compare(FoldedTreeSet.java:474) > at > org.apache.hadoop.hdfs.util.FoldedTreeSet.removeAndGet(FoldedTreeSet.java:849) > at > org.apache.hadoop.hdfs.util.FoldedTreeSet.remove(FoldedTreeSet.java:911) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeStorageInfo.removeBlock(DatanodeStorageInfo.java:252) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.removeBlock(BlocksMap.java:194) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.removeBlock(BlocksMap.java:108) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.removeBlockFromMap(BlockManager.java:3813) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.removeBlock(BlockManager.java:3617) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.removeBlocks(FSNamesystem.java:4270) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInternal(FSNamesystem.java:4244) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInt(FSNamesystem.java:4180) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:4164) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:871) > at > org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.delete(AuthorizationProviderProxyClientProtocol.java:311) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.delete(ClientNamenodeProtocolServerSideTranslatorPB.java:625) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) > {code} > In the current deletion logic in NameNode, there are mainly two steps: > * Collect INodes and all blocks to be deleted, then delete INodes. > * Remove blocks chunk by chunk in a loop. > Actually the first step should be a more expensive operation and will takes > more time. However, now we always see NN hangs during the remove block > operation. > Looking into this, we introduced a new structure {{FoldedTreeSet}} to have a > better performance in dealing FBR/IBRs. But compared with early > implementation in remove-block logic, {{FoldedTreeSet}} seems more slower > since It will take additional time to balance tree
[jira] [Created] (HDFS-16513) [SBN read] Observer Namenode does not trigger the edits rolling of active Namenode
tomscut created HDFS-16513: -- Summary: [SBN read] Observer Namenode does not trigger the edits rolling of active Namenode Key: HDFS-16513 URL: https://issues.apache.org/jira/browse/HDFS-16513 Project: Hadoop HDFS Issue Type: Improvement Reporter: tomscut Assignee: tomscut To avoid frequent edtis rolling, we should disable OBN from triggering the edits rolling of active Namenode. It is sufficient to retain only the triggering of SNN and the auto rolling of ANN. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16507) [SBN read] Avoid purging edit log which is in progress
[ https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16507: --- Summary: [SBN read] Avoid purging edit log which is in progress (was: Avoid purging edit log which is in progress) > [SBN read] Avoid purging edit log which is in progress > -- > > Key: HDFS-16507 > URL: https://issues.apache.org/jira/browse/HDFS-16507 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.1.0 >Reporter: tomscut >Priority: Critical > Labels: pull-request-available > Time Spent: 2h > Remaining Estimate: 0h > > We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL > exception. It looks like it's purging edit logs which is in process. > According to the analysis, I suspect that the editlog which is in progress to > be purged(after SNN checkpoint) does not finalize(See HDFS-14317) before ANN > rolls edit its self. > The stack: > {code:java} > java.lang.Thread.getStackTrace(Thread.java:1552) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) > > org.apache.hadoop.hdfs.server.namenode.FileJournalManager.purgeLogsOlderThan(FileJournalManager.java:185) > > org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:623) > > org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:388) > > org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:620) > > org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1512) > org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:177) > > org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:1249) > > org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:617) > > org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:516) > java.security.AccessController.doPrivileged(Native Method) > javax.security.auth.Subject.doAs(Subject.java:422) > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > > org.apache.hadoop.hdfs.server.namenode.ImageServlet.doPut(ImageServlet.java:515) > javax.servlet.http.HttpServlet.service(HttpServlet.java:710) > javax.servlet.http.HttpServlet.service(HttpServlet.java:790) > org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772) > > org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604) > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) > > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) > > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) > > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) > > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > org.eclipse.jetty.server.Server.handle(Server.java:539) > org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333) > > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > > org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) > org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108) > > org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) > > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) > >
[jira] [Commented] (HDFS-8277) Safemode enter fails when Standby NameNode is down
[ https://issues.apache.org/jira/browse/HDFS-8277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508550#comment-17508550 ] tomscut commented on HDFS-8277: --- We have the same problem, see HDFS-16505. Because DFSAdmin is sequential, if we use Standby Read, the configuration order is: nn1 -> OBN1 nn2 -> OBN2 nn3 -> OBN3 nn4 -> ANN nn5 -> SNN So let's say that when we *enter* Safemode, namenodes are normal. So all five nodes are in Safemode. Then, OBN1 goes down, we execute *leave* Safemode, and we can't exit Safemode normally. We have to run this command one by one: hdfs dfsadmin -fs hdfs://: -safemode leave . > Safemode enter fails when Standby NameNode is down > -- > > Key: HDFS-8277 > URL: https://issues.apache.org/jira/browse/HDFS-8277 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha, namenode >Affects Versions: 2.6.0 > Environment: HDP 2.2.0 >Reporter: Hari Sekhon >Assignee: Jianfei Jiang >Priority: Major > Attachments: HDFS-8277-safemode-edits.patch, HDFS-8277.patch, > HDFS-8277_1.patch, HDFS-8277_2.patch, HDFS-8277_3.patch, HDFS-8277_4.patch, > HDFS-8277_5.patch > > > HDFS fails to enter safemode when the Standby NameNode is down (eg. due to > AMBARI-10536). > {code}hdfs dfsadmin -safemode enter > safemode: Call From nn2/x.x.x.x to nn1:8020 failed on connection exception: > java.net.ConnectException: Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused{code} > This appears to be a bug in that it's not trying both NameNodes like the > standard hdfs client code does, and is instead stopping after getting a > connection refused from nn1 which is down. I verified normal hadoop fs writes > and reads via cli did work at this time, using nn2. I happened to run this > command as the hdfs user on nn2 which was the surviving Active NameNode. > After I re-bootstrapped the Standby NN to fix it the command worked as > expected again. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16508) When the nn1 fails at very beginning, admin command that waits exist safe mode fails
[ https://issues.apache.org/jira/browse/HDFS-16508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508488#comment-17508488 ] tomscut commented on HDFS-16508: Hi [~willtoshare] , please see HDFS-15509, HDFS-8277 and HDFS-16505. It seems to be the same kind of problem, > When the nn1 fails at very beginning, admin command that waits exist safe > mode fails > > > Key: HDFS-16508 > URL: https://issues.apache.org/jira/browse/HDFS-16508 > Project: Hadoop HDFS > Issue Type: Bug > Components: tools >Affects Versions: 3.3.1 >Reporter: May >Priority: Major > > The HA is enabled, and we have two NameNodes: nn1 and nn2. > When starting the cluster, the nn1 fails at the very beginning, and nn2 > transfers to active state. The culster can provide services normally. > However, when we tried to get safe mode or wait exit safe mode, our dfsadmin > command fails due to an IOException: cannot connect to nn1. > The *root cause* seems locate in here: > {code:java} > //DFSAdmin.class > public void setSafeMode(String[] argv, int idx) throws IOException { > … > if (isHaEnabled) { > String nsId = dfsUri.getHost(); > List> proxies = > HAUtil.getProxiesForAllNameNodesInNameservice( > dfsConf, nsId, ClientProtocol.class); > for (ProxyAndInfo proxy : proxies) { > ClientProtocol haNn = proxy.getProxy(); > //The code always queries from the first nn, i.e., nn1, and returns > with IOException when nn1 fails. > boolean inSafeMode = haNn.setSafeMode(action, false); > if (waitExitSafe) { > inSafeMode = waitExitSafeMode(haNn, inSafeMode); > } > System.out.println("Safe mode is " + (inSafeMode ? "ON" : "OFF") > + " in " + proxy.getAddress()); > } > } > … > } > {code} > Actually, I'm curious that do we need to get/wait every namenode here when HA > is enabled? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16507) Avoid purging edit log which is in progress
[ https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508484#comment-17508484 ] tomscut commented on HDFS-16507: Hi [~xkrogen] [~ekanth] [~chaosun] . This issue HDFS-14317 does a good job of avoiding this problem. However, if SNN's rolledit operation is disabled accidentally by configuration, and ANN's automatic roll period is very long, then edit log which is in progress may also be purged. I think we should reset *minTxIdToKeep* to ensure that the in progress edit log is not purged very strict. Wait for ANN to automatically roll to finalize the edit log. Please help to reivew if it is reasonable. Thanks a lot. > Avoid purging edit log which is in progress > --- > > Key: HDFS-16507 > URL: https://issues.apache.org/jira/browse/HDFS-16507 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.1.0 >Reporter: tomscut >Priority: Critical > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL > exception. It looks like it's purging edit logs which is in process. > According to the analysis, I suspect that the editlog which is in progress to > be purged(after SNN checkpoint) does not finalize(See HDFS-14317) before ANN > rolls edit its self. > The stack: > {code:java} > java.lang.Thread.getStackTrace(Thread.java:1552) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) > > org.apache.hadoop.hdfs.server.namenode.FileJournalManager.purgeLogsOlderThan(FileJournalManager.java:185) > > org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:623) > > org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:388) > > org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:620) > > org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1512) > org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:177) > > org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:1249) > > org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:617) > > org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:516) > java.security.AccessController.doPrivileged(Native Method) > javax.security.auth.Subject.doAs(Subject.java:422) > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > > org.apache.hadoop.hdfs.server.namenode.ImageServlet.doPut(ImageServlet.java:515) > javax.servlet.http.HttpServlet.service(HttpServlet.java:710) > javax.servlet.http.HttpServlet.service(HttpServlet.java:790) > org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772) > > org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604) > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) > > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) > > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) > > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) > > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > org.eclipse.jetty.server.Server.handle(Server.java:539) > org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333) > > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > > org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) > org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108) > >
[jira] [Updated] (HDFS-16507) Avoid purging edit log which is in progress
[ https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16507: --- Description: We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL exception. It looks like it's purging edit logs which is in process. According to the analysis, I suspect that the editlog which is in progress to be purged(after SNN checkpoint) does not finalize(See HDFS-14317) before ANN rolls edit its self. The stack: {code:java} java.lang.Thread.getStackTrace(Thread.java:1552) org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) org.apache.hadoop.hdfs.server.namenode.FileJournalManager.purgeLogsOlderThan(FileJournalManager.java:185) org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:623) org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:388) org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:620) org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1512) org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:177) org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:1249) org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:617) org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:516) java.security.AccessController.doPrivileged(Native Method) javax.security.auth.Subject.doAs(Subject.java:422) org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) org.apache.hadoop.hdfs.server.namenode.ImageServlet.doPut(ImageServlet.java:515) javax.servlet.http.HttpServlet.service(HttpServlet.java:710) javax.servlet.http.HttpServlet.service(HttpServlet.java:790) org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772) org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604) org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) org.eclipse.jetty.server.Server.handle(Server.java:539) org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333) org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108) org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) java.lang.Thread.run(Thread.java:745) {code} I post some key logs for your reference: 1. ANN. Create editlog, {color:#ff}edits_InProgress_00024207987{color}. {code:java} 2022-03-15 17:24:52,558 INFO namenode.FSEditLog (FSEditLog.java:startLogSegment(1394)) - Starting log segment at 24207987 2022-03-15 17:24:52,609 INFO namenode.FSEditLog (FSEditLog.java:startLogSegment(1423)) - Ending log segment at 24207987 2022-03-15 17:24:52,610 INFO namenode.FSEditLog (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1432)) - logEdit at 24207987 2022-03-15 17:24:52,624 INFO namenode.FSEditLog
[jira] [Updated] (HDFS-16507) Avoid purging edit log which is in progress
[ https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16507: --- Description: We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL exception. It looks like it's purging edit logs which is in process. According to the analysis, I suspect that the editlog which is in progress to be purged does not finalize(See ) before ANN rolls edit its self. I post some key logs for your reference: 1. ANN. Create editlog, {color:#ff}edits_InProgress_00024207987{color}. {code:java} 2022-03-15 17:24:52,558 INFO namenode.FSEditLog (FSEditLog.java:startLogSegment(1394)) - Starting log segment at 24207987 2022-03-15 17:24:52,609 INFO namenode.FSEditLog (FSEditLog.java:startLogSegment(1423)) - Ending log segment at 24207987 2022-03-15 17:24:52,610 INFO namenode.FSEditLog (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1432)) - logEdit at 24207987 2022-03-15 17:24:52,624 INFO namenode.FSEditLog (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1434)) - logSync at 24207987 {code} 2. SNN. Checkpoint. The oldest image file is: fsimage_000{color:#de350b}25892513{color}. {color:#ff}25892513 + 1 - 100 = 24892514{color} {color:#ff}dfs.namenode.num.extra.edits.retained=100{color} {color:#172b4d}Code: {color}NNStorageRetentionManager#purgeOldStorage {code:java} void purgeOldStorage(NameNodeFile nnf) throws IOException { FSImageTransactionalStorageInspector inspector = new FSImageTransactionalStorageInspector(EnumSet.of(nnf)); storage.inspectStorageDirs(inspector); long minImageTxId = getImageTxIdToRetain(inspector); purgeCheckpointsOlderThan(inspector, minImageTxId); {code} {color:#910091}...{color} {code:java} long minimumRequiredTxId = minImageTxId + 1; long purgeLogsFrom = Math.max(0, minimumRequiredTxId - numExtraEditsToRetain); ArrayList editLogs = new ArrayList(); purgeableLogs.selectInputStreams(editLogs, purgeLogsFrom, false, false); Collections.sort(editLogs, new Comparator() { @Override public int compare(EditLogInputStream a, EditLogInputStream b) { return ComparisonChain.start() .compare(a.getFirstTxId(), b.getFirstTxId()) .compare(a.getLastTxId(), b.getLastTxId()) .result(); } }); // Remove from consideration any edit logs that are in fact required. while (editLogs.size() > 0 && editLogs.get(editLogs.size() - 1).getFirstTxId() >= minimumRequiredTxId) { editLogs.remove(editLogs.size() - 1); } // Next, adjust the number of transactions to retain if doing so would mean // keeping too many segments around. while (editLogs.size() > maxExtraEditsSegmentsToRetain) { purgeLogsFrom = editLogs.get(0).getLastTxId() + 1; editLogs.remove(0); } ... purgeableLogs.purgeLogsOlderThan(purgeLogsFrom); }{code} {code:java} 2022-03-15 17:28:02,640 INFO ha.StandbyCheckpointer (StandbyCheckpointer.java:doWork(443)) - Triggering checkpoint because there have been 1189661 txns since the last checkpoint, which exceeds the configured threshold 2 2022-03-15 17:28:02,648 INFO namenode.FSImage (FSEditLogLoader.java:loadFSEdits(188)) - Edits file ByteStringEditLog[27082175, 27082606], ByteStringEditLog[27082175, 27082606], ByteStringEditLog[27082175, 27082606] of size 60008 edits # 432 loaded in 0 seconds 2022-03-15 17:28:02,649 INFO namenode.FSImage (FSImage.java:saveNamespace(1121)) - Save namespace ... 2022-03-15 17:28:02,650 INFO namenode.FSImageFormatProtobuf (FSImageFormatProtobuf.java:save(718)) - Saving image file /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 using no compression 2022-03-15 17:28:03,180 INFO namenode.FSImageFormatProtobuf (FSImageFormatProtobuf.java:save(722)) - Image file /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 of size 17885002 bytes saved in 0 seconds . 2022-03-15 17:28:03,183 INFO namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:getImageTxIdToRetain(211)) - Going to retain 2 images with txid >= 25892513 2022-03-15 17:28:03,183 INFO namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:purgeImage(233)) - Purging old image FSImageFile(file=/data/hadoop/hdfs/namenode/current/fsimage_00024794305, cpktTxId=00024794305) 2022-03-15 17:28:03,188 INFO namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:purgeOldStorage(169)) - purgeLogsFrom: 24892514 2022-03-15 17:28:03,282 INFO namenode.TransferFsImage (TransferFsImage.java:copyFileToStream(396)) - Sending fileName: /data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: 17885002. Sent total: 17885002 bytes. Size of last segment intended to send: -1 bytes. 2022-03-15 17:28:03,536 INFO namenode.TransferFsImage (TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 27082606 to
[jira] [Updated] (HDFS-16507) Avoid purging edit log which is in progress
[ https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16507: --- Summary: Avoid purging edit log which is in progress (was: Purged edit logs which is in progress) > Avoid purging edit log which is in progress > --- > > Key: HDFS-16507 > URL: https://issues.apache.org/jira/browse/HDFS-16507 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.1.0 >Reporter: tomscut >Priority: Critical > > We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL > exception. It looks like it's purging edit logs which is in process. > According to the analysis, I suspect that the editlog which is in progress to > be purged does not finalize before ANN rolls edit its self. > I post some key logs for your reference: > 1. ANN. Create editlog, > {color:#ff}edits_InProgress_00024207987{color}. > {code:java} > 2022-03-15 17:24:52,558 INFO namenode.FSEditLog > (FSEditLog.java:startLogSegment(1394)) - Starting log segment at 24207987 > 2022-03-15 17:24:52,609 INFO namenode.FSEditLog > (FSEditLog.java:startLogSegment(1423)) - Ending log segment at 24207987 > 2022-03-15 17:24:52,610 INFO namenode.FSEditLog > (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1432)) - logEdit at 24207987 > 2022-03-15 17:24:52,624 INFO namenode.FSEditLog > (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1434)) - logSync at 24207987 > {code} > 2. SNN. Checkpoint. > The oldest image file is: fsimage_000{color:#de350b}25892513{color}. > {color:#ff}25892513 + 1 - 100 = 24892514{color} > {color:#ff}dfs.namenode.num.extra.edits.retained=100{color} > {color:#172b4d}Code: {color}NNStorageRetentionManager#purgeOldStorage > {code:java} > void purgeOldStorage(NameNodeFile nnf) throws IOException { > FSImageTransactionalStorageInspector inspector = > new FSImageTransactionalStorageInspector(EnumSet.of(nnf)); > storage.inspectStorageDirs(inspector); > long minImageTxId = getImageTxIdToRetain(inspector); > purgeCheckpointsOlderThan(inspector, minImageTxId); > > {code} > {color:#910091}...{color} > {code:java} > long minimumRequiredTxId = minImageTxId + 1; > long purgeLogsFrom = Math.max(0, minimumRequiredTxId - > numExtraEditsToRetain); > > ArrayList editLogs = new > ArrayList(); > purgeableLogs.selectInputStreams(editLogs, purgeLogsFrom, false, false); > Collections.sort(editLogs, new Comparator() { > @Override > public int compare(EditLogInputStream a, EditLogInputStream b) { > return ComparisonChain.start() > .compare(a.getFirstTxId(), b.getFirstTxId()) > .compare(a.getLastTxId(), b.getLastTxId()) > .result(); > } > }); > // Remove from consideration any edit logs that are in fact required. > while (editLogs.size() > 0 && > editLogs.get(editLogs.size() - 1).getFirstTxId() >= > minimumRequiredTxId) { > editLogs.remove(editLogs.size() - 1); > } > > // Next, adjust the number of transactions to retain if doing so would mean > // keeping too many segments around. > while (editLogs.size() > maxExtraEditsSegmentsToRetain) { > purgeLogsFrom = editLogs.get(0).getLastTxId() + 1; > editLogs.remove(0); > } > ... > purgeableLogs.purgeLogsOlderThan(purgeLogsFrom); > }{code} > > {code:java} > 2022-03-15 17:28:02,640 INFO ha.StandbyCheckpointer > (StandbyCheckpointer.java:doWork(443)) - Triggering checkpoint because there > have been 1189661 txns since the last checkpoint, which exceeds the > configured threshold 2 > 2022-03-15 17:28:02,648 INFO namenode.FSImage > (FSEditLogLoader.java:loadFSEdits(188)) - Edits file > ByteStringEditLog[27082175, 27082606], ByteStringEditLog[27082175, 27082606], > ByteStringEditLog[27082175, 27082606] of size 60008 edits # 432 loaded in 0 > seconds > 2022-03-15 17:28:02,649 INFO namenode.FSImage > (FSImage.java:saveNamespace(1121)) - Save namespace ... > 2022-03-15 17:28:02,650 INFO namenode.FSImageFormatProtobuf > (FSImageFormatProtobuf.java:save(718)) - Saving image file > /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 using no > compression > 2022-03-15 17:28:03,180 INFO namenode.FSImageFormatProtobuf > (FSImageFormatProtobuf.java:save(722)) - Image file > /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 of size > 17885002 bytes saved in 0 seconds . > 2022-03-15 17:28:03,183 INFO namenode.NNStorageRetentionManager > (NNStorageRetentionManager.java:getImageTxIdToRetain(211)) - Going to retain > 2 images with txid >= 25892513 > 2022-03-15 17:28:03,183 INFO namenode.NNStorageRetentionManager > (NNStorageRetentionManager.java:purgeImage(233)) - Purging old image > FSImageFile(file=/data/hadoop/hdfs/namenode/current/fsimage_00024794305,
[jira] [Commented] (HDFS-16505) Setting safemode should not be interrupted by abnormal nodes
[ https://issues.apache.org/jira/browse/HDFS-16505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17507960#comment-17507960 ] tomscut commented on HDFS-16505: Thanks [~ayushtkn] for your comments and reminding me. > Setting safemode should not be interrupted by abnormal nodes > > > Key: HDFS-16505 > URL: https://issues.apache.org/jira/browse/HDFS-16505 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: tomscut >Assignee: tomscut >Priority: Major > Labels: pull-request-available > Attachments: image-2022-03-15-09-29-36-538.png, > image-2022-03-15-09-29-44-430.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > Setting safemode should not be interrupted by abnormal nodes. > For example, we have four namenodes configured in the following order: > NS1 -> active > NS2 -> standby > NS3 -> observer > NS4 -> observer. > When the {color:#FF}NS1 {color}process exits, setting the states of > safemode, {color:#FF}NS2{color}, {color:#FF}NS3{color}, and > {color:#FF}NS4 {color}fails. Similarly, when the > {color:#FF}NS2{color} process exits, only the safemode state of > {color:#FF}NS1{color} can be set successfully. > > When the {color:#FF}NS1{color} process exits: > Before the change: > !image-2022-03-15-09-29-36-538.png|width=1145,height=97! > After the change: > !image-2022-03-15-09-29-44-430.png|width=1104,height=119! > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16507) Purged edit logs which is in progress
[ https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17507458#comment-17507458 ] tomscut commented on HDFS-16507: Seems to be related to this issue [HDFS-14317|https://issues.apache.org/jira/browse/HDFS-14317]. > Purged edit logs which is in progress > - > > Key: HDFS-16507 > URL: https://issues.apache.org/jira/browse/HDFS-16507 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.1.0 >Reporter: tomscut >Priority: Critical > > We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL > exception. It looks like it's purging edit logs which is in process. > According to the analysis, I suspect that the editlog which is in progress to > be purged does not finalize before ANN rolls edit its self. > I post some key logs for your reference: > 1. ANN. Create editlog, > {color:#ff}edits_InProgress_00024207987{color}. > {code:java} > 2022-03-15 17:24:52,558 INFO namenode.FSEditLog > (FSEditLog.java:startLogSegment(1394)) - Starting log segment at 24207987 > 2022-03-15 17:24:52,609 INFO namenode.FSEditLog > (FSEditLog.java:startLogSegment(1423)) - Ending log segment at 24207987 > 2022-03-15 17:24:52,610 INFO namenode.FSEditLog > (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1432)) - logEdit at 24207987 > 2022-03-15 17:24:52,624 INFO namenode.FSEditLog > (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1434)) - logSync at 24207987 > {code} > 2. SNN. Checkpoint. > The oldest image file is: fsimage_000{color:#de350b}25892513{color}. > {color:#ff}25892513 + 1 - 100 = 24892514{color} > {color:#ff}dfs.namenode.num.extra.edits.retained=100{color} > {color:#172b4d}Code: {color}NNStorageRetentionManager#purgeOldStorage > {code:java} > void purgeOldStorage(NameNodeFile nnf) throws IOException { > FSImageTransactionalStorageInspector inspector = > new FSImageTransactionalStorageInspector(EnumSet.of(nnf)); > storage.inspectStorageDirs(inspector); > long minImageTxId = getImageTxIdToRetain(inspector); > purgeCheckpointsOlderThan(inspector, minImageTxId); > > {code} > {color:#910091}...{color} > {code:java} > long minimumRequiredTxId = minImageTxId + 1; > long purgeLogsFrom = Math.max(0, minimumRequiredTxId - > numExtraEditsToRetain); > > ArrayList editLogs = new > ArrayList(); > purgeableLogs.selectInputStreams(editLogs, purgeLogsFrom, false, false); > Collections.sort(editLogs, new Comparator() { > @Override > public int compare(EditLogInputStream a, EditLogInputStream b) { > return ComparisonChain.start() > .compare(a.getFirstTxId(), b.getFirstTxId()) > .compare(a.getLastTxId(), b.getLastTxId()) > .result(); > } > }); > // Remove from consideration any edit logs that are in fact required. > while (editLogs.size() > 0 && > editLogs.get(editLogs.size() - 1).getFirstTxId() >= > minimumRequiredTxId) { > editLogs.remove(editLogs.size() - 1); > } > > // Next, adjust the number of transactions to retain if doing so would mean > // keeping too many segments around. > while (editLogs.size() > maxExtraEditsSegmentsToRetain) { > purgeLogsFrom = editLogs.get(0).getLastTxId() + 1; > editLogs.remove(0); > } > ... > purgeableLogs.purgeLogsOlderThan(purgeLogsFrom); > }{code} > > {code:java} > 2022-03-15 17:28:02,640 INFO ha.StandbyCheckpointer > (StandbyCheckpointer.java:doWork(443)) - Triggering checkpoint because there > have been 1189661 txns since the last checkpoint, which exceeds the > configured threshold 2 > 2022-03-15 17:28:02,648 INFO namenode.FSImage > (FSEditLogLoader.java:loadFSEdits(188)) - Edits file > ByteStringEditLog[27082175, 27082606], ByteStringEditLog[27082175, 27082606], > ByteStringEditLog[27082175, 27082606] of size 60008 edits # 432 loaded in 0 > seconds > 2022-03-15 17:28:02,649 INFO namenode.FSImage > (FSImage.java:saveNamespace(1121)) - Save namespace ... > 2022-03-15 17:28:02,650 INFO namenode.FSImageFormatProtobuf > (FSImageFormatProtobuf.java:save(718)) - Saving image file > /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 using no > compression > 2022-03-15 17:28:03,180 INFO namenode.FSImageFormatProtobuf > (FSImageFormatProtobuf.java:save(722)) - Image file > /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 of size > 17885002 bytes saved in 0 seconds . > 2022-03-15 17:28:03,183 INFO namenode.NNStorageRetentionManager > (NNStorageRetentionManager.java:getImageTxIdToRetain(211)) - Going to retain > 2 images with txid >= 25892513 > 2022-03-15 17:28:03,183 INFO namenode.NNStorageRetentionManager > (NNStorageRetentionManager.java:purgeImage(233)) - Purging old image >
[jira] [Updated] (HDFS-16507) Purged edit logs which is in progress
[ https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16507: --- Description: We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL exception. It looks like it's purging edit logs which is in process. According to the analysis, I suspect that the editlog which is in progress to be purged does not finalize before ANN rolls edit its self. I post some key logs for your reference: 1. ANN. Create editlog, {color:#ff}edits_InProgress_00024207987{color}. {code:java} 2022-03-15 17:24:52,558 INFO namenode.FSEditLog (FSEditLog.java:startLogSegment(1394)) - Starting log segment at 24207987 2022-03-15 17:24:52,609 INFO namenode.FSEditLog (FSEditLog.java:startLogSegment(1423)) - Ending log segment at 24207987 2022-03-15 17:24:52,610 INFO namenode.FSEditLog (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1432)) - logEdit at 24207987 2022-03-15 17:24:52,624 INFO namenode.FSEditLog (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1434)) - logSync at 24207987 {code} 2. SNN. Checkpoint. The oldest image file is: fsimage_000{color:#de350b}25892513{color}. {color:#ff}25892513 + 1 - 100 = 24892514{color} {color:#ff}dfs.namenode.num.extra.edits.retained=100{color} {color:#172b4d}Code: {color}NNStorageRetentionManager#purgeOldStorage {code:java} void purgeOldStorage(NameNodeFile nnf) throws IOException { FSImageTransactionalStorageInspector inspector = new FSImageTransactionalStorageInspector(EnumSet.of(nnf)); storage.inspectStorageDirs(inspector); long minImageTxId = getImageTxIdToRetain(inspector); purgeCheckpointsOlderThan(inspector, minImageTxId); {code} {color:#910091}...{color} {code:java} long minimumRequiredTxId = minImageTxId + 1; long purgeLogsFrom = Math.max(0, minimumRequiredTxId - numExtraEditsToRetain); ArrayList editLogs = new ArrayList(); purgeableLogs.selectInputStreams(editLogs, purgeLogsFrom, false, false); Collections.sort(editLogs, new Comparator() { @Override public int compare(EditLogInputStream a, EditLogInputStream b) { return ComparisonChain.start() .compare(a.getFirstTxId(), b.getFirstTxId()) .compare(a.getLastTxId(), b.getLastTxId()) .result(); } }); // Remove from consideration any edit logs that are in fact required. while (editLogs.size() > 0 && editLogs.get(editLogs.size() - 1).getFirstTxId() >= minimumRequiredTxId) { editLogs.remove(editLogs.size() - 1); } // Next, adjust the number of transactions to retain if doing so would mean // keeping too many segments around. while (editLogs.size() > maxExtraEditsSegmentsToRetain) { purgeLogsFrom = editLogs.get(0).getLastTxId() + 1; editLogs.remove(0); } ... purgeableLogs.purgeLogsOlderThan(purgeLogsFrom); }{code} {code:java} 2022-03-15 17:28:02,640 INFO ha.StandbyCheckpointer (StandbyCheckpointer.java:doWork(443)) - Triggering checkpoint because there have been 1189661 txns since the last checkpoint, which exceeds the configured threshold 2 2022-03-15 17:28:02,648 INFO namenode.FSImage (FSEditLogLoader.java:loadFSEdits(188)) - Edits file ByteStringEditLog[27082175, 27082606], ByteStringEditLog[27082175, 27082606], ByteStringEditLog[27082175, 27082606] of size 60008 edits # 432 loaded in 0 seconds 2022-03-15 17:28:02,649 INFO namenode.FSImage (FSImage.java:saveNamespace(1121)) - Save namespace ... 2022-03-15 17:28:02,650 INFO namenode.FSImageFormatProtobuf (FSImageFormatProtobuf.java:save(718)) - Saving image file /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 using no compression 2022-03-15 17:28:03,180 INFO namenode.FSImageFormatProtobuf (FSImageFormatProtobuf.java:save(722)) - Image file /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 of size 17885002 bytes saved in 0 seconds . 2022-03-15 17:28:03,183 INFO namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:getImageTxIdToRetain(211)) - Going to retain 2 images with txid >= 25892513 2022-03-15 17:28:03,183 INFO namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:purgeImage(233)) - Purging old image FSImageFile(file=/data/hadoop/hdfs/namenode/current/fsimage_00024794305, cpktTxId=00024794305) 2022-03-15 17:28:03,188 INFO namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:purgeOldStorage(169)) - purgeLogsFrom: 24892514 2022-03-15 17:28:03,282 INFO namenode.TransferFsImage (TransferFsImage.java:copyFileToStream(396)) - Sending fileName: /data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: 17885002. Sent total: 17885002 bytes. Size of last segment intended to send: -1 bytes. 2022-03-15 17:28:03,536 INFO namenode.TransferFsImage (TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 27082606 to namenode at
[jira] [Commented] (HDFS-16507) Purged edit logs which is in progress
[ https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17507307#comment-17507307 ] tomscut commented on HDFS-16507: Hi [~weichiu] [~chaosun] [~xkrogen] [~Symious] , could you please take a look? I wonder if we missed something. Thank you very much. > Purged edit logs which is in progress > - > > Key: HDFS-16507 > URL: https://issues.apache.org/jira/browse/HDFS-16507 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.1.0 >Reporter: tomscut >Priority: Critical > > We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL > exception. It looks like it's purging edit logs which is in process. > According to the analysis, I suspect that the editlog which is in progress to > be purged does not finalize before ANN rolls edit its self. > I post some key logs for your reference: > 1. ANN. Create editlog, > {color:#ff}edits_InProgress_00024207987{color}. > {code:java} > 2022-03-15 17:24:52,558 INFO namenode.FSEditLog > (FSEditLog.java:startLogSegment(1394)) - Starting log segment at 24207987 > 2022-03-15 17:24:52,609 INFO namenode.FSEditLog > (FSEditLog.java:startLogSegment(1423)) - Ending log segment at 24207987 > 2022-03-15 17:24:52,610 INFO namenode.FSEditLog > (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1432)) - logEdit at 24207987 > 2022-03-15 17:24:52,624 INFO namenode.FSEditLog > (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1434)) - logSync at 24207987 > {code} > 2. SNN. Checkpoint. > The oldest image file is: fsimage_000{color:#de350b}25892513{color}. > {color:#ff}25892513 + 1 - 100 = 24892514{color} > {color:#ff}dfs.namenode.num.extra.edits.retained=100{color} > {color:#172b4d}Code: {color}NNStorageRetentionManager#purgeOldStorage > {code:java} > void purgeOldStorage(NameNodeFile nnf) throws IOException { > FSImageTransactionalStorageInspector inspector = > new FSImageTransactionalStorageInspector(EnumSet.of(nnf)); > storage.inspectStorageDirs(inspector); > long minImageTxId = getImageTxIdToRetain(inspector); > purgeCheckpointsOlderThan(inspector, minImageTxId); > > {code} > {color:#910091}...{color} > {code:java} > long minimumRequiredTxId = minImageTxId + 1; > long purgeLogsFrom = Math.max(0, minimumRequiredTxId - > numExtraEditsToRetain); > > ArrayList editLogs = new > ArrayList(); > purgeableLogs.selectInputStreams(editLogs, purgeLogsFrom, false, false); > Collections.sort(editLogs, new Comparator() { > @Override > public int compare(EditLogInputStream a, EditLogInputStream b) { > return ComparisonChain.start() > .compare(a.getFirstTxId(), b.getFirstTxId()) > .compare(a.getLastTxId(), b.getLastTxId()) > .result(); > } > }); > // Remove from consideration any edit logs that are in fact required. > while (editLogs.size() > 0 && > editLogs.get(editLogs.size() - 1).getFirstTxId() >= > minimumRequiredTxId) { > editLogs.remove(editLogs.size() - 1); > } > > // Next, adjust the number of transactions to retain if doing so would mean > // keeping too many segments around. > while (editLogs.size() > maxExtraEditsSegmentsToRetain) { > purgeLogsFrom = editLogs.get(0).getLastTxId() + 1; > editLogs.remove(0); > } > ... > purgeableLogs.purgeLogsOlderThan(purgeLogsFrom); > }{code} > > {code:java} > 2022-03-15 17:28:02,640 INFO ha.StandbyCheckpointer > (StandbyCheckpointer.java:doWork(443)) - Triggering checkpoint because there > have been 1189661 txns since the last checkpoint, which exceeds the > configured threshold 2 > 2022-03-15 17:28:02,648 INFO namenode.FSImage > (FSEditLogLoader.java:loadFSEdits(188)) - Edits file > ByteStringEditLog[27082175, 27082606], ByteStringEditLog[27082175, 27082606], > ByteStringEditLog[27082175, 27082606] of size 60008 edits # 432 loaded in 0 > seconds > 2022-03-15 17:28:02,649 INFO namenode.FSImage > (FSImage.java:saveNamespace(1121)) - Save namespace ... > 2022-03-15 17:28:02,650 INFO namenode.FSImageFormatProtobuf > (FSImageFormatProtobuf.java:save(718)) - Saving image file > /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 using no > compression > 2022-03-15 17:28:03,180 INFO namenode.FSImageFormatProtobuf > (FSImageFormatProtobuf.java:save(722)) - Image file > /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 of size > 17885002 bytes saved in 0 seconds . > 2022-03-15 17:28:03,183 INFO namenode.NNStorageRetentionManager > (NNStorageRetentionManager.java:getImageTxIdToRetain(211)) - Going to retain > 2 images with txid >= 25892513 > 2022-03-15 17:28:03,183 INFO namenode.NNStorageRetentionManager > (NNStorageRetentionManager.java:purgeImage(233)) - Purging old image >
[jira] [Updated] (HDFS-16507) Purged edit logs which is in process
[ https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16507: --- Description: We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL exception. It looks like it's purging edit logs which is in process. According to the analysis, I suspect that the editlog which is in progress to be purged does not finalize before ANN rolls edit its self. I post some key logs for your reference: 1. ANN. Create editlog, {color:#ff}edits_InProgress_00024207987{color}. {code:java} 2022-03-15 17:24:52,558 INFO namenode.FSEditLog (FSEditLog.java:startLogSegment(1394)) - Starting log segment at 24207987 2022-03-15 17:24:52,609 INFO namenode.FSEditLog (FSEditLog.java:startLogSegment(1423)) - Ending log segment at 24207987 2022-03-15 17:24:52,610 INFO namenode.FSEditLog (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1432)) - logEdit at 24207987 2022-03-15 17:24:52,624 INFO namenode.FSEditLog (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1434)) - logSync at 24207987 {code} 2. SNN. Checkpoint. The oldest image file is: fsimage_000{color:#de350b}25892513{color}. {color:#ff}25892513 + 1 - 100 = 24892514{color} {color:#ff}dfs.namenode.num.extra.edits.retained=100{color} {color:#172b4d}Code: {color}NNStorageRetentionManager#purgeOldStorage {code:java} void purgeOldStorage(NameNodeFile nnf) throws IOException { FSImageTransactionalStorageInspector inspector = new FSImageTransactionalStorageInspector(EnumSet.of(nnf)); storage.inspectStorageDirs(inspector); long minImageTxId = getImageTxIdToRetain(inspector); purgeCheckpointsOlderThan(inspector, minImageTxId); {code} {color:#910091}...{color} {code:java} long minimumRequiredTxId = minImageTxId + 1; long purgeLogsFrom = Math.max(0, minimumRequiredTxId - numExtraEditsToRetain); ArrayList editLogs = new ArrayList(); purgeableLogs.selectInputStreams(editLogs, purgeLogsFrom, false, false); Collections.sort(editLogs, new Comparator() { @Override public int compare(EditLogInputStream a, EditLogInputStream b) { return ComparisonChain.start() .compare(a.getFirstTxId(), b.getFirstTxId()) .compare(a.getLastTxId(), b.getLastTxId()) .result(); } }); // Remove from consideration any edit logs that are in fact required. while (editLogs.size() > 0 && editLogs.get(editLogs.size() - 1).getFirstTxId() >= minimumRequiredTxId) { editLogs.remove(editLogs.size() - 1); } // Next, adjust the number of transactions to retain if doing so would mean // keeping too many segments around. while (editLogs.size() > maxExtraEditsSegmentsToRetain) { purgeLogsFrom = editLogs.get(0).getLastTxId() + 1; editLogs.remove(0); } ... purgeableLogs.purgeLogsOlderThan(purgeLogsFrom); }{code} {code:java} 2022-03-15 17:28:02,640 INFO ha.StandbyCheckpointer (StandbyCheckpointer.java:doWork(443)) - Triggering checkpoint because there have been 1189661 txns since the last checkpoint, which exceeds the configured threshold 2 2022-03-15 17:28:02,648 INFO namenode.FSImage (FSEditLogLoader.java:loadFSEdits(188)) - Edits file ByteStringEditLog[27082175, 27082606], ByteStringEditLog[27082175, 27082606], ByteStringEditLog[27082175, 27082606] of size 60008 edits # 432 loaded in 0 seconds 2022-03-15 17:28:02,649 INFO namenode.FSImage (FSImage.java:saveNamespace(1121)) - Save namespace ... 2022-03-15 17:28:02,650 INFO namenode.FSImageFormatProtobuf (FSImageFormatProtobuf.java:save(718)) - Saving image file /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 using no compression 2022-03-15 17:28:03,180 INFO namenode.FSImageFormatProtobuf (FSImageFormatProtobuf.java:save(722)) - Image file /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 of size 17885002 bytes saved in 0 seconds . 2022-03-15 17:28:03,183 INFO namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:getImageTxIdToRetain(211)) - Going to retain 2 images with txid >= 25892513 2022-03-15 17:28:03,183 INFO namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:purgeImage(233)) - Purging old image FSImageFile(file=/data/hadoop/hdfs/namenode/current/fsimage_00024794305, cpktTxId=00024794305) 2022-03-15 17:28:03,188 INFO namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:purgeOldStorage(169)) - purgeLogsFrom: 24892514 2022-03-15 17:28:03,282 INFO namenode.TransferFsImage (TransferFsImage.java:copyFileToStream(396)) - Sending fileName: /data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: 17885002. Sent total: 17885002 bytes. Size of last segment intended to send: -1 bytes. 2022-03-15 17:28:03,536 INFO namenode.TransferFsImage (TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 27082606 to namenode at
[jira] [Updated] (HDFS-16507) Purged edit logs which is in progress
[ https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16507: --- Summary: Purged edit logs which is in progress (was: Purged edit logs which is in process) > Purged edit logs which is in progress > - > > Key: HDFS-16507 > URL: https://issues.apache.org/jira/browse/HDFS-16507 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.1.0 >Reporter: tomscut >Priority: Critical > > We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL > exception. It looks like it's purging edit logs which is in process. > According to the analysis, I suspect that the editlog which is in progress to > be purged does not finalize before ANN rolls edit its self. > I post some key logs for your reference: > 1. ANN. Create editlog, > {color:#ff}edits_InProgress_00024207987{color}. > {code:java} > 2022-03-15 17:24:52,558 INFO namenode.FSEditLog > (FSEditLog.java:startLogSegment(1394)) - Starting log segment at 24207987 > 2022-03-15 17:24:52,609 INFO namenode.FSEditLog > (FSEditLog.java:startLogSegment(1423)) - Ending log segment at 24207987 > 2022-03-15 17:24:52,610 INFO namenode.FSEditLog > (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1432)) - logEdit at 24207987 > 2022-03-15 17:24:52,624 INFO namenode.FSEditLog > (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1434)) - logSync at 24207987 > {code} > 2. SNN. Checkpoint. > The oldest image file is: fsimage_000{color:#de350b}25892513{color}. > {color:#ff}25892513 + 1 - 100 = 24892514{color} > {color:#ff}dfs.namenode.num.extra.edits.retained=100{color} > {color:#172b4d}Code: {color}NNStorageRetentionManager#purgeOldStorage > {code:java} > void purgeOldStorage(NameNodeFile nnf) throws IOException { > FSImageTransactionalStorageInspector inspector = > new FSImageTransactionalStorageInspector(EnumSet.of(nnf)); > storage.inspectStorageDirs(inspector); > long minImageTxId = getImageTxIdToRetain(inspector); > purgeCheckpointsOlderThan(inspector, minImageTxId); > > {code} > {color:#910091}...{color} > {code:java} > long minimumRequiredTxId = minImageTxId + 1; > long purgeLogsFrom = Math.max(0, minimumRequiredTxId - > numExtraEditsToRetain); > > ArrayList editLogs = new > ArrayList(); > purgeableLogs.selectInputStreams(editLogs, purgeLogsFrom, false, false); > Collections.sort(editLogs, new Comparator() { > @Override > public int compare(EditLogInputStream a, EditLogInputStream b) { > return ComparisonChain.start() > .compare(a.getFirstTxId(), b.getFirstTxId()) > .compare(a.getLastTxId(), b.getLastTxId()) > .result(); > } > }); > // Remove from consideration any edit logs that are in fact required. > while (editLogs.size() > 0 && > editLogs.get(editLogs.size() - 1).getFirstTxId() >= > minimumRequiredTxId) { > editLogs.remove(editLogs.size() - 1); > } > > // Next, adjust the number of transactions to retain if doing so would mean > // keeping too many segments around. > while (editLogs.size() > maxExtraEditsSegmentsToRetain) { > purgeLogsFrom = editLogs.get(0).getLastTxId() + 1; > editLogs.remove(0); > } > ... > purgeableLogs.purgeLogsOlderThan(purgeLogsFrom); > }{code} > > {code:java} > 2022-03-15 17:28:02,640 INFO ha.StandbyCheckpointer > (StandbyCheckpointer.java:doWork(443)) - Triggering checkpoint because there > have been 1189661 txns since the last checkpoint, which exceeds the > configured threshold 2 > 2022-03-15 17:28:02,648 INFO namenode.FSImage > (FSEditLogLoader.java:loadFSEdits(188)) - Edits file > ByteStringEditLog[27082175, 27082606], ByteStringEditLog[27082175, 27082606], > ByteStringEditLog[27082175, 27082606] of size 60008 edits # 432 loaded in 0 > seconds > 2022-03-15 17:28:02,649 INFO namenode.FSImage > (FSImage.java:saveNamespace(1121)) - Save namespace ... > 2022-03-15 17:28:02,650 INFO namenode.FSImageFormatProtobuf > (FSImageFormatProtobuf.java:save(718)) - Saving image file > /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 using no > compression > 2022-03-15 17:28:03,180 INFO namenode.FSImageFormatProtobuf > (FSImageFormatProtobuf.java:save(722)) - Image file > /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 of size > 17885002 bytes saved in 0 seconds . > 2022-03-15 17:28:03,183 INFO namenode.NNStorageRetentionManager > (NNStorageRetentionManager.java:getImageTxIdToRetain(211)) - Going to retain > 2 images with txid >= 25892513 > 2022-03-15 17:28:03,183 INFO namenode.NNStorageRetentionManager > (NNStorageRetentionManager.java:purgeImage(233)) - Purging old image > FSImageFile(file=/data/hadoop/hdfs/namenode/current/fsimage_00024794305, >
[jira] [Updated] (HDFS-16507) Purged edit logs which is in process
[ https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16507: --- Description: We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL exception. It looks like it's purging edit logs which is in process. According to the analysis, I suspect that the editlog to be purged does not finalize before ANN rolls edit its self. I post some key logs for your reference: 1. ANN. Create editlog, {color:#ff}edits_InProgress_00024207987{color}. {code:java} 2022-03-15 17:24:52,558 INFO namenode.FSEditLog (FSEditLog.java:startLogSegment(1394)) - Starting log segment at 24207987 2022-03-15 17:24:52,609 INFO namenode.FSEditLog (FSEditLog.java:startLogSegment(1423)) - Ending log segment at 24207987 2022-03-15 17:24:52,610 INFO namenode.FSEditLog (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1432)) - logEdit at 24207987 2022-03-15 17:24:52,624 INFO namenode.FSEditLog (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1434)) - logSync at 24207987 {code} 2. SNN. Checkpoint. The oldest image file is: fsimage_000{color:#de350b}25892513{color}. {color:#ff}25892513 + 1 - 100 = 24892514{color} {color:#ff}dfs.namenode.num.extra.edits.retained=100{color} {color:#172b4d}Code: {color}NNStorageRetentionManager#purgeOldStorage {code:java} void purgeOldStorage(NameNodeFile nnf) throws IOException { FSImageTransactionalStorageInspector inspector = new FSImageTransactionalStorageInspector(EnumSet.of(nnf)); storage.inspectStorageDirs(inspector); long minImageTxId = getImageTxIdToRetain(inspector); purgeCheckpointsOlderThan(inspector, minImageTxId); {code} {color:#910091}...{color} {code:java} long minimumRequiredTxId = minImageTxId + 1; long purgeLogsFrom = Math.max(0, minimumRequiredTxId - numExtraEditsToRetain); ArrayList editLogs = new ArrayList(); purgeableLogs.selectInputStreams(editLogs, purgeLogsFrom, false, false); Collections.sort(editLogs, new Comparator() { @Override public int compare(EditLogInputStream a, EditLogInputStream b) { return ComparisonChain.start() .compare(a.getFirstTxId(), b.getFirstTxId()) .compare(a.getLastTxId(), b.getLastTxId()) .result(); } }); // Remove from consideration any edit logs that are in fact required. while (editLogs.size() > 0 && editLogs.get(editLogs.size() - 1).getFirstTxId() >= minimumRequiredTxId) { editLogs.remove(editLogs.size() - 1); } // Next, adjust the number of transactions to retain if doing so would mean // keeping too many segments around. while (editLogs.size() > maxExtraEditsSegmentsToRetain) { purgeLogsFrom = editLogs.get(0).getLastTxId() + 1; editLogs.remove(0); } ... purgeableLogs.purgeLogsOlderThan(purgeLogsFrom); }{code} {code:java} 2022-03-15 17:28:02,640 INFO ha.StandbyCheckpointer (StandbyCheckpointer.java:doWork(443)) - Triggering checkpoint because there have been 1189661 txns since the last checkpoint, which exceeds the configured threshold 2 2022-03-15 17:28:02,648 INFO namenode.FSImage (FSEditLogLoader.java:loadFSEdits(188)) - Edits file ByteStringEditLog[27082175, 27082606], ByteStringEditLog[27082175, 27082606], ByteStringEditLog[27082175, 27082606] of size 60008 edits # 432 loaded in 0 seconds 2022-03-15 17:28:02,649 INFO namenode.FSImage (FSImage.java:saveNamespace(1121)) - Save namespace ... 2022-03-15 17:28:02,650 INFO namenode.FSImageFormatProtobuf (FSImageFormatProtobuf.java:save(718)) - Saving image file /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 using no compression 2022-03-15 17:28:03,180 INFO namenode.FSImageFormatProtobuf (FSImageFormatProtobuf.java:save(722)) - Image file /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 of size 17885002 bytes saved in 0 seconds . 2022-03-15 17:28:03,183 INFO namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:getImageTxIdToRetain(211)) - Going to retain 2 images with txid >= 25892513 2022-03-15 17:28:03,183 INFO namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:purgeImage(233)) - Purging old image FSImageFile(file=/data/hadoop/hdfs/namenode/current/fsimage_00024794305, cpktTxId=00024794305) 2022-03-15 17:28:03,188 INFO namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:purgeOldStorage(169)) - purgeLogsFrom: 24892514 2022-03-15 17:28:03,282 INFO namenode.TransferFsImage (TransferFsImage.java:copyFileToStream(396)) - Sending fileName: /data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: 17885002. Sent total: 17885002 bytes. Size of last segment intended to send: -1 bytes. 2022-03-15 17:28:03,536 INFO namenode.TransferFsImage (TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 27082606 to namenode at
[jira] [Updated] (HDFS-16507) Purged edit logs which is in process
[ https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16507: --- Description: We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL exception. It looks like it's purging edit logs which is in process. According to the analysis, I suspect that the editlog to be purged does not finalize. I post some key logs for your reference: 1. ANN. Create editlog, {color:#ff}edits_InProgress_00024207987{color}. {code:java} 2022-03-15 17:24:52,558 INFO namenode.FSEditLog (FSEditLog.java:startLogSegment(1394)) - Starting log segment at 24207987 2022-03-15 17:24:52,609 INFO namenode.FSEditLog (FSEditLog.java:startLogSegment(1423)) - Ending log segment at 24207987 2022-03-15 17:24:52,610 INFO namenode.FSEditLog (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1432)) - logEdit at 24207987 2022-03-15 17:24:52,624 INFO namenode.FSEditLog (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1434)) - logSync at 24207987 {code} 2. SNN. Checkpoint. The oldest image file is: fsimage_000{color:#de350b}25892513{color}. {color:#ff}25892513 + 1 - 100 = 24892514{color} {color:#ff}dfs.namenode.num.extra.edits.retained=100{color} {color:#172b4d}Code: {color}NNStorageRetentionManager#purgeOldStorage {code:java} void purgeOldStorage(NameNodeFile nnf) throws IOException { FSImageTransactionalStorageInspector inspector = new FSImageTransactionalStorageInspector(EnumSet.of(nnf)); storage.inspectStorageDirs(inspector); long minImageTxId = getImageTxIdToRetain(inspector); purgeCheckpointsOlderThan(inspector, minImageTxId); {code} {color:#910091}...{color} {code:java} long minimumRequiredTxId = minImageTxId + 1; long purgeLogsFrom = Math.max(0, minimumRequiredTxId - numExtraEditsToRetain); ArrayList editLogs = new ArrayList(); purgeableLogs.selectInputStreams(editLogs, purgeLogsFrom, false, false); Collections.sort(editLogs, new Comparator() { @Override public int compare(EditLogInputStream a, EditLogInputStream b) { return ComparisonChain.start() .compare(a.getFirstTxId(), b.getFirstTxId()) .compare(a.getLastTxId(), b.getLastTxId()) .result(); } }); // Remove from consideration any edit logs that are in fact required. while (editLogs.size() > 0 && editLogs.get(editLogs.size() - 1).getFirstTxId() >= minimumRequiredTxId) { editLogs.remove(editLogs.size() - 1); } // Next, adjust the number of transactions to retain if doing so would mean // keeping too many segments around. while (editLogs.size() > maxExtraEditsSegmentsToRetain) { purgeLogsFrom = editLogs.get(0).getLastTxId() + 1; editLogs.remove(0); } ... purgeableLogs.purgeLogsOlderThan(purgeLogsFrom); }{code} {code:java} 2022-03-15 17:28:02,640 INFO ha.StandbyCheckpointer (StandbyCheckpointer.java:doWork(443)) - Triggering checkpoint because there have been 1189661 txns since the last checkpoint, which exceeds the configured threshold 2 2022-03-15 17:28:02,648 INFO namenode.FSImage (FSEditLogLoader.java:loadFSEdits(188)) - Edits file ByteStringEditLog[27082175, 27082606], ByteStringEditLog[27082175, 27082606], ByteStringEditLog[27082175, 27082606] of size 60008 edits # 432 loaded in 0 seconds 2022-03-15 17:28:02,649 INFO namenode.FSImage (FSImage.java:saveNamespace(1121)) - Save namespace ... 2022-03-15 17:28:02,650 INFO namenode.FSImageFormatProtobuf (FSImageFormatProtobuf.java:save(718)) - Saving image file /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 using no compression 2022-03-15 17:28:03,180 INFO namenode.FSImageFormatProtobuf (FSImageFormatProtobuf.java:save(722)) - Image file /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 of size 17885002 bytes saved in 0 seconds . 2022-03-15 17:28:03,183 INFO namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:getImageTxIdToRetain(211)) - Going to retain 2 images with txid >= 25892513 2022-03-15 17:28:03,183 INFO namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:purgeImage(233)) - Purging old image FSImageFile(file=/data/hadoop/hdfs/namenode/current/fsimage_00024794305, cpktTxId=00024794305) 2022-03-15 17:28:03,188 INFO namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:purgeOldStorage(169)) - purgeLogsFrom: 24892514 2022-03-15 17:28:03,282 INFO namenode.TransferFsImage (TransferFsImage.java:copyFileToStream(396)) - Sending fileName: /data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: 17885002. Sent total: 17885002 bytes. Size of last segment intended to send: -1 bytes. 2022-03-15 17:28:03,536 INFO namenode.TransferFsImage (TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 27082606 to namenode at http://sg-test-ambari-nn1.bigdata.bigo.inner:50070 in
[jira] [Updated] (HDFS-16507) Purged edit logs which is in process
[ https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16507: --- Description: We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL exception. It looks like it's purging edit logs which is in process. According to the analysis, I suspect that the editlog to be purged does not finalize. I post some key logs for your reference: 1. ANN. Create editlog, {color:#ff}edits_InProgress_00024207987{color}. {code:java} 2022-03-15 17:24:52,558 INFO namenode.FSEditLog (FSEditLog.java:startLogSegment(1394)) - Starting log segment at 24207987 2022-03-15 17:24:52,609 INFO namenode.FSEditLog (FSEditLog.java:startLogSegment(1423)) - Ending log segment at 24207987 2022-03-15 17:24:52,610 INFO namenode.FSEditLog (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1432)) - logEdit at 24207987 2022-03-15 17:24:52,624 INFO namenode.FSEditLog (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1434)) - logSync at 24207987 {code} 2. SNN. Checkpoint. The oldest image file is: fsimage_000{color:#de350b}25892513{color}. {color:#ff}25892513 + 1 - 100 = 24892514{color} {color:#ff}dfs.namenode.num.extra.edits.retained=100{color} {color:#172b4d}Code: {color}NNStorageRetentionManager#purgeOldStorage {code:java} void purgeOldStorage(NameNodeFile nnf) throws IOException { FSImageTransactionalStorageInspector inspector = new FSImageTransactionalStorageInspector(EnumSet.of(nnf)); storage.inspectStorageDirs(inspector); long minImageTxId = getImageTxIdToRetain(inspector); purgeCheckpointsOlderThan(inspector, minImageTxId); if (nnf == NameNodeFile.IMAGE_ROLLBACK) { // do not purge edits for IMAGE_ROLLBACK. return; } long minimumRequiredTxId = minImageTxId + 1; long purgeLogsFrom = Math.max(0, minimumRequiredTxId - numExtraEditsToRetain); ArrayList editLogs = new ArrayList(); purgeableLogs.selectInputStreams(editLogs, purgeLogsFrom, false, false); Collections.sort(editLogs, new Comparator() { @Override public int compare(EditLogInputStream a, EditLogInputStream b) { return ComparisonChain.start() .compare(a.getFirstTxId(), b.getFirstTxId()) .compare(a.getLastTxId(), b.getLastTxId()) .result(); } }); // Remove from consideration any edit logs that are in fact required. while (editLogs.size() > 0 && editLogs.get(editLogs.size() - 1).getFirstTxId() >= minimumRequiredTxId) { editLogs.remove(editLogs.size() - 1); } // Next, adjust the number of transactions to retain if doing so would mean // keeping too many segments around. while (editLogs.size() > maxExtraEditsSegmentsToRetain) { purgeLogsFrom = editLogs.get(0).getLastTxId() + 1; editLogs.remove(0); } // Finally, ensure that we're not trying to purge any transactions that we // actually need. if (purgeLogsFrom > minimumRequiredTxId) { throw new AssertionError("Should not purge more edits than required to " + "restore: " + purgeLogsFrom + " should be <= " + minimumRequiredTxId); } LOG.info("purgeLogsFrom: " + purgeLogsFrom); for (EditLogInputStream editLog : editLogs) { if (editLog.isInProgress()) { LOG.info("editLog isInProgress, start txid:" + editLog.getFirstTxId() + " , last txid:" + editLog.getLastTxId()); } } purgeableLogs.purgeLogsOlderThan(purgeLogsFrom); }{code} {code:java} 2022-03-15 17:28:02,640 INFO ha.StandbyCheckpointer (StandbyCheckpointer.java:doWork(443)) - Triggering checkpoint because there have been 1189661 txns since the last checkpoint, which exceeds the configured threshold 2 2022-03-15 17:28:02,648 INFO namenode.FSImage (FSEditLogLoader.java:loadFSEdits(188)) - Edits file ByteStringEditLog[27082175, 27082606], ByteStringEditLog[27082175, 27082606], ByteStringEditLog[27082175, 27082606] of size 60008 edits # 432 loaded in 0 seconds 2022-03-15 17:28:02,649 INFO namenode.FSImage (FSImage.java:saveNamespace(1121)) - Save namespace ... 2022-03-15 17:28:02,650 INFO namenode.FSImageFormatProtobuf (FSImageFormatProtobuf.java:save(718)) - Saving image file /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 using no compression 2022-03-15 17:28:03,180 INFO namenode.FSImageFormatProtobuf (FSImageFormatProtobuf.java:save(722)) - Image file /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 of size 17885002 bytes saved in 0 seconds . 2022-03-15 17:28:03,183 INFO namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:getImageTxIdToRetain(211)) - Going to retain 2 images with txid >= 25892513 2022-03-15 17:28:03,183 INFO namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:purgeImage(233)) - Purging old image FSImageFile(file=/data/hadoop/hdfs/namenode/current/fsimage_00024794305,
[jira] [Updated] (HDFS-16507) Purged edit logs which is in process
[ https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16507: --- Description: We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL exception. It looks like it's purging edit logs which is in process. According to the analysis, I suspect that the editlog to be purged does not finalize. I post some key logs for your reference: 1. ANN. Create editlog, {color:#ff}edits_InProgress_00024207987{color}. {code:java} 2022-03-15 17:24:52,558 INFO namenode.FSEditLog (FSEditLog.java:startLogSegment(1394)) - Starting log segment at 24207987 2022-03-15 17:24:52,609 INFO namenode.FSEditLog (FSEditLog.java:startLogSegment(1423)) - Ending log segment at 24207987 2022-03-15 17:24:52,610 INFO namenode.FSEditLog (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1432)) - logEdit at 24207987 2022-03-15 17:24:52,624 INFO namenode.FSEditLog (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1434)) - logSync at 24207987 {code} 2. SNN. Checkpoint. The oldest image file is: fsimage_000{color:#de350b}25892513{color}. {color:#ff}25892513 + 1 - 100 = 24892514{color} {color:#ff}dfs.namenode.num.extra.edits.retained=100{color} {code:java} 2022-03-15 17:28:02,640 INFO ha.StandbyCheckpointer (StandbyCheckpointer.java:doWork(443)) - Triggering checkpoint because there have been 1189661 txns since the last checkpoint, which exceeds the configured threshold 2 2022-03-15 17:28:02,648 INFO namenode.FSImage (FSEditLogLoader.java:loadFSEdits(188)) - Edits file ByteStringEditLog[27082175, 27082606], ByteStringEditLog[27082175, 27082606], ByteStringEditLog[27082175, 27082606] of size 60008 edits # 432 loaded in 0 seconds 2022-03-15 17:28:02,649 INFO namenode.FSImage (FSImage.java:saveNamespace(1121)) - Save namespace ... 2022-03-15 17:28:02,650 INFO namenode.FSImageFormatProtobuf (FSImageFormatProtobuf.java:save(718)) - Saving image file /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 using no compression 2022-03-15 17:28:03,180 INFO namenode.FSImageFormatProtobuf (FSImageFormatProtobuf.java:save(722)) - Image file /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 of size 17885002 bytes saved in 0 seconds . 2022-03-15 17:28:03,183 INFO namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:getImageTxIdToRetain(211)) - Going to retain 2 images with txid >= 25892513 2022-03-15 17:28:03,183 INFO namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:purgeImage(233)) - Purging old image FSImageFile(file=/data/hadoop/hdfs/namenode/current/fsimage_00024794305, cpktTxId=00024794305) 2022-03-15 17:28:03,188 INFO namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:purgeOldStorage(169)) - purgeLogsFrom: 24892514 2022-03-15 17:28:03,282 INFO namenode.TransferFsImage (TransferFsImage.java:copyFileToStream(396)) - Sending fileName: /data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: 17885002. Sent total: 17885002 bytes. Size of last segment intended to send: -1 bytes. 2022-03-15 17:28:03,536 INFO namenode.TransferFsImage (TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 27082606 to namenode at http://sg-test-ambari-nn1.bigdata.bigo.inner:50070 in 0.343 seconds 2022-03-15 17:28:03,640 INFO namenode.TransferFsImage (TransferFsImage.java:copyFileToStream(396)) - Sending fileName: /data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: 17885002. Sent total: 17885002 bytes. Size of last segment intended to send: -1 bytes. 2022-03-15 17:28:03,684 INFO namenode.TransferFsImage (TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 27082606 to namenode at http://sg-test-ambari-dn1.bigdata.bigo.inner:50070 in 0.148 seconds 2022-03-15 17:28:03,748 INFO namenode.TransferFsImage (TransferFsImage.java:copyFileToStream(396)) - Sending fileName: /data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: 17885002. Sent total: 17885002 bytes. Size of last segment intended to send: -1 bytes. 2022-03-15 17:28:03,798 INFO namenode.TransferFsImage (TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 27082606 to namenode at http://sg-test-ambari-dn2.bigdata.bigo.inner:50070 in 0.113 seconds 2022-03-15 17:28:03,798 INFO ha.StandbyCheckpointer (StandbyCheckpointer.java:doWork(482)) - Checkpoint finished successfully. {code} 3. ANN. Purge edit logs. The oldest image file is: fsimage_000{color:#de350b}25892513{color}. {color:#ff}25892513 + 1 - 100 = 24892514{color} {color:#ff}dfs.namenode.num.extra.edits.retained=100{color} {code:java} 2022-03-15 17:28:03,515 INFO namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:getImageTxIdToRetain(211)) - Going to retain 2 images
[jira] [Updated] (HDFS-16507) Purged edit logs which is in process
[ https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16507: --- Description: We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL exception. It looks like it's purging edit logs which is in process. According to the analysis, I suspect that the editlog to be purged does not finalize. I post some key logs for your reference: 1. ANN. Create editlog, {color:#ff}edits_InProgress_00024207987{color}. {code:java} 2022-03-15 17:24:52,558 INFO namenode.FSEditLog (FSEditLog.java:startLogSegment(1394)) - Starting log segment at 24207987 2022-03-15 17:24:52,609 INFO namenode.FSEditLog (FSEditLog.java:startLogSegment(1423)) - Ending log segment at 24207987 2022-03-15 17:24:52,610 INFO namenode.FSEditLog (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1432)) - logEdit at 24207987 2022-03-15 17:24:52,624 INFO namenode.FSEditLog (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1434)) - logSync at 24207987 {code} 2. SNN. Checkpoint. {color:#ff}25892513 + 1 - 100 = 24892514{color} {color:#ff}dfs.namenode.num.extra.edits.retained=100{color} {code:java} 2022-03-15 17:28:02,640 INFO ha.StandbyCheckpointer (StandbyCheckpointer.java:doWork(443)) - Triggering checkpoint because there have been 1189661 txns since the last checkpoint, which exceeds the configured threshold 2 2022-03-15 17:28:02,648 INFO namenode.FSImage (FSEditLogLoader.java:loadFSEdits(188)) - Edits file ByteStringEditLog[27082175, 27082606], ByteStringEditLog[27082175, 27082606], ByteStringEditLog[27082175, 27082606] of size 60008 edits # 432 loaded in 0 seconds 2022-03-15 17:28:02,649 INFO namenode.FSImage (FSImage.java:saveNamespace(1121)) - Save namespace ... 2022-03-15 17:28:02,650 INFO namenode.FSImageFormatProtobuf (FSImageFormatProtobuf.java:save(718)) - Saving image file /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 using no compression 2022-03-15 17:28:03,180 INFO namenode.FSImageFormatProtobuf (FSImageFormatProtobuf.java:save(722)) - Image file /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 of size 17885002 bytes saved in 0 seconds . 2022-03-15 17:28:03,183 INFO namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:getImageTxIdToRetain(211)) - Going to retain 2 images with txid >= 25892513 2022-03-15 17:28:03,183 INFO namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:purgeImage(233)) - Purging old image FSImageFile(file=/data/hadoop/hdfs/namenode/current/fsimage_00024794305, cpktTxId=00024794305) 2022-03-15 17:28:03,188 INFO namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:purgeOldStorage(169)) - purgeLogsFrom: 24892514 2022-03-15 17:28:03,282 INFO namenode.TransferFsImage (TransferFsImage.java:copyFileToStream(396)) - Sending fileName: /data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: 17885002. Sent total: 17885002 bytes. Size of last segment intended to send: -1 bytes. 2022-03-15 17:28:03,536 INFO namenode.TransferFsImage (TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 27082606 to namenode at http://sg-test-ambari-nn1.bigdata.bigo.inner:50070 in 0.343 seconds 2022-03-15 17:28:03,640 INFO namenode.TransferFsImage (TransferFsImage.java:copyFileToStream(396)) - Sending fileName: /data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: 17885002. Sent total: 17885002 bytes. Size of last segment intended to send: -1 bytes. 2022-03-15 17:28:03,684 INFO namenode.TransferFsImage (TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 27082606 to namenode at http://sg-test-ambari-dn1.bigdata.bigo.inner:50070 in 0.148 seconds 2022-03-15 17:28:03,748 INFO namenode.TransferFsImage (TransferFsImage.java:copyFileToStream(396)) - Sending fileName: /data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: 17885002. Sent total: 17885002 bytes. Size of last segment intended to send: -1 bytes. 2022-03-15 17:28:03,798 INFO namenode.TransferFsImage (TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 27082606 to namenode at http://sg-test-ambari-dn2.bigdata.bigo.inner:50070 in 0.113 seconds 2022-03-15 17:28:03,798 INFO ha.StandbyCheckpointer (StandbyCheckpointer.java:doWork(482)) - Checkpoint finished successfully. {code} 3. ANN. Purge edit logs. {color:#ff}25892513 + 1 - 100 = 24892514{color} {color:#ff}dfs.namenode.num.extra.edits.retained=100{color} {code:java} 2022-03-15 17:28:03,515 INFO namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:getImageTxIdToRetain(211)) - Going to retain 2 images with txid >= 25892513 {code} {code:java} 2022-03-15 17:28:03,523 INFO namenode.NNStorageRetentionManager
[jira] [Updated] (HDFS-16507) Purged edit logs which is in process
[ https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16507: --- Description: We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL exception. It looks like it's purging edit logs which is in process. According to the analysis, I suspect that the Editlog to be purged does not finalize normally . I post some key logs for your reference: 1. ANN. Create editlog, {color:#ff}edits_InProgresS_00024207987{color}. {code:java} 2022-03-15 17:24:52,558 INFO namenode.FSEditLog (FSEditLog.java:startLogSegment(1394)) - Starting log segment at 24207987 2022-03-15 17:24:52,609 INFO namenode.FSEditLog (FSEditLog.java:startLogSegment(1423)) - Ending log segment at 24207987 2022-03-15 17:24:52,610 INFO namenode.FSEditLog (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1432)) - logEdit at 24207987 2022-03-15 17:24:52,624 INFO namenode.FSEditLog (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1434)) - logSync at 24207987 {code} 2. SNN. Checkpoint. {color:#ff}25892513 + 1 - 100 = 24892514{color} {color:#ff}dfs.namenode.num.extra.edits.retained=100{color} {code:java} 2022-03-15 17:28:02,640 INFO ha.StandbyCheckpointer (StandbyCheckpointer.java:doWork(443)) - Triggering checkpoint because there have been 1189661 txns since the last checkpoint, which exceeds the configured threshold 2 2022-03-15 17:28:02,648 INFO namenode.FSImage (FSEditLogLoader.java:loadFSEdits(188)) - Edits file ByteStringEditLog[27082175, 27082606], ByteStringEditLog[27082175, 27082606], ByteStringEditLog[27082175, 27082606] of size 60008 edits # 432 loaded in 0 seconds 2022-03-15 17:28:02,649 INFO namenode.FSImage (FSImage.java:saveNamespace(1121)) - Save namespace ... 2022-03-15 17:28:02,650 INFO namenode.FSImageFormatProtobuf (FSImageFormatProtobuf.java:save(718)) - Saving image file /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 using no compression 2022-03-15 17:28:03,180 INFO namenode.FSImageFormatProtobuf (FSImageFormatProtobuf.java:save(722)) - Image file /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 of size 17885002 bytes saved in 0 seconds . 2022-03-15 17:28:03,183 INFO namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:getImageTxIdToRetain(211)) - Going to retain 2 images with txid >= 25892513 2022-03-15 17:28:03,183 INFO namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:purgeImage(233)) - Purging old image FSImageFile(file=/data/hadoop/hdfs/namenode/current/fsimage_00024794305, cpktTxId=00024794305) 2022-03-15 17:28:03,188 INFO namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:purgeOldStorage(169)) - purgeLogsFrom: 24892514 2022-03-15 17:28:03,282 INFO namenode.TransferFsImage (TransferFsImage.java:copyFileToStream(396)) - Sending fileName: /data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: 17885002. Sent total: 17885002 bytes. Size of last segment intended to send: -1 bytes. 2022-03-15 17:28:03,536 INFO namenode.TransferFsImage (TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 27082606 to namenode at http://sg-test-ambari-nn1.bigdata.bigo.inner:50070 in 0.343 seconds 2022-03-15 17:28:03,640 INFO namenode.TransferFsImage (TransferFsImage.java:copyFileToStream(396)) - Sending fileName: /data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: 17885002. Sent total: 17885002 bytes. Size of last segment intended to send: -1 bytes. 2022-03-15 17:28:03,684 INFO namenode.TransferFsImage (TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 27082606 to namenode at http://sg-test-ambari-dn1.bigdata.bigo.inner:50070 in 0.148 seconds 2022-03-15 17:28:03,748 INFO namenode.TransferFsImage (TransferFsImage.java:copyFileToStream(396)) - Sending fileName: /data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: 17885002. Sent total: 17885002 bytes. Size of last segment intended to send: -1 bytes. 2022-03-15 17:28:03,798 INFO namenode.TransferFsImage (TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 27082606 to namenode at http://sg-test-ambari-dn2.bigdata.bigo.inner:50070 in 0.113 seconds 2022-03-15 17:28:03,798 INFO ha.StandbyCheckpointer (StandbyCheckpointer.java:doWork(482)) - Checkpoint finished successfully. {code} 3. ANN. Purge edit logs. {color:#ff}25892513 + 1 - 100 = 24892514{color} {color:#ff}dfs.namenode.num.extra.edits.retained=100{color} {code:java} 2022-03-15 17:28:03,515 INFO namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:getImageTxIdToRetain(211)) - Going to retain 2 images with txid >= 25892513 {code} {code:java} 2022-03-15 17:28:03,523 INFO namenode.NNStorageRetentionManager
[jira] [Updated] (HDFS-16507) Purged edit logs which is in process
[ https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16507: --- Affects Version/s: 3.1.0 > Purged edit logs which is in process > > > Key: HDFS-16507 > URL: https://issues.apache.org/jira/browse/HDFS-16507 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.1.0 >Reporter: tomscut >Priority: Critical > > We introduced Standby read functionality in branch-3.1.0, but found a FATAL > exception. It looks like it's purging edit logs which is in process. > According to the analysis, I suspect that the Editlog to be purged does not > finalize normally . > I post some key logs for your reference: > 1. ANN. Create editlog, > {color:#FF}edits_InProgresS_00024207987{color}. > > {code:java} > 2022-03-15 17:24:52,558 INFO namenode.FSEditLog > (FSEditLog.java:startLogSegment(1394)) - Starting log segment at 24207987 > 2022-03-15 17:24:52,609 INFO namenode.FSEditLog > (FSEditLog.java:startLogSegment(1423)) - Ending log segment at 24207987 > 2022-03-15 17:24:52,610 INFO namenode.FSEditLog > (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1432)) - logEdit at 24207987 > 2022-03-15 17:24:52,624 INFO namenode.FSEditLog > (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1434)) - logSync at 24207987 > {code} > 2. SNN. Checkpoint. > > {color:#FF}25892513 + 1 - 100 = 24892514{color} > {color:#FF}dfs.namenode.num.extra.edits.retained=100{color} > > {code:java} > 2022-03-15 17:28:02,640 INFO ha.StandbyCheckpointer > (StandbyCheckpointer.java:doWork(443)) - Triggering checkpoint because there > have been 1189661 txns since the last checkpoint, which exceeds the > configured threshold 2 > 2022-03-15 17:28:02,648 INFO namenode.FSImage > (FSEditLogLoader.java:loadFSEdits(188)) - Edits file > ByteStringEditLog[27082175, 27082606], ByteStringEditLog[27082175, 27082606], > ByteStringEditLog[27082175, 27082606] of size 60008 edits # 432 loaded in 0 > seconds > 2022-03-15 17:28:02,649 INFO namenode.FSImage > (FSImage.java:saveNamespace(1121)) - Save namespace ... > 2022-03-15 17:28:02,650 INFO namenode.FSImageFormatProtobuf > (FSImageFormatProtobuf.java:save(718)) - Saving image file > /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 using no > compression > 2022-03-15 17:28:03,180 INFO namenode.FSImageFormatProtobuf > (FSImageFormatProtobuf.java:save(722)) - Image file > /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 of size > 17885002 bytes saved in 0 seconds . > 2022-03-15 17:28:03,183 INFO namenode.NNStorageRetentionManager > (NNStorageRetentionManager.java:getImageTxIdToRetain(211)) - Going to retain > 2 images with txid >= 25892513 > 2022-03-15 17:28:03,183 INFO namenode.NNStorageRetentionManager > (NNStorageRetentionManager.java:purgeImage(233)) - Purging old image > FSImageFile(file=/data/hadoop/hdfs/namenode/current/fsimage_00024794305, > cpktTxId=00024794305) > 2022-03-15 17:28:03,188 INFO namenode.NNStorageRetentionManager > (NNStorageRetentionManager.java:purgeOldStorage(169)) - purgeLogsFrom: > 24892514 > 2022-03-15 17:28:03,282 INFO namenode.TransferFsImage > (TransferFsImage.java:copyFileToStream(396)) - Sending fileName: > /data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: > 17885002. Sent total: 17885002 bytes. Size of last segment intended to send: > -1 bytes. > 2022-03-15 17:28:03,536 INFO namenode.TransferFsImage > (TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid > 27082606 to namenode at http://sg-test-ambari-nn1.bigdata.bigo.inner:50070 in > 0.343 seconds > 2022-03-15 17:28:03,640 INFO namenode.TransferFsImage > (TransferFsImage.java:copyFileToStream(396)) - Sending fileName: > /data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: > 17885002. Sent total: 17885002 bytes. Size of last segment intended to send: > -1 bytes. > 2022-03-15 17:28:03,684 INFO namenode.TransferFsImage > (TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid > 27082606 to namenode at http://sg-test-ambari-dn1.bigdata.bigo.inner:50070 in > 0.148 seconds > 2022-03-15 17:28:03,748 INFO namenode.TransferFsImage > (TransferFsImage.java:copyFileToStream(396)) - Sending fileName: > /data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: > 17885002. Sent total: 17885002 bytes. Size of last segment intended to send: > -1 bytes. > 2022-03-15 17:28:03,798 INFO namenode.TransferFsImage > (TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid > 27082606 to namenode at http://sg-test-ambari-dn2.bigdata.bigo.inner:50070 in > 0.113 seconds > 2022-03-15 17:28:03,798 INFO ha.StandbyCheckpointer >
[jira] [Updated] (HDFS-16507) Purged edit logs which is in process
[ https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16507: --- Environment: (was: {code:java} // code placeholder {code}) > Purged edit logs which is in process > > > Key: HDFS-16507 > URL: https://issues.apache.org/jira/browse/HDFS-16507 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: tomscut >Priority: Critical > > We introduced Standby read functionality in branch-3.1.0, but found a FATAL > exception. It looks like it's purging edit logs which is in process. > According to the analysis, I suspect that the Editlog to be purged does not > finalize normally . > I post some key logs for your reference: > 1. ANN. Create editlog, > {color:#FF}edits_InProgresS_00024207987{color}. > > {code:java} > 2022-03-15 17:24:52,558 INFO namenode.FSEditLog > (FSEditLog.java:startLogSegment(1394)) - Starting log segment at 24207987 > 2022-03-15 17:24:52,609 INFO namenode.FSEditLog > (FSEditLog.java:startLogSegment(1423)) - Ending log segment at 24207987 > 2022-03-15 17:24:52,610 INFO namenode.FSEditLog > (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1432)) - logEdit at 24207987 > 2022-03-15 17:24:52,624 INFO namenode.FSEditLog > (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1434)) - logSync at 24207987 > {code} > 2. SNN. Checkpoint. > > {color:#FF}25892513 + 1 - 100 = 24892514{color} > {color:#FF}dfs.namenode.num.extra.edits.retained=100{color} > > {code:java} > 2022-03-15 17:28:02,640 INFO ha.StandbyCheckpointer > (StandbyCheckpointer.java:doWork(443)) - Triggering checkpoint because there > have been 1189661 txns since the last checkpoint, which exceeds the > configured threshold 2 > 2022-03-15 17:28:02,648 INFO namenode.FSImage > (FSEditLogLoader.java:loadFSEdits(188)) - Edits file > ByteStringEditLog[27082175, 27082606], ByteStringEditLog[27082175, 27082606], > ByteStringEditLog[27082175, 27082606] of size 60008 edits # 432 loaded in 0 > seconds > 2022-03-15 17:28:02,649 INFO namenode.FSImage > (FSImage.java:saveNamespace(1121)) - Save namespace ... > 2022-03-15 17:28:02,650 INFO namenode.FSImageFormatProtobuf > (FSImageFormatProtobuf.java:save(718)) - Saving image file > /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 using no > compression > 2022-03-15 17:28:03,180 INFO namenode.FSImageFormatProtobuf > (FSImageFormatProtobuf.java:save(722)) - Image file > /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 of size > 17885002 bytes saved in 0 seconds . > 2022-03-15 17:28:03,183 INFO namenode.NNStorageRetentionManager > (NNStorageRetentionManager.java:getImageTxIdToRetain(211)) - Going to retain > 2 images with txid >= 25892513 > 2022-03-15 17:28:03,183 INFO namenode.NNStorageRetentionManager > (NNStorageRetentionManager.java:purgeImage(233)) - Purging old image > FSImageFile(file=/data/hadoop/hdfs/namenode/current/fsimage_00024794305, > cpktTxId=00024794305) > 2022-03-15 17:28:03,188 INFO namenode.NNStorageRetentionManager > (NNStorageRetentionManager.java:purgeOldStorage(169)) - purgeLogsFrom: > 24892514 > 2022-03-15 17:28:03,282 INFO namenode.TransferFsImage > (TransferFsImage.java:copyFileToStream(396)) - Sending fileName: > /data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: > 17885002. Sent total: 17885002 bytes. Size of last segment intended to send: > -1 bytes. > 2022-03-15 17:28:03,536 INFO namenode.TransferFsImage > (TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid > 27082606 to namenode at http://sg-test-ambari-nn1.bigdata.bigo.inner:50070 in > 0.343 seconds > 2022-03-15 17:28:03,640 INFO namenode.TransferFsImage > (TransferFsImage.java:copyFileToStream(396)) - Sending fileName: > /data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: > 17885002. Sent total: 17885002 bytes. Size of last segment intended to send: > -1 bytes. > 2022-03-15 17:28:03,684 INFO namenode.TransferFsImage > (TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid > 27082606 to namenode at http://sg-test-ambari-dn1.bigdata.bigo.inner:50070 in > 0.148 seconds > 2022-03-15 17:28:03,748 INFO namenode.TransferFsImage > (TransferFsImage.java:copyFileToStream(396)) - Sending fileName: > /data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: > 17885002. Sent total: 17885002 bytes. Size of last segment intended to send: > -1 bytes. > 2022-03-15 17:28:03,798 INFO namenode.TransferFsImage > (TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid > 27082606 to namenode at http://sg-test-ambari-dn2.bigdata.bigo.inner:50070 in > 0.113 seconds > 2022-03-15 17:28:03,798 INFO ha.StandbyCheckpointer >
[jira] [Created] (HDFS-16507) Purged edit logs which is in process
tomscut created HDFS-16507: -- Summary: Purged edit logs which is in process Key: HDFS-16507 URL: https://issues.apache.org/jira/browse/HDFS-16507 Project: Hadoop HDFS Issue Type: Bug Environment: {code:java} // code placeholder {code} Reporter: tomscut We introduced Standby read functionality in branch-3.1.0, but found a FATAL exception. It looks like it's purging edit logs which is in process. According to the analysis, I suspect that the Editlog to be purged does not finalize normally . I post some key logs for your reference: 1. ANN. Create editlog, {color:#FF}edits_InProgresS_00024207987{color}. {code:java} 2022-03-15 17:24:52,558 INFO namenode.FSEditLog (FSEditLog.java:startLogSegment(1394)) - Starting log segment at 24207987 2022-03-15 17:24:52,609 INFO namenode.FSEditLog (FSEditLog.java:startLogSegment(1423)) - Ending log segment at 24207987 2022-03-15 17:24:52,610 INFO namenode.FSEditLog (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1432)) - logEdit at 24207987 2022-03-15 17:24:52,624 INFO namenode.FSEditLog (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1434)) - logSync at 24207987 {code} 2. SNN. Checkpoint. {color:#FF}25892513 + 1 - 100 = 24892514{color} {color:#FF}dfs.namenode.num.extra.edits.retained=100{color} {code:java} 2022-03-15 17:28:02,640 INFO ha.StandbyCheckpointer (StandbyCheckpointer.java:doWork(443)) - Triggering checkpoint because there have been 1189661 txns since the last checkpoint, which exceeds the configured threshold 2 2022-03-15 17:28:02,648 INFO namenode.FSImage (FSEditLogLoader.java:loadFSEdits(188)) - Edits file ByteStringEditLog[27082175, 27082606], ByteStringEditLog[27082175, 27082606], ByteStringEditLog[27082175, 27082606] of size 60008 edits # 432 loaded in 0 seconds 2022-03-15 17:28:02,649 INFO namenode.FSImage (FSImage.java:saveNamespace(1121)) - Save namespace ... 2022-03-15 17:28:02,650 INFO namenode.FSImageFormatProtobuf (FSImageFormatProtobuf.java:save(718)) - Saving image file /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 using no compression 2022-03-15 17:28:03,180 INFO namenode.FSImageFormatProtobuf (FSImageFormatProtobuf.java:save(722)) - Image file /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 of size 17885002 bytes saved in 0 seconds . 2022-03-15 17:28:03,183 INFO namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:getImageTxIdToRetain(211)) - Going to retain 2 images with txid >= 25892513 2022-03-15 17:28:03,183 INFO namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:purgeImage(233)) - Purging old image FSImageFile(file=/data/hadoop/hdfs/namenode/current/fsimage_00024794305, cpktTxId=00024794305) 2022-03-15 17:28:03,188 INFO namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:purgeOldStorage(169)) - purgeLogsFrom: 24892514 2022-03-15 17:28:03,282 INFO namenode.TransferFsImage (TransferFsImage.java:copyFileToStream(396)) - Sending fileName: /data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: 17885002. Sent total: 17885002 bytes. Size of last segment intended to send: -1 bytes. 2022-03-15 17:28:03,536 INFO namenode.TransferFsImage (TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 27082606 to namenode at http://sg-test-ambari-nn1.bigdata.bigo.inner:50070 in 0.343 seconds 2022-03-15 17:28:03,640 INFO namenode.TransferFsImage (TransferFsImage.java:copyFileToStream(396)) - Sending fileName: /data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: 17885002. Sent total: 17885002 bytes. Size of last segment intended to send: -1 bytes. 2022-03-15 17:28:03,684 INFO namenode.TransferFsImage (TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 27082606 to namenode at http://sg-test-ambari-dn1.bigdata.bigo.inner:50070 in 0.148 seconds 2022-03-15 17:28:03,748 INFO namenode.TransferFsImage (TransferFsImage.java:copyFileToStream(396)) - Sending fileName: /data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: 17885002. Sent total: 17885002 bytes. Size of last segment intended to send: -1 bytes. 2022-03-15 17:28:03,798 INFO namenode.TransferFsImage (TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 27082606 to namenode at http://sg-test-ambari-dn2.bigdata.bigo.inner:50070 in 0.113 seconds 2022-03-15 17:28:03,798 INFO ha.StandbyCheckpointer (StandbyCheckpointer.java:doWork(482)) - Checkpoint finished successfully. {code} 3. ANN. Purge edit logs. {color:#FF}25892513 + 1 - 100 = 24892514{color} {color:#FF}dfs.namenode.num.extra.edits.retained=100{color} {code:java} 2022-03-15 17:28:03,515 INFO namenode.NNStorageRetentionManager
[jira] [Created] (HDFS-16506) Unit tests failed because of OutOfMemoryError
tomscut created HDFS-16506: -- Summary: Unit tests failed because of OutOfMemoryError Key: HDFS-16506 URL: https://issues.apache.org/jira/browse/HDFS-16506 Project: Hadoop HDFS Issue Type: Bug Reporter: tomscut Unit tests failed because of OutOfMemoryError. An example: [[OutOfMemoryError|https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4009/5/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt].|https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4009/5/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt] {code:java} [ERROR] Tests run: 32, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: 95.727 s <<< FAILURE! - in org.apache.hadoop.hdfs.server.blockmanagement.TestBlockInfoStriped [ERROR] testGetBlockInfo[4: ErasureCodingPolicy=[Name=RS-10-4-1024k, Schema=[ECSchema=[Codec=rs, numDataUnits=10, numParityUnits=4]], CellSize=1048576, Id=5]](org.apache.hadoop.hdfs.server.blockmanagement.TestBlockInfoStriped) Time elapsed: 15.831 s <<< ERROR! java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:717) at io.netty.util.concurrent.ThreadPerTaskExecutor.execute(ThreadPerTaskExecutor.java:32) at io.netty.util.internal.ThreadExecutorMap$1.execute(ThreadExecutorMap.java:57) at io.netty.util.concurrent.SingleThreadEventExecutor.doStartThread(SingleThreadEventExecutor.java:975) at io.netty.util.concurrent.SingleThreadEventExecutor.ensureThreadStarted(SingleThreadEventExecutor.java:958) at io.netty.util.concurrent.SingleThreadEventExecutor.shutdownGracefully(SingleThreadEventExecutor.java:660) at io.netty.util.concurrent.MultithreadEventExecutorGroup.shutdownGracefully(MultithreadEventExecutorGroup.java:163) at io.netty.util.concurrent.AbstractEventExecutorGroup.shutdownGracefully(AbstractEventExecutorGroup.java:70) at org.apache.hadoop.hdfs.server.datanode.web.DatanodeHttpServer.close(DatanodeHttpServer.java:346) at org.apache.hadoop.hdfs.server.datanode.DataNode.shutdown(DataNode.java:2348) at org.apache.hadoop.hdfs.MiniDFSCluster.shutdownDataNode(MiniDFSCluster.java:2166) at org.apache.hadoop.hdfs.MiniDFSCluster.shutdownDataNodes(MiniDFSCluster.java:2156) at org.apache.hadoop.hdfs.MiniDFSCluster.shutdown(MiniDFSCluster.java:2135) at org.apache.hadoop.hdfs.MiniDFSCluster.shutdown(MiniDFSCluster.java:2109) at org.apache.hadoop.hdfs.MiniDFSCluster.shutdown(MiniDFSCluster.java:2102) at org.apache.hadoop.hdfs.MiniDFSCluster.close(MiniDFSCluster.java:3479) at org.apache.hadoop.hdfs.server.blockmanagement.TestBlockInfoStriped.testGetBlockInfo(TestBlockInfoStriped.java:257) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.lang.Thread.run(Thread.java:748) {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16505) Setting safemode should not be interrupted by abnormal nodes
tomscut created HDFS-16505: -- Summary: Setting safemode should not be interrupted by abnormal nodes Key: HDFS-16505 URL: https://issues.apache.org/jira/browse/HDFS-16505 Project: Hadoop HDFS Issue Type: Bug Reporter: tomscut Assignee: tomscut Attachments: image-2022-03-15-09-29-36-538.png, image-2022-03-15-09-29-44-430.png Setting safemode should not be interrupted by abnormal nodes. For example, we have four namenodes configured in the following order: NS1 -> active NS2 -> standby NS3 -> observer NS4 -> observer. When the {color:#FF}NS1 {color}process exits, setting the states of safemode, {color:#FF}NS2{color}, {color:#FF}NS3{color}, and {color:#FF}NS4 {color}fails. Similarly, when the {color:#FF}NS2{color} process exits, only the safemode state of {color:#FF}NS1{color} can be set successfully. When the {color:#FF}NS1{color} process exits: Before the change: !image-2022-03-15-09-29-36-538.png|width=1145,height=97! After the change: !image-2022-03-15-09-29-44-430.png|width=1104,height=119! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16503) Should verify whether the path name is valid in the WebHDFS
[ https://issues.apache.org/jira/browse/HDFS-16503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16503: --- Description: When creating a file using WebHDFS, there are two main steps: 1. Obtain the location of the Datanode to be written. 2. Put the file to this location. Currently *NameNodeRpcServer* verifies that pathName is valid, but *NamenodeWebHdfsMethods* and *RouterWebHdfsMethods* do not. So if we use an invalid path(such as duplicated slash), the first step returns success, but the second step throws an {*}InvalidPathException{*}. IMO, we should also do the validation in WebHdfs, which is consistent with the NameNodeRpcServer. !image-2022-03-14-09-35-49-860.png|width=548,height=164! The same webHDFS operations are: CREATE, APPEND, OPEN, GETFILECHECKSUM. So we can add DFSUtil.isValidName to redirectURI for *NamenodeWebHdfsMethods* and *RouterWebHdfsMethods.* was: When creating a file using WebHDFS, there are two main steps: 1. Obtain the location of the Datanode to be written. 2. Put the file to this location. Currently *NameNodeRpcServer* verifies that pathName is valid, but *NamenodeWebHdfsMethods* and *RouterWebHdfsMethods* do not. So if we use an invalid path, the first step returns success, but the second step throws an {*}InvalidPathException{*}. IMO, we should also do the validation in WebHdfs, which is consistent with the NameNodeRpcServer. !image-2022-03-14-09-35-49-860.png|width=548,height=164! The same webHDFS operations are: CREATE, APPEND, OPEN, GETFILECHECKSUM. So we can add DFSUtil.isValidName to redirectURI for *NamenodeWebHdfsMethods* and *RouterWebHdfsMethods.* > Should verify whether the path name is valid in the WebHDFS > --- > > Key: HDFS-16503 > URL: https://issues.apache.org/jira/browse/HDFS-16503 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: tomscut >Assignee: tomscut >Priority: Major > Labels: pull-request-available > Attachments: image-2022-03-14-09-35-49-860.png > > Time Spent: 10m > Remaining Estimate: 0h > > When creating a file using WebHDFS, there are two main steps: > 1. Obtain the location of the Datanode to be written. > 2. Put the file to this location. > Currently *NameNodeRpcServer* verifies that pathName is valid, but > *NamenodeWebHdfsMethods* and *RouterWebHdfsMethods* do not. > So if we use an invalid path(such as duplicated slash), the first step > returns success, but the second step throws an {*}InvalidPathException{*}. > IMO, we should also do the validation in WebHdfs, which is consistent with > the NameNodeRpcServer. > !image-2022-03-14-09-35-49-860.png|width=548,height=164! > The same webHDFS operations are: CREATE, APPEND, OPEN, GETFILECHECKSUM. So we > can add DFSUtil.isValidName to redirectURI for *NamenodeWebHdfsMethods* and > *RouterWebHdfsMethods.* -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16503) Should verify whether the path name is valid in the WebHDFS
[ https://issues.apache.org/jira/browse/HDFS-16503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16503: --- Description: When creating a file using WebHDFS, there are two main steps: 1. Obtain the location of the Datanode to be written. 2. Put the file to this location. Currently *NameNodeRpcServer* verifies that pathName is valid, but *NamenodeWebHdfsMethods* and *RouterWebHdfsMethods* do not. So if we use an invalid path, the first step returns success, but the second step throws an {*}InvalidPathException{*}. IMO, we should also do the validation in WebHdfs, which is consistent with the NameNodeRpcServer. !image-2022-03-14-09-35-49-860.png|width=548,height=164! The same webHDFS operations are: CREATE, APPEND, OPEN, GETFILECHECKSUM. So we can add DFSUtil.isValidName to redirectURI for *NamenodeWebHdfsMethods* and *RouterWebHdfsMethods.* was: When creating a file using WebHDFS, there are two main steps: 1. Obtain the location of the Datanode to be written. 2. Put the file to this location. Currently *NameNodeRpcServer* verifies that pathName is valid, but *NamenodeWebHdfsMethods* and *RouterWebHdfsMethods* do not. So if we use an invalid path, the first step returns success, but the second step throws an {*}InvalidPathException{*}. We should also do the validation in WebHdfs, which is consistent with the NameNodeRpcServer. !image-2022-03-14-09-35-49-860.png|width=548,height=164! The same webHDFS operations are: CREATE, APPEND, OPEN, GETFILECHECKSUM. So we can add DFSUtil.isValidName to redirectURI for *NamenodeWebHdfsMethods* and *RouterWebHdfsMethods.* > Should verify whether the path name is valid in the WebHDFS > --- > > Key: HDFS-16503 > URL: https://issues.apache.org/jira/browse/HDFS-16503 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: tomscut >Assignee: tomscut >Priority: Major > Attachments: image-2022-03-14-09-35-49-860.png > > > When creating a file using WebHDFS, there are two main steps: > 1. Obtain the location of the Datanode to be written. > 2. Put the file to this location. > Currently *NameNodeRpcServer* verifies that pathName is valid, but > *NamenodeWebHdfsMethods* and *RouterWebHdfsMethods* do not. > So if we use an invalid path, the first step returns success, but the second > step throws an {*}InvalidPathException{*}. IMO, we should also do the > validation in WebHdfs, which is consistent with the NameNodeRpcServer. > !image-2022-03-14-09-35-49-860.png|width=548,height=164! > The same webHDFS operations are: CREATE, APPEND, OPEN, GETFILECHECKSUM. So we > can add DFSUtil.isValidName to redirectURI for *NamenodeWebHdfsMethods* and > *RouterWebHdfsMethods.* -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16503) Should verify whether the path name is valid in the WebHDFS
tomscut created HDFS-16503: -- Summary: Should verify whether the path name is valid in the WebHDFS Key: HDFS-16503 URL: https://issues.apache.org/jira/browse/HDFS-16503 Project: Hadoop HDFS Issue Type: Bug Reporter: tomscut Assignee: tomscut Attachments: image-2022-03-14-09-35-49-860.png When creating a file using WebHDFS, there are two main steps: 1. Obtain the location of the Datanode to be written. 2. Put the file to this location. Currently *NameNodeRpcServer* verifies that pathName is valid, but *NamenodeWebHdfsMethods* and *RouterWebHdfsMethods* do not. So if we use an invalid path, the first step returns success, but the second step throws an {*}InvalidPathException{*}. We should also do the validation in WebHdfs, which is consistent with the NameNodeRpcServer. !image-2022-03-14-09-35-49-860.png|width=548,height=164! The same webHDFS operations are: CREATE, APPEND, OPEN, GETFILECHECKSUM. So we can add DFSUtil.isValidName to redirectURI for *NamenodeWebHdfsMethods* and *RouterWebHdfsMethods.* -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] (HDFS-14271) [SBN read] StandbyException is logged if Observer is the first NameNode
[ https://issues.apache.org/jira/browse/HDFS-14271 ] tomscut deleted comment on HDFS-14271: was (Author: tomscut): Perhaps the client needs to make a cache, such as a map, to record the state of each Namenode. Send the request to the corresponding state of the Namenode every time. When ObserverRetryOnActiveException or StandbyException occur, updating cache state of the corresponding Namenode. > [SBN read] StandbyException is logged if Observer is the first NameNode > --- > > Key: HDFS-14271 > URL: https://issues.apache.org/jira/browse/HDFS-14271 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.3.0 >Reporter: Wei-Chiu Chuang >Assignee: Shen Yinjie >Priority: Minor > Labels: multi-sbnn > Attachments: HDFS-14271_1.patch, image-2022-03-11-14-54-49-806.png > > > If I transition the first NameNode into Observer state, and then I create a > file from command line, it prints the following StandbyException log message, > as if the command failed. But it actually completed successfully: > {noformat} > [root@weichiu-sbsr-1 ~]# hdfs dfs -touchz /tmp/abf > 19/02/12 16:35:17 INFO retry.RetryInvocationHandler: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): > Operation category WRITE is not supported in state observer. Visit > https://s.apache.org/sbnn-error > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:98) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1987) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1424) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:762) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:458) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:918) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:853) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2782) > , while invoking $Proxy4.create over > [weichiu-sbsr-1.gce.cloudera.com/172.31.121.145:8020,weichiu-sbsr-2.gce.cloudera.com/172.31.121.140:8020]. > Trying to failover immediately. > {noformat} > This is unlike the case when the first NameNode is the Standby, where this > StandbyException is suppressed. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14271) [SBN read] StandbyException is logged if Observer is the first NameNode
[ https://issues.apache.org/jira/browse/HDFS-14271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17504755#comment-17504755 ] tomscut commented on HDFS-14271: Perhaps the client needs to make a cache, such as a map, to record the state of each Namenode. Send the request to the corresponding state of the Namenode every time. When ObserverRetryOnActiveException or StandbyException occur, updating cache state of the corresponding Namenode. > [SBN read] StandbyException is logged if Observer is the first NameNode > --- > > Key: HDFS-14271 > URL: https://issues.apache.org/jira/browse/HDFS-14271 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.3.0 >Reporter: Wei-Chiu Chuang >Assignee: Shen Yinjie >Priority: Minor > Labels: multi-sbnn > Attachments: HDFS-14271_1.patch, image-2022-03-11-14-54-49-806.png > > > If I transition the first NameNode into Observer state, and then I create a > file from command line, it prints the following StandbyException log message, > as if the command failed. But it actually completed successfully: > {noformat} > [root@weichiu-sbsr-1 ~]# hdfs dfs -touchz /tmp/abf > 19/02/12 16:35:17 INFO retry.RetryInvocationHandler: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): > Operation category WRITE is not supported in state observer. Visit > https://s.apache.org/sbnn-error > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:98) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1987) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1424) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:762) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:458) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:918) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:853) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2782) > , while invoking $Proxy4.create over > [weichiu-sbsr-1.gce.cloudera.com/172.31.121.145:8020,weichiu-sbsr-2.gce.cloudera.com/172.31.121.140:8020]. > Trying to failover immediately. > {noformat} > This is unlike the case when the first NameNode is the Standby, where this > StandbyException is suppressed. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14271) [SBN read] StandbyException is logged if Observer is the first NameNode
[ https://issues.apache.org/jira/browse/HDFS-14271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17504747#comment-17504747 ] tomscut commented on HDFS-14271: How about a solution to reduce loglevel, albeit just to reduce log output. In addition, *ObserverRetryOnActiveException* to deal with. !image-2022-03-11-14-54-49-806.png|width=687,height=150! > [SBN read] StandbyException is logged if Observer is the first NameNode > --- > > Key: HDFS-14271 > URL: https://issues.apache.org/jira/browse/HDFS-14271 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.3.0 >Reporter: Wei-Chiu Chuang >Assignee: Shen Yinjie >Priority: Minor > Labels: multi-sbnn > Attachments: HDFS-14271_1.patch, image-2022-03-11-14-54-49-806.png > > > If I transition the first NameNode into Observer state, and then I create a > file from command line, it prints the following StandbyException log message, > as if the command failed. But it actually completed successfully: > {noformat} > [root@weichiu-sbsr-1 ~]# hdfs dfs -touchz /tmp/abf > 19/02/12 16:35:17 INFO retry.RetryInvocationHandler: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): > Operation category WRITE is not supported in state observer. Visit > https://s.apache.org/sbnn-error > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:98) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1987) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1424) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:762) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:458) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:918) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:853) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2782) > , while invoking $Proxy4.create over > [weichiu-sbsr-1.gce.cloudera.com/172.31.121.145:8020,weichiu-sbsr-2.gce.cloudera.com/172.31.121.140:8020]. > Trying to failover immediately. > {noformat} > This is unlike the case when the first NameNode is the Standby, where this > StandbyException is suppressed. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14271) [SBN read] StandbyException is logged if Observer is the first NameNode
[ https://issues.apache.org/jira/browse/HDFS-14271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-14271: --- Attachment: image-2022-03-11-14-54-49-806.png > [SBN read] StandbyException is logged if Observer is the first NameNode > --- > > Key: HDFS-14271 > URL: https://issues.apache.org/jira/browse/HDFS-14271 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.3.0 >Reporter: Wei-Chiu Chuang >Assignee: Shen Yinjie >Priority: Minor > Labels: multi-sbnn > Attachments: HDFS-14271_1.patch, image-2022-03-11-14-54-49-806.png > > > If I transition the first NameNode into Observer state, and then I create a > file from command line, it prints the following StandbyException log message, > as if the command failed. But it actually completed successfully: > {noformat} > [root@weichiu-sbsr-1 ~]# hdfs dfs -touchz /tmp/abf > 19/02/12 16:35:17 INFO retry.RetryInvocationHandler: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): > Operation category WRITE is not supported in state observer. Visit > https://s.apache.org/sbnn-error > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:98) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1987) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1424) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:762) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:458) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:918) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:853) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2782) > , while invoking $Proxy4.create over > [weichiu-sbsr-1.gce.cloudera.com/172.31.121.145:8020,weichiu-sbsr-2.gce.cloudera.com/172.31.121.140:8020]. > Trying to failover immediately. > {noformat} > This is unlike the case when the first NameNode is the Standby, where this > StandbyException is suppressed. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16498) Fix NPE for checkBlockReportLease
[ https://issues.apache.org/jira/browse/HDFS-16498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17503977#comment-17503977 ] tomscut commented on HDFS-16498: [~jianghuazhu] I agree with you, it would be more appropriate to change the WARN level here, I will update. Thanks. > Fix NPE for checkBlockReportLease > - > > Key: HDFS-16498 > URL: https://issues.apache.org/jira/browse/HDFS-16498 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: tomscut >Assignee: tomscut >Priority: Major > Labels: pull-request-available > Attachments: image-2022-03-09-20-35-22-028.png, screenshot-1.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > During the restart of Namenode, a Datanode is not registered, but this > Datanode triggers FBR, which causes NPE. > !image-2022-03-09-20-35-22-028.png|width=871,height=158! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16499) [SPS]: Should not start indefinitely while another SPS process is running
tomscut created HDFS-16499: -- Summary: [SPS]: Should not start indefinitely while another SPS process is running Key: HDFS-16499 URL: https://issues.apache.org/jira/browse/HDFS-16499 Project: Hadoop HDFS Issue Type: Sub-task Reporter: tomscut Assignee: tomscut Normally, we can only start one SPS process at a time. When one process is running, start another process and retry indefinitely. I think, in this case, we should exit immediately. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16498) Fix NPE for checkBlockReportLease
[ https://issues.apache.org/jira/browse/HDFS-16498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16498: --- Description: During the restart of Namenode, a Datanode is not registered, but this Datanode triggers FBR, which causes NPE. !image-2022-03-09-20-35-22-028.png|width=871,height=158! was:During the restart of Namenode, a Datanode is not registered, but this Datanode triggers FBR, which causes NPE. > Fix NPE for checkBlockReportLease > - > > Key: HDFS-16498 > URL: https://issues.apache.org/jira/browse/HDFS-16498 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: tomscut >Assignee: tomscut >Priority: Major > Attachments: image-2022-03-09-20-35-22-028.png > > > During the restart of Namenode, a Datanode is not registered, but this > Datanode triggers FBR, which causes NPE. > !image-2022-03-09-20-35-22-028.png|width=871,height=158! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16498) Fix NPE for checkBlockReportLease
[ https://issues.apache.org/jira/browse/HDFS-16498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16498: --- Attachment: image-2022-03-09-20-35-22-028.png > Fix NPE for checkBlockReportLease > - > > Key: HDFS-16498 > URL: https://issues.apache.org/jira/browse/HDFS-16498 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: tomscut >Assignee: tomscut >Priority: Major > Attachments: image-2022-03-09-20-35-22-028.png > > > During the restart of Namenode, a Datanode is not registered, but this > Datanode triggers FBR, which causes NPE. > !image-2022-03-09-20-35-22-028.png|width=871,height=158! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16498) Fix NPE for checkBlockReportLease
tomscut created HDFS-16498: -- Summary: Fix NPE for checkBlockReportLease Key: HDFS-16498 URL: https://issues.apache.org/jira/browse/HDFS-16498 Project: Hadoop HDFS Issue Type: Bug Reporter: tomscut Assignee: tomscut During the restart of Namenode, a Datanode is not registered, but this Datanode triggers FBR, which causes NPE. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16488) [SPS]: Expose metrics to JMX for external SPS
[ https://issues.apache.org/jira/browse/HDFS-16488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16488: --- Attachment: image-2022-02-26-22-15-25-543.png > [SPS]: Expose metrics to JMX for external SPS > - > > Key: HDFS-16488 > URL: https://issues.apache.org/jira/browse/HDFS-16488 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: tomscut >Assignee: tomscut >Priority: Major > Labels: pull-request-available > Attachments: image-2022-02-26-22-15-25-543.png > > Time Spent: 10m > Remaining Estimate: 0h > > Currently, external SPS has no monitoring metrics. We do not know how many > blocks are waiting to be processed, how many blocks are waiting to be > retried, and how many blocks have been migrated. > We can expose these metrics in JMX for easy collection and display by > monitoring systems. > !image-2022-02-26-22-15-09-432.png|width=593,height=160! > For example, in our cluster, we exposed these metrics to JMX, collected by > JMX-Exporter and combined with Prometheus, and finally display by Grafana. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16488) [SPS]: Expose metrics to JMX for external SPS
[ https://issues.apache.org/jira/browse/HDFS-16488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16488: --- Description: Currently, external SPS has no monitoring metrics. We do not know how many blocks are waiting to be processed, how many blocks are waiting to be retried, and how many blocks have been migrated. We can expose these metrics in JMX for easy collection and display by monitoring systems. !image-2022-02-26-22-15-09-432.png|width=593,height=160! For example, in our cluster, we exposed these metrics to JMX, collected by JMX-Exporter and combined with Prometheus, and finally display by Grafana. was: Currently, external SPS has no monitoring metrics. We do not know how many blocks are waiting to be processed, how many blocks are waiting to be retried, and how many blocks have been migrated. We can expose these metrics in JMX for easy collection and display by monitoring systems. For example, in our cluster, we exposed these metrics to JMX, collected by JMX-Exporter and combined with Prometheus, and finally display by Grafana. > [SPS]: Expose metrics to JMX for external SPS > - > > Key: HDFS-16488 > URL: https://issues.apache.org/jira/browse/HDFS-16488 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: tomscut >Assignee: tomscut >Priority: Major > Labels: pull-request-available > Attachments: image-2022-02-26-22-15-25-543.png > > Time Spent: 10m > Remaining Estimate: 0h > > Currently, external SPS has no monitoring metrics. We do not know how many > blocks are waiting to be processed, how many blocks are waiting to be > retried, and how many blocks have been migrated. > We can expose these metrics in JMX for easy collection and display by > monitoring systems. > !image-2022-02-26-22-15-09-432.png|width=593,height=160! > For example, in our cluster, we exposed these metrics to JMX, collected by > JMX-Exporter and combined with Prometheus, and finally display by Grafana. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16488) [SPS]: Expose metrics to JMX for external SPS
[ https://issues.apache.org/jira/browse/HDFS-16488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16488: --- Description: Currently, external SPS has no monitoring metrics. We do not know how many blocks are waiting to be processed, how many blocks are waiting to be retried, and how many blocks have been migrated. We can expose these metrics in JMX for easy collection and display by monitoring systems. !image-2022-02-26-22-15-25-543.png|width=631,height=170! For example, in our cluster, we exposed these metrics to JMX, collected by JMX-Exporter and combined with Prometheus, and finally display by Grafana. was: Currently, external SPS has no monitoring metrics. We do not know how many blocks are waiting to be processed, how many blocks are waiting to be retried, and how many blocks have been migrated. We can expose these metrics in JMX for easy collection and display by monitoring systems. !image-2022-02-26-22-15-09-432.png|width=593,height=160! For example, in our cluster, we exposed these metrics to JMX, collected by JMX-Exporter and combined with Prometheus, and finally display by Grafana. > [SPS]: Expose metrics to JMX for external SPS > - > > Key: HDFS-16488 > URL: https://issues.apache.org/jira/browse/HDFS-16488 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: tomscut >Assignee: tomscut >Priority: Major > Labels: pull-request-available > Attachments: image-2022-02-26-22-15-25-543.png > > Time Spent: 10m > Remaining Estimate: 0h > > Currently, external SPS has no monitoring metrics. We do not know how many > blocks are waiting to be processed, how many blocks are waiting to be > retried, and how many blocks have been migrated. > We can expose these metrics in JMX for easy collection and display by > monitoring systems. > !image-2022-02-26-22-15-25-543.png|width=631,height=170! > For example, in our cluster, we exposed these metrics to JMX, collected by > JMX-Exporter and combined with Prometheus, and finally display by Grafana. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16488) [SPS]: Expose metrics to JMX for external SPS
[ https://issues.apache.org/jira/browse/HDFS-16488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16488: --- Attachment: (was: image-2022-02-26-22-15-09-432.png) > [SPS]: Expose metrics to JMX for external SPS > - > > Key: HDFS-16488 > URL: https://issues.apache.org/jira/browse/HDFS-16488 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: tomscut >Assignee: tomscut >Priority: Major > Labels: pull-request-available > Attachments: image-2022-02-26-22-15-25-543.png > > Time Spent: 10m > Remaining Estimate: 0h > > Currently, external SPS has no monitoring metrics. We do not know how many > blocks are waiting to be processed, how many blocks are waiting to be > retried, and how many blocks have been migrated. > We can expose these metrics in JMX for easy collection and display by > monitoring systems. > !image-2022-02-26-22-15-25-543.png|width=631,height=170! > For example, in our cluster, we exposed these metrics to JMX, collected by > JMX-Exporter and combined with Prometheus, and finally display by Grafana. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16488) [SPS]: Expose metrics to JMX for external SPS
[ https://issues.apache.org/jira/browse/HDFS-16488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16488: --- Attachment: image-2022-02-26-22-15-09-432.png > [SPS]: Expose metrics to JMX for external SPS > - > > Key: HDFS-16488 > URL: https://issues.apache.org/jira/browse/HDFS-16488 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: tomscut >Assignee: tomscut >Priority: Major > Labels: pull-request-available > Attachments: image-2022-02-26-22-15-25-543.png > > Time Spent: 10m > Remaining Estimate: 0h > > Currently, external SPS has no monitoring metrics. We do not know how many > blocks are waiting to be processed, how many blocks are waiting to be > retried, and how many blocks have been migrated. > We can expose these metrics in JMX for easy collection and display by > monitoring systems. > !image-2022-02-26-22-15-09-432.png|width=593,height=160! > For example, in our cluster, we exposed these metrics to JMX, collected by > JMX-Exporter and combined with Prometheus, and finally display by Grafana. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16488) [SPS]: Expose metrics to JMX for external SPS
tomscut created HDFS-16488: -- Summary: [SPS]: Expose metrics to JMX for external SPS Key: HDFS-16488 URL: https://issues.apache.org/jira/browse/HDFS-16488 Project: Hadoop HDFS Issue Type: Sub-task Reporter: tomscut Assignee: tomscut Currently, external SPS has no monitoring metrics. We do not know how many blocks are waiting to be processed, how many blocks are waiting to be retried, and how many blocks have been migrated. We can expose these metrics in JMX for easy collection and display by monitoring systems. For example, in our cluster, we exposed these metrics to JMX, collected by JMX-Exporter and combined with Prometheus, and finally display by Grafana. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16397) Reconfig slow disk parameters for datanode
[ https://issues.apache.org/jira/browse/HDFS-16397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498392#comment-17498392 ] tomscut commented on HDFS-16397: Thanks [~tasanuma] . > Reconfig slow disk parameters for datanode > -- > > Key: HDFS-16397 > URL: https://issues.apache.org/jira/browse/HDFS-16397 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: tomscut >Assignee: tomscut >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.3.3 > > Time Spent: 3h 20m > Remaining Estimate: 0h > > In large clusters, rolling restart datanodes takes long time. We can make > slow peers parameters and slow disks parameters in datanode reconfigurable to > facilitate cluster operation and maintenance. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16371) Exclude slow disks when choosing volume
[ https://issues.apache.org/jira/browse/HDFS-16371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497914#comment-17497914 ] tomscut commented on HDFS-16371: Hi [~tasanuma] , I cherry-picked this PR to branch-3.3 . Please look at [#4031|https://github.com/apache/hadoop/pull/4031] > Exclude slow disks when choosing volume > --- > > Key: HDFS-16371 > URL: https://issues.apache.org/jira/browse/HDFS-16371 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: tomscut >Assignee: tomscut >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 3h 40m > Remaining Estimate: 0h > > Currently, the datanode can detect slow disks. See HDFS-11461. > And after HDFS-16311, the slow disk information we collected is more accurate. > So we can exclude these slow disks according to some rules when choosing > volume. This will prevents some slow disks from affecting the throughput of > the whole datanode. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15854) Make some parameters configurable for SlowDiskTracker and SlowPeerTracker
[ https://issues.apache.org/jira/browse/HDFS-15854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497853#comment-17497853 ] tomscut commented on HDFS-15854: Thank you [~tasanuma] . > Make some parameters configurable for SlowDiskTracker and SlowPeerTracker > - > > Key: HDFS-15854 > URL: https://issues.apache.org/jira/browse/HDFS-15854 > Project: Hadoop HDFS > Issue Type: Wish >Reporter: tomscut >Assignee: tomscut >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.3.3 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > Make some parameters configurable for SlowDiskTracker and SlowPeerTracker. > Related to https://issues.apache.org/jira/browse/HDFS-15814. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-16460) [SPS]: Handle failure retries for moving tasks
[ https://issues.apache.org/jira/browse/HDFS-16460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17496056#comment-17496056 ] tomscut edited comment on HDFS-16460 at 2/22/22, 12:14 PM: --- The related PR is [#4001.|https://github.com/apache/hadoop/pull/4001] was (Author: tomscut): The related PR is [#4001|https://github.com/apache/hadoop/pull/4001] > [SPS]: Handle failure retries for moving tasks > -- > > Key: HDFS-16460 > URL: https://issues.apache.org/jira/browse/HDFS-16460 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: tomscut >Assignee: tomscut >Priority: Major > > Handle failure retries for moving tasks. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16460) [SPS]: Handle failure retries for moving tasks
[ https://issues.apache.org/jira/browse/HDFS-16460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17496056#comment-17496056 ] tomscut commented on HDFS-16460: The related PR is [#4001|https://github.com/apache/hadoop/pull/4001] > [SPS]: Handle failure retries for moving tasks > -- > > Key: HDFS-16460 > URL: https://issues.apache.org/jira/browse/HDFS-16460 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: tomscut >Assignee: tomscut >Priority: Major > > Handle failure retries for moving tasks. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16477) [SPS]: Add metric PendingSPSPaths for getting the number of paths to be processed by SPS
tomscut created HDFS-16477: -- Summary: [SPS]: Add metric PendingSPSPaths for getting the number of paths to be processed by SPS Key: HDFS-16477 URL: https://issues.apache.org/jira/browse/HDFS-16477 Project: Hadoop HDFS Issue Type: Sub-task Reporter: tomscut Assignee: tomscut Currently we have no idea how many paths are waiting to be processed when using the SPS feature. We should add metric PendingSPSPaths for getting the number of paths to be processed by SPS in NameNode. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16460) [SPS]: Handle failure retries for moving tasks
tomscut created HDFS-16460: -- Summary: [SPS]: Handle failure retries for moving tasks Key: HDFS-16460 URL: https://issues.apache.org/jira/browse/HDFS-16460 Project: Hadoop HDFS Issue Type: Sub-task Reporter: tomscut Assignee: tomscut Handle failure retries for moving tasks. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16458) [SPS]: Fix bug for unit test of reconfiguring SPS mode
[ https://issues.apache.org/jira/browse/HDFS-16458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17494472#comment-17494472 ] tomscut commented on HDFS-16458: Hi [~rakeshr], [~umamaheswararao], PTAL. Thanks. > [SPS]: Fix bug for unit test of reconfiguring SPS mode > -- > > Key: HDFS-16458 > URL: https://issues.apache.org/jira/browse/HDFS-16458 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: tomscut >Assignee: tomscut >Priority: Major > > TestNameNodeReconfigure#verifySPSEnabled was compared with > itself({*}isSPSRunning{*}) at assertEquals. > In addition, after an *internal SPS* has been removed, *spsService daemon* > will not start within StoragePolicySatisfyManager. I think the relevant code > can be removed to simplify the code. > IMO, after reconfig SPS mode, we just need to confirm whether the mode is > correct and whether spsManager is NULL. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16458) [SPS]: Fix bug for unit test of reconfiguring SPS mode
[ https://issues.apache.org/jira/browse/HDFS-16458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-16458: --- Description: TestNameNodeReconfigure#verifySPSEnabled was compared with itself({*}isSPSRunning{*}) at assertEquals. In addition, after an *internal SPS* has been removed, *spsService daemon* will not start within StoragePolicySatisfyManager. I think the relevant code can be removed to simplify the code. IMO, after reconfig SPS mode, we just need to confirm whether the mode is correct and whether spsManager is NULL. was: TestNameNodeReconfigure#verifySPSEnabled was compared with itself(isSPSRunning) at assertEquals. In addition, after an *internal SPS* has been removed, *spsService daemon* will not start within StoragePolicySatisfyManager. I think the relevant code can be removed to simplify the code. IMO, after reconfig SPS mode, we just need to confirm whether the mode is correct and whether spsManager is NULL. > [SPS]: Fix bug for unit test of reconfiguring SPS mode > -- > > Key: HDFS-16458 > URL: https://issues.apache.org/jira/browse/HDFS-16458 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: tomscut >Assignee: tomscut >Priority: Major > > TestNameNodeReconfigure#verifySPSEnabled was compared with > itself({*}isSPSRunning{*}) at assertEquals. > In addition, after an *internal SPS* has been removed, *spsService daemon* > will not start within StoragePolicySatisfyManager. I think the relevant code > can be removed to simplify the code. > IMO, after reconfig SPS mode, we just need to confirm whether the mode is > correct and whether spsManager is NULL. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16458) [SPS]: Fix bug for unit test of reconfiguring SPS mode
tomscut created HDFS-16458: -- Summary: [SPS]: Fix bug for unit test of reconfiguring SPS mode Key: HDFS-16458 URL: https://issues.apache.org/jira/browse/HDFS-16458 Project: Hadoop HDFS Issue Type: Sub-task Reporter: tomscut Assignee: tomscut TestNameNodeReconfigure#verifySPSEnabled was compared with itself(isSPSRunning) at assertEquals. In addition, after an *internal SPS* has been removed, *spsService daemon* will not start within StoragePolicySatisfyManager. I think the relevant code can be removed to simplify the code. IMO, after reconfig SPS mode, we just need to confirm whether the mode is correct and whether spsManager is NULL. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12228) [SPS]: Add storage policy satisfier related metrics
[ https://issues.apache.org/jira/browse/HDFS-12228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17494344#comment-17494344 ] tomscut commented on HDFS-12228: Hi [~ajithshetty] [~rakeshr] , how is the current progress of this? Is it still going on? > [SPS]: Add storage policy satisfier related metrics > --- > > Key: HDFS-12228 > URL: https://issues.apache.org/jira/browse/HDFS-12228 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode, namenode >Reporter: Rakesh Radhakrishnan >Assignee: Ajith S >Priority: Major > > This jira to discuss and implement metrics needed for SPS feature. > Below are few metrics: > # count of {{inprogress}} block movements > # count of {{successful}} block movements > # count of {{failed}} block movements > Need to analyse and add more. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15118) [SBN Read] Slow clients when Observer reads are enabled but there are no Observers on the cluster.
[ https://issues.apache.org/jira/browse/HDFS-15118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-15118: --- Labels: (was: Read SBN) > [SBN Read] Slow clients when Observer reads are enabled but there are no > Observers on the cluster. > -- > > Key: HDFS-15118 > URL: https://issues.apache.org/jira/browse/HDFS-15118 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 2.10.0 >Reporter: Konstantin Shvachko >Assignee: Chen Liang >Priority: Major > Fix For: 3.3.0, 3.1.4, 3.2.2, 2.10.1 > > Attachments: HDFS-15118.001.patch, HDFS-15118.002.patch > > > We see substantial degradation in performance of HDFS clients, when Observer > reads are enabled via {{ObserverReadProxyProvider}}, but there are no > ObserverNodes on the cluster. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15118) [SBN Read] Slow clients when Observer reads are enabled but there are no Observers on the cluster.
[ https://issues.apache.org/jira/browse/HDFS-15118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tomscut updated HDFS-15118: --- Labels: Read SBN (was: ) > [SBN Read] Slow clients when Observer reads are enabled but there are no > Observers on the cluster. > -- > > Key: HDFS-15118 > URL: https://issues.apache.org/jira/browse/HDFS-15118 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 2.10.0 >Reporter: Konstantin Shvachko >Assignee: Chen Liang >Priority: Major > Labels: Read, SBN > Fix For: 3.3.0, 3.1.4, 3.2.2, 2.10.1 > > Attachments: HDFS-15118.001.patch, HDFS-15118.002.patch > > > We see substantial degradation in performance of HDFS clients, when Observer > reads are enabled via {{ObserverReadProxyProvider}}, but there are no > ObserverNodes on the cluster. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16446) Consider ioutils of disk when choosing volume
tomscut created HDFS-16446: -- Summary: Consider ioutils of disk when choosing volume Key: HDFS-16446 URL: https://issues.apache.org/jira/browse/HDFS-16446 Project: Hadoop HDFS Issue Type: New Feature Reporter: tomscut Assignee: tomscut Attachments: image-2022-02-05-09-50-12-241.png Consider ioutils of disk when choosing volume. Principle is as follows: !image-2022-02-05-09-50-12-241.png|width=309,height=159! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13671) Namenode deletes large dir slowly caused by FoldedTreeSet#removeAndGet
[ https://issues.apache.org/jira/browse/HDFS-13671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17484045#comment-17484045 ] tomscut commented on HDFS-13671: We introduced this patch to branch-3.1.0, which is stable for replication data. But for EC data, GC performance is poor without adjusting GC parameters. We changed the GC from CMS to G1 and strictly limited G1MaxNewSizePercent and MaxGCPauseMillis, so GC performance improved and was acceptable. But the mixed GC took up to 10 seconds or more, even though the mixed GC was triggered every 7 days or so. If anyone also uses this patch on EC data, we are looking forward to communicating with you. Thanks. BTW, if there is a need to submit a related PR to branch-3.1, I am happy to do that. > Namenode deletes large dir slowly caused by FoldedTreeSet#removeAndGet > -- > > Key: HDFS-13671 > URL: https://issues.apache.org/jira/browse/HDFS-13671 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.1.0, 3.0.3 >Reporter: Yiqun Lin >Assignee: Haibin Huang >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.2.3, 3.3.2 > > Attachments: HDFS-13671-001.patch, image-2021-06-10-19-28-18-373.png, > image-2021-06-10-19-28-58-359.png, image-2021-06-18-15-46-46-052.png, > image-2021-06-18-15-47-04-037.png > > Time Spent: 7h 40m > Remaining Estimate: 0h > > NameNode hung when deleting large files/blocks. The stack info: > {code} > "IPC Server handler 4 on 8020" #87 daemon prio=5 os_prio=0 > tid=0x7fb505b27800 nid=0x94c3 runnable [0x7fa861361000] >java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.hdfs.util.FoldedTreeSet.compare(FoldedTreeSet.java:474) > at > org.apache.hadoop.hdfs.util.FoldedTreeSet.removeAndGet(FoldedTreeSet.java:849) > at > org.apache.hadoop.hdfs.util.FoldedTreeSet.remove(FoldedTreeSet.java:911) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeStorageInfo.removeBlock(DatanodeStorageInfo.java:252) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.removeBlock(BlocksMap.java:194) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.removeBlock(BlocksMap.java:108) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.removeBlockFromMap(BlockManager.java:3813) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.removeBlock(BlockManager.java:3617) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.removeBlocks(FSNamesystem.java:4270) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInternal(FSNamesystem.java:4244) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInt(FSNamesystem.java:4180) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:4164) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:871) > at > org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.delete(AuthorizationProviderProxyClientProtocol.java:311) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.delete(ClientNamenodeProtocolServerSideTranslatorPB.java:625) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) > {code} > In the current deletion logic in NameNode, there are mainly two steps: > * Collect INodes and all blocks to be deleted, then delete INodes. > * Remove blocks chunk by chunk in a loop. > Actually the first step should be a more expensive operation and will takes > more time. However, now we always see NN hangs during the remove block > operation. > Looking into this, we introduced a new structure {{FoldedTreeSet}} to have a > better performance in dealing FBR/IBRs. But compared with early > implementation in remove-block logic, {{FoldedTreeSet}} seems more slower > since It will take additional time to balance tree node. When there are large > block to be removed/deleted, it looks bad. > For the get type operations in {{DatanodeStorageInfo}}, we only provide the > {{getBlockIterator}} to return blocks iterator and no other get operation > with specified block. Still we need to use {{FoldedTreeSet}} in > {{DatanodeStorageInfo}}? As we know {{FoldedTreeSet}} is benefit for Get not > Update. Maybe we can revert this to the early implementation. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe,