[jira] [Updated] (HDFS-16557) BootstrapStandby failed because of checking gap for inprogress EditLogInputStream

2022-04-22 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16557:
---
Description: 
The lastTxId of an inprogress EditLogInputStream lastTxId isn't necessarily 
HdfsServerConstants.INVALID_TXID. We can determine its status directly by 
EditLogInputStream#isInProgress.

For example, set {color:#ff}{{dfs.ha.tail-edits.in-progress=true}}{color}. 
Then bootstrapStandby, the EditLogInputStream of inProgress is misjudged, 
resulting in a gap check failure, which causes bootstrapStandby to fail.

hdfs namenode -bootstrapStandby

!image-2022-04-22-17-17-32-487.png|width=766,height=161!

!image-2022-04-22-17-17-14-577.png|width=598,height=187!

  was:
The lastTxId of an inprogress EditLogInputStream lastTxId isn't necessarily 
HdfsServerConstants.INVALID_TXID. We can determine its status directly by 
EditLogInputStream#isInProgress.

For example, set {color:#FF}{{dfs.ha.tail-edits.in-progress=true}}{color}. 
Then bootstrapStandby, the EditLogInputStream of inProgress is misjudged, 
resulting in a gap check failure, which causes bootstrapStandby to fail.

!image-2022-04-22-17-17-32-487.png|width=766,height=161!

!image-2022-04-22-17-17-14-577.png|width=598,height=187!


> BootstrapStandby failed because of checking gap for inprogress 
> EditLogInputStream
> -
>
> Key: HDFS-16557
> URL: https://issues.apache.org/jira/browse/HDFS-16557
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2022-04-22-17-17-14-577.png, 
> image-2022-04-22-17-17-14-618.png, image-2022-04-22-17-17-23-113.png, 
> image-2022-04-22-17-17-32-487.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The lastTxId of an inprogress EditLogInputStream lastTxId isn't necessarily 
> HdfsServerConstants.INVALID_TXID. We can determine its status directly by 
> EditLogInputStream#isInProgress.
> For example, set 
> {color:#ff}{{dfs.ha.tail-edits.in-progress=true}}{color}. Then 
> bootstrapStandby, the EditLogInputStream of inProgress is misjudged, 
> resulting in a gap check failure, which causes bootstrapStandby to fail.
> hdfs namenode -bootstrapStandby
> !image-2022-04-22-17-17-32-487.png|width=766,height=161!
> !image-2022-04-22-17-17-14-577.png|width=598,height=187!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16557) BootstrapStandby failed because of checking gap for inprogress EditLogInputStream

2022-04-22 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16557:
---
Description: 
The lastTxId of an inprogress EditLogInputStream lastTxId isn't necessarily 
HdfsServerConstants.INVALID_TXID. We can determine its status directly by 
EditLogInputStream#isInProgress.

For example, set {color:#FF}{{dfs.ha.tail-edits.in-progress=true}}{color}. 
Then bootstrapStandby, the EditLogInputStream of inProgress is misjudged, 
resulting in a gap check failure, which causes bootstrapStandby to fail.

!image-2022-04-22-17-17-32-487.png|width=766,height=161!

!image-2022-04-22-17-17-14-577.png|width=598,height=187!

  was:
The lastTxId of an inprogress EditLogInputStream lastTxId isn't necessarily 
HdfsServerConstants.INVALID_TXID. We can determine its status directly by 
EditLogInputStream#isInProgress.

For example, when bootstrapStandby, the EditLogInputStream of inProgress is 
misjudged, resulting in a gap check failure, which causes bootstrapStandby to 
fail.

!image-2022-04-22-17-17-32-487.png|width=766,height=161!

!image-2022-04-22-17-17-14-577.png|width=598,height=187!


> BootstrapStandby failed because of checking gap for inprogress 
> EditLogInputStream
> -
>
> Key: HDFS-16557
> URL: https://issues.apache.org/jira/browse/HDFS-16557
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2022-04-22-17-17-14-577.png, 
> image-2022-04-22-17-17-14-618.png, image-2022-04-22-17-17-23-113.png, 
> image-2022-04-22-17-17-32-487.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The lastTxId of an inprogress EditLogInputStream lastTxId isn't necessarily 
> HdfsServerConstants.INVALID_TXID. We can determine its status directly by 
> EditLogInputStream#isInProgress.
> For example, set 
> {color:#FF}{{dfs.ha.tail-edits.in-progress=true}}{color}. Then 
> bootstrapStandby, the EditLogInputStream of inProgress is misjudged, 
> resulting in a gap check failure, which causes bootstrapStandby to fail.
> !image-2022-04-22-17-17-32-487.png|width=766,height=161!
> !image-2022-04-22-17-17-14-577.png|width=598,height=187!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16557) BootstrapStandby failed because of checking gap for inprogress EditLogInputStream

2022-04-22 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16557:
---
Summary: BootstrapStandby failed because of checking gap for inprogress 
EditLogInputStream  (was: BootstrapStandby failed because of checking Gap for 
inprogress EditLogInputStream)

> BootstrapStandby failed because of checking gap for inprogress 
> EditLogInputStream
> -
>
> Key: HDFS-16557
> URL: https://issues.apache.org/jira/browse/HDFS-16557
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2022-04-22-17-17-14-577.png, 
> image-2022-04-22-17-17-14-618.png, image-2022-04-22-17-17-23-113.png, 
> image-2022-04-22-17-17-32-487.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The lastTxId of an inprogress EditLogInputStream lastTxId isn't necessarily 
> HdfsServerConstants.INVALID_TXID. We can determine its status directly by 
> EditLogInputStream#isInProgress.
> For example, when bootstrapStandby, the EditLogInputStream of inProgress is 
> misjudged, resulting in a gap check failure, which causes bootstrapStandby to 
> fail.
> !image-2022-04-22-17-17-32-487.png|width=766,height=161!
> !image-2022-04-22-17-17-14-577.png|width=598,height=187!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16557) BootstrapStandby failed because of checking Gap for inprogress EditLogInputStream

2022-04-22 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16557:
---
Attachment: image-2022-04-22-17-17-32-487.png

> BootstrapStandby failed because of checking Gap for inprogress 
> EditLogInputStream
> -
>
> Key: HDFS-16557
> URL: https://issues.apache.org/jira/browse/HDFS-16557
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
> Attachments: image-2022-04-22-17-17-14-577.png, 
> image-2022-04-22-17-17-14-618.png, image-2022-04-22-17-17-23-113.png, 
> image-2022-04-22-17-17-32-487.png
>
>
> The lastTxId of an inprogress EditLogInputStream lastTxId isn't necessarily 
> HdfsServerConstants.INVALID_TXID. We can determine its status directly by 
> EditLogInputStream#isInProgress.
> For example, when bootstrapStandby, the EditLogInputStream of inProgress is 
> misjudged, resulting in a gap check failure, which causes bootstrapStandby to 
> fail.
> !image-2022-04-22-17-17-23-113.png!
> !image-2022-04-22-17-17-14-577.png!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16557) BootstrapStandby failed because of checking Gap for inprogress EditLogInputStream

2022-04-22 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16557:
---
Description: 
The lastTxId of an inprogress EditLogInputStream lastTxId isn't necessarily 
HdfsServerConstants.INVALID_TXID. We can determine its status directly by 
EditLogInputStream#isInProgress.

For example, when bootstrapStandby, the EditLogInputStream of inProgress is 
misjudged, resulting in a gap check failure, which causes bootstrapStandby to 
fail.

!image-2022-04-22-17-17-23-113.png!

!image-2022-04-22-17-17-14-577.png!

  was:
The lastTxId of an inprogress EditLogInputStream lastTxId isn't necessarily 
HdfsServerConstants.INVALID_TXID. We can determine its status directly by 
EditLogInputStream#isInProgress.

For example, when bootstrapStandby, the EditLogInputStream of inProgress is 
misjudged, resulting in a gap check failure, which causes bootstrapStandby to 
fail.


> BootstrapStandby failed because of checking Gap for inprogress 
> EditLogInputStream
> -
>
> Key: HDFS-16557
> URL: https://issues.apache.org/jira/browse/HDFS-16557
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
> Attachments: image-2022-04-22-17-17-14-577.png, 
> image-2022-04-22-17-17-14-618.png, image-2022-04-22-17-17-23-113.png, 
> image-2022-04-22-17-17-32-487.png
>
>
> The lastTxId of an inprogress EditLogInputStream lastTxId isn't necessarily 
> HdfsServerConstants.INVALID_TXID. We can determine its status directly by 
> EditLogInputStream#isInProgress.
> For example, when bootstrapStandby, the EditLogInputStream of inProgress is 
> misjudged, resulting in a gap check failure, which causes bootstrapStandby to 
> fail.
> !image-2022-04-22-17-17-23-113.png!
> !image-2022-04-22-17-17-14-577.png!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16557) BootstrapStandby failed because of checking Gap for inprogress EditLogInputStream

2022-04-22 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16557:
---
Attachment: image-2022-04-22-17-17-14-618.png

> BootstrapStandby failed because of checking Gap for inprogress 
> EditLogInputStream
> -
>
> Key: HDFS-16557
> URL: https://issues.apache.org/jira/browse/HDFS-16557
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
> Attachments: image-2022-04-22-17-17-14-577.png, 
> image-2022-04-22-17-17-14-618.png, image-2022-04-22-17-17-23-113.png, 
> image-2022-04-22-17-17-32-487.png
>
>
> The lastTxId of an inprogress EditLogInputStream lastTxId isn't necessarily 
> HdfsServerConstants.INVALID_TXID. We can determine its status directly by 
> EditLogInputStream#isInProgress.
> For example, when bootstrapStandby, the EditLogInputStream of inProgress is 
> misjudged, resulting in a gap check failure, which causes bootstrapStandby to 
> fail.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16557) BootstrapStandby failed because of checking Gap for inprogress EditLogInputStream

2022-04-22 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16557:
---
Attachment: image-2022-04-22-17-17-23-113.png

> BootstrapStandby failed because of checking Gap for inprogress 
> EditLogInputStream
> -
>
> Key: HDFS-16557
> URL: https://issues.apache.org/jira/browse/HDFS-16557
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
> Attachments: image-2022-04-22-17-17-14-577.png, 
> image-2022-04-22-17-17-14-618.png, image-2022-04-22-17-17-23-113.png, 
> image-2022-04-22-17-17-32-487.png
>
>
> The lastTxId of an inprogress EditLogInputStream lastTxId isn't necessarily 
> HdfsServerConstants.INVALID_TXID. We can determine its status directly by 
> EditLogInputStream#isInProgress.
> For example, when bootstrapStandby, the EditLogInputStream of inProgress is 
> misjudged, resulting in a gap check failure, which causes bootstrapStandby to 
> fail.
> !image-2022-04-22-17-17-23-113.png!
> !image-2022-04-22-17-17-14-577.png!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16557) BootstrapStandby failed because of checking Gap for inprogress EditLogInputStream

2022-04-22 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16557:
---
Description: 
The lastTxId of an inprogress EditLogInputStream lastTxId isn't necessarily 
HdfsServerConstants.INVALID_TXID. We can determine its status directly by 
EditLogInputStream#isInProgress.

For example, when bootstrapStandby, the EditLogInputStream of inProgress is 
misjudged, resulting in a gap check failure, which causes bootstrapStandby to 
fail.

!image-2022-04-22-17-17-32-487.png|width=766,height=161!

!image-2022-04-22-17-17-14-577.png|width=598,height=187!

  was:
The lastTxId of an inprogress EditLogInputStream lastTxId isn't necessarily 
HdfsServerConstants.INVALID_TXID. We can determine its status directly by 
EditLogInputStream#isInProgress.

For example, when bootstrapStandby, the EditLogInputStream of inProgress is 
misjudged, resulting in a gap check failure, which causes bootstrapStandby to 
fail.

!image-2022-04-22-17-17-23-113.png!

!image-2022-04-22-17-17-14-577.png!


> BootstrapStandby failed because of checking Gap for inprogress 
> EditLogInputStream
> -
>
> Key: HDFS-16557
> URL: https://issues.apache.org/jira/browse/HDFS-16557
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
> Attachments: image-2022-04-22-17-17-14-577.png, 
> image-2022-04-22-17-17-14-618.png, image-2022-04-22-17-17-23-113.png, 
> image-2022-04-22-17-17-32-487.png
>
>
> The lastTxId of an inprogress EditLogInputStream lastTxId isn't necessarily 
> HdfsServerConstants.INVALID_TXID. We can determine its status directly by 
> EditLogInputStream#isInProgress.
> For example, when bootstrapStandby, the EditLogInputStream of inProgress is 
> misjudged, resulting in a gap check failure, which causes bootstrapStandby to 
> fail.
> !image-2022-04-22-17-17-32-487.png|width=766,height=161!
> !image-2022-04-22-17-17-14-577.png|width=598,height=187!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16557) BootstrapStandby failed because of checking Gap for inprogress EditLogInputStream

2022-04-22 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16557:
---
Attachment: image-2022-04-22-17-17-14-577.png

> BootstrapStandby failed because of checking Gap for inprogress 
> EditLogInputStream
> -
>
> Key: HDFS-16557
> URL: https://issues.apache.org/jira/browse/HDFS-16557
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
> Attachments: image-2022-04-22-17-17-14-577.png, 
> image-2022-04-22-17-17-14-618.png, image-2022-04-22-17-17-23-113.png, 
> image-2022-04-22-17-17-32-487.png
>
>
> The lastTxId of an inprogress EditLogInputStream lastTxId isn't necessarily 
> HdfsServerConstants.INVALID_TXID. We can determine its status directly by 
> EditLogInputStream#isInProgress.
> For example, when bootstrapStandby, the EditLogInputStream of inProgress is 
> misjudged, resulting in a gap check failure, which causes bootstrapStandby to 
> fail.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16557) BootstrapStandby failed because of checking Gap for inprogress EditLogInputStream

2022-04-22 Thread tomscut (Jira)
tomscut created HDFS-16557:
--

 Summary: BootstrapStandby failed because of checking Gap for 
inprogress EditLogInputStream
 Key: HDFS-16557
 URL: https://issues.apache.org/jira/browse/HDFS-16557
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: tomscut
Assignee: tomscut


The lastTxId of an inprogress EditLogInputStream lastTxId isn't necessarily 
HdfsServerConstants.INVALID_TXID. We can determine its status directly by 
EditLogInputStream#isInProgress.

For example, when bootstrapStandby, the EditLogInputStream of inProgress is 
misjudged, resulting in a gap check failure, which causes bootstrapStandby to 
fail.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16552) Fix NPE for TestBlockManager

2022-04-21 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16552:
---
Summary: Fix NPE for TestBlockManager  (was: Fix NPE for 
BlockManager#scheduleReconstruction)

> Fix NPE for TestBlockManager
> 
>
> Key: HDFS-16552
> URL: https://issues.apache.org/jira/browse/HDFS-16552
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> There is a NPE in BlockManager when run 
> TestBlockManager#testSkipReconstructionWithManyBusyNodes2. Because 
> NameNodeMetrics is not initialized in this unit test.
>  
> Related ci link, see 
> [this|https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt].
> {code:java}
> [ERROR] Tests run: 34, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 30.088 s <<< FAILURE! - in 
> org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager
> [ERROR] 
> testSkipReconstructionWithManyBusyNodes2(org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager)
>   Time elapsed: 2.783 s  <<< ERROR!
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.scheduleReconstruction(BlockManager.java:2171)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.testSkipReconstructionWithManyBusyNodes2(TestBlockManager.java:947)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
>   at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
>   at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
>   at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) 
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash

2022-04-21 Thread tomscut (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17526151#comment-17526151
 ] 

tomscut commented on HDFS-16550:


I have submitted a simple PR according to the way of Fast Fail. [~sunchao] 
[~xkrogen] Please help to have a look at it, thank you very much.

> [SBN read] Improper cache-size for journal node may cause cluster crash
> ---
>
> Key: HDFS-16550
> URL: https://issues.apache.org/jira/browse/HDFS-16550
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2022-04-21-09-54-29-751.png, 
> image-2022-04-21-09-54-57-111.png, image-2022-04-21-12-32-56-170.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When we introduced {*}SBN Read{*}, we encountered a situation during upgrade 
> the JournalNodes.
> Cluster Info: 
> *Active: nn0*
> *Standby: nn1*
> 1. Rolling restart journal node. {color:#ff}(related config: 
> fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color}
> 2. The cluster runs for a while, edits cache usage is increasing and memory 
> is used up.
> 3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed 
> out waiting 12ms for a quorum of nodes to respond”{_}.
> 4. Transfer nn1 to Active state.
> 5. {color:#ff}New Active namenode(nn1){color} also shutdown because of 
> “{_}Timed out waiting 12ms for a quorum of nodes to respond” too{_}.
> 6. {color:#ff}The cluster crashed{color}.
>  
> Related code:
> {code:java}
> JournaledEditsCache(Configuration conf) {
>   capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY,
>   DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT);
>   if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) {
> Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " +
> "maximum JVM memory is only %d bytes. It is recommended that you " +
> "decrease the cache size or increase the heap size.",
> capacity, Runtime.getRuntime().maxMemory()));
>   }
>   Journal.LOG.info("Enabling the journaled edits cache with a capacity " +
>   "of bytes: " + capacity);
>   ReadWriteLock lock = new ReentrantReadWriteLock(true);
>   readLock = new AutoCloseableLock(lock.readLock());
>   writeLock = new AutoCloseableLock(lock.writeLock());
>   initialize(INVALID_TXN_ID);
> } {code}
> Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size 
> than the memory requested by the process. If 
> {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * 
> Runtime.getruntime().maxMemory(){*}, only warn logs are printed during 
> journalnode startup. This can easily be overlooked by users. However, as the 
> cluster runs to a certain period of time, it is likely to cause the cluster 
> to crash.
>  
> NN log:
> !image-2022-04-21-09-54-57-111.png|width=1012,height=47!
> !image-2022-04-21-12-32-56-170.png|width=809,height=218!
> IMO, when {*}fs.journalNode.edit-cache-size-bytes > threshold * 
> Runtime.getruntime ().maxMemory(){*}, we should throw an Exception and 
> {color:#ff}fast fail{color}. Giving a clear hint for users to update 
> related configurations. Or if cache-size exceeds 50% (or some other 
> threshold) of maxMemory, force cache-size to be 25% of maxMemory.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash

2022-04-21 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16550:
---
Description: 
When we introduced {*}SBN Read{*}, we encountered a situation during upgrade 
the JournalNodes.

Cluster Info: 
*Active: nn0*
*Standby: nn1*

1. Rolling restart journal node. {color:#ff}(related config: 
fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color}

2. The cluster runs for a while, edits cache usage is increasing and memory is 
used up.

3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed out 
waiting 12ms for a quorum of nodes to respond”{_}.

4. Transfer nn1 to Active state.

5. {color:#ff}New Active namenode(nn1){color} also shutdown because of 
“{_}Timed out waiting 12ms for a quorum of nodes to respond” too{_}.

6. {color:#ff}The cluster crashed{color}.

 

Related code:
{code:java}
JournaledEditsCache(Configuration conf) {
  capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY,
  DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT);
  if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) {
Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " +
"maximum JVM memory is only %d bytes. It is recommended that you " +
"decrease the cache size or increase the heap size.",
capacity, Runtime.getRuntime().maxMemory()));
  }
  Journal.LOG.info("Enabling the journaled edits cache with a capacity " +
  "of bytes: " + capacity);
  ReadWriteLock lock = new ReentrantReadWriteLock(true);
  readLock = new AutoCloseableLock(lock.readLock());
  writeLock = new AutoCloseableLock(lock.writeLock());
  initialize(INVALID_TXN_ID);
} {code}
Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size 
than the memory requested by the process. If 
{*}fs.journalNode.edit-cache-sie.bytes > 0.9 * 
Runtime.getruntime().maxMemory(){*}, only warn logs are printed during 
journalnode startup. This can easily be overlooked by users. However, as the 
cluster runs to a certain period of time, it is likely to cause the cluster to 
crash.

 

NN log:

!image-2022-04-21-09-54-57-111.png|width=1012,height=47!

!image-2022-04-21-12-32-56-170.png|width=809,height=218!

IMO, when {*}fs.journalNode.edit-cache-size-bytes > threshold * 
Runtime.getruntime ().maxMemory(){*}, we should throw an Exception and 
{color:#ff}fast fail{color}. Giving a clear hint for users to update 
related configurations. Or if cache-size exceeds 50% (or some other threshold) 
of maxMemory, force cache-size to be 25% of maxMemory.

  was:
When we introduced {*}SBN Read{*}, we encountered a situation during upgrade 
the JournalNodes.

Cluster Info: 
*Active: nn0*
*Standby: nn1*

1. Rolling restart journal node. {color:#ff}(related config: 
fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color}

2. The cluster runs for a while, edits cache usage is increasing and memory is 
used up.

3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed out 
waiting 12ms for a quorum of nodes to respond”{_}.

4. Transfer nn1 to Active state.

5. {color:#ff}New Active namenode(nn1){color} also shutdown because of 
“{_}Timed out waiting 12ms for a quorum of nodes to respond” too{_}.

6. {color:#ff}The cluster crashed{color}.

 

Related code:
{code:java}
JournaledEditsCache(Configuration conf) {
  capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY,
  DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT);
  if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) {
Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " +
"maximum JVM memory is only %d bytes. It is recommended that you " +
"decrease the cache size or increase the heap size.",
capacity, Runtime.getRuntime().maxMemory()));
  }
  Journal.LOG.info("Enabling the journaled edits cache with a capacity " +
  "of bytes: " + capacity);
  ReadWriteLock lock = new ReentrantReadWriteLock(true);
  readLock = new AutoCloseableLock(lock.readLock());
  writeLock = new AutoCloseableLock(lock.writeLock());
  initialize(INVALID_TXN_ID);
} {code}
Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size 
than the memory requested by the process. If 
{*}fs.journalNode.edit-cache-sie.bytes > 0.9 * 
Runtime.getruntime().maxMemory(){*}, only warn logs are printed during 
journalnode startup. This can easily be overlooked by users. However, as the 
cluster runs to a certain period of time, it is likely to cause the cluster to 
crash.

 

NN log:

!image-2022-04-21-09-54-57-111.png|width=1012,height=47!

!image-2022-04-21-12-32-56-170.png|width=809,height=218!

 

IMO, when {*}fs.journalNode.edit-cache-size-bytes > threshold * 
Runtime.getruntime ().maxMemory(){*}, we should throw an Exception and 
{color:#ff}fast fail{color}. Giving a clear 

[jira] [Updated] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash

2022-04-21 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16550:
---
Description: 
When we introduced {*}SBN Read{*}, we encountered a situation during upgrade 
the JournalNodes.

Cluster Info: 
*Active: nn0*
*Standby: nn1*

1. Rolling restart journal node. {color:#ff}(related config: 
fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color}

2. The cluster runs for a while, edits cache usage is increasing and memory is 
used up.

3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed out 
waiting 12ms for a quorum of nodes to respond”{_}.

4. Transfer nn1 to Active state.

5. {color:#ff}New Active namenode(nn1){color} also shutdown because of 
“{_}Timed out waiting 12ms for a quorum of nodes to respond” too{_}.

6. {color:#ff}The cluster crashed{color}.

 

Related code:
{code:java}
JournaledEditsCache(Configuration conf) {
  capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY,
  DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT);
  if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) {
Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " +
"maximum JVM memory is only %d bytes. It is recommended that you " +
"decrease the cache size or increase the heap size.",
capacity, Runtime.getRuntime().maxMemory()));
  }
  Journal.LOG.info("Enabling the journaled edits cache with a capacity " +
  "of bytes: " + capacity);
  ReadWriteLock lock = new ReentrantReadWriteLock(true);
  readLock = new AutoCloseableLock(lock.readLock());
  writeLock = new AutoCloseableLock(lock.writeLock());
  initialize(INVALID_TXN_ID);
} {code}
Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size 
than the memory requested by the process. If 
{*}fs.journalNode.edit-cache-sie.bytes > 0.9 * 
Runtime.getruntime().maxMemory(){*}, only warn logs are printed during 
journalnode startup. This can easily be overlooked by users. However, as the 
cluster runs to a certain period of time, it is likely to cause the cluster to 
crash.

 

NN log:

!image-2022-04-21-09-54-57-111.png|width=1012,height=47!

!image-2022-04-21-12-32-56-170.png|width=809,height=218!

 

IMO, when {*}fs.journalNode.edit-cache-size-bytes > threshold * 
Runtime.getruntime ().maxMemory(){*}, we should throw an Exception and 
{color:#ff}fast fail{color}. Giving a clear hint for users to update 
related configurations.

  was:
When we introduced {*}SBN Read{*}, we encountered a situation during upgrade 
the JournalNodes.

Cluster Info: 
*Active: nn0*
*Standby: nn1*

1. Rolling restart journal node. {color:#ff}(related config: 
fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color}

2. The cluster runs for a while.

3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed out 
waiting 12ms for a quorum of nodes to respond”{_}.

4. Transfer nn1 to Active state.

5. {color:#ff}New Active namenode(nn1){color} also shutdown because of 
“{_}Timed out waiting 12ms for a quorum of nodes to respond” too{_}.

6. {color:#ff}The cluster crashed{color}.

 

Related code:
{code:java}
JournaledEditsCache(Configuration conf) {
  capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY,
  DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT);
  if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) {
Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " +
"maximum JVM memory is only %d bytes. It is recommended that you " +
"decrease the cache size or increase the heap size.",
capacity, Runtime.getRuntime().maxMemory()));
  }
  Journal.LOG.info("Enabling the journaled edits cache with a capacity " +
  "of bytes: " + capacity);
  ReadWriteLock lock = new ReentrantReadWriteLock(true);
  readLock = new AutoCloseableLock(lock.readLock());
  writeLock = new AutoCloseableLock(lock.writeLock());
  initialize(INVALID_TXN_ID);
} {code}
Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size 
than the memory requested by the process. If 
{*}fs.journalNode.edit-cache-sie.bytes > 0.9 * 
Runtime.getruntime().maxMemory(){*}, only warn logs are printed during 
journalnode startup. This can easily be overlooked by users. However, as the 
cluster runs to a certain period of time, it is likely to cause the cluster to 
crash.

 

NN log:

!image-2022-04-21-09-54-57-111.png|width=1012,height=47!

!image-2022-04-21-12-32-56-170.png|width=809,height=218!

 

IMO, when {*}fs.journalNode.edit-cache-size-bytes > threshold * 
Runtime.getruntime ().maxMemory(){*}, we should throw an Exception and 
{color:#ff}fast fail{color}. Giving a clear hint for users to update 
related configurations.


> [SBN read] Improper cache-size for journal node may cause cluster crash
> 

[jira] [Updated] (HDFS-16552) Fix NPE for BlockManager#scheduleReconstruction

2022-04-21 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16552:
---
Summary: Fix NPE for BlockManager#scheduleReconstruction  (was: Fix NPE for 
BlockManager)

> Fix NPE for BlockManager#scheduleReconstruction
> ---
>
> Key: HDFS-16552
> URL: https://issues.apache.org/jira/browse/HDFS-16552
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>
> There is a NPE in BlockManager when run 
> TestBlockManager#testSkipReconstructionWithManyBusyNodes2. Because 
> NameNodeMetrics is not initialized in this unit test.
>  
> Related ci link, see 
> [this|https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt].
> {code:java}
> [ERROR] Tests run: 34, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 30.088 s <<< FAILURE! - in 
> org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager
> [ERROR] 
> testSkipReconstructionWithManyBusyNodes2(org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager)
>   Time elapsed: 2.783 s  <<< ERROR!
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.scheduleReconstruction(BlockManager.java:2171)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.testSkipReconstructionWithManyBusyNodes2(TestBlockManager.java:947)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
>   at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
>   at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
>   at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) 
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16552) Fix NPE for BlockManager

2022-04-21 Thread tomscut (Jira)
tomscut created HDFS-16552:
--

 Summary: Fix NPE for BlockManager
 Key: HDFS-16552
 URL: https://issues.apache.org/jira/browse/HDFS-16552
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: tomscut
Assignee: tomscut


There is a NPE in BlockManager when run 
TestBlockManager#testSkipReconstructionWithManyBusyNodes2. Because 
NameNodeMetrics is not initialized in this unit test.

 

Related ci link, see 
[this|https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt].

 
{code:java}
[ERROR] Tests run: 34, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 30.088 
s <<< FAILURE! - in 
org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager
[ERROR] 
testSkipReconstructionWithManyBusyNodes2(org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager)
  Time elapsed: 2.783 s  <<< ERROR!
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.scheduleReconstruction(BlockManager.java:2171)
at 
org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.testSkipReconstructionWithManyBusyNodes2(TestBlockManager.java:947)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
at 
org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
at 
org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
at 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
at 
org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
at 
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) {code}
 

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16552) Fix NPE for BlockManager

2022-04-21 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16552:
---
Description: 
There is a NPE in BlockManager when run 
TestBlockManager#testSkipReconstructionWithManyBusyNodes2. Because 
NameNodeMetrics is not initialized in this unit test.

 

Related ci link, see 
[this|https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt].
{code:java}
[ERROR] Tests run: 34, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 30.088 
s <<< FAILURE! - in 
org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager
[ERROR] 
testSkipReconstructionWithManyBusyNodes2(org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager)
  Time elapsed: 2.783 s  <<< ERROR!
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.scheduleReconstruction(BlockManager.java:2171)
at 
org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.testSkipReconstructionWithManyBusyNodes2(TestBlockManager.java:947)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
at 
org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
at 
org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
at 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
at 
org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
at 
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) {code}
 

 

  was:
There is a NPE in BlockManager when run 
TestBlockManager#testSkipReconstructionWithManyBusyNodes2. Because 
NameNodeMetrics is not initialized in this unit test.

 

Related ci link, see 
[this|https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt].

 
{code:java}
[ERROR] Tests run: 34, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 30.088 
s <<< FAILURE! - in 
org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager
[ERROR] 
testSkipReconstructionWithManyBusyNodes2(org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager)
  Time elapsed: 2.783 s  <<< ERROR!
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.scheduleReconstruction(BlockManager.java:2171)
at 
org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.testSkipReconstructionWithManyBusyNodes2(TestBlockManager.java:947)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 

[jira] [Updated] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash

2022-04-21 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16550:
---
Description: 
When we introduced {*}SBN Read{*}, we encountered a situation during upgrade 
the JournalNodes.

Cluster Info: 
*Active: nn0*
*Standby: nn1*

1. Rolling restart journal node. {color:#ff}(related config: 
fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color}

2. The cluster runs for a while.

3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed out 
waiting 12ms for a quorum of nodes to respond”{_}.

4. Transfer nn1 to Active state.

5. {color:#ff}New Active namenode(nn1){color} also shutdown because of 
“{_}Timed out waiting 12ms for a quorum of nodes to respond” too{_}.

6. {color:#ff}The cluster crashed{color}.

 

Related code:
{code:java}
JournaledEditsCache(Configuration conf) {
  capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY,
  DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT);
  if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) {
Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " +
"maximum JVM memory is only %d bytes. It is recommended that you " +
"decrease the cache size or increase the heap size.",
capacity, Runtime.getRuntime().maxMemory()));
  }
  Journal.LOG.info("Enabling the journaled edits cache with a capacity " +
  "of bytes: " + capacity);
  ReadWriteLock lock = new ReentrantReadWriteLock(true);
  readLock = new AutoCloseableLock(lock.readLock());
  writeLock = new AutoCloseableLock(lock.writeLock());
  initialize(INVALID_TXN_ID);
} {code}
Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size 
than the memory requested by the process. If 
{*}fs.journalNode.edit-cache-sie.bytes > 0.9 * 
Runtime.getruntime().maxMemory(){*}, only warn logs are printed during 
journalnode startup. This can easily be overlooked by users. However, as the 
cluster runs to a certain period of time, it is likely to cause the cluster to 
crash.

 

NN log:

!image-2022-04-21-09-54-57-111.png|width=1012,height=47!

!image-2022-04-21-12-32-56-170.png|width=809,height=218!

 

IMO, when {*}fs.journalNode.edit-cache-size-bytes > threshold * 
Runtime.getruntime ().maxMemory(){*}, we should throw an Exception and 
{color:#ff}fast fail{color}. Giving a clear hint for users to update 
related configurations.

  was:
When we introduced {*}SBN Read{*}, we encountered a situation during upgrade 
the JournalNodes.

Cluster Info: 
*Active: nn0*
*Standby: nn1*

1. Rolling restart journal node. {color:#ff}(related config: 
fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color}

2. The cluster runs for a while.

3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed out 
waiting 12ms for a quorum of nodes to respond”{_}.

4. Transfer nn1 to Active state.

5. {color:#ff}New Active namenode(nn1){color} also shutdown because of 
Timed out waiting 12ms for a quorum of nodes to respond.

6. {color:#ff}The cluster crashed{color}.

 

Related code:
{code:java}
JournaledEditsCache(Configuration conf) {
  capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY,
  DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT);
  if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) {
Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " +
"maximum JVM memory is only %d bytes. It is recommended that you " +
"decrease the cache size or increase the heap size.",
capacity, Runtime.getRuntime().maxMemory()));
  }
  Journal.LOG.info("Enabling the journaled edits cache with a capacity " +
  "of bytes: " + capacity);
  ReadWriteLock lock = new ReentrantReadWriteLock(true);
  readLock = new AutoCloseableLock(lock.readLock());
  writeLock = new AutoCloseableLock(lock.writeLock());
  initialize(INVALID_TXN_ID);
} {code}
Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size 
than the memory requested by the process. If 
{*}fs.journalNode.edit-cache-sie.bytes > 0.9 * 
Runtime.getruntime().maxMemory(){*}, only warn logs are printed during 
journalnode startup. This can easily be overlooked by users. However, as the 
cluster runs to a certain period of time, it is likely to cause the cluster to 
crash.

 

NN log:

!image-2022-04-21-09-54-57-111.png|width=1012,height=47!

!image-2022-04-21-12-32-56-170.png|width=809,height=218!

 

IMO, when {*}fs.journalNode.edit-cache-size-bytes > threshold * 
Runtime.getruntime ().maxMemory(){*}, we should throw an Exception and 
{color:#ff}fast fail{color}. Giving a clear hint for users to update 
related configurations.


> [SBN read] Improper cache-size for journal node may cause cluster crash
> ---
>
> Key: HDFS-16550

[jira] [Updated] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash

2022-04-20 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16550:
---
 Attachment: image-2022-04-21-12-32-56-170.png
Description: 
When we introduced {*}SBN Read{*}, we encountered a situation during upgrade 
the JournalNodes.

Cluster Info: 
*Active: nn0*
*Standby: nn1*

1. Rolling restart journal node. {color:#ff}(related config: 
fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color}

2. The cluster runs for a while.

3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed out 
waiting 12ms for a quorum of nodes to respond”{_}.

4. Transfer nn1 to Active state.

5. {color:#ff}New Active namenode(nn1){color} also shutdown because of 
Timed out waiting 12ms for a quorum of nodes to respond.

6. {color:#ff}The cluster crashed{color}.

 

Related code:
{code:java}
JournaledEditsCache(Configuration conf) {
  capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY,
  DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT);
  if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) {
Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " +
"maximum JVM memory is only %d bytes. It is recommended that you " +
"decrease the cache size or increase the heap size.",
capacity, Runtime.getRuntime().maxMemory()));
  }
  Journal.LOG.info("Enabling the journaled edits cache with a capacity " +
  "of bytes: " + capacity);
  ReadWriteLock lock = new ReentrantReadWriteLock(true);
  readLock = new AutoCloseableLock(lock.readLock());
  writeLock = new AutoCloseableLock(lock.writeLock());
  initialize(INVALID_TXN_ID);
} {code}
Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size 
than the memory requested by the process. If 
{*}fs.journalNode.edit-cache-sie.bytes > 0.9 * 
Runtime.getruntime().maxMemory(){*}, only warn logs are printed during 
journalnode startup. This can easily be overlooked by users. However, as the 
cluster runs to a certain period of time, it is likely to cause the cluster to 
crash.

!image-2022-04-21-09-54-57-111.png|width=1012,height=47!

!image-2022-04-21-12-32-56-170.png|width=809,height=218!

 

IMO, when {*}fs.journalNode.edit-cache-size-bytes > threshold * 
Runtime.getruntime ().maxMemory(){*}, we should throw an Exception and 
{color:#ff}fast fail{color}. Giving a clear hint for users to update 
related configurations.

  was:
When we introduced {*}SBN Read{*}, we encountered a situation during upgrade 
the JournalNodes.

Cluster Info: 
*Active: nn0*
*Standby: nn1*

1. Rolling restart journal node. {color:#ff}(related config: 
fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color}

2. The cluster runs for a while.

3. {color:#ff}Active namenode(nn0){color} shutdown because of Timed out 
waiting 12ms for a quorum of nodes to respond.

4. Transfer nn1 to Active state.

5. {color:#ff}New Active namenode(nn1){color} also shutdown because of 
Timed out waiting 12ms for a quorum of nodes to respond.

6. {color:#ff}The cluster crashed{color}.

 

Related code:
{code:java}
JournaledEditsCache(Configuration conf) {
  capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY,
  DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT);
  if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) {
Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " +
"maximum JVM memory is only %d bytes. It is recommended that you " +
"decrease the cache size or increase the heap size.",
capacity, Runtime.getRuntime().maxMemory()));
  }
  Journal.LOG.info("Enabling the journaled edits cache with a capacity " +
  "of bytes: " + capacity);
  ReadWriteLock lock = new ReentrantReadWriteLock(true);
  readLock = new AutoCloseableLock(lock.readLock());
  writeLock = new AutoCloseableLock(lock.writeLock());
  initialize(INVALID_TXN_ID);
} {code}
Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size 
than the memory requested by the process. If 
{*}fs.journalNode.edit-cache-sie.bytes > 0.9 * 
Runtime.getruntime().maxMemory(){*}, only warn logs are printed during 
journalnode startup. This can easily be overlooked by users. However, as the 
cluster runs to a certain period of time, it is likely to cause the cluster to 
crash.

!image-2022-04-21-09-54-57-111.png|width=1227,height=57!

IMO, when {*}fs.journalNode.edit-cache-size-bytes > threshold * 
Runtime.getruntime ().maxMemory(){*}, we should throw an Exception and 
{color:#ff}fast fail{color}. Giving a clear hint for users to update 
related configurations.


> [SBN read] Improper cache-size for journal node may cause cluster crash
> ---
>
> Key: HDFS-16550
> URL: 

[jira] [Updated] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash

2022-04-20 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16550:
---
Description: 
When we introduced {*}SBN Read{*}, we encountered a situation during upgrade 
the JournalNodes.

Cluster Info: 
*Active: nn0*
*Standby: nn1*

1. Rolling restart journal node. {color:#ff}(related config: 
fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color}

2. The cluster runs for a while.

3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed out 
waiting 12ms for a quorum of nodes to respond”{_}.

4. Transfer nn1 to Active state.

5. {color:#ff}New Active namenode(nn1){color} also shutdown because of 
Timed out waiting 12ms for a quorum of nodes to respond.

6. {color:#ff}The cluster crashed{color}.

 

Related code:
{code:java}
JournaledEditsCache(Configuration conf) {
  capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY,
  DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT);
  if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) {
Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " +
"maximum JVM memory is only %d bytes. It is recommended that you " +
"decrease the cache size or increase the heap size.",
capacity, Runtime.getRuntime().maxMemory()));
  }
  Journal.LOG.info("Enabling the journaled edits cache with a capacity " +
  "of bytes: " + capacity);
  ReadWriteLock lock = new ReentrantReadWriteLock(true);
  readLock = new AutoCloseableLock(lock.readLock());
  writeLock = new AutoCloseableLock(lock.writeLock());
  initialize(INVALID_TXN_ID);
} {code}
Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size 
than the memory requested by the process. If 
{*}fs.journalNode.edit-cache-sie.bytes > 0.9 * 
Runtime.getruntime().maxMemory(){*}, only warn logs are printed during 
journalnode startup. This can easily be overlooked by users. However, as the 
cluster runs to a certain period of time, it is likely to cause the cluster to 
crash.

 

NN log:

!image-2022-04-21-09-54-57-111.png|width=1012,height=47!

!image-2022-04-21-12-32-56-170.png|width=809,height=218!

 

IMO, when {*}fs.journalNode.edit-cache-size-bytes > threshold * 
Runtime.getruntime ().maxMemory(){*}, we should throw an Exception and 
{color:#ff}fast fail{color}. Giving a clear hint for users to update 
related configurations.

  was:
When we introduced {*}SBN Read{*}, we encountered a situation during upgrade 
the JournalNodes.

Cluster Info: 
*Active: nn0*
*Standby: nn1*

1. Rolling restart journal node. {color:#ff}(related config: 
fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color}

2. The cluster runs for a while.

3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed out 
waiting 12ms for a quorum of nodes to respond”{_}.

4. Transfer nn1 to Active state.

5. {color:#ff}New Active namenode(nn1){color} also shutdown because of 
Timed out waiting 12ms for a quorum of nodes to respond.

6. {color:#ff}The cluster crashed{color}.

 

Related code:
{code:java}
JournaledEditsCache(Configuration conf) {
  capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY,
  DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT);
  if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) {
Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " +
"maximum JVM memory is only %d bytes. It is recommended that you " +
"decrease the cache size or increase the heap size.",
capacity, Runtime.getRuntime().maxMemory()));
  }
  Journal.LOG.info("Enabling the journaled edits cache with a capacity " +
  "of bytes: " + capacity);
  ReadWriteLock lock = new ReentrantReadWriteLock(true);
  readLock = new AutoCloseableLock(lock.readLock());
  writeLock = new AutoCloseableLock(lock.writeLock());
  initialize(INVALID_TXN_ID);
} {code}
Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size 
than the memory requested by the process. If 
{*}fs.journalNode.edit-cache-sie.bytes > 0.9 * 
Runtime.getruntime().maxMemory(){*}, only warn logs are printed during 
journalnode startup. This can easily be overlooked by users. However, as the 
cluster runs to a certain period of time, it is likely to cause the cluster to 
crash.

!image-2022-04-21-09-54-57-111.png|width=1012,height=47!

!image-2022-04-21-12-32-56-170.png|width=809,height=218!

 

IMO, when {*}fs.journalNode.edit-cache-size-bytes > threshold * 
Runtime.getruntime ().maxMemory(){*}, we should throw an Exception and 
{color:#ff}fast fail{color}. Giving a clear hint for users to update 
related configurations.


> [SBN read] Improper cache-size for journal node may cause cluster crash
> ---
>
> Key: HDFS-16550
> URL: 

[jira] [Updated] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash

2022-04-20 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16550:
---
Description: 
When we introduced {*}SBN Read{*}, we encountered a situation during upgrade 
the JournalNodes.

Cluster Info: 
*Active: nn0*
*Standby: nn1*

1. Rolling restart journal node. {color:#ff}(related config: 
fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color}

2. The cluster runs for a while.

3. {color:#ff}Active namenode(nn0){color} shutdown because of Timed out 
waiting 12ms for a quorum of nodes to respond.

4. Transfer nn1 to Active state.

5. {color:#ff}New Active namenode(nn1){color} also shutdown because of 
Timed out waiting 12ms for a quorum of nodes to respond.

6. {color:#ff}The cluster crashed{color}.

 

Related code:
{code:java}
JournaledEditsCache(Configuration conf) {
  capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY,
  DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT);
  if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) {
Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " +
"maximum JVM memory is only %d bytes. It is recommended that you " +
"decrease the cache size or increase the heap size.",
capacity, Runtime.getRuntime().maxMemory()));
  }
  Journal.LOG.info("Enabling the journaled edits cache with a capacity " +
  "of bytes: " + capacity);
  ReadWriteLock lock = new ReentrantReadWriteLock(true);
  readLock = new AutoCloseableLock(lock.readLock());
  writeLock = new AutoCloseableLock(lock.writeLock());
  initialize(INVALID_TXN_ID);
} {code}
Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size 
than the memory requested by the process. If 
{*}fs.journalNode.edit-cache-sie.bytes > 0.9 * 
Runtime.getruntime().maxMemory(){*}, only warn logs are printed during 
journalnode startup. This can easily be overlooked by users. However, as the 
cluster runs to a certain period of time, it is likely to cause the cluster to 
crash.

!image-2022-04-21-09-54-57-111.png|width=1227,height=57!

IMO, when {*}fs.journalNode.edit-cache-size-bytes > threshold * 
Runtime.getruntime ().maxMemory(){*}, we should throw an Exception and 
{color:#ff}fast fail{color}. Giving a clear hint for users to update 
related configurations.

  was:
When we introduced SBN Read, we encountered a situation when upgrading the 
JournalNodes.

Cluster Info: 
*Active: nn0*
*Standby: nn1*

1. Rolling restart journal node. {color:#FF}(related config: 
fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color}

2. The cluster runs for a while.

3. {color:#FF}Active namenode(nn0){color} shutdown because of Timed out 
waiting 12ms for a quorum of nodes to respond.

4. Transfer nn1 to Active state.

5. {color:#FF}New Active namenode(nn1){color} also shutdown because of 
Timed out waiting 12ms for a quorum of nodes to respond.

6. {color:#FF}The cluster crashed{color}.

 

Related code:
{code:java}
JournaledEditsCache(Configuration conf) {
  capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY,
  DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT);
  if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) {
Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " +
"maximum JVM memory is only %d bytes. It is recommended that you " +
"decrease the cache size or increase the heap size.",
capacity, Runtime.getRuntime().maxMemory()));
  }
  Journal.LOG.info("Enabling the journaled edits cache with a capacity " +
  "of bytes: " + capacity);
  ReadWriteLock lock = new ReentrantReadWriteLock(true);
  readLock = new AutoCloseableLock(lock.readLock());
  writeLock = new AutoCloseableLock(lock.writeLock());
  initialize(INVALID_TXN_ID);
} {code}
Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size 
than the memory requested by the process. If 
{*}fs.journalNode.edit-cache-sie.bytes > 0.9 * 
Runtime.getruntime().maxMemory(){*}, only warn logs are printed during 
journalnode startup. This can easily be overlooked by users. However, as the 
cluster runs to a certain period of time, it is likely to cause the cluster to 
crash.

!image-2022-04-21-09-54-57-111.png|width=1227,height=57!

IMO, when {*}fs.journalNode.edit-cache-size-bytes > threshold * 
Runtime.getruntime ().maxMemory(){*}, we should throw an Exception and 
{color:#FF}fast fail{color}. Giving a clear hint for users to update 
related configurations.


> [SBN read] Improper cache-size for journal node may cause cluster crash
> ---
>
> Key: HDFS-16550
> URL: https://issues.apache.org/jira/browse/HDFS-16550
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: tomscut
>

[jira] [Created] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash

2022-04-20 Thread tomscut (Jira)
tomscut created HDFS-16550:
--

 Summary: [SBN read] Improper cache-size for journal node may cause 
cluster crash
 Key: HDFS-16550
 URL: https://issues.apache.org/jira/browse/HDFS-16550
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: tomscut
Assignee: tomscut
 Attachments: image-2022-04-21-09-54-29-751.png, 
image-2022-04-21-09-54-57-111.png

When we introduced SBN Read, we encountered a situation when upgrading the 
JournalNodes.

Cluster Info: 
*Active: nn0*
*Standby: nn1*

1. Rolling restart journal node. {color:#FF}(related config: 
fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color}

2. The cluster runs for a while.

3. {color:#FF}Active namenode(nn0){color} shutdown because of Timed out 
waiting 12ms for a quorum of nodes to respond.

4. Transfer nn1 to Active state.

5. {color:#FF}New Active namenode(nn1){color} also shutdown because of 
Timed out waiting 12ms for a quorum of nodes to respond.

6. {color:#FF}The cluster crashed{color}.

 

Related code:
{code:java}
JournaledEditsCache(Configuration conf) {
  capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY,
  DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT);
  if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) {
Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " +
"maximum JVM memory is only %d bytes. It is recommended that you " +
"decrease the cache size or increase the heap size.",
capacity, Runtime.getRuntime().maxMemory()));
  }
  Journal.LOG.info("Enabling the journaled edits cache with a capacity " +
  "of bytes: " + capacity);
  ReadWriteLock lock = new ReentrantReadWriteLock(true);
  readLock = new AutoCloseableLock(lock.readLock());
  writeLock = new AutoCloseableLock(lock.writeLock());
  initialize(INVALID_TXN_ID);
} {code}
Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size 
than the memory requested by the process. If 
{*}fs.journalNode.edit-cache-sie.bytes > 0.9 * 
Runtime.getruntime().maxMemory(){*}, only warn logs are printed during 
journalnode startup. This can easily be overlooked by users. However, as the 
cluster runs to a certain period of time, it is likely to cause the cluster to 
crash.

!image-2022-04-21-09-54-57-111.png|width=1227,height=57!

IMO, when {*}fs.journalNode.edit-cache-size-bytes > threshold * 
Runtime.getruntime ().maxMemory(){*}, we should throw an Exception and 
{color:#FF}fast fail{color}. Giving a clear hint for users to update 
related configurations.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16548) Failed unit test testRenameMoreThanOnceAcrossSnapDirs_2

2022-04-20 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16548:
---
Description: 
It seems to be related to HDFS-16531.
{code:java}
[ERROR] Tests run: 44, Failures: 6, Errors: 0, Skipped: 0, Time elapsed: 
143.701 s <<< FAILURE! - in 
org.apache.hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots
[ERROR] 
testRenameMoreThanOnceAcrossSnapDirs_2(org.apache.hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots)
  Time elapsed: 6.606 s  <<< FAILURE!
java.lang.AssertionError: expected:<3> but was:<1>
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.failNotEquals(Assert.java:835)
at org.junit.Assert.assertEquals(Assert.java:647)
at org.junit.Assert.assertEquals(Assert.java:633)
at 
org.apache.hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots.testRenameMoreThanOnceAcrossSnapDirs_2(TestRenameWithSnapshots.java:985)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
at 
org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
at 
org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
at 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
at 
org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
at 
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
 {code}

  was:
It seems to be related to this HDFS-16531.
{code:java}
[ERROR] Tests run: 44, Failures: 6, Errors: 0, Skipped: 0, Time elapsed: 
143.701 s <<< FAILURE! - in 
org.apache.hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots
[ERROR] 
testRenameMoreThanOnceAcrossSnapDirs_2(org.apache.hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots)
  Time elapsed: 6.606 s  <<< FAILURE!
java.lang.AssertionError: expected:<3> but was:<1>
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.failNotEquals(Assert.java:835)
at org.junit.Assert.assertEquals(Assert.java:647)
at org.junit.Assert.assertEquals(Assert.java:633)
at 
org.apache.hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots.testRenameMoreThanOnceAcrossSnapDirs_2(TestRenameWithSnapshots.java:985)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 

[jira] [Updated] (HDFS-16548) Failed unit test testRenameMoreThanOnceAcrossSnapDirs_2

2022-04-20 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16548:
---
Description: 
It seems to be related to this HDFS-16531.
{code:java}
[ERROR] Tests run: 44, Failures: 6, Errors: 0, Skipped: 0, Time elapsed: 
143.701 s <<< FAILURE! - in 
org.apache.hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots
[ERROR] 
testRenameMoreThanOnceAcrossSnapDirs_2(org.apache.hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots)
  Time elapsed: 6.606 s  <<< FAILURE!
java.lang.AssertionError: expected:<3> but was:<1>
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.failNotEquals(Assert.java:835)
at org.junit.Assert.assertEquals(Assert.java:647)
at org.junit.Assert.assertEquals(Assert.java:633)
at 
org.apache.hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots.testRenameMoreThanOnceAcrossSnapDirs_2(TestRenameWithSnapshots.java:985)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
at 
org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
at 
org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
at 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
at 
org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
at 
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
 {code}

  was:
 
{code:java}
[ERROR] Tests run: 44, Failures: 6, Errors: 0, Skipped: 0, Time elapsed: 
143.701 s <<< FAILURE! - in 
org.apache.hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots
[ERROR] 
testRenameMoreThanOnceAcrossSnapDirs_2(org.apache.hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots)
  Time elapsed: 6.606 s  <<< FAILURE!
java.lang.AssertionError: expected:<3> but was:<1>
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.failNotEquals(Assert.java:835)
at org.junit.Assert.assertEquals(Assert.java:647)
at org.junit.Assert.assertEquals(Assert.java:633)
at 
org.apache.hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots.testRenameMoreThanOnceAcrossSnapDirs_2(TestRenameWithSnapshots.java:985)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 

[jira] [Created] (HDFS-16548) Failed unit test testRenameMoreThanOnceAcrossSnapDirs_2

2022-04-20 Thread tomscut (Jira)
tomscut created HDFS-16548:
--

 Summary: Failed unit test testRenameMoreThanOnceAcrossSnapDirs_2
 Key: HDFS-16548
 URL: https://issues.apache.org/jira/browse/HDFS-16548
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: tomscut


 
{code:java}
[ERROR] Tests run: 44, Failures: 6, Errors: 0, Skipped: 0, Time elapsed: 
143.701 s <<< FAILURE! - in 
org.apache.hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots
[ERROR] 
testRenameMoreThanOnceAcrossSnapDirs_2(org.apache.hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots)
  Time elapsed: 6.606 s  <<< FAILURE!
java.lang.AssertionError: expected:<3> but was:<1>
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.failNotEquals(Assert.java:835)
at org.junit.Assert.assertEquals(Assert.java:647)
at org.junit.Assert.assertEquals(Assert.java:633)
at 
org.apache.hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots.testRenameMoreThanOnceAcrossSnapDirs_2(TestRenameWithSnapshots.java:985)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
at 
org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
at 
org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
at 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
at 
org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
at 
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
 {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16547) [SBN read] Namenode in safe mode should not be transfered to observer state

2022-04-19 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16547:
---
Summary: [SBN read] Namenode in safe mode should not be transfered to 
observer state  (was: [SBN read] Namenode in safe mode should not be transfer 
to observer state)

> [SBN read] Namenode in safe mode should not be transfered to observer state
> ---
>
> Key: HDFS-16547
> URL: https://issues.apache.org/jira/browse/HDFS-16547
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently, when a Namenode is in safemode(under starting or enter safemode 
> manually), we can transfer this Namenode to Observer by command. This 
> Observer node may receive many requests and then throw a SafemodeException, 
> this causes unnecessary failover on the client.
> So Namenode in safe mode should not be transfer to observer state.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16547) [SBN read] Namenode in safe mode should not be transfer to observer state

2022-04-19 Thread tomscut (Jira)
tomscut created HDFS-16547:
--

 Summary: [SBN read] Namenode in safe mode should not be transfer 
to observer state
 Key: HDFS-16547
 URL: https://issues.apache.org/jira/browse/HDFS-16547
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: tomscut
Assignee: tomscut


Currently, when a Namenode is in safemode(under starting or enter safemode 
manually), we can transfer this Namenode to Observer by command. This Observer 
node may receive many requests and then throw a SafemodeException, this causes 
unnecessary failover on the client.

So Namenode in safe mode should not be transfer to observer state.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-16507) [SBN read] Avoid purging edit log which is in progress

2022-04-13 Thread tomscut (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17521995#comment-17521995
 ] 

tomscut edited comment on HDFS-16507 at 4/14/22 1:10 AM:
-

Thanks [~xkrogen] for your comments.

If the situation arises that ` minTxIdToKeep > curSegmentTxId `, 
Preconditions.CheckArgument` while fail, then throw an IllegalArgumentException.

This will cause ImageServlet#doPut to fail, and then cause the SNN checkpoint 
to fail, and maybe the SNN will retry several times until ANN rolls the edits 
log itself. But ANN avoids purging the inprogress edit log, so it will not 
crash. We can see the stack as follows.

Please point out if my description is incorrect. Thank you.

The stack of purgeLogsOlderThan:
{code:java}
java.lang.Thread.getStackTrace(Thread.java:1552)
    org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
    
org.apache.hadoop.hdfs.server.namenode.FileJournalManager.purgeLogsOlderThan(FileJournalManager.java:185)
    
org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:623)
    
org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:388)
    
org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:620)
    
org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1512)
org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:177)
    
org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:1249)
    
org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:617)
    
org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:516)
    java.security.AccessController.doPrivileged(Native Method)
    javax.security.auth.Subject.doAs(Subject.java:422)
    
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
    
org.apache.hadoop.hdfs.server.namenode.ImageServlet.doPut(ImageServlet.java:515)
    javax.servlet.http.HttpServlet.service(HttpServlet.java:710)
    javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
    org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
    
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
    
org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604)
    
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
    org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
    
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
    org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
    
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
    org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
    
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
    
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
    org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
    
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
    
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
    
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
    
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
    
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
    org.eclipse.jetty.server.Server.handle(Server.java:539)
    org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333)
    org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
    
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
    org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
    
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
    
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
    
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
    
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
    
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
    
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
    java.lang.Thread.run(Thread.java:745) {code}


was (Author: tomscut):
Thanks [~xkrogen] for your comments.

if the situation arises that ` minTxIdToKeep > curSegmentTxId `, 
Preconditions.CheckArgument` while fail, then throw an IllegalArgumentException.

This will cause ImageServlet#doPut to fail, and then cause the SNN 

[jira] [Commented] (HDFS-16507) [SBN read] Avoid purging edit log which is in progress

2022-04-13 Thread tomscut (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17521995#comment-17521995
 ] 

tomscut commented on HDFS-16507:


Thanks [~xkrogen] for your comments.

if the situation arises that ` minTxIdToKeep > curSegmentTxId `, 
Preconditions.CheckArgument` while fail, then throw an IllegalArgumentException.

This will cause ImageServlet#doPut to fail, and then cause the SNN checkpoint 
to fail, and maybe the SNN will retry several times until ANN rolls the edits 
log itself. But ANN avoids purging the inprogress edit log, so it will not 
crash. We can see the stack as follows.

Please point out if my description is incorrect. Thank you.

The stack of purgeLogsOlderThan:
{code:java}
java.lang.Thread.getStackTrace(Thread.java:1552)
    org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
    
org.apache.hadoop.hdfs.server.namenode.FileJournalManager.purgeLogsOlderThan(FileJournalManager.java:185)
    
org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:623)
    
org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:388)
    
org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:620)
    
org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1512)
org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:177)
    
org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:1249)
    
org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:617)
    
org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:516)
    java.security.AccessController.doPrivileged(Native Method)
    javax.security.auth.Subject.doAs(Subject.java:422)
    
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
    
org.apache.hadoop.hdfs.server.namenode.ImageServlet.doPut(ImageServlet.java:515)
    javax.servlet.http.HttpServlet.service(HttpServlet.java:710)
    javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
    org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
    
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
    
org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604)
    
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
    org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
    
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
    org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
    
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
    org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
    
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
    
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
    org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
    
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
    
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
    
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
    
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
    
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
    org.eclipse.jetty.server.Server.handle(Server.java:539)
    org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333)
    org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
    
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
    org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
    
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
    
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
    
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
    
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
    
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
    
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
    java.lang.Thread.run(Thread.java:745) {code}

> [SBN read] Avoid purging edit log which is in progress
> --
>
> Key: HDFS-16507
> URL: https://issues.apache.org/jira/browse/HDFS-16507
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.1.0
>

[jira] (HDFS-16507) [SBN read] Avoid purging edit log which is in progress

2022-04-12 Thread tomscut (Jira)


[ https://issues.apache.org/jira/browse/HDFS-16507 ]


tomscut deleted comment on HDFS-16507:


was (Author: tomscut):
[~xkrogen] Your comment makes a lot of sense to me.

IMO,  there are two ways to approach this problem:
1. Throw an IllegalArgumentException, wait for Edit to be turned off normally, 
and then automatically FSEditLog#purgeLogsOlderThan. However, if SNN is down 
for a long time, edits   log may take up more disk space.

2. Update `minTxIdToKeep` here. Like the PR I submitted at the beginning.
{code:java}
// Reset purgeLogsFrom to avoid purging edit log which is in progress.
if (isSegmentOpen()) {
   minTxIdToKeep = minTxIdToKeep > curSegmentTxId ? curSegmentTxId : 
minTxIdToKeep;
} {code}
What do you think of this? cc [~sunchao] [~vjasani] .

> [SBN read] Avoid purging edit log which is in progress
> --
>
> Key: HDFS-16507
> URL: https://issues.apache.org/jira/browse/HDFS-16507
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: tomscut
>Assignee: tomscut
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.4, 3.3.4
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL 
> exception. It looks like it's purging edit logs which is in process.
> According to the analysis, I suspect that the editlog which is in progress to 
> be purged(after SNN checkpoint) does not finalize(See HDFS-14317) before ANN 
> rolls edit its self. 
> The stack:
> {code:java}
> java.lang.Thread.getStackTrace(Thread.java:1552)
>     org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
>     
> org.apache.hadoop.hdfs.server.namenode.FileJournalManager.purgeLogsOlderThan(FileJournalManager.java:185)
>     
> org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:623)
>     
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:388)
>     
> org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:620)
>     
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1512)
> org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:177)
>     
> org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:1249)
>     
> org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:617)
>     
> org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:516)
>     java.security.AccessController.doPrivileged(Native Method)
>     javax.security.auth.Subject.doAs(Subject.java:422)
>     
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>     
> org.apache.hadoop.hdfs.server.namenode.ImageServlet.doPut(ImageServlet.java:515)
>     javax.servlet.http.HttpServlet.service(HttpServlet.java:710)
>     javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>     org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>     
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>     
> org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604)
>     
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>     org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
>     
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>     org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
>     
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
>     
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
>     
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
>     
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>     org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>     
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
>     
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>     
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>     
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
>     
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>     org.eclipse.jetty.server.Server.handle(Server.java:539)
>     org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333)
>     
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
>     
> 

[jira] [Comment Edited] (HDFS-16507) [SBN read] Avoid purging edit log which is in progress

2022-04-12 Thread tomscut (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17521409#comment-17521409
 ] 

tomscut edited comment on HDFS-16507 at 4/13/22 2:06 AM:
-

[~xkrogen] Your comment makes a lot of sense to me.

IMO,  there are two ways to approach this problem:
1. Throw an IllegalArgumentException, wait for Edit to be turned off normally, 
and then automatically FSEditLog#purgeLogsOlderThan. However, if SNN is down 
for a long time, edits   log may take up more disk space.

2. Update `minTxIdToKeep` here. Like the PR I submitted at the beginning.
{code:java}
// Reset purgeLogsFrom to avoid purging edit log which is in progress.
if (isSegmentOpen()) {
   minTxIdToKeep = minTxIdToKeep > curSegmentTxId ? curSegmentTxId : 
minTxIdToKeep;
} {code}
What do you think of this? cc [~sunchao] [~vjasani] .


was (Author: tomscut):
[~xkrogen] Your comment makes a lot of sense to me.

IMO,  there are two ways to approach this problem:
1. Throw an IllegalArgumentException, wait for Edit to be turned off normally, 
and then automatically FSEditLog#purgeLogsOlderThan. However, if SNN is down 
for a long time, edits may take up more disk space.

2. Update `minTxIdToKeep` here. Like the PR I submitted at the beginning.
{code:java}
// Reset purgeLogsFrom to avoid purging edit log which is in progress.
if (isSegmentOpen()) {
   minTxIdToKeep = minTxIdToKeep > curSegmentTxId ? curSegmentTxId : 
minTxIdToKeep;
} {code}
What do you think of this? cc [~sunchao] [~vjasani] .

> [SBN read] Avoid purging edit log which is in progress
> --
>
> Key: HDFS-16507
> URL: https://issues.apache.org/jira/browse/HDFS-16507
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: tomscut
>Assignee: tomscut
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.4, 3.3.4
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL 
> exception. It looks like it's purging edit logs which is in process.
> According to the analysis, I suspect that the editlog which is in progress to 
> be purged(after SNN checkpoint) does not finalize(See HDFS-14317) before ANN 
> rolls edit its self. 
> The stack:
> {code:java}
> java.lang.Thread.getStackTrace(Thread.java:1552)
>     org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
>     
> org.apache.hadoop.hdfs.server.namenode.FileJournalManager.purgeLogsOlderThan(FileJournalManager.java:185)
>     
> org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:623)
>     
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:388)
>     
> org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:620)
>     
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1512)
> org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:177)
>     
> org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:1249)
>     
> org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:617)
>     
> org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:516)
>     java.security.AccessController.doPrivileged(Native Method)
>     javax.security.auth.Subject.doAs(Subject.java:422)
>     
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>     
> org.apache.hadoop.hdfs.server.namenode.ImageServlet.doPut(ImageServlet.java:515)
>     javax.servlet.http.HttpServlet.service(HttpServlet.java:710)
>     javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>     org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>     
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>     
> org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604)
>     
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>     org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
>     
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>     org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
>     
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
>     
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
>     
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
>     
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>     

[jira] [Commented] (HDFS-16507) [SBN read] Avoid purging edit log which is in progress

2022-04-12 Thread tomscut (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17521409#comment-17521409
 ] 

tomscut commented on HDFS-16507:


[~xkrogen] Your comment makes a lot of sense to me.

IMO,  there are two ways to approach this problem:
1. Throw an IllegalArgumentException, wait for Edit to be turned off normally, 
and then automatically FSEditLog#purgeLogsOlderThan. However, if SNN is down 
for a long time, edits may take up more disk space.

2. Update `minTxIdToKeep` here. Like the PR I submitted at the beginning.
{code:java}
// Reset purgeLogsFrom to avoid purging edit log which is in progress.
if (isSegmentOpen()) {
   minTxIdToKeep = minTxIdToKeep > curSegmentTxId ? curSegmentTxId : 
minTxIdToKeep;
} {code}
What do you think of this? cc [~sunchao] [~vjasani] .

> [SBN read] Avoid purging edit log which is in progress
> --
>
> Key: HDFS-16507
> URL: https://issues.apache.org/jira/browse/HDFS-16507
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: tomscut
>Assignee: tomscut
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.4, 3.3.4
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL 
> exception. It looks like it's purging edit logs which is in process.
> According to the analysis, I suspect that the editlog which is in progress to 
> be purged(after SNN checkpoint) does not finalize(See HDFS-14317) before ANN 
> rolls edit its self. 
> The stack:
> {code:java}
> java.lang.Thread.getStackTrace(Thread.java:1552)
>     org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
>     
> org.apache.hadoop.hdfs.server.namenode.FileJournalManager.purgeLogsOlderThan(FileJournalManager.java:185)
>     
> org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:623)
>     
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:388)
>     
> org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:620)
>     
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1512)
> org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:177)
>     
> org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:1249)
>     
> org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:617)
>     
> org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:516)
>     java.security.AccessController.doPrivileged(Native Method)
>     javax.security.auth.Subject.doAs(Subject.java:422)
>     
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>     
> org.apache.hadoop.hdfs.server.namenode.ImageServlet.doPut(ImageServlet.java:515)
>     javax.servlet.http.HttpServlet.service(HttpServlet.java:710)
>     javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>     org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>     
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>     
> org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604)
>     
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>     org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
>     
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>     org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
>     
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
>     
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
>     
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
>     
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>     org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>     
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
>     
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>     
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>     
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
>     
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>     org.eclipse.jetty.server.Server.handle(Server.java:539)
>     org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333)
>     
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
>     
> 

[jira] [Comment Edited] (HDFS-16507) [SBN read] Avoid purging edit log which is in progress

2022-04-12 Thread tomscut (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17521403#comment-17521403
 ] 

tomscut edited comment on HDFS-16507 at 4/13/22 1:41 AM:
-

Hi [~xkrogen] , thanks for your comments.

The process is as follows:

After checkpoint, SNN passes fsimage to ANN. When ANN receives fsimage, it 
trigger  in ImageServlet#doPut.

Do you mean, if the situation arises that ` minTxIdToKeep > curSegmentTxId `, 
ANN will crash because `Preconditions.CheckArgument` failure?


was (Author: tomscut):
Hi [~xkrogen] , thanks for your comments.

The process is as follows:

After checkpoint, SNN passes fsimage to ANN. When ANN receives fsimage, it 
fires fseditlogpurgelogsolderthan in ImageServlet#doPut.

Do you mean, if the situation arises that ` minTxIdToKeep > curSegmentTxId `, 
ANN will crash because `Preconditions.CheckArgument` failure?

> [SBN read] Avoid purging edit log which is in progress
> --
>
> Key: HDFS-16507
> URL: https://issues.apache.org/jira/browse/HDFS-16507
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: tomscut
>Assignee: tomscut
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.4, 3.3.4
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL 
> exception. It looks like it's purging edit logs which is in process.
> According to the analysis, I suspect that the editlog which is in progress to 
> be purged(after SNN checkpoint) does not finalize(See HDFS-14317) before ANN 
> rolls edit its self. 
> The stack:
> {code:java}
> java.lang.Thread.getStackTrace(Thread.java:1552)
>     org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
>     
> org.apache.hadoop.hdfs.server.namenode.FileJournalManager.purgeLogsOlderThan(FileJournalManager.java:185)
>     
> org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:623)
>     
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:388)
>     
> org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:620)
>     
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1512)
> org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:177)
>     
> org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:1249)
>     
> org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:617)
>     
> org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:516)
>     java.security.AccessController.doPrivileged(Native Method)
>     javax.security.auth.Subject.doAs(Subject.java:422)
>     
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>     
> org.apache.hadoop.hdfs.server.namenode.ImageServlet.doPut(ImageServlet.java:515)
>     javax.servlet.http.HttpServlet.service(HttpServlet.java:710)
>     javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>     org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>     
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>     
> org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604)
>     
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>     org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
>     
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>     org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
>     
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
>     
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
>     
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
>     
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>     org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>     
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
>     
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>     
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>     
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
>     
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>     org.eclipse.jetty.server.Server.handle(Server.java:539)
>     org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333)
>     
> 

[jira] [Comment Edited] (HDFS-16507) [SBN read] Avoid purging edit log which is in progress

2022-04-12 Thread tomscut (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17521403#comment-17521403
 ] 

tomscut edited comment on HDFS-16507 at 4/13/22 1:41 AM:
-

Hi [~xkrogen] , thanks for your comments.

The process is as follows:

After checkpoint, SNN passes fsimage to ANN. When ANN receives fsimage, it 
trigger FSEditLog#purgeLogsOlderThan in ImageServlet#doPut.

Do you mean, if the situation arises that ` minTxIdToKeep > curSegmentTxId `, 
ANN will crash because `Preconditions.CheckArgument` failure?


was (Author: tomscut):
Hi [~xkrogen] , thanks for your comments.

The process is as follows:

After checkpoint, SNN passes fsimage to ANN. When ANN receives fsimage, it 
trigger  in ImageServlet#doPut.

Do you mean, if the situation arises that ` minTxIdToKeep > curSegmentTxId `, 
ANN will crash because `Preconditions.CheckArgument` failure?

> [SBN read] Avoid purging edit log which is in progress
> --
>
> Key: HDFS-16507
> URL: https://issues.apache.org/jira/browse/HDFS-16507
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: tomscut
>Assignee: tomscut
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.4, 3.3.4
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL 
> exception. It looks like it's purging edit logs which is in process.
> According to the analysis, I suspect that the editlog which is in progress to 
> be purged(after SNN checkpoint) does not finalize(See HDFS-14317) before ANN 
> rolls edit its self. 
> The stack:
> {code:java}
> java.lang.Thread.getStackTrace(Thread.java:1552)
>     org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
>     
> org.apache.hadoop.hdfs.server.namenode.FileJournalManager.purgeLogsOlderThan(FileJournalManager.java:185)
>     
> org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:623)
>     
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:388)
>     
> org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:620)
>     
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1512)
> org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:177)
>     
> org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:1249)
>     
> org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:617)
>     
> org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:516)
>     java.security.AccessController.doPrivileged(Native Method)
>     javax.security.auth.Subject.doAs(Subject.java:422)
>     
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>     
> org.apache.hadoop.hdfs.server.namenode.ImageServlet.doPut(ImageServlet.java:515)
>     javax.servlet.http.HttpServlet.service(HttpServlet.java:710)
>     javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>     org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>     
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>     
> org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604)
>     
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>     org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
>     
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>     org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
>     
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
>     
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
>     
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
>     
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>     org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>     
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
>     
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>     
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>     
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
>     
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>     org.eclipse.jetty.server.Server.handle(Server.java:539)
>     org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333)
>     

[jira] [Commented] (HDFS-16507) [SBN read] Avoid purging edit log which is in progress

2022-04-12 Thread tomscut (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17521403#comment-17521403
 ] 

tomscut commented on HDFS-16507:


Hi [~xkrogen] , thanks for your comments.

The process is as follows:

After checkpoint, SNN passes fsimage to ANN. When ANN receives fsimage, it 
fires fseditlogpurgelogsolderthan in ImageServlet#doPut.

Do you mean, if the situation arises that ` minTxIdToKeep > curSegmentTxId `, 
ANN will crash because `Preconditions.CheckArgument` failure?

> [SBN read] Avoid purging edit log which is in progress
> --
>
> Key: HDFS-16507
> URL: https://issues.apache.org/jira/browse/HDFS-16507
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: tomscut
>Assignee: tomscut
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.4, 3.3.4
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL 
> exception. It looks like it's purging edit logs which is in process.
> According to the analysis, I suspect that the editlog which is in progress to 
> be purged(after SNN checkpoint) does not finalize(See HDFS-14317) before ANN 
> rolls edit its self. 
> The stack:
> {code:java}
> java.lang.Thread.getStackTrace(Thread.java:1552)
>     org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
>     
> org.apache.hadoop.hdfs.server.namenode.FileJournalManager.purgeLogsOlderThan(FileJournalManager.java:185)
>     
> org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:623)
>     
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:388)
>     
> org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:620)
>     
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1512)
> org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:177)
>     
> org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:1249)
>     
> org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:617)
>     
> org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:516)
>     java.security.AccessController.doPrivileged(Native Method)
>     javax.security.auth.Subject.doAs(Subject.java:422)
>     
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>     
> org.apache.hadoop.hdfs.server.namenode.ImageServlet.doPut(ImageServlet.java:515)
>     javax.servlet.http.HttpServlet.service(HttpServlet.java:710)
>     javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>     org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>     
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>     
> org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604)
>     
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>     org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
>     
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>     org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
>     
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
>     
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
>     
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
>     
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>     org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>     
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
>     
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>     
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>     
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
>     
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>     org.eclipse.jetty.server.Server.handle(Server.java:539)
>     org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333)
>     
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
>     
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>     org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
>     
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>     
> 

[jira] [Created] (HDFS-16527) Add global timeout rule for TestRouterDistCpProcedure

2022-03-31 Thread tomscut (Jira)
tomscut created HDFS-16527:
--

 Summary: Add global timeout rule for TestRouterDistCpProcedure
 Key: HDFS-16527
 URL: https://issues.apache.org/jira/browse/HDFS-16527
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: tomscut
Assignee: tomscut


As [Ayush Saxena|https://github.com/ayushtkn] mentioned 
[here|[https://github.com/apache/hadoop/pull/4009#pullrequestreview-925554297].]
 TestRouterDistCpProcedure failed many times because of timeout. I will add a 
global timeout rule for it. This makes it easy to set the timeout.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16513) [SBN read] Observer Namenode should not trigger the edits rolling of active Namenode

2022-03-27 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16513:
---
Summary: [SBN read] Observer Namenode should not trigger the edits rolling 
of active Namenode  (was: [SBN read] Observer Namenode does not trigger the 
edits rolling of active Namenode)

> [SBN read] Observer Namenode should not trigger the edits rolling of active 
> Namenode
> 
>
> Key: HDFS-16513
> URL: https://issues.apache.org/jira/browse/HDFS-16513
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> To avoid frequent edtis rolling, we should disable OBN from triggering the 
> edits rolling of active Namenode. 
> It is sufficient to retain only the triggering of SNN and the auto rolling of 
> ANN. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-16507) [SBN read] Avoid purging edit log which is in progress

2022-03-24 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut reassigned HDFS-16507:
--

Assignee: tomscut

> [SBN read] Avoid purging edit log which is in progress
> --
>
> Key: HDFS-16507
> URL: https://issues.apache.org/jira/browse/HDFS-16507
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: tomscut
>Assignee: tomscut
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL 
> exception. It looks like it's purging edit logs which is in process.
> According to the analysis, I suspect that the editlog which is in progress to 
> be purged(after SNN checkpoint) does not finalize(See HDFS-14317) before ANN 
> rolls edit its self. 
> The stack:
> {code:java}
> java.lang.Thread.getStackTrace(Thread.java:1552)
>     org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
>     
> org.apache.hadoop.hdfs.server.namenode.FileJournalManager.purgeLogsOlderThan(FileJournalManager.java:185)
>     
> org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:623)
>     
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:388)
>     
> org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:620)
>     
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1512)
> org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:177)
>     
> org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:1249)
>     
> org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:617)
>     
> org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:516)
>     java.security.AccessController.doPrivileged(Native Method)
>     javax.security.auth.Subject.doAs(Subject.java:422)
>     
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>     
> org.apache.hadoop.hdfs.server.namenode.ImageServlet.doPut(ImageServlet.java:515)
>     javax.servlet.http.HttpServlet.service(HttpServlet.java:710)
>     javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>     org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>     
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>     
> org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604)
>     
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>     org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
>     
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>     org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
>     
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
>     
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
>     
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
>     
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>     org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>     
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
>     
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>     
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>     
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
>     
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>     org.eclipse.jetty.server.Server.handle(Server.java:539)
>     org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333)
>     
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
>     
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>     org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
>     
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>     
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>     
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>     
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>     
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>     
> 

[jira] [Updated] (HDFS-16507) [SBN read] Avoid purging edit log which is in progress

2022-03-24 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16507:
---
Description: 
We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL 
exception. It looks like it's purging edit logs which is in process.

According to the analysis, I suspect that the editlog which is in progress to 
be purged(after SNN checkpoint) does not finalize(See HDFS-14317) before ANN 
rolls edit its self. 

The stack:
{code:java}
java.lang.Thread.getStackTrace(Thread.java:1552)
    org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
    
org.apache.hadoop.hdfs.server.namenode.FileJournalManager.purgeLogsOlderThan(FileJournalManager.java:185)
    
org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:623)
    
org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:388)
    
org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:620)
    
org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1512)
org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:177)
    
org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:1249)
    
org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:617)
    
org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:516)
    java.security.AccessController.doPrivileged(Native Method)
    javax.security.auth.Subject.doAs(Subject.java:422)
    
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
    
org.apache.hadoop.hdfs.server.namenode.ImageServlet.doPut(ImageServlet.java:515)
    javax.servlet.http.HttpServlet.service(HttpServlet.java:710)
    javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
    org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
    
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
    
org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604)
    
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
    org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
    
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
    org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
    
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
    org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
    
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
    
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
    org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
    
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
    
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
    
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
    
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
    
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
    org.eclipse.jetty.server.Server.handle(Server.java:539)
    org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333)
    org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
    
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
    org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
    
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
    
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
    
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
    
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
    
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
    
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
    java.lang.Thread.run(Thread.java:745) {code}
 

I post some key logs for your reference:

1. ANN. Create editlog, 
{color:#ff}edits_InProgress_00024207987{color}.
{code:java}
2022-03-15 17:24:52,558 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegment(1394)) - Starting log segment at 24207987
2022-03-15 17:24:52,609 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegment(1423)) - Ending log segment at 24207987
2022-03-15 17:24:52,610 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1432)) - logEdit at 24207987
2022-03-15 17:24:52,624 INFO  namenode.FSEditLog 

[jira] [Updated] (HDFS-16446) Consider ioutils of disk when choosing volume

2022-03-24 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16446:
---
Description: 
Consider ioutils of disk when choosing volume.

Consider ioutils of disk when choosing volume to avoid busy disks.

Document: 
[https://docs.google.com/document/d/1Ko1J7shz8hVLnNACT6PKVQ_leIHf_YaIFA2s3yJMZHQ/edit?usp=sharing]

Principle is as follows:

!https://user-images.githubusercontent.com/55134131/159827737-f4ca4d66-c2f2-4bef-901b-6d2bc7bdda9a.png|width=440,height=192!

  was:
Consider ioutils of disk when choosing volume.

Principle is as follows:

!image-2022-02-05-09-50-12-241.png|width=309,height=159!


> Consider ioutils of disk when choosing volume
> -
>
> Key: HDFS-16446
> URL: https://issues.apache.org/jira/browse/HDFS-16446
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2022-02-05-09-50-12-241.png
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Consider ioutils of disk when choosing volume.
> Consider ioutils of disk when choosing volume to avoid busy disks.
> Document: 
> [https://docs.google.com/document/d/1Ko1J7shz8hVLnNACT6PKVQ_leIHf_YaIFA2s3yJMZHQ/edit?usp=sharing]
> Principle is as follows:
> !https://user-images.githubusercontent.com/55134131/159827737-f4ca4d66-c2f2-4bef-901b-6d2bc7bdda9a.png|width=440,height=192!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16446) Consider ioutils of disk when choosing volume

2022-03-24 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16446:
---
Description: 
Consider ioutils of disk when choosing volume to avoid busy disks.

Document: 
[https://docs.google.com/document/d/1Ko1J7shz8hVLnNACT6PKVQ_leIHf_YaIFA2s3yJMZHQ/edit?usp=sharing]

Principle is as follows:

!https://user-images.githubusercontent.com/55134131/159827737-f4ca4d66-c2f2-4bef-901b-6d2bc7bdda9a.png|width=440,height=192!

  was:
Consider ioutils of disk when choosing volume.

Consider ioutils of disk when choosing volume to avoid busy disks.

Document: 
[https://docs.google.com/document/d/1Ko1J7shz8hVLnNACT6PKVQ_leIHf_YaIFA2s3yJMZHQ/edit?usp=sharing]

Principle is as follows:

!https://user-images.githubusercontent.com/55134131/159827737-f4ca4d66-c2f2-4bef-901b-6d2bc7bdda9a.png|width=440,height=192!


> Consider ioutils of disk when choosing volume
> -
>
> Key: HDFS-16446
> URL: https://issues.apache.org/jira/browse/HDFS-16446
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2022-02-05-09-50-12-241.png
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Consider ioutils of disk when choosing volume to avoid busy disks.
> Document: 
> [https://docs.google.com/document/d/1Ko1J7shz8hVLnNACT6PKVQ_leIHf_YaIFA2s3yJMZHQ/edit?usp=sharing]
> Principle is as follows:
> !https://user-images.githubusercontent.com/55134131/159827737-f4ca4d66-c2f2-4bef-901b-6d2bc7bdda9a.png|width=440,height=192!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13671) Namenode deletes large dir slowly caused by FoldedTreeSet#removeAndGet

2022-03-23 Thread tomscut (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511575#comment-17511575
 ] 

tomscut commented on HDFS-13671:


Hi [~max2049] , we are still using CMS on a cluster without EC data, some 
parameter adjustment should be able to solve this problem.

And how long is your FBR period? If it is 6 hours(default) and the cluster size 
is large, it may have an impact on GC. We set this to 3 days.

We use G1GC on a cluster with this feature that uses EC data. The main 
parameters(open JDK 1.8) are as follows:
{code:java}
-server -Xmx200g -Xms200g 
-XX:MaxDirectMemorySize=2g 
-XX:MaxMetaspaceSize=2g 
-XX:MetaspaceSize=1g 
-XX:+UseG1GC -XX:+UnlockExperimentalVMOptions 
-XX:InitiatingHeapOccupancyPercent=75 
-XX:G1NewSizePercent=0 -XX:G1MaxNewSizePercent=3 
-XX:SurvivorRatio=2 -XX:+DisableExplicitGC -XX:MaxTenuringThreshold=15 
-XX:-UseBiasedLocking -XX:ParallelGCThreads=40 -XX:ConcGCThreads=20 
-XX:MaxJavaStackTraceDepth=100 -XX:MaxGCPauseMillis=200 
-verbose:gc -XX:+UnlockDiagnosticVMOptions -XX:+PrintGCDetails 
-XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCCause -XX:+PrintGCDateStamps 
-XX:+PrintReferenceGC -XX:+PrintHeapAtGC -XX:+PrintAdaptiveSizePolicy 
-XX:+G1PrintHeapRegions -XX:+PrintTenuringDistribution 
-Xloggc:/data1/var/log/hadoop/$USER/gc.log-`date +'%Y%m%d%H%M'`" {code}
 

> Namenode deletes large dir slowly caused by FoldedTreeSet#removeAndGet
> --
>
> Key: HDFS-13671
> URL: https://issues.apache.org/jira/browse/HDFS-13671
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.1.0, 3.0.3
>Reporter: Yiqun Lin
>Assignee: Haibin Huang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.3, 3.3.2
>
> Attachments: HDFS-13671-001.patch, image-2021-06-10-19-28-18-373.png, 
> image-2021-06-10-19-28-58-359.png, image-2021-06-18-15-46-46-052.png, 
> image-2021-06-18-15-47-04-037.png
>
>  Time Spent: 7h 40m
>  Remaining Estimate: 0h
>
> NameNode hung when deleting large files/blocks. The stack info:
> {code}
> "IPC Server handler 4 on 8020" #87 daemon prio=5 os_prio=0 
> tid=0x7fb505b27800 nid=0x94c3 runnable [0x7fa861361000]
>java.lang.Thread.State: RUNNABLE
>   at 
> org.apache.hadoop.hdfs.util.FoldedTreeSet.compare(FoldedTreeSet.java:474)
>   at 
> org.apache.hadoop.hdfs.util.FoldedTreeSet.removeAndGet(FoldedTreeSet.java:849)
>   at 
> org.apache.hadoop.hdfs.util.FoldedTreeSet.remove(FoldedTreeSet.java:911)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeStorageInfo.removeBlock(DatanodeStorageInfo.java:252)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.removeBlock(BlocksMap.java:194)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.removeBlock(BlocksMap.java:108)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.removeBlockFromMap(BlockManager.java:3813)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.removeBlock(BlockManager.java:3617)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.removeBlocks(FSNamesystem.java:4270)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInternal(FSNamesystem.java:4244)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInt(FSNamesystem.java:4180)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:4164)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:871)
>   at 
> org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.delete(AuthorizationProviderProxyClientProtocol.java:311)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.delete(ClientNamenodeProtocolServerSideTranslatorPB.java:625)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
> {code}
> In the current deletion logic in NameNode, there are mainly two steps:
> * Collect INodes and all blocks to be deleted, then delete INodes.
> * Remove blocks  chunk by chunk in a loop.
> Actually the first step should be a more expensive operation and will takes 
> more time. However, now we always see NN hangs during the remove block 
> operation. 
> Looking into this, we introduced a new structure {{FoldedTreeSet}} to have a 
> better performance in dealing FBR/IBRs. But compared with early 
> implementation in remove-block logic, {{FoldedTreeSet}} seems more slower 
> since It will take additional time to balance tree 

[jira] [Created] (HDFS-16513) [SBN read] Observer Namenode does not trigger the edits rolling of active Namenode

2022-03-20 Thread tomscut (Jira)
tomscut created HDFS-16513:
--

 Summary: [SBN read] Observer Namenode does not trigger the edits 
rolling of active Namenode
 Key: HDFS-16513
 URL: https://issues.apache.org/jira/browse/HDFS-16513
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: tomscut
Assignee: tomscut


To avoid frequent edtis rolling, we should disable OBN from triggering the 
edits rolling of active Namenode. 

It is sufficient to retain only the triggering of SNN and the auto rolling of 
ANN. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16507) [SBN read] Avoid purging edit log which is in progress

2022-03-20 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16507:
---
Summary: [SBN read] Avoid purging edit log which is in progress  (was: 
Avoid purging edit log which is in progress)

> [SBN read] Avoid purging edit log which is in progress
> --
>
> Key: HDFS-16507
> URL: https://issues.apache.org/jira/browse/HDFS-16507
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: tomscut
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL 
> exception. It looks like it's purging edit logs which is in process.
> According to the analysis, I suspect that the editlog which is in progress to 
> be purged(after SNN checkpoint) does not finalize(See HDFS-14317) before ANN 
> rolls edit its self. 
> The stack:
> {code:java}
> java.lang.Thread.getStackTrace(Thread.java:1552)
>     org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
>     
> org.apache.hadoop.hdfs.server.namenode.FileJournalManager.purgeLogsOlderThan(FileJournalManager.java:185)
>     
> org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:623)
>     
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:388)
>     
> org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:620)
>     
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1512)
> org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:177)
>     
> org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:1249)
>     
> org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:617)
>     
> org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:516)
>     java.security.AccessController.doPrivileged(Native Method)
>     javax.security.auth.Subject.doAs(Subject.java:422)
>     
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>     
> org.apache.hadoop.hdfs.server.namenode.ImageServlet.doPut(ImageServlet.java:515)
>     javax.servlet.http.HttpServlet.service(HttpServlet.java:710)
>     javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>     org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>     
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>     
> org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604)
>     
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>     org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
>     
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>     org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
>     
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
>     
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
>     
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
>     
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>     org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>     
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
>     
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>     
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>     
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
>     
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>     org.eclipse.jetty.server.Server.handle(Server.java:539)
>     org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333)
>     
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
>     
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>     org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
>     
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>     
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>     
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>     
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>     
> 

[jira] [Commented] (HDFS-8277) Safemode enter fails when Standby NameNode is down

2022-03-17 Thread tomscut (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-8277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508550#comment-17508550
 ] 

tomscut commented on HDFS-8277:
---

We have the same problem, see HDFS-16505.

Because DFSAdmin is sequential, if we use Standby Read, the configuration order 
is:
nn1 -> OBN1
nn2 -> OBN2
nn3 -> OBN3
nn4 -> ANN
nn5 -> SNN

So let's say that when we *enter* Safemode, namenodes are normal. So all five 
nodes are in Safemode.

Then, OBN1 goes down, we execute *leave* Safemode, and we can't exit Safemode 
normally.

We have to run this command one by one: hdfs dfsadmin -fs 
hdfs://: -safemode leave .

> Safemode enter fails when Standby NameNode is down
> --
>
> Key: HDFS-8277
> URL: https://issues.apache.org/jira/browse/HDFS-8277
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha, namenode
>Affects Versions: 2.6.0
> Environment: HDP 2.2.0
>Reporter: Hari Sekhon
>Assignee: Jianfei Jiang
>Priority: Major
> Attachments: HDFS-8277-safemode-edits.patch, HDFS-8277.patch, 
> HDFS-8277_1.patch, HDFS-8277_2.patch, HDFS-8277_3.patch, HDFS-8277_4.patch, 
> HDFS-8277_5.patch
>
>
> HDFS fails to enter safemode when the Standby NameNode is down (eg. due to 
> AMBARI-10536).
> {code}hdfs dfsadmin -safemode enter
> safemode: Call From nn2/x.x.x.x to nn1:8020 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused{code}
> This appears to be a bug in that it's not trying both NameNodes like the 
> standard hdfs client code does, and is instead stopping after getting a 
> connection refused from nn1 which is down. I verified normal hadoop fs writes 
> and reads via cli did work at this time, using nn2. I happened to run this 
> command as the hdfs user on nn2 which was the surviving Active NameNode.
> After I re-bootstrapped the Standby NN to fix it the command worked as 
> expected again.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16508) When the nn1 fails at very beginning, admin command that waits exist safe mode fails

2022-03-17 Thread tomscut (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508488#comment-17508488
 ] 

tomscut commented on HDFS-16508:


Hi [~willtoshare] , please see HDFS-15509, HDFS-8277 and HDFS-16505. It seems 
to be the same kind of problem, 

> When the nn1 fails at very beginning, admin command that waits exist safe 
> mode fails
> 
>
> Key: HDFS-16508
> URL: https://issues.apache.org/jira/browse/HDFS-16508
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: tools
>Affects Versions: 3.3.1
>Reporter: May
>Priority: Major
>
> The HA is enabled, and we have two NameNodes: nn1 and nn2.
> When starting the cluster, the nn1 fails at the very beginning, and nn2 
> transfers to active state. The culster can provide services normally.
> However, when we tried to get safe mode or wait exit safe mode, our dfsadmin 
> command fails due to an IOException: cannot connect to nn1.
> The *root cause* seems locate in here:
> {code:java}
> //DFSAdmin.class
> public void setSafeMode(String[] argv, int idx) throws IOException {
> …
> if (isHaEnabled) {
>       String nsId = dfsUri.getHost();
>       List> proxies =
>           HAUtil.getProxiesForAllNameNodesInNameservice(
>           dfsConf, nsId, ClientProtocol.class);
>       for (ProxyAndInfo proxy : proxies) {
>         ClientProtocol haNn = proxy.getProxy();
> //The code always queries from the first nn, i.e., nn1, and returns 
> with IOException when nn1 fails.
>         boolean inSafeMode = haNn.setSafeMode(action, false);
>         if (waitExitSafe) {
>           inSafeMode = waitExitSafeMode(haNn, inSafeMode);
>         }
>         System.out.println("Safe mode is " + (inSafeMode ? "ON" : "OFF")
>             + " in " + proxy.getAddress());
>       }
>     } 
> …
> }
> {code}
> Actually, I'm curious that do we need to get/wait every namenode here when HA 
> is enabled?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16507) Avoid purging edit log which is in progress

2022-03-17 Thread tomscut (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508484#comment-17508484
 ] 

tomscut commented on HDFS-16507:


Hi [~xkrogen] [~ekanth] [~chaosun] .

This issue HDFS-14317 does a good job of avoiding this problem.

However, if SNN's rolledit operation is disabled accidentally by configuration, 
and ANN's automatic roll period is very long, then edit log which is in 
progress may also be purged.

I think we should reset *minTxIdToKeep* to ensure that the in progress edit log 
is not purged very strict. Wait for ANN to automatically roll to finalize the 
edit log.

Please help to reivew if it is reasonable. Thanks a lot.

> Avoid purging edit log which is in progress
> ---
>
> Key: HDFS-16507
> URL: https://issues.apache.org/jira/browse/HDFS-16507
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: tomscut
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL 
> exception. It looks like it's purging edit logs which is in process.
> According to the analysis, I suspect that the editlog which is in progress to 
> be purged(after SNN checkpoint) does not finalize(See HDFS-14317) before ANN 
> rolls edit its self. 
> The stack:
> {code:java}
> java.lang.Thread.getStackTrace(Thread.java:1552)
>     org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
>     
> org.apache.hadoop.hdfs.server.namenode.FileJournalManager.purgeLogsOlderThan(FileJournalManager.java:185)
>     
> org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:623)
>     
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:388)
>     
> org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:620)
>     
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1512)
> org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:177)
>     
> org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:1249)
>     
> org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:617)
>     
> org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:516)
>     java.security.AccessController.doPrivileged(Native Method)
>     javax.security.auth.Subject.doAs(Subject.java:422)
>     
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>     
> org.apache.hadoop.hdfs.server.namenode.ImageServlet.doPut(ImageServlet.java:515)
>     javax.servlet.http.HttpServlet.service(HttpServlet.java:710)
>     javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>     org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>     
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>     
> org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604)
>     
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>     org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
>     
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>     org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
>     
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
>     
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
>     
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
>     
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>     org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>     
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
>     
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>     
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>     
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
>     
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>     org.eclipse.jetty.server.Server.handle(Server.java:539)
>     org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333)
>     
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
>     
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>     org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
>     
> 

[jira] [Updated] (HDFS-16507) Avoid purging edit log which is in progress

2022-03-17 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16507:
---
Description: 
We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL 
exception. It looks like it's purging edit logs which is in process.

According to the analysis, I suspect that the editlog which is in progress to 
be purged(after SNN checkpoint) does not finalize(See HDFS-14317) before ANN 
rolls edit its self. 

The stack:
{code:java}
java.lang.Thread.getStackTrace(Thread.java:1552)
    org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
    
org.apache.hadoop.hdfs.server.namenode.FileJournalManager.purgeLogsOlderThan(FileJournalManager.java:185)
    
org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:623)
    
org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:388)
    
org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:620)
    
org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1512)
org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:177)
    
org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:1249)
    
org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:617)
    
org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:516)
    java.security.AccessController.doPrivileged(Native Method)
    javax.security.auth.Subject.doAs(Subject.java:422)
    
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
    
org.apache.hadoop.hdfs.server.namenode.ImageServlet.doPut(ImageServlet.java:515)
    javax.servlet.http.HttpServlet.service(HttpServlet.java:710)
    javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
    org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
    
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
    
org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604)
    
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
    org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
    
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
    org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
    
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
    org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
    
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
    
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
    org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
    
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
    
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
    
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
    
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
    
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
    org.eclipse.jetty.server.Server.handle(Server.java:539)
    org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333)
    org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
    
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
    org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
    
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
    
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
    
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
    
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
    
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
    
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
    java.lang.Thread.run(Thread.java:745) {code}
 

I post some key logs for your reference:

1. ANN. Create editlog, 
{color:#ff}edits_InProgress_00024207987{color}.
{code:java}
2022-03-15 17:24:52,558 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegment(1394)) - Starting log segment at 24207987
2022-03-15 17:24:52,609 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegment(1423)) - Ending log segment at 24207987
2022-03-15 17:24:52,610 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1432)) - logEdit at 24207987
2022-03-15 17:24:52,624 INFO  namenode.FSEditLog 

[jira] [Updated] (HDFS-16507) Avoid purging edit log which is in progress

2022-03-17 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16507:
---
Description: 
We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL 
exception. It looks like it's purging edit logs which is in process.

According to the analysis, I suspect that the editlog which is in progress to 
be purged does not finalize(See ) before ANN rolls edit its self. 

I post some key logs for your reference:

1. ANN. Create editlog, 
{color:#ff}edits_InProgress_00024207987{color}.
{code:java}
2022-03-15 17:24:52,558 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegment(1394)) - Starting log segment at 24207987
2022-03-15 17:24:52,609 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegment(1423)) - Ending log segment at 24207987
2022-03-15 17:24:52,610 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1432)) - logEdit at 24207987
2022-03-15 17:24:52,624 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1434)) - logSync at 24207987 
{code}
2. SNN. Checkpoint.

The oldest image file is: fsimage_000{color:#de350b}25892513{color}.

{color:#ff}25892513 + 1 - 100 = 24892514{color}
{color:#ff}dfs.namenode.num.extra.edits.retained=100{color}

{color:#172b4d}Code: {color}NNStorageRetentionManager#purgeOldStorage
{code:java}
void purgeOldStorage(NameNodeFile nnf) throws IOException {
  FSImageTransactionalStorageInspector inspector =
  new FSImageTransactionalStorageInspector(EnumSet.of(nnf));
  storage.inspectStorageDirs(inspector);

  long minImageTxId = getImageTxIdToRetain(inspector);
  purgeCheckpointsOlderThan(inspector, minImageTxId);
  
{code}
{color:#910091}...{color}
{code:java}
  long minimumRequiredTxId = minImageTxId + 1;
  long purgeLogsFrom = Math.max(0, minimumRequiredTxId - numExtraEditsToRetain);
  
  ArrayList editLogs = new ArrayList();
  purgeableLogs.selectInputStreams(editLogs, purgeLogsFrom, false, false);
  Collections.sort(editLogs, new Comparator() {
@Override
public int compare(EditLogInputStream a, EditLogInputStream b) {
  return ComparisonChain.start()
  .compare(a.getFirstTxId(), b.getFirstTxId())
  .compare(a.getLastTxId(), b.getLastTxId())
  .result();
}
  });

  // Remove from consideration any edit logs that are in fact required.
  while (editLogs.size() > 0 &&
  editLogs.get(editLogs.size() - 1).getFirstTxId() >= minimumRequiredTxId) {
editLogs.remove(editLogs.size() - 1);
  }
  
  // Next, adjust the number of transactions to retain if doing so would mean
  // keeping too many segments around.
  while (editLogs.size() > maxExtraEditsSegmentsToRetain) {
purgeLogsFrom = editLogs.get(0).getLastTxId() + 1;
editLogs.remove(0);
  }
  ...
  purgeableLogs.purgeLogsOlderThan(purgeLogsFrom);
}{code}
 
{code:java}
2022-03-15 17:28:02,640 INFO  ha.StandbyCheckpointer 
(StandbyCheckpointer.java:doWork(443)) - Triggering checkpoint because there 
have been 1189661 txns since the last checkpoint, which exceeds the configured 
threshold 2
2022-03-15 17:28:02,648 INFO  namenode.FSImage 
(FSEditLogLoader.java:loadFSEdits(188)) - Edits file 
ByteStringEditLog[27082175, 27082606], ByteStringEditLog[27082175, 27082606], 
ByteStringEditLog[27082175, 27082606] of size 60008 edits # 432 loaded in 0 
seconds
2022-03-15 17:28:02,649 INFO  namenode.FSImage 
(FSImage.java:saveNamespace(1121)) - Save namespace ...
2022-03-15 17:28:02,650 INFO  namenode.FSImageFormatProtobuf 
(FSImageFormatProtobuf.java:save(718)) - Saving image file 
/data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 using no 
compression
2022-03-15 17:28:03,180 INFO  namenode.FSImageFormatProtobuf 
(FSImageFormatProtobuf.java:save(722)) - Image file 
/data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 of size 
17885002 bytes saved in 0 seconds .
2022-03-15 17:28:03,183 INFO  namenode.NNStorageRetentionManager 
(NNStorageRetentionManager.java:getImageTxIdToRetain(211)) - Going to retain 2 
images with txid >= 25892513
2022-03-15 17:28:03,183 INFO  namenode.NNStorageRetentionManager 
(NNStorageRetentionManager.java:purgeImage(233)) - Purging old image 
FSImageFile(file=/data/hadoop/hdfs/namenode/current/fsimage_00024794305,
 cpktTxId=00024794305)
2022-03-15 17:28:03,188 INFO  namenode.NNStorageRetentionManager 
(NNStorageRetentionManager.java:purgeOldStorage(169)) - purgeLogsFrom: 24892514
2022-03-15 17:28:03,282 INFO  namenode.TransferFsImage 
(TransferFsImage.java:copyFileToStream(396)) - Sending fileName: 
/data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: 
17885002. Sent total: 17885002 bytes. Size of last segment intended to send: -1 
bytes.
2022-03-15 17:28:03,536 INFO  namenode.TransferFsImage 
(TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 
27082606 to 

[jira] [Updated] (HDFS-16507) Avoid purging edit log which is in progress

2022-03-17 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16507:
---
Summary: Avoid purging edit log which is in progress  (was: Purged edit 
logs which is in progress)

> Avoid purging edit log which is in progress
> ---
>
> Key: HDFS-16507
> URL: https://issues.apache.org/jira/browse/HDFS-16507
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: tomscut
>Priority: Critical
>
> We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL 
> exception. It looks like it's purging edit logs which is in process.
> According to the analysis, I suspect that the editlog which is in progress to 
> be purged does not finalize before ANN rolls edit its self. 
> I post some key logs for your reference:
> 1. ANN. Create editlog, 
> {color:#ff}edits_InProgress_00024207987{color}.
> {code:java}
> 2022-03-15 17:24:52,558 INFO  namenode.FSEditLog 
> (FSEditLog.java:startLogSegment(1394)) - Starting log segment at 24207987
> 2022-03-15 17:24:52,609 INFO  namenode.FSEditLog 
> (FSEditLog.java:startLogSegment(1423)) - Ending log segment at 24207987
> 2022-03-15 17:24:52,610 INFO  namenode.FSEditLog 
> (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1432)) - logEdit at 24207987
> 2022-03-15 17:24:52,624 INFO  namenode.FSEditLog 
> (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1434)) - logSync at 24207987 
> {code}
> 2. SNN. Checkpoint.
> The oldest image file is: fsimage_000{color:#de350b}25892513{color}.
> {color:#ff}25892513 + 1 - 100 = 24892514{color}
> {color:#ff}dfs.namenode.num.extra.edits.retained=100{color}
> {color:#172b4d}Code: {color}NNStorageRetentionManager#purgeOldStorage
> {code:java}
> void purgeOldStorage(NameNodeFile nnf) throws IOException {
>   FSImageTransactionalStorageInspector inspector =
>   new FSImageTransactionalStorageInspector(EnumSet.of(nnf));
>   storage.inspectStorageDirs(inspector);
>   long minImageTxId = getImageTxIdToRetain(inspector);
>   purgeCheckpointsOlderThan(inspector, minImageTxId);
>   
> {code}
> {color:#910091}...{color}
> {code:java}
>   long minimumRequiredTxId = minImageTxId + 1;
>   long purgeLogsFrom = Math.max(0, minimumRequiredTxId - 
> numExtraEditsToRetain);
>   
>   ArrayList editLogs = new 
> ArrayList();
>   purgeableLogs.selectInputStreams(editLogs, purgeLogsFrom, false, false);
>   Collections.sort(editLogs, new Comparator() {
> @Override
> public int compare(EditLogInputStream a, EditLogInputStream b) {
>   return ComparisonChain.start()
>   .compare(a.getFirstTxId(), b.getFirstTxId())
>   .compare(a.getLastTxId(), b.getLastTxId())
>   .result();
> }
>   });
>   // Remove from consideration any edit logs that are in fact required.
>   while (editLogs.size() > 0 &&
>   editLogs.get(editLogs.size() - 1).getFirstTxId() >= 
> minimumRequiredTxId) {
> editLogs.remove(editLogs.size() - 1);
>   }
>   
>   // Next, adjust the number of transactions to retain if doing so would mean
>   // keeping too many segments around.
>   while (editLogs.size() > maxExtraEditsSegmentsToRetain) {
> purgeLogsFrom = editLogs.get(0).getLastTxId() + 1;
> editLogs.remove(0);
>   }
>   ...
>   purgeableLogs.purgeLogsOlderThan(purgeLogsFrom);
> }{code}
>  
> {code:java}
> 2022-03-15 17:28:02,640 INFO  ha.StandbyCheckpointer 
> (StandbyCheckpointer.java:doWork(443)) - Triggering checkpoint because there 
> have been 1189661 txns since the last checkpoint, which exceeds the 
> configured threshold 2
> 2022-03-15 17:28:02,648 INFO  namenode.FSImage 
> (FSEditLogLoader.java:loadFSEdits(188)) - Edits file 
> ByteStringEditLog[27082175, 27082606], ByteStringEditLog[27082175, 27082606], 
> ByteStringEditLog[27082175, 27082606] of size 60008 edits # 432 loaded in 0 
> seconds
> 2022-03-15 17:28:02,649 INFO  namenode.FSImage 
> (FSImage.java:saveNamespace(1121)) - Save namespace ...
> 2022-03-15 17:28:02,650 INFO  namenode.FSImageFormatProtobuf 
> (FSImageFormatProtobuf.java:save(718)) - Saving image file 
> /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 using no 
> compression
> 2022-03-15 17:28:03,180 INFO  namenode.FSImageFormatProtobuf 
> (FSImageFormatProtobuf.java:save(722)) - Image file 
> /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 of size 
> 17885002 bytes saved in 0 seconds .
> 2022-03-15 17:28:03,183 INFO  namenode.NNStorageRetentionManager 
> (NNStorageRetentionManager.java:getImageTxIdToRetain(211)) - Going to retain 
> 2 images with txid >= 25892513
> 2022-03-15 17:28:03,183 INFO  namenode.NNStorageRetentionManager 
> (NNStorageRetentionManager.java:purgeImage(233)) - Purging old image 
> FSImageFile(file=/data/hadoop/hdfs/namenode/current/fsimage_00024794305,

[jira] [Commented] (HDFS-16505) Setting safemode should not be interrupted by abnormal nodes

2022-03-16 Thread tomscut (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17507960#comment-17507960
 ] 

tomscut commented on HDFS-16505:


Thanks [~ayushtkn] for your comments and reminding me. 

> Setting safemode should not be interrupted by abnormal nodes
> 
>
> Key: HDFS-16505
> URL: https://issues.apache.org/jira/browse/HDFS-16505
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2022-03-15-09-29-36-538.png, 
> image-2022-03-15-09-29-44-430.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Setting safemode should not be interrupted by abnormal nodes. 
> For example, we have four namenodes configured in the following order:
> NS1 -> active
> NS2 -> standby
> NS3 -> observer
> NS4 -> observer.
> When the {color:#FF}NS1 {color}process exits, setting the states of 
> safemode, {color:#FF}NS2{color}, {color:#FF}NS3{color}, and 
> {color:#FF}NS4 {color}fails. Similarly, when the 
> {color:#FF}NS2{color} process exits, only the safemode state of 
> {color:#FF}NS1{color} can be set successfully.
>  
> When the {color:#FF}NS1{color} process exits:
> Before the change:
> !image-2022-03-15-09-29-36-538.png|width=1145,height=97!
> After the change:
> !image-2022-03-15-09-29-44-430.png|width=1104,height=119!
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16507) Purged edit logs which is in progress

2022-03-16 Thread tomscut (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17507458#comment-17507458
 ] 

tomscut commented on HDFS-16507:


Seems to be related to this issue 
[HDFS-14317|https://issues.apache.org/jira/browse/HDFS-14317].

> Purged edit logs which is in progress
> -
>
> Key: HDFS-16507
> URL: https://issues.apache.org/jira/browse/HDFS-16507
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: tomscut
>Priority: Critical
>
> We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL 
> exception. It looks like it's purging edit logs which is in process.
> According to the analysis, I suspect that the editlog which is in progress to 
> be purged does not finalize before ANN rolls edit its self. 
> I post some key logs for your reference:
> 1. ANN. Create editlog, 
> {color:#ff}edits_InProgress_00024207987{color}.
> {code:java}
> 2022-03-15 17:24:52,558 INFO  namenode.FSEditLog 
> (FSEditLog.java:startLogSegment(1394)) - Starting log segment at 24207987
> 2022-03-15 17:24:52,609 INFO  namenode.FSEditLog 
> (FSEditLog.java:startLogSegment(1423)) - Ending log segment at 24207987
> 2022-03-15 17:24:52,610 INFO  namenode.FSEditLog 
> (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1432)) - logEdit at 24207987
> 2022-03-15 17:24:52,624 INFO  namenode.FSEditLog 
> (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1434)) - logSync at 24207987 
> {code}
> 2. SNN. Checkpoint.
> The oldest image file is: fsimage_000{color:#de350b}25892513{color}.
> {color:#ff}25892513 + 1 - 100 = 24892514{color}
> {color:#ff}dfs.namenode.num.extra.edits.retained=100{color}
> {color:#172b4d}Code: {color}NNStorageRetentionManager#purgeOldStorage
> {code:java}
> void purgeOldStorage(NameNodeFile nnf) throws IOException {
>   FSImageTransactionalStorageInspector inspector =
>   new FSImageTransactionalStorageInspector(EnumSet.of(nnf));
>   storage.inspectStorageDirs(inspector);
>   long minImageTxId = getImageTxIdToRetain(inspector);
>   purgeCheckpointsOlderThan(inspector, minImageTxId);
>   
> {code}
> {color:#910091}...{color}
> {code:java}
>   long minimumRequiredTxId = minImageTxId + 1;
>   long purgeLogsFrom = Math.max(0, minimumRequiredTxId - 
> numExtraEditsToRetain);
>   
>   ArrayList editLogs = new 
> ArrayList();
>   purgeableLogs.selectInputStreams(editLogs, purgeLogsFrom, false, false);
>   Collections.sort(editLogs, new Comparator() {
> @Override
> public int compare(EditLogInputStream a, EditLogInputStream b) {
>   return ComparisonChain.start()
>   .compare(a.getFirstTxId(), b.getFirstTxId())
>   .compare(a.getLastTxId(), b.getLastTxId())
>   .result();
> }
>   });
>   // Remove from consideration any edit logs that are in fact required.
>   while (editLogs.size() > 0 &&
>   editLogs.get(editLogs.size() - 1).getFirstTxId() >= 
> minimumRequiredTxId) {
> editLogs.remove(editLogs.size() - 1);
>   }
>   
>   // Next, adjust the number of transactions to retain if doing so would mean
>   // keeping too many segments around.
>   while (editLogs.size() > maxExtraEditsSegmentsToRetain) {
> purgeLogsFrom = editLogs.get(0).getLastTxId() + 1;
> editLogs.remove(0);
>   }
>   ...
>   purgeableLogs.purgeLogsOlderThan(purgeLogsFrom);
> }{code}
>  
> {code:java}
> 2022-03-15 17:28:02,640 INFO  ha.StandbyCheckpointer 
> (StandbyCheckpointer.java:doWork(443)) - Triggering checkpoint because there 
> have been 1189661 txns since the last checkpoint, which exceeds the 
> configured threshold 2
> 2022-03-15 17:28:02,648 INFO  namenode.FSImage 
> (FSEditLogLoader.java:loadFSEdits(188)) - Edits file 
> ByteStringEditLog[27082175, 27082606], ByteStringEditLog[27082175, 27082606], 
> ByteStringEditLog[27082175, 27082606] of size 60008 edits # 432 loaded in 0 
> seconds
> 2022-03-15 17:28:02,649 INFO  namenode.FSImage 
> (FSImage.java:saveNamespace(1121)) - Save namespace ...
> 2022-03-15 17:28:02,650 INFO  namenode.FSImageFormatProtobuf 
> (FSImageFormatProtobuf.java:save(718)) - Saving image file 
> /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 using no 
> compression
> 2022-03-15 17:28:03,180 INFO  namenode.FSImageFormatProtobuf 
> (FSImageFormatProtobuf.java:save(722)) - Image file 
> /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 of size 
> 17885002 bytes saved in 0 seconds .
> 2022-03-15 17:28:03,183 INFO  namenode.NNStorageRetentionManager 
> (NNStorageRetentionManager.java:getImageTxIdToRetain(211)) - Going to retain 
> 2 images with txid >= 25892513
> 2022-03-15 17:28:03,183 INFO  namenode.NNStorageRetentionManager 
> (NNStorageRetentionManager.java:purgeImage(233)) - Purging old image 
> 

[jira] [Updated] (HDFS-16507) Purged edit logs which is in progress

2022-03-15 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16507:
---
Description: 
We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL 
exception. It looks like it's purging edit logs which is in process.

According to the analysis, I suspect that the editlog which is in progress to 
be purged does not finalize before ANN rolls edit its self. 

I post some key logs for your reference:

1. ANN. Create editlog, 
{color:#ff}edits_InProgress_00024207987{color}.
{code:java}
2022-03-15 17:24:52,558 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegment(1394)) - Starting log segment at 24207987
2022-03-15 17:24:52,609 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegment(1423)) - Ending log segment at 24207987
2022-03-15 17:24:52,610 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1432)) - logEdit at 24207987
2022-03-15 17:24:52,624 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1434)) - logSync at 24207987 
{code}
2. SNN. Checkpoint.

The oldest image file is: fsimage_000{color:#de350b}25892513{color}.

{color:#ff}25892513 + 1 - 100 = 24892514{color}
{color:#ff}dfs.namenode.num.extra.edits.retained=100{color}

{color:#172b4d}Code: {color}NNStorageRetentionManager#purgeOldStorage
{code:java}
void purgeOldStorage(NameNodeFile nnf) throws IOException {
  FSImageTransactionalStorageInspector inspector =
  new FSImageTransactionalStorageInspector(EnumSet.of(nnf));
  storage.inspectStorageDirs(inspector);

  long minImageTxId = getImageTxIdToRetain(inspector);
  purgeCheckpointsOlderThan(inspector, minImageTxId);
  
{code}
{color:#910091}...{color}
{code:java}
  long minimumRequiredTxId = minImageTxId + 1;
  long purgeLogsFrom = Math.max(0, minimumRequiredTxId - numExtraEditsToRetain);
  
  ArrayList editLogs = new ArrayList();
  purgeableLogs.selectInputStreams(editLogs, purgeLogsFrom, false, false);
  Collections.sort(editLogs, new Comparator() {
@Override
public int compare(EditLogInputStream a, EditLogInputStream b) {
  return ComparisonChain.start()
  .compare(a.getFirstTxId(), b.getFirstTxId())
  .compare(a.getLastTxId(), b.getLastTxId())
  .result();
}
  });

  // Remove from consideration any edit logs that are in fact required.
  while (editLogs.size() > 0 &&
  editLogs.get(editLogs.size() - 1).getFirstTxId() >= minimumRequiredTxId) {
editLogs.remove(editLogs.size() - 1);
  }
  
  // Next, adjust the number of transactions to retain if doing so would mean
  // keeping too many segments around.
  while (editLogs.size() > maxExtraEditsSegmentsToRetain) {
purgeLogsFrom = editLogs.get(0).getLastTxId() + 1;
editLogs.remove(0);
  }
  ...
  purgeableLogs.purgeLogsOlderThan(purgeLogsFrom);
}{code}
 
{code:java}
2022-03-15 17:28:02,640 INFO  ha.StandbyCheckpointer 
(StandbyCheckpointer.java:doWork(443)) - Triggering checkpoint because there 
have been 1189661 txns since the last checkpoint, which exceeds the configured 
threshold 2
2022-03-15 17:28:02,648 INFO  namenode.FSImage 
(FSEditLogLoader.java:loadFSEdits(188)) - Edits file 
ByteStringEditLog[27082175, 27082606], ByteStringEditLog[27082175, 27082606], 
ByteStringEditLog[27082175, 27082606] of size 60008 edits # 432 loaded in 0 
seconds
2022-03-15 17:28:02,649 INFO  namenode.FSImage 
(FSImage.java:saveNamespace(1121)) - Save namespace ...
2022-03-15 17:28:02,650 INFO  namenode.FSImageFormatProtobuf 
(FSImageFormatProtobuf.java:save(718)) - Saving image file 
/data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 using no 
compression
2022-03-15 17:28:03,180 INFO  namenode.FSImageFormatProtobuf 
(FSImageFormatProtobuf.java:save(722)) - Image file 
/data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 of size 
17885002 bytes saved in 0 seconds .
2022-03-15 17:28:03,183 INFO  namenode.NNStorageRetentionManager 
(NNStorageRetentionManager.java:getImageTxIdToRetain(211)) - Going to retain 2 
images with txid >= 25892513
2022-03-15 17:28:03,183 INFO  namenode.NNStorageRetentionManager 
(NNStorageRetentionManager.java:purgeImage(233)) - Purging old image 
FSImageFile(file=/data/hadoop/hdfs/namenode/current/fsimage_00024794305,
 cpktTxId=00024794305)
2022-03-15 17:28:03,188 INFO  namenode.NNStorageRetentionManager 
(NNStorageRetentionManager.java:purgeOldStorage(169)) - purgeLogsFrom: 24892514
2022-03-15 17:28:03,282 INFO  namenode.TransferFsImage 
(TransferFsImage.java:copyFileToStream(396)) - Sending fileName: 
/data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: 
17885002. Sent total: 17885002 bytes. Size of last segment intended to send: -1 
bytes.
2022-03-15 17:28:03,536 INFO  namenode.TransferFsImage 
(TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 
27082606 to namenode at 

[jira] [Commented] (HDFS-16507) Purged edit logs which is in progress

2022-03-15 Thread tomscut (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17507307#comment-17507307
 ] 

tomscut commented on HDFS-16507:


Hi [~weichiu] [~chaosun] [~xkrogen] [~Symious] , could you please take a look? 
I wonder if we missed something. Thank you very much.

> Purged edit logs which is in progress
> -
>
> Key: HDFS-16507
> URL: https://issues.apache.org/jira/browse/HDFS-16507
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: tomscut
>Priority: Critical
>
> We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL 
> exception. It looks like it's purging edit logs which is in process.
> According to the analysis, I suspect that the editlog which is in progress to 
> be purged does not finalize before ANN rolls edit its self. 
> I post some key logs for your reference:
> 1. ANN. Create editlog, 
> {color:#ff}edits_InProgress_00024207987{color}.
> {code:java}
> 2022-03-15 17:24:52,558 INFO  namenode.FSEditLog 
> (FSEditLog.java:startLogSegment(1394)) - Starting log segment at 24207987
> 2022-03-15 17:24:52,609 INFO  namenode.FSEditLog 
> (FSEditLog.java:startLogSegment(1423)) - Ending log segment at 24207987
> 2022-03-15 17:24:52,610 INFO  namenode.FSEditLog 
> (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1432)) - logEdit at 24207987
> 2022-03-15 17:24:52,624 INFO  namenode.FSEditLog 
> (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1434)) - logSync at 24207987 
> {code}
> 2. SNN. Checkpoint.
> The oldest image file is: fsimage_000{color:#de350b}25892513{color}.
> {color:#ff}25892513 + 1 - 100 = 24892514{color}
> {color:#ff}dfs.namenode.num.extra.edits.retained=100{color}
> {color:#172b4d}Code: {color}NNStorageRetentionManager#purgeOldStorage
> {code:java}
> void purgeOldStorage(NameNodeFile nnf) throws IOException {
>   FSImageTransactionalStorageInspector inspector =
>   new FSImageTransactionalStorageInspector(EnumSet.of(nnf));
>   storage.inspectStorageDirs(inspector);
>   long minImageTxId = getImageTxIdToRetain(inspector);
>   purgeCheckpointsOlderThan(inspector, minImageTxId);
>   
> {code}
> {color:#910091}...{color}
> {code:java}
>   long minimumRequiredTxId = minImageTxId + 1;
>   long purgeLogsFrom = Math.max(0, minimumRequiredTxId - 
> numExtraEditsToRetain);
>   
>   ArrayList editLogs = new 
> ArrayList();
>   purgeableLogs.selectInputStreams(editLogs, purgeLogsFrom, false, false);
>   Collections.sort(editLogs, new Comparator() {
> @Override
> public int compare(EditLogInputStream a, EditLogInputStream b) {
>   return ComparisonChain.start()
>   .compare(a.getFirstTxId(), b.getFirstTxId())
>   .compare(a.getLastTxId(), b.getLastTxId())
>   .result();
> }
>   });
>   // Remove from consideration any edit logs that are in fact required.
>   while (editLogs.size() > 0 &&
>   editLogs.get(editLogs.size() - 1).getFirstTxId() >= 
> minimumRequiredTxId) {
> editLogs.remove(editLogs.size() - 1);
>   }
>   
>   // Next, adjust the number of transactions to retain if doing so would mean
>   // keeping too many segments around.
>   while (editLogs.size() > maxExtraEditsSegmentsToRetain) {
> purgeLogsFrom = editLogs.get(0).getLastTxId() + 1;
> editLogs.remove(0);
>   }
>   ...
>   purgeableLogs.purgeLogsOlderThan(purgeLogsFrom);
> }{code}
>  
> {code:java}
> 2022-03-15 17:28:02,640 INFO  ha.StandbyCheckpointer 
> (StandbyCheckpointer.java:doWork(443)) - Triggering checkpoint because there 
> have been 1189661 txns since the last checkpoint, which exceeds the 
> configured threshold 2
> 2022-03-15 17:28:02,648 INFO  namenode.FSImage 
> (FSEditLogLoader.java:loadFSEdits(188)) - Edits file 
> ByteStringEditLog[27082175, 27082606], ByteStringEditLog[27082175, 27082606], 
> ByteStringEditLog[27082175, 27082606] of size 60008 edits # 432 loaded in 0 
> seconds
> 2022-03-15 17:28:02,649 INFO  namenode.FSImage 
> (FSImage.java:saveNamespace(1121)) - Save namespace ...
> 2022-03-15 17:28:02,650 INFO  namenode.FSImageFormatProtobuf 
> (FSImageFormatProtobuf.java:save(718)) - Saving image file 
> /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 using no 
> compression
> 2022-03-15 17:28:03,180 INFO  namenode.FSImageFormatProtobuf 
> (FSImageFormatProtobuf.java:save(722)) - Image file 
> /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 of size 
> 17885002 bytes saved in 0 seconds .
> 2022-03-15 17:28:03,183 INFO  namenode.NNStorageRetentionManager 
> (NNStorageRetentionManager.java:getImageTxIdToRetain(211)) - Going to retain 
> 2 images with txid >= 25892513
> 2022-03-15 17:28:03,183 INFO  namenode.NNStorageRetentionManager 
> (NNStorageRetentionManager.java:purgeImage(233)) - Purging old image 
> 

[jira] [Updated] (HDFS-16507) Purged edit logs which is in process

2022-03-15 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16507:
---
Description: 
We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL 
exception. It looks like it's purging edit logs which is in process.

According to the analysis, I suspect that the editlog which is in progress to 
be purged does not finalize before ANN rolls edit its self. 

I post some key logs for your reference:

1. ANN. Create editlog, 
{color:#ff}edits_InProgress_00024207987{color}.
{code:java}
2022-03-15 17:24:52,558 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegment(1394)) - Starting log segment at 24207987
2022-03-15 17:24:52,609 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegment(1423)) - Ending log segment at 24207987
2022-03-15 17:24:52,610 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1432)) - logEdit at 24207987
2022-03-15 17:24:52,624 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1434)) - logSync at 24207987 
{code}
2. SNN. Checkpoint.

The oldest image file is: fsimage_000{color:#de350b}25892513{color}.

{color:#ff}25892513 + 1 - 100 = 24892514{color}
{color:#ff}dfs.namenode.num.extra.edits.retained=100{color}

{color:#172b4d}Code: {color}NNStorageRetentionManager#purgeOldStorage
{code:java}
void purgeOldStorage(NameNodeFile nnf) throws IOException {
  FSImageTransactionalStorageInspector inspector =
  new FSImageTransactionalStorageInspector(EnumSet.of(nnf));
  storage.inspectStorageDirs(inspector);

  long minImageTxId = getImageTxIdToRetain(inspector);
  purgeCheckpointsOlderThan(inspector, minImageTxId);
  
{code}
{color:#910091}...{color}
{code:java}
  long minimumRequiredTxId = minImageTxId + 1;
  long purgeLogsFrom = Math.max(0, minimumRequiredTxId - numExtraEditsToRetain);
  
  ArrayList editLogs = new ArrayList();
  purgeableLogs.selectInputStreams(editLogs, purgeLogsFrom, false, false);
  Collections.sort(editLogs, new Comparator() {
@Override
public int compare(EditLogInputStream a, EditLogInputStream b) {
  return ComparisonChain.start()
  .compare(a.getFirstTxId(), b.getFirstTxId())
  .compare(a.getLastTxId(), b.getLastTxId())
  .result();
}
  });

  // Remove from consideration any edit logs that are in fact required.
  while (editLogs.size() > 0 &&
  editLogs.get(editLogs.size() - 1).getFirstTxId() >= minimumRequiredTxId) {
editLogs.remove(editLogs.size() - 1);
  }
  
  // Next, adjust the number of transactions to retain if doing so would mean
  // keeping too many segments around.
  while (editLogs.size() > maxExtraEditsSegmentsToRetain) {
purgeLogsFrom = editLogs.get(0).getLastTxId() + 1;
editLogs.remove(0);
  }
  ...
  purgeableLogs.purgeLogsOlderThan(purgeLogsFrom);
}{code}
 
{code:java}
2022-03-15 17:28:02,640 INFO  ha.StandbyCheckpointer 
(StandbyCheckpointer.java:doWork(443)) - Triggering checkpoint because there 
have been 1189661 txns since the last checkpoint, which exceeds the configured 
threshold 2
2022-03-15 17:28:02,648 INFO  namenode.FSImage 
(FSEditLogLoader.java:loadFSEdits(188)) - Edits file 
ByteStringEditLog[27082175, 27082606], ByteStringEditLog[27082175, 27082606], 
ByteStringEditLog[27082175, 27082606] of size 60008 edits # 432 loaded in 0 
seconds
2022-03-15 17:28:02,649 INFO  namenode.FSImage 
(FSImage.java:saveNamespace(1121)) - Save namespace ...
2022-03-15 17:28:02,650 INFO  namenode.FSImageFormatProtobuf 
(FSImageFormatProtobuf.java:save(718)) - Saving image file 
/data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 using no 
compression
2022-03-15 17:28:03,180 INFO  namenode.FSImageFormatProtobuf 
(FSImageFormatProtobuf.java:save(722)) - Image file 
/data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 of size 
17885002 bytes saved in 0 seconds .
2022-03-15 17:28:03,183 INFO  namenode.NNStorageRetentionManager 
(NNStorageRetentionManager.java:getImageTxIdToRetain(211)) - Going to retain 2 
images with txid >= 25892513
2022-03-15 17:28:03,183 INFO  namenode.NNStorageRetentionManager 
(NNStorageRetentionManager.java:purgeImage(233)) - Purging old image 
FSImageFile(file=/data/hadoop/hdfs/namenode/current/fsimage_00024794305,
 cpktTxId=00024794305)
2022-03-15 17:28:03,188 INFO  namenode.NNStorageRetentionManager 
(NNStorageRetentionManager.java:purgeOldStorage(169)) - purgeLogsFrom: 24892514
2022-03-15 17:28:03,282 INFO  namenode.TransferFsImage 
(TransferFsImage.java:copyFileToStream(396)) - Sending fileName: 
/data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: 
17885002. Sent total: 17885002 bytes. Size of last segment intended to send: -1 
bytes.
2022-03-15 17:28:03,536 INFO  namenode.TransferFsImage 
(TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 
27082606 to namenode at 

[jira] [Updated] (HDFS-16507) Purged edit logs which is in progress

2022-03-15 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16507:
---
Summary: Purged edit logs which is in progress  (was: Purged edit logs 
which is in process)

> Purged edit logs which is in progress
> -
>
> Key: HDFS-16507
> URL: https://issues.apache.org/jira/browse/HDFS-16507
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: tomscut
>Priority: Critical
>
> We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL 
> exception. It looks like it's purging edit logs which is in process.
> According to the analysis, I suspect that the editlog which is in progress to 
> be purged does not finalize before ANN rolls edit its self. 
> I post some key logs for your reference:
> 1. ANN. Create editlog, 
> {color:#ff}edits_InProgress_00024207987{color}.
> {code:java}
> 2022-03-15 17:24:52,558 INFO  namenode.FSEditLog 
> (FSEditLog.java:startLogSegment(1394)) - Starting log segment at 24207987
> 2022-03-15 17:24:52,609 INFO  namenode.FSEditLog 
> (FSEditLog.java:startLogSegment(1423)) - Ending log segment at 24207987
> 2022-03-15 17:24:52,610 INFO  namenode.FSEditLog 
> (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1432)) - logEdit at 24207987
> 2022-03-15 17:24:52,624 INFO  namenode.FSEditLog 
> (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1434)) - logSync at 24207987 
> {code}
> 2. SNN. Checkpoint.
> The oldest image file is: fsimage_000{color:#de350b}25892513{color}.
> {color:#ff}25892513 + 1 - 100 = 24892514{color}
> {color:#ff}dfs.namenode.num.extra.edits.retained=100{color}
> {color:#172b4d}Code: {color}NNStorageRetentionManager#purgeOldStorage
> {code:java}
> void purgeOldStorage(NameNodeFile nnf) throws IOException {
>   FSImageTransactionalStorageInspector inspector =
>   new FSImageTransactionalStorageInspector(EnumSet.of(nnf));
>   storage.inspectStorageDirs(inspector);
>   long minImageTxId = getImageTxIdToRetain(inspector);
>   purgeCheckpointsOlderThan(inspector, minImageTxId);
>   
> {code}
> {color:#910091}...{color}
> {code:java}
>   long minimumRequiredTxId = minImageTxId + 1;
>   long purgeLogsFrom = Math.max(0, minimumRequiredTxId - 
> numExtraEditsToRetain);
>   
>   ArrayList editLogs = new 
> ArrayList();
>   purgeableLogs.selectInputStreams(editLogs, purgeLogsFrom, false, false);
>   Collections.sort(editLogs, new Comparator() {
> @Override
> public int compare(EditLogInputStream a, EditLogInputStream b) {
>   return ComparisonChain.start()
>   .compare(a.getFirstTxId(), b.getFirstTxId())
>   .compare(a.getLastTxId(), b.getLastTxId())
>   .result();
> }
>   });
>   // Remove from consideration any edit logs that are in fact required.
>   while (editLogs.size() > 0 &&
>   editLogs.get(editLogs.size() - 1).getFirstTxId() >= 
> minimumRequiredTxId) {
> editLogs.remove(editLogs.size() - 1);
>   }
>   
>   // Next, adjust the number of transactions to retain if doing so would mean
>   // keeping too many segments around.
>   while (editLogs.size() > maxExtraEditsSegmentsToRetain) {
> purgeLogsFrom = editLogs.get(0).getLastTxId() + 1;
> editLogs.remove(0);
>   }
>   ...
>   purgeableLogs.purgeLogsOlderThan(purgeLogsFrom);
> }{code}
>  
> {code:java}
> 2022-03-15 17:28:02,640 INFO  ha.StandbyCheckpointer 
> (StandbyCheckpointer.java:doWork(443)) - Triggering checkpoint because there 
> have been 1189661 txns since the last checkpoint, which exceeds the 
> configured threshold 2
> 2022-03-15 17:28:02,648 INFO  namenode.FSImage 
> (FSEditLogLoader.java:loadFSEdits(188)) - Edits file 
> ByteStringEditLog[27082175, 27082606], ByteStringEditLog[27082175, 27082606], 
> ByteStringEditLog[27082175, 27082606] of size 60008 edits # 432 loaded in 0 
> seconds
> 2022-03-15 17:28:02,649 INFO  namenode.FSImage 
> (FSImage.java:saveNamespace(1121)) - Save namespace ...
> 2022-03-15 17:28:02,650 INFO  namenode.FSImageFormatProtobuf 
> (FSImageFormatProtobuf.java:save(718)) - Saving image file 
> /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 using no 
> compression
> 2022-03-15 17:28:03,180 INFO  namenode.FSImageFormatProtobuf 
> (FSImageFormatProtobuf.java:save(722)) - Image file 
> /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 of size 
> 17885002 bytes saved in 0 seconds .
> 2022-03-15 17:28:03,183 INFO  namenode.NNStorageRetentionManager 
> (NNStorageRetentionManager.java:getImageTxIdToRetain(211)) - Going to retain 
> 2 images with txid >= 25892513
> 2022-03-15 17:28:03,183 INFO  namenode.NNStorageRetentionManager 
> (NNStorageRetentionManager.java:purgeImage(233)) - Purging old image 
> FSImageFile(file=/data/hadoop/hdfs/namenode/current/fsimage_00024794305,
>  

[jira] [Updated] (HDFS-16507) Purged edit logs which is in process

2022-03-15 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16507:
---
Description: 
We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL 
exception. It looks like it's purging edit logs which is in process.

According to the analysis, I suspect that the editlog to be purged does not 
finalize before ANN rolls edit its self. 

I post some key logs for your reference:

1. ANN. Create editlog, 
{color:#ff}edits_InProgress_00024207987{color}.
{code:java}
2022-03-15 17:24:52,558 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegment(1394)) - Starting log segment at 24207987
2022-03-15 17:24:52,609 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegment(1423)) - Ending log segment at 24207987
2022-03-15 17:24:52,610 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1432)) - logEdit at 24207987
2022-03-15 17:24:52,624 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1434)) - logSync at 24207987 
{code}
2. SNN. Checkpoint.

The oldest image file is: fsimage_000{color:#de350b}25892513{color}.

{color:#ff}25892513 + 1 - 100 = 24892514{color}
{color:#ff}dfs.namenode.num.extra.edits.retained=100{color}

{color:#172b4d}Code: {color}NNStorageRetentionManager#purgeOldStorage
{code:java}
void purgeOldStorage(NameNodeFile nnf) throws IOException {
  FSImageTransactionalStorageInspector inspector =
  new FSImageTransactionalStorageInspector(EnumSet.of(nnf));
  storage.inspectStorageDirs(inspector);

  long minImageTxId = getImageTxIdToRetain(inspector);
  purgeCheckpointsOlderThan(inspector, minImageTxId);
  
{code}
{color:#910091}...{color}
{code:java}
  long minimumRequiredTxId = minImageTxId + 1;
  long purgeLogsFrom = Math.max(0, minimumRequiredTxId - numExtraEditsToRetain);
  
  ArrayList editLogs = new ArrayList();
  purgeableLogs.selectInputStreams(editLogs, purgeLogsFrom, false, false);
  Collections.sort(editLogs, new Comparator() {
@Override
public int compare(EditLogInputStream a, EditLogInputStream b) {
  return ComparisonChain.start()
  .compare(a.getFirstTxId(), b.getFirstTxId())
  .compare(a.getLastTxId(), b.getLastTxId())
  .result();
}
  });

  // Remove from consideration any edit logs that are in fact required.
  while (editLogs.size() > 0 &&
  editLogs.get(editLogs.size() - 1).getFirstTxId() >= minimumRequiredTxId) {
editLogs.remove(editLogs.size() - 1);
  }
  
  // Next, adjust the number of transactions to retain if doing so would mean
  // keeping too many segments around.
  while (editLogs.size() > maxExtraEditsSegmentsToRetain) {
purgeLogsFrom = editLogs.get(0).getLastTxId() + 1;
editLogs.remove(0);
  }
  ...
  purgeableLogs.purgeLogsOlderThan(purgeLogsFrom);
}{code}
 
{code:java}
2022-03-15 17:28:02,640 INFO  ha.StandbyCheckpointer 
(StandbyCheckpointer.java:doWork(443)) - Triggering checkpoint because there 
have been 1189661 txns since the last checkpoint, which exceeds the configured 
threshold 2
2022-03-15 17:28:02,648 INFO  namenode.FSImage 
(FSEditLogLoader.java:loadFSEdits(188)) - Edits file 
ByteStringEditLog[27082175, 27082606], ByteStringEditLog[27082175, 27082606], 
ByteStringEditLog[27082175, 27082606] of size 60008 edits # 432 loaded in 0 
seconds
2022-03-15 17:28:02,649 INFO  namenode.FSImage 
(FSImage.java:saveNamespace(1121)) - Save namespace ...
2022-03-15 17:28:02,650 INFO  namenode.FSImageFormatProtobuf 
(FSImageFormatProtobuf.java:save(718)) - Saving image file 
/data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 using no 
compression
2022-03-15 17:28:03,180 INFO  namenode.FSImageFormatProtobuf 
(FSImageFormatProtobuf.java:save(722)) - Image file 
/data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 of size 
17885002 bytes saved in 0 seconds .
2022-03-15 17:28:03,183 INFO  namenode.NNStorageRetentionManager 
(NNStorageRetentionManager.java:getImageTxIdToRetain(211)) - Going to retain 2 
images with txid >= 25892513
2022-03-15 17:28:03,183 INFO  namenode.NNStorageRetentionManager 
(NNStorageRetentionManager.java:purgeImage(233)) - Purging old image 
FSImageFile(file=/data/hadoop/hdfs/namenode/current/fsimage_00024794305,
 cpktTxId=00024794305)
2022-03-15 17:28:03,188 INFO  namenode.NNStorageRetentionManager 
(NNStorageRetentionManager.java:purgeOldStorage(169)) - purgeLogsFrom: 24892514
2022-03-15 17:28:03,282 INFO  namenode.TransferFsImage 
(TransferFsImage.java:copyFileToStream(396)) - Sending fileName: 
/data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: 
17885002. Sent total: 17885002 bytes. Size of last segment intended to send: -1 
bytes.
2022-03-15 17:28:03,536 INFO  namenode.TransferFsImage 
(TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 
27082606 to namenode at 

[jira] [Updated] (HDFS-16507) Purged edit logs which is in process

2022-03-15 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16507:
---
Description: 
We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL 
exception. It looks like it's purging edit logs which is in process.

According to the analysis, I suspect that the editlog to be purged does not 
finalize. 

I post some key logs for your reference:

1. ANN. Create editlog, 
{color:#ff}edits_InProgress_00024207987{color}.
{code:java}
2022-03-15 17:24:52,558 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegment(1394)) - Starting log segment at 24207987
2022-03-15 17:24:52,609 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegment(1423)) - Ending log segment at 24207987
2022-03-15 17:24:52,610 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1432)) - logEdit at 24207987
2022-03-15 17:24:52,624 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1434)) - logSync at 24207987 
{code}
2. SNN. Checkpoint.

The oldest image file is: fsimage_000{color:#de350b}25892513{color}.

{color:#ff}25892513 + 1 - 100 = 24892514{color}
{color:#ff}dfs.namenode.num.extra.edits.retained=100{color}

{color:#172b4d}Code: {color}NNStorageRetentionManager#purgeOldStorage
{code:java}
void purgeOldStorage(NameNodeFile nnf) throws IOException {
  FSImageTransactionalStorageInspector inspector =
  new FSImageTransactionalStorageInspector(EnumSet.of(nnf));
  storage.inspectStorageDirs(inspector);

  long minImageTxId = getImageTxIdToRetain(inspector);
  purgeCheckpointsOlderThan(inspector, minImageTxId);
  
{code}
{color:#910091}...{color}
{code:java}
  long minimumRequiredTxId = minImageTxId + 1;
  long purgeLogsFrom = Math.max(0, minimumRequiredTxId - numExtraEditsToRetain);
  
  ArrayList editLogs = new ArrayList();
  purgeableLogs.selectInputStreams(editLogs, purgeLogsFrom, false, false);
  Collections.sort(editLogs, new Comparator() {
@Override
public int compare(EditLogInputStream a, EditLogInputStream b) {
  return ComparisonChain.start()
  .compare(a.getFirstTxId(), b.getFirstTxId())
  .compare(a.getLastTxId(), b.getLastTxId())
  .result();
}
  });

  // Remove from consideration any edit logs that are in fact required.
  while (editLogs.size() > 0 &&
  editLogs.get(editLogs.size() - 1).getFirstTxId() >= minimumRequiredTxId) {
editLogs.remove(editLogs.size() - 1);
  }
  
  // Next, adjust the number of transactions to retain if doing so would mean
  // keeping too many segments around.
  while (editLogs.size() > maxExtraEditsSegmentsToRetain) {
purgeLogsFrom = editLogs.get(0).getLastTxId() + 1;
editLogs.remove(0);
  }
  ...
  purgeableLogs.purgeLogsOlderThan(purgeLogsFrom);
}{code}
 
{code:java}
2022-03-15 17:28:02,640 INFO  ha.StandbyCheckpointer 
(StandbyCheckpointer.java:doWork(443)) - Triggering checkpoint because there 
have been 1189661 txns since the last checkpoint, which exceeds the configured 
threshold 2
2022-03-15 17:28:02,648 INFO  namenode.FSImage 
(FSEditLogLoader.java:loadFSEdits(188)) - Edits file 
ByteStringEditLog[27082175, 27082606], ByteStringEditLog[27082175, 27082606], 
ByteStringEditLog[27082175, 27082606] of size 60008 edits # 432 loaded in 0 
seconds
2022-03-15 17:28:02,649 INFO  namenode.FSImage 
(FSImage.java:saveNamespace(1121)) - Save namespace ...
2022-03-15 17:28:02,650 INFO  namenode.FSImageFormatProtobuf 
(FSImageFormatProtobuf.java:save(718)) - Saving image file 
/data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 using no 
compression
2022-03-15 17:28:03,180 INFO  namenode.FSImageFormatProtobuf 
(FSImageFormatProtobuf.java:save(722)) - Image file 
/data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 of size 
17885002 bytes saved in 0 seconds .
2022-03-15 17:28:03,183 INFO  namenode.NNStorageRetentionManager 
(NNStorageRetentionManager.java:getImageTxIdToRetain(211)) - Going to retain 2 
images with txid >= 25892513
2022-03-15 17:28:03,183 INFO  namenode.NNStorageRetentionManager 
(NNStorageRetentionManager.java:purgeImage(233)) - Purging old image 
FSImageFile(file=/data/hadoop/hdfs/namenode/current/fsimage_00024794305,
 cpktTxId=00024794305)
2022-03-15 17:28:03,188 INFO  namenode.NNStorageRetentionManager 
(NNStorageRetentionManager.java:purgeOldStorage(169)) - purgeLogsFrom: 24892514
2022-03-15 17:28:03,282 INFO  namenode.TransferFsImage 
(TransferFsImage.java:copyFileToStream(396)) - Sending fileName: 
/data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: 
17885002. Sent total: 17885002 bytes. Size of last segment intended to send: -1 
bytes.
2022-03-15 17:28:03,536 INFO  namenode.TransferFsImage 
(TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 
27082606 to namenode at http://sg-test-ambari-nn1.bigdata.bigo.inner:50070 in 

[jira] [Updated] (HDFS-16507) Purged edit logs which is in process

2022-03-15 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16507:
---
Description: 
We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL 
exception. It looks like it's purging edit logs which is in process.

According to the analysis, I suspect that the editlog to be purged does not 
finalize. 

I post some key logs for your reference:

1. ANN. Create editlog, 
{color:#ff}edits_InProgress_00024207987{color}.
{code:java}
2022-03-15 17:24:52,558 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegment(1394)) - Starting log segment at 24207987
2022-03-15 17:24:52,609 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegment(1423)) - Ending log segment at 24207987
2022-03-15 17:24:52,610 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1432)) - logEdit at 24207987
2022-03-15 17:24:52,624 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1434)) - logSync at 24207987 
{code}
2. SNN. Checkpoint.

The oldest image file is: fsimage_000{color:#de350b}25892513{color}.

{color:#ff}25892513 + 1 - 100 = 24892514{color}
{color:#ff}dfs.namenode.num.extra.edits.retained=100{color}

{color:#172b4d}Code: {color}NNStorageRetentionManager#purgeOldStorage

 
{code:java}
void purgeOldStorage(NameNodeFile nnf) throws IOException {
  FSImageTransactionalStorageInspector inspector =
  new FSImageTransactionalStorageInspector(EnumSet.of(nnf));
  storage.inspectStorageDirs(inspector);

  long minImageTxId = getImageTxIdToRetain(inspector);
  purgeCheckpointsOlderThan(inspector, minImageTxId);
  
  if (nnf == NameNodeFile.IMAGE_ROLLBACK) {
// do not purge edits for IMAGE_ROLLBACK.
return;
  }

  long minimumRequiredTxId = minImageTxId + 1;
  long purgeLogsFrom = Math.max(0, minimumRequiredTxId - numExtraEditsToRetain);
  
  ArrayList editLogs = new ArrayList();
  purgeableLogs.selectInputStreams(editLogs, purgeLogsFrom, false, false);
  Collections.sort(editLogs, new Comparator() {
@Override
public int compare(EditLogInputStream a, EditLogInputStream b) {
  return ComparisonChain.start()
  .compare(a.getFirstTxId(), b.getFirstTxId())
  .compare(a.getLastTxId(), b.getLastTxId())
  .result();
}
  });

  // Remove from consideration any edit logs that are in fact required.
  while (editLogs.size() > 0 &&
  editLogs.get(editLogs.size() - 1).getFirstTxId() >= minimumRequiredTxId) {
editLogs.remove(editLogs.size() - 1);
  }
  
  // Next, adjust the number of transactions to retain if doing so would mean
  // keeping too many segments around.
  while (editLogs.size() > maxExtraEditsSegmentsToRetain) {
purgeLogsFrom = editLogs.get(0).getLastTxId() + 1;
editLogs.remove(0);
  }
  
  // Finally, ensure that we're not trying to purge any transactions that we
  // actually need.
  if (purgeLogsFrom > minimumRequiredTxId) {
throw new AssertionError("Should not purge more edits than required to "
+ "restore: " + purgeLogsFrom + " should be <= "
+ minimumRequiredTxId);
  }

  LOG.info("purgeLogsFrom: " + purgeLogsFrom);
  for (EditLogInputStream editLog : editLogs) {
if (editLog.isInProgress()) {
  LOG.info("editLog isInProgress, start txid:" + editLog.getFirstTxId() +
  " , last txid:" + editLog.getLastTxId());
}
  }
  
  purgeableLogs.purgeLogsOlderThan(purgeLogsFrom);
}{code}
 

 
{code:java}
2022-03-15 17:28:02,640 INFO  ha.StandbyCheckpointer 
(StandbyCheckpointer.java:doWork(443)) - Triggering checkpoint because there 
have been 1189661 txns since the last checkpoint, which exceeds the configured 
threshold 2
2022-03-15 17:28:02,648 INFO  namenode.FSImage 
(FSEditLogLoader.java:loadFSEdits(188)) - Edits file 
ByteStringEditLog[27082175, 27082606], ByteStringEditLog[27082175, 27082606], 
ByteStringEditLog[27082175, 27082606] of size 60008 edits # 432 loaded in 0 
seconds
2022-03-15 17:28:02,649 INFO  namenode.FSImage 
(FSImage.java:saveNamespace(1121)) - Save namespace ...
2022-03-15 17:28:02,650 INFO  namenode.FSImageFormatProtobuf 
(FSImageFormatProtobuf.java:save(718)) - Saving image file 
/data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 using no 
compression
2022-03-15 17:28:03,180 INFO  namenode.FSImageFormatProtobuf 
(FSImageFormatProtobuf.java:save(722)) - Image file 
/data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 of size 
17885002 bytes saved in 0 seconds .
2022-03-15 17:28:03,183 INFO  namenode.NNStorageRetentionManager 
(NNStorageRetentionManager.java:getImageTxIdToRetain(211)) - Going to retain 2 
images with txid >= 25892513
2022-03-15 17:28:03,183 INFO  namenode.NNStorageRetentionManager 
(NNStorageRetentionManager.java:purgeImage(233)) - Purging old image 
FSImageFile(file=/data/hadoop/hdfs/namenode/current/fsimage_00024794305,
 

[jira] [Updated] (HDFS-16507) Purged edit logs which is in process

2022-03-15 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16507:
---
Description: 
We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL 
exception. It looks like it's purging edit logs which is in process.

According to the analysis, I suspect that the editlog to be purged does not 
finalize. 

I post some key logs for your reference:

1. ANN. Create editlog, 
{color:#ff}edits_InProgress_00024207987{color}.
{code:java}
2022-03-15 17:24:52,558 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegment(1394)) - Starting log segment at 24207987
2022-03-15 17:24:52,609 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegment(1423)) - Ending log segment at 24207987
2022-03-15 17:24:52,610 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1432)) - logEdit at 24207987
2022-03-15 17:24:52,624 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1434)) - logSync at 24207987 
{code}
2. SNN. Checkpoint.

The oldest image file is: fsimage_000{color:#de350b}25892513{color}.

{color:#ff}25892513 + 1 - 100 = 24892514{color}
{color:#ff}dfs.namenode.num.extra.edits.retained=100{color}
{code:java}
2022-03-15 17:28:02,640 INFO  ha.StandbyCheckpointer 
(StandbyCheckpointer.java:doWork(443)) - Triggering checkpoint because there 
have been 1189661 txns since the last checkpoint, which exceeds the configured 
threshold 2
2022-03-15 17:28:02,648 INFO  namenode.FSImage 
(FSEditLogLoader.java:loadFSEdits(188)) - Edits file 
ByteStringEditLog[27082175, 27082606], ByteStringEditLog[27082175, 27082606], 
ByteStringEditLog[27082175, 27082606] of size 60008 edits # 432 loaded in 0 
seconds
2022-03-15 17:28:02,649 INFO  namenode.FSImage 
(FSImage.java:saveNamespace(1121)) - Save namespace ...
2022-03-15 17:28:02,650 INFO  namenode.FSImageFormatProtobuf 
(FSImageFormatProtobuf.java:save(718)) - Saving image file 
/data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 using no 
compression
2022-03-15 17:28:03,180 INFO  namenode.FSImageFormatProtobuf 
(FSImageFormatProtobuf.java:save(722)) - Image file 
/data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 of size 
17885002 bytes saved in 0 seconds .
2022-03-15 17:28:03,183 INFO  namenode.NNStorageRetentionManager 
(NNStorageRetentionManager.java:getImageTxIdToRetain(211)) - Going to retain 2 
images with txid >= 25892513
2022-03-15 17:28:03,183 INFO  namenode.NNStorageRetentionManager 
(NNStorageRetentionManager.java:purgeImage(233)) - Purging old image 
FSImageFile(file=/data/hadoop/hdfs/namenode/current/fsimage_00024794305,
 cpktTxId=00024794305)
2022-03-15 17:28:03,188 INFO  namenode.NNStorageRetentionManager 
(NNStorageRetentionManager.java:purgeOldStorage(169)) - purgeLogsFrom: 24892514
2022-03-15 17:28:03,282 INFO  namenode.TransferFsImage 
(TransferFsImage.java:copyFileToStream(396)) - Sending fileName: 
/data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: 
17885002. Sent total: 17885002 bytes. Size of last segment intended to send: -1 
bytes.
2022-03-15 17:28:03,536 INFO  namenode.TransferFsImage 
(TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 
27082606 to namenode at http://sg-test-ambari-nn1.bigdata.bigo.inner:50070 in 
0.343 seconds
2022-03-15 17:28:03,640 INFO  namenode.TransferFsImage 
(TransferFsImage.java:copyFileToStream(396)) - Sending fileName: 
/data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: 
17885002. Sent total: 17885002 bytes. Size of last segment intended to send: -1 
bytes.
2022-03-15 17:28:03,684 INFO  namenode.TransferFsImage 
(TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 
27082606 to namenode at http://sg-test-ambari-dn1.bigdata.bigo.inner:50070 in 
0.148 seconds
2022-03-15 17:28:03,748 INFO  namenode.TransferFsImage 
(TransferFsImage.java:copyFileToStream(396)) - Sending fileName: 
/data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: 
17885002. Sent total: 17885002 bytes. Size of last segment intended to send: -1 
bytes.
2022-03-15 17:28:03,798 INFO  namenode.TransferFsImage 
(TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 
27082606 to namenode at http://sg-test-ambari-dn2.bigdata.bigo.inner:50070 in 
0.113 seconds
2022-03-15 17:28:03,798 INFO  ha.StandbyCheckpointer 
(StandbyCheckpointer.java:doWork(482)) - Checkpoint finished successfully.
 {code}
3. ANN. Purge edit logs.

The oldest image file is: fsimage_000{color:#de350b}25892513{color}.

{color:#ff}25892513 + 1 - 100 = 24892514{color}
{color:#ff}dfs.namenode.num.extra.edits.retained=100{color}
{code:java}
2022-03-15 17:28:03,515 INFO  namenode.NNStorageRetentionManager 
(NNStorageRetentionManager.java:getImageTxIdToRetain(211)) - Going to retain 2 
images 

[jira] [Updated] (HDFS-16507) Purged edit logs which is in process

2022-03-15 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16507:
---
Description: 
We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL 
exception. It looks like it's purging edit logs which is in process.

According to the analysis, I suspect that the editlog to be purged does not 
finalize. 

I post some key logs for your reference:

1. ANN. Create editlog, 
{color:#ff}edits_InProgress_00024207987{color}.
{code:java}
2022-03-15 17:24:52,558 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegment(1394)) - Starting log segment at 24207987
2022-03-15 17:24:52,609 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegment(1423)) - Ending log segment at 24207987
2022-03-15 17:24:52,610 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1432)) - logEdit at 24207987
2022-03-15 17:24:52,624 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1434)) - logSync at 24207987 
{code}
2. SNN. Checkpoint.

{color:#ff}25892513 + 1 - 100 = 24892514{color}
{color:#ff}dfs.namenode.num.extra.edits.retained=100{color}
{code:java}
2022-03-15 17:28:02,640 INFO  ha.StandbyCheckpointer 
(StandbyCheckpointer.java:doWork(443)) - Triggering checkpoint because there 
have been 1189661 txns since the last checkpoint, which exceeds the configured 
threshold 2
2022-03-15 17:28:02,648 INFO  namenode.FSImage 
(FSEditLogLoader.java:loadFSEdits(188)) - Edits file 
ByteStringEditLog[27082175, 27082606], ByteStringEditLog[27082175, 27082606], 
ByteStringEditLog[27082175, 27082606] of size 60008 edits # 432 loaded in 0 
seconds
2022-03-15 17:28:02,649 INFO  namenode.FSImage 
(FSImage.java:saveNamespace(1121)) - Save namespace ...
2022-03-15 17:28:02,650 INFO  namenode.FSImageFormatProtobuf 
(FSImageFormatProtobuf.java:save(718)) - Saving image file 
/data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 using no 
compression
2022-03-15 17:28:03,180 INFO  namenode.FSImageFormatProtobuf 
(FSImageFormatProtobuf.java:save(722)) - Image file 
/data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 of size 
17885002 bytes saved in 0 seconds .
2022-03-15 17:28:03,183 INFO  namenode.NNStorageRetentionManager 
(NNStorageRetentionManager.java:getImageTxIdToRetain(211)) - Going to retain 2 
images with txid >= 25892513
2022-03-15 17:28:03,183 INFO  namenode.NNStorageRetentionManager 
(NNStorageRetentionManager.java:purgeImage(233)) - Purging old image 
FSImageFile(file=/data/hadoop/hdfs/namenode/current/fsimage_00024794305,
 cpktTxId=00024794305)
2022-03-15 17:28:03,188 INFO  namenode.NNStorageRetentionManager 
(NNStorageRetentionManager.java:purgeOldStorage(169)) - purgeLogsFrom: 24892514
2022-03-15 17:28:03,282 INFO  namenode.TransferFsImage 
(TransferFsImage.java:copyFileToStream(396)) - Sending fileName: 
/data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: 
17885002. Sent total: 17885002 bytes. Size of last segment intended to send: -1 
bytes.
2022-03-15 17:28:03,536 INFO  namenode.TransferFsImage 
(TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 
27082606 to namenode at http://sg-test-ambari-nn1.bigdata.bigo.inner:50070 in 
0.343 seconds
2022-03-15 17:28:03,640 INFO  namenode.TransferFsImage 
(TransferFsImage.java:copyFileToStream(396)) - Sending fileName: 
/data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: 
17885002. Sent total: 17885002 bytes. Size of last segment intended to send: -1 
bytes.
2022-03-15 17:28:03,684 INFO  namenode.TransferFsImage 
(TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 
27082606 to namenode at http://sg-test-ambari-dn1.bigdata.bigo.inner:50070 in 
0.148 seconds
2022-03-15 17:28:03,748 INFO  namenode.TransferFsImage 
(TransferFsImage.java:copyFileToStream(396)) - Sending fileName: 
/data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: 
17885002. Sent total: 17885002 bytes. Size of last segment intended to send: -1 
bytes.
2022-03-15 17:28:03,798 INFO  namenode.TransferFsImage 
(TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 
27082606 to namenode at http://sg-test-ambari-dn2.bigdata.bigo.inner:50070 in 
0.113 seconds
2022-03-15 17:28:03,798 INFO  ha.StandbyCheckpointer 
(StandbyCheckpointer.java:doWork(482)) - Checkpoint finished successfully.
 {code}
3. ANN. Purge edit logs.

 

{color:#ff}25892513 + 1 - 100 = 24892514{color}
{color:#ff}dfs.namenode.num.extra.edits.retained=100{color}
{code:java}
2022-03-15 17:28:03,515 INFO  namenode.NNStorageRetentionManager 
(NNStorageRetentionManager.java:getImageTxIdToRetain(211)) - Going to retain 2 
images with txid >= 25892513 {code}
{code:java}
2022-03-15 17:28:03,523 INFO  namenode.NNStorageRetentionManager 

[jira] [Updated] (HDFS-16507) Purged edit logs which is in process

2022-03-15 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16507:
---
Description: 
We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL 
exception. It looks like it's purging edit logs which is in process.

According to the analysis, I suspect that the Editlog to be purged does not 
finalize normally .

I post some key logs for your reference:

1. ANN. Create editlog, 
{color:#ff}edits_InProgresS_00024207987{color}.
{code:java}
2022-03-15 17:24:52,558 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegment(1394)) - Starting log segment at 24207987
2022-03-15 17:24:52,609 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegment(1423)) - Ending log segment at 24207987
2022-03-15 17:24:52,610 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1432)) - logEdit at 24207987
2022-03-15 17:24:52,624 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1434)) - logSync at 24207987 
{code}
2. SNN. Checkpoint.

{color:#ff}25892513 + 1 - 100 = 24892514{color}
{color:#ff}dfs.namenode.num.extra.edits.retained=100{color}
{code:java}
2022-03-15 17:28:02,640 INFO  ha.StandbyCheckpointer 
(StandbyCheckpointer.java:doWork(443)) - Triggering checkpoint because there 
have been 1189661 txns since the last checkpoint, which exceeds the configured 
threshold 2
2022-03-15 17:28:02,648 INFO  namenode.FSImage 
(FSEditLogLoader.java:loadFSEdits(188)) - Edits file 
ByteStringEditLog[27082175, 27082606], ByteStringEditLog[27082175, 27082606], 
ByteStringEditLog[27082175, 27082606] of size 60008 edits # 432 loaded in 0 
seconds
2022-03-15 17:28:02,649 INFO  namenode.FSImage 
(FSImage.java:saveNamespace(1121)) - Save namespace ...
2022-03-15 17:28:02,650 INFO  namenode.FSImageFormatProtobuf 
(FSImageFormatProtobuf.java:save(718)) - Saving image file 
/data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 using no 
compression
2022-03-15 17:28:03,180 INFO  namenode.FSImageFormatProtobuf 
(FSImageFormatProtobuf.java:save(722)) - Image file 
/data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 of size 
17885002 bytes saved in 0 seconds .
2022-03-15 17:28:03,183 INFO  namenode.NNStorageRetentionManager 
(NNStorageRetentionManager.java:getImageTxIdToRetain(211)) - Going to retain 2 
images with txid >= 25892513
2022-03-15 17:28:03,183 INFO  namenode.NNStorageRetentionManager 
(NNStorageRetentionManager.java:purgeImage(233)) - Purging old image 
FSImageFile(file=/data/hadoop/hdfs/namenode/current/fsimage_00024794305,
 cpktTxId=00024794305)
2022-03-15 17:28:03,188 INFO  namenode.NNStorageRetentionManager 
(NNStorageRetentionManager.java:purgeOldStorage(169)) - purgeLogsFrom: 24892514
2022-03-15 17:28:03,282 INFO  namenode.TransferFsImage 
(TransferFsImage.java:copyFileToStream(396)) - Sending fileName: 
/data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: 
17885002. Sent total: 17885002 bytes. Size of last segment intended to send: -1 
bytes.
2022-03-15 17:28:03,536 INFO  namenode.TransferFsImage 
(TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 
27082606 to namenode at http://sg-test-ambari-nn1.bigdata.bigo.inner:50070 in 
0.343 seconds
2022-03-15 17:28:03,640 INFO  namenode.TransferFsImage 
(TransferFsImage.java:copyFileToStream(396)) - Sending fileName: 
/data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: 
17885002. Sent total: 17885002 bytes. Size of last segment intended to send: -1 
bytes.
2022-03-15 17:28:03,684 INFO  namenode.TransferFsImage 
(TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 
27082606 to namenode at http://sg-test-ambari-dn1.bigdata.bigo.inner:50070 in 
0.148 seconds
2022-03-15 17:28:03,748 INFO  namenode.TransferFsImage 
(TransferFsImage.java:copyFileToStream(396)) - Sending fileName: 
/data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: 
17885002. Sent total: 17885002 bytes. Size of last segment intended to send: -1 
bytes.
2022-03-15 17:28:03,798 INFO  namenode.TransferFsImage 
(TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 
27082606 to namenode at http://sg-test-ambari-dn2.bigdata.bigo.inner:50070 in 
0.113 seconds
2022-03-15 17:28:03,798 INFO  ha.StandbyCheckpointer 
(StandbyCheckpointer.java:doWork(482)) - Checkpoint finished successfully.
 {code}
3. ANN. Purge edit logs.

 

{color:#ff}25892513 + 1 - 100 = 24892514{color}
{color:#ff}dfs.namenode.num.extra.edits.retained=100{color}
{code:java}
2022-03-15 17:28:03,515 INFO  namenode.NNStorageRetentionManager 
(NNStorageRetentionManager.java:getImageTxIdToRetain(211)) - Going to retain 2 
images with txid >= 25892513 {code}
{code:java}
2022-03-15 17:28:03,523 INFO  namenode.NNStorageRetentionManager 

[jira] [Updated] (HDFS-16507) Purged edit logs which is in process

2022-03-15 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16507:
---
Affects Version/s: 3.1.0

> Purged edit logs which is in process
> 
>
> Key: HDFS-16507
> URL: https://issues.apache.org/jira/browse/HDFS-16507
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: tomscut
>Priority: Critical
>
> We introduced Standby read functionality in branch-3.1.0, but found a FATAL 
> exception. It looks like it's purging edit logs which is in process.
> According to the analysis, I suspect that the Editlog to be purged does not 
> finalize normally .
> I post some key logs for your reference:
> 1. ANN. Create editlog, 
> {color:#FF}edits_InProgresS_00024207987{color}.
>  
> {code:java}
> 2022-03-15 17:24:52,558 INFO  namenode.FSEditLog 
> (FSEditLog.java:startLogSegment(1394)) - Starting log segment at 24207987
> 2022-03-15 17:24:52,609 INFO  namenode.FSEditLog 
> (FSEditLog.java:startLogSegment(1423)) - Ending log segment at 24207987
> 2022-03-15 17:24:52,610 INFO  namenode.FSEditLog 
> (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1432)) - logEdit at 24207987
> 2022-03-15 17:24:52,624 INFO  namenode.FSEditLog 
> (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1434)) - logSync at 24207987 
> {code}
> 2. SNN. Checkpoint.
>  
> {color:#FF}25892513 + 1 - 100 = 24892514{color}
> {color:#FF}dfs.namenode.num.extra.edits.retained=100{color}
>  
> {code:java}
> 2022-03-15 17:28:02,640 INFO  ha.StandbyCheckpointer 
> (StandbyCheckpointer.java:doWork(443)) - Triggering checkpoint because there 
> have been 1189661 txns since the last checkpoint, which exceeds the 
> configured threshold 2
> 2022-03-15 17:28:02,648 INFO  namenode.FSImage 
> (FSEditLogLoader.java:loadFSEdits(188)) - Edits file 
> ByteStringEditLog[27082175, 27082606], ByteStringEditLog[27082175, 27082606], 
> ByteStringEditLog[27082175, 27082606] of size 60008 edits # 432 loaded in 0 
> seconds
> 2022-03-15 17:28:02,649 INFO  namenode.FSImage 
> (FSImage.java:saveNamespace(1121)) - Save namespace ...
> 2022-03-15 17:28:02,650 INFO  namenode.FSImageFormatProtobuf 
> (FSImageFormatProtobuf.java:save(718)) - Saving image file 
> /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 using no 
> compression
> 2022-03-15 17:28:03,180 INFO  namenode.FSImageFormatProtobuf 
> (FSImageFormatProtobuf.java:save(722)) - Image file 
> /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 of size 
> 17885002 bytes saved in 0 seconds .
> 2022-03-15 17:28:03,183 INFO  namenode.NNStorageRetentionManager 
> (NNStorageRetentionManager.java:getImageTxIdToRetain(211)) - Going to retain 
> 2 images with txid >= 25892513
> 2022-03-15 17:28:03,183 INFO  namenode.NNStorageRetentionManager 
> (NNStorageRetentionManager.java:purgeImage(233)) - Purging old image 
> FSImageFile(file=/data/hadoop/hdfs/namenode/current/fsimage_00024794305,
>  cpktTxId=00024794305)
> 2022-03-15 17:28:03,188 INFO  namenode.NNStorageRetentionManager 
> (NNStorageRetentionManager.java:purgeOldStorage(169)) - purgeLogsFrom: 
> 24892514
> 2022-03-15 17:28:03,282 INFO  namenode.TransferFsImage 
> (TransferFsImage.java:copyFileToStream(396)) - Sending fileName: 
> /data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: 
> 17885002. Sent total: 17885002 bytes. Size of last segment intended to send: 
> -1 bytes.
> 2022-03-15 17:28:03,536 INFO  namenode.TransferFsImage 
> (TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 
> 27082606 to namenode at http://sg-test-ambari-nn1.bigdata.bigo.inner:50070 in 
> 0.343 seconds
> 2022-03-15 17:28:03,640 INFO  namenode.TransferFsImage 
> (TransferFsImage.java:copyFileToStream(396)) - Sending fileName: 
> /data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: 
> 17885002. Sent total: 17885002 bytes. Size of last segment intended to send: 
> -1 bytes.
> 2022-03-15 17:28:03,684 INFO  namenode.TransferFsImage 
> (TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 
> 27082606 to namenode at http://sg-test-ambari-dn1.bigdata.bigo.inner:50070 in 
> 0.148 seconds
> 2022-03-15 17:28:03,748 INFO  namenode.TransferFsImage 
> (TransferFsImage.java:copyFileToStream(396)) - Sending fileName: 
> /data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: 
> 17885002. Sent total: 17885002 bytes. Size of last segment intended to send: 
> -1 bytes.
> 2022-03-15 17:28:03,798 INFO  namenode.TransferFsImage 
> (TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 
> 27082606 to namenode at http://sg-test-ambari-dn2.bigdata.bigo.inner:50070 in 
> 0.113 seconds
> 2022-03-15 17:28:03,798 INFO  ha.StandbyCheckpointer 
> 

[jira] [Updated] (HDFS-16507) Purged edit logs which is in process

2022-03-15 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16507:
---
Environment: (was: {code:java}
// code placeholder
{code})

> Purged edit logs which is in process
> 
>
> Key: HDFS-16507
> URL: https://issues.apache.org/jira/browse/HDFS-16507
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: tomscut
>Priority: Critical
>
> We introduced Standby read functionality in branch-3.1.0, but found a FATAL 
> exception. It looks like it's purging edit logs which is in process.
> According to the analysis, I suspect that the Editlog to be purged does not 
> finalize normally .
> I post some key logs for your reference:
> 1. ANN. Create editlog, 
> {color:#FF}edits_InProgresS_00024207987{color}.
>  
> {code:java}
> 2022-03-15 17:24:52,558 INFO  namenode.FSEditLog 
> (FSEditLog.java:startLogSegment(1394)) - Starting log segment at 24207987
> 2022-03-15 17:24:52,609 INFO  namenode.FSEditLog 
> (FSEditLog.java:startLogSegment(1423)) - Ending log segment at 24207987
> 2022-03-15 17:24:52,610 INFO  namenode.FSEditLog 
> (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1432)) - logEdit at 24207987
> 2022-03-15 17:24:52,624 INFO  namenode.FSEditLog 
> (FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1434)) - logSync at 24207987 
> {code}
> 2. SNN. Checkpoint.
>  
> {color:#FF}25892513 + 1 - 100 = 24892514{color}
> {color:#FF}dfs.namenode.num.extra.edits.retained=100{color}
>  
> {code:java}
> 2022-03-15 17:28:02,640 INFO  ha.StandbyCheckpointer 
> (StandbyCheckpointer.java:doWork(443)) - Triggering checkpoint because there 
> have been 1189661 txns since the last checkpoint, which exceeds the 
> configured threshold 2
> 2022-03-15 17:28:02,648 INFO  namenode.FSImage 
> (FSEditLogLoader.java:loadFSEdits(188)) - Edits file 
> ByteStringEditLog[27082175, 27082606], ByteStringEditLog[27082175, 27082606], 
> ByteStringEditLog[27082175, 27082606] of size 60008 edits # 432 loaded in 0 
> seconds
> 2022-03-15 17:28:02,649 INFO  namenode.FSImage 
> (FSImage.java:saveNamespace(1121)) - Save namespace ...
> 2022-03-15 17:28:02,650 INFO  namenode.FSImageFormatProtobuf 
> (FSImageFormatProtobuf.java:save(718)) - Saving image file 
> /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 using no 
> compression
> 2022-03-15 17:28:03,180 INFO  namenode.FSImageFormatProtobuf 
> (FSImageFormatProtobuf.java:save(722)) - Image file 
> /data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 of size 
> 17885002 bytes saved in 0 seconds .
> 2022-03-15 17:28:03,183 INFO  namenode.NNStorageRetentionManager 
> (NNStorageRetentionManager.java:getImageTxIdToRetain(211)) - Going to retain 
> 2 images with txid >= 25892513
> 2022-03-15 17:28:03,183 INFO  namenode.NNStorageRetentionManager 
> (NNStorageRetentionManager.java:purgeImage(233)) - Purging old image 
> FSImageFile(file=/data/hadoop/hdfs/namenode/current/fsimage_00024794305,
>  cpktTxId=00024794305)
> 2022-03-15 17:28:03,188 INFO  namenode.NNStorageRetentionManager 
> (NNStorageRetentionManager.java:purgeOldStorage(169)) - purgeLogsFrom: 
> 24892514
> 2022-03-15 17:28:03,282 INFO  namenode.TransferFsImage 
> (TransferFsImage.java:copyFileToStream(396)) - Sending fileName: 
> /data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: 
> 17885002. Sent total: 17885002 bytes. Size of last segment intended to send: 
> -1 bytes.
> 2022-03-15 17:28:03,536 INFO  namenode.TransferFsImage 
> (TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 
> 27082606 to namenode at http://sg-test-ambari-nn1.bigdata.bigo.inner:50070 in 
> 0.343 seconds
> 2022-03-15 17:28:03,640 INFO  namenode.TransferFsImage 
> (TransferFsImage.java:copyFileToStream(396)) - Sending fileName: 
> /data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: 
> 17885002. Sent total: 17885002 bytes. Size of last segment intended to send: 
> -1 bytes.
> 2022-03-15 17:28:03,684 INFO  namenode.TransferFsImage 
> (TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 
> 27082606 to namenode at http://sg-test-ambari-dn1.bigdata.bigo.inner:50070 in 
> 0.148 seconds
> 2022-03-15 17:28:03,748 INFO  namenode.TransferFsImage 
> (TransferFsImage.java:copyFileToStream(396)) - Sending fileName: 
> /data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: 
> 17885002. Sent total: 17885002 bytes. Size of last segment intended to send: 
> -1 bytes.
> 2022-03-15 17:28:03,798 INFO  namenode.TransferFsImage 
> (TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 
> 27082606 to namenode at http://sg-test-ambari-dn2.bigdata.bigo.inner:50070 in 
> 0.113 seconds
> 2022-03-15 17:28:03,798 INFO  ha.StandbyCheckpointer 
> 

[jira] [Created] (HDFS-16507) Purged edit logs which is in process

2022-03-15 Thread tomscut (Jira)
tomscut created HDFS-16507:
--

 Summary: Purged edit logs which is in process
 Key: HDFS-16507
 URL: https://issues.apache.org/jira/browse/HDFS-16507
 Project: Hadoop HDFS
  Issue Type: Bug
 Environment: {code:java}
// code placeholder
{code}
Reporter: tomscut


We introduced Standby read functionality in branch-3.1.0, but found a FATAL 
exception. It looks like it's purging edit logs which is in process.

According to the analysis, I suspect that the Editlog to be purged does not 
finalize normally .

I post some key logs for your reference:

1. ANN. Create editlog, 
{color:#FF}edits_InProgresS_00024207987{color}.

 
{code:java}
2022-03-15 17:24:52,558 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegment(1394)) - Starting log segment at 24207987
2022-03-15 17:24:52,609 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegment(1423)) - Ending log segment at 24207987
2022-03-15 17:24:52,610 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1432)) - logEdit at 24207987
2022-03-15 17:24:52,624 INFO  namenode.FSEditLog 
(FSEditLog.java:startLogSegmentAndWriteHeaderTxn(1434)) - logSync at 24207987 
{code}
2. SNN. Checkpoint.

 

{color:#FF}25892513 + 1 - 100 = 24892514{color}
{color:#FF}dfs.namenode.num.extra.edits.retained=100{color}

 
{code:java}
2022-03-15 17:28:02,640 INFO  ha.StandbyCheckpointer 
(StandbyCheckpointer.java:doWork(443)) - Triggering checkpoint because there 
have been 1189661 txns since the last checkpoint, which exceeds the configured 
threshold 2
2022-03-15 17:28:02,648 INFO  namenode.FSImage 
(FSEditLogLoader.java:loadFSEdits(188)) - Edits file 
ByteStringEditLog[27082175, 27082606], ByteStringEditLog[27082175, 27082606], 
ByteStringEditLog[27082175, 27082606] of size 60008 edits # 432 loaded in 0 
seconds
2022-03-15 17:28:02,649 INFO  namenode.FSImage 
(FSImage.java:saveNamespace(1121)) - Save namespace ...
2022-03-15 17:28:02,650 INFO  namenode.FSImageFormatProtobuf 
(FSImageFormatProtobuf.java:save(718)) - Saving image file 
/data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 using no 
compression
2022-03-15 17:28:03,180 INFO  namenode.FSImageFormatProtobuf 
(FSImageFormatProtobuf.java:save(722)) - Image file 
/data/hadoop/hdfs/namenode/current/fsimage.ckpt_00027082606 of size 
17885002 bytes saved in 0 seconds .
2022-03-15 17:28:03,183 INFO  namenode.NNStorageRetentionManager 
(NNStorageRetentionManager.java:getImageTxIdToRetain(211)) - Going to retain 2 
images with txid >= 25892513
2022-03-15 17:28:03,183 INFO  namenode.NNStorageRetentionManager 
(NNStorageRetentionManager.java:purgeImage(233)) - Purging old image 
FSImageFile(file=/data/hadoop/hdfs/namenode/current/fsimage_00024794305,
 cpktTxId=00024794305)
2022-03-15 17:28:03,188 INFO  namenode.NNStorageRetentionManager 
(NNStorageRetentionManager.java:purgeOldStorage(169)) - purgeLogsFrom: 24892514
2022-03-15 17:28:03,282 INFO  namenode.TransferFsImage 
(TransferFsImage.java:copyFileToStream(396)) - Sending fileName: 
/data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: 
17885002. Sent total: 17885002 bytes. Size of last segment intended to send: -1 
bytes.
2022-03-15 17:28:03,536 INFO  namenode.TransferFsImage 
(TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 
27082606 to namenode at http://sg-test-ambari-nn1.bigdata.bigo.inner:50070 in 
0.343 seconds
2022-03-15 17:28:03,640 INFO  namenode.TransferFsImage 
(TransferFsImage.java:copyFileToStream(396)) - Sending fileName: 
/data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: 
17885002. Sent total: 17885002 bytes. Size of last segment intended to send: -1 
bytes.
2022-03-15 17:28:03,684 INFO  namenode.TransferFsImage 
(TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 
27082606 to namenode at http://sg-test-ambari-dn1.bigdata.bigo.inner:50070 in 
0.148 seconds
2022-03-15 17:28:03,748 INFO  namenode.TransferFsImage 
(TransferFsImage.java:copyFileToStream(396)) - Sending fileName: 
/data/hadoop/hdfs/namenode/current/fsimage_00027082606, fileSize: 
17885002. Sent total: 17885002 bytes. Size of last segment intended to send: -1 
bytes.
2022-03-15 17:28:03,798 INFO  namenode.TransferFsImage 
(TransferFsImage.java:uploadImageFromStorage(240)) - Uploaded image with txid 
27082606 to namenode at http://sg-test-ambari-dn2.bigdata.bigo.inner:50070 in 
0.113 seconds
2022-03-15 17:28:03,798 INFO  ha.StandbyCheckpointer 
(StandbyCheckpointer.java:doWork(482)) - Checkpoint finished successfully.
 {code}
3. ANN. Purge edit logs.

 

{color:#FF}25892513 + 1 - 100 = 24892514{color}
{color:#FF}dfs.namenode.num.extra.edits.retained=100{color}
{code:java}
2022-03-15 17:28:03,515 INFO  namenode.NNStorageRetentionManager 

[jira] [Created] (HDFS-16506) Unit tests failed because of OutOfMemoryError

2022-03-14 Thread tomscut (Jira)
tomscut created HDFS-16506:
--

 Summary: Unit tests failed because of OutOfMemoryError
 Key: HDFS-16506
 URL: https://issues.apache.org/jira/browse/HDFS-16506
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: tomscut


Unit tests failed because of OutOfMemoryError.

An example: 
[[OutOfMemoryError|https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4009/5/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt].|https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4009/5/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt]
{code:java}
[ERROR] Tests run: 32, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: 95.727 
s <<< FAILURE! - in 
org.apache.hadoop.hdfs.server.blockmanagement.TestBlockInfoStriped
[ERROR] testGetBlockInfo[4: ErasureCodingPolicy=[Name=RS-10-4-1024k, 
Schema=[ECSchema=[Codec=rs, numDataUnits=10, numParityUnits=4]], 
CellSize=1048576, 
Id=5]](org.apache.hadoop.hdfs.server.blockmanagement.TestBlockInfoStriped)  
Time elapsed: 15.831 s  <<< ERROR!
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:717)
at 
io.netty.util.concurrent.ThreadPerTaskExecutor.execute(ThreadPerTaskExecutor.java:32)
at 
io.netty.util.internal.ThreadExecutorMap$1.execute(ThreadExecutorMap.java:57)
at 
io.netty.util.concurrent.SingleThreadEventExecutor.doStartThread(SingleThreadEventExecutor.java:975)
at 
io.netty.util.concurrent.SingleThreadEventExecutor.ensureThreadStarted(SingleThreadEventExecutor.java:958)
at 
io.netty.util.concurrent.SingleThreadEventExecutor.shutdownGracefully(SingleThreadEventExecutor.java:660)
at 
io.netty.util.concurrent.MultithreadEventExecutorGroup.shutdownGracefully(MultithreadEventExecutorGroup.java:163)
at 
io.netty.util.concurrent.AbstractEventExecutorGroup.shutdownGracefully(AbstractEventExecutorGroup.java:70)
at 
org.apache.hadoop.hdfs.server.datanode.web.DatanodeHttpServer.close(DatanodeHttpServer.java:346)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.shutdown(DataNode.java:2348)
at 
org.apache.hadoop.hdfs.MiniDFSCluster.shutdownDataNode(MiniDFSCluster.java:2166)
at 
org.apache.hadoop.hdfs.MiniDFSCluster.shutdownDataNodes(MiniDFSCluster.java:2156)
at 
org.apache.hadoop.hdfs.MiniDFSCluster.shutdown(MiniDFSCluster.java:2135)
at 
org.apache.hadoop.hdfs.MiniDFSCluster.shutdown(MiniDFSCluster.java:2109)
at 
org.apache.hadoop.hdfs.MiniDFSCluster.shutdown(MiniDFSCluster.java:2102)
at org.apache.hadoop.hdfs.MiniDFSCluster.close(MiniDFSCluster.java:3479)
at 
org.apache.hadoop.hdfs.server.blockmanagement.TestBlockInfoStriped.testGetBlockInfo(TestBlockInfoStriped.java:257)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299)
at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.lang.Thread.run(Thread.java:748) {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16505) Setting safemode should not be interrupted by abnormal nodes

2022-03-14 Thread tomscut (Jira)
tomscut created HDFS-16505:
--

 Summary: Setting safemode should not be interrupted by abnormal 
nodes
 Key: HDFS-16505
 URL: https://issues.apache.org/jira/browse/HDFS-16505
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: tomscut
Assignee: tomscut
 Attachments: image-2022-03-15-09-29-36-538.png, 
image-2022-03-15-09-29-44-430.png

Setting safemode should not be interrupted by abnormal nodes. 

For example, we have four namenodes configured in the following order:
NS1 -> active
NS2 -> standby
NS3 -> observer
NS4 -> observer.

When the {color:#FF}NS1 {color}process exits, setting the states of 
safemode, {color:#FF}NS2{color}, {color:#FF}NS3{color}, and 
{color:#FF}NS4 {color}fails. Similarly, when the {color:#FF}NS2{color} 
process exits, only the safemode state of {color:#FF}NS1{color} can be set 
successfully.

 

When the {color:#FF}NS1{color} process exits:

Before the change:

!image-2022-03-15-09-29-36-538.png|width=1145,height=97!

After the change:

!image-2022-03-15-09-29-44-430.png|width=1104,height=119!

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16503) Should verify whether the path name is valid in the WebHDFS

2022-03-13 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16503:
---
Description: 
When creating a file using WebHDFS, there are two main steps:
1. Obtain the location of the Datanode to be written.
2. Put the file to this location.

Currently *NameNodeRpcServer* verifies that pathName is valid, but 
*NamenodeWebHdfsMethods* and *RouterWebHdfsMethods* do not.

So if we use an invalid path(such as duplicated slash), the first step returns 
success, but the second step throws an {*}InvalidPathException{*}. IMO, we 
should also do the validation in WebHdfs, which is consistent with the 
NameNodeRpcServer.

!image-2022-03-14-09-35-49-860.png|width=548,height=164!

The same webHDFS operations are: CREATE, APPEND, OPEN, GETFILECHECKSUM. So we 
can add DFSUtil.isValidName to redirectURI for *NamenodeWebHdfsMethods* and 
*RouterWebHdfsMethods.*

  was:
When creating a file using WebHDFS, there are two main steps:
1. Obtain the location of the Datanode to be written.
2. Put the file to this location.

Currently *NameNodeRpcServer* verifies that pathName is valid, but 
*NamenodeWebHdfsMethods* and *RouterWebHdfsMethods* do not.

So if we use an invalid path, the first step returns success, but the second 
step throws an {*}InvalidPathException{*}. IMO, we should also do the 
validation in WebHdfs, which is consistent with the NameNodeRpcServer.

!image-2022-03-14-09-35-49-860.png|width=548,height=164!

The same webHDFS operations are: CREATE, APPEND, OPEN, GETFILECHECKSUM. So we 
can add DFSUtil.isValidName to redirectURI for *NamenodeWebHdfsMethods* and 
*RouterWebHdfsMethods.*


> Should verify whether the path name is valid in the WebHDFS
> ---
>
> Key: HDFS-16503
> URL: https://issues.apache.org/jira/browse/HDFS-16503
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2022-03-14-09-35-49-860.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When creating a file using WebHDFS, there are two main steps:
> 1. Obtain the location of the Datanode to be written.
> 2. Put the file to this location.
> Currently *NameNodeRpcServer* verifies that pathName is valid, but 
> *NamenodeWebHdfsMethods* and *RouterWebHdfsMethods* do not.
> So if we use an invalid path(such as duplicated slash), the first step 
> returns success, but the second step throws an {*}InvalidPathException{*}. 
> IMO, we should also do the validation in WebHdfs, which is consistent with 
> the NameNodeRpcServer.
> !image-2022-03-14-09-35-49-860.png|width=548,height=164!
> The same webHDFS operations are: CREATE, APPEND, OPEN, GETFILECHECKSUM. So we 
> can add DFSUtil.isValidName to redirectURI for *NamenodeWebHdfsMethods* and 
> *RouterWebHdfsMethods.*



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16503) Should verify whether the path name is valid in the WebHDFS

2022-03-13 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16503:
---
Description: 
When creating a file using WebHDFS, there are two main steps:
1. Obtain the location of the Datanode to be written.
2. Put the file to this location.

Currently *NameNodeRpcServer* verifies that pathName is valid, but 
*NamenodeWebHdfsMethods* and *RouterWebHdfsMethods* do not.

So if we use an invalid path, the first step returns success, but the second 
step throws an {*}InvalidPathException{*}. IMO, we should also do the 
validation in WebHdfs, which is consistent with the NameNodeRpcServer.

!image-2022-03-14-09-35-49-860.png|width=548,height=164!

The same webHDFS operations are: CREATE, APPEND, OPEN, GETFILECHECKSUM. So we 
can add DFSUtil.isValidName to redirectURI for *NamenodeWebHdfsMethods* and 
*RouterWebHdfsMethods.*

  was:
When creating a file using WebHDFS, there are two main steps:
1. Obtain the location of the Datanode to be written.
2. Put the file to this location.

Currently *NameNodeRpcServer* verifies that pathName is valid, but 
*NamenodeWebHdfsMethods* and *RouterWebHdfsMethods* do not.

So if we use an invalid path, the first step returns success, but the second 
step throws an {*}InvalidPathException{*}. We should also do the validation in 
WebHdfs, which is consistent with the NameNodeRpcServer.

!image-2022-03-14-09-35-49-860.png|width=548,height=164!

The same webHDFS operations are: CREATE, APPEND, OPEN, GETFILECHECKSUM. So we 
can add DFSUtil.isValidName to redirectURI for *NamenodeWebHdfsMethods* and 
*RouterWebHdfsMethods.*


> Should verify whether the path name is valid in the WebHDFS
> ---
>
> Key: HDFS-16503
> URL: https://issues.apache.org/jira/browse/HDFS-16503
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
> Attachments: image-2022-03-14-09-35-49-860.png
>
>
> When creating a file using WebHDFS, there are two main steps:
> 1. Obtain the location of the Datanode to be written.
> 2. Put the file to this location.
> Currently *NameNodeRpcServer* verifies that pathName is valid, but 
> *NamenodeWebHdfsMethods* and *RouterWebHdfsMethods* do not.
> So if we use an invalid path, the first step returns success, but the second 
> step throws an {*}InvalidPathException{*}. IMO, we should also do the 
> validation in WebHdfs, which is consistent with the NameNodeRpcServer.
> !image-2022-03-14-09-35-49-860.png|width=548,height=164!
> The same webHDFS operations are: CREATE, APPEND, OPEN, GETFILECHECKSUM. So we 
> can add DFSUtil.isValidName to redirectURI for *NamenodeWebHdfsMethods* and 
> *RouterWebHdfsMethods.*



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16503) Should verify whether the path name is valid in the WebHDFS

2022-03-13 Thread tomscut (Jira)
tomscut created HDFS-16503:
--

 Summary: Should verify whether the path name is valid in the 
WebHDFS
 Key: HDFS-16503
 URL: https://issues.apache.org/jira/browse/HDFS-16503
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: tomscut
Assignee: tomscut
 Attachments: image-2022-03-14-09-35-49-860.png

When creating a file using WebHDFS, there are two main steps:
1. Obtain the location of the Datanode to be written.
2. Put the file to this location.

Currently *NameNodeRpcServer* verifies that pathName is valid, but 
*NamenodeWebHdfsMethods* and *RouterWebHdfsMethods* do not.

So if we use an invalid path, the first step returns success, but the second 
step throws an {*}InvalidPathException{*}. We should also do the validation in 
WebHdfs, which is consistent with the NameNodeRpcServer.

!image-2022-03-14-09-35-49-860.png|width=548,height=164!

The same webHDFS operations are: CREATE, APPEND, OPEN, GETFILECHECKSUM. So we 
can add DFSUtil.isValidName to redirectURI for *NamenodeWebHdfsMethods* and 
*RouterWebHdfsMethods.*



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] (HDFS-14271) [SBN read] StandbyException is logged if Observer is the first NameNode

2022-03-10 Thread tomscut (Jira)


[ https://issues.apache.org/jira/browse/HDFS-14271 ]


tomscut deleted comment on HDFS-14271:


was (Author: tomscut):
Perhaps the client needs to make a cache, such as a map, to record the state of 
each Namenode.

Send the request to the corresponding state of the Namenode every time.

When ObserverRetryOnActiveException or StandbyException occur, updating cache 
state of the corresponding Namenode.

> [SBN read] StandbyException is logged if Observer is the first NameNode
> ---
>
> Key: HDFS-14271
> URL: https://issues.apache.org/jira/browse/HDFS-14271
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.3.0
>Reporter: Wei-Chiu Chuang
>Assignee: Shen Yinjie
>Priority: Minor
>  Labels: multi-sbnn
> Attachments: HDFS-14271_1.patch, image-2022-03-11-14-54-49-806.png
>
>
> If I transition the first NameNode into Observer state, and then I create a 
> file from command line, it prints the following StandbyException log message, 
> as if the command failed. But it actually completed successfully:
> {noformat}
> [root@weichiu-sbsr-1 ~]# hdfs dfs -touchz /tmp/abf
> 19/02/12 16:35:17 INFO retry.RetryInvocationHandler: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>  Operation category WRITE is not supported in state observer. Visit 
> https://s.apache.org/sbnn-error
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:98)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1987)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1424)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:762)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:458)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:918)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:853)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2782)
> , while invoking $Proxy4.create over 
> [weichiu-sbsr-1.gce.cloudera.com/172.31.121.145:8020,weichiu-sbsr-2.gce.cloudera.com/172.31.121.140:8020].
>  Trying to failover immediately.
> {noformat}
> This is unlike the case when the first NameNode is the Standby, where this 
> StandbyException is suppressed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14271) [SBN read] StandbyException is logged if Observer is the first NameNode

2022-03-10 Thread tomscut (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17504755#comment-17504755
 ] 

tomscut commented on HDFS-14271:


Perhaps the client needs to make a cache, such as a map, to record the state of 
each Namenode.

Send the request to the corresponding state of the Namenode every time.

When ObserverRetryOnActiveException or StandbyException occur, updating cache 
state of the corresponding Namenode.

> [SBN read] StandbyException is logged if Observer is the first NameNode
> ---
>
> Key: HDFS-14271
> URL: https://issues.apache.org/jira/browse/HDFS-14271
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.3.0
>Reporter: Wei-Chiu Chuang
>Assignee: Shen Yinjie
>Priority: Minor
>  Labels: multi-sbnn
> Attachments: HDFS-14271_1.patch, image-2022-03-11-14-54-49-806.png
>
>
> If I transition the first NameNode into Observer state, and then I create a 
> file from command line, it prints the following StandbyException log message, 
> as if the command failed. But it actually completed successfully:
> {noformat}
> [root@weichiu-sbsr-1 ~]# hdfs dfs -touchz /tmp/abf
> 19/02/12 16:35:17 INFO retry.RetryInvocationHandler: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>  Operation category WRITE is not supported in state observer. Visit 
> https://s.apache.org/sbnn-error
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:98)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1987)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1424)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:762)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:458)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:918)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:853)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2782)
> , while invoking $Proxy4.create over 
> [weichiu-sbsr-1.gce.cloudera.com/172.31.121.145:8020,weichiu-sbsr-2.gce.cloudera.com/172.31.121.140:8020].
>  Trying to failover immediately.
> {noformat}
> This is unlike the case when the first NameNode is the Standby, where this 
> StandbyException is suppressed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14271) [SBN read] StandbyException is logged if Observer is the first NameNode

2022-03-10 Thread tomscut (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17504747#comment-17504747
 ] 

tomscut commented on HDFS-14271:


How about a solution to reduce loglevel, albeit just to reduce log output. In 
addition, *ObserverRetryOnActiveException* to deal with.

!image-2022-03-11-14-54-49-806.png|width=687,height=150!

> [SBN read] StandbyException is logged if Observer is the first NameNode
> ---
>
> Key: HDFS-14271
> URL: https://issues.apache.org/jira/browse/HDFS-14271
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.3.0
>Reporter: Wei-Chiu Chuang
>Assignee: Shen Yinjie
>Priority: Minor
>  Labels: multi-sbnn
> Attachments: HDFS-14271_1.patch, image-2022-03-11-14-54-49-806.png
>
>
> If I transition the first NameNode into Observer state, and then I create a 
> file from command line, it prints the following StandbyException log message, 
> as if the command failed. But it actually completed successfully:
> {noformat}
> [root@weichiu-sbsr-1 ~]# hdfs dfs -touchz /tmp/abf
> 19/02/12 16:35:17 INFO retry.RetryInvocationHandler: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>  Operation category WRITE is not supported in state observer. Visit 
> https://s.apache.org/sbnn-error
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:98)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1987)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1424)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:762)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:458)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:918)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:853)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2782)
> , while invoking $Proxy4.create over 
> [weichiu-sbsr-1.gce.cloudera.com/172.31.121.145:8020,weichiu-sbsr-2.gce.cloudera.com/172.31.121.140:8020].
>  Trying to failover immediately.
> {noformat}
> This is unlike the case when the first NameNode is the Standby, where this 
> StandbyException is suppressed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14271) [SBN read] StandbyException is logged if Observer is the first NameNode

2022-03-10 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-14271:
---
Attachment: image-2022-03-11-14-54-49-806.png

> [SBN read] StandbyException is logged if Observer is the first NameNode
> ---
>
> Key: HDFS-14271
> URL: https://issues.apache.org/jira/browse/HDFS-14271
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.3.0
>Reporter: Wei-Chiu Chuang
>Assignee: Shen Yinjie
>Priority: Minor
>  Labels: multi-sbnn
> Attachments: HDFS-14271_1.patch, image-2022-03-11-14-54-49-806.png
>
>
> If I transition the first NameNode into Observer state, and then I create a 
> file from command line, it prints the following StandbyException log message, 
> as if the command failed. But it actually completed successfully:
> {noformat}
> [root@weichiu-sbsr-1 ~]# hdfs dfs -touchz /tmp/abf
> 19/02/12 16:35:17 INFO retry.RetryInvocationHandler: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>  Operation category WRITE is not supported in state observer. Visit 
> https://s.apache.org/sbnn-error
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:98)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1987)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1424)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:762)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:458)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:918)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:853)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2782)
> , while invoking $Proxy4.create over 
> [weichiu-sbsr-1.gce.cloudera.com/172.31.121.145:8020,weichiu-sbsr-2.gce.cloudera.com/172.31.121.140:8020].
>  Trying to failover immediately.
> {noformat}
> This is unlike the case when the first NameNode is the Standby, where this 
> StandbyException is suppressed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16498) Fix NPE for checkBlockReportLease

2022-03-09 Thread tomscut (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17503977#comment-17503977
 ] 

tomscut commented on HDFS-16498:


[~jianghuazhu] I agree with you, it would be more appropriate to change the 
WARN level here, I will update. Thanks.

> Fix NPE for checkBlockReportLease
> -
>
> Key: HDFS-16498
> URL: https://issues.apache.org/jira/browse/HDFS-16498
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2022-03-09-20-35-22-028.png, screenshot-1.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> During the restart of Namenode, a Datanode is not registered, but this 
> Datanode triggers FBR, which causes NPE.
> !image-2022-03-09-20-35-22-028.png|width=871,height=158!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16499) [SPS]: Should not start indefinitely while another SPS process is running

2022-03-09 Thread tomscut (Jira)
tomscut created HDFS-16499:
--

 Summary: [SPS]: Should not start indefinitely while another SPS 
process is running
 Key: HDFS-16499
 URL: https://issues.apache.org/jira/browse/HDFS-16499
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: tomscut
Assignee: tomscut


Normally, we can only start one SPS process at a time. When one process is 
running, start another process and retry indefinitely. I think, in this case, 
we should exit immediately.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16498) Fix NPE for checkBlockReportLease

2022-03-09 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16498:
---
Description: 
During the restart of Namenode, a Datanode is not registered, but this Datanode 
triggers FBR, which causes NPE.

!image-2022-03-09-20-35-22-028.png|width=871,height=158!

  was:During the restart of Namenode, a Datanode is not registered, but this 
Datanode triggers FBR, which causes NPE.


> Fix NPE for checkBlockReportLease
> -
>
> Key: HDFS-16498
> URL: https://issues.apache.org/jira/browse/HDFS-16498
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
> Attachments: image-2022-03-09-20-35-22-028.png
>
>
> During the restart of Namenode, a Datanode is not registered, but this 
> Datanode triggers FBR, which causes NPE.
> !image-2022-03-09-20-35-22-028.png|width=871,height=158!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16498) Fix NPE for checkBlockReportLease

2022-03-09 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16498:
---
Attachment: image-2022-03-09-20-35-22-028.png

> Fix NPE for checkBlockReportLease
> -
>
> Key: HDFS-16498
> URL: https://issues.apache.org/jira/browse/HDFS-16498
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
> Attachments: image-2022-03-09-20-35-22-028.png
>
>
> During the restart of Namenode, a Datanode is not registered, but this 
> Datanode triggers FBR, which causes NPE.
> !image-2022-03-09-20-35-22-028.png|width=871,height=158!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16498) Fix NPE for checkBlockReportLease

2022-03-09 Thread tomscut (Jira)
tomscut created HDFS-16498:
--

 Summary: Fix NPE for checkBlockReportLease
 Key: HDFS-16498
 URL: https://issues.apache.org/jira/browse/HDFS-16498
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: tomscut
Assignee: tomscut


During the restart of Namenode, a Datanode is not registered, but this Datanode 
triggers FBR, which causes NPE.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16488) [SPS]: Expose metrics to JMX for external SPS

2022-02-26 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16488:
---
Attachment: image-2022-02-26-22-15-25-543.png

> [SPS]: Expose metrics to JMX for external SPS
> -
>
> Key: HDFS-16488
> URL: https://issues.apache.org/jira/browse/HDFS-16488
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2022-02-26-22-15-25-543.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, external SPS has no monitoring metrics. We do not know how many 
> blocks are waiting to be processed, how many blocks are waiting to be 
> retried, and how many blocks have been migrated.
> We can expose these metrics in JMX for easy collection and display by 
> monitoring systems.
> !image-2022-02-26-22-15-09-432.png|width=593,height=160!
> For example, in our cluster, we exposed these metrics to JMX, collected by 
> JMX-Exporter and combined with Prometheus, and finally display by Grafana.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16488) [SPS]: Expose metrics to JMX for external SPS

2022-02-26 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16488:
---
Description: 
Currently, external SPS has no monitoring metrics. We do not know how many 
blocks are waiting to be processed, how many blocks are waiting to be retried, 
and how many blocks have been migrated.

We can expose these metrics in JMX for easy collection and display by 
monitoring systems.

!image-2022-02-26-22-15-09-432.png|width=593,height=160!

For example, in our cluster, we exposed these metrics to JMX, collected by 
JMX-Exporter and combined with Prometheus, and finally display by Grafana.

  was:
Currently, external SPS has no monitoring metrics. We do not know how many 
blocks are waiting to be processed, how many blocks are waiting to be retried, 
and how many blocks have been migrated.

We can expose these metrics in JMX for easy collection and display by 
monitoring systems.

For example, in our cluster, we exposed these metrics to JMX, collected by 
JMX-Exporter and combined with Prometheus, and finally display by Grafana.


> [SPS]: Expose metrics to JMX for external SPS
> -
>
> Key: HDFS-16488
> URL: https://issues.apache.org/jira/browse/HDFS-16488
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2022-02-26-22-15-25-543.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, external SPS has no monitoring metrics. We do not know how many 
> blocks are waiting to be processed, how many blocks are waiting to be 
> retried, and how many blocks have been migrated.
> We can expose these metrics in JMX for easy collection and display by 
> monitoring systems.
> !image-2022-02-26-22-15-09-432.png|width=593,height=160!
> For example, in our cluster, we exposed these metrics to JMX, collected by 
> JMX-Exporter and combined with Prometheus, and finally display by Grafana.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16488) [SPS]: Expose metrics to JMX for external SPS

2022-02-26 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16488:
---
Description: 
Currently, external SPS has no monitoring metrics. We do not know how many 
blocks are waiting to be processed, how many blocks are waiting to be retried, 
and how many blocks have been migrated.

We can expose these metrics in JMX for easy collection and display by 
monitoring systems.

!image-2022-02-26-22-15-25-543.png|width=631,height=170!

For example, in our cluster, we exposed these metrics to JMX, collected by 
JMX-Exporter and combined with Prometheus, and finally display by Grafana.

  was:
Currently, external SPS has no monitoring metrics. We do not know how many 
blocks are waiting to be processed, how many blocks are waiting to be retried, 
and how many blocks have been migrated.

We can expose these metrics in JMX for easy collection and display by 
monitoring systems.

!image-2022-02-26-22-15-09-432.png|width=593,height=160!

For example, in our cluster, we exposed these metrics to JMX, collected by 
JMX-Exporter and combined with Prometheus, and finally display by Grafana.


> [SPS]: Expose metrics to JMX for external SPS
> -
>
> Key: HDFS-16488
> URL: https://issues.apache.org/jira/browse/HDFS-16488
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2022-02-26-22-15-25-543.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, external SPS has no monitoring metrics. We do not know how many 
> blocks are waiting to be processed, how many blocks are waiting to be 
> retried, and how many blocks have been migrated.
> We can expose these metrics in JMX for easy collection and display by 
> monitoring systems.
> !image-2022-02-26-22-15-25-543.png|width=631,height=170!
> For example, in our cluster, we exposed these metrics to JMX, collected by 
> JMX-Exporter and combined with Prometheus, and finally display by Grafana.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16488) [SPS]: Expose metrics to JMX for external SPS

2022-02-26 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16488:
---
Attachment: (was: image-2022-02-26-22-15-09-432.png)

> [SPS]: Expose metrics to JMX for external SPS
> -
>
> Key: HDFS-16488
> URL: https://issues.apache.org/jira/browse/HDFS-16488
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2022-02-26-22-15-25-543.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, external SPS has no monitoring metrics. We do not know how many 
> blocks are waiting to be processed, how many blocks are waiting to be 
> retried, and how many blocks have been migrated.
> We can expose these metrics in JMX for easy collection and display by 
> monitoring systems.
> !image-2022-02-26-22-15-25-543.png|width=631,height=170!
> For example, in our cluster, we exposed these metrics to JMX, collected by 
> JMX-Exporter and combined with Prometheus, and finally display by Grafana.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16488) [SPS]: Expose metrics to JMX for external SPS

2022-02-26 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16488:
---
Attachment: image-2022-02-26-22-15-09-432.png

> [SPS]: Expose metrics to JMX for external SPS
> -
>
> Key: HDFS-16488
> URL: https://issues.apache.org/jira/browse/HDFS-16488
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2022-02-26-22-15-25-543.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, external SPS has no monitoring metrics. We do not know how many 
> blocks are waiting to be processed, how many blocks are waiting to be 
> retried, and how many blocks have been migrated.
> We can expose these metrics in JMX for easy collection and display by 
> monitoring systems.
> !image-2022-02-26-22-15-09-432.png|width=593,height=160!
> For example, in our cluster, we exposed these metrics to JMX, collected by 
> JMX-Exporter and combined with Prometheus, and finally display by Grafana.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16488) [SPS]: Expose metrics to JMX for external SPS

2022-02-26 Thread tomscut (Jira)
tomscut created HDFS-16488:
--

 Summary: [SPS]: Expose metrics to JMX for external SPS
 Key: HDFS-16488
 URL: https://issues.apache.org/jira/browse/HDFS-16488
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: tomscut
Assignee: tomscut


Currently, external SPS has no monitoring metrics. We do not know how many 
blocks are waiting to be processed, how many blocks are waiting to be retried, 
and how many blocks have been migrated.

We can expose these metrics in JMX for easy collection and display by 
monitoring systems.

For example, in our cluster, we exposed these metrics to JMX, collected by 
JMX-Exporter and combined with Prometheus, and finally display by Grafana.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16397) Reconfig slow disk parameters for datanode

2022-02-25 Thread tomscut (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498392#comment-17498392
 ] 

tomscut commented on HDFS-16397:


Thanks [~tasanuma] .

> Reconfig slow disk parameters for datanode
> --
>
> Key: HDFS-16397
> URL: https://issues.apache.org/jira/browse/HDFS-16397
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.3
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> In large clusters, rolling restart datanodes takes long time. We can make 
> slow peers parameters and slow disks parameters in datanode reconfigurable to 
> facilitate cluster operation and maintenance.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16371) Exclude slow disks when choosing volume

2022-02-24 Thread tomscut (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497914#comment-17497914
 ] 

tomscut commented on HDFS-16371:


Hi [~tasanuma] , I cherry-picked this PR to branch-3.3 . Please look at 
[#4031|https://github.com/apache/hadoop/pull/4031]

> Exclude slow disks when choosing volume
> ---
>
> Key: HDFS-16371
> URL: https://issues.apache.org/jira/browse/HDFS-16371
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> Currently, the datanode can detect slow disks. See HDFS-11461.
> And after HDFS-16311, the slow disk information we collected is more accurate.
> So we can exclude these slow disks according to some rules when choosing 
> volume. This will prevents some slow disks from affecting the throughput of 
> the whole datanode.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15854) Make some parameters configurable for SlowDiskTracker and SlowPeerTracker

2022-02-24 Thread tomscut (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497853#comment-17497853
 ] 

tomscut commented on HDFS-15854:


Thank you [~tasanuma] .

> Make some parameters configurable for SlowDiskTracker and SlowPeerTracker
> -
>
> Key: HDFS-15854
> URL: https://issues.apache.org/jira/browse/HDFS-15854
> Project: Hadoop HDFS
>  Issue Type: Wish
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.3
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Make some parameters configurable for SlowDiskTracker and SlowPeerTracker. 
> Related to https://issues.apache.org/jira/browse/HDFS-15814.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-16460) [SPS]: Handle failure retries for moving tasks

2022-02-22 Thread tomscut (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17496056#comment-17496056
 ] 

tomscut edited comment on HDFS-16460 at 2/22/22, 12:14 PM:
---

The related PR is [#4001.|https://github.com/apache/hadoop/pull/4001]


was (Author: tomscut):
The related PR is [#4001|https://github.com/apache/hadoop/pull/4001]

> [SPS]: Handle failure retries for moving tasks
> --
>
> Key: HDFS-16460
> URL: https://issues.apache.org/jira/browse/HDFS-16460
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>
> Handle failure retries for moving tasks. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16460) [SPS]: Handle failure retries for moving tasks

2022-02-22 Thread tomscut (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17496056#comment-17496056
 ] 

tomscut commented on HDFS-16460:


The related PR is [#4001|https://github.com/apache/hadoop/pull/4001]

> [SPS]: Handle failure retries for moving tasks
> --
>
> Key: HDFS-16460
> URL: https://issues.apache.org/jira/browse/HDFS-16460
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>
> Handle failure retries for moving tasks. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16477) [SPS]: Add metric PendingSPSPaths for getting the number of paths to be processed by SPS

2022-02-22 Thread tomscut (Jira)
tomscut created HDFS-16477:
--

 Summary: [SPS]: Add metric PendingSPSPaths for getting the number 
of paths to be processed by SPS
 Key: HDFS-16477
 URL: https://issues.apache.org/jira/browse/HDFS-16477
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: tomscut
Assignee: tomscut


Currently we have no idea how many paths are waiting to be processed when using 
the SPS feature. We should add metric PendingSPSPaths for getting the number of 
paths to be processed by SPS in NameNode.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16460) [SPS]: Handle failure retries for moving tasks

2022-02-19 Thread tomscut (Jira)
tomscut created HDFS-16460:
--

 Summary: [SPS]: Handle failure retries for moving tasks
 Key: HDFS-16460
 URL: https://issues.apache.org/jira/browse/HDFS-16460
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: tomscut
Assignee: tomscut


Handle failure retries for moving tasks. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16458) [SPS]: Fix bug for unit test of reconfiguring SPS mode

2022-02-18 Thread tomscut (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17494472#comment-17494472
 ] 

tomscut commented on HDFS-16458:


Hi [~rakeshr], [~umamaheswararao], PTAL. Thanks.

> [SPS]: Fix bug for unit test of reconfiguring SPS mode
> --
>
> Key: HDFS-16458
> URL: https://issues.apache.org/jira/browse/HDFS-16458
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>
> TestNameNodeReconfigure#verifySPSEnabled was compared with 
> itself({*}isSPSRunning{*}) at assertEquals.
> In addition, after an *internal SPS* has been removed, *spsService daemon* 
> will not start within StoragePolicySatisfyManager. I think the relevant code 
> can be removed to simplify the code.
> IMO, after reconfig SPS mode, we just need to confirm whether the mode is 
> correct and whether spsManager is NULL.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16458) [SPS]: Fix bug for unit test of reconfiguring SPS mode

2022-02-18 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-16458:
---
Description: 
TestNameNodeReconfigure#verifySPSEnabled was compared with 
itself({*}isSPSRunning{*}) at assertEquals.

In addition, after an *internal SPS* has been removed, *spsService daemon* will 
not start within StoragePolicySatisfyManager. I think the relevant code can be 
removed to simplify the code.

IMO, after reconfig SPS mode, we just need to confirm whether the mode is 
correct and whether spsManager is NULL.

  was:
TestNameNodeReconfigure#verifySPSEnabled was compared with itself(isSPSRunning) 
at assertEquals.

In addition, after an *internal SPS* has been removed, *spsService daemon* will 
not start within StoragePolicySatisfyManager. I think the relevant code can be 
removed to simplify the code.

IMO, after reconfig SPS mode, we just need to confirm whether the mode is 
correct and whether spsManager is NULL.


> [SPS]: Fix bug for unit test of reconfiguring SPS mode
> --
>
> Key: HDFS-16458
> URL: https://issues.apache.org/jira/browse/HDFS-16458
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>
> TestNameNodeReconfigure#verifySPSEnabled was compared with 
> itself({*}isSPSRunning{*}) at assertEquals.
> In addition, after an *internal SPS* has been removed, *spsService daemon* 
> will not start within StoragePolicySatisfyManager. I think the relevant code 
> can be removed to simplify the code.
> IMO, after reconfig SPS mode, we just need to confirm whether the mode is 
> correct and whether spsManager is NULL.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16458) [SPS]: Fix bug for unit test of reconfiguring SPS mode

2022-02-18 Thread tomscut (Jira)
tomscut created HDFS-16458:
--

 Summary: [SPS]: Fix bug for unit test of reconfiguring SPS mode
 Key: HDFS-16458
 URL: https://issues.apache.org/jira/browse/HDFS-16458
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: tomscut
Assignee: tomscut


TestNameNodeReconfigure#verifySPSEnabled was compared with itself(isSPSRunning) 
at assertEquals.

In addition, after an *internal SPS* has been removed, *spsService daemon* will 
not start within StoragePolicySatisfyManager. I think the relevant code can be 
removed to simplify the code.

IMO, after reconfig SPS mode, we just need to confirm whether the mode is 
correct and whether spsManager is NULL.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12228) [SPS]: Add storage policy satisfier related metrics

2022-02-17 Thread tomscut (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-12228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17494344#comment-17494344
 ] 

tomscut commented on HDFS-12228:


Hi [~ajithshetty] [~rakeshr] , how is the current progress of this? Is it still 
going on?

> [SPS]: Add storage policy satisfier related metrics
> ---
>
> Key: HDFS-12228
> URL: https://issues.apache.org/jira/browse/HDFS-12228
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode, namenode
>Reporter: Rakesh Radhakrishnan
>Assignee: Ajith S
>Priority: Major
>
> This jira to discuss and implement metrics needed for SPS feature.
> Below are few metrics:
> # count of {{inprogress}} block movements
> # count of {{successful}} block movements
> # count of {{failed}} block movements
> Need to analyse and add more.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15118) [SBN Read] Slow clients when Observer reads are enabled but there are no Observers on the cluster.

2022-02-16 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-15118:
---
Labels:   (was: Read SBN)

> [SBN Read] Slow clients when Observer reads are enabled but there are no 
> Observers on the cluster.
> --
>
> Key: HDFS-15118
> URL: https://issues.apache.org/jira/browse/HDFS-15118
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 2.10.0
>Reporter: Konstantin Shvachko
>Assignee: Chen Liang
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2, 2.10.1
>
> Attachments: HDFS-15118.001.patch, HDFS-15118.002.patch
>
>
> We see substantial degradation in performance of HDFS clients, when Observer 
> reads are enabled via {{ObserverReadProxyProvider}}, but there are no 
> ObserverNodes on the cluster.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15118) [SBN Read] Slow clients when Observer reads are enabled but there are no Observers on the cluster.

2022-02-16 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-15118:
---
Labels: Read SBN  (was: )

> [SBN Read] Slow clients when Observer reads are enabled but there are no 
> Observers on the cluster.
> --
>
> Key: HDFS-15118
> URL: https://issues.apache.org/jira/browse/HDFS-15118
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 2.10.0
>Reporter: Konstantin Shvachko
>Assignee: Chen Liang
>Priority: Major
>  Labels: Read, SBN
> Fix For: 3.3.0, 3.1.4, 3.2.2, 2.10.1
>
> Attachments: HDFS-15118.001.patch, HDFS-15118.002.patch
>
>
> We see substantial degradation in performance of HDFS clients, when Observer 
> reads are enabled via {{ObserverReadProxyProvider}}, but there are no 
> ObserverNodes on the cluster.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16446) Consider ioutils of disk when choosing volume

2022-02-04 Thread tomscut (Jira)
tomscut created HDFS-16446:
--

 Summary: Consider ioutils of disk when choosing volume
 Key: HDFS-16446
 URL: https://issues.apache.org/jira/browse/HDFS-16446
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: tomscut
Assignee: tomscut
 Attachments: image-2022-02-05-09-50-12-241.png

Consider ioutils of disk when choosing volume.

Principle is as follows:

!image-2022-02-05-09-50-12-241.png|width=309,height=159!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13671) Namenode deletes large dir slowly caused by FoldedTreeSet#removeAndGet

2022-01-28 Thread tomscut (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17484045#comment-17484045
 ] 

tomscut commented on HDFS-13671:


We introduced this patch to branch-3.1.0, which is stable for replication data. 
But for EC data, GC performance is poor without adjusting GC parameters. We 
changed the GC from CMS to G1 and strictly limited G1MaxNewSizePercent and 
MaxGCPauseMillis, so GC performance improved and was acceptable. But the mixed 
GC took up to 10 seconds or more, even though the mixed GC was triggered every 
7 days or so.

If anyone also uses this patch on EC data, we are looking forward to 
communicating with you. Thanks.

BTW, if there is a need to submit a related PR to branch-3.1, I am happy to do 
that.

> Namenode deletes large dir slowly caused by FoldedTreeSet#removeAndGet
> --
>
> Key: HDFS-13671
> URL: https://issues.apache.org/jira/browse/HDFS-13671
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.1.0, 3.0.3
>Reporter: Yiqun Lin
>Assignee: Haibin Huang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.3, 3.3.2
>
> Attachments: HDFS-13671-001.patch, image-2021-06-10-19-28-18-373.png, 
> image-2021-06-10-19-28-58-359.png, image-2021-06-18-15-46-46-052.png, 
> image-2021-06-18-15-47-04-037.png
>
>  Time Spent: 7h 40m
>  Remaining Estimate: 0h
>
> NameNode hung when deleting large files/blocks. The stack info:
> {code}
> "IPC Server handler 4 on 8020" #87 daemon prio=5 os_prio=0 
> tid=0x7fb505b27800 nid=0x94c3 runnable [0x7fa861361000]
>java.lang.Thread.State: RUNNABLE
>   at 
> org.apache.hadoop.hdfs.util.FoldedTreeSet.compare(FoldedTreeSet.java:474)
>   at 
> org.apache.hadoop.hdfs.util.FoldedTreeSet.removeAndGet(FoldedTreeSet.java:849)
>   at 
> org.apache.hadoop.hdfs.util.FoldedTreeSet.remove(FoldedTreeSet.java:911)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeStorageInfo.removeBlock(DatanodeStorageInfo.java:252)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.removeBlock(BlocksMap.java:194)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.removeBlock(BlocksMap.java:108)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.removeBlockFromMap(BlockManager.java:3813)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.removeBlock(BlockManager.java:3617)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.removeBlocks(FSNamesystem.java:4270)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInternal(FSNamesystem.java:4244)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInt(FSNamesystem.java:4180)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:4164)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:871)
>   at 
> org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.delete(AuthorizationProviderProxyClientProtocol.java:311)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.delete(ClientNamenodeProtocolServerSideTranslatorPB.java:625)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
> {code}
> In the current deletion logic in NameNode, there are mainly two steps:
> * Collect INodes and all blocks to be deleted, then delete INodes.
> * Remove blocks  chunk by chunk in a loop.
> Actually the first step should be a more expensive operation and will takes 
> more time. However, now we always see NN hangs during the remove block 
> operation. 
> Looking into this, we introduced a new structure {{FoldedTreeSet}} to have a 
> better performance in dealing FBR/IBRs. But compared with early 
> implementation in remove-block logic, {{FoldedTreeSet}} seems more slower 
> since It will take additional time to balance tree node. When there are large 
> block to be removed/deleted, it looks bad.
> For the get type operations in {{DatanodeStorageInfo}}, we only provide the 
> {{getBlockIterator}} to return blocks iterator and no other get operation 
> with specified block. Still we need to use {{FoldedTreeSet}} in 
> {{DatanodeStorageInfo}}? As we know {{FoldedTreeSet}} is benefit for Get not 
> Update. Maybe we can revert this to the early implementation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, 

  1   2   3   4   >