[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17650956#comment-17650956 ] ASF GitHub Bot commented on HDFS-16689: --- xkrogen merged PR #4744: URL: https://github.com/apache/hadoop/pull/4744 > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) > ... 36 more > {code} > After tracing and found there is a critical bug in > *EditlogTailer#catchupDuringFailover()* when > *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* > try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. > It may cannot replay any edits when they are some abnormal JournalNodes. > Reproduce method, suppose: > - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode > is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, > JN1 and JN2. > - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully > synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or > restarted. > - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover > active from NN0 to NN1. > - NN1 only got two responses from JN0 and JN1 when it try to selecting > inputStreams with *fromTxnId=3* and *onlyDurableTxns=true*, and the count > txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad > network or restarted. > - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes > because the *maxAllowedTxns* is 0. > So I think Standby NameNode should *catchupDuringFailover()* with > *onlyDurableTxns=false* , so that it can replay all missed edits from > JournalNode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17650957#comment-17650957 ] ASF GitHub Bot commented on HDFS-16689: --- xkrogen commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1361784195 Merged, thanks @ZanderXu ! > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) > ... 36 more > {code} > After tracing and found there is a critical bug in > *EditlogTailer#catchupDuringFailover()* when > *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* > try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. > It may cannot replay any edits when they are some abnormal JournalNodes. > Reproduce method, suppose: > - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode > is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, > JN1 and JN2. > - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully > synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or > restarted. > - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover > active from NN0 to NN1. > - NN1 only got two responses from JN0 and JN1 when it try to selecting > inputStreams with *fromTxnId=3* and *onlyDurableTxns=true*, and the count > txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad > network or restarted. > - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes > because the *maxAllowedTxns* is 0. > So I think Standby NameNode should *catchupDuringFailover()* with > *onlyDurableTxns=false* , so that it can replay all missed edits from > JournalNode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17650955#comment-17650955 ] ASF GitHub Bot commented on HDFS-16689: --- xkrogen commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1361781237 `TestLeaseRecoveryV2` failure is tracked in HDFS-16853. I re-triggered Jenkins to get a more recent build (since the past one is 3 weeks old) and it reports the same findings ([PR-4744 #20](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/20/)). Merging to trunk. > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) > ... 36 more > {code} > After tracing and found there is a critical bug in > *EditlogTailer#catchupDuringFailover()* when > *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* > try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. > It may cannot replay any edits when they are some abnormal JournalNodes. > Reproduce method, suppose: > - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode > is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, > JN1 and JN2. > - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully > synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or > restarted. > - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover > active from NN0 to NN1. > - NN1 only got two responses from JN0 and JN1 when it try to selecting > inputStreams with *fromTxnId=3* and *onlyDurableTxns=true*, and the count > txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad > network or restarted. > - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes > because the *maxAllowedTxns* is 0. > So I think Standby NameNode should *catchupDuringFailover()* with > *onlyDurableTxns=false* , so that it can replay all missed edits from > JournalNode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17650104#comment-17650104 ] ASF GitHub Bot commented on HDFS-16689: --- hadoop-yetus commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1360928620 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 40s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 4 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 38m 41s | | trunk passed | | +1 :green_heart: | compile | 1m 26s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | compile | 1m 20s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | checkstyle | 1m 6s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 32s | | trunk passed | | +1 :green_heart: | javadoc | 1m 9s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 1m 28s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 30s | | trunk passed | | +1 :green_heart: | shadedclient | 22m 47s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 16s | | the patch passed | | +1 :green_heart: | compile | 1m 17s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javac | 1m 17s | | the patch passed | | +1 :green_heart: | compile | 1m 12s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | javac | 1m 12s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 52s | | hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 260 unchanged - 1 fixed = 260 total (was 261) | | +1 :green_heart: | mvnsite | 1m 23s | | the patch passed | | +1 :green_heart: | javadoc | 0m 49s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 1m 26s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 13s | | the patch passed | | +1 :green_heart: | shadedclient | 22m 21s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 302m 52s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/20/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 50s | | The patch does not generate ASF License warnings. | | | | 408m 42s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.TestLeaseRecovery2 | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/20/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/4744 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux c23c319f91b7 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 0d25fbee414a5cf318a3b9b9c831f5ae0aaf7d18 | | Default Java | Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/20/testReport/ | | Max. process+thread count | 3334 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | |
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17642599#comment-17642599 ] ASF GitHub Bot commented on HDFS-16689: --- hadoop-yetus commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1335499241 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 36s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 4 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 38m 29s | | trunk passed | | +1 :green_heart: | compile | 1m 45s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | compile | 1m 38s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | checkstyle | 1m 27s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 49s | | trunk passed | | +1 :green_heart: | javadoc | 1m 24s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 1m 47s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 38s | | trunk passed | | +1 :green_heart: | shadedclient | 29m 11s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 24s | | the patch passed | | +1 :green_heart: | compile | 1m 25s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javac | 1m 25s | | the patch passed | | +1 :green_heart: | compile | 1m 22s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | javac | 1m 22s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 1m 2s | | hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 260 unchanged - 1 fixed = 260 total (was 261) | | +1 :green_heart: | mvnsite | 1m 30s | | the patch passed | | +1 :green_heart: | javadoc | 0m 59s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 1m 35s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 25s | | the patch passed | | +1 :green_heart: | shadedclient | 22m 46s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 294m 53s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/19/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 1m 10s | | The patch does not generate ASF License warnings. | | | | 411m 40s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.TestLeaseRecovery2 | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/19/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/4744 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 0eb4f162b199 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 0d25fbee414a5cf318a3b9b9c831f5ae0aaf7d18 | | Default Java | Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/19/testReport/ | | Max. process+thread count | 3172 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | |
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17642291#comment-17642291 ] ASF GitHub Bot commented on HDFS-16689: --- hadoop-yetus commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1334790364 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 41s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 4 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 39m 47s | | trunk passed | | +1 :green_heart: | compile | 1m 36s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | compile | 1m 27s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | checkstyle | 1m 12s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 35s | | trunk passed | | +1 :green_heart: | javadoc | 1m 22s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 1m 51s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 43s | | trunk passed | | +1 :green_heart: | shadedclient | 24m 20s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 25s | | the patch passed | | +1 :green_heart: | compile | 1m 31s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javac | 1m 31s | | the patch passed | | +1 :green_heart: | compile | 1m 28s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | javac | 1m 28s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 1m 7s | [/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/18/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 260 unchanged - 1 fixed = 261 total (was 261) | | +1 :green_heart: | mvnsite | 1m 31s | | the patch passed | | +1 :green_heart: | javadoc | 1m 3s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 1m 35s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 38s | | the patch passed | | +1 :green_heart: | shadedclient | 22m 52s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 195m 33s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/18/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +0 :ok: | asflicense | 1m 3s | | ASF License check generated no output? | | | | 308m 4s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.server.namenode.TestNNThroughputBenchmark | | | hadoop.hdfs.TestLeaseRecovery2 | | | hadoop.hdfs.server.namenode.TestQuotaWithStripedBlocksWithRandomECPolicy | | | hadoop.hdfs.server.namenode.TestListOpenFiles | | | hadoop.hdfs.server.namenode.TestAuditLogs | | | hadoop.hdfs.server.namenode.TestReencryption | | | hadoop.hdfs.server.namenode.TestNameNodeRecovery | | | hadoop.hdfs.server.namenode.TestFSImage | | | hadoop.hdfs.server.namenode.TestBlockPlacementPolicyRackFaultTolerant | | | hadoop.hdfs.server.namenode.TestAddStripedBlockInFBR | | | hadoop.hdfs.server.namenode.metrics.TestNameNodeMetrics | | | hadoop.hdfs.server.namenode.TestNamenodeStorageDirectives | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/18/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/4744 | | Optional Tests | dupname asflicense compile
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17642156#comment-17642156 ] ASF GitHub Bot commented on HDFS-16689: --- hadoop-yetus commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1334405016 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 43s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 1s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 3 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 40m 19s | | trunk passed | | +1 :green_heart: | compile | 1m 47s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | compile | 1m 35s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | checkstyle | 1m 26s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 44s | | trunk passed | | +1 :green_heart: | javadoc | 1m 18s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 1m 46s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 57s | | trunk passed | | +1 :green_heart: | shadedclient | 23m 48s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 26s | | the patch passed | | +1 :green_heart: | compile | 1m 28s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javac | 1m 28s | | the patch passed | | +1 :green_heart: | compile | 1m 19s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | javac | 1m 19s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 1m 0s | | hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 242 unchanged - 1 fixed = 242 total (was 243) | | +1 :green_heart: | mvnsite | 1m 26s | | the patch passed | | +1 :green_heart: | javadoc | 0m 55s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 1m 37s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 38s | | the patch passed | | +1 :green_heart: | shadedclient | 24m 51s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 286m 2s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/17/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 1m 0s | | The patch does not generate ASF License warnings. | | | | 400m 43s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.TestLeaseRecovery2 | | | hadoop.hdfs.qjournal.TestNNWithQJM | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/17/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/4744 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 9c5a638b22ed 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 9cc99df017ce220e466eccc8fd808dc7265cf65a | | Default Java | Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/17/testReport/ | | Max. process+thread count | 3330 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17641927#comment-17641927 ] ASF GitHub Bot commented on HDFS-16689: --- ZanderXu commented on code in PR #4744: URL: https://github.com/apache/hadoop/pull/4744#discussion_r1037125839 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java: ## @@ -1666,6 +1670,9 @@ synchronized void recoverUnclosedStreams() { } catch (IOException ex) { // All journals have failed, it is handled in logSync. // TODO: are we sure this is OK? + if (terminateOnFailure) { +throw ex; Review Comment: @xkrogen Yeah. After deep think, this is a better solution. Thanks so much, sir. I learned a lot from you. I have updated the code, please help me review it again. Thanks. > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) > ... 36 more > {code} > After tracing and found there is a critical bug in > *EditlogTailer#catchupDuringFailover()* when > *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* > try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. > It may cannot replay any edits when they are some abnormal JournalNodes. > Reproduce method, suppose: > - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode > is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, > JN1 and JN2. > - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully > synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or > restarted. > - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover > active from NN0 to NN1. > - NN1 only got two responses from JN0 and JN1 when it try to selecting > inputStreams with *fromTxnId=3* and *onlyDurableTxns=true*, and the count > txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad > network or restarted. > - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes > because the *maxAllowedTxns* is 0. > So I think Standby NameNode should *catchupDuringFailover()* with > *onlyDurableTxns=false* , so that it can replay all missed edits from > JournalNode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17640156#comment-17640156 ] ASF GitHub Bot commented on HDFS-16689: --- xkrogen commented on code in PR #4744: URL: https://github.com/apache/hadoop/pull/4744#discussion_r1033837986 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java: ## @@ -1666,6 +1670,9 @@ synchronized void recoverUnclosedStreams() { } catch (IOException ex) { // All journals have failed, it is handled in logSync. // TODO: are we sure this is OK? + if (terminateOnFailure) { +throw ex; Review Comment: In my [last comment about this](https://github.com/apache/hadoop/pull/4744#discussion_r1026931795) I was thinking that we should have code more similar to `FSEditLog#startLogSegment()` where we actually call `terminate()` as opposed to just throwing the exception. I was thinking that we would always at least throw the exception, but only _terminate_ when `terminateOnFailure` is true, like: ```java } catch (IOException ex) { if (terminateOnFailure) { final String msg = "Unable to recover log segments: too few journals successfully recovered."; LOG.error(msg, ex); synchronized (journalSetLock) { IOUtils.cleanupWithLogger(LOG, journalSet); } terminate(1, msg); } else { throw ex; } } ``` What do you think? I am open to other options but this is what I had in mind. > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) > ... 36 more > {code} > After tracing and found there is a critical bug in > *EditlogTailer#catchupDuringFailover()* when > *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* > try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. > It may cannot replay any edits when they are some abnormal JournalNodes. > Reproduce method, suppose: > - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode > is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, > JN1 and JN2. > - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully > synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or > restarted. > - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover > active from NN0 to NN1. > - NN1 only got two responses from JN0 and JN1 when it try to selecting > inputStreams with *fromTxnId=3* and *onlyDurableTxns=true*, and the count > txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad > network or restarted. > - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes > because the *maxAllowedTxns* is 0. > So I think Standby NameNode should *catchupDuringFailover()* with > *onlyDurableTxns=false* , so that it can replay all missed edits from > JournalNode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17638562#comment-17638562 ] ASF GitHub Bot commented on HDFS-16689: --- hadoop-yetus commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1327146163 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 34s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 4 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 38m 57s | | trunk passed | | +1 :green_heart: | compile | 1m 37s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | compile | 1m 28s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | checkstyle | 1m 15s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 40s | | trunk passed | | +1 :green_heart: | javadoc | 1m 13s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 1m 40s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 41s | | trunk passed | | +1 :green_heart: | shadedclient | 22m 58s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 20s | | the patch passed | | +1 :green_heart: | compile | 1m 24s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javac | 1m 24s | | the patch passed | | +1 :green_heart: | compile | 1m 16s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | javac | 1m 16s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 59s | | hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 284 unchanged - 1 fixed = 284 total (was 285) | | +1 :green_heart: | mvnsite | 1m 25s | | the patch passed | | +1 :green_heart: | javadoc | 0m 54s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 1m 33s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 16s | | the patch passed | | +1 :green_heart: | shadedclient | 22m 43s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 296m 48s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/16/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 1m 3s | | The patch does not generate ASF License warnings. | | | | 405m 52s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.TestLeaseRecovery2 | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/16/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/4744 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux c04bb4af926a 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / c5743d6e90f97eb798465f89f53f29372b657a3d | | Default Java | Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/16/testReport/ | | Max. process+thread count | 3328 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | |
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637612#comment-17637612 ] ASF GitHub Bot commented on HDFS-16689: --- hadoop-yetus commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1324674967 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 34s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 1s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 4 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 39m 4s | | trunk passed | | +1 :green_heart: | compile | 1m 37s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | compile | 1m 28s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | checkstyle | 1m 27s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 46s | | trunk passed | | +1 :green_heart: | javadoc | 1m 15s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 1m 41s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 37s | | trunk passed | | +1 :green_heart: | shadedclient | 23m 19s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 24s | | the patch passed | | +1 :green_heart: | compile | 1m 25s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javac | 1m 25s | | the patch passed | | +1 :green_heart: | compile | 1m 17s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | javac | 1m 17s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 1m 2s | [/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/15/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 284 unchanged - 1 fixed = 285 total (was 285) | | +1 :green_heart: | mvnsite | 1m 25s | | the patch passed | | +1 :green_heart: | javadoc | 0m 53s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 1m 30s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 19s | | the patch passed | | +1 :green_heart: | shadedclient | 22m 34s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 304m 26s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/15/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 1m 12s | | The patch does not generate ASF License warnings. | | | | 414m 21s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.TestLeaseRecovery2 | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/15/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/4744 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux d2291b1e10fb 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 861206207bed3b1633531d7c3b570e95fa9e1c3a | | Default Java | Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | Test Results |
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17636770#comment-17636770 ] ASF GitHub Bot commented on HDFS-16689: --- xkrogen commented on code in PR #4744: URL: https://github.com/apache/hadoop/pull/4744#discussion_r1028281649 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImage.java: ## @@ -174,6 +175,11 @@ protected FSImage(Configuration conf, archivalManager = new NNStorageRetentionManager(conf, storage, editLog); FSImageFormatProtobuf.initParallelLoad(conf); } + + @VisibleForTesting + void setEditLog(FSEditLog editLog) { +this.editLog = editLog; + } Review Comment: Can we use `DFSTestUtil.setEditLogForTesting()` for this? ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImage.java: ## @@ -174,6 +175,11 @@ protected FSImage(Configuration conf, archivalManager = new NNStorageRetentionManager(conf, storage, editLog); FSImageFormatProtobuf.initParallelLoad(conf); } + + @VisibleForTesting + void setEditLog(FSEditLog editLog) { +this.editLog = editLog; + } Review Comment: Can we use `DFSTestUtil.setEditLogForTesting()` for this instead of adding a new method? > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) > ... 36 more > {code} > After tracing and found there is a critical bug in > *EditlogTailer#catchupDuringFailover()* when > *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* > try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. > It may cannot replay any edits when they are some abnormal JournalNodes. > Reproduce method, suppose: > - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode > is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, > JN1 and JN2. > - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully > synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or > restarted. > - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover > active from NN0 to NN1. > - NN1 only got two responses from JN0 and JN1 when it try to selecting > inputStreams with *fromTxnId=3* and *onlyDurableTxns=true*, and the count > txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad > network or restarted. > - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes > because the *maxAllowedTxns* is 0. > So I think Standby NameNode should *catchupDuringFailover()* with > *onlyDurableTxns=false* , so that it can replay all missed edits from > JournalNode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17636063#comment-17636063 ] ASF GitHub Bot commented on HDFS-16689: --- xkrogen commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1320596912 Sorry for the delay in getting back to you on this @ZanderXu . I think the current diff looks really good, just not sure about the exception handling. > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) > ... 36 more > {code} > After tracing and found there is a critical bug in > *EditlogTailer#catchupDuringFailover()* when > *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* > try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. > It may cannot replay any edits when they are some abnormal JournalNodes. > Reproduce method, suppose: > - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode > is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, > JN1 and JN2. > - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully > synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or > restarted. > - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover > active from NN0 to NN1. > - NN1 only got two responses from JN0 and JN1 when it try to selecting > inputStreams with *fromTxnId=3* and *onlyDurableTxns=true*, and the count > txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad > network or restarted. > - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes > because the *maxAllowedTxns* is 0. > So I think Standby NameNode should *catchupDuringFailover()* with > *onlyDurableTxns=false* , so that it can replay all missed edits from > JournalNode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17636057#comment-17636057 ] ASF GitHub Bot commented on HDFS-16689: --- xkrogen commented on code in PR #4744: URL: https://github.com/apache/hadoop/pull/4744#discussion_r1026900094 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/EditLogTailer.java: ## @@ -283,7 +283,7 @@ public void stop() throws IOException { } @VisibleForTesting - FSEditLog getEditLog() { + public FSEditLog getEditLog() { Review Comment: why `public`? can we leave it package-private? it seems like it's not used anywhere AFAICT ## hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/TestNNWithQJM.java: ## @@ -197,10 +197,9 @@ public void testMismatchedNNIsRejected() throws Exception { .manageNameDfsDirs(false).format(false).checkExitOnShutdown(false) .build(); fail("New NN with different namespace should have been rejected"); -} catch (ExitException ee) { +} catch (IOException ie) { GenericTestUtils.assertExceptionContains( - "Unable to start log segment 1: too few journals", ee); - assertTrue("Didn't terminate properly ", ExitUtil.terminateCalled()); + "recoverUnfinalizedSegments failed for too many journals", ie); Review Comment: I see you made this change by catching it in `FSNamesystem#loadFSImage()` but now I'm having second thoughts about it. The old behavior comes from this block in `FSEditLog#startLogSegment()`: https://github.com/apache/hadoop/blob/63db1a85e376c2266afdc62b9590e40acc98429c/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java#L1437-L1447 Here we catch the `IOException` _within FSEditLog_ itself and add some useful contextual information, plus we clean up the `journalSet`. Seems useful. Given that `recoverUnclosedStreams()` is called from a variety of places, it seems that we can't just copy this approach by adding the `terminate(1)` logic within `FSEditLog#recoverUnclosedStreams()` > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) > ... 36 more > {code} > After tracing and found there is a critical bug in > *EditlogTailer#catchupDuringFailover()* when > *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* > try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. > It may cannot replay any edits when they are some abnormal JournalNodes. > Reproduce method, suppose: > - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode > is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, > JN1 and JN2. > - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully > synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or > restarted. > - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover > active from NN0 to NN1. > - NN1 only got two responses from JN0 and JN1 when it try to selecting > inputStreams with *fromTxnId=3* and *onlyDurableTxns=true*, and the count > txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad > network or restarted. > - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes > because the *maxAllowedTxns* is 0. > So I think Standby NameNode should *catchupDuringFailover()* with > *onlyDurableTxns=false* , so that it can replay all missed edits from > JournalNode. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17625560#comment-17625560 ] ASF GitHub Bot commented on HDFS-16689: --- ashutoshcipher commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1294777181 Thanks @ZanderXu for involving me. I will look into it. > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) > ... 36 more > {code} > After tracing and found there is a critical bug in > *EditlogTailer#catchupDuringFailover()* when > *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* > try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. > It may cannot replay any edits when they are some abnormal JournalNodes. > Reproduce method, suppose: > - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode > is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, > JN1 and JN2. > - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully > synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or > restarted. > - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover > active from NN0 to NN1. > - NN1 only got two responses from JN0 and JN1 when it try to selecting > inputStreams with *fromTxnId=3* and *onlyDurableTxns=true*, and the count > txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad > network or restarted. > - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes > because the *maxAllowedTxns* is 0. > So I think Standby NameNode should *catchupDuringFailover()* with > *onlyDurableTxns=false* , so that it can replay all missed edits from > JournalNode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17625515#comment-17625515 ] ASF GitHub Bot commented on HDFS-16689: --- ZanderXu commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1294684900 @xkrogen Sir, can you help me finally review it? @ashutoshcipher @tomscut @ayushtkn @Hexiaoqiao Sir, can help me to double-review it when you are available? > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) > ... 36 more > {code} > After tracing and found there is a critical bug in > *EditlogTailer#catchupDuringFailover()* when > *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* > try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. > It may cannot replay any edits when they are some abnormal JournalNodes. > Reproduce method, suppose: > - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode > is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, > JN1 and JN2. > - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully > synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or > restarted. > - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover > active from NN0 to NN1. > - NN1 only got two responses from JN0 and JN1 when it try to selecting > inputStreams with *fromTxnId=3* and *onlyDurableTxns=true*, and the count > txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad > network or restarted. > - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes > because the *maxAllowedTxns* is 0. > So I think Standby NameNode should *catchupDuringFailover()* with > *onlyDurableTxns=false* , so that it can replay all missed edits from > JournalNode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17613351#comment-17613351 ] ASF GitHub Bot commented on HDFS-16689: --- hadoop-yetus commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1269507559 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 40s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 4 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 40m 0s | | trunk passed | | +1 :green_heart: | compile | 1m 42s | | trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | compile | 1m 35s | | trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | checkstyle | 1m 14s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 40s | | trunk passed | | +1 :green_heart: | javadoc | 1m 17s | | trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 42s | | trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 56s | | trunk passed | | +1 :green_heart: | shadedclient | 24m 28s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 22s | | the patch passed | | +1 :green_heart: | compile | 1m 26s | | the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javac | 1m 26s | | the patch passed | | +1 :green_heart: | compile | 1m 22s | | the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | javac | 1m 22s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 1m 2s | | hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 284 unchanged - 1 fixed = 284 total (was 285) | | +1 :green_heart: | mvnsite | 1m 29s | | the patch passed | | +1 :green_heart: | javadoc | 0m 59s | | the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 35s | | the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 46s | | the patch passed | | +1 :green_heart: | shadedclient | 24m 45s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 245m 21s | | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 57s | | The patch does not generate ASF License warnings. | | | | 359m 34s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/14/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/4744 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 942c597e4795 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 8cd18a15b892f6b949d24edfab335255a28f5eee | | Default Java | Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/14/testReport/ | | Max. process+thread count | 3468 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/14/console | | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 | | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org | This
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605688#comment-17605688 ] ASF GitHub Bot commented on HDFS-16689: --- hadoop-yetus commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1249033813 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 41s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 4 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 38m 5s | | trunk passed | | +1 :green_heart: | compile | 1m 36s | | trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | compile | 1m 35s | | trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | checkstyle | 1m 15s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 40s | | trunk passed | | +1 :green_heart: | javadoc | 1m 24s | | trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 37s | | trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 38s | | trunk passed | | +1 :green_heart: | shadedclient | 22m 47s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | -1 :x: | mvninstall | 1m 7s | [/patch-mvninstall-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/12/artifact/out/patch-mvninstall-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch failed. | | -1 :x: | compile | 1m 12s | [/patch-compile-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/12/artifact/out/patch-compile-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04.txt) | hadoop-hdfs in the patch failed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04. | | -1 :x: | javac | 1m 12s | [/patch-compile-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/12/artifact/out/patch-compile-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04.txt) | hadoop-hdfs in the patch failed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04. | | -1 :x: | compile | 1m 8s | [/patch-compile-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/12/artifact/out/patch-compile-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07.txt) | hadoop-hdfs in the patch failed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07. | | -1 :x: | javac | 1m 8s | [/patch-compile-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/12/artifact/out/patch-compile-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07.txt) | hadoop-hdfs in the patch failed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07. | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 59s | | hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 284 unchanged - 1 fixed = 284 total (was 285) | | -1 :x: | mvnsite | 1m 10s | [/patch-mvnsite-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/12/artifact/out/patch-mvnsite-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch failed. | | +1 :green_heart: | javadoc | 0m 55s | | the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 32s | | the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | -1 :x: | spotbugs | 1m 15s | [/patch-spotbugs-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/12/artifact/out/patch-spotbugs-hadoop-hdfs-project_hadoop-hdfs.txt) |
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601221#comment-17601221 ] ASF GitHub Bot commented on HDFS-16689: --- hadoop-yetus commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1239162136 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 48s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 1s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 4 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 41m 58s | | trunk passed | | +1 :green_heart: | compile | 1m 36s | | trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | compile | 1m 27s | | trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | checkstyle | 1m 17s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 41s | | trunk passed | | +1 :green_heart: | javadoc | 1m 18s | | trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 42s | | trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 43s | | trunk passed | | +1 :green_heart: | shadedclient | 26m 23s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 23s | | the patch passed | | +1 :green_heart: | compile | 1m 28s | | the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javac | 1m 28s | | the patch passed | | +1 :green_heart: | compile | 1m 20s | | the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | javac | 1m 20s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 1m 1s | | hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 284 unchanged - 1 fixed = 284 total (was 285) | | +1 :green_heart: | mvnsite | 1m 25s | | the patch passed | | +1 :green_heart: | javadoc | 0m 59s | | the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 33s | | the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 31s | | the patch passed | | +1 :green_heart: | shadedclient | 26m 31s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 344m 50s | | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 56s | | The patch does not generate ASF License warnings. | | | | 464m 27s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/11/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/4744 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux db61cc147d3f 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 43c73b2081d6a34ca8b14eb86eb7e342b67e2c0a | | Default Java | Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/11/testReport/ | | Max. process+thread count | 2242 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/11/console | | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 | | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org | This
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600875#comment-17600875 ] ASF GitHub Bot commented on HDFS-16689: --- hadoop-yetus commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1238381684 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 1m 41s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 4 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 39m 0s | | trunk passed | | +1 :green_heart: | compile | 1m 38s | | trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | compile | 1m 34s | | trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | checkstyle | 1m 18s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 47s | | trunk passed | | +1 :green_heart: | javadoc | 1m 21s | | trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 37s | | trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 47s | | trunk passed | | +1 :green_heart: | shadedclient | 23m 39s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 27s | | the patch passed | | +1 :green_heart: | compile | 1m 29s | | the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javac | 1m 29s | | the patch passed | | +1 :green_heart: | compile | 1m 20s | | the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | javac | 1m 20s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 1m 3s | [/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/9/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 285 unchanged - 1 fixed = 286 total (was 286) | | +1 :green_heart: | mvnsite | 1m 28s | | the patch passed | | +1 :green_heart: | javadoc | 0m 58s | | the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 33s | | the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 33s | | the patch passed | | +1 :green_heart: | shadedclient | 22m 29s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 265m 32s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/9/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 1m 15s | | The patch does not generate ASF License warnings. | | | | 377m 14s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.TestReconstructStripedFile | | | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/9/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/4744 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux b1d08172cce1 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 58002d0e78874a22190a43904eefae665aa1d983 | | Default Java | Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600872#comment-17600872 ] ASF GitHub Bot commented on HDFS-16689: --- hadoop-yetus commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1238374829 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 50s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 4 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 38m 59s | | trunk passed | | +1 :green_heart: | compile | 1m 47s | | trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | compile | 1m 34s | | trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | checkstyle | 1m 23s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 43s | | trunk passed | | +1 :green_heart: | javadoc | 1m 21s | | trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 37s | | trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 43s | | trunk passed | | +1 :green_heart: | shadedclient | 23m 1s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 27s | | the patch passed | | +1 :green_heart: | compile | 1m 27s | | the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javac | 1m 27s | | the patch passed | | +1 :green_heart: | compile | 1m 23s | | the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | javac | 1m 23s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 1m 1s | [/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/10/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 285 unchanged - 1 fixed = 286 total (was 286) | | +1 :green_heart: | mvnsite | 1m 27s | | the patch passed | | +1 :green_heart: | javadoc | 0m 59s | | the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 33s | | the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 30s | | the patch passed | | +1 :green_heart: | shadedclient | 22m 22s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 262m 4s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/10/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 1m 19s | | The patch does not generate ASF License warnings. | | | | 372m 35s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.TestReconstructStripedFile | | | hadoop.hdfs.TestReconstructStripedFileWithRandomECPolicy | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/10/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/4744 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 4c4ce04c083f 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 58002d0e78874a22190a43904eefae665aa1d983 | | Default Java | Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600710#comment-17600710 ] ASF GitHub Bot commented on HDFS-16689: --- ZanderXu commented on code in PR #4744: URL: https://github.com/apache/hadoop/pull/4744#discussion_r963522811 ## hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestHAWithInProgressTail.java: ## @@ -0,0 +1,142 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hdfs.server.namenode; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.permission.FsPermission; +import org.apache.hadoop.hdfs.DFSConfigKeys; +import org.apache.hadoop.hdfs.HAUtil; +import org.apache.hadoop.hdfs.MiniDFSCluster; +import org.apache.hadoop.hdfs.qjournal.MiniJournalCluster; +import org.apache.hadoop.hdfs.qjournal.MiniQJMHACluster; +import org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager; +import org.apache.hadoop.hdfs.qjournal.client.SpyQJournalUtil; +import org.apache.hadoop.hdfs.server.blockmanagement.BlockManagerFaultInjector; +import org.junit.After; +import org.junit.Before; +import org.junit.Test; + +import java.io.IOException; + +import static org.apache.hadoop.hdfs.server.namenode.NameNodeAdapter.getFileInfo; +import static org.junit.Assert.assertNotNull; + +public class TestHAWithInProgressTail { + private MiniQJMHACluster qjmhaCluster; + private MiniDFSCluster cluster; + private MiniJournalCluster jnCluster; + private NameNode nn0; + private NameNode nn1; + + @Before + public void startUp() throws IOException { +Configuration conf = new Configuration(); +conf.setBoolean(DFSConfigKeys.DFS_HA_TAILEDITS_INPROGRESS_KEY, true); +conf.setInt(DFSConfigKeys.DFS_QJOURNAL_SELECT_INPUT_STREAMS_TIMEOUT_KEY, 500); +HAUtil.setAllowStandbyReads(conf, true); +qjmhaCluster = new MiniQJMHACluster.Builder(conf).build(); +cluster = qjmhaCluster.getDfsCluster(); +jnCluster = qjmhaCluster.getJournalCluster(); + +// Get NameNode from cluster to future manual control +nn0 = cluster.getNameNode(0); +nn1 = cluster.getNameNode(1); + } + + @After + public void tearDown() throws IOException { +if (qjmhaCluster != null) { + qjmhaCluster.shutdown(); +} + } + + + /** + * Test that Standby Node tails multiple segments while catching up + * during the transition to Active. + */ + @Test + public void testFailoverWithAbnormalJN() throws Exception { +cluster.transitionToActive(0); +cluster.waitActive(0); + +BlockManagerFaultInjector.instance = new BlockManagerFaultInjector() { + @Override + public void mockJNStreams() throws IOException { +spyOnJASjournal(); + } +}; Review Comment: @xkrogen Sir, I have updated this patch, please help me review it again. Thanks > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at >
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599196#comment-17599196 ] ASF GitHub Bot commented on HDFS-16689: --- xkrogen commented on code in PR #4744: URL: https://github.com/apache/hadoop/pull/4744#discussion_r961186697 ## hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestHAWithInProgressTail.java: ## @@ -0,0 +1,142 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hdfs.server.namenode; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.permission.FsPermission; +import org.apache.hadoop.hdfs.DFSConfigKeys; +import org.apache.hadoop.hdfs.HAUtil; +import org.apache.hadoop.hdfs.MiniDFSCluster; +import org.apache.hadoop.hdfs.qjournal.MiniJournalCluster; +import org.apache.hadoop.hdfs.qjournal.MiniQJMHACluster; +import org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager; +import org.apache.hadoop.hdfs.qjournal.client.SpyQJournalUtil; +import org.apache.hadoop.hdfs.server.blockmanagement.BlockManagerFaultInjector; +import org.junit.After; +import org.junit.Before; +import org.junit.Test; + +import java.io.IOException; + +import static org.apache.hadoop.hdfs.server.namenode.NameNodeAdapter.getFileInfo; +import static org.junit.Assert.assertNotNull; + +public class TestHAWithInProgressTail { + private MiniQJMHACluster qjmhaCluster; + private MiniDFSCluster cluster; + private MiniJournalCluster jnCluster; + private NameNode nn0; + private NameNode nn1; + + @Before + public void startUp() throws IOException { +Configuration conf = new Configuration(); +conf.setBoolean(DFSConfigKeys.DFS_HA_TAILEDITS_INPROGRESS_KEY, true); +conf.setInt(DFSConfigKeys.DFS_QJOURNAL_SELECT_INPUT_STREAMS_TIMEOUT_KEY, 500); +HAUtil.setAllowStandbyReads(conf, true); +qjmhaCluster = new MiniQJMHACluster.Builder(conf).build(); +cluster = qjmhaCluster.getDfsCluster(); +jnCluster = qjmhaCluster.getJournalCluster(); + +// Get NameNode from cluster to future manual control +nn0 = cluster.getNameNode(0); +nn1 = cluster.getNameNode(1); + } + + @After + public void tearDown() throws IOException { +if (qjmhaCluster != null) { + qjmhaCluster.shutdown(); +} + } + + + /** + * Test that Standby Node tails multiple segments while catching up + * during the transition to Active. + */ + @Test + public void testFailoverWithAbnormalJN() throws Exception { +cluster.transitionToActive(0); +cluster.waitActive(0); + +BlockManagerFaultInjector.instance = new BlockManagerFaultInjector() { + @Override + public void mockJNStreams() throws IOException { +spyOnJASjournal(); + } +}; Review Comment: I see. Can we instead use `NameNodeAdapter#spyOnEditLog()` to create a spying edit log which overrides the `initJournalsForWrite()` call to perform the spy initialization? > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at >
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17586111#comment-17586111 ] ASF GitHub Bot commented on HDFS-16689: --- hadoop-yetus commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1229255104 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 41s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 4 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 38m 54s | | trunk passed | | +1 :green_heart: | compile | 1m 43s | | trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | compile | 1m 29s | | trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | checkstyle | 1m 23s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 45s | | trunk passed | | +1 :green_heart: | javadoc | 1m 16s | | trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javadoc | 1m 39s | | trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 38s | | trunk passed | | +1 :green_heart: | shadedclient | 22m 52s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 23s | | the patch passed | | +1 :green_heart: | compile | 1m 33s | | the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javac | 1m 33s | | the patch passed | | +1 :green_heart: | compile | 1m 24s | | the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | javac | 1m 24s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 1m 2s | [/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/8/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs-project/hadoop-hdfs: The patch generated 2 new + 261 unchanged - 1 fixed = 263 total (was 262) | | +1 :green_heart: | mvnsite | 1m 27s | | the patch passed | | +1 :green_heart: | javadoc | 0m 53s | | the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javadoc | 1m 31s | | the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 22s | | the patch passed | | +1 :green_heart: | shadedclient | 23m 5s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 256m 25s | | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 59s | | The patch does not generate ASF License warnings. | | | | 366m 30s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/8/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/4744 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux ebae8950abdc 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 823727664b94974f96c4e4bc12ffdff3462dc56f | | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/8/testReport/ | | Max. process+thread count | 2898 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17586110#comment-17586110 ] ASF GitHub Bot commented on HDFS-16689: --- hadoop-yetus commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1229254534 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 35s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 1s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 4 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 38m 55s | | trunk passed | | +1 :green_heart: | compile | 1m 32s | | trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | compile | 1m 25s | | trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | checkstyle | 1m 14s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 44s | | trunk passed | | +1 :green_heart: | javadoc | 1m 16s | | trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javadoc | 1m 44s | | trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 48s | | trunk passed | | +1 :green_heart: | shadedclient | 22m 34s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 24s | | the patch passed | | +1 :green_heart: | compile | 1m 23s | | the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javac | 1m 23s | | the patch passed | | +1 :green_heart: | compile | 1m 14s | | the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | javac | 1m 14s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 1m 1s | [/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/7/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs-project/hadoop-hdfs: The patch generated 2 new + 261 unchanged - 1 fixed = 263 total (was 262) | | +1 :green_heart: | mvnsite | 1m 27s | | the patch passed | | +1 :green_heart: | javadoc | 0m 57s | | the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javadoc | 1m 29s | | the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 37s | | the patch passed | | +1 :green_heart: | shadedclient | 22m 41s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 257m 29s | | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 1m 5s | | The patch does not generate ASF License warnings. | | | | 366m 9s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/7/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/4744 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 8acc78f52b6e 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / d45f42f4a538eaa90b2296722b46c781f6c4aa17 | | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/7/testReport/ | | Max. process+thread count | 3264 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17585769#comment-17585769 ] ASF GitHub Bot commented on HDFS-16689: --- ZanderXu commented on code in PR #4744: URL: https://github.com/apache/hadoop/pull/4744#discussion_r956583964 ## hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestHAWithInProgressTail.java: ## @@ -0,0 +1,142 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hdfs.server.namenode; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.permission.FsPermission; +import org.apache.hadoop.hdfs.DFSConfigKeys; +import org.apache.hadoop.hdfs.HAUtil; +import org.apache.hadoop.hdfs.MiniDFSCluster; +import org.apache.hadoop.hdfs.qjournal.MiniJournalCluster; +import org.apache.hadoop.hdfs.qjournal.MiniQJMHACluster; +import org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager; +import org.apache.hadoop.hdfs.qjournal.client.SpyQJournalUtil; +import org.apache.hadoop.hdfs.server.blockmanagement.BlockManagerFaultInjector; +import org.junit.After; +import org.junit.Before; +import org.junit.Test; + +import java.io.IOException; + +import static org.apache.hadoop.hdfs.server.namenode.NameNodeAdapter.getFileInfo; +import static org.junit.Assert.assertNotNull; + +public class TestHAWithInProgressTail { + private MiniQJMHACluster qjmhaCluster; + private MiniDFSCluster cluster; + private MiniJournalCluster jnCluster; + private NameNode nn0; + private NameNode nn1; + + @Before + public void startUp() throws IOException { +Configuration conf = new Configuration(); +conf.setBoolean(DFSConfigKeys.DFS_HA_TAILEDITS_INPROGRESS_KEY, true); +conf.setInt(DFSConfigKeys.DFS_QJOURNAL_SELECT_INPUT_STREAMS_TIMEOUT_KEY, 500); +HAUtil.setAllowStandbyReads(conf, true); +qjmhaCluster = new MiniQJMHACluster.Builder(conf).build(); +cluster = qjmhaCluster.getDfsCluster(); +jnCluster = qjmhaCluster.getJournalCluster(); + +// Get NameNode from cluster to future manual control +nn0 = cluster.getNameNode(0); +nn1 = cluster.getNameNode(1); + } + + @After + public void tearDown() throws IOException { +if (qjmhaCluster != null) { + qjmhaCluster.shutdown(); +} + } + + + /** + * Test that Standby Node tails multiple segments while catching up + * during the transition to Active. + */ + @Test + public void testFailoverWithAbnormalJN() throws Exception { +cluster.transitionToActive(0); +cluster.waitActive(0); + +BlockManagerFaultInjector.instance = new BlockManagerFaultInjector() { + @Override + public void mockJNStreams() throws IOException { +spyOnJASjournal(); + } +}; Review Comment: Sir, because `editLog.initJournalsForWrite()` creates a new `journalSet`, it will let the previous mocked JN invalid. So we must mock the JNs after `initJournalsForWrite()`. > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at >
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17585021#comment-17585021 ] ASF GitHub Bot commented on HDFS-16689: --- hadoop-yetus commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1227743672 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 38s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 3 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 38m 16s | | trunk passed | | +1 :green_heart: | compile | 1m 36s | | trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | compile | 1m 27s | | trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | checkstyle | 1m 23s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 43s | | trunk passed | | +1 :green_heart: | javadoc | 1m 15s | | trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javadoc | 1m 39s | | trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 28s | | trunk passed | | +1 :green_heart: | shadedclient | 22m 45s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 21s | | the patch passed | | +1 :green_heart: | compile | 1m 24s | | the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javac | 1m 24s | | the patch passed | | +1 :green_heart: | compile | 1m 18s | | the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | javac | 1m 18s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 0m 57s | [/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/6/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 253 unchanged - 0 fixed = 254 total (was 253) | | +1 :green_heart: | mvnsite | 1m 23s | | the patch passed | | +1 :green_heart: | javadoc | 0m 57s | | the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javadoc | 1m 28s | | the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 24s | | the patch passed | | +1 :green_heart: | shadedclient | 22m 44s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 239m 12s | | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 1m 1s | | The patch does not generate ASF License warnings. | | | | 347m 32s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/6/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/4744 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux e86124f1fc66 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 7f72f67ecf733794ff2df3797c16d03a29c146a9 | | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/6/testReport/ | | Max. process+thread count | 3377 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584987#comment-17584987 ] ASF GitHub Bot commented on HDFS-16689: --- xkrogen commented on code in PR #4744: URL: https://github.com/apache/hadoop/pull/4744#discussion_r955294947 ## hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/TestNNWithQJM.java: ## @@ -197,10 +197,9 @@ public void testMismatchedNNIsRejected() throws Exception { .manageNameDfsDirs(false).format(false).checkExitOnShutdown(false) .build(); fail("New NN with different namespace should have been rejected"); -} catch (ExitException ee) { +} catch (IOException ie) { GenericTestUtils.assertExceptionContains( - "Unable to start log segment 1: too few journals", ee); - assertTrue("Didn't terminate properly ", ExitUtil.terminateCalled()); + "recoverUnfinalizedSegments failed for too many journals", ie); Review Comment: I wonder if we should modify the caller to catch the `IOException` and rethrow as `ExitException` to match previous behavior? ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java: ## @@ -1657,16 +1657,11 @@ synchronized void logEdit(final int length, final byte[] data) { /** * Run recovery on all journals to recover any unclosed segments */ - synchronized void recoverUnclosedStreams() { + synchronized void recoverUnclosedStreams() throws IOException { Preconditions.checkState( state == State.BETWEEN_LOG_SEGMENTS, "May not recover segments - wrong state: %s", state); -try { - journalSet.recoverUnfinalizedSegments(); -} catch (IOException ex) { - // All journals have failed, it is handled in logSync. - // TODO: are we sure this is OK? -} +journalSet.recoverUnfinalizedSegments(); Review Comment: This looks right to me as we've been discussing, but I would appreciate another pair of eyes on it to see if I'm missing anything. @omalley can you take a look? (see discussion above on why we're making this change) ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/EditLogTailer.java: ## @@ -299,33 +299,28 @@ public void catchupDuringFailover() throws IOException { // Important to do tailing as the login user, in case the shared // edits storage is implemented by a JournalManager that depends // on security credentials to access the logs (eg QuorumJournalManager). -SecurityUtil.doAsLoginUser(new PrivilegedExceptionAction() { - @Override - public Void run() throws Exception { -long editsTailed = 0; -// Fully tail the journal to the end -do { - long startTime = timer.monotonicNow(); - try { -NameNode.getNameNodeMetrics().addEditLogTailInterval( -startTime - lastLoadTimeMs); -// It is already under the name system lock and the checkpointer -// thread is already stopped. No need to acquire any other lock. -editsTailed = doTailEdits(); - } catch (InterruptedException e) { -throw new IOException(e); - } finally { -NameNode.getNameNodeMetrics().addEditLogTailTime( -timer.monotonicNow() - startTime); - } -} while(editsTailed > 0); -return null; +SecurityUtil.doAsLoginUser((PrivilegedExceptionAction) () -> { + long startTime = timer.monotonicNow(); + try { +NameNode.getNameNodeMetrics().addEditLogTailInterval((startTime - lastLoadTimeMs)); Review Comment: why did you remove the do-while loop? ## hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/SpyQJournalUtil.java: ## @@ -0,0 +1,107 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hdfs.qjournal.client; + +import org.apache.hadoop.conf.Configuration; +import
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584738#comment-17584738 ] ASF GitHub Bot commented on HDFS-16689: --- hadoop-yetus commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1227031102 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 40s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 2 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 39m 19s | | trunk passed | | +1 :green_heart: | compile | 1m 33s | | trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | compile | 1m 29s | | trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | checkstyle | 1m 15s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 38s | | trunk passed | | +1 :green_heart: | javadoc | 1m 24s | | trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javadoc | 1m 45s | | trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 35s | | trunk passed | | +1 :green_heart: | shadedclient | 22m 44s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 21s | | the patch passed | | +1 :green_heart: | compile | 1m 25s | | the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javac | 1m 25s | | the patch passed | | +1 :green_heart: | compile | 1m 17s | | the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | javac | 1m 17s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 1m 3s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 29s | | the patch passed | | +1 :green_heart: | javadoc | 0m 57s | | the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javadoc | 1m 28s | | the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 27s | | the patch passed | | +1 :green_heart: | shadedclient | 23m 34s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 252m 8s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/5/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 1m 4s | | The patch does not generate ASF License warnings. | | | | 362m 37s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.qjournal.TestNNWithQJM | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/5/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/4744 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 4188a87c263a 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 7d8b49790968f54bbc943fb3697624a9007d2126 | | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/5/testReport/ | | Max. process+thread count | 2937 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584546#comment-17584546 ] ASF GitHub Bot commented on HDFS-16689: --- ZanderXu commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1226679229 ``` if (curSegment != null) { LOG.warn("Client is requesting a new log segment " + txid + " though we are already writing " + curSegment + ". " + "Aborting the current segment in order to begin the new one." + " ; journal id: " + journalId); // The writer may have lost a connection to us and is now // re-connecting after the connection came back. // We should abort our own old segment. abortCurSegment(); } ``` The `abortCurSegment()` just aborts the current segment, but not finalize the current inProgress segment, so may result in two inProgress segment files on disk. > So are we agreed that the best way forward is to modify recoverUnclosedStreams() to throw exception on failure, then we can use inProgressOk = false to solve this problem as you originally proposed? Yes, I totally agree with this and I will modify this patch with this idea. > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) > ... 36 more > {code} > After tracing and found there is a critical bug in > *EditlogTailer#catchupDuringFailover()* when > *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* > try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. > It may cannot replay any edits when they are some abnormal JournalNodes. > Reproduce method, suppose: > - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode > is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, > JN1 and JN2. > - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully > synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or > restarted. > - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover > active from NN0 to NN1. > - NN1 only got two responses from JN0 and JN1 when it try to selecting > inputStreams with *fromTxnId=3* and *onlyDurableTxns=true*, and the count > txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad > network or restarted. > - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes > because the *maxAllowedTxns* is 0. > So I think Standby NameNode should *catchupDuringFailover()* with > *onlyDurableTxns=false* , so that it can replay all missed edits from > JournalNode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584511#comment-17584511 ] ASF GitHub Bot commented on HDFS-16689: --- xkrogen commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1226624906 > I'm sorry, I just find this comment, but didn't find related code to finalize the previous inProgress segment. Can you share the related code? Thanks. I'm referring to this: https://github.com/apache/hadoop/blob/62c86eaa0e539a4307ca794e0fcd502a77ebceb8/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/Journal.java#L574-L583 But a little more digging made me realize that I don't think what I described will actually happen, since in `FSEditLog#openForWrite()` before calling `startLogSegment()` we first check that there are no active streams: https://github.com/apache/hadoop/blob/63db1a85e376c2266afdc62b9590e40acc98429c/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java#L338-L347 So it will actually throw an exception, rather than finalizing the old segment as I said previously. But this is _after_ `catchupDuringFailver()`, so to make your original proposal (disable in-progress edits) work properly, we still need to modify `recoverUnclosedStreams()` to throw an error when it fails instead of just swallowing the exception. I briefly looked at the other usages of `recoverUnclosedStreams()` and I don't really see any reason why we would want to swallow the exception ... The TODO comment there is also from 2012, 10 years old now :) So are we agreed that the best way forward is to modify `recoverUnclosedStreams()` to throw exception on failure, then we can use `inProgressOk = false` to solve this problem as you originally proposed? > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) > ... 36 more > {code} > After tracing and found there is a critical bug in > *EditlogTailer#catchupDuringFailover()* when > *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* > try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. > It may cannot replay any edits when they are some abnormal JournalNodes. > Reproduce method, suppose: > - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode > is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, > JN1 and JN2. > - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully > synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or > restarted. > - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover > active from NN0 to NN1. > - NN1 only got two responses from JN0 and JN1 when it try to selecting > inputStreams with *fromTxnId=3* and *onlyDurableTxns=true*, and the count > txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad > network or restarted. > - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes > because the *maxAllowedTxns* is 0. > So I think Standby NameNode should *catchupDuringFailover()* with > *onlyDurableTxns=false* , so that it can replay all missed edits from > JournalNode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17583796#comment-17583796 ] ASF GitHub Bot commented on HDFS-16689: --- abhishekkarigar commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1224559583 @ZanderXu hi one more question, i am setting up HA namenode on kubernetes on the standby namenode , i triggered the bootstrapStandby $ hdfs namenode -bootstrapStandby and i get the following error }{"name":"org.apache.hadoop.hdfs.server.namenode.NameNode","time":1661279164420,"date":"2022-08-23 18:26:04,420","level":"INFO","thread":"main","message":"registered UNIX signal handlers for [TERM, HUP, INT]"}{"name":"org.apache.hadoop.hdfs.server.namenode.NameNode","time":1661279164532,"date":"2022-08-23 18:26:04,532","level":"INFO","thread":"main","message":"createNameNode [-bootstrapStandby]"}{"name":"org.apache.hadoop.hdfs.server.namenode.ha.BootstrapStandby","time":1661279164846,"date":"2022-08-23 18:26:04,846","level":"INFO","thread":"main","message":"Found nn: apache-hadoop-namenode-0.apache-hadoop-namenode.nom-backend.svc.cluster.local, ipc: hdfs:8020"}{"name":"org.apache.hadoop.hdfs.server.namenode.NameNode","time":1661279164847,"date":"2022-08-23 18:26:04,847","level":"ERROR","thread":"main","message":"Failed to start namenode.","exceptionclass":"java.io.IOException","stack":["java.io.IOException: java.lang.NullPointerException","\tat org.apache.hadoop.hdfs.server.namenode.ha.BootstrapStandby.run(BootstrapStandby.java:549)","\tat org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1741)","\tat org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1834)","Caused by: java.lang.NullPointerException","\tat org.apache.hadoop.hdfs.server.namenode.ha.BootstrapStandby.parseConfAndFindOtherNN(BootstrapStandby.java:435)","\tat org.apache.hadoop.hdfs.server.namenode.ha.BootstrapStandby.run(BootstrapStandby.java:114)","\tat org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:81)","\tat org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:95)","\tat org.apache.hadoop.hdfs.server.namenode.ha.BootstrapStandby.run(BootstrapStandby.java:544)","\t... 2 more"]}{"name":"org.apache.hadoop.util.ExitUtil","time":1661279164850,"date":"2022-08-23 18:26:04,850","level":"INFO","thread":"main","message":"Exiting with status 1: java.io.IOException: java.lang.NullPointerException"}{"name":"org.apache.hadoop.hdfs.server.namenode.NameNode","time":1661279164852,"date":"2022-08-23 18:26:04,852","level":"INFO","thread":"shutdown-hook-0","message":"SHUTDOWN_MSG: \n/\nSHUTDOWN_MSG: Shutting down NameNode at apache-hadoop-namenode-1.apache-hadoop-namenode.nom-backend.svc.cluster.local/10.129.2.45\n > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) > ... 36 more > {code} > After tracing and found there is a critical bug in > *EditlogTailer#catchupDuringFailover()* when > *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* > try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. > It may cannot replay any edits when they are some abnormal JournalNodes. > Reproduce method, suppose: > - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode > is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, > JN1 and JN2. > - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully > synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or > restarted. > - NN1's
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17583004#comment-17583004 ] ASF GitHub Bot commented on HDFS-16689: --- ZanderXu commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1222443598 @abhishekkarigar Thanks for your attention to this issue. @xkrogen and I will solve this problem as soon as possible. @xkrogen Sir, please review the latest patch. Thanks > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) > ... 36 more > {code} > After tracing and found there is a critical bug in > *EditlogTailer#catchupDuringFailover()* when > *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* > try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. > It may cannot replay any edits when they are some abnormal JournalNodes. > Reproduce method, suppose: > - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode > is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, > JN1 and JN2. > - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully > synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or > restarted. > - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover > active from NN0 to NN1. > - NN1 only got two responses from JN0 and JN1 when it try to selecting > inputStreams with *fromTxnId=3* and *onlyDurableTxns=true*, and the count > txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad > network or restarted. > - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes > because the *maxAllowedTxns* is 0. > So I think Standby NameNode should *catchupDuringFailover()* with > *onlyDurableTxns=false* , so that it can replay all missed edits from > JournalNode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17582584#comment-17582584 ] ASF GitHub Bot commented on HDFS-16689: --- abhishekkarigar commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1221578477 when will this be merged to trunk ? > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) > ... 36 more > {code} > After tracing and found there is a critical bug in > *EditlogTailer#catchupDuringFailover()* when > *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* > try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. > It may cannot replay any edits when they are some abnormal JournalNodes. > Reproduce method, suppose: > - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode > is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, > JN1 and JN2. > - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully > synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or > restarted. > - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover > active from NN0 to NN1. > - NN1 only got two responses from JN0 and JN1 when it try to selecting > inputStreams with *fromTxnId=3* and *onlyDurableTxns=true*, and the count > txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad > network or restarted. > - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes > because the *maxAllowedTxns* is 0. > So I think Standby NameNode should *catchupDuringFailover()* with > *onlyDurableTxns=false* , so that it can replay all missed edits from > JournalNode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17582225#comment-17582225 ] ASF GitHub Bot commented on HDFS-16689: --- hadoop-yetus commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1221316624 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 37s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 7 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 40m 8s | | trunk passed | | +1 :green_heart: | compile | 1m 40s | | trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | compile | 1m 38s | | trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | checkstyle | 1m 26s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 46s | | trunk passed | | +1 :green_heart: | javadoc | 1m 23s | | trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javadoc | 1m 48s | | trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 43s | | trunk passed | | +1 :green_heart: | shadedclient | 23m 31s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 27s | | the patch passed | | +1 :green_heart: | compile | 1m 28s | | the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javac | 1m 28s | | the patch passed | | +1 :green_heart: | compile | 1m 22s | | the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | javac | 1m 22s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 1m 2s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 30s | | the patch passed | | +1 :green_heart: | javadoc | 1m 4s | | the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javadoc | 1m 30s | | the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 24s | | the patch passed | | +1 :green_heart: | shadedclient | 23m 11s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 249m 15s | | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 1m 13s | | The patch does not generate ASF License warnings. | | | | 361m 59s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/4/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/4744 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 74f55473e3da 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 342e712086a910debbabc7ee799f999bd1557caa | | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/4/testReport/ | | Max. process+thread count | 3311 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/4/console | | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 | | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org | This message was automatically generated. > Standby
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17582125#comment-17582125 ] ASF GitHub Bot commented on HDFS-16689: --- ZanderXu commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1221235271 @xkrogen Thanks. > I think you might have also understood what I was saying in my last comment Yes, I got it. > I am thinking we can add a new parameter to LogsPurgeable#selectInputStreams() like preferBulkReads. This is a good idea, I will fix this patch like this. > In recoverUnclosedStreams, if the finalization fails, it will just ignore it and assume that it will be handled later (by Journal#startLogSegment(), which will automatically close an old stream when you try to open a new one). > In recoverUnclosedStreams, if the finalization fails, it will just ignore it Sorry, I didn't notice this. But I think it's crazy. > assume that it will be handled later (by Journal#startLogSegment(), which will automatically close an old stream when you try to open a new one). I'm sorry, I just find this comment, but didn't find related code to finalize the previous inProgress segment. Can you share the related code? Thanks. > If there is something preventing the new active from communicating with the JNs, or something preventing the JNs from finalizing the old segment, then the NN will eventually fail to become active regardless. Yes, I agree. Standby should crash or fail to become active if it cannot finalize the old segment. About this case, How about fix it in a new PR? > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) > ... 36 more > {code} > After tracing and found there is a critical bug in > *EditlogTailer#catchupDuringFailover()* when > *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* > try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. > It may cannot replay any edits when they are some abnormal JournalNodes. > Reproduce method, suppose: > - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode > is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, > JN1 and JN2. > - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully > synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or > restarted. > - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover > active from NN0 to NN1. > - NN1 only got two responses from JN0 and JN1 when it try to selecting > inputStreams with *fromTxnId=3* and *onlyDurableTxns=true*, and the count > txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad > network or restarted. > - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes > because the *maxAllowedTxns* is 0. > So I think Standby NameNode should *catchupDuringFailover()* with > *onlyDurableTxns=false* , so that it can replay all missed edits from > JournalNode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17582031#comment-17582031 ] ASF GitHub Bot commented on HDFS-16689: --- xkrogen commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1221081194 Thinking more on this, maybe it does make sense to just modify `startActiveServices` to retry `recoverUnclosedStreams` until it actually succeeds. If there is something preventing the new active from communicating with the JNs, or something preventing the JNs from finalizing the old segment, then the NN will eventually fail to become active regardless. > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) > ... 36 more > {code} > After tracing and found there is a critical bug in > *EditlogTailer#catchupDuringFailover()* when > *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* > try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. > It may cannot replay any edits when they are some abnormal JournalNodes. > Reproduce method, suppose: > - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode > is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, > JN1 and JN2. > - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully > synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or > restarted. > - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover > active from NN0 to NN1. > - NN1 only got two responses from JN0 and JN1 when it try to selecting > inputStreams with *fromTxnId=3* and *onlyDurableTxns=true*, and the count > txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad > network or restarted. > - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes > because the *maxAllowedTxns* is 0. > So I think Standby NameNode should *catchupDuringFailover()* with > *onlyDurableTxns=false* , so that it can replay all missed edits from > JournalNode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581984#comment-17581984 ] ASF GitHub Bot commented on HDFS-16689: --- xkrogen commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1220981198 > If the active crashed, during standby starting active services, the standby will recover unclosed streams via `recoverUnclosedStreams`. So before `catchupDuringFailover`, the last segment should always closed. I don't think this will always be true. In `recoverUnclosedStreams`, if the finalization fails, it will just ignore it and assume that it will be handled later (by `Journal#startLogSegment()`, which will automatically close an old stream when you try to open a new one). https://github.com/apache/hadoop/blob/63db1a85e376c2266afdc62b9590e40acc98429c/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java#L1660-L1670 So if we want to disable in-progress edits during `catchupDuringFailover`, then we have to modify this to retry/abort when the old segments cannot be closed. But I don't think this is the right approach. I think you might have also understood what I was saying in my last comment here: > Perhaps we should add a way for callers of QuorumJournalManager to indicate that they want to use the streaming mechanism, as opposed to RPC. Disabling in-progress edits achieves this, but is too strong (note that the streaming mechanism can also load in-progress edits). Looking at the current diff, I don't see a reason to add `LogsPurgeable#enableInProgressTailing()`. We already have the `inProgressOk` parameter of `selectInputStreams()`, so why do we need a different mechanism to adjust whether we use in-progress edits? What I was trying to say is that what we really want to do, both in this Jira and in HDFS-14806, is _disable the RPC mechanism and use the streaming mechanism_. In HDFS-14806, to do this, we turned off in-progress edits, which was okay. But in this Jira, it's not okay > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) > ... 36 more > {code} > After tracing and found there is a critical bug in > *EditlogTailer#catchupDuringFailover()* when > *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* > try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. > It may cannot replay any edits when they are some abnormal JournalNodes. > Reproduce method, suppose: > - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode > is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, > JN1 and JN2. > - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully > synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or > restarted. > - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover > active from NN0 to NN1. > - NN1 only got two responses from JN0 and JN1 when it try to selecting > inputStreams with *fromTxnId=3* and *onlyDurableTxns=true*, and the count > txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad > network or restarted. > - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes > because the *maxAllowedTxns* is 0. > So I think Standby NameNode should *catchupDuringFailover()* with > *onlyDurableTxns=false* , so that it can replay all missed edits from > JournalNode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail:
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581947#comment-17581947 ] ASF GitHub Bot commented on HDFS-16689: --- hadoop-yetus commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1220871253 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 40s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 3 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 38m 22s | | trunk passed | | +1 :green_heart: | compile | 1m 44s | | trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | compile | 1m 37s | | trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | checkstyle | 1m 26s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 47s | | trunk passed | | +1 :green_heart: | javadoc | 1m 25s | | trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javadoc | 1m 47s | | trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 39s | | trunk passed | | +1 :green_heart: | shadedclient | 23m 14s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 22s | | the patch passed | | +1 :green_heart: | compile | 1m 27s | | the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javac | 1m 27s | | the patch passed | | +1 :green_heart: | compile | 1m 21s | | the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | javac | 1m 21s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 1m 3s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 26s | | the patch passed | | +1 :green_heart: | javadoc | 0m 58s | | the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javadoc | 1m 27s | | the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 24s | | the patch passed | | +1 :green_heart: | shadedclient | 22m 39s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 240m 21s | | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 1m 13s | | The patch does not generate ASF License warnings. | | | | 350m 27s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/3/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/4744 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux bd3e3c5509b8 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / a5ade31e6f1dc9674d288de910363b1b35f4520c | | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/3/testReport/ | | Max. process+thread count | 3206 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/3/console | | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 | | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org | This message was automatically generated. > Standby
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581772#comment-17581772 ] ASF GitHub Bot commented on HDFS-16689: --- ZanderXu commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1220527843 @xkrogen Master, I have update this patch via enable or disable inProgressTailing in `QuorumJournalManager`, please help me review it. If you have any good ideas, I'd be happy to modify this patch as intended. > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) > ... 36 more > {code} > After tracing and found there is a critical bug in > *EditlogTailer#catchupDuringFailover()* when > *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* > try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. > It may cannot replay any edits when they are some abnormal JournalNodes. > Reproduce method, suppose: > - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode > is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, > JN1 and JN2. > - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully > synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or > restarted. > - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover > active from NN0 to NN1. > - NN1 only got two responses from JN0 and JN1 when it try to selecting > inputStreams with *fromTxnId=3* and *onlyDurableTxns=true*, and the count > txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad > network or restarted. > - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes > because the *maxAllowedTxns* is 0. > So I think Standby NameNode should *catchupDuringFailover()* with > *onlyDurableTxns=false* , so that it can replay all missed edits from > JournalNode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581715#comment-17581715 ] ASF GitHub Bot commented on HDFS-16689: --- ZanderXu commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1220377078 > If the active crashed, then the segment won't be finalized, right? If the active crashed, during standby starting active services, the standby will recover unclosed streams via `recoverUnclosedStreams`. So before `catchupDuringFailover`, the last segment should always closed. > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) > ... 36 more > {code} > After tracing and found there is a critical bug in > *EditlogTailer#catchupDuringFailover()* when > *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* > try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. > It may cannot replay any edits when they are some abnormal JournalNodes. > Reproduce method, suppose: > - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode > is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, > JN1 and JN2. > - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully > synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or > restarted. > - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover > active from NN0 to NN1. > - NN1 only got two responses from JN0 and JN1 when it try to selecting > inputStreams with *fromTxnId=3* and *onlyDurableTxns=true*, and the count > txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad > network or restarted. > - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes > because the *maxAllowedTxns* is 0. > So I think Standby NameNode should *catchupDuringFailover()* with > *onlyDurableTxns=false* , so that it can replay all missed edits from > JournalNode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581481#comment-17581481 ] ASF GitHub Bot commented on HDFS-16689: --- xkrogen commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1219804467 > Before `catchupDuringFailover`, whatever active crashed or successfully changed to standby, the last segment in majority journalnode should be finalized. If the active crashed, then the segment won't be finalized, right? > The same processing idea has also appeared in [HDFS-14806](https://issues.apache.org/jira/browse/HDFS-14806). [Here](https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/BootstrapStandby.java#L113) This is different. Bootstrap standby doesn't need to get _all_ transactions, it's more like a best-effort to load most of the transactions. Later, when the standby transitions to active, _then_ it will call `catchupDuringFailover` to load the remainder. Though generally I agree that the idea is similar. Perhaps we should add a way for callers of `QuorumJournalManager` to indicate that they want to use the streaming mechanism, as opposed to RPC. Disabling in-progress edits achieves this, but is too strong (note that the streaming mechanism can also load in-progress edits). > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) > ... 36 more > {code} > After tracing and found there is a critical bug in > *EditlogTailer#catchupDuringFailover()* when > *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* > try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. > It may cannot replay any edits when they are some abnormal JournalNodes. > Reproduce method, suppose: > - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode > is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, > JN1 and JN2. > - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully > synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or > restarted. > - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover > active from NN0 to NN1. > - NN1 only got two responses from JN0 and JN1 when it try to selecting > inputStreams with *fromTxnId=3* and *onlyDurableTxns=true*, and the count > txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad > network or restarted. > - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes > because the *maxAllowedTxns* is 0. > So I think Standby NameNode should *catchupDuringFailover()* with > *onlyDurableTxns=false* , so that it can replay all missed edits from > JournalNode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581115#comment-17581115 ] ASF GitHub Bot commented on HDFS-16689: --- ZanderXu commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1218994368 The same processing idea has also appeared in HDFS-14806. [Here](https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/BootstrapStandby.java#L113) > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) > ... 36 more > {code} > After tracing and found there is a critical bug in > *EditlogTailer#catchupDuringFailover()* when > *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* > try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. > It may cannot replay any edits when they are some abnormal JournalNodes. > Reproduce method, suppose: > - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode > is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, > JN1 and JN2. > - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully > synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or > restarted. > - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover > active from NN0 to NN1. > - NN1 only got two responses from JN0 and JN1 when it try to selecting > inputStreams with *fromTxnId=3* and *onlyDurableTxns=true*, and the count > txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad > network or restarted. > - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes > because the *maxAllowedTxns* is 0. > So I think Standby NameNode should *catchupDuringFailover()* with > *onlyDurableTxns=false* , so that it can replay all missed edits from > JournalNode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581113#comment-17581113 ] ASF GitHub Bot commented on HDFS-16689: --- ZanderXu commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1218979907 @xkrogen Thanks for your review and comment. Sorry for the late replay. Before `catchupDuringFailover`, whatever active crash or successfully changed to standby, the last segment in majority journalnode should be finalized. My idea is `catchupDuringFailover` just ignore `selectRpcInputStreams` and fail back to use `selectStreamingInputStreams` with `getEditLogManifest`. Unfortunately, `catchupDuringFailover` can only use disable in-progress to ignore `selectRpcInputStreams`, because `inProgressTailingEnabled` in `QuorumJournalManager` is unchangeable. The key code is as follows: ``` @Override public void selectInputStreams(Collection streams, long fromTxnId, boolean inProgressOk, boolean onlyDurableTxns) throws IOException { // Here, catchupDuringFailover should ignore this if branch and fail back to selectStreamingInputStreams if (inProgressOk && inProgressTailingEnabled) { LOG.debug("Tailing edits starting from txn ID {} via RPC mechanism", fromTxnId); try { Collection rpcStreams = new ArrayList<>(); selectRpcInputStreams(rpcStreams, fromTxnId, onlyDurableTxns); streams.addAll(rpcStreams); return; } catch (IOException ioe) { LOG.warn("Encountered exception while tailing edits >= " + fromTxnId + " via RPC; falling back to streaming.", ioe); } } selectStreamingInputStreams(streams, fromTxnId, inProgressOk, onlyDurableTxns); } ``` > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) > ... 36 more > {code} > After tracing and found there is a critical bug in > *EditlogTailer#catchupDuringFailover()* when > *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* > try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. > It may cannot replay any edits when they are some abnormal JournalNodes. > Reproduce method, suppose: > - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode > is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, > JN1 and JN2. > - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully > synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or > restarted. > - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover > active from NN0 to NN1. > - NN1 only got two responses from JN0 and JN1 when it try to selecting > inputStreams with *fromTxnId=3* and *onlyDurableTxns=true*, and the count > txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad > network or restarted. > - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes > because the *maxAllowedTxns* is 0. > So I think Standby NameNode should *catchupDuringFailover()* with > *onlyDurableTxns=false* , so that it can replay all missed edits from > JournalNode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580410#comment-17580410 ] ASF GitHub Bot commented on HDFS-16689: --- xkrogen commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1216959253 It looks like the current diff is disallowing in-progress edits during `catchupDuringFailover()`. But this isn't right, we _do_ want in-progress edits, since those can be durable and have been ack'ed to clients. Right? I'm also a little confused because your [last comment](https://github.com/apache/hadoop/pull/4744#issuecomment-1215017852) says you solved the issue via `getEditLogManifest()`, but I don't see that method called at all in the current diff. > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) > ... 36 more > {code} > After tracing and found there is a critical bug in > *EditlogTailer#catchupDuringFailover()* when > *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* > try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. > It may cannot replay any edits when they are some abnormal JournalNodes. > Reproduce method, suppose: > - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode > is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, > JN1 and JN2. > - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully > synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or > restarted. > - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover > active from NN0 to NN1. > - NN1 only got two responses from JN0 and JN1 when it try to selecting > inputStreams with *fromTxnId=3* and *onlyDurableTxns=true*, and the count > txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad > network or restarted. > - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes > because the *maxAllowedTxns* is 0. > So I think Standby NameNode should *catchupDuringFailover()* with > *onlyDurableTxns=false* , so that it can replay all missed edits from > JournalNode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580129#comment-17580129 ] ASF GitHub Bot commented on HDFS-16689: --- hadoop-yetus commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1216277870 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 38s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 2 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 38m 16s | | trunk passed | | +1 :green_heart: | compile | 1m 45s | | trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | compile | 1m 39s | | trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | checkstyle | 1m 23s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 48s | | trunk passed | | +1 :green_heart: | javadoc | 1m 25s | | trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javadoc | 1m 48s | | trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 44s | | trunk passed | | +1 :green_heart: | shadedclient | 23m 2s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 26s | | the patch passed | | +1 :green_heart: | compile | 1m 27s | | the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javac | 1m 27s | | the patch passed | | +1 :green_heart: | compile | 1m 18s | | the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | javac | 1m 18s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 1m 1s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 29s | | the patch passed | | +1 :green_heart: | javadoc | 0m 57s | | the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javadoc | 1m 32s | | the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 25s | | the patch passed | | +1 :green_heart: | shadedclient | 22m 53s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 240m 23s | | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 1m 14s | | The patch does not generate ASF License warnings. | | | | 350m 50s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/2/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/4744 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 790ea73b89ab 4.15.0-175-generic #184-Ubuntu SMP Thu Mar 24 17:48:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 8c4d268898c28a97c51bd7a35dfb0ca8a6b236b6 | | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/2/testReport/ | | Max. process+thread count | 3453 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/2/console | | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 | | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org | This message was automatically generated. > Standby
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17579871#comment-17579871 ] ASF GitHub Bot commented on HDFS-16689: --- hadoop-yetus commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1215647723 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 35s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 2 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 38m 4s | | trunk passed | | +1 :green_heart: | compile | 1m 43s | | trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | compile | 1m 38s | | trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | checkstyle | 1m 26s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 50s | | trunk passed | | +1 :green_heart: | javadoc | 1m 26s | | trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javadoc | 1m 48s | | trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 41s | | trunk passed | | +1 :green_heart: | shadedclient | 25m 2s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 30s | | the patch passed | | +1 :green_heart: | compile | 1m 40s | | the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javac | 1m 40s | | the patch passed | | +1 :green_heart: | compile | 1m 24s | | the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | javac | 1m 24s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 0m 59s | [/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/1/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs-project/hadoop-hdfs: The patch generated 10 new + 113 unchanged - 0 fixed = 123 total (was 113) | | +1 :green_heart: | mvnsite | 1m 27s | | the patch passed | | +1 :green_heart: | javadoc | 0m 58s | | the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javadoc | 1m 32s | | the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 29s | | the patch passed | | +1 :green_heart: | shadedclient | 23m 7s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 241m 15s | | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 1m 14s | | The patch does not generate ASF License warnings. | | | | 353m 59s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/1/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/4744 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 9cd64b6f59b4 4.15.0-175-generic #184-Ubuntu SMP Thu Mar 24 17:48:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 0ca77ca0e9976dc0aef231841ac0bcd632e1a63f | | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4744/1/testReport/ | | Max. process+thread count | 3757 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17579815#comment-17579815 ] ASF GitHub Bot commented on HDFS-16689: --- xkrogen commented on PR #4628: URL: https://github.com/apache/hadoop/pull/4628#issuecomment-1215390129 Thanks all, appreciate you looking into this. @ZanderXu , I will try to look into HDFS-16659 and HDFS-16645. > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) > ... 36 more > {code} > After tracing and found there is a critical bug in > *EditlogTailer#catchupDuringFailover()* when > *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* > try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. > It may cannot replay any edits when they are some abnormal JournalNodes. > Reproduce method, suppose: > - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode > is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, > JN1 and JN2. > - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully > synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or > restarted. > - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover > active from NN0 to NN1. > - NN1 only got two responses from JN0 and JN1 when it try to selecting > inputStreams with *fromTxnId=3* and *onlyDurableTxns=true*, and the count > txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad > network or restarted. > - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes > because the *maxAllowedTxns* is 0. > So I think Standby NameNode should *catchupDuringFailover()* with > *onlyDurableTxns=false* , so that it can replay all missed edits from > JournalNode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17579709#comment-17579709 ] ASF GitHub Bot commented on HDFS-16689: --- ZanderXu commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1215017852 @ferhui @xkrogen Master, this PR uses `getEditLogManifest()` to fix this problem. Please help me review it, thanks. > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) > ... 36 more > {code} > After tracing and found there is a critical bug in > *EditlogTailer#catchupDuringFailover()* when > *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* > try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. > It may cannot replay any edits when they are some abnormal JournalNodes. > Reproduce method, suppose: > - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode > is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, > JN1 and JN2. > - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully > synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or > restarted. > - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover > active from NN0 to NN1. > - NN1 only got two responses from JN0 and JN1 when it try to selecting > inputStreams with *fromTxnId=3* and *onlyDurableTxns=true*, and the count > txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad > network or restarted. > - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes > because the *maxAllowedTxns* is 0. > So I think Standby NameNode should *catchupDuringFailover()* with > *onlyDurableTxns=false* , so that it can replay all missed edits from > JournalNode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17579707#comment-17579707 ] ASF GitHub Bot commented on HDFS-16689: --- ZanderXu opened a new pull request, #4744: URL: https://github.com/apache/hadoop/pull/4744 ### Description of PR Standby NameNode may crash when transitioning to Active with a in-progress tailer if there are some abnormal JNs. And the crashed error message as blew: ```java Caused by: java.lang.IllegalStateException: Cannot start writing at txid X when there is a stream available for read: ByteStringEditLog[X, Y], ByteStringEditLog[X, 0] at org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) at org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) ... 36 more ``` After tracing and found there is a critical bug in `EditlogTailer#catchupDuringFailover()` with `DFS_HA_TAILEDITS_INPROGRESS_KEY=true`. `catchupDuringFailover()` may cannot replay all edits from JournalNodes with `onlyDurableTxns=true` if there are some abnormal JournalNodes, because maybe JournalNodes will return a empty response caused `maxAllowedTxns=0` in `QuorumJournalManager#selectRpcInputStreams`. Reproduce method, suppose: - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, JN1 and JN2. - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or restarted. - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover active from NN0 to NN1. - NN1 only got two responses from JN0 and JN1 when it try to selecting inputStreams with `fromTxnId=3` and `onlyDurableTxns=true`, and the count txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad network or restarted. - NN1 will cannot replay any Edits with `fromTxnId=3` from JournalNodes because the `maxAllowedTxns` is 0. So I think Standby NameNode should `catchupDuringFailover()` with `onlyDurableTxns=false` , so that it can replay all missed edits from JournalNode. > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) > ... 36 more > {code} > After tracing and found there is a critical bug in > *EditlogTailer#catchupDuringFailover()* when > *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* > try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. > It may cannot replay any edits when they are some abnormal JournalNodes. > Reproduce method, suppose: > - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode > is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, > JN1 and JN2. > - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully > synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or > restarted. > - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover > active from NN0 to NN1. > - NN1 only got two responses from JN0 and JN1 when it try to selecting > inputStreams with *fromTxnId=3* and *onlyDurableTxns=true*, and the count > txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad > network or restarted. > - NN1 will cannot
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17579664#comment-17579664 ] ASF GitHub Bot commented on HDFS-16689: --- hadoop-yetus commented on PR #4743: URL: https://github.com/apache/hadoop/pull/4743#issuecomment-1214926068 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 0s | | Docker mode activated. | | -1 :x: | patch | 0m 19s | | https://github.com/apache/hadoop/pull/4743 does not apply to trunk. Rebase required? Wrong Branch? See https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute for help. | | Subsystem | Report/Notes | |--:|:-| | GITHUB PR | https://github.com/apache/hadoop/pull/4743 | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4743/1/console | | versions | git=2.17.1 | | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org | This message was automatically generated. > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) > ... 36 more > {code} > After tracing and found there is a critical bug in > *EditlogTailer#catchupDuringFailover()* when > *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* > try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. > It may cannot replay any edits when they are some abnormal JournalNodes. > Reproduce method, suppose: > - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode > is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, > JN1 and JN2. > - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully > synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or > restarted. > - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover > active from NN0 to NN1. > - NN1 only got two responses from JN0 and JN1 when it try to selecting > inputStreams with *fromTxnId=3* and *onlyDurableTxns=true*, and the count > txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad > network or restarted. > - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes > because the *maxAllowedTxns* is 0. > So I think Standby NameNode should *catchupDuringFailover()* with > *onlyDurableTxns=false* , so that it can replay all missed edits from > JournalNode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17579665#comment-17579665 ] ASF GitHub Bot commented on HDFS-16689: --- ferhui commented on PR #4628: URL: https://github.com/apache/hadoop/pull/4628#issuecomment-1214926249 @xkrogen Thanks for pointing it out. Reverted. @ZanderXu can raise a new MR and we can discuss it later. > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) > ... 36 more > {code} > After tracing and found there is a critical bug in > *EditlogTailer#catchupDuringFailover()* when > *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* > try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. > It may cannot replay any edits when they are some abnormal JournalNodes. > Reproduce method, suppose: > - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode > is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, > JN1 and JN2. > - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully > synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or > restarted. > - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover > active from NN0 to NN1. > - NN1 only got two responses from JN0 and JN1 when it try to selecting > inputStreams with *fromTxnId=3* and *onlyDurableTxns=true*, and the count > txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad > network or restarted. > - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes > because the *maxAllowedTxns* is 0. > So I think Standby NameNode should *catchupDuringFailover()* with > *onlyDurableTxns=false* , so that it can replay all missed edits from > JournalNode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17579662#comment-17579662 ] ASF GitHub Bot commented on HDFS-16689: --- ferhui merged PR #4743: URL: https://github.com/apache/hadoop/pull/4743 > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) > ... 36 more > {code} > After tracing and found there is a critical bug in > *EditlogTailer#catchupDuringFailover()* when > *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* > try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. > It may cannot replay any edits when they are some abnormal JournalNodes. > Reproduce method, suppose: > - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode > is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, > JN1 and JN2. > - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully > synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or > restarted. > - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover > active from NN0 to NN1. > - NN1 only got two responses from JN0 and JN1 when it try to selecting > inputStreams with *fromTxnId=3* and *onlyDurableTxns=true*, and the count > txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad > network or restarted. > - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes > because the *maxAllowedTxns* is 0. > So I think Standby NameNode should *catchupDuringFailover()* with > *onlyDurableTxns=false* , so that it can replay all missed edits from > JournalNode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17579661#comment-17579661 ] ASF GitHub Bot commented on HDFS-16689: --- ferhui opened a new pull request, #4743: URL: https://github.com/apache/hadoop/pull/4743 Reverts apache/hadoop#4628 > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) > ... 36 more > {code} > After tracing and found there is a critical bug in > *EditlogTailer#catchupDuringFailover()* when > *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* > try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. > It may cannot replay any edits when they are some abnormal JournalNodes. > Reproduce method, suppose: > - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode > is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, > JN1 and JN2. > - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully > synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or > restarted. > - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover > active from NN0 to NN1. > - NN1 only got two responses from JN0 and JN1 when it try to selecting > inputStreams with *fromTxnId=3* and *onlyDurableTxns=true*, and the count > txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad > network or restarted. > - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes > because the *maxAllowedTxns* is 0. > So I think Standby NameNode should *catchupDuringFailover()* with > *onlyDurableTxns=false* , so that it can replay all missed edits from > JournalNode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17579546#comment-17579546 ] ASF GitHub Bot commented on HDFS-16689: --- ZanderXu commented on PR #4628: URL: https://github.com/apache/hadoop/pull/4628#issuecomment-1214661593 @xkrogen Master, about this PR, I'm waiting for your feedback. Thanks > Because during NN0 stoping active service, it will close the current segment, last finalize segment of JN0 contains the txn 3. So during NN1 starting active service, it can load and apply txn 3 through getEditLogManifest(). Please correct me if I'm wrong. But If it's correct, how about let journalnodes find and sync the loss data during starting active services to fix the problem as you mentioned above, which already existed before [HDFS-12943](https://issues.apache.org/jira/browse/HDFS-12943), > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) > ... 36 more > {code} > After tracing and found there is a critical bug in > *EditlogTailer#catchupDuringFailover()* when > *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* > try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. > It may cannot replay any edits when they are some abnormal JournalNodes. > Reproduce method, suppose: > - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode > is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, > JN1 and JN2. > - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully > synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or > restarted. > - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover > active from NN0 to NN1. > - NN1 only got two responses from JN0 and JN1 when it try to selecting > inputStreams with *fromTxnId=3* and *onlyDurableTxns=true*, and the count > txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad > network or restarted. > - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes > because the *maxAllowedTxns* is 0. > So I think Standby NameNode should *catchupDuringFailover()* with > *onlyDurableTxns=false* , so that it can replay all missed edits from > JournalNode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17579191#comment-17579191 ] ASF GitHub Bot commented on HDFS-16689: --- ZanderXu commented on PR #4628: URL: https://github.com/apache/hadoop/pull/4628#issuecomment-1213596142 By the way, @ferhui @xkrogen Can you help me to review another some issues about JN? Maybe we can fix them together. - [HDFS-16659](https://github.com/apache/hadoop/pull/4560) JournalNode should throw CacheMissException when SinceTxId is bigger than HighestWrittenTxId - [HDFS-16645](https://github.com/apache/hadoop/pull/4518) [JN] Bugfix: java.lang.IllegalStateException: Invalid log manifest > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) > ... 36 more > {code} > After tracing and found there is a critical bug in > *EditlogTailer#catchupDuringFailover()* when > *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* > try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. > It may cannot replay any edits when they are some abnormal JournalNodes. > Reproduce method, suppose: > - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode > is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, > JN1 and JN2. > - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully > synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or > restarted. > - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover > active from NN0 to NN1. > - NN1 only got two responses from JN0 and JN1 when it try to selecting > inputStreams with *fromTxnId=3* and *onlyDurableTxns=true*, and the count > txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad > network or restarted. > - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes > because the *maxAllowedTxns* is 0. > So I think Standby NameNode should *catchupDuringFailover()* with > *onlyDurableTxns=false* , so that it can replay all missed edits from > JournalNode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17579189#comment-17579189 ] ASF GitHub Bot commented on HDFS-16689: --- ZanderXu commented on PR #4628: URL: https://github.com/apache/hadoop/pull/4628#issuecomment-1213594732 Thanks @xkrogen for your detailed explanation. I have through this corner case during coding this patch. Please correct me if I'm wrong. I think above scenario always existed before HDFS-12943, so I fixed this bug like this way. And we can find a good idea to fix data loss issue you mentioned above. > 1. Currently NN0 is active and JN0-2 all have txn 2 committed. 2. NN0 attempts to write txn 3. It only succeeds to JN0, and crashes before writing to JN1/JN2. 3. We fail over to NN1, which currently has txns up to 1 4. NN1 attempts to load most recent state from JNs 4a. Before HDFS-12943, NN1 uses `getEditLogManifest()`, it will load and apply txn 2 AND 3. Because during NN0 stoping active service, it will close the current segment, last finalize segment of JN0 contains the txn 3. So during NN1 starting active service, it can load and apply txn 3 through `getEditLogManifest()`. And the current logic in `startActiveServices` is confusing. 1. using `onlyDurableTxns=true` to catchup all edits from JNs. 2. using `onlyDurableTxns=false` to check if there are newer txid readable in `openForWrite`. There is indeed a probability of data loss, if the disk in JN0 corrupted before the segment is not synchronized by JN1 and JN2 in time. But maybe we need add a new logic to find this case and let JNs synchronously sync the missing txid, such as in `startActive()` method. > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) > ... 36 more > {code} > After tracing and found there is a critical bug in > *EditlogTailer#catchupDuringFailover()* when > *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* > try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. > It may cannot replay any edits when they are some abnormal JournalNodes. > Reproduce method, suppose: > - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode > is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, > JN1 and JN2. > - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully > synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or > restarted. > - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover > active from NN0 to NN1. > - NN1 only got two responses from JN0 and JN1 when it try to selecting > inputStreams with *fromTxnId=3* and *onlyDurableTxns=true*, and the count > txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad > network or restarted. > - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes > because the *maxAllowedTxns* is 0. > So I think Standby NameNode should *catchupDuringFailover()* with > *onlyDurableTxns=false* , so that it can replay all missed edits from > JournalNode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17579100#comment-17579100 ] ASF GitHub Bot commented on HDFS-16689: --- xkrogen commented on PR #4628: URL: https://github.com/apache/hadoop/pull/4628#issuecomment-1213426216 @ZanderXu @ferhui sorry for being late to the discussion here. Please correct me if I'm wrong, but I believe this introduces a correctness issue. Unless you can dispute the situation I outline below, we will need to revert this PR. Using `onlyDurableTxns` when initializing the new active is essential to maintain a correct transaction history. Envision the following scenario: 1. Currently NN0 is active and JN0-2 all have txn 2 committed. 2. NN0 attempts to write txn 3. It only succeeds to JN0, and crashes before writing to JN1/JN2. 3. We fail over to NN1, which currently has txns up to 1 4. NN1 attempts to load most recent state from JNs 4a. Previously, NN1 uses `onlyDurableTxns=true`, so it will only load txn 2. Good! 4b. With this PR, NN1 uses `onlyDurableTxns=false`, so if it gets a response from JN0, it will load and apply txn 2 AND 3. In situation (4b), txn 3 should NOT be loaded by NN1, because it was not durably committed > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) > ... 36 more > {code} > After tracing and found there is a critical bug in > *EditlogTailer#catchupDuringFailover()* when > *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* > try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. > It may cannot replay any edits when they are some abnormal JournalNodes. > Reproduce method, suppose: > - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode > is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, > JN1 and JN2. > - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully > synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or > restarted. > - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover > active from NN0 to NN1. > - NN1 only got two responses from JN0 and JN1 when it try to selecting > inputStreams with *fromTxnId=3* and *onlyDurableTxns=true*, and the count > txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad > network or restarted. > - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes > because the *maxAllowedTxns* is 0. > So I think Standby NameNode should *catchupDuringFailover()* with > *onlyDurableTxns=false* , so that it can replay all missed edits from > JournalNode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578868#comment-17578868 ] ASF GitHub Bot commented on HDFS-16689: --- ZanderXu commented on PR #4628: URL: https://github.com/apache/hadoop/pull/4628#issuecomment-1212881166 @ferhui Thank you very much for helping me review this PR and very nice suggestion. > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) > ... 36 more > {code} > After tracing and found there is a critical bug in > *EditlogTailer#catchupDuringFailover()* when > *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* > try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. > It may cannot replay any edits when they are some abnormal JournalNodes. > Reproduce method, suppose: > - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode > is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, > JN1 and JN2. > - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully > synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or > restarted. > - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover > active from NN0 to NN1. > - NN1 only got two responses from JN0 and JN1 when it try to selecting > inputStreams with *fromTxnId=3* and *onlyDurableTxns=true*, and the count > txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad > network or restarted. > - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes > because the *maxAllowedTxns* is 0. > So I think Standby NameNode should *catchupDuringFailover()* with > *onlyDurableTxns=false* , so that it can replay all missed edits from > JournalNode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578782#comment-17578782 ] ASF GitHub Bot commented on HDFS-16689: --- ferhui commented on PR #4628: URL: https://github.com/apache/hadoop/pull/4628#issuecomment-1212715520 @ZanderXu Thanks for your contribution! Merged > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) > ... 36 more > {code} > After tracing and found there is a critical bug in > *EditlogTailer#catchupDuringFailover()* when > *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* > try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. > It may cannot replay any edits when they are some abnormal JournalNodes. > Reproduce method, suppose: > - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode > is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, > JN1 and JN2. > - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully > synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or > restarted. > - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover > active from NN0 to NN1. > - NN1 only got two responses from JN0 and JN1 when it try to selecting > inputStreams with *fromTxnId=3* and *onlyDurableTxns=true*, and the count > txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad > network or restarted. > - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes > because the *maxAllowedTxns* is 0. > So I think Standby NameNode should *catchupDuringFailover()* with > *onlyDurableTxns=false* , so that it can replay all missed edits from > JournalNode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578778#comment-17578778 ] ASF GitHub Bot commented on HDFS-16689: --- ferhui merged PR #4628: URL: https://github.com/apache/hadoop/pull/4628 > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) > ... 36 more > {code} > After tracing and found there is a critical bug in > *EditlogTailer#catchupDuringFailover()* when > *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* > try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. > It may cannot replay any edits when they are some abnormal JournalNodes. > Reproduce method, suppose: > - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode > is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, > JN1 and JN2. > - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully > synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or > restarted. > - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover > active from NN0 to NN1. > - NN1 only got two responses from JN0 and JN1 when it try to selecting > inputStreams with *fromTxnId=3* and *onlyDurableTxns=true*, and the count > txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad > network or restarted. > - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes > because the *maxAllowedTxns* is 0. > So I think Standby NameNode should *catchupDuringFailover()* with > *onlyDurableTxns=false* , so that it can replay all missed edits from > JournalNode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578385#comment-17578385 ] ASF GitHub Bot commented on HDFS-16689: --- hadoop-yetus commented on PR #4628: URL: https://github.com/apache/hadoop/pull/4628#issuecomment-1211814838 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 1m 24s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 2 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 42m 16s | | trunk passed | | +1 :green_heart: | compile | 1m 44s | | trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | compile | 1m 36s | | trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | checkstyle | 1m 22s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 45s | | trunk passed | | +1 :green_heart: | javadoc | 1m 25s | | trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javadoc | 1m 45s | | trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 4m 8s | | trunk passed | | +1 :green_heart: | shadedclient | 26m 21s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 27s | | the patch passed | | +1 :green_heart: | compile | 1m 43s | | the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javac | 1m 43s | | the patch passed | | +1 :green_heart: | compile | 1m 31s | | the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | javac | 1m 31s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 1m 9s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 33s | | the patch passed | | +1 :green_heart: | javadoc | 1m 7s | | the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javadoc | 1m 33s | | the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 57s | | the patch passed | | +1 :green_heart: | shadedclient | 26m 4s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 375m 35s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4628/5/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 1m 21s | | The patch does not generate ASF License warnings. | | | | 497m 27s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.server.blockmanagement.TestBlockTokenWithShortCircuitRead | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4628/5/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/4628 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 0f0416ed0c69 4.15.0-175-generic #184-Ubuntu SMP Thu Mar 24 17:48:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 5e6adb43e490330c7aa5fd71216308fb0915d3b9 | | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4628/5/testReport/ | | Max. process+thread count | 2109 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U:
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578386#comment-17578386 ] ASF GitHub Bot commented on HDFS-16689: --- hadoop-yetus commented on PR #4628: URL: https://github.com/apache/hadoop/pull/4628#issuecomment-1211815047 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 1m 1s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 2 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 42m 24s | | trunk passed | | +1 :green_heart: | compile | 1m 50s | | trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | compile | 1m 36s | | trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | checkstyle | 1m 23s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 47s | | trunk passed | | +1 :green_heart: | javadoc | 1m 25s | | trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javadoc | 1m 43s | | trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 4m 11s | | trunk passed | | +1 :green_heart: | shadedclient | 26m 31s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 33s | | the patch passed | | +1 :green_heart: | compile | 1m 40s | | the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javac | 1m 40s | | the patch passed | | +1 :green_heart: | compile | 1m 28s | | the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | javac | 1m 28s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 1m 9s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 36s | | the patch passed | | +1 :green_heart: | javadoc | 1m 4s | | the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javadoc | 1m 40s | | the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 52s | | the patch passed | | +1 :green_heart: | shadedclient | 26m 10s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 374m 1s | | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 1m 12s | | The patch does not generate ASF License warnings. | | | | 496m 15s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4628/6/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/4628 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 6211aff2a338 4.15.0-175-generic #184-Ubuntu SMP Thu Mar 24 17:48:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 5e6adb43e490330c7aa5fd71216308fb0915d3b9 | | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4628/6/testReport/ | | Max. process+thread count | 2195 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4628/6/console | | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 | | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org | This message was automatically generated. > Standby
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578215#comment-17578215 ] ASF GitHub Bot commented on HDFS-16689: --- ZanderXu commented on code in PR #4628: URL: https://github.com/apache/hadoop/pull/4628#discussion_r943049964 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumJournalManager.java: ## @@ -134,7 +134,6 @@ public QuorumJournalManager(Configuration conf, } - Review Comment: Thanks, I rolled it back in the latest commit. > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) > ... 36 more > {code} > After tracing and found there is a critical bug in > *EditlogTailer#catchupDuringFailover()* when > *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* > try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. > It may cannot replay any edits when they are some abnormal JournalNodes. > Reproduce method, suppose: > - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode > is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, > JN1 and JN2. > - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully > synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or > restarted. > - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover > active from NN0 to NN1. > - NN1 only got two responses from JN0 and JN1 when it try to selecting > inputStreams with *fromTxnId=3* and *onlyDurableTxns=true*, and the count > txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad > network or restarted. > - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes > because the *maxAllowedTxns* is 0. > So I think Standby NameNode should *catchupDuringFailover()* with > *onlyDurableTxns=false* , so that it can replay all missed edits from > JournalNode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578123#comment-17578123 ] ASF GitHub Bot commented on HDFS-16689: --- hadoop-yetus commented on PR #4628: URL: https://github.com/apache/hadoop/pull/4628#issuecomment-1211152590 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 54s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 2 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 41m 19s | | trunk passed | | +1 :green_heart: | compile | 1m 43s | | trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | compile | 1m 30s | | trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | checkstyle | 1m 19s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 40s | | trunk passed | | +1 :green_heart: | javadoc | 1m 22s | | trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javadoc | 1m 38s | | trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 48s | | trunk passed | | +1 :green_heart: | shadedclient | 26m 11s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 22s | | the patch passed | | +1 :green_heart: | compile | 1m 30s | | the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javac | 1m 30s | | the patch passed | | +1 :green_heart: | compile | 1m 21s | | the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | javac | 1m 21s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 1m 1s | [/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4628/4/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs-project/hadoop-hdfs: The patch generated 10 new + 155 unchanged - 0 fixed = 165 total (was 155) | | +1 :green_heart: | mvnsite | 1m 27s | | the patch passed | | +1 :green_heart: | javadoc | 0m 59s | | the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javadoc | 1m 29s | | the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 31s | | the patch passed | | +1 :green_heart: | shadedclient | 25m 56s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 325m 54s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4628/4/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 57s | | The patch does not generate ASF License warnings. | | | | 444m 14s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.server.mover.TestMover | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4628/4/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/4628 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 1168497bc6e0 4.15.0-175-generic #184-Ubuntu SMP Thu Mar 24 17:48:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 5eb3768b8e9fb0a7262cc5eecc7887414f8e4120 | | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578042#comment-17578042 ] ASF GitHub Bot commented on HDFS-16689: --- ferhui commented on code in PR #4628: URL: https://github.com/apache/hadoop/pull/4628#discussion_r942607037 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumJournalManager.java: ## @@ -134,7 +134,6 @@ public QuorumJournalManager(Configuration conf, } - Review Comment: avoid unnecessary changes > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) > ... 36 more > {code} > After tracing and found there is a critical bug in > *EditlogTailer#catchupDuringFailover()* when > *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* > try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. > It may cannot replay any edits when they are some abnormal JournalNodes. > Reproduce method, suppose: > - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode > is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, > JN1 and JN2. > - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully > synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or > restarted. > - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover > active from NN0 to NN1. > - NN1 only got two responses from JN0 and JN1 when it try to selecting > inputStreams with *fromTxnId=3* and *onlyDurableTxns=true*, and the count > txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad > network or restarted. > - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes > because the *maxAllowedTxns* is 0. > So I think Standby NameNode should *catchupDuringFailover()* with > *onlyDurableTxns=false* , so that it can replay all missed edits from > JournalNode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577963#comment-17577963 ] ASF GitHub Bot commented on HDFS-16689: --- ZanderXu commented on PR #4628: URL: https://github.com/apache/hadoop/pull/4628#issuecomment-1210583051 @ferhui Master, thanks for your nice suggestion. I learned a lot from you, thanks. I have updated this patch, please help me reivew it again. Thanks > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) > ... 36 more > {code} > After tracing and found there is a critical bug in > *EditlogTailer#catchupDuringFailover()* when > *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* > try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. > It may cannot replay any edits when they are some abnormal JournalNodes. > Reproduce method, suppose: > - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode > is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, > JN1 and JN2. > - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully > synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or > restarted. > - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover > active from NN0 to NN1. > - NN1 only got two responses from JN0 and JN1 when it try to selecting > inputStreams with *fromTxnId=3* and *onlyDurableTxns=true*, and the count > txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad > network or restarted. > - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes > because the *maxAllowedTxns* is 0. > So I think Standby NameNode should *catchupDuringFailover()* with > *onlyDurableTxns=false* , so that it can replay all missed edits from > JournalNode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577708#comment-17577708 ] ASF GitHub Bot commented on HDFS-16689: --- ferhui commented on PR #4628: URL: https://github.com/apache/hadoop/pull/4628#issuecomment-1210072454 I'm not sure. How about adding a test utility class to that package? > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) > ... 36 more > {code} > After tracing and found there is a critical bug in > *EditlogTailer#catchupDuringFailover()* when > *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* > try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. > It may cannot replay any edits when they are some abnormal JournalNodes. > Reproduce method, suppose: > - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode > is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, > JN1 and JN2. > - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully > synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or > restarted. > - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover > active from NN0 to NN1. > - NN1 only got two responses from JN0 and JN1 when it try to selecting > inputStreams with *fromTxnId=3* and *onlyDurableTxns=true*, and the count > txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad > network or restarted. > - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes > because the *maxAllowedTxns* is 0. > So I think Standby NameNode should *catchupDuringFailover()* with > *onlyDurableTxns=false* , so that it can replay all missed edits from > JournalNode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577571#comment-17577571 ] ASF GitHub Bot commented on HDFS-16689: --- hadoop-yetus commented on PR #4628: URL: https://github.com/apache/hadoop/pull/4628#issuecomment-1209714098 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 1m 22s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 44m 30s | | trunk passed | | +1 :green_heart: | compile | 1m 43s | | trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | compile | 1m 30s | | trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | checkstyle | 1m 19s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 41s | | trunk passed | | +1 :green_heart: | javadoc | 1m 19s | | trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javadoc | 1m 39s | | trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 46s | | trunk passed | | +1 :green_heart: | shadedclient | 25m 50s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 23s | | the patch passed | | +1 :green_heart: | compile | 1m 29s | | the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javac | 1m 29s | | the patch passed | | +1 :green_heart: | compile | 1m 18s | | the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | javac | 1m 18s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 1m 2s | | hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 154 unchanged - 1 fixed = 154 total (was 155) | | +1 :green_heart: | mvnsite | 1m 26s | | the patch passed | | +1 :green_heart: | javadoc | 0m 58s | | the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javadoc | 1m 32s | | the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 33s | | the patch passed | | +1 :green_heart: | shadedclient | 25m 42s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 338m 6s | | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 56s | | The patch does not generate ASF License warnings. | | | | 459m 50s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4628/3/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/4628 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 528bdc9a040f 4.15.0-175-generic #184-Ubuntu SMP Thu Mar 24 17:48:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 209275054faaa7faf24781fb40e0a36079559706 | | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4628/3/testReport/ | | Max. process+thread count | 2251 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4628/3/console | | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 | | Powered by | Apache Yetus 0.14.0
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577333#comment-17577333 ] ASF GitHub Bot commented on HDFS-16689: --- ZanderXu commented on PR #4628: URL: https://github.com/apache/hadoop/pull/4628#issuecomment-1209185985 @ferhui Master, thanks for your helping review. > Change visibility because of Test cases, right? Is there a way to avoid changing it and make it concise? Yes, change the visibility is only for UT. The related classes are in different packages, and I need to write some edits to JN and verify them after HA, so it's difficult to code the UT without changing the visibility. Do you have some good ideas? > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) > ... 36 more > {code} > After tracing and found there is a critical bug in > *EditlogTailer#catchupDuringFailover()* when > *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* > try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. > It may cannot replay any edits when they are some abnormal JournalNodes. > Reproduce method, suppose: > - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode > is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, > JN1 and JN2. > - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully > synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or > restarted. > - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover > active from NN0 to NN1. > - NN1 only got two responses from JN0 and JN1 when it try to selecting > inputStreams with *fromTxnId=3* and *onlyDurableTxns=true*, and the count > txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad > network or restarted. > - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes > because the *maxAllowedTxns* is 0. > So I think Standby NameNode should *catchupDuringFailover()* with > *onlyDurableTxns=false* , so that it can replay all missed edits from > JournalNode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer
[ https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577271#comment-17577271 ] ASF GitHub Bot commented on HDFS-16689: --- ferhui commented on PR #4628: URL: https://github.com/apache/hadoop/pull/4628#issuecomment-1209082414 @ZanderXu Good catch! Change visibility because of Test cases, right? Is there a way to avoid changing it and make it concise? > Standby NameNode crashes when transitioning to Active with in-progress tailer > - > > Key: HDFS-16689 > URL: https://issues.apache.org/jira/browse/HDFS-16689 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Critical > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > Standby NameNode crashes when transitioning to Active with a in-progress > tailer. And the error message like blew: > {code:java} > Caused by: java.lang.IllegalStateException: Cannot start writing at txid X > when there is a stream available for read: ByteStringEditLog[X, Y], > ByteStringEditLog[X, 0] > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132) > ... 36 more > {code} > After tracing and found there is a critical bug in > *EditlogTailer#catchupDuringFailover()* when > *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* > try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. > It may cannot replay any edits when they are some abnormal JournalNodes. > Reproduce method, suppose: > - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode > is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, > JN1 and JN2. > - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully > synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or > restarted. > - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover > active from NN0 to NN1. > - NN1 only got two responses from JN0 and JN1 when it try to selecting > inputStreams with *fromTxnId=3* and *onlyDurableTxns=true*, and the count > txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad > network or restarted. > - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes > because the *maxAllowedTxns* is 0. > So I think Standby NameNode should *catchupDuringFailover()* with > *onlyDurableTxns=false* , so that it can replay all missed edits from > JournalNode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org