[jira] [Commented] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash
[ https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17641405#comment-17641405 ] ASF GitHub Bot commented on HDFS-16550: --- xkrogen merged PR #4209: URL: https://github.com/apache/hadoop/pull/4209 > [SBN read] Improper cache-size for journal node may cause cluster crash > --- > > Key: HDFS-16550 > URL: https://issues.apache.org/jira/browse/HDFS-16550 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Tao Li >Assignee: Tao Li >Priority: Major > Labels: pull-request-available > Attachments: image-2022-04-21-09-54-29-751.png, > image-2022-04-21-09-54-57-111.png, image-2022-04-21-12-32-56-170.png > > Time Spent: 1h > Remaining Estimate: 0h > > When we introduced {*}SBN Read{*}, we encountered a situation during upgrade > the JournalNodes. > Cluster Info: > *Active: nn0* > *Standby: nn1* > 1. Rolling restart journal node. {color:#ff}(related config: > fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color} > 2. The cluster runs for a while, edits cache usage is increasing and memory > is used up. > 3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed > out waiting 12ms for a quorum of nodes to respond”{_}. > 4. Transfer nn1 to Active state. > 5. {color:#ff}New Active namenode(nn1){color} also shutdown because of > “{_}Timed out waiting 12ms for a quorum of nodes to respond” too{_}. > 6. {color:#ff}The cluster crashed{color}. > > Related code: > {code:java} > JournaledEditsCache(Configuration conf) { > capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY, > DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT); > if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) { > Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " + > "maximum JVM memory is only %d bytes. It is recommended that you " + > "decrease the cache size or increase the heap size.", > capacity, Runtime.getRuntime().maxMemory())); > } > Journal.LOG.info("Enabling the journaled edits cache with a capacity " + > "of bytes: " + capacity); > ReadWriteLock lock = new ReentrantReadWriteLock(true); > readLock = new AutoCloseableLock(lock.readLock()); > writeLock = new AutoCloseableLock(lock.writeLock()); > initialize(INVALID_TXN_ID); > } {code} > Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size > than the memory requested by the process. If > {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * > Runtime.getruntime().maxMemory(){*}, only warn logs are printed during > journalnode startup. This can easily be overlooked by users. However, as the > cluster runs to a certain period of time, it is likely to cause the cluster > to crash. > > NN log: > !image-2022-04-21-09-54-57-111.png|width=1012,height=47! > !image-2022-04-21-12-32-56-170.png|width=809,height=218! > IMO, we should not set the {{cache size}} to a fixed value, but to the ratio > of maximum memory, which is 0.2 by default. > This avoids the problem of too large cache size. In addition, users can > actively adjust the heap size when they need to increase the cache size. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash
[ https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17641403#comment-17641403 ] ASF GitHub Bot commented on HDFS-16550: --- xkrogen commented on PR #4209: URL: https://github.com/apache/hadoop/pull/4209#issuecomment-1332367011 `TestLeaseRecoveryV2` failure is tracked in HDFS-16853. LGTM. Merging to trunk. Thanks for the contribution @tomscut ! > [SBN read] Improper cache-size for journal node may cause cluster crash > --- > > Key: HDFS-16550 > URL: https://issues.apache.org/jira/browse/HDFS-16550 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Tao Li >Assignee: Tao Li >Priority: Major > Labels: pull-request-available > Attachments: image-2022-04-21-09-54-29-751.png, > image-2022-04-21-09-54-57-111.png, image-2022-04-21-12-32-56-170.png > > Time Spent: 1h > Remaining Estimate: 0h > > When we introduced {*}SBN Read{*}, we encountered a situation during upgrade > the JournalNodes. > Cluster Info: > *Active: nn0* > *Standby: nn1* > 1. Rolling restart journal node. {color:#ff}(related config: > fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color} > 2. The cluster runs for a while, edits cache usage is increasing and memory > is used up. > 3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed > out waiting 12ms for a quorum of nodes to respond”{_}. > 4. Transfer nn1 to Active state. > 5. {color:#ff}New Active namenode(nn1){color} also shutdown because of > “{_}Timed out waiting 12ms for a quorum of nodes to respond” too{_}. > 6. {color:#ff}The cluster crashed{color}. > > Related code: > {code:java} > JournaledEditsCache(Configuration conf) { > capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY, > DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT); > if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) { > Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " + > "maximum JVM memory is only %d bytes. It is recommended that you " + > "decrease the cache size or increase the heap size.", > capacity, Runtime.getRuntime().maxMemory())); > } > Journal.LOG.info("Enabling the journaled edits cache with a capacity " + > "of bytes: " + capacity); > ReadWriteLock lock = new ReentrantReadWriteLock(true); > readLock = new AutoCloseableLock(lock.readLock()); > writeLock = new AutoCloseableLock(lock.writeLock()); > initialize(INVALID_TXN_ID); > } {code} > Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size > than the memory requested by the process. If > {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * > Runtime.getruntime().maxMemory(){*}, only warn logs are printed during > journalnode startup. This can easily be overlooked by users. However, as the > cluster runs to a certain period of time, it is likely to cause the cluster > to crash. > > NN log: > !image-2022-04-21-09-54-57-111.png|width=1012,height=47! > !image-2022-04-21-12-32-56-170.png|width=809,height=218! > IMO, we should not set the {{cache size}} to a fixed value, but to the ratio > of maximum memory, which is 0.2 by default. > This avoids the problem of too large cache size. In addition, users can > actively adjust the heap size when they need to increase the cache size. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash
[ https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17641098#comment-17641098 ] ASF GitHub Bot commented on HDFS-16550: --- hadoop-yetus commented on PR #4209: URL: https://github.com/apache/hadoop/pull/4209#issuecomment-1331717919 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 44s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +0 :ok: | xmllint | 0m 1s | | xmllint was not available. | | +0 :ok: | markdownlint | 0m 1s | | markdownlint was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 40m 11s | | trunk passed | | +1 :green_heart: | compile | 1m 36s | | trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | compile | 1m 26s | | trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | checkstyle | 1m 15s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 45s | | trunk passed | | +1 :green_heart: | javadoc | 1m 15s | | trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 38s | | trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 38s | | trunk passed | | +1 :green_heart: | shadedclient | 23m 22s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 18s | | the patch passed | | +1 :green_heart: | compile | 1m 21s | | the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javac | 1m 21s | | the patch passed | | +1 :green_heart: | compile | 1m 16s | | the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | javac | 1m 16s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 1m 0s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 24s | | the patch passed | | +1 :green_heart: | javadoc | 0m 53s | | the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 30s | | the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 20s | | the patch passed | | +1 :green_heart: | shadedclient | 22m 41s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 305m 30s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/9/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 1m 3s | | The patch does not generate ASF License warnings. | | | | 416m 21s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.TestLeaseRecovery2 | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/9/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/4209 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint markdownlint | | uname | Linux 95eb9f9b43d3 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 7453d35f07cb264a7917ca1e291681d8a9f25dfd | | Default Java | Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/9/testReport/ | | Max. process+thread
[jira] [Commented] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash
[ https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17641016#comment-17641016 ] ASF GitHub Bot commented on HDFS-16550: --- xkrogen commented on PR #4209: URL: https://github.com/apache/hadoop/pull/4209#issuecomment-1331512468 LGTM. Will wait for a clean Jenkins run > [SBN read] Improper cache-size for journal node may cause cluster crash > --- > > Key: HDFS-16550 > URL: https://issues.apache.org/jira/browse/HDFS-16550 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Tao Li >Assignee: Tao Li >Priority: Major > Labels: pull-request-available > Attachments: image-2022-04-21-09-54-29-751.png, > image-2022-04-21-09-54-57-111.png, image-2022-04-21-12-32-56-170.png > > Time Spent: 1h > Remaining Estimate: 0h > > When we introduced {*}SBN Read{*}, we encountered a situation during upgrade > the JournalNodes. > Cluster Info: > *Active: nn0* > *Standby: nn1* > 1. Rolling restart journal node. {color:#ff}(related config: > fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color} > 2. The cluster runs for a while, edits cache usage is increasing and memory > is used up. > 3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed > out waiting 12ms for a quorum of nodes to respond”{_}. > 4. Transfer nn1 to Active state. > 5. {color:#ff}New Active namenode(nn1){color} also shutdown because of > “{_}Timed out waiting 12ms for a quorum of nodes to respond” too{_}. > 6. {color:#ff}The cluster crashed{color}. > > Related code: > {code:java} > JournaledEditsCache(Configuration conf) { > capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY, > DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT); > if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) { > Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " + > "maximum JVM memory is only %d bytes. It is recommended that you " + > "decrease the cache size or increase the heap size.", > capacity, Runtime.getRuntime().maxMemory())); > } > Journal.LOG.info("Enabling the journaled edits cache with a capacity " + > "of bytes: " + capacity); > ReadWriteLock lock = new ReentrantReadWriteLock(true); > readLock = new AutoCloseableLock(lock.readLock()); > writeLock = new AutoCloseableLock(lock.writeLock()); > initialize(INVALID_TXN_ID); > } {code} > Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size > than the memory requested by the process. If > {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * > Runtime.getruntime().maxMemory(){*}, only warn logs are printed during > journalnode startup. This can easily be overlooked by users. However, as the > cluster runs to a certain period of time, it is likely to cause the cluster > to crash. > > NN log: > !image-2022-04-21-09-54-57-111.png|width=1012,height=47! > !image-2022-04-21-12-32-56-170.png|width=809,height=218! > IMO, we should not set the {{cache size}} to a fixed value, but to the ratio > of maximum memory, which is 0.2 by default. > This avoids the problem of too large cache size. In addition, users can > actively adjust the heap size when they need to increase the cache size. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash
[ https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17641000#comment-17641000 ] ASF GitHub Bot commented on HDFS-16550: --- tomscut commented on code in PR #4209: URL: https://github.com/apache/hadoop/pull/4209#discussion_r1035400504 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournaledEditsCache.java: ## @@ -123,8 +125,14 @@ class JournaledEditsCache { JournaledEditsCache(Configuration conf) { float fraction = conf.getFloat(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_FRACTION_KEY, DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_FRACTION_DEFAULT); +if (fraction <= 0 || fraction >= 1.0f) { + terminate(1, new IllegalArgumentException(String.format( Review Comment: Thanks @xkrogen for the review. I updated it. > [SBN read] Improper cache-size for journal node may cause cluster crash > --- > > Key: HDFS-16550 > URL: https://issues.apache.org/jira/browse/HDFS-16550 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Tao Li >Assignee: Tao Li >Priority: Major > Labels: pull-request-available > Attachments: image-2022-04-21-09-54-29-751.png, > image-2022-04-21-09-54-57-111.png, image-2022-04-21-12-32-56-170.png > > Time Spent: 1h > Remaining Estimate: 0h > > When we introduced {*}SBN Read{*}, we encountered a situation during upgrade > the JournalNodes. > Cluster Info: > *Active: nn0* > *Standby: nn1* > 1. Rolling restart journal node. {color:#ff}(related config: > fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color} > 2. The cluster runs for a while, edits cache usage is increasing and memory > is used up. > 3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed > out waiting 12ms for a quorum of nodes to respond”{_}. > 4. Transfer nn1 to Active state. > 5. {color:#ff}New Active namenode(nn1){color} also shutdown because of > “{_}Timed out waiting 12ms for a quorum of nodes to respond” too{_}. > 6. {color:#ff}The cluster crashed{color}. > > Related code: > {code:java} > JournaledEditsCache(Configuration conf) { > capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY, > DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT); > if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) { > Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " + > "maximum JVM memory is only %d bytes. It is recommended that you " + > "decrease the cache size or increase the heap size.", > capacity, Runtime.getRuntime().maxMemory())); > } > Journal.LOG.info("Enabling the journaled edits cache with a capacity " + > "of bytes: " + capacity); > ReadWriteLock lock = new ReentrantReadWriteLock(true); > readLock = new AutoCloseableLock(lock.readLock()); > writeLock = new AutoCloseableLock(lock.writeLock()); > initialize(INVALID_TXN_ID); > } {code} > Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size > than the memory requested by the process. If > {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * > Runtime.getruntime().maxMemory(){*}, only warn logs are printed during > journalnode startup. This can easily be overlooked by users. However, as the > cluster runs to a certain period of time, it is likely to cause the cluster > to crash. > > NN log: > !image-2022-04-21-09-54-57-111.png|width=1012,height=47! > !image-2022-04-21-12-32-56-170.png|width=809,height=218! > IMO, we should not set the {{cache size}} to a fixed value, but to the ratio > of maximum memory, which is 0.2 by default. > This avoids the problem of too large cache size. In addition, users can > actively adjust the heap size when they need to increase the cache size. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash
[ https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17640991#comment-17640991 ] ASF GitHub Bot commented on HDFS-16550: --- xkrogen commented on code in PR #4209: URL: https://github.com/apache/hadoop/pull/4209#discussion_r1035381835 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournaledEditsCache.java: ## @@ -123,8 +125,14 @@ class JournaledEditsCache { JournaledEditsCache(Configuration conf) { float fraction = conf.getFloat(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_FRACTION_KEY, DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_FRACTION_DEFAULT); +if (fraction <= 0 || fraction >= 1.0f) { + terminate(1, new IllegalArgumentException(String.format( Review Comment: `terminate()` seems too strong to me; I would expect that we just use `Preconditions.checkArgument()` to throw an exception. This is used elsewhere to validate config values, e.g. in `FSDirectory` constructor and `HAUtil.getNameNodeIdOfOtherNodes()`. > [SBN read] Improper cache-size for journal node may cause cluster crash > --- > > Key: HDFS-16550 > URL: https://issues.apache.org/jira/browse/HDFS-16550 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Tao Li >Assignee: Tao Li >Priority: Major > Labels: pull-request-available > Attachments: image-2022-04-21-09-54-29-751.png, > image-2022-04-21-09-54-57-111.png, image-2022-04-21-12-32-56-170.png > > Time Spent: 1h > Remaining Estimate: 0h > > When we introduced {*}SBN Read{*}, we encountered a situation during upgrade > the JournalNodes. > Cluster Info: > *Active: nn0* > *Standby: nn1* > 1. Rolling restart journal node. {color:#ff}(related config: > fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color} > 2. The cluster runs for a while, edits cache usage is increasing and memory > is used up. > 3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed > out waiting 12ms for a quorum of nodes to respond”{_}. > 4. Transfer nn1 to Active state. > 5. {color:#ff}New Active namenode(nn1){color} also shutdown because of > “{_}Timed out waiting 12ms for a quorum of nodes to respond” too{_}. > 6. {color:#ff}The cluster crashed{color}. > > Related code: > {code:java} > JournaledEditsCache(Configuration conf) { > capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY, > DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT); > if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) { > Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " + > "maximum JVM memory is only %d bytes. It is recommended that you " + > "decrease the cache size or increase the heap size.", > capacity, Runtime.getRuntime().maxMemory())); > } > Journal.LOG.info("Enabling the journaled edits cache with a capacity " + > "of bytes: " + capacity); > ReadWriteLock lock = new ReentrantReadWriteLock(true); > readLock = new AutoCloseableLock(lock.readLock()); > writeLock = new AutoCloseableLock(lock.writeLock()); > initialize(INVALID_TXN_ID); > } {code} > Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size > than the memory requested by the process. If > {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * > Runtime.getruntime().maxMemory(){*}, only warn logs are printed during > journalnode startup. This can easily be overlooked by users. However, as the > cluster runs to a certain period of time, it is likely to cause the cluster > to crash. > > NN log: > !image-2022-04-21-09-54-57-111.png|width=1012,height=47! > !image-2022-04-21-12-32-56-170.png|width=809,height=218! > IMO, we should not set the {{cache size}} to a fixed value, but to the ratio > of maximum memory, which is 0.2 by default. > This avoids the problem of too large cache size. In addition, users can > actively adjust the heap size when they need to increase the cache size. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash
[ https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17640374#comment-17640374 ] ASF GitHub Bot commented on HDFS-16550: --- hadoop-yetus commented on PR #4209: URL: https://github.com/apache/hadoop/pull/4209#issuecomment-1330060886 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 43s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +0 :ok: | xmllint | 0m 1s | | xmllint was not available. | | +0 :ok: | markdownlint | 0m 1s | | markdownlint was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 39m 58s | | trunk passed | | +1 :green_heart: | compile | 1m 40s | | trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | compile | 1m 31s | | trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | checkstyle | 1m 22s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 48s | | trunk passed | | +1 :green_heart: | javadoc | 1m 18s | | trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 41s | | trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 44s | | trunk passed | | +1 :green_heart: | shadedclient | 25m 57s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 29s | | the patch passed | | +1 :green_heart: | compile | 1m 40s | | the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javac | 1m 39s | | the patch passed | | +1 :green_heart: | compile | 1m 32s | | the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | javac | 1m 32s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 1m 10s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 42s | | the patch passed | | +1 :green_heart: | javadoc | 1m 8s | | the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 47s | | the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 4m 28s | | the patch passed | | -1 :x: | shadedclient | 36m 26s | | patch has errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 16m 54s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/8/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +0 :ok: | asflicense | 0m 39s | | ASF License check generated no output? | | | | 144m 56s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.server.namenode.ha.TestHAAppend | | | hadoop.fs.viewfs.TestViewFileSystemLinkFallback | | | hadoop.fs.viewfs.TestViewFsWithAcls | | | hadoop.fs.permission.TestStickyBit | | | hadoop.fs.contract.hdfs.TestHDFSContractSetTimes | | | hadoop.fs.contract.hdfs.TestHDFSContractPathHandle | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/8/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/4209 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint markdownlint | | uname | Linux 9ee32daf0144 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 11b70ea65c625e55a4b9052434567563abd6432e | | Default Java | Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | Multi-JDK versions |
[jira] [Commented] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash
[ https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17640340#comment-17640340 ] ASF GitHub Bot commented on HDFS-16550: --- tomscut commented on PR #4209: URL: https://github.com/apache/hadoop/pull/4209#issuecomment-1329975976 Thanks @xkrogen for your thoughtful advice. It makes perfect sense to me. I updated the code. > [SBN read] Improper cache-size for journal node may cause cluster crash > --- > > Key: HDFS-16550 > URL: https://issues.apache.org/jira/browse/HDFS-16550 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Tao Li >Assignee: Tao Li >Priority: Major > Labels: pull-request-available > Attachments: image-2022-04-21-09-54-29-751.png, > image-2022-04-21-09-54-57-111.png, image-2022-04-21-12-32-56-170.png > > Time Spent: 1h > Remaining Estimate: 0h > > When we introduced {*}SBN Read{*}, we encountered a situation during upgrade > the JournalNodes. > Cluster Info: > *Active: nn0* > *Standby: nn1* > 1. Rolling restart journal node. {color:#ff}(related config: > fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color} > 2. The cluster runs for a while, edits cache usage is increasing and memory > is used up. > 3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed > out waiting 12ms for a quorum of nodes to respond”{_}. > 4. Transfer nn1 to Active state. > 5. {color:#ff}New Active namenode(nn1){color} also shutdown because of > “{_}Timed out waiting 12ms for a quorum of nodes to respond” too{_}. > 6. {color:#ff}The cluster crashed{color}. > > Related code: > {code:java} > JournaledEditsCache(Configuration conf) { > capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY, > DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT); > if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) { > Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " + > "maximum JVM memory is only %d bytes. It is recommended that you " + > "decrease the cache size or increase the heap size.", > capacity, Runtime.getRuntime().maxMemory())); > } > Journal.LOG.info("Enabling the journaled edits cache with a capacity " + > "of bytes: " + capacity); > ReadWriteLock lock = new ReentrantReadWriteLock(true); > readLock = new AutoCloseableLock(lock.readLock()); > writeLock = new AutoCloseableLock(lock.writeLock()); > initialize(INVALID_TXN_ID); > } {code} > Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size > than the memory requested by the process. If > {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * > Runtime.getruntime().maxMemory(){*}, only warn logs are printed during > journalnode startup. This can easily be overlooked by users. However, as the > cluster runs to a certain period of time, it is likely to cause the cluster > to crash. > > NN log: > !image-2022-04-21-09-54-57-111.png|width=1012,height=47! > !image-2022-04-21-12-32-56-170.png|width=809,height=218! > IMO, we should not set the {{cache size}} to a fixed value, but to the ratio > of maximum memory, which is 0.2 by default. > This avoids the problem of too large cache size. In addition, users can > actively adjust the heap size when they need to increase the cache size. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash
[ https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17640311#comment-17640311 ] ASF GitHub Bot commented on HDFS-16550: --- tomscut commented on code in PR #4209: URL: https://github.com/apache/hadoop/pull/4209#discussion_r1034187735 ## hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/server/TestJournaledEditsCache.java: ## @@ -221,6 +224,24 @@ public void testCacheMalformedInput() throws Exception { cache.retrieveEdits(-1, 10, new ArrayList<>()); } + @Test + public void testCacheFraction() { +// Set dfs.journalnode.edit-cache-size.bytes. +Configuration config = new Configuration(); +config.setInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY, 1); + config.setFloat(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_FRACTION_KEY, 0.1f); +cache = new JournaledEditsCache(config); +assertEquals(1, cache.getCapacity(), 0.0); + +// Don't set dfs.journalnode.edit-cache-size.bytes. +Configuration config1 = new Configuration(); + config1.setFloat(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_FRACTION_KEY, 0.1f); +cache = new JournaledEditsCache(config1); +assertEquals( +memory * config1.getFloat(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_FRACTION_KEY, 0.0f), +cache.getCapacity(), 0.0); Review Comment: Because we compute `capacity` by `capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY, (int) Runtime.getRuntime().maxMemory() * fraction)`. The assert result will be: ``` java.lang.AssertionError: Expected :190893260 Actual :190893264 ``` So I will update to `assertEquals((int) (Runtime.getRuntime().maxMemory() * 0.1f), cache.getCapacity())`. > [SBN read] Improper cache-size for journal node may cause cluster crash > --- > > Key: HDFS-16550 > URL: https://issues.apache.org/jira/browse/HDFS-16550 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Tao Li >Assignee: Tao Li >Priority: Major > Labels: pull-request-available > Attachments: image-2022-04-21-09-54-29-751.png, > image-2022-04-21-09-54-57-111.png, image-2022-04-21-12-32-56-170.png > > Time Spent: 1h > Remaining Estimate: 0h > > When we introduced {*}SBN Read{*}, we encountered a situation during upgrade > the JournalNodes. > Cluster Info: > *Active: nn0* > *Standby: nn1* > 1. Rolling restart journal node. {color:#ff}(related config: > fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color} > 2. The cluster runs for a while, edits cache usage is increasing and memory > is used up. > 3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed > out waiting 12ms for a quorum of nodes to respond”{_}. > 4. Transfer nn1 to Active state. > 5. {color:#ff}New Active namenode(nn1){color} also shutdown because of > “{_}Timed out waiting 12ms for a quorum of nodes to respond” too{_}. > 6. {color:#ff}The cluster crashed{color}. > > Related code: > {code:java} > JournaledEditsCache(Configuration conf) { > capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY, > DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT); > if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) { > Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " + > "maximum JVM memory is only %d bytes. It is recommended that you " + > "decrease the cache size or increase the heap size.", > capacity, Runtime.getRuntime().maxMemory())); > } > Journal.LOG.info("Enabling the journaled edits cache with a capacity " + > "of bytes: " + capacity); > ReadWriteLock lock = new ReentrantReadWriteLock(true); > readLock = new AutoCloseableLock(lock.readLock()); > writeLock = new AutoCloseableLock(lock.writeLock()); > initialize(INVALID_TXN_ID); > } {code} > Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size > than the memory requested by the process. If > {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * > Runtime.getruntime().maxMemory(){*}, only warn logs are printed during > journalnode startup. This can easily be overlooked by users. However, as the > cluster runs to a certain period of time, it is likely to cause the cluster > to crash. > > NN log: > !image-2022-04-21-09-54-57-111.png|width=1012,height=47! > !image-2022-04-21-12-32-56-170.png|width=809,height=218! > IMO, we should not set the {{cache size}} to a fixed value, but to the ratio > of maximum memory, which is 0.2 by default. > This avoids the problem of too large cache size. In addition, users can > actively adjust the heap size when they need to increase the cache size. --
[jira] [Commented] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash
[ https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17640308#comment-17640308 ] ASF GitHub Bot commented on HDFS-16550: --- tomscut commented on code in PR #4209: URL: https://github.com/apache/hadoop/pull/4209#discussion_r1034183546 ## hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml: ## @@ -4955,6 +4955,19 @@ Review Comment: Thanks for your thoughtful advice. > [SBN read] Improper cache-size for journal node may cause cluster crash > --- > > Key: HDFS-16550 > URL: https://issues.apache.org/jira/browse/HDFS-16550 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Tao Li >Assignee: Tao Li >Priority: Major > Labels: pull-request-available > Attachments: image-2022-04-21-09-54-29-751.png, > image-2022-04-21-09-54-57-111.png, image-2022-04-21-12-32-56-170.png > > Time Spent: 1h > Remaining Estimate: 0h > > When we introduced {*}SBN Read{*}, we encountered a situation during upgrade > the JournalNodes. > Cluster Info: > *Active: nn0* > *Standby: nn1* > 1. Rolling restart journal node. {color:#ff}(related config: > fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color} > 2. The cluster runs for a while, edits cache usage is increasing and memory > is used up. > 3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed > out waiting 12ms for a quorum of nodes to respond”{_}. > 4. Transfer nn1 to Active state. > 5. {color:#ff}New Active namenode(nn1){color} also shutdown because of > “{_}Timed out waiting 12ms for a quorum of nodes to respond” too{_}. > 6. {color:#ff}The cluster crashed{color}. > > Related code: > {code:java} > JournaledEditsCache(Configuration conf) { > capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY, > DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT); > if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) { > Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " + > "maximum JVM memory is only %d bytes. It is recommended that you " + > "decrease the cache size or increase the heap size.", > capacity, Runtime.getRuntime().maxMemory())); > } > Journal.LOG.info("Enabling the journaled edits cache with a capacity " + > "of bytes: " + capacity); > ReadWriteLock lock = new ReentrantReadWriteLock(true); > readLock = new AutoCloseableLock(lock.readLock()); > writeLock = new AutoCloseableLock(lock.writeLock()); > initialize(INVALID_TXN_ID); > } {code} > Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size > than the memory requested by the process. If > {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * > Runtime.getruntime().maxMemory(){*}, only warn logs are printed during > journalnode startup. This can easily be overlooked by users. However, as the > cluster runs to a certain period of time, it is likely to cause the cluster > to crash. > > NN log: > !image-2022-04-21-09-54-57-111.png|width=1012,height=47! > !image-2022-04-21-12-32-56-170.png|width=809,height=218! > IMO, we should not set the {{cache size}} to a fixed value, but to the ratio > of maximum memory, which is 0.2 by default. > This avoids the problem of too large cache size. In addition, users can > actively adjust the heap size when they need to increase the cache size. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash
[ https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17640301#comment-17640301 ] ASF GitHub Bot commented on HDFS-16550: --- xkrogen commented on code in PR #4209: URL: https://github.com/apache/hadoop/pull/4209#discussion_r1034168326 ## hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/server/TestJournaledEditsCache.java: ## @@ -221,6 +224,24 @@ public void testCacheMalformedInput() throws Exception { cache.retrieveEdits(-1, 10, new ArrayList<>()); } + @Test + public void testCacheFraction() { +// Set dfs.journalnode.edit-cache-size.bytes. +Configuration config = new Configuration(); +config.setInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY, 1); + config.setFloat(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_FRACTION_KEY, 0.1f); +cache = new JournaledEditsCache(config); +assertEquals(1, cache.getCapacity(), 0.0); + +// Don't set dfs.journalnode.edit-cache-size.bytes. +Configuration config1 = new Configuration(); + config1.setFloat(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_FRACTION_KEY, 0.1f); +cache = new JournaledEditsCache(config1); +assertEquals( +memory * config1.getFloat(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_FRACTION_KEY, 0.0f), +cache.getCapacity(), 0.0); Review Comment: Why? `getCapacity()` returns an int, so we are doing int-to-int comparison. > [SBN read] Improper cache-size for journal node may cause cluster crash > --- > > Key: HDFS-16550 > URL: https://issues.apache.org/jira/browse/HDFS-16550 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Tao Li >Assignee: Tao Li >Priority: Major > Labels: pull-request-available > Attachments: image-2022-04-21-09-54-29-751.png, > image-2022-04-21-09-54-57-111.png, image-2022-04-21-12-32-56-170.png > > Time Spent: 1h > Remaining Estimate: 0h > > When we introduced {*}SBN Read{*}, we encountered a situation during upgrade > the JournalNodes. > Cluster Info: > *Active: nn0* > *Standby: nn1* > 1. Rolling restart journal node. {color:#ff}(related config: > fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color} > 2. The cluster runs for a while, edits cache usage is increasing and memory > is used up. > 3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed > out waiting 12ms for a quorum of nodes to respond”{_}. > 4. Transfer nn1 to Active state. > 5. {color:#ff}New Active namenode(nn1){color} also shutdown because of > “{_}Timed out waiting 12ms for a quorum of nodes to respond” too{_}. > 6. {color:#ff}The cluster crashed{color}. > > Related code: > {code:java} > JournaledEditsCache(Configuration conf) { > capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY, > DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT); > if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) { > Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " + > "maximum JVM memory is only %d bytes. It is recommended that you " + > "decrease the cache size or increase the heap size.", > capacity, Runtime.getRuntime().maxMemory())); > } > Journal.LOG.info("Enabling the journaled edits cache with a capacity " + > "of bytes: " + capacity); > ReadWriteLock lock = new ReentrantReadWriteLock(true); > readLock = new AutoCloseableLock(lock.readLock()); > writeLock = new AutoCloseableLock(lock.writeLock()); > initialize(INVALID_TXN_ID); > } {code} > Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size > than the memory requested by the process. If > {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * > Runtime.getruntime().maxMemory(){*}, only warn logs are printed during > journalnode startup. This can easily be overlooked by users. However, as the > cluster runs to a certain period of time, it is likely to cause the cluster > to crash. > > NN log: > !image-2022-04-21-09-54-57-111.png|width=1012,height=47! > !image-2022-04-21-12-32-56-170.png|width=809,height=218! > IMO, we should not set the {{cache size}} to a fixed value, but to the ratio > of maximum memory, which is 0.2 by default. > This avoids the problem of too large cache size. In addition, users can > actively adjust the heap size when they need to increase the cache size. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash
[ https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17640300#comment-17640300 ] ASF GitHub Bot commented on HDFS-16550: --- tomscut commented on code in PR #4209: URL: https://github.com/apache/hadoop/pull/4209#discussion_r1034167277 ## hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/server/TestJournaledEditsCache.java: ## @@ -221,6 +224,24 @@ public void testCacheMalformedInput() throws Exception { cache.retrieveEdits(-1, 10, new ArrayList<>()); } + @Test + public void testCacheFraction() { +// Set dfs.journalnode.edit-cache-size.bytes. +Configuration config = new Configuration(); +config.setInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY, 1); + config.setFloat(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_FRACTION_KEY, 0.1f); +cache = new JournaledEditsCache(config); +assertEquals(1, cache.getCapacity(), 0.0); + +// Don't set dfs.journalnode.edit-cache-size.bytes. +Configuration config1 = new Configuration(); + config1.setFloat(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_FRACTION_KEY, 0.1f); +cache = new JournaledEditsCache(config1); +assertEquals( +memory * config1.getFloat(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_FRACTION_KEY, 0.0f), +cache.getCapacity(), 0.0); Review Comment: There will be a loss of accuracy. > [SBN read] Improper cache-size for journal node may cause cluster crash > --- > > Key: HDFS-16550 > URL: https://issues.apache.org/jira/browse/HDFS-16550 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Tao Li >Assignee: Tao Li >Priority: Major > Labels: pull-request-available > Attachments: image-2022-04-21-09-54-29-751.png, > image-2022-04-21-09-54-57-111.png, image-2022-04-21-12-32-56-170.png > > Time Spent: 1h > Remaining Estimate: 0h > > When we introduced {*}SBN Read{*}, we encountered a situation during upgrade > the JournalNodes. > Cluster Info: > *Active: nn0* > *Standby: nn1* > 1. Rolling restart journal node. {color:#ff}(related config: > fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color} > 2. The cluster runs for a while, edits cache usage is increasing and memory > is used up. > 3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed > out waiting 12ms for a quorum of nodes to respond”{_}. > 4. Transfer nn1 to Active state. > 5. {color:#ff}New Active namenode(nn1){color} also shutdown because of > “{_}Timed out waiting 12ms for a quorum of nodes to respond” too{_}. > 6. {color:#ff}The cluster crashed{color}. > > Related code: > {code:java} > JournaledEditsCache(Configuration conf) { > capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY, > DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT); > if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) { > Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " + > "maximum JVM memory is only %d bytes. It is recommended that you " + > "decrease the cache size or increase the heap size.", > capacity, Runtime.getRuntime().maxMemory())); > } > Journal.LOG.info("Enabling the journaled edits cache with a capacity " + > "of bytes: " + capacity); > ReadWriteLock lock = new ReentrantReadWriteLock(true); > readLock = new AutoCloseableLock(lock.readLock()); > writeLock = new AutoCloseableLock(lock.writeLock()); > initialize(INVALID_TXN_ID); > } {code} > Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size > than the memory requested by the process. If > {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * > Runtime.getruntime().maxMemory(){*}, only warn logs are printed during > journalnode startup. This can easily be overlooked by users. However, as the > cluster runs to a certain period of time, it is likely to cause the cluster > to crash. > > NN log: > !image-2022-04-21-09-54-57-111.png|width=1012,height=47! > !image-2022-04-21-12-32-56-170.png|width=809,height=218! > IMO, we should not set the {{cache size}} to a fixed value, but to the ratio > of maximum memory, which is 0.2 by default. > This avoids the problem of too large cache size. In addition, users can > actively adjust the heap size when they need to increase the cache size. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash
[ https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17640224#comment-17640224 ] ASF GitHub Bot commented on HDFS-16550: --- xkrogen commented on code in PR #4209: URL: https://github.com/apache/hadoop/pull/4209#discussion_r1033958544 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournaledEditsCache.java: ## @@ -277,11 +279,10 @@ void storeEdits(byte[] inputData, long newStartTxn, long newEndTxn, initialize(INVALID_TXN_ID); Journal.LOG.warn(String.format("A single batch of edits was too " + "large to fit into the cache: startTxn = %d, endTxn = %d, " + -"input length = %d. The capacity of the cache (%s) must be " + +"input length = %d. The capacity of the cache must be " + Review Comment: Can we keep the key here in the error message, but print both? Like `(%s or %s)` where one is `DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY` and one is `DFS_JOURNALNODE_EDIT_CACHE_SIZE_FRACTION_KEY` ## hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/server/TestJournaledEditsCache.java: ## @@ -221,6 +224,24 @@ public void testCacheMalformedInput() throws Exception { cache.retrieveEdits(-1, 10, new ArrayList<>()); } + @Test + public void testCacheFraction() { +// Set dfs.journalnode.edit-cache-size.bytes. +Configuration config = new Configuration(); +config.setInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY, 1); + config.setFloat(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_FRACTION_KEY, 0.1f); +cache = new JournaledEditsCache(config); +assertEquals(1, cache.getCapacity(), 0.0); Review Comment: why do we do this as a floating-point comparison? can't we just do `assertEquals(1, cache.getCapacity())` ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournaledEditsCache.java: ## @@ -121,12 +121,14 @@ class JournaledEditsCache { // ** End lock-protected fields ** JournaledEditsCache(Configuration conf) { +float fraction = conf.getFloat(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_FRACTION_KEY, +DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_FRACTION_DEFAULT); Review Comment: maybe enforce a check here to guarantee that `fraction < 1.0` ? to fail-fast in case of misconfigurations ## hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/server/TestJournaledEditsCache.java: ## @@ -221,6 +224,24 @@ public void testCacheMalformedInput() throws Exception { cache.retrieveEdits(-1, 10, new ArrayList<>()); } + @Test + public void testCacheFraction() { Review Comment: how about `testCacheSizeConfigs` ? since we are testing both of them and how they interact ## hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/server/TestJournaledEditsCache.java: ## @@ -221,6 +224,24 @@ public void testCacheMalformedInput() throws Exception { cache.retrieveEdits(-1, 10, new ArrayList<>()); } + @Test + public void testCacheFraction() { +// Set dfs.journalnode.edit-cache-size.bytes. +Configuration config = new Configuration(); +config.setInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY, 1); + config.setFloat(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_FRACTION_KEY, 0.1f); +cache = new JournaledEditsCache(config); +assertEquals(1, cache.getCapacity(), 0.0); + +// Don't set dfs.journalnode.edit-cache-size.bytes. +Configuration config1 = new Configuration(); + config1.setFloat(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_FRACTION_KEY, 0.1f); +cache = new JournaledEditsCache(config1); +assertEquals( +memory * config1.getFloat(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_FRACTION_KEY, 0.0f), +cache.getCapacity(), 0.0); Review Comment: how about just this? ```suggestion assertEquals(memory / 10, cache.getCapacity()); ``` ## hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml: ## @@ -4955,6 +4955,19 @@ + + dfs.journalnode.edit-cache-size.fraction + 0.5f + +This ratio refers to the proportion of the maximum memory of the JVM. +Used to calculate the size of the edits cache that is kept in the JournalNode's memory. Review Comment: we should explicitly mention that this is an alternative to the `.bytes` config and either copy some of the guidance there (about when it will be enabled, txn size, etc.) or reference it like "see the documentation for `dfs.journalnode.edit-cache-size.bytes` to learn more" ## hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/server/TestJournaledEditsCache.java: ## @@ -221,6 +224,24 @@ public void
[jira] [Commented] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash
[ https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17639732#comment-17639732 ] ASF GitHub Bot commented on HDFS-16550: --- tomscut commented on PR #4209: URL: https://github.com/apache/hadoop/pull/4209#issuecomment-1328405489 Hi @xkrogen @ZanderXu , I updated the code according to the suggestion, please take a look. Thanks! > [SBN read] Improper cache-size for journal node may cause cluster crash > --- > > Key: HDFS-16550 > URL: https://issues.apache.org/jira/browse/HDFS-16550 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Tao Li >Assignee: Tao Li >Priority: Major > Labels: pull-request-available > Attachments: image-2022-04-21-09-54-29-751.png, > image-2022-04-21-09-54-57-111.png, image-2022-04-21-12-32-56-170.png > > Time Spent: 1h > Remaining Estimate: 0h > > When we introduced {*}SBN Read{*}, we encountered a situation during upgrade > the JournalNodes. > Cluster Info: > *Active: nn0* > *Standby: nn1* > 1. Rolling restart journal node. {color:#ff}(related config: > fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color} > 2. The cluster runs for a while, edits cache usage is increasing and memory > is used up. > 3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed > out waiting 12ms for a quorum of nodes to respond”{_}. > 4. Transfer nn1 to Active state. > 5. {color:#ff}New Active namenode(nn1){color} also shutdown because of > “{_}Timed out waiting 12ms for a quorum of nodes to respond” too{_}. > 6. {color:#ff}The cluster crashed{color}. > > Related code: > {code:java} > JournaledEditsCache(Configuration conf) { > capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY, > DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT); > if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) { > Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " + > "maximum JVM memory is only %d bytes. It is recommended that you " + > "decrease the cache size or increase the heap size.", > capacity, Runtime.getRuntime().maxMemory())); > } > Journal.LOG.info("Enabling the journaled edits cache with a capacity " + > "of bytes: " + capacity); > ReadWriteLock lock = new ReentrantReadWriteLock(true); > readLock = new AutoCloseableLock(lock.readLock()); > writeLock = new AutoCloseableLock(lock.writeLock()); > initialize(INVALID_TXN_ID); > } {code} > Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size > than the memory requested by the process. If > {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * > Runtime.getruntime().maxMemory(){*}, only warn logs are printed during > journalnode startup. This can easily be overlooked by users. However, as the > cluster runs to a certain period of time, it is likely to cause the cluster > to crash. > > NN log: > !image-2022-04-21-09-54-57-111.png|width=1012,height=47! > !image-2022-04-21-12-32-56-170.png|width=809,height=218! > IMO, we should not set the {{cache size}} to a fixed value, but to the ratio > of maximum memory, which is 0.2 by default. > This avoids the problem of too large cache size. In addition, users can > actively adjust the heap size when they need to increase the cache size. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash
[ https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17639734#comment-17639734 ] ASF GitHub Bot commented on HDFS-16550: --- tomscut commented on PR #4209: URL: https://github.com/apache/hadoop/pull/4209#issuecomment-1328405738 > I am -1 on the PR as-is. We have publicly exposed the current config `dfs.journalnode.edit-cache-size.bytes`; we can't just rename it and change the behavior now. I also think there is a lot of value in being able to configure the cache size exactly, rather than as a fraction, but I do recognize the value in using a ratio as a helpful default (one less knob to tune). I would propose: > > * _Add_ (not replace) a new config `dfs.journalnode.edit-cache-size.fraction` (or `.ratio`? but either way I think we should maintain the `edit-cache-size` prefix) > * If `edit-cache-size.bytes` is set, use that value. Otherwise, use the value of `edit-cache-size.fraction * Runtime#maxMemory()`, which has a default value set. > * I would suggest 0.5 rather than 0.3 for the default value of `fraction` but am open to discussion there. > > This still does change the default behavior slightly, since before you would get a 1GB cache and now you get `-Xmx * 0.5`, but there is an easy way to preserve the old behavior and if you've explicitly configured the cache size (which you probably did, if you're using the feature) then there is no change. Hi @xkrogen @ZanderXu , I updated the code according to the suggestion, please take a look. Thanks! > [SBN read] Improper cache-size for journal node may cause cluster crash > --- > > Key: HDFS-16550 > URL: https://issues.apache.org/jira/browse/HDFS-16550 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Tao Li >Assignee: Tao Li >Priority: Major > Labels: pull-request-available > Attachments: image-2022-04-21-09-54-29-751.png, > image-2022-04-21-09-54-57-111.png, image-2022-04-21-12-32-56-170.png > > Time Spent: 1h > Remaining Estimate: 0h > > When we introduced {*}SBN Read{*}, we encountered a situation during upgrade > the JournalNodes. > Cluster Info: > *Active: nn0* > *Standby: nn1* > 1. Rolling restart journal node. {color:#ff}(related config: > fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color} > 2. The cluster runs for a while, edits cache usage is increasing and memory > is used up. > 3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed > out waiting 12ms for a quorum of nodes to respond”{_}. > 4. Transfer nn1 to Active state. > 5. {color:#ff}New Active namenode(nn1){color} also shutdown because of > “{_}Timed out waiting 12ms for a quorum of nodes to respond” too{_}. > 6. {color:#ff}The cluster crashed{color}. > > Related code: > {code:java} > JournaledEditsCache(Configuration conf) { > capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY, > DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT); > if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) { > Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " + > "maximum JVM memory is only %d bytes. It is recommended that you " + > "decrease the cache size or increase the heap size.", > capacity, Runtime.getRuntime().maxMemory())); > } > Journal.LOG.info("Enabling the journaled edits cache with a capacity " + > "of bytes: " + capacity); > ReadWriteLock lock = new ReentrantReadWriteLock(true); > readLock = new AutoCloseableLock(lock.readLock()); > writeLock = new AutoCloseableLock(lock.writeLock()); > initialize(INVALID_TXN_ID); > } {code} > Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size > than the memory requested by the process. If > {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * > Runtime.getruntime().maxMemory(){*}, only warn logs are printed during > journalnode startup. This can easily be overlooked by users. However, as the > cluster runs to a certain period of time, it is likely to cause the cluster > to crash. > > NN log: > !image-2022-04-21-09-54-57-111.png|width=1012,height=47! > !image-2022-04-21-12-32-56-170.png|width=809,height=218! > IMO, we should not set the {{cache size}} to a fixed value, but to the ratio > of maximum memory, which is 0.2 by default. > This avoids the problem of too large cache size. In addition, users can > actively adjust the heap size when they need to increase the cache size. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail:
[jira] [Commented] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash
[ https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17639557#comment-17639557 ] ASF GitHub Bot commented on HDFS-16550: --- hadoop-yetus commented on PR #4209: URL: https://github.com/apache/hadoop/pull/4209#issuecomment-1328124363 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 44s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +0 :ok: | xmllint | 0m 0s | | xmllint was not available. | | +0 :ok: | markdownlint | 0m 0s | | markdownlint was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 38m 49s | | trunk passed | | +1 :green_heart: | compile | 1m 37s | | trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | compile | 1m 25s | | trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | checkstyle | 1m 18s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 45s | | trunk passed | | +1 :green_heart: | javadoc | 1m 26s | | trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 39s | | trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 26s | | trunk passed | | +1 :green_heart: | shadedclient | 23m 14s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 22s | | the patch passed | | +1 :green_heart: | compile | 1m 28s | | the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javac | 1m 28s | | the patch passed | | +1 :green_heart: | compile | 1m 18s | | the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | javac | 1m 18s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 58s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 20s | | the patch passed | | +1 :green_heart: | javadoc | 0m 58s | | the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 34s | | the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 22s | | the patch passed | | +1 :green_heart: | shadedclient | 23m 6s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 299m 14s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/7/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 1m 3s | | The patch does not generate ASF License warnings. | | | | 409m 34s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.TestLeaseRecovery2 | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/7/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/4209 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint markdownlint | | uname | Linux 89cdd00a00b1 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / b642d8d1ce241edaae46c249446e3200ed669ec0 | | Default Java | Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/7/testReport/ | | Max. process+thread
[jira] [Commented] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash
[ https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17639178#comment-17639178 ] ASF GitHub Bot commented on HDFS-16550: --- hadoop-yetus commented on PR #4209: URL: https://github.com/apache/hadoop/pull/4209#issuecomment-1328052097 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 1m 0s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +0 :ok: | xmllint | 0m 1s | | xmllint was not available. | | +0 :ok: | markdownlint | 0m 1s | | markdownlint was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 41m 56s | | trunk passed | | +1 :green_heart: | compile | 1m 31s | | trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | compile | 1m 22s | | trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | checkstyle | 1m 7s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 26s | | trunk passed | | +1 :green_heart: | javadoc | 1m 7s | | trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 28s | | trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 36s | | trunk passed | | +1 :green_heart: | shadedclient | 25m 48s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 17s | | the patch passed | | +1 :green_heart: | compile | 1m 23s | | the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javac | 1m 23s | | the patch passed | | +1 :green_heart: | compile | 1m 15s | | the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | javac | 1m 15s | | the patch passed | | -1 :x: | blanks | 0m 0s | [/blanks-eol.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/6/artifact/out/blanks-eol.txt) | The patch has 6 line(s) that end in blanks. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply | | +1 :green_heart: | checkstyle | 0m 53s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 20s | | the patch passed | | +1 :green_heart: | javadoc | 0m 52s | | the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 22s | | the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 27s | | the patch passed | | +1 :green_heart: | shadedclient | 25m 30s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 411m 51s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/6/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 42s | | The patch does not generate ASF License warnings. | | | | 527m 33s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.TestLeaseRecovery2 | | | hadoop.hdfs.server.namenode.ha.TestObserverNode | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/6/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/4209 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint markdownlint | | uname | Linux e2d3cdec852b 4.15.0-192-generic #203-Ubuntu SMP Wed Aug 10 17:40:03 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 14c9099a6d4dc8a162d74bb84923360e145ad791 | | Default Java | Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | Multi-JDK versions |
[jira] [Commented] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash
[ https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17638860#comment-17638860 ] ASF GitHub Bot commented on HDFS-16550: --- hadoop-yetus commented on PR #4209: URL: https://github.com/apache/hadoop/pull/4209#issuecomment-1328030594 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 42s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +0 :ok: | xmllint | 0m 1s | | xmllint was not available. | | +0 :ok: | markdownlint | 0m 1s | | markdownlint was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 1s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 38m 57s | | trunk passed | | +1 :green_heart: | compile | 1m 40s | | trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | compile | 1m 38s | | trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | checkstyle | 1m 17s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 33s | | trunk passed | | +1 :green_heart: | javadoc | 1m 15s | | trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 42s | | trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 37s | | trunk passed | | +1 :green_heart: | shadedclient | 22m 58s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 20s | | the patch passed | | +1 :green_heart: | compile | 1m 29s | | the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javac | 1m 29s | | the patch passed | | +1 :green_heart: | compile | 1m 16s | | the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | javac | 1m 16s | | the patch passed | | -1 :x: | blanks | 0m 0s | [/blanks-eol.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/5/artifact/out/blanks-eol.txt) | The patch has 6 line(s) that end in blanks. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply | | +1 :green_heart: | checkstyle | 1m 0s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 27s | | the patch passed | | +1 :green_heart: | javadoc | 0m 57s | | the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 28s | | the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 22s | | the patch passed | | +1 :green_heart: | shadedclient | 22m 55s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 279m 40s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/5/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 1m 10s | | The patch does not generate ASF License warnings. | | | | 389m 36s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.TestLeaseRecovery2 | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/5/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/4209 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint markdownlint | | uname | Linux 3572df215a04 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / a4b0dd9b62169e9b32c5e64791e14e37eca4e67a | | Default Java | Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04
[jira] [Commented] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash
[ https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17636952#comment-17636952 ] ASF GitHub Bot commented on HDFS-16550: --- tomscut commented on PR #4209: URL: https://github.com/apache/hadoop/pull/4209#issuecomment-1322873539 > I am -1 on the PR as-is. We have publicly exposed the current config `dfs.journalnode.edit-cache-size.bytes`; we can't just rename it and change the behavior now. I also think there is a lot of value in being able to configure the cache size exactly, rather than as a fraction, but I do recognize the value in using a ratio as a helpful default (one less knob to tune). I would propose: > > * _Add_ (not replace) a new config `dfs.journalnode.edit-cache-size.fraction` (or `.ratio`? but either way I think we should maintain the `edit-cache-size` prefix) > * If `edit-cache-size.bytes` is set, use that value. Otherwise, use the value of `edit-cache-size.fraction * Runtime#maxMemory()`, which has a default value set. > * I would suggest 0.5 rather than 0.3 for the default value of `fraction` but am open to discussion there. > > This still does change the default behavior slightly, since before you would get a 1GB cache and now you get `-Xmx * 0.5`, but there is an easy way to preserve the old behavior and if you've explicitly configured the cache size (which you probably did, if you're using the feature) then there is no change. Thank you for your comments and detailed suggestions. It is a good idea to add a new config. > [SBN read] Improper cache-size for journal node may cause cluster crash > --- > > Key: HDFS-16550 > URL: https://issues.apache.org/jira/browse/HDFS-16550 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Tao Li >Assignee: Tao Li >Priority: Major > Labels: pull-request-available > Attachments: image-2022-04-21-09-54-29-751.png, > image-2022-04-21-09-54-57-111.png, image-2022-04-21-12-32-56-170.png > > Time Spent: 1h > Remaining Estimate: 0h > > When we introduced {*}SBN Read{*}, we encountered a situation during upgrade > the JournalNodes. > Cluster Info: > *Active: nn0* > *Standby: nn1* > 1. Rolling restart journal node. {color:#ff}(related config: > fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color} > 2. The cluster runs for a while, edits cache usage is increasing and memory > is used up. > 3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed > out waiting 12ms for a quorum of nodes to respond”{_}. > 4. Transfer nn1 to Active state. > 5. {color:#ff}New Active namenode(nn1){color} also shutdown because of > “{_}Timed out waiting 12ms for a quorum of nodes to respond” too{_}. > 6. {color:#ff}The cluster crashed{color}. > > Related code: > {code:java} > JournaledEditsCache(Configuration conf) { > capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY, > DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT); > if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) { > Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " + > "maximum JVM memory is only %d bytes. It is recommended that you " + > "decrease the cache size or increase the heap size.", > capacity, Runtime.getRuntime().maxMemory())); > } > Journal.LOG.info("Enabling the journaled edits cache with a capacity " + > "of bytes: " + capacity); > ReadWriteLock lock = new ReentrantReadWriteLock(true); > readLock = new AutoCloseableLock(lock.readLock()); > writeLock = new AutoCloseableLock(lock.writeLock()); > initialize(INVALID_TXN_ID); > } {code} > Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size > than the memory requested by the process. If > {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * > Runtime.getruntime().maxMemory(){*}, only warn logs are printed during > journalnode startup. This can easily be overlooked by users. However, as the > cluster runs to a certain period of time, it is likely to cause the cluster > to crash. > > NN log: > !image-2022-04-21-09-54-57-111.png|width=1012,height=47! > !image-2022-04-21-12-32-56-170.png|width=809,height=218! > IMO, we should not set the {{cache size}} to a fixed value, but to the ratio > of maximum memory, which is 0.2 by default. > This avoids the problem of too large cache size. In addition, users can > actively adjust the heap size when they need to increase the cache size. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail:
[jira] [Commented] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash
[ https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17636773#comment-17636773 ] ASF GitHub Bot commented on HDFS-16550: --- xkrogen commented on PR #4209: URL: https://github.com/apache/hadoop/pull/4209#issuecomment-1322361104 I am -1 on the PR as-is. We have publicly exposed the current config `dfs.journalnode.edit-cache-size.bytes`; we can't just rename it and change the behavior now. I also think there is a lot of value in being able to configure the cache size exactly, rather than as a fraction, but I do recognize the value in using a ratio as a helpful default (one less knob to tune). I would propose: * _Add_ (not replace) a new config `dfs.journalnode.edit-cache-size.fraction` (or `.ratio`? but either way I think we should maintain the `edit-cache-size` prefix) * If `edit-cache-size.bytes` is set, use that value. Otherwise, use the value of `edit-cache-size.fraction * Runtime#maxMemory()`, which has a default value set. * I would suggest 0.5 rather than 0.3 for the default value of `fraction` but am open to discussion there. This still does change the default behavior slightly, since before you would get a 1GB cache and now you get `-Xmx * 0.5`, but there is an easy way to preserve the old behavior and if you've explicitly configured the cache size (which you probably did, if you're using the feature) then there is no change. > [SBN read] Improper cache-size for journal node may cause cluster crash > --- > > Key: HDFS-16550 > URL: https://issues.apache.org/jira/browse/HDFS-16550 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Tao Li >Assignee: Tao Li >Priority: Major > Labels: pull-request-available > Attachments: image-2022-04-21-09-54-29-751.png, > image-2022-04-21-09-54-57-111.png, image-2022-04-21-12-32-56-170.png > > Time Spent: 1h > Remaining Estimate: 0h > > When we introduced {*}SBN Read{*}, we encountered a situation during upgrade > the JournalNodes. > Cluster Info: > *Active: nn0* > *Standby: nn1* > 1. Rolling restart journal node. {color:#ff}(related config: > fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color} > 2. The cluster runs for a while, edits cache usage is increasing and memory > is used up. > 3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed > out waiting 12ms for a quorum of nodes to respond”{_}. > 4. Transfer nn1 to Active state. > 5. {color:#ff}New Active namenode(nn1){color} also shutdown because of > “{_}Timed out waiting 12ms for a quorum of nodes to respond” too{_}. > 6. {color:#ff}The cluster crashed{color}. > > Related code: > {code:java} > JournaledEditsCache(Configuration conf) { > capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY, > DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT); > if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) { > Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " + > "maximum JVM memory is only %d bytes. It is recommended that you " + > "decrease the cache size or increase the heap size.", > capacity, Runtime.getRuntime().maxMemory())); > } > Journal.LOG.info("Enabling the journaled edits cache with a capacity " + > "of bytes: " + capacity); > ReadWriteLock lock = new ReentrantReadWriteLock(true); > readLock = new AutoCloseableLock(lock.readLock()); > writeLock = new AutoCloseableLock(lock.writeLock()); > initialize(INVALID_TXN_ID); > } {code} > Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size > than the memory requested by the process. If > {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * > Runtime.getruntime().maxMemory(){*}, only warn logs are printed during > journalnode startup. This can easily be overlooked by users. However, as the > cluster runs to a certain period of time, it is likely to cause the cluster > to crash. > > NN log: > !image-2022-04-21-09-54-57-111.png|width=1012,height=47! > !image-2022-04-21-12-32-56-170.png|width=809,height=218! > IMO, we should not set the {{cache size}} to a fixed value, but to the ratio > of maximum memory, which is 0.2 by default. > This avoids the problem of too large cache size. In addition, users can > actively adjust the heap size when they need to increase the cache size. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash
[ https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17636381#comment-17636381 ] ASF GitHub Bot commented on HDFS-16550: --- tomscut commented on PR #4209: URL: https://github.com/apache/hadoop/pull/4209#issuecomment-1321325256 Hi @tasanuma @ayushtkn , could you also please take a look? Thanks. > [SBN read] Improper cache-size for journal node may cause cluster crash > --- > > Key: HDFS-16550 > URL: https://issues.apache.org/jira/browse/HDFS-16550 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Tao Li >Assignee: Tao Li >Priority: Major > Labels: pull-request-available > Attachments: image-2022-04-21-09-54-29-751.png, > image-2022-04-21-09-54-57-111.png, image-2022-04-21-12-32-56-170.png > > Time Spent: 1h > Remaining Estimate: 0h > > When we introduced {*}SBN Read{*}, we encountered a situation during upgrade > the JournalNodes. > Cluster Info: > *Active: nn0* > *Standby: nn1* > 1. Rolling restart journal node. {color:#ff}(related config: > fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color} > 2. The cluster runs for a while, edits cache usage is increasing and memory > is used up. > 3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed > out waiting 12ms for a quorum of nodes to respond”{_}. > 4. Transfer nn1 to Active state. > 5. {color:#ff}New Active namenode(nn1){color} also shutdown because of > “{_}Timed out waiting 12ms for a quorum of nodes to respond” too{_}. > 6. {color:#ff}The cluster crashed{color}. > > Related code: > {code:java} > JournaledEditsCache(Configuration conf) { > capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY, > DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT); > if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) { > Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " + > "maximum JVM memory is only %d bytes. It is recommended that you " + > "decrease the cache size or increase the heap size.", > capacity, Runtime.getRuntime().maxMemory())); > } > Journal.LOG.info("Enabling the journaled edits cache with a capacity " + > "of bytes: " + capacity); > ReadWriteLock lock = new ReentrantReadWriteLock(true); > readLock = new AutoCloseableLock(lock.readLock()); > writeLock = new AutoCloseableLock(lock.writeLock()); > initialize(INVALID_TXN_ID); > } {code} > Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size > than the memory requested by the process. If > {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * > Runtime.getruntime().maxMemory(){*}, only warn logs are printed during > journalnode startup. This can easily be overlooked by users. However, as the > cluster runs to a certain period of time, it is likely to cause the cluster > to crash. > > NN log: > !image-2022-04-21-09-54-57-111.png|width=1012,height=47! > !image-2022-04-21-12-32-56-170.png|width=809,height=218! > IMO, we should not set the {{cache size}} to a fixed value, but to the ratio > of maximum memory, which is 0.2 by default. > This avoids the problem of too large cache size. In addition, users can > actively adjust the heap size when they need to increase the cache size. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash
[ https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17624999#comment-17624999 ] ASF GitHub Bot commented on HDFS-16550: --- ZanderXu commented on PR #4209: URL: https://github.com/apache/hadoop/pull/4209#issuecomment-1293280962 Yeah, either fixe value or scale value look fine to me. If you want to limit the total cache memory, you need to consider multiple namespace case. As you said, the warn log may be ignored, so you need to take some measures to prevent OOM, such as failure to initialize one namespace if currently cache memory is big enough. So for me, the current logic is enough, and the project admin needs to set a reasonable value according to the number of namespace and the total heap size of journalNode if she/he want to enable this feature. Declare again, my idea does not block this PR. If anyone thinks this modification is necessary, I will review it carefully later. > [SBN read] Improper cache-size for journal node may cause cluster crash > --- > > Key: HDFS-16550 > URL: https://issues.apache.org/jira/browse/HDFS-16550 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Tao Li >Assignee: Tao Li >Priority: Major > Labels: pull-request-available > Attachments: image-2022-04-21-09-54-29-751.png, > image-2022-04-21-09-54-57-111.png, image-2022-04-21-12-32-56-170.png > > Time Spent: 1h > Remaining Estimate: 0h > > When we introduced {*}SBN Read{*}, we encountered a situation during upgrade > the JournalNodes. > Cluster Info: > *Active: nn0* > *Standby: nn1* > 1. Rolling restart journal node. {color:#ff}(related config: > fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color} > 2. The cluster runs for a while, edits cache usage is increasing and memory > is used up. > 3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed > out waiting 12ms for a quorum of nodes to respond”{_}. > 4. Transfer nn1 to Active state. > 5. {color:#ff}New Active namenode(nn1){color} also shutdown because of > “{_}Timed out waiting 12ms for a quorum of nodes to respond” too{_}. > 6. {color:#ff}The cluster crashed{color}. > > Related code: > {code:java} > JournaledEditsCache(Configuration conf) { > capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY, > DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT); > if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) { > Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " + > "maximum JVM memory is only %d bytes. It is recommended that you " + > "decrease the cache size or increase the heap size.", > capacity, Runtime.getRuntime().maxMemory())); > } > Journal.LOG.info("Enabling the journaled edits cache with a capacity " + > "of bytes: " + capacity); > ReadWriteLock lock = new ReentrantReadWriteLock(true); > readLock = new AutoCloseableLock(lock.readLock()); > writeLock = new AutoCloseableLock(lock.writeLock()); > initialize(INVALID_TXN_ID); > } {code} > Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size > than the memory requested by the process. If > {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * > Runtime.getruntime().maxMemory(){*}, only warn logs are printed during > journalnode startup. This can easily be overlooked by users. However, as the > cluster runs to a certain period of time, it is likely to cause the cluster > to crash. > > NN log: > !image-2022-04-21-09-54-57-111.png|width=1012,height=47! > !image-2022-04-21-12-32-56-170.png|width=809,height=218! > IMO, we should not set the {{cache size}} to a fixed value, but to the ratio > of maximum memory, which is 0.2 by default. > This avoids the problem of too large cache size. In addition, users can > actively adjust the heap size when they need to increase the cache size. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash
[ https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17624115#comment-17624115 ] ASF GitHub Bot commented on HDFS-16550: --- tomscut commented on PR #4209: URL: https://github.com/apache/hadoop/pull/4209#issuecomment-1291347727 > @tomscut Thanks for involving me. In my case, I think this PR is unnecessary. But we can print some warning logs to prompt the admin if the set memory is too large, such as more than 90% of the heap size. > > But, if anyone thinks this modification is necessary, I will review it carefully later. Thanks @ZanderXu for the review. There are already waring logs, but they are easy to ignore. Because there is no connection between memory and cache size, it's easy to miss when updating configuration. > [SBN read] Improper cache-size for journal node may cause cluster crash > --- > > Key: HDFS-16550 > URL: https://issues.apache.org/jira/browse/HDFS-16550 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Tao Li >Assignee: Tao Li >Priority: Major > Labels: pull-request-available > Attachments: image-2022-04-21-09-54-29-751.png, > image-2022-04-21-09-54-57-111.png, image-2022-04-21-12-32-56-170.png > > Time Spent: 1h > Remaining Estimate: 0h > > When we introduced {*}SBN Read{*}, we encountered a situation during upgrade > the JournalNodes. > Cluster Info: > *Active: nn0* > *Standby: nn1* > 1. Rolling restart journal node. {color:#ff}(related config: > fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color} > 2. The cluster runs for a while, edits cache usage is increasing and memory > is used up. > 3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed > out waiting 12ms for a quorum of nodes to respond”{_}. > 4. Transfer nn1 to Active state. > 5. {color:#ff}New Active namenode(nn1){color} also shutdown because of > “{_}Timed out waiting 12ms for a quorum of nodes to respond” too{_}. > 6. {color:#ff}The cluster crashed{color}. > > Related code: > {code:java} > JournaledEditsCache(Configuration conf) { > capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY, > DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT); > if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) { > Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " + > "maximum JVM memory is only %d bytes. It is recommended that you " + > "decrease the cache size or increase the heap size.", > capacity, Runtime.getRuntime().maxMemory())); > } > Journal.LOG.info("Enabling the journaled edits cache with a capacity " + > "of bytes: " + capacity); > ReadWriteLock lock = new ReentrantReadWriteLock(true); > readLock = new AutoCloseableLock(lock.readLock()); > writeLock = new AutoCloseableLock(lock.writeLock()); > initialize(INVALID_TXN_ID); > } {code} > Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size > than the memory requested by the process. If > {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * > Runtime.getruntime().maxMemory(){*}, only warn logs are printed during > journalnode startup. This can easily be overlooked by users. However, as the > cluster runs to a certain period of time, it is likely to cause the cluster > to crash. > > NN log: > !image-2022-04-21-09-54-57-111.png|width=1012,height=47! > !image-2022-04-21-12-32-56-170.png|width=809,height=218! > IMO, we should not set the {{cache size}} to a fixed value, but to the ratio > of maximum memory, which is 0.2 by default. > This avoids the problem of too large cache size. In addition, users can > actively adjust the heap size when they need to increase the cache size. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash
[ https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17623540#comment-17623540 ] ASF GitHub Bot commented on HDFS-16550: --- ZanderXu commented on PR #4209: URL: https://github.com/apache/hadoop/pull/4209#issuecomment-1289930632 @tomscut Thanks for involving me. In my case, I think this PR is unnecessary. But we can print some warning logs to prompt the admin if the set memory is too large, such as more than 90% of the heap size. But, if anyone thinks this modification is necessary, I will review it carefully later. > [SBN read] Improper cache-size for journal node may cause cluster crash > --- > > Key: HDFS-16550 > URL: https://issues.apache.org/jira/browse/HDFS-16550 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Tao Li >Assignee: Tao Li >Priority: Major > Labels: pull-request-available > Attachments: image-2022-04-21-09-54-29-751.png, > image-2022-04-21-09-54-57-111.png, image-2022-04-21-12-32-56-170.png > > Time Spent: 1h > Remaining Estimate: 0h > > When we introduced {*}SBN Read{*}, we encountered a situation during upgrade > the JournalNodes. > Cluster Info: > *Active: nn0* > *Standby: nn1* > 1. Rolling restart journal node. {color:#ff}(related config: > fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color} > 2. The cluster runs for a while, edits cache usage is increasing and memory > is used up. > 3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed > out waiting 12ms for a quorum of nodes to respond”{_}. > 4. Transfer nn1 to Active state. > 5. {color:#ff}New Active namenode(nn1){color} also shutdown because of > “{_}Timed out waiting 12ms for a quorum of nodes to respond” too{_}. > 6. {color:#ff}The cluster crashed{color}. > > Related code: > {code:java} > JournaledEditsCache(Configuration conf) { > capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY, > DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT); > if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) { > Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " + > "maximum JVM memory is only %d bytes. It is recommended that you " + > "decrease the cache size or increase the heap size.", > capacity, Runtime.getRuntime().maxMemory())); > } > Journal.LOG.info("Enabling the journaled edits cache with a capacity " + > "of bytes: " + capacity); > ReadWriteLock lock = new ReentrantReadWriteLock(true); > readLock = new AutoCloseableLock(lock.readLock()); > writeLock = new AutoCloseableLock(lock.writeLock()); > initialize(INVALID_TXN_ID); > } {code} > Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size > than the memory requested by the process. If > {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * > Runtime.getruntime().maxMemory(){*}, only warn logs are printed during > journalnode startup. This can easily be overlooked by users. However, as the > cluster runs to a certain period of time, it is likely to cause the cluster > to crash. > > NN log: > !image-2022-04-21-09-54-57-111.png|width=1012,height=47! > !image-2022-04-21-12-32-56-170.png|width=809,height=218! > IMO, we should not set the {{cache size}} to a fixed value, but to the ratio > of maximum memory, which is 0.2 by default. > This avoids the problem of too large cache size. In addition, users can > actively adjust the heap size when they need to increase the cache size. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash
[ https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17623478#comment-17623478 ] ASF GitHub Bot commented on HDFS-16550: --- tomscut commented on PR #4209: URL: https://github.com/apache/hadoop/pull/4209#issuecomment-1289848335 Hi @xkrogen @goiri @ZanderXu , could you please take a look? Thanks. The unit test is unrelated to this change. It is another issue. > [SBN read] Improper cache-size for journal node may cause cluster crash > --- > > Key: HDFS-16550 > URL: https://issues.apache.org/jira/browse/HDFS-16550 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Tao Li >Assignee: Tao Li >Priority: Major > Labels: pull-request-available > Attachments: image-2022-04-21-09-54-29-751.png, > image-2022-04-21-09-54-57-111.png, image-2022-04-21-12-32-56-170.png > > Time Spent: 1h > Remaining Estimate: 0h > > When we introduced {*}SBN Read{*}, we encountered a situation during upgrade > the JournalNodes. > Cluster Info: > *Active: nn0* > *Standby: nn1* > 1. Rolling restart journal node. {color:#ff}(related config: > fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color} > 2. The cluster runs for a while, edits cache usage is increasing and memory > is used up. > 3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed > out waiting 12ms for a quorum of nodes to respond”{_}. > 4. Transfer nn1 to Active state. > 5. {color:#ff}New Active namenode(nn1){color} also shutdown because of > “{_}Timed out waiting 12ms for a quorum of nodes to respond” too{_}. > 6. {color:#ff}The cluster crashed{color}. > > Related code: > {code:java} > JournaledEditsCache(Configuration conf) { > capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY, > DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT); > if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) { > Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " + > "maximum JVM memory is only %d bytes. It is recommended that you " + > "decrease the cache size or increase the heap size.", > capacity, Runtime.getRuntime().maxMemory())); > } > Journal.LOG.info("Enabling the journaled edits cache with a capacity " + > "of bytes: " + capacity); > ReadWriteLock lock = new ReentrantReadWriteLock(true); > readLock = new AutoCloseableLock(lock.readLock()); > writeLock = new AutoCloseableLock(lock.writeLock()); > initialize(INVALID_TXN_ID); > } {code} > Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size > than the memory requested by the process. If > {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * > Runtime.getruntime().maxMemory(){*}, only warn logs are printed during > journalnode startup. This can easily be overlooked by users. However, as the > cluster runs to a certain period of time, it is likely to cause the cluster > to crash. > > NN log: > !image-2022-04-21-09-54-57-111.png|width=1012,height=47! > !image-2022-04-21-12-32-56-170.png|width=809,height=218! > IMO, when {*}fs.journalNode.edit-cache-size-bytes > threshold * > Runtime.getruntime ().maxMemory(){*}, we should throw an Exception and > {color:#ff}fast fail{color}. Giving a clear hint for users to update > related configurations. Or if cache-size exceeds 50% (or some other > threshold) of maxMemory, force cache-size to be 25% of maxMemory. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash
[ https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17622609#comment-17622609 ] ASF GitHub Bot commented on HDFS-16550: --- hadoop-yetus commented on PR #4209: URL: https://github.com/apache/hadoop/pull/4209#issuecomment-1287763426 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 35s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +0 :ok: | xmllint | 0m 0s | | xmllint was not available. | | +0 :ok: | markdownlint | 0m 0s | | markdownlint was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 2 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 38m 51s | | trunk passed | | +1 :green_heart: | compile | 1m 36s | | trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | compile | 1m 34s | | trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | checkstyle | 1m 16s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 34s | | trunk passed | | +1 :green_heart: | javadoc | 1m 23s | | trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 43s | | trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 35s | | trunk passed | | +1 :green_heart: | shadedclient | 23m 4s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 20s | | the patch passed | | +1 :green_heart: | compile | 1m 24s | | the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javac | 1m 24s | | the patch passed | | +1 :green_heart: | compile | 1m 16s | | the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | javac | 1m 16s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 59s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 28s | | the patch passed | | +1 :green_heart: | javadoc | 0m 56s | | the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 34s | | the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 18s | | the patch passed | | +1 :green_heart: | shadedclient | 22m 46s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 244m 35s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/4/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 1m 4s | | The patch does not generate ASF License warnings. | | | | 353m 46s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.server.namenode.ha.TestObserverNode | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/4/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/4209 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint markdownlint | | uname | Linux 779ee7881403 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 5e87c99b7d2ad717f64a2d7180d9e736063d0739 | | Default Java | Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/4/testReport/ | |
[jira] [Commented] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash
[ https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17622361#comment-17622361 ] ASF GitHub Bot commented on HDFS-16550: --- hadoop-yetus commented on PR #4209: URL: https://github.com/apache/hadoop/pull/4209#issuecomment-1287182286 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 1m 1s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +0 :ok: | xmllint | 0m 0s | | xmllint was not available. | | +0 :ok: | markdownlint | 0m 0s | | markdownlint was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 2 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 41m 58s | | trunk passed | | +1 :green_heart: | compile | 1m 36s | | trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | compile | 1m 30s | | trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | checkstyle | 1m 18s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 38s | | trunk passed | | +1 :green_heart: | javadoc | 1m 15s | | trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 37s | | trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 47s | | trunk passed | | +1 :green_heart: | shadedclient | 25m 58s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 24s | | the patch passed | | +1 :green_heart: | compile | 1m 28s | | the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javac | 1m 28s | | the patch passed | | +1 :green_heart: | compile | 1m 20s | | the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | javac | 1m 20s | | the patch passed | | -1 :x: | blanks | 0m 0s | [/blanks-eol.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/3/artifact/out/blanks-eol.txt) | The patch has 3 line(s) that end in blanks. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply | | +1 :green_heart: | checkstyle | 0m 59s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 25s | | the patch passed | | +1 :green_heart: | javadoc | 0m 56s | | the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 28s | | the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 30s | | the patch passed | | +1 :green_heart: | shadedclient | 28m 37s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 333m 52s | | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 1m 24s | | The patch does not generate ASF License warnings. | | | | 455m 56s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/3/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/4209 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint markdownlint | | uname | Linux dfe3de39c34f 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 8fe6dc3e9e24a42a8210b930a0827d77f753b361 | | Default Java | Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/3/testReport/ | | Max. process+thread count | 2403 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs
[jira] [Commented] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash
[ https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621738#comment-17621738 ] ASF GitHub Bot commented on HDFS-16550: --- hadoop-yetus commented on PR #4209: URL: https://github.com/apache/hadoop/pull/4209#issuecomment-1286567852 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 39s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +0 :ok: | xmllint | 0m 1s | | xmllint was not available. | | +0 :ok: | markdownlint | 0m 1s | | markdownlint was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 2 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 42m 0s | | trunk passed | | +1 :green_heart: | compile | 1m 34s | | trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | compile | 1m 29s | | trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | checkstyle | 1m 21s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 37s | | trunk passed | | +1 :green_heart: | javadoc | 1m 20s | | trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 44s | | trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 39s | | trunk passed | | +1 :green_heart: | shadedclient | 23m 18s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 24s | | the patch passed | | +1 :green_heart: | compile | 1m 26s | | the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javac | 1m 26s | | the patch passed | | +1 :green_heart: | compile | 1m 20s | | the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | javac | 1m 20s | | the patch passed | | -1 :x: | blanks | 0m 0s | [/blanks-eol.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/2/artifact/out/blanks-eol.txt) | The patch has 3 line(s) that end in blanks. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply | | -0 :warning: | checkstyle | 1m 2s | [/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/2/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs-project/hadoop-hdfs: The patch generated 2 new + 201 unchanged - 0 fixed = 203 total (was 201) | | +1 :green_heart: | mvnsite | 1m 27s | | the patch passed | | +1 :green_heart: | javadoc | 0m 57s | | the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 28s | | the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 17s | | the patch passed | | +1 :green_heart: | shadedclient | 23m 2s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 244m 16s | | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 1m 3s | | The patch does not generate ASF License warnings. | | | | 357m 42s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/2/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/4209 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint markdownlint | | uname | Linux 01507fad7bdc 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / d18fa4a4b6296268d56c831da39e0d26329cfb0d | | Default Java | Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04
[jira] [Commented] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash
[ https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17526151#comment-17526151 ] tomscut commented on HDFS-16550: I have submitted a simple PR according to the way of Fast Fail. [~sunchao] [~xkrogen] Please help to have a look at it, thank you very much. > [SBN read] Improper cache-size for journal node may cause cluster crash > --- > > Key: HDFS-16550 > URL: https://issues.apache.org/jira/browse/HDFS-16550 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: tomscut >Assignee: tomscut >Priority: Major > Labels: pull-request-available > Attachments: image-2022-04-21-09-54-29-751.png, > image-2022-04-21-09-54-57-111.png, image-2022-04-21-12-32-56-170.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > When we introduced {*}SBN Read{*}, we encountered a situation during upgrade > the JournalNodes. > Cluster Info: > *Active: nn0* > *Standby: nn1* > 1. Rolling restart journal node. {color:#ff}(related config: > fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color} > 2. The cluster runs for a while, edits cache usage is increasing and memory > is used up. > 3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed > out waiting 12ms for a quorum of nodes to respond”{_}. > 4. Transfer nn1 to Active state. > 5. {color:#ff}New Active namenode(nn1){color} also shutdown because of > “{_}Timed out waiting 12ms for a quorum of nodes to respond” too{_}. > 6. {color:#ff}The cluster crashed{color}. > > Related code: > {code:java} > JournaledEditsCache(Configuration conf) { > capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY, > DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT); > if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) { > Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " + > "maximum JVM memory is only %d bytes. It is recommended that you " + > "decrease the cache size or increase the heap size.", > capacity, Runtime.getRuntime().maxMemory())); > } > Journal.LOG.info("Enabling the journaled edits cache with a capacity " + > "of bytes: " + capacity); > ReadWriteLock lock = new ReentrantReadWriteLock(true); > readLock = new AutoCloseableLock(lock.readLock()); > writeLock = new AutoCloseableLock(lock.writeLock()); > initialize(INVALID_TXN_ID); > } {code} > Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size > than the memory requested by the process. If > {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * > Runtime.getruntime().maxMemory(){*}, only warn logs are printed during > journalnode startup. This can easily be overlooked by users. However, as the > cluster runs to a certain period of time, it is likely to cause the cluster > to crash. > > NN log: > !image-2022-04-21-09-54-57-111.png|width=1012,height=47! > !image-2022-04-21-12-32-56-170.png|width=809,height=218! > IMO, when {*}fs.journalNode.edit-cache-size-bytes > threshold * > Runtime.getruntime ().maxMemory(){*}, we should throw an Exception and > {color:#ff}fast fail{color}. Giving a clear hint for users to update > related configurations. Or if cache-size exceeds 50% (or some other > threshold) of maxMemory, force cache-size to be 25% of maxMemory. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org