[ https://issues.apache.org/jira/browse/HDFS-16550?focusedWorklogId=759716&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759716 ]
ASF GitHub Bot logged work on HDFS-16550: ----------------------------------------- Author: ASF GitHub Bot Created on: 21/Apr/22 02:17 Start Date: 21/Apr/22 02:17 Worklog Time Spent: 10m Work Description: tomscut opened a new pull request, #4209: URL: https://github.com/apache/hadoop/pull/4209 JIRA: HDFS-16550. For details, please refer to the JIRA. Issue Time Tracking ------------------- Worklog Id: (was: 759716) Remaining Estimate: 0h Time Spent: 10m > [SBN read] Improper cache-size for journal node may cause cluster crash > ----------------------------------------------------------------------- > > Key: HDFS-16550 > URL: https://issues.apache.org/jira/browse/HDFS-16550 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: tomscut > Assignee: tomscut > Priority: Major > Attachments: image-2022-04-21-09-54-29-751.png, > image-2022-04-21-09-54-57-111.png > > Time Spent: 10m > Remaining Estimate: 0h > > When we introduced {*}SBN Read{*}, we encountered a situation during upgrade > the JournalNodes. > Cluster Info: > *Active: nn0* > *Standby: nn1* > 1. Rolling restart journal node. {color:#ff0000}(related config: > fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color} > 2. The cluster runs for a while. > 3. {color:#ff0000}Active namenode(nn0){color} shutdown because of Timed out > waiting 120000ms for a quorum of nodes to respond. > 4. Transfer nn1 to Active state. > 5. {color:#ff0000}New Active namenode(nn1){color} also shutdown because of > Timed out waiting 120000ms for a quorum of nodes to respond. > 6. {color:#ff0000}The cluster crashed{color}. > > Related code: > {code:java} > JournaledEditsCache(Configuration conf) { > capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY, > DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT); > if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) { > Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " + > "maximum JVM memory is only %d bytes. It is recommended that you " + > "decrease the cache size or increase the heap size.", > capacity, Runtime.getRuntime().maxMemory())); > } > Journal.LOG.info("Enabling the journaled edits cache with a capacity " + > "of bytes: " + capacity); > ReadWriteLock lock = new ReentrantReadWriteLock(true); > readLock = new AutoCloseableLock(lock.readLock()); > writeLock = new AutoCloseableLock(lock.writeLock()); > initialize(INVALID_TXN_ID); > } {code} > Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size > than the memory requested by the process. If > {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * > Runtime.getruntime().maxMemory(){*}, only warn logs are printed during > journalnode startup. This can easily be overlooked by users. However, as the > cluster runs to a certain period of time, it is likely to cause the cluster > to crash. > !image-2022-04-21-09-54-57-111.png|width=1227,height=57! > IMO, when {*}fs.journalNode.edit-cache-size-bytes > threshold * > Runtime.getruntime ().maxMemory(){*}, we should throw an Exception and > {color:#ff0000}fast fail{color}. Giving a clear hint for users to update > related configurations. -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org