[ 
https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Li updated HDFS-16550:
--------------------------
    Description: 
When we introduced {*}SBN Read{*}, we encountered a situation during upgrade 
the JournalNodes.

Cluster Info: 
*Active: nn0*
*Standby: nn1*

1. Rolling restart journal node. {color:#ff0000}(related config: 
fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color}

2. The cluster runs for a while, edits cache usage is increasing and memory is 
used up.

3. {color:#ff0000}Active namenode(nn0){color} shutdown because of “{_}Timed out 
waiting 120000ms for a quorum of nodes to respond”{_}.

4. Transfer nn1 to Active state.

5. {color:#ff0000}New Active namenode(nn1){color} also shutdown because of 
“{_}Timed out waiting 120000ms for a quorum of nodes to respond” too{_}.

6. {color:#ff0000}The cluster crashed{color}.

 

Related code:
{code:java}
JournaledEditsCache(Configuration conf) {
  capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY,
      DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT);
  if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) {
    Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " +
        "maximum JVM memory is only %d bytes. It is recommended that you " +
        "decrease the cache size or increase the heap size.",
        capacity, Runtime.getRuntime().maxMemory()));
  }
  Journal.LOG.info("Enabling the journaled edits cache with a capacity " +
      "of bytes: " + capacity);
  ReadWriteLock lock = new ReentrantReadWriteLock(true);
  readLock = new AutoCloseableLock(lock.readLock());
  writeLock = new AutoCloseableLock(lock.writeLock());
  initialize(INVALID_TXN_ID);
} {code}
Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size 
than the memory requested by the process. If 
{*}fs.journalNode.edit-cache-sie.bytes > 0.9 * 
Runtime.getruntime().maxMemory(){*}, only warn logs are printed during 
journalnode startup. This can easily be overlooked by users. However, as the 
cluster runs to a certain period of time, it is likely to cause the cluster to 
crash.

 

NN log:

!image-2022-04-21-09-54-57-111.png|width=1012,height=47!

!image-2022-04-21-12-32-56-170.png|width=809,height=218!

IMO, we should not set the {{cache size}} to a fixed value, but to the ratio of 
maximum memory, which is 0.2 by default.
This avoids the problem of too large cache size. In addition, users can 
actively adjust the heap size when they need to increase the cache size.

  was:
When we introduced {*}SBN Read{*}, we encountered a situation during upgrade 
the JournalNodes.

Cluster Info: 
*Active: nn0*
*Standby: nn1*

1. Rolling restart journal node. {color:#ff0000}(related config: 
fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color}

2. The cluster runs for a while, edits cache usage is increasing and memory is 
used up.

3. {color:#ff0000}Active namenode(nn0){color} shutdown because of “{_}Timed out 
waiting 120000ms for a quorum of nodes to respond”{_}.

4. Transfer nn1 to Active state.

5. {color:#ff0000}New Active namenode(nn1){color} also shutdown because of 
“{_}Timed out waiting 120000ms for a quorum of nodes to respond” too{_}.

6. {color:#ff0000}The cluster crashed{color}.

 

Related code:
{code:java}
JournaledEditsCache(Configuration conf) {
  capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY,
      DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT);
  if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) {
    Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " +
        "maximum JVM memory is only %d bytes. It is recommended that you " +
        "decrease the cache size or increase the heap size.",
        capacity, Runtime.getRuntime().maxMemory()));
  }
  Journal.LOG.info("Enabling the journaled edits cache with a capacity " +
      "of bytes: " + capacity);
  ReadWriteLock lock = new ReentrantReadWriteLock(true);
  readLock = new AutoCloseableLock(lock.readLock());
  writeLock = new AutoCloseableLock(lock.writeLock());
  initialize(INVALID_TXN_ID);
} {code}
Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size 
than the memory requested by the process. If 
{*}fs.journalNode.edit-cache-sie.bytes > 0.9 * 
Runtime.getruntime().maxMemory(){*}, only warn logs are printed during 
journalnode startup. This can easily be overlooked by users. However, as the 
cluster runs to a certain period of time, it is likely to cause the cluster to 
crash.

 

NN log:

!image-2022-04-21-09-54-57-111.png|width=1012,height=47!

!image-2022-04-21-12-32-56-170.png|width=809,height=218!

IMO, when {*}fs.journalNode.edit-cache-size-bytes > threshold * 
Runtime.getruntime ().maxMemory(){*}, we should throw an Exception and 
{color:#ff0000}fast fail{color}. Giving a clear hint for users to update 
related configurations. Or if cache-size exceeds 50% (or some other threshold) 
of maxMemory, force cache-size to be 25% of maxMemory.


> [SBN read] Improper cache-size for journal node may cause cluster crash
> -----------------------------------------------------------------------
>
>                 Key: HDFS-16550
>                 URL: https://issues.apache.org/jira/browse/HDFS-16550
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Tao Li
>            Assignee: Tao Li
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: image-2022-04-21-09-54-29-751.png, 
> image-2022-04-21-09-54-57-111.png, image-2022-04-21-12-32-56-170.png
>
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> When we introduced {*}SBN Read{*}, we encountered a situation during upgrade 
> the JournalNodes.
> Cluster Info: 
> *Active: nn0*
> *Standby: nn1*
> 1. Rolling restart journal node. {color:#ff0000}(related config: 
> fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color}
> 2. The cluster runs for a while, edits cache usage is increasing and memory 
> is used up.
> 3. {color:#ff0000}Active namenode(nn0){color} shutdown because of “{_}Timed 
> out waiting 120000ms for a quorum of nodes to respond”{_}.
> 4. Transfer nn1 to Active state.
> 5. {color:#ff0000}New Active namenode(nn1){color} also shutdown because of 
> “{_}Timed out waiting 120000ms for a quorum of nodes to respond” too{_}.
> 6. {color:#ff0000}The cluster crashed{color}.
>  
> Related code:
> {code:java}
> JournaledEditsCache(Configuration conf) {
>   capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY,
>       DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT);
>   if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) {
>     Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " +
>         "maximum JVM memory is only %d bytes. It is recommended that you " +
>         "decrease the cache size or increase the heap size.",
>         capacity, Runtime.getRuntime().maxMemory()));
>   }
>   Journal.LOG.info("Enabling the journaled edits cache with a capacity " +
>       "of bytes: " + capacity);
>   ReadWriteLock lock = new ReentrantReadWriteLock(true);
>   readLock = new AutoCloseableLock(lock.readLock());
>   writeLock = new AutoCloseableLock(lock.writeLock());
>   initialize(INVALID_TXN_ID);
> } {code}
> Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size 
> than the memory requested by the process. If 
> {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * 
> Runtime.getruntime().maxMemory(){*}, only warn logs are printed during 
> journalnode startup. This can easily be overlooked by users. However, as the 
> cluster runs to a certain period of time, it is likely to cause the cluster 
> to crash.
>  
> NN log:
> !image-2022-04-21-09-54-57-111.png|width=1012,height=47!
> !image-2022-04-21-12-32-56-170.png|width=809,height=218!
> IMO, we should not set the {{cache size}} to a fixed value, but to the ratio 
> of maximum memory, which is 0.2 by default.
> This avoids the problem of too large cache size. In addition, users can 
> actively adjust the heap size when they need to increase the cache size.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to