[ 
https://issues.apache.org/jira/browse/HDFS-14317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ekanth Sethuramalingam updated HDFS-14317:
------------------------------------------
    Description: 
The standby uses the following method to check if it is time to trigger edit 
log rolling on active.
{code}
  /**
   * @return true if the configured log roll period has elapsed.
   */
  private boolean tooLongSinceLastLoad() {
    return logRollPeriodMs >= 0 && 
      (monotonicNow() - lastLoadTimeMs) > logRollPeriodMs ;
  }
{code}

In doTailEdits(), lastLoadTimeMs is updated when standby is able to 
successfully tail any edits

{code}
      if (editsLoaded > 0) {
        lastLoadTimeMs = monotonicNow();
      }
{code}

The default configuration for {{dfs.ha.log-roll.period}} is 120 seconds and 
{{dfs.ha.tail-edits.period}} is 60 seconds. With in-progress edit log tailing 
enabled, tooLongSinceLastLoad() will almost never return true resulting in edit 
logs not rolled for a long time until this configuration 
{{dfs.namenode.edit.log.autoroll.multiplier.threshold}} takes effect.

[In our deployment, this resulted in in-progress edit logs getting deleted. The 
sequence of events is that standby was able to checkpoint twice while the 
in-progress edit log was growing on active. When the NNStorageRetentionManager 
decided to cleanup old checkpoints and edit logs, it cleaned up the in-progress 
edit log from active and QJM (as the txnid on in-progress edit log was older 
than the 2 most recent checkpoints) resulting in irrecoverably losing a few 
minutes worth of metadata].

  was:
The standby uses the following method to check if it is time to trigger edit 
log rolling on active.

{{/**}}
 \{{ * @return true if the configured log roll period has elapsed.}}
 \{{ */}}
 {{private boolean tooLongSinceLastLoad() {}}
 \{{  return logRollPeriodMs >= 0 && }}
 {{   (monotonicNow() - lastLoadTimeMs) > logRollPeriodMs ;}}
 {{}}}

In doTailEdits(), lastLoadTimeMs is updated when standby is able to 
successfully tail any edits

{{if (editsLoaded > 0) {}}
 {{  lastLoadTimeMs = monotonicNow();}}
 {{}}}

The default configuration for {{dfs.ha.log-roll.period}} is 120 seconds and 
{{dfs.ha.tail-edits.period}} is 60 seconds. With in-progress edit log tailing 
enabled, tooLongSinceLastLoad() will almost never return true resulting in edit 
logs not rolled for a long time until this configuration 
{{dfs.namenode.edit.log.autoroll.multiplier.threshold}} takes effect.

[In our deployment, this resulted in in-progress edit logs getting deleted. The 
sequence of events is that standby was able to checkpoint twice while the 
in-progress edit log was growing on active. When the NNStorageRetentionManager 
decided to cleanup old checkpoints and edit logs, it cleaned up the in-progress 
edit log from active and QJM (as the txnid on in-progress edit log was older 
than the 2 most recent checkpoints) resulting in irrecoverably losing a few 
minutes worth of metadata].


> Standby does not trigger edit log rolling when in-progress edit log tailing 
> is enabled
> --------------------------------------------------------------------------------------
>
>                 Key: HDFS-14317
>                 URL: https://issues.apache.org/jira/browse/HDFS-14317
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.9.0, 3.0.0
>            Reporter: Ekanth Sethuramalingam
>            Assignee: Ekanth Sethuramalingam
>            Priority: Critical
>
> The standby uses the following method to check if it is time to trigger edit 
> log rolling on active.
> {code}
>   /**
>    * @return true if the configured log roll period has elapsed.
>    */
>   private boolean tooLongSinceLastLoad() {
>     return logRollPeriodMs >= 0 && 
>       (monotonicNow() - lastLoadTimeMs) > logRollPeriodMs ;
>   }
> {code}
> In doTailEdits(), lastLoadTimeMs is updated when standby is able to 
> successfully tail any edits
> {code}
>       if (editsLoaded > 0) {
>         lastLoadTimeMs = monotonicNow();
>       }
> {code}
> The default configuration for {{dfs.ha.log-roll.period}} is 120 seconds and 
> {{dfs.ha.tail-edits.period}} is 60 seconds. With in-progress edit log tailing 
> enabled, tooLongSinceLastLoad() will almost never return true resulting in 
> edit logs not rolled for a long time until this configuration 
> {{dfs.namenode.edit.log.autoroll.multiplier.threshold}} takes effect.
> [In our deployment, this resulted in in-progress edit logs getting deleted. 
> The sequence of events is that standby was able to checkpoint twice while the 
> in-progress edit log was growing on active. When the 
> NNStorageRetentionManager decided to cleanup old checkpoints and edit logs, 
> it cleaned up the in-progress edit log from active and QJM (as the txnid on 
> in-progress edit log was older than the 2 most recent checkpoints) resulting 
> in irrecoverably losing a few minutes worth of metadata].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to