[
https://issues.apache.org/jira/browse/BOOKKEEPER-946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sijie Guo resolved BOOKKEEPER-946.
----------------------------------
Resolution: Fixed
Issue resolved by merging pull request 58
[https://github.com/apache/bookkeeper/pull/58]
{noformat}
commit 0abf37c64ced0fe49a6470bc0e2be632e47902d6
Author: Rithin <[email protected]>
AuthorDate: Tue Nov 8 18:02:34 2016 -0800
Commit: Sijie Guo <[email protected]>
CommitDate: Tue Nov 8 18:02:34 2016 -0800
BOOKKEEPER-946: Provide an option to delay auto recovery of lost bookies
If auto recovery is enabled, and a bookie goes down for upgrade or even if
it looses zk connection
intermittently, the auditor detects it as a lost bookie and starts under
replication detection and
the replication workers on other bookie nodes start replicating the under
replicated ledgers. All
of this stops once the bookie comes up but by then a few ledgers would get
replicated. Given the
fact that we have multiple copies of data, it is probably not necessary to
start the recovery as
soon as a bookie goes down. We can wait for an hour or so and then start
recovery. This should
cover cases like planned upgrade, intermittent network connectivity loss,
etc.
This change:
1) Provides a bookie option 'lostBookieRecoveryDelay' in secs, which
when set to a non zero value,
will delay the start of recovery by that number of seconds. By
default, this option is set to 0;
which means the audit is not delayed.
2) If another bookie were to go down in this interval, the recovery is
immediately started and the
one scheduled for future is canceled.
3) Adds counters to track how many audits were delayed(#1) and how many
scheduled audits were
canceled due to multiple bookie failures(#2).
4) Three junit tests to verify the new feature.
Author: Rithin <[email protected]>
Reviewers: [email protected] <[email protected]>, Enrico
Olivelli <[email protected]>
Closes #58 from rithin-shetty/audit_delay
{noformat}
> Provide an option to delay auto recovery of lost bookies
> --------------------------------------------------------
>
> Key: BOOKKEEPER-946
> URL: https://issues.apache.org/jira/browse/BOOKKEEPER-946
> Project: Bookkeeper
> Issue Type: Improvement
> Components: bookkeeper-server
> Affects Versions: 4.5.0
> Reporter: Rithin Shetty
> Assignee: Rithin Shetty
> Priority: Minor
> Fix For: 4.5.0
>
> Attachments:
> org.apache.bookkeeper.replication.AuditorLedgerCheckerTest-output.txt,
> org.apache.bookkeeper.replication.AuditorLedgerCheckerTest-output.txt
>
>
> If auto recovery is enabled, and a bookie goes down for upgrade or even if it
> looses zk connection intermittently, the auditor detects it as a lost bookie
> and starts under replication detection and the replication workers on other
> bookie nodes start replicating the under replicated ledgers. All of this
> stops once the bookie comes up but by then a few ledgers would get
> replicated. Given the fact that we have multiple copies of data, it is
> probably not necessary to start the recovery as soon as a bookie goes down.
> We can probably wait for an hour or so and then start recovery. This should
> cover cases like planned upgrade, intermittent network connectivity loss,
> etc. The amount of time to wait can be an option and the default would be to
> not wait at all(i.e. retain current behavior).
> Of course, if more than one bookie goes down within a short interval, we
> could decide to start auto recovery without waiting.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)