Hi guys, I found the BookKeeper decommission may be blocked by ledgers
that cannot be replicated.

Current bookie decommissions process.
  - Step 1: Use the command `bin/bookkeeper shell listunderreplicated`
to check whether there are some ledgers in the under-replicated state
  - Step 2: After all the ledgers are replicated complete, stop the
bookie and use the command `bin/bookkeeper shell decommissionbookie
-bookieid <bookieaddress>` to trigger decommission
  - Step 3: Wait for all the ledgers to be replicated and the bookie
decommission process will complete

However, there is a bug in the decommissioning process.

In Step 1, those under-replicated state ledgers are marked by the
following steps:
  - Auditor check lost bookie: it will be triggered by two cases: a)
One bookie lost after `lostBookieRecoveryDelay`, b) Check every
`auditorPeriodicBookieCheckInterval`.  The default is 24 hours.
  - Auditor checks all ledgers: triggered every
`auditorPeriodicCheckInterval`. The default is 7 days. It will check
every ledger's fragments with the following steps:
    - For every fragment, calculate pending read entries according to
`auditorLedgerVerificationPercentage`, default is `0`, which means
only checking the first and last entries of this fragment.
    - Read those entries from all the bookies in the ensemble list for
the pending read entries. If any entries read failed, mark the ledger
into an under-replicated state.


When we use the `bin/bookkeeper shell listunderreplicated` command to
check whether some are under-replicated, it only represents those
ledgers missing replicas before the last check. The lost bookie check
was 24 hours ago, and the all ledgers check was seven days ago. The
time range from the last check to the current timestamp won't mark any
missing replicas ledgers. Suppose we set EnsembleSize=3,
WriteQuorumSize=2, and AckQuorumSize=1, and decommission one bookie
with the current decommission process. In that case, it may result in
some ledgers that can't be replicated due to the only available
replica on the decommissioned bookie.

Moreover, the Auditor checks all ledgers and only checks the first and
last entries of each fragment of those ledgers. If the bookie disabled
writing journals and some entries are lost in one fragment, but the
first and last entries still exist, the checker won't find it.

### Options
There are two options to tune the decommissioning process.

1. Trigger-check all ledgers before Step 1. It has the following disadvantages.
   - It will cost a lot of resources
   - It only checks the first and last entries of each fragment of
those ledgers by default. It can't cover all the entries that check

 2. Turn the bookie into read-only mode instead of shutting it down
before using the `bin/bookkeeper shell decommissionbookie -bookieid
<bookieaddress>` command to trigger commission. When replicating
ledgers located on the decommission bookie, the ledgers can be
replicated successfully if one replica is available.

I suggest choosing the second option to tune the current bookie
decommission process. Do you have any suggestions?

Thanks,
Hang

Reply via email to