Hi Andrey,
    Sorry for the late reply. I double-checked the code and found the
`recover` command can solve the problem, but it has a performance
issue.

When we use `bin/bookkeeper shell recover` command to decommission one
bookie, the ledger replication is executed on the node where we run
the recover command and not run on the auto-recovery pods. If one
bookie holds 4TB of ledger data to be replicated, the ledger
replication can't be parallelized by adding more auto-recovery
instances.

IMO, we need another way to decommission a bookie instead of the
`recover` command.

Thanks,
Hang

Andrey Yegorov <andrey.yego...@datastax.com> 于2023年3月30日周四 05:54写道:
>
> Hi,
>
> You can use "recover" command instead.
>
> Switch bookie to read-only (via REST API)
> bin/bookkeeper shell recover ..
> recover command also has a flag to delete the cookie in ZK.
> As an additional benefit, this way you can decomm bookie with ledgers
> created with write quorum = 1.
>
> HTH.
>
> On Sun, Mar 26, 2023 at 9:27 PM Hang Chen <chenh...@apache.org> wrote:
>
> > Hi guys, I found the BookKeeper decommission may be blocked by ledgers
> > that cannot be replicated.
> >
> > Current bookie decommissions process.
> >   - Step 1: Use the command `bin/bookkeeper shell listunderreplicated`
> > to check whether there are some ledgers in the under-replicated state
> >   - Step 2: After all the ledgers are replicated complete, stop the
> > bookie and use the command `bin/bookkeeper shell decommissionbookie
> > -bookieid <bookieaddress>` to trigger decommission
> >   - Step 3: Wait for all the ledgers to be replicated and the bookie
> > decommission process will complete
> >
> > However, there is a bug in the decommissioning process.
> >
> > In Step 1, those under-replicated state ledgers are marked by the
> > following steps:
> >   - Auditor check lost bookie: it will be triggered by two cases: a)
> > One bookie lost after `lostBookieRecoveryDelay`, b) Check every
> > `auditorPeriodicBookieCheckInterval`.  The default is 24 hours.
> >   - Auditor checks all ledgers: triggered every
> > `auditorPeriodicCheckInterval`. The default is 7 days. It will check
> > every ledger's fragments with the following steps:
> >     - For every fragment, calculate pending read entries according to
> > `auditorLedgerVerificationPercentage`, default is `0`, which means
> > only checking the first and last entries of this fragment.
> >     - Read those entries from all the bookies in the ensemble list for
> > the pending read entries. If any entries read failed, mark the ledger
> > into an under-replicated state.
> >
> >
> > When we use the `bin/bookkeeper shell listunderreplicated` command to
> > check whether some are under-replicated, it only represents those
> > ledgers missing replicas before the last check. The lost bookie check
> > was 24 hours ago, and the all ledgers check was seven days ago. The
> > time range from the last check to the current timestamp won't mark any
> > missing replicas ledgers. Suppose we set EnsembleSize=3,
> > WriteQuorumSize=2, and AckQuorumSize=1, and decommission one bookie
> > with the current decommission process. In that case, it may result in
> > some ledgers that can't be replicated due to the only available
> > replica on the decommissioned bookie.
> >
> > Moreover, the Auditor checks all ledgers and only checks the first and
> > last entries of each fragment of those ledgers. If the bookie disabled
> > writing journals and some entries are lost in one fragment, but the
> > first and last entries still exist, the checker won't find it.
> >
> > ### Options
> > There are two options to tune the decommissioning process.
> >
> > 1. Trigger-check all ledgers before Step 1. It has the following
> > disadvantages.
> >    - It will cost a lot of resources
> >    - It only checks the first and last entries of each fragment of
> > those ledgers by default. It can't cover all the entries that check
> >
> >  2. Turn the bookie into read-only mode instead of shutting it down
> > before using the `bin/bookkeeper shell decommissionbookie -bookieid
> > <bookieaddress>` command to trigger commission. When replicating
> > ledgers located on the decommission bookie, the ledgers can be
> > replicated successfully if one replica is available.
> >
> > I suggest choosing the second option to tune the current bookie
> > decommission process. Do you have any suggestions?
> >
> > Thanks,
> > Hang
> >
>
>
> --
> Andrey Yegorov

Reply via email to