Hi Andrey, Sorry for the late reply. I double-checked the code and found the `recover` command can solve the problem, but it has a performance issue.
When we use `bin/bookkeeper shell recover` command to decommission one bookie, the ledger replication is executed on the node where we run the recover command and not run on the auto-recovery pods. If one bookie holds 4TB of ledger data to be replicated, the ledger replication can't be parallelized by adding more auto-recovery instances. IMO, we need another way to decommission a bookie instead of the `recover` command. Thanks, Hang Andrey Yegorov <andrey.yego...@datastax.com> 于2023年3月30日周四 05:54写道: > > Hi, > > You can use "recover" command instead. > > Switch bookie to read-only (via REST API) > bin/bookkeeper shell recover .. > recover command also has a flag to delete the cookie in ZK. > As an additional benefit, this way you can decomm bookie with ledgers > created with write quorum = 1. > > HTH. > > On Sun, Mar 26, 2023 at 9:27 PM Hang Chen <chenh...@apache.org> wrote: > > > Hi guys, I found the BookKeeper decommission may be blocked by ledgers > > that cannot be replicated. > > > > Current bookie decommissions process. > > - Step 1: Use the command `bin/bookkeeper shell listunderreplicated` > > to check whether there are some ledgers in the under-replicated state > > - Step 2: After all the ledgers are replicated complete, stop the > > bookie and use the command `bin/bookkeeper shell decommissionbookie > > -bookieid <bookieaddress>` to trigger decommission > > - Step 3: Wait for all the ledgers to be replicated and the bookie > > decommission process will complete > > > > However, there is a bug in the decommissioning process. > > > > In Step 1, those under-replicated state ledgers are marked by the > > following steps: > > - Auditor check lost bookie: it will be triggered by two cases: a) > > One bookie lost after `lostBookieRecoveryDelay`, b) Check every > > `auditorPeriodicBookieCheckInterval`. The default is 24 hours. > > - Auditor checks all ledgers: triggered every > > `auditorPeriodicCheckInterval`. The default is 7 days. It will check > > every ledger's fragments with the following steps: > > - For every fragment, calculate pending read entries according to > > `auditorLedgerVerificationPercentage`, default is `0`, which means > > only checking the first and last entries of this fragment. > > - Read those entries from all the bookies in the ensemble list for > > the pending read entries. If any entries read failed, mark the ledger > > into an under-replicated state. > > > > > > When we use the `bin/bookkeeper shell listunderreplicated` command to > > check whether some are under-replicated, it only represents those > > ledgers missing replicas before the last check. The lost bookie check > > was 24 hours ago, and the all ledgers check was seven days ago. The > > time range from the last check to the current timestamp won't mark any > > missing replicas ledgers. Suppose we set EnsembleSize=3, > > WriteQuorumSize=2, and AckQuorumSize=1, and decommission one bookie > > with the current decommission process. In that case, it may result in > > some ledgers that can't be replicated due to the only available > > replica on the decommissioned bookie. > > > > Moreover, the Auditor checks all ledgers and only checks the first and > > last entries of each fragment of those ledgers. If the bookie disabled > > writing journals and some entries are lost in one fragment, but the > > first and last entries still exist, the checker won't find it. > > > > ### Options > > There are two options to tune the decommissioning process. > > > > 1. Trigger-check all ledgers before Step 1. It has the following > > disadvantages. > > - It will cost a lot of resources > > - It only checks the first and last entries of each fragment of > > those ledgers by default. It can't cover all the entries that check > > > > 2. Turn the bookie into read-only mode instead of shutting it down > > before using the `bin/bookkeeper shell decommissionbookie -bookieid > > <bookieaddress>` command to trigger commission. When replicating > > ledgers located on the decommission bookie, the ledgers can be > > replicated successfully if one replica is available. > > > > I suggest choosing the second option to tune the current bookie > > decommission process. Do you have any suggestions? > > > > Thanks, > > Hang > > > > > -- > Andrey Yegorov