Hi guys, I found the BookKeeper decommission may be blocked by ledgers that cannot be replicated.
Current bookie decommissions process. - Step 1: Use the command `bin/bookkeeper shell listunderreplicated` to check whether there are some ledgers in the under-replicated state - Step 2: After all the ledgers are replicated complete, stop the bookie and use the command `bin/bookkeeper shell decommissionbookie -bookieid <bookieaddress>` to trigger decommission - Step 3: Wait for all the ledgers to be replicated and the bookie decommission process will complete However, there is a bug in the decommissioning process. In Step 1, those under-replicated state ledgers are marked by the following steps: - Auditor check lost bookie: it will be triggered by two cases: a) One bookie lost after `lostBookieRecoveryDelay`, b) Check every `auditorPeriodicBookieCheckInterval`. The default is 24 hours. - Auditor checks all ledgers: triggered every `auditorPeriodicCheckInterval`. The default is 7 days. It will check every ledger's fragments with the following steps: - For every fragment, calculate pending read entries according to `auditorLedgerVerificationPercentage`, default is `0`, which means only checking the first and last entries of this fragment. - Read those entries from all the bookies in the ensemble list for the pending read entries. If any entries read failed, mark the ledger into an under-replicated state. When we use the `bin/bookkeeper shell listunderreplicated` command to check whether some are under-replicated, it only represents those ledgers missing replicas before the last check. The lost bookie check was 24 hours ago, and the all ledgers check was seven days ago. The time range from the last check to the current timestamp won't mark any missing replicas ledgers. Suppose we set EnsembleSize=3, WriteQuorumSize=2, and AckQuorumSize=1, and decommission one bookie with the current decommission process. In that case, it may result in some ledgers that can't be replicated due to the only available replica on the decommissioned bookie. Moreover, the Auditor checks all ledgers and only checks the first and last entries of each fragment of those ledgers. If the bookie disabled writing journals and some entries are lost in one fragment, but the first and last entries still exist, the checker won't find it. ### Options There are two options to tune the decommissioning process. 1. Trigger-check all ledgers before Step 1. It has the following disadvantages. - It will cost a lot of resources - It only checks the first and last entries of each fragment of those ledgers by default. It can't cover all the entries that check 2. Turn the bookie into read-only mode instead of shutting it down before using the `bin/bookkeeper shell decommissionbookie -bookieid <bookieaddress>` command to trigger commission. When replicating ledgers located on the decommission bookie, the ledgers can be replicated successfully if one replica is available. I suggest choosing the second option to tune the current bookie decommission process. Do you have any suggestions? Thanks, Hang