[ https://issues.apache.org/jira/browse/BOOKKEEPER-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13290273#comment-13290273 ]
Ivan Kelly commented on BOOKKEEPER-237: --------------------------------------- I think the chaining mechanism over-complicates things. In fact i don't think we should be bookie focused at all. Rather we should focus on the ledgers and keeping them fully replicated. If we detect underreplication for a ledger, we have detected the loss of the bookie anyhow. I propose an alternative approach. Each bookie has a Recovery worker running. Bookies elect a Auditor among themselves. Auditor - Scans the full list of ledgers periodically. - Builds an inmemory bookie -> ledger index - Watches /ledgers/available - Periodically scan all ledgers On bookie failure: - Get ledgers for bookies from index. - Scan each of these ledgers. Scanning a ledger will return a number of LedgerFragmentReplicas corresponding to a missing ledger fragment replica. These are stored in /ledgers/underreplicated/L<ledgerid>-E<startentry>-R<replicaindex> Recovery workers on each bookie reads list from /ledgers/underreplicated/, picks an entry, locks it and rereplicates. If a recovery worker crashes half way, its lock will evaporate, and the new recovery worker will be able to do the replication. > Automatic recovery of under-replicated ledgers and its entries > -------------------------------------------------------------- > > Key: BOOKKEEPER-237 > URL: https://issues.apache.org/jira/browse/BOOKKEEPER-237 > Project: Bookkeeper > Issue Type: New Feature > Components: bookkeeper-client, bookkeeper-server > Affects Versions: 4.0.0 > Reporter: Rakesh R > Assignee: Rakesh R > Attachments: Auto Recovery Detection - distributed chain > approach.doc, Auto Recovery and Bookie sync-ups.pdf > > > As per the current design of BookKeeper, if one of the BookKeeper server > dies, there is no automatic mechanism to identify and recover the under > replicated ledgers and its corresponding entries. This would lead to losing > the successfully written entries, which will be a critical problem in > sensitive systems. This document is trying to describe few proposals to > overcome these limitations. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira