[ https://issues.apache.org/jira/browse/BOOKKEEPER-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13284505#comment-13284505 ]
Uma Maheswara Rao G commented on BOOKKEEPER-237: ------------------------------------------------ For work assignment, how about competing for getting the replication work. We already using this approach for Hbase for distributed log splitting. Idea is like below, Current distributed chain of watchers can identify the failure nodes and add at some place in ZK. All bookies can watch on that node. Whenever new failure node added, bookeies will get notification and they can start competing to get the work. Winner will take the replication work. Also they can update the state of the replication under that aquired lock node. If cluster restarts, Again bookies can participate in competetion to get the Failed nodes replication work. Whenever replication completes, they can delete the lock entry and failed bookie entry from ZK. Infact, in Hbase we have master co-ordination. But here we will be depending on distributed watching to identify filed bookies. @Rakesh/Flavio how about your thoughts on this? > Automatic recovery of under-replicated ledgers and its entries > -------------------------------------------------------------- > > Key: BOOKKEEPER-237 > URL: https://issues.apache.org/jira/browse/BOOKKEEPER-237 > Project: Bookkeeper > Issue Type: New Feature > Components: bookkeeper-client, bookkeeper-server > Affects Versions: 4.0.0 > Reporter: Rakesh R > Assignee: Rakesh R > Attachments: Auto Recovery Detection - distributed chain > approach.doc, Auto Recovery and Bookie sync-ups.pdf > > > As per the current design of BookKeeper, if one of the BookKeeper server > dies, there is no automatic mechanism to identify and recover the under > replicated ledgers and its corresponding entries. This would lead to losing > the successfully written entries, which will be a critical problem in > sensitive systems. This document is trying to describe few proposals to > overcome these limitations. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira