[ 
https://issues.apache.org/jira/browse/BOOKKEEPER-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13290273#comment-13290273
 ] 

Ivan Kelly commented on BOOKKEEPER-237:
---------------------------------------

I think the chaining mechanism over-complicates things. In fact i don't think 
we should be bookie focused at all. Rather we should focus on the ledgers and 
keeping them fully replicated. If we detect underreplication for a ledger, we 
have detected the loss of the bookie anyhow.

I propose an alternative approach.

Each bookie has a Recovery worker running. 
Bookies elect a Auditor among themselves.

Auditor
   - Scans the full list of ledgers periodically.
   - Builds an inmemory bookie -> ledger index
   - Watches /ledgers/available
   - Periodically scan all ledgers

   On bookie failure:
      - Get ledgers for bookies from index.
      - Scan each of these ledgers.

Scanning a ledger will return a number of LedgerFragmentReplicas corresponding 
to a missing ledger fragment replica.
These are stored in 
/ledgers/underreplicated/L<ledgerid>-E<startentry>-R<replicaindex>

Recovery workers on each bookie reads list from /ledgers/underreplicated/, 
picks an entry, locks it and rereplicates.
If a recovery worker crashes half way, its lock will evaporate, and the new 
recovery worker will be able to do the replication. 
                
> Automatic recovery of under-replicated ledgers and its entries
> --------------------------------------------------------------
>
>                 Key: BOOKKEEPER-237
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-237
>             Project: Bookkeeper
>          Issue Type: New Feature
>          Components: bookkeeper-client, bookkeeper-server
>    Affects Versions: 4.0.0
>            Reporter: Rakesh R
>            Assignee: Rakesh R
>         Attachments: Auto Recovery Detection - distributed chain 
> approach.doc, Auto Recovery and Bookie sync-ups.pdf
>
>
> As per the current design of BookKeeper, if one of the BookKeeper server 
> dies, there is no automatic mechanism to identify and recover the under 
> replicated ledgers and its corresponding entries. This would lead to losing 
> the successfully written entries, which will be a critical problem in 
> sensitive systems. This document is trying to describe few proposals to 
> overcome these limitations. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to