[jira] Commented: (QPID-2220) Assisting manual recovery from a complete persistent cluster crash.

Alan Conway (JIRA) Thu, 26 Nov 2009 10:53:12 -0800

    [ 
https://issues.apache.org/jira/browse/QPID-2220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782961#action_12782961
 ]


Alan Conway commented on QPID-2220:
-----------------------------------

To clarify the situation: the problem is recovering from a total cluster 
failure, no clean stores. We want to identify the store that is the most up to 
date, i.e. the last one modifed in respect of cluster order. We can do a pretty 
good job  just in cluster code by recording config changes.

Now if 2 or more brokers were killed at the same configuration , we'd like a 
more fine grained comparison.

Using the Persistence ID works for the Red Hat store because it is a 
monotonically increasing value that gets incremented for (almost) every change 
to the store (currently not incremented for deleting 
queues/exchanges/bindings.) So if we record the PID value N with the 
config-change and in recovery we find  the store is at PID M then we know there 
were M-N changes to that store since the config change. Thats a number we can 
meaningfully compare for brokers that died at the same membership.

Factors that make this work:
 - value that increases with each change to the db.
 - at runtime we can query the current value to save at each config change  
 - in recovery we can find the value associated with the database

Is that something we could have as an optional API on a MessageStore, or should 
we put it on a separate plugin that can optionally be provided  by the store 
plugin.

> Assisting manual recovery from a complete persistent cluster crash.
> -------------------------------------------------------------------
>
>                 Key: QPID-2220
>                 URL: https://issues.apache.org/jira/browse/QPID-2220
>             Project: Qpid
>          Issue Type: Improvement
>          Components: C++ Broker
>    Affects Versions: 0.5
>            Reporter: Alan Conway
>            Assignee: Alan Conway
>
> If every member of a persistent cluster crashes then manual intervention is 
> required to identify which store is most up-to-date, so it can be used to 
> recover. We need to provide tools to assist in this identification.
> The cluster can save a config-change counter with each config change (cluster 
> membership change). In recovery, the broker with the highest config-change 
> counter has the best store. 
> However if the last brokers in the cluster crash so close together that none 
> can record a config-change we need an additional decider.
> The store at http://qpidcomponents.org/download.html#persistence maintains a 
> global Persistence ID, a 64 bit value that is incremented for each enqueue, 
> dequeue. If the cluster stores  (config-change,PID) pairs then in recovery we 
> can use actual-PID - config-change PID as a tiebreaker.
> Proposed change to MessageStore API:
>   /** Returns a monotonically increasing value reflecting changes to the 
> store.
>   * The value can wrap-around to 0.
>   * Stores need not implement this function, they can simply return 0.
>   */
>   uint64_t getChangeCounter();
> The default implementation just returns 0  and the cluster must fall back to 
> relying on config-change counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:[email protected]

[jira] Commented: (QPID-2220) Assisting manual recovery from a complete persistent cluster crash.

Reply via email to