Shutting down upon exception is consistent with how the old cluster used to work, and it's presumably the least amount of code to write. The process manager (e.g. rgmanager) would then just start up the failed qpidd process again and it would re-sync.
On the other hand, just re-syncing a failed queue would allow the rest of the backup to theoretically be ok, so instead of losing an entire backup, you potentially only lose a queue's worth of messages (i.e. the rest of the queues are "ready"). Is it better to have most of the queues ready, or is it safer to just shoot the entire backup when any part of it is inconsistent? Andy On Sep 27, 2012, at 1:27 PM, Alan Conway wrote: > I'm thinking about error handling for HA backups, interested in opinions > about what's the right way to go. > > Presently if a backup encounters an exception while replicating, it > shuts down. That avoids possible spinning on trying to re-connect and > allows a new and hopefully error free replica to be started. > > Is there a softer option? In theory it would be possible for a broker to > try to just reset and re-start replication on queue that failed. Is that > desirable or does it just mask the fact that something has gone wrong? > A replica doing such a restart is no longer "ready" in that there are > messages in the restarted queue that have not been delayed so fail-over > to this backup before it catches up could lose messages. > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
