[ https://issues.apache.org/jira/browse/QPID-5007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alan Conway updated QPID-5007: ------------------------------ Description: rgmanager has the notion of an ordered domain, where it will try to start services on the highest priority node in the domain. (see https://fedorahosted.org/cluster/wiki/FailoverDomains) The problem arises like this: - start a 2 node cluster with an ordered domain. - Create a queue and put and put enough messages on so that catchup takes longer than the time to restart node1 - kill node1, rgmanager relocates qpidd-primary service to node2 - immediately restart node1 - rgmanager wants to relocate the service to node1 so it: - kills the primary on node2 as first step of relocation - attempts to restart the primary on node1 which fails because it is still in catchup and there is no primary to catch up from. - at this point we get into an infinite loop of failed attempts to restart the primary. The workaround is to set the nofailback option on the domain. See also: https://bugzilla.redhat.com/show_bug.cgi?id=970657 was: rgmanager has the notion of an ordered domain, where it will try to start services on the highest priority node in the domain. (see https://fedorahosted.org/cluster/wiki/FailoverDomains) The problem arises like this: - start a 2 node cluster with an ordered domain. - Create a queue and put and put enough messages on so that catchup takes longer than the time to restart node1 - kill node1, rgmanager relocates qpidd-primary service to node2 - immediately restart node1 - rgmanager wants to relocate the service to node1 so it: - kills the primary on node2 as first step of relocation - attempts to restart the primary on node1 which fails because it is still in catchup and there is no primary to catch up from. - at this point we get into an infinite loop of failed attempts to restart the primary. The workaround is to set the nofailback option on the domain. > Qpid HA cluster does not support failback in an ordered domain. > --------------------------------------------------------------- > > Key: QPID-5007 > URL: https://issues.apache.org/jira/browse/QPID-5007 > Project: Qpid > Issue Type: Bug > Components: C++ Clustering > Affects Versions: 0.22 > Reporter: Alan Conway > Assignee: Alan Conway > > rgmanager has the notion of an ordered domain, where it will try to start > services on the highest priority node in the domain. > (see https://fedorahosted.org/cluster/wiki/FailoverDomains) > The problem arises like this: > - start a 2 node cluster with an ordered domain. > - Create a queue and put and put enough messages on so that catchup takes > longer than the time to restart node1 > - kill node1, rgmanager relocates qpidd-primary service to node2 > - immediately restart node1 > - rgmanager wants to relocate the service to node1 so it: > - kills the primary on node2 as first step of relocation > - attempts to restart the primary on node1 which fails > because it is still in catchup and there is no primary to catch up > from. > - at this point we get into an infinite loop of failed attempts to > restart the primary. > The workaround is to set the nofailback option on the domain. > See also: https://bugzilla.redhat.com/show_bug.cgi?id=970657 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org For additional commands, e-mail: dev-h...@qpid.apache.org