Hi, Andrew >Ah good, I understood it correctly then :) > I would be interested in your opinion of how the other agent does the bootstrapping (ie. without notifications or master/slave). >That makes sense, the part I’m struggling with is that it sounds like the other agent shouldn’t work at all. > Yet we’ve used it extensively and not experienced these kinds of hangs. Regarding other scripts - I am not aware of any other scripts that actually handle cloned rabbitmq server. I may be mistaking, of course. So if you are aware if these scripts succeed in creating rabbitmq cluster which actually survives 1-node or all-node failure scenarios and reassembles the cluster automatically - please, let us know.
> Changing the state isn’t ideal but there is precedent, the part that has me concerned is the error codes coming out of notify. > Apart from producing some log messages, I can’t think how it would produce any recovery. > Unless you’re relying on the subsequent monitor operation to notice the error state. > I guess that would work but you might be waiting a while for it to notice. Yes, we are relying on subsequent monitor operations. We also have several OCF check levels to catch a case when one node does not have rabbitmq application started properly (btw, there was a strange bug that we had to wait for several non-zero checks to fail to get the resource to restart http://bugs.clusterlabs.org/show_bug.cgi?id=5243) . I now remember, why we did notify errors - for error logging, I guess. On Thu, Nov 12, 2015 at 1:30 AM, Andrew Beekhof <abeek...@redhat.com> wrote: > > > On 11 Nov 2015, at 11:35 PM, Vladimir Kuklin <vkuk...@mirantis.com> > wrote: > > > > Hi, Andrew > > > > Let me answer your questions. > > > > This agent is active/active which actually marks one of the node as > 'pseudo'-master which is used as a target for other nodes to join to. We > also check which node is a master and use it in monitor action to check > whether this node is clustered with this 'master' node. When we do cluster > bootstrap, we need to decide which node to mark as a master node. Then, > when it starts (actually, promotes), we can finally pick its name through > notification mechanism and ask other nodes to join this cluster. > > Ah good, I understood it correctly then :) > I would be interested in your opinion of how the other agent does the > bootstrapping (ie. without notifications or master/slave). > > > > > Regarding disconnect_node+forget_cluster_node this is quite simple - we > need to eject node from the cluster. Otherwise it is mentioned in the list > of cluster nodes and a lot of cluster actions, e.g. list_queues, will hang > forever as well as forget_cluster_node action. > > That makes sense, the part I’m struggling with is that it sounds like the > other agent shouldn’t work at all. > Yet we’ve used it extensively and not experienced these kinds of hangs. > > > > > We also handle this case whenever a node leaves the cluster. If you > remember, I wrote an email to Pacemaker ML regarding getting notifications > on node unjoin event '[openstack-dev] [Fuel][Pacemaker][HA] Notifying > clones of offline nodes’. > > Oh, I recall that now. > > > So we went another way and added a dbus daemon listener that does the > same when node lefts corosync cluster (we know that this is a little bit > racy, but disconnect+forget actions pair is idempotent). > > > > Regarding notification commands - we changed behaviour to the one that > fitter our use cases better and passed our destructive tests. It could be > Pacemaker-version dependent, so I agree we should consider changing this > behaviour. But so far it worked for us. > > Changing the state isn’t ideal but there is precedent, the part that has > me concerned is the error codes coming out of notify. > Apart from producing some log messages, I can’t think how it would produce > any recovery. > > Unless you’re relying on the subsequent monitor operation to notice the > error state. > I guess that would work but you might be waiting a while for it to notice. > > > > > On Wed, Nov 11, 2015 at 2:12 PM, Andrew Beekhof <abeek...@redhat.com> > wrote: > > > > > On 11 Nov 2015, at 6:26 PM, bdobre...@mirantis.com wrote: > > > > > > Thank you Andrew. > > > Answers below. > > > >>> > > > Sounds interesting, can you give any comment about how it differs to > the other[i] upstream agent? > > > Am I right that this one is effectively A/P and wont function without > some kind of shared storage? > > > Any particular reason you went down this path instead of full A/A? > > > > > > [i] > > > > https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/rabbitmq-cluster > > > <<< > > > It is based on multistate clone notifications. It requries nothing > shared but Corosync info base CIB where all Pacemaker resources stored > anyway. > > > And it is fully A/A. > > > > Oh! So I should skip the A/P parts before "Auto-configuration of a > cluster with a Pacemaker”? > > Is the idea that the master mode is for picking a node to bootstrap the > cluster? > > > > If so I don’t believe that should be necessary provided you specify > ordered=true for the clone. > > This allows you to assume in the agent that your instance is the only > one currently changing state (by starting or stopping). > > I notice that rabbitmq.com explicitly sets this to false… any > particular reason? > > > > > > Regarding the pcs command to create the resource, you can simplify it to: > > > > pcs resource create --force --master p_rabbitmq-server > ocf:rabbitmq:rabbitmq-server-ha \ > > erlang_cookie=DPMDALGUKEOMPTHWPYKC node_port=5672 \ > > op monitor interval=30 timeout=60 \ > > op monitor interval=27 role=Master timeout=60 \ > > op monitor interval=103 role=Slave timeout=60 OCF_CHECK_LEVEL=30 \ > > meta notify=true ordered=false interleave=true master-max=1 > master-node-max=1 > > > > If you update the stop/start/notify/promote/demote timeouts in the > agent’s metadata. > > > > > > Lines 1602,1565,1621,1632,1657, and 1678 have the notify command > returning an error. > > Was this logic tested? Because pacemaker does not currently > support/allow notify actions to fail. > > IIRC pacemaker simply ignores them. > > > > Modifying the resource state in notifications is also highly unusual. > > What was the reason for that? > > > > I notice that on node down, this agent makes disconnect_node and > forget_cluster_node calls. > > The other upstream agent does not, do you have any information about the > bad things that might happen as a result? > > > > Basically I’m looking for what each option does differently/better with > a view to converging on a single implementation. > > I don’t much care in which location it lives. > > > > I’m CC’ing the other upstream maintainer, it would be good if you guys > could have a chat :-) > > > > > All running rabbit nodes may process AMQP connections. Master state is > only for a cluster initial point at wich other slaves may join to it. > > > Note, here you can find events flow charts as well [0] > > > [0] https://www.rabbitmq.com/pacemaker.html > > > Regards, > > > Bogdan > > > > __________________________________________________________________________ > > > OpenStack Development Mailing List (not for usage questions) > > > Unsubscribe: > openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > > > > > > __________________________________________________________________________ > > OpenStack Development Mailing List (not for usage questions) > > Unsubscribe: > openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > > > > > > > -- > > Yours Faithfully, > > Vladimir Kuklin, > > Fuel Library Tech Lead, > > Mirantis, Inc. > > +7 (495) 640-49-04 > > +7 (926) 702-39-68 > > Skype kuklinvv > > 35bk3, Vorontsovskaya Str. > > Moscow, Russia, > > www.mirantis.com > > www.mirantis.ru > > vkuk...@mirantis.com > > > __________________________________________________________________________ > > OpenStack Development Mailing List (not for usage questions) > > Unsubscribe: > openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > -- Yours Faithfully, Vladimir Kuklin, Fuel Library Tech Lead, Mirantis, Inc. +7 (495) 640-49-04 +7 (926) 702-39-68 Skype kuklinvv 35bk3, Vorontsovskaya Str. Moscow, Russia, www.mirantis.com <http://www.mirantis.ru/> www.mirantis.ru vkuk...@mirantis.com
__________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev