Folks, JFYI: There were several major RabbitMQ HA failover related bugs fixed for the Fuel 6.1 release scope. Short story: 1) the AMQP cluster failover time was dramatically shortened from ~350 to ~220 seconds in average. 2) there is *no more* a full cluster downtime expected while the failover is in progress.
And these are about to be shortly backported for the 5.1.x/6.0.x milestones as well. Long story: * RabbiMQ fence daemon startup bug [0]. W/o this daemon running, the rabbit node failover time was *significantly* higher. * Fix for the full RabbitMQ cluster downtime issue [1] for the master of the multistate pacemaker resource failover. W/o this fix, all of the rabbit nodes would have been kept down until the failover finished. * Decreased mnesia_table_loading_timeout to 10 seconds [2]. This makes the failover a bit faster. * Incomplete mnesia files removal [3]. W/or this fix, the rabbit app may sometimes fail to start. * Some other fixes in the OCF logic for demote/stop/promote actions [4] (ready for review, testing in progress). W/o these fixes, the failover time was much longer than it should be and sometimes it could even fail and require manual steps (restarting the RabbitMQ cluster resource in pacemaker) to finish. Also, there were several fixes related to the bug [5] merged: [6], [7] but there is still an issue in the OCF script design persist. Which is, a node might sometimes have missed its join event and the OCF action monitor might not detect this as the RabbitMQ pacemaker resource agent keeps the rabbit app stopped unless it is really safe to be started. Hence, the monitor/start/promote actions must be drastically redesigned in oder to get this done. This issue may happen not very often, for example, for the long run failover test I've been running for a while, it may appear at the 23rd iteration and looks completely random. Note, there are no additional troubleshooting steps required to be described in the ops documentation as related patch [8] covers this case as well. Although, these changes require an update for the RabbitMQ clustering flow charts [9] (in progress). [0] https://launchpad.net/bugs/1456791 [1] https://bugs.launchpad.net/fuel/+bug/1436812 [2] https://review.openstack.org/184671 [3] https://bugs.launchpad.net/fuel/+bug/1457766 [4] https://review.openstack.org/185044 [5] https://bugs.launchpad.net/fuel/+bug/1455761 [6] https://review.openstack.org/184911 [7] https://review.openstack.org/184671 [8] https://review.openstack.org/184014 [9] http://goo.gl/PPNrw7 -- Best regards, Bogdan Dobrelya, Skype #bogdando_at_yahoo.com Irc #bogdando __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev