Fuelers, I have compiled a catalogue of all OpenStack HA fixes we have implemented so far, researched, or need to research and implement.
Here is a summary of where things stand today (I've added the same list to https://etherpad.openstack.org/p/fuel-ha-rabbitmq): Applied in 5.0, needs a backport to 4.1.1: - https://review.openstack.org/78178 ocf-neutron-dhcp-orphan - https://review.openstack.org/93927 nova-reap-deleted-instance - https://review.openstack.org/77276 oslo-ccn-handling - https://review.openstack.org/76686 oslo-kombu-reconnect-delay Proposed for 5.0: - https://review.openstack.org/93884 ocf-haproxy-vip-colocate - https://review.openstack.org/93411 rabbitmq-keepalive - https://review.openstack.org/93815 kernel-match-tcp-keepalive-to-nova-report-interval - https://review.openstack.org/93883 rabbitmq-hosts-shuffle Must be implemented in 5.0: - python-kombu-and-amqp-upgrade (multiple CCN fixes) - https://launchpadlibrarian.net/160766270/transport.py.patch python-amqp-tcp-user-timeout - https://bugs.launchpad.net/fuel/+bug/1312177 pacemaker-neutron-agent-stickiness - https://bugs.launchpad.net/fuel/+bug/1297355 ocf-galera-full-stop - https://bugs.launchpad.net/fuel/+bug/1293680 ocf-galera-take-donor-out Should be implemented in 5.1: - https://bugs.launchpad.net/fuel/+bug/1318936 rabbitmq-does-not-restart Known not to help or cause breakage: - https://review.openstack.org/34949 rabbitmq-amqp-heartbeat (requires a heartbeat periodic task in every OpenStack component) Below is the full catalogue: pacemaker-haproxy-reload - applied in 4.0 - https://bugs.launchpad.net/fuel/+bug/1259639 - https://review.openstack.org/61453 ceph-mon-list - applied in 4.1 - https://bugs.launchpad.net/fuel/+bug/1268579 - https://review.openstack.org/73106 ocf-neutron-agent-pid-matching - applied in 4.1 - https://bugs.launchpad.net/fuel/+bug/1269334 - https://review.openstack.org/67101 ocf-galera-restart-wait - applied in 4.1 - https://bugs.launchpad.net/fuel/+bug/1281625 - https://review.openstack.org/74431 pacemaker-fd-leak - applied in 4.1 - https://bugs.launchpad.net/fuel/+bug/1272840 - https://github.com/ClusterLabs/libqb/commit/b327dbec7380e7de6896f9bb6cb1ca58677f4ed8 pacemaker-broadcast-calculation - applied in 4.1 # TODO(angdraug): report to upstream - https://bugs.launchpad.net/fuel/+bug/1277614 - https://review.openstack.org/72438 rabbitmq-hosts - applied in 4.1 - https://bugs.launchpad.net/fuel/+bug/1285449 - https://review.openstack.org/77409 mysql-read-timeout - applied in 4.1 - https://bugs.launchpad.net/fuel/+bug/1285449 - https://review.openstack.org/77643 drop-mysql-on-disconnect - applied in 4.1.1, 5.0 # TODO(angdraug): confirm all fixes are present in 5.0 - https://bugs.launchpad.net/fuel/+bug/1288438 - https://review.openstack.org/81225 haproxy-netns - applied in 4.1.1, 5.0 - https://review.openstack.org/82518 rabbitmq3 - applied in 4.1.1, 5.0 - depends on rabbitmq3-ha-mode - https://bugs.launchpad.net/fuel/+bug/1288831 rabbitmq3-ha-mode - applied in 4.1.1, 5.0 - https://bugs.launchpad.net/fuel/+bug/1296922 - https://review.openstack.org/84707 rabbitmq-init-retry - applied in 4.1.1, 5.0 - https://bugs.launchpad.net/fuel/+bug/1314617 - https://review.openstack.org/88593 ocf-gratuitous-arp - applied in 4.1.1, 5.0 - https://bugs.launchpad.net/fuel/+bug/1310676 - https://review.openstack.org/89378 neutron-l3-rootwrap - applied in 4.1.1, 5.0 # TODO(rmoe): confirm how this is related to the neutron umask/pid flock bug (0751) - https://bugs.launchpad.net/fuel/+bug/1310926 - https://bugs.launchpad.net/neutron/+bug/1311804 ocf-neutron-l3-cleanup-ns - applied in 4.1.1, 5.0 - https://review.openstack.org/89872 ocf-neutron-dhcp-cleanup-ns - applied in 4.1.1, 5.0 - https://bugs.launchpad.net/fuel/+bug/1285929 - https://review.openstack.org/89557 rabbitmq-fd-ulimit - applied in 4.1.1, 5.0 - https://bugs.launchpad.net/fuel/+bug/1279594 - https://gerrit.mirantis.com/10566 ocf-neutron-agent-lost-mysql - applied in 4.1.1, 5.0 - https://bugs.launchpad.net/fuel/+bug/1287716 - https://review.openstack.org/77895 ocf-neutron-dhcp-orphan - applied in 5.0 # TODO(xenolog): backport to 4.1.1 - https://bugs.launchpad.net/fuel/+bug/1285929 - https://review.openstack.org/78178 nova-reap-deleted-instance - applied in 5.0, proposed for 4.1.1 - https://review.openstack.org/93927 oslo-ccn-handling - applied in 5.0 # TODO(angdraug): backport to 4.1.1 - https://review.openstack.org/77276 oslo-kombu-reconnect-delay - applied in 5.0 # TODO(angdraug): backport to 4.1.1 - https://review.openstack.org/76686 ocf-haproxy-vip-colocate - https://review.openstack.org/93884 rabbitmq-keepalive - https://review.openstack.org/93411 kernel-match-tcp-keepalive-to-nova-report-interval - https://review.openstack.org/93815 rabbitmq-hosts-shuffle - https://review.openstack.org/93883 python-kombu-and-amqp-upgrade - # NOTE(angdraug): multiple CCN handling fixes - # TODO(rmoe): try kombu 3.0.15 and amqp 1.4.5; if breaks, check whether kombu 2.5.13 and amqp 1.0.13 is enough python-amqp-tcp-user-timeout - depends on python-kombu-and-amqp-upgrade - https://launchpadlibrarian.net/160766270/transport.py.patch pacemaker-neutron-agent-stickiness - https://bugs.launchpad.net/fuel/+bug/1312177 ocf-galera-full-stop - # NOTE(angdraug): requires a rewrite of galera OCF script - https://bugs.launchpad.net/fuel/+bug/1297355 ocf-galera-take-donor-out - https://bugs.launchpad.net/fuel/+bug/1293680 rabbitmq-does-not-restart - NOTE(angdraug): managing rabbitmq by pacemaker is proposed - https://bugs.launchpad.net/fuel/+bug/1318936 rabbitmq-amqp-heartbeat - reverted # NOTE(angdraug): requires a heartbeat periodic task in every OpenStack component <https://lists.launchpad.net/openstack/msg15111.html> - https://review.openstack.org/34949 Please respond if you know about any other HA fixes and improvements that can help avoid breakage of OpenStack, RabbitMQ, and MySQL on failover. Thanks, -- Dmitry Borodaenko _______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev