** Description changed: [Environment] Trusty 14.04.3 Packages: ii corosync 2.3.3-1ubuntu1 amd64 Standards-based cluster framework (daemon and modules) ii libcorosync-common4 2.3.3-1ubuntu1 amd64 Standards-based cluster framework, common library [Reproducer] - 1) I deployed an HA environment using this bundle (http://bazaar.launchpad.net/~ost-maintainers/openstack-charm-testing/trunk/view/head:/bundles/dev/next-ha.yaml) with a 3 nodes installation of cinder related to an HACluster subordinate unit. $ juju-deployer -c next-ha.yaml -w 600 trusty-kilo 2) I changed the default corosync transport mode to unicast. $ juju set cinder-hacluster corosync_transport=udpu 3) I assured that the 3 units were quorated - cinder/0# corosync-quorumtool + cinder/0# corosync-quorumtool Votequorum information ---------------------- Expected votes: 3 Highest expected: 3 Total votes: 3 - Quorum: 2 - Flags: Quorate + Quorum: 2 + Flags: Quorate Membership information ---------------------- - Nodeid Votes Name - 1002 1 10.5.1.57 (local) - 1001 1 10.5.1.58 - 1000 1 10.5.1.59 + Nodeid Votes Name + 1002 1 10.5.1.57 (local) + 1001 1 10.5.1.58 + 1000 1 10.5.1.59 The primary unit was holding the VIP resource 10.5.105.1/16 - root@juju-niedbalski-sec-machine-4:/home/ubuntu# ip addr + root@juju-niedbalski-sec-machine-4:/home/ubuntu# ip addr 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc netem state UP group default qlen 1000 - link/ether fa:16:3e:d2:19:6f brd ff:ff:ff:ff:ff:ff - inet 10.5.1.57/16 brd 10.5.255.255 scope global eth0 - valid_lft forever preferred_lft forever - inet 10.5.105.1/16 brd 10.5.255.255 scope global secondary eth0 - valid_lft forever preferred_lft forever + link/ether fa:16:3e:d2:19:6f brd ff:ff:ff:ff:ff:ff + inet 10.5.1.57/16 brd 10.5.255.255 scope global eth0 + valid_lft forever preferred_lft forever + inet 10.5.105.1/16 brd 10.5.255.255 scope global secondary eth0 + valid_lft forever preferred_lft forever 4) I manually added a TC queue for the eth0 interface on the node holding the VIP resource, introducing a 350 ms delay. $ sudo tc qdisc add dev eth0 root netem delay 350ms - 5) Right after adding the 350ms on the cinder/0 unit, the corosync process informs that one of the processors failed, and is forming a new + 5) Right after adding the 350ms on the cinder/0 unit, the corosync process informs that one of the processors failed, and is forming a new cluster configuration. - + Mar 28 21:57:41 juju-niedbalski-sec-machine-5 corosync[4584]: [TOTEM ] A processor failed, forming new configuration. Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]: [TOTEM ] A new membership (10.5.1.57:11628) was formed. Members Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]: [QUORUM] Members[3]: 1002 1001 1000 Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]: [MAIN ] Completed service synchronization, ready to provide service. This happens on all of the units. 6) After receiving this message, I remove the queue from eth0: $ sudo tc qdisk del dev eth0 root netem Then, the following statement is written in the master node: Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]: [TOTEM ] A new membership (10.5.1.57:11628) was formed. Members Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]: [QUORUM] Members[3]: 1002 1001 1000 Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]: [MAIN ] Completed service synchronization, ready to provide service. - - 7) While executing 5 and 6 repeatedly, I ran the following command to track the SZ and RSS memory usage of the + 7) While executing 5 and 6 repeatedly, I ran the following command to track the VSZ and RSS memory usage of the corosync process: root@juju-niedbalski-sec-machine-4:/home/ubuntu# tc qdisc add dev eth0 root netem delay 350ms root@juju-niedbalski-sec-machine-4:/home/ubuntu# tc qdisc del dev eth0 root netem - $ sudo while true; do ps -o sz,rss -p $(pgrep corosync) 2>&1 | grep -E + $ sudo while true; do ps -o vsz,rss -p $(pgrep corosync) 2>&1 | grep -E '.*[0-9]+.*' | tee -a memory-usage.log && sleep 1; done - The results shows that both sz and rss are increased over time at a high - ratio. + The results shows that both vsz and rss are increased over time at a + high ratio. 25476 4036 ... (after 5 minutes). 135644 10352 [Fix] - So preliminary based on this reproducer, I think that this commit (https://github.com/corosync/corosync/commit/600fb4084adcbfe7678b44a83fa8f3d3550f48b9) + So preliminary based on this reproducer, I think that this commit (https://github.com/corosync/corosync/commit/600fb4084adcbfe7678b44a83fa8f3d3550f48b9) is a good candidate to be backported in Ubuntu Trusty. [Test Case] * See reproducer [Backport Impact] * Not identified
-- You received this bug notification because you are a member of Ubuntu High Availability Team, which is subscribed to corosync in Ubuntu. https://bugs.launchpad.net/bugs/1563089 Title: Memory Leak when new cluster configuration is formed. Status in corosync package in Ubuntu: In Progress Status in corosync source package in Trusty: In Progress Bug description: [Environment] Trusty 14.04.3 Packages: ii corosync 2.3.3-1ubuntu1 amd64 Standards-based cluster framework (daemon and modules) ii libcorosync-common4 2.3.3-1ubuntu1 amd64 Standards-based cluster framework, common library [Reproducer] 1) I deployed an HA environment using this bundle (http://bazaar.launchpad.net/~ost-maintainers/openstack-charm-testing/trunk/view/head:/bundles/dev/next-ha.yaml) with a 3 nodes installation of cinder related to an HACluster subordinate unit. $ juju-deployer -c next-ha.yaml -w 600 trusty-kilo 2) I changed the default corosync transport mode to unicast. $ juju set cinder-hacluster corosync_transport=udpu 3) I assured that the 3 units were quorated cinder/0# corosync-quorumtool Votequorum information ---------------------- Expected votes: 3 Highest expected: 3 Total votes: 3 Quorum: 2 Flags: Quorate Membership information ---------------------- Nodeid Votes Name 1002 1 10.5.1.57 (local) 1001 1 10.5.1.58 1000 1 10.5.1.59 The primary unit was holding the VIP resource 10.5.105.1/16 root@juju-niedbalski-sec-machine-4:/home/ubuntu# ip addr 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc netem state UP group default qlen 1000 link/ether fa:16:3e:d2:19:6f brd ff:ff:ff:ff:ff:ff inet 10.5.1.57/16 brd 10.5.255.255 scope global eth0 valid_lft forever preferred_lft forever inet 10.5.105.1/16 brd 10.5.255.255 scope global secondary eth0 valid_lft forever preferred_lft forever 4) I manually added a TC queue for the eth0 interface on the node holding the VIP resource, introducing a 350 ms delay. $ sudo tc qdisc add dev eth0 root netem delay 350ms 5) Right after adding the 350ms on the cinder/0 unit, the corosync process informs that one of the processors failed, and is forming a new cluster configuration. Mar 28 21:57:41 juju-niedbalski-sec-machine-5 corosync[4584]: [TOTEM ] A processor failed, forming new configuration. Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]: [TOTEM ] A new membership (10.5.1.57:11628) was formed. Members Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]: [QUORUM] Members[3]: 1002 1001 1000 Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]: [MAIN ] Completed service synchronization, ready to provide service. This happens on all of the units. 6) After receiving this message, I remove the queue from eth0: $ sudo tc qdisk del dev eth0 root netem Then, the following statement is written in the master node: Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]: [TOTEM ] A new membership (10.5.1.57:11628) was formed. Members Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]: [QUORUM] Members[3]: 1002 1001 1000 Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]: [MAIN ] Completed service synchronization, ready to provide service. 7) While executing 5 and 6 repeatedly, I ran the following command to track the VSZ and RSS memory usage of the corosync process: root@juju-niedbalski-sec-machine-4:/home/ubuntu# tc qdisc add dev eth0 root netem delay 350ms root@juju-niedbalski-sec-machine-4:/home/ubuntu# tc qdisc del dev eth0 root netem $ sudo while true; do ps -o vsz,rss -p $(pgrep corosync) 2>&1 | grep -E '.*[0-9]+.*' | tee -a memory-usage.log && sleep 1; done The results shows that both vsz and rss are increased over time at a high ratio. 25476 4036 ... (after 5 minutes). 135644 10352 [Fix] So preliminary based on this reproducer, I think that this commit (https://github.com/corosync/corosync/commit/600fb4084adcbfe7678b44a83fa8f3d3550f48b9) is a good candidate to be backported in Ubuntu Trusty. [Test Case] * See reproducer [Backport Impact] * Not identified To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1563089/+subscriptions _______________________________________________ Mailing list: https://launchpad.net/~ubuntu-ha Post to : ubuntu-ha@lists.launchpad.net Unsubscribe : https://launchpad.net/~ubuntu-ha More help : https://help.launchpad.net/ListHelp