** Description changed:

  [Environment]
  
  Trusty 14.04.3
  
  Packages:
  
  ii  corosync                         2.3.3-1ubuntu1                        
amd64        Standards-based cluster framework (daemon and modules)
  ii  libcorosync-common4              2.3.3-1ubuntu1                        
amd64        Standards-based cluster framework, common library
  
  [Reproducer]
- 
  
  1) I deployed an HA environment using this bundle 
(http://bazaar.launchpad.net/~ost-maintainers/openstack-charm-testing/trunk/view/head:/bundles/dev/next-ha.yaml)
  with a 3 nodes installation of cinder related to an HACluster subordinate 
unit.
  
  $ juju-deployer -c next-ha.yaml -w 600 trusty-kilo
  
  2) I changed the default corosync transport mode to unicast.
  
  $ juju set cinder-hacluster corosync_transport=udpu
  
  3) I assured that the 3 units were quorated
  
- cinder/0# corosync-quorumtool 
+ cinder/0# corosync-quorumtool
  Votequorum information
  ----------------------
  Expected votes:   3
  Highest expected: 3
  Total votes:      3
- Quorum:           2  
- Flags:            Quorate 
+ Quorum:           2
+ Flags:            Quorate
  
  Membership information
  ----------------------
-     Nodeid      Votes Name
-       1002          1 10.5.1.57 (local)
-       1001          1 10.5.1.58
-       1000          1 10.5.1.59
+     Nodeid      Votes Name
+       1002          1 10.5.1.57 (local)
+       1001          1 10.5.1.58
+       1000          1 10.5.1.59
  
  The primary unit was holding the VIP resource 10.5.105.1/16
  
- root@juju-niedbalski-sec-machine-4:/home/ubuntu# ip addr 
+ root@juju-niedbalski-sec-machine-4:/home/ubuntu# ip addr
  2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc netem state UP 
group default qlen 1000
-     link/ether fa:16:3e:d2:19:6f brd ff:ff:ff:ff:ff:ff
-     inet 10.5.1.57/16 brd 10.5.255.255 scope global eth0
-        valid_lft forever preferred_lft forever
-     inet 10.5.105.1/16 brd 10.5.255.255 scope global secondary eth0
-        valid_lft forever preferred_lft forever
+     link/ether fa:16:3e:d2:19:6f brd ff:ff:ff:ff:ff:ff
+     inet 10.5.1.57/16 brd 10.5.255.255 scope global eth0
+        valid_lft forever preferred_lft forever
+     inet 10.5.105.1/16 brd 10.5.255.255 scope global secondary eth0
+        valid_lft forever preferred_lft forever
  
  4) I manually added a TC queue for the eth0 interface on the node
  holding the VIP resource, introducing a 350 ms delay.
  
  $ sudo tc qdisc add dev eth0 root netem delay 350ms
  
- 5) Right after adding the 350ms on the cinder/0 unit, the corosync process 
informs that one of the processors failed, and is forming a new 
+ 5) Right after adding the 350ms on the cinder/0 unit, the corosync process 
informs that one of the processors failed, and is forming a new
  cluster configuration.
-  
+ 
  Mar 28 21:57:41 juju-niedbalski-sec-machine-5 corosync[4584]:  [TOTEM ] A 
processor failed, forming new configuration.
  Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]:  [TOTEM ] A new 
membership (10.5.1.57:11628) was formed. Members
  Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]:  [QUORUM] 
Members[3]: 1002 1001 1000
  Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]:  [MAIN  ] 
Completed service synchronization, ready to provide service.
  
  This happens on all of the units.
  
  6) After receiving this message, I remove the queue from eth0:
  
  $ sudo tc qdisk del dev eth0 root netem
  
  Then, the following statement is written in the master node:
  
  Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]:  [TOTEM ] A new 
membership (10.5.1.57:11628) was formed. Members
  Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]:  [QUORUM] 
Members[3]: 1002 1001 1000
  Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]:  [MAIN  ] 
Completed service synchronization, ready to provide service.
  
- 
- 7) While executing 5 and 6 repeatedly, I ran the following command to track 
the SZ and RSS memory usage of the
+ 7) While executing 5 and 6 repeatedly, I ran the following command to track 
the VSZ and RSS memory usage of the
  corosync process:
  
  root@juju-niedbalski-sec-machine-4:/home/ubuntu# tc qdisc add dev eth0 root 
netem delay 350ms
  root@juju-niedbalski-sec-machine-4:/home/ubuntu# tc qdisc del dev eth0 root 
netem
  
- $ sudo while true; do ps -o sz,rss -p $(pgrep corosync) 2>&1 | grep -E
+ $ sudo while true; do ps -o vsz,rss -p $(pgrep corosync) 2>&1 | grep -E
  '.*[0-9]+.*' | tee -a memory-usage.log && sleep 1; done
  
- The results shows that both sz and rss are increased over time at a high
- ratio.
+ The results shows that both vsz and rss are increased over time at a
+ high ratio.
  
  25476 4036
  
  ... (after 5 minutes).
  
  135644 10352
  
  [Fix]
  
- So preliminary based on this reproducer, I think that this commit 
(https://github.com/corosync/corosync/commit/600fb4084adcbfe7678b44a83fa8f3d3550f48b9)
 
+ So preliminary based on this reproducer, I think that this commit 
(https://github.com/corosync/corosync/commit/600fb4084adcbfe7678b44a83fa8f3d3550f48b9)
  is a good candidate to be backported in Ubuntu Trusty.
  
  [Test Case]
  
  * See reproducer
  
  [Backport Impact]
  
  * Not identified

-- 
You received this bug notification because you are a member of Ubuntu
High Availability Team, which is subscribed to corosync in Ubuntu.
https://bugs.launchpad.net/bugs/1563089

Title:
  Memory Leak when new cluster configuration is formed.

Status in corosync package in Ubuntu:
  In Progress
Status in corosync source package in Trusty:
  In Progress

Bug description:
  [Environment]

  Trusty 14.04.3

  Packages:

  ii  corosync                         2.3.3-1ubuntu1                        
amd64        Standards-based cluster framework (daemon and modules)
  ii  libcorosync-common4              2.3.3-1ubuntu1                        
amd64        Standards-based cluster framework, common library

  [Reproducer]

  1) I deployed an HA environment using this bundle 
(http://bazaar.launchpad.net/~ost-maintainers/openstack-charm-testing/trunk/view/head:/bundles/dev/next-ha.yaml)
  with a 3 nodes installation of cinder related to an HACluster subordinate 
unit.

  $ juju-deployer -c next-ha.yaml -w 600 trusty-kilo

  2) I changed the default corosync transport mode to unicast.

  $ juju set cinder-hacluster corosync_transport=udpu

  3) I assured that the 3 units were quorated

  cinder/0# corosync-quorumtool
  Votequorum information
  ----------------------
  Expected votes:   3
  Highest expected: 3
  Total votes:      3
  Quorum:           2
  Flags:            Quorate

  Membership information
  ----------------------
      Nodeid      Votes Name
        1002          1 10.5.1.57 (local)
        1001          1 10.5.1.58
        1000          1 10.5.1.59

  The primary unit was holding the VIP resource 10.5.105.1/16

  root@juju-niedbalski-sec-machine-4:/home/ubuntu# ip addr
  2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc netem state UP 
group default qlen 1000
      link/ether fa:16:3e:d2:19:6f brd ff:ff:ff:ff:ff:ff
      inet 10.5.1.57/16 brd 10.5.255.255 scope global eth0
         valid_lft forever preferred_lft forever
      inet 10.5.105.1/16 brd 10.5.255.255 scope global secondary eth0
         valid_lft forever preferred_lft forever

  4) I manually added a TC queue for the eth0 interface on the node
  holding the VIP resource, introducing a 350 ms delay.

  $ sudo tc qdisc add dev eth0 root netem delay 350ms

  5) Right after adding the 350ms on the cinder/0 unit, the corosync process 
informs that one of the processors failed, and is forming a new
  cluster configuration.

  Mar 28 21:57:41 juju-niedbalski-sec-machine-5 corosync[4584]:  [TOTEM ] A 
processor failed, forming new configuration.
  Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]:  [TOTEM ] A new 
membership (10.5.1.57:11628) was formed. Members
  Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]:  [QUORUM] 
Members[3]: 1002 1001 1000
  Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]:  [MAIN  ] 
Completed service synchronization, ready to provide service.

  This happens on all of the units.

  6) After receiving this message, I remove the queue from eth0:

  $ sudo tc qdisk del dev eth0 root netem

  Then, the following statement is written in the master node:

  Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]:  [TOTEM ] A new 
membership (10.5.1.57:11628) was formed. Members
  Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]:  [QUORUM] 
Members[3]: 1002 1001 1000
  Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]:  [MAIN  ] 
Completed service synchronization, ready to provide service.

  7) While executing 5 and 6 repeatedly, I ran the following command to track 
the VSZ and RSS memory usage of the
  corosync process:

  root@juju-niedbalski-sec-machine-4:/home/ubuntu# tc qdisc add dev eth0 root 
netem delay 350ms
  root@juju-niedbalski-sec-machine-4:/home/ubuntu# tc qdisc del dev eth0 root 
netem

  $ sudo while true; do ps -o vsz,rss -p $(pgrep corosync) 2>&1 | grep
  -E '.*[0-9]+.*' | tee -a memory-usage.log && sleep 1; done

  The results shows that both vsz and rss are increased over time at a
  high ratio.

  25476 4036

  ... (after 5 minutes).

  135644 10352

  [Fix]

  So preliminary based on this reproducer, I think that this commit 
(https://github.com/corosync/corosync/commit/600fb4084adcbfe7678b44a83fa8f3d3550f48b9)
  is a good candidate to be backported in Ubuntu Trusty.

  [Test Case]

  * See reproducer

  [Backport Impact]

  * Not identified

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1563089/+subscriptions

_______________________________________________
Mailing list: https://launchpad.net/~ubuntu-ha
Post to     : ubuntu-ha@lists.launchpad.net
Unsubscribe : https://launchpad.net/~ubuntu-ha
More help   : https://help.launchpad.net/ListHelp

Reply via email to