Re: [corosync] Corosync does not detect network partitioning correctly

Jan Friesse Wed, 13 Aug 2014 08:09:58 -0700

Vladimir,

Hi, guys


We have problems with corosync 1.4.x (1.4.6 and 1.4.7).

This is the scenario:
We have 3-node cluster:
node-1 (10.108.2.3),node-2(10.108.2.4),node-3(10.108.2.5) in UDPU mode.

We shut down the interface used for cluster communication on one of the
nodes by issuing `ifdown eth2` command on node-1.

yes. This is known problem. Actually, there is a patch for corosync 2.x,but it's not yet reviewed and not yet backported for flatiron.

Just don't do ifdown or use NM in non server mode (does ifdown on cableunplug).

If you need to test failover ether block trafic on switch, unplugnetwork cable (without NM or with NM but it had to be in server mode) orblock traffic via iptables (but carefully, you have to block everythingbut not localhost...)


Regards,
  Honza

node-2 and node-3 find this and understand that node-1 left the cluster.

runtime.totem.pg.mrp.srp.members.67267594.ip=r(0) ip(10.108.2.4)
runtime.totem.pg.mrp.srp.members.67267594.join_count=1
runtime.totem.pg.mrp.srp.members.67267594.status=joined
runtime.totem.pg.mrp.srp.members.84044810.ip=r(0) ip(10.108.2.5)
runtime.totem.pg.mrp.srp.members.84044810.join_count=2
runtime.totem.pg.mrp.srp.members.84044810.status=joined
runtime.totem.pg.mrp.srp.members.50490378.ip=r(0) ip(10.108.2.3)
runtime.totem.pg.mrp.srp.members.50490378.join_count=1
runtime.totem.pg.mrp.srp.members.50490378.status=left


But node-1 does not get it and thinks that node-2 and node-3 are still
online:

runtime.totem.pg.mrp.srp.firewall_enabled_or_nic_failure=1
runtime.totem.pg.mrp.srp.members.50490378.ip=r(0) ip(10.108.2.3)
runtime.totem.pg.mrp.srp.members.50490378.join_count=1
runtime.totem.pg.mrp.srp.members.50490378.status=joined
runtime.totem.pg.mrp.srp.members.67267594.ip=r(0) ip(10.108.2.4)
runtime.totem.pg.mrp.srp.members.67267594.join_count=1
runtime.totem.pg.mrp.srp.members.67267594.status=joined
runtime.totem.pg.mrp.srp.members.84044810.ip=r(0) ip(10.108.2.5)
runtime.totem.pg.mrp.srp.members.84044810.join_count=1
runtime.totem.pg.mrp.srp.members.84044810.status=joined
runtime.blackbox.dump_flight_data=no
runtime.blackbox.dump_state=no

In node-1 logs I see the following:

2014-08-13T15:46:18.848234+01:00 warning:    [MAIN  ] Totem is unable to
form a cluster because of an operating system or network fault. The most
common cause of this message is that the local firewall is configured
improperly.

2014-08-13T15:46:18.866365+01:00 debug:    [TOTEM ] sendmsg(mcast) failed
(non-critical): Invalid argument
(22)
2014-08-13T15:46:18.866799+01:00 debug:    [TOTEM ] sendmsg(mcast) failed
(non-critical): Invalid argument
(22)
2014-08-13T15:46:18.866799+01:00 debug:    [TOTEM ] sendmsg(mcast) failed
(non-critical): Invalid argument
(22)
2014-08-13T15:46:18.866799+01:00 debug:    [TOTEM ] sendmsg(mcast) failed
(non-critical): Invalid argument
(22)
2014-08-13T15:46:18.866799+01:00 debug:    [TOTEM ] sendmsg(mcast) failed
(non-critical): Invalid argument
(22)
2014-08-13T15:46:18.866799+01:00 debug:    [TOTEM ] sendmsg(mcast) failed
(non-critical): Invalid argument
(22)
2014-08-13T15:46:18.935539+01:00 debug:    [TOTEM ] sendmsg(mcast) failed
(non-critical): Invalid argument
(22)
2014-08-13T15:46:18.935932+01:00 debug:    [TOTEM ] sendmsg(mcast) failed
(non-critical): Invalid argument
(22)
2014-08-13T15:46:18.935932+01:00 debug:    [TOTEM ] sendmsg(mcast) failed
(non-critical): Invalid argument
(22)
2014-08-13T15:46:18.935932+01:00 debug:    [TOTEM ] sendmsg(mcast) failed
(non-critical): Invalid argument
(22)
2014-08-13T15:46:18.935932+01:00 debug:    [TOTEM ] sendmsg(mcast) failed
(non-critical): Invalid argument
(22)


I am pretty sure that this is the bug as corosync should detect ring
failure and mark dead nodes as dead on both sides.

Can you help me fix it or provide with the clue where to look for the fix?




_______________________________________________
discuss mailing list
[email protected]
http://lists.corosync.org/mailman/listinfo/discuss


_______________________________________________
discuss mailing list
[email protected]
http://lists.corosync.org/mailman/listinfo/discuss

Re: [corosync] Corosync does not detect network partitioning correctly

Reply via email to