[Linux-HA] ifdown ethX + corosync + DRBD = split-brain?

Howley, Tom Tue, 23 Jul 2013 01:07:00 -0700

Hi,



Previously sent this question to pacemaker mailing list, have since learned of 
this list that is probably more appropriate. Apologies to those that listen in 
on both and have seen question already:



I have been doing some testing of a fairly standard pacemaker/corosync setup 
with DRBD (with resource-level fencing) and have noticed the following in 
relation to testing network failures:



- Handling of all ports being blocked is OK, based on hundreds of tests.

- Handling of cable-pulls seems OK, based on only 10 tests.

- ifdown ethX leads to split-brain roughly 50% of the time due to two 
underlying issues:



1. corosync (possibly by design) handles loss of network interface differently 
to other network failures. I can only see this from the point of view of the   
logs: "[TOTEM ] The network interface is down.", which is different from 
cable-pull log, where I don't see that message. I'm guessing this as I don't 
know the code.

2. corosync allows a non-quorate partition, in my case a single node, to update 
the CIB. This behaviour has been previously confirmed in reply to previous 
mails on pacemaker mailing list and it has been mentioned that there may be 
improvements in this area in the future. This on its own seems like a bug to me.



My question is: is it possible for me to configure corosync/drbd to handle the 
ifdown scenario or do I simply have to tell people "do not test with ifdown", 
as I have seen mentioned in a few places on the web? If I do have to leave out 
ifdown testing, how can I be sure that I haven't missed out testing some real 
network failure scenario.



It is not feasible to do hundreds of cable-pulls, which is effectively what I'm 
trying to simulate. I will look into introducing failures via the switch, but 
ideally I'd like to be able to handle ifdown properly or have a clear 
explanation/justification for corosync's handling of ifdown (or maybe this is a 
general ifdown issue, independent of corosync?)



I would really appreciate advice on this as it's a serious issue for me.



Thanks,

Tom

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] ifdown ethX + corosync + DRBD = split-brain?

Reply via email to