Hi, We ran into some problems when we pull down the ethernet interface using “ifconfig eth0 down”
Our cluster has the following configurations and resources * Two network interfaces : eth0 and lo(cal) * 3 nodes with one node put in maintenance mode * No-quorum-policy=stop * Stonith-enabled=false * Postgresql Master/Slave * vip master and vip replication IPs * VIPs will run on the node where Postgresql Master is running Two test cases that we executed are as follows * Introduce delay in the ethernet interface o f the postgresql PRIMARY node (Command : tc qdisc add dev eth0 root netem delay 8000ms) * `Ifconfig eth0 down` on the postgresql PRIMARY Node * We expected that both these test cases test for network problems in the cluster In the first case (ethernet interface delay) * Cluster is divided into “partition WITH quorum” and “partition WITHOUT quorum” * Partition WITHOUT quorum shuts down all the services * Partition WITH quorum takes over as Postgresql PRIMARY and VIPs * Everything as expected. Wow ! In the second case (ethernet interface down) * We see lots of errors like the following . On the node * Feb 12 14:09:48 corosync [MAIN ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly. * Feb 12 14:09:49 corosync [MAIN ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly. * Feb 12 14:09:51 corosync [MAIN ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly. * But the `crm_mon –Afr` (from the node whose eth0 is down) always shows the cluster to be fully formed. * It shows all the nodes as UP * It shows itself as the one running the postgresql PRIMARY (as was the case before putting the ethernet interface is down) * `crm_mon -Afr` on the OTHER nodes show a different story * They show the other node as down * One of the other two nodes takes over the postgresql PRIMARY * This leads to a split brain situation which was gracefully avoided in the test case where only “delay is introduced into the interface” Questions : * Is it a known issue with pacemaker when the ethernet interface is pulled down ? * Is it an incorrect way of testing the cluster ? There is some information regarding the same in this thread http://www.gossamer-threads.com/lists/linuxha/pacemaker/59738 Regards, Deba
_______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org