Hi,
We ran into some problems when we pull down the ethernet interface using 
“ifconfig eth0 down”

Our cluster has the following configurations and resources

  *   Two  network interfaces : eth0 and lo(cal)
  *   3 nodes with one node put in maintenance mode
  *   No-quorum-policy=stop
  *   Stonith-enabled=false
  *   Postgresql Master/Slave
  *   vip master and vip replication IPs
  *   VIPs will run on the node where Postgresql Master is running

Two test cases that we executed are as follows

  *   Introduce delay in the ethernet interface o f the postgresql PRIMARY node 
 (Command  : tc qdisc add dev eth0 root netem delay 8000ms)
  *   `Ifconfig eth0 down` on the postgresql PRIMARY Node
  *   We expected that both these test cases test for network problems in the 
cluster

In the first case (ethernet interface delay)

  *   Cluster is divided into “partition WITH quorum” and “partition WITHOUT 
quorum”
  *   Partition WITHOUT quorum shuts down all the services
  *   Partition WITH quorum takes over as Postgresql PRIMARY and VIPs
  *   Everything as expected. Wow !

In the second case (ethernet interface down)

  *   We see lots of errors like the following . On the node
     *   Feb 12 14:09:48 corosync [MAIN  ] Totem is unable to form a cluster 
because of an operating system or network fault. The most common cause of this 
message is that the local firewall is configured improperly.
     *   Feb 12 14:09:49 corosync [MAIN  ] Totem is unable to form a cluster 
because of an operating system or network fault. The most common cause of this 
message is that the local firewall is configured improperly.
     *   Feb 12 14:09:51 corosync [MAIN  ] Totem is unable to form a cluster 
because of an operating system or network fault. The most common cause of this 
message is that the local firewall is configured improperly.
  *   But the `crm_mon –Afr` (from the node whose eth0 is down)  always shows 
the cluster to be fully formed.
     *   It shows all the nodes as UP
     *   It shows itself as the one running the postgresql PRIMARY  (as was the 
case before putting the ethernet interface is down)
  *   `crm_mon -Afr` on the OTHER nodes show a different story
     *   They show the other node as down
     *   One of the other two nodes takes over the postgresql PRIMARY
  *   This leads to a split brain situation which was gracefully avoided in the 
test case where only “delay is introduced into the interface”

Questions :

  *    Is it a known issue with pacemaker when the ethernet interface is pulled 
down ?
  *   Is it an incorrect way of testing the cluster ? There is some information 
regarding the same in this thread 
http://www.gossamer-threads.com/lists/linuxha/pacemaker/59738

Regards,
Deba

_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to