----- Original Message ----- | I can confirm that doing an ifdown is not the source of my corosync issues. | My cluster is in another state, so I can't pull a cable, but I can down a | port on a switch. That had the exact same affects as doing an ifdown. Two | machines got fenced when it should have only been one. | | ------- | Seth Reid | System Operations Engineer | Vendini, Inc. | 415.349.7736 | sr...@vendini.com | www.vendini.com
Hi Seth, I don't know if your problem is the same thing I'm looking at BUT: I'm currently working on a fix to the GFS2 file system for a similar problem. The scenario is something like this: 1. Node X goes down for some reason. 2. Node X gets fenced by one of the other nodes. 3. As part of the recovery, GFS2 on all the other nodes have to replay the journals for all the file systems mounted on X. 4. GFS2 journal replay hogs the CPU, which causes corosync to be starved for CPU on some node (say node Y). 5. Since corosync on node Y was starved for CPU, it doesn't respond in time to the other nodes (say node Z). 6. Thus, node Z fences node Y. In my case, the solution is to fix GFS2 so that it does some "cond_resched()" (conditional schedule) statements to allow corosync (and dlm) to get some work done. Thus, corosync isn't starved for CPU and does its work, and therefore, it doesn't get fenced. I don't know if that's what is happening in your case. Do you have a lot of GFS2 mount points that would need recovery when the first fence event occurs? In my case, I can recreate the problem by having 60 GFS2 mount points. Hopefully I'll be sending a GFS2 patch to the cluster-devel mailing list for this problem soon. In testing my fix, I've periodically experienced some weirdness and other unexplained fencing, so maybe there's a second problem lurking (or maybe there's just something weird in the experimental kernel I'm using as a base). Hopefully testing will prove whether my fix to GFS2 recovery is enough or if there's another problem. Regards, Bob Peterson Red Hat File Systems _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org