We are only using one mount, and that mount has nothing on it currently.
I have fixed the problem. Our OS is Ubuntu 16.04 LTS (Xenial). I added the 17.04 (Zesty) repo to get newer a newer version of Corosync. I upgraded Corosync, which upgraded a long list of other related packages (Pacemaker and gfs2 among them). My fencing now works properly. If a node loses network connection to the cluster, only that node is fenced. Presumably, it is a bug in one of the packages in the Xenial repo that has been fixed by in versions in Zesty. ------- Seth Reid On Fri, Mar 31, 2017 at 10:31 AM, Bob Peterson <rpete...@redhat.com> wrote: > ----- Original Message ----- > | I can confirm that doing an ifdown is not the source of my corosync > issues. > | My cluster is in another state, so I can't pull a cable, but I can down a > | port on a switch. That had the exact same affects as doing an ifdown. Two > | machines got fenced when it should have only been one. > | > | ------- > | Seth Reid > > > Hi Seth, > > I don't know if your problem is the same thing I'm looking at BUT: > I'm currently working on a fix to the GFS2 file system for a > similar problem. The scenario is something like this: > > 1. Node X goes down for some reason. > 2. Node X gets fenced by one of the other nodes. > 3. As part of the recovery, GFS2 on all the other nodes have to > replay the journals for all the file systems mounted on X. > 4. GFS2 journal replay hogs the CPU, which causes corosync to be > starved for CPU on some node (say node Y). > 5. Since corosync on node Y was starved for CPU, it doesn't respond > in time to the other nodes (say node Z). > 6. Thus, node Z fences node Y. > > In my case, the solution is to fix GFS2 so that it does some > "cond_resched()" (conditional schedule) statements to allow corosync > (and dlm) to get some work done. Thus, corosync isn't starved for > CPU and does its work, and therefore, it doesn't get fenced. > > I don't know if that's what is happening in your case. > Do you have a lot of GFS2 mount points that would need recovery > when the first fence event occurs? > In my case, I can recreate the problem by having 60 GFS2 mount points. > > Hopefully I'll be sending a GFS2 patch to the cluster-devel > mailing list for this problem soon. > > In testing my fix, I've periodically experienced some weirdness > and other unexplained fencing, so maybe there's a second problem > lurking (or maybe there's just something weird in the experimental > kernel I'm using as a base). Hopefully testing will prove whether > my fix to GFS2 recovery is enough or if there's another problem. > > Regards, > > Bob Peterson > Red Hat File Systems > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org >
_______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org