On 09/02/2016 08:14 AM, Dan Swartzendruber wrote: > > So, I was testing my ZFS dual-head JBOD 2-node cluster. Manual > failovers worked just fine. I then went to try an acid-test by logging > in to node A and doing 'systemctl stop network'. Sure enough, pacemaker > told the APC fencing agent to power-cycle node A. The ZFS pool moved to > node B as expected. As soon as node A was back up, I migrated the > pool/IP back to node A. I *thought* all was okay, until a bit later, I > did 'zpool status', and saw checksum errors on both sides of several of > the vdevs. After much digging and poking, the only theory I could come > up with was that maybe the fencing operation was considered complete too > quickly? I googled for examples using this, and the best tutorial I > found showed using a power-wait=5, whereas the default seems to be > power-wait=0? (this is CentOS 7, btw...) I changed it to use 5 instead
That's a reasonable theory -- that's why power_wait is available. It would be nice if there were a page collecting users' experience with the ideal power_wait for various devices. Even better if fence-agents used those values as the defaults. > of 0, and did a several fencing operations while a guest VM (vsphere via > NFS) was writing to the pool. So far, no evidence of corruption. BTW, > the way I was creating and managing the cluster was with the lcmc java > gui. Possibly the power-wait default of 0 comes from there, I can't > really tell. Any thoughts or ideas appreciated :) _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org