On Mon, 2013-04-22 at 09:50 -0600, Greg Woods wrote: > On Mon, 2013-04-22 at 10:12 +1000, Andrew Beekhof wrote: > > On Saturday, April 20, 2013, Greg Woods wrote: > > Often one of the > > > nodes gets stuck at "Stopping HA Services" > > > > > > That means pacemaker is waiting for one of your resources to stop. > > Do you have anything that would take a long time (or fail to stop)? > > Not that I am aware of. But some things that came up during this > weekend's powerdown make me think that some of the stop actions are > failing
This particular issue has been solved. It turns out that this is one of those "perfect storm" situations. Because of the coming powerdown, our HPSS (High Performance Storage System) was shut down several hours prior to the HA clusters going down. The HA clusters do not directly depend on the HPSS, but they do run backups to it. The incremental backup script works by taking an LVM snapshot of the logical volume that the file system containing the virtual machine images is mounted on, then mounting the virtual disk images from the snapshot, and finally, running our standard system backup script on the mounted images. The system backup script will normally run a find on the file system(s) to be backed up, and package it up into multiple cpio archives (as many as it takes for either the full file system or just the files that have changed in the past two days). Once an archive file has been created, it gets sent to the HPSS. It turns out that the script will try multiple times to send the file if the first attempt fails, which can actually cause it to continue running and retrying for many hours. While it is running, the snapshot is still in place. The cluster resource stop failed on one of the LVM resources, saying that the volume group could not be deactivated because there was still an active logical volume. The snapshot. So that caused the fence. This still doesn't fully explain the original issue of why the shutdown process can hang trying to stop the heartbeat service. Or does it? Since I wasn't looking for this, I can't be certain that the HPSS wasn't offline during the times I have observed these hangs, so I'll have to start checking for that. In the meantime, I'll have to create a shutdown script that checks for a hung backup, kills it, and deletes the snapshot before issuing the /sbin/shutdown command. --Greg _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems