On Mon, 2013-04-22 at 09:50 -0600, Greg Woods wrote:
> On Mon, 2013-04-22 at 10:12 +1000, Andrew Beekhof wrote:
> > On Saturday, April 20, 2013, Greg Woods wrote:
> >  Often one of the
> > > nodes gets stuck at "Stopping HA Services"
> > 
> > 
> > That means pacemaker is waiting for one of your resources to stop.
> > Do you have anything that would take a long time (or fail to stop)?
> 
> Not that I am aware of. But some things that came up during this
> weekend's powerdown make me think that some of the stop actions are
> failing

This particular issue has been solved. It turns out that this is one of
those "perfect storm" situations. Because of the coming powerdown, our
HPSS (High Performance Storage System) was shut down several hours prior
to the HA clusters going down. The HA clusters do not directly depend on
the HPSS, but they do run backups to it. The incremental backup script
works by taking an LVM snapshot of the logical volume that the file
system containing the virtual machine images is mounted on, then
mounting the virtual disk images from the snapshot, and finally, running
our standard system backup script on the mounted images. The system
backup script will normally run a find on the file system(s) to be
backed up, and package it up into multiple cpio archives (as many as it
takes for either the full file system or just the files that have
changed in the past two days). Once an archive file has been created, it
gets sent to the HPSS. It turns out that the script will try multiple
times to send the file if the first attempt fails, which can actually
cause it to continue running and retrying for many hours. While it is
running, the snapshot is still in place. The cluster resource stop
failed on one of the LVM resources, saying that the volume group could
not be deactivated because there was still an active logical volume. The
snapshot. So that caused the fence. 

This still doesn't fully explain the original issue of why the shutdown
process can hang trying to stop the heartbeat service. Or does it? Since
I wasn't looking for this, I can't be certain that the HPSS wasn't
offline during the times I have observed these hangs, so I'll have to
start checking for that. In the meantime, I'll have to create a shutdown
script that checks for a hung backup, kills it, and deletes the snapshot
before issuing the /sbin/shutdown command.

--Greg


_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to