Tks for your replies. I'll be posting more detailed log info looking for Dejan's suggestions.
Meanwhile, answering your questions: - 6 VM's total, 3 PV plus 3 HVM - One single DRBD for all VM's, one ext3 FS on it, all virtual disks are files. No LVM involved, DRBD device is on top of one sw-raid RAID1 partition, ~250GB. - Backup is done to the Dom0. Note, to different spindles, another sw-raid RAID1 on another pair of disks - Backups are fully serial, It did occur to me that paralell would be way too much to ask :) I'll definitely try limiting rsync speed meanwhile, maybe 50%, although my objective is to make the cluster more, if possible totally, resilient to timeouts. > -----Original Message----- > From: Lars Ellenberg [mailto:[EMAIL PROTECTED] > Sent: quarta-feira, 4 de Junho de 2008 22:15 > To: linux-ha@lists.linux-ha.org > Cc: [EMAIL PROTECTED] > Subject: Re: [Linux-HA] HB + DRBD + high I/O load = failed > failover(sometimes) > > > Dejan is right, I think, so look at the logs, and adjust the > timeouts where necessary. > But I want to add something else as well: > > On Wed, Jun 04, 2008 at 10:13:59PM +0200, Dejan Muhamedagic wrote: > > Hi, > > > > On Wed, Jun 04, 2008 at 08:37:50PM +0100, Rodrigo Borges > Pereira wrote: > > > Hello, > > > > > > I have a two node cluster that occasionally has a weird behavior. > > > The cluster runs a number of Xen VM's with virtual disk > files on top > > > of a DRBD device. Every night backups are done of each of > the VM's, via rsync/ssh. > > > Sometimes, the load this generates causes hb to try to failover. > > "number of Xen VM's": how much? > "on top of a DRBD device": one DRBD per DomU, one DRBD per > virtual disk, one DRBD for all, being a pv where you cut out > the DomU lvs? > what (how many spindles) is your drbd lower level device? > > do you backup by rsync/ssh into the VM or into the Dom0? > > do you backup all your DomUs at the same time? > > if so, and you only have few spindles below, you should > realize that you just sent your disk thrashing. > don't do that, spread your backup jobs somewhat, maybe even > strictly serialize them. > > you may also consider using the rsync --bwlimit option... > > > Why? > > > > > Then for > > > some reason it fails to do so, > > > > Logs should say why it fails. > > > > > and stays on the primary node. So all the VM's shutdown and then > > > boot again, on the same node. > > > > > > I'm pretty sure this has to do with timeout definitions, but what > > > would be the best locations to tune that? > > > > Your feeling may be right, but only logs could give us the whole > > story. If you're seeing "late heartbeat" messages or "node dead" > > or "node returning after partition" then you definitely > need to adjust > > timing (keepalive, warntime, and deadtime). Note that the > wording of > > warnings may be different. Otherwise, if the monitor operations are > > timing out, you should adjust the timeouts in the CIB. > > > > Thanks, > > > > Dejan > > -- > : Lars Ellenberg http://www.linbit.com : > : DRBD/HA support and consulting sales at linbit.com : > : LINBIT Information Technologies GmbH Tel +43-1-8178292-0 : > : Vivenotgasse 48, A-1120 Vienna/Europe Fax +43-1-8178292-82 : > __ > please don't Cc me, but send to list -- I'm subscribed > _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems