Tks for your replies. I'll be posting more detailed log info looking for
Dejan's suggestions.

Meanwhile, answering your questions:

- 6 VM's total, 3 PV plus 3 HVM 
- One single DRBD for all VM's, one ext3 FS on it, all virtual disks are
files. No LVM involved, DRBD device is on top of one sw-raid RAID1
partition, ~250GB.
- Backup is done to the Dom0. Note, to different spindles, another sw-raid
RAID1 on another pair of disks
- Backups are fully serial, It did occur to me that paralell would be way
too much to ask :)

I'll definitely try limiting rsync speed meanwhile, maybe 50%, although my
objective is to make the cluster more, if possible totally, resilient to
timeouts.

> -----Original Message-----
> From: Lars Ellenberg [mailto:[EMAIL PROTECTED] 
> Sent: quarta-feira, 4 de Junho de 2008 22:15
> To: linux-ha@lists.linux-ha.org
> Cc: [EMAIL PROTECTED]
> Subject: Re: [Linux-HA] HB + DRBD + high I/O load = failed 
> failover(sometimes)
> 
> 
> Dejan is right, I think, so look at the logs, and adjust the 
> timeouts where necessary.
> But I want to add something else as well:
> 
> On Wed, Jun 04, 2008 at 10:13:59PM +0200, Dejan Muhamedagic wrote:
> > Hi,
> > 
> > On Wed, Jun 04, 2008 at 08:37:50PM +0100, Rodrigo Borges 
> Pereira wrote:
> > > Hello,
> > > 
> > > I have a two node cluster that occasionally has a weird behavior. 
> > > The cluster runs a number of Xen VM's with virtual disk 
> files on top 
> > > of a DRBD device. Every night backups are done of each of 
> the VM's, via rsync/ssh.
> > > Sometimes, the load this generates causes hb to try to failover.
> 
> "number of Xen VM's": how much?
> "on top of a DRBD device": one DRBD per DomU, one DRBD per 
> virtual disk, one DRBD for all, being a pv where you cut out 
> the DomU lvs?
> what (how many spindles) is your drbd lower level device?
> 
> do you backup by rsync/ssh into the VM or into the Dom0?
> 
> do you backup all your DomUs at the same time?
> 
> if so, and you only have few spindles below, you should 
> realize that you just sent your disk thrashing.
> don't do that, spread your backup jobs somewhat, maybe even 
> strictly serialize them.
> 
> you may also consider using the rsync --bwlimit option...
> 
> > Why?
> > 
> > > Then for
> > > some reason it fails to do so,
> > 
> > Logs should say why it fails.
> > 
> > > and stays on the primary node. So all the VM's shutdown and then 
> > > boot again, on the same node.
> > > 
> > > I'm pretty sure this has to do with timeout definitions, but what 
> > > would be the best locations to tune that?
> > 
> > Your feeling may be right, but only logs could give us the whole 
> > story. If you're seeing "late heartbeat" messages or "node dead"
> > or "node returning after partition" then you definitely 
> need to adjust 
> > timing (keepalive, warntime, and deadtime). Note that the 
> wording of 
> > warnings may be different. Otherwise, if the monitor operations are 
> > timing out, you should adjust the timeouts in the CIB.
> > 
> > Thanks,
> > 
> > Dejan
> 
> -- 
> : Lars Ellenberg                           http://www.linbit.com :
> : DRBD/HA support and consulting             sales at linbit.com :
> : LINBIT Information Technologies GmbH      Tel +43-1-8178292-0  :
> : Vivenotgasse 48, A-1120 Vienna/Europe     Fax +43-1-8178292-82 :
> __
> please don't Cc me, but send to list -- I'm subscribed
> 

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to