Re: [Linux-HA] Initial dead time is smaller than deadtime

Bernd Schubert Wed, 09 Apr 2008 11:26:37 -0700

Hello Lars,

On Wednesday 09 April 2008 18:34:39 Lars Marowsky-Bree wrote:
> On 2008-04-08T19:32:58, Bernd Schubert <[EMAIL PROTECTED]> wrote:
> > Hello,
> >
> > I need to set a rather huge dead time of 1200s, but the initial dead time
> > is supposed to be of 120s or less. However, heartbeat tries to be
> > schoolmasterly and doesn't want to accept my settings:
> >
> > deadtime 1200 # time to declare a node dead
> > initdead 120  # time to declare a node dead on heartbeat startup
> > keepalive 120 # how often to send keepalive packets
>
> Algorithmic reasons require that initdead be larger than deadtime.


which algorithm are these and were can I find it in the sources?

>
> keepalive every two minutes and deadtime at 20 minutes is exceptional.
>
> Not even Lustre should create a load so high that a realtime priority
> thread which is entirely locked into memory is not reliably scheduled
> for 20 minutes at a stretch!

Actually, I don't have the slightest idea which is the correct value. However, 
120s is not sufficient and hard shutdown of a server presently triggers 
terrible hardware bugs. We can simply not afford any false resets.

>
> (I'm not quite sure I'd consider that "HA" ... ;-)

High failover times are not nice of course, but this is not life critical HA.

>
> This needs to be fixed within Lustre.

Yes, sure. 

>
> > Well, heartbeat is not startup up automatically here and even the nodes
> > are not powered on automatically after a hard reset. So when I start
> > heartbeat I'm activeley monitoring everything and there is absolutely no
> > need to let me wait at least 20min on start up. I'm even not convinced a
> > deadtime of 20min is sufficient, since this is for a Lustre cluster and
> > Lustre sometimes manages to create such a high load that nothing else
> > than the Lustre and related kernel threads do work on the system...
>
> A deadtime of 20m is not sufficient, but you worry about 20m on startup?

Yes, because I sit at startup in front of my system and just wait for 
heartbeat to finish to start the services. 
I still think there is another bug in heartbeat, though. There is simply no 
reason for heartbeat to wait $deadtime on initial startup of the heartbeat 
services, when it knows all heartbeat nodes are are up.
If I at least could manually force it to online the nodes, I would have no 
problem with an initial-deadtime == deadtime.

>
> You're quite aware that deadtime is the time you should expect to be w/o
> service in case one node crashes, right?

Yes, and I'm also quite aware that a false shutdown may cause a service down 
time of several days.

Thanks,
Bernd


-- 
Bernd Schubert
Q-Leap Networks GmbH
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Initial dead time is smaller than deadtime

Reply via email to