Re: [Linux-ha-dev] Starting heartbeat when interfaces are down

2007-10-24 Thread Dejan Muhamedagic
Hi,

On Tue, Oct 23, 2007 at 10:14:50PM -0400, Graham, Simon wrote:
  
  And indeed, the cluster does come up - without a node.  A more
 accurate
  summation is that a single node in the cluster doesn't come up.  So,
  the _cluster_ does recover from this error.  It just does it without
  that node.  So, service is not interrupted.
  
 
 At the end of the day then, I think my problem comes down to the fact
 that I am not using static IP addresses for the NICs -- I know you
 consider the use of DHCP (and also, I would guess zeroconf) addresses a
 bad thing - however, consider the case where one is trying to automate
 the cluster config/setup - in this case, the actual IP addresses used
 for the NIC are completely irrelevant to anyone other than the hb code
 (because users of the cluster should ONLY be using the cluster alias
 address).
 
 If you use DHCP/Zeroconf then if a NIC does not have link at boot time,
 it will not get an address assigned and HB will refuse to start with
 this error:
 
 Oct 17 05:41:47 heartbeat[10189]: 2007/10/17_05:41:49 ERROR: glib: Get
 broadcast for interface eth1 failed: Cannot assign requested address 
 Oct 17 05:41:47 heartbeat[10189]: 2007/10/17_05:41:49 ERROR: glib: IP
 interface [eth1] does not exist 
 Oct 17 05:41:47 heartbeat[10189]: 2007/10/17_05:41:49 ERROR: Illegal
 bcast [UDP/IP broadcast] in config file [eth1] 
 Oct 17 05:41:47 heartbeat[10189]: 2007/10/17_05:41:49 ERROR: Heartbeat
 not started: configuration error.
 Oct 17 05:41:47 heartbeat[10189]: 2007/10/17_05:41:49 ERROR:
 Configuration error, heartbeat not started.
 
 This actually can lead to HB not starting anywhere (consider the case of
 a two node cluster with a direct cable connect for one of the NICs -- if
 one node is powered off, then the other one will not have link on the
 NIC and therefore will not assign an address)
 
 I'd be interested in more discussion on why DHCP/Zeroconf is considered
 anathema.

I wouldn't say that it is an anathema. But, if you want to use
DHCP to provide a static address then make sure that DHCP is
always available. How do you expect to have a high availabitlity
solution which depends on DHCP and DHCP is not there?

 I'd also be interested in knowing if anyone is working on supporting IP
 V6 broadcast/multicast for the hb comms links (in which case a static
 address can be allocated with no configuration required)

Not to my knowledge. But that should be soon addressed.

Thanks,

Dejan

  
  This is the rationale for this behavior.  It's not perfect behavior,
  but
  it's not completely irrational either...
  
  --
  Alan Robertson [EMAIL PROTECTED]
 
 Thanks for the explanation - it helps a lot and is exactly what I was
 looking for.
 Simon
 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] Starting heartbeat when interfaces are down

2007-10-23 Thread Alan Robertson
Graham, Simon wrote:
 On 2007-10-19T21:57:17, Dejan Muhamedagic [EMAIL PROTECTED] wrote:

 http://old.linux-
 foundation.org/developer_bugzilla/show_bug.cgi?id=1732
 for some discussion on communication interfaces.
 discussion means the current deficits are by design ;-)

 
 So right now I'm thinking I need to modify the config and restart 
 the hb service when changes occur in the NICs...
 
 This seems somewhat counter to the idea of high availability but
 I'd
 like to understand the design center for this behavior before I
 start
 trying to 'fix' it...
 What are your circumstances? In which situations should the
 interface be down?
 Hotplug interfaces. Transient issues. Weird bugs.

 One NIC dead on start-up (which is a valid SPoF scenario which is
 currently not handled).
 
 Exactly - when you are in a degraded state you still want the cluster to
 come up.

And indeed, the cluster does come up - without a node.  A more accurate
summation is that a single node in the cluster doesn't come up.  So,
the _cluster_ does recover from this error.  It just does it without
that node.  So, service is not interrupted.

 The specific case that started me looking at this is when there is no
 address
 set on a link (e.g. if link is down at startup) which causes hb to
 simply 
 refuse to start.

It's not link down.  It's hardware missing.  Link down won't keep
heartbeat from starting, but missing hardware will certainly do so.

So, to correct both of these errors in the description:

When the hardware supporting heartbeat communications is missing
on startup, then the node on which its missing will refuse to
start, resulting in a degraded but operational cluster.

Of course, if you do recover from the error, you have the same
situation - a degraded but operational cluster.  In this case, somewhat
less degraded than the case above.

Here's why it works that way:
It is very common for people to make mistakes in configuration.
It is impossible to distinguish between a mistake and a broken
interface.
It is very hard to get people's attention to read logs.
Failing to start does a good job of doing that.

And, because of those considerations _and_ the complexity of doing
otherwise, it does not put any effort into trying to recover from it.
Because such code would be very rarely used in practice (like once every
5K-10K  cluster years - judging from past experience), the chances of it
having undiscovered bugs in it are very great.  The current behavior
exercises well-tested recovery paths (what to do when a node is down).

I don't claim that this is a perfect response, but in terms of initial
startup - you really don't want configuration errors to go unnoticed,
and you can't tell which case this is.  I would guess that in 99+% of
the cases it's a misconfiguration rather than a real failure.

The other case, of an interface going away, is a case the code
_probably_ should recover from that.

It is also worth noting that in practice (as opposed to in testing),
this has not come up to my knowledge.  The only bug in real life I've
heard of which exhibited this is one where the system was quite probably
misconfigured (using DHCP for cluster interfaces).

Keep in mind that a cluster will not stop providing service just because
a single node doesn't come up.  So, you haven't lost service when this
happens, but you get some really nasty messages and failure to start
usually gets people's attention.

I am fully aware that subsequent failures will indeed cause things to
fail - but this behavior does not constitute a single point of failure
for the cluster.

This is the rationale for this behavior.  It's not perfect behavior, but
it's not completely irrational either...

-- 
Alan Robertson [EMAIL PROTECTED]

Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions. - William
Wilberforce
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] Starting heartbeat when interfaces are down

2007-10-23 Thread Alan Robertson
Simon Talbot wrote:
 All,
 
 Does anyone know of any Quagga/Zebra OCF Scripts in development/mature,
 if not I will put some proper effort into making some decent ones?

We have some for one specific special case, but I'm not aware of any
more general ones.

-- 
Alan Robertson [EMAIL PROTECTED]

Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions. - William
Wilberforce
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


RE: [Linux-ha-dev] Starting heartbeat when interfaces are down

2007-10-23 Thread Graham, Simon
 
 And indeed, the cluster does come up - without a node.  A more
accurate
 summation is that a single node in the cluster doesn't come up.  So,
 the _cluster_ does recover from this error.  It just does it without
 that node.  So, service is not interrupted.
 

At the end of the day then, I think my problem comes down to the fact
that I am not using static IP addresses for the NICs -- I know you
consider the use of DHCP (and also, I would guess zeroconf) addresses a
bad thing - however, consider the case where one is trying to automate
the cluster config/setup - in this case, the actual IP addresses used
for the NIC are completely irrelevant to anyone other than the hb code
(because users of the cluster should ONLY be using the cluster alias
address).

If you use DHCP/Zeroconf then if a NIC does not have link at boot time,
it will not get an address assigned and HB will refuse to start with
this error:

Oct 17 05:41:47 heartbeat[10189]: 2007/10/17_05:41:49 ERROR: glib: Get
broadcast for interface eth1 failed: Cannot assign requested address 
Oct 17 05:41:47 heartbeat[10189]: 2007/10/17_05:41:49 ERROR: glib: IP
interface [eth1] does not exist 
Oct 17 05:41:47 heartbeat[10189]: 2007/10/17_05:41:49 ERROR: Illegal
bcast [UDP/IP broadcast] in config file [eth1] 
Oct 17 05:41:47 heartbeat[10189]: 2007/10/17_05:41:49 ERROR: Heartbeat
not started: configuration error.
Oct 17 05:41:47 heartbeat[10189]: 2007/10/17_05:41:49 ERROR:
Configuration error, heartbeat not started.

This actually can lead to HB not starting anywhere (consider the case of
a two node cluster with a direct cable connect for one of the NICs -- if
one node is powered off, then the other one will not have link on the
NIC and therefore will not assign an address)

I'd be interested in more discussion on why DHCP/Zeroconf is considered
anathema.

I'd also be interested in knowing if anyone is working on supporting IP
V6 broadcast/multicast for the hb comms links (in which case a static
address can be allocated with no configuration required)

 
 This is the rationale for this behavior.  It's not perfect behavior,
 but
 it's not completely irrational either...
 
 --
 Alan Robertson [EMAIL PROTECTED]

Thanks for the explanation - it helps a lot and is exactly what I was
looking for.
Simon
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] Starting heartbeat when interfaces are down

2007-10-22 Thread Lars Marowsky-Bree
On 2007-10-19T21:57:17, Dejan Muhamedagic [EMAIL PROTECTED] wrote:

 http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1732
 
 for some discussion on communication interfaces.

discussion means the current deficits are by design ;-)

  This seems somewhat counter to the idea of high availability but I'd
  like to understand the design center for this behavior before I start
  trying to 'fix' it...
 What are your circumstances? In which situations should the
 interface be down?

Hotplug interfaces. Transient issues. Weird bugs. 

One NIC dead on start-up (which is a valid SPoF scenario which is
currently not handled).


Regards,
Lars

-- 
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


RE: [Linux-ha-dev] Starting heartbeat when interfaces are down

2007-10-20 Thread Simon Talbot
All,

Does anyone know of any Quagga/Zebra OCF Scripts in development/mature,
if not I will put some proper effort into making some decent ones?

Thanks,

Simon

Simon Talbot MEng, ACGI 
(Chief Engineer) 
Tel: 020 3161 6001
Fax: 020 3161 6011

The information contained in this e-mail and any attachments are private

and confidential and may be legally privileged. 

It is intended for the named addressee(s) only. If you are not the 
intended recipient(s), you must not read, copy or use the information 
contained in any way. If you receive this email or any attachments in 
error, please notify us immediately by e-mail and destroy any copy you 
have of it. 

We accept no responsibility for any loss or damages whatsoever arising 
in any way from receipt or use of this e-mail or any attachments. This 
e-mail is not intended to create legally binding commitments on our 
behalf, nor do its comments reflect our corporate views or policies. 
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] Starting heartbeat when interfaces are down

2007-10-19 Thread Dejan Muhamedagic
Hi,

On Fri, Oct 19, 2007 at 02:47:40PM -0400, Graham, Simon wrote:
 Apologies if this has been asked before, but I have noticed that the
 heartbeat service fails to start if any of the interfaces mentioned in
 the ha.cf file are down at the time the service starts - heartbeat is
 quite happy to handle interfaces going down once it is running, but if
 it cant get the broadcast/multicast address from the interface at
 startup time, it just bails.

See

http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1732

for some discussion on communication interfaces.

 This seems somewhat counter to the idea of high availability but I'd
 like to understand the design center for this behavior before I start
 trying to 'fix' it...

What are your circumstances? In which situations should the
interface be down?

Thanks,

Dejan

 
 Thanks,
 
 Simon
 

 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/