Re: [Linux-ha-dev] Starting heartbeat when interfaces are down
Hi, On Tue, Oct 23, 2007 at 10:14:50PM -0400, Graham, Simon wrote: And indeed, the cluster does come up - without a node. A more accurate summation is that a single node in the cluster doesn't come up. So, the _cluster_ does recover from this error. It just does it without that node. So, service is not interrupted. At the end of the day then, I think my problem comes down to the fact that I am not using static IP addresses for the NICs -- I know you consider the use of DHCP (and also, I would guess zeroconf) addresses a bad thing - however, consider the case where one is trying to automate the cluster config/setup - in this case, the actual IP addresses used for the NIC are completely irrelevant to anyone other than the hb code (because users of the cluster should ONLY be using the cluster alias address). If you use DHCP/Zeroconf then if a NIC does not have link at boot time, it will not get an address assigned and HB will refuse to start with this error: Oct 17 05:41:47 heartbeat[10189]: 2007/10/17_05:41:49 ERROR: glib: Get broadcast for interface eth1 failed: Cannot assign requested address Oct 17 05:41:47 heartbeat[10189]: 2007/10/17_05:41:49 ERROR: glib: IP interface [eth1] does not exist Oct 17 05:41:47 heartbeat[10189]: 2007/10/17_05:41:49 ERROR: Illegal bcast [UDP/IP broadcast] in config file [eth1] Oct 17 05:41:47 heartbeat[10189]: 2007/10/17_05:41:49 ERROR: Heartbeat not started: configuration error. Oct 17 05:41:47 heartbeat[10189]: 2007/10/17_05:41:49 ERROR: Configuration error, heartbeat not started. This actually can lead to HB not starting anywhere (consider the case of a two node cluster with a direct cable connect for one of the NICs -- if one node is powered off, then the other one will not have link on the NIC and therefore will not assign an address) I'd be interested in more discussion on why DHCP/Zeroconf is considered anathema. I wouldn't say that it is an anathema. But, if you want to use DHCP to provide a static address then make sure that DHCP is always available. How do you expect to have a high availabitlity solution which depends on DHCP and DHCP is not there? I'd also be interested in knowing if anyone is working on supporting IP V6 broadcast/multicast for the hb comms links (in which case a static address can be allocated with no configuration required) Not to my knowledge. But that should be soon addressed. Thanks, Dejan This is the rationale for this behavior. It's not perfect behavior, but it's not completely irrational either... -- Alan Robertson [EMAIL PROTECTED] Thanks for the explanation - it helps a lot and is exactly what I was looking for. Simon ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Starting heartbeat when interfaces are down
Graham, Simon wrote: On 2007-10-19T21:57:17, Dejan Muhamedagic [EMAIL PROTECTED] wrote: http://old.linux- foundation.org/developer_bugzilla/show_bug.cgi?id=1732 for some discussion on communication interfaces. discussion means the current deficits are by design ;-) So right now I'm thinking I need to modify the config and restart the hb service when changes occur in the NICs... This seems somewhat counter to the idea of high availability but I'd like to understand the design center for this behavior before I start trying to 'fix' it... What are your circumstances? In which situations should the interface be down? Hotplug interfaces. Transient issues. Weird bugs. One NIC dead on start-up (which is a valid SPoF scenario which is currently not handled). Exactly - when you are in a degraded state you still want the cluster to come up. And indeed, the cluster does come up - without a node. A more accurate summation is that a single node in the cluster doesn't come up. So, the _cluster_ does recover from this error. It just does it without that node. So, service is not interrupted. The specific case that started me looking at this is when there is no address set on a link (e.g. if link is down at startup) which causes hb to simply refuse to start. It's not link down. It's hardware missing. Link down won't keep heartbeat from starting, but missing hardware will certainly do so. So, to correct both of these errors in the description: When the hardware supporting heartbeat communications is missing on startup, then the node on which its missing will refuse to start, resulting in a degraded but operational cluster. Of course, if you do recover from the error, you have the same situation - a degraded but operational cluster. In this case, somewhat less degraded than the case above. Here's why it works that way: It is very common for people to make mistakes in configuration. It is impossible to distinguish between a mistake and a broken interface. It is very hard to get people's attention to read logs. Failing to start does a good job of doing that. And, because of those considerations _and_ the complexity of doing otherwise, it does not put any effort into trying to recover from it. Because such code would be very rarely used in practice (like once every 5K-10K cluster years - judging from past experience), the chances of it having undiscovered bugs in it are very great. The current behavior exercises well-tested recovery paths (what to do when a node is down). I don't claim that this is a perfect response, but in terms of initial startup - you really don't want configuration errors to go unnoticed, and you can't tell which case this is. I would guess that in 99+% of the cases it's a misconfiguration rather than a real failure. The other case, of an interface going away, is a case the code _probably_ should recover from that. It is also worth noting that in practice (as opposed to in testing), this has not come up to my knowledge. The only bug in real life I've heard of which exhibited this is one where the system was quite probably misconfigured (using DHCP for cluster interfaces). Keep in mind that a cluster will not stop providing service just because a single node doesn't come up. So, you haven't lost service when this happens, but you get some really nasty messages and failure to start usually gets people's attention. I am fully aware that subsequent failures will indeed cause things to fail - but this behavior does not constitute a single point of failure for the cluster. This is the rationale for this behavior. It's not perfect behavior, but it's not completely irrational either... -- Alan Robertson [EMAIL PROTECTED] Openness is the foundation and preservative of friendship... Let me claim from you at all times your undisguised opinions. - William Wilberforce ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Starting heartbeat when interfaces are down
Simon Talbot wrote: All, Does anyone know of any Quagga/Zebra OCF Scripts in development/mature, if not I will put some proper effort into making some decent ones? We have some for one specific special case, but I'm not aware of any more general ones. -- Alan Robertson [EMAIL PROTECTED] Openness is the foundation and preservative of friendship... Let me claim from you at all times your undisguised opinions. - William Wilberforce ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
RE: [Linux-ha-dev] Starting heartbeat when interfaces are down
And indeed, the cluster does come up - without a node. A more accurate summation is that a single node in the cluster doesn't come up. So, the _cluster_ does recover from this error. It just does it without that node. So, service is not interrupted. At the end of the day then, I think my problem comes down to the fact that I am not using static IP addresses for the NICs -- I know you consider the use of DHCP (and also, I would guess zeroconf) addresses a bad thing - however, consider the case where one is trying to automate the cluster config/setup - in this case, the actual IP addresses used for the NIC are completely irrelevant to anyone other than the hb code (because users of the cluster should ONLY be using the cluster alias address). If you use DHCP/Zeroconf then if a NIC does not have link at boot time, it will not get an address assigned and HB will refuse to start with this error: Oct 17 05:41:47 heartbeat[10189]: 2007/10/17_05:41:49 ERROR: glib: Get broadcast for interface eth1 failed: Cannot assign requested address Oct 17 05:41:47 heartbeat[10189]: 2007/10/17_05:41:49 ERROR: glib: IP interface [eth1] does not exist Oct 17 05:41:47 heartbeat[10189]: 2007/10/17_05:41:49 ERROR: Illegal bcast [UDP/IP broadcast] in config file [eth1] Oct 17 05:41:47 heartbeat[10189]: 2007/10/17_05:41:49 ERROR: Heartbeat not started: configuration error. Oct 17 05:41:47 heartbeat[10189]: 2007/10/17_05:41:49 ERROR: Configuration error, heartbeat not started. This actually can lead to HB not starting anywhere (consider the case of a two node cluster with a direct cable connect for one of the NICs -- if one node is powered off, then the other one will not have link on the NIC and therefore will not assign an address) I'd be interested in more discussion on why DHCP/Zeroconf is considered anathema. I'd also be interested in knowing if anyone is working on supporting IP V6 broadcast/multicast for the hb comms links (in which case a static address can be allocated with no configuration required) This is the rationale for this behavior. It's not perfect behavior, but it's not completely irrational either... -- Alan Robertson [EMAIL PROTECTED] Thanks for the explanation - it helps a lot and is exactly what I was looking for. Simon ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Starting heartbeat when interfaces are down
On 2007-10-19T21:57:17, Dejan Muhamedagic [EMAIL PROTECTED] wrote: http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1732 for some discussion on communication interfaces. discussion means the current deficits are by design ;-) This seems somewhat counter to the idea of high availability but I'd like to understand the design center for this behavior before I start trying to 'fix' it... What are your circumstances? In which situations should the interface be down? Hotplug interfaces. Transient issues. Weird bugs. One NIC dead on start-up (which is a valid SPoF scenario which is currently not handled). Regards, Lars -- Teamlead Kernel, SuSE Labs, Research and Development SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
RE: [Linux-ha-dev] Starting heartbeat when interfaces are down
All, Does anyone know of any Quagga/Zebra OCF Scripts in development/mature, if not I will put some proper effort into making some decent ones? Thanks, Simon Simon Talbot MEng, ACGI (Chief Engineer) Tel: 020 3161 6001 Fax: 020 3161 6011 The information contained in this e-mail and any attachments are private and confidential and may be legally privileged. It is intended for the named addressee(s) only. If you are not the intended recipient(s), you must not read, copy or use the information contained in any way. If you receive this email or any attachments in error, please notify us immediately by e-mail and destroy any copy you have of it. We accept no responsibility for any loss or damages whatsoever arising in any way from receipt or use of this e-mail or any attachments. This e-mail is not intended to create legally binding commitments on our behalf, nor do its comments reflect our corporate views or policies. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Starting heartbeat when interfaces are down
Hi, On Fri, Oct 19, 2007 at 02:47:40PM -0400, Graham, Simon wrote: Apologies if this has been asked before, but I have noticed that the heartbeat service fails to start if any of the interfaces mentioned in the ha.cf file are down at the time the service starts - heartbeat is quite happy to handle interfaces going down once it is running, but if it cant get the broadcast/multicast address from the interface at startup time, it just bails. See http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1732 for some discussion on communication interfaces. This seems somewhat counter to the idea of high availability but I'd like to understand the design center for this behavior before I start trying to 'fix' it... What are your circumstances? In which situations should the interface be down? Thanks, Dejan Thanks, Simon ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/