Re: [Pacemaker] crm command line tool problem
Looks like a bug, can you post a hb_report archive of the scenario please? On Fri, May 22, 2009 at 5:51 PM, Joe Armstrong jarmstr...@postpath.com wrote: Hi All, I am playing around with the crm command line tool to create an HA config for pacemaker and am bumping into a problem. If I have a configuration running already, 3-node with ip httpd (pretty simple) and I want to create a new configuration according to the CRM CLI document I should: crm configure erase and then create my new configuration. But the erase directive also blows away my cluster node definitions. If I manually put them back again: crm configure node vm1 crm configure node vm2 crm configure node vm4 I get this: Last updated: Fri May 22 08:32:16 2009 Current DC: NONE 3 Nodes configured, unknown expected votes 0 Resources configured. Node vm1: UNCLEAN (offline) Node vm2: UNCLEAN (offline) Node vm4: UNCLEAN (offline) Now, now matter what I seem to do I can't get the nodes back to the online state. And worse, if I try to shutdown heartbeat all nodes try to kill all other nodes and crmd won't shutdown (if I kill crmd Bad Things(TM) happen). Is there a way I can either: - remove all resources/constraints but leave the node definitions (I did this by manually removing each resource and it works fine, but this seems like a pain in a large config) - get the cluster nodes to see each other again - do something before the erase so the nodes don't go to the UNCLEAN state Thanks. Joe ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] [RfC] Redesigned Debian HA packages, try 2 (was: try 1)
Simon Horman schrieb: Hi, Has there been any progress with getting these packages into experimental? Hi folks, it's me again after some kind of longer outage; I overworked the packages and created new ones of the latest upstream versions. openais-legacy has lately been ACCEPTed into Experimental; heartbeat and pacemaker are in the NEW queue since today and wait for their inclusion. I also created Lenny packages. These packages are proven to work on Lenny and allow you to run an openais-pacemaker or heartbeat-pacemaker-based cluster on Debian GNU/Linux 5.0 alias Lenny. They are available only for the amd64-architecture at the moment, but I hope to add x86-versions to the archive tomorrow. Anybody who wants to test them finds the versions for Unstable alias Sid here: deb http://people.debian.org/~madkiss/ha/sid/ ./ deb-src http://people.debian.org/~madkiss/ha/sid/ ./ Stable aka Debian GNU/Linux 5.0 (Lenny) users go here: deb http://people.debian.org/~madkiss/ha/lenny/ ./ deb-src http://people.debian.org/~madkiss/ha/lenny/ ./ These packages are mostly lintian clean by now. Please report back any glitches or packaging mistakes you spot. Best Regards Martin ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
[Pacemaker] Pacemaker on OpenAIS, RRP, and link failure
Florian Haas writes: 1. Set rrp_problem_count_timeout and/or rrp_problem_count_threshold ridiculously high so the ring status never goes to faulty. (It seems that RRP problem counting can't be disabled altogether). 2. Have package maintainers include some magic that does openais-cfgtool -r every time a network link changes its status to UP (where the network management subsystem permits this). 3. Instruct users to install cron jobs that do openais-cfgtool -r in specified intervals, causing OpenAIS to re-check the link status periodically. all the above sound like hacks to me. a better solution is to have a crm variable that tells if automatic recovery is desired. -- juha ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Pacemaker on OpenAIS, RRP, and link failure
Florian Haas writes: Agree that they're hacks, but disagree with your alternative. Why should Pacemaker be concerned with low-level OpenAIS recovery procedures? then have the variable in OpenAIS configuration. -- juha ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] PingD Failure-Timeout
On Thu, May 21, 2009 at 10:20 PM, Eliot Gable ega...@broadvox.net wrote: Is there a way to time-out the failure of PingD? Yes, but you need version = 1.0.0 I assume you're not running it as a clone right? In my configuration, I cannot run PingD all the time on every node. Only one node (the master) has public Internet access. I use PingD to cause the master to fail-over to one of the slaves. When a slave becomes master, it then gains public Internet connectivity. When it is a slave, the entire interface is down, so not even the gateway is reachable. So, I set up a PingD resource that is co-located with the master resource in the Master state. I also set up constraints that assign a -1000 score to a node for each resource if that node loses connectivity to the gateway. The result is that if I firewall off ICMP on the master, it correctly fails over to a slave. Then, it runs a stop on the master, as expected since it has a -1000 score. The result is that my master resource runs as Master on the node that was the slave, and is Stopped on the node that was the master. However, it is still stuck with a -1000 score, and will never restart on the node that was the master until PingD thinks it has connectivity back. But that won’t happen because PingD no longer runs on that node since the interface is down on it and it won’t see anything if it did. I set a failure-timeout on the PingD resource, but it does not seem to do anything. Running ‘crm_verify –V –L 21 | less’ shows that the -1000 score stays there, even well past the failure-timeout. Anybody have any suggestions how I can automatically clear that -1000 score after a certain (small) interval of time? Eliot Gable Senior Engineer 1228 Euclid Ave, Suite 390 Cleveland, OH 44115 Direct: 216-373-4808 Fax: 216-373-4657 ega...@broadvox.net CONFIDENTIAL COMMUNICATION. This e-mail and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom it is addressed. If you are not the intended recipient, please call me immediately. BROADVOX is a registered trademark of Broadvox, LLC. CONFIDENTIAL. This e-mail and any attached files are confidential and should be destroyed and/or returned if you are not the intended and proper recipient. ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] crm command line tool problem
On Mon, May 25, 2009 at 5:51 PM, Dejan Muhamedagic deja...@fastmail.fm wrote: Hi, On Fri, May 22, 2009 at 08:51:45AM -0700, Joe Armstrong wrote: Hi All, I am playing around with the crm command line tool to create an HA config for pacemaker and am bumping into a problem. If I have a configuration running already, 3-node with ip httpd (pretty simple) and I want to create a new configuration according to the CRM CLI document I should: crm configure erase and then create my new configuration. But the erase directive also blows away my cluster node definitions. If I This has been fixed in the beginning of April. The crm now doesn't remove nodes on erase. Please update if possible. 1.0.4 should be out later this week (if I can get my primary machine running again!) ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
[Pacemaker] Redesigned Debian HA packages
i replaced my older packages with the new debian packages (heartbeat and pacemaker-heartbeat) and my cluster came up automatically without a need to change anything. regarding crm_mon, i would like to start it automatically to monitor the cluster and send alert emails if something happens, but no /etc/init.d script seems to exist for that purpose. -- juha ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Pacemaker on OpenAIS, RRP, and link failure
On 2009-05-25 17:45, Andrew Beekhof wrote: SUSE is currently recommending NIC bonding. We've not been able to get satisfactory behavior from clusters using RRP. I've repeatedly told customers that NIC bonding is not a valid substitute for redundant Heartbeat links, I will stubbornly insist it isn't one for OpenAIS RRP links either. Some reasons: - You're not protected against bugs, currently known or unknown, in the bonding driver. If bonding itself breaks, you're screwed. - Most people actually run bonding over interfaces over the same make, model, and chipset. That's not necessarily optimal, but it's a reality. Thus, if your driver breaks, you're screwed again. Granted, this is probably to if you ran two RRP links in that same configuration too. - Finally, you can't bond between a switched and a direct back-to-back connection, which makes bonding entirely unsuitable for the redundant links use case I described earlier. 1. Set rrp_problem_count_timeout and/or rrp_problem_count_threshold ridiculously high so the ring status never goes to faulty. (It seems that RRP problem counting can't be disabled altogether). 2. Have package maintainers include some magic that does openais-cfgtool -r every time a network link changes its status to UP (where the network management subsystem permits this). 3. Instruct users to install cron jobs that do openais-cfgtool -r in specified intervals, causing OpenAIS to re-check the link status periodically. You could add it to the drbd monitor action I guess. But it does seem sub-optimal. I already made my point with regard to Juha's suggestion that it seems odd for Pacemaker to fiddle with its own communication infrastructure. To instead defer that task to a Pacemaker resource agent seems positively disturbing. I think the best solution is to work with upstream to get the feature working properly. That I fully agree with. The question is what working properly means in this case -- should it be capable of auto-recovery, or should it not? Cheers, Florian ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Redesigned Debian HA packages
On Mon, May 25, 2009 at 6:05 PM, Juha Heinanen j...@tutpro.com wrote: i replaced my older packages with the new debian packages (heartbeat and pacemaker-heartbeat) and my cluster came up automatically without a need to change anything. regarding crm_mon, i would like to start it automatically to monitor the cluster and send alert emails if something happens, but no /etc/init.d script seems to exist for that purpose. you probably want it running as a cluster resource there may even be an RA for it (if not we need to write one) ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] crm command line tool problem
On Mon, May 25, 2009 at 06:03:20PM +0200, Andrew Beekhof wrote: On Mon, May 25, 2009 at 5:51 PM, Dejan Muhamedagic deja...@fastmail.fm wrote: Hi, On Fri, May 22, 2009 at 08:51:45AM -0700, Joe Armstrong wrote: Hi All, I am playing around with the crm command line tool to create an HA config for pacemaker and am bumping into a problem. If I have a configuration running already, 3-node with ip httpd (pretty simple) and I want to create a new configuration according to the CRM CLI document I should: ? ? ? crm configure erase and then create my new configuration. ?But the erase directive also blows away my cluster node definitions. ?If I This has been fixed in the beginning of April. The crm now doesn't remove nodes on erase. Please update if possible. 1.0.4 should be out later this week (if I can get my primary machine running again!) Must be epidemic: mine's broken too (mb went south, out of warranty -- just for about a month) ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Pacemaker on OpenAIS, RRP, and link failure
On Mon, 2009-05-25 at 18:32 +0300, Juha Heinanen wrote: Florian Haas writes: Agree that they're hacks, but disagree with your alternative. Why should Pacemaker be concerned with low-level OpenAIS recovery procedures? then have the variable in OpenAIS configuration. Self-healing is not as obvious or easy as it sounds. Totem (the protocol) has no way to determine when the admin has replaced the faulty switch in the network. The only options I see is to periodically try the failed ring for liveness. The problem with this approach is it is hard to implement. Another option is to reenable the ring after some period of time internally and hope for the best. The problem is with this approach that is causes performance degredation every time the failed ring is reenabled and restarted. I think the first option is the best, but atm there isn't anyone that has written patches and most people are focused on the 1.0 release... Regards -steve -- juha ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Pacemaker on OpenAIS, RRP, and link failure
Steven Dake writes: Self-healing is not as obvious or easy as it sounds. Totem (the protocol) has no way to determine when the admin has replaced the faulty switch in the network. why can't it keep on pinging the interface/ip address even if there is no response? how is it with pingd, does pinging stop if there is no response and does the node remain dead forever after ping failure until someone manually does something? i don't see any difference here regarding heartbeat/openais level pinging. The only options I see is to periodically try the failed ring for liveness. The problem with this approach is it is hard to implement. try all the time also after failure like was done before failure. I think the first option is the best, but atm there isn't anyone that has written patches and most people are focused on the 1.0 release... 1.0 release that people cannot migrate from current heartbeat/pacemaker setup without loosing self healing capability makes little sense. -- juha ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker