Re: [Linux-HA] stonith external/rackpdu question

2010-05-25 Thread Peter Kruse
Hello, Alexander Fisher wrote: A word of warning. If the two PDUs can't communicate (perhaps a switch is down or a bad cable) and you fire the reset command at the PDU you can talk to, it'll appear to have worked (from the clusters perspective), but obviously won't. I think this warning

Re: [Linux-HA] [Pacemaker] new doc about stonith/fencing

2009-05-13 Thread Peter Kruse
Hello, Dejan Muhamedagic wrote: I tried to list all devices which manage host's power and may be used for fencing. I also tried to describe their deficiencies. None of the devices are recommended really as that depends on particular circumstances. I mean recommended in the sense that it

Re: [Linux-HA] [Pacemaker] new doc about stonith/fencing

2009-05-13 Thread Peter Kruse
Hi Karl, Karl Katzke wrote: Actually, I believe that the different vendor implementations of lights out systems (DRAC, HP/Compaq ILO, various others) *do* support that in various ways and fashions. Dell's RAC has a battery that lasts for up to 30 minutes last time I read it's specs.

Re: [Linux-HA] [Pacemaker] new doc about stonith/fencing

2009-05-11 Thread Peter Kruse
Hi Andrew, Andrew Beekhof wrote: Any switch that shares power with the host(s) it controls clearly has a SPoF. You don't need me to tell you that. But that does not have to be a SPoF for the entire system! The problem here is that a single failure (power loss) causes not only one node to go

Re: [Linux-HA] [Pacemaker] new doc about stonith/fencing

2009-05-06 Thread Peter Kruse
Hello, thanks for your replies, Andreas Mock wrote: If the PDUs becomes unavailable and shortly after the host is unavailable as well, then assume the host is down and fenced successfully. 'assume' is the bad word here. Stonith is there so that the cluster does NOT have to assume

Re: [Linux-HA] [Pacemaker] new doc about stonith/fencing

2009-05-06 Thread Peter Kruse
Hi Andrew, Andrew Beekhof wrote: I'd have to agree here. However, I can imagine a (non-default) option for the agent called cause-probable-data-corruption that let the agent behave in the way Peter describes. Just so that there is no way the user can claim they didn't understand the

Re: [Linux-HA] [Pacemaker] new doc about stonith/fencing

2009-05-04 Thread Peter Kruse
Hi Dejan, Dejan Muhamedagic wrote: As usual, constructive criticism/suggestions/etc are welcome. Thanks for sharing. Allow me to bring up a topic that to my point of view is important. You have written: The lights-out devices (IBM RSA, HP iLO, Dell DRAC) are becoming increasingly popular

[Linux-ha-dev] Re: [Linux-HA] APC SNMP STONITH

2007-09-18 Thread Peter Kruse
Hello, Philip Gwyn wrote: As discussed earlier, I'm writing a new SNMP STONITH plugin. The goal is for it to seamlessly work with the new and old MIBs (AP9606 vs AP7900). Ok, the old apcmastersnmp needed work, right. Instead of fixing the current apcmastersnmp.c, I started over from

Re: [Linux-HA] APC SNMP STONITH

2007-09-18 Thread Peter Kruse
Hello, Philip Gwyn wrote: As discussed earlier, I'm writing a new SNMP STONITH plugin. The goal is for it to seamlessly work with the new and old MIBs (AP9606 vs AP7900). Ok, the old apcmastersnmp needed work, right. Instead of fixing the current apcmastersnmp.c, I started over from

Re: [Linux-HA] Confusion about MailTo RA and monitoring

2007-07-17 Thread Peter Kruse
Hi, David Lang wrote: there is a second issue with MailTo part of the OCF specs are that it is considered 'safe' to call start or stop multiple times on a RA, with MailTo this will generate multiple e-mails. this isn't a fatal problem, but it is an annoyance (I've had the shutting down

[Linux-HA] 2.1.1 change in behaviour

2007-07-17 Thread Peter Kruse
Hello, while testing version 2.1.1. I found a change in behaviour when a resource in a group failed. If there are resources a b c d e f in the group G, and e failed this happens: 2.0.8: stop f, stop e, start e, start f 2.1.1: stop f, stop e, start a, start b, ..., start f What is the

Re: [Linux-HA] 2.1.1 change in behaviour

2007-07-17 Thread Peter Kruse
Hi, Lars Marowsky-Bree wrote: On 2007-07-17T12:25:48, Peter Kruse [EMAIL PROTECTED] wrote: Good question. I assume Andrew has a good explanation when you have the testcase (pe inputs). ;-) done, bug #1648 But, this is safe by definition. true. What problem is this causing for you

Re: [Linux-HA] Confusion about MailTo RA and monitoring

2007-07-16 Thread Peter Kruse
Hello, thanks for your fast replies. matilda matilda wrote: Peter Kruse [EMAIL PROTECTED] 16.07.2007 10:58 1) MailTo RA does have the monitor call. So it can be called and the required API is fullfilled. 2) In the case of MailTo the output of 'monitor' is a warning to the log. You can

Re: [Linux-HA] Confusion about MailTo RA and monitoring

2007-07-16 Thread Peter Kruse
Hi Lars, Lars Marowsky-Bree wrote: _If_ the fs goes haywire or is forcibly unmounted somehow, _and_ you're not monitoring it, heartbeat will never detect that error, but instead restart the application on top. That will fail though (because the fs is gone), and the node be blacklisted for that

Re: [Linux-HA] How to validate meta-data of OCF RAs

2007-07-11 Thread Peter Kruse
Hello, Max Hofer wrote: 1. Use the right DTD 2. use xmlstarlet to valdate the XML output of your script example: ./your_ocf_script meta-data | xmlstarlet val ra-api-1.dtd ah! very good, I guess that's what I was looking for. The hard part is the right DTD. As Andrew pointed out the HA

Re: [Linux-HA] Stale NFS File Handles, even with fsid=1234

2007-07-10 Thread Peter Kruse
Hi, Stefan Lasiewski wrote: OS: RHEL4 u4 , x86_64 Heartbeat version: heartbeat-2.0.8-2.el4.centos Two servers: fs1 is 'primary'. fs2 is 'standby'. Client name is app1 , running RHEL4 u4 i386 What kernel version is that, make sure it is not vulnerable to this:

[Linux-HA] How to validate meta-data of OCF RAs

2007-07-10 Thread Peter Kruse
Hello all, OCF Resource agents are required to provide information about this resource as an XML snippet (http://www.linux-ha.org/OCFResourceAgent) How can I validate that XML output? Thanks for any hint, Peter ___ Linux-HA mailing list

[Linux-HA] Help understand an incident

2007-07-03 Thread Peter Kruse
Hello list! today in one of our clusters a failover occured. Good news: it succeeded. But... while looking through the logs we found that messages are missing on one node so we can not say exactly what happened. Attached is the syslog from node-2 from the time where there are no messages on

Re: [Linux-HA] ERROR: write_last_sequence: /var/lib/heartbeat/pengine/pe-input.last does not exist

2007-06-27 Thread Peter Kruse
Hi, Andrew Beekhof wrote: later versions do not seem to contain that message anymore That means it's nothing to worry about? Even in the latest (released) version (which me thinks is 2.0.8)? I can continue to ignore it then? Peter ___

Re: [Linux-HA] setting is_managed to true triggers restart

2007-05-08 Thread Peter Kruse
Hi, thanks for your replies. Andrew Beekhof wrote: On 5/8/07, Peter Kruse [EMAIL PROTECTED] wrote: for this reason, and to avoid polluting the parameter namespace with CRM options, we created meta attributes at some point. you can operate on these by simply adding the --meta option to your

Re: [Linux-HA] setting is_managed to true triggers restart

2007-05-08 Thread Peter Kruse
Hi, Andrew Beekhof wrote: the blocks look the same, just use meta_attributes instead of instance_attributes and put the options in there (and use cibadmin to update them). That works, thanks! Peter ___ Linux-HA mailing list

[Linux-HA] BadThingsHappen with v2.0.5.

2007-04-19 Thread Peter Kruse
Hello, thanks for reading this, as it's with ancient v2.0.5., please tell me that this problem can not happen with recent version of heartbeat. Problem description: yesterday in one of our 2node HA-Clusters a successful takeover happened, where the failed node was resetted, so far so good.

Re: [Linux-HA] BadThingsHappen with v2.0.5.

2007-04-19 Thread Peter Kruse
Hi Andrew! Andrew Beekhof wrote: beosrv-c-2 is the failed node right? it was beosrv-c-1 that failed, beosrv-c-2 took over. do you have logs from there too? attached (messages about Gmain_timeout removed, there were too many of them) The problem now is that cibadmin -m reports: CIB on

Re: [Linux-HA] BadThingsHappen with v2.0.5.

2007-04-19 Thread Peter Kruse
Andrew Beekhof wrote: then i'm afraid your use of the dont fence nodes on startup option has come back to haunt you beosrv-c-1 came up but was not able to find beosrv-c-2 (even though it _was_ running) and because of that option beosrv-c-1 just pretended beosrv-c-2 wasn't running and happily

Re: [Linux-HA] Distinguish probe and monitor

2007-04-19 Thread Peter Kruse
Hello, thanks for this discussion. Andrew Beekhof wrote: On 4/19/07, Peter Kruse [EMAIL PROTECTED] wrote: the PE makes zero distinction between them and since it's the one doing the asking i believe that it is its meaning that counts. yes, think so, too. both ask the same question

Re: [Linux-ha-dev] new apc firmware breaks apcmastersnmp.so

2007-04-05 Thread Peter Kruse
Hello, Alan Robertson wrote: Dave Blaschke wrote: Also, is there some way to determine what firmware is on the APC and then pass the appropriate OID_ constant? This plugin must work for some folks (at least the original author anyway ;-) so these changes would probably break folks who are

Re: [Linux-ha-dev] new apc firmware breaks apcmastersnmp.so

2007-04-04 Thread Peter Kruse
Hi Dave, Dave Blaschke wrote: I cannot find the Config info syntax: message in the latest or any of the most recent 2.0.x code - what version of heartbeat are you using? Oops, yes that was an old version, but that doesn't make a difference concerning the oids. Regardless, you should get a

[Linux-ha-dev] new apc firmware breaks apcmastersnmp.so

2007-04-03 Thread Peter Kruse
Hello, with the v3 firmware of APCs PDUs (models AP7920 and AP7921 at least) the apcmastersnmp.so plugin to stonith does not work anymore. in apcmastersnmp.c there is: #define OID_IDENT .1.3.6.1.4.1.318.1.1.4.1.4.0 #define OID_NUM_OUTLETS.1.3.6.1.4.1.318.1.1.4.4.1.0

Re: [Linux-ha-dev] What happened to rsc_state?

2006-05-12 Thread Peter Kruse
Hi, Andrew Beekhof wrote: i ran ptest and it wants to start fence1:1 and fence2:1 the CRM probably just needs a little poke to rerun the PE. try: crm_attribute -n last_cleanup -v `date -r` ah! that did the trick, but I had to use `date -R` ;) i cleaned this up for 2.0.6 earlier this

Re: [Linux-ha-dev] What happened to rsc_state?

2006-05-10 Thread Peter Kruse
Hi, Andrew Beekhof wrote: On 5/9/06, Peter Kruse [EMAIL PROTECTED] wrote: although cibadmin -Ql -o status does not show the failed resource anymore. How can I recover from this situation? cib contents? Oh, thanks for reminding me (I should know by now...) attached is output of cibadmin

[Linux-ha-dev] What happened to rsc_state?

2006-05-09 Thread Peter Kruse
Hello, it seems that in 2.0.5 the attribute rsc_state to lrm_rsc_op has disappeared. And has been replaced by rc_code and op_status. But it is not the same. In order to remove errors in the cib, so that resources are started again, or nodes can take over again, I used to do something like this:

Re: [Linux-ha-dev] What happened to rsc_state?

2006-05-09 Thread Peter Kruse
Hi, Andrew Beekhof wrote: if you want a list of failed resources: crm_mon -1 | grep failed if you just want the lrm_rsc_op's that failed, look for rc_code != 0 rc_code != 7 (where 7 is LSB for Safely Stopped) in the result of cibadmin -Ql -o status Is that also true for fencing resources?

Re: [Linux-ha-dev] File descriptor left open

2006-02-14 Thread Peter Kruse
Hello, Alan Robertson wrote: Do you have any idea where this message is coming from? Hm, no, they are from lrmd? When I started v2.0.3 yesterday there came these messages: Feb 13 17:08:02 ha-test-1 lrmd: [5296]: info: RA output: (rg1:fraid0:start:stderr) File descriptor 3 left open Feb

Re: [Linux-ha-dev] File descriptor left open

2006-02-13 Thread Peter Kruse
is still open? Peter Peter Kruse wrote: Hello, In my logs I get these messages like this: Feb 7 18:23:57 ha-test-1 lrmd: [2000]: info: RA output: (rg1:fpbs1:start:stderr) Filedescriptor 3 left open File descriptor 4 left open File descriptor 5 left open File descriptor 6 left open File

[Linux-ha-dev] File descriptor left open

2006-02-08 Thread Peter Kruse
Hello, In my logs I get these messages like this: Feb 7 18:23:57 ha-test-1 lrmd: [2000]: info: RA output: (rg1:fpbs1:start:stderr) Filedescriptor 3 left open File descriptor 4 left open File descriptor 5 left open File descriptor 6 left open File descriptor 7 left open File descriptor 8

Re: [Linux-ha-dev] Re: [Linux-ha-cvs] Linux-HA CVS: crm by andrew from

2006-02-05 Thread Peter Kruse
Good Morning, Huang Zhen wrote: It looks that the code deems the HA_CCMUID as group id and HA_APIGID as user id. Right, I just stumbled across that problem, too, The error message is: ERROR: mask(io.c:readCibXmlFile): /var/lib/heartbeat/crm/cib.xml must be owned and read/writeable by user 17,

Re: [Linux-ha-dev] Tracking 2.0.3 release

2006-01-20 Thread Peter Kruse
Hello, Lars Marowsky-Bree wrote: On 2006-01-20T10:03:53, Andrew Beekhof [EMAIL PROTECTED] wrote: Woah, what are you calling crm_attribute for all the time? Its either an ipfail replacement or his way of getting resources to run on the node where they've failed the least... I

Re: [Linux-ha-dev] problem with some RA (output: cat: write error: Broken pipe)

2006-01-16 Thread Peter Kruse
Hi, Francis Montagnac wrote: I think it would be better to only reset SIGPIPE to SIG_DFL (perhaps also other signals) in the LRM just before exec'ing any external (ie: not pertaining to heartbeat itself) commands like the RA's. Is that hard to do? Or has somebody already done so? Should I

Re: [Linux-ha-dev] problem with some RA (output: cat: write error: Broken pipe)

2006-01-16 Thread Peter Kruse
Hello, Anyway, I donnot test it yet, so not sure if it's really the fixing for your issue. Could you please test it and post the result to the mailing list? TIA! Yes, the problem is gone, there are no more messages like that in syslog. Great! Peter

[Linux-ha-dev] problem with some RA (output: cat: write error: Broken pipe)

2006-01-12 Thread Peter Kruse
Hello, In one of my RAs there is a line like this: ( exportfs ; cat /proc/fs/nfs/exports ) | grep -q ^${export_dir}[ ] This line apparently produces these errors: Jan 12 13:40:08 ha-test-1 lrmd: [16217]: info: RA output: (rg1:nfs1:monitor:stderr) cat: Jan 12 13:40:08 ha-test-1 lrmd: