Re: [Linux-HA] Antw: Re: pcs or crmsh?

2012-11-14 Thread Lars Marowsky-Bree
On 2012-11-14T09:33:22, Digimer li...@alteeve.ca wrote: As it was told to me, pcs was going to be what whas used officially, but that anyone and everyone was welcome to continue using and developing crm or any other existing or new management tool. My take-away was that the devs wanted pcs,

Re: [Linux-HA] pcs or crmsh?

2012-11-14 Thread Lars Marowsky-Bree
On 2012-11-14T09:24:45, Rasto Levrinc rasto.levr...@gmail.com wrote: What doesn't work? I think that at this point of time, it's be easier to get crmsh going/fixed with pcmk 1.1.8. It's probably just some path somewhere. If really nothing works, you *must* use LCMC, Pacemaker GUI. :) crmsh's

Re: [Linux-HA] Antw: Re: pcs or crmsh?

2012-11-14 Thread Lars Marowsky-Bree
On 2012-11-14T12:44:53, Digimer li...@alteeve.ca wrote: Not really, to be honest. The way I see it is that Pacemaker is in tech preview (on rhel, which is where I live). So almost by definition, anything can change at any time. This is what happened here, so I don't see a problem. That is a

Re: [Linux-HA] cib_replace failed?

2012-11-13 Thread Lars Marowsky-Bree
On 2012-11-12T10:07:50, Andrew Beekhof and...@beekhof.net wrote: Um, are you setting a nodeid in corosync.conf? Because I see this: Nov 09 09:07:25 [2609] ha09a.mycharts.md crmd: crit: crm_get_peer: Node ha09a.mycharts.md and ha09a share the same cluster node id

Re: [Linux-HA] cib_replace failed?

2012-11-13 Thread Lars Marowsky-Bree
On 2012-11-13T16:34:23, Robinson, Eric eric.robin...@psmnv.com wrote: bump. Could someone please review the logs in the links below and tell me what the heck is going on with this cluster? I've never encountered anything like this before. Basically, corosync thinks the cluster is healthy

Re: [Linux-HA] cib_replace failed?

2012-11-13 Thread Lars Marowsky-Bree
On 2012-11-13T17:06:31, Robinson, Eric eric.robin...@psmnv.com wrote: I'm not sure how to correct this. Here are the results of my name resolution test on node ha09a... I'd probably strip everything except the short names out of /etc/HOSTNAME and /etc/hosts, though it may be sufficient to

Re: [Linux-HA] Bug around on-fail on op monitor ?

2012-11-12 Thread Lars Marowsky-Bree
On 2012-11-12T15:01:47, alain.mou...@bull.net wrote: Thanks but no, in older releases, the op monitoring failed leaded to fence as required by on-fail=fence . yes, that's what should happen. You can file a crm_report with the PE inputs showing this for 1.1.7, or directly retest with 1.1.8.

Re: [Linux-HA] Antw: Re: Pacemaker STONITH Config Check

2012-11-08 Thread Lars Marowsky-Bree
On 2012-11-07T12:51:25, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: I agree that one shouldn't have to do it, but I've seen cases (two node cluster with quorum-policy=ignore) where one node was down while the cluster wanted to fence both nodes. So when the other node goes up, nodes

Re: [Linux-HA] Q: crmd: [12771]: info: handle_request: Current ping state: S_TRANSITION_ENGINE

2012-11-06 Thread Lars Marowsky-Bree
On 2012-11-05T17:05:35, Dejan Muhamedagic de...@suse.de wrote: It's a debug instrumentation message. But it is only triggered when someone runs crmadmin -S, -H to look up the DC or something, it isn't triggered by the stack internally. If it's a debug message, why is it then at severity

Re: [Linux-HA] Q: crmd: [12771]: info: handle_request: Current ping state: S_TRANSITION_ENGINE

2012-11-05 Thread Lars Marowsky-Bree
On 2012-11-05T15:31:25, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: I just experienced that the syslog message crmd: [12771]: info: handle_request: Current ping state: S_TRANSITION_ENGINE is sent out several times per second for an extended period of time. So I wonder: Is it a

Re: [Linux-HA] cib_replace failed?

2012-10-31 Thread Lars Marowsky-Bree
On 2012-10-31T15:59:05, Robinson, Eric eric.robin...@psmnv.com wrote: Nobody has any thoughts on why my 2-node cluster has no DC? As I mentioned, corosync-cfgtool -s shows the ring active with no faults. That probably means that someone (i.e., you ;-) needs to dig more into the logs of

Re: [Linux-HA] Antw: Re: Limit for three Xen VMs in SLES11 SP2?

2012-10-25 Thread Lars Marowsky-Bree
On 2012-10-25T11:30:32, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: I just wonder: If the reason is some kind of resource shortage in the Xen Host that causes Xen guests to fail booting, it would ne nice if that situation could be detected. I was just asking for an already known

Re: [Linux-HA] resource monitor timeout, Killing with signal SIGTERM (15).

2012-10-24 Thread Lars Marowsky-Bree
On 2012-10-24T11:15:14, Dimitri Maziuk dmaz...@bmrb.wisc.edu wrote: I'm happy you have something that works for you. Although even if you're using it in haresources mode, your resource agents are still years out of date. It doesn't have resource agents (that's one of its pluses in my

Re: [Linux-HA] Antw: Re: [Linux-ha-dev] glue 1.0.11 released

2012-10-24 Thread Lars Marowsky-Bree
On 2012-10-22T14:12:17, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: Interesting formula: I'd use something like number of CPUs * 4, not divided by. Reason: Today's workload is usually limited by I/O, not by CPU power. However with something crazy like 32 CPUs, 32 tasks can

Re: [Linux-HA] resource monitor timeout, Killing with signal SIGTERM (15).

2012-10-24 Thread Lars Marowsky-Bree
On 2012-10-24T13:17:57, Dimitri Maziuk dmaz...@bmrb.wisc.edu wrote: I have e.g. mon script that greps 'lsof -i' to see if httpd is listening on * or cluster ip. Which IMO is a way saner check than wget'ting http://localhost/server-status -- and treating a [34]04 as a fail. Hence the plus

Re: [Linux-HA] resource monitor timeout, Killing with signal SIGTERM (15).

2012-10-24 Thread Lars Marowsky-Bree
On 2012-10-24T13:23:09, Dimitri Maziuk dmaz...@bmrb.wisc.edu wrote: PS. but for the most part, like you said: you *have* people stuck on 2.1.4 and you keep supporting them much as you hate it. Yes, but on SLES10, that was an actually shipping version with full support. EPEL has different

Re: [Linux-HA] Antw: Re: Q: Xen RA: node_ip_attribute

2012-09-28 Thread Lars Marowsky-Bree
On 2012-09-27T17:32:58, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: Just a note: As it turned out, the Xen RA (SLES11 SP2, resource-agents-3.9.3-0.7.1) is broken, because migrate will never look at the node_ip_attribute you configured. It's line 369:

Re: [Linux-HA] Q: crm shell's migrate lifetime

2012-09-27 Thread Lars Marowsky-Bree
On 2012-09-27T16:36:08, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: Hi Ulrich, we always appreciate your friendly, constructive and non-condescending feedback. However if you specify a duration like P2, the duration is not added to the current time; instead the current time is used

Re: [Linux-HA] Antw: Re: Q: Xen RA: node_ip_attribute

2012-09-24 Thread Lars Marowsky-Bree
On 2012-09-24T08:45:39, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: So I select on unique attribute name for Xen migration, specify that in the Xen resource, and then define that attribute per node, using one of the node's own IP addresses? Yes. The idea is that this allows you to

Re: [Linux-ha-dev] Q: Xen RA: node_ip_attribute

2012-09-21 Thread Lars Marowsky-Bree
On 2012-09-20T08:47:59, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: ---(resource-agents-3.9.3-0.7.1 of SLES 11 SP2)--- node_ip_attribute (string): Node attribute containing target IP address ^^ In case of a live migration, the system will

Re: [Linux-ha-dev] Slight bending of OCF specs: Re: Issues found in Apache resource agent

2012-09-12 Thread Lars Marowsky-Bree
On 2012-09-11T15:04:55, Alan Robertson al...@unix.sh wrote: Depends. Pacemaker may still care about the status of these agents. If it can't start or stop them, what can it do with them? The status from these agents may feed into operations on other resources that are fully managed.

Re: [Linux-ha-dev] Slight bending of OCF specs: Re: Issues found in Apache resource agent

2012-09-12 Thread Lars Marowsky-Bree
On 2012-09-12T09:01:05, Alan Robertson al...@unix.sh wrote: The status from these agents may feed into operations on other resources that are fully managed. Understood. I believe it will care about those other agents - not these. It shouldn't know about these, AFAIK. I guess then

Re: [Linux-ha-dev] Slight bending of OCF specs: Re: Issues found in Apache resource agent

2012-09-08 Thread Lars Marowsky-Bree
On 2012-09-07T13:46:27, Alan Robertson al...@unix.sh wrote: Well, I presume that one would not tell pacemaker about such agents, as they would not be useful to pacemaker. From the point of view of the crm command, you wouldn't consider them as valid resource agents to put in a

Re: [Linux-ha-dev] Slight bending of OCF specs: Re: Issues found in Apache resource agent

2012-09-07 Thread Lars Marowsky-Bree
On 2012-09-05T15:25:44, Dejan Muhamedagic de...@suse.de wrote: BTW, FWIW - monocf may be just like ocf, sans start and stop operations. That would make all ocf RA elligible for this use. Thinking about this, not entirely. We'd have to fake the start/stop at least. (In particular the start.)

Re: [Linux-ha-dev] Slight bending of OCF specs: Re: Issues found in Apache resource agent

2012-09-05 Thread Lars Marowsky-Bree
On 2012-09-04T19:20:23, Alan Robertson al...@unix.sh wrote: I will likely write a monitor-only resource agent for web servers. What would you think about calling it from the other web resource agents? Sharing code - in this case, the monitor-via-network of the http agents - seems to make

Re: [Linux-ha-dev] Slight bending of OCF specs: Re: Issues found in Apache resource agent

2012-09-05 Thread Lars Marowsky-Bree
On 2012-09-05T15:25:44, Dejan Muhamedagic de...@suse.de wrote: How about a new element. Something like primitive vm1 ocf:heartbeat:VirtualDomain require vm1 web-test dns-test How we map this into Pacemaker's dependency scheme is obviously open to discussion. The require would imply that

Re: [Linux-HA] Antw: Duplicate monitor operation on a multi state resource

2012-09-05 Thread Lars Marowsky-Bree
On 2012-09-05T07:54:46, Andrew Beekhof and...@beekhof.net wrote: (Or rather, obscure enough to configure that it might well be a bug.) It'd be trivial to just append the role to the operation key too. (It'd cause a few monitors to be recreated on update, but that'd be harmless.) Not

Re: [Linux-HA] OS System update in live cluster ?

2012-09-05 Thread Lars Marowsky-Bree
On 2012-09-05T06:26:50, Stefan Schloesser sschloes...@enomic.com wrote: Hi Lars, my problem with the rolling upgrade is the drbd partition. If you migrate the service its data will move too. If you then restart the cluster and migrate back the data will not be in an upgraded state and

Re: [Linux-HA] Antw: Re: crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438

2012-09-04 Thread Lars Marowsky-Bree
On 2012-09-04T10:50:11, EXTERNAL Konold Martin (erfrakon, RtP2/TEF72) external.martin.kon...@de.bosch.com wrote: I was reporting a serious bug in _your_ product and instead of thanking for the bugreport you simply closed it as invalid The bug was reported without a support contract. A

Re: [Linux-HA] OS System update in live cluster ?

2012-09-04 Thread Lars Marowsky-Bree
On 2012-09-04T15:56:14, Stefan Schloesser sschloes...@enomic.com wrote: Hi, I would like to know what the recommended way is to update a cluster. Every week or so bug fixes and security patches are released for a various parts of the used software. I prefer rolling upgrades; migrate

Re: [Linux-ha-dev] Note: Core Dumps with corosync-1.4.1-0.13.1 (SLES11 SP2)

2012-09-03 Thread Lars Marowsky-Bree
On 2012-08-31T14:56:22, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: Hi! By random I realized that every node of my 5-node test-cluster had at least one corosync-coredump. Unfortunately they even seem to have different signatures. I can provide a rough backtrace to get you warmed

Re: [Linux-HA] Antw: Re: Q: Debug clustered IP Adress

2012-08-31 Thread Lars Marowsky-Bree
On 2012-08-31T13:41:14, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: Hi! There are things I don't understand: Even after # /usr/lib64/heartbeat/send_arp -i 200 -r 5 br0 172.20.3.59 f1e991b1b951 not_used not_used neither the local arp table (arp) not the software bridge (brctl

Re: [Linux-HA] Time based resource stickiness example with crm configure ?

2012-08-30 Thread Lars Marowsky-Bree
On 2012-08-30T12:53:45, Stefan Schloesser sschloes...@enomic.com wrote: I would like to configure the resource-stickiness to 0 tuesdays between 2 and 2:20 am local time. I could not find any examples on how to do this using crm configure ... but only the XML snippets to accomplish this.

Re: [Linux-HA] Antw: Re: crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438

2012-08-29 Thread Lars Marowsky-Bree
On 2012-08-20T11:31:07, Lars Marowsky-Bree l...@suse.com wrote: Okay, so there's a bug in the NFS agent, point taken. I'll investigate why it took so long to release as a real maintenance update; you're right, that shouldn't happen. (I can already see it in the update queue though

Re: [Linux-HA] How HA can start systemd service.

2012-08-23 Thread Lars Marowsky-Bree
On 2012-08-23T09:35:51, Francis SOUYRI francis.sou...@apec.fr wrote: Hello Dejan, With the FC 16 heartbeat is a 3.0.4 not a v1. I do not use crm because I can success to implement ipfail. Dejan was refering to the v1 mode, namely the one that uses haresources. haresources can't drive

Re: [Linux-HA] Three clusters with common node

2012-08-22 Thread Lars Marowsky-Bree
On 2012-08-21T15:39:06, Carlos Pedro carlos_pe...@yahoo.com wrote: Dear Sirs, I´m working in a project and I was proposed to build three clusters using a common node, that is: Nodes cannot be shared between clusters like this. You can either build a 2 node cluster (with all nodes in one),

Re: [Linux-HA] Antw: Duplicate monitor operation on a multi state resource

2012-08-22 Thread Lars Marowsky-Bree
On 2012-08-22T10:08:14, RaSca ra...@miamammausalinux.org wrote: Thank you Ulrich, As far as you know, Is there a way to override the ID for each cloned instance of the mysql resource? How can I resolve the problem? Just make the intervals slightly different - 31s, 30s, 29s ... Regards,

Re: [Linux-HA] IP Clone

2012-08-21 Thread Lars Marowsky-Bree
On 2012-08-21T00:22:00, Dimitri Maziuk dmaz...@bmrb.wisc.edu wrote: CLUSTERIP which you presumably mean by fun with iptables is basically Jack gets all calls from even area codes and Jill: from odd area codes. Yeah, you cold do that, I just can't imagine why. Because the commonly given

Re: [Linux-HA] Many messages form clvmd in SLES11 SP2

2012-08-21 Thread Lars Marowsky-Bree
On 2012-08-21T14:32:53, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: Maybe I'm expecting too much, but isn't it possible to simply log Telling other nodes that PV blabla is being created? The problem is the error case, in which we want more logs. There is progress (libqb with the

Re: [Linux-HA] IP Clone

2012-08-21 Thread Lars Marowsky-Bree
On 2012-08-21T13:16:29, David Lang david_l...@intuit.com wrote: with ldirectord you have an extra network hop, and you have all your traffic going through one system. This is a scalability bottleneck as well as bing a separate system to configure. CLUSTERIP isn't the solution to every

Re: [Linux-HA] Antw: Re: crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438

2012-08-20 Thread Lars Marowsky-Bree
On 2012-08-17T16:38:01, EXTERNAL Konold Martin (erfrakon, RtP2/TEF72) external.martin.kon...@de.bosch.com wrote: I don't see an open bug for something like this right now. Are you serious? It was you who resolved this bug as INVALID in bugzilla

Re: [Linux-HA] Antw: Re: crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438

2012-08-20 Thread Lars Marowsky-Bree
On 2012-08-17T16:42:42, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: obviously not, because I have the latest updates installed. It happens frequently enough to care about it: # zgrep sscan /var/log/messages-201208*.bz2 |wc -l 76 Here are some: /var/log/messages-20120816.bz2:Aug

Re: [Linux-HA] Antw: Re: crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438

2012-08-20 Thread Lars Marowsky-Bree
On 2012-08-17T18:14:18, EXTERNAL Konold Martin (erfrakon, RtP2/TEF72) external.martin.kon...@de.bosch.com wrote: On the other hand you sofar did not provide any case where SLES11 SP2 runs reliably unmodified in a mission critical environment (e.g. a HA NFS server) without local bugfixes.

Re: [Linux-ha-dev] apply_xml_diff: Digest mis-match

2012-08-17 Thread Lars Marowsky-Bree
On 2012-08-13T15:39:22, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: Hi! In pacemaker-1.1.6-1.29.1 (SLES11 SP2 x86_64) I see this for an idle cluster with just one stonith resource being running when doing some unrelated change: What is the unrelated change you are doing? Does

Re: [Linux-HA] Antw: Re: crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438

2012-08-17 Thread Lars Marowsky-Bree
On 2012-08-16T17:54:06, EXTERNAL Konold Martin (erfrakon, RtP2/TEF72) external.martin.kon...@de.bosch.com wrote: Hi Martin, From my experience with SLES11 SP2 (with all current updates) I conclude that actually nobody is seriously running SP2 without local bugfixes. That isn't quite true.

Re: [Linux-HA] Antw: Re: crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438

2012-08-17 Thread Lars Marowsky-Bree
On 2012-08-17T08:41:15, Nikita Michalko michalko.sys...@a-i-p.com wrote: I am also testing SP2 - and yes, it's true: not yet ready for production ;-( What problems did you find? Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix

Re: [Linux-HA] Antw: Re: crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438

2012-08-17 Thread Lars Marowsky-Bree
On 2012-08-17T11:43:13, Nikita Michalko michalko.sys...@a-i-p.com wrote: - e.g. the problem with SLES 11 SP2 kernels crash - the same as described by Martin: SP2 kernels crash seriously (when a node rejoins the cluster) when using STCP as recommended in the SLES HA documentation and

Re: [Linux-HA] Antw: Re: lrmd: [6136]: ERROR: crm_abort: crm_strdup_fn: Triggered assert at utils.c:1013 : src != NULL

2012-08-14 Thread Lars Marowsky-Bree
On 2012-08-14T12:44:43, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: The messages are coming from the stonith plugin (it's actually in pacemaker). But I think that that got fixed in the meantime. ^ Do you have the latest maintenance update? Yes, latest on SLES is

Re: [Linux-HA] crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438

2012-08-14 Thread Lars Marowsky-Bree
On 2012-08-14T16:59:02, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: While starting a clone resource (mount OCFS2 filesystem), I see this message in syslog: crmd: [31942]: notice: do_lrm_invoke: Not creating resource for a delete event: (null) info: notify_deleted: Notifying

Re: [Linux-HA] Antw: Bond mode for 2 node direct link

2012-07-18 Thread Lars Marowsky-Bree
On 2012-07-18T20:01:35, Arnold Krille arn...@arnoldarts.de wrote: That would mean that your system runs the same whether one or two links are present. That's not what I said. What I said (or at least meant ;-) is that, even in the degraded state, the performance must still be within

Re: [Linux-HA] Bond mode for 2 node direct link

2012-07-17 Thread Lars Marowsky-Bree
On 2012-07-16T11:53:55, Volker Poplawski volker.poplaw...@atrics.de wrote: Hello everyone. Could you please tell me the recommended mode for a bonded network interface, which is used as the direct link in a two machine cluster? There are 'balance-rr', 'active-backup', 'balance-xor' etc

Re: [Linux-HA] Antw: Bond mode for 2 node direct link

2012-07-17 Thread Lars Marowsky-Bree
On 2012-07-17T23:44:13, Arnold Krille arn...@arnoldarts.de wrote: Additionally: If its two direct links dedicated to your storage network, there is no reason going active/backup and discarding half of the available bandwidth. Since the system must be designed for one link to have adequate

Re: [Linux-HA] Pacemaker and software RAID using shared storage.

2012-07-12 Thread Lars Marowsky-Bree
On 2012-07-12T10:31:53, Caspar Smit c.s...@truebit.nl wrote: Now the interesting part. I would like to create a software raid6 set (or multiple) with the disks in the JBOD and have the possibility to use the raid6 in an active/passive cluster. Sure. md RAID in a fail-over configuration is

Re: [Linux-HA] Antw: Re: mount.ocfs2 in D state

2012-07-03 Thread Lars Marowsky-Bree
On 2012-07-03T11:26:11, darren.mans...@opengi.co.uk wrote: I'd like to second Lars' comments here. I was strong-armed into doing a dual-primary DRBD + OCFS2 cluster and it's a nightmare to manage. There's no reason for us to do it other than 'we could'. It just needed something simple like

Re: [Linux-HA] mount.ocfs2 in D state

2012-07-02 Thread Lars Marowsky-Bree
On 2012-07-02T10:42:33, EXTERNAL Konold Martin (erfrakon, RtP2/TEF72) external.martin.kon...@de.bosch.com wrote: when a split brain (drbd) happens mount.ocfs2 remains hanging unkillable in D-state. Unsurprising, since all IO is frozen during that time (depending on your drbd setup, but I'm

Re: [Linux-HA] Antw: Re: mount.ocfs2 in D state

2012-07-02 Thread Lars Marowsky-Bree
On 2012-07-02T12:05:33, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: Unfortunately unless there's a real cluster filesystem that supports mirroring with shared devices also, DRBD on some locally mirrored device on each node seems to be the only alternative. (Talking about desasters)

Re: [Linux-HA] Antw: Re: mount.ocfs2 in D state

2012-07-02 Thread Lars Marowsky-Bree
On 2012-07-02T12:37:52, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: I've seen very few scenarios where OCFS2 was worth it over just using a regular file system like XFS in a fail-over configuration in this kind of environment. How would you fail over if your shared storage went

Re: [Linux-HA] Antw: Re: cib_process_diff: ... Failed application of an update diff

2012-06-29 Thread Lars Marowsky-Bree
On 2012-06-29T08:19:41, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: For SLE HA 11 SP1, please report these issues to NTS and SUSE support. As I'm sure they won't fix it in SP1 (that PTF is one year old now), SP1 is still supported by SUSE, and noone but our support folks know what

Re: [Linux-HA] OCFS2 - Renew node's IP address which has failed - Amazon EC2

2012-06-29 Thread Lars Marowsky-Bree
On 2012-06-28T11:37:37, Heitor Lessa heitor.le...@hotmail.com wrote: Such issue happens because OCFS does not support changes (modify/del) nodes in a running cluster, such tasks requires cluster down though. If driven by Pacemaker, OCFS2 does support adding/removing nodes at runtime.

Re: [Linux-HA] cib_process_diff: ... Failed application of an update diff

2012-06-27 Thread Lars Marowsky-Bree
On 2012-06-27T14:18:26, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: Hello, I see problems with applying configuration diffs so frequrntly that I suspect there's a bug in the code. This is for SLES11 SP1 on x86_64 with corosync-1.4.1-0.3.3.3518.1.PTF.712037 and

Re: [Linux-HA] Antw: Re: ocf:heartbeat:exportfs multiple exports, fsid, wait_for_leasetime_on_stop

2012-06-21 Thread Lars Marowsky-Bree
On 2012-06-21T08:02:25, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: See, it's simple. Any partially completed operation or state - not successful, ergo failure must be reported. Is it correct that the standard recovery procedure for this failure is node fencing then? If so it

Re: [Linux-HA] Antw: Re: ocf:heartbeat:exportfs multiple exports, fsid, wait_for_leasetime_on_stop

2012-06-20 Thread Lars Marowsky-Bree
On 2012-06-20T08:44:33, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: The problem is: What to do if 1 out of n exports fails: Is the resource started or stopped then. Likewise for unexporting and monitoring. If the operation partially failed, it is failed. But to have a clean

Re: [Linux-HA] Antw: Re: ocf:heartbeat:exportfs multiple exports, fsid, wait_for_leasetime_on_stop

2012-06-20 Thread Lars Marowsky-Bree
On 2012-06-20T16:37:35, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: so what exit code is failed? Then: With the standard logic of stop only performing when the resource is up (i.e. monitor reports stopped), a partially started resource that the monitor considers stopped may fail to

Re: [Linux-HA] What's the meaning of ... Failed application of an update diff

2012-06-20 Thread Lars Marowsky-Bree
On 2012-06-20T17:46:19, Andreas Kurz andr...@hastexo.com wrote: hb_report does not work. how to do a report tarball ? It has been renamed to crm_report There's still both around. Just that different distributions ship different implementations. Because. Well. Because. /rant Regards,

Re: [Linux-HA] What's the meaning of ... Failed application of an update diff

2012-06-19 Thread Lars Marowsky-Bree
On 2012-06-19T08:38:11, alain.mou...@bull.net wrote: So that means that my modifications by crm configure edit , even if they are correct (I've re-checked them) , have potentially corrupt the Pacemaker configuration ? No. The CIB automatically recovers from this by doing a full sync. The

Re: [Linux-HA] Antw: Re: ocf:heartbeat:exportfs multiple exports, fsid, wait_for_leasetime_on_stop

2012-06-19 Thread Lars Marowsky-Bree
On 2012-06-19T14:13:06, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: The problem is: What to do if 1 out of n exports fails: Is the resource started or stopped then. Likewise for unexporting and monitoring. If the operation partially failed, it is failed. Regards, Lars --

Re: [Linux-ha-dev] [rfc] SBD with Pacemaker/Quorum integration

2012-06-15 Thread Lars Marowsky-Bree
On 2012-05-25T17:31:52, Florian Haas flor...@hastexo.com wrote: Um, right now I have no opinion. Your commit messages are pretty terse, and there's no README in the repo. Mind adding one? FWIW, there is now a manual page as well. That might help with understanding what it is supposed to do.

Re: [Linux-HA] Does globally-unique make sense on filesystems cloned resources?

2012-06-06 Thread Lars Marowsky-Bree
On 2012-06-06T17:26:41, RaSca ra...@miamammausalinux.org wrote: Thank you Florian, but how can one declare an anonymous clone? Is it implicit with the globally-unique=false? You don't need to explicitly declare that. It is the default. (But yes, the default is globally-unique=false.)

Re: [Linux-HA] Question about stacks .

2012-06-04 Thread Lars Marowsky-Bree
On 2012-06-01T13:10:17, alain.mou...@bull.net wrote: -does that mean that it will be this Pacemaker/cman on RH ans SLES ? -or do RH and SLES wil require a different stack under Pacemaker ? Right now, SLE HA is on the plugin version of pacemaker, and SLE HA 11 will likely remain on it - that's

Re: [Linux-ha-dev] sbd spinoff from cluster-glue

2012-06-01 Thread Lars Marowsky-Bree
On 2012-06-01T16:16:20, Florian Haas flor...@hastexo.com wrote: Dejan, Lars, is it confirmed from your end that sbd is moving out of cluster-glue? If so, it would be nice if we could get an cluster-glue release with sbd removed, and a release of standalone sbd, so packagers can fix the

Re: [Linux-ha-dev] [rfc] SBD with Pacemaker/Quorum integration

2012-05-29 Thread Lars Marowsky-Bree
On 2012-05-29T08:39:06, Florian Haas flor...@hastexo.com wrote: Should be packageable on every platform, though I admit that I've not tried building the pacemaker module against anything but the corosync+pacemaker+openais stuff we ship on SLE HA 11 so far. Are you expecting this to build

Re: [Linux-ha-dev] [PATCH 0 of 2] Autotoolize build

2012-05-29 Thread Lars Marowsky-Bree
On 2012-05-29T14:31:20, Florian Haas flor...@hastexo.com wrote: Forgot to mention this in the original cover message, for those who haven't been following the discussion: this is for sbd which is just spinning off from cluster-glue. Thanks, I've merged them both! Regards, Lars --

Re: [Linux-ha-dev] [PATCH 0 of 2] Autotoolize build

2012-05-29 Thread Lars Marowsky-Bree
On 2012-05-29T17:56:59, Florian Haas flor...@hastexo.com wrote: In case you're wondering why I didn't use PKG_CHECK_MODULES for the PE libraries: their pkg-config file is currently broken; Andrew has a pull request for Pacemaker for that. I was wondering more about how to build this against

Re: [Linux-ha-dev] [PATCH 0 of 2] Autotoolize build

2012-05-29 Thread Lars Marowsky-Bree
On 2012-05-29T18:34:15, Florian Haas flor...@hastexo.com wrote: Yeah, it seems you just broke the build by including cluster/stack.h and not bothering to add an AC_CHECK_HEADERS to configure.ac. Where does that come from, is that new to Pacemaker? Uh? It builds here on the 1.1.7 pacemaker

Re: [Linux-ha-dev] [PATCH 0 of 2] Autotoolize build

2012-05-29 Thread Lars Marowsky-Bree
On 2012-05-29T18:57:30, Florian Haas flor...@hastexo.com wrote: The integration with the cluster stack is rather specific to whatever pacemaker/corosync version + configuration you build against. Unfortunately. Well that's what #ifdef HAVE_CLUSTER_STACK_H and friends are good for, no? I

Re: [Linux-ha-dev] [rfc] SBD with Pacemaker/Quorum integration

2012-05-25 Thread Lars Marowsky-Bree
On 2012-05-25T17:31:52, Florian Haas flor...@hastexo.com wrote: That aside, what do you think of the idea/approach? Um, right now I have no opinion. Your commit messages are pretty terse, and there's no README in the repo. Mind adding one? Good point. I wasn't aware the commit messages were

Re: [Linux-ha-dev] [rfc] SBD with Pacemaker/Quorum integration

2012-05-25 Thread Lars Marowsky-Bree
On 2012-05-25T21:44:25, Florian Haas flor...@hastexo.com wrote: If so, the master thread will not self-fence even if the majority of devices is currently unavailable. That's it, nothing more. Does that help? It does. One naive question: what's the rationale of tying in with

Re: [Linux-ha-dev] [rfc] SBD with Pacemaker/Quorum integration

2012-05-24 Thread Lars Marowsky-Bree
On 2012-05-24T14:34:59, Florian Haas flor...@hastexo.com wrote: To give you a glance of the extended sbd code, you can check out http://hg.linux-ha.org/sbd - the new Pacemaker integration is activated using the -P option in /etc/sysconfig/sbd, otherwise sbd remains a drop-in replacement

Re: [Linux-HA] Can /var/lib/pengine files be deleted at boot?

2012-05-16 Thread Lars Marowsky-Bree
On 2012-05-15T13:17:11, William Seligman selig...@nevis.columbia.edu wrote: I can post details and logs and whatnot, but I don't think I need to do detailed debugging. My question is: I don't think your rationale holds true, though. Like Andrew said, this is only ever just written, not read.

Re: [Linux-ha-dev] [PATCH] Filesystem RA: remove a status file only when OCF_CHECK_LEVEL is set as 20

2012-05-08 Thread Lars Marowsky-Bree
On 2012-05-08T12:08:27, Dejan Muhamedagic de...@suse.de wrote: In the default (without OCF_CHECK_LEVE), it's enough to try unmount the file system, isn't it? https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/Filesystem#L774 I don't see a need to remove the STATUSFILE

Re: [Linux-ha-dev] [PATCH v2] resource-agents: add Linux proxy arp resource agent

2012-04-04 Thread Lars Marowsky-Bree
On 2012-04-04T01:52:12, Christian Franke nob...@nowhere.ws wrote: Hello Florian, Your question is fully justified - I sincerely apologize for ignoring that comprehensive documentation. I rewrote the patch trying to adhere to the requirements given in the documentation. Hi Christian,

Re: [Linux-HA] Cluster node hanging upon access to ocfs2 fs when second cluster node dies ?

2012-04-04 Thread Lars Marowsky-Bree
On 2012-04-04T11:28:31, Rainer Krienke krie...@uni-koblenz.de wrote: There is one basic thing however I do not understand: My setup involves only a clustered filesystem. What I do not understand is why a stonith resource is needed at all in this case which after all causes freezes of the

Re: [Linux-HA] Cluster node hanging upon access to ocfs2 fs when second cluster node dies ?

2012-04-03 Thread Lars Marowsky-Bree
On 2012-04-03T10:32:48, Rainer Krienke krie...@uni-koblenz.de wrote: Hi Rainer, I am new to HA setup and my first try was to set up a HA cluster (using SLES 11 SP2 and the SLES11 SP2 HA extension) that simply offers an OCFS2 filesystem. I did the setup according to the SLES 11 SP2 HA

Re: [Linux-HA] Cluster node hanging upon access to ocfs2 fs when second cluster node dies ?

2012-04-03 Thread Lars Marowsky-Bree
On 2012-04-03T15:50:29, Rainer Krienke krie...@uni-koblenz.de wrote: rzinstal4:~ # sbd -d /dev/disk/by-id/scsi-259316a7265713551-part1 dump ==Dumping header on disk /dev/disk/by-id/scsi-259316a7265713551-part1 Header version : 2 Number of slots: 255 Sector size: 512 Timeout

Re: [Linux-HA] Cluster node hanging upon access to ocfs2 fs when second cluster node dies ?

2012-04-03 Thread Lars Marowsky-Bree
On 2012-04-03T15:59:00, Rainer Krienke krie...@uni-koblenz.de wrote: Hi Lars, this was something I detected already. And I changed the timeout in the cluster configuration to 200sec. So the log I posted was the result of the configuration below (200sec). Is this still to small? $ crm

Re: [Linux-HA] ERROR: do_recover: Action A_RECOVER (0000000001000000) not supported

2012-03-29 Thread Lars Marowsky-Bree
On 2012-03-29T11:31:38, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: pengine: [17043]: WARN: pe_fence_node: Node h07 will be fenced because it is un-expectedly down Th software bind used is basically SLES11 SP1 with a newer corosync (corosync-1.4.1-0.3.3.3518.1.PTF.712037). Were

Re: [Linux-ha-dev] Patch: pgsql streaming replication

2012-03-19 Thread Lars Marowsky-Bree
On 2012-03-19T11:09:16, Dejan Muhamedagic de...@suse.de wrote: --- a/heartbeat/pgsql +++ b/heartbeat/pgsql @@ -1,12 +1,13 @@ -#!/bin/sh +#!/bin/bash Our policy is not to change shell. Is that absolutely necessary? He sends in many patches. bash is a 1MB install. I can't believe that

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-15 Thread Lars Marowsky-Bree
On 2012-03-15T15:59:21, William Seligman selig...@nevis.columbia.edu wrote: Could this be an issue? I've noticed that my fencing agent always seems to be called with action=reboot when a node is fenced. Why is it using 'reboot' and not 'off'? Is this the standard, or am I missing a

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-14 Thread Lars Marowsky-Bree
On 2012-03-14T18:22:42, William Seligman selig...@nevis.columbia.edu wrote: Now consider a primary-primary cluster. Both run the same resource. One fails. There's no failover here; the other box still runs the resource. In my case, the only thing that has to work is cloned cluster IP

Re: [Linux-HA] Antw: Re: FW: How DC is selected?

2012-02-06 Thread Lars Marowsky-Bree
On 2012-02-06T09:05:13, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: but like with CPU affinity there should be no needless change of the DC. I also wondered why after each configuration change the DC is newly elected (it seems). It isn't (or shouldn't be). Still, the DC election

Re: [Linux-HA] Antw: Re: FW: How DC is selected?

2012-02-06 Thread Lars Marowsky-Bree
On 2012-02-06T22:13:20, Mayank mayank.mittal.1...@hotmail.com wrote: rsc_colocation id=pgsql_vip_colocation rsc=virtua_ip score=INFINITY with-rsc=pgsql9 with-rsc-role=Master/ The intention behind defining such constraints is to make sure that the postgre should always run in the master role

Re: [Linux-HA] Antw: Re: Q: IPC Channel to 9858 is not connected

2011-12-08 Thread Lars Marowsky-Bree
On 2011-12-08T12:08:06, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: Dejan Muhamedagic deja...@fastmail.fm schrieb am 08.12.2011 um 11:28 in Nachricht 20111208102833.GA12338@walrus.homenet: Hi, On Wed, Dec 07, 2011 at 02:26:52PM +0100, Ulrich Windl wrote: Hi! While

Re: [Linux-HA] Light Weight Quorum Arbitration

2011-12-06 Thread Lars Marowsky-Bree
On 2011-12-04T00:57:05, Andreas Kurz andr...@hastexo.com wrote: the concept of an arbitrator for split-site cluster is already implemented and should be available with Pacemaker 1.1.6 though it seem to be not directly documented ... beside source code and this draft document: Documentation

Re: [Linux-HA] disconnecting network of any node cause both nodes fenced

2011-12-06 Thread Lars Marowsky-Bree
On 2011-12-05T22:37:03, Andreas Kurz andr...@hastexo.com wrote: Did you clone the sbd resource? If yes, don't do that. Start it as a primitive, so in case of a split brain at least one node needs to start the stonith resource which should give the other node an advantage ... adding a

Re: [Linux-HA] Antw: Re: Q: cib-last-written

2011-12-03 Thread Lars Marowsky-Bree
On 2011-12-01T13:48:56, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: I wonder about that usefulness of that value, especially as any configuration change seems to increase the epoch anyway. I never saw that CRM cares about the cib-last-written string. It is for easy inspection by

Re: [Linux-ha-dev] [PATCH 2/2] Medium: LVM: force dmevent monitoring for clones

2011-11-30 Thread Lars Marowsky-Bree
On 2011-11-28T21:14:22, Florian Haas flor...@hastexo.com wrote: Seems to make sense. of course, an alternative would be to add a Conflicts: lvm2 x.y.z to the package on the respective versions to make sure it's only installed with a fixed lvm2 package ...? Surely you're joking.

Re: [Linux-HA] Pacemaker : how to modify configuration ?

2011-11-29 Thread Lars Marowsky-Bree
On 2011-11-28T15:04:45, alain.mou...@bull.net wrote: sorry but I forgot if there is another way than crm configure edit to modify all the value of on-fail= for all resources in the configuration ? If they're explicitly set, you have to modify them all. Otherwise, look at op_defaults or

Re: [Linux-HA] Antw: Re: Q: unmanaged MD-RAID auto-recovery

2011-11-29 Thread Lars Marowsky-Bree
On 2011-11-29T08:33:01, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: The state of an unmanaged resource is the state when it left the managed meta-state. That is not correct. An unmanaged resource is not *managed*, but its state is still relevant to other resources that possibly

Re: [Linux-HA] is it good to create order constraint for sbd resource

2011-11-29 Thread Lars Marowsky-Bree
On 2011-11-29T22:10:10, Andreas Kurz andr...@hastexo.com wrote: IIRC stonith resources are always started first and stopped last anyways ... without extra constraints ... implicitly. Please someone correct me if I'm wrong. Yes, but they are not mandatory. The configuration that was discussed

Re: [Linux-HA] Antw: Re: Q: unmanaged MD-RAID auto-recovery

2011-11-29 Thread Lars Marowsky-Bree
On 2011-11-29T12:36:39, Dimitri Maziuk dmaz...@bmrb.wisc.edu wrote: If you repeatedly try to re-sync with a dying disk, with each resync interrupted by i/o error, you will get data corruption sooner or later. No, you shouldn't. (Unless the drive returns faulty data on read, which is actually a

<    1   2   3   4   5   6   7   8   9   10   >