[ClusterLabs] CIB: op-status=4 ?
Hi, I have a question regarding ' 'op-status attribute getting value 4. In my case I have a strange behavior, when resources get those "monitor" operation entries in the CIB with op-status=4, and they do not seem to be called (exec-time=0). What does 'op-status' = 4 mean? I would appreciate some elaboration regarding this, since this is interpreted by pacemaker as an error, which causes logs: crm_mon:error: unpack_rsc_op:Preventing dbx_head_head from re-starting anywhere: operation monitor failed 'not configured' (6) and I am pretty sure the resource agent was not called (no logs, exec-time=0) There are two aspects of this: 1) harmless (pacemaker seems to not bother about it), which I guess indicates cancelled monitoring operations: op-status=4, rc-code=189 * Example: 2) error level one (op-status=4, rc-code=6), which generates logs: crm_mon:error: unpack_rsc_op:Preventing dbx_head_head from re-starting anywhere: operation monitor failed 'not configured' (6) * Example: Could it be some hardware (VM hyperviser) issue? Thanks in advance, -- Best Regards, Radoslaw Garbacz XtremeData Incorporated ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] how to set a dedicated fence delay for a stonith agent ?
On 2017-05-17 06:24, Lentes, Bernd wrote: ... I'd like to know what the software is use is doing. Am i the only one having that opinion ? No. How do you solve the problem of a deathmatch or killing the wrong node ? *I* live dangerously with fencing disabled. But then my clusters only really go down for maintenance reboots, and I usually do those when I'm at work and can walk into the server room and push the power button when it comes to that. (More accurately the one cluster that goes down. The others fail over without any problems.) Dima ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Pacemaker's "stonith too many failures" log is not accurate
On 05/17/2017 04:56 AM, Klaus Wenninger wrote: > On 05/17/2017 11:28 AM, 井上 和徳 wrote: >> Hi, >> I'm testing Pacemaker-1.1.17-rc1. >> The number of failures in "Too many failures (10) to fence" log does not >> match the number of actual failures. > > Well it kind of does as after 10 failures it doesn't try fencing again > so that is what > failures stay at ;-) > Of course it still sees the need to fence but doesn't actually try. > > Regards, > Klaus This feature can be a little confusing: it doesn't prevent all further fence attempts of the target, just *immediate* fence attempts. Whenever the next transition is started for some other reason (a configuration or state change, cluster-recheck-interval, node failure, etc.), it will try to fence again. Also, it only checks this threshold if it's aborting a transition *because* of this fence failure. If it's aborting the transition for some other reason, the number can go higher than the threshold. That's what I'm guessing happened here. >> After the 11th time fence failure, "Too many failures (10) to fence" is >> output. >> Incidentally, stonith-max-attempts has not been set, so it is 10 by default.. >> >> [root@x3650f log]# egrep "Requesting fencing|error: Operation reboot|Stonith >> failed|Too many failures" >> ##Requesting fencing : 1st time >> May 12 05:51:47 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of >> node rhel73-2 >> May 12 05:52:52 rhel73-1 stonith-ng[5265]: error: Operation reboot of >> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.8415167d: No data available >> May 12 05:52:52 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith >> failed >> ## 2nd time >> May 12 05:52:52 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of >> node rhel73-2 >> May 12 05:53:56 rhel73-1 stonith-ng[5265]: error: Operation reboot of >> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.53d3592a: No data available >> May 12 05:53:56 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith >> failed >> ## 3rd time >> May 12 05:53:56 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of >> node rhel73-2 >> May 12 05:55:01 rhel73-1 stonith-ng[5265]: error: Operation reboot of >> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.9177cb76: No data available >> May 12 05:55:01 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith >> failed >> ## 4th time >> May 12 05:55:01 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of >> node rhel73-2 >> May 12 05:56:05 rhel73-1 stonith-ng[5265]: error: Operation reboot of >> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.946531cb: No data available >> May 12 05:56:05 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith >> failed >> ## 5th time >> May 12 05:56:05 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of >> node rhel73-2 >> May 12 05:57:10 rhel73-1 stonith-ng[5265]: error: Operation reboot of >> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.278b3c4b: No data available >> May 12 05:57:10 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith >> failed >> ## 6th time >> May 12 05:57:10 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of >> node rhel73-2 >> May 12 05:58:14 rhel73-1 stonith-ng[5265]: error: Operation reboot of >> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.7a49aebb: No data available >> May 12 05:58:14 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith >> failed >> ## 7th time >> May 12 05:58:14 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of >> node rhel73-2 >> May 12 05:59:19 rhel73-1 stonith-ng[5265]: error: Operation reboot of >> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.83421862: No data available >> May 12 05:59:19 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith >> failed >> ## 8th time >> May 12 05:59:19 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of >> node rhel73-2 >> May 12 06:00:24 rhel73-1 stonith-ng[5265]: error: Operation reboot of >> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.afd7ef98: No data available >> May 12 06:00:24 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith >> failed >> ## 9th time >> May 12 06:00:24 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of >> node rhel73-2 >> May 12 06:01:28 rhel73-1 stonith-ng[5265]: error: Operation reboot of >> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.3b033dbe: No data available >> May 12 06:01:28 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith >> failed >> ## 10th time >> May 12 06:01:28 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of >> node rhel73-2 >> May 12 06:02:33 rhel73-1 stonith-ng[5265]: error: Operation reboot of >> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.5447a345: No data available >> May 12 06:02:33 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith >> failed >> ## 11th time >> May 12 06:02:33 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of >> node rhel73-2 >> May 12 06:03:37 rhel73-1 stonith-ng[5265]: error:
Re: [ClusterLabs] how to set a dedicated fence delay for a stonith agent ?
On 05/17/2017 03:33 PM, Lentes, Bernd wrote: > > - On May 17, 2017, at 2:58 PM, Klaus Wenninger kwenn...@redhat.com wrote: > > >>> I don't see that. >> fence_* are the RHCS-style fence-agents coming mainly from >> https://github.com/ClusterLabs/fence-agents. >> > Ah. Ok, i see that. > > Do you know if they cooperate with a SuSE HAE ? I found rpm's for SLES for > the fence agents. There is no conditional-compilation around support for RHCS-fence-agents. Thus I guess there won't be a technical issue. Question is just the degree of support you will get / want ... But there are probably others than me who can give you a more satisfactory answer. Regards, Klaus > > Bernd > > > Helmholtz Zentrum Muenchen > Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) > Ingolstaedter Landstr. 1 > 85764 Neuherberg > www.helmholtz-muenchen.de > Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe > Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons > Enhsen > Registergericht: Amtsgericht Muenchen HRB 6466 > USt-IdNr: DE 129521671 > ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] how to set a dedicated fence delay for a stonith agent ?
- On May 17, 2017, at 2:11 PM, Vladislav Bogdanov bub...@hoster-ok.com wrote: > 08.05.2017 22:20, Lentes, Bernd wrote: >> Hi, >> >> i remember that digimer often campaigns for a fence delay in a 2-node >> cluster. >> E.g. here: >> http://oss.clusterlabs.org/pipermail/pacemaker/2013-July/019228.html >> In my eyes it makes sense, so i try to establish that. I have two HP servers, >> each with an ILO card. >> I have to use the stonith:external/ipmi agent, the stonith:external/riloe >> refused to work. >> >> But i don't have a delay parameter there. >> crm ra info stonith:external/ipmi: > > Hi, > > There is another ipmi fence agent - fence_ipmilan (part of fence-agents > package). It has 'delay' parameter. > >> I don't see that. crm(live)# ra info stonith:ipmilan IPMI Over LAN (stonith:ipmilan) IPMI LAN STONITH device Parameters (*: required, []: default): hostname* (string): The hostname of the STONITH device ipaddr* (string): IP Address The IP address of the STONITH device port* (string): The port number to where the IPMI message is sent auth* (string): The authorization type of the IPMI session ("none", "straight", "md2", or "md5") priv* (string): The privilege level of the user ("operator" or "admin") login* (string): Login The username used for logging in to the STONITH device password* (string): Password The password used for logging in to the STONITH device priority (integer, [0]): The priority of the stonith resource. Devices are tried in order of highest priority to lowest. pcmk_host_argument (string, [port]): Advanced use only: An alternate parameter to supply instead of 'port' Some devices do not support the standard 'port' parameter or may provide additional ones. Use this to specify an alternate, device-specific, parameter that should indicate the machine to be fenced. A value of 'none' can be used to tell the cluster not to supply any additional parameters. pcmk_host_map (string): A mapping of host names to ports numbers for devices that do not support host names. Eg. node1:1;node2:2,3 would tell the cluster to use port 1 for node1 and ports 2 and 3 for node2 pcmk_host_list (string): A list of machines controlled by this device (Optional unless pcmk_host_check=static-list). pcmk_host_check (string, [dynamic-list]): How to determine which machines are controlled by the device. Allowed values: dynamic-list (query the device), static-list (check the pcmk_host_list attribute), none (assume every device can fence every machine) ... There is no delay parameter, and all the pcmk_*** parameters are the ones from stonithd, and that one does not have a dedicated delay parameter, just the pcmk_delay_max parameter which is not fixed but random. Do you have another ipmilan RA ? I have SLES 11 SP4 boxes, maybe my RA is not recent enough ? Bernd Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons Enhsen Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] how to set a dedicated fence delay for a stonith agent ?
08.05.2017 22:20, Lentes, Bernd wrote: Hi, i remember that digimer often campaigns for a fence delay in a 2-node cluster. E.g. here: http://oss.clusterlabs.org/pipermail/pacemaker/2013-July/019228.html In my eyes it makes sense, so i try to establish that. I have two HP servers, each with an ILO card. I have to use the stonith:external/ipmi agent, the stonith:external/riloe refused to work. But i don't have a delay parameter there. crm ra info stonith:external/ipmi: Hi, There is another ipmi fence agent - fence_ipmilan (part of fence-agents package). It has 'delay' parameter. ... pcmk_delay_max (time, [0s]): Enable random delay for stonith actions and specify the maximum of random delay This prevents double fencing when using slow devices such as sbd. Use this to enable random delay for stonith actions and specify the maximum of random delay. ... This is the only delay parameter i can use. But a random delay does not seem to be a reliable solution. The stonith:ipmilan agent also provides just a random delay. Same with the riloe agent. How did anyone solve this problem ? Or do i have to edit the RA (I will get practice in that :-))? Bernd ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] how to set a dedicated fence delay for a stonith agent ?
- On May 10, 2017, at 9:15 PM, Dimitri Maziuk dmaz...@bmrb.wisc.edu wrote: > On 05/10/2017 01:54 PM, Ken Gaillot wrote: >> On 05/10/2017 12:26 PM, Dimitri Maziuk wrote: > >>> - fencing in 2-node clusters does not work reliably without fixed delay >> >> Not quite. Fixed delay allows a particular method for avoiding a death >> match in a two-node cluster. Pacemaker's built-in random delay >> capability is another method. > > Deathmatch is one problem, killing the wrong node (2 nodes, no quorum) > is another. Fixed delay is digimer's attempt to alleviate the latter, > so... apples and fruits not entirely unlike apples. > > -- Hi, so what should i do ? Using pcmk_delay_max does not seem to be really reliable. I don't like the idea of being dependent from a software thinking "which delay i should choose, depending on the ... weather conditions, any mood ..." I'd like to know what the software is use is doing. Am i the only one having that opinion ? How do you solve the problem of a deathmatch or killing the wrong node ? Bernd Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons Enhsen Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Pacemaker's "stonith too many failures" log is not accurate
On 05/17/2017 11:28 AM, 井上 和徳 wrote: > Hi, > I'm testing Pacemaker-1.1.17-rc1. > The number of failures in "Too many failures (10) to fence" log does not > match the number of actual failures. Well it kind of does as after 10 failures it doesn't try fencing again so that is what failures stay at ;-) Of course it still sees the need to fence but doesn't actually try. Regards, Klaus > > After the 11th time fence failure, "Too many failures (10) to fence" is > output. > Incidentally, stonith-max-attempts has not been set, so it is 10 by default.. > > [root@x3650f log]# egrep "Requesting fencing|error: Operation reboot|Stonith > failed|Too many failures" > ##Requesting fencing : 1st time > May 12 05:51:47 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of > node rhel73-2 > May 12 05:52:52 rhel73-1 stonith-ng[5265]: error: Operation reboot of > rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.8415167d: No data available > May 12 05:52:52 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith > failed > ## 2nd time > May 12 05:52:52 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of > node rhel73-2 > May 12 05:53:56 rhel73-1 stonith-ng[5265]: error: Operation reboot of > rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.53d3592a: No data available > May 12 05:53:56 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith > failed > ## 3rd time > May 12 05:53:56 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of > node rhel73-2 > May 12 05:55:01 rhel73-1 stonith-ng[5265]: error: Operation reboot of > rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.9177cb76: No data available > May 12 05:55:01 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith > failed > ## 4th time > May 12 05:55:01 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of > node rhel73-2 > May 12 05:56:05 rhel73-1 stonith-ng[5265]: error: Operation reboot of > rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.946531cb: No data available > May 12 05:56:05 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith > failed > ## 5th time > May 12 05:56:05 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of > node rhel73-2 > May 12 05:57:10 rhel73-1 stonith-ng[5265]: error: Operation reboot of > rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.278b3c4b: No data available > May 12 05:57:10 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith > failed > ## 6th time > May 12 05:57:10 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of > node rhel73-2 > May 12 05:58:14 rhel73-1 stonith-ng[5265]: error: Operation reboot of > rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.7a49aebb: No data available > May 12 05:58:14 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith > failed > ## 7th time > May 12 05:58:14 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of > node rhel73-2 > May 12 05:59:19 rhel73-1 stonith-ng[5265]: error: Operation reboot of > rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.83421862: No data available > May 12 05:59:19 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith > failed > ## 8th time > May 12 05:59:19 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of > node rhel73-2 > May 12 06:00:24 rhel73-1 stonith-ng[5265]: error: Operation reboot of > rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.afd7ef98: No data available > May 12 06:00:24 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith > failed > ## 9th time > May 12 06:00:24 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of > node rhel73-2 > May 12 06:01:28 rhel73-1 stonith-ng[5265]: error: Operation reboot of > rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.3b033dbe: No data available > May 12 06:01:28 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith > failed > ## 10th time > May 12 06:01:28 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of > node rhel73-2 > May 12 06:02:33 rhel73-1 stonith-ng[5265]: error: Operation reboot of > rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.5447a345: No data available > May 12 06:02:33 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith > failed > ## 11th time > May 12 06:02:33 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of > node rhel73-2 > May 12 06:03:37 rhel73-1 stonith-ng[5265]: error: Operation reboot of > rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.db50c21a: No data available > May 12 06:03:37 rhel73-1 crmd[5269]: warning: Too many failures (10) to fence > rhel73-2, giving up > May 12 06:03:37 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith > failed > > Regards, > Kazunori INOUE > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: