[ClusterLabs] CIB: op-status=4 ?

2017-05-17 Thread Radoslaw Garbacz
Hi,

I have a question regarding ' 'op-status
attribute getting value 4.

In my case I have a strange behavior, when resources get those "monitor"
operation entries in the CIB with op-status=4, and they do not seem to be
called (exec-time=0).

What does 'op-status' = 4 mean?

I would appreciate some elaboration regarding this, since this is
interpreted by pacemaker as an error, which causes logs:
crm_mon:error: unpack_rsc_op:Preventing dbx_head_head from
re-starting anywhere: operation monitor failed 'not configured' (6)

and I am pretty sure the resource agent was not called (no logs,
exec-time=0)

There are two aspects of this:

1) harmless (pacemaker seems to not bother about it), which I guess
indicates cancelled monitoring operations:
op-status=4, rc-code=189

* Example:



2) error level one (op-status=4, rc-code=6), which generates logs:
crm_mon:error: unpack_rsc_op:Preventing dbx_head_head from
re-starting anywhere: operation monitor failed 'not configured' (6)

* Example:



Could it be some hardware (VM hyperviser) issue?


Thanks in advance,

-- 
Best Regards,

Radoslaw Garbacz
XtremeData Incorporated
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] how to set a dedicated fence delay for a stonith agent ?

2017-05-17 Thread Dmitri Maziuk

On 2017-05-17 06:24, Lentes, Bernd wrote:



...

I'd like to know what the software is use is doing. Am i the only one having 
that opinion ?


No.


How do you solve the problem of a deathmatch or killing the wrong node ?


*I* live dangerously with fencing disabled. But then my clusters only 
really go down for maintenance reboots, and I usually do those when I'm 
at work and can walk into the server room and push the power button when 
it comes to that.


(More accurately the one cluster that goes down. The others fail over 
without any problems.)


Dima



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker's "stonith too many failures" log is not accurate

2017-05-17 Thread Ken Gaillot
On 05/17/2017 04:56 AM, Klaus Wenninger wrote:
> On 05/17/2017 11:28 AM, 井上 和徳 wrote:
>> Hi,
>> I'm testing Pacemaker-1.1.17-rc1.
>> The number of failures in "Too many failures (10) to fence" log does not 
>> match the number of actual failures.
> 
> Well it kind of does as after 10 failures it doesn't try fencing again
> so that is what
> failures stay at ;-)
> Of course it still sees the need to fence but doesn't actually try.
> 
> Regards,
> Klaus

This feature can be a little confusing: it doesn't prevent all further
fence attempts of the target, just *immediate* fence attempts. Whenever
the next transition is started for some other reason (a configuration or
state change, cluster-recheck-interval, node failure, etc.), it will try
to fence again.

Also, it only checks this threshold if it's aborting a transition
*because* of this fence failure. If it's aborting the transition for
some other reason, the number can go higher than the threshold. That's
what I'm guessing happened here.

>> After the 11th time fence failure, "Too many failures (10) to fence" is 
>> output.
>> Incidentally, stonith-max-attempts has not been set, so it is 10 by default..
>>
>> [root@x3650f log]# egrep "Requesting fencing|error: Operation reboot|Stonith 
>> failed|Too many failures"
>> ##Requesting fencing : 1st time
>> May 12 05:51:47 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
>> node rhel73-2
>> May 12 05:52:52 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
>> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.8415167d: No data available
>> May 12 05:52:52 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
>> failed
>> ## 2nd time
>> May 12 05:52:52 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
>> node rhel73-2
>> May 12 05:53:56 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
>> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.53d3592a: No data available
>> May 12 05:53:56 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
>> failed
>> ## 3rd time
>> May 12 05:53:56 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
>> node rhel73-2
>> May 12 05:55:01 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
>> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.9177cb76: No data available
>> May 12 05:55:01 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
>> failed
>> ## 4th time
>> May 12 05:55:01 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
>> node rhel73-2
>> May 12 05:56:05 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
>> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.946531cb: No data available
>> May 12 05:56:05 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
>> failed
>> ## 5th time
>> May 12 05:56:05 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
>> node rhel73-2
>> May 12 05:57:10 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
>> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.278b3c4b: No data available
>> May 12 05:57:10 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
>> failed
>> ## 6th time
>> May 12 05:57:10 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
>> node rhel73-2
>> May 12 05:58:14 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
>> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.7a49aebb: No data available
>> May 12 05:58:14 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
>> failed
>> ## 7th time
>> May 12 05:58:14 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
>> node rhel73-2
>> May 12 05:59:19 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
>> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.83421862: No data available
>> May 12 05:59:19 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
>> failed
>> ## 8th time
>> May 12 05:59:19 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
>> node rhel73-2
>> May 12 06:00:24 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
>> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.afd7ef98: No data available
>> May 12 06:00:24 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
>> failed
>> ## 9th time
>> May 12 06:00:24 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
>> node rhel73-2
>> May 12 06:01:28 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
>> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.3b033dbe: No data available
>> May 12 06:01:28 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
>> failed
>> ## 10th time
>> May 12 06:01:28 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
>> node rhel73-2
>> May 12 06:02:33 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
>> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.5447a345: No data available
>> May 12 06:02:33 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
>> failed
>> ## 11th time
>> May 12 06:02:33 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
>> node rhel73-2
>> May 12 06:03:37 rhel73-1 stonith-ng[5265]:   error: 

Re: [ClusterLabs] how to set a dedicated fence delay for a stonith agent ?

2017-05-17 Thread Klaus Wenninger
On 05/17/2017 03:33 PM, Lentes, Bernd wrote:
>
> - On May 17, 2017, at 2:58 PM, Klaus Wenninger kwenn...@redhat.com wrote:
>
>
>>> I don't see that.
>> fence_* are the RHCS-style fence-agents coming mainly from
>> https://github.com/ClusterLabs/fence-agents.
>>
> Ah. Ok, i see that.
>
> Do you know if they cooperate with a SuSE HAE ? I found rpm's for SLES for 
> the fence agents.

There is no conditional-compilation around support for RHCS-fence-agents.
Thus I guess there won't be a technical issue.
Question is just the degree of support you will get / want ...
But there are probably others than me who can give you a more
satisfactory answer.

Regards,
Klaus

>
> Bernd
>  
>
> Helmholtz Zentrum Muenchen
> Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
> Ingolstaedter Landstr. 1
> 85764 Neuherberg
> www.helmholtz-muenchen.de
> Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe
> Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons 
> Enhsen
> Registergericht: Amtsgericht Muenchen HRB 6466
> USt-IdNr: DE 129521671
>


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] how to set a dedicated fence delay for a stonith agent ?

2017-05-17 Thread Lentes, Bernd


- On May 17, 2017, at 2:11 PM, Vladislav Bogdanov bub...@hoster-ok.com 
wrote:

> 08.05.2017 22:20, Lentes, Bernd wrote:
>> Hi,
>>
>> i remember that digimer often campaigns for a fence delay in a 2-node  
>> cluster.
>> E.g. here: 
>> http://oss.clusterlabs.org/pipermail/pacemaker/2013-July/019228.html
>> In my eyes it makes sense, so i try to establish that. I have two HP servers,
>> each with an ILO card.
>> I have to use the stonith:external/ipmi agent, the stonith:external/riloe
>> refused to work.
>>
>> But i don't have a delay parameter there.
>> crm ra info stonith:external/ipmi:
> 
> Hi,
> 
> There is another ipmi fence agent - fence_ipmilan (part of fence-agents
> package). It has 'delay' parameter.
> 
>>

I don't see that.


crm(live)# ra info stonith:ipmilan
IPMI Over LAN (stonith:ipmilan)

IPMI LAN STONITH device

Parameters (*: required, []: default):

hostname* (string):
The hostname of the STONITH device

ipaddr* (string): IP Address
The IP address of the STONITH device

port* (string):
The port number to where the IPMI message is sent

auth* (string):
The authorization type of the IPMI session ("none", "straight", "md2", or 
"md5")

priv* (string):
The privilege level of the user ("operator" or "admin")

login* (string): Login
The username used for logging in to the STONITH device

password* (string): Password
The password used for logging in to the STONITH device

priority (integer, [0]): The priority of the stonith resource. Devices are 
tried in order of highest priority to lowest.
pcmk_host_argument (string, [port]): Advanced use only: An alternate parameter 
to supply instead of 'port'
Some devices do not support the standard 'port' parameter or may provide 
additional ones.
Use this to specify an alternate, device-specific, parameter that should 
indicate the machine to be fenced.
A value of 'none' can be used to tell the cluster not to supply any 
additional parameters.

pcmk_host_map (string): A mapping of host names to ports numbers for devices 
that do not support host names.
Eg. node1:1;node2:2,3 would tell the cluster to use port 1 for node1 and 
ports 2 and 3 for node2

pcmk_host_list (string): A list of machines controlled by this device (Optional 
unless pcmk_host_check=static-list).
pcmk_host_check (string, [dynamic-list]): How to determine which machines are 
controlled by the device.
Allowed values: dynamic-list (query the device), static-list (check the 
pcmk_host_list attribute), none (assume every device can fence every machine)
...


There is no delay parameter, and all the pcmk_*** parameters are the ones from 
stonithd, and that one does not have a dedicated delay parameter,
just the pcmk_delay_max parameter which is not fixed but random. Do you have 
another ipmilan RA ?

I have SLES 11 SP4 boxes, maybe my RA is not recent enough ?

Bernd
 

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe
Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons Enhsen
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] how to set a dedicated fence delay for a stonith agent ?

2017-05-17 Thread Vladislav Bogdanov

08.05.2017 22:20, Lentes, Bernd wrote:

Hi,

i remember that digimer often campaigns for a fence delay in a 2-node  cluster.
E.g. here: http://oss.clusterlabs.org/pipermail/pacemaker/2013-July/019228.html
In my eyes it makes sense, so i try to establish that. I have two HP servers, 
each with an ILO card.
I have to use the stonith:external/ipmi agent, the stonith:external/riloe 
refused to work.

But i don't have a delay parameter there.
crm ra info stonith:external/ipmi:


Hi,

There is another ipmi fence agent - fence_ipmilan (part of fence-agents 
package). It has 'delay' parameter.




...
pcmk_delay_max (time, [0s]): Enable random delay for stonith actions and 
specify the maximum of random delay
This prevents double fencing when using slow devices such as sbd.
Use this to enable random delay for stonith actions and specify the maximum 
of random delay.
...

This is the only delay parameter i can use. But a random delay does not seem to 
be a reliable solution.

The stonith:ipmilan agent also provides just a random delay. Same with the 
riloe agent.

How did anyone solve this problem ?

Or do i have to edit the RA (I will get practice in that :-))?


Bernd





___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] how to set a dedicated fence delay for a stonith agent ?

2017-05-17 Thread Lentes, Bernd


- On May 10, 2017, at 9:15 PM, Dimitri Maziuk dmaz...@bmrb.wisc.edu wrote:

> On 05/10/2017 01:54 PM, Ken Gaillot wrote:
>> On 05/10/2017 12:26 PM, Dimitri Maziuk wrote:
> 
>>> - fencing in 2-node clusters does not work reliably without fixed delay
>> 
>> Not quite. Fixed delay allows a particular method for avoiding a death
>> match in a two-node cluster. Pacemaker's built-in random delay
>> capability is another method.
> 
> Deathmatch is one problem, killing the wrong node (2 nodes, no quorum)
> is another. Fixed delay is digimer's attempt to alleviate the latter,
> so... apples and fruits not entirely unlike apples.
> 
> --

Hi,

so what should i do ? Using pcmk_delay_max does not seem to be really reliable.
I don't like the idea of being dependent from a software thinking "which delay 
i should choose, depending on the ... weather conditions, any mood ..."
I'd like to know what the software is use is doing. Am i the only one having 
that opinion ?

How do you solve the problem of a deathmatch or killing the wrong node ?

Bernd
 

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe
Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons Enhsen
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker's "stonith too many failures" log is not accurate

2017-05-17 Thread Klaus Wenninger
On 05/17/2017 11:28 AM, 井上 和徳 wrote:
> Hi,
> I'm testing Pacemaker-1.1.17-rc1.
> The number of failures in "Too many failures (10) to fence" log does not 
> match the number of actual failures.

Well it kind of does as after 10 failures it doesn't try fencing again
so that is what
failures stay at ;-)
Of course it still sees the need to fence but doesn't actually try.

Regards,
Klaus

>
> After the 11th time fence failure, "Too many failures (10) to fence" is 
> output.
> Incidentally, stonith-max-attempts has not been set, so it is 10 by default..
>
> [root@x3650f log]# egrep "Requesting fencing|error: Operation reboot|Stonith 
> failed|Too many failures"
> ##Requesting fencing : 1st time
> May 12 05:51:47 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
> node rhel73-2
> May 12 05:52:52 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.8415167d: No data available
> May 12 05:52:52 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
> failed
> ## 2nd time
> May 12 05:52:52 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
> node rhel73-2
> May 12 05:53:56 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.53d3592a: No data available
> May 12 05:53:56 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
> failed
> ## 3rd time
> May 12 05:53:56 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
> node rhel73-2
> May 12 05:55:01 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.9177cb76: No data available
> May 12 05:55:01 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
> failed
> ## 4th time
> May 12 05:55:01 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
> node rhel73-2
> May 12 05:56:05 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.946531cb: No data available
> May 12 05:56:05 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
> failed
> ## 5th time
> May 12 05:56:05 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
> node rhel73-2
> May 12 05:57:10 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.278b3c4b: No data available
> May 12 05:57:10 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
> failed
> ## 6th time
> May 12 05:57:10 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
> node rhel73-2
> May 12 05:58:14 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.7a49aebb: No data available
> May 12 05:58:14 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
> failed
> ## 7th time
> May 12 05:58:14 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
> node rhel73-2
> May 12 05:59:19 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.83421862: No data available
> May 12 05:59:19 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
> failed
> ## 8th time
> May 12 05:59:19 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
> node rhel73-2
> May 12 06:00:24 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.afd7ef98: No data available
> May 12 06:00:24 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
> failed
> ## 9th time
> May 12 06:00:24 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
> node rhel73-2
> May 12 06:01:28 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.3b033dbe: No data available
> May 12 06:01:28 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
> failed
> ## 10th time
> May 12 06:01:28 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
> node rhel73-2
> May 12 06:02:33 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.5447a345: No data available
> May 12 06:02:33 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
> failed
> ## 11th time
> May 12 06:02:33 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
> node rhel73-2
> May 12 06:03:37 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.db50c21a: No data available
> May 12 06:03:37 rhel73-1 crmd[5269]: warning: Too many failures (10) to fence 
> rhel73-2, giving up
> May 12 06:03:37 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
> failed
>
> Regards,
> Kazunori INOUE
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: