Re: [ClusterLabs] Node attribute disappears when pacemaker is started

2017-05-31 Thread Ken Gaillot
On 05/26/2017 03:21 AM, 井上 和徳 wrote:
> Hi Ken,
> 
> I got crm_report.
> 
> Regards,
> Kazunori INOUE

I don't think it attached -- my mail client says it's 0 bytes.

>> -Original Message-
>> From: Ken Gaillot [mailto:kgail...@redhat.com]
>> Sent: Friday, May 26, 2017 4:23 AM
>> To: users@clusterlabs.org
>> Subject: Re: [ClusterLabs] Node attribute disappears when pacemaker is 
>> started
>>
>> On 05/24/2017 05:13 AM, 井上 和徳 wrote:
>>> Hi,
>>>
>>> After loading the node attribute, when I start pacemaker of that node, the 
>>> attribute disappears.
>>>
>>> 1. Start pacemaker on node1.
>>> 2. Load configure containing node attribute of node2.
>>>(I use multicast addresses in corosync, so did not set "nodelist 
>>> {nodeid: }" in corosync.conf.)
>>> 3. Start pacemaker on node2, the node attribute that should have been load 
>>> disappears.
>>>Is this specifications ?
>>
>> Hi,
>>
>> No, this should not happen for a permanent node attribute.
>>
>> Transient node attributes (status-attr in crm shell) are erased when the
>> node starts, so it would be expected in that case.
>>
>> I haven't been able to reproduce this with a permanent node attribute.
>> Can you attach logs from both nodes around the time node2 is started?
>>
>>>
>>> 1.
>>> [root@rhel73-1 ~]# systemctl start corosync;systemctl start pacemaker
>>> [root@rhel73-1 ~]# crm configure show
>>> node 3232261507: rhel73-1
>>> property cib-bootstrap-options: \
>>>   have-watchdog=false \
>>>   dc-version=1.1.17-0.1.rc2.el7-524251c \
>>>   cluster-infrastructure=corosync
>>>
>>> 2.
>>> [root@rhel73-1 ~]# cat rhel73-2.crm
>>> node rhel73-2 \
>>>   utilization capacity="2" \
>>>   attributes attrname="attr2"
>>>
>>> [root@rhel73-1 ~]# crm configure load update rhel73-2.crm
>>> [root@rhel73-1 ~]# crm configure show
>>> node 3232261507: rhel73-1
>>> node rhel73-2 \
>>>   utilization capacity=2 \
>>>   attributes attrname=attr2
>>> property cib-bootstrap-options: \
>>>   have-watchdog=false \
>>>   dc-version=1.1.17-0.1.rc2.el7-524251c \
>>>   cluster-infrastructure=corosync
>>>
>>> 3.
>>> [root@rhel73-1 ~]# ssh rhel73-2 'systemctl start corosync;systemctl start 
>>> pacemaker'
>>> [root@rhel73-1 ~]# crm configure show
>>> node 3232261507: rhel73-1
>>> node 3232261508: rhel73-2
>>> property cib-bootstrap-options: \
>>>   have-watchdog=false \
>>>   dc-version=1.1.17-0.1.rc2.el7-524251c \
>>>   cluster-infrastructure=corosync
>>>
>>> Regards,
>>> Kazunori INOUE

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker's "stonith too many failures" log is not accurate

2017-05-31 Thread Ken Gaillot
On 05/26/2017 03:21 AM, 井上 和徳 wrote:
> Hi Ken,
> 
> The cause turned out.
> 
> When stonith is executed, stonithd sends results and notifications to crmd.
> https://github.com/ClusterLabs/pacemaker/blob/0459f409580f41b35ce8ae31fb22e6370a508dab/fencing/remote.c#L402-L406
> 
> - when "result" is sent (calling do_local_reply()), too many stonith failures 
> is checked in too_many_st_failures().
>   
> https://github.com/ClusterLabs/pacemaker/blob/0459f409580f41b35ce8ae31fb22e6370a508dab/crmd/te_callbacks.c#L638-L669
> - when "notification" is sent (calling do_stonith_notify()), the number of 
> failures is incremented in st_fail_count_increment().
>   
> https://github.com/ClusterLabs/pacemaker/blob/0459f409580f41b35ce8ae31fb22e6370a508dab/crmd/te_callbacks.c#L704-L726
> From this, since checking is done before incrementing, the number of failures 
> in "Too many failures (10) to fence" log does not match the number of actual 
> failures.

Thanks for this analysis!

We do want the result to be sent before the notifications, so the
solution will be slightly more complicated. The DC will have to call
st_fail_count_increment() when receiving the result, while non-DC nodes
will continue to call it when receiving the notification.

I'll put together a fix before the 1.1.17 release.

> 
> I confirmed that the expected result will be obtained from the following 
> changes.
> 
> # git diff
> diff --git a/fencing/remote.c b/fencing/remote.c
> index 4a47d49..3ff324e 100644
> --- a/fencing/remote.c
> +++ b/fencing/remote.c
> @@ -399,12 +399,12 @@ handle_local_reply_and_notify(remote_fencing_op_t * op, 
> xmlNode * data, int rc)
>  reply = stonith_construct_reply(op->request, NULL, data, rc);
>  crm_xml_add(reply, F_STONITH_DELEGATE, op->delegate);
> 
> -/* Send fencing OP reply to local client that initiated fencing */
> -do_local_reply(reply, op->client_id, op->call_options & 
> st_opt_sync_call, FALSE);
> -
>  /* bcast to all local clients that the fencing operation happend */
>  do_stonith_notify(0, T_STONITH_NOTIFY_FENCE, rc, notify_data);
> 
> +/* Send fencing OP reply to local client that initiated fencing */
> +do_local_reply(reply, op->client_id, op->call_options & 
> st_opt_sync_call, FALSE);
> +
>  /* mark this op as having notify's already sent */
>  op->notify_sent = TRUE;
>  free_xml(reply);
> 
> Regards,
> Kazunori INOUE
> 
>> -Original Message-
>> From: Ken Gaillot [mailto:kgail...@redhat.com]
>> Sent: Wednesday, May 17, 2017 11:09 PM
>> To: users@clusterlabs.org
>> Subject: Re: [ClusterLabs] Pacemaker's "stonith too many failures" log is 
>> not accurate
>>
>> On 05/17/2017 04:56 AM, Klaus Wenninger wrote:
>>> On 05/17/2017 11:28 AM, 井上 和徳 wrote:
 Hi,
 I'm testing Pacemaker-1.1.17-rc1.
 The number of failures in "Too many failures (10) to fence" log does not 
 match the number of actual failures.
>>>
>>> Well it kind of does as after 10 failures it doesn't try fencing again
>>> so that is what
>>> failures stay at ;-)
>>> Of course it still sees the need to fence but doesn't actually try.
>>>
>>> Regards,
>>> Klaus
>>
>> This feature can be a little confusing: it doesn't prevent all further
>> fence attempts of the target, just *immediate* fence attempts. Whenever
>> the next transition is started for some other reason (a configuration or
>> state change, cluster-recheck-interval, node failure, etc.), it will try
>> to fence again.
>>
>> Also, it only checks this threshold if it's aborting a transition
>> *because* of this fence failure. If it's aborting the transition for
>> some other reason, the number can go higher than the threshold. That's
>> what I'm guessing happened here.
>>
 After the 11th time fence failure, "Too many failures (10) to fence" is 
 output.
 Incidentally, stonith-max-attempts has not been set, so it is 10 by 
 default..

 [root@x3650f log]# egrep "Requesting fencing|error: Operation 
 reboot|Stonith failed|Too many failures"
 ##Requesting fencing : 1st time
 May 12 05:51:47 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) 
 of node rhel73-2
 May 12 05:52:52 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
 rhel73-2 by rhel73-1 for
>> crmd.5269@rhel73-1.8415167d: No data available
 May 12 05:52:52 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
 failed
 ## 2nd time
 May 12 05:52:52 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) 
 of node rhel73-2
 May 12 05:53:56 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
 rhel73-2 by rhel73-1 for
>> crmd.5269@rhel73-1.53d3592a: No data available
 May 12 05:53:56 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
 failed
 ## 3rd time
 May 12 05:53:56 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) 
 of node rhel73-2
 May 12 05:55:01 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
 rhel73-2 by 

Re: [ClusterLabs] Pacemaker occasionally takes minutes to respond

2017-05-31 Thread Ken Gaillot
On 05/24/2017 08:04 AM, Attila Megyeri wrote:
> Hi Klaus,
> 
> Thank you for your response.
> I tried many things, but no luck.
> 
> We have many pacemaker clusters with 99% identical configurations, package 
> versions, and only this one causes issues. (BTW we use unicast for corosync, 
> but this is the same for our other clusters as well.)
> I checked all connection settings between the nodes (to confirm there are no 
> firewall issues), increased the number of cores on each node, but still - as 
> long as a monitor operation is pending for a resource, no other operation is 
> executed.
> 
> e.g. resource A is being monitored, and timeout is 90 seconds, until this 
> check times out I cannot do a cleanup or start/stop on any other resource.

Do you have any constraints configured? If B depends on A, you probably
want at least an ordering constraint. Then the cluster would stop B
before stopping A, and not try to start it until A is up again.

Throttling based on load wasn't added until Pacemaker 1.1.11, so the
only limit on parallel execution in 1.1.10 was batch-limit, which
defaulted to 30 at the time.

I'd investigate by figuring out which node was DC at the time and
checking its pacemaker log (preferably with PCMK_debug=crmd turned on).
You can see each run of the policy engine and what decisions were made,
ending with a message like "saving inputs in
/var/lib/pacemaker/pengine/pe-input-4940.bz2". You can run crm_simulate
on that file to get more information about the decision-making process.

"crm_simulate -Sx $FILE -D transition.dot" will create a dot graph of
the transition showing dependencies. You can convert the graph to an svg
with "dot transition.dot -Tsvg > transition.svg" and then look at that
file in any SVG viewer (including most browsers).

> Two more interesting things: 
> - cluster recheck is set to 2 minutes, and even though the resources are 
> running properly, the fail counters are not reduced and crm_mon lists the 
> resources in failed actions section. forever. Or until I manually do resource 
> cleanup.
> - If i execute a crm resource cleanup RES_name from another node, sometimes 
> it simply does not clean up the failed state. If I execute this from the node 
> where the resource IS actually runing, the resource is removed from the 
> failed actions.
> 
> 
> What do you recommend, how could I start troubleshooting these issues? As I 
> said, this setup works fine in several other systems, but here I am 
> really-realy stuck.
> 
> 
> thanks!
> 
> Attila
> 
> 
> 
> 
> 
>> -Original Message-
>> From: Klaus Wenninger [mailto:kwenn...@redhat.com]
>> Sent: Wednesday, May 10, 2017 2:04 PM
>> To: users@clusterlabs.org
>> Subject: Re: [ClusterLabs] Pacemaker occasionally takes minutes to respond
>>
>> On 05/09/2017 10:34 PM, Attila Megyeri wrote:
>>>
>>> Actually I found some more details:
>>>
>>>
>>>
>>> there are two resources: A and B
>>>
>>>
>>>
>>> resource B depends on resource A (when the RA monitors B, if will fail
>>> if A is not running properly)
>>>
>>>
>>>
>>> If I stop resource A, the next monitor operation of "B" will fail.
>>> Interestingly, this check happens immediately after A is stopped.
>>>
>>>
>>>
>>> B is configured to restart if monitor fails. Start timeout is rather
>>> long, 180 seconds. So pacemaker tries to restart B, and waits.
>>>
>>>
>>>
>>> If I want to start "A", nothing happens until the start operation of
>>> "B" fails - typically several minutes.
>>>
>>>
>>>
>>>
>>>
>>> Is this the right behavior?
>>>
>>> It appears that pacemaker is blocked until resource B is being
>>> started, and I cannot really start its dependency...
>>>
>>> Shouldn't it be possible to start a resource while another resource is
>>> also starting?
>>>
>>
>> As long as resources don't depend on each other parallel starting should
>> work/happen.
>>
>> The number of parallel actions executed is derived from the number of
>> cores and
>> when load is detected some kind of throttling kicks in (in fact reduction of
>> the operations executed in parallel with the aim to reduce the load induced
>> by pacemaker). When throttling kicks in you should get log messages (there
>> is in fact a parallel discussion going on ...).
>> No idea if throttling might be a reason here but maybe worth considering
>> at least.
>>
>> Another reason why certain things happen with quite some delay I've
>> observed
>> is that obviously some situations are just resolved when the
>> cluster-recheck-interval
>> triggers a pengine run in addition to those triggered by changes.
>> You might easily verify this by changing the cluster-recheck-interval.
>>
>> Regards,
>> Klaus
>>
>>>
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Attila
>>>
>>>
>>>
>>>
>>>
>>> *From:*Attila Megyeri [mailto:amegy...@minerva-soft.com]
>>> *Sent:* Tuesday, May 9, 2017 9:53 PM
>>> *To:* users@clusterlabs.org; kgail...@redhat.com
>>> *Subject:* [ClusterLabs] Pacemaker occasionally takes minutes to respond
>>>
>>>
>>>
>>> Hi Ken, all,
>>>
>>>
>>>

Re: [ClusterLabs] clearing failed actions

2017-05-31 Thread Ken Gaillot
On 05/31/2017 12:17 PM, Ken Gaillot wrote:
> On 05/30/2017 02:50 PM, Attila Megyeri wrote:
>> Hi Ken,
>>
>>
>>> -Original Message-
>>> From: Ken Gaillot [mailto:kgail...@redhat.com]
>>> Sent: Tuesday, May 30, 2017 4:32 PM
>>> To: users@clusterlabs.org
>>> Subject: Re: [ClusterLabs] clearing failed actions
>>>
>>> On 05/30/2017 09:13 AM, Attila Megyeri wrote:
 Hi,



 Shouldn't the



 cluster-recheck-interval="2m"



 property instruct pacemaker to recheck the cluster every 2 minutes and
 clean the failcounts?
>>>
>>> It instructs pacemaker to recalculate whether any actions need to be
>>> taken (including expiring any failcounts appropriately).
>>>
 At the primitive level I also have a



 migration-threshold="30" failure-timeout="2m"



 but whenever I have a failure, it remains there forever.





 What could be causing this?



 thanks,

 Attila
>>> Is it a single old failure, or a recurring failure? The failure timeout
>>> works in a somewhat nonintuitive way. Old failures are not individually
>>> expired. Instead, all failures of a resource are simultaneously cleared
>>> if all of them are older than the failure-timeout. So if something keeps
>>> failing repeatedly (more frequently than the failure-timeout), none of
>>> the failures will be cleared.
>>>
>>> If it's not a repeating failure, something odd is going on.
>>
>> It is not a repeating failure. Let's say that a resource fails for whatever 
>> action, It will remain in the failed actions (crm_mon -Af) until I issue a 
>> "crm resource cleanup ". Even after days or weeks, even 
>> though I see in the logs that cluster is rechecked every 120 seconds.
>>
>> How could I troubleshoot this issue?
>>
>> thanks!
> 
> 
> Ah, I see what you're saying. That's expected behavior.
> 
> The failure-timeout applies to the failure *count* (which is used for
> checking against migration-threshold), not the failure *history* (which
> is used for the status display).
> 
> The idea is to have it no longer affect the cluster behavior, but still
> allow an administrator to know that it happened. That's why a manual
> cleanup is required to clear the history.

Hmm, I'm wrong there ... failure-timeout does expire the failure history
used for status display.

It works with the current versions. It's possible 1.1.10 had issues with
that.

Check the status to see which node is DC, and look at the pacemaker log
there after the failure occurred. There should be a message about the
failcount expiring. You can also look at the live CIB and search for
last_failure to see what is used for the display.

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: clearing failed actions

2017-05-31 Thread Digimer
On 31/05/17 11:02 PM, Attila Megyeri wrote:
 What type of failure do you have, and what is the status after that? Do
>> you
 have fencing enabled?

>>>
>>> Typically a failed start, or a failed monitor.
>>> Fencing is disabled as we have  multiple nodes / quorum.
>>
>> Stonith and quorum solve different problems. Stonith is required, quorum
>> is optional.
>>
>> https://www.alteeve.com/w/The_2-Node_Myth
> 
> I see your point, but does it relate to the failcount issue? By turning 
> stonith off, the fail counters will not be removed even if the service 
> recovers immediately after restart?

I don't know, but according to Ken's last email, what you're seeing is
expected. I replied because of the miss understanding of the rolls
quorum and fencing play. Running a cluster without fencing is dangerous.

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] clearing failed actions

2017-05-31 Thread Ken Gaillot
On 05/30/2017 02:50 PM, Attila Megyeri wrote:
> Hi Ken,
> 
> 
>> -Original Message-
>> From: Ken Gaillot [mailto:kgail...@redhat.com]
>> Sent: Tuesday, May 30, 2017 4:32 PM
>> To: users@clusterlabs.org
>> Subject: Re: [ClusterLabs] clearing failed actions
>>
>> On 05/30/2017 09:13 AM, Attila Megyeri wrote:
>>> Hi,
>>>
>>>
>>>
>>> Shouldn't the
>>>
>>>
>>>
>>> cluster-recheck-interval="2m"
>>>
>>>
>>>
>>> property instruct pacemaker to recheck the cluster every 2 minutes and
>>> clean the failcounts?
>>
>> It instructs pacemaker to recalculate whether any actions need to be
>> taken (including expiring any failcounts appropriately).
>>
>>> At the primitive level I also have a
>>>
>>>
>>>
>>> migration-threshold="30" failure-timeout="2m"
>>>
>>>
>>>
>>> but whenever I have a failure, it remains there forever.
>>>
>>>
>>>
>>>
>>>
>>> What could be causing this?
>>>
>>>
>>>
>>> thanks,
>>>
>>> Attila
>> Is it a single old failure, or a recurring failure? The failure timeout
>> works in a somewhat nonintuitive way. Old failures are not individually
>> expired. Instead, all failures of a resource are simultaneously cleared
>> if all of them are older than the failure-timeout. So if something keeps
>> failing repeatedly (more frequently than the failure-timeout), none of
>> the failures will be cleared.
>>
>> If it's not a repeating failure, something odd is going on.
> 
> It is not a repeating failure. Let's say that a resource fails for whatever 
> action, It will remain in the failed actions (crm_mon -Af) until I issue a 
> "crm resource cleanup ". Even after days or weeks, even though 
> I see in the logs that cluster is rechecked every 120 seconds.
> 
> How could I troubleshoot this issue?
> 
> thanks!


Ah, I see what you're saying. That's expected behavior.

The failure-timeout applies to the failure *count* (which is used for
checking against migration-threshold), not the failure *history* (which
is used for the status display).

The idea is to have it no longer affect the cluster behavior, but still
allow an administrator to know that it happened. That's why a manual
cleanup is required to clear the history.

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Pacemaker 1.1.17 Release Candidate 3

2017-05-31 Thread Ken Gaillot
The third release candidate for Pacemaker version 1.1.17 is now
available at:

https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1.17-rc3

Significant changes in this release:

* This release adds support for setting meta-attributes on the new
bundle resource type, which will be inherited by the bundle's component
resources. This allows features such target-role, is-managed,
maintenance mode, etc., to work with bundles.

* A node joining a cluster no longer forces a write-out of all node
attributes when atomic attrd is in use, as this is only necessary with
legacy attrd (which is used on legacy cluster stacks such as heartbeat,
corosync 1, and CMAN). This improves scalability, as the write-out could
cause a surge in IPC traffic that causes problems in large clusters.

* Recovery of failed Pacemaker Remote connections now avoids restarting
resources on the Pacemaker Remote node unless necessary.

Testing and feedback is welcome!
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: clearing failed actions

2017-05-31 Thread Attila Megyeri


> -Original Message-
> From: Digimer [mailto:li...@alteeve.ca]
> Sent: Wednesday, May 31, 2017 2:20 PM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> ; Attila Megyeri 
> Subject: Re: [ClusterLabs] Antw: clearing failed actions
> 
> On 31/05/17 07:52 PM, Attila Megyeri wrote:
> > Hi,
> >
> >
> >
> >> -Original Message-
> >> From: Ulrich Windl [mailto:ulrich.wi...@rz.uni-regensburg.de]
> >> Sent: Wednesday, May 31, 2017 8:52 AM
> >> To: users@clusterlabs.org
> >> Subject: [ClusterLabs] Antw: clearing failed actions
> >>
> > Attila Megyeri  schrieb am 30.05.2017
> um
> >> 16:13 in
> >> Nachricht
> >>
>  >> soft.local>:
> >>> Hi,
> >>>
> >>> Shouldn't the
> >>>
> >>> cluster-recheck-interval="2m"
> >>>
> >>> property instruct pacemaker to recheck the cluster every 2 minutes and
> >> clean
> >>> the failcounts?
> >>>
> >>> At the primitive level I also have a
> >>>
> >>> migration-threshold="30" failure-timeout="2m"
> >>>
> >>> but whenever I have a failure, it remains there forever.
> >>
> >> What type of failure do you have, and what is the status after that? Do
> you
> >> have fencing enabled?
> >>
> >
> > Typically a failed start, or a failed monitor.
> > Fencing is disabled as we have  multiple nodes / quorum.
> 
> Stonith and quorum solve different problems. Stonith is required, quorum
> is optional.
> 
> https://www.alteeve.com/w/The_2-Node_Myth
> 
> > Pacemaker is 1.1.10.
> >

I see your point, but does it relate to the failcount issue? By turning stonith 
off, the fail counters will not be removed even if the service recovers 
immediately after restart?




> >
> >
> >>>
> >>>
> >>> What could be causing this?
> >>>
> >>> thanks,
> >>> Attila
> >>
> >>
> >>
> >>
> >>
> >> ___
> >> Users mailing list: Users@clusterlabs.org
> >> http://lists.clusterlabs.org/mailman/listinfo/users
> >>
> >> Project Home: http://www.clusterlabs.org
> >> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> Bugs: http://bugs.clusterlabs.org
> >
> > ___
> > Users mailing list: Users@clusterlabs.org
> > http://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
> 
> 
> --
> Digimer
> Papers and Projects: https://alteeve.com/w/
> "I am, somehow, less interested in the weight and convolutions of
> Einstein’s brain than in the near certainty that people of equal talent
> have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: clearing failed actions

2017-05-31 Thread Digimer
On 31/05/17 07:52 PM, Attila Megyeri wrote:
> Hi,
> 
> 
> 
>> -Original Message-
>> From: Ulrich Windl [mailto:ulrich.wi...@rz.uni-regensburg.de]
>> Sent: Wednesday, May 31, 2017 8:52 AM
>> To: users@clusterlabs.org
>> Subject: [ClusterLabs] Antw: clearing failed actions
>>
> Attila Megyeri  schrieb am 30.05.2017 um
>> 16:13 in
>> Nachricht
>> > soft.local>:
>>> Hi,
>>>
>>> Shouldn't the
>>>
>>> cluster-recheck-interval="2m"
>>>
>>> property instruct pacemaker to recheck the cluster every 2 minutes and
>> clean
>>> the failcounts?
>>>
>>> At the primitive level I also have a
>>>
>>> migration-threshold="30" failure-timeout="2m"
>>>
>>> but whenever I have a failure, it remains there forever.
>>
>> What type of failure do you have, and what is the status after that? Do you
>> have fencing enabled?
>>
> 
> Typically a failed start, or a failed monitor.
> Fencing is disabled as we have  multiple nodes / quorum.

Stonith and quorum solve different problems. Stonith is required, quorum
is optional.

https://www.alteeve.com/w/The_2-Node_Myth

> Pacemaker is 1.1.10.
> 
> 
> 
>>>
>>>
>>> What could be causing this?
>>>
>>> thanks,
>>> Attila
>>
>>
>>
>>
>>
>> ___
>> Users mailing list: Users@clusterlabs.org
>> http://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 


-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: clearing failed actions

2017-05-31 Thread Attila Megyeri
Hi,



> -Original Message-
> From: Ulrich Windl [mailto:ulrich.wi...@rz.uni-regensburg.de]
> Sent: Wednesday, May 31, 2017 8:52 AM
> To: users@clusterlabs.org
> Subject: [ClusterLabs] Antw: clearing failed actions
> 
> >>> Attila Megyeri  schrieb am 30.05.2017 um
> 16:13 in
> Nachricht
>  soft.local>:
> > Hi,
> >
> > Shouldn't the
> >
> > cluster-recheck-interval="2m"
> >
> > property instruct pacemaker to recheck the cluster every 2 minutes and
> clean
> > the failcounts?
> >
> > At the primitive level I also have a
> >
> > migration-threshold="30" failure-timeout="2m"
> >
> > but whenever I have a failure, it remains there forever.
> 
> What type of failure do you have, and what is the status after that? Do you
> have fencing enabled?
> 

Typically a failed start, or a failed monitor.
Fencing is disabled as we have  multiple nodes / quorum.

Pacemaker is 1.1.10.



> >
> >
> > What could be causing this?
> >
> > thanks,
> > Attila
> 
> 
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org