[Pacemaker] Fail-count and failure timeout

2010-10-01 Thread Holger . Teutsch
Hi,
I observed the following in pacemaker Versions 1.1.3 and tip up to patch 
10258.

In a small test environment to study fail-count behavior I have one 
resource

anything
doing sleep 600 with monitoring interval 10 secs.

The failure-timeout is 300.

I would expect to never see a failcount higher than 1.

I observed some sporadic clears but mostly the count is increasing by 1 
each 10 minutes. 

Am I mistaken or is this a bug ?

Regards
Holger

-- complete cib for reference ---


  

  







  


  


  

  
  


  
  


  

  


  

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Fail-count and failure timeout

2010-10-05 Thread Andrew Beekhof
On Fri, Oct 1, 2010 at 3:40 PM,   wrote:
> Hi,
> I observed the following in pacemaker Versions 1.1.3 and tip up to patch
> 10258.
>
> In a small test environment to study fail-count behavior I have one resource
>
> anything
> doing sleep 600 with monitoring interval 10 secs.
>
> The failure-timeout is 300.
>
> I would expect to never see a failcount higher than 1.

Why?

The fail-count is only reset when the PE runs... which is on a failure
and/or after the cluster-recheck-interval
So I'd expect a maximum of two.

   cluster-recheck-interval = time [15min]
  Polling interval for time based changes to options,
resource parameters and constraints.

  The Cluster is primarily event driven, however the
configuration can have elements that change based on time. To ensure
these changes take effect, we can optionally poll  the  cluster’s
  status for changes. Allowed values: Zero disables
polling. Positive values are an interval in seconds (unless other SI
units are specified. eg. 5min)



>
> I observed some sporadic clears but mostly the count is increasing by 1 each
> 10 minutes.
>
> Am I mistaken or is this a bug ?

Hard to say without logs.  What value did it reach?

>
> Regards
> Holger
>
> -- complete cib for reference ---
>
>  validate-with="pacemaker-1.2" crm_feature_set="3.0.4" have-quorum="0"
> cib-last-written="Fri Oct  1 14:17:31 2010" dc-uuid="hotlx">
>   
>     
>       
>          value="1.1.3-09640bd6069e677d5eed65203a6056d9bf562e67"/>
>          name="cluster-infrastructure" value="openais"/>
>          name="expected-quorum-votes" value="2"/>
>          name="no-quorum-policy" value="ignore"/>
>          name="stonith-enabled" value="false"/>
>          name="start-failure-is-fatal" value="false"/>
>          name="last-lrm-refresh" value="1285926879"/>
>       
>     
>     
>       
>     
>     
>       
>         
>            value="started"/>
>            name="failure-timeout" value="300"/>
>         
>         
>            on-fail="restart" timeout="20s"/>
>            on-fail="restart" timeout="20s"/>
>         
>         
>            value="sleep 600"/>
>         
>       
>     
>     
>   
> 
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
>

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Fail-count and failure timeout

2010-10-05 Thread Andrew Beekhof
On Tue, Oct 5, 2010 at 11:07 AM, Andrew Beekhof  wrote:
> On Fri, Oct 1, 2010 at 3:40 PM,   wrote:
>> Hi,
>> I observed the following in pacemaker Versions 1.1.3 and tip up to patch
>> 10258.
>>
>> In a small test environment to study fail-count behavior I have one resource
>>
>> anything
>> doing sleep 600 with monitoring interval 10 secs.
>>
>> The failure-timeout is 300.
>>
>> I would expect to never see a failcount higher than 1.
>
> Why?
>
> The fail-count is only reset when the PE runs... which is on a failure
> and/or after the cluster-recheck-interval
> So I'd expect a maximum of two.

Actually this is wrong.
There is no maximum, because there needs to have been 300s since the
last failure when the PE runs.
And since it only runs when the resource fails, it is never reset.

>
>       cluster-recheck-interval = time [15min]
>              Polling interval for time based changes to options,
> resource parameters and constraints.
>
>              The Cluster is primarily event driven, however the
> configuration can have elements that change based on time. To ensure
> these changes take effect, we can optionally poll  the  cluster’s
>              status for changes. Allowed values: Zero disables
> polling. Positive values are an interval in seconds (unless other SI
> units are specified. eg. 5min)
>
>
>
>>
>> I observed some sporadic clears but mostly the count is increasing by 1 each
>> 10 minutes.
>>
>> Am I mistaken or is this a bug ?
>
> Hard to say without logs.  What value did it reach?
>
>>
>> Regards
>> Holger
>>
>> -- complete cib for reference ---
>>
>> > validate-with="pacemaker-1.2" crm_feature_set="3.0.4" have-quorum="0"
>> cib-last-written="Fri Oct  1 14:17:31 2010" dc-uuid="hotlx">
>>   
>>     
>>       
>>         > value="1.1.3-09640bd6069e677d5eed65203a6056d9bf562e67"/>
>>         > name="cluster-infrastructure" value="openais"/>
>>         > name="expected-quorum-votes" value="2"/>
>>         > name="no-quorum-policy" value="ignore"/>
>>         > name="stonith-enabled" value="false"/>
>>         > name="start-failure-is-fatal" value="false"/>
>>         > name="last-lrm-refresh" value="1285926879"/>
>>       
>>     
>>     
>>       
>>     
>>     
>>       
>>         
>>           > value="started"/>
>>           > name="failure-timeout" value="300"/>
>>         
>>         
>>           > on-fail="restart" timeout="20s"/>
>>           > on-fail="restart" timeout="20s"/>
>>         
>>         
>>           > value="sleep 600"/>
>>         
>>       
>>     
>>     
>>   
>> 
>>
>> ___
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs:
>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>
>>
>

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Fail-count and failure timeout

2010-10-05 Thread Holger . Teutsch
The resource failed when the sleep expired, i.e. each 600 secs.
Now I changed the resource to

sleep 7200, failure-timeout 3600

i.e. to values far beyond the recheck-interval opf 15m.

Now everything behaves as expected.
 
Mit freundlichen Grüßen / Kind regards 

Holger Teutsch 





From:   Andrew Beekhof 
To: The Pacemaker cluster resource manager 

Date:   05.10.2010 11:09
Subject:Re: [Pacemaker] Fail-count and failure timeout



On Tue, Oct 5, 2010 at 11:07 AM, Andrew Beekhof  
wrote:
> On Fri, Oct 1, 2010 at 3:40 PM,   
wrote:
>> Hi,
>> I observed the following in pacemaker Versions 1.1.3 and tip up to 
patch
>> 10258.
>>
>> In a small test environment to study fail-count behavior I have one 
resource
>>
>> anything
>> doing sleep 600 with monitoring interval 10 secs.
>>
>> The failure-timeout is 300.
>>
>> I would expect to never see a failcount higher than 1.
>
> Why?
>
> The fail-count is only reset when the PE runs... which is on a failure
> and/or after the cluster-recheck-interval
> So I'd expect a maximum of two.

Actually this is wrong.
There is no maximum, because there needs to have been 300s since the
last failure when the PE runs.
And since it only runs when the resource fails, it is never reset.

>
>   cluster-recheck-interval = time [15min]
>  Polling interval for time based changes to options,
> resource parameters and constraints.
>
>  The Cluster is primarily event driven, however the
> configuration can have elements that change based on time. To ensure
> these changes take effect, we can optionally poll  the  cluster’s
>  status for changes. Allowed values: Zero disables
> polling. Positive values are an interval in seconds (unless other SI
> units are specified. eg. 5min)
>
>
>
>>
>> I observed some sporadic clears but mostly the count is increasing by 1 
each
>> 10 minutes.
>>
>> Am I mistaken or is this a bug ?
>
> Hard to say without logs.  What value did it reach?
>
>>
>> Regards
>> Holger
>>
>> -- complete cib for reference ---
>>
>> > validate-with="pacemaker-1.2" crm_feature_set="3.0.4" have-quorum="0"
>> cib-last-written="Fri Oct  1 14:17:31 2010" dc-uuid="hotlx">
>>   
>> 
>>   
>> > value="1.1.3-09640bd6069e677d5eed65203a6056d9bf562e67"/>
>> > name="cluster-infrastructure" value="openais"/>
>> > name="expected-quorum-votes" value="2"/>
>> > name="no-quorum-policy" value="ignore"/>
>> > name="stonith-enabled" value="false"/>
>> > name="start-failure-is-fatal" value="false"/>
>> > name="last-lrm-refresh" value="1285926879"/>
>>   
>> 
>> 
>>   
>> 
>> 
>>   
>> 
>>   > value="started"/>
>>   > name="failure-timeout" value="300"/>
>> 
>> 
>>   > on-fail="restart" timeout="20s"/>
>>   > on-fail="restart" timeout="20s"/>
>> 
>> 
>>   > value="sleep 600"/>
>> 
>>   
>> 
>> 
>>   
>> 
>>
>> ___
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs:
>> 
http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>
>>
>

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: 
http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker