Re: [Pacemaker] [Problem] A fail count is up by a postponed monitor.

renayama19661014 Sun, 16 May 2010 17:52:39 -0700

Hi Andrew,

I registered this problem with Bugzilla.


 * http://developerbugs.linux-foundation.org/show_bug.cgi?id=2417 

Best Regards,
Hideo Yamauchi.


--- renayama19661...@ybb.ne.jp wrote:

> Hi Andrew,
> 
> Thank you for comment.
> 
> > After reconstructing the logs (because they were unreadable) I found
> > that they didn't go back far enough.
> > Please _attach_ logs from the time prmFsPostgreSQLDB3-2 was started by
> > the cluster.
> 
> When I did this test, I did not collect information with hb_report. 
> I attach the log of all nodes when the problem occurred.
> 
> The test procedure that I performed is as follows.
> 
> Step1) Start four nodes and send cib.xml.
> Step2) Revise RA to cause a stop error of pingd in srv01.
> Step3) By a crm command, we stop clnPrmPingd and cause a stop error.
>        * The monitor error of the problem seems to be taking place.
> Step4) We reboot srv01.
> 
> Best Regards,
> Hideo Yamauchi.
> 
> 
> --- Andrew Beekhof <and...@beekhof.net> wrote:
> 
> > After reconstructing the logs (because they were unreadable) I found
> > that they didn't go back far enough.
> > Please _attach_ logs from the time prmFsPostgreSQLDB3-2 was started by
> > the cluster.
> > 
> > On Tue, May 11, 2010 at 9:01 AM,  <renayama19661...@ybb.ne.jp> wrote:
> > > Hi,
> > >
> > > On a test of Pacemaker before a little, the following problem happened.
> > >
> > > &nbsp;* corosync 1.2.1
> > > &nbsp;* Pacemaker-1-0-8463260ff667
> > > &nbsp;* Reusable-Cluster-Components-c447fc25e119
> > > &nbsp;* Cluster-Resource-Agents-f92935082277
> > >
> > > A problem is that the monitor error of the prmFsPostgreSQLDB3-2 resource 
> > > that stopped occurs
> > after a
> > > stop.
> > > This problem does not always occur.
> > >
> > > [r...@srv02 ~]# crm_mon -1
> > > ============
> > > Last updated: Tue Apr 27 11:19:26 2010
> > > Stack: openais
> > > Current DC: srv02 - partition with quorum
> > > Version: 1.0.8-8463260ff6674ad568e6bd3b2a84f03edce257f3
> > > 4 Nodes configured, 4 expected votes
> > > 13 Resources configured.
> > > ============
> > >
> > > Online: [ srv02 srv03 srv04 ]
> > > OFFLINE: [ srv01 ]
> > >
> > > &nbsp;Resource Group: grpStonith1
> > > &nbsp; &nbsp; prmStonithN1 &nbsp; &nbsp; &nbsp; (stonith:external/ssh): 
> > > Started srv04
> > > &nbsp;Resource Group: grpStonith2
> > > &nbsp; &nbsp; prmStonithN2 &nbsp; &nbsp; &nbsp; (stonith:external/ssh): 
> > > Started srv03
> > > &nbsp;Resource Group: grpStonith3
> > > &nbsp; &nbsp; prmStonithN3 &nbsp; &nbsp; &nbsp; (stonith:external/ssh): 
> > > Started srv02
> > > &nbsp;Resource Group: grpStonith4
> > > &nbsp; &nbsp; prmStonithN4 &nbsp; &nbsp; &nbsp; (stonith:external/ssh): 
> > > Started srv03
> > > &nbsp;Clone Set: clnUMgroup01
> > > &nbsp; &nbsp; Started: [ srv04 ]
> > > &nbsp; &nbsp; Stopped: [ clnUmResource:0 ]
> > > &nbsp;Clone Set: clnDiskd1
> > > &nbsp; &nbsp; Started: [ srv02 srv03 srv04 ]
> > > &nbsp; &nbsp; Stopped: [ clnPrmDiskd1:2 ]
> > > &nbsp;Clone Set: clnG3dummy1
> > > &nbsp; &nbsp; Started: [ srv02 srv03 srv04 ]
> > > &nbsp; &nbsp; Stopped: [ clnG3dummy01:0 ]
> > > &nbsp;Clone Set: clnG3dummy2
> > > &nbsp; &nbsp; Started: [ srv02 srv03 srv04 ]
> > > &nbsp; &nbsp; Stopped: [ clnG3dummy02:0 ]
> > >
> > > Failed actions:
> > > &nbsp; &nbsp;prmFsPostgreSQLDB3-2_monitor_10000 (node=srv03, call=49, 
> > > rc=7,
> status=complete): not
> > running
> > >
> > > I think that I am caused by the fact that the elimination from the list 
> > > of the monitor does
> > not go
> > > well somehow or other.
> > >
> > > Possibly it may be caused by the practice of a postponed monitor.
> > >
> > > It is a problem to increase at the fail count of the monitor error after 
> > > a resource stopped.
> > > This may cause strange resource placement.
> > >
> > > -----------------------------
> > > Apr 27 11:09:19 srv03 lrmd: [24383]: info: rsc:prmFsPostgreSQLDB3-3:59: 
> > > stop
> > > Apr 27 11:09:19 srv03 crmd: [24386]: info: process_lrm_event: LRM 
> > > operation
> > > prmFsPostgreSQLDB3-3_monitor_10000 (call=51, status=1, cib-update=0, 
> > > confirmed=true)
> Cancelled
> > > Apr 27 11:09:19 srv03 lrmd: [24383]: info: Managed 
> > > prmFsPostgreSQLDB3-3:stop process 26210
> > exited with
> > > return code 0.
> > > Apr 27 11:09:19 srv03 crmd: [24386]: info: process_lrm_event: LRM 
> > > operation
> > > prmFsPostgreSQLDB3-3_stop_0 (call=59, rc=0, cib-update=65, 
> > > confirmed=true) ok
> > > Apr 27 11:09:20 srv03 lrmd: [24383]: info: cancel_op: operation 
> > > monitor[49] on
> > > ocf::Dummy::prmFsPostgreSQLDB3-2 for client 24386, its parameters: 
> > > CRM_meta_interval=[10000]
> > > CRM_meta_on_fail=[restart] CRM_meta_timeout=[60000] 
> > > crm_feature_set=[3.0.1]
> > CRM_meta_name=[monitor]
> > > cancelled
> > > Apr 27 11:09:20 srv03 lrmd: [24383]: info: flush_op: process for 
> > > operation monitor[49] on
> > > ocf::Dummy::prmFsPostgreSQLDB3-2 for client 24386 still running, flush 
> > > delayed
> > > Apr 27 11:09:20 srv03 crmd: [24386]: info: do_lrm_rsc_op: Performing
> > > key=88:8:0:2587fade-309b-454d-8a81-19e8884439ec 
> > > op=prmFsPostgreSQLDB3-2_stop_0 )
> > > Apr 27 11:09:20 srv03 lrmd: [24383]: info: perform_op:2873: operation 
> > > monitor[49] on
> > > ocf::Dummy::prmFsPostgreSQLDB3-2 for client 24386, its parameters: 
> > > CRM_meta_interval=[10000]
> > > CRM_meta_on_fail=[restart] CRM_meta_timeout=[60000] 
> > > crm_feature_set=[3.0.1]
> > CRM_meta_name=[monitor]
> > > for rsc is already running.
> > > Apr 27 11:09:20 srv03 lrmd: [24383]: info: perform_op:2883: postponing 
> > > all ops on resource
> > > prmFsPostgreSQLDB3-2 by 1000 ms
> > > Apr 27 11:09:20 srv03 lrmd: [24383]: info: rsc:prmFsPostgreSQLDB3-2:60: 
> > > stop
> > > Apr 27 11:09:20 srv03 lrmd: [24383]: info: Managed 
> > > prmFsPostgreSQLDB3-2:stop process 26225
> > exited with
> > > return code 0.
> > > Apr 27 11:09:20 srv03 crmd: [24386]: info: process_lrm_event: LRM 
> > > operation
> > > prmFsPostgreSQLDB3-2_stop_0 (call=60, rc=0, cib-update=66, 
> > > confirmed=true) ok
> > > Apr 27 11:09:20 srv03 lrmd: [24383]: info: cancel_op: operation 
> > > monitor[47] on
> > > ocf::Dummy::prmFsPostgreSQLDB3-1 for client 24386, its parameters: 
> > > CRM_meta_interval=[10000]
> > > CRM_meta_on_fail=[restart] CRM_meta_timeout=[60000] 
> > > crm_feature_set=[3.0.1]
> > CRM_meta_name=[monitor]
> > > cancelled
> > > Apr 27 11:09:20 srv03 crmd: [24386]: info: do_lrm_rsc_op: Performing
> > > key=87:8:0:2587fade-309b-454d-8a81-19e8884439ec 
> > > op=prmFsPostgreSQLDB3-1_stop_0 )
> > > Apr 27 11:09:20 srv03 lrmd: [24383]: info: rsc:prmFsPostgreSQLDB3-1:61: 
> > > stop
> > > Apr 27 11:09:20 srv03 crmd: [24386]: info: process_lrm_event: LRM 
> > > operation
> > > prmFsPostgreSQLDB3-1_monitor_10000 (call=47, status=1, cib-update=0, 
> > > confirmed=true)
> Cancelled
> > > Apr 27 11:09:20 srv03 lrmd: [24383]: info: Managed 
> > > prmFsPostgreSQLDB3-1:stop process 26233
> > exited with
> > > return code 0.
> > > Apr 27 11:09:20 srv03 crmd: [24386]: info: process_lrm_event: LRM 
> > > operation
> > > prmFsPostgreSQLDB3-1_stop_0 (call=61, rc=0, cib-update=67, 
> > > confirmed=true) ok
> > > Apr 27 11:09:21 srv03 lrmd: [24383]: info: cancel_op: operation 
> > > monitor[45] on
> > > ocf::Dummy::prmExPostgreSQLDB3 for client 24386, its parameters: 
> > > CRM_meta_interval=[10000]
> > > CRM_meta_on_fail=[restart] CRM_meta_timeout=[90000] 
> > > crm_feature_set=[3.0.1]
> > CRM_meta_name=[monitor]
> > > cancelled
> > > Apr 27 11:09:21 srv03 crmd: [24386]: info: do_lrm_rsc_op: Performing
> > > key=86:8:0:2587fade-309b-454d-8a81-19e8884439ec 
> > > op=prmExPostgreSQLDB3_stop_0 )
> > > Apr 27 11:09:21 srv03 lrmd: [24383]: info: rsc:prmExPostgreSQLDB3:62: stop
> > > Apr 27 11:09:21 srv03 crmd: [24386]: info: process_lrm_event: LRM 
> > > operation
> > > prmExPostgreSQLDB3_monitor_10000 (call=45, status=1, cib-update=0, 
> > > confirmed=true) Cancelled
> > > Apr 27 11:09:21 srv03 lrmd: [24383]: info: Managed 
> > > prmExPostgreSQLDB3:stop process 26241
> > exited with
> > > return code 0.
> > > Apr 27 11:09:21 srv03 crmd: [24386]: info: process_lrm_event: LRM 
> > > operation
> > prmExPostgreSQLDB3_stop_0
> > > (call=62, rc=0, cib-update=68, confirmed=true) ok
> > > Apr 27 11:09:31 srv03 crmd: [24386]: info: process_lrm_event: LRM 
> > > operation
> > > prmFsPostgreSQLDB3-2_monitor_10000 (call=49, rc=7, cib-update=69, 
> > > confirmed=false) not
> running
> > > Apr 27 11:09:31 srv03 attrd: [24384]: info: attrd_ais_dispatch: Update 
> > > relayed from srv01
> > > Apr 27 11:09:31 srv03 attrd: [24384]: info: find_hash_entry: Creating 
> > > hash entry for
> > > fail-count-prmFsPostgreSQLDB3-2
> > > Apr 27 11:09:31 srv03 attrd: [24384]: info: attrd_local_callback: Expanded
> > > fail-count-prmFsPostgreSQLDB3-2=value++ to 1
> > > Apr 27 11:09:31 srv03 attrd: [24384]: info: attrd_trigger_update: Sending 
> > > flush op to all
> > hosts for:
> > > fail-count-prmFsPostgreSQLDB3-2 (1)
> > > Apr 27 11:09:31 srv03 attrd: [24384]: info: attrd_perform_update: Sent 
> > > update 52:
> > > fail-count-prmFsPostgreSQLDB3-2=1
> > > -----------------------------
> > >
> > > I do not confirm this problem with the most recent version.
> > > Possibly it may be the problem that somebody already reported.
> > >
> > > Best Regards,
> > > Hideo Yamauchi.
> > >
> > >
> > >
> > >
> > > _______________________________________________
> > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > >
> > > Project Home: http://www.clusterlabs.org
> > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > >
> > 
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > > _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

Re: [Pacemaker] [Problem] A fail count is up by a postponed monitor.

Reply via email to