Hi Andrew, I registered this problem with Bugzilla.
* http://developerbugs.linux-foundation.org/show_bug.cgi?id=2417 Best Regards, Hideo Yamauchi. --- renayama19661...@ybb.ne.jp wrote: > Hi Andrew, > > Thank you for comment. > > > After reconstructing the logs (because they were unreadable) I found > > that they didn't go back far enough. > > Please _attach_ logs from the time prmFsPostgreSQLDB3-2 was started by > > the cluster. > > When I did this test, I did not collect information with hb_report. > I attach the log of all nodes when the problem occurred. > > The test procedure that I performed is as follows. > > Step1) Start four nodes and send cib.xml. > Step2) Revise RA to cause a stop error of pingd in srv01. > Step3) By a crm command, we stop clnPrmPingd and cause a stop error. > * The monitor error of the problem seems to be taking place. > Step4) We reboot srv01. > > Best Regards, > Hideo Yamauchi. > > > --- Andrew Beekhof <and...@beekhof.net> wrote: > > > After reconstructing the logs (because they were unreadable) I found > > that they didn't go back far enough. > > Please _attach_ logs from the time prmFsPostgreSQLDB3-2 was started by > > the cluster. > > > > On Tue, May 11, 2010 at 9:01 AM, <renayama19661...@ybb.ne.jp> wrote: > > > Hi, > > > > > > On a test of Pacemaker before a little, the following problem happened. > > > > > > * corosync 1.2.1 > > > * Pacemaker-1-0-8463260ff667 > > > * Reusable-Cluster-Components-c447fc25e119 > > > * Cluster-Resource-Agents-f92935082277 > > > > > > A problem is that the monitor error of the prmFsPostgreSQLDB3-2 resource > > > that stopped occurs > > after a > > > stop. > > > This problem does not always occur. > > > > > > [r...@srv02 ~]# crm_mon -1 > > > ============ > > > Last updated: Tue Apr 27 11:19:26 2010 > > > Stack: openais > > > Current DC: srv02 - partition with quorum > > > Version: 1.0.8-8463260ff6674ad568e6bd3b2a84f03edce257f3 > > > 4 Nodes configured, 4 expected votes > > > 13 Resources configured. > > > ============ > > > > > > Online: [ srv02 srv03 srv04 ] > > > OFFLINE: [ srv01 ] > > > > > > Resource Group: grpStonith1 > > > prmStonithN1 (stonith:external/ssh): > > > Started srv04 > > > Resource Group: grpStonith2 > > > prmStonithN2 (stonith:external/ssh): > > > Started srv03 > > > Resource Group: grpStonith3 > > > prmStonithN3 (stonith:external/ssh): > > > Started srv02 > > > Resource Group: grpStonith4 > > > prmStonithN4 (stonith:external/ssh): > > > Started srv03 > > > Clone Set: clnUMgroup01 > > > Started: [ srv04 ] > > > Stopped: [ clnUmResource:0 ] > > > Clone Set: clnDiskd1 > > > Started: [ srv02 srv03 srv04 ] > > > Stopped: [ clnPrmDiskd1:2 ] > > > Clone Set: clnG3dummy1 > > > Started: [ srv02 srv03 srv04 ] > > > Stopped: [ clnG3dummy01:0 ] > > > Clone Set: clnG3dummy2 > > > Started: [ srv02 srv03 srv04 ] > > > Stopped: [ clnG3dummy02:0 ] > > > > > > Failed actions: > > > prmFsPostgreSQLDB3-2_monitor_10000 (node=srv03, call=49, > > > rc=7, > status=complete): not > > running > > > > > > I think that I am caused by the fact that the elimination from the list > > > of the monitor does > > not go > > > well somehow or other. > > > > > > Possibly it may be caused by the practice of a postponed monitor. > > > > > > It is a problem to increase at the fail count of the monitor error after > > > a resource stopped. > > > This may cause strange resource placement. > > > > > > ----------------------------- > > > Apr 27 11:09:19 srv03 lrmd: [24383]: info: rsc:prmFsPostgreSQLDB3-3:59: > > > stop > > > Apr 27 11:09:19 srv03 crmd: [24386]: info: process_lrm_event: LRM > > > operation > > > prmFsPostgreSQLDB3-3_monitor_10000 (call=51, status=1, cib-update=0, > > > confirmed=true) > Cancelled > > > Apr 27 11:09:19 srv03 lrmd: [24383]: info: Managed > > > prmFsPostgreSQLDB3-3:stop process 26210 > > exited with > > > return code 0. > > > Apr 27 11:09:19 srv03 crmd: [24386]: info: process_lrm_event: LRM > > > operation > > > prmFsPostgreSQLDB3-3_stop_0 (call=59, rc=0, cib-update=65, > > > confirmed=true) ok > > > Apr 27 11:09:20 srv03 lrmd: [24383]: info: cancel_op: operation > > > monitor[49] on > > > ocf::Dummy::prmFsPostgreSQLDB3-2 for client 24386, its parameters: > > > CRM_meta_interval=[10000] > > > CRM_meta_on_fail=[restart] CRM_meta_timeout=[60000] > > > crm_feature_set=[3.0.1] > > CRM_meta_name=[monitor] > > > cancelled > > > Apr 27 11:09:20 srv03 lrmd: [24383]: info: flush_op: process for > > > operation monitor[49] on > > > ocf::Dummy::prmFsPostgreSQLDB3-2 for client 24386 still running, flush > > > delayed > > > Apr 27 11:09:20 srv03 crmd: [24386]: info: do_lrm_rsc_op: Performing > > > key=88:8:0:2587fade-309b-454d-8a81-19e8884439ec > > > op=prmFsPostgreSQLDB3-2_stop_0 ) > > > Apr 27 11:09:20 srv03 lrmd: [24383]: info: perform_op:2873: operation > > > monitor[49] on > > > ocf::Dummy::prmFsPostgreSQLDB3-2 for client 24386, its parameters: > > > CRM_meta_interval=[10000] > > > CRM_meta_on_fail=[restart] CRM_meta_timeout=[60000] > > > crm_feature_set=[3.0.1] > > CRM_meta_name=[monitor] > > > for rsc is already running. > > > Apr 27 11:09:20 srv03 lrmd: [24383]: info: perform_op:2883: postponing > > > all ops on resource > > > prmFsPostgreSQLDB3-2 by 1000 ms > > > Apr 27 11:09:20 srv03 lrmd: [24383]: info: rsc:prmFsPostgreSQLDB3-2:60: > > > stop > > > Apr 27 11:09:20 srv03 lrmd: [24383]: info: Managed > > > prmFsPostgreSQLDB3-2:stop process 26225 > > exited with > > > return code 0. > > > Apr 27 11:09:20 srv03 crmd: [24386]: info: process_lrm_event: LRM > > > operation > > > prmFsPostgreSQLDB3-2_stop_0 (call=60, rc=0, cib-update=66, > > > confirmed=true) ok > > > Apr 27 11:09:20 srv03 lrmd: [24383]: info: cancel_op: operation > > > monitor[47] on > > > ocf::Dummy::prmFsPostgreSQLDB3-1 for client 24386, its parameters: > > > CRM_meta_interval=[10000] > > > CRM_meta_on_fail=[restart] CRM_meta_timeout=[60000] > > > crm_feature_set=[3.0.1] > > CRM_meta_name=[monitor] > > > cancelled > > > Apr 27 11:09:20 srv03 crmd: [24386]: info: do_lrm_rsc_op: Performing > > > key=87:8:0:2587fade-309b-454d-8a81-19e8884439ec > > > op=prmFsPostgreSQLDB3-1_stop_0 ) > > > Apr 27 11:09:20 srv03 lrmd: [24383]: info: rsc:prmFsPostgreSQLDB3-1:61: > > > stop > > > Apr 27 11:09:20 srv03 crmd: [24386]: info: process_lrm_event: LRM > > > operation > > > prmFsPostgreSQLDB3-1_monitor_10000 (call=47, status=1, cib-update=0, > > > confirmed=true) > Cancelled > > > Apr 27 11:09:20 srv03 lrmd: [24383]: info: Managed > > > prmFsPostgreSQLDB3-1:stop process 26233 > > exited with > > > return code 0. > > > Apr 27 11:09:20 srv03 crmd: [24386]: info: process_lrm_event: LRM > > > operation > > > prmFsPostgreSQLDB3-1_stop_0 (call=61, rc=0, cib-update=67, > > > confirmed=true) ok > > > Apr 27 11:09:21 srv03 lrmd: [24383]: info: cancel_op: operation > > > monitor[45] on > > > ocf::Dummy::prmExPostgreSQLDB3 for client 24386, its parameters: > > > CRM_meta_interval=[10000] > > > CRM_meta_on_fail=[restart] CRM_meta_timeout=[90000] > > > crm_feature_set=[3.0.1] > > CRM_meta_name=[monitor] > > > cancelled > > > Apr 27 11:09:21 srv03 crmd: [24386]: info: do_lrm_rsc_op: Performing > > > key=86:8:0:2587fade-309b-454d-8a81-19e8884439ec > > > op=prmExPostgreSQLDB3_stop_0 ) > > > Apr 27 11:09:21 srv03 lrmd: [24383]: info: rsc:prmExPostgreSQLDB3:62: stop > > > Apr 27 11:09:21 srv03 crmd: [24386]: info: process_lrm_event: LRM > > > operation > > > prmExPostgreSQLDB3_monitor_10000 (call=45, status=1, cib-update=0, > > > confirmed=true) Cancelled > > > Apr 27 11:09:21 srv03 lrmd: [24383]: info: Managed > > > prmExPostgreSQLDB3:stop process 26241 > > exited with > > > return code 0. > > > Apr 27 11:09:21 srv03 crmd: [24386]: info: process_lrm_event: LRM > > > operation > > prmExPostgreSQLDB3_stop_0 > > > (call=62, rc=0, cib-update=68, confirmed=true) ok > > > Apr 27 11:09:31 srv03 crmd: [24386]: info: process_lrm_event: LRM > > > operation > > > prmFsPostgreSQLDB3-2_monitor_10000 (call=49, rc=7, cib-update=69, > > > confirmed=false) not > running > > > Apr 27 11:09:31 srv03 attrd: [24384]: info: attrd_ais_dispatch: Update > > > relayed from srv01 > > > Apr 27 11:09:31 srv03 attrd: [24384]: info: find_hash_entry: Creating > > > hash entry for > > > fail-count-prmFsPostgreSQLDB3-2 > > > Apr 27 11:09:31 srv03 attrd: [24384]: info: attrd_local_callback: Expanded > > > fail-count-prmFsPostgreSQLDB3-2=value++ to 1 > > > Apr 27 11:09:31 srv03 attrd: [24384]: info: attrd_trigger_update: Sending > > > flush op to all > > hosts for: > > > fail-count-prmFsPostgreSQLDB3-2 (1) > > > Apr 27 11:09:31 srv03 attrd: [24384]: info: attrd_perform_update: Sent > > > update 52: > > > fail-count-prmFsPostgreSQLDB3-2=1 > > > ----------------------------- > > > > > > I do not confirm this problem with the most recent version. > > > Possibly it may be the problem that somebody already reported. > > > > > > Best Regards, > > > Hideo Yamauchi. > > > > > > > > > > > > > > > _______________________________________________ > > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > > > Project Home: http://www.clusterlabs.org > > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > > > > > > _______________________________________________ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf