Re: [Pacemaker] resource starts but then fails right away
On 10/05/2013, at 9:23 PM, Brian J. Murrell wrote: > On 13-05-09 09:53 PM, Andrew Beekhof wrote: >> >> May 7 02:36:16 node1 crmd[16836]: info: delete_resource: Removing >> resource testfs-resource1 for 18002_crm_resource (internal) on node1 >> May 7 02:36:16 node1 lrmd: [16833]: info: flush_op: process for operation >> monitor[8] on ocf::Target::testfs-resource1 for client 16836 still running, >> flush delayed >> May 7 02:36:16 node1 crmd[16836]: info: lrm_remove_deleted_op: Removing >> op testfs-resource1_monitor_0:8 for deleted resource testfs-resource1 >> >> So apparently a badly timed cleanup was run. > > :-( I didn't know there could such timing problems. I might have to > change my process a bit then perhaps. > >> Did you do that or was it the crm shell? > > That was "me" doing a "crm resource cleanup" (soon to become > "crm_resource -r ... --cleanup"). The process is typically: > > - create resource > - start resource > - wait for resource to start > > where "start resource" is: > - "clean it to start with a known clean resource" > (crm resource cleanup) > - "start resource" > (crm_resource -r ... -p target-role -m -v Started) > > and "wait for resource" is a loop of "crm resource status ..." (soon to > be "crm_resource -r ... --locate") > > So the create, clean, start operations happen in quite quick succession > (i.e. scripted). Is that pathological? Is a clean between create and > start known to be problematic? Its certainly known to be unnecessary. In some older versions it is also problematic. > > FWIW, the reason for clean before the start, even after just creating > the resource is that "clean" and "start" are lumped together into a > function that is called after create, but can also be called at other > times during the life-cycle, so it could be needed to clean a resource > before trying to start it. I was hoping the cleaning of a just created > resource was going to be effectively a NOOP. Its never a no-op, and at that particular point the cluster is trying to discover the status of the resource. Running a clean in the middle of that interferes with this. > > I guess for completeness, I should add here that creating the resource > is a "cibadmin -o resource -C ..." operation. > >> If the machine is heavily loaded, or just very busy with file I/O, that can >> still take quite a long time. > > Yeah, not very loaded at all, especially at this point. This is all > happening before anything really gets started on the machine... this is > the process of getting the resources up and running and the machine is > dedicated to running the tasks associated with these resources. > > Cheers, > b. > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] resource starts but then fails right away
On 13-05-09 09:53 PM, Andrew Beekhof wrote: > > May 7 02:36:16 node1 crmd[16836]: info: delete_resource: Removing > resource testfs-resource1 for 18002_crm_resource (internal) on node1 > May 7 02:36:16 node1 lrmd: [16833]: info: flush_op: process for operation > monitor[8] on ocf::Target::testfs-resource1 for client 16836 still running, > flush delayed > May 7 02:36:16 node1 crmd[16836]: info: lrm_remove_deleted_op: Removing > op testfs-resource1_monitor_0:8 for deleted resource testfs-resource1 > > So apparently a badly timed cleanup was run. :-( I didn't know there could such timing problems. I might have to change my process a bit then perhaps. > Did you do that or was it the crm shell? That was "me" doing a "crm resource cleanup" (soon to become "crm_resource -r ... --cleanup"). The process is typically: - create resource - start resource - wait for resource to start where "start resource" is: - "clean it to start with a known clean resource" (crm resource cleanup) - "start resource" (crm_resource -r ... -p target-role -m -v Started) and "wait for resource" is a loop of "crm resource status ..." (soon to be "crm_resource -r ... --locate") So the create, clean, start operations happen in quite quick succession (i.e. scripted). Is that pathological? Is a clean between create and start known to be problematic? FWIW, the reason for clean before the start, even after just creating the resource is that "clean" and "start" are lumped together into a function that is called after create, but can also be called at other times during the life-cycle, so it could be needed to clean a resource before trying to start it. I was hoping the cleaning of a just created resource was going to be effectively a NOOP. I guess for completeness, I should add here that creating the resource is a "cibadmin -o resource -C ..." operation. > If the machine is heavily loaded, or just very busy with file I/O, that can > still take quite a long time. Yeah, not very loaded at all, especially at this point. This is all happening before anything really gets started on the machine... this is the process of getting the resources up and running and the machine is dedicated to running the tasks associated with these resources. Cheers, b. signature.asc Description: OpenPGP digital signature ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] resource starts but then fails right away
On 10/05/2013, at 12:26 AM, Brian J. Murrell wrote: > I do see the: > > May 7 02:37:32 node1 crmd[16836]:error: print_elem: Aborting transition, > action lost: [Action 5]: In-flight (id: testfs-resource1_monitor_0, loc: > node1, priority: 0) > > in the log. Is that the root cause of the problem? Ordinarily I'd have said yes, but I also see: May 7 02:36:16 node1 crmd[16836]: info: delete_resource: Removing resource testfs-resource1 for 18002_crm_resource (internal) on node1 May 7 02:36:16 node1 lrmd: [16833]: info: flush_op: process for operation monitor[8] on ocf::Target::testfs-resource1 for client 16836 still running, flush delayed May 7 02:36:16 node1 crmd[16836]: info: lrm_remove_deleted_op: Removing op testfs-resource1_monitor_0:8 for deleted resource testfs-resource1 So apparently a badly timed cleanup was run. Did you do that or was it the crm shell? > If so, what's that > trying to tell me, exactly? If not, what is the cause of the problem? > > It really can't be the RA timing out since I give the monitor operation > a 60 second timeout and the status action of the RA only take a few > seconds at most to run and is not really an operation that can get > blocked on anything. It's effectively the grepping of a file. If the machine is heavily loaded, or just very busy with file I/O, that can still take quite a long time. I've seen IPaddr monitor actions take over a minute for example. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] resource starts but then fails right away
Using Pacemaker 1.1.7 on EL6.3, I'm getting an intermittent recurrence of a situation where I add a resource and start it and it seems to start but then right away fail. i.e. # clean up resource before trying to start, just to make sure we start with a clean slate # crm resource cleanup testfs-resource1 Cleaning up testfs-resource1 on node1 Waiting for 2 replies from the CRMd.. OK # now try to start it # crm_resource -r testfs-resource1 -p target-role -m -v Started # monitor teh start up for success # crm resource status testfs-resource1: resource testfs-resource1 is NOT running # crm resource status testfs-resource1 resource testfs-resource1 is NOT running # crm resource status testfs-resource1 resource testfs-resource1 is NOT running ... # crm resource status testfs-resource1 resource testfs-resource1 is NOT running # crm resource status testfs-resource1 resource testfs-resource1 is NOT running # crm resource status testfs-resource1 resource testfs-resource1 is running on: node1 # it started. check once more: # crm status Last updated: Tue May 7 02:37:34 2013 Last change: Tue May 7 02:36:17 2013 via crm_resource on node1 Stack: openais Current DC: node1 - partition WITHOUT quorum Version: 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14 1 Nodes configured, 2 expected votes 3 Resources configured. Online: [ node1 ] st-fencing (stonith:fence_foo):Started node1 resource2 (ocf::foo:Target): Started node1 testfs-resource1 (ocf::foo:Target): Started node1 FAILED Failed actions: testfs-resource1_monitor_0 (node=node1, call=-1, rc=1, status=Timed Out): unknown error # but lo and behold, it failed, with a monitor operation failing. # stop it # crm_resource -r testfs-resource1 -p target-role -m -v Stopped: 0 The syslog for this whole operation, starting with adding the resource is as follows: May 7 02:36:12 node1 cib[16831]: info: cib:diff: - May 7 02:36:12 node1 crmd[16836]: info: abort_transition_graph: te_update_diff:126 - Triggered transition abort (complete=1, tag=diff, id=(null), magic=NA, cib=0.16.1) : Non-status change May 7 02:36:12 node1 crmd[16836]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] May 7 02:36:12 node1 cib[16831]: info: cib:diff: + May 7 02:36:12 node1 cib[16831]: info: cib:diff: + May 7 02:36:12 node1 cib[16831]: info: cib:diff: + May 7 02:36:12 node1 cib[16831]: info: cib:diff: + May 7 02:36:12 node1 cib[16831]: info: cib:diff: + May 7 02:36:12 node1 cib[16831]: info: cib:diff: + May 7 02:36:12 node1 cib[16831]: info: cib:diff: + May 7 02:36:12 node1 cib[16831]: info: cib:diff: + May 7 02:36:12 node1 cib[16831]: info: cib:diff: + May 7 02:36:12 node1 cib[16831]: info: cib:diff: + May 7 02:36:12 node1 cib[16831]: info: cib:diff: + May 7 02:36:12 node1 cib[16831]: info: cib:diff: + May 7 02:36:12 node1 cib[16831]: info: cib:diff: + May 7 02:36:12 node1 cib[16831]: info: cib:diff: + May 7 02:36:12 node1 cib[16831]: info: cib:diff: + May 7 02:36:12 node1 cib[16831]: info: cib:diff: + May 7 02:36:12 node1 cib[16831]: info: cib:diff: + May 7 02:36:12 node1 cib[16831]: info: cib:diff: + May 7 02:36:12 node1 cib[16831]: info: cib:diff: + May 7 02:36:12 node1 cib[16831]: info: cib_process_request: Operation complete: op cib_create for section resources (origin=local/cibadmin/2, version=0.16.1): ok (rc=0) May 7 02:36:12 node1 pengine[16835]: notice: unpack_config: On loss of CCM Quorum: Ignore May 7 02:36:12 node1 crmd[16836]: notice: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] May 7 02:36:12 node1 crmd[16836]: info: do_te_invoke: Processing graph 13 (ref=pe_calc-dc-1367894172-72) derived from /var/lib/pengine/pe-input-124.bz2 May 7 02:36:12 node1 crmd[16836]: info: te_rsc_command: Initiating action 5: monitor testfs-resource1_monitor_0 on node1 (local) May 7 02:36:12 node1 lrmd: [16833]: info: rsc:testfs-resource1:8: probe May 7 02:36:12 node1 pengine[16835]: notice: process_pe_message: Transition 13: PEngine Input stored in: /var/lib/pengine/pe-input-124.bz2 May 7 02:36:14 node1 crmd[16836]: info: abort_transition_graph: te_update_diff:126 - Triggered transition abort (complete=0, tag=diff, id=(null), magic=NA, cib=0.17.1) : Non-status change May 7 02:36:14 node1 cib[16831]: info: cib:diff: - May 7 02:36:14 node1 cib[16831]: info: cib:diff: + May 7 02:36:14 node1 cib[16831]: info: cib:diff: + May 7 02:36:14 node1 cib[16831]: info: cib:diff: + May 7 02:36:14 node1 cib[16831]: info: ci