Re: [Pacemaker] resource starts but then fails right away

2013-05-12 Thread Andrew Beekhof

On 10/05/2013, at 9:23 PM, Brian J. Murrell  wrote:

> On 13-05-09 09:53 PM, Andrew Beekhof wrote:
>> 
>> May  7 02:36:16 node1 crmd[16836]: info: delete_resource: Removing 
>> resource testfs-resource1 for 18002_crm_resource (internal) on node1
>> May  7 02:36:16 node1 lrmd: [16833]: info: flush_op: process for operation 
>> monitor[8] on ocf::Target::testfs-resource1 for client 16836 still running, 
>> flush delayed
>> May  7 02:36:16 node1 crmd[16836]: info: lrm_remove_deleted_op: Removing 
>> op testfs-resource1_monitor_0:8 for deleted resource testfs-resource1
>> 
>> So apparently a badly timed cleanup was run.
> 
> :-(  I didn't know there could such timing problems.  I might have to
> change my process a bit then perhaps.
> 
>> Did you do that or was it the crm shell?
> 
> That was "me" doing a "crm resource cleanup" (soon to become
> "crm_resource -r ... --cleanup").  The process is typically:
> 
> - create resource
> - start resource
> - wait for resource to start
> 
> where "start resource" is:
> - "clean it to start with a known clean resource"
>  (crm resource cleanup)
> - "start resource"
>  (crm_resource -r ... -p target-role -m -v Started)
> 
> and "wait for resource" is a loop of "crm resource status ..." (soon to
> be "crm_resource -r ... --locate")
> 
> So the create, clean, start operations happen in quite quick succession
> (i.e. scripted).  Is that pathological?  Is a clean between create and
> start known to be problematic?

Its certainly known to be unnecessary.
In some older versions it is also problematic.

> 
> FWIW, the reason for clean before the start, even after just creating
> the resource is that "clean" and "start" are lumped together into a
> function that is called after create, but can also be called at other
> times during the life-cycle, so it could be needed to clean a resource
> before trying to start it.  I was hoping the cleaning of a just created
> resource was going to be effectively a NOOP.

Its never a no-op, and at that particular point the cluster is trying to 
discover the status of the resource.
Running a clean in the middle of that interferes with this.

> 
> I guess for completeness, I should add here that creating the resource
> is a "cibadmin -o resource -C ..." operation.
> 
>> If the machine is heavily loaded, or just very busy with file I/O, that can 
>> still take quite a long time.
> 
> Yeah, not very loaded at all, especially at this point.  This is all
> happening before anything really gets started on the machine... this is
> the process of getting the resources up and running and the machine is
> dedicated to running the tasks associated with these resources.
> 
> Cheers,
> b.
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] resource starts but then fails right away

2013-05-10 Thread Brian J. Murrell
On 13-05-09 09:53 PM, Andrew Beekhof wrote:
> 
> May  7 02:36:16 node1 crmd[16836]: info: delete_resource: Removing 
> resource testfs-resource1 for 18002_crm_resource (internal) on node1
> May  7 02:36:16 node1 lrmd: [16833]: info: flush_op: process for operation 
> monitor[8] on ocf::Target::testfs-resource1 for client 16836 still running, 
> flush delayed
> May  7 02:36:16 node1 crmd[16836]: info: lrm_remove_deleted_op: Removing 
> op testfs-resource1_monitor_0:8 for deleted resource testfs-resource1
> 
> So apparently a badly timed cleanup was run.

:-(  I didn't know there could such timing problems.  I might have to
change my process a bit then perhaps.

> Did you do that or was it the crm shell?

That was "me" doing a "crm resource cleanup" (soon to become
"crm_resource -r ... --cleanup").  The process is typically:

- create resource
- start resource
- wait for resource to start

where "start resource" is:
- "clean it to start with a known clean resource"
  (crm resource cleanup)
- "start resource"
  (crm_resource -r ... -p target-role -m -v Started)

and "wait for resource" is a loop of "crm resource status ..." (soon to
be "crm_resource -r ... --locate")

So the create, clean, start operations happen in quite quick succession
(i.e. scripted).  Is that pathological?  Is a clean between create and
start known to be problematic?

FWIW, the reason for clean before the start, even after just creating
the resource is that "clean" and "start" are lumped together into a
function that is called after create, but can also be called at other
times during the life-cycle, so it could be needed to clean a resource
before trying to start it.  I was hoping the cleaning of a just created
resource was going to be effectively a NOOP.

I guess for completeness, I should add here that creating the resource
is a "cibadmin -o resource -C ..." operation.

> If the machine is heavily loaded, or just very busy with file I/O, that can 
> still take quite a long time.

Yeah, not very loaded at all, especially at this point.  This is all
happening before anything really gets started on the machine... this is
the process of getting the resources up and running and the machine is
dedicated to running the tasks associated with these resources.

Cheers,
b.




signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] resource starts but then fails right away

2013-05-09 Thread Andrew Beekhof

On 10/05/2013, at 12:26 AM, Brian J. Murrell  wrote:

> I do see the:
> 
> May  7 02:37:32 node1 crmd[16836]:error: print_elem: Aborting transition, 
> action lost: [Action 5]: In-flight (id: testfs-resource1_monitor_0, loc: 
> node1, priority: 0)
> 
> in the log.  Is that the root cause of the problem?  

Ordinarily I'd have said yes, but I also see:

May  7 02:36:16 node1 crmd[16836]: info: delete_resource: Removing resource 
testfs-resource1 for 18002_crm_resource (internal) on node1
May  7 02:36:16 node1 lrmd: [16833]: info: flush_op: process for operation 
monitor[8] on ocf::Target::testfs-resource1 for client 16836 still running, 
flush delayed
May  7 02:36:16 node1 crmd[16836]: info: lrm_remove_deleted_op: Removing op 
testfs-resource1_monitor_0:8 for deleted resource testfs-resource1

So apparently a badly timed cleanup was run.  Did you do that or was it the crm 
shell?

> If so, what's that
> trying to tell me, exactly?  If not, what is the cause of the problem?
> 
> It really can't be the RA timing out since I give the monitor operation
> a 60 second timeout and the status action of the RA only take a few
> seconds at most to run and is not really an operation that can get
> blocked on anything.  It's effectively the grepping of a file.

If the machine is heavily loaded, or just very busy with file I/O, that can 
still take quite a long time.
I've seen IPaddr monitor actions take over a minute for example.


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] resource starts but then fails right away

2013-05-09 Thread Brian J. Murrell
Using Pacemaker 1.1.7 on EL6.3, I'm getting an intermittent recurrence
of a situation where I add a resource and start it and it seems to
start but then right away fail.  i.e.

# clean up resource before trying to start, just to make sure we start with a 
clean slate
# crm resource cleanup testfs-resource1
Cleaning up testfs-resource1 on node1

Waiting for 2 replies from the CRMd.. OK

# now try to start it
# crm_resource -r testfs-resource1 -p target-role -m -v Started

# monitor teh start up for success
# crm resource status testfs-resource1:

resource testfs-resource1 is NOT running

# crm resource status testfs-resource1

resource testfs-resource1 is NOT running

# crm resource status testfs-resource1

resource testfs-resource1 is NOT running

...

# crm resource status testfs-resource1

resource testfs-resource1 is NOT running

# crm resource status testfs-resource1

resource testfs-resource1 is NOT running

# crm resource status testfs-resource1
resource testfs-resource1 is running on: node1

# it started.  check once more:

# crm status

Last updated: Tue May  7 02:37:34 2013
Last change: Tue May  7 02:36:17 2013 via crm_resource on node1
Stack: openais
Current DC: node1 - partition WITHOUT quorum
Version: 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14
1 Nodes configured, 2 expected votes
3 Resources configured.


Online: [ node1 ]

 st-fencing (stonith:fence_foo):Started node1
 resource2  (ocf::foo:Target):  Started node1
 testfs-resource1   (ocf::foo:Target):  Started node1 FAILED

Failed actions:
testfs-resource1_monitor_0 (node=node1, call=-1, rc=1, status=Timed Out): 
unknown error

# but lo and behold, it failed, with a monitor operation failing.

# stop it
# crm_resource -r testfs-resource1 -p target-role -m -v Stopped: 0

The syslog for this whole operation, starting with adding the resource
is as follows:

May  7 02:36:12 node1 cib[16831]: info: cib:diff: - 
May  7 02:36:12 node1 crmd[16836]: info: abort_transition_graph: 
te_update_diff:126 - Triggered transition abort (complete=1, tag=diff, 
id=(null), magic=NA, cib=0.16.1) : Non-status change
May  7 02:36:12 node1 crmd[16836]:   notice: do_state_transition: State 
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
origin=abort_transition_graph ]
May  7 02:36:12 node1 cib[16831]: info: cib:diff: + 
May  7 02:36:12 node1 cib[16831]: info: cib:diff: +   
May  7 02:36:12 node1 cib[16831]: info: cib:diff: + 
May  7 02:36:12 node1 cib[16831]: info: cib:diff: +   
May  7 02:36:12 node1 cib[16831]: info: cib:diff: + 

May  7 02:36:12 node1 cib[16831]: info: cib:diff: +   
May  7 02:36:12 node1 cib[16831]: info: cib:diff: + 

May  7 02:36:12 node1 cib[16831]: info: cib:diff: + 
May  7 02:36:12 node1 cib[16831]: info: cib:diff: +   
May  7 02:36:12 node1 cib[16831]: info: cib:diff: +   
May  7 02:36:12 node1 cib[16831]: info: cib:diff: +   
May  7 02:36:12 node1 cib[16831]: info: cib:diff: + 
May  7 02:36:12 node1 cib[16831]: info: cib:diff: + 

May  7 02:36:12 node1 cib[16831]: info: cib:diff: +   
May  7 02:36:12 node1 cib[16831]: info: cib:diff: + 

May  7 02:36:12 node1 cib[16831]: info: cib:diff: +   
May  7 02:36:12 node1 cib[16831]: info: cib:diff: + 
May  7 02:36:12 node1 cib[16831]: info: cib:diff: +   
May  7 02:36:12 node1 cib[16831]: info: cib:diff: + 
May  7 02:36:12 node1 cib[16831]: info: cib_process_request: Operation 
complete: op cib_create for section resources (origin=local/cibadmin/2, 
version=0.16.1): ok (rc=0)
May  7 02:36:12 node1 pengine[16835]:   notice: unpack_config: On loss of CCM 
Quorum: Ignore
May  7 02:36:12 node1 crmd[16836]:   notice: do_state_transition: State 
transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=handle_response ]
May  7 02:36:12 node1 crmd[16836]: info: do_te_invoke: Processing graph 13 
(ref=pe_calc-dc-1367894172-72) derived from /var/lib/pengine/pe-input-124.bz2
May  7 02:36:12 node1 crmd[16836]: info: te_rsc_command: Initiating action 
5: monitor testfs-resource1_monitor_0 on node1 (local)
May  7 02:36:12 node1 lrmd: [16833]: info: rsc:testfs-resource1:8: probe
May  7 02:36:12 node1 pengine[16835]:   notice: process_pe_message: Transition 
13: PEngine Input stored in: /var/lib/pengine/pe-input-124.bz2
May  7 02:36:14 node1 crmd[16836]: info: abort_transition_graph: 
te_update_diff:126 - Triggered transition abort (complete=0, tag=diff, 
id=(null), magic=NA, cib=0.17.1) : Non-status change
May  7 02:36:14 node1 cib[16831]: info: cib:diff: - 
May  7 02:36:14 node1 cib[16831]: info: cib:diff: + 
May  7 02:36:14 node1 cib[16831]: info: cib:diff: +   
May  7 02:36:14 node1 cib[16831]: info: cib:diff: + 
May  7 02:36:14 node1 cib[16831]: info: ci