Re: [ClusterLabs] op stop timeout update causes monitor op to fail?

2019-09-17 Thread Dennis Jacobfeuerborn
On 11.09.19 16:51, Ken Gaillot wrote:
> On Tue, 2019-09-10 at 09:54 +0200, Dennis Jacobfeuerborn wrote:
>> Hi,
>> I just updated the timeout for the stop operation on an nfs cluster
>> and
>> while the timeout was update the status suddenly showed this:
>>
>> Failed Actions:
>> * nfsserver_monitor_1 on nfs1aqs1 'unknown error' (1): call=41,
>> status=Timed Out, exitreason='none',
>> last-rc-change='Tue Aug 13 14:14:28 2019', queued=0ms, exec=0ms
> 
> Are you sure it wasn't already showing that? The timestamp of that
> error is Aug 13, while the logs show the timeout update happening Sep
> 10.

I'm fairly certain. I did a "pcs status" before that operation to check
the state of the cluster.

> 
> Old errors will keep showing up in status until you manually clean them
> up (with crm_resource --cleanup or a higher-level tool equivalent), or
> any configured failure-timeout is reached.
> 
> In any case, the log excerpt shows that nothing went wrong during the
> time it covers. There were no actions scheduled in that transition in
> response to the timeout change (which is as expected).

What about this line:
pengine:  warning: unpack_rsc_op_failure:   Processing failed op monitor
for nfsserver on nfs1aqs1: unknown error (1)

I cleaned up the error and tried this again and this time it worked. The
corresponding line in the log now reads:
pengine: info: determine_op_status: Operation monitor found resource
nfsserver active on nfs1aqs1

What I'm wondering is if this could be a race condition of pacemaker
updating the resource and the monitor operation.

Regards,
  Dennis
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] op stop timeout update causes monitor op to fail?

2019-09-11 Thread Ken Gaillot
On Tue, 2019-09-10 at 09:54 +0200, Dennis Jacobfeuerborn wrote:
> Hi,
> I just updated the timeout for the stop operation on an nfs cluster
> and
> while the timeout was update the status suddenly showed this:
> 
> Failed Actions:
> * nfsserver_monitor_1 on nfs1aqs1 'unknown error' (1): call=41,
> status=Timed Out, exitreason='none',
> last-rc-change='Tue Aug 13 14:14:28 2019', queued=0ms, exec=0ms

Are you sure it wasn't already showing that? The timestamp of that
error is Aug 13, while the logs show the timeout update happening Sep
10.

Old errors will keep showing up in status until you manually clean them
up (with crm_resource --cleanup or a higher-level tool equivalent), or
any configured failure-timeout is reached.

In any case, the log excerpt shows that nothing went wrong during the
time it covers. There were no actions scheduled in that transition in
response to the timeout change (which is as expected).

> 
> The command used:
> pcs resource update nfsserver op stop timeout=30s
> 
> I can't imagine that this is expected to happen. Is there another way
> to
> update the timeout that doesn't cause this?
> 
> I attached the log of the transition.
> 
> Regards,
>   Dennis
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] op stop timeout update causes monitor op to fail?

2019-09-10 Thread Dennis Jacobfeuerborn
Hi,
I just updated the timeout for the stop operation on an nfs cluster and
while the timeout was update the status suddenly showed this:

Failed Actions:
* nfsserver_monitor_1 on nfs1aqs1 'unknown error' (1): call=41,
status=Timed Out, exitreason='none',
last-rc-change='Tue Aug 13 14:14:28 2019', queued=0ms, exec=0ms

The command used:
pcs resource update nfsserver op stop timeout=30s

I can't imagine that this is expected to happen. Is there another way to
update the timeout that doesn't cause this?

I attached the log of the transition.

Regards,
  Dennis
Sep 10 09:39:29 [2378] nfs1a-qs1cib: info: cib_process_request: 
Forwarding cib_replace operation for section configuration to all 
(origin=local/cibadmin/2)
Sep 10 09:39:29 [2378] nfs1a-qs1cib: info: cib_perform_op:  Diff: 
--- 0.76.14 2
Sep 10 09:39:29 [2378] nfs1a-qs1cib: info: cib_perform_op:  Diff: 
+++ 0.77.0 8b73092b4ee9744fc4eaff60f8ba8388
Sep 10 09:39:29 [2378] nfs1a-qs1cib: info: cib_perform_op:  +  
/cib:  @epoch=77, @num_updates=0
Sep 10 09:39:29 [2378] nfs1a-qs1cib: info: cib_perform_op:  +  
/cib/configuration/resources/primitive[@id='nfsserver']/operations/op[@id='nfsserver-stop-interval-0s']:
  @timeout=30s
Sep 10 09:39:29 [2378] nfs1a-qs1cib: info: cib_perform_op:  ++ 
/cib/configuration/resources/primitive[@id='nfsserver']:  
Sep 10 09:39:29 [2378] nfs1a-qs1cib: info: cib_process_request: 
Completed cib_replace operation for section configuration: OK (rc=0, 
origin=nfs1aqs1/cibadmin/2, version=0.77.0)
Sep 10 09:39:29 [2383] nfs1a-qs1   crmd: info: abort_transition_graph:  
Transition aborted by op.nfsserver-stop-interval-0s 'modify': Configuration 
change | cib=0.77.0 source=te_update_diff:456 
path=/cib/configuration/resources/primitive[@id='nfsserver']/operations/op[@id='nfsserver-stop-interval-0s']
 complete=true
Sep 10 09:39:29 [2383] nfs1a-qs1   crmd:   notice: do_state_transition: 
State transition S_IDLE -> S_POLICY_ENGINE | input=I_PE_CALC 
cause=C_FSA_INTERNAL origin=abort_transition_graph
Sep 10 09:39:29 [2382] nfs1a-qs1pengine:   notice: unpack_config:   On loss 
of CCM Quorum: Ignore
Sep 10 09:39:29 [2382] nfs1a-qs1pengine: info: determine_online_status: 
Node nfs1bqs1 is online
Sep 10 09:39:29 [2382] nfs1a-qs1pengine: info: determine_online_status: 
Node nfs1aqs1 is online
Sep 10 09:39:29 [2382] nfs1a-qs1pengine:  warning: unpack_rsc_op_failure:   
Processing failed op monitor for nfsserver on nfs1aqs1: unknown error (1)
Sep 10 09:39:29 [2382] nfs1a-qs1pengine: info: unpack_node_loop:
Node 2 is already processed
Sep 10 09:39:29 [2382] nfs1a-qs1pengine: info: unpack_node_loop:
Node 1 is already processed
Sep 10 09:39:29 [2382] nfs1a-qs1pengine: info: unpack_node_loop:
Node 2 is already processed
Sep 10 09:39:29 [2382] nfs1a-qs1pengine: info: unpack_node_loop:
Node 1 is already processed
Sep 10 09:39:29 [2382] nfs1a-qs1pengine: info: clone_print:  
Master/Slave Set: drbd-clone [drbd]
Sep 10 09:39:29 [2382] nfs1a-qs1pengine: info: short_print:  
Masters: [ nfs1aqs1 ]
Sep 10 09:39:29 [2382] nfs1a-qs1pengine: info: short_print:  
Slaves: [ nfs1bqs1 ]
Sep 10 09:39:29 [2382] nfs1a-qs1pengine: info: common_print:
metadata-fs (ocf::heartbeat:Filesystem):Started nfs1aqs1
Sep 10 09:39:29 [2382] nfs1a-qs1pengine: info: common_print:
medias-fs   (ocf::heartbeat:Filesystem):Started nfs1aqs1
Sep 10 09:39:29 [2382] nfs1a-qs1pengine: info: common_print:
nfsserver   (ocf::heartbeat:nfsserver): Started nfs1aqs1
Sep 10 09:39:29 [2382] nfs1a-qs1pengine: info: common_print:vip 
(ocf::heartbeat:IPaddr2):   Started nfs1aqs1
Sep 10 09:39:29 [2382] nfs1a-qs1pengine: info: get_failcount_full:  
nfsserver has failed 1 times on nfs1aqs1
Sep 10 09:39:29 [2382] nfs1a-qs1pengine: info: 
check_migration_threshold:   nfsserver can fail 99 more times on 
nfs1aqs1 before being forced off
Sep 10 09:39:29 [2382] nfs1a-qs1pengine: info: master_color:
Promoting drbd:1 (Master nfs1aqs1)
Sep 10 09:39:29 [2382] nfs1a-qs1pengine: info: master_color:
drbd-clone: Promoted 1 instances of a possible 1 to master
Sep 10 09:39:29 [2382] nfs1a-qs1pengine: info: LogActions:  Leave   
drbd:0  (Slave nfs1bqs1)
Sep 10 09:39:29 [2382] nfs1a-qs1pengine: info: LogActions:  Leave   
drbd:1  (Master nfs1aqs1)
Sep 10 09:39:29 [2382] nfs1a-qs1pengine: info: LogActions:  Leave   
metadata-fs (Started nfs1aqs1)
Sep 10 09:39:29 [2382] nfs1a-qs1pengine: info: LogActions:  Leave   
medias-fs   (Started nfs1aqs1)
Sep 10 09:39:29 [2382] nfs1a-qs1pengine: info: LogActions:  Leave   
nfsserver   (Started nfs1aqs1)
Sep 10 09:39:29 [2382] nfs1a-qs1