Re: [ClusterLabs] op stop timeout update causes monitor op to fail?
On 11.09.19 16:51, Ken Gaillot wrote: > On Tue, 2019-09-10 at 09:54 +0200, Dennis Jacobfeuerborn wrote: >> Hi, >> I just updated the timeout for the stop operation on an nfs cluster >> and >> while the timeout was update the status suddenly showed this: >> >> Failed Actions: >> * nfsserver_monitor_1 on nfs1aqs1 'unknown error' (1): call=41, >> status=Timed Out, exitreason='none', >> last-rc-change='Tue Aug 13 14:14:28 2019', queued=0ms, exec=0ms > > Are you sure it wasn't already showing that? The timestamp of that > error is Aug 13, while the logs show the timeout update happening Sep > 10. I'm fairly certain. I did a "pcs status" before that operation to check the state of the cluster. > > Old errors will keep showing up in status until you manually clean them > up (with crm_resource --cleanup or a higher-level tool equivalent), or > any configured failure-timeout is reached. > > In any case, the log excerpt shows that nothing went wrong during the > time it covers. There were no actions scheduled in that transition in > response to the timeout change (which is as expected). What about this line: pengine: warning: unpack_rsc_op_failure: Processing failed op monitor for nfsserver on nfs1aqs1: unknown error (1) I cleaned up the error and tried this again and this time it worked. The corresponding line in the log now reads: pengine: info: determine_op_status: Operation monitor found resource nfsserver active on nfs1aqs1 What I'm wondering is if this could be a race condition of pacemaker updating the resource and the monitor operation. Regards, Dennis ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] op stop timeout update causes monitor op to fail?
On Tue, 2019-09-10 at 09:54 +0200, Dennis Jacobfeuerborn wrote: > Hi, > I just updated the timeout for the stop operation on an nfs cluster > and > while the timeout was update the status suddenly showed this: > > Failed Actions: > * nfsserver_monitor_1 on nfs1aqs1 'unknown error' (1): call=41, > status=Timed Out, exitreason='none', > last-rc-change='Tue Aug 13 14:14:28 2019', queued=0ms, exec=0ms Are you sure it wasn't already showing that? The timestamp of that error is Aug 13, while the logs show the timeout update happening Sep 10. Old errors will keep showing up in status until you manually clean them up (with crm_resource --cleanup or a higher-level tool equivalent), or any configured failure-timeout is reached. In any case, the log excerpt shows that nothing went wrong during the time it covers. There were no actions scheduled in that transition in response to the timeout change (which is as expected). > > The command used: > pcs resource update nfsserver op stop timeout=30s > > I can't imagine that this is expected to happen. Is there another way > to > update the timeout that doesn't cause this? > > I attached the log of the transition. > > Regards, > Dennis -- Ken Gaillot ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] op stop timeout update causes monitor op to fail?
Hi, I just updated the timeout for the stop operation on an nfs cluster and while the timeout was update the status suddenly showed this: Failed Actions: * nfsserver_monitor_1 on nfs1aqs1 'unknown error' (1): call=41, status=Timed Out, exitreason='none', last-rc-change='Tue Aug 13 14:14:28 2019', queued=0ms, exec=0ms The command used: pcs resource update nfsserver op stop timeout=30s I can't imagine that this is expected to happen. Is there another way to update the timeout that doesn't cause this? I attached the log of the transition. Regards, Dennis Sep 10 09:39:29 [2378] nfs1a-qs1cib: info: cib_process_request: Forwarding cib_replace operation for section configuration to all (origin=local/cibadmin/2) Sep 10 09:39:29 [2378] nfs1a-qs1cib: info: cib_perform_op: Diff: --- 0.76.14 2 Sep 10 09:39:29 [2378] nfs1a-qs1cib: info: cib_perform_op: Diff: +++ 0.77.0 8b73092b4ee9744fc4eaff60f8ba8388 Sep 10 09:39:29 [2378] nfs1a-qs1cib: info: cib_perform_op: + /cib: @epoch=77, @num_updates=0 Sep 10 09:39:29 [2378] nfs1a-qs1cib: info: cib_perform_op: + /cib/configuration/resources/primitive[@id='nfsserver']/operations/op[@id='nfsserver-stop-interval-0s']: @timeout=30s Sep 10 09:39:29 [2378] nfs1a-qs1cib: info: cib_perform_op: ++ /cib/configuration/resources/primitive[@id='nfsserver']: Sep 10 09:39:29 [2378] nfs1a-qs1cib: info: cib_process_request: Completed cib_replace operation for section configuration: OK (rc=0, origin=nfs1aqs1/cibadmin/2, version=0.77.0) Sep 10 09:39:29 [2383] nfs1a-qs1 crmd: info: abort_transition_graph: Transition aborted by op.nfsserver-stop-interval-0s 'modify': Configuration change | cib=0.77.0 source=te_update_diff:456 path=/cib/configuration/resources/primitive[@id='nfsserver']/operations/op[@id='nfsserver-stop-interval-0s'] complete=true Sep 10 09:39:29 [2383] nfs1a-qs1 crmd: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph Sep 10 09:39:29 [2382] nfs1a-qs1pengine: notice: unpack_config: On loss of CCM Quorum: Ignore Sep 10 09:39:29 [2382] nfs1a-qs1pengine: info: determine_online_status: Node nfs1bqs1 is online Sep 10 09:39:29 [2382] nfs1a-qs1pengine: info: determine_online_status: Node nfs1aqs1 is online Sep 10 09:39:29 [2382] nfs1a-qs1pengine: warning: unpack_rsc_op_failure: Processing failed op monitor for nfsserver on nfs1aqs1: unknown error (1) Sep 10 09:39:29 [2382] nfs1a-qs1pengine: info: unpack_node_loop: Node 2 is already processed Sep 10 09:39:29 [2382] nfs1a-qs1pengine: info: unpack_node_loop: Node 1 is already processed Sep 10 09:39:29 [2382] nfs1a-qs1pengine: info: unpack_node_loop: Node 2 is already processed Sep 10 09:39:29 [2382] nfs1a-qs1pengine: info: unpack_node_loop: Node 1 is already processed Sep 10 09:39:29 [2382] nfs1a-qs1pengine: info: clone_print: Master/Slave Set: drbd-clone [drbd] Sep 10 09:39:29 [2382] nfs1a-qs1pengine: info: short_print: Masters: [ nfs1aqs1 ] Sep 10 09:39:29 [2382] nfs1a-qs1pengine: info: short_print: Slaves: [ nfs1bqs1 ] Sep 10 09:39:29 [2382] nfs1a-qs1pengine: info: common_print: metadata-fs (ocf::heartbeat:Filesystem):Started nfs1aqs1 Sep 10 09:39:29 [2382] nfs1a-qs1pengine: info: common_print: medias-fs (ocf::heartbeat:Filesystem):Started nfs1aqs1 Sep 10 09:39:29 [2382] nfs1a-qs1pengine: info: common_print: nfsserver (ocf::heartbeat:nfsserver): Started nfs1aqs1 Sep 10 09:39:29 [2382] nfs1a-qs1pengine: info: common_print:vip (ocf::heartbeat:IPaddr2): Started nfs1aqs1 Sep 10 09:39:29 [2382] nfs1a-qs1pengine: info: get_failcount_full: nfsserver has failed 1 times on nfs1aqs1 Sep 10 09:39:29 [2382] nfs1a-qs1pengine: info: check_migration_threshold: nfsserver can fail 99 more times on nfs1aqs1 before being forced off Sep 10 09:39:29 [2382] nfs1a-qs1pengine: info: master_color: Promoting drbd:1 (Master nfs1aqs1) Sep 10 09:39:29 [2382] nfs1a-qs1pengine: info: master_color: drbd-clone: Promoted 1 instances of a possible 1 to master Sep 10 09:39:29 [2382] nfs1a-qs1pengine: info: LogActions: Leave drbd:0 (Slave nfs1bqs1) Sep 10 09:39:29 [2382] nfs1a-qs1pengine: info: LogActions: Leave drbd:1 (Master nfs1aqs1) Sep 10 09:39:29 [2382] nfs1a-qs1pengine: info: LogActions: Leave metadata-fs (Started nfs1aqs1) Sep 10 09:39:29 [2382] nfs1a-qs1pengine: info: LogActions: Leave medias-fs (Started nfs1aqs1) Sep 10 09:39:29 [2382] nfs1a-qs1pengine: info: LogActions: Leave nfsserver (Started nfs1aqs1) Sep 10 09:39:29 [2382] nfs1a-qs1