Dear all,

I have an issue on my cluster running Pacemaker 2.0.3.
From time to time my resources fail to start or fail to stop. The cluster runs into an timeout for the command. Between the attempt to start the resource and the timeout is only 1 second (between Jun 04 13:48:37 and Jun 04 13:48:38).

The resource is defined like this:
        Resource: nmb (class=systemd type=nmb)
Operations: monitor interval=60s timeout=100s (nmb-monitor-interval-60s)
                      start interval=0s timeout=600s (nmb-start-interval-0s)
                      stop interval=0s timeout=600s (nmb-stop-interval-0s)

So the timeout should be 600 seconds. The cluster states that the timeout has reached (timeout (elapsed=647561ms, remaining=-47561ms)), but obviously only 1 second has passed.

Do you have any hint for me ?

Below is a log from that time.

Thanks

Log:
Jun 04 13:48:37 node1 pacemaker-controld [1162] (process_graph_event) info: Transition 641 action 31 (smb_stop_0 on node1) confirmed: ok | rc=0 call-id=896 Jun 04 13:48:37 node1 pacemaker-controld [1162] (te_rsc_command) notice: Initiating stop operation nmb_stop_0 locally on node1 | action 29 Jun 04 13:48:37 node1 pacemaker-execd [1159] (cancel_recurring_action) info: Cancelling systemd operation nmb_status_60000 Jun 04 13:48:37 node1 pacemaker-based [1157] (cib_process_request) info: Completed cib_modify operation for section status: OK (rc=0, origin=node1/crmd/1383, version=1.141.45) Jun 04 13:48:37 node1 pacemaker-controld [1162] (do_lrm_rsc_op) info: Performing key=29:641:0:dbf25643-c4e8-4374-8aec-3c9e3ea3d397 op=nmb_stop_0 Jun 04 13:48:37 node1 pacemaker-execd [1159] (log_execute) info: executing - rsc:nmb action:stop call_id:909 Jun 04 13:48:37 node1 pacemaker-controld [1162] (process_lrm_event) info: Result of monitor operation for nmb on node1: Cancelled | call=329 key=nmb_monitor_60000 confirmed=true Jun 04 13:48:37 node1 pacemaker-controld [1162] (exec_alert_list) info: Sending resource alert via alert to (null) Jun 04 13:48:37 node1 pacemaker-based [1157] (cib_process_request) info: Forwarding cib_modify operation for section status to all (origin=local/crmd/1384) Jun 04 13:48:37 node1 pacemaker-execd [1159] (process_lrmd_alert_exec) info: Executing alert alert for 8f105ac8-de01-4d3b-9881-10963acd7002 Jun 04 13:48:37 node1 pacemaker-based [1157] (cib_perform_op) info: Diff: --- 1.141.45 2 Jun 04 13:48:37 node1 pacemaker-based [1157] (cib_perform_op) info: Diff: +++ 1.141.46 (null) Jun 04 13:48:37 node1 pacemaker-based [1157] (cib_perform_op) info: + /cib: @num_updates=46 Jun 04 13:48:37 node1 pacemaker-based [1157] (cib_perform_op) info: + /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='nmb']/lrm_rsc_op[@id='nmb_last_0']: @operation_key=nmb_stop_0, @operation=stop, @crm-debug-origin=do_update_resource, @transition-key=29:641:0:dbf25643-c4e8-4374-8aec-3c9e3ea3d397, @transition-magic=-1:193;29:641:0:dbf25643-c4e8-4374-8aec-3c9e3ea3d397, @call-id=-1, @rc-code=193, @op-status=-1, @last-rc-change=1591271317, @last-run=1591271317, @e Jun 04 13:48:37 node1 pacemaker-based [1157] (cib_process_request) info: Completed cib_modify operation for section status: OK (rc=0, origin=node1/crmd/1384, version=1.141.46) Jun 04 13:48:38 node1 pacemaker-execd [1159] (systemd_exec_result) info: Call to stop passed: /org/freedesktop/systemd1/job/189965 Jun 04 13:48:38 node1 pacemaker-execd [1159] (action_complete) notice: Giving up on nmb stop (rc=0): timeout (elapsed=647561ms, remaining=-47561ms) Jun 04 13:48:38 node1 pacemaker-based [1157] (cib_process_request) info: Forwarding cib_modify operation for section status to all (origin=local/crmd/1385)
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to