Re: [ClusterLabs] Resource start and stop run into timeout

2020-06-05 Thread joerg
Thank you! This is my issue. I cannot find anything on bugzilla, so let's hope that there will be an update soon.Jörg___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Resource start and stop run into timeout

2020-06-05 Thread Ken Gaillot
On Fri, 2020-06-05 at 13:06 +0200, jo...@die-kilians.eu wrote:
> Dear all,
> 
> I have an issue on my cluster running Pacemaker 2.0.3.
>  From time to time my resources fail to start or fail to stop. The 
> cluster runs into an timeout for the command.

This is a known bug in 2.0.3 when it is built with
-UPCMK_TIME_EMERGENCY_CGT in the C flags, which is not the default.
Fedora 31 and an Ubuntu release are known to have used that option (bug
reports are open for both distributions, and hopefully updated packages
will be available before long).

The only fix is to use a different pacemaker version on those
platforms.

> Between the attempt to start the resource and the timeout is only 1 
> second (between Jun 04 13:48:37 and Jun 04 13:48:38).
> 
> The resource is defined like this:
>   Resource: nmb (class=systemd type=nmb)
> Operations: monitor interval=60s timeout=100s 
> (nmb-monitor-interval-60s)
> start interval=0s timeout=600s (nmb-start-
> interval-0s)
> stop interval=0s timeout=600s (nmb-stop-interval-
> 0s)
> 
> So the timeout should be 600 seconds. The cluster states that the 
> timeout has reached (timeout (elapsed=647561ms, remaining=-
> 47561ms)), 
> but obviously only 1 second has passed.
> 
> Do you have any hint for me ?
> 
> Below is a log from that time.
> 
> Thanks
> 
> Log:
> Jun 04 13:48:37 node1 pacemaker-controld  [1162]
> (process_graph_event)   
>   info: Transition 641 action 31 (smb_stop_0 on node1) confirmed:
> ok 
> > rc=0 call-id=896
> 
> Jun 04 13:48:37 node1 pacemaker-controld  [1162]
> (te_rsc_command) 
> notice: Initiating stop operation nmb_stop_0 locally on node1 |
> action 
> 29
> Jun 04 13:48:37 node1 pacemaker-execd [1159] 
> (cancel_recurring_action)info: Cancelling systemd operation 
> nmb_status_6
> Jun 04 13:48:37 node1 pacemaker-based [1157]
> (cib_process_request)   
>   info: Completed cib_modify operation for section status: OK
> (rc=0, 
> origin=node1/crmd/1383, version=1.141.45)
> Jun 04 13:48:37 node1 pacemaker-controld  [1162]
> (do_lrm_rsc_op)  
> info: Performing key=29:641:0:dbf25643-c4e8-4374-8aec-3c9e3ea3d397 
> op=nmb_stop_0
> Jun 04 13:48:37 node1 pacemaker-execd [1159]
> (log_execute)
> info: executing - rsc:nmb action:stop call_id:909
> Jun 04 13:48:37 node1 pacemaker-controld  [1162]
> (process_lrm_event)  
> info: Result of monitor operation for nmb on node1: Cancelled |
> call=329 
> key=nmb_monitor_6 confirmed=true
> Jun 04 13:48:37 node1 pacemaker-controld  [1162]
> (exec_alert_list)
> info: Sending resource alert via alert to (null)
> Jun 04 13:48:37 node1 pacemaker-based [1157]
> (cib_process_request)   
>   info: Forwarding cib_modify operation for section status to
> all 
> (origin=local/crmd/1384)
> Jun 04 13:48:37 node1 pacemaker-execd [1159] 
> (process_lrmd_alert_exec)info: Executing alert alert for 
> 8f105ac8-de01-4d3b-9881-10963acd7002
> Jun 04 13:48:37 node1 pacemaker-based [1157]
> (cib_perform_op) 
> info: Diff: --- 1.141.45 2
> Jun 04 13:48:37 node1 pacemaker-based [1157]
> (cib_perform_op) 
> info: Diff: +++ 1.141.46 (null)
> Jun 04 13:48:37 node1 pacemaker-based [1157]
> (cib_perform_op) 
> info: +  /cib:  @num_updates=46
> Jun 04 13:48:37 node1 pacemaker-based [1157]
> (cib_perform_op) 
> info: +  
> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resour
> ce[@id='nmb']/lrm_rsc_op[@id='nmb_last_0']: 
>   @operation_key=nmb_stop_0, @operation=stop, 
> @crm-debug-origin=do_update_resource, 
> @transition-key=29:641:0:dbf25643-c4e8-4374-8aec-3c9e3ea3d397, 
> @transition-magic=-1:193;29:641:0:dbf25643-c4e8-4374-8aec-
> 3c9e3ea3d397, 
> @call-id=-1, @rc-code=193, @op-status=-1, @last-rc-
> change=1591271317, 
> @last-run=1591271317, @e
> Jun 04 13:48:37 node1 pacemaker-based [1157]
> (cib_process_request)   
>   info: Completed cib_modify operation for section status: OK
> (rc=0, 
> origin=node1/crmd/1384, version=1.141.46)
> Jun 04 13:48:38 node1 pacemaker-execd [1159]
> (systemd_exec_result)   
>   info: Call to stop passed: /org/freedesktop/systemd1/job/189965
> Jun 04 13:48:38 node1 pacemaker-execd [1159]
> (action_complete)
> notice: Giving up on nmb stop (rc=0): timeout (elapsed=647561ms, 
> remaining=-47561ms)
> Jun 04 13:48:38 node1 pacemaker-based [1157]
> (cib_process_request)   
>   info: Forwarding cib_modify operation for section status to
> all 
> (origin=local/crmd/1385)
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Resource start and stop run into timeout

2020-06-05 Thread joerg

Dear all,

I have an issue on my cluster running Pacemaker 2.0.3.
From time to time my resources fail to start or fail to stop. The 
cluster runs into an timeout for the command.
Between the attempt to start the resource and the timeout is only 1 
second (between Jun 04 13:48:37 and Jun 04 13:48:38).


The resource is defined like this:
Resource: nmb (class=systemd type=nmb)
	  Operations: monitor interval=60s timeout=100s 
(nmb-monitor-interval-60s)

  start interval=0s timeout=600s (nmb-start-interval-0s)
  stop interval=0s timeout=600s (nmb-stop-interval-0s)

So the timeout should be 600 seconds. The cluster states that the 
timeout has reached (timeout (elapsed=647561ms, remaining=-47561ms)), 
but obviously only 1 second has passed.


Do you have any hint for me ?

Below is a log from that time.

Thanks

Log:
Jun 04 13:48:37 node1 pacemaker-controld  [1162] (process_graph_event)   
 info: Transition 641 action 31 (smb_stop_0 on node1) confirmed: ok 
| rc=0 call-id=896
Jun 04 13:48:37 node1 pacemaker-controld  [1162] (te_rsc_command) 
notice: Initiating stop operation nmb_stop_0 locally on node1 | action 
29
Jun 04 13:48:37 node1 pacemaker-execd [1159] 
(cancel_recurring_action)info: Cancelling systemd operation 
nmb_status_6
Jun 04 13:48:37 node1 pacemaker-based [1157] (cib_process_request)   
 info: Completed cib_modify operation for section status: OK (rc=0, 
origin=node1/crmd/1383, version=1.141.45)
Jun 04 13:48:37 node1 pacemaker-controld  [1162] (do_lrm_rsc_op)  
info: Performing key=29:641:0:dbf25643-c4e8-4374-8aec-3c9e3ea3d397 
op=nmb_stop_0
Jun 04 13:48:37 node1 pacemaker-execd [1159] (log_execute)
info: executing - rsc:nmb action:stop call_id:909
Jun 04 13:48:37 node1 pacemaker-controld  [1162] (process_lrm_event)  
info: Result of monitor operation for nmb on node1: Cancelled | call=329 
key=nmb_monitor_6 confirmed=true
Jun 04 13:48:37 node1 pacemaker-controld  [1162] (exec_alert_list)
info: Sending resource alert via alert to (null)
Jun 04 13:48:37 node1 pacemaker-based [1157] (cib_process_request)   
 info: Forwarding cib_modify operation for section status to all 
(origin=local/crmd/1384)
Jun 04 13:48:37 node1 pacemaker-execd [1159] 
(process_lrmd_alert_exec)info: Executing alert alert for 
8f105ac8-de01-4d3b-9881-10963acd7002
Jun 04 13:48:37 node1 pacemaker-based [1157] (cib_perform_op) 
info: Diff: --- 1.141.45 2
Jun 04 13:48:37 node1 pacemaker-based [1157] (cib_perform_op) 
info: Diff: +++ 1.141.46 (null)
Jun 04 13:48:37 node1 pacemaker-based [1157] (cib_perform_op) 
info: +  /cib:  @num_updates=46
Jun 04 13:48:37 node1 pacemaker-based [1157] (cib_perform_op) 
info: +  
/cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='nmb']/lrm_rsc_op[@id='nmb_last_0']: 
 @operation_key=nmb_stop_0, @operation=stop, 
@crm-debug-origin=do_update_resource, 
@transition-key=29:641:0:dbf25643-c4e8-4374-8aec-3c9e3ea3d397, 
@transition-magic=-1:193;29:641:0:dbf25643-c4e8-4374-8aec-3c9e3ea3d397, 
@call-id=-1, @rc-code=193, @op-status=-1, @last-rc-change=1591271317, 
@last-run=1591271317, @e
Jun 04 13:48:37 node1 pacemaker-based [1157] (cib_process_request)   
 info: Completed cib_modify operation for section status: OK (rc=0, 
origin=node1/crmd/1384, version=1.141.46)
Jun 04 13:48:38 node1 pacemaker-execd [1159] (systemd_exec_result)   
 info: Call to stop passed: /org/freedesktop/systemd1/job/189965
Jun 04 13:48:38 node1 pacemaker-execd [1159] (action_complete)
notice: Giving up on nmb stop (rc=0): timeout (elapsed=647561ms, 
remaining=-47561ms)
Jun 04 13:48:38 node1 pacemaker-based [1157] (cib_process_request)   
 info: Forwarding cib_modify operation for section status to all 
(origin=local/crmd/1385)

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/