Hi, I'm experiencing a time out on a demote operation and I'm not sure which parameter / attribute needs to be updated to extend the time out window.
I'm using Pacemaker 1.1.16 and Corosync 2.4.2. Here are the set of log lines that show the issue (shutdown initiated, then demote time out after 20 seconds): --snip-- Jan 10 09:08:13 tgtnode2 pacemakerd[1096]: notice: Caught 'Terminated' signal Jan 10 09:08:13 tgtnode2 crmd[1104]: notice: Caught 'Terminated' signal Jan 10 09:08:13 tgtnode2 crmd[1104]: notice: State transition S_IDLE -> S_POLICY_ENGINE Jan 10 09:08:13 tgtnode2 pengine[1103]: notice: Scheduling Node tgtnode2.parodyne.com for shutdown Jan 10 09:08:13 tgtnode2 pengine[1103]: notice: Promote p_scst_zfs_vols:0^I(Slave -> Master tgtnode1.parodyne.com) Jan 10 09:08:13 tgtnode2 pengine[1103]: notice: Demote p_scst_zfs_vols:1^I(Master -> Stopped tgtnode2.parodyne.com) Jan 10 09:08:13 tgtnode2 pengine[1103]: notice: Stop p_dlm:1^I(tgtnode2.parodyne.com) Jan 10 09:08:13 tgtnode2 pengine[1103]: notice: Migrate p_dummy_g_zfs^I(Started tgtnode2.parodyne.com -> tgtnode1.parodyne.com) Jan 10 09:08:13 tgtnode2 pengine[1103]: notice: Move p_zfs_pool_one^I(Started tgtnode2.parodyne.com -> tgtnode1.parodyne.com) Jan 10 09:08:13 tgtnode2 pengine[1103]: notice: Calculated transition 3, saving inputs in /var/lib/pacemaker/pengine/pe-input-1441.bz2 Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17449]: DEBUG: scst_notify() -> Received a 'pre' / 'demote' notification. Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17449]: DEBUG: p_scst_zfs_vols notify returned: 0 Jan 10 09:08:13 tgtnode2 crmd[1104]: notice: Result of notify operation for p_scst_zfs_vols on tgtnode2.parodyne.com: 0 (ok) Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17473]: DEBUG: scst_monitor() -> SCST version: 3.3.0-rc Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17473]: DEBUG: scst_monitor() -> Resource is running. Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17473]: DEBUG: scst_monitor() -> SCST local target group state: active Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17473]: DEBUG: scst_demote() -> Resource is currently running as Master. Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17473]: INFO: Blocking all 'zfs_vols' devices... Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17473]: DEBUG: Waiting for devices to finish blocking... Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17473]: DEBUG: scst_demote() -> Setting target group 'zfs_vols_local' ALUA state to 'transitioning'... Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17473]: INFO: Collecting current configuration: done. -> Making requested changes. -> Setting Target Group attribute 'state' to value 'transitioning' for target group 'zfs_vols/zfs_vols_local': done. -> Done, 1 change(s) made. All done. Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17473]: DEBUG: scst_demote() -> Setting target group 'zfs_vols_local' ALUA state to 'unavailable'... Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17473]: INFO: Collecting current configuration: done. -> Making requested changes. -> Setting Target Group attribute 'state' to value 'unavailable' for target group 'zfs_vols/zfs_vols_local': done. -> Done, 1 change(s) made. All done. Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17473]: DEBUG: scst_demote() -> Changing the group's devices to inactive... Jan 10 09:08:33 tgtnode2 lrmd[1101]: warning: p_scst_zfs_vols_demote_0 process (PID 17473) timed out Jan 10 09:08:33 tgtnode2 crmd[1104]: notice: Transition aborted by operation p_scst_zfs_vols_demote_0 'modify' on tgtnode2.parodyne.com: Event failed Jan 10 09:08:33 tgtnode2 crmd[1104]: notice: Transition aborted by status-2-fail-count-p_scst_zfs_vols doing create fail-count-p_scst_zfs_vols=1: Transient attribute change --snip-- So I'm getting a "time out" after 20 seconds of waiting in the demote operation with this line: Jan 10 09:08:33 tgtnode2 lrmd[1101]: warning: p_scst_zfs_vols_demote_0 process (PID 17473) timed out The 20 second time out is consistent when testing this, so I'm sure it's just a configuration thing, but it's not obvious to me which parameter/attribute/setting needs to be modified. The relevant metadata section from the RA referenced above: --snip-- <actions> <action name="meta-data" timeout="5" /> <action name="start" timeout="120" /> <action name="stop" timeout="90" /> <action name="monitor" timeout="20" depth="0" interval="10" role="Master" /> <action name="monitor" timeout="20" depth="0" interval="20" role="Slave" /> <action name="notify" timeout="20" /> <action name="promote" timeout="60" /> <action name="demote" timeout="60" /> <action name="reload" timeout="20" /> <action name="validate-all" timeout="20" /> </actions> --snip-- And the primitive and clone (multi-state) actual cluster configuration for the referenced resource: --snip-- primitive p_scst_zfs_vols ocf:esos:scst \ params alua=true device_group=zfs_vols local_tgt_grp=zfs_vols_local remote_tgt_grp=zfs_vols_remote m_alua_state=active s_alua_state=unavailable use_trans_state=true set_dev_active=true \ op monitor interval=10 role=Master \ op monitor interval=20 role=Slave \ op start interval=0 timeout=120 \ op stop interval=0 timeout=90 ms ms_scst_zfs_vols p_scst_zfs_vols \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true interleave=true --snip-- I see a few values in the RA's metadata action section with "20 seconds" and the interval parameter for the primitive, but I'm not sure which might be affecting this demote time out setting. Any would help be greatly appreciated. Thanks so much for your time! And thank you for a great software product! --Marc _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org