Re: [ClusterLabs] Pacemaker resource start delay when there are another resource is starting

Vladislav Bogdanov Wed, 01 Nov 2017 07:41:28 -0700

01.11.2017 17:20, Ken Gaillot wrote:

On Sat, 2017-10-28 at 01:11 +0800, lkxjtu wrote:


Thank you for your response! This means that there shoudn't be long
"sleep" in ocf script.
If my service takes 10 minite from service starting to healthcheck
normally, then what shoud I do?


That is a tough situation with no great answer.

You can leave it as it is, and live with the delay. Note that it only
happens if a resource fails after the slow resource has already begun
starting ... if they fail at the same time (as with a node failure),
the cluster will schedule recovery for both at the same time.

Another possibility would be to have the start return immediately, and
make the monitor artificially return success for the first 10 minutes
after starting. It's hacky, and it depends on your situation whether
the behavior is acceptable. My first thought on how to implement this
would be to have the start action set a private node attribute
(attrd_updater -p) with a timestamp. When the monitor runs, it could do
its usual check, and if it succeeds, remove that node attribute, but if
it fails, check the node attribute to see whether it's within the
desired delay.


Or write a master-slave resource agent, like DRBD has.

It sets low master score on an outdated node after sync is started andraises it after sync is finished, thus it is not promoted to masteruntil sync is complete.


If you "map" service states to a pacemaker states like

Starting - Slave
Started - Master

that would help.

Thank you very much!

Hi,
If I remember correctly, any pending actions from a previous

transition

must be completed before a new transition can be calculated.

Otherwise,

there's the possibility that the pending action could change the

state

in a way that makes the second transition's decisions harmful.
Theoretically (and ideally), pacemaker could figure out whether

some of

the actions in the second transition would be needed regardless of
whether the pending actions succeeded or failed, but in practice,

that

would be difficult to implement (and possibly take more time to
calculate than is desirable in a recovery situation).

On Fri, 2017-10-27 at 23:54 +0800, lkxjtu wrote:

I have two clone resources in my corosync/pacemaker cluster. They

are

fm_mgt and logserver. Both of their RA is ocf. fm_mgt takes 1

minute

to start the
service(calling ocf start function for 1 minite). Configured as
below：
# crm configure show
node 168002177: 192.168.2.177
node 168002178: 192.168.2.178
node 168002179: 192.168.2.179
primitive fm_mgt fm_mgt \
          op monitor interval=20s timeout=120s \
          op stop interval=0 timeout=120s on-fail=restart \
          op start interval=0 timeout=120s on-fail=restart \
          meta target-role=Started
primitive logserver logserver \
          op monitor interval=20s timeout=120s \
          op stop interval=0 timeout=120s on-fail=restart \
          op start interval=0 timeout=120s on-fail=restart \
          meta target-role=Started
clone fm_mgt_replica fm_mgt
clone logserver_replica logserver
property cib-bootstrap-options: \
          have-watchdog=false \
          dc-version=1.1.13-10.el7-44eb2dd \
          cluster-infrastructure=corosync \
          stonith-enabled=false \
          start-failure-is-fatal=false
When I kill fm_mgt service on one node，pacemaker will immediately
recover it after monitor failed. This looks perfectly normal. But

in

this 1 minite
of fm_mgt starting, if I kill logserver service on any node, the
monitor will catch the fail normally too，but pacemaker will not
restart it
immediately but waiting for fm_mgt starting finished. After fm_mgt
starting finished, pacemaker begin restarting logserver. It seems
that there are
some dependency between pacemaker resource.
# crm status
Last updated: Thu Oct 26 06:40:24 2017          Last change: Thu

Oct

26     06:36:33 2017 by root via crm_resource on 192.168.2.177
Stack: corosync
Current DC: 192.168.2.179 (version 1.1.13-10.el7-44eb2dd) -

partition

with quorum
3 nodes and 6 resources configured
Online: [ 192.168.2.177 192.168.2.178 192.168.2.179 ]
Full list of resources:
   Clone Set: logserver_replica [logserver]
       logserver  (ocf::heartbeat:logserver):     FAILED

192.168.2.177

       Started: [ 192.168.2.178 192.168.2.179 ]
   Clone Set: fm_mgt_replica [fm_mgt]
       Started: [ 192.168.2.178 192.168.2.179 ]
       Stopped: [ 192.168.2.177 ]
I am confusing very much. Is there something wrong configure?Thank
you very much!
James
best regards


【网易自营】好吃到爆！鲜香弹滑加热即食，经典13香/麻辣小龙虾仅75元3斤>>


【网易自营】好吃到爆！鲜香弹滑加热即食，经典13香/麻辣小龙虾仅75元3斤>>


【网易自营|30天无忧退货】仅售同款价1/4！MUJI制造商“2017秋冬舒适家居拖鞋系列”限时仅34.9元>>



_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Pacemaker resource start delay when there are another resource is starting

Reply via email to