On 08/18/2016 04:00 PM, Ken Gaillot wrote: > On 08/17/2016 08:17 PM, TEG AMJG wrote: >> Hi >> >> I am having a problem with a simple Active/Passive cluster which >> consists in the next configuration >> >> Cluster Name: kamcluster >> Corosync Nodes: >> kam1vs3 kam2vs3 >> Pacemaker Nodes: >> kam1vs3 kam2vs3 >> >> Resources: >> Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2) >> Attributes: ip=10.0.1.206 cidr_netmask=32 >> Operations: start interval=0s timeout=20s (ClusterIP-start-interval-0s) >> stop interval=0s timeout=20s (ClusterIP-stop-interval-0s) >> monitor interval=10s (ClusterIP-monitor-interval-10s) >> Resource: ClusterIP2 (class=ocf provider=heartbeat type=IPaddr2) >> Attributes: ip=10.0.1.207 cidr_netmask=32 >> Operations: start interval=0s timeout=20s (ClusterIP2-start-interval-0s) >> stop interval=0s timeout=20s (ClusterIP2-stop-interval-0s) >> monitor interval=10s (ClusterIP2-monitor-interval-10s) >> Resource: rtpproxycluster (class=systemd type=rtpproxy) >> Operations: monitor interval=10s (rtpproxycluster-monitor-interval-10s) >> stop interval=0s on-fail=block >> (rtpproxycluster-stop-interval-0s) >> Resource: kamailioetcfs (class=ocf provider=heartbeat type=Filesystem) >> Attributes: device=/dev/drbd1 directory=/etc/kamailio fstype=ext4 >> Operations: start interval=0s timeout=60 (kamailioetcfs-start-interval-0s) >> monitor interval=10s on-fail=fence >> (kamailioetcfs-monitor-interval-1 >> 0s) >> stop interval=0s on-fail=fence >> (kamailioetcfs-stop-interval-0s) >> Clone: fence_kam2_xvm-clone >> Meta Attrs: interleave=true clone-max=2 clone-node-max=1 >> Resource: fence_kam2_xvm (class=stonith type=fence_xvm) >> Attributes: port=tegamjg_kam2 pcmk_host_list=kam2vs3 >> Operations: monitor interval=60s (fence_kam2_xvm-monitor-interval-60s) >> Master: kamailioetcclone >> Meta Attrs: master-max=1 master-node-max=1 clone-max=2 >> clone-node-max=1 notify=t >> rue on-fail=fence >> Resource: kamailioetc (class=ocf provider=linbit type=drbd) >> Attributes: drbd_resource=kamailioetc >> Operations: start interval=0s timeout=240 (kamailioetc-start-interval-0s) >> promote interval=0s on-fail=fence >> (kamailioetc-promote-interval-0s) >> demote interval=0s on-fail=fence >> (kamailioetc-demote-interval-0s) >> stop interval=0s on-fail=fence (kamailioetc-stop-interval-0s) >> monitor interval=10s (kamailioetc-monitor-interval-10s) >> Clone: fence_kam1_xvm-clone >> Meta Attrs: interleave=true clone-max=2 clone-node-max=1 >> Resource: fence_kam1_xvm (class=stonith type=fence_xvm) >> Attributes: port=tegamjg_kam1 pcmk_host_list=kam1vs3 >> Operations: monitor interval=60s (fence_kam1_xvm-monitor-interval-60s) >> Resource: kamailiocluster (class=ocf provider=heartbeat type=kamailio) >> Attributes: listen_address=10.0.1.206 >> conffile=/etc/kamailio/kamailio.cfg pidfil >> >> e=/var/run/kamailio.pid monitoring_ip=10.0.1.206 >> monitoring_ip2=10.0.1.207 port=50 >> 60 proto=udp >> kamctlrc=/etc/kamailio/kamctlrc shmem=128 pkg=8 >> Meta Attrs: target-role=Stopped >> Operations: start interval=0s timeout=60 >> (kamailiocluster-start-interval-0s) >> stop interval=0s timeout=30 (kamailiocluster-stop-interval-0s) >> monitor interval=5s (kamailiocluster-monitor-interval-5s) >> >> Stonith Devices: >> Fencing Levels: >> >> Location Constraints: >> Ordering Constraints: >> start fence_kam1_xvm-clone then start fence_kam2_xvm-clone >> (kind:Mandatory) (id: >> >> order-fence_kam1_xvm-clone-fence_kam2_xvm-clone-mandatory) >> start fence_kam2_xvm-clone then promote kamailioetcclone >> (kind:Mandatory) (id:or >> >> der-fence_kam2_xvm-clone-kamailioetcclone-mandatory) >> promote kamailioetcclone then start kamailioetcfs (kind:Optional) >> (id:order-kama >> ilioetcclone-kamailioetcfs-Optional) >> Resource Sets: >> set kamailioetcfs sequential=true (id:pcs_rsc_set_kamailioetcfs) set >> ClusterIP >> ClusterIP2 sequential=false >> (id:pcs_rsc_set_ClusterIP_ClusterIP2) set rtpproxyclu >> ster >> kamailiocluster sequential=true >> (id:pcs_rsc_set_rtpproxycluster_kamailioclust >> er) >> (id:pcs_rsc_order_set_kamailioetcfs_set_ClusterIP_ClusterIP2_set_rtpproxyclust >> >> er_kamailiocluster) >> Colocation Constraints: >> rtpproxycluster with ClusterIP2 (score:INFINITY) >> (id:colocation-rtpproxycluster- >> ClusterIP2-INFINITY) >> ClusterIP2 with ClusterIP (score:INFINITY) >> (id:colocation-ClusterIP2-ClusterIP-I >> NFINITY) >> ClusterIP with kamailioetcfs (score:INFINITY) >> (id:colocation-ClusterIP-kamailioe >> tcfs-INFINITY) >> kamailioetcfs with kamailioetcclone (score:INFINITY) >> (with-rsc-role:Master) (id: >> >> colocation-kamailioetcfs-kamailioetcclone-INFINITY) >> fence_kam2_xvm-clone with fence_kam1_xvm-clone (score:INFINITY) >> (id:colocation-f >> >> ence_kam2_xvm-clone-fence_kam1_xvm-clone-INFINITY) >> kamailioetcclone with fence_kam2_xvm-clone (score:INFINITY) >> (id:colocation-kamai >> lioetcclone-fence_kam2_xvm-clone-INFINITY) >> kamailiocluster with rtpproxycluster (score:INFINITY) >> (id:colocation-kamailioclu >> ster-rtpproxycluster-INFINITY) >> >> Resources Defaults: >> migration-threshold: 2 >> failure-timeout: 10m >> resource-stickiness: 200 >> Operations Defaults: >> No defaults set >> >> Cluster Properties: >> cluster-infrastructure: corosync >> cluster-name: kamcluster >> dc-version: 1.1.13-10.el7_2.2-44eb2dd >> have-watchdog: false >> last-lrm-refresh: 1471479940 >> no-quorum-policy: ignore >> start-failure-is-fatal: false >> stonith-action: reboot >> stonith-enabled: true >> >> Now my problem is when the rtpproxy systemd resource starts/stops, i am >> experiencing 2 seconds delay between the call from the start/stop signal >> is sent to systemd and the confirmation and interpretation of the action >> to the crmd. I am pretty sure that the initiation of this service doesnt > The problem is that systemd does not have a way to say "start this > service, and don't tell me it's done until it's completely started". > Systemd reports successful completion as soon as the start has been > initiated.
First sorry for my previous guessing in the wrong direction ... But anyway with 'systemctl' the waiting for a service to daemonize (type = forking) or for it to return status via DBus (type = notify) seems to work ... Haven't checked so far if this is a useful source of information but that is how 'systemctl' seems to do the waiting: https://github.com/systemd/systemd/blob/master/src/systemctl/systemctl.c#L2838 https://github.com/systemd/systemd/blob/master/src/shared/bus-unit-util.c#L839 https://github.com/systemd/systemd/blob/master/src/shared/bus-unit-util.c#L788 I know that the status-query for certain-daemons would fail if issued directly after they have daemonized. And of course the concept can't work with all types of units (e.g. type = simple). Thus the status-checking after StartUnit could still be done with a configurable delay (via an attribute defaulting to the 2 seconds). > > By contrast, Pacemaker can't proceed with dependencies until the service > is completely up. So, Pacemaker's workaround is to initiate the start, > then poll the status every 2 seconds until systemd reports the service > is completely up. > > The 2 seconds is hardcoded, so there's no way to avoid that at the > moment, other than using an OCF agent. > >> take even half of that time to start/stop. i did the same thing with a >> dummy service as a systemd resource and it als takes 2 seconds to >> start/stop the resource. Also, this is not happening in any of the ocf >> resources. I am putting what i see in the logs about this behaviour >> >> This is after doing "pcs resource restart rtpproxycluster" >> >> Aug 17 20:59:18 [13187] kam1 crmd: notice: te_rsc_command: >> Initiating action 14: stop rtpproxycluster_stop_0 on kam1vs3 (local) >> Aug 17 20:59:18 [13184] kam1 lrmd: info: >> cancel_recurring_action: Cancelling systemd operation >> rtpproxycluster_status_10000 >> Aug 17 20:59:18 [13187] kam1 crmd: info: do_lrm_rsc_op: >> Performing key=14:30:0:8a202722-ece2-4617-b26e-8d4aa5f3522b >> op=rtpproxycluster_stop_0 >> Aug 17 20:59:18 [13184] kam1 lrmd: info: log_execute: >> executing - rsc:rtpproxycluster action:stop call_id:106 >> Aug 17 20:59:18 [13187] kam1 crmd: info: process_lrm_event: >> Operation rtpproxycluster_monitor_10000: Cancelled (node=kam1vs3, >> call=104, confirmed=true) >> Aug 17 20:59:18 [13184] kam1 lrmd: info: systemd_exec_result: >> Call to stop passed: /org/freedesktop/systemd1/job/8302 >> Aug 17 20:59:20 [13187] kam1 crmd: notice: process_lrm_event: >> Operation rtpproxycluster_stop_0: ok (node=kam1vs3, call=106, rc=0, >> cib-update=134, confirmed=true) >> Aug 17 20:59:20 [13182] kam1 cib: info: cib_perform_op: >> + >> >> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='rtpproxycluster']/lrm_rsc_op[@id='rtpproxycluster_last_0']: >> @operation_key=rtpproxycluster_stop_0, @operation=stop, >> @transition-key=14:30:0:8a202722-ece2-4617-b26e-8d4aa5f3522b, >> @transition-magic=0:0;14:30:0:8a202722-ece2-4617-b26e-8d4aa5f3522b, >> @call-id=106, @last-run=1471481958, @last-rc-change=1471481958, >> @exec-time=2116 >> >> Aug 17 20:59:20 [13186] kam1 pengine: info: RecurringOp: Start >> recurring monitor (10s) for rtpproxycluster on kam1vs3 >> Aug 17 20:59:20 [13186] kam1 pengine: notice: LogActions: Start >> rtpproxycluster (kam1vs3) >> Aug 17 20:59:20 [13187] kam1 crmd: notice: te_rsc_command: >> Initiating action 13: start rtpproxycluster_start_0 on kam1vs3 (local) >> Aug 17 20:59:20 [13187] kam1 crmd: info: do_lrm_rsc_op: >> Performing key=13:31:0:8a202722-ece2-4617-b26e-8d4aa5f3522b >> op=rtpproxycluster_start_0 >> Aug 17 20:59:20 [13184] kam1 lrmd: info: log_execute: >> executing - rsc:rtpproxycluster action:start call_id:107 >> Aug 17 20:59:21 [13184] kam1 lrmd: info: systemd_exec_result: >> Call to start passed: /org/freedesktop/systemd1/job/8303 >> Aug 17 20:59:23 [13187] kam1 crmd: notice: process_lrm_event: >> Operation rtpproxycluster_start_0: ok (node=kam1vs3, call=107, rc=0, >> cib-update=136, confirmed=true) >> Aug 17 20:59:23 [13182] kam1 cib: info: cib_perform_op: >> + >> >> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='rtpproxycluster']/lrm_rsc_op[@id='rtpproxycluster_last_0']: >> @operation_key=rtpproxycluster_start_0, @operation=start, >> @transition-key=13:31:0:8a202722-ece2-4617-b26e-8d4aa5f3522b, >> @transition-magic=0:0;13:31:0:8a202722-ece2-4617-b26e-8d4aa5f3522b, >> @call-id=107, @last-run=1471481960, @last-rc-change=1471481960, >> @exec-time=2068 >> >> Why is this happening?, many of my resources depends on each other, so >> the delay it takes when failover its needed is quite a lot (just 4 >> seconds of delay stopping/starting just for that systemd resource. >> >> Is this normal?, is there anyway to solve this?, Do i really need to >> make an OCF resource for that service (that would take some time that i >> dont really have at this moment) >> >> Regards >> >> Alejandro > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org