[ClusterLabs] 答复: Re: clone resource not get restarted on fail

he.hailong5 Mon, 13 Feb 2017 17:12:37 -0800

Hi,




＞ crm configure show

+ crm configure show

node $id="336855579" paas-controller-1

node $id="336855580" paas-controller-2

node $id="336855581" paas-controller-3

primitive apigateway ocf:heartbeat:apigateway \

        op monitor interval="2s" timeout="20s" on-fail="restart" \

        op stop interval="0" timeout="200s" on-fail="restart" \

        op start interval="0" timeout="9999h" on-fail="restart"

primitive apigateway_vip ocf:heartbeat:IPaddr2 \

        params ip="20.20.2.7" cidr_netmask="24" \

        op start interval="0" timeout="20" \

        op stop interval="0" timeout="20" \

        op monitor timeout="20s" interval="2s" depth="0"

primitive router ocf:heartbeat:router \

        op monitor interval="2s" timeout="20s" on-fail="restart" \

        op stop interval="0" timeout="200s" on-fail="restart" \

        op start interval="0" timeout="9999h" on-fail="restart"

primitive router_vip ocf:heartbeat:IPaddr2 \

        params ip="10.10.1.7" cidr_netmask="24" \

        op start interval="0" timeout="20" \

        op stop interval="0" timeout="20" \

        op monitor timeout="20s" interval="2s" depth="0"

primitive sdclient ocf:heartbeat:sdclient \

        op monitor interval="2s" timeout="20s" on-fail="restart" \

        op stop interval="0" timeout="200s" on-fail="restart" \

        op start interval="0" timeout="9999h" on-fail="restart"

primitive sdclient_vip ocf:heartbeat:IPaddr2 \

        params ip="10.10.1.8" cidr_netmask="24" \

        op start interval="0" timeout="20" \

        op stop interval="0" timeout="20" \

        op monitor timeout="20s" interval="2s" depth="0"

clone apigateway_rep apigateway

clone router_rep router

clone sdclient_rep sdclient

location apigateway_loc apigateway_vip \

        rule $id="apigateway_loc-rule" +inf: apigateway_workable eq 1

location router_loc router_vip \

        rule $id="router_loc-rule" +inf: router_workable eq 1

location sdclient_loc sdclient_vip \

        rule $id="sdclient_loc-rule" +inf: sdclient_workable eq 1

property $id="cib-bootstrap-options" \

        dc-version="1.1.10-42f2063" \

        cluster-infrastructure="corosync" \

        stonith-enabled="false" \

        no-quorum-policy="ignore" \

        start-failure-is-fatal="false" \

        last-lrm-refresh="1486981647"

op_defaults $id="op_defaults-options" \

        on-fail="restart"

-------------------------------------------------------------------------------------------------




and B.T.W, I am using "crm_attribute -N $HOSTNAME -q -l reboot --name 
＜prefix＞_workable -v ＜1 or 0＞" in the monitor to update the transient 
attributes, which control the vip location.

and also found, the vip resource won't get moved if the related clone resource 
failed to restart.







原始邮件



发件人： ＜kgail...@redhat.com＞
收件人： ＜users@clusterlabs.org＞
日 期 ：2017年02月13日 23:04
主 题 ：Re: [ClusterLabs] clone resource not get restarted on fail





On 02/13/2017 07:57 AM, he.hailo...@zte.com.cn wrote:
＞ Pacemaker 1.1.10
＞ 
＞ Corosync 2.3.3
＞ 
＞ 
＞ this is a 3 nodes cluster configured with 3 clone resources, each
＞ attached wih a vip resource of IPAddr2:
＞ 
＞ 
＞ ＞crm status
＞ 
＞ 
＞ Online: [ paas-controller-1 paas-controller-2 paas-controller-3 ]
＞ 
＞ 
＞  router_vip     (ocf::heartbeat:IPaddr2):       Started paas-controller-1 
＞ 
＞  sdclient_vip   (ocf::heartbeat:IPaddr2):       Started paas-controller-3 
＞ 
＞  apigateway_vip (ocf::heartbeat:IPaddr2):       Started paas-controller-2 
＞ 
＞  Clone Set: sdclient_rep [sdclient]
＞ 
＞      Started: [ paas-controller-1 paas-controller-2 paas-controller-3 ]
＞ 
＞  Clone Set: router_rep [router]
＞ 
＞      Started: [ paas-controller-1 paas-controller-2 paas-controller-3 ]
＞ 
＞  Clone Set: apigateway_rep [apigateway]
＞ 
＞      Started: [ paas-controller-1 paas-controller-2 paas-controller-3 ]
＞ 
＞ 
＞ It is observed that sometimes the clone resource is stuck to monitor
＞ when the service fails:
＞ 
＞ 
＞  router_vip     (ocf::heartbeat:IPaddr2):       Started paas-controller-1 
＞ 
＞  sdclient_vip   (ocf::heartbeat:IPaddr2):       Started paas-controller-2 
＞ 
＞  apigateway_vip (ocf::heartbeat:IPaddr2):       Started paas-controller-3 
＞ 
＞  Clone Set: sdclient_rep [sdclient]
＞ 
＞      Started: [ paas-controller-1 paas-controller-2 ]
＞ 
＞      Stopped: [ paas-controller-3 ]
＞ 
＞  Clone Set: router_rep [router]
＞ 
＞      router     (ocf::heartbeat:router):        Started
＞ paas-controller-3 FAILED 
＞ 
＞      Started: [ paas-controller-1 paas-controller-2 ]
＞ 
＞  Clone Set: apigateway_rep [apigateway]
＞ 
＞      apigateway (ocf::heartbeat:apigateway):    Started
＞ paas-controller-3 FAILED 
＞ 
＞      Started: [ paas-controller-1 paas-controller-2 ]
＞ 
＞ 
＞ in the example above. the sdclient_rep get restarted on node 3, while
＞ the other two hang at monitoring on node 3, here are the ocf logs:
＞ 
＞ 
＞ abnormal (apigateway_rep):
＞ 
＞ 2017-02-13 18:27:53 [23586]===print_log test_monitor run_func main===
＞ Starting health check.
＞ 
＞ 2017-02-13 18:27:53 [23586]===print_log test_monitor run_func main===
＞ health check succeed.
＞ 
＞ 2017-02-13 18:27:55 [24010]===print_log test_monitor run_func main===
＞ Starting health check.
＞ 
＞ 2017-02-13 18:27:55 [24010]===print_log test_monitor run_func main===
＞ Failed: docker daemon is not running.
＞ 
＞ 2017-02-13 18:27:57 [24095]===print_log test_monitor run_func main===
＞ Starting health check.
＞ 
＞ 2017-02-13 18:27:57 [24095]===print_log test_monitor run_func main===
＞ Failed: docker daemon is not running.
＞ 
＞ 2017-02-13 18:27:59 [24159]===print_log test_monitor run_func main===
＞ Starting health check.
＞ 
＞ 2017-02-13 18:27:59 [24159]===print_log test_monitor run_func main===
＞ Failed: docker daemon is not running.
＞ 
＞ 
＞ normal (sdclient_rep):
＞ 
＞ 2017-02-13 18:27:52 [23507]===print_log sdclient_monitor run_func
＞ main=== health check succeed.
＞ 
＞ 2017-02-13 18:27:54 [23630]===print_log sdclient_monitor run_func
＞ main=== Starting health check.
＞ 
＞ 2017-02-13 18:27:54 [23630]===print_log sdclient_monitor run_func
＞ main=== Failed: docker daemon is not running.
＞ 
＞ 2017-02-13 18:27:55 [23710]===print_log sdclient_stop run_func main===
＞ Starting stop the container.
＞ 
＞ 2017-02-13 18:27:55 [23710]===print_log sdclient_stop run_func main===
＞ docker daemon lost, pretend stop succeed.
＞ 
＞ 2017-02-13 18:27:55 [23763]===print_log sdclient_start run_func main===
＞ Starting run the container.
＞ 
＞ 2017-02-13 18:27:55 [23763]===print_log sdclient_start run_func main===
＞ docker daemon lost, try again in 5 secs.
＞ 
＞ 2017-02-13 18:28:00 [23763]===print_log sdclient_start run_func main===
＞ docker daemon lost, try again in 5 secs.
＞ 
＞ 2017-02-13 18:28:05 [23763]===print_log sdclient_start run_func main===
＞ docker daemon lost, try again in 5 secs.
＞ 
＞ 
＞ If I disable 2 clone resource, the switch over test for one clone
＞ resource works as expected: fail the service -＞ monitor fails -＞ stop
＞ -＞ start
＞ 
＞ 
＞ Online: [ paas-controller-1 paas-controller-2 paas-controller-3 ]
＞ 
＞ 
＞  sdclient_vip   (ocf::heartbeat:IPaddr2):       Started paas-controller-2 
＞ 
＞  Clone Set: sdclient_rep [sdclient]
＞ 
＞      Started: [ paas-controller-1 paas-controller-2 ]
＞ 
＞      Stopped: [ paas-controller-3 ]
＞ 
＞ 
＞ what's the reason behind???? 

Can you show the configuration of the three clones, their operations,
and any constraints?

Normally, the response is controlled by the monitor operation's on-fail
attribute (which defaults to restart).


_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] 答复: Re: clone resource not get restarted on fail

Reply via email to