[ClusterLabs] 答复: Re: 答复: Re: clone resource not get restarted on fail
Is there a reason not to use a colocation constraint instead? If X_vipis colocated with X, it will be moved if X fails. [hhl]: the movement should take place as well if X stopped (the start is on-going). I don't know if the colocation would satisfy this requirement. I don't see any reason in your configuration why the services wouldn'tbe restarted. It's possible the cluster tried to restart the service,but the stop action failed. Since you have stonith disabled, the clustercan't recover from a failed stop action. [hhl]: the ocf logs showed the pacemaker never entered the stop function in this case.Is there a reason you disabled quorum? With 3 nodes, if they get splitinto groups of 1 node and 2 nodes, quorum is what keeps the groups fromboth starting all resources. [hhl]: I enabled the quorum and had a retry, the same happens. b.t.w, I repeat sevaral times today, and found when I trigger the condition on one node that would fail all the clone resources, only one would get restared, the other two would fail to restart. > trigger the failure conditon on paas-controller-1 Online: [ paas-controller-1 paas-controller-2 paas-controller-3 ] router_vip (ocf::heartbeat:IPaddr2): Started paas-controller-2 sdclient_vip (ocf::heartbeat:IPaddr2): Started paas-controller-3 apigateway_vip (ocf::heartbeat:IPaddr2): Started paas-controller-3 Clone Set: sdclient_rep [sdclient] Started: [ paas-controller-2 paas-controller-3 ] Stopped: [ paas-controller-1 ] Clone Set: router_rep [router] router (ocf::heartbeat:router):Started paas-controller-1 FAILED Started: [ paas-controller-2 paas-controller-3 ] Clone Set: apigateway_rep [apigateway] apigateway (ocf::heartbeat:apigateway):Started paas-controller-1 FAILED Started: [ paas-controller-2 paas-controller-3 ] > trigger the failure conditon on paas-controller-3 Online: [ paas-controller-1 paas-controller-2 paas-controller-3 ] router_vip (ocf::heartbeat:IPaddr2): Started paas-controller-2 sdclient_vip (ocf::heartbeat:IPaddr2): Started paas-controller-3 apigateway_vip (ocf::heartbeat:IPaddr2): Started paas-controller-3 Clone Set: sdclient_rep [sdclient] sdclient (ocf::heartbeat:sdclient): Started paas-controller-3 FAILED Started: [ paas-controller-1 paas-controller-2 ] Clone Set: router_rep [router] Started: [ paas-controller-1 paas-controller-2 ] Stopped: [ paas-controller-3 ] Clone Set: apigateway_rep [apigateway] apigateway (ocf::heartbeat:apigateway):Started paas-controller-3 FAILED Started: [ paas-controller-1 paas-controller-2 ] 原始邮件 发件人: <kgail...@redhat.com> 收件人:何海龙10164561 抄送人: <users@clusterlabs.org> 日 期 :2017年02月15日 06:14 主 题 :Re: 答复: Re: [ClusterLabs] clone resource not get restarted on fail On 02/13/2017 07:08 PM, he.hailo...@zte.com.cn wrote: > Hi, > > > > crm configure show > > + crm configure show > > node $id="336855579" paas-controller-1 > > node $id="336855580" paas-controller-2 > > node $id="336855581" paas-controller-3 > > primitive apigateway ocf:heartbeat:apigateway \ > > op monitor interval="2s" timeout="20s" on-fail="restart" \ > > op stop interval="0" timeout="200s" on-fail="restart" \ > > op start interval="0" timeout="h" on-fail="restart" > > primitive apigateway_vip ocf:heartbeat:IPaddr2 \ > > params ip="20.20.2.7" cidr_netmask="24" \ > > op start interval="0" timeout="20" \ > > op stop interval="0" timeout="20" \ > > op monitor timeout="20s" interval="2s" depth="0" > > primitive router ocf:heartbeat:router \ > > op monitor interval="2s" timeout="20s" on-fail="restart" \ > > op stop interval="0" timeout="200s" on-fail="restart" \ > > op start interval="0" timeout="h" on-fail="restart" > > primitive router_vip ocf:heartbeat:IPaddr2 \ > > params ip="10.10.1.7" cidr_netmask="24" \ > > op start interval="0" timeout="20" \ > > op stop interval="0" timeout="20" \ > > op monitor timeout="20s" interval="2s" depth="0" > > primitive sdclient ocf:heartbeat:sdclient \ > > op monitor interval="2s" timeout="20s" on-fail="restart" \ > > op stop interval="0" timeout="200s" on-fail="restart" \ > > op start interval="0" timeout="h" on-fail="restart" > > primitive sdclient_vip ocf:heartbeat:IPaddr2 \ > > params ip="10.10.1.8" cidr_netmask="24" \ > > op start interval="0" timeout="20" \ > > op stop interval="0" timeout="20" \ > > op monitor timeout="20s" interval="2s" depth="0" > > clone apigateway_rep apigateway > > clone router_rep router > > clone sdclient_rep sdclient > > location apigateway_loc apigateway_vip \ > > rule $id="apigateway_loc-rule" +inf: apigateway_workable eq 1 > > location router_loc router_v
Re: [ClusterLabs] 答复: Re: clone resource not get restarted on fail
On 02/13/2017 07:08 PM, he.hailo...@zte.com.cn wrote: > Hi, > > > > crm configure show > > + crm configure show > > node $id="336855579" paas-controller-1 > > node $id="336855580" paas-controller-2 > > node $id="336855581" paas-controller-3 > > primitive apigateway ocf:heartbeat:apigateway \ > > op monitor interval="2s" timeout="20s" on-fail="restart" \ > > op stop interval="0" timeout="200s" on-fail="restart" \ > > op start interval="0" timeout="h" on-fail="restart" > > primitive apigateway_vip ocf:heartbeat:IPaddr2 \ > > params ip="20.20.2.7" cidr_netmask="24" \ > > op start interval="0" timeout="20" \ > > op stop interval="0" timeout="20" \ > > op monitor timeout="20s" interval="2s" depth="0" > > primitive router ocf:heartbeat:router \ > > op monitor interval="2s" timeout="20s" on-fail="restart" \ > > op stop interval="0" timeout="200s" on-fail="restart" \ > > op start interval="0" timeout="h" on-fail="restart" > > primitive router_vip ocf:heartbeat:IPaddr2 \ > > params ip="10.10.1.7" cidr_netmask="24" \ > > op start interval="0" timeout="20" \ > > op stop interval="0" timeout="20" \ > > op monitor timeout="20s" interval="2s" depth="0" > > primitive sdclient ocf:heartbeat:sdclient \ > > op monitor interval="2s" timeout="20s" on-fail="restart" \ > > op stop interval="0" timeout="200s" on-fail="restart" \ > > op start interval="0" timeout="h" on-fail="restart" > > primitive sdclient_vip ocf:heartbeat:IPaddr2 \ > > params ip="10.10.1.8" cidr_netmask="24" \ > > op start interval="0" timeout="20" \ > > op stop interval="0" timeout="20" \ > > op monitor timeout="20s" interval="2s" depth="0" > > clone apigateway_rep apigateway > > clone router_rep router > > clone sdclient_rep sdclient > > location apigateway_loc apigateway_vip \ > > rule $id="apigateway_loc-rule" +inf: apigateway_workable eq 1 > > location router_loc router_vip \ > > rule $id="router_loc-rule" +inf: router_workable eq 1 > > location sdclient_loc sdclient_vip \ > > rule $id="sdclient_loc-rule" +inf: sdclient_workable eq 1 > > property $id="cib-bootstrap-options" \ > > dc-version="1.1.10-42f2063" \ > > cluster-infrastructure="corosync" \ > > stonith-enabled="false" \ > > no-quorum-policy="ignore" \ > > start-failure-is-fatal="false" \ > > last-lrm-refresh="1486981647" > > op_defaults $id="op_defaults-options" \ > > on-fail="restart" > > - > > > and B.T.W, I am using "crm_attribute -N $HOSTNAME -q -l reboot --name > <prefix>_workable -v <1 or 0>" in the monitor to update the > transient attributes, which control the vip location. Is there a reason not to use a colocation constraint instead? If X_vip is colocated with X, it will be moved if X fails. I don't see any reason in your configuration why the services wouldn't be restarted. It's possible the cluster tried to restart the service, but the stop action failed. Since you have stonith disabled, the cluster can't recover from a failed stop action. Is there a reason you disabled quorum? With 3 nodes, if they get split into groups of 1 node and 2 nodes, quorum is what keeps the groups from both starting all resources. > and also found, the vip resource won't get moved if the related clone > resource failed to restart. > > > 原始邮件 > *发件人:*<kgail...@redhat.com>; > *收件人:*<users@clusterlabs.org>; > *日 期 :*2017年02月13日 23:04 > *主 题 :**Re: [ClusterLabs] clone resource not get restarted on fail* > > > On 02/13/2017 07:57 AM, he.hailo...@zte.com.cn wrote: > > Pacemaker 1.1.10 > > > > Corosync 2.3.3 > > > > > > this is a 3 nodes cluster configured with 3 clone resources, each > > attached wih a vip resource of IPAddr2: > > > > > > >crm status > > > > > > Online: [ paas-controller-1 paas-controller-2 paas-controller-3 ] > > > > > > router_vip (ocf::heartbeat:IPaddr2): Started paas-controller-1 > > > > sdclient_vip (ocf::heartbeat:IPaddr2): Started paas-controller-3 > > > > apigateway_vip (ocf::heartbeat:IPaddr2): Started paas-controller-2 > > > > Clone Set: sdclient_rep [sdclient] > > > > Started: [ paas-controller-1 paas-controller-2 paas-controller-3 ] > > > > Clone Set: router_rep [router] > > > > Started: [ paas-controller-1 paas-controller-2 paas-controller-3 ] > > > > Clone Set: apigateway_rep [apigateway] > > > > Started: [ paas-controller-1 paas-controller-2 paas-controller-3 ] > > > > > > It is observed that sometimes the clone resource is stuck to monitor > > when the service fails: > > > > > > router_vip (ocf::heartbeat:IPaddr2): Started paas-controller-1 > > > > sdclient_vip (ocf::heartbeat:IPaddr2):