On 02/13/2017 07:57 AM, he.hailo...@zte.com.cn wrote: > Pacemaker 1.1.10 > > Corosync 2.3.3 > > > this is a 3 nodes cluster configured with 3 clone resources, each > attached wih a vip resource of IPAddr2: > > > >crm status > > > Online: [ paas-controller-1 paas-controller-2 paas-controller-3 ] > > > router_vip (ocf::heartbeat:IPaddr2): Started paas-controller-1 > > sdclient_vip (ocf::heartbeat:IPaddr2): Started paas-controller-3 > > apigateway_vip (ocf::heartbeat:IPaddr2): Started paas-controller-2 > > Clone Set: sdclient_rep [sdclient] > > Started: [ paas-controller-1 paas-controller-2 paas-controller-3 ] > > Clone Set: router_rep [router] > > Started: [ paas-controller-1 paas-controller-2 paas-controller-3 ] > > Clone Set: apigateway_rep [apigateway] > > Started: [ paas-controller-1 paas-controller-2 paas-controller-3 ] > > > It is observed that sometimes the clone resource is stuck to monitor > when the service fails: > > > router_vip (ocf::heartbeat:IPaddr2): Started paas-controller-1 > > sdclient_vip (ocf::heartbeat:IPaddr2): Started paas-controller-2 > > apigateway_vip (ocf::heartbeat:IPaddr2): Started paas-controller-3 > > Clone Set: sdclient_rep [sdclient] > > Started: [ paas-controller-1 paas-controller-2 ] > > Stopped: [ paas-controller-3 ] > > Clone Set: router_rep [router] > > router (ocf::heartbeat:router): Started > paas-controller-3 FAILED > > Started: [ paas-controller-1 paas-controller-2 ] > > Clone Set: apigateway_rep [apigateway] > > apigateway (ocf::heartbeat:apigateway): Started > paas-controller-3 FAILED > > Started: [ paas-controller-1 paas-controller-2 ] > > > in the example above. the sdclient_rep get restarted on node 3, while > the other two hang at monitoring on node 3, here are the ocf logs: > > > abnormal (apigateway_rep): > > 2017-02-13 18:27:53 [23586]===print_log test_monitor run_func main=== > Starting health check. > > 2017-02-13 18:27:53 [23586]===print_log test_monitor run_func main=== > health check succeed. > > 2017-02-13 18:27:55 [24010]===print_log test_monitor run_func main=== > Starting health check. > > 2017-02-13 18:27:55 [24010]===print_log test_monitor run_func main=== > Failed: docker daemon is not running. > > 2017-02-13 18:27:57 [24095]===print_log test_monitor run_func main=== > Starting health check. > > 2017-02-13 18:27:57 [24095]===print_log test_monitor run_func main=== > Failed: docker daemon is not running. > > 2017-02-13 18:27:59 [24159]===print_log test_monitor run_func main=== > Starting health check. > > 2017-02-13 18:27:59 [24159]===print_log test_monitor run_func main=== > Failed: docker daemon is not running. > > > normal (sdclient_rep): > > 2017-02-13 18:27:52 [23507]===print_log sdclient_monitor run_func > main=== health check succeed. > > 2017-02-13 18:27:54 [23630]===print_log sdclient_monitor run_func > main=== Starting health check. > > 2017-02-13 18:27:54 [23630]===print_log sdclient_monitor run_func > main=== Failed: docker daemon is not running. > > 2017-02-13 18:27:55 [23710]===print_log sdclient_stop run_func main=== > Starting stop the container. > > 2017-02-13 18:27:55 [23710]===print_log sdclient_stop run_func main=== > docker daemon lost, pretend stop succeed. > > 2017-02-13 18:27:55 [23763]===print_log sdclient_start run_func main=== > Starting run the container. > > 2017-02-13 18:27:55 [23763]===print_log sdclient_start run_func main=== > docker daemon lost, try again in 5 secs. > > 2017-02-13 18:28:00 [23763]===print_log sdclient_start run_func main=== > docker daemon lost, try again in 5 secs. > > 2017-02-13 18:28:05 [23763]===print_log sdclient_start run_func main=== > docker daemon lost, try again in 5 secs. > > > If I disable 2 clone resource, the switch over test for one clone > resource works as expected: fail the service -> monitor fails -> stop > -> start > > > Online: [ paas-controller-1 paas-controller-2 paas-controller-3 ] > > > sdclient_vip (ocf::heartbeat:IPaddr2): Started paas-controller-2 > > Clone Set: sdclient_rep [sdclient] > > Started: [ paas-controller-1 paas-controller-2 ] > > Stopped: [ paas-controller-3 ] > > > what's the reason behind????
Can you show the configuration of the three clones, their operations, and any constraints? Normally, the response is controlled by the monitor operation's on-fail attribute (which defaults to restart). _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org