Re: [ClusterLabs] Is corosync supposed to be restarted if it fies?
On Thu, Nov 30, 2017 at 12:42 AM, Jan Pokorný wrote: > On 29/11/17 22:00 +0100, Jan Pokorný wrote: >> On 28/11/17 22:35 +0300, Andrei Borzenkov wrote: >>> 28.11.2017 13:01, Jan Pokorný пишет: On 27/11/17 17:43 +0300, Andrei Borzenkov wrote: > Отправлено с iPhone > >> 27 нояб. 2017 г., в 14:36, Ferenc Wágner написал(а): >> >> Andrei Borzenkov writes: >> >>> 25.11.2017 10:05, Andrei Borzenkov пишет: >>> In one of guides suggested procedure to simulate split brain was to kill corosync process. It actually worked on one cluster, but on another corosync process was restarted after being killed without cluster noticing anything. Except after several attempts pacemaker died with stopping resources ... :) This is SLES12 SP2; I do not see any Restart in service definition so it probably not systemd. >>> FTR - it was not corosync, but pacemakker; its unit file specifies >>> RestartOn=error so killing corosync caused pacemaker to fail and be >>> restarted by systemd. >> >> And starting corosync via a Requires dependency? > > Exactly. From my testing it looks like we should change "Requires=corosync.service" to "BindsTo=corosync.service" in pacemaker.service. Could you give it a try? >>> >>> I'm not sure what is expected outcome, but pacemaker.service is still >>> restarted (due to Restart=on-failure). >> >> Expected outcome is that pacemaker.service will become >> "inactive (dead)" after killing corosync (as a result of being >> "bound" by pacemaker). Have you indeed issued "systemctl >> daemon-reload" after updating the pacemaker unit file? >> Of course. I even rebooted ... :) ha1:~ # systemctl cat pacemaker.service | grep corosync After=corosync.service BindsTo=corosync.service # ExecStopPost=/bin/sh -c 'pidof crmd || killall -TERM corosync' ha1:~ # Nov 30 10:41:14 ha1 sbd[1743]:cluster:error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Nov 30 10:41:14 ha1 systemd[1]: corosync.service: Main process exited, code=killed, status=9/KILL Nov 30 10:41:14 ha1 sbd[1743]:cluster: warning: sbd_membership_destroy: Lost connection to corosync Nov 30 10:41:14 ha1 systemd[1]: pacemaker.service: Main process exited, code=exited, status=107/n/a Nov 30 10:41:14 ha1 sbd[1743]:cluster:error: set_servant_health: Cluster connection terminated Nov 30 10:41:14 ha1 systemd[1]: Stopped Pacemaker High Availability Cluster Manager. Nov 30 10:41:14 ha1 sbd[1743]:cluster:error: cluster_connect_cpg: Could not connect to the Cluster Process Group API: 2 Nov 30 10:41:14 ha1 systemd[1]: pacemaker.service: Unit entered failed state. Nov 30 10:41:14 ha1 sbd[1739]: warning: inquisitor_child: cluster health check: UNHEALTHY Nov 30 10:41:14 ha1 systemd[1]: pacemaker.service: Failed with result 'exit-code'. ... Nov 30 10:41:14 ha1 systemd[1]: corosync.service: Unit entered failed state. Nov 30 10:41:14 ha1 systemd[1]: corosync.service: Failed with result 'signal'. Nov 30 10:41:14 ha1 systemd[1]: pacemaker.service: Service hold-off time over, scheduling restart. Nov 30 10:41:14 ha1 systemd[1]: Stopped Pacemaker High Availability Cluster Manager. Nov 30 10:41:14 ha1 systemd[1]: Starting Corosync Cluster Engine... Do you mean you get different results? Do not forget that the only thing BindsTo does is to stop service is dependency failed; it does *not* affect decision whether to restart service in any way (at least directly). >> (FTR, I tried with systemd 235). >> Well ... what we have here is race condition. We have two events - corosync.service and pacemaker.service *independent* failures and two (re-)actions - stop pacemaker.service in response to the former (due to BindsTo) and restart pacemaker.service in response to the latter (due to Restart=on-failure). The final result depends on the order in which systemd gets those events and schedules actions (and relative timing when those actions complete) and this is not deterministic. Now 235 includes some changes to restart logic which refuses to do restart if other action (like stop) is currently being scheduled. I am not sure what happens if restart is scheduled first though (such "implementation details" tend to be not documented in systemd world). I have been doing systemd troubleshooting for a long time to know that even if you observe specific sequence of events, another system may exhibit completely different sequence. Anyway, I will try to install system with 235 on the same platform to see how it behaves. >>> If intention is to unconditionally stop it when corosync dies, >>> pacemaker should probably exit with unique code and unit files have >>> RestartPreventExitStatus set to it. >> >> That would be an elaborate way to reach the same. >> This is the *only* way to reach the same. You cannot both tell service manager to restart s
Re: [ClusterLabs] Is corosync supposed to be restarted if it fies?
On 29/11/17 22:00 +0100, Jan Pokorný wrote: > On 28/11/17 22:35 +0300, Andrei Borzenkov wrote: >> 28.11.2017 13:01, Jan Pokorný пишет: >>> On 27/11/17 17:43 +0300, Andrei Borzenkov wrote: Отправлено с iPhone > 27 нояб. 2017 г., в 14:36, Ferenc Wágner написал(а): > > Andrei Borzenkov writes: > >> 25.11.2017 10:05, Andrei Borzenkov пишет: >> >>> In one of guides suggested procedure to simulate split brain was to kill >>> corosync process. It actually worked on one cluster, but on another >>> corosync process was restarted after being killed without cluster >>> noticing anything. Except after several attempts pacemaker died with >>> stopping resources ... :) >>> >>> This is SLES12 SP2; I do not see any Restart in service definition so it >>> probably not systemd. >>> >> FTR - it was not corosync, but pacemakker; its unit file specifies >> RestartOn=error so killing corosync caused pacemaker to fail and be >> restarted by systemd. > > And starting corosync via a Requires dependency? Exactly. >>> >>> From my testing it looks like we should change >>> "Requires=corosync.service" to "BindsTo=corosync.service" >>> in pacemaker.service. >>> >>> Could you give it a try? >>> >> >> I'm not sure what is expected outcome, but pacemaker.service is still >> restarted (due to Restart=on-failure). > > Expected outcome is that pacemaker.service will become > "inactive (dead)" after killing corosync (as a result of being > "bound" by pacemaker). Have you indeed issued "systemctl > daemon-reload" after updating the pacemaker unit file? > > (FTR, I tried with systemd 235). > >> If intention is to unconditionally stop it when corosync dies, >> pacemaker should probably exit with unique code and unit files have >> RestartPreventExitStatus set to it. > > That would be an elaborate way to reach the same. > > But good point in questioning what's the "best intention" around these > scenarios -- normally, fencing would happen, but as you note, the node > had actually survived by being fast enough to put corosync back to > life, and from there, whether it adds any value to have pacemaker > restarted on non-clean terminations at all. I don't know. > > Would it make more sense to have FailureAction=reboot-immediate to > at least in part emulate the fencing instead? Although the restart may be also blazingly fast in some cases, not making much difference except for taking all the previously running resources forcibly down as an extra step, which may be either good or bad. -- Jan (Poki) pgpo6ZFEeT30X.pgp Description: PGP signature ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Is corosync supposed to be restarted if it fies?
On 28/11/17 22:35 +0300, Andrei Borzenkov wrote: > 28.11.2017 13:01, Jan Pokorný пишет: >> On 27/11/17 17:43 +0300, Andrei Borzenkov wrote: >>> Отправлено с iPhone >>> 27 нояб. 2017 г., в 14:36, Ferenc Wágner написал(а): Andrei Borzenkov writes: > 25.11.2017 10:05, Andrei Borzenkov пишет: > >> In one of guides suggested procedure to simulate split brain was to kill >> corosync process. It actually worked on one cluster, but on another >> corosync process was restarted after being killed without cluster >> noticing anything. Except after several attempts pacemaker died with >> stopping resources ... :) >> >> This is SLES12 SP2; I do not see any Restart in service definition so it >> probably not systemd. >> > FTR - it was not corosync, but pacemakker; its unit file specifies > RestartOn=error so killing corosync caused pacemaker to fail and be > restarted by systemd. And starting corosync via a Requires dependency? >>> >>> Exactly. >> >> From my testing it looks like we should change >> "Requires=corosync.service" to "BindsTo=corosync.service" >> in pacemaker.service. >> >> Could you give it a try? >> > > I'm not sure what is expected outcome, but pacemaker.service is still > restarted (due to Restart=on-failure). Expected outcome is that pacemaker.service will become "inactive (dead)" after killing corosync (as a result of being "bound" by pacemaker). Have you indeed issued "systemctl daemon-reload" after updating the pacemaker unit file? (FTR, I tried with systemd 235). > If intention is to unconditionally stop it when corosync dies, > pacemaker should probably exit with unique code and unit files have > RestartPreventExitStatus set to it. That would be an elaborate way to reach the same. But good point in questioning what's the "best intention" around these scenarios -- normally, fencing would happen, but as you note, the node had actually survived by being fast enough to put corosync back to life, and from there, whether it adds any value to have pacemaker restarted on non-clean terminations at all. I don't know. Would it make more sense to have FailureAction=reboot-immediate to at least in part emulate the fencing instead? -- Jan (Poki) pgpvr3dRWe6V_.pgp Description: PGP signature ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Is corosync supposed to be restarted if it fies?
28.11.2017 13:01, Jan Pokorný пишет: > On 27/11/17 17:43 +0300, Andrei Borzenkov wrote: >> Отправлено с iPhone >> >>> 27 нояб. 2017 г., в 14:36, Ferenc Wágner написал(а): >>> >>> Andrei Borzenkov writes: >>> 25.11.2017 10:05, Andrei Borzenkov пишет: > In one of guides suggested procedure to simulate split brain was to kill > corosync process. It actually worked on one cluster, but on another > corosync process was restarted after being killed without cluster > noticing anything. Except after several attempts pacemaker died with > stopping resources ... :) > > This is SLES12 SP2; I do not see any Restart in service definition so it > probably not systemd. > FTR - it was not corosync, but pacemakker; its unit file specifies RestartOn=error so killing corosync caused pacemaker to fail and be restarted by systemd. >>> >>> And starting corosync via a Requires dependency? >> >> Exactly. > > From my testing it looks like we should change > "Requires=corosync.service" to "BindsTo=corosync.service" > in pacemaker.service. > > Could you give it a try? > I'm not sure what is expected outcome, but pacemaker.service is still restarted (due to Restart=on-failure). If intention is to unconditionally stop it when corosync dies, pacemaker should probably exit with unique code and unit files have RestartPreventExitStatus set to it. signature.asc Description: OpenPGP digital signature ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Is corosync supposed to be restarted if it fies?
On 27/11/17 17:43 +0300, Andrei Borzenkov wrote: > Отправлено с iPhone > >> 27 нояб. 2017 г., в 14:36, Ferenc Wágner написал(а): >> >> Andrei Borzenkov writes: >> >>> 25.11.2017 10:05, Andrei Borzenkov пишет: >>> In one of guides suggested procedure to simulate split brain was to kill corosync process. It actually worked on one cluster, but on another corosync process was restarted after being killed without cluster noticing anything. Except after several attempts pacemaker died with stopping resources ... :) This is SLES12 SP2; I do not see any Restart in service definition so it probably not systemd. >>> FTR - it was not corosync, but pacemakker; its unit file specifies >>> RestartOn=error so killing corosync caused pacemaker to fail and be >>> restarted by systemd. >> >> And starting corosync via a Requires dependency? > > Exactly. From my testing it looks like we should change "Requires=corosync.service" to "BindsTo=corosync.service" in pacemaker.service. Could you give it a try? -- Jan (Poki) pgpyPzPqzvNQR.pgp Description: PGP signature ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Is corosync supposed to be restarted if it fies?
Отправлено с iPhone > 27 нояб. 2017 г., в 14:36, Ferenc Wágner написал(а): > > Andrei Borzenkov writes: > >> 25.11.2017 10:05, Andrei Borzenkov пишет: >> >>> In one of guides suggested procedure to simulate split brain was to kill >>> corosync process. It actually worked on one cluster, but on another >>> corosync process was restarted after being killed without cluster >>> noticing anything. Except after several attempts pacemaker died with >>> stopping resources ... :) >>> >>> This is SLES12 SP2; I do not see any Restart in service definition so it >>> probably not systemd. >>> >> FTR - it was not corosync, but pacemakker; its unit file specifies >> RestartOn=error so killing corosync caused pacemaker to fail and be >> restarted by systemd. > > And starting corosync via a Requires dependency? Exactly. > -- > Feri > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Is corosync supposed to be restarted if it fies?
Andrei Borzenkov writes: > 25.11.2017 10:05, Andrei Borzenkov пишет: > >> In one of guides suggested procedure to simulate split brain was to kill >> corosync process. It actually worked on one cluster, but on another >> corosync process was restarted after being killed without cluster >> noticing anything. Except after several attempts pacemaker died with >> stopping resources ... :) >> >> This is SLES12 SP2; I do not see any Restart in service definition so it >> probably not systemd. >> > FTR - it was not corosync, but pacemakker; its unit file specifies > RestartOn=error so killing corosync caused pacemaker to fail and be > restarted by systemd. And starting corosync via a Requires dependency? -- Feri ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Is corosync supposed to be restarted if it fies?
25.11.2017 10:05, Andrei Borzenkov пишет: > In one of guides suggested procedure to simulate split brain was to kill > corosync process. It actually worked on one cluster, but on another > corosync process was restarted after being killed without cluster > noticing anything. Except after several attempts pacemaker died with > stopping resources ... :) > > This is SLES12 SP2; I do not see any Restart in service definition so it > probably not systemd. > FTR - it was not corosync, but pacemakker; its unit file specifies RestartOn=error so killing corosync caused pacemaker to fail and be restarted by systemd. I wish systemd could dynamically "unmanage" services ... ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org