On Mon, Apr 22, 2024 at 9:51 AM NOLIBOS Christophe < christophe.noli...@thalesgroup.com> wrote:
> Classified as: {OPEN} > > > > ‘kill -9’ command. > > Is it gracefully exit? > Looking as if corosync-unit-file has Restart=on-failure disabled per default. I'm not aware of another mechanism that would restart corosync and I think default behavior is not to restart. Comments suggest just to enable if using watchdog but that might just reference the RestartSec to provoke a watchdog-reboot instead of a restart via systemd. Any signal that isn't handled by the process - so that the exit-code could be set to 0 - should be fine. Klaus > > *De :* Klaus Wenninger <kwenn...@redhat.com> > *Envoyé :* jeudi 18 avril 2024 20:17 > *À :* NOLIBOS Christophe <christophe.noli...@thalesgroup.com> > *Cc :* Cluster Labs - All topics related to open-source clustering > welcomed <users@clusterlabs.org> > *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync > crash" fix > > > > > > NOLIBOS Christophe <christophe.noli...@thalesgroup.com> schrieb am Do., > 18. Apr. 2024, 19:01: > > Classified as: {OPEN} > > > > Hummm… my RHEL 8.8 OS has been hardened. > > I am wondering if the problem does not come from that. > > > > On another side, I get the same issue (i.e. corosync not restarted by > system) with Pacemaker 2.1.5-8 deployed on RHEL 8.4 (not hardened). > > > > I’m checking. > > > > How did, you kill corosync? If it exits gracefully might not be restarted. > Check journal. Sry cant try am on my mobile ATM. Klaus > > > > > > {OPEN} > > > > {OPEN} > > *De :* Users <users-boun...@clusterlabs.org> *De la part de* NOLIBOS > Christophe via Users > *Envoyé :* jeudi 18 avril 2024 18:34 > *À :* Klaus Wenninger <kwenn...@redhat.com>; Cluster Labs - All topics > related to open-source clustering welcomed <users@clusterlabs.org> > *Cc :* NOLIBOS Christophe <christophe.noli...@thalesgroup.com> > *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync > crash" fix > > > > Classified as: {OPEN} > > > > So, the issue is on systemd? > > > > If I run the same test on RHEL 7 (3.10.0-693.11.1.el7) with pacemaker > 1.1.13-10, corosync is correctly restarted by systemd. > > > > [RHEL7 ~]# journalctl -f > > -- Logs begin at Wed 2024-01-03 13:15:41 UTC. -- > > Apr 18 16:26:55 - systemd[1]: corosync.service failed. > > Apr 18 16:26:55 - systemd[1]: pacemaker.service holdoff time over, > scheduling restart. > > Apr 18 16:26:55 - systemd[1]: Starting Corosync Cluster Engine... > > Apr 18 16:26:55 - corosync[12179]: Starting Corosync Cluster Engine > (corosync): [ OK ] > > Apr 18 16:26:55 - systemd[1]: Started Corosync Cluster Engine. > > Apr 18 16:26:55 - systemd[1]: Started Pacemaker High Availability Cluster > Manager. > > Apr 18 16:26:55 - systemd[1]: Starting Pacemaker High Availability Cluster > Manager... > > Apr 18 16:26:55 - pacemakerd[12192]: notice: Additional logging > available in /var/log/pacemaker.log > > Apr 18 16:26:55 - pacemakerd[12192]: notice: Switching to > /var/log/cluster/corosync.log > > Apr 18 16:26:55 - pacemakerd[12192]: notice: Additional logging > available in /var/log/cluster/corosync.log > > > > *De :* Klaus Wenninger <kwenn...@redhat.com> > *Envoyé :* jeudi 18 avril 2024 18:12 > *À :* NOLIBOS Christophe <christophe.noli...@thalesgroup.com>; Cluster > Labs - All topics related to open-source clustering welcomed < > users@clusterlabs.org> > *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync > crash" fix > > > > > > > > On Thu, Apr 18, 2024 at 6:09 PM Klaus Wenninger <kwenn...@redhat.com> > wrote: > > > > > > On Thu, Apr 18, 2024 at 6:06 PM NOLIBOS Christophe < > christophe.noli...@thalesgroup.com> wrote: > > Classified as: {OPEN} > > > > Well… why do you say that « Well if corosync isn't there that this is to > be expected and pacemaker won't recover corosync.”? > > In my mind, Corosync is managed by Pacemaker as any other cluster resource > and the "pacemakerd: recover properly from > Corosync crash" fix > implemented in version 2.1.2 seems confirm that. > > > > Nope. Startup of the stack is done by systemd. And pacemaker is just > started after corosync is up and > > systemd should be responsible for keeping the stack up. > > For completeness: if you have sbd in the mix that is as well being started > by systemd but kind of > > parallel with corosync as part of it (systemd terminology). > > > > The "recover" above is referring to pacemaker recovering from corosync > going away and coming back. > > > > > > Klaus > > > > > > {OPEN} > > > > {OPEN} > > *De :* NOLIBOS Christophe > *Envoyé :* jeudi 18 avril 2024 17:56 > *À :* 'Klaus Wenninger' <kwenn...@redhat.com>; Cluster Labs - All topics > related to open-source clustering welcomed <users@clusterlabs.org> > *Cc :* Ken Gaillot <kgail...@redhat.com> > *Objet :* RE: [ClusterLabs] "pacemakerd: recover properly from Corosync > crash" fix > > > > Classified as: {OPEN} > > > > > > [~]$ systemctl status corosync > > ● corosync.service - Corosync Cluster Engine > > Loaded: loaded (/usr/lib/systemd/system/corosync.service; enabled; > vendor preset: disabled) > > Active: failed (Result: signal) since Thu 2024-04-18 14:58:42 UTC; > 53min ago > > Docs: man:corosync > > man:corosync.conf > > man:corosync_overview > > Process: 2027251 ExecStop=/usr/sbin/corosync-cfgtool -H --force > (code=exited, status=0/SUCCESS) > > Process: 1324906 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS > (code=killed, signal=KILL) > > Main PID: 1324906 (code=killed, signal=KILL) > > > > Apr 18 13:16:04 - corosync[1324906]: [QUORUM] Sync joined[1]: 1 > > Apr 18 13:16:04 - corosync[1324906]: [TOTEM ] A new membership (1.1c8) > was formed. Members joined: 1 > > Apr 18 13:16:04 - corosync[1324906]: [VOTEQ ] Waiting for all cluster > members. Current votes: 1 expected_votes: 2 > > Apr 18 13:16:04 - corosync[1324906]: [VOTEQ ] Waiting for all cluster > members. Current votes: 1 expected_votes: 2 > > Apr 18 13:16:04 - corosync[1324906]: [VOTEQ ] Waiting for all cluster > members. Current votes: 1 expected_votes: 2 > > Apr 18 13:16:04 - corosync[1324906]: [QUORUM] Members[1]: 1 > > Apr 18 13:16:04 - corosync[1324906]: [MAIN ] Completed service > synchronization, ready to provide service. > > Apr 18 13:16:04 - systemd[1]: Started Corosync Cluster Engine. > > Apr 18 14:58:42 - systemd[1]: corosync.service: Main process exited, > code=killed, status=9/KILL > > Apr 18 14:58:42 - systemd[1]: corosync.service: Failed with result > 'signal'. > > [~]$ > > > > > > *De :* Klaus Wenninger <kwenn...@redhat.com> > *Envoyé :* jeudi 18 avril 2024 17:43 > *À :* Cluster Labs - All topics related to open-source clustering > welcomed <users@clusterlabs.org> > *Cc :* Ken Gaillot <kgail...@redhat.com>; NOLIBOS Christophe < > christophe.noli...@thalesgroup.com> > *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync > crash" fix > > > > > > > > On Thu, Apr 18, 2024 at 5:07 PM NOLIBOS Christophe via Users < > users@clusterlabs.org> wrote: > > Classified as: {OPEN} > > I'm using RedHat 8.8 (4.18.0-477.21.1.el8_8.x86_64). > When I kill Corosync, no new corosync process is created and pacemaker is > in failure. > The only solution is to restart the pacemaker service. > > [~]$ pcs status > Error: unable to get cib > [~]$ > > [~]$systemctl status pacemaker > ● pacemaker.service - Pacemaker High Availability Cluster Manager > Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; enabled; > vendor preset: disabled) > Active: active (running) since Thu 2024-04-18 13:16:04 UTC; 1h 43min ago > Docs: man:pacemakerd > https://clusterlabs.org/pacemaker/doc/ > Main PID: 1324923 (pacemakerd) > Tasks: 91 > Memory: 132.1M > CGroup: /system.slice/pacemaker.service > ... > Apr 18 14:59:02 - pacemakerd[1324923]: crit: Could not connect to > Corosync CFG: CS_ERR_LIBRARY > Apr 18 14:59:03 - pacemakerd[1324923]: crit: Could not connect to > Corosync CFG: CS_ERR_LIBRARY > Apr 18 14:59:04 - pacemakerd[1324923]: crit: Could not connect to > Corosync CFG: CS_ERR_LIBRARY > Apr 18 14:59:05 - pacemakerd[1324923]: crit: Could not connect to > Corosync CFG: CS_ERR_LIBRARY > Apr 18 14:59:06 - pacemakerd[1324923]: crit: Could not connect to > Corosync CFG: CS_ERR_LIBRARY > Apr 18 14:59:07 - pacemakerd[1324923]: crit: Could not connect to > Corosync CFG: CS_ERR_LIBRARY > Apr 18 14:59:08 - pacemakerd[1324923]: crit: Could not connect to > Corosync CFG: CS_ERR_LIBRARY > Apr 18 14:59:09 - pacemakerd[1324923]: crit: Could not connect to > Corosync CFG: CS_ERR_LIBRARY > Apr 18 14:59:10 - pacemakerd[1324923]: crit: Could not connect to > Corosync CFG: CS_ERR_LIBRARY > Apr 18 14:59:11 - pacemakerd[1324923]: crit: Could not connect to > Corosync CFG: CS_ERR_LIBRARY > [~]$ > > Well if corosync isn't there that this is to be expected and pacemaker > won't recover corosync. > > Can you check what systemd thinks about corosync (status/journal). > > > > Klaus > > > {OPEN} > > -----Message d'origine----- > De : Ken Gaillot <kgail...@redhat.com> > Envoyé : jeudi 18 avril 2024 16:40 > À : Cluster Labs - All topics related to open-source clustering welcomed < > users@clusterlabs.org> > Cc : NOLIBOS Christophe <christophe.noli...@thalesgroup.com> > Objet : Re: [ClusterLabs] "pacemakerd: recover properly from Corosync > crash" fix > > What OS are you using? Does it use systemd? > > What does happen when you kill Corosync? > > On Thu, 2024-04-18 at 13:13 +0000, NOLIBOS Christophe via Users wrote: > > Classified as: {OPEN} > > > > Dear All, > > > > I have a question about the "pacemakerd: recover properly from > > Corosync crash" fix implemented in version 2.1.2. > > I have observed the issue when testing pacemaker version 2.0.5, just > > by killing the ‘corosync’ process: Corosync was not recovered. > > > > I am using now pacemaker version 2.1.5-8. > > Doing the same test, I have the same result: Corosync is still not > > recovered. > > > > Please confirm the "pacemakerd: recover properly from Corosync crash" > > fix implemented in version 2.1.2 covers this scenario. > > If it is, did I miss something in the configuration of my cluster? > > > > Best Regard. > > > > Christophe. > > > > > > > > {OPEN} > > _______________________________________________ > > Manage your subscription: > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > ClusterLabs home: https://www.clusterlabs.org/ > -- > Ken Gaillot <kgail...@redhat.com> > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > > > > {OPEN} > >
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/