On Tue, Apr 23, 2024 at 10:34 AM Klaus Wenninger <kwenn...@redhat.com> wrote:
> > > On Tue, Apr 23, 2024 at 9:53 AM NOLIBOS Christophe < > christophe.noli...@thalesgroup.com> wrote: > >> Classified as: {OPEN} >> >> >> >> Other strange thing. >> >> On RHEL 7, corosync is restarted while the “Restart=on-failure » line is >> commented. >> >> I think also that something changed in the pacemaker behavior, or >> somewhere else. >> > > That is how it was working before introduction of the reconnection to > corosync. > Previously pacemaker would fail and systemd would restart it checking the > services > pacemaker depends on. And finding corosync not running it would be > restarted. > >From what I've read there has been a change in how systemd is handling restart of dependent services a while back as well. So changed behavior can come from that as well. Just for completeness ... Klaus > > Klaus > > >> >> >> *De :* Klaus Wenninger <kwenn...@redhat.com> >> *Envoyé :* lundi 22 avril 2024 12:41 >> *À :* NOLIBOS Christophe <christophe.noli...@thalesgroup.com> >> *Cc :* Cluster Labs - All topics related to open-source clustering >> welcomed <users@clusterlabs.org> >> *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync >> crash" fix >> >> >> >> >> >> >> >> On Mon, Apr 22, 2024 at 12:32 PM NOLIBOS Christophe < >> christophe.noli...@thalesgroup.com> wrote: >> >> Classified as: {OPEN} >> >> >> >> You are right : the “Restart=on-failure” line is commented and so, >> disabled per default. >> >> Uncommenting it resolves my issue. >> >> >> >> Maybe pacemaker changed behavior here without syncing enough with >> corosync behavior. >> >> We'll look into that to see which approach is better - restart corosync >> on failure - or have >> >> pacemaker be restarted by systemd which should in turn restart corosync >> as well. >> >> >> >> Klaus >> >> >> >> Thanks a lot. >> >> Christophe. >> >> >> >> *De :* Klaus Wenninger <kwenn...@redhat.com> >> *Envoyé :* lundi 22 avril 2024 11:06 >> *À :* NOLIBOS Christophe <christophe.noli...@thalesgroup.com> >> *Cc :* Cluster Labs - All topics related to open-source clustering >> welcomed <users@clusterlabs.org> >> *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync >> crash" fix >> >> >> >> >> >> >> >> On Mon, Apr 22, 2024 at 9:51 AM NOLIBOS Christophe < >> christophe.noli...@thalesgroup.com> wrote: >> >> Classified as: {OPEN} >> >> >> >> ‘kill -9’ command. >> >> Is it gracefully exit? >> >> >> >> Looking as if corosync-unit-file has Restart=on-failure disabled per >> default. >> >> I'm not aware of another mechanism that would restart corosync and I >> >> think default behavior is not to restart. >> >> Comments suggest just to enable if using watchdog but that might just >> >> reference the RestartSec to provoke a watchdog-reboot instead of a >> >> restart via systemd. >> >> Any signal that isn't handled by the process - so that the exit-code could >> >> be set to 0 - should be fine. >> >> >> >> Klaus >> >> >> >> >> >> *De :* Klaus Wenninger <kwenn...@redhat.com> >> *Envoyé :* jeudi 18 avril 2024 20:17 >> *À :* NOLIBOS Christophe <christophe.noli...@thalesgroup.com> >> *Cc :* Cluster Labs - All topics related to open-source clustering >> welcomed <users@clusterlabs.org> >> *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync >> crash" fix >> >> >> >> >> >> NOLIBOS Christophe <christophe.noli...@thalesgroup.com> schrieb am Do., >> 18. Apr. 2024, 19:01: >> >> Classified as: {OPEN} >> >> >> >> Hummm… my RHEL 8.8 OS has been hardened. >> >> I am wondering if the problem does not come from that. >> >> >> >> On another side, I get the same issue (i.e. corosync not restarted by >> system) with Pacemaker 2.1.5-8 deployed on RHEL 8.4 (not hardened). >> >> >> >> I’m checking. >> >> >> >> How did, you kill corosync? If it exits gracefully might not be >> restarted. Check journal. Sry cant try am on my mobile ATM. Klaus >> >> >> >> >> >> {OPEN} >> >> >> >> {OPEN} >> >> >> >> {OPEN} >> >> >> >> {OPEN} >> >> *De :* Users <users-boun...@clusterlabs.org> *De la part de* NOLIBOS >> Christophe via Users >> *Envoyé :* jeudi 18 avril 2024 18:34 >> *À :* Klaus Wenninger <kwenn...@redhat.com>; Cluster Labs - All topics >> related to open-source clustering welcomed <users@clusterlabs.org> >> *Cc :* NOLIBOS Christophe <christophe.noli...@thalesgroup.com> >> *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync >> crash" fix >> >> >> >> Classified as: {OPEN} >> >> >> >> So, the issue is on systemd? >> >> >> >> If I run the same test on RHEL 7 (3.10.0-693.11.1.el7) with pacemaker >> 1.1.13-10, corosync is correctly restarted by systemd. >> >> >> >> [RHEL7 ~]# journalctl -f >> >> -- Logs begin at Wed 2024-01-03 13:15:41 UTC. -- >> >> Apr 18 16:26:55 - systemd[1]: corosync.service failed. >> >> Apr 18 16:26:55 - systemd[1]: pacemaker.service holdoff time over, >> scheduling restart. >> >> Apr 18 16:26:55 - systemd[1]: Starting Corosync Cluster Engine... >> >> Apr 18 16:26:55 - corosync[12179]: Starting Corosync Cluster Engine >> (corosync): [ OK ] >> >> Apr 18 16:26:55 - systemd[1]: Started Corosync Cluster Engine. >> >> Apr 18 16:26:55 - systemd[1]: Started Pacemaker High Availability Cluster >> Manager. >> >> Apr 18 16:26:55 - systemd[1]: Starting Pacemaker High Availability >> Cluster Manager... >> >> Apr 18 16:26:55 - pacemakerd[12192]: notice: Additional logging >> available in /var/log/pacemaker.log >> >> Apr 18 16:26:55 - pacemakerd[12192]: notice: Switching to >> /var/log/cluster/corosync.log >> >> Apr 18 16:26:55 - pacemakerd[12192]: notice: Additional logging >> available in /var/log/cluster/corosync.log >> >> >> >> *De :* Klaus Wenninger <kwenn...@redhat.com> >> *Envoyé :* jeudi 18 avril 2024 18:12 >> *À :* NOLIBOS Christophe <christophe.noli...@thalesgroup.com>; Cluster >> Labs - All topics related to open-source clustering welcomed < >> users@clusterlabs.org> >> *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync >> crash" fix >> >> >> >> >> >> >> >> On Thu, Apr 18, 2024 at 6:09 PM Klaus Wenninger <kwenn...@redhat.com> >> wrote: >> >> >> >> >> >> On Thu, Apr 18, 2024 at 6:06 PM NOLIBOS Christophe < >> christophe.noli...@thalesgroup.com> wrote: >> >> Classified as: {OPEN} >> >> >> >> Well… why do you say that « Well if corosync isn't there that this is >> to be expected and pacemaker won't recover corosync.”? >> >> In my mind, Corosync is managed by Pacemaker as any other cluster >> resource and the "pacemakerd: recover properly from > Corosync crash" fix >> implemented in version 2.1.2 seems confirm that. >> >> >> >> Nope. Startup of the stack is done by systemd. And pacemaker is just >> started after corosync is up and >> >> systemd should be responsible for keeping the stack up. >> >> For completeness: if you have sbd in the mix that is as well being >> started by systemd but kind of >> >> parallel with corosync as part of it (systemd terminology). >> >> >> >> The "recover" above is referring to pacemaker recovering from corosync >> going away and coming back. >> >> >> >> >> >> Klaus >> >> >> >> >> >> {OPEN} >> >> >> >> {OPEN} >> >> *De :* NOLIBOS Christophe >> *Envoyé :* jeudi 18 avril 2024 17:56 >> *À :* 'Klaus Wenninger' <kwenn...@redhat.com>; Cluster Labs - All topics >> related to open-source clustering welcomed <users@clusterlabs.org> >> *Cc :* Ken Gaillot <kgail...@redhat.com> >> *Objet :* RE: [ClusterLabs] "pacemakerd: recover properly from Corosync >> crash" fix >> >> >> >> Classified as: {OPEN} >> >> >> >> >> >> [~]$ systemctl status corosync >> >> ● corosync.service - Corosync Cluster Engine >> >> Loaded: loaded (/usr/lib/systemd/system/corosync.service; enabled; >> vendor preset: disabled) >> >> Active: failed (Result: signal) since Thu 2024-04-18 14:58:42 UTC; >> 53min ago >> >> Docs: man:corosync >> >> man:corosync.conf >> >> man:corosync_overview >> >> Process: 2027251 ExecStop=/usr/sbin/corosync-cfgtool -H --force >> (code=exited, status=0/SUCCESS) >> >> Process: 1324906 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS >> (code=killed, signal=KILL) >> >> Main PID: 1324906 (code=killed, signal=KILL) >> >> >> >> Apr 18 13:16:04 - corosync[1324906]: [QUORUM] Sync joined[1]: 1 >> >> Apr 18 13:16:04 - corosync[1324906]: [TOTEM ] A new membership (1.1c8) >> was formed. Members joined: 1 >> >> Apr 18 13:16:04 - corosync[1324906]: [VOTEQ ] Waiting for all cluster >> members. Current votes: 1 expected_votes: 2 >> >> Apr 18 13:16:04 - corosync[1324906]: [VOTEQ ] Waiting for all cluster >> members. Current votes: 1 expected_votes: 2 >> >> Apr 18 13:16:04 - corosync[1324906]: [VOTEQ ] Waiting for all cluster >> members. Current votes: 1 expected_votes: 2 >> >> Apr 18 13:16:04 - corosync[1324906]: [QUORUM] Members[1]: 1 >> >> Apr 18 13:16:04 - corosync[1324906]: [MAIN ] Completed service >> synchronization, ready to provide service. >> >> Apr 18 13:16:04 - systemd[1]: Started Corosync Cluster Engine. >> >> Apr 18 14:58:42 - systemd[1]: corosync.service: Main process exited, >> code=killed, status=9/KILL >> >> Apr 18 14:58:42 - systemd[1]: corosync.service: Failed with result >> 'signal'. >> >> [~]$ >> >> >> >> >> >> *De :* Klaus Wenninger <kwenn...@redhat.com> >> *Envoyé :* jeudi 18 avril 2024 17:43 >> *À :* Cluster Labs - All topics related to open-source clustering >> welcomed <users@clusterlabs.org> >> *Cc :* Ken Gaillot <kgail...@redhat.com>; NOLIBOS Christophe < >> christophe.noli...@thalesgroup.com> >> *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync >> crash" fix >> >> >> >> >> >> >> >> On Thu, Apr 18, 2024 at 5:07 PM NOLIBOS Christophe via Users < >> users@clusterlabs.org> wrote: >> >> Classified as: {OPEN} >> >> I'm using RedHat 8.8 (4.18.0-477.21.1.el8_8.x86_64). >> When I kill Corosync, no new corosync process is created and pacemaker is >> in failure. >> The only solution is to restart the pacemaker service. >> >> [~]$ pcs status >> Error: unable to get cib >> [~]$ >> >> [~]$systemctl status pacemaker >> ● pacemaker.service - Pacemaker High Availability Cluster Manager >> Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; enabled; >> vendor preset: disabled) >> Active: active (running) since Thu 2024-04-18 13:16:04 UTC; 1h 43min >> ago >> Docs: man:pacemakerd >> https://clusterlabs.org/pacemaker/doc/ >> Main PID: 1324923 (pacemakerd) >> Tasks: 91 >> Memory: 132.1M >> CGroup: /system.slice/pacemaker.service >> ... >> Apr 18 14:59:02 - pacemakerd[1324923]: crit: Could not connect to >> Corosync CFG: CS_ERR_LIBRARY >> Apr 18 14:59:03 - pacemakerd[1324923]: crit: Could not connect to >> Corosync CFG: CS_ERR_LIBRARY >> Apr 18 14:59:04 - pacemakerd[1324923]: crit: Could not connect to >> Corosync CFG: CS_ERR_LIBRARY >> Apr 18 14:59:05 - pacemakerd[1324923]: crit: Could not connect to >> Corosync CFG: CS_ERR_LIBRARY >> Apr 18 14:59:06 - pacemakerd[1324923]: crit: Could not connect to >> Corosync CFG: CS_ERR_LIBRARY >> Apr 18 14:59:07 - pacemakerd[1324923]: crit: Could not connect to >> Corosync CFG: CS_ERR_LIBRARY >> Apr 18 14:59:08 - pacemakerd[1324923]: crit: Could not connect to >> Corosync CFG: CS_ERR_LIBRARY >> Apr 18 14:59:09 - pacemakerd[1324923]: crit: Could not connect to >> Corosync CFG: CS_ERR_LIBRARY >> Apr 18 14:59:10 - pacemakerd[1324923]: crit: Could not connect to >> Corosync CFG: CS_ERR_LIBRARY >> Apr 18 14:59:11 - pacemakerd[1324923]: crit: Could not connect to >> Corosync CFG: CS_ERR_LIBRARY >> [~]$ >> >> Well if corosync isn't there that this is to be expected and pacemaker >> won't recover corosync. >> >> Can you check what systemd thinks about corosync (status/journal). >> >> >> >> Klaus >> >> >> {OPEN} >> >> -----Message d'origine----- >> De : Ken Gaillot <kgail...@redhat.com> >> Envoyé : jeudi 18 avril 2024 16:40 >> À : Cluster Labs - All topics related to open-source clustering welcomed < >> users@clusterlabs.org> >> Cc : NOLIBOS Christophe <christophe.noli...@thalesgroup.com> >> Objet : Re: [ClusterLabs] "pacemakerd: recover properly from Corosync >> crash" fix >> >> What OS are you using? Does it use systemd? >> >> What does happen when you kill Corosync? >> >> On Thu, 2024-04-18 at 13:13 +0000, NOLIBOS Christophe via Users wrote: >> > Classified as: {OPEN} >> > >> > Dear All, >> > >> > I have a question about the "pacemakerd: recover properly from >> > Corosync crash" fix implemented in version 2.1.2. >> > I have observed the issue when testing pacemaker version 2.0.5, just >> > by killing the ‘corosync’ process: Corosync was not recovered. >> > >> > I am using now pacemaker version 2.1.5-8. >> > Doing the same test, I have the same result: Corosync is still not >> > recovered. >> > >> > Please confirm the "pacemakerd: recover properly from Corosync crash" >> > fix implemented in version 2.1.2 covers this scenario. >> > If it is, did I miss something in the configuration of my cluster? >> > >> > Best Regard. >> > >> > Christophe. >> > >> > >> > >> > {OPEN} >> > _______________________________________________ >> > Manage your subscription: >> > https://lists.clusterlabs.org/mailman/listinfo/users >> > >> > ClusterLabs home: https://www.clusterlabs.org/ >> -- >> Ken Gaillot <kgail...@redhat.com> >> _______________________________________________ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ >> >> >> >> {OPEN} >> >>
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/