On Thu, Apr 18, 2024 at 6:09 PM Klaus Wenninger <kwenn...@redhat.com> wrote:
> > > On Thu, Apr 18, 2024 at 6:06 PM NOLIBOS Christophe < > christophe.noli...@thalesgroup.com> wrote: > >> Classified as: {OPEN} >> >> >> >> Well… why do you say that « Well if corosync isn't there that this is >> to be expected and pacemaker won't recover corosync.”? >> >> In my mind, Corosync is managed by Pacemaker as any other cluster >> resource and the "pacemakerd: recover properly from > Corosync crash" fix >> implemented in version 2.1.2 seems confirm that. >> > > Nope. Startup of the stack is done by systemd. And pacemaker is just > started after corosync is up and > systemd should be responsible for keeping the stack up. > For completeness: if you have sbd in the mix that is as well being started > by systemd but kind of > parallel with corosync as part of it (systemd terminology). > The "recover" above is referring to pacemaker recovering from corosync going away and coming back. > > Klaus > >> >> >> >> >> {OPEN} >> >> *De :* NOLIBOS Christophe >> *Envoyé :* jeudi 18 avril 2024 17:56 >> *À :* 'Klaus Wenninger' <kwenn...@redhat.com>; Cluster Labs - All topics >> related to open-source clustering welcomed <users@clusterlabs.org> >> *Cc :* Ken Gaillot <kgail...@redhat.com> >> *Objet :* RE: [ClusterLabs] "pacemakerd: recover properly from Corosync >> crash" fix >> >> >> >> Classified as: {OPEN} >> >> >> >> >> >> [~]$ systemctl status corosync >> >> ● corosync.service - Corosync Cluster Engine >> >> Loaded: loaded (/usr/lib/systemd/system/corosync.service; enabled; >> vendor preset: disabled) >> >> Active: failed (Result: signal) since Thu 2024-04-18 14:58:42 UTC; >> 53min ago >> >> Docs: man:corosync >> >> man:corosync.conf >> >> man:corosync_overview >> >> Process: 2027251 ExecStop=/usr/sbin/corosync-cfgtool -H --force >> (code=exited, status=0/SUCCESS) >> >> Process: 1324906 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS >> (code=killed, signal=KILL) >> >> Main PID: 1324906 (code=killed, signal=KILL) >> >> >> >> Apr 18 13:16:04 - corosync[1324906]: [QUORUM] Sync joined[1]: 1 >> >> Apr 18 13:16:04 - corosync[1324906]: [TOTEM ] A new membership (1.1c8) >> was formed. Members joined: 1 >> >> Apr 18 13:16:04 - corosync[1324906]: [VOTEQ ] Waiting for all cluster >> members. Current votes: 1 expected_votes: 2 >> >> Apr 18 13:16:04 - corosync[1324906]: [VOTEQ ] Waiting for all cluster >> members. Current votes: 1 expected_votes: 2 >> >> Apr 18 13:16:04 - corosync[1324906]: [VOTEQ ] Waiting for all cluster >> members. Current votes: 1 expected_votes: 2 >> >> Apr 18 13:16:04 - corosync[1324906]: [QUORUM] Members[1]: 1 >> >> Apr 18 13:16:04 - corosync[1324906]: [MAIN ] Completed service >> synchronization, ready to provide service. >> >> Apr 18 13:16:04 - systemd[1]: Started Corosync Cluster Engine. >> >> Apr 18 14:58:42 - systemd[1]: corosync.service: Main process exited, >> code=killed, status=9/KILL >> >> Apr 18 14:58:42 - systemd[1]: corosync.service: Failed with result >> 'signal'. >> >> [~]$ >> >> >> >> >> >> *De :* Klaus Wenninger <kwenn...@redhat.com> >> *Envoyé :* jeudi 18 avril 2024 17:43 >> *À :* Cluster Labs - All topics related to open-source clustering >> welcomed <users@clusterlabs.org> >> *Cc :* Ken Gaillot <kgail...@redhat.com>; NOLIBOS Christophe < >> christophe.noli...@thalesgroup.com> >> *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync >> crash" fix >> >> >> >> >> >> >> >> On Thu, Apr 18, 2024 at 5:07 PM NOLIBOS Christophe via Users < >> users@clusterlabs.org> wrote: >> >> Classified as: {OPEN} >> >> I'm using RedHat 8.8 (4.18.0-477.21.1.el8_8.x86_64). >> When I kill Corosync, no new corosync process is created and pacemaker is >> in failure. >> The only solution is to restart the pacemaker service. >> >> [~]$ pcs status >> Error: unable to get cib >> [~]$ >> >> [~]$systemctl status pacemaker >> ● pacemaker.service - Pacemaker High Availability Cluster Manager >> Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; enabled; >> vendor preset: disabled) >> Active: active (running) since Thu 2024-04-18 13:16:04 UTC; 1h 43min >> ago >> Docs: man:pacemakerd >> https://clusterlabs.org/pacemaker/doc/ >> Main PID: 1324923 (pacemakerd) >> Tasks: 91 >> Memory: 132.1M >> CGroup: /system.slice/pacemaker.service >> ... >> Apr 18 14:59:02 - pacemakerd[1324923]: crit: Could not connect to >> Corosync CFG: CS_ERR_LIBRARY >> Apr 18 14:59:03 - pacemakerd[1324923]: crit: Could not connect to >> Corosync CFG: CS_ERR_LIBRARY >> Apr 18 14:59:04 - pacemakerd[1324923]: crit: Could not connect to >> Corosync CFG: CS_ERR_LIBRARY >> Apr 18 14:59:05 - pacemakerd[1324923]: crit: Could not connect to >> Corosync CFG: CS_ERR_LIBRARY >> Apr 18 14:59:06 - pacemakerd[1324923]: crit: Could not connect to >> Corosync CFG: CS_ERR_LIBRARY >> Apr 18 14:59:07 - pacemakerd[1324923]: crit: Could not connect to >> Corosync CFG: CS_ERR_LIBRARY >> Apr 18 14:59:08 - pacemakerd[1324923]: crit: Could not connect to >> Corosync CFG: CS_ERR_LIBRARY >> Apr 18 14:59:09 - pacemakerd[1324923]: crit: Could not connect to >> Corosync CFG: CS_ERR_LIBRARY >> Apr 18 14:59:10 - pacemakerd[1324923]: crit: Could not connect to >> Corosync CFG: CS_ERR_LIBRARY >> Apr 18 14:59:11 - pacemakerd[1324923]: crit: Could not connect to >> Corosync CFG: CS_ERR_LIBRARY >> [~]$ >> >> Well if corosync isn't there that this is to be expected and pacemaker >> won't recover corosync. >> >> Can you check what systemd thinks about corosync (status/journal). >> >> >> >> Klaus >> >> >> {OPEN} >> >> -----Message d'origine----- >> De : Ken Gaillot <kgail...@redhat.com> >> Envoyé : jeudi 18 avril 2024 16:40 >> À : Cluster Labs - All topics related to open-source clustering welcomed < >> users@clusterlabs.org> >> Cc : NOLIBOS Christophe <christophe.noli...@thalesgroup.com> >> Objet : Re: [ClusterLabs] "pacemakerd: recover properly from Corosync >> crash" fix >> >> What OS are you using? Does it use systemd? >> >> What does happen when you kill Corosync? >> >> On Thu, 2024-04-18 at 13:13 +0000, NOLIBOS Christophe via Users wrote: >> > Classified as: {OPEN} >> > >> > Dear All, >> > >> > I have a question about the "pacemakerd: recover properly from >> > Corosync crash" fix implemented in version 2.1.2. >> > I have observed the issue when testing pacemaker version 2.0.5, just >> > by killing the ‘corosync’ process: Corosync was not recovered. >> > >> > I am using now pacemaker version 2.1.5-8. >> > Doing the same test, I have the same result: Corosync is still not >> > recovered. >> > >> > Please confirm the "pacemakerd: recover properly from Corosync crash" >> > fix implemented in version 2.1.2 covers this scenario. >> > If it is, did I miss something in the configuration of my cluster? >> > >> > Best Regard. >> > >> > Christophe. >> > >> > >> > >> > {OPEN} >> > _______________________________________________ >> > Manage your subscription: >> > https://lists.clusterlabs.org/mailman/listinfo/users >> > >> > ClusterLabs home: https://www.clusterlabs.org/ >> -- >> Ken Gaillot <kgail...@redhat.com> >> _______________________________________________ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ >> >> >> >> {OPEN} >> >>
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/