Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

2024-04-23 Thread Klaus Wenninger
On Tue, Apr 23, 2024 at 10:34 AM Klaus Wenninger 
wrote:

>
>
> On Tue, Apr 23, 2024 at 9:53 AM NOLIBOS Christophe <
> christophe.noli...@thalesgroup.com> wrote:
>
>> Classified as: {OPEN}
>>
>>
>>
>> Other strange thing.
>>
>> On RHEL 7, corosync is restarted while the “Restart=on-failure » line is
>> commented.
>>
>> I think also that something changed in the pacemaker behavior, or
>> somewhere else.
>>
>
> That is how it was working before introduction of the reconnection to
> corosync.
> Previously pacemaker would fail and systemd would restart it checking the
> services
> pacemaker depends on. And finding corosync not running it would be
> restarted.
>

>From what I've read there has been a change in how systemd is handling
restart
of dependent services a while back as well. So changed behavior can come
from
that as well. Just for completeness ...

Klaus

>
> Klaus
>
>
>>
>>
>> *De :* Klaus Wenninger 
>> *Envoyé :* lundi 22 avril 2024 12:41
>> *À :* NOLIBOS Christophe 
>> *Cc :* Cluster Labs - All topics related to open-source clustering
>> welcomed 
>> *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync
>> crash" fix
>>
>>
>>
>>
>>
>>
>>
>> On Mon, Apr 22, 2024 at 12:32 PM NOLIBOS Christophe <
>> christophe.noli...@thalesgroup.com> wrote:
>>
>> Classified as: {OPEN}
>>
>>
>>
>> You are right : the “Restart=on-failure” line is commented and so,
>> disabled per default.
>>
>> Uncommenting it resolves my issue.
>>
>>
>>
>> Maybe pacemaker changed behavior here without syncing enough with
>> corosync behavior.
>>
>> We'll look into that to see which approach is better - restart corosync
>> on failure - or have
>>
>> pacemaker be restarted by systemd which should in turn restart corosync
>> as well.
>>
>>
>>
>> Klaus
>>
>>
>>
>> Thanks a lot.
>>
>> Christophe.
>>
>>
>>
>> *De :* Klaus Wenninger 
>> *Envoyé :* lundi 22 avril 2024 11:06
>> *À :* NOLIBOS Christophe 
>> *Cc :* Cluster Labs - All topics related to open-source clustering
>> welcomed 
>> *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync
>> crash" fix
>>
>>
>>
>>
>>
>>
>>
>> On Mon, Apr 22, 2024 at 9:51 AM NOLIBOS Christophe <
>> christophe.noli...@thalesgroup.com> wrote:
>>
>> Classified as: {OPEN}
>>
>>
>>
>> ‘kill -9’ command.
>>
>> Is it gracefully exit?
>>
>>
>>
>> Looking as if corosync-unit-file has Restart=on-failure disabled per
>> default.
>>
>> I'm not aware of another mechanism that would restart corosync and I
>>
>> think default behavior is not to restart.
>>
>> Comments suggest just to enable if using watchdog but that might just
>>
>> reference the RestartSec to provoke a watchdog-reboot instead of a
>>
>> restart via systemd.
>>
>> Any signal that isn't handled by the process - so that the exit-code could
>>
>> be set to 0 - should be fine.
>>
>>
>>
>> Klaus
>>
>>
>>
>>
>>
>> *De :* Klaus Wenninger 
>> *Envoyé :* jeudi 18 avril 2024 20:17
>> *À :* NOLIBOS Christophe 
>> *Cc :* Cluster Labs - All topics related to open-source clustering
>> welcomed 
>> *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync
>> crash" fix
>>
>>
>>
>>
>>
>> NOLIBOS Christophe  schrieb am Do.,
>> 18. Apr. 2024, 19:01:
>>
>> Classified as: {OPEN}
>>
>>
>>
>> Hummm… my RHEL 8.8 OS has been hardened.
>>
>> I am wondering if the problem does not come from that.
>>
>>
>>
>> On another side, I get the same issue (i.e. corosync not restarted by
>> system) with Pacemaker 2.1.5-8 deployed on RHEL 8.4 (not hardened).
>>
>>
>>
>> I’m checking.
>>
>>
>>
>> How did, you kill corosync? If it exits gracefully might not be
>> restarted. Check journal. Sry cant try am on my mobile ATM. Klaus
>>
>>
>>
>>
>>
>> {OPEN}
>>
>>
>>
>> {OPEN}
>>
>>
>>
>> {OPEN}
>>
>>
>>
>> {OPEN}
>>
>> *De :* Users  *De la part de* NOLIBOS
>> Christophe via Users
>> *Envoyé :* jeudi 18 avril 2024 18:34
>> 

Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

2024-04-23 Thread Klaus Wenninger
On Tue, Apr 23, 2024 at 9:53 AM NOLIBOS Christophe <
christophe.noli...@thalesgroup.com> wrote:

> Classified as: {OPEN}
>
>
>
> Other strange thing.
>
> On RHEL 7, corosync is restarted while the “Restart=on-failure » line is
> commented.
>
> I think also that something changed in the pacemaker behavior, or
> somewhere else.
>

That is how it was working before introduction of the reconnection to
corosync.
Previously pacemaker would fail and systemd would restart it checking the
services
pacemaker depends on. And finding corosync not running it would be
restarted.

Klaus


>
>
> *De :* Klaus Wenninger 
> *Envoyé :* lundi 22 avril 2024 12:41
> *À :* NOLIBOS Christophe 
> *Cc :* Cluster Labs - All topics related to open-source clustering
> welcomed 
> *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync
> crash" fix
>
>
>
>
>
>
>
> On Mon, Apr 22, 2024 at 12:32 PM NOLIBOS Christophe <
> christophe.noli...@thalesgroup.com> wrote:
>
> Classified as: {OPEN}
>
>
>
> You are right : the “Restart=on-failure” line is commented and so,
> disabled per default.
>
> Uncommenting it resolves my issue.
>
>
>
> Maybe pacemaker changed behavior here without syncing enough with corosync
> behavior.
>
> We'll look into that to see which approach is better - restart corosync on
> failure - or have
>
> pacemaker be restarted by systemd which should in turn restart corosync as
> well.
>
>
>
> Klaus
>
>
>
> Thanks a lot.
>
> Christophe.
>
>
>
> *De :* Klaus Wenninger 
> *Envoyé :* lundi 22 avril 2024 11:06
> *À :* NOLIBOS Christophe 
> *Cc :* Cluster Labs - All topics related to open-source clustering
> welcomed 
> *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync
> crash" fix
>
>
>
>
>
>
>
> On Mon, Apr 22, 2024 at 9:51 AM NOLIBOS Christophe <
> christophe.noli...@thalesgroup.com> wrote:
>
> Classified as: {OPEN}
>
>
>
> ‘kill -9’ command.
>
> Is it gracefully exit?
>
>
>
> Looking as if corosync-unit-file has Restart=on-failure disabled per
> default.
>
> I'm not aware of another mechanism that would restart corosync and I
>
> think default behavior is not to restart.
>
> Comments suggest just to enable if using watchdog but that might just
>
> reference the RestartSec to provoke a watchdog-reboot instead of a
>
> restart via systemd.
>
> Any signal that isn't handled by the process - so that the exit-code could
>
> be set to 0 - should be fine.
>
>
>
> Klaus
>
>
>
>
>
> *De :* Klaus Wenninger 
> *Envoyé :* jeudi 18 avril 2024 20:17
> *À :* NOLIBOS Christophe 
> *Cc :* Cluster Labs - All topics related to open-source clustering
> welcomed 
> *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync
> crash" fix
>
>
>
>
>
> NOLIBOS Christophe  schrieb am Do.,
> 18. Apr. 2024, 19:01:
>
> Classified as: {OPEN}
>
>
>
> Hummm… my RHEL 8.8 OS has been hardened.
>
> I am wondering if the problem does not come from that.
>
>
>
> On another side, I get the same issue (i.e. corosync not restarted by
> system) with Pacemaker 2.1.5-8 deployed on RHEL 8.4 (not hardened).
>
>
>
> I’m checking.
>
>
>
> How did, you kill corosync? If it exits gracefully might not be restarted.
> Check journal. Sry cant try am on my mobile ATM. Klaus
>
>
>
>
>
> {OPEN}
>
>
>
> {OPEN}
>
>
>
> {OPEN}
>
>
>
> {OPEN}
>
> *De :* Users  *De la part de* NOLIBOS
> Christophe via Users
> *Envoyé :* jeudi 18 avril 2024 18:34
> *À :* Klaus Wenninger ; Cluster Labs - All topics
> related to open-source clustering welcomed 
> *Cc :* NOLIBOS Christophe 
> *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync
> crash" fix
>
>
>
> Classified as: {OPEN}
>
>
>
> So, the issue is on systemd?
>
>
>
> If I run the same test on RHEL 7 (3.10.0-693.11.1.el7) with pacemaker
> 1.1.13-10, corosync is correctly restarted by systemd.
>
>
>
> [RHEL7 ~]# journalctl -f
>
> -- Logs begin at Wed 2024-01-03 13:15:41 UTC. --
>
> Apr 18 16:26:55 - systemd[1]: corosync.service failed.
>
> Apr 18 16:26:55 - systemd[1]: pacemaker.service holdoff time over,
> scheduling restart.
>
> Apr 18 16:26:55 - systemd[1]: Starting Corosync Cluster Engine...
>
> Apr 18 16:26:55 - corosync[12179]: Starting Corosync Cluster Engine
> (corosync): [  OK  ]
>
> Apr 18 16:26:55 - systemd[1]: Started Corosync Cluster Engine.
>
> Apr 18 16:26:55 - systemd[1]: S

Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

2024-04-23 Thread NOLIBOS Christophe via Users
Classified as: {OPEN}

 

Other strange thing.

On RHEL 7, corosync is restarted while the “Restart=on-failure » line is 
commented.

I think also that something changed in the pacemaker behavior, or somewhere 
else.

 

De : Klaus Wenninger  
Envoyé : lundi 22 avril 2024 12:41
À : NOLIBOS Christophe 
Cc : Cluster Labs - All topics related to open-source clustering welcomed 

Objet : Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

 

 

 

On Mon, Apr 22, 2024 at 12:32 PM NOLIBOS Christophe 
mailto:christophe.noli...@thalesgroup.com> 
> wrote:

Classified as: {OPEN}

 

You are right : the “Restart=on-failure” line is commented and so, disabled per 
default.

Uncommenting it resolves my issue.

 

Maybe pacemaker changed behavior here without syncing enough with corosync 
behavior.

We'll look into that to see which approach is better - restart corosync on 
failure - or have

pacemaker be restarted by systemd which should in turn restart corosync as well.

 

Klaus 

 

Thanks a lot.

Christophe.

 

De : Klaus Wenninger < <mailto:kwenn...@redhat.com> kwenn...@redhat.com> 
Envoyé : lundi 22 avril 2024 11:06
À : NOLIBOS Christophe mailto:noli...@thalesgroup.com> 
noli...@thalesgroup.com>
Cc : Cluster Labs - All topics related to open-source clustering welcomed < 
<mailto:users@clusterlabs.org> users@clusterlabs.org>
Objet : Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

 

 

 

On Mon, Apr 22, 2024 at 9:51 AM NOLIBOS Christophe 
mailto:christophe.noli...@thalesgroup.com> 
> wrote:

Classified as: {OPEN}

 

‘kill -9’ command.

Is it gracefully exit?

 

Looking as if corosync-unit-file has Restart=on-failure disabled per default.

I'm not aware of another mechanism that would restart corosync and I

think default behavior is not to restart.

Comments suggest just to enable if using watchdog but that might just

reference the RestartSec to provoke a watchdog-reboot instead of a

restart via systemd.

Any signal that isn't handled by the process - so that the exit-code could

be set to 0 - should be fine.

 

Klaus

 

 

De : Klaus Wenninger < <mailto:kwenn...@redhat.com> kwenn...@redhat.com> 
Envoyé : jeudi 18 avril 2024 20:17
À : NOLIBOS Christophe < <mailto:christophe.noli...@thalesgroup.com> 
christophe.noli...@thalesgroup.com>
Cc : Cluster Labs - All topics related to open-source clustering welcomed < 
<mailto:users@clusterlabs.org> users@clusterlabs.org>
Objet : Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

 

 

NOLIBOS Christophe mailto:christophe.noli...@thalesgroup.com> > schrieb am Do., 18. Apr. 2024, 
19:01:

Classified as: {OPEN}

 

Hummm… my RHEL 8.8 OS has been hardened.

I am wondering if the problem does not come from that.

 

On another side, I get the same issue (i.e. corosync not restarted by system) 
with Pacemaker 2.1.5-8 deployed on RHEL 8.4 (not hardened).

 

I’m checking.

 

How did, you kill corosync? If it exits gracefully might not be restarted. 
Check journal. Sry cant try am on my mobile ATM. Klaus

 

 

{OPEN}

 

{OPEN}

 

{OPEN}

 

{OPEN}

De : Users < <mailto:users-boun...@clusterlabs.org> 
users-boun...@clusterlabs.org> De la part de NOLIBOS Christophe via Users
Envoyé : jeudi 18 avril 2024 18:34
À : Klaus Wenninger < <mailto:kwenn...@redhat.com> kwenn...@redhat.com>; 
Cluster Labs - All topics related to open-source clustering welcomed < 
<mailto:users@clusterlabs.org> users@clusterlabs.org>
Cc : NOLIBOS Christophe < <mailto:christophe.noli...@thalesgroup.com> 
christophe.noli...@thalesgroup.com>
Objet : Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

 

Classified as: {OPEN}

 

So, the issue is on systemd?

 

If I run the same test on RHEL 7 (3.10.0-693.11.1.el7) with pacemaker 
1.1.13-10, corosync is correctly restarted by systemd.

 

[RHEL7 ~]# journalctl -f

-- Logs begin at Wed 2024-01-03 13:15:41 UTC. --

Apr 18 16:26:55 - systemd[1]: corosync.service failed.

Apr 18 16:26:55 - systemd[1]: pacemaker.service holdoff time over, scheduling 
restart.

Apr 18 16:26:55 - systemd[1]: Starting Corosync Cluster Engine...

Apr 18 16:26:55 - corosync[12179]: Starting Corosync Cluster Engine (corosync): 
[  OK  ]

Apr 18 16:26:55 - systemd[1]: Started Corosync Cluster Engine.

Apr 18 16:26:55 - systemd[1]: Started Pacemaker High Availability Cluster 
Manager.

Apr 18 16:26:55 - systemd[1]: Starting Pacemaker High Availability Cluster 
Manager...

Apr 18 16:26:55 - pacemakerd[12192]:   notice: Additional logging available in 
/var/log/pacemaker.log

Apr 18 16:26:55 - pacemakerd[12192]:   notice: Switching to 
/var/log/cluster/corosync.log

Apr 18 16:26:55 - pacemakerd[12192]:   notice: Additional logging available in 
/var/log/cluster/corosync.log

 

De : Klaus Wenninger < <mailto:kwen

Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

2024-04-22 Thread Klaus Wenninger
On Mon, Apr 22, 2024 at 12:32 PM NOLIBOS Christophe <
christophe.noli...@thalesgroup.com> wrote:

> Classified as: {OPEN}
>
>
>
> You are right : the “Restart=on-failure” line is commented and so,
> disabled per default.
>
> Uncommenting it resolves my issue.
>

Maybe pacemaker changed behavior here without syncing enough with corosync
behavior.
We'll look into that to see which approach is better - restart corosync on
failure - or have
pacemaker be restarted by systemd which should in turn restart corosync as
well.

Klaus

>
>
> Thanks a lot.
>
> Christophe.
>
>
>
> *De :* Klaus Wenninger 
> *Envoyé :* lundi 22 avril 2024 11:06
> *À :* NOLIBOS Christophe 
> *Cc :* Cluster Labs - All topics related to open-source clustering
> welcomed 
> *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync
> crash" fix
>
>
>
>
>
>
>
> On Mon, Apr 22, 2024 at 9:51 AM NOLIBOS Christophe <
> christophe.noli...@thalesgroup.com> wrote:
>
> Classified as: {OPEN}
>
>
>
> ‘kill -9’ command.
>
> Is it gracefully exit?
>
>
>
> Looking as if corosync-unit-file has Restart=on-failure disabled per
> default.
>
> I'm not aware of another mechanism that would restart corosync and I
>
> think default behavior is not to restart.
>
> Comments suggest just to enable if using watchdog but that might just
>
> reference the RestartSec to provoke a watchdog-reboot instead of a
>
> restart via systemd.
>
> Any signal that isn't handled by the process - so that the exit-code could
>
> be set to 0 - should be fine.
>
>
>
> Klaus
>
>
>
>
>
> *De :* Klaus Wenninger 
> *Envoyé :* jeudi 18 avril 2024 20:17
> *À :* NOLIBOS Christophe 
> *Cc :* Cluster Labs - All topics related to open-source clustering
> welcomed 
> *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync
> crash" fix
>
>
>
>
>
> NOLIBOS Christophe  schrieb am Do.,
> 18. Apr. 2024, 19:01:
>
> Classified as: {OPEN}
>
>
>
> Hummm… my RHEL 8.8 OS has been hardened.
>
> I am wondering if the problem does not come from that.
>
>
>
> On another side, I get the same issue (i.e. corosync not restarted by
> system) with Pacemaker 2.1.5-8 deployed on RHEL 8.4 (not hardened).
>
>
>
> I’m checking.
>
>
>
> How did, you kill corosync? If it exits gracefully might not be restarted.
> Check journal. Sry cant try am on my mobile ATM. Klaus
>
>
>
>
>
> {OPEN}
>
>
>
> {OPEN}
>
>
>
> {OPEN}
>
> *De :* Users  *De la part de* NOLIBOS
> Christophe via Users
> *Envoyé :* jeudi 18 avril 2024 18:34
> *À :* Klaus Wenninger ; Cluster Labs - All topics
> related to open-source clustering welcomed 
> *Cc :* NOLIBOS Christophe 
> *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync
> crash" fix
>
>
>
> Classified as: {OPEN}
>
>
>
> So, the issue is on systemd?
>
>
>
> If I run the same test on RHEL 7 (3.10.0-693.11.1.el7) with pacemaker
> 1.1.13-10, corosync is correctly restarted by systemd.
>
>
>
> [RHEL7 ~]# journalctl -f
>
> -- Logs begin at Wed 2024-01-03 13:15:41 UTC. --
>
> Apr 18 16:26:55 - systemd[1]: corosync.service failed.
>
> Apr 18 16:26:55 - systemd[1]: pacemaker.service holdoff time over,
> scheduling restart.
>
> Apr 18 16:26:55 - systemd[1]: Starting Corosync Cluster Engine...
>
> Apr 18 16:26:55 - corosync[12179]: Starting Corosync Cluster Engine
> (corosync): [  OK  ]
>
> Apr 18 16:26:55 - systemd[1]: Started Corosync Cluster Engine.
>
> Apr 18 16:26:55 - systemd[1]: Started Pacemaker High Availability Cluster
> Manager.
>
> Apr 18 16:26:55 - systemd[1]: Starting Pacemaker High Availability Cluster
> Manager...
>
> Apr 18 16:26:55 - pacemakerd[12192]:   notice: Additional logging
> available in /var/log/pacemaker.log
>
> Apr 18 16:26:55 - pacemakerd[12192]:   notice: Switching to
> /var/log/cluster/corosync.log
>
> Apr 18 16:26:55 - pacemakerd[12192]:   notice: Additional logging
> available in /var/log/cluster/corosync.log
>
>
>
> *De :* Klaus Wenninger 
> *Envoyé :* jeudi 18 avril 2024 18:12
> *À :* NOLIBOS Christophe ; Cluster
> Labs - All topics related to open-source clustering welcomed <
> users@clusterlabs.org>
> *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync
> crash" fix
>
>
>
>
>
>
>
> On Thu, Apr 18, 2024 at 6:09 PM Klaus Wenninger 
> wrote:
>
>
>
>
>
> On Thu, Apr 18, 2024 at 6:06 PM NOLIBOS Christophe <
> christophe.noli...@thalesgroup.com> wrote:
>

Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

2024-04-22 Thread NOLIBOS Christophe via Users
Classified as: {OPEN}

 

You are right : the “Restart=on-failure” line is commented and so, disabled per 
default.

Uncommenting it resolves my issue.

 

Thanks a lot.

Christophe.

 

De : Klaus Wenninger  
Envoyé : lundi 22 avril 2024 11:06
À : NOLIBOS Christophe 
Cc : Cluster Labs - All topics related to open-source clustering welcomed 

Objet : Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

 

 

 

On Mon, Apr 22, 2024 at 9:51 AM NOLIBOS Christophe 
mailto:christophe.noli...@thalesgroup.com> 
> wrote:

Classified as: {OPEN}

 

‘kill -9’ command.

Is it gracefully exit?

 

Looking as if corosync-unit-file has Restart=on-failure disabled per default.

I'm not aware of another mechanism that would restart corosync and I

think default behavior is not to restart.

Comments suggest just to enable if using watchdog but that might just

reference the RestartSec to provoke a watchdog-reboot instead of a

restart via systemd.

Any signal that isn't handled by the process - so that the exit-code could

be set to 0 - should be fine.

 

Klaus

 

 

De : Klaus Wenninger mailto:kwenn...@redhat.com> > 
Envoyé : jeudi 18 avril 2024 20:17
À : NOLIBOS Christophe mailto:christophe.noli...@thalesgroup.com> >
Cc : Cluster Labs - All topics related to open-source clustering welcomed 
mailto:users@clusterlabs.org> >
Objet : Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

 

 

NOLIBOS Christophe mailto:christophe.noli...@thalesgroup.com> > schrieb am Do., 18. Apr. 2024, 
19:01:

Classified as: {OPEN}

 

Hummm… my RHEL 8.8 OS has been hardened.

I am wondering if the problem does not come from that.

 

On another side, I get the same issue (i.e. corosync not restarted by system) 
with Pacemaker 2.1.5-8 deployed on RHEL 8.4 (not hardened).

 

I’m checking.

 

How did, you kill corosync? If it exits gracefully might not be restarted. 
Check journal. Sry cant try am on my mobile ATM. Klaus

 

 

{OPEN}

 

{OPEN}

 

{OPEN}

De : Users mailto:users-boun...@clusterlabs.org> > De la part de NOLIBOS Christophe via 
Users
Envoyé : jeudi 18 avril 2024 18:34
À : Klaus Wenninger mailto:kwenn...@redhat.com> >; 
Cluster Labs - All topics related to open-source clustering welcomed 
mailto:users@clusterlabs.org> >
Cc : NOLIBOS Christophe mailto:christophe.noli...@thalesgroup.com> >
Objet : Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

 

Classified as: {OPEN}

 

So, the issue is on systemd?

 

If I run the same test on RHEL 7 (3.10.0-693.11.1.el7) with pacemaker 
1.1.13-10, corosync is correctly restarted by systemd.

 

[RHEL7 ~]# journalctl -f

-- Logs begin at Wed 2024-01-03 13:15:41 UTC. --

Apr 18 16:26:55 - systemd[1]: corosync.service failed.

Apr 18 16:26:55 - systemd[1]: pacemaker.service holdoff time over, scheduling 
restart.

Apr 18 16:26:55 - systemd[1]: Starting Corosync Cluster Engine...

Apr 18 16:26:55 - corosync[12179]: Starting Corosync Cluster Engine (corosync): 
[  OK  ]

Apr 18 16:26:55 - systemd[1]: Started Corosync Cluster Engine.

Apr 18 16:26:55 - systemd[1]: Started Pacemaker High Availability Cluster 
Manager.

Apr 18 16:26:55 - systemd[1]: Starting Pacemaker High Availability Cluster 
Manager...

Apr 18 16:26:55 - pacemakerd[12192]:   notice: Additional logging available in 
/var/log/pacemaker.log

Apr 18 16:26:55 - pacemakerd[12192]:   notice: Switching to 
/var/log/cluster/corosync.log

Apr 18 16:26:55 - pacemakerd[12192]:   notice: Additional logging available in 
/var/log/cluster/corosync.log

 

De : Klaus Wenninger mailto:kwenn...@redhat.com> > 
Envoyé : jeudi 18 avril 2024 18:12
À : NOLIBOS Christophe mailto:christophe.noli...@thalesgroup.com> >; Cluster Labs - All topics 
related to open-source clustering welcomed mailto:users@clusterlabs.org> >
Objet : Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

 

 

 

On Thu, Apr 18, 2024 at 6:09 PM Klaus Wenninger mailto:kwenn...@redhat.com> > wrote:

 

 

On Thu, Apr 18, 2024 at 6:06 PM NOLIBOS Christophe 
 wrote:

Classified as: {OPEN}

 

Well… why do you say that « Well if corosync isn't  there that this is to be 
expected and pacemaker won't recover corosync.”?

In my mind, Corosync is managed by Pacemaker as any other cluster resource and 
the "pacemakerd: recover properly from > Corosync crash" fix implemented in 
version 2.1.2 seems confirm that.

 

Nope. Startup of the stack is done by systemd. And pacemaker is just started 
after corosync is up and

systemd should be responsible for keeping the stack up.

For completeness: if you have sbd in the mix that is as well being started by 
systemd but kind of

parallel with corosync as part of it (systemd terminology).

 

The "recover" above is referring to pacemaker recovering from corosync going 
away and coming back.

 

 

Klaus 

 

 

{OPE

Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

2024-04-22 Thread Klaus Wenninger
On Mon, Apr 22, 2024 at 9:51 AM NOLIBOS Christophe <
christophe.noli...@thalesgroup.com> wrote:

> Classified as: {OPEN}
>
>
>
> ‘kill -9’ command.
>
> Is it gracefully exit?
>

Looking as if corosync-unit-file has Restart=on-failure disabled per
default.
I'm not aware of another mechanism that would restart corosync and I
think default behavior is not to restart.
Comments suggest just to enable if using watchdog but that might just
reference the RestartSec to provoke a watchdog-reboot instead of a
restart via systemd.
Any signal that isn't handled by the process - so that the exit-code could
be set to 0 - should be fine.

Klaus


>
> *De :* Klaus Wenninger 
> *Envoyé :* jeudi 18 avril 2024 20:17
> *À :* NOLIBOS Christophe 
> *Cc :* Cluster Labs - All topics related to open-source clustering
> welcomed 
> *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync
> crash" fix
>
>
>
>
>
> NOLIBOS Christophe  schrieb am Do.,
> 18. Apr. 2024, 19:01:
>
> Classified as: {OPEN}
>
>
>
> Hummm… my RHEL 8.8 OS has been hardened.
>
> I am wondering if the problem does not come from that.
>
>
>
> On another side, I get the same issue (i.e. corosync not restarted by
> system) with Pacemaker 2.1.5-8 deployed on RHEL 8.4 (not hardened).
>
>
>
> I’m checking.
>
>
>
> How did, you kill corosync? If it exits gracefully might not be restarted.
> Check journal. Sry cant try am on my mobile ATM. Klaus
>
>
>
>
>
> {OPEN}
>
>
>
> {OPEN}
>
> *De :* Users  *De la part de* NOLIBOS
> Christophe via Users
> *Envoyé :* jeudi 18 avril 2024 18:34
> *À :* Klaus Wenninger ; Cluster Labs - All topics
> related to open-source clustering welcomed 
> *Cc :* NOLIBOS Christophe 
> *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync
> crash" fix
>
>
>
> Classified as: {OPEN}
>
>
>
> So, the issue is on systemd?
>
>
>
> If I run the same test on RHEL 7 (3.10.0-693.11.1.el7) with pacemaker
> 1.1.13-10, corosync is correctly restarted by systemd.
>
>
>
> [RHEL7 ~]# journalctl -f
>
> -- Logs begin at Wed 2024-01-03 13:15:41 UTC. --
>
> Apr 18 16:26:55 - systemd[1]: corosync.service failed.
>
> Apr 18 16:26:55 - systemd[1]: pacemaker.service holdoff time over,
> scheduling restart.
>
> Apr 18 16:26:55 - systemd[1]: Starting Corosync Cluster Engine...
>
> Apr 18 16:26:55 - corosync[12179]: Starting Corosync Cluster Engine
> (corosync): [  OK  ]
>
> Apr 18 16:26:55 - systemd[1]: Started Corosync Cluster Engine.
>
> Apr 18 16:26:55 - systemd[1]: Started Pacemaker High Availability Cluster
> Manager.
>
> Apr 18 16:26:55 - systemd[1]: Starting Pacemaker High Availability Cluster
> Manager...
>
> Apr 18 16:26:55 - pacemakerd[12192]:   notice: Additional logging
> available in /var/log/pacemaker.log
>
> Apr 18 16:26:55 - pacemakerd[12192]:   notice: Switching to
> /var/log/cluster/corosync.log
>
> Apr 18 16:26:55 - pacemakerd[12192]:   notice: Additional logging
> available in /var/log/cluster/corosync.log
>
>
>
> *De :* Klaus Wenninger 
> *Envoyé :* jeudi 18 avril 2024 18:12
> *À :* NOLIBOS Christophe ; Cluster
> Labs - All topics related to open-source clustering welcomed <
> users@clusterlabs.org>
> *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync
> crash" fix
>
>
>
>
>
>
>
> On Thu, Apr 18, 2024 at 6:09 PM Klaus Wenninger 
> wrote:
>
>
>
>
>
> On Thu, Apr 18, 2024 at 6:06 PM NOLIBOS Christophe <
> christophe.noli...@thalesgroup.com> wrote:
>
> Classified as: {OPEN}
>
>
>
> Well… why do you say that « Well if corosync isn't  there that this is to
> be expected and pacemaker won't recover corosync.”?
>
> In my mind, Corosync is managed by Pacemaker as any other cluster resource
> and the "pacemakerd: recover properly from > Corosync crash" fix
> implemented in version 2.1.2 seems confirm that.
>
>
>
> Nope. Startup of the stack is done by systemd. And pacemaker is just
> started after corosync is up and
>
> systemd should be responsible for keeping the stack up.
>
> For completeness: if you have sbd in the mix that is as well being started
> by systemd but kind of
>
> parallel with corosync as part of it (systemd terminology).
>
>
>
> The "recover" above is referring to pacemaker recovering from corosync
> going away and coming back.
>
>
>
>
>
> Klaus
>
>
>
>
>
> {OPEN}
>
>
>
> {OPEN}
>
> *De :* NOLIBOS Christophe
> *Envoyé :* jeudi 18 avril 2024 17:56
> *À :* 'Klaus Wenninger' ; Cluste

Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

2024-04-22 Thread NOLIBOS Christophe via Users
Classified as: {OPEN}

 

‘kill -9’ command.

Is it gracefully exit?

 

De : Klaus Wenninger  
Envoyé : jeudi 18 avril 2024 20:17
À : NOLIBOS Christophe 
Cc : Cluster Labs - All topics related to open-source clustering welcomed 

Objet : Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

 

 

NOLIBOS Christophe mailto:christophe.noli...@thalesgroup.com> > schrieb am Do., 18. Apr. 2024, 
19:01:

Classified as: {OPEN}

 

Hummm… my RHEL 8.8 OS has been hardened.

I am wondering if the problem does not come from that.

 

On another side, I get the same issue (i.e. corosync not restarted by system) 
with Pacemaker 2.1.5-8 deployed on RHEL 8.4 (not hardened).

 

I’m checking.

 

How did, you kill corosync? If it exits gracefully might not be restarted. 
Check journal. Sry cant try am on my mobile ATM. Klaus

 

 

{OPEN}

 

{OPEN}

De : Users mailto:users-boun...@clusterlabs.org> > De la part de NOLIBOS Christophe via 
Users
Envoyé : jeudi 18 avril 2024 18:34
À : Klaus Wenninger mailto:kwenn...@redhat.com> >; 
Cluster Labs - All topics related to open-source clustering welcomed 
mailto:users@clusterlabs.org> >
Cc : NOLIBOS Christophe mailto:christophe.noli...@thalesgroup.com> >
Objet : Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

 

Classified as: {OPEN}

 

So, the issue is on systemd?

 

If I run the same test on RHEL 7 (3.10.0-693.11.1.el7) with pacemaker 
1.1.13-10, corosync is correctly restarted by systemd.

 

[RHEL7 ~]# journalctl -f

-- Logs begin at Wed 2024-01-03 13:15:41 UTC. --

Apr 18 16:26:55 - systemd[1]: corosync.service failed.

Apr 18 16:26:55 - systemd[1]: pacemaker.service holdoff time over, scheduling 
restart.

Apr 18 16:26:55 - systemd[1]: Starting Corosync Cluster Engine...

Apr 18 16:26:55 - corosync[12179]: Starting Corosync Cluster Engine (corosync): 
[  OK  ]

Apr 18 16:26:55 - systemd[1]: Started Corosync Cluster Engine.

Apr 18 16:26:55 - systemd[1]: Started Pacemaker High Availability Cluster 
Manager.

Apr 18 16:26:55 - systemd[1]: Starting Pacemaker High Availability Cluster 
Manager...

Apr 18 16:26:55 - pacemakerd[12192]:   notice: Additional logging available in 
/var/log/pacemaker.log

Apr 18 16:26:55 - pacemakerd[12192]:   notice: Switching to 
/var/log/cluster/corosync.log

Apr 18 16:26:55 - pacemakerd[12192]:   notice: Additional logging available in 
/var/log/cluster/corosync.log

 

De : Klaus Wenninger mailto:kwenn...@redhat.com> > 
Envoyé : jeudi 18 avril 2024 18:12
À : NOLIBOS Christophe mailto:christophe.noli...@thalesgroup.com> >; Cluster Labs - All topics 
related to open-source clustering welcomed mailto:users@clusterlabs.org> >
Objet : Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

 

 

 

On Thu, Apr 18, 2024 at 6:09 PM Klaus Wenninger mailto:kwenn...@redhat.com> > wrote:

 

 

On Thu, Apr 18, 2024 at 6:06 PM NOLIBOS Christophe 
mailto:christophe.noli...@thalesgroup.com> 
> wrote:

Classified as: {OPEN}

 

Well… why do you say that « Well if corosync isn't  there that this is to be 
expected and pacemaker won't recover corosync.”?

In my mind, Corosync is managed by Pacemaker as any other cluster resource and 
the "pacemakerd: recover properly from > Corosync crash" fix implemented in 
version 2.1.2 seems confirm that.

 

Nope. Startup of the stack is done by systemd. And pacemaker is just started 
after corosync is up and

systemd should be responsible for keeping the stack up.

For completeness: if you have sbd in the mix that is as well being started by 
systemd but kind of

parallel with corosync as part of it (systemd terminology).

 

The "recover" above is referring to pacemaker recovering from corosync going 
away and coming back.

 

 

Klaus 

 

 

{OPEN}

 

{OPEN}

De : NOLIBOS Christophe 
Envoyé : jeudi 18 avril 2024 17:56
À : 'Klaus Wenninger' mailto:kwenn...@redhat.com> >; 
Cluster Labs - All topics related to open-source clustering welcomed 
mailto:users@clusterlabs.org> >
Cc : Ken Gaillot mailto:kgail...@redhat.com> >
Objet : RE: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

 

Classified as: {OPEN}

 

 

[~]$ systemctl status corosync

● corosync.service - Corosync Cluster Engine

   Loaded: loaded (/usr/lib/systemd/system/corosync.service; enabled; vendor 
preset: disabled)

   Active: failed (Result: signal) since Thu 2024-04-18 14:58:42 UTC; 53min ago

 Docs: man:corosync

   man:corosync.conf

   man:corosync_overview

  Process: 2027251 ExecStop=/usr/sbin/corosync-cfgtool -H --force (code=exited, 
status=0/SUCCESS)

  Process: 1324906 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS 
(code=killed, signal=KILL)

Main PID: 1324906 (code=killed, signal=KILL)

 

Apr 18 13:16:04 - corosync[1324906]:   [QUORUM] Sync joined[1]: 1

Apr 18 13:16:04 - corosync[1324906]

Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

2024-04-18 Thread Klaus Wenninger
NOLIBOS Christophe  schrieb am Do., 18.
Apr. 2024, 19:01:

> Classified as: {OPEN}
>
>
>
> Hummm… my RHEL 8.8 OS has been hardened.
>
> I am wondering if the problem does not come from that.
>
>
>
> On another side, I get the same issue (i.e. corosync not restarted by
> system) with Pacemaker 2.1.5-8 deployed on RHEL 8.4 (not hardened).
>
>
>
> I’m checking.
>
>
>
How did, you kill corosync? If it exits gracefully might not be restarted.
Check journal. Sry cant try am on my mobile ATM. Klaus


>
> {OPEN}
>
> *De :* Users  *De la part de* NOLIBOS
> Christophe via Users
> *Envoyé :* jeudi 18 avril 2024 18:34
> *À :* Klaus Wenninger ; Cluster Labs - All topics
> related to open-source clustering welcomed 
> *Cc :* NOLIBOS Christophe 
> *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync
> crash" fix
>
>
>
> Classified as: {OPEN}
>
>
>
> So, the issue is on systemd?
>
>
>
> If I run the same test on RHEL 7 (3.10.0-693.11.1.el7) with pacemaker
> 1.1.13-10, corosync is correctly restarted by systemd.
>
>
>
> [RHEL7 ~]# journalctl -f
>
> -- Logs begin at Wed 2024-01-03 13:15:41 UTC. --
>
> Apr 18 16:26:55 - systemd[1]: corosync.service failed.
>
> Apr 18 16:26:55 - systemd[1]: pacemaker.service holdoff time over,
> scheduling restart.
>
> Apr 18 16:26:55 - systemd[1]: Starting Corosync Cluster Engine...
>
> Apr 18 16:26:55 - corosync[12179]: Starting Corosync Cluster Engine
> (corosync): [  OK  ]
>
> Apr 18 16:26:55 - systemd[1]: Started Corosync Cluster Engine.
>
> Apr 18 16:26:55 - systemd[1]: Started Pacemaker High Availability Cluster
> Manager.
>
> Apr 18 16:26:55 - systemd[1]: Starting Pacemaker High Availability Cluster
> Manager...
>
> Apr 18 16:26:55 - pacemakerd[12192]:   notice: Additional logging
> available in /var/log/pacemaker.log
>
> Apr 18 16:26:55 - pacemakerd[12192]:   notice: Switching to
> /var/log/cluster/corosync.log
>
> Apr 18 16:26:55 - pacemakerd[12192]:   notice: Additional logging
> available in /var/log/cluster/corosync.log
>
>
>
> *De :* Klaus Wenninger 
> *Envoyé :* jeudi 18 avril 2024 18:12
> *À :* NOLIBOS Christophe ; Cluster
> Labs - All topics related to open-source clustering welcomed <
> users@clusterlabs.org>
> *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync
> crash" fix
>
>
>
>
>
>
>
> On Thu, Apr 18, 2024 at 6:09 PM Klaus Wenninger 
> wrote:
>
>
>
>
>
> On Thu, Apr 18, 2024 at 6:06 PM NOLIBOS Christophe <
> christophe.noli...@thalesgroup.com> wrote:
>
> Classified as: {OPEN}
>
>
>
> Well… why do you say that « Well if corosync isn't  there that this is to
> be expected and pacemaker won't recover corosync.”?
>
> In my mind, Corosync is managed by Pacemaker as any other cluster resource
> and the "pacemakerd: recover properly from > Corosync crash" fix
> implemented in version 2.1.2 seems confirm that.
>
>
>
> Nope. Startup of the stack is done by systemd. And pacemaker is just
> started after corosync is up and
>
> systemd should be responsible for keeping the stack up.
>
> For completeness: if you have sbd in the mix that is as well being started
> by systemd but kind of
>
> parallel with corosync as part of it (systemd terminology).
>
>
>
> The "recover" above is referring to pacemaker recovering from corosync
> going away and coming back.
>
>
>
>
>
> Klaus
>
>
>
>
>
> {OPEN}
>
>
>
> {OPEN}
>
> *De :* NOLIBOS Christophe
> *Envoyé :* jeudi 18 avril 2024 17:56
> *À :* 'Klaus Wenninger' ; Cluster Labs - All topics
> related to open-source clustering welcomed 
> *Cc :* Ken Gaillot 
> *Objet :* RE: [ClusterLabs] "pacemakerd: recover properly from Corosync
> crash" fix
>
>
>
> Classified as: {OPEN}
>
>
>
>
>
> [~]$ systemctl status corosync
>
> ● corosync.service - Corosync Cluster Engine
>
>Loaded: loaded (/usr/lib/systemd/system/corosync.service; enabled;
> vendor preset: disabled)
>
>Active: failed (Result: signal) since Thu 2024-04-18 14:58:42 UTC;
> 53min ago
>
>  Docs: man:corosync
>
>man:corosync.conf
>
>man:corosync_overview
>
>   Process: 2027251 ExecStop=/usr/sbin/corosync-cfgtool -H --force
> (code=exited, status=0/SUCCESS)
>
>   Process: 1324906 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS
> (code=killed, signal=KILL)
>
> Main PID: 1324906 (code=killed, signal=KILL)
>
>
>
> Apr 18 13:16:04 - corosync[1324906]:   [QUORUM] Sync joined[1]: 1
>
> Apr 18 13

Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

2024-04-18 Thread NOLIBOS Christophe via Users
Classified as: {OPEN}

 

Hummm… my RHEL 8.8 OS has been hardened.

I am wondering if the problem does not come from that.

 

On another side, I get the same issue (i.e. corosync not restarted by system) 
with Pacemaker 2.1.5-8 deployed on RHEL 8.4 (not hardened).

 

I’m checking.

 

 

{OPEN}

De : Users  De la part de NOLIBOS Christophe via 
Users
Envoyé : jeudi 18 avril 2024 18:34
À : Klaus Wenninger ; Cluster Labs - All topics related to 
open-source clustering welcomed 
Cc : NOLIBOS Christophe 
Objet : Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

 

Classified as: {OPEN}

 

So, the issue is on systemd?

 

If I run the same test on RHEL 7 (3.10.0-693.11.1.el7) with pacemaker 
1.1.13-10, corosync is correctly restarted by systemd.

 

[RHEL7 ~]# journalctl -f

-- Logs begin at Wed 2024-01-03 13:15:41 UTC. --

Apr 18 16:26:55 - systemd[1]: corosync.service failed.

Apr 18 16:26:55 - systemd[1]: pacemaker.service holdoff time over, scheduling 
restart.

Apr 18 16:26:55 - systemd[1]: Starting Corosync Cluster Engine...

Apr 18 16:26:55 - corosync[12179]: Starting Corosync Cluster Engine (corosync): 
[  OK  ]

Apr 18 16:26:55 - systemd[1]: Started Corosync Cluster Engine.

Apr 18 16:26:55 - systemd[1]: Started Pacemaker High Availability Cluster 
Manager.

Apr 18 16:26:55 - systemd[1]: Starting Pacemaker High Availability Cluster 
Manager...

Apr 18 16:26:55 - pacemakerd[12192]:   notice: Additional logging available in 
/var/log/pacemaker.log

Apr 18 16:26:55 - pacemakerd[12192]:   notice: Switching to 
/var/log/cluster/corosync.log

Apr 18 16:26:55 - pacemakerd[12192]:   notice: Additional logging available in 
/var/log/cluster/corosync.log

 

De : Klaus Wenninger mailto:kwenn...@redhat.com> > 
Envoyé : jeudi 18 avril 2024 18:12
À : NOLIBOS Christophe mailto:christophe.noli...@thalesgroup.com> >; Cluster Labs - All topics 
related to open-source clustering welcomed mailto:users@clusterlabs.org> >
Objet : Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

 

 

 

On Thu, Apr 18, 2024 at 6:09 PM Klaus Wenninger mailto:kwenn...@redhat.com> > wrote:

 

 

On Thu, Apr 18, 2024 at 6:06 PM NOLIBOS Christophe 
mailto:christophe.noli...@thalesgroup.com> 
> wrote:

Classified as: {OPEN}

 

Well… why do you say that « Well if corosync isn't  there that this is to be 
expected and pacemaker won't recover corosync.”?

In my mind, Corosync is managed by Pacemaker as any other cluster resource and 
the "pacemakerd: recover properly from > Corosync crash" fix implemented in 
version 2.1.2 seems confirm that.

 

Nope. Startup of the stack is done by systemd. And pacemaker is just started 
after corosync is up and

systemd should be responsible for keeping the stack up.

For completeness: if you have sbd in the mix that is as well being started by 
systemd but kind of

parallel with corosync as part of it (systemd terminology).

 

The "recover" above is referring to pacemaker recovering from corosync going 
away and coming back.

 

 

Klaus 

 

 

{OPEN}

 

{OPEN}

De : NOLIBOS Christophe 
Envoyé : jeudi 18 avril 2024 17:56
À : 'Klaus Wenninger' mailto:kwenn...@redhat.com> >; 
Cluster Labs - All topics related to open-source clustering welcomed 
mailto:users@clusterlabs.org> >
Cc : Ken Gaillot mailto:kgail...@redhat.com> >
Objet : RE: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

 

Classified as: {OPEN}

 

 

[~]$ systemctl status corosync

● corosync.service - Corosync Cluster Engine

   Loaded: loaded (/usr/lib/systemd/system/corosync.service; enabled; vendor 
preset: disabled)

   Active: failed (Result: signal) since Thu 2024-04-18 14:58:42 UTC; 53min ago

 Docs: man:corosync

   man:corosync.conf

   man:corosync_overview

  Process: 2027251 ExecStop=/usr/sbin/corosync-cfgtool -H --force (code=exited, 
status=0/SUCCESS)

  Process: 1324906 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS 
(code=killed, signal=KILL)

Main PID: 1324906 (code=killed, signal=KILL)

 

Apr 18 13:16:04 - corosync[1324906]:   [QUORUM] Sync joined[1]: 1

Apr 18 13:16:04 - corosync[1324906]:   [TOTEM ] A new membership (1.1c8) was 
formed. Members joined: 1

Apr 18 13:16:04 - corosync[1324906]:   [VOTEQ ] Waiting for all cluster 
members. Current votes: 1 expected_votes: 2

Apr 18 13:16:04 - corosync[1324906]:   [VOTEQ ] Waiting for all cluster 
members. Current votes: 1 expected_votes: 2

Apr 18 13:16:04 - corosync[1324906]:   [VOTEQ ] Waiting for all cluster 
members. Current votes: 1 expected_votes: 2

Apr 18 13:16:04 - corosync[1324906]:   [QUORUM] Members[1]: 1

Apr 18 13:16:04 - corosync[1324906]:   [MAIN  ] Completed service 
synchronization, ready to provide service.

Apr 18 13:16:04 - systemd[1]: Started Corosync Cluster Engine.

Apr 18 14:58:42 - systemd[1]: corosync.service: Main process exited, 
code=killed, status

Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

2024-04-18 Thread NOLIBOS Christophe via Users
Classified as: {OPEN}

 

So, the issue is on systemd?

 

If I run the same test on RHEL 7 (3.10.0-693.11.1.el7) with pacemaker 
1.1.13-10, corosync is correctly restarted by systemd.

 

[RHEL7 ~]# journalctl -f

-- Logs begin at Wed 2024-01-03 13:15:41 UTC. --

Apr 18 16:26:55 - systemd[1]: corosync.service failed.

Apr 18 16:26:55 - systemd[1]: pacemaker.service holdoff time over, scheduling 
restart.

Apr 18 16:26:55 - systemd[1]: Starting Corosync Cluster Engine...

Apr 18 16:26:55 - corosync[12179]: Starting Corosync Cluster Engine (corosync): 
[  OK  ]

Apr 18 16:26:55 - systemd[1]: Started Corosync Cluster Engine.

Apr 18 16:26:55 - systemd[1]: Started Pacemaker High Availability Cluster 
Manager.

Apr 18 16:26:55 - systemd[1]: Starting Pacemaker High Availability Cluster 
Manager...

Apr 18 16:26:55 - pacemakerd[12192]:   notice: Additional logging available in 
/var/log/pacemaker.log

Apr 18 16:26:55 - pacemakerd[12192]:   notice: Switching to 
/var/log/cluster/corosync.log

Apr 18 16:26:55 - pacemakerd[12192]:   notice: Additional logging available in 
/var/log/cluster/corosync.log

 

De : Klaus Wenninger  
Envoyé : jeudi 18 avril 2024 18:12
À : NOLIBOS Christophe ; Cluster Labs - All 
topics related to open-source clustering welcomed 
Objet : Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

 

 

 

On Thu, Apr 18, 2024 at 6:09 PM Klaus Wenninger mailto:kwenn...@redhat.com> > wrote:

 

 

On Thu, Apr 18, 2024 at 6:06 PM NOLIBOS Christophe 
mailto:christophe.noli...@thalesgroup.com> 
> wrote:

Classified as: {OPEN}

 

Well… why do you say that « Well if corosync isn't  there that this is to be 
expected and pacemaker won't recover corosync.”?

In my mind, Corosync is managed by Pacemaker as any other cluster resource and 
the "pacemakerd: recover properly from > Corosync crash" fix implemented in 
version 2.1.2 seems confirm that.

 

Nope. Startup of the stack is done by systemd. And pacemaker is just started 
after corosync is up and

systemd should be responsible for keeping the stack up.

For completeness: if you have sbd in the mix that is as well being started by 
systemd but kind of

parallel with corosync as part of it (systemd terminology).

 

The "recover" above is referring to pacemaker recovering from corosync going 
away and coming back.

 

 

Klaus 

 

 

{OPEN}

 

{OPEN}

De : NOLIBOS Christophe 
Envoyé : jeudi 18 avril 2024 17:56
À : 'Klaus Wenninger' mailto:kwenn...@redhat.com> >; 
Cluster Labs - All topics related to open-source clustering welcomed 
mailto:users@clusterlabs.org> >
Cc : Ken Gaillot mailto:kgail...@redhat.com> >
Objet : RE: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

 

Classified as: {OPEN}

 

 

[~]$ systemctl status corosync

● corosync.service - Corosync Cluster Engine

   Loaded: loaded (/usr/lib/systemd/system/corosync.service; enabled; vendor 
preset: disabled)

   Active: failed (Result: signal) since Thu 2024-04-18 14:58:42 UTC; 53min ago

 Docs: man:corosync

   man:corosync.conf

   man:corosync_overview

  Process: 2027251 ExecStop=/usr/sbin/corosync-cfgtool -H --force (code=exited, 
status=0/SUCCESS)

  Process: 1324906 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS 
(code=killed, signal=KILL)

Main PID: 1324906 (code=killed, signal=KILL)

 

Apr 18 13:16:04 - corosync[1324906]:   [QUORUM] Sync joined[1]: 1

Apr 18 13:16:04 - corosync[1324906]:   [TOTEM ] A new membership (1.1c8) was 
formed. Members joined: 1

Apr 18 13:16:04 - corosync[1324906]:   [VOTEQ ] Waiting for all cluster 
members. Current votes: 1 expected_votes: 2

Apr 18 13:16:04 - corosync[1324906]:   [VOTEQ ] Waiting for all cluster 
members. Current votes: 1 expected_votes: 2

Apr 18 13:16:04 - corosync[1324906]:   [VOTEQ ] Waiting for all cluster 
members. Current votes: 1 expected_votes: 2

Apr 18 13:16:04 - corosync[1324906]:   [QUORUM] Members[1]: 1

Apr 18 13:16:04 - corosync[1324906]:   [MAIN  ] Completed service 
synchronization, ready to provide service.

Apr 18 13:16:04 - systemd[1]: Started Corosync Cluster Engine.

Apr 18 14:58:42 - systemd[1]: corosync.service: Main process exited, 
code=killed, status=9/KILL

Apr 18 14:58:42 - systemd[1]: corosync.service: Failed with result 'signal'.

[~]$

 

 

De : Klaus Wenninger mailto:kwenn...@redhat.com> > 
Envoyé : jeudi 18 avril 2024 17:43
À : Cluster Labs - All topics related to open-source clustering welcomed 
mailto:users@clusterlabs.org> >
Cc : Ken Gaillot mailto:kgail...@redhat.com> >; NOLIBOS 
Christophe mailto:christophe.noli...@thalesgroup.com> >
Objet : Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

 

 

 

On Thu, Apr 18, 2024 at 5:07 PM NOLIBOS Christophe via Users 
mailto:users@clusterlabs.org> > wrote:

Classified as: {OPEN}

I'm using RedHat 8.8 (4.18.0-477.21.1.el8_8.x86_64).
When 

Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

2024-04-18 Thread Klaus Wenninger
On Thu, Apr 18, 2024 at 6:09 PM Klaus Wenninger  wrote:

>
>
> On Thu, Apr 18, 2024 at 6:06 PM NOLIBOS Christophe <
> christophe.noli...@thalesgroup.com> wrote:
>
>> Classified as: {OPEN}
>>
>>
>>
>> Well… why do you say that « Well if corosync isn't  there that this is
>> to be expected and pacemaker won't recover corosync.”?
>>
>> In my mind, Corosync is managed by Pacemaker as any other cluster
>> resource and the "pacemakerd: recover properly from > Corosync crash" fix
>> implemented in version 2.1.2 seems confirm that.
>>
>
> Nope. Startup of the stack is done by systemd. And pacemaker is just
> started after corosync is up and
> systemd should be responsible for keeping the stack up.
> For completeness: if you have sbd in the mix that is as well being started
> by systemd but kind of
> parallel with corosync as part of it (systemd terminology).
>

The "recover" above is referring to pacemaker recovering from corosync
going away and coming back.


>
> Klaus
>
>>
>>
>>
>>
>> {OPEN}
>>
>> *De :* NOLIBOS Christophe
>> *Envoyé :* jeudi 18 avril 2024 17:56
>> *À :* 'Klaus Wenninger' ; Cluster Labs - All topics
>> related to open-source clustering welcomed 
>> *Cc :* Ken Gaillot 
>> *Objet :* RE: [ClusterLabs] "pacemakerd: recover properly from Corosync
>> crash" fix
>>
>>
>>
>> Classified as: {OPEN}
>>
>>
>>
>>
>>
>> [~]$ systemctl status corosync
>>
>> ● corosync.service - Corosync Cluster Engine
>>
>>Loaded: loaded (/usr/lib/systemd/system/corosync.service; enabled;
>> vendor preset: disabled)
>>
>>Active: failed (Result: signal) since Thu 2024-04-18 14:58:42 UTC;
>> 53min ago
>>
>>  Docs: man:corosync
>>
>>man:corosync.conf
>>
>>man:corosync_overview
>>
>>   Process: 2027251 ExecStop=/usr/sbin/corosync-cfgtool -H --force
>> (code=exited, status=0/SUCCESS)
>>
>>   Process: 1324906 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS
>> (code=killed, signal=KILL)
>>
>> Main PID: 1324906 (code=killed, signal=KILL)
>>
>>
>>
>> Apr 18 13:16:04 - corosync[1324906]:   [QUORUM] Sync joined[1]: 1
>>
>> Apr 18 13:16:04 - corosync[1324906]:   [TOTEM ] A new membership (1.1c8)
>> was formed. Members joined: 1
>>
>> Apr 18 13:16:04 - corosync[1324906]:   [VOTEQ ] Waiting for all cluster
>> members. Current votes: 1 expected_votes: 2
>>
>> Apr 18 13:16:04 - corosync[1324906]:   [VOTEQ ] Waiting for all cluster
>> members. Current votes: 1 expected_votes: 2
>>
>> Apr 18 13:16:04 - corosync[1324906]:   [VOTEQ ] Waiting for all cluster
>> members. Current votes: 1 expected_votes: 2
>>
>> Apr 18 13:16:04 - corosync[1324906]:   [QUORUM] Members[1]: 1
>>
>> Apr 18 13:16:04 - corosync[1324906]:   [MAIN  ] Completed service
>> synchronization, ready to provide service.
>>
>> Apr 18 13:16:04 - systemd[1]: Started Corosync Cluster Engine.
>>
>> Apr 18 14:58:42 - systemd[1]: corosync.service: Main process exited,
>> code=killed, status=9/KILL
>>
>> Apr 18 14:58:42 - systemd[1]: corosync.service: Failed with result
>> 'signal'.
>>
>> [~]$
>>
>>
>>
>>
>>
>> *De :* Klaus Wenninger 
>> *Envoyé :* jeudi 18 avril 2024 17:43
>> *À :* Cluster Labs - All topics related to open-source clustering
>> welcomed 
>> *Cc :* Ken Gaillot ; NOLIBOS Christophe <
>> christophe.noli...@thalesgroup.com>
>> *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync
>> crash" fix
>>
>>
>>
>>
>>
>>
>>
>> On Thu, Apr 18, 2024 at 5:07 PM NOLIBOS Christophe via Users <
>> users@clusterlabs.org> wrote:
>>
>> Classified as: {OPEN}
>>
>> I'm using RedHat 8.8 (4.18.0-477.21.1.el8_8.x86_64).
>> When I kill Corosync, no new corosync process is created and pacemaker is
>> in failure.
>> The only solution is to restart the pacemaker service.
>>
>> [~]$ pcs status
>> Error: unable to get cib
>> [~]$
>>
>> [~]$systemctl status pacemaker
>> ● pacemaker.service - Pacemaker High Availability Cluster Manager
>>Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; enabled;
>> vendor preset: disabled)
>>Active: active (running) since Thu 2024-04-18 13:16:04 UTC; 1h 43min
>> ago
>>  Docs: man:pacemakerd
>>https://clusterlabs.org/pac

Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

2024-04-18 Thread NOLIBOS Christophe via Users
Classified as: {OPEN}

 

Well… why do you say that « Well if corosync isn't  there that this is to be 
expected and pacemaker won't recover corosync.”?

In my mind, Corosync is managed by Pacemaker as any other cluster resource and 
the "pacemakerd: recover properly from > Corosync crash" fix implemented in 
version 2.1.2 seems confirm that.

 

 

{OPEN}

De : NOLIBOS Christophe 
Envoyé : jeudi 18 avril 2024 17:56
À : 'Klaus Wenninger' ; Cluster Labs - All topics related 
to open-source clustering welcomed 
Cc : Ken Gaillot 
Objet : RE: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

 

Classified as: {OPEN}

 

 

[~]$ systemctl status corosync

● corosync.service - Corosync Cluster Engine

   Loaded: loaded (/usr/lib/systemd/system/corosync.service; enabled; vendor 
preset: disabled)

   Active: failed (Result: signal) since Thu 2024-04-18 14:58:42 UTC; 53min ago

 Docs: man:corosync

   man:corosync.conf

   man:corosync_overview

  Process: 2027251 ExecStop=/usr/sbin/corosync-cfgtool -H --force (code=exited, 
status=0/SUCCESS)

  Process: 1324906 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS 
(code=killed, signal=KILL)

Main PID: 1324906 (code=killed, signal=KILL)

 

Apr 18 13:16:04 - corosync[1324906]:   [QUORUM] Sync joined[1]: 1

Apr 18 13:16:04 - corosync[1324906]:   [TOTEM ] A new membership (1.1c8) was 
formed. Members joined: 1

Apr 18 13:16:04 - corosync[1324906]:   [VOTEQ ] Waiting for all cluster 
members. Current votes: 1 expected_votes: 2

Apr 18 13:16:04 - corosync[1324906]:   [VOTEQ ] Waiting for all cluster 
members. Current votes: 1 expected_votes: 2

Apr 18 13:16:04 - corosync[1324906]:   [VOTEQ ] Waiting for all cluster 
members. Current votes: 1 expected_votes: 2

Apr 18 13:16:04 - corosync[1324906]:   [QUORUM] Members[1]: 1

Apr 18 13:16:04 - corosync[1324906]:   [MAIN  ] Completed service 
synchronization, ready to provide service.

Apr 18 13:16:04 - systemd[1]: Started Corosync Cluster Engine.

Apr 18 14:58:42 - systemd[1]: corosync.service: Main process exited, 
code=killed, status=9/KILL

Apr 18 14:58:42 - systemd[1]: corosync.service: Failed with result 'signal'.

[~]$

 

 

De : Klaus Wenninger mailto:kwenn...@redhat.com> > 
Envoyé : jeudi 18 avril 2024 17:43
À : Cluster Labs - All topics related to open-source clustering welcomed 
mailto:users@clusterlabs.org> >
Cc : Ken Gaillot mailto:kgail...@redhat.com> >; NOLIBOS 
Christophe mailto:christophe.noli...@thalesgroup.com> >
Objet : Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

 

 

 

On Thu, Apr 18, 2024 at 5:07 PM NOLIBOS Christophe via Users 
mailto:users@clusterlabs.org> > wrote:

Classified as: {OPEN}

I'm using RedHat 8.8 (4.18.0-477.21.1.el8_8.x86_64).
When I kill Corosync, no new corosync process is created and pacemaker is in 
failure.
The only solution is to restart the pacemaker service.

[~]$ pcs status
Error: unable to get cib
[~]$

[~]$systemctl status pacemaker
● pacemaker.service - Pacemaker High Availability Cluster Manager
   Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; enabled; vendor 
preset: disabled)
   Active: active (running) since Thu 2024-04-18 13:16:04 UTC; 1h 43min ago
 Docs: man:pacemakerd
   https://clusterlabs.org/pacemaker/doc/
 Main PID: 1324923 (pacemakerd)
Tasks: 91
   Memory: 132.1M
   CGroup: /system.slice/pacemaker.service
...
Apr 18 14:59:02 - pacemakerd[1324923]:  crit: Could not connect to Corosync 
CFG: CS_ERR_LIBRARY
Apr 18 14:59:03 - pacemakerd[1324923]:  crit: Could not connect to Corosync 
CFG: CS_ERR_LIBRARY
Apr 18 14:59:04 - pacemakerd[1324923]:  crit: Could not connect to Corosync 
CFG: CS_ERR_LIBRARY
Apr 18 14:59:05 - pacemakerd[1324923]:  crit: Could not connect to Corosync 
CFG: CS_ERR_LIBRARY
Apr 18 14:59:06 - pacemakerd[1324923]:  crit: Could not connect to Corosync 
CFG: CS_ERR_LIBRARY
Apr 18 14:59:07 - pacemakerd[1324923]:  crit: Could not connect to Corosync 
CFG: CS_ERR_LIBRARY
Apr 18 14:59:08 - pacemakerd[1324923]:  crit: Could not connect to Corosync 
CFG: CS_ERR_LIBRARY
Apr 18 14:59:09 - pacemakerd[1324923]:  crit: Could not connect to Corosync 
CFG: CS_ERR_LIBRARY
Apr 18 14:59:10 - pacemakerd[1324923]:  crit: Could not connect to Corosync 
CFG: CS_ERR_LIBRARY
Apr 18 14:59:11 - pacemakerd[1324923]:  crit: Could not connect to Corosync 
CFG: CS_ERR_LIBRARY
[~]$

Well if corosync isn't  there that this is to be expected and pacemaker won't 
recover corosync.

Can you check what systemd thinks about corosync (status/journal). 

 

Klaus


{OPEN}

-Message d'origine-
De : Ken Gaillot mailto:kgail...@redhat.com> > 
Envoyé : jeudi 18 avril 2024 16:40
À : Cluster Labs - All topics related to open-source clustering welcomed 
mailto:users@clusterlabs.org> >
Cc : NOLIBOS Christophe mailto:christophe.noli...@thalesgroup.com> >
Objet : Re: [ClusterLabs] "pacemakerd: recover pro

Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

2024-04-18 Thread NOLIBOS Christophe via Users
Classified as: {OPEN}

 

 

[~]$ systemctl status corosync

● corosync.service - Corosync Cluster Engine

   Loaded: loaded (/usr/lib/systemd/system/corosync.service; enabled; vendor 
preset: disabled)

   Active: failed (Result: signal) since Thu 2024-04-18 14:58:42 UTC; 53min ago

 Docs: man:corosync

   man:corosync.conf

   man:corosync_overview

  Process: 2027251 ExecStop=/usr/sbin/corosync-cfgtool -H --force (code=exited, 
status=0/SUCCESS)

  Process: 1324906 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS 
(code=killed, signal=KILL)

Main PID: 1324906 (code=killed, signal=KILL)

 

Apr 18 13:16:04 - corosync[1324906]:   [QUORUM] Sync joined[1]: 1

Apr 18 13:16:04 - corosync[1324906]:   [TOTEM ] A new membership (1.1c8) was 
formed. Members joined: 1

Apr 18 13:16:04 - corosync[1324906]:   [VOTEQ ] Waiting for all cluster 
members. Current votes: 1 expected_votes: 2

Apr 18 13:16:04 - corosync[1324906]:   [VOTEQ ] Waiting for all cluster 
members. Current votes: 1 expected_votes: 2

Apr 18 13:16:04 - corosync[1324906]:   [VOTEQ ] Waiting for all cluster 
members. Current votes: 1 expected_votes: 2

Apr 18 13:16:04 - corosync[1324906]:   [QUORUM] Members[1]: 1

Apr 18 13:16:04 - corosync[1324906]:   [MAIN  ] Completed service 
synchronization, ready to provide service.

Apr 18 13:16:04 - systemd[1]: Started Corosync Cluster Engine.

Apr 18 14:58:42 - systemd[1]: corosync.service: Main process exited, 
code=killed, status=9/KILL

Apr 18 14:58:42 - systemd[1]: corosync.service: Failed with result 'signal'.

[~]$

 

 

De : Klaus Wenninger  
Envoyé : jeudi 18 avril 2024 17:43
À : Cluster Labs - All topics related to open-source clustering welcomed 

Cc : Ken Gaillot ; NOLIBOS Christophe 

Objet : Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

 

 

 

On Thu, Apr 18, 2024 at 5:07 PM NOLIBOS Christophe via Users 
mailto:users@clusterlabs.org> > wrote:

Classified as: {OPEN}

I'm using RedHat 8.8 (4.18.0-477.21.1.el8_8.x86_64).
When I kill Corosync, no new corosync process is created and pacemaker is in 
failure.
The only solution is to restart the pacemaker service.

[~]$ pcs status
Error: unable to get cib
[~]$

[~]$systemctl status pacemaker
● pacemaker.service - Pacemaker High Availability Cluster Manager
   Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; enabled; vendor 
preset: disabled)
   Active: active (running) since Thu 2024-04-18 13:16:04 UTC; 1h 43min ago
 Docs: man:pacemakerd
   https://clusterlabs.org/pacemaker/doc/
 Main PID: 1324923 (pacemakerd)
Tasks: 91
   Memory: 132.1M
   CGroup: /system.slice/pacemaker.service
...
Apr 18 14:59:02 - pacemakerd[1324923]:  crit: Could not connect to Corosync 
CFG: CS_ERR_LIBRARY
Apr 18 14:59:03 - pacemakerd[1324923]:  crit: Could not connect to Corosync 
CFG: CS_ERR_LIBRARY
Apr 18 14:59:04 - pacemakerd[1324923]:  crit: Could not connect to Corosync 
CFG: CS_ERR_LIBRARY
Apr 18 14:59:05 - pacemakerd[1324923]:  crit: Could not connect to Corosync 
CFG: CS_ERR_LIBRARY
Apr 18 14:59:06 - pacemakerd[1324923]:  crit: Could not connect to Corosync 
CFG: CS_ERR_LIBRARY
Apr 18 14:59:07 - pacemakerd[1324923]:  crit: Could not connect to Corosync 
CFG: CS_ERR_LIBRARY
Apr 18 14:59:08 - pacemakerd[1324923]:  crit: Could not connect to Corosync 
CFG: CS_ERR_LIBRARY
Apr 18 14:59:09 - pacemakerd[1324923]:  crit: Could not connect to Corosync 
CFG: CS_ERR_LIBRARY
Apr 18 14:59:10 - pacemakerd[1324923]:  crit: Could not connect to Corosync 
CFG: CS_ERR_LIBRARY
Apr 18 14:59:11 - pacemakerd[1324923]:  crit: Could not connect to Corosync 
CFG: CS_ERR_LIBRARY
[~]$

Well if corosync isn't  there that this is to be expected and pacemaker won't 
recover corosync.

Can you check what systemd thinks about corosync (status/journal). 

 

Klaus


{OPEN}

-Message d'origine-
De : Ken Gaillot mailto:kgail...@redhat.com> > 
Envoyé : jeudi 18 avril 2024 16:40
À : Cluster Labs - All topics related to open-source clustering welcomed 
mailto:users@clusterlabs.org> >
Cc : NOLIBOS Christophe mailto:christophe.noli...@thalesgroup.com> >
Objet : Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

What OS are you using? Does it use systemd?

What does happen when you kill Corosync?

On Thu, 2024-04-18 at 13:13 +, NOLIBOS Christophe via Users wrote:
> Classified as: {OPEN}
> 
> Dear All,
>  
> I have a question about the "pacemakerd: recover properly from 
> Corosync crash" fix implemented in version 2.1.2.
> I have observed the issue when testing pacemaker version 2.0.5, just 
> by killing the ‘corosync’ process: Corosync was not recovered.
>  
> I am using now pacemaker version 2.1.5-8.
> Doing the same test, I have the same result: Corosync is still not 
> recovered.
>  
> Please confirm the "pacemakerd: recover properly from Corosync crash"
> fix implemented in version 2.1.2

Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

2024-04-18 Thread Klaus Wenninger
On Thu, Apr 18, 2024 at 5:07 PM NOLIBOS Christophe via Users <
users@clusterlabs.org> wrote:

> Classified as: {OPEN}
>
> I'm using RedHat 8.8 (4.18.0-477.21.1.el8_8.x86_64).
> When I kill Corosync, no new corosync process is created and pacemaker is
> in failure.
> The only solution is to restart the pacemaker service.
>
> [~]$ pcs status
> Error: unable to get cib
> [~]$
>
> [~]$systemctl status pacemaker
> ● pacemaker.service - Pacemaker High Availability Cluster Manager
>Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; enabled;
> vendor preset: disabled)
>Active: active (running) since Thu 2024-04-18 13:16:04 UTC; 1h 43min ago
>  Docs: man:pacemakerd
>https://clusterlabs.org/pacemaker/doc/
>  Main PID: 1324923 (pacemakerd)
> Tasks: 91
>Memory: 132.1M
>CGroup: /system.slice/pacemaker.service
> ...
> Apr 18 14:59:02 - pacemakerd[1324923]:  crit: Could not connect to
> Corosync CFG: CS_ERR_LIBRARY
> Apr 18 14:59:03 - pacemakerd[1324923]:  crit: Could not connect to
> Corosync CFG: CS_ERR_LIBRARY
> Apr 18 14:59:04 - pacemakerd[1324923]:  crit: Could not connect to
> Corosync CFG: CS_ERR_LIBRARY
> Apr 18 14:59:05 - pacemakerd[1324923]:  crit: Could not connect to
> Corosync CFG: CS_ERR_LIBRARY
> Apr 18 14:59:06 - pacemakerd[1324923]:  crit: Could not connect to
> Corosync CFG: CS_ERR_LIBRARY
> Apr 18 14:59:07 - pacemakerd[1324923]:  crit: Could not connect to
> Corosync CFG: CS_ERR_LIBRARY
> Apr 18 14:59:08 - pacemakerd[1324923]:  crit: Could not connect to
> Corosync CFG: CS_ERR_LIBRARY
> Apr 18 14:59:09 - pacemakerd[1324923]:  crit: Could not connect to
> Corosync CFG: CS_ERR_LIBRARY
> Apr 18 14:59:10 - pacemakerd[1324923]:  crit: Could not connect to
> Corosync CFG: CS_ERR_LIBRARY
> Apr 18 14:59:11 - pacemakerd[1324923]:  crit: Could not connect to
> Corosync CFG: CS_ERR_LIBRARY
> [~]$
>
>
> Well if corosync isn't  there that this is to be expected and pacemaker
won't recover corosync.
Can you check what systemd thinks about corosync (status/journal).

Klaus

>
> {OPEN}
>
> -Message d'origine-
> De : Ken Gaillot 
> Envoyé : jeudi 18 avril 2024 16:40
> À : Cluster Labs - All topics related to open-source clustering welcomed <
> users@clusterlabs.org>
> Cc : NOLIBOS Christophe 
> Objet : Re: [ClusterLabs] "pacemakerd: recover properly from Corosync
> crash" fix
>
> What OS are you using? Does it use systemd?
>
> What does happen when you kill Corosync?
>
> On Thu, 2024-04-18 at 13:13 +, NOLIBOS Christophe via Users wrote:
> > Classified as: {OPEN}
> >
> > Dear All,
> >
> > I have a question about the "pacemakerd: recover properly from
> > Corosync crash" fix implemented in version 2.1.2.
> > I have observed the issue when testing pacemaker version 2.0.5, just
> > by killing the ‘corosync’ process: Corosync was not recovered.
> >
> > I am using now pacemaker version 2.1.5-8.
> > Doing the same test, I have the same result: Corosync is still not
> > recovered.
> >
> > Please confirm the "pacemakerd: recover properly from Corosync crash"
> > fix implemented in version 2.1.2 covers this scenario.
> > If it is, did I miss something in the configuration of my cluster?
> >
> > Best Regard.
> >
> > Christophe.
> >
> >
> >
> > {OPEN}
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
> --
> Ken Gaillot 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

2024-04-18 Thread NOLIBOS Christophe via Users
Classified as: {OPEN}

I'm using RedHat 8.8 (4.18.0-477.21.1.el8_8.x86_64).
When I kill Corosync, no new corosync process is created and pacemaker is in 
failure.
The only solution is to restart the pacemaker service.

[~]$ pcs status
Error: unable to get cib
[~]$

[~]$systemctl status pacemaker
● pacemaker.service - Pacemaker High Availability Cluster Manager
   Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; enabled; vendor 
preset: disabled)
   Active: active (running) since Thu 2024-04-18 13:16:04 UTC; 1h 43min ago
 Docs: man:pacemakerd
   https://clusterlabs.org/pacemaker/doc/
 Main PID: 1324923 (pacemakerd)
Tasks: 91
   Memory: 132.1M
   CGroup: /system.slice/pacemaker.service
...
Apr 18 14:59:02 - pacemakerd[1324923]:  crit: Could not connect to Corosync 
CFG: CS_ERR_LIBRARY
Apr 18 14:59:03 - pacemakerd[1324923]:  crit: Could not connect to Corosync 
CFG: CS_ERR_LIBRARY
Apr 18 14:59:04 - pacemakerd[1324923]:  crit: Could not connect to Corosync 
CFG: CS_ERR_LIBRARY
Apr 18 14:59:05 - pacemakerd[1324923]:  crit: Could not connect to Corosync 
CFG: CS_ERR_LIBRARY
Apr 18 14:59:06 - pacemakerd[1324923]:  crit: Could not connect to Corosync 
CFG: CS_ERR_LIBRARY
Apr 18 14:59:07 - pacemakerd[1324923]:  crit: Could not connect to Corosync 
CFG: CS_ERR_LIBRARY
Apr 18 14:59:08 - pacemakerd[1324923]:  crit: Could not connect to Corosync 
CFG: CS_ERR_LIBRARY
Apr 18 14:59:09 - pacemakerd[1324923]:  crit: Could not connect to Corosync 
CFG: CS_ERR_LIBRARY
Apr 18 14:59:10 - pacemakerd[1324923]:  crit: Could not connect to Corosync 
CFG: CS_ERR_LIBRARY
Apr 18 14:59:11 - pacemakerd[1324923]:  crit: Could not connect to Corosync 
CFG: CS_ERR_LIBRARY
[~]$



{OPEN}

-Message d'origine-
De : Ken Gaillot  
Envoyé : jeudi 18 avril 2024 16:40
À : Cluster Labs - All topics related to open-source clustering welcomed 

Cc : NOLIBOS Christophe 
Objet : Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

What OS are you using? Does it use systemd?

What does happen when you kill Corosync?

On Thu, 2024-04-18 at 13:13 +, NOLIBOS Christophe via Users wrote:
> Classified as: {OPEN}
> 
> Dear All,
>  
> I have a question about the "pacemakerd: recover properly from 
> Corosync crash" fix implemented in version 2.1.2.
> I have observed the issue when testing pacemaker version 2.0.5, just 
> by killing the ‘corosync’ process: Corosync was not recovered.
>  
> I am using now pacemaker version 2.1.5-8.
> Doing the same test, I have the same result: Corosync is still not 
> recovered.
>  
> Please confirm the "pacemakerd: recover properly from Corosync crash"
> fix implemented in version 2.1.2 covers this scenario.
> If it is, did I miss something in the configuration of my cluster?
>  
> Best Regard.
>  
> Christophe.
>   
>  
> 
> {OPEN}
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
--
Ken Gaillot 


smime.p7s
Description: S/MIME cryptographic signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

2024-04-18 Thread Ken Gaillot
What OS are you using? Does it use systemd?

What does happen when you kill Corosync?

On Thu, 2024-04-18 at 13:13 +, NOLIBOS Christophe via Users wrote:
> Classified as: {OPEN}
> 
> Dear All,
>  
> I have a question about the "pacemakerd: recover properly from
> Corosync crash" fix implemented in version 2.1.2.
> I have observed the issue when testing pacemaker version 2.0.5, just
> by killing the ‘corosync’ process: Corosync was not recovered.
>  
> I am using now pacemaker version 2.1.5-8.
> Doing the same test, I have the same result: Corosync is still not
> recovered.
>  
> Please confirm the "pacemakerd: recover properly from Corosync crash"
> fix implemented in version 2.1.2 covers this scenario.
> If it is, did I miss something in the configuration of my cluster?
>  
> Best Regard.
>  
> Christophe.
>  
> 
> Christophe NolibosDL-FEP Component ManagerTHALES Land & Air
> Systems105, avenue du Général Eisenhower, 31100 Toulouse, FRANCETél.
> : +33 (0)5 61 19 79 09Mobile : +33 (0)6 31 22 20 58
> Email : christophe.noli...@thalesgroup.com
>  
>  
> 
> {OPEN}
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

2024-04-18 Thread NOLIBOS Christophe via Users
Classified as: {OPEN}


Dear All,

 

I have a question about the "pacemakerd: recover properly from Corosync
crash" fix implemented in version 2.1.2.

I have observed the issue when testing pacemaker version 2.0.5, just by
killing the ‘corosync’ process: Corosync was not recovered.

 

I am using now pacemaker version 2.1.5-8.

Doing the same test, I have the same result: Corosync is still not
recovered.

 

Please confirm the "pacemakerd: recover properly from Corosync crash" fix
implemented in version 2.1.2 covers this scenario.

If it is, did I miss something in the configuration of my cluster?

 

Best Regard.

 

Christophe.

 











Christophe Nolibos

DL-FEP Component Manager

THALES Land & Air Systems

105, avenue du Général Eisenhower, 31100 Toulouse, FRANCE

Tél. : +33 (0)5 61 19 79 09

Mobile : +33 (0)6 31 22 20 58



Email :  
christophe.noli...@thalesgroup.com

 

 


{OPEN}



smime.p7s
Description: S/MIME cryptographic signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/