Re: [ClusterLabs] Peer (slave) node deleting master's transient_attributes

2021-02-08 Thread Stuart Massey
Wonderful, thank you for looking at this! I have posted uncompressed "saving inputs" files at the links below - 3241 is the immediately preceding one that exists, and 3242 is the one created upon encountering the problem state. In both cases, it looks to me like node02 is DC. There are none of

Re: [ClusterLabs] Peer (slave) node deleting master's transient_attributes

2021-02-08 Thread Ken Gaillot
On Mon, 2021-02-08 at 12:01 -0500, Stuart Massey wrote: > I'm wondering if anyone can advise us on next steps here and/or > correct our understanding. This seems like a race condition that > causes resources to be stopped unnecessarily. Is there a way to > prevent a node from processing cib

Re: [ClusterLabs] Peer (slave) node deleting master's transient_attributes

2021-02-08 Thread Stuart Massey
I'm wondering if anyone can advise us on next steps here and/or correct our understanding. This seems like a race condition that causes resources to be stopped unnecessarily. Is there a way to prevent a node from processing cib updates from a peer while DC negotiations are underway? Our "node02"

Re: [ClusterLabs] Peer (slave) node deleting master's transient_attributes

2021-02-01 Thread Stuart Massey
Sequence seems to be: - node02 is DC and master/primary, node01 is maintenance mode and slave/secondary - comms go down - node01 elects itself master, and deletes node01 status from its cib - comms come up - cluster starts reforming - node01 sends cib updates to node02 -

Re: [ClusterLabs] Peer (slave) node deleting master's transient_attributes

2021-02-01 Thread Ken Gaillot
On Mon, 2021-02-01 at 11:09 -0500, Stuart Massey wrote: > Hi Ken, > Thanks. In this case, transient_attributes for node02 in the cib on > node02 which never lost quorum seem to be deleted by a request from > node01 when node01 rejoins the cluster - IF I understand the > pacemaker.log correctly.

Re: [ClusterLabs] Peer (slave) node deleting master's transient_attributes

2021-02-01 Thread Ken Gaillot
On Mon, 2021-02-01 at 11:16 -0500, Stuart Massey wrote: > Andrei, > You are right, thank you. I have an earlier thread on which I posted > a pacemaker.log for this issue, and didn't think to point to it here. > The link is > http://project.ibss.net/samples/deidPacemakerLog.2021-01-25.txtxt . >

Re: [ClusterLabs] Peer (slave) node deleting master's transient_attributes

2021-02-01 Thread Stuart Massey
Andrei, You are right, thank you. I have an earlier thread on which I posted a pacemaker.log for this issue, and didn't think to point to it here. The link is http://project.ibss.net/samples/deidPacemakerLog.2021-01-25.txt . So, node01 is in maintenance mode, and constraints prevent any resources

Re: [ClusterLabs] Peer (slave) node deleting master's transient_attributes

2021-02-01 Thread Stuart Massey
Hi Ken, Thanks. In this case, transient_attributes for node02 in the cib on node02 which never lost quorum seem to be deleted by a request from node01 when node01 rejoins the cluster - IF I understand the pacemaker.log correctly. This causes node02 to stop resources, which will not be restarted

Re: [ClusterLabs] Peer (slave) node deleting master's transient_attributes

2021-02-01 Thread Ken Gaillot
On Mon, 2021-02-01 at 09:58 -0600, Ken Gaillot wrote: > On Fri, 2021-01-29 at 12:37 -0500, Stuart Massey wrote: > > Can someone help me with this? > > Background: > > > "node01" is failing, and has been placed in "maintenance" mode. > > > It > > > occasionally loses connectivity. > > > "node02" is

Re: [ClusterLabs] Peer (slave) node deleting master's transient_attributes

2021-02-01 Thread Ken Gaillot
On Fri, 2021-01-29 at 12:37 -0500, Stuart Massey wrote: > Can someone help me with this? > Background: > > "node01" is failing, and has been placed in "maintenance" mode. It > > occasionally loses connectivity. > > "node02" is able to run our resources > > Consider the following messages from

Re: [ClusterLabs] Peer (slave) node deleting master's transient_attributes

2021-01-30 Thread Andrei Borzenkov
29.01.2021 20:37, Stuart Massey пишет: > Can someone help me with this? > Background: > > "node01" is failing, and has been placed in "maintenance" mode. It > occasionally loses connectivity. > > "node02" is able to run our resources > > Consider the following messages from pacemaker.log on

[ClusterLabs] Peer (slave) node deleting master's transient_attributes

2021-01-29 Thread Stuart Massey
Can someone help me with this? Background: "node01" is failing, and has been placed in "maintenance" mode. It occasionally loses connectivity. "node02" is able to run our resources Consider the following messages from pacemaker.log on "node02", just after "node01" has rejoined the cluster (per