Re: [ClusterLabs] DC marks itself as OFFLINE, continues orchestrating the other nodes

2022-09-29 Thread Ken Gaillot
I suspect this is fixed in newer versions. It's not a join timing issue
but some sort of peer state bug, and there's been a good bit of change
in that area since this code.

A few comments inline ...

On Wed, 2022-09-14 at 12:40 +0200, Lars Ellenberg wrote:
> On Thu, Sep 08, 2022 at 10:11:46AM -0500, Ken Gaillot wrote:
> > On Thu, 2022-09-08 at 15:01 +0200, Lars Ellenberg wrote:
> > > Scenario:
> > > three nodes, no fencing (I know)
> > > break network, isolating nodes
> > > unbreak network, see how cluster partitions rejoin and resume
> > > service
> > 
> > I'm guessing the CIB changed during the break, with more changes in
> > one
> > of the other partitions than mqhavm24 ...
> 
> quite likely.
> 
> > Reconciling CIB differences in different partitions is inherently
> > lossy. Basically we gotta pick one side to win, and the current
> > algorithm just looks at the number of changes. (An "admin epoch"
> > can
> > also be bumped manually to override that.)
> 
> Yes.

That turned out to be unrelated, the CIBs re-synced after the rejoin
without a problem.

> 
> > > I have full crm_reports and some context knowledge about the
> > > setup.
> > > 
> > > For now I'd like to know: has anyone seen this before,
> > > is that a known bug in corner cases/races during re-join,
> > > has it even been fixed meanwhile?
> > 
> > No, yes, no

Probably no, no, yes :)

> 
> Thank you.
> That's what I thought :-|
> 
> > It does seem we could handle the specific case of the local node's
> > state being overwritten a little better. We can't just override the
> > join state if the other nodes think it is different, but we could
> > release DC and restart the join process. How did it handle the
> > situation in this case?
> 
> I think these are the most interesting lines:
> 
> -
> Aug 11 12:32:45 mqhavm24 corosync[13296]:  [QUORUM] Members[1]: 1
>stopping stuff
> 
> Aug 11 12:33:36 mqhavm24 corosync[13296]:  [QUORUM] Members[3]: 1 3 2
> 
> Aug 11 12:33:36 [13310] mqhavm24   crmd:  warning:
> crmd_ha_msg_filter:   Another DC detected: mqhavm37 (op=noop)
> Aug 11 12:33:36 [13310] mqhavm24   crmd: info: update_dc: 
> Set DC to mqhavm24 (3.0.14)
> 
> Aug 11 12:33:36 [13308] mqhavm24  attrd:   notice:
> attrd_check_for_new_writer:   Detected another attribute writer
> (mqhavm37), starting new election
> Aug 11 12:33:36 [13308] mqhavm24  attrd:   notice:
> attrd_declare_winner: Recorded local node as attribute writer (was
> unset)
> 
> plan to start stuff on all three nodes
> Aug 11 12:33:36 [13309] mqhavm24pengine:   notice:
> process_pe_message:   Calculated transition 161, saving inputs in
> /var/lib/pacemaker/pengine/pe-input-688.bz2
> 
> but then
> Aug 11 12:33:36 [13305] mqhavm24cib: info:
> cib_perform_op:   +  /cib/status/node_state[@id='1']:  @crm-
> debug-origin=do_cib_replaced, @join=down
> 
> and we now keep stuff stopped locally, but continue to manage the
> other two nodes.
> -
> 
> 
> commented log of the most intersting node below,
> starting at the point when communication goes down.
> maybe you see something that gives you an idea how to handle this
> better.
> 
> If it helps, I have the full crm_report of all nodes,
> should you feel the urge to have a look.
> 
> Aug 11 12:32:45 mqhavm24 corosync[13296]:  [TOTEM ] Failed to receive
> the leave message. failed: 3 2
> Aug 11 12:32:45 mqhavm24 corosync[13296]:  [QUORUM] This node is
> within the non-primary component and will NOT provide any services.
> Aug 11 12:32:45 mqhavm24 corosync[13296]:  [QUORUM] Members[1]: 1
> Aug 11 12:32:45 mqhavm24 corosync[13296]:  [MAIN  ] Completed service
> synchronization, ready to provide service.
> [stripping most info level for now]
> Aug 11 12:32:45 [13306] mqhavm24 stonith-ng:   notice:
> crm_update_peer_state_iter:   Node mqhavm37 state is now lost |
> nodeid=2 previous=member source=crm_update_peer_proc
> Aug 11 12:32:45 [13306] mqhavm24 stonith-ng:   notice:
> reap_crm_member:  Purged 1 peer with id=2 and/or uname=mqhavm37
> from the membership cache
> Aug 11 12:32:45 [13306] mqhavm24 stonith-ng:   notice:
> crm_update_peer_state_iter:   Node mqhavm34 state is now lost |
> nodeid=3 previous=member source=crm_update_peer_proc
> Aug 11 12:32:45 [13306] mqhavm24 stonith-ng:   notice:
> reap_crm_member:  Purged 1 peer with id=3 and/or uname=mqhavm34
> from the membership cache
> Aug 11 12:32:45 [13303] mqhavm24 pacemakerd:  warning:
> pcmk_quorum_notification: Quorum lost | membership=3112546
> members=1
> Aug 11 12:32:45 [13303] mqhavm24 pacemakerd:   notice:
> crm_update_peer_state_iter:   Node mqhavm34 state is now lost |
> nodeid=3 previous=member source=crm_reap_unseen_nodes
> Aug 11 12:32:45 [13303] mqhavm24 pacemakerd:   notice:
> crm_update_peer_state_iter:   Node mqhavm37 state is now lost |
> nodeid=2 previous=member source=crm_reap_unseen_nodes
> Aug 11 12:32:45 [13310] mqhavm24   crmd:  warning:
> pcmk_quorum_notification

Re: [ClusterLabs] DC marks itself as OFFLINE, continues orchestrating the other nodes

2022-09-14 Thread Lars Ellenberg
On Thu, Sep 08, 2022 at 10:11:46AM -0500, Ken Gaillot wrote:
> On Thu, 2022-09-08 at 15:01 +0200, Lars Ellenberg wrote:
> > Scenario:
> > three nodes, no fencing (I know)
> > break network, isolating nodes
> > unbreak network, see how cluster partitions rejoin and resume service
> 
> I'm guessing the CIB changed during the break, with more changes in one
> of the other partitions than mqhavm24 ...

quite likely.

> Reconciling CIB differences in different partitions is inherently
> lossy. Basically we gotta pick one side to win, and the current
> algorithm just looks at the number of changes. (An "admin epoch" can
> also be bumped manually to override that.)

Yes.

> > I have full crm_reports and some context knowledge about the setup.
> > 
> > For now I'd like to know: has anyone seen this before,
> > is that a known bug in corner cases/races during re-join,
> > has it even been fixed meanwhile?
> 
> No, yes, no

Thank you.
That's what I thought :-|

> It does seem we could handle the specific case of the local node's
> state being overwritten a little better. We can't just override the
> join state if the other nodes think it is different, but we could
> release DC and restart the join process. How did it handle the
> situation in this case?

I think these are the most interesting lines:

-
Aug 11 12:32:45 mqhavm24 corosync[13296]:  [QUORUM] Members[1]: 1
   stopping stuff

Aug 11 12:33:36 mqhavm24 corosync[13296]:  [QUORUM] Members[3]: 1 3 2

Aug 11 12:33:36 [13310] mqhavm24   crmd:  warning: crmd_ha_msg_filter:  
Another DC detected: mqhavm37 (op=noop)
Aug 11 12:33:36 [13310] mqhavm24   crmd: info: update_dc:   Set DC 
to mqhavm24 (3.0.14)

Aug 11 12:33:36 [13308] mqhavm24  attrd:   notice: 
attrd_check_for_new_writer:  Detected another attribute writer (mqhavm37), 
starting new election
Aug 11 12:33:36 [13308] mqhavm24  attrd:   notice: attrd_declare_winner:
Recorded local node as attribute writer (was unset)

plan to start stuff on all three nodes
Aug 11 12:33:36 [13309] mqhavm24pengine:   notice: process_pe_message:  
Calculated transition 161, saving inputs in 
/var/lib/pacemaker/pengine/pe-input-688.bz2

but then
Aug 11 12:33:36 [13305] mqhavm24cib: info: cib_perform_op:  +  
/cib/status/node_state[@id='1']:  @crm-debug-origin=do_cib_replaced, @join=down

and we now keep stuff stopped locally, but continue to manage the other two 
nodes.
-


commented log of the most intersting node below,
starting at the point when communication goes down.
maybe you see something that gives you an idea how to handle this better.

If it helps, I have the full crm_report of all nodes,
should you feel the urge to have a look.

Aug 11 12:32:45 mqhavm24 corosync[13296]:  [TOTEM ] Failed to receive the leave 
message. failed: 3 2
Aug 11 12:32:45 mqhavm24 corosync[13296]:  [QUORUM] This node is within the 
non-primary component and will NOT provide any services.
Aug 11 12:32:45 mqhavm24 corosync[13296]:  [QUORUM] Members[1]: 1
Aug 11 12:32:45 mqhavm24 corosync[13296]:  [MAIN  ] Completed service 
synchronization, ready to provide service.
[stripping most info level for now]
Aug 11 12:32:45 [13306] mqhavm24 stonith-ng:   notice: 
crm_update_peer_state_iter:  Node mqhavm37 state is now lost | nodeid=2 
previous=member source=crm_update_peer_proc
Aug 11 12:32:45 [13306] mqhavm24 stonith-ng:   notice: reap_crm_member: Purged 
1 peer with id=2 and/or uname=mqhavm37 from the membership cache
Aug 11 12:32:45 [13306] mqhavm24 stonith-ng:   notice: 
crm_update_peer_state_iter:  Node mqhavm34 state is now lost | nodeid=3 
previous=member source=crm_update_peer_proc
Aug 11 12:32:45 [13306] mqhavm24 stonith-ng:   notice: reap_crm_member: Purged 
1 peer with id=3 and/or uname=mqhavm34 from the membership cache
Aug 11 12:32:45 [13303] mqhavm24 pacemakerd:  warning: 
pcmk_quorum_notification:Quorum lost | membership=3112546 members=1
Aug 11 12:32:45 [13303] mqhavm24 pacemakerd:   notice: 
crm_update_peer_state_iter:  Node mqhavm34 state is now lost | nodeid=3 
previous=member source=crm_reap_unseen_nodes
Aug 11 12:32:45 [13303] mqhavm24 pacemakerd:   notice: 
crm_update_peer_state_iter:  Node mqhavm37 state is now lost | nodeid=2 
previous=member source=crm_reap_unseen_nodes
Aug 11 12:32:45 [13310] mqhavm24   crmd:  warning: 
pcmk_quorum_notification:Quorum lost | membership=3112546 members=1
Aug 11 12:32:45 [13310] mqhavm24   crmd:   notice: 
crm_update_peer_state_iter:  Node mqhavm34 state is now lost | nodeid=3 
previous=member source=crm_reap_unseen_nodes
Aug 11 12:32:45 [13308] mqhavm24  attrd:   notice: 
crm_update_peer_state_iter:  Node mqhavm37 state is now lost | nodeid=2 
previous=member source=crm_update_peer_proc
Aug 11 12:32:45 [13305] mqhavm24cib:   notice: 
crm_update_peer_state_iter:  Node mqhavm37 state is now lost | nodeid=2 
previous=member source=crm_update_peer

Re: [ClusterLabs] DC marks itself as OFFLINE, continues orchestrating the other nodes

2022-09-08 Thread Ken Gaillot
On Thu, 2022-09-08 at 15:01 +0200, Lars Ellenberg wrote:
> Scenario:
> three nodes, no fencing (I know)
> break network, isolating nodes
> unbreak network, see how cluster partitions rejoin and resume service

I'm guessing the CIB changed during the break, with more changes in one
of the other partitions than mqhavm24 ...

> 
> 
> Funny outcome:
> /usr/sbin/crm_mon  -x pe-input-689.bz2
> Cluster Summary:
>   * Stack: corosync
>   * Current DC: mqhavm24 (version 1.1.24.linbit-2.0.el7-8f22be2ae) -
> partition with quorum
>   * Last updated: Thu Sep  8 14:39:54 2022
>   * Last change:  Thu Aug 11 12:33:02 2022 by root via crm_resource
> on mqhavm24
>   * 3 nodes configured
>   * 16 resource instances configured (2 DISABLED)
> 
> Node List:
>   * Online: [ mqhavm34 mqhavm37 ]
>   * OFFLINE: [ mqhavm24 ]
> 
> 
> Note how the current DC considers itself as OFFLINE!
> 
> It accepted an apparently outdated cib replaceament from one of the
> non-DCs
> from a previous membership while already authoritative itself,
> overwriting its own "join" status in the cib.

Reconciling CIB differences in different partitions is inherently
lossy. Basically we gotta pick one side to win, and the current
algorithm just looks at the number of changes. (An "admin epoch" can
also be bumped manually to override that.)

> 
> I have full crm_reports and some context knowledge about the setup.
> 
> For now I'd like to know: has anyone seen this before,
> is that a known bug in corner cases/races during re-join,
> has it even been fixed meanwhile?

No, yes, no

It does seem we could handle the specific case of the local node's
state being overwritten a little better. We can't just override the
join state if the other nodes think it is different, but we could
release DC and restart the join process. How did it handle the
situation in this case?

> 
> Thanks,
> Lars
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/