Hi Ken, Thanks. In this case, transient_attributes for node02 in the cib on node02 which never lost quorum seem to be deleted by a request from node01 when node01 rejoins the cluster - IF I understand the pacemaker.log correctly. This causes node02 to stop resources, which will not be restarted until we manually refresh on node02.
On Mon, Feb 1, 2021 at 10:59 AM Ken Gaillot <kgail...@redhat.com> wrote: > On Fri, 2021-01-29 at 12:37 -0500, Stuart Massey wrote: > > Can someone help me with this? > > Background: > > > "node01" is failing, and has been placed in "maintenance" mode. It > > > occasionally loses connectivity. > > > "node02" is able to run our resources > > > > Consider the following messages from pacemaker.log on "node02", just > > after "node01" has rejoined the cluster (per "node02"): > > > Jan 28 14:48:03 [21933] node02.example.com cib: info: > > > cib_perform_op: -- > > > /cib/status/node_state[@id='2']/transient_attributes[@id='2'] > > > Jan 28 14:48:03 [21933] node02.example.com cib: info: > > > cib_perform_op: + /cib: @num_updates=309 > > > Jan 28 14:48:03 [21933] node02.example.com cib: info: > > > cib_process_request: Completed cib_delete operation for section > > > //node_state[@uname='node02.example.com']/transient_attributes: OK > > > (rc=0, origin=node01.example.com/crmd/3784, version=0.94.309) > > > Jan 28 14:48:04 [21938] node02.example.com crmd: info: > > > abort_transition_graph: Transition aborted by deletion of > > > transient_attributes[@id='2']: Transient attribute change | > > > cib=0.94.309 source=abort_unless_down:357 > > > path=/cib/status/node_state[@id='2']/transient_attributes[@id='2'] > > > complete=true > > > Jan 28 14:48:05 [21937] node02.example.com pengine: info: > > > master_color: ms_drbd_ourApp: Promoted 0 instances of a possible 1 > > > to master > > > > > The implication, it seems to me, is that "node01" has asked "node02" > > to delete the transient-attributes for "node02". The transient- > > attributes should normally be: > > <transient_attributes id="2"> > > <instance_attributes id="status-2"> > > <nvpair id="status-2-master-drbd_ourApp" name="master- > > drbd_ourApp" value="10000"/> > > <nvpair id="status-2-pingd" name="pingd" value="100"/> > > </instance_attributes> > > </transient_attributes> > > > > These attributes are necessary for "node02" to be Master/Primary, > > correct? > > > > Why might this be happening and how do we prevent it? > > Transient attributes are always cleared when a node leaves the cluster > (that's what makes them transient ...). It's probably coincidence it > went through as the node rejoined. > > When the node rejoins, it will trigger another run of the scheduler, > which will schedule a probe of all resources on the node. Those probes > should reset the promotion score. > -- > Ken Gaillot <kgail...@redhat.com> > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ >
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/