On Fri, 2021-01-29 at 12:37 -0500, Stuart Massey wrote: > Can someone help me with this? > Background: > > "node01" is failing, and has been placed in "maintenance" mode. It > > occasionally loses connectivity. > > "node02" is able to run our resources > > Consider the following messages from pacemaker.log on "node02", just > after "node01" has rejoined the cluster (per "node02"): > > Jan 28 14:48:03 [21933] node02.example.com cib: info: > > cib_perform_op: -- > > /cib/status/node_state[@id='2']/transient_attributes[@id='2'] > > Jan 28 14:48:03 [21933] node02.example.com cib: info: > > cib_perform_op: + /cib: @num_updates=309 > > Jan 28 14:48:03 [21933] node02.example.com cib: info: > > cib_process_request: Completed cib_delete operation for section > > //node_state[@uname='node02.example.com']/transient_attributes: OK > > (rc=0, origin=node01.example.com/crmd/3784, version=0.94.309) > > Jan 28 14:48:04 [21938] node02.example.com crmd: info: > > abort_transition_graph: Transition aborted by deletion of > > transient_attributes[@id='2']: Transient attribute change | > > cib=0.94.309 source=abort_unless_down:357 > > path=/cib/status/node_state[@id='2']/transient_attributes[@id='2'] > > complete=true > > Jan 28 14:48:05 [21937] node02.example.com pengine: info: > > master_color: ms_drbd_ourApp: Promoted 0 instances of a possible 1 > > to master > > > The implication, it seems to me, is that "node01" has asked "node02" > to delete the transient-attributes for "node02". The transient- > attributes should normally be: > <transient_attributes id="2"> > <instance_attributes id="status-2"> > <nvpair id="status-2-master-drbd_ourApp" name="master- > drbd_ourApp" value="10000"/> > <nvpair id="status-2-pingd" name="pingd" value="100"/> > </instance_attributes> > </transient_attributes> > > These attributes are necessary for "node02" to be Master/Primary, > correct? > > Why might this be happening and how do we prevent it?
Transient attributes are always cleared when a node leaves the cluster (that's what makes them transient ...). It's probably coincidence it went through as the node rejoined. When the node rejoins, it will trigger another run of the scheduler, which will schedule a probe of all resources on the node. Those probes should reset the promotion score. -- Ken Gaillot <kgail...@redhat.com> _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/