Andrei, You are right, thank you. I have an earlier thread on which I posted a pacemaker.log for this issue, and didn't think to point to it here. The link is http://project.ibss.net/samples/deidPacemakerLog.2021-01-25.txt . So, node01 is in maintenance mode, and constraints prevent any resources from running on it (other than drbd in Secondary). I would not want node01 to ston[node02]ith after a communications failure, especially not if all resources are running fine on node02. Also I did not think to wonder if node01 could become DC even though in maintenance mode. The logs seem to me to match this contention. The cib ops happen right in the middle of the DC negotiations. Is there a way to tell node01 that it cannot be DC? Like a constraint? Thanks again.
On Sun, Jan 31, 2021 at 1:55 AM Andrei Borzenkov <arvidj...@gmail.com> wrote: > 29.01.2021 20:37, Stuart Massey пишет: > > Can someone help me with this? > > Background: > > > > "node01" is failing, and has been placed in "maintenance" mode. It > > occasionally loses connectivity. > > > > "node02" is able to run our resources > > > > Consider the following messages from pacemaker.log on "node02", just > after > > "node01" has rejoined the cluster (per "node02"): > > > > Jan 28 14:48:03 [21933] node02.example.com cib: info: > > cib_perform_op: -- > > /cib/status/node_state[@id='2']/transient_attributes[@id='2'] > > Jan 28 14:48:03 [21933] node02.example.com cib: info: > > cib_perform_op: + /cib: @num_updates=309 > > Jan 28 14:48:03 [21933] node02.example.com cib: info: > > cib_process_request: Completed cib_delete operation for section > > //node_state[@uname='node02.example.com']/transient_attributes: OK > (rc=0, > > origin=node01.example.com/crmd/3784, version=0.94.309) > > Jan 28 14:48:04 [21938] node02.example.com crmd: info: > > abort_transition_graph: Transition aborted by deletion of > > transient_attributes[@id='2']: Transient attribute change | cib=0.94.309 > > source=abort_unless_down:357 > > path=/cib/status/node_state[@id='2']/transient_attributes[@id='2'] > > complete=true > > Jan 28 14:48:05 [21937] node02.example.com pengine: info: > > master_color: ms_drbd_ourApp: Promoted 0 instances of a possible 1 to > master > > > > The implication, it seems to me, is that "node01" has asked "node02" to > > delete the transient-attributes for "node02". The transient-attributes > > should normally be: > > <transient_attributes id="2"> > > <instance_attributes id="status-2"> > > <nvpair id="status-2-master-drbd_ourApp" > > name="master-drbd_ourApp" value="10000"/> > > <nvpair id="status-2-pingd" name="pingd" value="100"/> > > </instance_attributes> > > </transient_attributes> > > > > These attributes are necessary for "node02" to be Master/Primary, > correct? > > > > Why might this be happening and how do we prevent it? > > > > You do not provide enough information to answer. At the very least you > need to show full logs from both nodes around time it happens (starting > with both nodes losing connectivity). > > But as a wild guess - you do not use stonith, node01 becomes DC and > clears other node state. > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ >
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/