On 2021-07-26 12:50 p.m.,
kgail...@redhat.com wrote:
On Mon, 2021-07-26 at 12:25 -0400, Digimer wrote:On 2021-07-26 9:54 a.m., kgail...@redhat.com wrote:On Fri, 2021-07-23 at 21:46 -0400, Digimer wrote:After a LOT of hassle, I finally got it updated, but OMG it was painful.I degraded the cluster (unsure if needed), set maintenance mode, deleted the stonith levels, deleted the stonith devices, recreated them with the updated values, recreated the stonith levels, and finally disabled maintenance mode. It should not have been this hard, right? Why is heck would it be that pacemaker kept "rolling back" to old configs? I'd delete the stonithThat is bizarre. It sounds like the CIB changes were taking effect locally, then being rejected by the rest of the cluster, which would send the "correct" CIB back to the originator. The logs of interest would be pacemaker.log from both nodes at the time you made the first configuration change that failed. I'm guessing the logs you posted were from after that point?Below are the logs. The change appears to first try at 'Jul 23 16:22:27', made on an-a02n01, included logs for a few minutes before in case relevant. * an-a02n01: https://www.alteeve.com/an-repo/files/an-a02n01.pacemaker.log * an-a02n02: https://www.alteeve.com/an-repo/files/an-a02n02.pacemaker.log Note that the PDUs as originally configured (10.201.2.1/2) were not available, so I had to disable and cleanup the stonith resources. They seemed to keep getting re-enabled, so I got to the habit of doing this cycle of disable -> cleanup -> disable -> cleanup before I could reliably get the resources to be 'stopped (disabled)' in 'pcs stonith status'. digimerThe initial change happened here: Jul 23 16:22:27 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: Diff: --- 0.337.112 2 Jul 23 16:22:27 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: Diff: +++ 0.338.0 6a24af66df3d9f825cc2681222f8f5d6 Jul 23 16:22:27 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: + /cib: @epoch=338, @num_updates=0 Jul 23 16:22:27 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: + /cib/configuration/resources/primitive[@id='apc_snmp_node1_an-pdu03']/instance_attributes[@id='apc_snmp_node1_an-pdu03-instance_attributes']/nvpair[@id='apc_snmp_node1_an-pdu03-instance_attributes-ip']: @value=10.201.2.3 Jul 23 16:22:27 an-a02n01.alteeve.com pacemaker-based [121628] (cib_replace_notify) info: Replaced: 0.337.112 -> 0.338.0 from an-a02n02 Jul 23 16:22:27 an-a02n01.alteeve.com pacemaker-based [121628] (cib_process_request) info: Completed cib_replace operation for section configuration: OK (rc=0, origin=an-a02n02/cibadmin/2, version=0.338.0) origin=an-a02n02/cibadmin/2 means that someone or something ran the cibadmin tool on an-02n02. Presumably this was your interactive pcs command. It was then reverted by: Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: Diff: --- 0.343.3 2 Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: Diff: +++ 0.344.0 (null) Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: + /cib: @epoch=344, @num_updates=0 Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: ++ /cib/configuration/resources: <primitive class="stonith" id="apc_snmp_node1_an-pdu03" type="fence_apc_snmp"/> Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: ++ <instance_attributes id="apc_snmp_node1_an-pdu03-instance_attributes"> Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: ++ <nvpair id="apc_snmp_node1_an-pdu03-instance_attributes-ip" name="ip" value="10.201.2.1"/> Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: ++ <nvpair id="apc_snmp_node1_an-pdu03-instance_attributes-pcmk_host_list" name="pcmk_host_list" value="an-a02n01"/> Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: ++ <nvpair id="apc_snmp_node1_an-pdu03-instance_attributes-pcmk_off_action" name="pcmk_off_action" value="reboot"/> Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: ++ <nvpair id="apc_snmp_node1_an-pdu03-instance_attributes-port" name="port" value="5"/> Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: ++ </instance_attributes> Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: ++ <operations> Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: ++ <op id="apc_snmp_node1_an-pdu03-monitor-interval-60" interval="60" name="monitor"/> Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: ++ </operations> Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: ++ </primitive> Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based [121628] (cib_process_request) info: Completed cib_apply_diff operation for section 'all': OK (rc=0, origin=an-a02n02/cibadmin/2, version=0.344.0) Notice the origin is still cibadmin on an-a02n02. So this was either you, or a script or cron on that node. I don't see any additional details on that node.
I have no idea what would have / could have done that. I had
ScanCore disabled, so my software wasn't doing anything. These are
stock CentOS Stream 8 installs, so there wouldn't be anything in
cron that should do this.
I am very confused... =/
-- Digimer Papers and Projects: https://alteeve.com/w/ "I am, somehow, less interested in the weight and convolutions of Einstein’s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/