On Thu, Jul 11, 2024 at 3:27 AM Reid Wahl <nw...@redhat.com> wrote: > > On Thu, Jul 11, 2024 at 2:56 AM Reid Wahl <nw...@redhat.com> wrote: > > > > On Fri, Jul 5, 2024 at 3:48 AM Artur Novik <freish...@gmail.com> wrote: > > > > > > >On Thu, Jul 4, 2024 at 5:03 AM Artur Novik <freishutz at gmail.com> > > > >wrote: > > > > > > >> Hi everybody, > > > >> I faced with a strange behavior and I since there was a lot of activity > > > >> around crm_node structs in 2.1.7, I want to believe that it's a > > > >> regression > > > >> rather than a new behavior by default. > > > >> > > > >> "crm_node -i" occasionally, but very often, returns "*exit code 68* : > > > >> Node > > > >> is not known to cluster". > > > >> > > > >> The quick test below (taken from two different clusters with pacemaker > > > >> 2.1.7 and 2.1.8): > > > >> > > > >> ``` > > > >> > > > >> [root at node1 ~]# crm_node -i > > > >> Node is not known to cluster > > > >> [root at node1 ~]# crm_node -i > > > >> 1 > > > >> [root at node1 ~]# crm_node -i > > > >> 1 > > > >> [root at node1 ~]# crm_node -i > > > >> Node is not known to cluster > > > >> [root at node1 ~]# for i in 1 2 3 4 5 6 7; do ssh node$i crm_node -i; > > > >> done > > > >> 1 > > > >> 2 > > > >> Node is not known to cluster > > > >> Node is not known to cluster > > > >> 5 > > > >> Node is not known to cluster > > > >> 7 > > > >> [root at node1 ~]# for i in 1 2 3 4 5 6 7; do sleep 1; ssh node$i > > > >> crm_node -i ; done > > > >> Node is not known to cluster > > > >> Node is not known to cluster > > > >> Node is not known to cluster > > > >> Node is not known to cluster > > > >> Node is not known to cluster > > > >> 6 > > > >> 7 > > > >> > > > >> > > > >> [root at es-brick2 ~]# crm_node -i > > > >> 2 > > > >> [root at es-brick2 ~]# crm_node -i > > > >> 2 > > > >> [root at es-brick2 ~]# crm_node -i > > > >> Node is not known to cluster > > > >> [root at es-brick2 ~]# crm_node -i > > > >> 2 > > > >> [root at es-brick2 ~]# rpm -qa | grep pacemaker | sort > > > >> pacemaker-2.1.8.rc2-1.el8_10.x86_64 > > > >> pacemaker-cli-2.1.8.rc2-1.el8_10.x86_64 > > > >> pacemaker-cluster-libs-2.1.8.rc2-1.el8_10.x86_64 > > > >> pacemaker-libs-2.1.8.rc2-1.el8_10.x86_64 > > > >> pacemaker-remote-2.1.8.rc2-1.el8_10.x86_64 > > > >> pacemaker-schemas-2.1.8.rc2-1.el8_10.noarch > > > >> > > > >> ``` > > > >> > > > >> I checked next versions (all packages, except the last one, taken from > > > >> rocky linux and rebuilt against corosync 3.1.8 from rocky 8.10. The > > > >> distro > > > >> itself rockylinux 8.10 too): > > > >> Pacemaker version Status > > > >> 2.1.5 (8.8) OK > > > >> 2.1.6 (8.9) OK > > > >> 2.1.7 (8.10) Broken > > > >> 2.1.8-RC2 (upstream) Broken > > > >> > > > >> I don't attach logs for now since I believe it could be reproduced > > > >> absolutely on any installation. > > > >> > > > > > > > Hi, thanks for the report. I can try to reproduce on 2.1.8 later, but so > > > > far I'm unable to reproduce on the current upstream main branch. I don't > > > > believe there are any major differences in the relevant code between > > > > main > > > > and 2.1.8-rc2. > > > > > > > I wonder if it's an issue where the controller is busy with a > > > > synchronous > > > > request when you run `crm_node -i` (which would be a bug). Can you share > > > > logs and your config? > > > > > > The logs could be taken from google drive since they are too large to > > > attach: > > > https://drive.google.com/file/d/1MLgjYncHXrQlZQ2FAmoGp9blvDtS-8RG/view?usp=drive_link > > > (~65MB with all nodes) > > > https://drive.google.com/drive/folders/13YYhAtS6zlDjoOOf8ZZQSyfTP_wzLbG_?usp=drive_link > > > (the directory with logs) > > > > > > The timestamp and node: > > > [root@es-brick1 ~]# date > > > Fri Jul 5 10:02:35 UTC 2024 > > > > > > Since this reproduced on multiple KVMs (rhel8, 9 and fedora40), I > > > attached some info from hypervisor side too. > > > > Thank you for the additional info. We've been looking into this, and > > so far I'm still unable to reproduce it on my machine. However, I have > > an idea that it's related to passing a pointer to an uninitialized > > `nodeid` variable in `print_node_id()` within crm_node.c. > > > > Can you run `crm_node -i -VVVVVV` and share the output from a > > successful run and from a failed run? > > Disregard. I can't reproduce it when I build from source, but I can > reproduce it after I install the pacemaker package from the fedora > repo via dnf.
The problem is indeed the uninitialized `uint32_t nodeid`. When the garbage value is less than INT_MAX, we get the "not known to cluster" error. When the garbage value is greater than INT_MAX, the controller finds a negative int, which is invalid as an ID, so it searches for the local node name instead and finds the correct result. We're working on a fix. > > > > > > > > > Thanks, > > > > A > > > > _______________________________________________ > > > > Manage your subscription: > > > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > > > > > ClusterLabs home: https://www.clusterlabs.org/ > > > > > > > > > > > -- > > > > Regards, > > > > > > > > Reid Wahl (He/Him) > > > > Senior Software Engineer, Red Hat > > > > RHEL High Availability - Pacemaker > > > > > > _______________________________________________ > > > Manage your subscription: > > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > > > ClusterLabs home: https://www.clusterlabs.org/ > > > > > > > > -- > > Regards, > > > > Reid Wahl (He/Him) > > Senior Software Engineer, Red Hat > > RHEL High Availability - Pacemaker > > > > -- > Regards, > > Reid Wahl (He/Him) > Senior Software Engineer, Red Hat > RHEL High Availability - Pacemaker -- Regards, Reid Wahl (He/Him) Senior Software Engineer, Red Hat RHEL High Availability - Pacemaker _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/