On Fri, Feb 25, 2022 at 4:31 AM Ulrich Windl <ulrich.wi...@rz.uni-regensburg.de> wrote: > > >>> Reid Wahl <nw...@redhat.com> schrieb am 25.02.2022 um 12:31 in Nachricht > <capiuu99iacxk4jn9_afm+wb98dz7fpvvtze-7sfloryuxrm...@mail.gmail.com>: > > On Thu, Feb 24, 2022 at 2:28 AM Ulrich Windl > > <ulrich.wi...@rz.uni‑regensburg.de> wrote: > >> > >> Hi! > >> > >> I just discovered this oddity for a SLES15 SP3 cluster: > >> Feb 24 11:16:17 h16 pacemaker‑attrd[7274]: notice: Setting > val_net_gw1[h18]: > > 1000 ‑> 139000 > >> > >> That surprised me, because usually the value is 1000 or 0. > >> > >> Diggding a bit further I found: > >> Migration Summary: > >> * Node: h18: > >> * prm_ping_gw1: migration‑threshold=1000000 fail‑count=1 > last‑failure='Thu > > Feb 24 11:17:18 2022' > >> > >> Failed Resource Actions: > >> * prm_ping_gw1_monitor_60000 on h18 'error' (1): call=200, > status='Error', > > exitreason='', last‑rc‑change='2022‑02‑24 11:17:18 +01:00', queued=0ms, > exec=0ms > >> > >> Digging further: > >> Feb 24 11:16:17 h18 kernel: BUG: Bad rss‑counter state mm:00000000c620b5fe > > > idx:1 val:17 > >> Feb 24 11:16:17 h18 pacemaker‑attrd[6946]: notice: Setting > val_net_gw1[h18]: > > 1000 ‑> 139000 > >> Feb 24 11:17:17 h18 kernel: traps: pacemaker‑execd[38950] general > protection > > fault ip:7f610e71cbcf sp:7ffff7c25100 error:0 in > > libc‑2.31.so[7f610e63b000+1e6000] > >> > >> (that rss‑counter causing series of core dumps seems to be a new "feature" > of > > SLES15 SP3 kernels that is being investigated by support) > >> > >> Somewhat later: > >> Feb 24 11:17:18 h18 pacemaker‑attrd[6946]: notice: Setting > val_net_gw1[h18]: > > 139000 ‑> (unset) > >> (restarted RA) > >> Feb 24 11:17:21 h18 pacemaker‑attrd[6946]: notice: Setting > val_net_gw1[h18]: > > (unset) ‑> 1000 > >> > >> Another node: > >> Feb 24 11:16:17 h19 pacemaker‑attrd[7435]: notice: Setting > val_net_gw1[h18]: > > 1000 ‑> 139000 > >> Feb 24 11:17:18 h19 pacemaker‑attrd[7435]: notice: Setting > val_net_gw1[h18]: > > 139000 ‑> (unset) > >> Feb 24 11:17:21 h19 pacemaker‑attrd[7435]: notice: Setting > val_net_gw1[h18]: > > (unset) ‑> 1000 > >> > >> So it seems the ping RA sets some garbage value when failing. Is that > > correct? > > > > This is ocf:pacemaker:ping, right? And is use_fping enabled? > > Correct. use_fping is not set (default value). I found no fping on the host. > > > > > > Looks like it uses ($active * $multiplier) ‑‑ see ping_update(). I'm > > assuming your multiplier is 1000. > > Corrct: multiplier=1000, and host_list has just one address. > > > > > $active is set by either fping_check() or ping_check(), depending on > > your configuration. You can see what they're doing here. I'd assume > > $active is getting set to 139 and then is multiplied by 1000 to set > > $score later. > > But wouldn't that mean 139 hosts were pinged successfully? > (${HA_BIN}/pingd is being used)
Yeah, that seems to be the intent. Hence my saying "It could also be a side effect of the fault though, since I don't see anything in fping_check() or ping_check() that's an obvious candidate for setting active=139 unless you have a massive host list." > > > ‑ > > https://github.com/ClusterLabs/pacemaker/blob/Pacemaker‑2.0.5/extra/resource > > > s/ping#L220‑L277 > > > Regards, > Ulrich > > >> > >> resource‑agents‑4.8.0+git30.d0077df0‑150300.8.20.1.x86_64 > >> pacemaker‑2.0.5+20201202.ba59be712‑150300.4.16.1.x86_64 > >> > >> Regards, > >> Ulrich > >> > >> > >> _______________________________________________ > >> Manage your subscription: > >> https://lists.clusterlabs.org/mailman/listinfo/users > >> > >> ClusterLabs home: https://www.clusterlabs.org/ > >> > > > > > > ‑‑ > > Regards, > > > > Reid Wahl (He/Him), RHCA > > Senior Software Maintenance Engineer, Red Hat > > CEE ‑ Platform Support Delivery ‑ ClusterHA > > > > _______________________________________________ > > Manage your subscription: > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > ClusterLabs home: https://www.clusterlabs.org/ > > > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ -- Regards, Reid Wahl (He/Him), RHCA Senior Software Maintenance Engineer, Red Hat CEE - Platform Support Delivery - ClusterHA _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/