On Tue, 2019-08-13 at 11:06 +0200, Ulrich Windl wrote: > Hi, > > an update: > After setting a failure-timeout for the resource that stale monitor > failure > was removed automatically at next cluster recheck (it seems). > Still I wonder why a resource cleanup didn't do that (bug?).
Possibly ... also possibly fixed already, as there have been a few clean-up related fixes in the past few versions. I'm not sure what's been backported in the build you have. > > Regards, > Ulrich > > > > > > "Ulrich Windl" <ulrich.wi...@rz.uni-regensburg.de> schrieb am > > > > 13.08.2019 > > um > 10:07 in Nachricht <5d526fb0020000a100032...@gwsmtp.uni-regensburg.de > >: > > > > > Ken Gaillot <kgail...@redhat.com> schrieb am 13.08.2019 um > > > > > 01:03 in > > > > Nachricht > > <ec355e98371d22d6c83e4927f65fdfa711266c21.ca...@redhat.com>: > > > On Mon, 2019‑08‑12 at 17:46 +0200, Ulrich Windl wrote: > > > > Hi! > > > > > > > > I just noticed that a "crm resource cleanup <rsc>" caused some > > > > unexpected behavior and the syslog message: > > > > crmd[7281]: warning: new_event_notification (7281‑97955‑15): > > > > Broken > > > > pipe (32) > > > > > > > > It's SLES14 SP4 last updated Sept. 2018 (up since then, > > > > pacemaker‑ > > > > 1.1.19+20180928.0d2680780‑1.8.x86_64). > > > > > > > > The cleanup was due to a failed monitor. As an unexpected > > > > consequence > > > > of this cleanup, CRM seemed to restart the complete resource > > > > (and > > > > dependencies), even though it was running. > > > > > > I assume the monitor failure was old, and recovery had already > > > completed? If not, recovery might have been initiated before the > > > clean‑ > > > up was recorded. > > > > > > > I noticed that a manual "crm_resource ‑C ‑r <rsc> ‑N <node>" > > > > command > > > > has the same effect (multiple resources are "Cleaned up", > > > > resources > > > > are restarted seemingly before the "probe" is done.). > > > > > > Can you verify whether the probes were done? The DC should log a > > > message when each <rsc>_monitor_0 result comes in. > > > > So here's a rough sketch of events: > > 17:10:23 crmd[7281]: notice: State transition S_IDLE -> > > S_POLICY_ENGINE > > ...no probes yet... > > 17:10:24 pengine[7280]: warning: Processing failed monitor of > > prm_nfs_server > > on rksaph11: not running > > ...lots of starts/restarts... > > 17:10:24 pengine[7280]: notice: * Restart prm_nfs_server > > ... > > 17:10:24 crmd[7281]: notice: Processing graph 6628 > > (ref=pe_calc-dc-1565622624-7313) derived from > > /var/lib/pacemaker/pengine/pe-input-1810.bz2 > > ...monitors are being called... > > 17:10:24 crmd[7281]: notice: Result of probe operation for > > prm_nfs_vg on > > h11: 0 (ok) > > ...the above was the first probe result... > > 17:10:24 crmd[7281]: warning: Action 33 (prm_nfs_vg_monitor_0) on > > h11 > > failed > > (target: 7 vs. rc: 0): Error > > ...not surprising to me: The resource was running; I don't know why > > the > > cluster want to start it... That's normal, that's how pacemaker detects active resources after clean-up. It schedules a probe and start in the assumption that the probe will find the resource not running; if it is running, the probe result will cause a new transition where the start isn't needed. That message will be improved in the next version (already in master branch), like: notice: Transition 10 action 33 (prm_nfs_vg_monitor_0 on h11): expected 'not running' but got 'ok' > > 17:10:24 crmd[7281]: notice: Transition 6629 (Complete=9, > > Pending=0, > > Fired=0, Skipped=0, Incomplete=0, > > Source=/var/lib/pacemaker/pengine/pe-input-1811.bz2): Complete > > 17:10:24 crmd[7281]: notice: State transition S_TRANSITION_ENGINE > > -> > > S_IDLE > > > > The really bad thing after this is that the "cleaned up" resource > > still has > > a > > failed status (dated in the past (last-rc-change='Mon Aug 12 > > 04:52:23 > > 2019')), > > even though "running". > > > > I tend to believe that the cluster is in a bad state, or the > > software has a > > problem cleaning the status of the monitor. It does sound like a clean-up bug, but I'm not aware of any current issues. I suspect it's already fixed. > > The CIB status for the resource looks like this: > > <lrm_resource id="prm_nfs_server" type="nfsserver" > > class="ocf" > > provider="heartbeat"> > > <lrm_rsc_op id="prm_nfs_server_last_0" > > operation_key="prm_nfs_server_start_0" operation="start" > > crm-debug-origin="do_update_resource" crm_feature_set="3.0.14" > > transition-key="67:6583:0:d941efc1-de73-4ee4-b593-f65be9e90726" > > transition-magic="0:0;67:6583:0:d941efc1-de73-4ee4-b593- > > f65be9e90726" > > exit-reason="" on_node="h11" call-id="799" rc-code="0" op- > > status="0" > > interval="0" last-run="1565582351" last-rc-change="1565582351" > > exec-time="708" > > queue-time="0" op-digest="73311a0ef4ba8e9f1f97e05e989f6348"/> > > <lrm_rsc_op id="prm_nfs_server_monitor_60000" > > operation_key="prm_nfs_server_monitor_60000" operation="monitor" > > crm-debug-origin="do_update_resource" crm_feature_set="3.0.14" > > transition-key="68:6583:0:d941efc1-de73-4ee4-b593-f65be9e90726" > > transition-magic="0:0;68:6583:0:d941efc1-de73-4ee4-b593- > > f65be9e90726" > > exit-reason="" on_node="h11" call-id="800" rc-code="0" op- > > status="0" > > interval="60000" last-rc-change="1565582351" exec-time="499" queue- > > time="0" > > op-digest="9d8aa17b2a741c8328d7896459733e56"/> > > <lrm_rsc_op id="prm_nfs_server_last_failure_0" > > operation_key="prm_nfs_server_monitor_60000" operation="monitor" > > crm-debug-origin="do_update_resource" crm_feature_set="3.0.14" > > transition-key="4:6568:0:d941efc1-de73-4ee4-b593-f65be9e90726" > > transition-magic="0:7;4:6568:0:d941efc1-de73-4ee4-b593- > > f65be9e90726" > > exit-reason="" on_node="h11" call-id="738" rc-code="7" op- > > status="0" > > interval="60000" last-rc-change="1565578343" exec-time="0" queue- > > time="0" > > op-digest="9d8aa17b2a741c8328d7896459733e56"/> > > </lrm_resource> > > > > > > Regards, > > Ulrich > > > > > > > > > Actually the manual says when cleaning up a single primitive, > > > > the > > > > whole group is cleaned up, unless using ‑‑force. Well ,I don't > > > > like > > > > this default, as I expect any status change from probe would > > > > propagate to the group anyway... > > > > > > In 1.1, clean‑up always wipes the history of the affected > > > resources, > > > regardless of whether the history is for success or failure. That > > > means > > > all the cleaned resources will be reprobed. In 2.0, clean‑up by > > > default > > > wipes the history only if there's a failed action (‑‑refresh/‑R > > > is > > > required to get the 1.1 behavior). That lessens the impact of the > > > "default to whole group" behavior. > > > > > > I think the original idea was that a group indicates that the > > > resources > > > are closely related, so changing the status of one member might > > > affect > > > what status the others report. > > > > > > > Regards, > > > > Ulrich -- Ken Gaillot <kgail...@redhat.com> _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/