On Wed, 2019-08-14 at 10:24 +0200, Ulrich Windl wrote: > (subject changed for existing thread) > > Hi! > > After I had thought the problem with the sticky failed monitor was > solved > eventually, I realized that I'm getting a message that I don't really > understand after each cluster recheck interval: > > pengine[7280]: notice: Re-initiated expired calculated failure > prm_nfs_server_monitor_60000 (rc=7, > magic=0:7;4:6568:0:d941efc1-de73-4ee4-b593-f65be9e90726) on h11 > > The message repeats absolutely identical. So what does it mean? The
That one confuses me too. An expired failure is simply ignored for non-recurring operations. But for expired failures of a recurring monitor, if the node is up, the monitor's restart digest is altered, which I believe causes it to be cancelled and re-scheduled. The reason in the commit message was "This is particularly relevant for those with on-fail=block which stick around and are not cleaned up by a subsequent stop/start." I don't claim to understand it. :) > monitor > did not fail between cluster rechecks, and crm_mon is not displaying > any failed > operations. Probably because it's expired. A clean-up should still get rid of it, though. > > Regards, > Ulrich > > > > > > Ulrich Windl schrieb am 13.08.2019 um 11:06 in Nachricht > > > > <5D527D91.124 : > > 161 : > 60728>: > > Hi, > > > > an update: > > After setting a failure-timeout for the resource that stale monitor > > failure > > was removed automatically at next cluster recheck (it seems). > > Still I wonder why a resource cleanup didn't do that (bug?). > > > > Regards, > > Ulrich > > > > > > > > > "Ulrich Windl" <ulrich.wi...@rz.uni-regensburg.de> schrieb am > > > > > 13.08.2019 > > um > > 10:07 in Nachricht < > > 5d526fb0020000a100032...@gwsmtp.uni-regensburg.de>: > > > > > > Ken Gaillot <kgail...@redhat.com> schrieb am 13.08.2019 um > > > > > > 01:03 in > > > > > > Nachricht > > > <ec355e98371d22d6c83e4927f65fdfa711266c21.ca...@redhat.com>: > > > > On Mon, 2019‑08‑12 at 17:46 +0200, Ulrich Windl wrote: > > > > > Hi! > > > > > > > > > > I just noticed that a "crm resource cleanup <rsc>" caused > > > > > some > > > > > unexpected behavior and the syslog message: > > > > > crmd[7281]: warning: new_event_notification (7281‑97955‑15): > > > > > Broken > > > > > pipe (32) > > > > > > > > > > It's SLES14 SP4 last updated Sept. 2018 (up since then, > > > > > pacemaker‑ > > > > > 1.1.19+20180928.0d2680780‑1.8.x86_64). > > > > > > > > > > The cleanup was due to a failed monitor. As an unexpected > > > > > consequence > > > > > of this cleanup, CRM seemed to restart the complete resource > > > > > (and > > > > > dependencies), even though it was running. > > > > > > > > I assume the monitor failure was old, and recovery had already > > > > completed? If not, recovery might have been initiated before > > > > the clean‑ > > > > up was recorded. > > > > > > > > > I noticed that a manual "crm_resource ‑C ‑r <rsc> ‑N <node>" > > > > > command > > > > > has the same effect (multiple resources are "Cleaned up", > > > > > resources > > > > > are restarted seemingly before the "probe" is done.). > > > > > > > > Can you verify whether the probes were done? The DC should log > > > > a > > > > message when each <rsc>_monitor_0 result comes in. > > > > > > So here's a rough sketch of events: > > > 17:10:23 crmd[7281]: notice: State transition S_IDLE -> > > > S_POLICY_ENGINE > > > ...no probes yet... > > > 17:10:24 pengine[7280]: warning: Processing failed monitor of > > > prm_nfs_server > > > on rksaph11: not running > > > ...lots of starts/restarts... > > > 17:10:24 pengine[7280]: notice: * Restart prm_nfs_server > > > ... > > > 17:10:24 crmd[7281]: notice: Processing graph 6628 > > > (ref=pe_calc-dc-1565622624-7313) derived from > > > /var/lib/pacemaker/pengine/pe-input-1810.bz2 > > > ...monitors are being called... > > > 17:10:24 crmd[7281]: notice: Result of probe operation for > > > prm_nfs_vg > > on > > > h11: 0 (ok) > > > ...the above was the first probe result... > > > 17:10:24 crmd[7281]: warning: Action 33 (prm_nfs_vg_monitor_0) > > > on h11 > > > failed > > > (target: 7 vs. rc: 0): Error > > > ...not surprising to me: The resource was running; I don't know > > > why the > > > cluster want to start it... > > > 17:10:24 crmd[7281]: notice: Transition 6629 (Complete=9, > > > Pending=0, > > > Fired=0, Skipped=0, Incomplete=0, > > > Source=/var/lib/pacemaker/pengine/pe-input-1811.bz2): Complete > > > 17:10:24 crmd[7281]: notice: State transition > > > S_TRANSITION_ENGINE -> > > > > S_IDLE > > > > > > The really bad thing after this is that the "cleaned up" resource > > > still > > has > > > a > > > failed status (dated in the past (last-rc-change='Mon Aug 12 > > > 04:52:23 > > > 2019')), > > > even though "running". > > > > > > I tend to believe that the cluster is in a bad state, or the > > > software has > > a > > > problem cleaning the status of the monitor. > > > > > > The CIB status for the resource looks like this: > > > <lrm_resource id="prm_nfs_server" type="nfsserver" > > > class="ocf" > > > provider="heartbeat"> > > > <lrm_rsc_op id="prm_nfs_server_last_0" > > > operation_key="prm_nfs_server_start_0" operation="start" > > > crm-debug-origin="do_update_resource" crm_feature_set="3.0.14" > > > transition-key="67:6583:0:d941efc1-de73-4ee4-b593-f65be9e90726" > > > transition-magic="0:0;67:6583:0:d941efc1-de73-4ee4-b593- > > > f65be9e90726" > > > exit-reason="" on_node="h11" call-id="799" rc-code="0" op- > > > status="0" > > > interval="0" last-run="1565582351" last-rc-change="1565582351" > > > exec-time="708" > > > queue-time="0" op-digest="73311a0ef4ba8e9f1f97e05e989f6348"/> > > > <lrm_rsc_op id="prm_nfs_server_monitor_60000" > > > operation_key="prm_nfs_server_monitor_60000" operation="monitor" > > > crm-debug-origin="do_update_resource" crm_feature_set="3.0.14" > > > transition-key="68:6583:0:d941efc1-de73-4ee4-b593-f65be9e90726" > > > transition-magic="0:0;68:6583:0:d941efc1-de73-4ee4-b593- > > > f65be9e90726" > > > exit-reason="" on_node="h11" call-id="800" rc-code="0" op- > > > status="0" > > > interval="60000" last-rc-change="1565582351" exec-time="499" > > queue-time="0" > > > op-digest="9d8aa17b2a741c8328d7896459733e56"/> > > > <lrm_rsc_op id="prm_nfs_server_last_failure_0" > > > operation_key="prm_nfs_server_monitor_60000" operation="monitor" > > > crm-debug-origin="do_update_resource" crm_feature_set="3.0.14" > > > transition-key="4:6568:0:d941efc1-de73-4ee4-b593-f65be9e90726" > > > transition-magic="0:7;4:6568:0:d941efc1-de73-4ee4-b593- > > > f65be9e90726" > > > exit-reason="" on_node="h11" call-id="738" rc-code="7" op- > > > status="0" > > > interval="60000" last-rc-change="1565578343" exec-time="0" queue- > > > time="0" > > > op-digest="9d8aa17b2a741c8328d7896459733e56"/> > > > </lrm_resource> > > > > > > > > > Regards, > > > Ulrich > > > > > > > > > > > > Actually the manual says when cleaning up a single primitive, > > > > > the > > > > > whole group is cleaned up, unless using ‑‑force. Well ,I > > > > > don't like > > > > > this default, as I expect any status change from probe would > > > > > propagate to the group anyway... > > > > > > > > In 1.1, clean‑up always wipes the history of the affected > > > > resources, > > > > regardless of whether the history is for success or failure. > > > > That means > > > > all the cleaned resources will be reprobed. In 2.0, clean‑up by > > > > default > > > > wipes the history only if there's a failed action (‑‑refresh/‑R > > > > is > > > > required to get the 1.1 behavior). That lessens the impact of > > > > the > > > > "default to whole group" behavior. > > > > > > > > I think the original idea was that a group indicates that the > > > > resources > > > > are closely related, so changing the status of one member might > > > > affect > > > > what status the others report. > > > > > > > > > Regards, > > > > > Ulrich > > > > > > > > ‑‑ > > > > Ken Gaillot <kgail...@redhat.com> > > > > > > > > _______________________________________________ > > > > Manage your subscription: > > > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > > > > > ClusterLabs home: https://www.clusterlabs.org/ > > > > > > > > > > > > _______________________________________________ > > > Manage your subscription: > > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > > > ClusterLabs home: https://www.clusterlabs.org/ > > > > > > > > > > > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ -- Ken Gaillot <kgail...@redhat.com> _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/