Re: [ClusterLabs] Q: "Re-initiated expired calculated failure"

Ken Gaillot Wed, 14 Aug 2019 09:56:49 -0700

On Wed, 2019-08-14 at 10:24 +0200, Ulrich Windl wrote:
> (subject changed for existing thread)
> 
> Hi!
> 
> After I had thought the problem with the sticky failed monitor was
> solved
> eventually, I realized that I'm getting a message that I don't really
> understand after each cluster recheck interval:
> 
> pengine[7280]:   notice: Re-initiated expired calculated failure
> prm_nfs_server_monitor_60000 (rc=7,
> magic=0:7;4:6568:0:d941efc1-de73-4ee4-b593-f65be9e90726) on h11
> 
> The message repeats absolutely identical. So what does it mean? The


That one confuses me too.

An expired failure is simply ignored for non-recurring operations. But
for expired failures of a recurring monitor, if the node is up, the
monitor's restart digest is altered, which I believe causes it to be
cancelled and re-scheduled.

The reason in the commit message was "This is particularly relevant for
those with on-fail=block which stick around and are not cleaned up by a
subsequent stop/start."

I don't claim to understand it. :)

> monitor
> did not fail between cluster rechecks, and crm_mon is not displaying
> any failed
> operations.

Probably because it's expired. A clean-up should still get rid of it,
though.

> 
> Regards,
> Ulrich
> 
> 
> > > > Ulrich Windl schrieb am 13.08.2019 um 11:06 in Nachricht
> > > > <5D527D91.124 :
> 
> 161 :
> 60728>:
> > Hi,
> > 
> > an update:
> > After setting a failure-timeout for the resource that stale monitor
> > failure
> > was removed automatically at next cluster recheck (it seems).
> > Still I wonder why a resource cleanup didn't do that (bug?).
> > 
> > Regards,
> > Ulrich
> > 
> > 
> > > > > "Ulrich Windl" <ulrich.wi...@rz.uni-regensburg.de> schrieb am
> > > > > 13.08.2019
> 
> um
> > 10:07 in Nachricht <
> > 5d526fb0020000a100032...@gwsmtp.uni-regensburg.de>:
> > > > > > Ken Gaillot <kgail...@redhat.com> schrieb am 13.08.2019 um
> > > > > > 01:03 in
> > > 
> > > Nachricht
> > > <ec355e98371d22d6c83e4927f65fdfa711266c21.ca...@redhat.com>:
> > > > On Mon, 2019‑08‑12 at 17:46 +0200, Ulrich Windl wrote:
> > > > > Hi!
> > > > > 
> > > > > I just noticed that a "crm resource cleanup <rsc>" caused
> > > > > some
> > > > > unexpected behavior and the syslog message:
> > > > > crmd[7281]:  warning: new_event_notification (7281‑97955‑15):
> > > > > Broken
> > > > > pipe (32)
> > > > > 
> > > > > It's SLES14 SP4 last updated Sept. 2018 (up since then,
> > > > > pacemaker‑
> > > > > 1.1.19+20180928.0d2680780‑1.8.x86_64).
> > > > > 
> > > > > The cleanup was due to a failed monitor. As an unexpected
> > > > > consequence
> > > > > of this cleanup, CRM seemed to restart the complete resource
> > > > > (and
> > > > > dependencies), even though it was running.
> > > > 
> > > > I assume the monitor failure was old, and recovery had already
> > > > completed? If not, recovery might have been initiated before
> > > > the clean‑
> > > > up was recorded.
> > > > 
> > > > > I noticed that a manual "crm_resource ‑C ‑r <rsc> ‑N <node>"
> > > > > command
> > > > > has the same effect (multiple resources are "Cleaned up",
> > > > > resources
> > > > > are restarted seemingly before the "probe" is done.).
> > > > 
> > > > Can you verify whether the probes were done? The DC should log
> > > > a
> > > > message when each <rsc>_monitor_0 result comes in.
> > > 
> > > So here's a rough sketch of events:
> > > 17:10:23 crmd[7281]:   notice: State transition S_IDLE ->
> > > S_POLICY_ENGINE
> > > ...no probes yet...
> > > 17:10:24 pengine[7280]:  warning: Processing failed monitor of 
> > > prm_nfs_server
> > > on rksaph11: not running
> > > ...lots of starts/restarts...
> > > 17:10:24 pengine[7280]:   notice:  * Restart    prm_nfs_server  
> > > ...
> > > 17:10:24 crmd[7281]:   notice: Processing graph 6628
> > > (ref=pe_calc-dc-1565622624-7313) derived from
> > > /var/lib/pacemaker/pengine/pe-input-1810.bz2
> > > ...monitors are being called...
> > > 17:10:24 crmd[7281]:   notice: Result of probe operation for
> > > prm_nfs_vg
> 
> on
> > > h11: 0 (ok)
> > > ...the above was the first probe result...
> > > 17:10:24 crmd[7281]:  warning: Action 33 (prm_nfs_vg_monitor_0)
> > > on h11 
> > > failed
> > > (target: 7 vs. rc: 0): Error
> > > ...not surprising to me: The resource was running; I don't know
> > > why the
> > > cluster want to start it...
> > > 17:10:24 crmd[7281]:   notice: Transition 6629 (Complete=9,
> > > Pending=0,
> > > Fired=0, Skipped=0, Incomplete=0,
> > > Source=/var/lib/pacemaker/pengine/pe-input-1811.bz2): Complete
> > > 17:10:24 crmd[7281]:   notice: State transition
> > > S_TRANSITION_ENGINE -> 
> > 
> > S_IDLE
> > > 
> > > The really bad thing after this is that the "cleaned up" resource
> > > still
> 
> has 
> > > a
> > > failed status (dated in the past (last-rc-change='Mon Aug 12
> > > 04:52:23 
> > > 2019')),
> > > even though "running".
> > > 
> > > I tend to believe that the cluster is in a bad state, or the
> > > software has
> 
> a
> > > problem cleaning the status of the monitor.
> > > 
> > > The CIB status for the resource looks like this:
> > >         <lrm_resource id="prm_nfs_server" type="nfsserver"
> > > class="ocf"
> > > provider="heartbeat">
> > >           <lrm_rsc_op id="prm_nfs_server_last_0"
> > > operation_key="prm_nfs_server_start_0" operation="start"
> > > crm-debug-origin="do_update_resource" crm_feature_set="3.0.14"
> > > transition-key="67:6583:0:d941efc1-de73-4ee4-b593-f65be9e90726"
> > > transition-magic="0:0;67:6583:0:d941efc1-de73-4ee4-b593-
> > > f65be9e90726"
> > > exit-reason="" on_node="h11" call-id="799" rc-code="0" op-
> > > status="0"
> > > interval="0" last-run="1565582351" last-rc-change="1565582351" 
> > > exec-time="708"
> > > queue-time="0" op-digest="73311a0ef4ba8e9f1f97e05e989f6348"/>
> > >           <lrm_rsc_op id="prm_nfs_server_monitor_60000"
> > > operation_key="prm_nfs_server_monitor_60000" operation="monitor"
> > > crm-debug-origin="do_update_resource" crm_feature_set="3.0.14"
> > > transition-key="68:6583:0:d941efc1-de73-4ee4-b593-f65be9e90726"
> > > transition-magic="0:0;68:6583:0:d941efc1-de73-4ee4-b593-
> > > f65be9e90726"
> > > exit-reason="" on_node="h11" call-id="800" rc-code="0" op-
> > > status="0"
> > > interval="60000" last-rc-change="1565582351" exec-time="499"
> 
> queue-time="0"
> > > op-digest="9d8aa17b2a741c8328d7896459733e56"/>
> > >           <lrm_rsc_op id="prm_nfs_server_last_failure_0"
> > > operation_key="prm_nfs_server_monitor_60000" operation="monitor"
> > > crm-debug-origin="do_update_resource" crm_feature_set="3.0.14"
> > > transition-key="4:6568:0:d941efc1-de73-4ee4-b593-f65be9e90726"
> > > transition-magic="0:7;4:6568:0:d941efc1-de73-4ee4-b593-
> > > f65be9e90726"
> > > exit-reason="" on_node="h11" call-id="738" rc-code="7" op-
> > > status="0"
> > > interval="60000" last-rc-change="1565578343" exec-time="0" queue-
> > > time="0"
> > > op-digest="9d8aa17b2a741c8328d7896459733e56"/>
> > >         </lrm_resource>
> > > 
> > > 
> > > Regards,
> > > Ulrich
> > > 
> > > > 
> > > > > Actually the manual says when cleaning up a single primitive,
> > > > > the
> > > > > whole group is cleaned up, unless using ‑‑force. Well ,I
> > > > > don't like
> > > > > this default, as I expect any status change from probe would
> > > > > propagate to the group anyway...
> > > > 
> > > > In 1.1, clean‑up always wipes the history of the affected
> > > > resources,
> > > > regardless of whether the history is for success or failure.
> > > > That means
> > > > all the cleaned resources will be reprobed. In 2.0, clean‑up by
> > > > default
> > > > wipes the history only if there's a failed action (‑‑refresh/‑R
> > > > is
> > > > required to get the 1.1 behavior). That lessens the impact of
> > > > the
> > > > "default to whole group" behavior.
> > > > 
> > > > I think the original idea was that a group indicates that the
> > > > resources
> > > > are closely related, so changing the status of one member might
> > > > affect
> > > > what status the others report.
> > > > 
> > > > > Regards,
> > > > > Ulrich
> > > > 
> > > > ‑‑ 
> > > > Ken Gaillot <kgail...@redhat.com>
> > > > 
> > > > _______________________________________________
> > > > Manage your subscription:
> > > > https://lists.clusterlabs.org/mailman/listinfo/users 
> > > > 
> > > > ClusterLabs home: https://www.clusterlabs.org/ 
> > > 
> > > 
> > > 
> > > _______________________________________________
> > > Manage your subscription:
> > > https://lists.clusterlabs.org/mailman/listinfo/users 
> > > 
> > > ClusterLabs home: https://www.clusterlabs.org/ 
> > 
> > 
> > 
> > 
> 
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
-- 
Ken Gaillot <kgail...@redhat.com>

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Q: "Re-initiated expired calculated failure"

Reply via email to