On Mon, 2019-11-18 at 18:13 +0100, Jehan-Guillaume de Rorthais wrote:
> On Mon, 18 Nov 2019 10:45:25 -0600
> Ken Gaillot <kgail...@redhat.com> wrote:
> 
> > On Fri, 2019-11-15 at 14:35 +0100, Jehan-Guillaume de Rorthais
> > wrote:
> > > On Thu, 14 Nov 2019 11:09:57 -0600
> > > Ken Gaillot <kgail...@redhat.com> wrote:
> > >   
> > > > On Thu, 2019-11-14 at 15:22 +0100, Ulrich Windl wrote:  
> > > > > > > > Jehan-Guillaume de Rorthais <j...@dalibo.com> schrieb
> > > > > > > > am
> > > > > > > > 14.11.2019 um    
> > > > > 
> > > > > 15:17 in
> > > > > Nachricht <20191114151719.6cbf4e38@firost>:    
> > > > > > On Wed, 13 Nov 2019 17:30:31 ‑0600
> > > > > > Ken Gaillot <kgail...@redhat.com> wrote:
> > > > > > ...    
> > > > > > > A longstanding pain point in the logs has been improved.
> > > > > > > Whenever
> > > > > > > the
> > > > > > > scheduler processes resource history, it logs a warning
> > > > > > > for
> > > > > > > any
> > > > > > > failures it finds, regardless of whether they are new or
> > > > > > > old,
> > > > > > > which can
> > > > > > > confuse anyone reading the logs. Now, the log will
> > > > > > > contain
> > > > > > > the
> > > > > > > time of
> > > > > > > the failure, so it's obvious whether you're seeing the
> > > > > > > same
> > > > > > > event
> > > > > > > or
> > > > > > > not. The log will also contain the exit reason if one was
> > > > > > > provided by
> > > > > > > the resource agent, for easier troubleshooting.    
> > > > > > 
> > > > > > I've been hurt by this in the past and I was wondering what
> > > > > > was
> > > > > > the
> > > > > > point of
> > > > > > warning again and again in the logs for past failures
> > > > > > during
> > > > > > scheduling? 
> > > > > > What this information brings to the administrator?    
> > > > 
> > > > The controller will log an event just once, when it happens.
> > > > 
> > > > The scheduler, on the other hand, uses the entire recorded
> > > > resource
> > > > history to determine the current resource state. Old failures
> > > > (that
> > > > haven't been cleaned) must be taken into account.  
> > > 
> > > OK, I wasn't aware of this. If you have a few minutes, I would be
> > > interested to
> > > know why the full history is needed and not just find the latest
> > > entry from
> > > there. Or maybe there's some comments in the source code that
> > > already
> > > cover this question?  
> > 
> > The full *recorded* history consists of the most recent operation
> > that
> > affects the state (like start/stop/promote/demote), the most recent
> > failed operation, and the most recent results of any recurring
> > monitors.
> > 
> > For example there may be a failed monitor, but whether the resource
> > is
> > considered failed or not would depend on whether there was a more
> > recent successful stop or start. Even if the failed monitor has
> > been
> > superseded, it needs to stay in the history for display purposes
> > until
> > the user has cleaned it up.
> 
> OK, understood.
> 
> Maybe that's why "FAILED" appears shortly in crm_mon during a
> resource move on
> a clean resource, but with past failures? Maybe I should dig this
> weird
> behavior and wrap up a bug report if I confirm this?

I've occasionally seen such blips myself but never had a chance to
track them down. If you can find a reliable reproducer, a bug report
would be worthwhile. I'm not sure why it happens, but hazarding a
guess, it may be related to recording pending operations -- a pending
operation doesn't affect the state, so if recording the pending
operation clears the last clean start/stop, and there's a past failure,
that could make sense.

> 
> > > > Every run of the scheduler is completely independent, so it
> > > > doesn't
> > > > know about any earlier runs or what they logged. Think of it
> > > > like
> > > > Frosty the Snowman saying "Happy Birthday!" every time his hat
> > > > is
> > > > put
> > > > on.  
> > > 
> > > I don't have this ref :)  
> > 
> > I figured not everybody would, but it was too fun to pass up :)
> > 
> > The snowman comes to life every time his magic hat is put on, but
> > to
> > him each time feels like he's being born for the first time, so he
> > says
> > "Happy Birthday!"
> > 
> > https://www.youtube.com/watch?v=1PbWTEYoN8o
> 
> heh :)
> 
> > > > As far as each run is concerned, it is the first time it's seen
> > > > the
> > > > history. This is what allows the DC role to move from node to
> > > > node,
> > > > and
> > > > the scheduler to be run as a simulation using a saved CIB file.
> > > > 
> > > > We could change the wording further if necessary. The previous
> > > > version
> > > > would log something like:
> > > > 
> > > > warning: Processing failed monitor of my-rsc on node1: not
> > > > running
> > > > 
> > > > and this latest change will log it like:
> > > > 
> > > > warning: Unexpected result (not running: No process state file
> > > > found)
> > > > was recorded for monitor of my-rsc on node1 at Nov 12 19:19:02
> > > > 2019  
> > > 
> > > /result/state/ ?  
> > 
> > It's the result of a resource agent action, so it could be for
> > example
> > a timeout or a permissions issue.
> 
> ok
> 
> > > > I wanted to be explicit about the message being about
> > > > processing
> > > > resource history that may or may not be the first time it's
> > > > been
> > > > processed and logged, but everything I came up with seemed too
> > > > long
> > > > for
> > > > a log line. Another possibility might be something like:
> > > > 
> > > > warning: Using my-rsc history to determine its current state on
> > > > node1:
> > > > Unexpected result (not running: No process state file found)
> > > > was
> > > > recorded for monitor at Nov 12 19:19:02 2019  
> > > 
> > > I better like the first one.
> > > 
> > > However, it feels like implementation details exposed to the
> > > world,
> > > isn't it? How useful is this information for the end user? What
> > > the
> > > user can do
> > > with this information? There's noting to fix and this is not
> > > actually
> > > an error
> > > of the current running process.
> > > 
> > > I still fail to understand why the scheduler doesn't process the
> > > history
> > > silently, whatever it finds there, then warn for something really
> > > important if
> > > the final result is not expected...  
> > 
> > From the scheduler's point of view, it's all relevant information
> > that
> > goes into the decision making. Even an old failure can cause new
> > actions, for example if quorum was not held at the time but has now
> > been reached, or if there is a failure-timeout that just expired.
> > So
> > any failure history is important to understanding whatever the
> > scheduler says needs to be done.
> > 
> > Also, the scheduler is run on the DC, which is not necessarily the
> > node
> > that executed the action. So it's useful for troubleshooting to
> > present
> > a picture of the whole cluster on the DC, rather than just what's
> > the
> > situation on the local node.
> 
> OK, kind of got it. The scheduler need to summarize the chain of
> event to
> define the state of a resource based on the last event.
> 
> > I could see an argument for lowering it from warning to notice, but
> > it's a balance between what's most useful during normal operation
> > and
> > what's most useful during troubleshooting.
> 
> So in my humble opinion, the messages should definitely be at notice
> level.
> Maybe they should even go to debug level. I never had to troubleshoot
> a bad
> decision from the scheduler because of a bad state summary.
> Moreover, if needed, the admin can still study the history from cib
> backed up
> on disk, isn't it?
> 
> The alternative would be to spit the event chain in details only if
> the result
> of the summary is different from what the scheduler was expecting?
> 
-- 
Ken Gaillot <kgail...@redhat.com>

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to