Re: [ClusterLabs] Antw: Re: Pacemaker 2.0.3-rc3 now available

2019-11-18 Thread Jehan-Guillaume de Rorthais
On Mon, 18 Nov 2019 10:45:25 -0600
Ken Gaillot  wrote:

> On Fri, 2019-11-15 at 14:35 +0100, Jehan-Guillaume de Rorthais wrote:
> > On Thu, 14 Nov 2019 11:09:57 -0600
> > Ken Gaillot  wrote:
> >   
> > > On Thu, 2019-11-14 at 15:22 +0100, Ulrich Windl wrote:  
> > > > > > > Jehan-Guillaume de Rorthais  schrieb am
> > > > > > > 14.11.2019 um
> > > > 
> > > > 15:17 in
> > > > Nachricht <20191114151719.6cbf4e38@firost>:
> > > > > On Wed, 13 Nov 2019 17:30:31 ‑0600
> > > > > Ken Gaillot  wrote:
> > > > > ...
> > > > > > A longstanding pain point in the logs has been improved.
> > > > > > Whenever
> > > > > > the
> > > > > > scheduler processes resource history, it logs a warning for
> > > > > > any
> > > > > > failures it finds, regardless of whether they are new or old,
> > > > > > which can
> > > > > > confuse anyone reading the logs. Now, the log will contain
> > > > > > the
> > > > > > time of
> > > > > > the failure, so it's obvious whether you're seeing the same
> > > > > > event
> > > > > > or
> > > > > > not. The log will also contain the exit reason if one was
> > > > > > provided by
> > > > > > the resource agent, for easier troubleshooting.
> > > > > 
> > > > > I've been hurt by this in the past and I was wondering what was
> > > > > the
> > > > > point of
> > > > > warning again and again in the logs for past failures during
> > > > > scheduling? 
> > > > > What this information brings to the administrator?
> > > 
> > > The controller will log an event just once, when it happens.
> > > 
> > > The scheduler, on the other hand, uses the entire recorded resource
> > > history to determine the current resource state. Old failures (that
> > > haven't been cleaned) must be taken into account.  
> > 
> > OK, I wasn't aware of this. If you have a few minutes, I would be
> > interested to
> > know why the full history is needed and not just find the latest
> > entry from
> > there. Or maybe there's some comments in the source code that already
> > cover this question?  
> 
> The full *recorded* history consists of the most recent operation that
> affects the state (like start/stop/promote/demote), the most recent
> failed operation, and the most recent results of any recurring
> monitors.
> 
> For example there may be a failed monitor, but whether the resource is
> considered failed or not would depend on whether there was a more
> recent successful stop or start. Even if the failed monitor has been
> superseded, it needs to stay in the history for display purposes until
> the user has cleaned it up.

OK, understood.

Maybe that's why "FAILED" appears shortly in crm_mon during a resource move on
a clean resource, but with past failures? Maybe I should dig this weird
behavior and wrap up a bug report if I confirm this?

> > > Every run of the scheduler is completely independent, so it doesn't
> > > know about any earlier runs or what they logged. Think of it like
> > > Frosty the Snowman saying "Happy Birthday!" every time his hat is
> > > put
> > > on.  
> > 
> > I don't have this ref :)  
> 
> I figured not everybody would, but it was too fun to pass up :)
> 
> The snowman comes to life every time his magic hat is put on, but to
> him each time feels like he's being born for the first time, so he says
> "Happy Birthday!"
> 
> https://www.youtube.com/watch?v=1PbWTEYoN8o

heh :)

> > > As far as each run is concerned, it is the first time it's seen the
> > > history. This is what allows the DC role to move from node to node,
> > > and
> > > the scheduler to be run as a simulation using a saved CIB file.
> > > 
> > > We could change the wording further if necessary. The previous
> > > version
> > > would log something like:
> > > 
> > > warning: Processing failed monitor of my-rsc on node1: not running
> > > 
> > > and this latest change will log it like:
> > > 
> > > warning: Unexpected result (not running: No process state file
> > > found)
> > > was recorded for monitor of my-rsc on node1 at Nov 12 19:19:02 2019  
> > 
> > /result/state/ ?  
> 
> It's the result of a resource agent action, so it could be for example
> a timeout or a permissions issue.

ok

> > > I wanted to be explicit about the message being about processing
> > > resource history that may or may not be the first time it's been
> > > processed and logged, but everything I came up with seemed too long
> > > for
> > > a log line. Another possibility might be something like:
> > > 
> > > warning: Using my-rsc history to determine its current state on
> > > node1:
> > > Unexpected result (not running: No process state file found) was
> > > recorded for monitor at Nov 12 19:19:02 2019  
> > 
> > I better like the first one.
> > 
> > However, it feels like implementation details exposed to the world,
> > isn't it? How useful is this information for the end user? What the
> > user can do
> > with this information? There's noting to fix and this is not actually
> > an error
> > of the current running process.
> > 
> > I still 

Re: [ClusterLabs] Antw: Re: Pacemaker 2.0.3-rc3 now available

2019-11-18 Thread Ken Gaillot
On Fri, 2019-11-15 at 14:35 +0100, Jehan-Guillaume de Rorthais wrote:
> On Thu, 14 Nov 2019 11:09:57 -0600
> Ken Gaillot  wrote:
> 
> > On Thu, 2019-11-14 at 15:22 +0100, Ulrich Windl wrote:
> > > > > > Jehan-Guillaume de Rorthais  schrieb am
> > > > > > 14.11.2019 um  
> > > 
> > > 15:17 in
> > > Nachricht <20191114151719.6cbf4e38@firost>:  
> > > > On Wed, 13 Nov 2019 17:30:31 ‑0600
> > > > Ken Gaillot  wrote:
> > > > ...  
> > > > > A longstanding pain point in the logs has been improved.
> > > > > Whenever
> > > > > the
> > > > > scheduler processes resource history, it logs a warning for
> > > > > any
> > > > > failures it finds, regardless of whether they are new or old,
> > > > > which can
> > > > > confuse anyone reading the logs. Now, the log will contain
> > > > > the
> > > > > time of
> > > > > the failure, so it's obvious whether you're seeing the same
> > > > > event
> > > > > or
> > > > > not. The log will also contain the exit reason if one was
> > > > > provided by
> > > > > the resource agent, for easier troubleshooting.  
> > > > 
> > > > I've been hurt by this in the past and I was wondering what was
> > > > the
> > > > point of
> > > > warning again and again in the logs for past failures during
> > > > scheduling? 
> > > > What this information brings to the administrator?  
> > 
> > The controller will log an event just once, when it happens.
> > 
> > The scheduler, on the other hand, uses the entire recorded resource
> > history to determine the current resource state. Old failures (that
> > haven't been cleaned) must be taken into account.
> 
> OK, I wasn't aware of this. If you have a few minutes, I would be
> interested to
> know why the full history is needed and not just find the latest
> entry from
> there. Or maybe there's some comments in the source code that already
> cover this question?

The full *recorded* history consists of the most recent operation that
affects the state (like start/stop/promote/demote), the most recent
failed operation, and the most recent results of any recurring
monitors.

For example there may be a failed monitor, but whether the resource is
considered failed or not would depend on whether there was a more
recent successful stop or start. Even if the failed monitor has been
superseded, it needs to stay in the history for display purposes until
the user has cleaned it up.

> > Every run of the scheduler is completely independent, so it doesn't
> > know about any earlier runs or what they logged. Think of it like
> > Frosty the Snowman saying "Happy Birthday!" every time his hat is
> > put
> > on.
> 
> I don't have this ref :)

I figured not everybody would, but it was too fun to pass up :)

The snowman comes to life every time his magic hat is put on, but to
him each time feels like he's being born for the first time, so he says
"Happy Birthday!"

https://www.youtube.com/watch?v=1PbWTEYoN8o


> > As far as each run is concerned, it is the first time it's seen the
> > history. This is what allows the DC role to move from node to node,
> > and
> > the scheduler to be run as a simulation using a saved CIB file.
> > 
> > We could change the wording further if necessary. The previous
> > version
> > would log something like:
> > 
> > warning: Processing failed monitor of my-rsc on node1: not running
> > 
> > and this latest change will log it like:
> > 
> > warning: Unexpected result (not running: No process state file
> > found)
> > was recorded for monitor of my-rsc on node1 at Nov 12 19:19:02 2019
> 
> /result/state/ ?

It's the result of a resource agent action, so it could be for example
a timeout or a permissions issue.

> > I wanted to be explicit about the message being about processing
> > resource history that may or may not be the first time it's been
> > processed and logged, but everything I came up with seemed too long
> > for
> > a log line. Another possibility might be something like:
> > 
> > warning: Using my-rsc history to determine its current state on
> > node1:
> > Unexpected result (not running: No process state file found) was
> > recorded for monitor at Nov 12 19:19:02 2019
> 
> I better like the first one.
> 
> However, it feels like implementation details exposed to the world,
> isn't it? How useful is this information for the end user? What the
> user can do
> with this information? There's noting to fix and this is not actually
> an error
> of the current running process.
> 
> I still fail to understand why the scheduler doesn't process the
> history
> silently, whatever it finds there, then warn for something really
> important if
> the final result is not expected...

From the scheduler's point of view, it's all relevant information that
goes into the decision making. Even an old failure can cause new
actions, for example if quorum was not held at the time but has now
been reached, or if there is a failure-timeout that just expired. So
any failure history is important to understanding whatever the
scheduler says needs to 

Re: [ClusterLabs] Antw: Re: Pacemaker 2.0.3-rc3 now available

2019-11-14 Thread Ken Gaillot
On Thu, 2019-11-14 at 15:22 +0100, Ulrich Windl wrote:
> > > > Jehan-Guillaume de Rorthais  schrieb am
> > > > 14.11.2019 um
> 
> 15:17 in
> Nachricht <20191114151719.6cbf4e38@firost>:
> > On Wed, 13 Nov 2019 17:30:31 ‑0600
> > Ken Gaillot  wrote:
> > ...
> > > A longstanding pain point in the logs has been improved. Whenever
> > > the
> > > scheduler processes resource history, it logs a warning for any
> > > failures it finds, regardless of whether they are new or old,
> > > which can
> > > confuse anyone reading the logs. Now, the log will contain the
> > > time of
> > > the failure, so it's obvious whether you're seeing the same event
> > > or
> > > not. The log will also contain the exit reason if one was
> > > provided by
> > > the resource agent, for easier troubleshooting.
> > 
> > I've been hurt by this in the past and I was wondering what was the
> > point
> 
> of
> > warning again and again in the logs for past failures during
> > scheduling? 
> > What
> > this information brings to the administrator?

The controller will log an event just once, when it happens.

The scheduler, on the other hand, uses the entire recorded resource
history to determine the current resource state. Old failures (that
haven't been cleaned) must be taken into account.

Every run of the scheduler is completely independent, so it doesn't
know about any earlier runs or what they logged. Think of it like
Frosty the Snowman saying "Happy Birthday!" every time his hat is put
on. As far as each run is concerned, it is the first time it's seen the
history. This is what allows the DC role to move from node to node, and
the scheduler to be run as a simulation using a saved CIB file.

We could change the wording further if necessary. The previous version
would log something like:

warning: Processing failed monitor of my-rsc on node1: not running

and this latest change will log it like:

warning: Unexpected result (not running: No process state file found)
was recorded for monitor of my-rsc on node1 at Nov 12 19:19:02 2019

I wanted to be explicit about the message being about processing
resource history that may or may not be the first time it's been
processed and logged, but everything I came up with seemed too long for
a log line. Another possibility might be something like:

warning: Using my-rsc history to determine its current state on node1:
Unexpected result (not running: No process state file found) was
recorded for monitor at Nov 12 19:19:02 2019


> > In my humble opinion, any entry in the log file should be about
> > something
> > happening by the time the message appears. And it should appears
> > only once,
> > not
> > repeated again and again for no (appearing) reasons. At least, most
> > of the
> > time. Do I miss something?
> > 
> > I'm sure these historical failure warnings raised by the scheduler
> > have
> 
> been
> > already raised in the past by either the lrm or crm process in most
> > of the
> > cases, aren't them?
> > 
> > Unless I'm not aware of something else, the scheduler might warn
> > about 
> > current
> > unexpected status of a resource, not all of them in the past.
> > 
> > Could you shed some lights on this mystery from the user point of
> > view?
> 
> Hi!
> 
> I can agree that the current pacemaker of SLES12 logs so much while
> virtually
> doing nothing that it's very hard to find out when pacemaker actually
> does
> something. And if it does something, it seems it's announcing the
> same thing at
> least three times before actually really doing anything.
> 
> Regards,
> Ulrich

Part of the difficulty arises from Pacemaker's design of using multiple
independent daemons to handle different aspects of cluster management.
A single failure event might get logged by the executor (lrmd),
controller (crmd), and scheduler (pengine), but in different contexts (
potentially on different nodes).

Improving the logs is a major focus of new releases, and we're always
looking for specific suggestions as to which messages need the most
attention. There's been a lot of progress between 1.1.14 and 2.0.3, but
it takes a while for that to land in distributions.
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/