On Mon, 2019-11-18 at 18:13 +0100, Jehan-Guillaume de Rorthais wrote: > On Mon, 18 Nov 2019 10:45:25 -0600 > Ken Gaillot <kgail...@redhat.com> wrote: > > > On Fri, 2019-11-15 at 14:35 +0100, Jehan-Guillaume de Rorthais > > wrote: > > > On Thu, 14 Nov 2019 11:09:57 -0600 > > > Ken Gaillot <kgail...@redhat.com> wrote: > > > > > > > On Thu, 2019-11-14 at 15:22 +0100, Ulrich Windl wrote: > > > > > > > > Jehan-Guillaume de Rorthais <j...@dalibo.com> schrieb > > > > > > > > am > > > > > > > > 14.11.2019 um > > > > > > > > > > 15:17 in > > > > > Nachricht <20191114151719.6cbf4e38@firost>: > > > > > > On Wed, 13 Nov 2019 17:30:31 ‑0600 > > > > > > Ken Gaillot <kgail...@redhat.com> wrote: > > > > > > ... > > > > > > > A longstanding pain point in the logs has been improved. > > > > > > > Whenever > > > > > > > the > > > > > > > scheduler processes resource history, it logs a warning > > > > > > > for > > > > > > > any > > > > > > > failures it finds, regardless of whether they are new or > > > > > > > old, > > > > > > > which can > > > > > > > confuse anyone reading the logs. Now, the log will > > > > > > > contain > > > > > > > the > > > > > > > time of > > > > > > > the failure, so it's obvious whether you're seeing the > > > > > > > same > > > > > > > event > > > > > > > or > > > > > > > not. The log will also contain the exit reason if one was > > > > > > > provided by > > > > > > > the resource agent, for easier troubleshooting. > > > > > > > > > > > > I've been hurt by this in the past and I was wondering what > > > > > > was > > > > > > the > > > > > > point of > > > > > > warning again and again in the logs for past failures > > > > > > during > > > > > > scheduling? > > > > > > What this information brings to the administrator? > > > > > > > > The controller will log an event just once, when it happens. > > > > > > > > The scheduler, on the other hand, uses the entire recorded > > > > resource > > > > history to determine the current resource state. Old failures > > > > (that > > > > haven't been cleaned) must be taken into account. > > > > > > OK, I wasn't aware of this. If you have a few minutes, I would be > > > interested to > > > know why the full history is needed and not just find the latest > > > entry from > > > there. Or maybe there's some comments in the source code that > > > already > > > cover this question? > > > > The full *recorded* history consists of the most recent operation > > that > > affects the state (like start/stop/promote/demote), the most recent > > failed operation, and the most recent results of any recurring > > monitors. > > > > For example there may be a failed monitor, but whether the resource > > is > > considered failed or not would depend on whether there was a more > > recent successful stop or start. Even if the failed monitor has > > been > > superseded, it needs to stay in the history for display purposes > > until > > the user has cleaned it up. > > OK, understood. > > Maybe that's why "FAILED" appears shortly in crm_mon during a > resource move on > a clean resource, but with past failures? Maybe I should dig this > weird > behavior and wrap up a bug report if I confirm this?
I've occasionally seen such blips myself but never had a chance to track them down. If you can find a reliable reproducer, a bug report would be worthwhile. I'm not sure why it happens, but hazarding a guess, it may be related to recording pending operations -- a pending operation doesn't affect the state, so if recording the pending operation clears the last clean start/stop, and there's a past failure, that could make sense. > > > > > Every run of the scheduler is completely independent, so it > > > > doesn't > > > > know about any earlier runs or what they logged. Think of it > > > > like > > > > Frosty the Snowman saying "Happy Birthday!" every time his hat > > > > is > > > > put > > > > on. > > > > > > I don't have this ref :) > > > > I figured not everybody would, but it was too fun to pass up :) > > > > The snowman comes to life every time his magic hat is put on, but > > to > > him each time feels like he's being born for the first time, so he > > says > > "Happy Birthday!" > > > > https://www.youtube.com/watch?v=1PbWTEYoN8o > > heh :) > > > > > As far as each run is concerned, it is the first time it's seen > > > > the > > > > history. This is what allows the DC role to move from node to > > > > node, > > > > and > > > > the scheduler to be run as a simulation using a saved CIB file. > > > > > > > > We could change the wording further if necessary. The previous > > > > version > > > > would log something like: > > > > > > > > warning: Processing failed monitor of my-rsc on node1: not > > > > running > > > > > > > > and this latest change will log it like: > > > > > > > > warning: Unexpected result (not running: No process state file > > > > found) > > > > was recorded for monitor of my-rsc on node1 at Nov 12 19:19:02 > > > > 2019 > > > > > > /result/state/ ? > > > > It's the result of a resource agent action, so it could be for > > example > > a timeout or a permissions issue. > > ok > > > > > I wanted to be explicit about the message being about > > > > processing > > > > resource history that may or may not be the first time it's > > > > been > > > > processed and logged, but everything I came up with seemed too > > > > long > > > > for > > > > a log line. Another possibility might be something like: > > > > > > > > warning: Using my-rsc history to determine its current state on > > > > node1: > > > > Unexpected result (not running: No process state file found) > > > > was > > > > recorded for monitor at Nov 12 19:19:02 2019 > > > > > > I better like the first one. > > > > > > However, it feels like implementation details exposed to the > > > world, > > > isn't it? How useful is this information for the end user? What > > > the > > > user can do > > > with this information? There's noting to fix and this is not > > > actually > > > an error > > > of the current running process. > > > > > > I still fail to understand why the scheduler doesn't process the > > > history > > > silently, whatever it finds there, then warn for something really > > > important if > > > the final result is not expected... > > > > From the scheduler's point of view, it's all relevant information > > that > > goes into the decision making. Even an old failure can cause new > > actions, for example if quorum was not held at the time but has now > > been reached, or if there is a failure-timeout that just expired. > > So > > any failure history is important to understanding whatever the > > scheduler says needs to be done. > > > > Also, the scheduler is run on the DC, which is not necessarily the > > node > > that executed the action. So it's useful for troubleshooting to > > present > > a picture of the whole cluster on the DC, rather than just what's > > the > > situation on the local node. > > OK, kind of got it. The scheduler need to summarize the chain of > event to > define the state of a resource based on the last event. > > > I could see an argument for lowering it from warning to notice, but > > it's a balance between what's most useful during normal operation > > and > > what's most useful during troubleshooting. > > So in my humble opinion, the messages should definitely be at notice > level. > Maybe they should even go to debug level. I never had to troubleshoot > a bad > decision from the scheduler because of a bad state summary. > Moreover, if needed, the admin can still study the history from cib > backed up > on disk, isn't it? > > The alternative would be to spit the event chain in details only if > the result > of the summary is different from what the scheduler was expecting? > -- Ken Gaillot <kgail...@redhat.com> _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/