On Wed, 2017-11-01 at 10:04 +0100, Ferenc Wágner wrote: > Ken Gaillot <kgail...@redhat.com> writes: > > > When an operation completes, a history entry (<lrm_rsc_op>) is > > added to > > the pe-input file. If the agent supports reload, the entry will > > include > > op-force-restart and op-restart-digest fields. Now I see those are > > present in the vm-alder_last_0 entry, so agent support isn't the > > issue. > > Thanks for the explanation. > > > However, the operation is recorded as a *failed* probe (i.e. the > > resource was running where it wasn't expected). This gets recorded > > as a > > separate vm-alder_last_failure_0 entry, which does not get the > > special > > fields. It looks to me like this failure entry is forcing the > > restart. > > That would be a good idea if it's an actual failure; if we find a > > resource unexpectedly running, we don't know how it was started, so > > a > > full restart makes sense. > > > > However, I'm guessing it may not have been a real error, but a > > resource > > cleanup. A cleanup clears the history so the resource is re-probed, > > and > > I suspect that re-probe is what got recorded here as a failure. > > Does > > that match what actually happened? > > Well, I can't really remember, it happened two months ago... I'm > pretty > sure the resource wasn't running unexpectedly, I'd surely recall such > a > grave failure. Interestingly, though, my shell history contains a > cleanup operation shortly after the parameter change. Also, if you > look > at the logs in my thread starting mail, you'll find > > warning: Processing failed op monitor for vm-alder on vhbl05: not > running (7) > > which does not seem to match up with the failure in the lrm_rsc_op > entry > in pe-input. It's sort of "normal" that such a resource disappears > and > gets restarted by the cluster. If that report survived the > unexpected > restart, I might have wanted to routinely clean it up afterwards. > > (I'm leaving for a short holiday now, expect longer delays.)
Looking at it again with crm_simulate with 1.1.18 + patches, it does appear that the combination of a cleanup and a parameter change in the same transition turned the reload into a restart. The cleanup results in a failed probe being recorded, and that history entry does not have the magic attributes indicating reloadability. I suspect if you changed the parameter, waited for the reload to happen, then did the cleanup, it would have been fine. I'll have to investigate a fix. -- Ken Gaillot <kgail...@redhat.com> _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org