Re: [ClusterLabs] Restarting a failed resource on same node
On Wed, 2017-10-04 at 10:59 -0700, Paolo Zarpellon wrote: > Hi Ken, > Indeed the migration-threshold was the problem :-( > > BTW, for a master-slave resource, is it possible to have different > migration-thresholds? > I.e. I'd like the slave to be restarted where it failed, but master > to be migrated to the > other node right away (by promoting the slave there). No, that's not possible currently. There's a planned overhaul of the failure handling options that would open the possibility, though. No time frame on when it might get done. > I've tried configuring something like this: > > [root@test-236 ~]# pcs resource show test-ha > Master: test-ha > Meta Attrs: master-node-max=1 clone-max=2 notify=true master-max=1 > clone-node-max=1 requires=nothing migration-threshold=1 > Resource: test (class=ocf provider=heartbeat type=test) > Meta Attrs: migration-threshold=INFINITY > Operations: start interval=0s on-fail=restart timeout=120s (test- > start-interval-0s) > monitor interval=10s on-fail=restart timeout=60s > (test-monitor-interval-10s) > monitor interval=11s on-fail=restart role=Master > timeout=60s (test-monitor-interval-11s) > promote interval=0s on-fail=restart timeout=60s (test- > promote-interval-0s) > demote interval=0s on-fail=stop timeout=60s (test- > demote-interval-0s) > stop interval=0s on-fail=block timeout=60s (test-stop- > interval-0s) > notify interval=0s timeout=60s (test-notify-interval- > 0s) > [root@test-236 ~]# > > but It does not seem to help as both master and slave are always > restarted on the same node > due to test resource's migration-threshold set to INFINITY > > Thank you in advance. > Regards, > Paolo > > On Tue, Oct 3, 2017 at 7:12 AM, Ken Gaillot> wrote: > > On Mon, 2017-10-02 at 12:32 -0700, Paolo Zarpellon wrote: > > > Hi, > > > on a basic 2-node cluster, I have a master-slave resource where > > > master runs on a node and slave on the other one. If I kill the > > slave > > > resource, the resource status goes to "stopped". > > > Similarly, if I kill the the master resource, the slave one is > > > promoted to master but the failed one does not restart as slave. > > > Is there a way to restart failing resources on the same node they > > > were running? > > > Thank you in advance. > > > Regards, > > > Paolo > > > > Restarting on the same node is the default behavior -- something > > must > > be blocking it. For example, check your migration-threshold (if > > restarting fails this many times, it has nowhere to go and will > > stop). > > > > ___ > > Users mailing list: Users@clusterlabs.org > > http://lists.clusterlabs.org/mailman/listinfo/users > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc > > h.pdf > > Bugs: http://bugs.clusterlabs.org > > > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch. > pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Is "Process pause detected" triggered too easily?
On Wed, 4 Oct 2017, Jan Friesse wrote: > > Could you clarify the formula for me? I don't see how "- 2" and "650" > > map to this configuration. > > Since Corosync 2.3.4 when nodelist is used, totem.token is used only as > a basis for calculating real token timeout. You can check corosync.conf > man page for more information and formula. A-ha! I was looking for that in the corosync.conf man page shipped with Ubuntu 14, which of course ships corosync 2.3.3. Silly me! So with the right man page, that's indeed spelled out under "token_coefficient". Thanks! > > And I suppose that on our bigger system (20+5 servers) we need to > > greatly increase the consensus timeout. > > Consensus timeout reflects token value so if it is not defined in config > file it's computed as token * 1.2. This is not reflected in manpage and > needs to be fixed. Actually the man page I see for 2.4.2 does mention this :) so I guess we should simply comment out our setting for "consensus". Cheers, JM -- saff...@gmail.com ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Restarting a failed resource on same node
Hi Ken, Indeed the migration-threshold was the problem :-( BTW, for a master-slave resource, is it possible to have different migration-thresholds? I.e. I'd like the slave to be restarted where it failed, but master to be migrated to the other node right away (by promoting the slave there). I've tried configuring something like this: [root@test-236 ~]# pcs resource show test-ha Master: test-ha Meta Attrs: master-node-max=1 clone-max=2 notify=true master-max=1 clone-node-max=1 requires=nothing migration-threshold=1 Resource: test (class=ocf provider=heartbeat type=test) Meta Attrs: migration-threshold=INFINITY Operations: start interval=0s on-fail=restart timeout=120s (test-start-interval-0s) monitor interval=10s on-fail=restart timeout=60s (test-monitor-interval-10s) monitor interval=11s on-fail=restart role=Master timeout=60s (test-monitor-interval-11s) promote interval=0s on-fail=restart timeout=60s (test-promote-interval-0s) demote interval=0s on-fail=stop timeout=60s (test-demote-interval-0s) stop interval=0s on-fail=block timeout=60s (test-stop-interval-0s) notify interval=0s timeout=60s (test-notify-interval-0s) [root@test-236 ~]# but It does not seem to help as both master and slave are always restarted on the same node due to test resource's migration-threshold set to INFINITY Thank you in advance. Regards, Paolo On Tue, Oct 3, 2017 at 7:12 AM, Ken Gaillotwrote: > On Mon, 2017-10-02 at 12:32 -0700, Paolo Zarpellon wrote: > > Hi, > > on a basic 2-node cluster, I have a master-slave resource where > > master runs on a node and slave on the other one. If I kill the slave > > resource, the resource status goes to "stopped". > > Similarly, if I kill the the master resource, the slave one is > > promoted to master but the failed one does not restart as slave. > > Is there a way to restart failing resources on the same node they > > were running? > > Thank you in advance. > > Regards, > > Paolo > > Restarting on the same node is the default behavior -- something must > be blocking it. For example, check your migration-threshold (if > restarting fails this many times, it has nowhere to go and will stop). > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Is "Process pause detected" triggered too easily?
Jean, Hi Jan, On Tue, 3 Oct 2017, Jan Friesse wrote: I hope this makes sense! :) I would still have some questions :) but that is really not related to the problem you have. Questions are welcome! I am new to this stack, so there is certainly room for learning and for improvement. My personal favorite is consensus timeout. Because you've set (and I must say according to doc correctly) consensus timeout to 3600 (= 1.2 * token). Problem is, that result token timeout is not 3000, but with 5 nodes it is actually 3000 (base token) + (no_nodes - 2) * 650 ms = 4950 (as you can check by observing runtime.config.totem.token key). So it may make sense to set consensus timeout to ~6000. Could you clarify the formula for me? I don't see how "- 2" and "650" map to this configuration. Since Corosync 2.3.4 when nodelist is used, totem.token is used only as a basis for calculating real token timeout. You can check corosync.conf man page for more information and formula. And I suppose that on our bigger system (20+5 servers) we need to greatly increase the consensus timeout. Consensus timeout reflects token value so if it is not defined in config file it's computed as token * 1.2. This is not reflected in manpage and needs to be fixed. Overall, tuning the timeouts seems related to be Black Magic. ;) I liked It is the idea suggested in an old thread that there would be a spreadsheet (or even just plain formulas) exposing the relation between the various knobs. Idea is to compute it in the code directly. This is implemented for some parts, but sadly not for some other. Reason is mostly that it's quite hard to make these timeouts right, so failure detection is fast enough but there are as few false membership changes as possible. One thing I wonder is: would it make sense to annotate the state machine diagram in the Totem paper (page 15 of http://www.cs.jhu.edu/~yairamir/tocs.ps.gz) with those tunables? Assuming the paper still reflects the behavior of the current code. Yes, code reflects paper (to some extend, some things are slightly different) and I really like idea of annotating it, or actually having wiki page with this diagram and slight documentation of totemsrp insides. This doesn't change the fact that "bug" is reproducible even with "correct" consensus, so I will continue working on this issue. Great! Thanks for taking the time to investigate. Yep, np. Honza Cheers, JM ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org