Re: [ClusterLabs] Resource not starting correctly IV
Thanks. In the end, I found out that my target application has a setting whereby the application becomes instantly detectable to the monitoring side of my script. After doing this, the associated resource is created flawlessly every time. On Tue, Apr 16, 2019 at 1:46 PM Jan Pokorný wrote: > [letter-casing wise: > it's either "Pacemaker" or down-to-the-terminal "pacemaker"] > > On 16/04/19 10:21 -0600, JCA wrote: > > 2. It would seem that what Pacemaker is doing is the following: > >a. Check out whether the app is running. > >b. If it is not, launch it. > >c. Check out again > >d. If running, exit. > >e. Otherwise, stop it. > > f. Launch it. > >g. Go to a. > > > > [...] > > > > 4. If the above is correct, and if I am getting the picture correctly, it > > would seem that the problem is that my monitoring function does not > detect > > immediately that my app is up and running. That's clearly my problem. > > However, is there any way to get Pacemaker to introduce a delay between > > steps b and c in section 2 above? > > Ah, it should have occurred to me! > > Typical solution, I think, is to have a sleep loop following the > daemon launch within "start" action that will run (subset) of what > "monitor" normally does, so as to synchronize on the "service ready" > moment. Default timeout for "start" within agent's metadata should > then reflect the common time to get to the point "monitor" is happy > plus some reserve. > > Some agents may do more elaborate things like precisely limiting such > waiting in respect to the time they were actually given by the > resource manager/pacemaker (if I don't misremember, that value is > provided through environment variables for sort of an introspection). > > Resource agent experts could advise here. > > (Truth to be told, "daemon readiness" used to be a very marginalized > problem putting barriers to practical [= race-free] dependency ordering > etc., luckily clever people realized that the most precize tracking > can only be at the hands of the actual daemon implementors if event > driven paradigm is to be applied. For instance, if you can influence > my_app, and it's a standard forking daemon, it would be best if the > parent exited only when the daemon is truly ready to provide service > -- this usually requires some typically signal-based synchronization > amongst the daemon processes. With systemd, situation is much simpler > since no forking is necessary, just a call to sd_notify(3) -- in that > case, though, your agent would need to mimic the server side of the > sd_notify protocol since nothing would do it for you.) > > > 5. Following up on 4: if my script sleeps for a few seconds immediately > > after launching my app (it's a daemon) in myapp_start then everything > works > > fine. Indeed, the call sequence in node one now becomes: > > > > monitor: > > > > Status: NOT_RUNNING > > Exit: NOT_RUNNING > > > > start: > > > > Validate: SUCCESS > > Status: NOT_RUNNING > > Start: SUCCESS > > Exit: SUCCESS > > > > monitor: > > > > Status: SUCCESS > > Exit: SUCCESS > > That's easier but less effective and reliable (more opportunistic than > fact-based) than polling the "monitor" outcomes privately within "start" > as sketched above. > > -- > Jan (Poki) > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Resource not starting correctly IV
[letter-casing wise: it's either "Pacemaker" or down-to-the-terminal "pacemaker"] On 16/04/19 10:21 -0600, JCA wrote: > 2. It would seem that what Pacemaker is doing is the following: >a. Check out whether the app is running. >b. If it is not, launch it. >c. Check out again >d. If running, exit. >e. Otherwise, stop it. > f. Launch it. >g. Go to a. > > [...] > > 4. If the above is correct, and if I am getting the picture correctly, it > would seem that the problem is that my monitoring function does not detect > immediately that my app is up and running. That's clearly my problem. > However, is there any way to get Pacemaker to introduce a delay between > steps b and c in section 2 above? Ah, it should have occurred to me! Typical solution, I think, is to have a sleep loop following the daemon launch within "start" action that will run (subset) of what "monitor" normally does, so as to synchronize on the "service ready" moment. Default timeout for "start" within agent's metadata should then reflect the common time to get to the point "monitor" is happy plus some reserve. Some agents may do more elaborate things like precisely limiting such waiting in respect to the time they were actually given by the resource manager/pacemaker (if I don't misremember, that value is provided through environment variables for sort of an introspection). Resource agent experts could advise here. (Truth to be told, "daemon readiness" used to be a very marginalized problem putting barriers to practical [= race-free] dependency ordering etc., luckily clever people realized that the most precize tracking can only be at the hands of the actual daemon implementors if event driven paradigm is to be applied. For instance, if you can influence my_app, and it's a standard forking daemon, it would be best if the parent exited only when the daemon is truly ready to provide service -- this usually requires some typically signal-based synchronization amongst the daemon processes. With systemd, situation is much simpler since no forking is necessary, just a call to sd_notify(3) -- in that case, though, your agent would need to mimic the server side of the sd_notify protocol since nothing would do it for you.) > 5. Following up on 4: if my script sleeps for a few seconds immediately > after launching my app (it's a daemon) in myapp_start then everything works > fine. Indeed, the call sequence in node one now becomes: > > monitor: > > Status: NOT_RUNNING > Exit: NOT_RUNNING > > start: > > Validate: SUCCESS > Status: NOT_RUNNING > Start: SUCCESS > Exit: SUCCESS > > monitor: > > Status: SUCCESS > Exit: SUCCESS That's easier but less effective and reliable (more opportunistic than fact-based) than polling the "monitor" outcomes privately within "start" as sketched above. -- Jan (Poki) pgpC486CyR_af.pgp Description: PGP signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Coming in 2.0.2: check whether a date-based rule is expired
Hi all, I wanted to point out an experimental feature that will be part of the next release. We are adding a "crm_rule" command that has the ability to check whether a particular date-based rule is currently in effect. The motivation is a perennial user complaint: expired constraints remain in the configuration, which can be confusing. We don't automatically remove such constraints, for several reasons: we try to avoid modifying any user-specified configuration; expired constraints are useful context when investigating an issue after it happened; and crm_simulate can be run for any configuration for an arbitrary past date to see what would have happened at that time. The new command gives users (and high-level tools) a way to determine whether a rule is in effect, so they can remove it themselves, whether manually or in an automated way such as a cron. You can use it like: crm_rule -r [-d ] [-X ] With just -r, it will tell you whether the specified rule from the configuration is currently in effect. If you give -d, it will check as of that date and time (ISO 8601 format). If you give it -X, it will look for the rule in the given XML rather than the CIB (you can also use "-X -" to read the XML from standard input). Example output: % crm_rule -r my-current-rule Rule my-current-role is still in effect % crm_rule -r some-long-ago-rule Rule some-long-ago-rule is expired % crm_rule -r some-future-rule Rule some-future-rule has not yet taken effect % crm_rule -r some-recurring-rule Could not determine whether rule some-recurring-rule is expired Scripts can use the exit status to distinguish the various cases. The command will be considered experimental for the 2.0.2 release; its interface and behavior may change in future versions. The current implementation has a limitation: the rule may contain only a single date_expression, and the expression's operation must not be date_spec. Other capabilities may eventually be added to crm_rule, for example the ability to evaluate the current value of any cluster or resource property. -- Ken Gaillot ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] SBD as watchdog daemon
Well, I checked this PR https://github.com/ClusterLabs/sbd/pull/27 from author repository https://github.com/jjd27/sbd/tree/cluster-quorum The problem is still exists. When corosync is frozen on one node, both node are rebooted. Don’t apply this PR. > 16 апр. 2019 г., в 19:13, Klaus Wenninger написал(а): ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Resource not starting correctly IV
Thanks to everybody who has contributed to this. Let me summarize things, if it is only for my own benefit - I learn more quickly when I try to explain that I am trying to learn something to others. I instrumented my script in order to find out exactly how many times it is invoked when creating my resource, and exactly what functions in the script are invoked. Just as a reminder, the logs I am about to describe are created directly as a result from executing the following command: # pcs resource create ClusterMyApp ocf:myapp:myapp-script op monitor interval=30s myapp-script is always the same, and the starting conditions for the app that it is meant to launch are always exactly the same. In all cases before issuing the command above I made sure to delete the resource, if already there. What follows is a log of the way in which myapp-script was invoked as a result of executing the command above. It consists of a series of blocks, like the following: monitor: Status: NOT_RUNNING Exit: NOT_RUNNING This block is an invocation of myapp-script with argument 'monitor'. The 'Status' line means myapp_monitor was invoked, and it returned OCF_NOT_RUNNING. The 'Exit' line means that myapp-script exited with OCF_NOT_RUNNING. In a block with more than two lines, the line immediately preceding the 'Exit' line represents the function in the script that was invoked as a consequence of the argument passed down to the script. The other lines are nested function invocations, as a consequence of that. A typical log obtained in node one would be the following: monitor: Status: NOT_RUNNING Exit: NOT_RUNNING start: Validate: SUCCESS Status: NOT_RUNNING Start: SUCCESS Exit: SUCCESS monitor: Status: NOT_RUNNING Exit: NOT_RUNNING stop: Validate: SUCCESS Status: SUCCESS Stop: SUCCESS Exit: SUCCESS start: Validate: SUCCESS Status: NOT_RUNNING Start: SUCCESS Exit: SUCCESS monitor: Status: SUCCESS Exit: SUCCESS A few observations: 1. The monitor/start/stop sequence above can be repeated many times, and the number of times it is repeated varies from one run to the next. Occasionally, just three calls are made: monitor, start and monitor, exiting with SUCCESS. 2. It would seem that what PaceMaker is doing is the following: a. Check out whether the app is running. b. If it is not, launch it. c. Check out again d. If running, exit. e. Otherwise, stop it. f. Launch it. g. Go to a. 3. In node two, the log obtained as a consequence of creating the resource always seems to be monitor: Status: NOT_RUNNING Exit: NOT_RUNNING which makes sense to me. 4. If the above is correct, and if I am getting the picture correctly, it would seem that the problem is that my monitoring function does not detect immediately that my app is up and running. That's clearly my problem. However, is there any way to get PaceMaker to introduce a delay between steps b and c in section 2 above? 5. Following up on 4: if my script sleeps for a few seconds immediately after launching my app (it's a daemon) in myapp_start then everything works fine. Indeed, the call sequence in node one now becomes: monitor: Status: NOT_RUNNING Exit: NOT_RUNNING start: Validate: SUCCESS Status: NOT_RUNNING Start: SUCCESS Exit: SUCCESS monitor: Status: SUCCESS Exit: SUCCESS ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] SBD as watchdog daemon
On 4/16/19 5:27 PM, Олег Самойлов wrote: > >> 16 апр. 2019 г., в 16:21, Klaus Wenninger написал(а): >> >> On 4/16/19 3:12 PM, Олег Самойлов wrote: >>> Okey, looked like I found where it must be fixed. >>> >>> sbd-cluster.c >>> >>>/* TODO - Make a CPG call and only call notify_parent() when we >>> get a reply */ >>>notify_parent(); >>> >>> Can anyone explain me, how to make mentioned CPG call? >> There should be a PR already that does exactly that. > Not only. > >> It just has to be rebased. > Not true. This PR is in conflict with the master branch. Which is what I wanted to express with 'has to be rebased' ;-) > >> But be aware that this isn't gonna solve your halted-pacemaker-daemons >> issue. > Also not true. I tried to merge this PR and I has solved several conflicts > intuitively. Now watchdog fires when corosync is frozen (half of my problems > is solved). Exactly - which is why I was directing your attention to the pacemaker-daemons. > But… It fires on both nodes. :) May be this is due to my lack of knowledge > of the corosync infrastructure. > > This PR is from 2017 year, why you didn’t fix and apply such very important > PR yet? Because there were other things to do that were even more important ;-) And as you've just discovered yourself things are not always that easy ... Even if the issue with the non-blocked node restarting is solved there are still delicate issues with startup/shutdown, installation/deinstallation gradually configuring up a cluster from a single node over two-node to several nodes, ... to be considered. Klaus > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] SBD as watchdog daemon
> 16 апр. 2019 г., в 16:21, Klaus Wenninger написал(а): > > On 4/16/19 3:12 PM, Олег Самойлов wrote: >> Okey, looked like I found where it must be fixed. >> >> sbd-cluster.c >> >>/* TODO - Make a CPG call and only call notify_parent() when we >> get a reply */ >>notify_parent(); >> >> Can anyone explain me, how to make mentioned CPG call? > There should be a PR already that does exactly that. Not only. > It just has to be rebased. Not true. This PR is in conflict with the master branch. > But be aware that this isn't gonna solve your halted-pacemaker-daemons > issue. Also not true. I tried to merge this PR and I has solved several conflicts intuitively. Now watchdog fires when corosync is frozen (half of my problems is solved). But… It fires on both nodes. :) May be this is due to my lack of knowledge of the corosync infrastructure. This PR is from 2017 year, why you didn’t fix and apply such very important PR yet? ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] SBD as watchdog daemon
On 4/16/19 3:12 PM, Олег Самойлов wrote: > Okey, looked like I found where it must be fixed. > > sbd-cluster.c > > /* TODO - Make a CPG call and only call notify_parent() when we > get a reply */ > notify_parent(); > > Can anyone explain me, how to make mentioned CPG call? There should be a PR already that does exactly that. It just has to be rebased. But be aware that this isn't gonna solve your halted-pacemaker-daemons issue. Klaus > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ -- Klaus Wenninger Senior Software Engineer, EMEA ENG Base Operating Systems Red Hat kwenn...@redhat.com Red Hat GmbH, http://www.de.redhat.com/, Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Michael O'Neill, Tom Savage, Eric Shander ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] SBD as watchdog daemon
Okey, looked like I found where it must be fixed. sbd-cluster.c /* TODO - Make a CPG call and only call notify_parent() when we get a reply */ notify_parent(); Can anyone explain me, how to make mentioned CPG call? ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Resource not starting correctly III
On 15/04/19 16:01 -0600, JCA wrote: > This is weird. Further experiments, consisting of creating and deleting the > resource, reveal that, on creating the resource, myapp-script may be > invoked multiple times - sometimes four, sometimes twenty or so, sometimes > returning OCF_SUCCESS, some other times returning OCF_NOT_RUNNING. And > whether or not it succeeds, as per pcs status, this seems to be something > completely random. Please, don't forget that the agent gets also invoked so as to extract its metadata (action is meta-data in that case). You would figure this out if you followed Ulrich's advice. Apologies if this possibility is expressly skipped in your experiments. -- Jan (Poki) pgpoTl6uJqDKV.pgp Description: PGP signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/