Re: [ClusterLabs] fence agent and using it with pacemaker
On 10/02/16 15:20 +0100, Stanislav Kopp wrote: > I have general, clarification question about how fence agents work > with pacemaker (crmsh in particular). As far I understood STDIN > arguments can be used within pacemaker resources and command line > arguments in terminal (for testing and scripting?). Fencing scripts from fence-agents package support both kinds of input; Pacemaker will pass the arguments (de facto attributes/parameters of the particular stonith resource as specified in CIB via tools like crmsh + some fence-agents API specific parameters like "action", but user-provided values always take precedence when configured) by piping them into the running script, but there is no reason you could not do the same from terminal, e.g.: # /usr/sbin/fence_pve < I have "fence_pve" [1] agent which works fine with command line > arguments, but not with pacemaker, it says some parameters like > "passwd" or "login" does not exist, Can you fully specify "it" in the previous sentence, please? Or even better, can you mimic what Pacemaker pumps into the agent per the example above? There may be a bug in interaction between fence_pve implementation and the fencing library, which does the heavy lifting behind the scenes. > although STDIN parameters are supported [2] > > [1] > https://github.com/ClusterLabs/fence-agents/blob/master/fence/agents/pve/fence_pve.py > [2] https://www.mankier.com/8/fence_pve#Stdin_Parameters -- Jan (Poki) pgpS3cj3_7KA7.pgp Description: PGP signature ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: DLM fencing
On 10/02/16 02:40 AM, Ulrich Windl wrote: Digimerschrieb am 08.02.2016 um 20:03 in Nachricht > <56b8e68a.1060...@alteeve.ca>: >> On 08/02/16 01:56 PM, Ferenc Wágner wrote: >>> Ken Gaillot writes: >>> On 02/07/2016 12:21 AM, G Spot wrote: > Thanks for your response, am using ocf:pacemaker:controld resource > agent and stonith-enabled=false do I need to configure stonith device > to make this work? Correct. DLM requires access to fencing. >>> >>> I've ment to explore this connection for long, but never found much >>> useful material on the subject. How does DLM fencing fit into the >>> modern Pacemaker architecture? Fencing is a confusing topic in itself >>> already (fence_legacy, fence_pcmk, stonith, stonithd, stonith_admin), >>> then dlm_controld can use dlm_stonith to proxy fencing requests to >>> Pacemaker, and it becomes hopeless... :) >>> >>> I'd be grateful for a pointer to a good overview document, or a quick >>> sketch if you can spare the time. To invoke some concrete questions: >>> When does DLM fence a node? Is it necessary only when there's no >>> resource manager running on the cluster? Does it matter whether >>> dlm_controld is run as a standalone daemon or as a controld resource? >>> Wouldn't Pacemaker fence a failing node itself all the same? Or is >>> dlm_stonith for the case when only the stonithd component of Pacemaker >>> is active somehow? >> >> DLM is a thing onto itself, and some tools like gfs2 and clustered-lvm >> use it to coordinate locking across the cluster. If a node drops out, >> the cluster informs dlm and it blocks until the lost node is confirmed >> fenced. Then it reaps the lost locks and recovery can begin. >> >> If fencing fails or is not configured, DLM never unblocks and anything >> using it is left hung (by design, better to hang than risk corruption). >> >> One of many reasons why fencing is critical. > > I'm not deeply in DLM, but it seems to me DLM can run standalone, or in the > cluster infrastructure (we only use it inside the cluster). When running > standalone, it makes sense that DLM has ist own fencing, but when running > inside the cluster infrastructure, I'd expect tha tthe cluster's fencing > mechanisms are used (maybe just because if the better logging of reasons). To be clear; DLM does NOT have it's own fencing. It relies on the cluster's fencing. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] crmsh configure delete for constraints
On Wed, Feb 10, 2016 at 07:39:27AM +0300, Vladislav Bogdanov wrote: [...] > >> Particularly, imho RAs should not run validate_all on stop > >> action. > > > >I'd disagree here. If the environment is no good (bad > >installation, missing configuration and similar), then the stop > >operation probably won't do much good. Ultimately, it may depend > >on how the resource is managed. In ocf-rarun, validate_all is > >run, but then the operation is not carried out if the environment > >is invalid. In particular, the resource is considered to be > >stopped, and the stop operation exits with success. One of the > >most common cases is when the software resides on shared > >non-parallel storage. > > Well, I'd reword. Generally, RA should not exit with error if validation > fails on stop. > Is that better? Much better! :) Not on probes either. Cheers, Dejan > > > >BTW, handling the stop and monitor/probe operations was the > >primary motivation to develop ocf-rarun. It's often quite > >difficult to get these things right. > > > >Cheers, > > > >Dejan > > > > > >> Best, > >> Vladislav > >> > >> > >> ___ > >> Users mailing list: Users@clusterlabs.org > >> http://clusterlabs.org/mailman/listinfo/users > >> > >> Project Home: http://www.clusterlabs.org > >> Getting started: > >http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > >> Bugs: http://bugs.clusterlabs.org > > > >___ > >Users mailing list: Users@clusterlabs.org > >http://clusterlabs.org/mailman/listinfo/users > > > >Project Home: http://www.clusterlabs.org > >Getting started: > >http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > >Bugs: http://bugs.clusterlabs.org > > ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] crmsh configure delete for constraints
On Wed, Feb 10, 2016 at 12:06:34PM +0100, Ferenc Wágner wrote: > Dejan Muhamedagicwrites: > > > If the environment is no good (bad installation, missing configuration > > and similar), then the stop operation probably won't do much good. > > Agreed. It may not even know how to probe it. > > > In ocf-rarun, validate_all is run, but then the operation is not > > carried out if the environment is invalid. In particular, the resource > > is considered to be stopped, and the stop operation exits with > > success. > > This sounds dangerous. What if the local configuration of a node gets > damaged while a resource is running on it? I understand your worry, but cannot imagine how that could happen, unless in case of a more serious failure such as disk crash, which, the failure, should really cause fencing at another level. The most common case, by far, is some mistake or omission during cluster setup. Humans tend to make mistakes. As Vladislav wrote elsewhere in this thread, this can cause a fencing loop, which is no fun, in particular if pacemaker is set to start on boot. It happened to me a few times and I guess I don't need to describe the intensity of my feelings toward computers in general and the cluster stack in particular (not to mention the RA author). > Eventually the cluster may > try to stop it, think that it succeeded and start the resource on > another node. Now you have two instances running. Or is the resource > probed on each node before the start? No, I don't think so. The probes are run only on crmd start. > Can a probe failure save your day > here? Or do you only mean resource parameters by "environment" (which > should be identical on each host, so validation would fail everywhere)? The validation typically checks the configuration and then whether various files (programs) and directories exist, sometimes if directories are writable. There could be more, but at least I would prefer to stop here. Anyway, we could introduce something like optional emergency_stop() which would be invoked in ocf-rarun in case the validation failed. And/or say a RUN_STOP_ANYWAY variable which would allow stop to be run regardless. But note that it is extremely difficult to prove or make sure that executing RA _after_ the validate step failed is going to produce meaningful results. In addition, there could also be FENCE_ON_INVALID_ENVIRONMENT (to be set by the user) for the very paranoid ;-) Cheers, Dejan > -- > Thanks, > Feri. > > ___ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Cluster resources migration from CMAN to Pacemaker
On 09/02/16 15:34 +0530, jaspal singla wrote: > Hi Jan/Digiman, (as a matter of fact, Digimer, from Digital Mermaid :-) > Thanks for your replies. Based on your inputs, I managed to configure these > values and results were fine but still have some doubts for which I would > seek your help. I also tried to dig some of issues on internet but seems > due to lack of cman -> pacemaker documentation, I couldn't find any. That's not exactly CMAN -> Pacemaker, better conceptual expression is (CMAN,rgmanager) -> (Corosync v2,Pacemaker) or (CMAN,rgmanager) -> (Corosync/CMAN,Pacemaker) depending on what's the exact target (these expressions is what's "clufter -h" uses to provide a hint about facilitated conversions). And yes, it's so non-existent I determined to put some bits of non-code knowledge to the docs accompanying clufter: https://pagure.io/clufter/blob/master/f/__root__/doc/rgmanager-pacemaker thus at least partially fill the vacuum (+ lay some common grounds to talk about cluster properties in a way as implementation-agnostic as possible <-- I am not aware of similar effort but I didn't search extensively). Any help with extending/refining it is welcome. > I have configured 8 scripts under one resource as you recommended. But out > of which 2 scripts are not being executed by cluster by cluster itself. > When I tried to execute the same script manually, I am able to do it but > through pacemaker command I don't. > > For example: > > This is the output of crm_mon command: > > ### > Last updated: Mon Feb 8 17:30:57 2016 Last change: Mon Feb 8 > 17:03:29 2016 by hacluster via crmd on ha1-103.cisco.com > Stack: corosync > Current DC: ha1-103.cisco.com (version 1.1.13-10.el7-44eb2dd) - partition > with quorum > 1 node and 10 resources configured > > Online: [ ha1-103.cisco.com ] > > Resource Group: ctm_service > FSCheck > (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/FsCheckAgent.py): > Started ha1-103.cisco.com > NTW_IF > (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/NtwIFAgent.py): Started > ha1-103.cisco.com > CTM_RSYNC > (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/RsyncAgent.py): Started > ha1-103.cisco.com > REPL_IF > (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/ODG_IFAgent.py): Started > ha1-103.cisco.com > ORACLE_REPLICATOR > (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/ODG_ReplicatorAgent.py): > Started ha1-103.cisco.com > CTM_SID > (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/OracleAgent.py): Started > ha1-103.cisco.com > CTM_SRV > (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/CtmAgent.py):Stopped > CTM_APACHE > (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/ApacheAgent.py): Stopped > Resource Group: ctm_heartbeat > CTM_HEARTBEAT > (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/HeartBeat.py): Started > ha1-103.cisco.com > Resource Group: ctm_monitoring > FLASHBACK > (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/FlashBackMonitor.py): > Started ha1-103.cisco.com > > Failed Actions: > * CTM_SRV_start_0 on ha1-103.cisco.com 'unknown error' (1): call=577, > status=complete, exitreason='none', > last-rc-change='Mon Feb 8 17:12:33 2016', queued=0ms, exec=74ms > > # > > > CTM_SRV && CTM_APACHE are in stopped state. These services are not being > executed by cluster OR it is being failed somehow by cluster, not sure > why? When I manually execute CTM_SRV script, the script gets executed > without issues. > > -> For manually execution of this script I ran the below command: > > # /cisco/PrimeOpticalServer/HA/bin/OracleAgent.py status > > Output: > > _ > 2016-02-08 17:48:41,888 INFO MainThread CtmAgent > = > Executing preliminary checks... > Check Oracle and Listener availability > => Oracle and listener are up. > Migration check > => Migration check completed successfully. > Check the status of the DB archivelog > => DB archivelog check completed successfully. > Check of Oracle scheduler... > => Check of Oracle scheduler completed successfully > Initializing database tables > => Database tables initialized successfully. > Install in cache the store procedure > => Installing store procedures completed successfully > Gather the oracle system stats > => Oracle stats completed successfully > Preliminary checks completed. > = > Starting base services... > Starting Zookeeper... > JMX enabled by default > Using config: /opt/CiscoTransportManagerServer/zookeeper/bin/../conf/zoo.cfg > Starting zookeeper ... STARTED > Retrieving name
[ClusterLabs] Antw: Re: Antw: Re: DLM fencing
>>> Digimerschrieb am 10.02.2016 um 17:32 in Nachricht <56bb6637.6090...@alteeve.ca>: > On 10/02/16 02:40 AM, Ulrich Windl wrote: [...] >>> If fencing fails or is not configured, DLM never unblocks and anything >>> using it is left hung (by design, better to hang than risk corruption). >>> >>> One of many reasons why fencing is critical. >> >> I'm not deeply in DLM, but it seems to me DLM can run standalone, or in the >> cluster infrastructure (we only use it inside the cluster). When running >> standalone, it makes sense that DLM has ist own fencing, but when running >> inside the cluster infrastructure, I'd expect tha tthe cluster's fencing >> mechanisms are used (maybe just because if the better logging of reasons). > > To be clear; DLM does NOT have it's own fencing. It relies on the > cluster's fencing. OK, is this true for cLVM and O2CB as well? I always felt some of those is doing a fencing themselves as soon as they fail to communicate with DLM. So the first guess was it's DLM... > > -- > Digimer > Papers and Projects: https://alteeve.ca/w/ > What if the cure for cancer is trapped in the mind of a person without > access to education? > > ___ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: Antw: Re: DLM fencing
On 11/02/16 02:37 AM, Ulrich Windl wrote: Digimerschrieb am 10.02.2016 um 17:32 in Nachricht > <56bb6637.6090...@alteeve.ca>: >> On 10/02/16 02:40 AM, Ulrich Windl wrote: > > [...] If fencing fails or is not configured, DLM never unblocks and anything using it is left hung (by design, better to hang than risk corruption). One of many reasons why fencing is critical. >>> >>> I'm not deeply in DLM, but it seems to me DLM can run standalone, or in the >>> cluster infrastructure (we only use it inside the cluster). When running >>> standalone, it makes sense that DLM has ist own fencing, but when running >>> inside the cluster infrastructure, I'd expect tha tthe cluster's fencing >>> mechanisms are used (maybe just because if the better logging of reasons). >> >> To be clear; DLM does NOT have it's own fencing. It relies on the >> cluster's fencing. > > OK, is this true for cLVM and O2CB as well? I always felt some of those is > doing a fencing themselves as soon as they fail to communicate with DLM. So > the first guess was it's DLM... I can't speak to o2cb, never used it. However, clustered LVM, gfs2 and rgmanager use DLM, and in all cases, DLM does nothing but block until it is told that the fence was successful. It plays no active role in fencing. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: crmsh configure delete for constraints
10.02.2016 11:38, Ulrich Windl wrote: Vladislav Bogdanovschrieb am 10.02.2016 um 05:39 in Nachricht <6e479808-6362-4932-b2c6-348c7efc4...@hoster-ok.com>: [...] Well, I'd reword. Generally, RA should not exit with error if validation fails on stop. Is that better? [...] As we have different error codes, what type of error? Any which makes pacemaker to think resource stop op failed. OCF_ERR_* particularly. If pacemaker has got an error on start, it will run stop with the same set of parameters anyways. And will get error again if that one was from validation and RA does not differentiate validation for start and stop. And then circular fencing over the whole cluster is triggered for no reason. Of course, for safety, RA could save its state if start was successful and skip validation on stop only if that state is not found. Otherwise removed binary or config file would result in resource running on several nodes. Well, this all seems to be very complicated to make some general algorithm ;) Regards, Ulrich ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: crmsh configure delete for constraints
Vladislav Bogdanovwrites: > If pacemaker has got an error on start, it will run stop with the same > set of parameters anyways. And will get error again if that one was > from validation and RA does not differentiate validation for start and > stop. And then circular fencing over the whole cluster is triggered > for no reason. > > Of course, for safety, RA could save its state if start was successful > and skip validation on stop only if that state is not found. Otherwise > removed binary or config file would result in resource running on > several nodes. What would happen if we made the start operation return OCF_NOT_RUNNING if validation fails? Or more broadly: if the start operation knows that the resource is not running, thus a stop opration would do no good. >From Pacemaker Explained B.4: "The cluster will not attempt to stop a resource that returns this for any action." The probes could still return OCF_ERR_CONFIGURED, putting real info into the logs, the stop failure could still lead to fencing, protecting data integrity, but circular fencing would not happen. I hope. By the way, what are the reasons to run stop after a failed start? To clean up halfway-started resources? Besides OCF_ERR_GENERIC, the other error codes pretty much guarrantee that the resource can not be active. -- Regards, Feri. ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] crmsh configure delete for constraints
Dejan Muhamedagicwrites: > If the environment is no good (bad installation, missing configuration > and similar), then the stop operation probably won't do much good. Agreed. It may not even know how to probe it. > In ocf-rarun, validate_all is run, but then the operation is not > carried out if the environment is invalid. In particular, the resource > is considered to be stopped, and the stop operation exits with > success. This sounds dangerous. What if the local configuration of a node gets damaged while a resource is running on it? Eventually the cluster may try to stop it, think that it succeeded and start the resource on another node. Now you have two instances running. Or is the resource probed on each node before the start? Can a probe failure save your day here? Or do you only mean resource parameters by "environment" (which should be identical on each host, so validation would fail everywhere)? -- Thanks, Feri. ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] [Linux-HA] Anyone successfully install PAcemaker/Corosync on Freebsd?
Moving to users@clusterlabs.org. On Sat, Dec 19, 2015 at 06:47:54PM -0400, mike wrote: > Hi All, > > just curious if anyone has had any luck at one point installing > Pacemaker and Corosync on FreeBSD. According to pacemaker changelog, at least David Shane Holdenand Ruben Kerkhof have been submitting pull requests recently with freebsd compat fixes, maybe they can help? Lars > I've run into an issue when running ./configure while trying to > install Corosync. The process craps out at nss with this error: > checking for nss... configure: error: in `/root/heartbeat/corosync-2.3.3': > configure: error: The pkg-config script could not be found or is too > old. Make sure it > is in your PATH or set the PKG_CONFIG environment variable to the full > path to pkg-config. > Alternatively, you may set the environment variables nss_CFLAGS > and nss_LIBS to avoid the need to call pkg-config. > See the pkg-config man page for more details. > > I've looked unsuccessfully for a package called pkg-config and nss > appears to be installed as you can see from this output: > root@wellesley:~/heartbeat/corosync-2.3.3 # pkg install nss > Updating FreeBSD repository catalogue... > FreeBSD repository is up-to-date. > All repositories are up-to-date. > Checking integrity... done (0 conflicting) > The most recent version of packages are already installed > > Anyway - just looking for any suggestions. Hoping that perhaps > someone has successfully done this. > > thanks in advance > -mgb -- : Lars Ellenberg : http://www.LINBIT.com ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: crmsh configure delete for constraints
10.02.2016 13:56, Ferenc Wágner wrote: Vladislav Bogdanovwrites: If pacemaker has got an error on start, it will run stop with the same set of parameters anyways. And will get error again if that one was from validation and RA does not differentiate validation for start and stop. And then circular fencing over the whole cluster is triggered for no reason. Of course, for safety, RA could save its state if start was successful and skip validation on stop only if that state is not found. Otherwise removed binary or config file would result in resource running on several nodes. What would happen if we made the start operation return OCF_NOT_RUNNING Well, then cluster will try to start it again, and that could be undesirable - what are OCF_ERR_INSTALLED and OCF_ERR_CONFIGURED for then? if validation fails? Or more broadly: if the start operation knows that the resource is not running, thus a stop opration would do no good. From Pacemaker Explained B.4: "The cluster will not attempt to stop a resource that returns this for any action." The probes could still return OCF_ERR_CONFIGURED, putting real info into the logs, the stop failure could still lead to fencing, protecting data integrity, but circular fencing would not happen. I hope. By the way, what are the reasons to run stop after a failed start? To clean up halfway-started resources? Besides OCF_ERR_GENERIC, the other error codes pretty much guarrantee that the resource can not be active. That heavily depends on how given RA is implemented... ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Antw: Re: Antw: Re: crmsh configure delete for constraints
>>> Ferenc Wágnerschrieb am 10.02.2016 um 11:56 in Nachricht <87mvr8n896@lant.ki.iif.hu>: > Vladislav Bogdanov writes: > >> If pacemaker has got an error on start, it will run stop with the same >> set of parameters anyways. And will get error again if that one was >> from validation and RA does not differentiate validation for start and >> stop. And then circular fencing over the whole cluster is triggered >> for no reason. >> >> Of course, for safety, RA could save its state if start was successful >> and skip validation on stop only if that state is not found. Otherwise >> removed binary or config file would result in resource running on >> several nodes. > > What would happen if we made the start operation return OCF_NOT_RUNNING > if validation fails? Or more broadly: if the start operation knows that I think this should NOT be done, because actually the RA doesn't know (most likely). You are trying to reduce the impact of one problem by introducing another problem (returning an incorrect exit code). > the resource is not running, thus a stop opration would do no good. If the configuration is NOT correct the cluster should neither try to start or stop the resource. Maybe the cluster should remember that bad state until the operator does a cleanup of the problem. > From Pacemaker Explained B.4: "The cluster will not attempt to stop a > resource that returns this for any action." The probes could still > return OCF_ERR_CONFIGURED, putting real info into the logs, the stop > failure could still lead to fencing, protecting data integrity, but > circular fencing would not happen. I hope. > > By the way, what are the reasons to run stop after a failed start? To Probably as the start operation is not required to be atomic, that is, the resource could partially be started. Stop ensures the resource is completely stopped (or otherwise fencing will do that). > clean up halfway-started resources? Besides OCF_ERR_GENERIC, the other > error codes pretty much guarrantee that the resource can not be active. > -- > Regards, > Feri. > > ___ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org