Re: [ClusterLabs] fence agent and using it with pacemaker

2016-02-10 Thread Jan Pokorný
On 10/02/16 15:20 +0100, Stanislav Kopp wrote:
> I have general, clarification question about how fence agents work
> with pacemaker (crmsh in particular). As far I understood STDIN
> arguments can be used within pacemaker resources and command line
> arguments in terminal (for testing and scripting?).

Fencing scripts from fence-agents package support both kinds of input;
Pacemaker will pass the arguments (de facto attributes/parameters of
the particular stonith resource as specified in CIB via tools like
crmsh + some fence-agents API specific parameters like "action", but
user-provided values always take precedence when configured) by piping
them into the running script, but there is no reason you could not do
the same from terminal, e.g.:

# /usr/sbin/fence_pve < I have "fence_pve" [1] agent which works fine with command line
> arguments, but not with pacemaker, it says some parameters like
> "passwd" or "login" does not exist,

Can you fully specify "it" in the previous sentence, please?
Or even better, can you mimic what Pacemaker pumps into the agent
per the example above?

There may be a bug in interaction between fence_pve implementation
and the fencing library, which does the heavy lifting behind the scenes.

> although STDIN parameters are supported [2]
> 
> [1] 
> https://github.com/ClusterLabs/fence-agents/blob/master/fence/agents/pve/fence_pve.py
> [2] https://www.mankier.com/8/fence_pve#Stdin_Parameters

-- 
Jan (Poki)


pgpS3cj3_7KA7.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: DLM fencing

2016-02-10 Thread Digimer
On 10/02/16 02:40 AM, Ulrich Windl wrote:
 Digimer  schrieb am 08.02.2016 um 20:03 in Nachricht
> <56b8e68a.1060...@alteeve.ca>:
>> On 08/02/16 01:56 PM, Ferenc Wágner wrote:
>>> Ken Gaillot  writes:
>>>
 On 02/07/2016 12:21 AM, G Spot wrote:

> Thanks for your response, am using ocf:pacemaker:controld resource
> agent and stonith-enabled=false do I need to configure stonith device
> to make this work?

 Correct. DLM requires access to fencing.
>>>
>>> I've ment to explore this connection for long, but never found much
>>> useful material on the subject.  How does DLM fencing fit into the
>>> modern Pacemaker architecture?  Fencing is a confusing topic in itself
>>> already (fence_legacy, fence_pcmk, stonith, stonithd, stonith_admin),
>>> then dlm_controld can use dlm_stonith to proxy fencing requests to
>>> Pacemaker, and it becomes hopeless... :)
>>>
>>> I'd be grateful for a pointer to a good overview document, or a quick
>>> sketch if you can spare the time.  To invoke some concrete questions:
>>> When does DLM fence a node?  Is it necessary only when there's no
>>> resource manager running on the cluster?  Does it matter whether
>>> dlm_controld is run as a standalone daemon or as a controld resource?
>>> Wouldn't Pacemaker fence a failing node itself all the same?  Or is
>>> dlm_stonith for the case when only the stonithd component of Pacemaker
>>> is active somehow?
>>
>> DLM is a thing onto itself, and some tools like gfs2 and clustered-lvm
>> use it to coordinate locking across the cluster. If a node drops out,
>> the cluster informs dlm and it blocks until the lost node is confirmed
>> fenced. Then it reaps the lost locks and recovery can begin.
>>
>> If fencing fails or is not configured, DLM never unblocks and anything
>> using it is left hung (by design, better to hang than risk corruption).
>>
>> One of many reasons why fencing is critical.
> 
> I'm not deeply in DLM, but it seems to me DLM can run standalone, or in the
> cluster infrastructure (we only use it inside the cluster). When running
> standalone, it makes sense that DLM has ist own fencing, but when running
> inside the cluster infrastructure, I'd expect tha tthe cluster's fencing
> mechanisms are used (maybe just because if the better logging of reasons).

To be clear; DLM does NOT have it's own fencing. It relies on the
cluster's fencing.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] crmsh configure delete for constraints

2016-02-10 Thread Dejan Muhamedagic
On Wed, Feb 10, 2016 at 07:39:27AM +0300, Vladislav Bogdanov wrote:
[...]
> >> Particularly, imho RAs should not run validate_all on stop
> >> action.
> >
> >I'd disagree here. If the environment is no good (bad
> >installation, missing configuration and similar), then the stop
> >operation probably won't do much good. Ultimately, it may depend
> >on how the resource is managed. In ocf-rarun, validate_all is
> >run, but then the operation is not carried out if the environment
> >is invalid. In particular, the resource is considered to be
> >stopped, and the stop operation exits with success. One of the
> >most common cases is when the software resides on shared
> >non-parallel storage.
> 
> Well, I'd reword. Generally, RA should not exit with error if validation 
> fails on stop.
> Is that better?

Much better! :) Not on probes either.

Cheers,

Dejan

> >
> >BTW, handling the stop and monitor/probe operations was the
> >primary motivation to develop ocf-rarun. It's often quite
> >difficult to get these things right.
> >
> >Cheers,
> >
> >Dejan
> >
> >
> >> Best,
> >> Vladislav
> >> 
> >> 
> >> ___
> >> Users mailing list: Users@clusterlabs.org
> >> http://clusterlabs.org/mailman/listinfo/users
> >> 
> >> Project Home: http://www.clusterlabs.org
> >> Getting started:
> >http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> Bugs: http://bugs.clusterlabs.org
> >
> >___
> >Users mailing list: Users@clusterlabs.org
> >http://clusterlabs.org/mailman/listinfo/users
> >
> >Project Home: http://www.clusterlabs.org
> >Getting started:
> >http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >Bugs: http://bugs.clusterlabs.org
> 
> 

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] crmsh configure delete for constraints

2016-02-10 Thread Dejan Muhamedagic
On Wed, Feb 10, 2016 at 12:06:34PM +0100, Ferenc Wágner wrote:
> Dejan Muhamedagic  writes:
> 
> > If the environment is no good (bad installation, missing configuration
> > and similar), then the stop operation probably won't do much good.
> 
> Agreed.  It may not even know how to probe it.
> 
> > In ocf-rarun, validate_all is run, but then the operation is not
> > carried out if the environment is invalid. In particular, the resource
> > is considered to be stopped, and the stop operation exits with
> > success.
> 
> This sounds dangerous.  What if the local configuration of a node gets
> damaged while a resource is running on it?

I understand your worry, but cannot imagine how that could
happen, unless in case of a more serious failure such as disk
crash, which, the failure, should really cause fencing at another
level.

The most common case, by far, is some mistake or omission during
cluster setup. Humans tend to make mistakes. As Vladislav wrote
elsewhere in this thread, this can cause a fencing loop, which is
no fun, in particular if pacemaker is set to start on boot. It
happened to me a few times and I guess I don't need to describe
the intensity of my feelings toward computers in general and the
cluster stack in particular (not to mention the RA author).

> Eventually the cluster may
> try to stop it, think that it succeeded and start the resource on
> another node.  Now you have two instances running.  Or is the resource
> probed on each node before the start?

No, I don't think so. The probes are run only on crmd start.

> Can a probe failure save your day
> here?  Or do you only mean resource parameters by "environment" (which
> should be identical on each host, so validation would fail everywhere)?

The validation typically checks the configuration and then
whether various files (programs) and directories exist, sometimes
if directories are writable. There could be more, but at least I
would prefer to stop here.

Anyway, we could introduce something like optional
emergency_stop() which would be invoked in ocf-rarun in case the
validation failed. And/or say a RUN_STOP_ANYWAY variable which
would allow stop to be run regardless. But note that it is
extremely difficult to prove or make sure that executing RA
_after_ the validate step failed is going to produce meaningful
results.  In addition, there could also be
FENCE_ON_INVALID_ENVIRONMENT (to be set by the user) for the very
paranoid ;-)

Cheers,

Dejan

> -- 
> Thanks,
> Feri.
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Cluster resources migration from CMAN to Pacemaker

2016-02-10 Thread Jan Pokorný
On 09/02/16 15:34 +0530, jaspal singla wrote:
> Hi Jan/Digiman,

(as a matter of fact, Digimer, from Digital Mermaid :-)

> Thanks for your replies. Based on your inputs, I managed to configure these
> values and results were fine but still have some doubts for which I would
> seek your help. I also tried to dig some of issues on internet but seems
> due to lack of cman -> pacemaker documentation, I couldn't find any.

That's not exactly CMAN -> Pacemaker, better conceptual expression is
  (CMAN,rgmanager) -> (Corosync v2,Pacemaker)
or
  (CMAN,rgmanager) -> (Corosync/CMAN,Pacemaker) 
depending on what's the exact target (these expressions is what's
"clufter -h" uses to provide a hint about facilitated conversions).

And yes, it's so non-existent I determined to put some bits of
non-code knowledge to the docs accompanying clufter:
https://pagure.io/clufter/blob/master/f/__root__/doc/rgmanager-pacemaker
thus at least partially fill the vacuum (+ lay some common grounds
to talk about cluster properties in a way as implementation-agnostic
as possible <-- I am not aware of similar effort but I didn't search
extensively).

Any help with extending/refining it is welcome.

> I have configured 8 scripts under one resource as you recommended. But out
> of which 2 scripts are not being executed by cluster by cluster itself.
> When I tried to execute the same script manually, I am able to do it but
> through pacemaker command I don't.
> 
> For example:
> 
> This is the output of crm_mon command:
> 
> ###
> Last updated: Mon Feb  8 17:30:57 2016  Last change: Mon Feb  8
> 17:03:29 2016 by hacluster via crmd on ha1-103.cisco.com
> Stack: corosync
> Current DC: ha1-103.cisco.com (version 1.1.13-10.el7-44eb2dd) - partition
> with quorum
> 1 node and 10 resources configured
> 
> Online: [ ha1-103.cisco.com ]
> 
>  Resource Group: ctm_service
>  FSCheck
>  (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/FsCheckAgent.py):
>  Started ha1-103.cisco.com
>  NTW_IF
> (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/NtwIFAgent.py):  Started
> ha1-103.cisco.com
>  CTM_RSYNC
>  (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/RsyncAgent.py):  Started
> ha1-103.cisco.com
>  REPL_IF
>  (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/ODG_IFAgent.py): Started
> ha1-103.cisco.com
>  ORACLE_REPLICATOR
>  (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/ODG_ReplicatorAgent.py):
> Started ha1-103.cisco.com
>  CTM_SID
>  (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/OracleAgent.py): Started
> ha1-103.cisco.com
>  CTM_SRV
>  (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/CtmAgent.py):Stopped
>  CTM_APACHE
> (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/ApacheAgent.py): Stopped
>  Resource Group: ctm_heartbeat
>  CTM_HEARTBEAT
>  (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/HeartBeat.py):   Started
> ha1-103.cisco.com
>  Resource Group: ctm_monitoring
>  FLASHBACK
>  (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/FlashBackMonitor.py):
>  Started ha1-103.cisco.com
> 
> Failed Actions:
> * CTM_SRV_start_0 on ha1-103.cisco.com 'unknown error' (1): call=577,
> status=complete, exitreason='none',
> last-rc-change='Mon Feb  8 17:12:33 2016', queued=0ms, exec=74ms
> 
> #
> 
> 
> CTM_SRV && CTM_APACHE are in stopped state. These services are not being
> executed by cluster OR it is being failed somehow by cluster, not sure
> why?  When I manually execute CTM_SRV script, the script gets executed
> without issues.
> 
> -> For manually execution of this script I ran the below command:
> 
> # /cisco/PrimeOpticalServer/HA/bin/OracleAgent.py status
> 
> Output:
> 
> _
> 2016-02-08 17:48:41,888 INFO MainThread CtmAgent
> =
> Executing preliminary checks...
>  Check Oracle and Listener availability
>   => Oracle and listener are up.
>  Migration check
>   => Migration check completed successfully.
>  Check the status of the DB archivelog
>   => DB archivelog check completed successfully.
>  Check of Oracle scheduler...
>   => Check of Oracle scheduler completed successfully
>  Initializing database tables
>   => Database tables initialized successfully.
>  Install in cache the store procedure
>   => Installing store procedures completed successfully
>  Gather the oracle system stats
>   => Oracle stats completed successfully
> Preliminary checks completed.
> =
> Starting base services...
> Starting Zookeeper...
> JMX enabled by default
> Using config: /opt/CiscoTransportManagerServer/zookeeper/bin/../conf/zoo.cfg
> Starting zookeeper ... STARTED
>  Retrieving name 

[ClusterLabs] Antw: Re: Antw: Re: DLM fencing

2016-02-10 Thread Ulrich Windl
>>> Digimer  schrieb am 10.02.2016 um 17:32 in Nachricht
<56bb6637.6090...@alteeve.ca>:
> On 10/02/16 02:40 AM, Ulrich Windl wrote:

[...]
>>> If fencing fails or is not configured, DLM never unblocks and anything
>>> using it is left hung (by design, better to hang than risk corruption).
>>>
>>> One of many reasons why fencing is critical.
>> 
>> I'm not deeply in DLM, but it seems to me DLM can run standalone, or in the
>> cluster infrastructure (we only use it inside the cluster). When running
>> standalone, it makes sense that DLM has ist own fencing, but when running
>> inside the cluster infrastructure, I'd expect tha tthe cluster's fencing
>> mechanisms are used (maybe just because if the better logging of reasons).
> 
> To be clear; DLM does NOT have it's own fencing. It relies on the
> cluster's fencing.

OK, is this true for cLVM and O2CB as well? I always felt some of those is 
doing a fencing themselves as soon as they fail to communicate with DLM. So the 
first guess was it's DLM...

> 
> -- 
> Digimer
> Papers and Projects: https://alteeve.ca/w/ 
> What if the cure for cancer is trapped in the mind of a person without
> access to education?
> 
> ___
> Users mailing list: Users@clusterlabs.org 
> http://clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 





___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Antw: Re: DLM fencing

2016-02-10 Thread Digimer
On 11/02/16 02:37 AM, Ulrich Windl wrote:
 Digimer  schrieb am 10.02.2016 um 17:32 in Nachricht
> <56bb6637.6090...@alteeve.ca>:
>> On 10/02/16 02:40 AM, Ulrich Windl wrote:
> 
> [...]
 If fencing fails or is not configured, DLM never unblocks and anything
 using it is left hung (by design, better to hang than risk corruption).

 One of many reasons why fencing is critical.
>>>
>>> I'm not deeply in DLM, but it seems to me DLM can run standalone, or in the
>>> cluster infrastructure (we only use it inside the cluster). When running
>>> standalone, it makes sense that DLM has ist own fencing, but when running
>>> inside the cluster infrastructure, I'd expect tha tthe cluster's fencing
>>> mechanisms are used (maybe just because if the better logging of reasons).
>>
>> To be clear; DLM does NOT have it's own fencing. It relies on the
>> cluster's fencing.
> 
> OK, is this true for cLVM and O2CB as well? I always felt some of those is 
> doing a fencing themselves as soon as they fail to communicate with DLM. So 
> the first guess was it's DLM...

I can't speak to o2cb, never used it. However, clustered LVM, gfs2 and
rgmanager use DLM, and in all cases, DLM does nothing but block until it
is told that the fence was successful. It plays no active role in fencing.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: crmsh configure delete for constraints

2016-02-10 Thread Vladislav Bogdanov

10.02.2016 11:38, Ulrich Windl wrote:

Vladislav Bogdanov  schrieb am 10.02.2016 um 05:39 in

Nachricht <6e479808-6362-4932-b2c6-348c7efc4...@hoster-ok.com>:

[...]

Well, I'd reword. Generally, RA should not exit with error if validation
fails on stop.
Is that better?

[...]

As we have different error codes, what type of error?


Any which makes pacemaker to think resource stop op failed.
OCF_ERR_* particularly.

If pacemaker has got an error on start, it will run stop with the same 
set of parameters anyways. And will get error again if that one was from 
validation and RA does not differentiate validation for start and stop. 
And then circular fencing over the whole cluster is triggered for no reason.


Of course, for safety, RA could save its state if start was successful 
and skip validation on stop only if that state is not found. Otherwise 
removed binary or config file would result in resource running on 
several nodes.


Well, this all seems to be very complicated to make some general 
algorithm ;)





Regards,
Ulrich



___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: crmsh configure delete for constraints

2016-02-10 Thread Ferenc Wágner
Vladislav Bogdanov  writes:

> If pacemaker has got an error on start, it will run stop with the same
> set of parameters anyways. And will get error again if that one was
> from validation and RA does not differentiate validation for start and
> stop. And then circular fencing over the whole cluster is triggered
> for no reason.
>
> Of course, for safety, RA could save its state if start was successful
> and skip validation on stop only if that state is not found. Otherwise
> removed binary or config file would result in resource running on
> several nodes.

What would happen if we made the start operation return OCF_NOT_RUNNING
if validation fails?  Or more broadly: if the start operation knows that
the resource is not running, thus a stop opration would do no good.
>From Pacemaker Explained B.4: "The cluster will not attempt to stop a
resource that returns this for any action."  The probes could still
return OCF_ERR_CONFIGURED, putting real info into the logs, the stop
failure could still lead to fencing, protecting data integrity, but
circular fencing would not happen.  I hope.

By the way, what are the reasons to run stop after a failed start?  To
clean up halfway-started resources?  Besides OCF_ERR_GENERIC, the other
error codes pretty much guarrantee that the resource can not be active.
-- 
Regards,
Feri.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] crmsh configure delete for constraints

2016-02-10 Thread Ferenc Wágner
Dejan Muhamedagic  writes:

> If the environment is no good (bad installation, missing configuration
> and similar), then the stop operation probably won't do much good.

Agreed.  It may not even know how to probe it.

> In ocf-rarun, validate_all is run, but then the operation is not
> carried out if the environment is invalid. In particular, the resource
> is considered to be stopped, and the stop operation exits with
> success.

This sounds dangerous.  What if the local configuration of a node gets
damaged while a resource is running on it?  Eventually the cluster may
try to stop it, think that it succeeded and start the resource on
another node.  Now you have two instances running.  Or is the resource
probed on each node before the start?  Can a probe failure save your day
here?  Or do you only mean resource parameters by "environment" (which
should be identical on each host, so validation would fail everywhere)?
-- 
Thanks,
Feri.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [Linux-HA] Anyone successfully install PAcemaker/Corosync on Freebsd?

2016-02-10 Thread Lars Ellenberg

Moving to users@clusterlabs.org.

On Sat, Dec 19, 2015 at 06:47:54PM -0400, mike wrote:
> Hi All,
> 
> just curious if anyone has had any luck at one point installing
> Pacemaker and Corosync on FreeBSD.

According to pacemaker changelog, at least
David Shane Holden  and Ruben Kerkhof 
have been submitting pull requests recently with freebsd compat fixes,
maybe they can help?

   Lars

> I've run into an issue when running ./configure while trying to
> install Corosync. The process craps out at nss with this error:
> checking for nss... configure: error: in `/root/heartbeat/corosync-2.3.3':
> configure: error: The pkg-config script could not be found or is too
> old. Make sure it
> is in your PATH or set the PKG_CONFIG environment variable to the full
> path to pkg-config.​
> Alternatively, you may set the environment variables nss_CFLAGS
> and nss_LIBS to avoid the need to call pkg-config.
> See the pkg-config man page for more details.
> 
> I've looked unsuccessfully for a package called pkg-config and nss
> appears to be installed as you can see from this output:
> root@wellesley:~/heartbeat/corosync-2.3.3 # pkg install nss
> Updating FreeBSD repository catalogue...
> FreeBSD repository is up-to-date.
> All repositories are up-to-date.
> Checking integrity... done (0 conflicting)
> The most recent version of packages are already installed
> 
> Anyway - just looking for any suggestions. Hoping that perhaps
> someone has successfully done this.
> 
> thanks in advance
> -mgb

-- 
: Lars Ellenberg
: http://www.LINBIT.com

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: crmsh configure delete for constraints

2016-02-10 Thread Vladislav Bogdanov

10.02.2016 13:56, Ferenc Wágner wrote:

Vladislav Bogdanov  writes:


If pacemaker has got an error on start, it will run stop with the same
set of parameters anyways. And will get error again if that one was
from validation and RA does not differentiate validation for start and
stop. And then circular fencing over the whole cluster is triggered
for no reason.

Of course, for safety, RA could save its state if start was successful
and skip validation on stop only if that state is not found. Otherwise
removed binary or config file would result in resource running on
several nodes.


What would happen if we made the start operation return OCF_NOT_RUNNING


Well, then cluster will try to start it again, and that could be 
undesirable - what are OCF_ERR_INSTALLED and OCF_ERR_CONFIGURED for then?



if validation fails?  Or more broadly: if the start operation knows that
the resource is not running, thus a stop opration would do no good.
 From Pacemaker Explained B.4: "The cluster will not attempt to stop a
resource that returns this for any action."  The probes could still
return OCF_ERR_CONFIGURED, putting real info into the logs, the stop
failure could still lead to fencing, protecting data integrity, but
circular fencing would not happen.  I hope.

By the way, what are the reasons to run stop after a failed start?  To
clean up halfway-started resources?  Besides OCF_ERR_GENERIC, the other
error codes pretty much guarrantee that the resource can not be active.


That heavily depends on how given RA is implemented...


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Re: Antw: Re: crmsh configure delete for constraints

2016-02-10 Thread Ulrich Windl
>>> Ferenc Wágner  schrieb am 10.02.2016 um 11:56 in Nachricht
<87mvr8n896@lant.ki.iif.hu>:
> Vladislav Bogdanov  writes:
> 
>> If pacemaker has got an error on start, it will run stop with the same
>> set of parameters anyways. And will get error again if that one was
>> from validation and RA does not differentiate validation for start and
>> stop. And then circular fencing over the whole cluster is triggered
>> for no reason.
>>
>> Of course, for safety, RA could save its state if start was successful
>> and skip validation on stop only if that state is not found. Otherwise
>> removed binary or config file would result in resource running on
>> several nodes.
> 
> What would happen if we made the start operation return OCF_NOT_RUNNING
> if validation fails?  Or more broadly: if the start operation knows that

I think this should NOT be done, because actually the RA doesn't know (most
likely). You are trying to  reduce the impact of one problem by introducing
another problem (returning an incorrect exit code).

> the resource is not running, thus a stop opration would do no good.

If the configuration is NOT correct the cluster should neither try to start or
stop the resource. Maybe the cluster should remember that bad state until the
operator does a cleanup of the problem.

> From Pacemaker Explained B.4: "The cluster will not attempt to stop a
> resource that returns this for any action."  The probes could still
> return OCF_ERR_CONFIGURED, putting real info into the logs, the stop
> failure could still lead to fencing, protecting data integrity, but
> circular fencing would not happen.  I hope.
> 
> By the way, what are the reasons to run stop after a failed start?  To

Probably as the start operation is not required to be atomic, that is, the
resource could partially be started. Stop ensures the resource is completely
stopped (or otherwise fencing will do that).

> clean up halfway-started resources?  Besides OCF_ERR_GENERIC, the other
> error codes pretty much guarrantee that the resource can not be active.
> -- 
> Regards,
> Feri.
> 
> ___
> Users mailing list: Users@clusterlabs.org 
> http://clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 




___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org