[ClusterLabs] fence-agents v4.2.1

2018-05-31 Thread Oyvind Albrigtsen

ClusterLabs is happy to announce fence-agents v4.2.1, which is a
bugfix release for v4.2.0.

The source code is available at:
https://github.com/ClusterLabs/fence-agents/releases/tag/v4.2.1

The most significant enhancements in this release are:
- bugfixes and enhancements:
 - fence_scsi: fix python3 encoding issue
 - xml-check: fix not failing on incorrect or missing metadata

Everyone is encouraged to download and test the new release.
We do many regression tests and simulations, but we can't cover all
possible use cases, so your feedback is important and appreciated.

Many thanks to all the contributors to this release.


Best,
The fence-agents maintainers
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] [questionnaire] Do you manage your pacemaker configuration by hand and (if so) what reusability features do you use?

2018-05-31 Thread Jan Pokorný
Hello,

I am soliciting feedback on these CIB features related questions,
please reply (preferably on-list so we have the shared collective
knowledge) if at least one of the questions is answered positively
in your case (just tick the respective "[ ]" boxes as "[x]").

Any other commentary also welcome -- thank you in advance.

1.  [ ] Do you edit CIB by hand (as opposed to relying on crm/pcs or
their UI counterparts)?

2.  [ ] Do you use "template" based syntactic simplification[1] in CIB?

3.  [ ] Do you use "id-ref" based syntactic simplification[2] in CIB?

3.1 [ ] When positive about 3., would you mind much if "id-refs" got
unfold/exploded during the "cibadmin --upgrade --force"
equivalent as a reliability/safety precaution?

4.  [ ] Do you use "tag" based syntactic grouping[3] in CIB?


(Some of these questions tangentially touch the topic of perhaps
excessively complex means of configuration that was raised during
the 2017's cluster summit.)

[1] 
https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/index.html#_reusing_resource_definitions
[2] 
https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/index.html#s-reusing-config-elements
[3] 
https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/index.html#_tagging_configuration_elements

-- 
Jan (Poki)


pgpAix5mIbZfO.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] PAF not starting resource successfully after node reboot (was: How to set up fencing/stonith)

2018-05-31 Thread Casey & Gina
> There is no "master node" in pacemaker. There is master/slave resource
> so at the best it is "node on which specific resource has master role".
> And we have no way to know which on which node you resource had master
> role when you did it. Please be more specific, otherwise it is hard to
> impossible to follow.

Well my limited understanding is that there should be one node that's the 
master at any point in time.  I don't see how it makes sense to have resources 
with masters on different nodes in the same clusters.  I'm being as specific as 
I can given my limited knowledge.  I'm not a developer; just an admin trying to 
get a simple cluster up and running.  Years ago, I did this same thing with two 
nodes and heartbeat, and it was very easy.  Anyways, I guess I mean that I 
powered off the node that was the master for all resources at the time.

> Not specifically related to your problem but I wonder what is the
> difference. For all I know for master/slave "Started" == "Slave" so I'm
> surprised to see two different states listed here.

I also wondered about that, since from the PostgreSQL, there is one master and 
two standbys which are no different from one another.  But like you said, it 
didn't seem relevant to my problem.

> Well, apparently resource agent does not like crashed instance. It is
> quite possible, I have been working with another replicated database
> where it was necessary to manually fix configuration after failover,
> *outside* of pacemaker. Pacemaker simply failed to start resource which
> had unexpected state.

I can manually start up the database in standby mode, without any errors or 
special intervention/fixing whatsoever, as long as the replication logs have 
not gotten too far ahead on the new master.  In that case I would need to 
rebuild the standby.

> This needs someone familiar with this RA and application to answer.

The resource agent is PAF and I've seen a lot of others discussing this on this 
list, so I hope that I am asking in the right place.

> Note that it is not quite normal use case. You explicitly disabled any
> handling by RA, thus effectively not using pacemaker high availability
> at all. Does it fail over master if you do not unmanage resource and
> kill node where resource has master role?

I was following the specific instructions in the E-mail I was replying to, 
which asked me to unmanage the resource and try manual debugging steps.  As 
I've discussed in this thread (please review the previous E-mails on this 
thread for further information), pacemaker does fail over the master, but then 
when the former master node comes back online, if I do a `pcs cluster start` on 
it without manually starting up the database by hand, it fails to start the PAF 
resource and pacemaker ends up fencing the node again.

I've been told that what PAF does on resource startup is exactly the same as 
the manual commands that I can do to make it work.  In the prior E-mails on 
this thread, I was told that the reason the resource startup fails is because 
the resource agent is incorrectly determining that the resource is already 
running when it's not - so it's never even trying to start the resource at all. 
 The debug instructions I'm attempting to follow are in an attempt to figure 
out what command it is running to determine this state.  Fail over to another 
node is only half the battle - the failed node should be able to rejoin the 
cluster without the cluster immediately fencing it when I try, shouldn't it?

>> --
>> root@d-gp2-dbpg0-2:/var# crm_resource -r postgresql-ha -VV --force-check 
>> warning: unpack_rsc_op_failure:Processing failed op monitor for 
>> postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9)
>> warning: unpack_rsc_op_failure:Processing failed op monitor for 
>> postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9)
>> Operation monitor for postgresql-10-main:0 (ocf:heartbeat:pgsqlms) returned 5
>>> stderr: Use of uninitialized value $OCF_Functions::ARG[0] in pattern match 
>>> (m//) at 
>>> /usr/lib/ocf/resource.d/heartbeat/../../lib/heartbeat/OCF_Functions.pm line 
>>> 392.
>>> stderr: ocf-exit-reason:PAF v2.2.0 is compatible with Pacemaker 1.1.13 and 
>>> greater
>> Error performing operation: Input/output error
>> --
> 
> This looks like a bug in your version.

Version of what?  I'm using the corosync, pacemaker, and pcs versions as 
provided by Ubuntu (for version 16.04), and resource-agents-paf as provided by 
the PGDG repository.

These versions are as follows:
* corosync - 2.3.5-3ubuntu2
* pacemaker - 1.1.14-2ubuntu1.3
* pcs - 0.9.149-1ubuntu1.1
* resource-agents-paf - 2.2.0-2.pgdg16.04+1

These are the latest packaged versions available for my platform, as far as I'm 
aware, and the same as I presume other Ubuntu users on this list are running.

Regards,
-- 
Casey
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/u

Re: [ClusterLabs] [questionnaire] Do you manage your pacemaker configuration by hand and (if so) what reusability features do you use?

2018-05-31 Thread Ken Gaillot
On Thu, 2018-05-31 at 14:48 +0200, Jan Pokorný wrote:
> Hello,
> 
> I am soliciting feedback on these CIB features related questions,
> please reply (preferably on-list so we have the shared collective
> knowledge) if at least one of the questions is answered positively
> in your case (just tick the respective "[ ]" boxes as "[x]").
> 
> Any other commentary also welcome -- thank you in advance.
> 
> 1.  [ ] Do you edit CIB by hand (as opposed to relying on crm/pcs or
> their UI counterparts)?

To clarify, crm shell supports both templates and id-ref, while pcs
does not.

> 2.  [ ] Do you use "template" based syntactic simplification[1] in
> CIB?
> 
> 3.  [ ] Do you use "id-ref" based syntactic simplification[2] in CIB?
> 
> 3.1 [ ] When positive about 3., would you mind much if "id-refs" got
> unfold/exploded during the "cibadmin --upgrade --force"
> equivalent as a reliability/safety precaution?

Regardless of whether anyone minds, we're not going to do it. It would
render the feature useless and force any user using it to either
abandon it or perform potentially massive manual edits to their CIB.

If the community feels that id-ref is not a useful feature, we can
deprecate it, and in some future release, drop support and
automatically expand it as part of the upgrade transform for that
release.

Otherwise we will continue full support for it.

> 4.  [ ] Do you use "tag" based syntactic grouping[3] in CIB?
> 
> 
> (Some of these questions tangentially touch the topic of perhaps
> excessively complex means of configuration that was raised during
> the 2017's cluster summit.)
> 
> [1] https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-si
> ngle/Pacemaker_Explained/index.html#_reusing_resource_definitions
> [2] https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-si
> ngle/Pacemaker_Explained/index.html#s-reusing-config-elements
> [3] https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-
> single/Pacemaker_Explained/index.html#_tagging_configuration_elements
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [questionnaire] Do you manage your pacemaker configuration by hand and (if so) what reusability features do you use?

2018-05-31 Thread Jan Pokorný
On 31/05/18 11:42 -0500, Ken Gaillot wrote:
> On Thu, 2018-05-31 at 14:48 +0200, Jan Pokorný wrote:
>> I am soliciting feedback on these CIB features related questions,
>> please reply (preferably on-list so we have the shared collective
>> knowledge) if at least one of the questions is answered positively
>> in your case (just tick the respective "[ ]" boxes as "[x]").
>> 
>> Any other commentary also welcome -- thank you in advance.
>> 
>> 1.  [ ] Do you edit CIB by hand (as opposed to relying on crm/pcs or
>> their UI counterparts)?
> 
> To clarify, crm shell supports both templates and id-ref, while pcs
> does not.

No implications were intended, nor expressed.

I am (possibly we are) interested in the original question regardless
how other questions are answered -- please do so as the reply to the
original post, if you wish to participate.

-- 
Jan (Poki)


pgpAXs8WRNmNk.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Pacemaker 2.0.0-rc5 now available

2018-05-31 Thread Ken Gaillot
Since we had a few significant bug fixes, I decided to do one more
release candidate for Pacemaker version 2.0.0. If there are no serious
issues found with this one, the final will likely be released next
week.

Source code is available at:

https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-2.0.0-rc5

This is a bug fix release. The two main ones:

* Avoid unnecessary repeated recovery of resources that have "requires"
set to "quorum" or "nothing" (i.e. they can start elsewhere before
their previously active node is fenced; this includes fence devices and
Pacemaker Remote connection resources).

* Allow a monitor to be cancelled when its resource is unmanaged.

The only known issue remaining to be resolved before final release is
some tweaking of the transform of pre-2.0 configurations after an
upgrade.
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Why would a standby node be fenced? (was: How to set up fencing/stonith)

2018-05-31 Thread Jehan-Guillaume de Rorthais
Sorry for getting back to you so late.

On Fri, 25 May 2018 11:58:59 -0600
Casey & Gina  wrote:

> > On May 25, 2018, at 7:01 AM, Casey Allen Shobe 
> > wrote: 
> >> Actually, why is Pacemaker fencing the standby node just because a
> >> resource fails to start there?  I thought only the master should be fenced
> >> if it were assumed to be broken.  
> 
> This is probably the most important thing to ask outside of the PAF resource
> agent which many may not be as fluent with as pacemaker itself, and perhaps
> the most indicative of me setting something up incorrectly outside of that
> resource agent.
> 
> My understanding of fencing was that pacemaker would only fence a node if it
> was the master but had stopped responding, to avoid a split-brain situation.
> Why would pacemaker ever fence a standby node with no resources currently
> allocated to it?

So, as discussed on IRC and for the mailing list history, here is the answer:

https://clusterlabs.github.io/PAF/administration.html#failover

In short: after a failure (either on a primary or a standby), you MUST fix
things on the node before starting Pacemaker.

If you don't, PAF will detect something incoherent and raise an error, leading
Pacemaker to most likely fence your node, again.

As instance, after a primary crash, you will have to resync it as a standby with
the new master before starting Pacemaker on the node and giving PAF the relay.
It is actually really important if you don't want to end up with a silently
corrupted standby in your cluster.

Cheers,
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] PAF not starting resource successfully after node reboot (was: How to set up fencing/stonith)

2018-05-31 Thread Andrei Borzenkov
31.05.2018 19:20, Casey & Gina пишет:
>> There is no "master node" in pacemaker. There is master/slave
>> resource so at the best it is "node on which specific resource has
>> master role". And we have no way to know which on which node you
>> resource had master role when you did it. Please be more specific,
>> otherwise it is hard to impossible to follow.
> 
> Well my limited understanding is that there should be one node that's
> the master at any point in time.  I don't see how it makes sense to
> have resources with masters on different nodes in the same clusters.

It is entirely possible and useful for different resources to have
master role on different nodes at the same time. "Master" simply denotes
one of two possible state, it does not convey any additional semantic.

> I'm being as specific as I can given my limited knowledge.  I'm not a
> developer; just an admin trying to get a simple cluster up and
> running.  Years ago, I did this same thing with two nodes and
> heartbeat, and it was very easy.  Anyways, I guess I mean that I
> powered off the node that was the master for all resources at the
> time.
> 
>> Not specifically related to your problem but I wonder what is the 
>> difference. For all I know for master/slave "Started" == "Slave" so
>> I'm surprised to see two different states listed here.
> 
> I also wondered about that, since from the PostgreSQL, there is one
> master and two standbys which are no different from one another.  But
> like you said, it didn't seem relevant to my problem.
> 
>> Well, apparently resource agent does not like crashed instance. It
>> is quite possible, I have been working with another replicated
>> database where it was necessary to manually fix configuration after
>> failover, *outside* of pacemaker. Pacemaker simply failed to start
>> resource which had unexpected state.
> 
> I can manually start up the database in standby mode, without any
> errors or special intervention/fixing whatsoever, as long as the
> replication logs have not gotten too far ahead on the new master.  In
> that case I would need to rebuild the standby.
> 
>> This needs someone familiar with this RA and application to
>> answer.
> 
> The resource agent is PAF and I've seen a lot of others discussing
> this on this list, so I hope that I am asking in the right place.
> 

Sure, hopefully the right person chimes in.

>> Note that it is not quite normal use case. You explicitly disabled
>> any handling by RA, thus effectively not using pacemaker high
>> availability at all. Does it fail over master if you do not
>> unmanage resource and kill node where resource has master role?
> 
> I was following the specific instructions in the E-mail I was
> replying to, which asked me to unmanage the resource and try manual
> debugging steps.  As I've discussed in this thread (please review the
> previous E-mails on this thread for further information), pacemaker
> does fail over the master, but then when the former master node comes
> back online, if I do a `pcs cluster start` on it without manually
> starting up the database by hand, it fails to start the PAF resource
> and pacemaker ends up fencing the node again.
> 

Well, it means you now have new primary database instance (which was
failed over by pacemaker) on node A and old primary database instance on
node B which you now start. On node B it remains primary because that
was the sate in which node was killed. It is quite logical that attempt
to start resource (and hence database instance) fails.

Quick look at PAF manual gives

you need to rebuild the PostgreSQL instance on the failed node

did you do it? I am not intimately familiar with Postgres, but in this
case I expect that you need to make database on node B secondary (slave,
whatever it is called) to new master on node A. That is exactly what I
described as "manually fixing configuration outside of pacemaker".

> I've been told that what PAF does on resource startup is exactly the
> same as the manual commands that I can do to make it work.  In the
> prior E-mails on this thread, I was told that the reason the resource
> startup fails is because the resource agent is incorrectly
> determining that the resource is already running when it's not - so
> it's never even trying to start the resource at all.  The debug
> instructions I'm attempting to follow are in an attempt to figure out
> what command it is running to determine this state.  Fail over to
> another node is only half the battle - the failed node should be able
> to rejoin the cluster without the cluster immediately fencing it when
> I try, shouldn't it?
> 
>>> -- root@d-gp2-dbpg0-2:/var# crm_resource -r postgresql-ha -VV
>>> --force-check warning: unpack_rsc_op_failure:Processing
>>> failed op monitor for postgresql-10-main:2 on d-gp2-dbpg0-2:
>>> master (failed) (9) warning: unpack_rsc_op_failure:
>>> Processing failed op monitor for postgresql-10-main:2 on
>>> d-gp2-dbpg0-2: master (failed) (9) Operation moni

Re: [ClusterLabs] Why would a standby node be fenced? (was: How to set up fencing/stonith)

2018-05-31 Thread Andrei Borzenkov
31.05.2018 22:18, Jehan-Guillaume de Rorthais пишет:
> Sorry for getting back to you so late.
> 
> On Fri, 25 May 2018 11:58:59 -0600
> Casey & Gina  wrote:
> 
>>> On May 25, 2018, at 7:01 AM, Casey Allen Shobe 
>>> wrote: 
 Actually, why is Pacemaker fencing the standby node just because a
 resource fails to start there?  I thought only the master should be fenced
 if it were assumed to be broken.  
>>
>> This is probably the most important thing to ask outside of the PAF resource
>> agent which many may not be as fluent with as pacemaker itself, and perhaps
>> the most indicative of me setting something up incorrectly outside of that
>> resource agent.
>>
>> My understanding of fencing was that pacemaker would only fence a node if it
>> was the master but had stopped responding, to avoid a split-brain situation.
>> Why would pacemaker ever fence a standby node with no resources currently
>> allocated to it?
> 
> So, as discussed on IRC and for the mailing list history, here is the answer:
> 
> https://clusterlabs.github.io/PAF/administration.html#failover
> 
> In short: after a failure (either on a primary or a standby), you MUST fix
> things on the node before starting Pacemaker.
> 
> If you don't, PAF will detect something incoherent and raise an error, leading
> Pacemaker to most likely fence your node, again.
> 

Well, that does not sound very polite to user :)

Another database RA I mentioned somewhere in this thread has different
approach - it starts database in its monitor action and start action is
effectively dummy. So start always succeeds from pacemaker point of
view, but database won't be started until manually synchronized again by
administrator.

Downside is that pacemaker resource status does not reflect database
status. I wish pacemaker supported something like "requires manual
intervention" resource state that would not be treated like error
(causing all sorts of fatal consequences) but still evaluated for
dependencies (i.e. dependent resources would not be started). That would
be ideal for such case.

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Why would a standby node be fenced? (was: How to set up fencing/stonith)

2018-05-31 Thread Jehan-Guillaume de Rorthais
On Thu, 31 May 2018 22:52:12 +0300
Andrei Borzenkov  wrote:

> 31.05.2018 22:18, Jehan-Guillaume de Rorthais пишет:
> > Sorry for getting back to you so late.
> > 
> > On Fri, 25 May 2018 11:58:59 -0600
> > Casey & Gina  wrote:
> >   
> >>> On May 25, 2018, at 7:01 AM, Casey Allen Shobe 
> >>> wrote:   
>  Actually, why is Pacemaker fencing the standby node just because a
>  resource fails to start there?  I thought only the master should be
>  fenced if it were assumed to be broken.
> >>
> >> This is probably the most important thing to ask outside of the PAF
> >> resource agent which many may not be as fluent with as pacemaker itself,
> >> and perhaps the most indicative of me setting something up incorrectly
> >> outside of that resource agent.
> >>
> >> My understanding of fencing was that pacemaker would only fence a node if
> >> it was the master but had stopped responding, to avoid a split-brain
> >> situation. Why would pacemaker ever fence a standby node with no resources
> >> currently allocated to it?  
> > 
> > So, as discussed on IRC and for the mailing list history, here is the
> > answer:
> > 
> > https://clusterlabs.github.io/PAF/administration.html#failover
> > 
> > In short: after a failure (either on a primary or a standby), you MUST fix
> > things on the node before starting Pacemaker.
> > 
> > If you don't, PAF will detect something incoherent and raise an error,
> > leading Pacemaker to most likely fence your node, again.
> >   
> 
> Well, that does not sound very polite to user :)

Sure :)

But at least, It's been documented as you pointed earlier.

After a failure and an automatic failover, either you have some automatic
failback process somewhere...or you have to fix some things around.

PAF is not able to do automatic failback.

> Another database RA I mentioned somewhere in this thread has different
> approach - it starts database in its monitor action and start action is
> effectively dummy.

Mh, I would have to study that. But I'm not thrill about such behavior at a
first look.

> So start always succeeds from pacemaker point of
> view, but database won't be started until manually synchronized again by
> administrator.

It seems scary...What about the stop action? What if the monitor detect an
error? Well, I really should check this RA you are talking about to answer my
questions.

> Downside is that pacemaker resource status does not reflect database
> status. I wish pacemaker supported something like "requires manual
> intervention" resource state that would not be treated like error
> (causing all sorts of fatal consequences) but still evaluated for
> dependencies (i.e. dependent resources would not be started). That would
> be ideal for such case.

Good idea.

I have a couple more:
* handling errors from notify actions
* supporting mgirate-to/from for multistate RA
* having real infinite master score :)

Cheers,
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Why would a standby node be fenced? (was: How to set up fencing/stonith)

2018-05-31 Thread Ken Gaillot
On Thu, 2018-05-31 at 22:43 +0200, Jehan-Guillaume de Rorthais wrote:
> On Thu, 31 May 2018 22:52:12 +0300
> Andrei Borzenkov  wrote:
> 
> > 31.05.2018 22:18, Jehan-Guillaume de Rorthais пишет:
> > > Sorry for getting back to you so late.
> > > 
> > > On Fri, 25 May 2018 11:58:59 -0600
> > > Casey & Gina  wrote:
> > >   
> > > > > On May 25, 2018, at 7:01 AM, Casey Allen Shobe  > > > > icloud.com>
> > > > > wrote:   
> > > > > > Actually, why is Pacemaker fencing the standby node just
> > > > > > because a
> > > > > > resource fails to start there?  I thought only the master
> > > > > > should be
> > > > > > fenced if it were assumed to be broken.
> > > > 
> > > > This is probably the most important thing to ask outside of the
> > > > PAF
> > > > resource agent which many may not be as fluent with as
> > > > pacemaker itself,
> > > > and perhaps the most indicative of me setting something up
> > > > incorrectly
> > > > outside of that resource agent.
> > > > 
> > > > My understanding of fencing was that pacemaker would only fence
> > > > a node if
> > > > it was the master but had stopped responding, to avoid a split-
> > > > brain
> > > > situation. Why would pacemaker ever fence a standby node with
> > > > no resources
> > > > currently allocated to it?  
> > > 
> > > So, as discussed on IRC and for the mailing list history, here is
> > > the
> > > answer:
> > > 
> > > https://clusterlabs.github.io/PAF/administration.html#failover
> > > 
> > > In short: after a failure (either on a primary or a standby), you
> > > MUST fix
> > > things on the node before starting Pacemaker.
> > > 
> > > If you don't, PAF will detect something incoherent and raise an
> > > error,
> > > leading Pacemaker to most likely fence your node, again.
> > >   
> > 
> > Well, that does not sound very polite to user :)
> 
> Sure :)
> 
> But at least, It's been documented as you pointed earlier.
> 
> After a failure and an automatic failover, either you have some
> automatic
> failback process somewhere...or you have to fix some things around.
> 
> PAF is not able to do automatic failback.
> 
> > Another database RA I mentioned somewhere in this thread has
> > different
> > approach - it starts database in its monitor action and start
> > action is
> > effectively dummy.
> 
> Mh, I would have to study that. But I'm not thrill about such
> behavior at a
> first look.
> 
> > So start always succeeds from pacemaker point of
> > view, but database won't be started until manually synchronized
> > again by
> > administrator.
> 
> It seems scary...What about the stop action? What if the monitor
> detect an
> error? Well, I really should check this RA you are talking about to
> answer my
> questions.
> 
> > Downside is that pacemaker resource status does not reflect
> > database
> > status. I wish pacemaker supported something like "requires manual
> > intervention" resource state that would not be treated like error
> > (causing all sorts of fatal consequences) but still evaluated for
> > dependencies (i.e. dependent resources would not be started). That
> > would
> > be ideal for such case.

I'm not clear what such a result would mean. Is the goal to stop
dependent resources, but not the resource itself? And/or to block all
further management of the resource?

> Good idea.
> 
> I have a couple more:
> * handling errors from notify actions

I could imagine notify supporting on-fail, defaulting to ignore. Would
that do what you want? Should notify errors count toward the resource
fail count?

> * supporting migrate-to/from for multistate RA
> * having real infinite master score :)

What behavior isn't supported by current infinity?

> 
> Cheers,
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] PAF not starting resource successfully after node reboot (was: How to set up fencing/stonith)

2018-05-31 Thread Casey & Gina
> Quick look at PAF manual gives
> 
> you need to rebuild the PostgreSQL instance on the failed node
> 
> did you do it? I am not intimately familiar with Postgres, but in this
> case I expect that you need to make database on node B secondary (slave,
> whatever it is called) to new master on node A. That is exactly what I
> described as "manually fixing configuration outside of pacemaker".

I did not see this prior to today, but was pointed to this a little while ago.  
I did not realize that this would be necessary, so I have written a script to 
rebuild the db and then do the `pcs cluster start` afterwards, which I'll make 
part of our standard recovery procedure.

I guess I expected that pacemaker would be able to handle this case 
automatically - if the resource agent reported a resource in a 
potentially-corrupt state, pacemaker could then call the resource agent to 
start the rebuild.  But there are probably some reasons that's not a great 
idea, and I think that I understand things enough now to be confident in just 
using a custom script for this purpose when necessary.

When I set up clusters in the past with heartbeat, I had put the database on a 
DRBD partition, so this simplified matters since there was never a possibility 
of some new writes to the master not yet being replicated to the slave.  In 
development testing, I found that I did not need to rebuild the database, just 
start it up manually in slave mode.  But now that I've thought this through 
better, I realize that in a production environment, should the master crash, it 
is quite likely that it will have some data that has not yet replicated to the 
slaves, so it could not cleanly come up as a standby since it would have some 
data that was too new.

> pacemaker is too old. The error most likely comes from missing
> OCF_RESKEY_crm_feature_set which is exported by crm_resource starting
> with 1.1.17. I am not that familiar with debian packaging, but I'd
> expect resource-agents-paf require suitable pacemaker version. Of course
> Ubuntu package may be patched to include necessary code ...

I'm not sure why that would be - the resource agent works fine with this 
version of pacemaker, and according to 
https://github.com/ClusterLabs/PAF/releases, it only requires pacemaker 
>=1.1.13.  I think that something is wrong with the command that I was trying 
to run, as pacemaker 1.1.14 successfully uses this resource agent to 
start/stop/monitor the service generally speaking, outside of the manual 
debugging context.

Thank you!
-- 
Casey
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Why would a standby node be fenced? (was: How to set up fencing/stonith)

2018-05-31 Thread Casey & Gina
> Well, that does not sound very polite to user :)

The thing that really threw me off was pacemaker rebooting the node as soon as 
I'd try to start the cluster on it without the database running.

Is there a way to prevent this from happening?  Some way to indicate to 
Pacemaker, "Hey, I'm not willing/able to start the resource here because it 
appears to be in a corrupt state", while not causing the node to be fenced 
because it thinks that the resource is running when it isn't?

It would be perfectly safe to not fence the node, in this case...

-- 
Casey
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org