Re: [ClusterLabs] PAF not starting resource successfully after node reboot (was: How to set up fencing/stonith)

2018-06-04 Thread Andrei Borzenkov
04.06.2018 18:53, Casey & Gina пишет:
>> There are different code paths when RA is called automatically by
>> resource manager and when RA is called manually by crm_resource. The
>> latter did not export this environment variable until 1.1.17. So
>> documentation is correct in that you do not need 1.1.17 to use RA
>> normally, as part of pacemaker configuration.
> 
> Okay, got it.
> 
>> You should be able to workaround it (should you ever need to manually
>> trigger actions with crm_resource) by manually exporting this
>> environment variable before calling crm_resource.
> 
> Awesome, thanks!  How would I know what to set the variable value to?
> 

That's the value of crm_feature_set CIB attribute; like:


...

You can use cibadmin -Q to display current CIB.
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] PAF not starting resource successfully after node reboot (was: How to set up fencing/stonith)

2018-06-04 Thread Casey & Gina
> There are different code paths when RA is called automatically by
> resource manager and when RA is called manually by crm_resource. The
> latter did not export this environment variable until 1.1.17. So
> documentation is correct in that you do not need 1.1.17 to use RA
> normally, as part of pacemaker configuration.

Okay, got it.

> You should be able to workaround it (should you ever need to manually
> trigger actions with crm_resource) by manually exporting this
> environment variable before calling crm_resource.

Awesome, thanks!  How would I know what to set the variable value to?

Best wishes,
-- 
Casey

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] PAF not starting resource successfully after node reboot (was: How to set up fencing/stonith)

2018-06-01 Thread Andrei Borzenkov
On Fri, Jun 1, 2018 at 12:22 AM, Casey & Gina  wrote:
>
>> pacemaker is too old. The error most likely comes from missing
>> OCF_RESKEY_crm_feature_set which is exported by crm_resource starting
>> with 1.1.17. I am not that familiar with debian packaging, but I'd
>> expect resource-agents-paf require suitable pacemaker version. Of course
>> Ubuntu package may be patched to include necessary code ...
>
> I'm not sure why that would be - the resource agent works fine with this 
> version of pacemaker, and according to 
> https://github.com/ClusterLabs/PAF/releases, it only requires pacemaker 
> >=1.1.13.  I think that something is wrong with the command that I was trying 
> to run, as pacemaker 1.1.14 successfully uses this resource agent to 
> start/stop/monitor the service generally speaking, outside of the manual 
> debugging context.

There are different code paths when RA is called automatically by
resource manager and when RA is called manually by crm_resource. The
latter did not export this environment variable until 1.1.17. So
documentation is correct in that you do not need 1.1.17 to use RA
normally, as part of pacemaker configuration.

You should be able to workaround it (should you ever need to manually
trigger actions with crm_resource) by manually exporting this
environment variable before calling crm_resource.
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] PAF not starting resource successfully after node reboot (was: How to set up fencing/stonith)

2018-05-31 Thread Casey & Gina
> Quick look at PAF manual gives
> 
> you need to rebuild the PostgreSQL instance on the failed node
> 
> did you do it? I am not intimately familiar with Postgres, but in this
> case I expect that you need to make database on node B secondary (slave,
> whatever it is called) to new master on node A. That is exactly what I
> described as "manually fixing configuration outside of pacemaker".

I did not see this prior to today, but was pointed to this a little while ago.  
I did not realize that this would be necessary, so I have written a script to 
rebuild the db and then do the `pcs cluster start` afterwards, which I'll make 
part of our standard recovery procedure.

I guess I expected that pacemaker would be able to handle this case 
automatically - if the resource agent reported a resource in a 
potentially-corrupt state, pacemaker could then call the resource agent to 
start the rebuild.  But there are probably some reasons that's not a great 
idea, and I think that I understand things enough now to be confident in just 
using a custom script for this purpose when necessary.

When I set up clusters in the past with heartbeat, I had put the database on a 
DRBD partition, so this simplified matters since there was never a possibility 
of some new writes to the master not yet being replicated to the slave.  In 
development testing, I found that I did not need to rebuild the database, just 
start it up manually in slave mode.  But now that I've thought this through 
better, I realize that in a production environment, should the master crash, it 
is quite likely that it will have some data that has not yet replicated to the 
slaves, so it could not cleanly come up as a standby since it would have some 
data that was too new.

> pacemaker is too old. The error most likely comes from missing
> OCF_RESKEY_crm_feature_set which is exported by crm_resource starting
> with 1.1.17. I am not that familiar with debian packaging, but I'd
> expect resource-agents-paf require suitable pacemaker version. Of course
> Ubuntu package may be patched to include necessary code ...

I'm not sure why that would be - the resource agent works fine with this 
version of pacemaker, and according to 
https://github.com/ClusterLabs/PAF/releases, it only requires pacemaker 
>=1.1.13.  I think that something is wrong with the command that I was trying 
to run, as pacemaker 1.1.14 successfully uses this resource agent to 
start/stop/monitor the service generally speaking, outside of the manual 
debugging context.

Thank you!
-- 
Casey
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] PAF not starting resource successfully after node reboot (was: How to set up fencing/stonith)

2018-05-31 Thread Andrei Borzenkov
31.05.2018 19:20, Casey & Gina пишет:
>> There is no "master node" in pacemaker. There is master/slave
>> resource so at the best it is "node on which specific resource has
>> master role". And we have no way to know which on which node you
>> resource had master role when you did it. Please be more specific,
>> otherwise it is hard to impossible to follow.
> 
> Well my limited understanding is that there should be one node that's
> the master at any point in time.  I don't see how it makes sense to
> have resources with masters on different nodes in the same clusters.

It is entirely possible and useful for different resources to have
master role on different nodes at the same time. "Master" simply denotes
one of two possible state, it does not convey any additional semantic.

> I'm being as specific as I can given my limited knowledge.  I'm not a
> developer; just an admin trying to get a simple cluster up and
> running.  Years ago, I did this same thing with two nodes and
> heartbeat, and it was very easy.  Anyways, I guess I mean that I
> powered off the node that was the master for all resources at the
> time.
> 
>> Not specifically related to your problem but I wonder what is the 
>> difference. For all I know for master/slave "Started" == "Slave" so
>> I'm surprised to see two different states listed here.
> 
> I also wondered about that, since from the PostgreSQL, there is one
> master and two standbys which are no different from one another.  But
> like you said, it didn't seem relevant to my problem.
> 
>> Well, apparently resource agent does not like crashed instance. It
>> is quite possible, I have been working with another replicated
>> database where it was necessary to manually fix configuration after
>> failover, *outside* of pacemaker. Pacemaker simply failed to start
>> resource which had unexpected state.
> 
> I can manually start up the database in standby mode, without any
> errors or special intervention/fixing whatsoever, as long as the
> replication logs have not gotten too far ahead on the new master.  In
> that case I would need to rebuild the standby.
> 
>> This needs someone familiar with this RA and application to
>> answer.
> 
> The resource agent is PAF and I've seen a lot of others discussing
> this on this list, so I hope that I am asking in the right place.
> 

Sure, hopefully the right person chimes in.

>> Note that it is not quite normal use case. You explicitly disabled
>> any handling by RA, thus effectively not using pacemaker high
>> availability at all. Does it fail over master if you do not
>> unmanage resource and kill node where resource has master role?
> 
> I was following the specific instructions in the E-mail I was
> replying to, which asked me to unmanage the resource and try manual
> debugging steps.  As I've discussed in this thread (please review the
> previous E-mails on this thread for further information), pacemaker
> does fail over the master, but then when the former master node comes
> back online, if I do a `pcs cluster start` on it without manually
> starting up the database by hand, it fails to start the PAF resource
> and pacemaker ends up fencing the node again.
> 

Well, it means you now have new primary database instance (which was
failed over by pacemaker) on node A and old primary database instance on
node B which you now start. On node B it remains primary because that
was the sate in which node was killed. It is quite logical that attempt
to start resource (and hence database instance) fails.

Quick look at PAF manual gives

you need to rebuild the PostgreSQL instance on the failed node

did you do it? I am not intimately familiar with Postgres, but in this
case I expect that you need to make database on node B secondary (slave,
whatever it is called) to new master on node A. That is exactly what I
described as "manually fixing configuration outside of pacemaker".

> I've been told that what PAF does on resource startup is exactly the
> same as the manual commands that I can do to make it work.  In the
> prior E-mails on this thread, I was told that the reason the resource
> startup fails is because the resource agent is incorrectly
> determining that the resource is already running when it's not - so
> it's never even trying to start the resource at all.  The debug
> instructions I'm attempting to follow are in an attempt to figure out
> what command it is running to determine this state.  Fail over to
> another node is only half the battle - the failed node should be able
> to rejoin the cluster without the cluster immediately fencing it when
> I try, shouldn't it?
> 
>>> -- root@d-gp2-dbpg0-2:/var# crm_resource -r postgresql-ha -VV
>>> --force-check warning: unpack_rsc_op_failure:Processing
>>> failed op monitor for postgresql-10-main:2 on d-gp2-dbpg0-2:
>>> master (failed) (9) warning: unpack_rsc_op_failure:
>>> Processing failed op monitor for postgresql-10-main:2 on
>>> d-gp2-dbpg0-2: master (failed) (9) Operation 

Re: [ClusterLabs] PAF not starting resource successfully after node reboot (was: How to set up fencing/stonith)

2018-05-31 Thread Casey & Gina
> There is no "master node" in pacemaker. There is master/slave resource
> so at the best it is "node on which specific resource has master role".
> And we have no way to know which on which node you resource had master
> role when you did it. Please be more specific, otherwise it is hard to
> impossible to follow.

Well my limited understanding is that there should be one node that's the 
master at any point in time.  I don't see how it makes sense to have resources 
with masters on different nodes in the same clusters.  I'm being as specific as 
I can given my limited knowledge.  I'm not a developer; just an admin trying to 
get a simple cluster up and running.  Years ago, I did this same thing with two 
nodes and heartbeat, and it was very easy.  Anyways, I guess I mean that I 
powered off the node that was the master for all resources at the time.

> Not specifically related to your problem but I wonder what is the
> difference. For all I know for master/slave "Started" == "Slave" so I'm
> surprised to see two different states listed here.

I also wondered about that, since from the PostgreSQL, there is one master and 
two standbys which are no different from one another.  But like you said, it 
didn't seem relevant to my problem.

> Well, apparently resource agent does not like crashed instance. It is
> quite possible, I have been working with another replicated database
> where it was necessary to manually fix configuration after failover,
> *outside* of pacemaker. Pacemaker simply failed to start resource which
> had unexpected state.

I can manually start up the database in standby mode, without any errors or 
special intervention/fixing whatsoever, as long as the replication logs have 
not gotten too far ahead on the new master.  In that case I would need to 
rebuild the standby.

> This needs someone familiar with this RA and application to answer.

The resource agent is PAF and I've seen a lot of others discussing this on this 
list, so I hope that I am asking in the right place.

> Note that it is not quite normal use case. You explicitly disabled any
> handling by RA, thus effectively not using pacemaker high availability
> at all. Does it fail over master if you do not unmanage resource and
> kill node where resource has master role?

I was following the specific instructions in the E-mail I was replying to, 
which asked me to unmanage the resource and try manual debugging steps.  As 
I've discussed in this thread (please review the previous E-mails on this 
thread for further information), pacemaker does fail over the master, but then 
when the former master node comes back online, if I do a `pcs cluster start` on 
it without manually starting up the database by hand, it fails to start the PAF 
resource and pacemaker ends up fencing the node again.

I've been told that what PAF does on resource startup is exactly the same as 
the manual commands that I can do to make it work.  In the prior E-mails on 
this thread, I was told that the reason the resource startup fails is because 
the resource agent is incorrectly determining that the resource is already 
running when it's not - so it's never even trying to start the resource at all. 
 The debug instructions I'm attempting to follow are in an attempt to figure 
out what command it is running to determine this state.  Fail over to another 
node is only half the battle - the failed node should be able to rejoin the 
cluster without the cluster immediately fencing it when I try, shouldn't it?

>> --
>> root@d-gp2-dbpg0-2:/var# crm_resource -r postgresql-ha -VV --force-check 
>> warning: unpack_rsc_op_failure:Processing failed op monitor for 
>> postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9)
>> warning: unpack_rsc_op_failure:Processing failed op monitor for 
>> postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9)
>> Operation monitor for postgresql-10-main:0 (ocf:heartbeat:pgsqlms) returned 5
>>> stderr: Use of uninitialized value $OCF_Functions::ARG[0] in pattern match 
>>> (m//) at 
>>> /usr/lib/ocf/resource.d/heartbeat/../../lib/heartbeat/OCF_Functions.pm line 
>>> 392.
>>> stderr: ocf-exit-reason:PAF v2.2.0 is compatible with Pacemaker 1.1.13 and 
>>> greater
>> Error performing operation: Input/output error
>> --
> 
> This looks like a bug in your version.

Version of what?  I'm using the corosync, pacemaker, and pcs versions as 
provided by Ubuntu (for version 16.04), and resource-agents-paf as provided by 
the PGDG repository.

These versions are as follows:
* corosync - 2.3.5-3ubuntu2
* pacemaker - 1.1.14-2ubuntu1.3
* pcs - 0.9.149-1ubuntu1.1
* resource-agents-paf - 2.2.0-2.pgdg16.04+1

These are the latest packaged versions available for my platform, as far as I'm 
aware, and the same as I presume other Ubuntu users on this list are running.

Regards,
-- 
Casey
___
Users mailing list: Users@clusterlabs.org

Re: [ClusterLabs] PAF not starting resource successfully after node reboot (was: How to set up fencing/stonith)

2018-05-30 Thread Andrei Borzenkov
31.05.2018 01:30, Casey & Gina пишет:
>> In this case, the agent is returning "master (failed)", which does not
>> mean that it previously failed when it was master -- it means it is
>> currently running as master, in a failed condition.
> 
> Well, it surely is NOT running.  So the likely problem is the way it's doing 
> this check?  I see a lot of people here using PAF - I'd be surprised if such 
> a bug weren't discovered already...
> 
>> Stopping an already stopped service does not return an error -- here,
>> the agent is saying it was unable to demote or stop a running instance.
> 
> I still don't understand.  There are *NO* postgres processes running on the 
> node, no additions to the log file.  Nothing whatsoever that supports the 
> notion that it's a running instance.
> 
>> Unfortunately clustering has some inherent complexity that gives it a
>> steep learning curve. On top of that, logging/troubleshooting
>> improvements are definitely an area of ongoing need in pacemaker. The
>> good news is that once a cluster is running successfully, it's usually
>> smooth sailing after that.
> 
> I hope so...  I just don't see what I'm doing that's outside of the standard 
> box.  I've set up PAF following it's instructions.  I see that others here 
> are using it.  Hasn't anybody else gotten such a setup working already?  I 
> would think this is a pretty standard failure case that anybody would test if 
> they've set up a cluster...  In any case, I'll keep persisting as long as I 
> can...on to debugging...
> 
>> You can debug like this:
>>
>> 1. Unmanage the resource in pacemaker, so you can mess with it
>> manually.
>>
>> 2. Cause the desired failure for testing. Pacemaker should detect the
>> failure, but not do anything about it.
> 
> I executed `pcs resource unmanage postgresql-ha`, and then powered off the 
> master node.

There is no "master node" in pacemaker. There is master/slave resource
so at the best it is "node on which specific resource has master role".
And we have no way to know which on which node you resource had master
role when you did it. Please be more specific, otherwise it is hard to
impossible to follow.

>  The fencing kicked in and restarted the node.  After the node rebooted, I 
> issued a `pcs cluster start` on it as the crm_resource command complained 
> about the CIB without doing that.
> 
> I then ended up seeing this:
> 
> --
>  vfencing   (stonith:external/vcenter): Started d-gp2-dbpg0-1
>  postgresql-master-vip  (ocf::heartbeat:IPaddr2):   Started d-gp2-dbpg0-2
>  Master/Slave Set: postgresql-ha [postgresql-10-main] (unmanaged)
>  postgresql-10-main (ocf::heartbeat:pgsqlms):   Started d-gp2-dbpg0-3 
> (unmanaged)
>  postgresql-10-main (ocf::heartbeat:pgsqlms):   Slave d-gp2-dbpg0-1 
> (unmanaged)

Not specifically related to your problem but I wonder what is the
difference. For all I know for master/slave "Started" == "Slave" so I'm
surprised to see two different states listed here.


>  postgresql-10-main (ocf::heartbeat:pgsqlms):   FAILED Master 
> d-gp2-dbpg0-2 (unmanaged)
> 
> Failed Actions:
> * postgresql-10-main_monitor_0 on d-gp2-dbpg0-2 'master (failed)' (9): 
> call=14, status=complete, exitreason='Instance "postgresql-10-main" 
> controldata indicates a running primary instance, the instance has probably 
> crashed',

Well, apparently resource agent does not like crashed instance. It is
quite possible, I have been working with another replicated database
where it was necessary to manually fix configuration after failover,
*outside* of pacemaker. Pacemaker simply failed to start resource which
had unexpected state.

This needs someone familiar with this RA and application to answer.

Note that it is not quite normal use case. You explicitly disabled any
handling by RA, thus effectively not using pacemaker high availability
at all. Does it fail over master if you do not unmanage resource and
kill node where resource has master role?

> last-rc-change='Wed May 30 22:18:16 2018', queued=0ms, exec=190ms
> * postgresql-10-main_monitor_15000 on d-gp2-dbpg0-2 'master (failed)' (9): 
> call=16, status=complete, exitreason='Instance "postgresql-10-main" 
> controldata indicates a running primary instance, the instance has probably 
> crashed',
> last-rc-change='Wed May 30 22:18:16 2018', queued=0ms, exec=138ms
> --
> 
>> 3. Run crm_resource with the -VV option and --force-* with whatever
>> action you want to attempt (in this case, demote or stop). The -VV (aka
>> --verbose --verbose) will turn on OCF_TRACE_RA. The --force-* command
>> will read the resource configuration and do the same thing pacemaker
>> would do to execute the command.
> 
> I thought that I would want to see what the "check" is doing to do the check, 
> since you're telling me that it thinks the service is running when it's 
> definitely not.  I tried the following command which didn't work (am I doing 
> something wrong?):
> 
> --
> 

Re: [ClusterLabs] PAF not starting resource successfully after node reboot (was: How to set up fencing/stonith)

2018-05-30 Thread Casey & Gina
> In this case, the agent is returning "master (failed)", which does not
> mean that it previously failed when it was master -- it means it is
> currently running as master, in a failed condition.

Well, it surely is NOT running.  So the likely problem is the way it's doing 
this check?  I see a lot of people here using PAF - I'd be surprised if such a 
bug weren't discovered already...

> Stopping an already stopped service does not return an error -- here,
> the agent is saying it was unable to demote or stop a running instance.

I still don't understand.  There are *NO* postgres processes running on the 
node, no additions to the log file.  Nothing whatsoever that supports the 
notion that it's a running instance.

> Unfortunately clustering has some inherent complexity that gives it a
> steep learning curve. On top of that, logging/troubleshooting
> improvements are definitely an area of ongoing need in pacemaker. The
> good news is that once a cluster is running successfully, it's usually
> smooth sailing after that.

I hope so...  I just don't see what I'm doing that's outside of the standard 
box.  I've set up PAF following it's instructions.  I see that others here are 
using it.  Hasn't anybody else gotten such a setup working already?  I would 
think this is a pretty standard failure case that anybody would test if they've 
set up a cluster...  In any case, I'll keep persisting as long as I can...on to 
debugging...

> You can debug like this:
> 
> 1. Unmanage the resource in pacemaker, so you can mess with it
> manually.
> 
> 2. Cause the desired failure for testing. Pacemaker should detect the
> failure, but not do anything about it.

I executed `pcs resource unmanage postgresql-ha`, and then powered off the 
master node.  The fencing kicked in and restarted the node.  After the node 
rebooted, I issued a `pcs cluster start` on it as the crm_resource command 
complained about the CIB without doing that.

I then ended up seeing this:

--
 vfencing   (stonith:external/vcenter): Started d-gp2-dbpg0-1
 postgresql-master-vip  (ocf::heartbeat:IPaddr2):   Started d-gp2-dbpg0-2
 Master/Slave Set: postgresql-ha [postgresql-10-main] (unmanaged)
 postgresql-10-main (ocf::heartbeat:pgsqlms):   Started d-gp2-dbpg0-3 
(unmanaged)
 postgresql-10-main (ocf::heartbeat:pgsqlms):   Slave d-gp2-dbpg0-1 
(unmanaged)
 postgresql-10-main (ocf::heartbeat:pgsqlms):   FAILED Master 
d-gp2-dbpg0-2 (unmanaged)

Failed Actions:
* postgresql-10-main_monitor_0 on d-gp2-dbpg0-2 'master (failed)' (9): call=14, 
status=complete, exitreason='Instance "postgresql-10-main" controldata 
indicates a running primary instance, the instance has probably crashed',
last-rc-change='Wed May 30 22:18:16 2018', queued=0ms, exec=190ms
* postgresql-10-main_monitor_15000 on d-gp2-dbpg0-2 'master (failed)' (9): 
call=16, status=complete, exitreason='Instance "postgresql-10-main" controldata 
indicates a running primary instance, the instance has probably crashed',
last-rc-change='Wed May 30 22:18:16 2018', queued=0ms, exec=138ms
--

> 3. Run crm_resource with the -VV option and --force-* with whatever
> action you want to attempt (in this case, demote or stop). The -VV (aka
> --verbose --verbose) will turn on OCF_TRACE_RA. The --force-* command
> will read the resource configuration and do the same thing pacemaker
> would do to execute the command.

I thought that I would want to see what the "check" is doing to do the check, 
since you're telling me that it thinks the service is running when it's 
definitely not.  I tried the following command which didn't work (am I doing 
something wrong?):

--
root@d-gp2-dbpg0-2:/var# crm_resource -r postgresql-ha -VV --force-check 
 warning: unpack_rsc_op_failure:Processing failed op monitor for 
postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9)
 warning: unpack_rsc_op_failure:Processing failed op monitor for 
postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9)
Operation monitor for postgresql-10-main:0 (ocf:heartbeat:pgsqlms) returned 5
 >  stderr: Use of uninitialized value $OCF_Functions::ARG[0] in pattern match 
 > (m//) at 
 > /usr/lib/ocf/resource.d/heartbeat/../../lib/heartbeat/OCF_Functions.pm line 
 > 392.
 >  stderr: ocf-exit-reason:PAF v2.2.0 is compatible with Pacemaker 1.1.13 and 
 > greater
Error performing operation: Input/output error
--

Attempting to force-demote didn't work either:

--
root@d-gp2-dbpg0-2:/var# crm_resource -r postgresql-ha -VV --force-demote
 warning: unpack_rsc_op_failure:Processing failed op monitor for 
postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9)
 warning: unpack_rsc_op_failure:Processing failed op monitor for 
postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9)
resource postgresql-ha is running on: d-gp2-dbpg0-3 
resource postgresql-ha is running on: d-gp2-dbpg0-1 
resource postgresql-ha is running on: d-gp2-dbpg0-2 Master
It is not 

Re: [ClusterLabs] PAF not starting resource successfully after node reboot (was: How to set up fencing/stonith)

2018-05-29 Thread Casey & Gina
> On May 27, 2018, at 2:28 PM, Ken Gaillot  wrote:
> 
>> May 22 23:57:24 [2196] d-gp2-dbpg0-2pengine: info:
>> determine_op_status: Operation monitor found resource postgresql-10-
>> main:2 active on d-gp2-dbpg0-2
> 
>> May 22 23:57:24 [2196] d-gp2-dbpg0-2pengine:   notice:
>> LogActions:  Demote  postgresql-10-main:1(Master -> Slave d-gp2-
>> dbpg0-1)
>> May 22 23:57:24 [2196] d-gp2-dbpg0-2pengine:   notice:
>> LogActions:  Recover postgresql-10-main:1(Master d-gp2-dbpg0-1)
> 
> From the above, we can see that the initial probe after the node
> rejoined found that the resource was already running in master mode
> there (at least, that's what the agent thinks). So, the cluster wants
> to demote it, stop it, and start it again as a slave.

Are you sure you're reading the above correctly?  The first line you quoted 
says the resource is already active on node 2, which is not the node that was 
restarted, and is the node that took over as master after I powered node 1 off.

Anyways I enabled debug logging in corosync.conf, and I now see the following 
information:

May 29 20:59:28 [10583] d-gp2-dbpg0-2 crm_resource:debug: 
determine_op_status:  postgresql-10-main_monitor_0 on d-gp2-dbpg0-1 
returned 'master (failed)' (9) instead of the expected value: 'not running' (7)
May 29 20:59:28 [10583] d-gp2-dbpg0-2 crm_resource:  warning: 
unpack_rsc_op_failure:Processing failed op monitor for postgresql-10-main:1 
on d-gp2-dbpg0-1: master (failed) (9)
May 29 20:59:28 [10583] d-gp2-dbpg0-2 crm_resource:debug: 
determine_op_status:  postgresql-10-main_monitor_0 on d-gp2-dbpg0-1 
returned 'master (failed)' (9) instead of the expected value: 'not running' (7)
May 29 20:59:28 [10583] d-gp2-dbpg0-2 crm_resource:  warning: 
unpack_rsc_op_failure:Processing failed op monitor for postgresql-10-main:1 
on d-gp2-dbpg0-1: master (failed) (9)

I'm not sure why these lines appear twice (same question I've had in the past 
about some log messages), but it seems that whatever it's doing to check the 
status of the resource, it is correctly determining that PostgreSQL failed 
while in master state, rather than being shut down cleanly.  Why this results 
in the node being fenced is beyond me.

I don't feel that I'm trying to do anything complex - just have a simple 
cluster that handles PostgreSQL failover.  I'm not trying to do anything fancy 
and am pretty much following the PAF docs, plus the addition of the fencing 
resource (which it says it requires to work properly - if this is "properly" I 
don't understand what goal it is trying to achieve...).  I'm getting really 
frustrated with pacemaker as I've been fighting hard to try to get it working 
for two months now and still feel in the dark about why it's behaving the way 
it is.  I'm sorry if I seem like an idiot...this definitely makes me feel like 
one...


Here is my configuration again, in case it helps:

Cluster Name: d-gp2-dbpg0
Corosync Nodes:
 d-gp2-dbpg0-1 d-gp2-dbpg0-2 d-gp2-dbpg0-3
Pacemaker Nodes:
 d-gp2-dbpg0-1 d-gp2-dbpg0-2 d-gp2-dbpg0-3

Resources:
 Resource: postgresql-master-vip (class=ocf provider=heartbeat type=IPaddr2)
  Attributes: ip=10.124.164.250 cidr_netmask=22
  Operations: start interval=0s timeout=20s 
(postgresql-master-vip-start-interval-0s)
  stop interval=0s timeout=20s 
(postgresql-master-vip-stop-interval-0s)
  monitor interval=10s (postgresql-master-vip-monitor-interval-10s)
 Master: postgresql-ha
  Meta Attrs: notify=true 
  Resource: postgresql-10-main (class=ocf provider=heartbeat type=pgsqlms)
   Attributes: bindir=/usr/lib/postgresql/10/bin 
pgdata=/var/lib/postgresql/10/main pghost=/var/run/postgresql pgport=5432 
recovery_template=/etc/postgresql/10/main/recovery.conf start_opts="-c 
config_file=/etc/postgresql/10/main/postgresql.conf"
   Operations: start interval=0s timeout=60s 
(postgresql-10-main-start-interval-0s)
   stop interval=0s timeout=60s 
(postgresql-10-main-stop-interval-0s)
   promote interval=0s timeout=30s 
(postgresql-10-main-promote-interval-0s)
   demote interval=0s timeout=120s 
(postgresql-10-main-demote-interval-0s)
   monitor interval=15s role=Master timeout=10s 
(postgresql-10-main-monitor-interval-15s)
   monitor interval=16s role=Slave timeout=10s 
(postgresql-10-main-monitor-interval-16s)
   notify interval=0s timeout=60s 
(postgresql-10-main-notify-interval-0s)

Stonith Devices:
 Resource: vfencing (class=stonith type=external/vcenter)
  Attributes: VI_SERVER=10.124.137.100 
VI_CREDSTORE=/etc/pacemaker/vicredentials.xml 
HOSTLIST=d-gp2-dbpg0-1;d-gp2-dbpg0-2;d-gp2-dbpg0-3 RESETPOWERON=1
  Operations: monitor interval=60s (vfencing-monitor-60s)
Fencing Levels:

Location Constraints:
Ordering Constraints:
  promote postgresql-ha then start postgresql-master-vip (kind:Mandatory) 
(non-symmetrical) (id:order-postgresql-ha-postgresql-master-vip-Mandatory)
  demote 

Re: [ClusterLabs] PAF not starting resource successfully after node reboot (was: How to set up fencing/stonith)

2018-05-29 Thread Casey & Gina
> On May 27, 2018, at 2:28 PM, Ken Gaillot  wrote:
> 
> Pacemaker isn't fencing because the start failed, at least not
> directly:
> 
>> May 22 23:57:24 [2196] d-gp2-dbpg0-2pengine: info:
>> determine_op_status: Operation monitor found resource postgresql-10-
>> main:2 active on d-gp2-dbpg0-2
> 
>> May 22 23:57:24 [2196] d-gp2-dbpg0-2pengine:   notice:
>> LogActions:  Demote  postgresql-10-main:1(Master -> Slave d-gp2-
>> dbpg0-1)
>> May 22 23:57:24 [2196] d-gp2-dbpg0-2pengine:   notice:
>> LogActions:  Recover postgresql-10-main:1(Master d-gp2-dbpg0-1)
> 
> From the above, we can see that the initial probe after the node
> rejoined found that the resource was already running in master mode
> there (at least, that's what the agent thinks). So, the cluster wants
> to demote it, stop it, and start it again as a slave.

Well, it was running in master node prior to being power-cycled.  However my 
understanding was that PAF always tries to initially start PostgreSQL in 
standby mode.  There would be no reason for it to promote node 1 to master 
since node 2 has already taken over the master role, and there is no location 
constraint set that would cause it to try to move this role back to node 1 
after it rejoins the cluster.

Jehan-Guillaume wrote:  "on resource start, PAF will create the 
"PGDATA/recovery.conf" file based on your template anyway. No need to create it
yourself.".  The recovery.conf file being present upon PostgreSQL startup is 
what makes it start in standby mode.

Since no new log output is ever written to the PostgreSQL log file, it does not 
seem that it's ever actually doing anything to try to start the resource.  The 
recovery.conf doesn't get copied in, and no new data appears in the PostgreSQL 
log.  As far as I can tell, nothing ever happens on the rejoined node at all, 
before it gets fenced.

How can I tell what the resource agent is trying to do behind the scenes?  Is 
there a way that I can see what command(s) it is trying to run, so that I may 
try them manually?

> But the demote failed

I reckon that it probably couldn't demote what was never started.

> But the stop fails too

I guess that it can't stop what is already stopped?  Although, I'm surprised 
that it would error in this case, instead of just realizing that it was already 
stopped...

> 
>> May 22 23:57:24 [2196] d-gp2-dbpg0-2pengine:  warning:
>> pe_fence_node:   Node d-gp2-dbpg0-1 will be fenced because of
>> resource failure(s)
> 
> which is why the cluster then wants to fence the node. (If a resource
> won't stop, the only way to recover it is to kill the entire node.)

But the resource is *never started*!?  There is never any postgres process 
running, and nothing appears in the PostgreSQL log file.  I'm really confused 
as to why pacemaker thinks it needs to fence something that's never running at 
all...  I guess what I need is to somehow figure out what the resource agent is 
doing that makes it think the resource is already active; is there a way to do 
this?

It would be really helpful, if somewhere within this verbose logging, were an 
indication of what commands were actually being run to monitor, start, stop, 
etc. as it seems like a black box.

I'm wondering if some stale PID file is getting left around after the hard 
reboot, and that is what the resource agent is checking instead of the actual 
running status, but I would hope that the resource agent would be smarter than 
that.

Thanks,
-- 
Casey
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] PAF not starting resource successfully after node reboot (was: How to set up fencing/stonith)

2018-05-27 Thread Ken Gaillot
On Wed, 2018-05-23 at 14:22 -0600, Casey & Gina wrote:
> I have pcsd set to auto-start at boot, but not pacemaker or
> corosync.  After I power off the node in vSphere, the node is fenced
> and then powered back on.  I see it show up in `pcs status` with PCSD
> Status of Online after a few seconds but shown as OFFLINE in the list
> of nodes on top since pacemaker and corosync are not running.  If I
> then do a `pcs cluster start` on the rebooted node, it is again
> restarted.  So I cannot get it to rejoin the cluster.
> 
> The corosync log from another node in the cluster (pasted below)
> indicates that PostgreSQL fails to start after pacemaker/corosync are
> restarted (on d-gp2-dbpg0-1 in this case), but it does not seem to
> give any reason as to why.  When I look on the failed node, I see
> that the PostgreSQL log is not being appended, so it doesn't seem
> it's ever actually trying to start it.  I'm not sure where else I
> could try looking.
> 
> Strangely, if prior to running `pcs cluster start` on the rebooted
> node, I sudo to postgres, copy the recovery.conf template to the data
> directory, and use pg_ctl to start the database, it comes up just
> fine in standby mode.  Then if I do `pcs cluster start`, the node
> rejoins the cluster just fine without any problem.
> 
> Can you tell me why pacemaker is failing to start PostgreSQL in
> standby mode based on the log data below, or how I can dig deeper
> into what is going on?  Is this due to some misconfiguration on my
> part?  I thought that PAF would try to do exactly what I do manually,
> but it doesn't seem this is the case...
> 
> Actually, why is Pacemaker fencing the standby node just because the
> resource fails to start there?  I thought only the master should be
> fenced if it were assumed to be broken.
> 
> Thank you for any help you can provide,

Pacemaker isn't fencing because the start failed, at least not
directly:

> May 22 23:57:24 [2196] d-gp2-dbpg0-2pengine: info:
> determine_op_status: Operation monitor found resource postgresql-10-
> main:2 active on d-gp2-dbpg0-2

> May 22 23:57:24 [2196] d-gp2-dbpg0-2pengine:   notice:
> LogActions:  Demote  postgresql-10-main:1(Master -> Slave d-gp2-
> dbpg0-1)
> May 22 23:57:24 [2196] d-gp2-dbpg0-2pengine:   notice:
> LogActions:  Recover postgresql-10-main:1(Master d-gp2-dbpg0-1)

From the above, we can see that the initial probe after the node
rejoined found that the resource was already running in master mode
there (at least, that's what the agent thinks). So, the cluster wants
to demote it, stop it, and start it again as a slave.

> May 22 23:57:24 [2197] d-gp2-dbpg0-2   crmd:   notice:
> abort_transition_graph:  Transition aborted by postgresql-10-
> main_demote_0 'modify' on d-gp2-dbpg0-1: Event failed
> (magic=0:1;13:27:0:0df60493-9320-463d-94ca-a9515d139f9f, cib=0.35.70,
> source=match_graph_event:381, 0)

But the demote failed

> May 22 23:57:24 [2196] d-gp2-dbpg0-2pengine:   notice:
> LogActions:  Stoppostgresql-10-main:1(d-gp2-dbpg0-1)

So now the cluster wants to just stop it there

> May 22 23:57:24 [2197] d-gp2-dbpg0-2   crmd:   notice:
> abort_transition_graph:  Transition aborted by postgresql-10-
> main_stop_0 'modify' on d-gp2-dbpg0-1: Event failed
> (magic=0:1;2:28:0:0df60493-9320-463d-94ca-a9515d139f9f, cib=0.35.74,
> source=match_graph_event:381, 0)

But the stop fails too

> May 22 23:57:24 [2196] d-gp2-dbpg0-2pengine:  warning:
> pe_fence_node:   Node d-gp2-dbpg0-1 will be fenced because of
> resource failure(s)

which is why the cluster then wants to fence the node. (If a resource
won't stop, the only way to recover it is to kill the entire node.)
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] PAF not starting resource successfully after node reboot (was: How to set up fencing/stonith)

2018-05-25 Thread Casey Allen Shobe
Any advice about how to fix this?  I've been struggling to get things working 
for weeks now and I think this is the final stumbling block I need to figure 
out.

On May 23, 2018, at 2:22 PM, Casey & Gina  wrote:

>>> So now my concern is this - our VM's are distributed across 32 hosts.  One 
>>> condition we were hoping to handle was when one of those host machines 
>>> fails, due to bad memory or something else, as it is likely that not all of 
>>> the nodes within a cluster are residing on the same VM host (there may even 
>>> be some way to configure them to stay on separate hosts in ESX).  In this 
>>> case, a reset command will fail as well, I'd assume.  I had thought that 
>>> when the resource was fenced, it was done with an 'off' command, and that 
>>> the resources would be brought up on a standby node.  Is there a way to 
>>> make this work?
>> 
>> Configure your stonith agent to use "off" instead of "reset".
> 
> I tried a setup with RESETPOWERON="1" for the external/vcenter stonith 
> plugin.  It does seem to work better, but I end up with a node that can't 
> rejoin the cluster without being immediately rebooted, due to the PostgreSQL 
> resource failing.
> 
> I have pcsd set to auto-start at boot, but not pacemaker or corosync.  After 
> I power off the node in vSphere, the node is fenced and then powered back on. 
>  I see it show up in `pcs status` with PCSD Status of Online after a few 
> seconds but shown as OFFLINE in the list of nodes on top since pacemaker and 
> corosync are not running.  If I then do a `pcs cluster start` on the rebooted 
> node, it is again restarted.  So I cannot get it to rejoin the cluster.
> 
> The corosync log from another node in the cluster (pasted below) indicates 
> that PostgreSQL fails to start after pacemaker/corosync are restarted (on 
> d-gp2-dbpg0-1 in this case), but it does not seem to give any reason as to 
> why.  When I look on the failed node, I see that the PostgreSQL log is not 
> being appended, so it doesn't seem it's ever actually trying to start it.  
> I'm not sure where else I could try looking.
> 
> Strangely, if prior to running `pcs cluster start` on the rebooted node, I 
> sudo to postgres, copy the recovery.conf template to the data directory, and 
> use pg_ctl to start the database, it comes up just fine in standby mode.  
> Then if I do `pcs cluster start`, the node rejoins the cluster just fine 
> without any problem.
> 
> Can you tell me why pacemaker is failing to start PostgreSQL in standby mode 
> based on the log data below, or how I can dig deeper into what is going on?  
> Is this due to some misconfiguration on my part?  I thought that PAF would 
> try to do exactly what I do manually, but it doesn't seem this is the case...
> 
> Actually, why is Pacemaker fencing the standby node just because the resource 
> fails to start there?  I thought only the master should be fenced if it were 
> assumed to be broken.
> 
> Thank you for any help you can provide,
> -- 
> Casey
> 
> 
> --
> [2157] d-gp2-dbpg0-2 corosyncnotice  [TOTEM ] A new membership 
> (10.124.164.63:392) was formed. Members joined: 1
> May 22 23:57:19 [2189] d-gp2-dbpg0-2 pacemakerd: info: 
> pcmk_quorum_notification:Membership 392: quorum retained (3)
> May 22 23:57:19 [2197] d-gp2-dbpg0-2   crmd: info: 
> pcmk_quorum_notification:Membership 392: quorum retained (3)
> May 22 23:57:19 [2189] d-gp2-dbpg0-2 pacemakerd:   notice: 
> crm_update_peer_state_iter:  pcmk_quorum_notification: Node d-gp2-dbpg0-1[1] 
> - state is now member (was lost)
> May 22 23:57:19 [2197] d-gp2-dbpg0-2   crmd:   notice: 
> crm_update_peer_state_iter:  pcmk_quorum_notification: Node d-gp2-dbpg0-1[1] 
> - state is now member (was lost)
> May 22 23:57:19 [2192] d-gp2-dbpg0-2cib: info: 
> cib_process_request: Forwarding cib_modify operation for section status to 
> master (origin=local/crmd/268)
> May 22 23:57:19 [2192] d-gp2-dbpg0-2cib: info: cib_perform_op:
>   Diff: --- 0.35.51 2
> May 22 23:57:19 [2192] d-gp2-dbpg0-2cib: info: cib_perform_op:
>   Diff: +++ 0.35.52 (null)
> May 22 23:57:19 [2192] d-gp2-dbpg0-2cib: info: cib_perform_op:
>   +  /cib:  @num_updates=52
> May 22 23:57:19 [2192] d-gp2-dbpg0-2cib: info: cib_perform_op:
>   +  /cib/status/node_state[@id='1']:  @crm-debug-origin=peer_update_callback
> May 22 23:57:19 [2192] d-gp2-dbpg0-2cib: info: 
> cib_process_request: Completed cib_modify operation for section status: OK 
> (rc=0, origin=d-gp2-dbpg0-2/crmd/268, version=0.35.52)
> May 22 23:57:19 [2192] d-gp2-dbpg0-2cib: info: 
> cib_process_request: Forwarding cib_modify operation for section nodes to 
> master (origin=local/crmd/272)
> May 22 23:57:19 [2192] d-gp2-dbpg0-2cib: info: 
> cib_process_request: Forwarding cib_modify operation for section status to 
> master (origin=local/crmd/273)
> May 22 23:57:19