> Quick look at PAF manual gives
> 
> you need to rebuild the PostgreSQL instance on the failed node
> 
> did you do it? I am not intimately familiar with Postgres, but in this
> case I expect that you need to make database on node B secondary (slave,
> whatever it is called) to new master on node A. That is exactly what I
> described as "manually fixing configuration outside of pacemaker".

I did not see this prior to today, but was pointed to this a little while ago.  
I did not realize that this would be necessary, so I have written a script to 
rebuild the db and then do the `pcs cluster start` afterwards, which I'll make 
part of our standard recovery procedure.

I guess I expected that pacemaker would be able to handle this case 
automatically - if the resource agent reported a resource in a 
potentially-corrupt state, pacemaker could then call the resource agent to 
start the rebuild.  But there are probably some reasons that's not a great 
idea, and I think that I understand things enough now to be confident in just 
using a custom script for this purpose when necessary.

When I set up clusters in the past with heartbeat, I had put the database on a 
DRBD partition, so this simplified matters since there was never a possibility 
of some new writes to the master not yet being replicated to the slave.  In 
development testing, I found that I did not need to rebuild the database, just 
start it up manually in slave mode.  But now that I've thought this through 
better, I realize that in a production environment, should the master crash, it 
is quite likely that it will have some data that has not yet replicated to the 
slaves, so it could not cleanly come up as a standby since it would have some 
data that was too new.

> pacemaker is too old. The error most likely comes from missing
> OCF_RESKEY_crm_feature_set which is exported by crm_resource starting
> with 1.1.17. I am not that familiar with debian packaging, but I'd
> expect resource-agents-paf require suitable pacemaker version. Of course
> Ubuntu package may be patched to include necessary code ...

I'm not sure why that would be - the resource agent works fine with this 
version of pacemaker, and according to 
https://github.com/ClusterLabs/PAF/releases, it only requires pacemaker 
>=1.1.13.  I think that something is wrong with the command that I was trying 
to run, as pacemaker 1.1.14 successfully uses this resource agent to 
start/stop/monitor the service generally speaking, outside of the manual 
debugging context.

Thank you!
-- 
Casey
_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to