[ClusterLabs] fence-agents v4.2.1
ClusterLabs is happy to announce fence-agents v4.2.1, which is a bugfix release for v4.2.0. The source code is available at: https://github.com/ClusterLabs/fence-agents/releases/tag/v4.2.1 The most significant enhancements in this release are: - bugfixes and enhancements: - fence_scsi: fix python3 encoding issue - xml-check: fix not failing on incorrect or missing metadata Everyone is encouraged to download and test the new release. We do many regression tests and simulations, but we can't cover all possible use cases, so your feedback is important and appreciated. Many thanks to all the contributors to this release. Best, The fence-agents maintainers ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] [questionnaire] Do you manage your pacemaker configuration by hand and (if so) what reusability features do you use?
Hello, I am soliciting feedback on these CIB features related questions, please reply (preferably on-list so we have the shared collective knowledge) if at least one of the questions is answered positively in your case (just tick the respective "[ ]" boxes as "[x]"). Any other commentary also welcome -- thank you in advance. 1. [ ] Do you edit CIB by hand (as opposed to relying on crm/pcs or their UI counterparts)? 2. [ ] Do you use "template" based syntactic simplification[1] in CIB? 3. [ ] Do you use "id-ref" based syntactic simplification[2] in CIB? 3.1 [ ] When positive about 3., would you mind much if "id-refs" got unfold/exploded during the "cibadmin --upgrade --force" equivalent as a reliability/safety precaution? 4. [ ] Do you use "tag" based syntactic grouping[3] in CIB? (Some of these questions tangentially touch the topic of perhaps excessively complex means of configuration that was raised during the 2017's cluster summit.) [1] https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/index.html#_reusing_resource_definitions [2] https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/index.html#s-reusing-config-elements [3] https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/index.html#_tagging_configuration_elements -- Jan (Poki) pgpAix5mIbZfO.pgp Description: PGP signature ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] PAF not starting resource successfully after node reboot (was: How to set up fencing/stonith)
> There is no "master node" in pacemaker. There is master/slave resource > so at the best it is "node on which specific resource has master role". > And we have no way to know which on which node you resource had master > role when you did it. Please be more specific, otherwise it is hard to > impossible to follow. Well my limited understanding is that there should be one node that's the master at any point in time. I don't see how it makes sense to have resources with masters on different nodes in the same clusters. I'm being as specific as I can given my limited knowledge. I'm not a developer; just an admin trying to get a simple cluster up and running. Years ago, I did this same thing with two nodes and heartbeat, and it was very easy. Anyways, I guess I mean that I powered off the node that was the master for all resources at the time. > Not specifically related to your problem but I wonder what is the > difference. For all I know for master/slave "Started" == "Slave" so I'm > surprised to see two different states listed here. I also wondered about that, since from the PostgreSQL, there is one master and two standbys which are no different from one another. But like you said, it didn't seem relevant to my problem. > Well, apparently resource agent does not like crashed instance. It is > quite possible, I have been working with another replicated database > where it was necessary to manually fix configuration after failover, > *outside* of pacemaker. Pacemaker simply failed to start resource which > had unexpected state. I can manually start up the database in standby mode, without any errors or special intervention/fixing whatsoever, as long as the replication logs have not gotten too far ahead on the new master. In that case I would need to rebuild the standby. > This needs someone familiar with this RA and application to answer. The resource agent is PAF and I've seen a lot of others discussing this on this list, so I hope that I am asking in the right place. > Note that it is not quite normal use case. You explicitly disabled any > handling by RA, thus effectively not using pacemaker high availability > at all. Does it fail over master if you do not unmanage resource and > kill node where resource has master role? I was following the specific instructions in the E-mail I was replying to, which asked me to unmanage the resource and try manual debugging steps. As I've discussed in this thread (please review the previous E-mails on this thread for further information), pacemaker does fail over the master, but then when the former master node comes back online, if I do a `pcs cluster start` on it without manually starting up the database by hand, it fails to start the PAF resource and pacemaker ends up fencing the node again. I've been told that what PAF does on resource startup is exactly the same as the manual commands that I can do to make it work. In the prior E-mails on this thread, I was told that the reason the resource startup fails is because the resource agent is incorrectly determining that the resource is already running when it's not - so it's never even trying to start the resource at all. The debug instructions I'm attempting to follow are in an attempt to figure out what command it is running to determine this state. Fail over to another node is only half the battle - the failed node should be able to rejoin the cluster without the cluster immediately fencing it when I try, shouldn't it? >> -- >> root@d-gp2-dbpg0-2:/var# crm_resource -r postgresql-ha -VV --force-check >> warning: unpack_rsc_op_failure:Processing failed op monitor for >> postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9) >> warning: unpack_rsc_op_failure:Processing failed op monitor for >> postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9) >> Operation monitor for postgresql-10-main:0 (ocf:heartbeat:pgsqlms) returned 5 >>> stderr: Use of uninitialized value $OCF_Functions::ARG[0] in pattern match >>> (m//) at >>> /usr/lib/ocf/resource.d/heartbeat/../../lib/heartbeat/OCF_Functions.pm line >>> 392. >>> stderr: ocf-exit-reason:PAF v2.2.0 is compatible with Pacemaker 1.1.13 and >>> greater >> Error performing operation: Input/output error >> -- > > This looks like a bug in your version. Version of what? I'm using the corosync, pacemaker, and pcs versions as provided by Ubuntu (for version 16.04), and resource-agents-paf as provided by the PGDG repository. These versions are as follows: * corosync - 2.3.5-3ubuntu2 * pacemaker - 1.1.14-2ubuntu1.3 * pcs - 0.9.149-1ubuntu1.1 * resource-agents-paf - 2.2.0-2.pgdg16.04+1 These are the latest packaged versions available for my platform, as far as I'm aware, and the same as I presume other Ubuntu users on this list are running. Regards, -- Casey ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/u
Re: [ClusterLabs] [questionnaire] Do you manage your pacemaker configuration by hand and (if so) what reusability features do you use?
On Thu, 2018-05-31 at 14:48 +0200, Jan Pokorný wrote: > Hello, > > I am soliciting feedback on these CIB features related questions, > please reply (preferably on-list so we have the shared collective > knowledge) if at least one of the questions is answered positively > in your case (just tick the respective "[ ]" boxes as "[x]"). > > Any other commentary also welcome -- thank you in advance. > > 1. [ ] Do you edit CIB by hand (as opposed to relying on crm/pcs or > their UI counterparts)? To clarify, crm shell supports both templates and id-ref, while pcs does not. > 2. [ ] Do you use "template" based syntactic simplification[1] in > CIB? > > 3. [ ] Do you use "id-ref" based syntactic simplification[2] in CIB? > > 3.1 [ ] When positive about 3., would you mind much if "id-refs" got > unfold/exploded during the "cibadmin --upgrade --force" > equivalent as a reliability/safety precaution? Regardless of whether anyone minds, we're not going to do it. It would render the feature useless and force any user using it to either abandon it or perform potentially massive manual edits to their CIB. If the community feels that id-ref is not a useful feature, we can deprecate it, and in some future release, drop support and automatically expand it as part of the upgrade transform for that release. Otherwise we will continue full support for it. > 4. [ ] Do you use "tag" based syntactic grouping[3] in CIB? > > > (Some of these questions tangentially touch the topic of perhaps > excessively complex means of configuration that was raised during > the 2017's cluster summit.) > > [1] https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-si > ngle/Pacemaker_Explained/index.html#_reusing_resource_definitions > [2] https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-si > ngle/Pacemaker_Explained/index.html#s-reusing-config-elements > [3] https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html- > single/Pacemaker_Explained/index.html#_tagging_configuration_elements -- Ken Gaillot ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] [questionnaire] Do you manage your pacemaker configuration by hand and (if so) what reusability features do you use?
On 31/05/18 11:42 -0500, Ken Gaillot wrote: > On Thu, 2018-05-31 at 14:48 +0200, Jan Pokorný wrote: >> I am soliciting feedback on these CIB features related questions, >> please reply (preferably on-list so we have the shared collective >> knowledge) if at least one of the questions is answered positively >> in your case (just tick the respective "[ ]" boxes as "[x]"). >> >> Any other commentary also welcome -- thank you in advance. >> >> 1. [ ] Do you edit CIB by hand (as opposed to relying on crm/pcs or >> their UI counterparts)? > > To clarify, crm shell supports both templates and id-ref, while pcs > does not. No implications were intended, nor expressed. I am (possibly we are) interested in the original question regardless how other questions are answered -- please do so as the reply to the original post, if you wish to participate. -- Jan (Poki) pgpAXs8WRNmNk.pgp Description: PGP signature ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Pacemaker 2.0.0-rc5 now available
Since we had a few significant bug fixes, I decided to do one more release candidate for Pacemaker version 2.0.0. If there are no serious issues found with this one, the final will likely be released next week. Source code is available at: https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-2.0.0-rc5 This is a bug fix release. The two main ones: * Avoid unnecessary repeated recovery of resources that have "requires" set to "quorum" or "nothing" (i.e. they can start elsewhere before their previously active node is fenced; this includes fence devices and Pacemaker Remote connection resources). * Allow a monitor to be cancelled when its resource is unmanaged. The only known issue remaining to be resolved before final release is some tweaking of the transform of pre-2.0 configurations after an upgrade. -- Ken Gaillot ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Why would a standby node be fenced? (was: How to set up fencing/stonith)
Sorry for getting back to you so late. On Fri, 25 May 2018 11:58:59 -0600 Casey & Gina wrote: > > On May 25, 2018, at 7:01 AM, Casey Allen Shobe > > wrote: > >> Actually, why is Pacemaker fencing the standby node just because a > >> resource fails to start there? I thought only the master should be fenced > >> if it were assumed to be broken. > > This is probably the most important thing to ask outside of the PAF resource > agent which many may not be as fluent with as pacemaker itself, and perhaps > the most indicative of me setting something up incorrectly outside of that > resource agent. > > My understanding of fencing was that pacemaker would only fence a node if it > was the master but had stopped responding, to avoid a split-brain situation. > Why would pacemaker ever fence a standby node with no resources currently > allocated to it? So, as discussed on IRC and for the mailing list history, here is the answer: https://clusterlabs.github.io/PAF/administration.html#failover In short: after a failure (either on a primary or a standby), you MUST fix things on the node before starting Pacemaker. If you don't, PAF will detect something incoherent and raise an error, leading Pacemaker to most likely fence your node, again. As instance, after a primary crash, you will have to resync it as a standby with the new master before starting Pacemaker on the node and giving PAF the relay. It is actually really important if you don't want to end up with a silently corrupted standby in your cluster. Cheers, ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] PAF not starting resource successfully after node reboot (was: How to set up fencing/stonith)
31.05.2018 19:20, Casey & Gina пишет: >> There is no "master node" in pacemaker. There is master/slave >> resource so at the best it is "node on which specific resource has >> master role". And we have no way to know which on which node you >> resource had master role when you did it. Please be more specific, >> otherwise it is hard to impossible to follow. > > Well my limited understanding is that there should be one node that's > the master at any point in time. I don't see how it makes sense to > have resources with masters on different nodes in the same clusters. It is entirely possible and useful for different resources to have master role on different nodes at the same time. "Master" simply denotes one of two possible state, it does not convey any additional semantic. > I'm being as specific as I can given my limited knowledge. I'm not a > developer; just an admin trying to get a simple cluster up and > running. Years ago, I did this same thing with two nodes and > heartbeat, and it was very easy. Anyways, I guess I mean that I > powered off the node that was the master for all resources at the > time. > >> Not specifically related to your problem but I wonder what is the >> difference. For all I know for master/slave "Started" == "Slave" so >> I'm surprised to see two different states listed here. > > I also wondered about that, since from the PostgreSQL, there is one > master and two standbys which are no different from one another. But > like you said, it didn't seem relevant to my problem. > >> Well, apparently resource agent does not like crashed instance. It >> is quite possible, I have been working with another replicated >> database where it was necessary to manually fix configuration after >> failover, *outside* of pacemaker. Pacemaker simply failed to start >> resource which had unexpected state. > > I can manually start up the database in standby mode, without any > errors or special intervention/fixing whatsoever, as long as the > replication logs have not gotten too far ahead on the new master. In > that case I would need to rebuild the standby. > >> This needs someone familiar with this RA and application to >> answer. > > The resource agent is PAF and I've seen a lot of others discussing > this on this list, so I hope that I am asking in the right place. > Sure, hopefully the right person chimes in. >> Note that it is not quite normal use case. You explicitly disabled >> any handling by RA, thus effectively not using pacemaker high >> availability at all. Does it fail over master if you do not >> unmanage resource and kill node where resource has master role? > > I was following the specific instructions in the E-mail I was > replying to, which asked me to unmanage the resource and try manual > debugging steps. As I've discussed in this thread (please review the > previous E-mails on this thread for further information), pacemaker > does fail over the master, but then when the former master node comes > back online, if I do a `pcs cluster start` on it without manually > starting up the database by hand, it fails to start the PAF resource > and pacemaker ends up fencing the node again. > Well, it means you now have new primary database instance (which was failed over by pacemaker) on node A and old primary database instance on node B which you now start. On node B it remains primary because that was the sate in which node was killed. It is quite logical that attempt to start resource (and hence database instance) fails. Quick look at PAF manual gives you need to rebuild the PostgreSQL instance on the failed node did you do it? I am not intimately familiar with Postgres, but in this case I expect that you need to make database on node B secondary (slave, whatever it is called) to new master on node A. That is exactly what I described as "manually fixing configuration outside of pacemaker". > I've been told that what PAF does on resource startup is exactly the > same as the manual commands that I can do to make it work. In the > prior E-mails on this thread, I was told that the reason the resource > startup fails is because the resource agent is incorrectly > determining that the resource is already running when it's not - so > it's never even trying to start the resource at all. The debug > instructions I'm attempting to follow are in an attempt to figure out > what command it is running to determine this state. Fail over to > another node is only half the battle - the failed node should be able > to rejoin the cluster without the cluster immediately fencing it when > I try, shouldn't it? > >>> -- root@d-gp2-dbpg0-2:/var# crm_resource -r postgresql-ha -VV >>> --force-check warning: unpack_rsc_op_failure:Processing >>> failed op monitor for postgresql-10-main:2 on d-gp2-dbpg0-2: >>> master (failed) (9) warning: unpack_rsc_op_failure: >>> Processing failed op monitor for postgresql-10-main:2 on >>> d-gp2-dbpg0-2: master (failed) (9) Operation moni
Re: [ClusterLabs] Why would a standby node be fenced? (was: How to set up fencing/stonith)
31.05.2018 22:18, Jehan-Guillaume de Rorthais пишет: > Sorry for getting back to you so late. > > On Fri, 25 May 2018 11:58:59 -0600 > Casey & Gina wrote: > >>> On May 25, 2018, at 7:01 AM, Casey Allen Shobe >>> wrote: Actually, why is Pacemaker fencing the standby node just because a resource fails to start there? I thought only the master should be fenced if it were assumed to be broken. >> >> This is probably the most important thing to ask outside of the PAF resource >> agent which many may not be as fluent with as pacemaker itself, and perhaps >> the most indicative of me setting something up incorrectly outside of that >> resource agent. >> >> My understanding of fencing was that pacemaker would only fence a node if it >> was the master but had stopped responding, to avoid a split-brain situation. >> Why would pacemaker ever fence a standby node with no resources currently >> allocated to it? > > So, as discussed on IRC and for the mailing list history, here is the answer: > > https://clusterlabs.github.io/PAF/administration.html#failover > > In short: after a failure (either on a primary or a standby), you MUST fix > things on the node before starting Pacemaker. > > If you don't, PAF will detect something incoherent and raise an error, leading > Pacemaker to most likely fence your node, again. > Well, that does not sound very polite to user :) Another database RA I mentioned somewhere in this thread has different approach - it starts database in its monitor action and start action is effectively dummy. So start always succeeds from pacemaker point of view, but database won't be started until manually synchronized again by administrator. Downside is that pacemaker resource status does not reflect database status. I wish pacemaker supported something like "requires manual intervention" resource state that would not be treated like error (causing all sorts of fatal consequences) but still evaluated for dependencies (i.e. dependent resources would not be started). That would be ideal for such case. ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Why would a standby node be fenced? (was: How to set up fencing/stonith)
On Thu, 31 May 2018 22:52:12 +0300 Andrei Borzenkov wrote: > 31.05.2018 22:18, Jehan-Guillaume de Rorthais пишет: > > Sorry for getting back to you so late. > > > > On Fri, 25 May 2018 11:58:59 -0600 > > Casey & Gina wrote: > > > >>> On May 25, 2018, at 7:01 AM, Casey Allen Shobe > >>> wrote: > Actually, why is Pacemaker fencing the standby node just because a > resource fails to start there? I thought only the master should be > fenced if it were assumed to be broken. > >> > >> This is probably the most important thing to ask outside of the PAF > >> resource agent which many may not be as fluent with as pacemaker itself, > >> and perhaps the most indicative of me setting something up incorrectly > >> outside of that resource agent. > >> > >> My understanding of fencing was that pacemaker would only fence a node if > >> it was the master but had stopped responding, to avoid a split-brain > >> situation. Why would pacemaker ever fence a standby node with no resources > >> currently allocated to it? > > > > So, as discussed on IRC and for the mailing list history, here is the > > answer: > > > > https://clusterlabs.github.io/PAF/administration.html#failover > > > > In short: after a failure (either on a primary or a standby), you MUST fix > > things on the node before starting Pacemaker. > > > > If you don't, PAF will detect something incoherent and raise an error, > > leading Pacemaker to most likely fence your node, again. > > > > Well, that does not sound very polite to user :) Sure :) But at least, It's been documented as you pointed earlier. After a failure and an automatic failover, either you have some automatic failback process somewhere...or you have to fix some things around. PAF is not able to do automatic failback. > Another database RA I mentioned somewhere in this thread has different > approach - it starts database in its monitor action and start action is > effectively dummy. Mh, I would have to study that. But I'm not thrill about such behavior at a first look. > So start always succeeds from pacemaker point of > view, but database won't be started until manually synchronized again by > administrator. It seems scary...What about the stop action? What if the monitor detect an error? Well, I really should check this RA you are talking about to answer my questions. > Downside is that pacemaker resource status does not reflect database > status. I wish pacemaker supported something like "requires manual > intervention" resource state that would not be treated like error > (causing all sorts of fatal consequences) but still evaluated for > dependencies (i.e. dependent resources would not be started). That would > be ideal for such case. Good idea. I have a couple more: * handling errors from notify actions * supporting mgirate-to/from for multistate RA * having real infinite master score :) Cheers, ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Why would a standby node be fenced? (was: How to set up fencing/stonith)
On Thu, 2018-05-31 at 22:43 +0200, Jehan-Guillaume de Rorthais wrote: > On Thu, 31 May 2018 22:52:12 +0300 > Andrei Borzenkov wrote: > > > 31.05.2018 22:18, Jehan-Guillaume de Rorthais пишет: > > > Sorry for getting back to you so late. > > > > > > On Fri, 25 May 2018 11:58:59 -0600 > > > Casey & Gina wrote: > > > > > > > > On May 25, 2018, at 7:01 AM, Casey Allen Shobe > > > > icloud.com> > > > > > wrote: > > > > > > Actually, why is Pacemaker fencing the standby node just > > > > > > because a > > > > > > resource fails to start there? I thought only the master > > > > > > should be > > > > > > fenced if it were assumed to be broken. > > > > > > > > This is probably the most important thing to ask outside of the > > > > PAF > > > > resource agent which many may not be as fluent with as > > > > pacemaker itself, > > > > and perhaps the most indicative of me setting something up > > > > incorrectly > > > > outside of that resource agent. > > > > > > > > My understanding of fencing was that pacemaker would only fence > > > > a node if > > > > it was the master but had stopped responding, to avoid a split- > > > > brain > > > > situation. Why would pacemaker ever fence a standby node with > > > > no resources > > > > currently allocated to it? > > > > > > So, as discussed on IRC and for the mailing list history, here is > > > the > > > answer: > > > > > > https://clusterlabs.github.io/PAF/administration.html#failover > > > > > > In short: after a failure (either on a primary or a standby), you > > > MUST fix > > > things on the node before starting Pacemaker. > > > > > > If you don't, PAF will detect something incoherent and raise an > > > error, > > > leading Pacemaker to most likely fence your node, again. > > > > > > > Well, that does not sound very polite to user :) > > Sure :) > > But at least, It's been documented as you pointed earlier. > > After a failure and an automatic failover, either you have some > automatic > failback process somewhere...or you have to fix some things around. > > PAF is not able to do automatic failback. > > > Another database RA I mentioned somewhere in this thread has > > different > > approach - it starts database in its monitor action and start > > action is > > effectively dummy. > > Mh, I would have to study that. But I'm not thrill about such > behavior at a > first look. > > > So start always succeeds from pacemaker point of > > view, but database won't be started until manually synchronized > > again by > > administrator. > > It seems scary...What about the stop action? What if the monitor > detect an > error? Well, I really should check this RA you are talking about to > answer my > questions. > > > Downside is that pacemaker resource status does not reflect > > database > > status. I wish pacemaker supported something like "requires manual > > intervention" resource state that would not be treated like error > > (causing all sorts of fatal consequences) but still evaluated for > > dependencies (i.e. dependent resources would not be started). That > > would > > be ideal for such case. I'm not clear what such a result would mean. Is the goal to stop dependent resources, but not the resource itself? And/or to block all further management of the resource? > Good idea. > > I have a couple more: > * handling errors from notify actions I could imagine notify supporting on-fail, defaulting to ignore. Would that do what you want? Should notify errors count toward the resource fail count? > * supporting migrate-to/from for multistate RA > * having real infinite master score :) What behavior isn't supported by current infinity? > > Cheers, -- Ken Gaillot ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] PAF not starting resource successfully after node reboot (was: How to set up fencing/stonith)
> Quick look at PAF manual gives > > you need to rebuild the PostgreSQL instance on the failed node > > did you do it? I am not intimately familiar with Postgres, but in this > case I expect that you need to make database on node B secondary (slave, > whatever it is called) to new master on node A. That is exactly what I > described as "manually fixing configuration outside of pacemaker". I did not see this prior to today, but was pointed to this a little while ago. I did not realize that this would be necessary, so I have written a script to rebuild the db and then do the `pcs cluster start` afterwards, which I'll make part of our standard recovery procedure. I guess I expected that pacemaker would be able to handle this case automatically - if the resource agent reported a resource in a potentially-corrupt state, pacemaker could then call the resource agent to start the rebuild. But there are probably some reasons that's not a great idea, and I think that I understand things enough now to be confident in just using a custom script for this purpose when necessary. When I set up clusters in the past with heartbeat, I had put the database on a DRBD partition, so this simplified matters since there was never a possibility of some new writes to the master not yet being replicated to the slave. In development testing, I found that I did not need to rebuild the database, just start it up manually in slave mode. But now that I've thought this through better, I realize that in a production environment, should the master crash, it is quite likely that it will have some data that has not yet replicated to the slaves, so it could not cleanly come up as a standby since it would have some data that was too new. > pacemaker is too old. The error most likely comes from missing > OCF_RESKEY_crm_feature_set which is exported by crm_resource starting > with 1.1.17. I am not that familiar with debian packaging, but I'd > expect resource-agents-paf require suitable pacemaker version. Of course > Ubuntu package may be patched to include necessary code ... I'm not sure why that would be - the resource agent works fine with this version of pacemaker, and according to https://github.com/ClusterLabs/PAF/releases, it only requires pacemaker >=1.1.13. I think that something is wrong with the command that I was trying to run, as pacemaker 1.1.14 successfully uses this resource agent to start/stop/monitor the service generally speaking, outside of the manual debugging context. Thank you! -- Casey ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Why would a standby node be fenced? (was: How to set up fencing/stonith)
> Well, that does not sound very polite to user :) The thing that really threw me off was pacemaker rebooting the node as soon as I'd try to start the cluster on it without the database running. Is there a way to prevent this from happening? Some way to indicate to Pacemaker, "Hey, I'm not willing/able to start the resource here because it appears to be in a corrupt state", while not causing the node to be fenced because it thinks that the resource is running when it isn't? It would be perfectly safe to not fence the node, in this case... -- Casey ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org