Re: [ClusterLabs] Antw: Re: Antw: [EXT] Re: Order set troubles
On 26.03.2021 22:18, Reid Wahl wrote: > On Fri, Mar 26, 2021 at 6:27 AM Andrei Borzenkov > wrote: > >> On Fri, Mar 26, 2021 at 10:17 AM Ulrich Windl >> wrote: >>> >> Andrei Borzenkov schrieb am 26.03.2021 um >> 06:19 in >>> Nachricht <534274b3-a6de-5fac-0ae4-d02c305f1...@gmail.com>: On 25.03.2021 21:45, Reid Wahl wrote: > FWIW we have this KB article (I seem to remember Strahil is a Red Hat > customer): > - How do I configure SAP HANA Scale-Up System Replication in a >> Pacemaker > cluster when the HANA filesystems are on NFS shares?( > https://access.redhat.com/solutions/5156571) > "How do I make the cluster resources recover when one node loses access to the NFS server?" If node loses access to NFS server then monitor operations for >> resources that depend on NFS availability will fail or timeout and pacemaker will recover (likely by rebooting this node). That's how similar configurations have been handled for the past 20 years in other HA managers. I am genuinely interested, have you encountered the case >> where it was not enough? >>> >>> That's a big problem with the SAP design (basically it's just too >> complex). >>> In the past I had written a kind of resource agent that worked without >> that >>> overly complex overhead, but since those days SAP has added much more >>> complexity. >>> If the NFS server is external, pacemaker could fence your nodes when the >> NFS >>> server is down as first the monitor operation will fail (hanging on >> NFS), the >>> the recover (stop/start) will fail (also hanging on NFS). >> >> And how exactly placing NFS resource under pacemaker control is going >> to change it? >> > > I noted earlier based on the old case notes: > > "Apparently there were situations in which the SAPHana resource wasn't > failing over when connectivity was lost with the NFS share that contained > the hdb* binaries and the HANA data. I don't remember the exact details > (whether demotion was failing, or whether it wasn't even trying to demote > on the primary and promote on the secondary, or what). Either way, I was > surprised that this procedure was necessary, but it seemed to be." > > Strahil may be dealing with a similar situation, not sure. I get where > you're coming from -- I too would expect the application that depends on > NFS to simply fail when NFS connectivity is lost, which in turn leads to > failover and recovery. For whatever reason, due to some weirdness of the > SAPHana resource agent, that didn't happen. > Yes. The only reason to use this workaround would be if resource agent monitor still believes that application is up when required NFS is down. Which is a bug in resource agent or possibly in application itself. While using this workaround in this case is perfectly reasonable, none of reasons listed in the message I was replying to are applicable. So far the only reason OP wanted to do it was some obscure race condition on startup outside of pacemaker. In which case this workaround simply delays NFS mount, sidestepping race. I also remember something about racing with dnsmasq, at which point I'd say that making cluster depend on availability of DNS is e-h-h-h unwise. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] ocf-tester always claims failure, even with built-in resource agents?
On Fri, Mar 26, 2021 at 2:44 PM Antony Stone wrote: > On Friday 26 March 2021 at 18:31:51, Ken Gaillot wrote: > > > On Fri, 2021-03-26 at 19:59 +0300, Andrei Borzenkov wrote: > > > On 26.03.2021 17:28, Antony Stone wrote: > > > > > > > > So far all is well and good, my cluster synchronises, starts the > > > > resources, and everything's working as expected. It'll move the > > > > resources from one cluster member to another (either if I ask it to, > or > > > > if there's a problem), and it seems to work just as the older version > > > > did. > > > > I'm glad this far was easy :) > > Well, I've been using corosync & pacemaker for some years now; I've got > used > to some of their quirks and foibles :) > > Now I just need to learn about the new ones for the newer versions... > > > It's worth noting that pacemaker itself doesn't try to validate the > > agent meta-data, it just checks for the pieces that are interesting to > > it and ignores the rest. > > I guess that's good, so long as what it does pay attention to is what it > wants > to see? > > > It's also worth noting that the OCF 1.0 standard is horribly outdated > > compared to actual use, and the OCF 1.1 standard is being adopted today > > (!) after many years of trying to come up with something more up-to- > > date. > > So, is ocf-tester no longer the right tools I should be using to check > this > sort of thing? What shouold I be doing instead to make sure my > configuration > is valid / acceptable to pacemaker? > > > Bottom line, it's worth installing xmllint to see if that helps, but I > > wouldn't worry about meta-data schema issues. > > Well, as stated in my other reply to Andrei, I now get: > > /usr/lib/ocf/resource.d/heartbeat/asterisk passed all tests > > /usr/lib/ocf/resource.d/heartbeat/anything passed all tests > > so I guess it means my configuration file is okay, and I need to look > somewher > eelse to find out why pacemaker 2.0.1 is throwing wobblies with exactly > the > same resources that pacemaker 1.1.16 can manage quite happily and stably... > > > > Either agent does not run as root or something blocks chown. Usual > > > suspects are apparmor or SELinux. > > > > Pacemaker itself can also return this error in certain cases, such as > > not having permissions to execute the agent. Check the pacemaker detail > > log (usually /var/log/pacemaker/pacemaker.log) and the system log > > around these times to see if there is more detail. > > I've turned on debug logging, but I'm still not sure I'm seeing *exactly* > what > the resource agent checker is doing when it gets this failure. > > > It is definitely weird that a privileges error would be sporadic. > > Hopefully the logs can shed some more light. > > I've captured a bunch of them this afternoon and will go through them on > Monday - it's pretty verbose! > > > Another possibility would be to set trace_ra=1 on the actions that are > > failing to get line-by-line info from the agents. > > So, that would be an extra parameter to the resource definition in > cluster.cib? > > Change: > > primitive Asterisk asterisk meta migration-threshold=3 op monitor > interval=5 > timeout=30 on-fail=restart failure-timeout=10s > > to: > > primitive Asterisk asterisk meta migration-threshold=3 op monitor > interval=5 > timeout=30 on-fail=restart failure-timeout=10s trace_ra=1 > > ? > It's an instance attribute, not a meta attribute. I'm not familiar with crmsh syntax but trace_ra=1 would go wherever you would configure a "normal" option, like `ip=x.x.x.x` for an IPaddr2 resource. It will save a shell trace of each operation to a file in /var/lib/heartbeat/trace_ra/asterisk. You would then wait for an operation to fail, find the file containing that operation's trace, and see what it tells you about the error. You might already have some more detail about the error in /var/log/messages and/or /var/log/pacemaker/pacemaker.log. Look in /var/log/messages around Fri Mar 26 13:37:08 2021 on the node where the failure occurred. See if there are any additional messages from the resource agent, or any stdout or stderr logged by lrmd/pacemaker-execd for the Asterisk resource. > > Antony. > > -- > "It is easy to be blinded to the essential uselessness of them by the > sense of > achievement you get from getting them to work at all. In other words - and > this is the rock solid principle on which the whole of the Corporation's > Galaxy-wide success is founded - their fundamental design flaws are > completely > hidden by their superficial design flaws." > > - Douglas Noel Adams > >Please reply to the > list; > please *don't* CC > me. > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > > -- Regards, Reid Wahl, RHCA Senior Software Maintenance Engineer, Red Hat
Re: [ClusterLabs] Feedback wanted: OCF Resource Agent API 1.1 proposed for adoption
OCF 1.1 is now formally adopted! https://github.com/ClusterLabs/OCF-spec/blob/master/ra/1.1/resource-agent-api.md Thanks to everyone who gave feedback. Now to add support for it ... On Tue, 2021-03-09 at 17:07 -0600, Ken Gaillot wrote: > Hi all, > > After many false starts over the years, we finally have a proposed > 1.1 > version of the resource agent standard. > > Discussion is invited here and/or on the pull request: > > https://github.com/ClusterLabs/OCF-spec/pull/24 > > One goal is to formalize widespread existing practices that deviate > from the 1.0 standard, such as the notify, promote, and demote > actions; > exit statuses 8, 9, 190, and 191; and allowing installers to choose > where agents are installed (officially /usr/ocf/resource.d in 1.0, > even > though everyone actually uses /usr/lib/ocf/resource.d). > > Another goal is to add optional new meta-data hints that user > interfaces can benefit from, such as whether a parameter is required > or > deprecated. > > The new standard deprecates the "unique" descriptor for parameters, > which was misused by Pacemaker, and replaces it with two new ones, > "reloadable" (to handle what Pacemaker used it for) and "unique- > group" > (to handle its original purpose more flexibly). A new "reload-params" > action updates any "reloadable" parameters. > > The last major change is completing the transition away from > master/slave terminology, renaming the roles to promoted/unpromoted. > > The changes are designed to be backward-compatible, so for the most > part, agents and software written to either standard can be used with > each other. However for agents that support promote/demote (which > were > not part of 1.0), it is recommended to use 1.1 agents only with > software that explicitly supports 1.1. Once the 1.1 standard is > adopted, we intend to update all ClusterLabs software to support it. > > The pull request description has a more detailed summary of all the > changes, and the standard itself can be compared with: > > https://github.com/ClusterLabs/OCF-spec/blob/master/ra/1.0/resource-agent-api.md > > https://github.com/kgaillot/OCF-spec/blob/ocf1.1/ra/1.1/resource-agent-api.md > > My goal is to merge the pull request formally adopting 1.1 by the end > of this month. -- Ken Gaillot ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] ocf-tester always claims failure, even with built-in resource agents?
On Friday 26 March 2021 at 18:31:51, Ken Gaillot wrote: > On Fri, 2021-03-26 at 19:59 +0300, Andrei Borzenkov wrote: > > On 26.03.2021 17:28, Antony Stone wrote: > > > > > > So far all is well and good, my cluster synchronises, starts the > > > resources, and everything's working as expected. It'll move the > > > resources from one cluster member to another (either if I ask it to, or > > > if there's a problem), and it seems to work just as the older version > > > did. > > I'm glad this far was easy :) Well, I've been using corosync & pacemaker for some years now; I've got used to some of their quirks and foibles :) Now I just need to learn about the new ones for the newer versions... > It's worth noting that pacemaker itself doesn't try to validate the > agent meta-data, it just checks for the pieces that are interesting to > it and ignores the rest. I guess that's good, so long as what it does pay attention to is what it wants to see? > It's also worth noting that the OCF 1.0 standard is horribly outdated > compared to actual use, and the OCF 1.1 standard is being adopted today > (!) after many years of trying to come up with something more up-to- > date. So, is ocf-tester no longer the right tools I should be using to check this sort of thing? What shouold I be doing instead to make sure my configuration is valid / acceptable to pacemaker? > Bottom line, it's worth installing xmllint to see if that helps, but I > wouldn't worry about meta-data schema issues. Well, as stated in my other reply to Andrei, I now get: /usr/lib/ocf/resource.d/heartbeat/asterisk passed all tests /usr/lib/ocf/resource.d/heartbeat/anything passed all tests so I guess it means my configuration file is okay, and I need to look somewher eelse to find out why pacemaker 2.0.1 is throwing wobblies with exactly the same resources that pacemaker 1.1.16 can manage quite happily and stably... > > Either agent does not run as root or something blocks chown. Usual > > suspects are apparmor or SELinux. > > Pacemaker itself can also return this error in certain cases, such as > not having permissions to execute the agent. Check the pacemaker detail > log (usually /var/log/pacemaker/pacemaker.log) and the system log > around these times to see if there is more detail. I've turned on debug logging, but I'm still not sure I'm seeing *exactly* what the resource agent checker is doing when it gets this failure. > It is definitely weird that a privileges error would be sporadic. > Hopefully the logs can shed some more light. I've captured a bunch of them this afternoon and will go through them on Monday - it's pretty verbose! > Another possibility would be to set trace_ra=1 on the actions that are > failing to get line-by-line info from the agents. So, that would be an extra parameter to the resource definition in cluster.cib? Change: primitive Asterisk asterisk meta migration-threshold=3 op monitor interval=5 timeout=30 on-fail=restart failure-timeout=10s to: primitive Asterisk asterisk meta migration-threshold=3 op monitor interval=5 timeout=30 on-fail=restart failure-timeout=10s trace_ra=1 ? Antony. -- "It is easy to be blinded to the essential uselessness of them by the sense of achievement you get from getting them to work at all. In other words - and this is the rock solid principle on which the whole of the Corporation's Galaxy-wide success is founded - their fundamental design flaws are completely hidden by their superficial design flaws." - Douglas Noel Adams Please reply to the list; please *don't* CC me. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] ocf-tester always claims failure, even with built-in resource agents?
On Friday 26 March 2021 at 17:59:07, Andrei Borzenkov wrote: > On 26.03.2021 17:28, Antony Stone wrote: > > # ocf-tester -n Asterisk /usr/lib/ocf/resource.d/heartbeat/asterisk > > Beginning tests for /usr/lib/ocf/resource.d/heartbeat/asterisk... > > /usr/sbin/ocf-tester: 226: /usr/sbin/ocf-tester: xmllint: not found > > * rc=127: Your agent produces meta-data which does not conform to > > ra-api-1.dtd * Your agent does not support the notify action (optional) > > * Your agent does not support the demote action (optional) > > * Your agent does not support the promote action (optional) > > * Your agent does not support master/slave (optional) > > * Your agent does not support the reload action (optional) > > Tests failed: /usr/lib/ocf/resource.d/heartbeat/asterisk failed 1 tests > As is pretty clear from error messages, ocf-tester calls xmllint which > is missing. Ah, I had not realised that this meant the rest of the output would be invalid. I thought it just meant "you don't have xmllint installed, so there's some stuff we might otherwise be able to tell you, but can't". If xmllint being installed is a requirement for the remained of the output the be meaningful, I'd expected that ocf-tester would simply give up at that point and tell me that until I install xmllint, ocf-tester can't do its job. That seems like a bit of a bug to me. After installing xmllint I now get: /usr/lib/ocf/resource.d/heartbeat/asterisk passed all tests /usr/lib/ocf/resource.d/heartbeat/anything passed all tests So I'm now back to working out how to debug the failures I do see in "normal" operation, which were notocurring with the older versions of corosync & pacemaker... > > My second question is: how can I debug what caused pacemaker to decide > > that it couldn't run Asterisk due to "insufficient privileges" > Agent returns this error if it fails to chown directory specified in its > configuration file: > > # Regardless of whether we just created the directory or it > # already existed, check whether it is writable by the configured > # user > if ! su -s /bin/sh - $OCF_RESKEY_user -c "test -w $dir"; then > ocf_log warn "Directory $dir is not writable by > $OCF_RESKEY_user, attempting chown" > ocf_run chown $OCF_RESKEY_user:$OCF_RESKEY_group $dir \ > > || exit $OCF_ERR_PERM > > Either agent does not run as root or something blocks chown. Usual > suspects are apparmor or SELinux. Well, I'm not running either of those, but your comments point me in what I think is a helpful direction - thanks. Regards, Antony. -- It may not seem obvious, but (6 x 5 + 5) x 5 - 55 equals 5! Please reply to the list; please *don't* CC me. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Community adoption of PAF vs pgsql
If you have an enterprise support agreement, be sure to also explore whether your vendor supports one and not the other. For example, Red Hat currently supports pgsql but not PAF (though there is an open BZ to add support for PAF). On Fri, Mar 26, 2021 at 9:14 AM Jehan-Guillaume de Rorthais wrote: > Hi, > > I'm one of the PAF author, so I'm biased. > > On Fri, 26 Mar 2021 14:51:28 + > Isaac Pittman wrote: > > > My team has the opportunity to update our PostgreSQL resource agent to > either > > PAF (https://github.com/ClusterLabs/PAF) or pgsql > > ( > https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/pgsql > ), > > and I've been charged with comparing them. > > In my opinion, you should spend time to actually build some "close-to-prod" > clusters and train them. Then you'll be able to choose base on some team > experience. > > Both agent have very different spirit and very different administrative > tasks. > > Break your cluster, make some switchover, some failover, how to failback a > node > and so on. > > > After searching various mailing lists and reviewing the code and > > documentation, it seems like either could suit our needs and both are > > actively maintained. > > > > One factor that I couldn't get a sense of is community support and > adoption: > > > > * Does PAF or pgsql enjoy wider community support or adoption, > especially > > for new projects? (I would expect many older projects to be on pgsql due > to > > its longer history.) > > Sadly, I have absolutely no clues... > > > * Does either seem to be on the road to deprecation? > > PAF is not on its way to deprecation, I have a pending TODO list for it. > > I would bet pgsql is not on its way to deprecation either, but I can't > speak > for the real authors. > > Regards, > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > > -- Regards, Reid Wahl, RHCA Senior Software Maintenance Engineer, Red Hat CEE - Platform Support Delivery - ClusterHA ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: Re: Antw: [EXT] Re: Order set troubles
On Fri, Mar 26, 2021 at 6:27 AM Andrei Borzenkov wrote: > On Fri, Mar 26, 2021 at 10:17 AM Ulrich Windl > wrote: > > > > >>> Andrei Borzenkov schrieb am 26.03.2021 um > 06:19 in > > Nachricht <534274b3-a6de-5fac-0ae4-d02c305f1...@gmail.com>: > > > On 25.03.2021 21:45, Reid Wahl wrote: > > >> FWIW we have this KB article (I seem to remember Strahil is a Red Hat > > >> customer): > > >> - How do I configure SAP HANA Scale-Up System Replication in a > Pacemaker > > >> cluster when the HANA filesystems are on NFS shares?( > > >> https://access.redhat.com/solutions/5156571) > > >> > > > > > > "How do I make the cluster resources recover when one node loses access > > > to the NFS server?" > > > > > > If node loses access to NFS server then monitor operations for > resources > > > that depend on NFS availability will fail or timeout and pacemaker will > > > recover (likely by rebooting this node). That's how similar > > > configurations have been handled for the past 20 years in other HA > > > managers. I am genuinely interested, have you encountered the case > where > > > it was not enough? > > > > That's a big problem with the SAP design (basically it's just too > complex). > > In the past I had written a kind of resource agent that worked without > that > > overly complex overhead, but since those days SAP has added much more > > complexity. > > If the NFS server is external, pacemaker could fence your nodes when the > NFS > > server is down as first the monitor operation will fail (hanging on > NFS), the > > the recover (stop/start) will fail (also hanging on NFS). > > And how exactly placing NFS resource under pacemaker control is going > to change it? > I noted earlier based on the old case notes: "Apparently there were situations in which the SAPHana resource wasn't failing over when connectivity was lost with the NFS share that contained the hdb* binaries and the HANA data. I don't remember the exact details (whether demotion was failing, or whether it wasn't even trying to demote on the primary and promote on the secondary, or what). Either way, I was surprised that this procedure was necessary, but it seemed to be." Strahil may be dealing with a similar situation, not sure. I get where you're coming from -- I too would expect the application that depends on NFS to simply fail when NFS connectivity is lost, which in turn leads to failover and recovery. For whatever reason, due to some weirdness of the SAPHana resource agent, that didn't happen. > > Even when fencing the > > node it would not help (resources cannot start) if the NFS server is > still > > down. > > And how exactly placing NFS resource under pacemaker control is going > to change it? > > > So you may end up with all your nodes being fenced and the fail counts > > disabling any automatic resource restart. > > > > And how exactly placing NFS resource under pacemaker control is going > to change it? > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > > -- Regards, Reid Wahl, RHCA Senior Software Maintenance Engineer, Red Hat CEE - Platform Support Delivery - ClusterHA ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: Re: Antw: [EXT] Re: Order set troubles
On Fri, Mar 26, 2021 at 4:06 AM Strahil Nikolov wrote: > Thanks everyone! I really appreciate your help. > > Actually , I found a RH solution (#5423971) that gave me enough ideas /it > is missing some steps/ to setup the cluster prooperly. > Careful. That solution is for Scale-Out. The solution I gave you[1] is a similar procedure intended for HANA in a Scale-Up configuration. Use whichever one is appropriate to your deployment. I didn't think about Scale-Out at first, because most customers I interact with use Scale-Up. [1] https://access.redhat.com/solutions/5156571 > So far , I have never used node attributes, order sets and location > constraints based on 'ocf:pacemaker: attribute's active/inactive values . > > I can say that I have learned alot today. > > > Best Regards, > Strahil Nikolov > -- Regards, Reid Wahl, RHCA Senior Software Maintenance Engineer, Red Hat CEE - Platform Support Delivery - ClusterHA ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] ocf-tester always claims failure, even with built-in resource agents?
On Fri, 2021-03-26 at 19:59 +0300, Andrei Borzenkov wrote: > On 26.03.2021 17:28, Antony Stone wrote: > > Hi. > > > > I've just signed up to the list. I've been using corosync and > > pacemaker for > > several years, mostly under Debian 9, which means: > > > > corosync 2.4.2 > > pacemaker 1.1.16 > > > > I've recently upgraded a test cluster to Debian 10, which gives me: > > > > corosync 3.0.1 > > pacemaker 2.0.1 > > > > I've made a few adjustments to my /etc/corosync/corosync.conf > > configuration so > > that corosync seems happy, and also some minor changes (mostly to > > the cluster > > defaults) in /etc/corosync/cluster.cib so that pacemaker is happy. > > > > So far all is well and good, my cluster synchronises, starts the > > resources, > > and everything's working as expected. It'll move the resources > > from one > > cluster member to another (either if I ask it to, or if there's a > > problem), > > and it seems to work just as the older version did. I'm glad this far was easy :) > > Then, several times a day, I get resource failures such as: > > > > * Asterisk_start_0 on castor 'insufficient privileges' (4): > > call=58, > > status=complete, > > exitreason='', > > last-rc-change='Fri Mar 26 13:37:08 2021', > > queued=0ms, > > exec=55ms > > > > I have no idea why the machine might tell me it cannot start > > Asterisk due to > > insufficient privilege when it's already been able to run it before > > the cluster > > resources moved back to this machine. Asterisk *can* and *does* > > run on this > > machine. > > > > Another error I get is: > > > > * Kann-Bear_monitor_5000 on helen 'unknown error' (1): > > call=62, > > status=complete, > > exitreason='', > > last-rc-change='Fri Mar 26 14:23:05 2021', > > queued=0ms, > > exec=0ms > > > > Now, that second resource is one which doesn't have a standard > > resource agent > > available for it under /usr/lib/ocf/resource.d, so I'm using the > > general- > > purpose agent /usr/lib/ocf/resource.d/heartbeat/anything to manage > > it. > > > > I thought, "perhaps there's something dodgy about using this > > 'anything' agent, > > because it can't really know about the resource it's managing", so > > I tested it > > with ocf-tester: > > > > # ocf-tester -n Kann-Bear -o binfile="/usr/sbin/bearerbox" -o > > cmdline_options="/etc/kannel/kannel.conf" -o > > pidfile="/var/run/kannel/kannel_bearerbox.pid" > > /usr/lib/ocf/resource.d/heartbeat/anything > > Beginning tests for /usr/lib/ocf/resource.d/heartbeat/anything... > > /usr/sbin/ocf-tester: 226: /usr/sbin/ocf-tester: xmllint: not found > > * rc=127: Your agent produces meta-data which does not conform to > > ra-api-1.dtd > > * Your agent does not support the notify action (optional) > > * Your agent does not support the demote action (optional) > > * Your agent does not support the promote action (optional) > > * Your agent does not support master/slave (optional) > > * Your agent does not support the reload action (optional) > > Tests failed: /usr/lib/ocf/resource.d/heartbeat/anything failed 1 > > tests > > > > Okay, something's not right. > > > > BUT, it doesn't matter *which* resource agent I test, it tells me > > the same > > thing every time, including for the built-in standard agents: > > > > * rc=127: Your agent produces meta-data which does not conform to > > ra-api-1.dtd > > > > For example: > > > > # ocf-tester -n Asterisk /usr/lib/ocf/resource.d/heartbeat/asterisk > > Beginning tests for /usr/lib/ocf/resource.d/heartbeat/asterisk... > > /usr/sbin/ocf-tester: 226: /usr/sbin/ocf-tester: xmllint: not found > > * rc=127: Your agent produces meta-data which does not conform to > > ra-api-1.dtd > > * Your agent does not support the notify action (optional) > > * Your agent does not support the demote action (optional) > > * Your agent does not support the promote action (optional) > > * Your agent does not support master/slave (optional) > > * Your agent does not support the reload action (optional) > > Tests failed: /usr/lib/ocf/resource.d/heartbeat/asterisk failed 1 > > tests > > > > > > # ocf-tester -n IP-Float4 -o ip=10.1.0.42 -o cidr_netmask=28 > > /usr/lib/ocf/resource.d/heartbeat/IPaddr2 > > Beginning tests for /usr/lib/ocf/resource.d/heartbeat/IPaddr2... > > /usr/sbin/ocf-tester: 226: /usr/sbin/ocf-tester: xmllint: not found > > * rc=127: Your agent produces meta-data which does not conform to > > ra-api-1.dtd > > * Your agent does not support the notify action (optional) > > * Your agent does not support the demote action (optional) > > * Your agent does not support the promote action (optional) > > * Your agent does not support master/slave (optional) > > * Your agent does not support the reload action (optional) > > Tests failed: /usr/lib/ocf/resource.d/heartbeat/IPaddr2 failed 1 > > tests > > > > > > So, it seems to be telling me that even the standard built-in > > resource
Re: [ClusterLabs] ocf-tester always claims failure, even with built-in resource agents?
On 26.03.2021 17:28, Antony Stone wrote: > Hi. > > I've just signed up to the list. I've been using corosync and pacemaker for > several years, mostly under Debian 9, which means: > > corosync 2.4.2 > pacemaker 1.1.16 > > I've recently upgraded a test cluster to Debian 10, which gives me: > > corosync 3.0.1 > pacemaker 2.0.1 > > I've made a few adjustments to my /etc/corosync/corosync.conf configuration > so > that corosync seems happy, and also some minor changes (mostly to the cluster > defaults) in /etc/corosync/cluster.cib so that pacemaker is happy. > > So far all is well and good, my cluster synchronises, starts the resources, > and everything's working as expected. It'll move the resources from one > cluster member to another (either if I ask it to, or if there's a problem), > and it seems to work just as the older version did. > > Then, several times a day, I get resource failures such as: > > * Asterisk_start_0 on castor 'insufficient privileges' (4): >call=58, >status=complete, >exitreason='', >last-rc-change='Fri Mar 26 13:37:08 2021', >queued=0ms, >exec=55ms > > I have no idea why the machine might tell me it cannot start Asterisk due to > insufficient privilege when it's already been able to run it before the > cluster > resources moved back to this machine. Asterisk *can* and *does* run on this > machine. > > Another error I get is: > > * Kann-Bear_monitor_5000 on helen 'unknown error' (1): >call=62, >status=complete, >exitreason='', >last-rc-change='Fri Mar 26 14:23:05 2021', >queued=0ms, >exec=0ms > > Now, that second resource is one which doesn't have a standard resource agent > available for it under /usr/lib/ocf/resource.d, so I'm using the general- > purpose agent /usr/lib/ocf/resource.d/heartbeat/anything to manage it. > > I thought, "perhaps there's something dodgy about using this 'anything' > agent, > because it can't really know about the resource it's managing", so I tested > it > with ocf-tester: > > # ocf-tester -n Kann-Bear -o binfile="/usr/sbin/bearerbox" -o > cmdline_options="/etc/kannel/kannel.conf" -o > pidfile="/var/run/kannel/kannel_bearerbox.pid" > /usr/lib/ocf/resource.d/heartbeat/anything > Beginning tests for /usr/lib/ocf/resource.d/heartbeat/anything... > /usr/sbin/ocf-tester: 226: /usr/sbin/ocf-tester: xmllint: not found > * rc=127: Your agent produces meta-data which does not conform to ra-api-1.dtd > * Your agent does not support the notify action (optional) > * Your agent does not support the demote action (optional) > * Your agent does not support the promote action (optional) > * Your agent does not support master/slave (optional) > * Your agent does not support the reload action (optional) > Tests failed: /usr/lib/ocf/resource.d/heartbeat/anything failed 1 tests > > Okay, something's not right. > > BUT, it doesn't matter *which* resource agent I test, it tells me the same > thing every time, including for the built-in standard agents: > > * rc=127: Your agent produces meta-data which does not conform to ra-api-1.dtd > > For example: > > # ocf-tester -n Asterisk /usr/lib/ocf/resource.d/heartbeat/asterisk > Beginning tests for /usr/lib/ocf/resource.d/heartbeat/asterisk... > /usr/sbin/ocf-tester: 226: /usr/sbin/ocf-tester: xmllint: not found > * rc=127: Your agent produces meta-data which does not conform to ra-api-1.dtd > * Your agent does not support the notify action (optional) > * Your agent does not support the demote action (optional) > * Your agent does not support the promote action (optional) > * Your agent does not support master/slave (optional) > * Your agent does not support the reload action (optional) > Tests failed: /usr/lib/ocf/resource.d/heartbeat/asterisk failed 1 tests > > > # ocf-tester -n IP-Float4 -o ip=10.1.0.42 -o cidr_netmask=28 > /usr/lib/ocf/resource.d/heartbeat/IPaddr2 > Beginning tests for /usr/lib/ocf/resource.d/heartbeat/IPaddr2... > /usr/sbin/ocf-tester: 226: /usr/sbin/ocf-tester: xmllint: not found > * rc=127: Your agent produces meta-data which does not conform to ra-api-1.dtd > * Your agent does not support the notify action (optional) > * Your agent does not support the demote action (optional) > * Your agent does not support the promote action (optional) > * Your agent does not support master/slave (optional) > * Your agent does not support the reload action (optional) > Tests failed: /usr/lib/ocf/resource.d/heartbeat/IPaddr2 failed 1 tests > > > So, it seems to be telling me that even the standard built-in resource agents > "produce meta-data which does not conform to ra-api-1.dtd" > > > My first question is: what's going wrong here? Am I using ocf-tester > incorrectly, or is it a bug? > As is pretty clear from error messages, ocf-tester calls xmllint which is missing. > My second question is: how can I debug what caused pacemaker to decide
Re: [ClusterLabs] Community adoption of PAF vs pgsql
Hi, I'm one of the PAF author, so I'm biased. On Fri, 26 Mar 2021 14:51:28 + Isaac Pittman wrote: > My team has the opportunity to update our PostgreSQL resource agent to either > PAF (https://github.com/ClusterLabs/PAF) or pgsql > (https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/pgsql), > and I've been charged with comparing them. In my opinion, you should spend time to actually build some "close-to-prod" clusters and train them. Then you'll be able to choose base on some team experience. Both agent have very different spirit and very different administrative tasks. Break your cluster, make some switchover, some failover, how to failback a node and so on. > After searching various mailing lists and reviewing the code and > documentation, it seems like either could suit our needs and both are > actively maintained. > > One factor that I couldn't get a sense of is community support and adoption: > > * Does PAF or pgsql enjoy wider community support or adoption, especially > for new projects? (I would expect many older projects to be on pgsql due to > its longer history.) Sadly, I have absolutely no clues... > * Does either seem to be on the road to deprecation? PAF is not on its way to deprecation, I have a pending TODO list for it. I would bet pgsql is not on its way to deprecation either, but I can't speak for the real authors. Regards, ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Community adoption of PAF vs pgsql
Hi All, My team has the opportunity to update our PostgreSQL resource agent to either PAF (https://github.com/ClusterLabs/PAF) or pgsql (https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/pgsql), and I've been charged with comparing them. After searching various mailing lists and reviewing the code and documentation, it seems like either could suit our needs and both are actively maintained. One factor that I couldn't get a sense of is community support and adoption: * Does PAF or pgsql enjoy wider community support or adoption, especially for new projects? (I would expect many older projects to be on pgsql due to its longer history.) * Does either seem to be on the road to deprecation? Any insight is much appreciated! Thanks, Isaac P.S. This is my first time posting to the mailing list. Please let me know if I've missed any protocols, and apologies in advance. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: Re: Antw: [EXT] Re: Order set troubles
Thanks everyone! I really appreciate your help. Actually , I found a RH solution (#5423971) that gave me enough ideas /it is missing some steps/ to setup the cluster prooperly. So far , I have never used node attributes, order sets and location constraints based on 'ocf:pacemaker: attribute's active/inactive values . I can say that I have learned alot today. Best Regards,Strahil Nikolov___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] ocf-tester always claims failure, even with built-in resource agents?
Hi. I've just signed up to the list. I've been using corosync and pacemaker for several years, mostly under Debian 9, which means: corosync 2.4.2 pacemaker 1.1.16 I've recently upgraded a test cluster to Debian 10, which gives me: corosync 3.0.1 pacemaker 2.0.1 I've made a few adjustments to my /etc/corosync/corosync.conf configuration so that corosync seems happy, and also some minor changes (mostly to the cluster defaults) in /etc/corosync/cluster.cib so that pacemaker is happy. So far all is well and good, my cluster synchronises, starts the resources, and everything's working as expected. It'll move the resources from one cluster member to another (either if I ask it to, or if there's a problem), and it seems to work just as the older version did. Then, several times a day, I get resource failures such as: * Asterisk_start_0 on castor 'insufficient privileges' (4): call=58, status=complete, exitreason='', last-rc-change='Fri Mar 26 13:37:08 2021', queued=0ms, exec=55ms I have no idea why the machine might tell me it cannot start Asterisk due to insufficient privilege when it's already been able to run it before the cluster resources moved back to this machine. Asterisk *can* and *does* run on this machine. Another error I get is: * Kann-Bear_monitor_5000 on helen 'unknown error' (1): call=62, status=complete, exitreason='', last-rc-change='Fri Mar 26 14:23:05 2021', queued=0ms, exec=0ms Now, that second resource is one which doesn't have a standard resource agent available for it under /usr/lib/ocf/resource.d, so I'm using the general- purpose agent /usr/lib/ocf/resource.d/heartbeat/anything to manage it. I thought, "perhaps there's something dodgy about using this 'anything' agent, because it can't really know about the resource it's managing", so I tested it with ocf-tester: # ocf-tester -n Kann-Bear -o binfile="/usr/sbin/bearerbox" -o cmdline_options="/etc/kannel/kannel.conf" -o pidfile="/var/run/kannel/kannel_bearerbox.pid" /usr/lib/ocf/resource.d/heartbeat/anything Beginning tests for /usr/lib/ocf/resource.d/heartbeat/anything... /usr/sbin/ocf-tester: 226: /usr/sbin/ocf-tester: xmllint: not found * rc=127: Your agent produces meta-data which does not conform to ra-api-1.dtd * Your agent does not support the notify action (optional) * Your agent does not support the demote action (optional) * Your agent does not support the promote action (optional) * Your agent does not support master/slave (optional) * Your agent does not support the reload action (optional) Tests failed: /usr/lib/ocf/resource.d/heartbeat/anything failed 1 tests Okay, something's not right. BUT, it doesn't matter *which* resource agent I test, it tells me the same thing every time, including for the built-in standard agents: * rc=127: Your agent produces meta-data which does not conform to ra-api-1.dtd For example: # ocf-tester -n Asterisk /usr/lib/ocf/resource.d/heartbeat/asterisk Beginning tests for /usr/lib/ocf/resource.d/heartbeat/asterisk... /usr/sbin/ocf-tester: 226: /usr/sbin/ocf-tester: xmllint: not found * rc=127: Your agent produces meta-data which does not conform to ra-api-1.dtd * Your agent does not support the notify action (optional) * Your agent does not support the demote action (optional) * Your agent does not support the promote action (optional) * Your agent does not support master/slave (optional) * Your agent does not support the reload action (optional) Tests failed: /usr/lib/ocf/resource.d/heartbeat/asterisk failed 1 tests # ocf-tester -n IP-Float4 -o ip=10.1.0.42 -o cidr_netmask=28 /usr/lib/ocf/resource.d/heartbeat/IPaddr2 Beginning tests for /usr/lib/ocf/resource.d/heartbeat/IPaddr2... /usr/sbin/ocf-tester: 226: /usr/sbin/ocf-tester: xmllint: not found * rc=127: Your agent produces meta-data which does not conform to ra-api-1.dtd * Your agent does not support the notify action (optional) * Your agent does not support the demote action (optional) * Your agent does not support the promote action (optional) * Your agent does not support master/slave (optional) * Your agent does not support the reload action (optional) Tests failed: /usr/lib/ocf/resource.d/heartbeat/IPaddr2 failed 1 tests So, it seems to be telling me that even the standard built-in resource agents "produce meta-data which does not conform to ra-api-1.dtd" My first question is: what's going wrong here? Am I using ocf-tester incorrectly, or is it a bug? My second question is: how can I debug what caused pacemaker to decide that it couldn't run Asterisk due to "insufficient privileges" on a machine which is perfectly well capacble of running Asterisk, and including when it gets started by pacemaker (in fact, that's the only way Asterisk gets started on these machines; it's a floating resource which pacemaker is in charge of). Please
Re: [ClusterLabs] Antw: Re: Antw: [EXT] Re: Order set troubles
On Fri, Mar 26, 2021 at 10:17 AM Ulrich Windl wrote: > > >>> Andrei Borzenkov schrieb am 26.03.2021 um 06:19 in > Nachricht <534274b3-a6de-5fac-0ae4-d02c305f1...@gmail.com>: > > On 25.03.2021 21:45, Reid Wahl wrote: > >> FWIW we have this KB article (I seem to remember Strahil is a Red Hat > >> customer): > >> - How do I configure SAP HANA Scale-Up System Replication in a Pacemaker > >> cluster when the HANA filesystems are on NFS shares?( > >> https://access.redhat.com/solutions/5156571) > >> > > > > "How do I make the cluster resources recover when one node loses access > > to the NFS server?" > > > > If node loses access to NFS server then monitor operations for resources > > that depend on NFS availability will fail or timeout and pacemaker will > > recover (likely by rebooting this node). That's how similar > > configurations have been handled for the past 20 years in other HA > > managers. I am genuinely interested, have you encountered the case where > > it was not enough? > > That's a big problem with the SAP design (basically it's just too complex). > In the past I had written a kind of resource agent that worked without that > overly complex overhead, but since those days SAP has added much more > complexity. > If the NFS server is external, pacemaker could fence your nodes when the NFS > server is down as first the monitor operation will fail (hanging on NFS), the > the recover (stop/start) will fail (also hanging on NFS). And how exactly placing NFS resource under pacemaker control is going to change it? > Even when fencing the > node it would not help (resources cannot start) if the NFS server is still > down. And how exactly placing NFS resource under pacemaker control is going to change it? > So you may end up with all your nodes being fenced and the fail counts > disabling any automatic resource restart. > And how exactly placing NFS resource under pacemaker control is going to change it? ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: Re: Antw: [EXT] Re: Order set troubles
Just a clarification. I'm using separate NFS shares for each HANA, so even if someone wipes the NFS for DC1, the cluster will failover to DC2 (separate NFS) and survive. Best Regards,Strahil Nikolov___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: Re: Antw: [EXT] Re: Order set troubles
On Fri, Mar 26, 2021 at 12:17 AM Ulrich Windl < ulrich.wi...@rz.uni-regensburg.de> wrote: > >>> Andrei Borzenkov schrieb am 26.03.2021 um 06:19 > in > Nachricht <534274b3-a6de-5fac-0ae4-d02c305f1...@gmail.com>: > > On 25.03.2021 21:45, Reid Wahl wrote: > >> FWIW we have this KB article (I seem to remember Strahil is a Red Hat > >> customer): > >> - How do I configure SAP HANA Scale-Up System Replication in a > Pacemaker > >> cluster when the HANA filesystems are on NFS shares?( > >> https://access.redhat.com/solutions/5156571) > >> > > > > "How do I make the cluster resources recover when one node loses access > > to the NFS server?" > > > > If node loses access to NFS server then monitor operations for resources > > that depend on NFS availability will fail or timeout and pacemaker will > > recover (likely by rebooting this node). That's how similar > > configurations have been handled for the past 20 years in other HA > > managers. I am genuinely interested, have you encountered the case where > > it was not enough? > > That's a big problem with the SAP design (basically it's just too complex). > +1000 to this. In the past I had written a kind of resource agent that worked without that > overly complex overhead, but since those days SAP has added much more > complexity. > If the NFS server is external, pacemaker could fence your nodes when the > NFS > server is down as first the monitor operation will fail (hanging on NFS), > the > the recover (stop/start) will fail (also hanging on NFS). Even when > fencing the > node it would not help (resources cannot start) if the NFS server is still > down. So you may end up with all your nodes being fenced and the fail > counts > disabling any automatic resource restart. > > > > >> I can't remember if there was some valid reason why we had to use an > >> attribute resource, or if we simply didn't think about the > sequential=false > >> require-all=false constraint set approach when planning this out. > >> > > > > Because as I already replied, this has different semantic - it will > > start HANA on both nodes if NFS comes up on any one node. > > > > But thank you for the pointer, it demonstrates really interesting > > technique. It also confirms that pacemaker does not have native means to > > express such ordering dependency/constraints. May be it should. > > > >> On Thu, Mar 25, 2021 at 3:39 AM Strahil Nikolov > >> wrote: > >> > >>> OCF_CHECK_LEVEL 20 > >>> NFS sometimes fails to start (systemd racing condition with dnsmasq) > >>> > >>> Best Regards, > >>> Strahil Nikolov > >>> > >>> On Thu, Mar 25, 2021 at 12:18, Andrei Borzenkov > >>> wrote: > >>> On Thu, Mar 25, 2021 at 10:31 AM Strahil Nikolov < > hunter86...@yahoo.com> > >>> wrote: > > Use Case: > > nfsA is shared filesystem for HANA running in site A > nfsB is shared filesystem for HANA running in site B > > clusterized resource of type SAPHanaTopology must run on all systems > if > >>> the FS for the HANA is running > > >>> > >>> And the reason you put NFS under pacemaker control in the first place? > >>> It is not going to switch over, just put it in /etc/fstab. > >>> > Yet, if siteA dies for some reason, I want to make SAPHanaTopology to > >>> still start on the nodes in site B. > > I think that it's a valid use case. > > Best Regards, > Strahil Nikolov > > On Thu, Mar 25, 2021 at 8:59, Ulrich Windl > wrote: > >>> Ken Gaillot schrieb am 24.03.2021 um 18:56 > in > Nachricht > <5bffded9c6e614919981dcc7d0b2903220bae19d.ca...@redhat.com>: > > On Wed, 2021‑03‑24 at 09:27 +, Strahil Nikolov wrote: > >> Hello All, > >> > >> I have a trouble creating an order set . > >> The end goal is to create a 2 node cluster where nodeA will mount > >> nfsA , while nodeB will mount nfsB.On top of that a depended cloned > >> resource should start on the node only if nfsA or nfsB has started > >> locally. > > This looks like ad odd design to me, and I wonder: What is the use > case? > (We are using "NFS loop-mounts" for many years, where the cluster > needs > >>> the > NFS service it provides, but that's a different design) > > Regards, > Ulrich > > > > >> > >> A prototype code would be something like: > >> pcs constraint order start (nfsA or nfsB) then start resource‑clone > >> > >> I tried to create a set like this, but it works only on nodeB: > >> pcs constraint order set nfsA nfsB resource‑clone > >> > >> Any idea how to implement that order constraint ? > >> Thanks in advance. > >> > >> Best Regards, > >> Strahil Nikolov > > > > Basically you want two sets, one with nfsA and nfsB with no ordering > > between them, and a second set with just resource‑clone, ordered > after > > the first set. > > > > I believe the pcs syntax is: > > > > pcs constraint
[ClusterLabs] Antw: Re: Antw: [EXT] Re: Order set troubles
>>> Andrei Borzenkov schrieb am 26.03.2021 um 06:19 in Nachricht <534274b3-a6de-5fac-0ae4-d02c305f1...@gmail.com>: > On 25.03.2021 21:45, Reid Wahl wrote: >> FWIW we have this KB article (I seem to remember Strahil is a Red Hat >> customer): >> - How do I configure SAP HANA Scale-Up System Replication in a Pacemaker >> cluster when the HANA filesystems are on NFS shares?( >> https://access.redhat.com/solutions/5156571) >> > > "How do I make the cluster resources recover when one node loses access > to the NFS server?" > > If node loses access to NFS server then monitor operations for resources > that depend on NFS availability will fail or timeout and pacemaker will > recover (likely by rebooting this node). That's how similar > configurations have been handled for the past 20 years in other HA > managers. I am genuinely interested, have you encountered the case where > it was not enough? That's a big problem with the SAP design (basically it's just too complex). In the past I had written a kind of resource agent that worked without that overly complex overhead, but since those days SAP has added much more complexity. If the NFS server is external, pacemaker could fence your nodes when the NFS server is down as first the monitor operation will fail (hanging on NFS), the the recover (stop/start) will fail (also hanging on NFS). Even when fencing the node it would not help (resources cannot start) if the NFS server is still down. So you may end up with all your nodes being fenced and the fail counts disabling any automatic resource restart. > >> I can't remember if there was some valid reason why we had to use an >> attribute resource, or if we simply didn't think about the sequential=false >> require-all=false constraint set approach when planning this out. >> > > Because as I already replied, this has different semantic - it will > start HANA on both nodes if NFS comes up on any one node. > > But thank you for the pointer, it demonstrates really interesting > technique. It also confirms that pacemaker does not have native means to > express such ordering dependency/constraints. May be it should. > >> On Thu, Mar 25, 2021 at 3:39 AM Strahil Nikolov >> wrote: >> >>> OCF_CHECK_LEVEL 20 >>> NFS sometimes fails to start (systemd racing condition with dnsmasq) >>> >>> Best Regards, >>> Strahil Nikolov >>> >>> On Thu, Mar 25, 2021 at 12:18, Andrei Borzenkov >>> wrote: >>> On Thu, Mar 25, 2021 at 10:31 AM Strahil Nikolov >>> wrote: Use Case: nfsA is shared filesystem for HANA running in site A nfsB is shared filesystem for HANA running in site B clusterized resource of type SAPHanaTopology must run on all systems if >>> the FS for the HANA is running >>> >>> And the reason you put NFS under pacemaker control in the first place? >>> It is not going to switch over, just put it in /etc/fstab. >>> Yet, if siteA dies for some reason, I want to make SAPHanaTopology to >>> still start on the nodes in site B. I think that it's a valid use case. Best Regards, Strahil Nikolov On Thu, Mar 25, 2021 at 8:59, Ulrich Windl wrote: >>> Ken Gaillot schrieb am 24.03.2021 um 18:56 in Nachricht <5bffded9c6e614919981dcc7d0b2903220bae19d.ca...@redhat.com>: > On Wed, 2021‑03‑24 at 09:27 +, Strahil Nikolov wrote: >> Hello All, >> >> I have a trouble creating an order set . >> The end goal is to create a 2 node cluster where nodeA will mount >> nfsA , while nodeB will mount nfsB.On top of that a depended cloned >> resource should start on the node only if nfsA or nfsB has started >> locally. This looks like ad odd design to me, and I wonder: What is the use case? (We are using "NFS loop-mounts" for many years, where the cluster needs >>> the NFS service it provides, but that's a different design) Regards, Ulrich >> >> A prototype code would be something like: >> pcs constraint order start (nfsA or nfsB) then start resource‑clone >> >> I tried to create a set like this, but it works only on nodeB: >> pcs constraint order set nfsA nfsB resource‑clone >> >> Any idea how to implement that order constraint ? >> Thanks in advance. >> >> Best Regards, >> Strahil Nikolov > > Basically you want two sets, one with nfsA and nfsB with no ordering > between them, and a second set with just resource‑clone, ordered after > the first set. > > I believe the pcs syntax is: > > pcs constraint order set nfsA nfsB sequential=false require‑all=false > set resource‑clone > > sequential=false says nfsA and nfsB have no ordering between them, and > require‑all=false says that resource‑clone only needs one of them. > > (I don't remember for sure the order of the sets in the command, i.e. > whether it's the primary set first or the dependent set first, but I
Re: [ClusterLabs] Antw: [EXT] Re: Order set troubles
On Thu, Mar 25, 2021 at 11:35 PM Reid Wahl wrote: > > > On Thu, Mar 25, 2021 at 10:20 PM Andrei Borzenkov > wrote: > >> On 25.03.2021 21:45, Reid Wahl wrote: >> > FWIW we have this KB article (I seem to remember Strahil is a Red Hat >> > customer): >> > - How do I configure SAP HANA Scale-Up System Replication in a >> Pacemaker >> > cluster when the HANA filesystems are on NFS shares?( >> > https://access.redhat.com/solutions/5156571) >> > >> >> "How do I make the cluster resources recover when one node loses access >> to the NFS server?" >> >> If node loses access to NFS server then monitor operations for resources >> that depend on NFS availability will fail or timeout and pacemaker will >> recover (likely by rebooting this node). That's how similar >> configurations have been handled for the past 20 years in other HA >> managers. I am genuinely interested, have you encountered the case where >> it was not enough? >> > > Yes, and I was perplexed by this at the time too. > > I just went back and checked the notes from the support case that led to > this article, since it's been nearly a year now. Apparently there were > situations in which the SAPHana resource wasn't failing over when > connectivity was lost with the NFS share that contained the hdb* binaries > and the HANA data. I don't remember the exact details (whether demotion was > failing, or whether it wasn't even trying to demote on the primary and > promote on the secondary, or what). Either way, I was surprised that this > procedure was necessary, but it seemed to be. > > The whole situation is a bit of a corner case in the first place. IIRC > this procedure only makes a difference if the primary loses contact with > the NFS server but the secondary can still access the NFS server. I expect > that to be relatively rare. If neither node can access the NFS server, then > we're stuck. > > >> >> > I can't remember if there was some valid reason why we had to use an >> > attribute resource, or if we simply didn't think about the >> sequential=false >> > require-all=false constraint set approach when planning this out. >> > >> >> Because as I already replied, this has different semantic - it will >> start HANA on both nodes if NFS comes up on any one node. >> > > Ah yes, that sounds right. > > But thank you for the pointer, it demonstrates really interesting >> technique. It also confirms that pacemaker does not have native means to >> express such ordering dependency/constraints. May be it should. >> > > I occasionally find that I have to use hacks like this to achieve certain > complex constraint behavior -- especially when it comes to colocation. I > don't know how many of these complex cases would be feasible to make > possible natively via RFE. Sometimes the way colocation is currently > implemented is incompatible with what users want to do. Probably requires > considerable effort to change it, though such requests are worth > documenting in RFEs. > > /me makes a note to do that and annoy Ken > (Not for this use case though, at least not right now) > >> > On Thu, Mar 25, 2021 at 3:39 AM Strahil Nikolov >> > wrote: >> > >> >> OCF_CHECK_LEVEL 20 >> >> NFS sometimes fails to start (systemd racing condition with dnsmasq) >> >> >> >> Best Regards, >> >> Strahil Nikolov >> >> >> >> On Thu, Mar 25, 2021 at 12:18, Andrei Borzenkov >> >> wrote: >> >> On Thu, Mar 25, 2021 at 10:31 AM Strahil Nikolov < >> hunter86...@yahoo.com> >> >> wrote: >> >>> >> >>> Use Case: >> >>> >> >>> nfsA is shared filesystem for HANA running in site A >> >>> nfsB is shared filesystem for HANA running in site B >> >>> >> >>> clusterized resource of type SAPHanaTopology must run on all systems >> if >> >> the FS for the HANA is running >> >>> >> >> >> >> And the reason you put NFS under pacemaker control in the first place? >> >> It is not going to switch over, just put it in /etc/fstab. >> >> >> >>> Yet, if siteA dies for some reason, I want to make SAPHanaTopology to >> >> still start on the nodes in site B. >> >>> >> >>> I think that it's a valid use case. >> >>> >> >>> Best Regards, >> >>> Strahil Nikolov >> >>> >> >>> On Thu, Mar 25, 2021 at 8:59, Ulrich Windl >> >>> wrote: >> >> Ken Gaillot schrieb am 24.03.2021 um 18:56 >> in >> >>> Nachricht >> >>> <5bffded9c6e614919981dcc7d0b2903220bae19d.ca...@redhat.com>: >> On Wed, 2021‑03‑24 at 09:27 +, Strahil Nikolov wrote: >> > Hello All, >> > >> > I have a trouble creating an order set . >> > The end goal is to create a 2 node cluster where nodeA will mount >> > nfsA , while nodeB will mount nfsB.On top of that a depended cloned >> > resource should start on the node only if nfsA or nfsB has started >> > locally. >> >>> >> >>> This looks like ad odd design to me, and I wonder: What is the use >> case? >> >>> (We are using "NFS loop-mounts" for many years, where the cluster >> needs >> >> the >> >>> NFS service it provides, but that's a different design) >> >>> >> >>>
Re: [ClusterLabs] Antw: [EXT] Re: Order set troubles
On Thu, Mar 25, 2021 at 10:20 PM Andrei Borzenkov wrote: > On 25.03.2021 21:45, Reid Wahl wrote: > > FWIW we have this KB article (I seem to remember Strahil is a Red Hat > > customer): > > - How do I configure SAP HANA Scale-Up System Replication in a > Pacemaker > > cluster when the HANA filesystems are on NFS shares?( > > https://access.redhat.com/solutions/5156571) > > > > "How do I make the cluster resources recover when one node loses access > to the NFS server?" > > If node loses access to NFS server then monitor operations for resources > that depend on NFS availability will fail or timeout and pacemaker will > recover (likely by rebooting this node). That's how similar > configurations have been handled for the past 20 years in other HA > managers. I am genuinely interested, have you encountered the case where > it was not enough? > Yes, and I was perplexed by this at the time too. I just went back and checked the notes from the support case that led to this article, since it's been nearly a year now. Apparently there were situations in which the SAPHana resource wasn't failing over when connectivity was lost with the NFS share that contained the hdb* binaries and the HANA data. I don't remember the exact details (whether demotion was failing, or whether it wasn't even trying to demote on the primary and promote on the secondary, or what). Either way, I was surprised that this procedure was necessary, but it seemed to be. The whole situation is a bit of a corner case in the first place. IIRC this procedure only makes a difference if the primary loses contact with the NFS server but the secondary can still access the NFS server. I expect that to be relatively rare. If neither node can access the NFS server, then we're stuck. > > > I can't remember if there was some valid reason why we had to use an > > attribute resource, or if we simply didn't think about the > sequential=false > > require-all=false constraint set approach when planning this out. > > > > Because as I already replied, this has different semantic - it will > start HANA on both nodes if NFS comes up on any one node. > Ah yes, that sounds right. But thank you for the pointer, it demonstrates really interesting > technique. It also confirms that pacemaker does not have native means to > express such ordering dependency/constraints. May be it should. > I occasionally find that I have to use hacks like this to achieve certain complex constraint behavior -- especially when it comes to colocation. I don't know how many of these complex cases would be feasible to make possible natively via RFE. Sometimes the way colocation is currently implemented is incompatible with what users want to do. Probably requires considerable effort to change it, though such requests are worth documenting in RFEs. /me makes a note to do that and annoy Ken > > On Thu, Mar 25, 2021 at 3:39 AM Strahil Nikolov > > wrote: > > > >> OCF_CHECK_LEVEL 20 > >> NFS sometimes fails to start (systemd racing condition with dnsmasq) > >> > >> Best Regards, > >> Strahil Nikolov > >> > >> On Thu, Mar 25, 2021 at 12:18, Andrei Borzenkov > >> wrote: > >> On Thu, Mar 25, 2021 at 10:31 AM Strahil Nikolov > > >> wrote: > >>> > >>> Use Case: > >>> > >>> nfsA is shared filesystem for HANA running in site A > >>> nfsB is shared filesystem for HANA running in site B > >>> > >>> clusterized resource of type SAPHanaTopology must run on all systems if > >> the FS for the HANA is running > >>> > >> > >> And the reason you put NFS under pacemaker control in the first place? > >> It is not going to switch over, just put it in /etc/fstab. > >> > >>> Yet, if siteA dies for some reason, I want to make SAPHanaTopology to > >> still start on the nodes in site B. > >>> > >>> I think that it's a valid use case. > >>> > >>> Best Regards, > >>> Strahil Nikolov > >>> > >>> On Thu, Mar 25, 2021 at 8:59, Ulrich Windl > >>> wrote: > >> Ken Gaillot schrieb am 24.03.2021 um 18:56 in > >>> Nachricht > >>> <5bffded9c6e614919981dcc7d0b2903220bae19d.ca...@redhat.com>: > On Wed, 2021‑03‑24 at 09:27 +, Strahil Nikolov wrote: > > Hello All, > > > > I have a trouble creating an order set . > > The end goal is to create a 2 node cluster where nodeA will mount > > nfsA , while nodeB will mount nfsB.On top of that a depended cloned > > resource should start on the node only if nfsA or nfsB has started > > locally. > >>> > >>> This looks like ad odd design to me, and I wonder: What is the use > case? > >>> (We are using "NFS loop-mounts" for many years, where the cluster needs > >> the > >>> NFS service it provides, but that's a different design) > >>> > >>> Regards, > >>> Ulrich > >>> > >>> > >>> > > > > A prototype code would be something like: > > pcs constraint order start (nfsA or nfsB) then start resource‑clone > > > > I tried to create a set like this, but it works only on nodeB: > > pcs constraint order set nfsA nfsB