Re: [ClusterLabs] ubsubscribe

2024-02-12 Thread Antony Stone
On Monday 12 February 2024 at 16:42:06, Bob Marčan via Users wrote:

> It should be in the body, not in the subject.

According to the headers, it should be in the subject, but not sent to the 
list address:

List-Id: Cluster Labs - All topics related to open-source clustering welcomed
  
List-Unsubscribe: ,
  
List-Archive: 
List-Post: 
List-Help: 
List-Subscribe: ,
  


Antony.

-- 
There's a good theatrical performance about puns on in the West End.  It's a 
play on words.

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Limit the number of resources starting/stoping in parallel possible?

2023-09-18 Thread Antony Stone
On Monday 18 September 2023 at 16:24:02, Knauf Steffen wrote:

> Hi,
> 
> we have multiple Cluster (2 node + quorum setup) with more then 100
> Resources ( 10 x VIP + 90 Microservices) per Node. If the Resources are
> stopped/started at the same time the Server is under heavy load, which may
> result into timeouts and an unresponsive server. We configured some
> Ordering Constraints (VIP --> Microservice). Is there a way to limit the
> number of resources starting/stoping in parallel? Perhaps you have some
> other tips to handle such a situation.

Do all the services actually need to be stopped when a VIP is moved away from 
a node (and started again when the VIP is replaced)?

I've found that in many cases keeping a service running all the time (for 
example with monit) and simply moving the VIP between nodes to control which 
services get any requests from remote clients is sufficient to provide High 
Availability / Load Balancing.

Antony.

-- 
"The future is already here.   It's just not evenly distributed yet."

 - William Gibson

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] MySQL cluster with auto failover

2023-09-08 Thread Antony Stone
On Thursday 07 September 2023 at 22:06:25, Damiano Giuliani wrote:

> Everything seems quite clear to me.
> 
> But, having single VIP makes a multimaster replica quite useless.

Why?

> im thinking about using pacemaker to create a cloned VIP binded to a cloned
> HA proxy which is health-checking the galera/mysql status  to route the
> traffic.

I don't think I understand the use of the word "cloned" there.

How many Virtual IPs are you considering having?

> or more easily pacemaker with a single VIP to a single HA proxy which is
> roting the traffic to all the heath-checked availables cluster node.
> in case a node fails, resoruces (vip and haproxy are started to a different
> node)

You can do that if you wish, yes.


Antony.

-- 
"640 kilobytes (of RAM) should be enough for anybody."

 - Bill Gates

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] MySQL cluster with auto failover

2023-09-07 Thread Antony Stone
On Wednesday 06 September 2023 at 17:01:24, Damiano Giuliani wrote:

> Everything is clear now.
> So the point is to use pacemaker and create the floating vip and bind it to
> sqlproxy to health check and route the traffic to the available and healthy
> galera nodes.

Good summary.

> It could be useful let pacemaker manage also galera services?

No; MySQL / Galera needs to be running on all nodes all the time.  Pacemaker 
is for managing resources which move between nodes.

If you want something that ensures processes are running on machines, 
irrespective of where the floating IP is, look at monit - it's very simple, 
easy to configure and knows how to manage resources which should run all the 
time.

> Do you have any guide that pack this everything together?

No; I've largely made this stuff up myself as I've needed it.


Antony.

-- 
This sentence contains exacly three erors.

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] MySQL cluster with auto failover

2023-09-06 Thread Antony Stone
On Wednesday 06 September 2023 at 13:58:51, Damiano Giuliani wrote:

> What I miss is how my application can support the connection on a multi
> master where 3 ips are available simultaneously.
> 
> JDBCmysql driver or similar support a list name/ip of clustered nodes?
> Galera provide a unique cluster ip?
> 
> Whats the point to create another cluster(pacemaker) for the VIP alongside
> galera?

Simple - as you explained above, you can't assume that clients can cope with 
the idea of 3 IPs for the actual clustered machines - clients want a single IP 
to connect to, and *you* want to be sure that that IP is going to provide a 
working connection, even if one of the machines happens to have a problem.

So, you have:

machine A - real IP address 198.51.100.1
machine B - real IP address 198.51.100.2
machine C - real IP address 198.51.100.3

All machines run MySQL, ProxySQL and corosync / pacemaker.

Pacemaker manages a floating IP address 198.51.100.42.

ProxySQL on each machine listens for connections to 198.51.100.42 and passes 
requests on to 198.51.100.1, 198.51.100.2 and 198.51.100.3.  ProxySQL knows 
how to monitor the three real IPs to make sure they can handle queries, and if 
one stops working, it stops getting sent queries.

MySQL listens for connections on 198.51.100.1 on machine A, 198.51.100.2 on 
machine B and 198.51.100.3 on machine C.

Galera makes sure that all three servers have the same DB content, so a query 
sent to any will return the same result.

That gives you a single floating IP address 198.51.100.42 which all clients can 
connect to, that address is ensured by pacemaker to be on a working server, 
and ProxySQL ensures that a request to that address is passed to a working 
instance of MySQL.


Antony.

-- 
This email is intended for the use of the individual addressee(s) named above 
and may contain information that is confidential, privileged or unsuitable for 
overly sensitive persons with low self-esteem, no sense of humour, or 
irrational religious beliefs.

If you have received this email in error, you are required to shred it 
immediately, add some nutmeg, three egg whites and a dessertspoonful of caster 
sugar.  Whisk until soft peaks form, then place in a warm oven for 40 minutes.  
Remove promptly and let stand for 2 hours before adding some decorative kiwi 
fruit and cream.  Then notify me immediately by return email and eat the 
original message.

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] MySQL cluster with auto failover

2023-09-06 Thread Antony Stone
On Wednesday 06 September 2023 at 12:50:40, Damiano Giuliani wrote:

> Looking at some Galera cluster designs on web seems a couple of server
> proxy are placed in front.

You can certainly do it that way, although some people simply have a floating 
virtual IP across the 3 nodes, and clients connect to whichever (permanently 
running) MySQL instance has the VIP at the time.

> If I would have only 3 nodes where I clustered MySQL with galera how then I
> have to point my application to the right nodes?

Galera is a "write anywhere, read anywhere" replication system.  You do not 
get a "cluster master" which all clients have to write to (or read from), so 
clients can simply connect to any node they like.

If you're familiar with standard MySQL master-slave replication, and you know 
this can be set up in a master-master configuration, simply regard Galera as 
the same outcome but for any number of modes.

One way to ensure that all clients connect to the same node at any given time 
is to use pacemaker or similar to run a floating virtual IP between the nodes.  
Clients connect to the VIP, and reach either MySQL itself, or the proxy (which 
is running on all 3 nodes, same as MySQL is), and the proxy connects to the 
real IPs of the 3 nodes.


Antony.

-- 
+++ Divide By Cucumber Error.  Please Reinstall Universe And Reboot +++

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] MySQL cluster with auto failover

2023-09-06 Thread Antony Stone
On Wednesday 06 September 2023 at 12:10:23, Damiano Giuliani wrote:

> Thanks for helping me.
> 
> I'm going to know more about Galera.
> What I don't like is seems I need many nodes, at least 3 for the cluster
> and then at least 2 other nodes for proxy.

You didn't mention anything about wanting a proxy service in your original 
posting, and there's no reason why a proxy can't run on the same machines as 
MySQL does.

As for requiring three nodes, you'll need that for pacemaker anyway.  Trying 
to run a 2-node cluster (of anything) as a production service is a disaster 
waiting to happen.  Read up about "split brain" if you do not know why.

> Asking for 5 VM is quite consuming.

What's your actual (functional) requirement here?

> As you told drbd can work only in 2 node cluster and disk replication is
> not dbms replication.
> Probably I'm going to try drbd on very small and low usage db.

The lower the usage (and therefore unrepresentative of typical production 
activity), the more likely it is that you'll think "this is working".

Assuming you mean to set up two MySQL servers each pointing at a synchronised 
DRBD storage volume on their local systems, the main thing I expect to go 
wrong is data cached in memory, perhaps during complex updates.

If instead you mean to have only one instance of MySQL running at any given 
time, and a failover involves stopping MySQL on the first node and then 
starting it on the second, then sure, that will work, but it introduces a 
(probably multi-second at the very least) delay during which no client 
requests can be processed during the failover, whereas a replicated Galera 
cluster (especially with something like ProxySQL in front of it) offers almost 
instantaneous switchover and therefore much reduced downtime in DB 
availability.

> More I know about MySQL  more postgresql seems have better replication at
> least for me.

So why not use Postgres?


Antony.

-- 
I don't know, maybe if we all waited then cosmic rays would write all our 
software for us. Of course it might take a while.

 - Ron Minnich, Los Alamos National Laboratory

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] MySQL cluster with auto failover

2023-09-06 Thread Antony Stone
On Wednesday 06 September 2023 at 11:23:54, Damiano Giuliani wrote:

> Thanks for helping.
> 
> Because I still don't know which version will be provided, probably MySQL
> enterprise or community.

I believe both support Galera replication.

> I was wondering about pacemaker because I know quite well how it works and
> I need a vip/automatic failover

I regard a VIP as being entirely independent of replicating database content.

> Galera seems a different approach that I have to study and test estensively
> before  place in production, instead pacemaker is a well know solution to
> me.

Yes, but if you are trying to replicate database content, how does pacemaker 
help with this?  Pacemaker can quite happily provide you with a floating 
virtual IP address which is guarenteed to be on one of your database servers, 
but it doesn't (as far as I know) have any mechanism for replicating the 
database content between those servers.

> Is drbd approach obsolete or solid?

I would say that DRBD is a solid solution provided you only have two nodes and 
they have very good connectivity between them.

However, replicating the data at disk level is a very different matter from 
having a DBMS such as MySQL understanding what is on the disk (and for that 
matter, what it has cached in RAM).

> About the degraded status I read on web there is a specific configuration?

I cannot comment on that; maybe someone else can.

> There is also a good Galera documentation to study?

I used https://mariadb.com/kb/en/getting-started-with-mariadb-galera-cluster


Antony.

-- 
"In fact I wanted to be John Cleese and it took me some time to realise that 
the job was already taken."

 - Douglas Adams

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] MySQL cluster with auto failover

2023-09-06 Thread Antony Stone
On Tuesday 05 September 2023 at 22:20:36, Damiano Giuliani wrote:

> Hi guys, I'm about to figure out how setup a pacemaker cluster for MySQL
> replication.

Why do you need pacemaker?

Why not just set up several machines and configure Galera to handle DB 
replication between them?

If you install MariaDB instead of MySQL it's directly compatible and Galera is 
included.

Depending on your distribution you may find that MariaDB is actually what you 
get even when you ask for MySQL (eg: Debian).

I don't see the need for pacemaker here.


Antony.

-- 
"Have you been drinking brake fluid again?  I think you're addicted to the 
stuff."

"No, no, it's alright - I can stop any time I want to."

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Load balancing, of a sort

2023-01-25 Thread Antony Stone
Hi.

I have a corosync / pacemaker 3-node cluster with a resource group which can 
run on any node in the cluster.

Every night a cron job on the node which is running the resources performs 
"crm_standby -v on" followed a short while later by "crm_standby -v off" in 
order to force the resources to migrate to another node member.

We do this partly to verify that all nodes are capable of running the 
resources, and partly because some of those resources generate significant log 
files, and if one machine just keeps running them day after day, we run out of 
disk space (which effectively means we just need to add more capacity to the 
machines, which can be done, but at a cost).

So long as a machine gets a day when it's not running the resources, a 
combination of migrating the log files to a central server, plus standard 
logfile rotation, takes care of managing the disk space.

What I notice, though, is that two of the machines tend to swap the resources 
between them, and the third machine hardly ever becomes the active node.

Is there some way of influencing the node selection mechanism when resources 
need to move away from the currently active node, so that, for example, the 
least recently used node could be favoured over the rest?


Thanks,


Antony.

-- 
I want to build a machine that will be proud of me.

 - Danny Hillis, creator of The Connection Machine

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Stonith external/ssh "device"?

2022-12-21 Thread Antony Stone
On Wednesday 21 December 2022 at 17:19:34, Antony Stone wrote:

> > pacemaker-fenced[3262]:   notice: Operation reboot of nodeB by 
> > for pacemaker-controld.26852@nodeA.93b391b2: No such device

> pacemaker-controld[3264]:   notice: Peer nodeB was not terminated (reboot)
> by  on behalf of pacemaker-controld.26852: No such device

I have resolved this - there was a discrepancy between the node names (some 
simple hostnames, some FQDNs) in my main cluster configuration, and the 
hostlist parameter for the external/ssh fencing plugin.

I have set them all to be simple hostnames with no domain and now all is 
working as expected.

I still find the log message "no such device" rather confusing.


Thanks,


Antony.

-- 
 yes, but this is #lbw, we don't do normal

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Stonith external/ssh "device"?

2022-12-21 Thread Antony Stone
On Wednesday 21 December 2022 at 16:59:16, Antony Stone wrote:

> Hi.
> 
> I'm implementing fencing on a 7-node cluster as described recently:
> https://lists.clusterlabs.org/pipermail/users/2022-December/030714.html
> 
> I'm using external/ssh for the time being, and it works if I test it using:
> 
> stonith -t external/ssh -p "nodeA nodeB nodeC" -T reset nodeB
> 
> 
> However, when it's supposed to be invoked because a node has got stuck, I
> simply find syslog full of the following (one from each of the other six
> nodes in the cluster):
> 
> pacemaker-fenced[3262]:   notice: Operation reboot of nodeB by  for
> pacemaker-controld.26852@nodeA.93b391b2: No such device
> 
> I have defined seven stonith resources, one for rebooting each machine, and
> I can see from "crm status" that they have been assigned randomly amongst
> the other servers, usually one per server, so that looks good.
> 
> 
> The main things that puzzle me about the log message are:
> 
> a) why does it say ""?  Is this more like "anyone", meaning that
> no- one in particular is required to do this task, provided that at least
> someone does it?  Does this indicate a configuration problem?

PS: I've just noticed that I'm also getting log entries immediately 
afterwards:

pacemaker-controld[3264]:   notice: Peer nodeB was not terminated (reboot) by 
 on behalf of pacemaker-controld.26852: No such device

> b) what is this "device" referred to?  I'm using "external/ssh" so there is
> no actual Stonith device for power-cycling hardware machines - am I
> supposed to define some sort of dummy device somewhere?
> 
> For clarity, this is what I have added to my cluster configuration to set
> this up:
> 
> primitive reboot_nodeAstonith:external/sshparams hostlist="nodeA"
> location only_nodeA   reboot_nodeA-inf: nodeA
> 
> ...repeated for all seven nodes.
> 
> I also have "stonith-enabled=yes" in the cib-bootstrap-options.
> 
> 
> Ideas, anyone?
> 
> Thanks,
> 
> 
> Antony.

-- 
Normal people think "If it ain't broke, don't fix it".
Engineers think "If it ain't broke, it doesn't have enough features yet".

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Stonith external/ssh "device"?

2022-12-21 Thread Antony Stone
Hi.

I'm implementing fencing on a 7-node cluster as described recently:
https://lists.clusterlabs.org/pipermail/users/2022-December/030714.html

I'm using external/ssh for the time being, and it works if I test it using:

stonith -t external/ssh -p "nodeA nodeB nodeC" -T reset nodeB


However, when it's supposed to be invoked because a node has got stuck, I 
simply find syslog full of the following (one from each of the other six nodes 
in the cluster):

pacemaker-fenced[3262]:   notice: Operation reboot of nodeB by  for 
pacemaker-controld.26852@nodeA.93b391b2: No such device

I have defined seven stonith resources, one for rebooting each machine, and I 
can see from "crm status" that they have been assigned randomly amongst the 
other servers, usually one per server, so that looks good.


The main things that puzzle me about the log message are:

a) why does it say ""?  Is this more like "anyone", meaning that no-
one in particular is required to do this task, provided that at least someone 
does it?  Does this indicate a configuration problem?

b) what is this "device" referred to?  I'm using "external/ssh" so there is no 
actual Stonith device for power-cycling hardware machines - am I supposed to 
define some sort of dummy device somewhere?

For clarity, this is what I have added to my cluster configuration to set this 
up:

primitive reboot_nodeA  stonith:external/sshparams hostlist="nodeA"
location only_nodeA reboot_nodeA-inf: nodeA

...repeated for all seven nodes.

I also have "stonith-enabled=yes" in the cib-bootstrap-options.


Ideas, anyone?

Thanks,


Antony.

-- 
This sentence contains exacly three erors.

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Stonith

2022-12-19 Thread Antony Stone
On Monday 19 December 2022 at 13:55:45, Andrei Borzenkov wrote:

> On Mon, Dec 19, 2022 at 3:44 PM Antony Stone
> 
>  wrote:
> > So, do I simply create one stonith resource for each server, and rely on
> > some other random server to invoke it when needed?
> 
> Yes, this is the most simple approach. You need to restrict this
> stonith resource to only one cluster node (set pcmk_host_list).

So, just to be clear, I create one stonith resource for each machine which 
needs to be able to be shut down by some other server?

I ask simply because the acronym stonith refers to "the other node", so it 
sounds to me more like something I need to define so that a working machine can 
kill another one.

> No, that is not needed, by default any node can use any stonith agent.

Okay, thanks for the quick clarification :)


Antony.

-- 
It is also possible that putting the birds in a laboratory setting 
inadvertently renders them relatively incompetent.

 - Daniel C Dennett

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Stonith

2022-12-19 Thread Antony Stone
Hi.

I have a 7-node corosync / pacemaker cluster which is working nicely as a 
proof-of-concept.

Three machines are in data centre 1, three are in data centre 2, and one 
machine is in data centre 3.

I'm using location contraints to run one set of resources on any of the 
machines in DC1, another set of resources in DC2, and the DC3 does nothing 
except act as a quorum server in case DC1 and DC2 lose sight of each other.

Everything is currently on externally-hosted virtual machines (by which I 
mean, I have no access to the hosting envorinment).

I now want to implement fencing, and for the PoC VMs I plan to use 
external/ssh in order to reboot a problem server - once things move to real 
hardware we shall have some sort of IPMI/RAC/PDU control.

Reading https://clusterlabs.org/pacemaker/doc/crm_fencing.html it seems that I 
define this just the same as any other resource in the system, however it's not 
clear to me how many resources I need to define.

When a machine needs restarting, any other machine in the cluster can do it - 
all have public-key SSH access to all others, and for IPMI/RAC/PDU every 
machine will have credentials to connect to the power controller for every 
other machine.

So, do I simply create one stonith resource for each server, and rely on some 
other random server to invoke it when needed?

Or do I in fact create one stonith resource for each server, and that resource 
then means that this server can shut down any other server?

Or, do I need to create 6 x 7 = 42 stonith resources so that any machine can 
shut down any other?

Thanks for any guidance, or pointers to more comprehensive documentation.


Thanks,


Antony

-- 
BASIC is to computer languages what Roman numerals are to arithmetic.

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] QDevice not found after reboot but appears after cluster restart

2022-07-28 Thread Antony Stone
On Thursday 28 July 2022 at 22:17:01, john tillman wrote:

> I have a two cluster setup with a qdevice. 'pcs quorum status' from a
> cluster node shows the qdevice casting a vote.  On the qdevice node
> 'corosync-qnetd-tool -s' says I have 2 connected clients and 1 cluster.
> The vote count looks correct when I shutdown either one of the cluster
> nodes or the qdevice.  So the voting seems to be working at this point.

Indeed - shutting down 1 of 3 nodes leaves quorum intact, therefore everything 
still awake knows what's going on.

> From this state, if I reboot both my cluster nodes at the same time

Ugh!

> but leave the qdevice node running, the cluster will not see the qdevice
> when the nodes come back up: 'pcs quorum status' show 3 votes expected but
> only 2 votes cast (from the cluster nodes).

I would think this is to be expected, since if you reboot 2 out of 3 nodes, 
you completely lose quorum, so the single node left has no idea what to trust 
when the other nodes return.

Starting from a situation such as this, your only hope is to rebuilt the 
cluster from scratch, IMHO.


Antony.

-- 
Police have found a cartoonist dead in his house.  They say that details are 
currently sketchy.

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Question regarding the security of corosync

2022-06-21 Thread Antony Stone
On Friday 17 June 2022 at 11:39:14, Mario Freytag wrote:

> I’d like to ask about the security of corosync. We’re using a Proxmox HA
> setup in our testing environment and need to confirm it’s compliance with
> PCI guidelines.
> 
> We have a few questions:
> 
> Is the communication encrypted?
> What method of encryption is used?
> What method of authentication is used?
> What is the recommended way of separation for the corosync network? VLAN?

Your first three questions are probably well-answered by 
https://github.com/fghaas/corosync/blob/master/SECURITY

For the fourth, I agree with Jan Friesse - a dedicated physical network is 
best; a dedicated VLAN is second best.


Antony.

-- 
There's no such thing as bad weather - only the wrong clothes.

 - Billy Connolly

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] ethernet link up/down - ?

2022-02-07 Thread Antony Stone
On Monday 07 February 2022 at 20:09:02, lejeczek via Users wrote:

> Hi guys
> 
> How do you guys go about doing link up/down as a resource?

I apply or remove addresses on the interface, using "IPaddr2" and "IPv6addr", 
which I know is not the same thing.

Why do you separately want to control link up/down?  I can't think what I 
would use this for.


Antony.

-- 
https://tools.ietf.org/html/rfc6890 - providing 16 million IPv4 addresses for 
talking to yourself.

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] no systemd - ?

2021-12-18 Thread Antony Stone
On Saturday 18 December 2021 at 16:46:55, lejeczek via Users wrote:

> hi guys
> 
> I've always been RHE/Fedora user and memories of times before 'systemd'
> almost completely vacated my brain - nowadays, is it possible to have HA
> without systemd would you know and if so, then how would that work?

It definitely works, but I rather doubt you can do this under RHEL/Fedora (RH 
being the primary developers of systemd).

I have been using https://www.devuan.org/ very happily for ~4 years, together 
with corosync / pacemaker.

There might be RPM-based distros which exclude systemd, but RPM was never my 
personal preference, so I can't point you in that direction :(

However, there's absolutely no requirement to be using systemd in order to 
enjoy corosync & pacemaker :)


Antony.

-- 
2 days of trial and error can easily save you 5 minutes spent reading the 
manual.

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] 8 node cluster

2021-09-07 Thread Antony Stone
On Tuesday 07 September 2021 at 19:37:33, M N S H SNGHL wrote:

> I am looking for some suggestions here. I have created an 8 node HA cluster
> on my SuSE hosts.

An even number of nodes is never a good idea.

> 1) The resources should work fine even if 7 nodes go down, which means
> surviving node should still be running the resources.

> I did set "last_man_standing (and last_man_standing_window) option, with
> ATB .. but it didn't really work or didn't dynamically reduce the expected
> votes.

What do the log files (especially on that "last man") tell you happened as you 
gradually reduced the number of nodes online?

> 2) Another requirement is - If all nodes in the cluster go down, and just
> one (anyone) comes back up, it should pick up the resources and should run
> them.

So, how should this one node realise that it is the only node awake and should 
be running the reources, and that there aren't {1..7} other nodes somewhere 
else on the network, all in the same situation, thinking "I can't connect to 
anyone else, but I'm alive, so I'll take on the resources"?

> I tried setting ignore-quorum-policy to ignore, and which worked most of
> the time... (yet to find the case where it doesn't work).. but I am
> suspecting, wouldn't this setting cause split-brain in some cases?

I think you're taking the wrong approach to HA.  Some number of nodes (plural) 
need to be in communication with each other in order for them to decide 
whether they have quorum or not, and can decide to be in charge of the 
resources.

Two basic rules of HA:

1. One node on its own has no clue whatever else is going on with the rest of 
the cluster, and therefore cannot decide to take charge

2. Quorum (unless you override it and really know what you're doing) requires 
>50% of nodes to be in agreement, and an even number of nodes can split into 
50:50, where neither half (literally) is >50%, so everything stops.  This is 
"split brain".

I have two questions:

 - why do you feel you need as many as 8 nodes when the resources will only be 
running on one node?

 - why do you specifically want 8 nodes instead of 7 or 9?


Antony.

-- 
The Royal Society for the Prevention of Cruelty to Animals was formed in 1824.
The National Society for the Prevention of Cruelty to Children was not formed 
until 1884.
That says something about the British.

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Sub‑clusters / super‑clusters - working :)

2021-08-06 Thread Antony Stone
On Friday 06 August 2021 at 15:12:57, Andrei Borzenkov wrote:

> On Fri, Aug 6, 2021 at 3:42 PM Antony Stone wrote:
> > On Friday 06 August 2021 at 14:14:09, Andrei Borzenkov wrote:
> > > 
> > > If connectivity between (any two) sites is lost you may end up with
> > > one of A or B going out of quorum.
> > 
> > Agreed.
> > 
> > > While this will stop active resources and restart them on another site,
> > 
> > No.  Resources do not start on the "wrong" site because of:
> > location pref_A GroupA rule -inf: site ne cityA
> > location pref_B GroupB rule -inf: site ne cityB
> > 
> > The resources in GroupA either run in cityA or they do not run at all.
> 
> Where did I say anything about group A or B? You have single resource
> that can migrate between sites
> 
> location no_pref Anywhere rule -inf: site ne cityA and site ne cityB

In fact that rule turns out to be unnecessary, because of:

colocation Ast 100: Anywhere [ GroupA GroupB ]

(apologies for the typo the first time I posted that, corrected in my previous 
reply to this one).

This ensures that the "Anywhere" resource group runs either on the machine 
which is running the "GroupA" group or the one which is running the "GroupB" 
group.  This is an added bonus which I find useful, so that only one machine at 
each site is running all the resources at that site.

> I have no idea what "Asterisk in cityA'' means because I see only one
> resource named Asterisk which is not restricted to a single site
> according to your configuration.

Ah, I see the confusion.  I used Asterisk as a simple resource in my example, 
as the thing I wanted to run just once, somewhere.

In fact, for the real setup, where GroupA and GroupB each comprise 10 
resources, and the Anywhere group comprises two, Asterisk is one of the 10 
resources which do run at both sites.

> The only resource that allegedly can migrate between sites in
> configuration you have shown so far is Asterisk.

Yes, in my example documented here.

> Now you say this resource never migrates between sites.

Yes, for my real configuration, which contains 10 resources (one of which is 
Asterisk) in each of GroupA and GroupB, and is therefore over-complicated to 
quote as a proof-of-concept here.

> I'm not sure how helpful this will be to anyone reading archives because I
> completely lost all track of what you tried to achieve.

That can be expressed very simply:

1. A group of resources named GroupA which either run in cityA or do not run 
at all.

2. A group of resources named GroupB which either run in cityB or do not run 
at all.

3. A group of resources name Anywhere which run in either cityA or cityB but 
not both.


Antony.

-- 
Numerous psychological studies over the years have demonstrated that the 
majority of people genuinely believe they are not like the majority of people.

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: Sub‑clusters / super‑clusters - working :)

2021-08-06 Thread Antony Stone
On Friday 06 August 2021 at 14:47:03, Ulrich Windl wrote:

> Antony Stone schrieb am 06.08.2021 um 14:41
> 
> > location pref_A GroupA rule ‑inf: site ne cityA
> > location pref_B GroupB rule ‑inf: site ne cityB
> 
> I'm wondering whether the first is equivalentto
> location pref_A GroupA rule inf: site eq cityA

I certainly believe it is.

> In that case I think it's more clear (avoiding double negation).

Fair point :)


Antony.

-- 
3 logicians walk into a bar. The bartender asks "Do you all want a drink?"
The first logician says "I don't know."
The second logician says "I don't know."
The third logician says "Yes!"

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Sub‑clusters / super‑clusters - working :)

2021-08-06 Thread Antony Stone
On Friday 06 August 2021 at 14:14:09, Andrei Borzenkov wrote:

> On Thu, Aug 5, 2021 at 3:44 PM Antony Stone wrote:
> > 
> > For anyone interested in the detail of how to do this (without needing
> > booth), here is my cluster.conf file, as in "crm configure load replace
> > cluster.conf":
> > 
> > 
> > node tom attribute site=cityA
> > node dick attribute site=cityA
> > node harry attribute site=cityA
> > 
> > node fred attribute site=cityB
> > node george attribute site=cityB
> > node ron attribute site=cityB
> > 
> > primitive A-float IPaddr2 params ip=192.168.32.250 cidr_netmask=24 meta
> > migration-threshold=3 failure-timeout=60 op monitor interval=5 timeout=20
> > on- fail=restart
> > primitive B-float IPaddr2 params ip=192.168.42.250 cidr_netmask=24 meta
> > migration-threshold=3 failure-timeout=60 op monitor interval=5 timeout=20
> > on- fail=restart
> > primitive Asterisk asterisk meta migration-threshold=3 failure-timeout=60
> > op monitor interval=5 timeout=20 on-fail=restart
> > 
> > group GroupA A-float4  resource-stickiness=100
> > group GroupB B-float4  resource-stickiness=100
> > group Anywhere Asterisk resource-stickiness=100
> > 
> > location pref_A GroupA rule -inf: site ne cityA
> > location pref_B GroupB rule -inf: site ne cityB
> > location no_pref Anywhere rule -inf: site ne cityA and site ne cityB
> > 
> > colocation Ast 100: Anywhere [ cityA cityB ]
> 
> You define a resource set, but there are no resources cityA or cityB,
> at least you do not show them. So it is not quite clear what this
> colocation does.

Apologies - I had used different names in my test setup, and converted them to 
cityA etc for the sake of continuity in this discussion.

That should be:

colocation Ast 100: Anywhere [ GroupA GroupB ]

> > property cib-bootstrap-options: stonith-enabled=no no-quorum-policy=stop
> 
> If connectivity between (any two) sites is lost you may end up with
> one of A or B going out of quorum.

Agreed.

> While this will stop active resources and restart them on another site,

No.  Resources do not start on the "wrong" site because of:

location pref_A GroupA rule -inf: site ne cityA
location pref_B GroupB rule -inf: site ne cityB

The resources in GroupA either run in cityA or they do not run at all.

> there is no coordination between stopping and starting so for some time
> resources will be active on both sites. It is up to you to evaluate whether
> this matters.

Any resource which tried to start at the wrong site would simply fail, because 
the IP addresses involved do not work at the "other" site.

> If this matters your solution does not protect against it.
> 
> If this does not matter, the usual response is - why do you need a
> cluster in the first place? Why not simply always run asterisk on both
> sites all the time?

Because Asterisk at cityA is bound to a floating IP address, which is held on 
one of the three machines in cityA.  I can't run Asterisk on all three 
machines there because only one of them has the IP address.

Asterisk _does_ normally run on both sites all the time, but only on one 
machine at each site.

> > start-failure-is-fatal=false cluster-recheck-interval=60s
> > 
> > 
> > Of course, the group definitions are not needed for single resources, but
> > I shall in practice be using multiple resources which do need groups, so
> > I wanted to ensure I was creating something which would work with that.
> 
> > I have tested it by:
> ...
> >  - causing a network failure at one city (so it simply disappears without
> > stopping corosync neatly): the other city continues its resources (plus
> > the "anywhere" resource), the isolated city stops
> 
> If the site is completely isolated it probably does not matter whether
> anything is active there. It is partial connectivity loss where it
> becomes interesting.

Agreed, however my testing shows that resources which I want running in cityA 
are either running there or they're not (they never move to cityB or cityC), 
similarly for cityB, and the resources I want just a single instance of are 
doing just that, and on the same machine at cityA or cityB as the local 
resources are running on.


Thanks for the feedback,


Antony.

-- 
"Measuring average network latency is about as useful as measuring the mean 
temperature of patients in a hospital."

 - Stéphane Bortzmeyer

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Sub‑clusters / super‑clusters - working :)

2021-08-05 Thread Antony Stone
On Thursday 05 August 2021 at 15:44:18, Ulrich Windl wrote:

> Hi!
> 
> Nice to hear. What could be "interesting" is how stable the WAN-type of
> corosync communication works.

Well, between cityA and cityB it should be pretty good, because these are two 
data centres on opposite sides of England run by the same hosting provider 
(with private dark fibre between them, not dependent on the Internet).

> If it's not that stable, the cluster could try to fence nodes rather
> frequently. OK, you disabled fencing; maybe it works without.

I'm going to find out :)

> Did you tune the parameters?

No:

a) I only just got it working today :)

b) I got it working on a bunch of VMs in my own personal hosting environment; 
I haven't tried it in the real data centres yet.

At the moment I regard it as a Proof of Concept to show that the design works.


Antony.

-- 
Heisenberg, Gödel, and Chomsky walk in to a bar.
Heisenberg says, "Clearly this is a joke, but how can we work out if it's 
funny or not?"
Gödel replies, "We can't know that because we're inside the joke."
Chomsky says, "Of course it's funny. You're just saying it wrong."

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Sub‑clusters / super‑clusters - working :)

2021-08-05 Thread Antony Stone
On Thursday 05 August 2021 at 10:51:37, Antony Stone wrote:

> On Thursday 05 August 2021 at 07:48:37, Ulrich Windl wrote:
> > 
> > Have you ever tried to find out why this happens? (Talking about logs)
> 
> Not in detail, no, but just in case there's a chance of getting this
> working as suggested simply using location constraints, I shall look
> further.

I now have a working solution - thank you to everyone who has helped.

The answer to the problem above was simple - with a 6-node cluster, 3 votes is 
not quorum.

I added a 7th node (in "city C") and adjusted the location constraints to 
ensure that cluster A resources run in city A, cluster B resources run in city 
B, and the "anywhere" resource runs in either city A or city B.

I've even added a colocation constraint to ensure that the "anywhere" resource 
runs on the same machine in either city A or city B as is running the local 
resources there (which wasn't a strict requirement, but is very useful).

For anyone interested in the detail of how to do this (without needing booth), 
here is my cluster.conf file, as in "crm configure load replace cluster.conf":


node tom attribute site=cityA
node dick attribute site=cityA
node harry attribute site=cityA

node fred attribute site=cityB
node george attribute site=cityB
node ron attribute site=cityB

primitive A-float IPaddr2 params ip=192.168.32.250 cidr_netmask=24 meta 
migration-threshold=3 failure-timeout=60 op monitor interval=5 timeout=20 on-
fail=restart
primitive B-float IPaddr2 params ip=192.168.42.250 cidr_netmask=24 meta 
migration-threshold=3 failure-timeout=60 op monitor interval=5 timeout=20 on-
fail=restart
primitive Asterisk asterisk meta migration-threshold=3 failure-timeout=60 op 
monitor interval=5 timeout=20 on-fail=restart

group GroupA A-float4  resource-stickiness=100
group GroupB B-float4  resource-stickiness=100
group Anywhere Asterisk resource-stickiness=100

location pref_A GroupA rule -inf: site ne cityA
location pref_B GroupB rule -inf: site ne cityB
location no_pref Anywhere rule -inf: site ne cityA and site ne cityB

colocation Ast 100: Anywhere [ cityA cityB ]

property cib-bootstrap-options: stonith-enabled=no no-quorum-policy=stop 
start-failure-is-fatal=false cluster-recheck-interval=60s


Of course, the group definitions are not needed for single resources, but I 
shall in practice be using multiple resources which do need groups, so I 
wanted to ensure I was creating something which would work with that.

I have tested it by:

 - bringing up one node at a time: as soon as any 4 nodes are running, all 
possible resources are running

 - bringing up 5 or more nodes: all resources run

 - taking down one node at a time to a maximum of three nodes offline: if at 
least one node in a given city is running, the resources at that city are 
running

 - turning off (using "halt", so that corosync dies nicely) all three nodes in 
a city simultaneously: that city's resources stop running, the other city 
continues working, as well as the "anywhere" resource

 - causing a network failure at one city (so it simply disappears without 
stopping corosync neatly): the other city continues its resources (plus the 
"anywhere" resource), the isolated city stops

For me, this is the solution I wanted, and in fact it's even slightly better 
than the previous two isolated 3-node clusters I had, because I can now have 
resources running on a single active node in cityA (provided it can see at 
least 3 other nodes in cityB or cityC), which wasn't possible before.


Once again, thanks to everyone who has helped me to achieve this result :)


Antony.

-- 
"The future is already here.   It's just not evenly distributed yet."

 - William Gibson

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: Sub‑clusters / super‑clusters?

2021-08-05 Thread Antony Stone
On Thursday 05 August 2021 at 07:43:30, Andrei Borzenkov wrote:

> On 05.08.2021 00:01, Antony Stone wrote:
> > 
> > Requirements 1, 2 and 3 are easy to achieve - don't connect the clusters.
> > 
> > Requirement 4 is the one I'm stuck with how to implement.
> 
> You either have single cluster and define appropriate location
> constraints or you have multiple clusters and configure geo-cluster on
> top of them. But you already have been told it multiple times.
> 
> > If the three nodes comprising cluster A can manage resources such that
> > they run on only one of the three nodes at any time, surely there must
> > be a way of doing the same thing with a resource running on one of three
> > clusters?
> 
> You need something that coordinates resources between three clusters and
> that is booth.

Indeed:

On Wednesday 04 August 2021 at 12:48:37, Antony Stone wrote:

> I'm going to look into booth as suggested by others.

Thanks,


Antony.

-- 
+++ Divide By Cucumber Error.  Please Reinstall Universe And Reboot +++

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: Re: Antw: [EXT] Re: Sub‑clusters / super‑clusters?

2021-08-05 Thread Antony Stone
On Thursday 05 August 2021 at 07:48:37, Ulrich Windl wrote:

> Antony Stone schrieb am 04.08.2021 um 21:27:
> > 
> > As soon as I connect the clusters at city A and city B, and apply the
> > location contraints and weighting rules you have suggested:
> > 
> > 1. everything works, including the single resource at either city A or
> > city B, so long as both clusters are operational.
> > 
> > 2. as soon as one cluster fails (all three of its nodes nodes become
> > unavailable), then the other cluster stops running all its resources as
> > well. This is even with quorum=2.
> 
> Have you ever tried to find out why this happens? (Talking about logs)

Not in detail, no, but just in case there's a chance of getting this working 
as suggested simply using location constraints, I shall look further.

Thanks,


Antony.

-- 
This sentence contains exacly three erors.

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: Sub‑clusters / super‑clusters?

2021-08-04 Thread Antony Stone
On Wednesday 04 August 2021 at 22:06:39, Frank D. Engel, Jr. wrote:

> There is no safe way to do what you are trying to do.
> 
> If the resource is on cluster A and contact is lost between clusters A
> and B due to a network failure, how does cluster B know if the resource
> is still running on cluster A or not?
>
> It has no way of knowing if cluster A is even up and running.
> 
> In that situation it cannot safely start the resource.

I am perfectly happy to have an additional machine at a third location in 
order to avoid this split-brain between two clusters.

However, what I cannot have is for the resources which should be running on 
cluster A to get started on cluster B.

If cluster A is down, then its resources should simply not run - as happens 
right now with two independent clusters.

Suppose for a moment I had three clusters at three locations: A, B and C.

Is there a method by which I can have:

1. Cluster A resources running on cluster A if cluster A is functional and not 
running anywhere if cluster A is non-functional.

2. Cluster B resources running on cluster B if cluster B is functional and not 
running anywhere if cluster B is non-functional.

3. Cluster C resources running on cluster C if cluster C is functional and not 
running anywhere if cluster C is non-functional.

4. Resource D running _somewhere_ on clusters A, B or C, but only a single 
instance of D at a single location at any time.

Requirements 1, 2 and 3 are easy to achieve - don't connect the clusters.

Requirement 4 is the one I'm stuck with how to implement.

If the three nodes comprising cluster A can manage resources such that they 
run on only one of the three nodes at any time, surely there must be a way of 
doing the same thing with a resource running on one of three clusters?


Antony.

-- 
I don't know, maybe if we all waited then cosmic rays would write all our 
software for us. Of course it might take a while.

 - Ron Minnich, Los Alamos National Laboratory

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: Sub‑clusters / super‑clusters?

2021-08-04 Thread Antony Stone
On Wednesday 04 August 2021 at 20:57:49, Strahil Nikolov wrote:

> That's why you need a qdisk at a 3-rd location, so you will have 7 votes in
> total.When 3 nodes in cityA die, all resources will be started on the
> remaining 3 nodes.

I think I have not explained this properly.

I have three nodes in city A which run resources which have to run in city A.  
They are based on IP addresses which are only valid on the network in city A.

I have three nodes in city B which run resources which have to run in city B.  
They are based on IP addresses which are only valid on the network in city B.

I have redundant routing between my upstream provider, and cities A and B, so 
that I only _need_ resources to be running in one of the two cities for 
everything to work as required.  City A can go completely offline and not run 
its resources, and everything I need continues to work via city B.

I now have an additional requirement to run a single resource at either city A 
or city B but not both.

As soon as I connect the clusters at city A and city B, and apply the location 
contraints and weighting rules you have suggested:

1. everything works, including the single resource at either city A or city B, 
so long as both clusters are operational.

2. as soon as one cluster fails (all three of its nodes nodes become 
unavailable), then the other cluster stops running all its resources as well.  
This is even with quorum=2.

This means I have lost the redundancy between my two clusters, which is based 
on the expectation that only one cluster will fail at a time.  If the failure 
of one automatically _causes_ the failure of the other, I have no high 
availability any more.

What I require is for cluster A to continue running its own resources, plus 
the single resource which can run anywhere, in the event that cluster B fails.

In other words, I need the exact same outcome as I have at present if cluster 
B fails (its resources stop, cluster A is unaffected), except that cluster A 
continues to run the single resource which I need just a single instance of.

It is impossible for the nodes at city A to run the resources which should be 
running at city B, partly because some of them are identical ("Asterisk" as a 
resource, for example, is already running at city A), and partly because some 
of them are bound to the networking arrangements (I cannot set a floating IP 
address which belongs in city A on a machine which exists in city B - it just 
doesn't work).

Therefore if adding a seventh node at a third location would try to start 
_all_ resources in city A if city B goes down, it is not a working solution.  
If city B goes down then I simply do not want its resources to be running 
anywhere, just the same as I have now with the two independent clusters.


Thanks,


Antony.

-- 
"In fact I wanted to be John Cleese and it took me some time to realise that 
the job was already taken."

 - Douglas Adams

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: Sub‑clusters / super‑clusters?

2021-08-04 Thread Antony Stone
On Wednesday 04 August 2021 at 16:07:39, Andrei Borzenkov wrote:

> On Wed, Aug 4, 2021 at 5:03 PM Antony Stone wrote:
> > On Wednesday 04 August 2021 at 13:31:12, Andrei Borzenkov wrote:
> > > On Wed, Aug 4, 2021 at 1:48 PM Antony Stone wrote:
> > > > On Tuesday 03 August 2021 at 12:12:03, Strahil Nikolov via Users
> > > > wrote:
> > > > > Won't something like this work ? Each node in LA will have same
> > > > > score of 5000, while other cities will be -5000.
> > > > > 
> > > > > pcs constraint location DummyRes1 rule score=5000 city eq LA
> > > > > pcs constraint location DummyRes1 rule score=-5000 city ne LA
> > > > > stickiness -> 1
> > > > 
> > > > Thanks for the idea, but no difference.
> > > > 
> > > > Basically, as soon as zero nodes in one city are available, all
> > > > resources, including those running perfectly at the other city, stop.
> > > 
> > > That is not what you originally said.
> > > 
> > > You said you have 6 node cluster (3 + 3) and 2 nodes are not available.
> > 
> > No, I don't think I said that?
> 
> "With the new setup, if two machines in city A fail, then _both_
> clusters stop working"

Ah, apologies - that was a typo.  "With the new setup, if the machines in city 
A fail, then _both_ clusters stop working".

So, basically what I'm saying is that with two separate clusters, if one 
fails, the other keeps going (as one would expect).

Joining the two clusters together so that I can have a single floating resource 
which can run anywhere (as well as the exact same location-specific resources 
as before) results in one cluster failure taking the other cluster down too.

I need one fully-working 3-node cluster to keep going, no matter what the 
other cluster does.


Antony.

-- 
It is also possible that putting the birds in a laboratory setting 
inadvertently renders them relatively incompetent.

 - Daniel C Dennett

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: Sub‑clusters / super‑clusters?

2021-08-04 Thread Antony Stone
On Wednesday 04 August 2021 at 13:31:12, Andrei Borzenkov wrote:

> On Wed, Aug 4, 2021 at 1:48 PM Antony Stone wrote:
> > On Tuesday 03 August 2021 at 12:12:03, Strahil Nikolov via Users wrote:
> > > Won't something like this work ? Each node in LA will have same score
> > > of 5000, while other cities will be -5000.
> > > 
> > > pcs constraint location DummyRes1 rule score=5000 city eq LA
> > > pcs constraint location DummyRes1 rule score=-5000 city ne LA
> > > stickiness -> 1
> > 
> > Thanks for the idea, but no difference.
> > 
> > Basically, as soon as zero nodes in one city are available, all
> > resources, including those running perfectly at the other city, stop.
> 
> That is not what you originally said.
> 
> You said you have 6 node cluster (3 + 3) and 2 nodes are not available.

No, I don't think I said that?

With the new setup, if 2 nodes are not available, everything carries on 
working; it doesn't matter whether the two nodes are in the same or different 
locations.  That's fine.

My problem is that with the new setup, if three nodes at one location go down, 
then *everything* stops, including the resources I want to carry on running at 
the other location.

Under my previous, working arrangement with two separate clusters, one data 
centre going down does not affect the other, therefore I have a fully working 
system (since the two data centres provide identical services with redundant 
routing).

A failure of one data centre taking down working services in the other data 
centre is not the high availability solution I'm looking for - it's more like 
high unavailability :)


Antony.

-- 
BASIC is to computer languages what Roman numerals are to arithmetic.

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: Sub‑clusters / super‑clusters?

2021-08-04 Thread Antony Stone
On Tuesday 03 August 2021 at 12:12:03, Strahil Nikolov via Users wrote:

> Won't something like this work ? Each node in LA will have same score of
> 5000, while other cities will be -5000.
>
> pcs constraint location DummyRes1 rule score=5000 city eq LA
> pcs constraint location DummyRes1 rule score=-5000 city ne LA
> stickiness -> 1

Thanks for the idea, but no difference.

Basically, as soon as zero nodes in one city are available, all resources, 
including those running perfectly at the other city, stop.

I'm going to look into booth as suggested by others.

Thanks,


Antony.

-- 
Atheism is a non-prophet-making organisation.

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Sub-clusters / super-clusters?

2021-08-03 Thread Antony Stone
On Tuesday 11 May 2021 at 12:56:01, Strahil Nikolov wrote:

> Here is the example I had promised:
>
> pcs node attribute server1 city=LA
> pcs node attribute server2 city=NY
>
> # Don't run on any node that is not in LA
> pcs constraint location DummyRes1 rule score=-INFINITY city ne LA
> 
> #Don't run on any node that is not in NY
> pcs constraint location DummyRes2 rule score=-INFINITY city ne NY
>
> The idea is that if you add a node and you forget to specify the attribute
> with the name 'city' , DummyRes1 & DummyRes2 won't be started on it.
> 
> For resources that do not have a constraint based on the city -> they will
> run everywhere unless you specify a colocation constraint between the
> resources.

Excellent - thanks.  I happen to use crmsh rather than pcs, but I've adapted 
the above and got it working.

Unfortunately, there is a problem.

My current setup is:

One 3-machine cluster in city A running a bunch of resources between them, the 
most important of which for this discussion is Asterisk telephony.

One 3-machine cluster in city B doing exactly the same thing.

The two clusters have no knowledge of each other.

I have high-availability routing between my clusters and my upstream telephony 
provider, such that a call can be handled by Cluster A or Cluster B, and if 
one is unavailable, the call gets routed to the other.

Thus, a total failure of Cluster A means I still get phone calls, via Cluster 
B.


To implement the above "one resource which can run anywhere, but only a single 
instance", I joined together clusters A and B, and placed the corresponding 
location constraints on the resources I want only at A and the ones I want 
only at B.  I then added the resource with no location constraint, and it runs 
anywhere, just once.

So far, so good.


The problem is:

With the two independent clusters, if two machines in city A fail, then 
Cluster A fails completely (no quorum), and Cluster B continues working.  That 
means I still get phone calls.

With the new setup, if two machines in city A fail, then _both_ clusters stop 
working and I have no functional resources anywhere.


So, my question now is:

How can I have a 3-machine Cluster A running local resources, and a 3-machine 
Cluster B running local resources, plus one resource running on either Cluster 
A or Cluster B, but without a failure of one cluster causing _everything_ to 
stop?


Thanks,


Antony.

-- 
One tequila, two tequila, three tequila, floor.

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] One Failed Resource = Failover the Cluster?

2021-06-07 Thread Antony Stone
On Monday 07 June 2021 at 21:49:45, Eric Robinson wrote:

> > -Original Message-
> > From: kgail...@redhat.com 
> > Sent: Monday, June 7, 2021 2:39 PM
> > To: Strahil Nikolov ; Cluster Labs - All topics
> > related to open-source clustering welcomed ; Eric
> > Robinson 
> > Subject: Re: [ClusterLabs] One Failed Resource = Failover the Cluster?
> > 
> > By default, dependent resources in a colocation will affect the placement
> > of the resources they depend on.
> > 
> > In this case, if one of the mysql instances fails and meets its migration
> > threshold, all of the resources will move to another node, to maximize
> > the chance of all of them being able to run.
> 
> Which is what I don't want to happen. I only want the cluster to failover
> if one of the lower dependencies fails (drbd or filesystem). If one of the
> MySQL instances fails, I do not want the cluster to move everything for
> the sake of that one resource. That's like a teacher relocating all the
> students in the classroom to a new classroom because one of then lost his
> pencil.

Okay, so let's focus on what you *do* want to happen.

One MySQL instance fails.  Nothing else does.

What do you want next?

 - Cluster continues with a failed MySQL resource?

 - MySQL resource moves to another node but no other resources move?

 - something else I can't really imagine right now?


I'm sure that if you can define what you want the cluster to do in this 
situation (MySQL fails, all else continues okay), someone here can help you 
explain that to pacemaker.


Antony.

-- 
This email was created using 100% recycled electrons.

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Sub-clusters / super-clusters?

2021-05-10 Thread Antony Stone
On Monday 10 May 2021 at 16:49:07, Strahil Nikolov wrote:

> You can use  node attributes to define in which  city is each host and then
> use a location constraint to control in which city to run/not run the
> resources. I will try to provide an example tomorrow.

Thank you - that would be helpful.

I did think that a location constraint could be a way to do this, but I wasn't 
sure how to label three machines in one cluster as a "single location".

Any pointers most welcome :)

>   On Mon, May 10, 2021 at 15:52, Antony Stone wrote:
> >   On Monday 10 May 2021 at 14:41:37, Klaus Wenninger wrote:
> > On 5/10/21 2:32 PM, Antony Stone wrote:
> > > Hi.
> > > 
> > > I'm using corosync 3.0.1 and pacemaker 2.0.1, currently in the
> > > following way:
> > > 
> > > I have two separate clusters of three machines each, one in a data
> > > centre in city A, and one in a data centre in city B.
> > > 
> > > Several of the resources being managed by these clusters are based on
> > > floating IP addresses, which are tied to the data centre, therefore the
> > > resources in city A can run on any of the three machines there (alfa,
> > > bravo and charlie), but cannot run on any machine in city B (delta,
> > > echo and foxtrot).
> > > 
> > > I now have a need to create a couple of additional resources which can
> > > operate from anywhere, so I'm wondering if there is a way to configure
> > > corosync / pacemaker so that:
> > > 
> > > Machines alfa, bravo, charlie live in city A and manage resources X, Y
> > > and Z between them.
> > > 
> > > Machines delta, echo and foxtrot live in city B and manage resources U,
> > > V and W between them.
> > > 
> > > All of alpha to foxtrot are also in a "super-cluster" managing
> > > resources P and Q, so these two can be running on any of the 6
> > > machines.
> > > 
> > > 
> > > I hope the question is clear.  Is there an answer :) ?
> > 
> > Sounds like a use-case for https://github.com/ClusterLabs/booth
> 
> Interesting - hadn't come across that feature before.
> 
> Thanks - I'll look into further documentation.
> 
> If anyone else has any other suggestions I'm happy to see whether something
> else might work better for my setup.
> 
> 
> Antony.

-- 
90% of networking problems are routing problems.
9 of the remaining 10% are routing problems in the other direction.
The remaining 1% might be something else, but check the routing anyway.

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Sub-clusters / super-clusters?

2021-05-10 Thread Antony Stone
On Monday 10 May 2021 at 14:41:37, Klaus Wenninger wrote:

> On 5/10/21 2:32 PM, Antony Stone wrote:
> > Hi.
> > 
> > I'm using corosync 3.0.1 and pacemaker 2.0.1, currently in the following
> > way:
> > 
> > I have two separate clusters of three machines each, one in a data centre
> > in city A, and one in a data centre in city B.
> > 
> > Several of the resources being managed by these clusters are based on
> > floating IP addresses, which are tied to the data centre, therefore the
> > resources in city A can run on any of the three machines there (alfa,
> > bravo and charlie), but cannot run on any machine in city B (delta, echo
> > and foxtrot).
> > 
> > I now have a need to create a couple of additional resources which can
> > operate from anywhere, so I'm wondering if there is a way to configure
> > corosync / pacemaker so that:
> > 
> > Machines alfa, bravo, charlie live in city A and manage resources X, Y
> > and Z between them.
> > 
> > Machines delta, echo and foxtrot live in city B and manage resources U, V
> > and W between them.
> > 
> > All of alpha to foxtrot are also in a "super-cluster" managing resources
> > P and Q, so these two can be running on any of the 6 machines.
> > 
> > 
> > I hope the question is clear.  Is there an answer :) ?
> 
> Sounds like a use-case for https://github.com/ClusterLabs/booth

Interesting - hadn't come across that feature before.

Thanks - I'll look into further documentation.

If anyone else has any other suggestions I'm happy to see whether something 
else might work better for my setup.


Antony.

-- 
What do you get when you cross a joke with a rhetorical question?

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Sub-clusters / super-clusters?

2021-05-10 Thread Antony Stone
Hi.

I'm using corosync 3.0.1 and pacemaker 2.0.1, currently in the following way:

I have two separate clusters of three machines each, one in a data centre in 
city A, and one in a data centre in city B.

Several of the resources being managed by these clusters are based on floating 
IP addresses, which are tied to the data centre, therefore the resources in 
city A can run on any of the three machines there (alfa, bravo and charlie), 
but cannot run on any machine in city B (delta, echo and foxtrot).

I now have a need to create a couple of additional resources which can operate 
from anywhere, so I'm wondering if there is a way to configure corosync / 
pacemaker so that:

Machines alfa, bravo, charlie live in city A and manage resources X, Y and Z 
between them.

Machines delta, echo and foxtrot live in city B and manage resources U, V and 
W between them.

All of alpha to foxtrot are also in a "super-cluster" managing resources P and 
Q, so these two can be running on any of the 6 machines.


I hope the question is clear.  Is there an answer :) ?


Thanks,


Antony.

-- 
Ramdisk is not an installation procedure.

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Question about ping nodes

2021-04-17 Thread Antony Stone
On Saturday 17 April 2021 at 21:41:16, Piotr Kandziora wrote:

> Hi,
> 
> Hope some guru will advise here ;)
> 
> I've got two nodes cluster with some resource placement dependent on ping
> node visibility (
> https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/ht
> ml/high_availability_add-on_reference/s1-moving_resources_due_to_connectivi
> ty_changes-haar ).
> 
> Is it possible to do nothing with these resources when both nodes do not
> have access to the ping node?

The whole purpose of a ping node is to give a resource node confidence that it 
can safely provide resources to the network.  If it loses connectivity to the 
ping node, it considers itself dead / offline / non-functional.

If both resource nodes lose contact with the ping node, then neither of them 
have the confidence to provide resources, so both of them stop doing so.

All I can say is that if this is possible then either your resource nodes have 
poor network connectivity and should not be used as resource nodes, or else 
the ping node has poor reachability and should not be used as a ping node.

> Currently, when the ping node is unavailable (node itself becomes
> unavailable) both nodes stop the resources.

This is correct.  Neither node can tell whether it should take charge of the 
resources, so for the sake of safety, they both decline to do so.

Two node clusters are intrinsically fragile in this respect.


Antony.

-- 
https://tools.ietf.org/html/rfc6890 - providing 16 million IPv4 addresses for 
talking to yourself.

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Single-node automated startup question

2021-04-14 Thread Antony Stone
On Wednesday 14 April 2021 at 19:33:39, Strahil Nikolov wrote:

> What about a small form factor device to serve as a quorum maker ?
> Best Regards,Strahil Nikolov

If you're going to take that approach, why not a virtual machine or two, 
hosted inside the physical machine which is your single real node?

I'm not necessarily advocating this method of achieving quorum, but it's 
probably an idea worth considering for specific situations.


Antony.

-- 
Someone has stolen all the toilets from New Scotland Yard.  Police say they 
have absolutely nothing to go on.

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: Why my node1 couldn't back to the clustering chain?

2021-04-09 Thread Antony Stone
On Friday 09 April 2021 at 11:06:14, Ulrich Windl wrote:

> # lscpu
> CPU(s):  144

> # free -h
> Mem:  754Gi

Nice :)

No doubt Jason would like to connect 8 of these together in a cluster...


Antony.

-- 
Numerous psychological studies over the years have demonstrated that the 
majority of people genuinely believe they are not like the majority of people.

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Why my node1 couldn't back to the clustering chain?

2021-04-09 Thread Antony Stone
On Friday 09 April 2021 at 10:34:33, Jason Long wrote:

> Thanks.
> I meant was a Cheat sheet.

I don't understand that sentence.

> Yes, something like rendering a 3D movie or... . The Corosync and Pacemaker
> are not OK for it? What kind of clustering using for rendering? Beowulf
> cluster?

Corosync and pacemaker are for High Availability, which generally means that 
you have more computing resources than you need at any given time, in order 
that a failed machine can be efficiently replaced by a working one.  If all 
your 
machines are busy, and one fails, you have no spare computing resources to 
take over from the failed one.

The setup you were asking about is High Performance computing, where you are 
trying to use the resources you have as efficiently and continuously as 
possible, therefore you don't have any spare capacity (since 'spare' means 
'wasted' in this regard).

A Beowulf Cluster is one example of the sort of thing you're asking about; for 
others, see the "Implementations" section of the URL I previously provided.


Antony.

-- 
https://tools.ietf.org/html/rfc6890 - providing 16 million IPv4 addresses for 
talking to yourself.

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Why my node1 couldn't back to the clustering chain?

2021-04-09 Thread Antony Stone
On Friday 09 April 2021 at 08:58:39, Jason Long wrote:

> Thank you so much for your great answers.
> As the final questions:

Really :) ?

> 1- Which commands are useful to monitoring and managing my pacemaker
> cluster?

Some people prefer https://crmsh.github.io/documentation/ and some people 
prefer https://github.com/ClusterLabs/pcs

> 2- I don't know if this is a right question or not. Consider 100 PCs that
> each of them have an Intel Core 2 Duo Processor (2 cores) with 4GB of RAM.
> How can I merge these PCs together so that I have a system with 200 CPUs
> and 400GB of RAM?

The answer to that depends on what you want to do with them.

As a general-purpose computing resource, you can't.  The CPU on machine A has 
no (reasonable) access to the RAM on machine B, so no part of the system can 
actually work with 400GBytes RAM.

For specialist purposes (generally speaking, performing the same tasks on 
small pieces of data all at the same time and then putting the results 
together at the end), you can create a very different type of "cluster" than 
the ones we talk about here with corosync and pacemaker.

https://en.wikipedia.org/wiki/Computer_cluster

A common usage for such a setup is frame rendering of computer generated films; 
give each of your 100 PCs one frame to render, put all the frames together in 
the right order at the end, and you've created your film in just over 1% of the 
time it would have taken on one computer (of the same type).


Regards,


Antony.

-- 
Most people have more than the average number of legs.

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Why my node1 couldn't back to the clustering chain?

2021-04-08 Thread Antony Stone
On Thursday 08 April 2021 at 21:33:48, Jason Long wrote:

> Yes, I just wanted to know. In clustering, when a node is down and
> go online again, then the cluster will not use it again until another node
> fails. Am I right?

Think of it like this:

You can have as many nodes in your cluster as you think you need, and I'm 
going to assume that you only need the resources running on one node at any 
given time.

Cluster management (eg: corosync / pacemaker) will ensure that the resources 
are running on *a* node.

The resources will be moved *away* from that node if they can't run there any 
more, for some reason (the node going down is a good reason).

However, there is almost never any concept of the resources being moved *to* a 
(specific) node.  If they get moved away from one node, then obviously they 
need to be moved to another one, but the move happens because the resources 
have to be moved *away* from the first node, not because the cluster thinks 
they need to be moved *to* the second node.

So, if a node is running its resources quite happily, it doesn't matter what 
happens to all the other nodes (provided quorum remains); the resources will 
stay running on that same node all the time.


Antony.

-- 
Was ist braun, liegt ins Gras, und raucht?
Ein Kaminchen...

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Why my node1 couldn't back to the clustering chain?

2021-04-08 Thread Antony Stone
On Thursday 08 April 2021 at 21:33:48, Jason Long wrote:

> Yes, I just wanted to know. In clustering, when a node is down and
> go online again, then the cluster will not use it again until another node
> fails. Am I right?

In general, yes - unless you have specified a location contraint for resources; 
however as already discussed, this is unusual and doesn't apply by default.


Antony.

-- 
A user interface is like a joke.
If you have to explain it, it means it doesn't work.

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Why my node1 couldn't back to the clustering chain?

2021-04-08 Thread Antony Stone
On Thursday 08 April 2021 at 21:24:02, Jason Long wrote:

> Thanks.
> Thus, my cluster uses Node1 when Node2 is down?

Judging from your previous emails, you have a two node cluster.

What else is it going to use?


Antony.

-- 
Anything that improbable is effectively impossible.

 - Murray Gell-Mann, Nobel Prizewinner in Physics

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Why my node1 couldn't back to the clustering chain?

2021-04-08 Thread Antony Stone
On Thursday 08 April 2021 at 16:55:47, Ken Gaillot wrote:

> On Thu, 2021-04-08 at 14:32 +, Jason Long wrote:
> > Why, when node1 is back, then web server still on node2? Why not
> > switched?
> 
> By default, there are no preferences as to where a resource should run.
> The cluster is free to move or leave resources as needed.
> 
> If you want a resource to prefer a particular node, you can use
> location constraints to express that. However there is rarely a need to
> do so; in most clusters, nodes are equally interchangeable.

I would add that it is generally preferable, in fact, to leave a resource 
where it is unless there's a good reason to move it.

Imagine for example that you have three nodes and a resource is running on 
node A.

Node A fails, the resource moves to node C, and node A then comes back again.

If the resource then got moved back to node A just because it had recovered, 
you've now had two transitions of the resource (each of which means *some* 
downtime, however small that may be), whereas if it remains running on node C 
until such time as there's a good reason to move it away, you've only had one 
transition to cope with.


Antony.

-- 
"The future is already here.   It's just not evenly distributed yet."

 - William Gibson

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: Re: Antw: [EXT] Re: cluster-recheck-interval and failure-timeout

2021-04-07 Thread Antony Stone
On Wednesday 07 April 2021 at 10:40:54, Ulrich Windl wrote:

> >>> Ken Gaillot  schrieb am 06.04.2021 um 15:58
> > On Tue, 2021-04-06 at 09:15 +0200, Ulrich Windl wrote:

> >> Sorry I don't get it: If you have a timestamp for each failure-
> >> timeout, what's so hard to put all the fail counts that are older than
> >> failure-timeout on a list, and then reset that list to zero?
> > 
> > That's exactly the issue -- we don't have a timestamp for each failure.
> > Only the most recent failed operation, and the total fail count (per
> > resource and operation), are stored in the CIB status.
> > 
> > We could store all failures in the CIB, but that would be a significant
> > project, and we'd need new options to keep the current behavior as the
> > default.
> 
> I still don't quite get it: Some failing operation increases the
> fail-count, and the time stamp for the failing operation is recorded
> (crm_mon can display it). So solving this problem (saving the last time
> for each fail count) doesn't look so hard to do.

For the avoidance of doubt, I (who started this thread) have solved my problem 
by following the advice from Reid Wahl - I was putting the "failure-timeout" 
parameter into the incorrect section of mt resource definition.  Moving it to 
the "meta" section has resolved my problem.

The way it works now makes completely good sense to me:

1. A failure happens, and gets corrected.

2. Provided no further failure of that resource occurs within the failure-
timeout setting, the failure gets forgotten about.

3. If a further failure of the resource does occur within failure-timeout, the 
original timestamp is discarded, the failure count is incremented, and the 
timestamp of the new failure is used to check whether there's another failure 
within failure-timeout of *that*

4. If no further failure occurs within failure-timeout of the most recent 
failure timestamp, all previous failures are forgotten.

5. If enough failures occur within failure-timeout *of each other* then the 
failure count gets incremented to the point where the resource gets moved to 
another node.

Regards,


Antony.

-- 
"It wouldn't be a good idea to talk about him behind his back in front of 
him."

 - murble

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] failure-timeout not working in corosync 2.0.1

2021-03-31 Thread Antony Stone
On Wednesday 31 March 2021 at 23:09:38, Antony Stone wrote:

> On Wednesday 31 March 2021 at 22:53:53, Reid Wahl wrote:
> > Hi, Antony. failure-timeout should be a resource meta attribute, not an
> > attribute of the monitor operation. At least I'm not aware of it being
> > configurable per-operation -- maybe it is. Can't check at the moment :)
> 
> Okay, I'll try moving it - but that still leaves me wondering why it works
> fine in pacemaker 1.1.16 and not in 2.0.1.

*Thank you, Reid*

It works.

Moving the failure-timeout specification to the "meta" section of the resource 
definition has caused the failures to disappear from "crm status -f" after the 
expected amount of time.

I am sure that this also means the resources are no longer going to move from 
node 1 to node 2 to node 3 and then got totally stuck.

I shall find for sure out by tomorrow (it's nearly midnight where I am now).

I already know what I need to do to stop this particular resource from having 
to be restarted so frequently, but the fact that the 2.0.1 cluster couldn't 
cope with it at all made me nervous about just doing that, and then never 
being confident that the cluster _could_ cope if a resource really needed to be 
restarted several times.

Pacemaker 1.1.16 could cope with the configuration fine, even though I was 
clearly putting failure-timeout into the wrong place in cluster.cib.

Once again, thank you Reid.


Antony.

-- 
What do you get when you cross a joke with a rhetorical question?

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] failure-timeout not working in corosync 2.0.1

2021-03-31 Thread Antony Stone
On Wednesday 31 March 2021 at 23:11:50, Reid Wahl wrote:

> Maybe Pacemaker-1 was looser in its handling of resource meta attributes vs
> operation meta attributes. Good question.

Returning to my suspicion that it's more likely me that simply did something 
wrong, what command can I use to find out what pacemaker thinks my cluster.cib 
file really means, so I can be sure it's been interpreted by pacemaker the same 
way as I do?


Antony.

-- 
"I don't mind that he got rich, but I do mind that he peddles himself as the 
ultimate hacker and God's own gift to technology when his track record 
suggests that he wouldn't know a decent design idea or a well-written hunk of 
code if it bit him in the face. He's made his billions selling elaborately 
sugar-coated crap that runs like a pig on [sedatives], crashes at the drop of 
an electron, and has set the computing world back by at least a decade."

 - Eric S Raymond, about Bill Gates

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] failure-timeout not working in corosync 2.0.1

2021-03-31 Thread Antony Stone
On Wednesday 31 March 2021 at 22:53:53, Reid Wahl wrote:

> Hi, Antony. failure-timeout should be a resource meta attribute, not an
> attribute of the monitor operation. At least I'm not aware of it being
> configurable per-operation -- maybe it is. Can't check at the moment :)

Okay, I'll try moving it - but that still leaves me wondering why it works fine 
in pacemaker 1.1.16 and not in 2.0.1.


Antony.

-- 
Python is executable pseudocode.
Perl is executable line noise.

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] failure-timeout not working in corosync 2.0.1

2021-03-31 Thread Antony Stone
Hi.

I've pared my configureation down to almost a bare minimum to demonstrate the 
problem I'm having.

I have two questions:

1. What command can I use to find out what pacemaker thinks my cluster.cib file 
really means?

I know what I put in it, but I want to see what pacemaker has understood from 
it, to make sure that pacemaker has the same idea about how to manage my 
resources as I do.


2. Can anyone tell me what the problem is with the following cluster.cib 
(lines split on spaces to make things more readable, the actual file consists 
of four lines of text):

primitive IP-float4
IPaddr2
params
ip=10.1.0.5
cidr_netmask=24
meta
migration-threshold=3
op
monitor
interval=10
timeout=30
on-fail=restart
failure-timeout=180
primitive IPsecVPN
lsb:ipsecwrapper
meta
migration-threshold=3
op
monitor
interval=10
timeout=30
on-fail=restart
failure-timeout=180
group Everything
IP-float4
IPsecVPN
resource-stickiness=100
property cib-bootstrap-options:
stonith-enabled=no
no-quorum-policy=stop
start-failure-is-fatal=false
cluster-recheck-interval=60s

My problem is that "failure-timeout" is not being honoured.  A resource 
failure simply never times out, and 3 failures (over a fortnight, if that's 
how long it takes to get 3 failures) mean that the resources move.

I want a failure to be forgotten about after 180 seconds (or at least, soon 
after that - 240 seconds would be fine, if cluster-recheck-interval means that 
180 can't quite be achieved).

Somehow or other, _far_ more than 180 seconds go by, and I *still* have:

fail-count=1 last-failure='Wed Mar 31 21:23:11 2021'

as part of the output of "crm status -f" (the above timestamp is BST, so 
that's 70 minutes ago now).


Thanks for any help,


Antony.

-- 
Don't procrastinate - put it off until tomorrow.

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] cluster-recheck-interval and failure-timeout

2021-03-31 Thread Antony Stone
On Wednesday 31 March 2021 at 16:58:30, Antony Stone wrote:

> I'm only interested in the most recent failure.  I'm saying that once that
> failure is more than "failure-timeout" seconds old, I want the fact that
> the resource failed to be forgotten, so that it can be restarted or moved
> between nodes as normal, and not either be moved to another node just
> because (a) there were two failures last Friday and then one today, or (b)
> get stuck and not run on any nodes at all because all three nodes had
> three failures sometime in the past month.

I've just confirmed that this is working as expected on pacemaker 1.1.16 
(Debian 9) and is not working on pacemaker 2.0.1 (Debian 10).

I have one cluster of 3 machines running pacemaker 1.1.16 and I have another 
cluster of 3 machines running pacemaker 2.0.1

They are both running the same set of resources.

I just deliberately killed the same resource on each cluster, and sure enough 
"crm status -f" on both told me it had a fail-count of 1, with a last-failure 
timestamp.

I waited 5 minutes (well above my failure-timeout value) and asked for "crm 
status -f" again.

On pacemaker 1.1.16 there was simply a list of resources; no mention of 
failures.  Just what I want.

On pacemaker 2.0.1 there was a list of resources plus a fail-count=1 and a 
last-failure timestamp of 5 minutes earlier.

To be sure I'm not being impatient, I've left it an hour (I did this test 
eariler, while I was still trying to understand the timing interactions) and 
the fail-count does not go away.


Does anyone have suggestions on how to debug this difference in behaviour 
between pacemaker 1.1.16 and 2.0.1, because at present it prevents me being 
able to upgrade an operational cluster, as the result is simply unusable.


Thanks,


Antony.

-- 
Perfection in design is achieved not when there is nothing left to add, but 
rather when there is nothing left to take away.

 - Antoine de Saint-Exupery

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] cluster-recheck-interval and failure-timeout

2021-03-31 Thread Antony Stone
On Wednesday 31 March 2021 at 15:48:15, Ken Gaillot wrote:

> On Wed, 2021-03-31 at 14:32 +0200, Antony Stone wrote:
> > 
> > So, what am I misunderstanding about "failure-timeout", and what
> > configuration setting do I need to use to tell pacemaker that "provided the
> > resource hasn't failed within the past X seconds, forget the fact that it
> > failed more than X seconds ago"?
> 
> Unfortunately, there is no way. failure-timeout expires *all* failures
> once the *most recent* is that old.

I've re-read the above sentence, and in fact you seem to be agreeing with my 
expectation (which is not what happens).

> It's a bit counter-intuitive but currently, Pacemaker only remembers a
> resource's most recent failure and the total count of failures, and changing
> that would be a big project.

I'm only interested in the most recent failure.  I'm saying that once that 
failure is more than "failure-timeout" seconds old, I want the fact that the 
resource failed to be forgotten, so that it can be restarted or moved between 
nodes as normal, and not either be moved to another node just because (a) 
there were two failures last Friday and then one today, or (b) get stuck and 
not run on any nodes at all because all three nodes had three failures 
sometime in the past month.


Thanks,


Antony.


-- 
The Magic Words are Squeamish Ossifrage.

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] cluster-recheck-interval and failure-timeout

2021-03-31 Thread Antony Stone
On Wednesday 31 March 2021 at 15:48:15, Ken Gaillot wrote:

> On Wed, 2021-03-31 at 14:32 +0200, Antony Stone wrote:
>
> > So, what am I misunderstanding about "failure-timeout", and what
> > configuration setting do I need to use to tell pacemaker that "provided the
> > resource hasn't failed within the past X seconds, forget the fact that it
> > failed more than X seconds ago"?
> 
> Unfortunately, there is no way. failure-timeout expires *all* failures
> once the *most recent* is that old. It's a bit counter-intuitive but
> currently, Pacemaker only remembers a resource's most recent failure
> and the total count of failures, and changing that would be a big
> project.

So, are you saying that if a resource failed last Friday, and then again on 
Saturday, but has been running perfectly happily ever since, a failure today 
will trigger "that's it, we're moving it, it doesn't work here"?

That seems bizarre.

Surely the length of time a resource has been running without problem should 
be taken into account when deciding whether the node it's running on is fit to 
handle it or not?

My problem is also bigger than that - and I can't believe there isn't a way 
round the following, otherwise people couldn't use pacemaker:

I have "migration-threshold=3" on most of my resources, and I have three 
nodes.

If a resource fails for the third time (in any period of time) on a node, it 
gets moved (along with the rest in the group) to another node.  The cluster 
does not forget that it failed and was moved away from the first node, though.

"crm status -f" confirms that to me.

If it then fails three times (in an hour, or a fortnight, whatever) on the 
second node, it gets moved to node 3, and from that point on the cluster 
thinks there's nowhere else to move it to, so another failure means a total 
failure of the cluster.

There must be _something_ I'm doing wrong for the cluster to behave in this 
way?  It can't believe it's by design.


Regards,


Antony.

-- 
Anyone that's normal doesn't really achieve much.

 - Mark Blair, Australian rocket engineer

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] cluster-recheck-interval and failure-timeout

2021-03-31 Thread Antony Stone
Hi.

I'm trying to understand what looks to me like incorrect behaviour between 
cluster-recheck-interval and failure-timeout, under pacemaker 2.0.1

I have three machines in a corosync (3.0.1 if it matters) cluster, managing 12 
resources in a single group.

I'm following documentation from:

https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html/
Pacemaker_Explained/s-cluster-options.html

and

https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html/
Pacemaker_Explained/s-resource-options.html

I have set a cluster property:

cluster-recheck-interval=60s

I have set a resource property:

failure-timeout=180

The docs say failure-timeout is "How many seconds to wait before acting as if 
the failure had not occurred, and potentially allowing the resource back to 
the node on which it failed."

I think this should mean that if the resource fails and gets restarted, the 
fact that it failed will be "forgotten" after 180 seconds (or maybe a little 
longer, depending on exactly when the next cluster recheck is done).

However what I'm seeing is that if the resource fails and gets restarted, and 
this then happens an hour later, it's still counted as two failures.  If it 
fails and gets restarted another hour after that, it's recorded as three 
failures and (because I have "migration-threshold=3") it gets moved to another 
node (and therefore all the other resources in group are moved as well).

So, what am I misunderstanding about "failure-timeout", and what configuration 
setting do I need to use to tell pacemaker that "provided the resource hasn't 
failed within the past X seconds, forget the fact that it failed more than X 
seconds ago"?


Thanks,


Antony.

-- 
The first fifty percent of an engineering project takes ninety percent of the 
time, and the remaining fifty percent takes another ninety percent of the time.

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: ocf-tester always claims failure, even with built-in resource agents?

2021-03-29 Thread Antony Stone
On Monday 29 March 2021 at 09:03:10, Ulrich Windl wrote:

> >> So, that would be an extra parameter to the resource definition in
> >> cluster.cib?
> >> 
> >> Change:
> >> 
> >> primitive Asterisk asterisk meta migration-threshold=3 op monitor
> >> interval=5 timeout=30 on-fail=restart failure-timeout=10s
> >> 
> >> to:
> >> 
> >> primitive Asterisk asterisk meta migration-threshold=3 op monitor
> >> interval=5 timeout=30 on-fail=restart failure-timeout=10s trace_ra=1
> >> 
> >> ?
> 
> IMHO it does not make sense to have failure-timeout smaller than the
> monitoring interval;

Um, 10 seconds is not smaller than 5 seconds...


Antony.

-- 
Your work is both good and original.  Unfortunately the parts that are good 
aren't original, and the parts that are original aren't good.

 - Samuel Johnson

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] ocf-tester always claims failure, even with built-in resource agents?

2021-03-26 Thread Antony Stone
On Friday 26 March 2021 at 18:31:51, Ken Gaillot wrote:

> On Fri, 2021-03-26 at 19:59 +0300, Andrei Borzenkov wrote:
> > On 26.03.2021 17:28, Antony Stone wrote:
> > > 
> > > So far all is well and good, my cluster synchronises, starts the
> > > resources, and everything's working as expected.  It'll move the
> > > resources from one cluster member to another (either if I ask it to, or
> > > if there's a problem), and it seems to work just as the older version
> > > did.
> 
> I'm glad this far was easy :)

Well, I've been using corosync & pacemaker for some years now; I've got used 
to some of their quirks and foibles :)

Now I just need to learn about the new ones for the newer versions...

> It's worth noting that pacemaker itself doesn't try to validate the
> agent meta-data, it just checks for the pieces that are interesting to
> it and ignores the rest.

I guess that's good, so long as what it does pay attention to is what it wants 
to see?

> It's also worth noting that the OCF 1.0 standard is horribly outdated
> compared to actual use, and the OCF 1.1 standard is being adopted today
> (!) after many years of trying to come up with something more up-to-
> date.

So, is ocf-tester no longer the right tools I should be using to check this 
sort of thing?  What shouold I be doing instead to make sure my configuration 
is valid / acceptable to pacemaker?

> Bottom line, it's worth installing xmllint to see if that helps, but I
> wouldn't worry about meta-data schema issues.

Well, as stated in my other reply to Andrei, I now get:

/usr/lib/ocf/resource.d/heartbeat/asterisk passed all tests

/usr/lib/ocf/resource.d/heartbeat/anything passed all tests

so I guess it means my configuration file is okay, and I need to look somewher 
eelse to find out why pacemaker 2.0.1 is throwing wobblies with exactly the 
same resources that pacemaker 1.1.16 can manage quite happily and stably...

> > Either agent does not run as root or something blocks chown. Usual
> > suspects are apparmor or SELinux.
> 
> Pacemaker itself can also return this error in certain cases, such as
> not having permissions to execute the agent. Check the pacemaker detail
> log (usually /var/log/pacemaker/pacemaker.log) and the system log
> around these times to see if there is more detail.

I've turned on debug logging, but I'm still not sure I'm seeing *exactly* what 
the resource agent checker is doing when it gets this failure.

> It is definitely weird that a privileges error would be sporadic.
> Hopefully the logs can shed some more light.

I've captured a bunch of them this afternoon and will go through them on 
Monday - it's pretty verbose!

> Another possibility would be to set trace_ra=1 on the actions that are
> failing to get line-by-line info from the agents.

So, that would be an extra parameter to the resource definition in cluster.cib?

Change:

primitive Asterisk asterisk meta migration-threshold=3 op monitor interval=5 
timeout=30 on-fail=restart failure-timeout=10s

to:

primitive Asterisk asterisk meta migration-threshold=3 op monitor interval=5 
timeout=30 on-fail=restart failure-timeout=10s trace_ra=1

?


Antony.

-- 
"It is easy to be blinded to the essential uselessness of them by the sense of 
achievement you get from getting them to work at all. In other words - and 
this is the rock solid principle on which the whole of the Corporation's 
Galaxy-wide success is founded - their fundamental design flaws are completely 
hidden by their superficial design flaws."

 - Douglas Noel Adams

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] ocf-tester always claims failure, even with built-in resource agents?

2021-03-26 Thread Antony Stone
On Friday 26 March 2021 at 17:59:07, Andrei Borzenkov wrote:

> On 26.03.2021 17:28, Antony Stone wrote:

> > # ocf-tester -n Asterisk /usr/lib/ocf/resource.d/heartbeat/asterisk
> > Beginning tests for /usr/lib/ocf/resource.d/heartbeat/asterisk...
> > /usr/sbin/ocf-tester: 226: /usr/sbin/ocf-tester: xmllint: not found
> > * rc=127: Your agent produces meta-data which does not conform to
> > ra-api-1.dtd * Your agent does not support the notify action (optional)
> > * Your agent does not support the demote action (optional)
> > * Your agent does not support the promote action (optional)
> > * Your agent does not support master/slave (optional)
> > * Your agent does not support the reload action (optional)
> > Tests failed: /usr/lib/ocf/resource.d/heartbeat/asterisk failed 1 tests

> As is pretty clear from error messages, ocf-tester calls xmllint which
> is missing.

Ah, I had not realised that this meant the rest of the output would be 
invalid.

I thought it just meant "you don't have xmllint installed, so there's some 
stuff we might otherwise be able to tell you, but can't".

If xmllint being installed is a requirement for the remained of the output the 
be meaningful, I'd expected that ocf-tester would simply give up at that point 
and tell me that until I install xmllint, ocf-tester can't do its job.

That seems like a bit of a bug to me.

After installing xmllint I now get:

/usr/lib/ocf/resource.d/heartbeat/asterisk passed all tests

/usr/lib/ocf/resource.d/heartbeat/anything passed all tests

So I'm now back to working out how to debug the failures I do see in "normal" 
operation, which were notocurring with the older versions of corosync & 
pacemaker...

> > My second question is: how can I debug what caused pacemaker to decide
> > that it couldn't run Asterisk due to "insufficient privileges"

> Agent returns this error if it fails to chown directory specified in its
> configuration file:
> 
> # Regardless of whether we just created the directory or it
> # already existed, check whether it is writable by the configured
> # user
> if ! su -s /bin/sh - $OCF_RESKEY_user -c "test -w $dir"; then
> ocf_log warn "Directory $dir is not writable by
> $OCF_RESKEY_user, attempting chown"
> ocf_run chown $OCF_RESKEY_user:$OCF_RESKEY_group $dir \
> 
> || exit $OCF_ERR_PERM
> 
> Either agent does not run as root or something blocks chown. Usual
> suspects are apparmor or SELinux.

Well, I'm not running either of those, but your comments point me in what I 
think is a helpful direction - thanks.


Regards,

Antony.

-- 
It may not seem obvious, but (6 x 5 + 5) x 5 - 55 equals 5!

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] ocf-tester always claims failure, even with built-in resource agents?

2021-03-26 Thread Antony Stone
Hi.

I've just signed up to the list.  I've been using corosync and pacemaker for 
several years, mostly under Debian 9, which means:

corosync 2.4.2
pacemaker 1.1.16

I've recently upgraded a test cluster to Debian 10, which gives me:

corosync 3.0.1
pacemaker 2.0.1

I've made a few adjustments to my /etc/corosync/corosync.conf configuration so 
that corosync seems happy, and also some minor changes (mostly to the cluster 
defaults) in /etc/corosync/cluster.cib so that pacemaker is happy.

So far all is well and good, my cluster synchronises, starts the resources, 
and everything's working as expected.  It'll move the resources from one 
cluster member to another (either if I ask it to, or if there's a problem), 
and it seems to work just as the older version did.

Then, several times a day, I get resource failures such as:

* Asterisk_start_0 on castor 'insufficient privileges' (4):
 call=58,
 status=complete,
 exitreason='',
 last-rc-change='Fri Mar 26 13:37:08 2021',
 queued=0ms,
 exec=55ms

I have no idea why the machine might tell me it cannot start Asterisk due to 
insufficient privilege when it's already been able to run it before the cluster 
resources moved back to this machine.  Asterisk *can* and *does* run on this 
machine.

Another error I get is:

* Kann-Bear_monitor_5000 on helen 'unknown error' (1):
 call=62,
 status=complete,
 exitreason='',
 last-rc-change='Fri Mar 26 14:23:05 2021',
 queued=0ms,
 exec=0ms

Now, that second resource is one which doesn't have a standard resource agent 
available for it under /usr/lib/ocf/resource.d, so I'm using the general-
purpose agent /usr/lib/ocf/resource.d/heartbeat/anything to manage it.

I thought, "perhaps there's something dodgy about using this 'anything' agent, 
because it can't really know about the resource it's managing", so I tested it 
with ocf-tester:

# ocf-tester -n Kann-Bear -o binfile="/usr/sbin/bearerbox" -o 
cmdline_options="/etc/kannel/kannel.conf" -o 
pidfile="/var/run/kannel/kannel_bearerbox.pid" 
/usr/lib/ocf/resource.d/heartbeat/anything
Beginning tests for /usr/lib/ocf/resource.d/heartbeat/anything...
/usr/sbin/ocf-tester: 226: /usr/sbin/ocf-tester: xmllint: not found
* rc=127: Your agent produces meta-data which does not conform to ra-api-1.dtd
* Your agent does not support the notify action (optional)
* Your agent does not support the demote action (optional)
* Your agent does not support the promote action (optional)
* Your agent does not support master/slave (optional)
* Your agent does not support the reload action (optional)
Tests failed: /usr/lib/ocf/resource.d/heartbeat/anything failed 1 tests

Okay, something's not right.

BUT, it doesn't matter *which* resource agent I test, it tells me the same 
thing every time, including for the built-in standard agents:

* rc=127: Your agent produces meta-data which does not conform to ra-api-1.dtd

For example:

# ocf-tester -n Asterisk /usr/lib/ocf/resource.d/heartbeat/asterisk
Beginning tests for /usr/lib/ocf/resource.d/heartbeat/asterisk...
/usr/sbin/ocf-tester: 226: /usr/sbin/ocf-tester: xmllint: not found
* rc=127: Your agent produces meta-data which does not conform to ra-api-1.dtd
* Your agent does not support the notify action (optional)
* Your agent does not support the demote action (optional)
* Your agent does not support the promote action (optional)
* Your agent does not support master/slave (optional)
* Your agent does not support the reload action (optional)
Tests failed: /usr/lib/ocf/resource.d/heartbeat/asterisk failed 1 tests


# ocf-tester -n IP-Float4 -o ip=10.1.0.42 -o cidr_netmask=28 
/usr/lib/ocf/resource.d/heartbeat/IPaddr2
Beginning tests for /usr/lib/ocf/resource.d/heartbeat/IPaddr2...
/usr/sbin/ocf-tester: 226: /usr/sbin/ocf-tester: xmllint: not found
* rc=127: Your agent produces meta-data which does not conform to ra-api-1.dtd
* Your agent does not support the notify action (optional)
* Your agent does not support the demote action (optional)
* Your agent does not support the promote action (optional)
* Your agent does not support master/slave (optional)
* Your agent does not support the reload action (optional)
Tests failed: /usr/lib/ocf/resource.d/heartbeat/IPaddr2 failed 1 tests


So, it seems to be telling me that even the standard built-in resource agents 
"produce meta-data which does not conform to ra-api-1.dtd"


My first question is: what's going wrong here?  Am I using ocf-tester 
incorrectly, or is it a bug?

My second question is: how can I debug what caused pacemaker to decide that it 
couldn't run Asterisk due to "insufficient privileges" on a machine which is 
perfectly well capacble of running Asterisk, and including when it gets 
started by pacemaker (in fact, that's the only way Asterisk gets started on 
these machines; it's a floating resource which pacemaker is in charge of).


Please