Re: [ClusterLabs] Stonith external/ssh "device"?

2022-12-21 Thread Antony Stone
On Wednesday 21 December 2022 at 17:19:34, Antony Stone wrote:

> > pacemaker-fenced[3262]:   notice: Operation reboot of nodeB by 
> > for pacemaker-controld.26852@nodeA.93b391b2: No such device

> pacemaker-controld[3264]:   notice: Peer nodeB was not terminated (reboot)
> by  on behalf of pacemaker-controld.26852: No such device

I have resolved this - there was a discrepancy between the node names (some 
simple hostnames, some FQDNs) in my main cluster configuration, and the 
hostlist parameter for the external/ssh fencing plugin.

I have set them all to be simple hostnames with no domain and now all is 
working as expected.

I still find the log message "no such device" rather confusing.


Thanks,


Antony.

-- 
 yes, but this is #lbw, we don't do normal

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Stonith external/ssh "device"?

2022-12-21 Thread Antony Stone
On Wednesday 21 December 2022 at 16:59:16, Antony Stone wrote:

> Hi.
> 
> I'm implementing fencing on a 7-node cluster as described recently:
> https://lists.clusterlabs.org/pipermail/users/2022-December/030714.html
> 
> I'm using external/ssh for the time being, and it works if I test it using:
> 
> stonith -t external/ssh -p "nodeA nodeB nodeC" -T reset nodeB
> 
> 
> However, when it's supposed to be invoked because a node has got stuck, I
> simply find syslog full of the following (one from each of the other six
> nodes in the cluster):
> 
> pacemaker-fenced[3262]:   notice: Operation reboot of nodeB by  for
> pacemaker-controld.26852@nodeA.93b391b2: No such device
> 
> I have defined seven stonith resources, one for rebooting each machine, and
> I can see from "crm status" that they have been assigned randomly amongst
> the other servers, usually one per server, so that looks good.
> 
> 
> The main things that puzzle me about the log message are:
> 
> a) why does it say ""?  Is this more like "anyone", meaning that
> no- one in particular is required to do this task, provided that at least
> someone does it?  Does this indicate a configuration problem?

PS: I've just noticed that I'm also getting log entries immediately 
afterwards:

pacemaker-controld[3264]:   notice: Peer nodeB was not terminated (reboot) by 
 on behalf of pacemaker-controld.26852: No such device

> b) what is this "device" referred to?  I'm using "external/ssh" so there is
> no actual Stonith device for power-cycling hardware machines - am I
> supposed to define some sort of dummy device somewhere?
> 
> For clarity, this is what I have added to my cluster configuration to set
> this up:
> 
> primitive reboot_nodeAstonith:external/sshparams hostlist="nodeA"
> location only_nodeA   reboot_nodeA-inf: nodeA
> 
> ...repeated for all seven nodes.
> 
> I also have "stonith-enabled=yes" in the cib-bootstrap-options.
> 
> 
> Ideas, anyone?
> 
> Thanks,
> 
> 
> Antony.

-- 
Normal people think "If it ain't broke, don't fix it".
Engineers think "If it ain't broke, it doesn't have enough features yet".

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/