Re: [ClusterLabs] Stonith external/ssh "device"?

2022-12-21 Thread Antony Stone
On Wednesday 21 December 2022 at 17:19:34, Antony Stone wrote:

> > pacemaker-fenced[3262]:   notice: Operation reboot of nodeB by 
> > for pacemaker-controld.26852@nodeA.93b391b2: No such device

> pacemaker-controld[3264]:   notice: Peer nodeB was not terminated (reboot)
> by  on behalf of pacemaker-controld.26852: No such device

I have resolved this - there was a discrepancy between the node names (some 
simple hostnames, some FQDNs) in my main cluster configuration, and the 
hostlist parameter for the external/ssh fencing plugin.

I have set them all to be simple hostnames with no domain and now all is 
working as expected.

I still find the log message "no such device" rather confusing.


Thanks,


Antony.

-- 
 yes, but this is #lbw, we don't do normal

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Stonith external/ssh "device"?

2022-12-21 Thread Antony Stone
On Wednesday 21 December 2022 at 16:59:16, Antony Stone wrote:

> Hi.
> 
> I'm implementing fencing on a 7-node cluster as described recently:
> https://lists.clusterlabs.org/pipermail/users/2022-December/030714.html
> 
> I'm using external/ssh for the time being, and it works if I test it using:
> 
> stonith -t external/ssh -p "nodeA nodeB nodeC" -T reset nodeB
> 
> 
> However, when it's supposed to be invoked because a node has got stuck, I
> simply find syslog full of the following (one from each of the other six
> nodes in the cluster):
> 
> pacemaker-fenced[3262]:   notice: Operation reboot of nodeB by  for
> pacemaker-controld.26852@nodeA.93b391b2: No such device
> 
> I have defined seven stonith resources, one for rebooting each machine, and
> I can see from "crm status" that they have been assigned randomly amongst
> the other servers, usually one per server, so that looks good.
> 
> 
> The main things that puzzle me about the log message are:
> 
> a) why does it say ""?  Is this more like "anyone", meaning that
> no- one in particular is required to do this task, provided that at least
> someone does it?  Does this indicate a configuration problem?

PS: I've just noticed that I'm also getting log entries immediately 
afterwards:

pacemaker-controld[3264]:   notice: Peer nodeB was not terminated (reboot) by 
 on behalf of pacemaker-controld.26852: No such device

> b) what is this "device" referred to?  I'm using "external/ssh" so there is
> no actual Stonith device for power-cycling hardware machines - am I
> supposed to define some sort of dummy device somewhere?
> 
> For clarity, this is what I have added to my cluster configuration to set
> this up:
> 
> primitive reboot_nodeAstonith:external/sshparams hostlist="nodeA"
> location only_nodeA   reboot_nodeA-inf: nodeA
> 
> ...repeated for all seven nodes.
> 
> I also have "stonith-enabled=yes" in the cib-bootstrap-options.
> 
> 
> Ideas, anyone?
> 
> Thanks,
> 
> 
> Antony.

-- 
Normal people think "If it ain't broke, don't fix it".
Engineers think "If it ain't broke, it doesn't have enough features yet".

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Stonith

2022-12-19 Thread Ken Gaillot
On Mon, 2022-12-19 at 16:17 +0300, Andrei Borzenkov wrote:
> On Mon, Dec 19, 2022 at 4:01 PM Antony Stone
>  wrote:
> > On Monday 19 December 2022 at 13:55:45, Andrei Borzenkov wrote:
> > 
> > > On Mon, Dec 19, 2022 at 3:44 PM Antony Stone
> > > 
> > >  wrote:
> > > > So, do I simply create one stonith resource for each server,
> > > > and rely on
> > > > some other random server to invoke it when needed?
> > > 
> > > Yes, this is the most simple approach. You need to restrict this
> > > stonith resource to only one cluster node (set pcmk_host_list).
> > 
> > So, just to be clear, I create one stonith resource for each
> > machine which
> > needs to be able to be shut down by some other server?
> > 
> 
> Correct.
> 
> > I ask simply because the acronym stonith refers to "the other
> > node", so it
> > sounds to me more like something I need to define so that a working
> > machine can
> > kill another one.
> > 
> 
> Yes, you define a stonith resource that can kill node A and nodes B,
> C, D, ... will use this resource to kill A when needed. As long as
> your stonith resource can actually work on any node it does not
> matter
> which one will do the killing. You can restrict which nodes can use
> this stonith agent using usual location constraints if necessary.
> 
> But keep in mind that if the whole site is down (or unaccessible) you
> will not have access to IPMI/PDU/whatever on this site so your
> stonith
> agents will fail ...

This is the main problem I see. Presumably the goal of the three-center 
setup is to handle network interruptions to one of them, but without
network, the fencing will fail and the cluster will be unable to
recover the resources from that center.

You may want to look at designing this as two independent clusters,
coordinated via booth. The third site only needs to run a booth
"arbitrator" (quorum server), not pacemaker. With this design, if one
site loses network access, it will shut itself down, and fencing only
needs to be able to work locally at each site.

https://clusterlabs.org/pacemaker/doc/2.1/Pacemaker_Explained/singlehtml/index.html#document-multi-site-clusters

-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Stonith

2022-12-19 Thread Andrei Borzenkov
On Mon, Dec 19, 2022 at 4:01 PM Antony Stone
 wrote:
>
> On Monday 19 December 2022 at 13:55:45, Andrei Borzenkov wrote:
>
> > On Mon, Dec 19, 2022 at 3:44 PM Antony Stone
> >
> >  wrote:
> > > So, do I simply create one stonith resource for each server, and rely on
> > > some other random server to invoke it when needed?
> >
> > Yes, this is the most simple approach. You need to restrict this
> > stonith resource to only one cluster node (set pcmk_host_list).
>
> So, just to be clear, I create one stonith resource for each machine which
> needs to be able to be shut down by some other server?
>

Correct.

> I ask simply because the acronym stonith refers to "the other node", so it
> sounds to me more like something I need to define so that a working machine 
> can
> kill another one.
>

Yes, you define a stonith resource that can kill node A and nodes B,
C, D, ... will use this resource to kill A when needed. As long as
your stonith resource can actually work on any node it does not matter
which one will do the killing. You can restrict which nodes can use
this stonith agent using usual location constraints if necessary.

But keep in mind that if the whole site is down (or unaccessible) you
will not have access to IPMI/PDU/whatever on this site so your stonith
agents will fail ...
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Stonith

2022-12-19 Thread Antony Stone
On Monday 19 December 2022 at 13:55:45, Andrei Borzenkov wrote:

> On Mon, Dec 19, 2022 at 3:44 PM Antony Stone
> 
>  wrote:
> > So, do I simply create one stonith resource for each server, and rely on
> > some other random server to invoke it when needed?
> 
> Yes, this is the most simple approach. You need to restrict this
> stonith resource to only one cluster node (set pcmk_host_list).

So, just to be clear, I create one stonith resource for each machine which 
needs to be able to be shut down by some other server?

I ask simply because the acronym stonith refers to "the other node", so it 
sounds to me more like something I need to define so that a working machine can 
kill another one.

> No, that is not needed, by default any node can use any stonith agent.

Okay, thanks for the quick clarification :)


Antony.

-- 
It is also possible that putting the birds in a laboratory setting 
inadvertently renders them relatively incompetent.

 - Daniel C Dennett

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Stonith

2022-12-19 Thread Andrei Borzenkov
On Mon, Dec 19, 2022 at 3:44 PM Antony Stone
 wrote:
>
> So, do I simply create one stonith resource for each server, and rely on some
> other random server to invoke it when needed?
>

Yes, this is the most simple approach. You need to restrict this
stonith resource to only one cluster node (set pcmk_host_list).

> Or do I in fact create one stonith resource for each server, and that resource
> then means that this server can shut down any other server?
>

If a stonith agent supports mapping between cluster nodes and IP
addresses (or whatever is needed to identify the correct instance to
kill the selected cluster node) - this would be an option. I do not
think either ssh or IPMI agents support it though.

> Or, do I need to create 6 x 7 = 42 stonith resources so that any machine can
> shut down any other?
>

No, that is not needed, by default any node can use any stonith agent.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Stonith configuration

2020-02-14 Thread Dan Swartzendruber

On 2020-02-14 13:06, Strahil Nikolov wrote:

On February 14, 2020 4:44:53 PM GMT+02:00, "BASDEN, ALASTAIR G."
 wrote:

Hi Strahil,




Note2:  Consider adding a third node  /for example a VM/ or a qdevice
on a  separate node (allows  to be  on a separate network, so a simple
 routing is the only requirement ) and reconfigure  the cluster  , so
you have 'Expected  votes: 3' .
This will protect  you from split brain and  is highly recommended.


Highly recommend qdevice.  I spun one up on a small (paperback size) 
'router' running CentOS7.


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Stonith configuration

2020-02-14 Thread BASDEN, ALASTAIR G.
Hi Strahil,
corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
id  = 172.17.150.20
status  = ring 0 active with no faults
RING ID 1
id  = 10.0.6.20
status  = ring 1 active with no faults

corosync-quorumtool -s
Quorum information
--
Date: Fri Feb 14 14:41:11 2020
Quorum provider:  corosync_votequorum
Nodes:2
Node ID:  1
Ring ID:  1/96
Quorate:  Yes

Votequorum information
--
Expected votes:   2
Highest expected: 2
Total votes:  2
Quorum:   1
Flags:2Node Quorate WaitForAll

Membership information
--
 Nodeid  Votes Name
  1  1 node1.primary.network (local)
  2  1 node2.primary.network


On the surviving node, the 10.0.6.21 interface flipflopped (though nothing 
detected on the other node), and that is what started it all off.

We have no firewall running.

Cheers,
Alastair.


On Fri, 14 Feb 2020, Strahil Nikolov wrote:

> On February 14, 2020 12:41:58 PM GMT+02:00, "BASDEN, ALASTAIR G." 
>  wrote:
>> Hi,
>> I wonder whether anyone could give me some advice about a stonith
>> configuration.
>>
>> We have 2 nodes, which form a HA cluster.
>>
>> These have 3 networks:
>> A generic network over which they are accessed (eg ssh)
>> (node1.primary.network, node2.primary.network)
>> A directly connected cable between them (10.0.6.20, 10.0.6.21).
>> A management network, on which ipmi is (172.16.150.20, 172.16.150.21)
>>
>> We have done:
>> pcs cluster setup --name hacluster node1.primary.network,10.0.6.20
>> node2.primary.network,10.0.6.21 --token 2
>> pcs cluster start --all
>> pcs property set no-quorum-policy=ignore
>> pcs property set stonith-enabled=true
>> pcs property set symmetric-cluster=true
>> pcs stonith create node1_ipmi fence_ipmilan ipaddr="172.16.150.20"
>> lanplus=true login="root" passwd="password"
>> pcmk_host_list="node1.primary.network" power_wait=10
>> pcs stonith create node2_ipmi fence_ipmilan ipaddr="172.16.150.21"
>> lanplus=true login="root" passwd="password"
>> pcmk_host_list="node2.primary.network" power_wait=10
>>
>> /etc/corosync/corosync.conf has:
>> totem {
>> version: 2
>> cluster_name: hacluster
>> secauth: off
>> transport: udpu
>> rrp_mode: passive
>> token: 2
>> }
>>
>> nodelist {
>> node {
>> ring0_addr: node1.primary.network
>> ring1_addr: 10.0.6.20
>> nodeid: 1
>> }
>>
>> node {
>> ring0_addr: node2.primary.network
>> ring1_addr: 10.0.6.21
>>  nodeid: 2
>> }
>> }
>>
>> quorum {
>> provider: corosync_votequorum
>> two_node: 1
>> }
>>
>> logging {
>> to_logfile: yes
>> logfile: /var/log/cluster/corosync.log
>> to_syslog: no
>> }
>>
>>
>> What I find is that if there is a problem with the directly connected
>> cable, the nodes stonith each other, even though the generic network is
>>
>> fine.
>>
>> What I would expect is that they would only shoot each other when both
>> networks are down (generic and directly connected).
>>
>> Any ideas?
>>
>> Thanks,
>> Alastair.
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>
> What is  the output of :
> corosync-cfgtool -s
> corosync-quorumtool -s
>
> Also check the logs of the suvived node for clues.
>
> What about firewall ?
> Have you enabled 'high-availability' service on firewalld on all zones for 
> your interfaces ?
>
> Best Regards,
> Strahil Nikolov
>
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] stonith-ng - performing action 'monitor' timed out with signal 15

2019-09-16 Thread Ken Gaillot
On Tue, 2019-09-03 at 10:09 +0200, Marco Marino wrote:
> Hi, I have a problem with fencing on a two node cluster. It seems
> that randomly the cluster cannot complete monitor operation for fence
> devices. In log I see:
> crmd[8206]:   error: Result of monitor operation for fence-node2 on
> ld2.mydomain.it: Timed Out
> As attachment there is 
> - /var/log/messages for node1 (only the important part)
> - /var/log/messages for node2 (only the important part) <-- Problem
> starts here
> - pcs status
> - pcs stonith show (for both fence devices)
> 
> I think it could be a timeout problem, so how can I see timeout value
> for monitor operation in stonith devices?
> Please, someone can help me with this problem?
> Furthermore, how can I fix the state of fence devices without
> downtime?
> 
> Thank you

How to investigate depends on whether this is an occasional monitor
failure, or happens every time the device start is attempted. From the
status you attached, I'm guessing it's at start.

In that case, my next step (since you've already verified ipmitool
works directly) would be to run the fence agent manually using the same
arguments used in the cluster configuration.

Check the man page for the fence agent, looking at the section for
"Stdin Parameters". These are what's used in the cluster configuration,
so make a note of what values you've configured. Then run the fence
agent like this:

echo -e "action=status\nPARAMETER=VALUE\nPARAMETER=VALUE\n..." | /path/to/agent

where PARAMETER=VALUE entries are what you have configured in the
cluster. If the problem isn't obvious from that, you can try adding a
debug_file parameter.
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] stonith-ng - performing action 'monitor' timed out with signal 15

2019-09-11 Thread Marco Marino
Hi, some updates about this?
Thank you

Il Mer 4 Set 2019, 10:46 Marco Marino  ha scritto:

> First of all, thank you for your support.
> Andrey: sure, I can reach machines through IPMI.
> Here is a short "log":
>
> #From ld1 trying to contact ld1
> [root@ld1 ~]# ipmitool -I lanplus -H 192.168.254.250 -U root -P XX
> sdr elist all
> SEL  | 72h | ns  |  7.1 | No Reading
> Intrusion| 73h | ok  |  7.1 |
> iDRAC8   | 00h | ok  |  7.1 | Dynamic MC @ 20h
> ...
>
> #From ld1 trying to contact ld2
> ipmitool -I lanplus -H 192.168.254.251 -U root -P XX sdr elist all
> SEL  | 72h | ns  |  7.1 | No Reading
> Intrusion| 73h | ok  |  7.1 |
> iDRAC7   | 00h | ok  |  7.1 | Dynamic MC @ 20h
> ...
>
>
> #From ld2 trying to contact ld1:
> root@ld2 ~]# ipmitool -I lanplus -H 192.168.254.250 -U root -P X sdr
> elist all
> SEL  | 72h | ns  |  7.1 | No Reading
> Intrusion| 73h | ok  |  7.1 |
> iDRAC8   | 00h | ok  |  7.1 | Dynamic MC @ 20h
> System Board | 00h | ns  |  7.1 | Logical FRU @00h
> .
>
> #From ld2 trying to contact ld2
> [root@ld2 ~]# ipmitool -I lanplus -H 192.168.254.251 -U root -P  sdr
> elist all
> SEL  | 72h | ns  |  7.1 | No Reading
> Intrusion| 73h | ok  |  7.1 |
> iDRAC7   | 00h | ok  |  7.1 | Dynamic MC @ 20h
> System Board | 00h | ns  |  7.1 | Logical FRU @00h
> 
>
> Jan: Actually the cluster uses /etc/hosts in order to resolve names:
> 172.16.77.10ld1.mydomain.it  ld1
> 172.16.77.11ld2.mydomain.it  ld2
>
> Furthermore I'm using ip addresses for ipmi interfaces in the
> configuration:
> [root@ld1 ~]# pcs stonith show fence-node1
>  Resource: fence-node1 (class=stonith type=fence_ipmilan)
>   Attributes: ipaddr=192.168.254.250 lanplus=1 login=root passwd=X
> pcmk_host_check=static-list pcmk_host_list=ld1.mydomain.it
>   Operations: monitor interval=60s (fence-node1-monitor-interval-60s)
>
>
> Any idea?
> How can I reset the state of the cluster without downtime? "pcs resource
> cleanup" is enough?
> Thank you,
> Marco
>
>
> Il giorno mer 4 set 2019 alle ore 10:29 Jan Pokorný 
> ha scritto:
>
>> On 03/09/19 20:15 +0300, Andrei Borzenkov wrote:
>> > 03.09.2019 11:09, Marco Marino пишет:
>> >> Hi, I have a problem with fencing on a two node cluster. It seems that
>> >> randomly the cluster cannot complete monitor operation for fence
>> devices.
>> >> In log I see:
>> >> crmd[8206]:   error: Result of monitor operation for fence-node2 on
>> >> ld2.mydomain.it: Timed Out
>> >
>> > Can you actually access IP addresses of your IPMI ports?
>>
>> [
>> Tangentially, interesting aspect beyond that and applicable for any
>> non-IP cross-host referential needs, which I haven't seen mentioned
>> anywhere so far, is the risk of DNS resolution (when /etc/hosts will
>> come short) getting to troubles (stale records, port blocked, DNS
>> server overload [DNSSEC, etc.], IPv4/IPv6 parallel records that the SW
>> cannot handle gracefully, etc.).  In any case, just a single DNS
>> server would apparently be an undesired SPOF, and would be unfortunate
>> when unable to fence a node because of that.
>>
>> I think the most robust approach is to use IP addresses whenever
>> possible, and unambiguous records in /etc/hosts when practical.
>> ]
>>
>> >> As attachment there is
>> >> - /var/log/messages for node1 (only the important part)
>> >> - /var/log/messages for node2 (only the important part) <-- Problem
>> starts
>> >> here
>> >> - pcs status
>> >> - pcs stonith show (for both fence devices)
>> >>
>> >> I think it could be a timeout problem, so how can I see timeout value
>> for
>> >> monitor operation in stonith devices?
>> >> Please, someone can help me with this problem?
>> >> Furthermore, how can I fix the state of fence devices without downtime?
>>
>> --
>> Jan (Poki)
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] stonith-ng - performing action 'monitor' timed out with signal 15

2019-09-04 Thread Marco Marino
First of all, thank you for your support.
Andrey: sure, I can reach machines through IPMI.
Here is a short "log":

#From ld1 trying to contact ld1
[root@ld1 ~]# ipmitool -I lanplus -H 192.168.254.250 -U root -P XX sdr
elist all
SEL  | 72h | ns  |  7.1 | No Reading
Intrusion| 73h | ok  |  7.1 |
iDRAC8   | 00h | ok  |  7.1 | Dynamic MC @ 20h
...

#From ld1 trying to contact ld2
ipmitool -I lanplus -H 192.168.254.251 -U root -P XX sdr elist all
SEL  | 72h | ns  |  7.1 | No Reading
Intrusion| 73h | ok  |  7.1 |
iDRAC7   | 00h | ok  |  7.1 | Dynamic MC @ 20h
...


#From ld2 trying to contact ld1:
root@ld2 ~]# ipmitool -I lanplus -H 192.168.254.250 -U root -P X sdr
elist all
SEL  | 72h | ns  |  7.1 | No Reading
Intrusion| 73h | ok  |  7.1 |
iDRAC8   | 00h | ok  |  7.1 | Dynamic MC @ 20h
System Board | 00h | ns  |  7.1 | Logical FRU @00h
.

#From ld2 trying to contact ld2
[root@ld2 ~]# ipmitool -I lanplus -H 192.168.254.251 -U root -P  sdr
elist all
SEL  | 72h | ns  |  7.1 | No Reading
Intrusion| 73h | ok  |  7.1 |
iDRAC7   | 00h | ok  |  7.1 | Dynamic MC @ 20h
System Board | 00h | ns  |  7.1 | Logical FRU @00h


Jan: Actually the cluster uses /etc/hosts in order to resolve names:
172.16.77.10ld1.mydomain.it  ld1
172.16.77.11ld2.mydomain.it  ld2

Furthermore I'm using ip addresses for ipmi interfaces in the configuration:
[root@ld1 ~]# pcs stonith show fence-node1
 Resource: fence-node1 (class=stonith type=fence_ipmilan)
  Attributes: ipaddr=192.168.254.250 lanplus=1 login=root passwd=X
pcmk_host_check=static-list pcmk_host_list=ld1.mydomain.it
  Operations: monitor interval=60s (fence-node1-monitor-interval-60s)


Any idea?
How can I reset the state of the cluster without downtime? "pcs resource
cleanup" is enough?
Thank you,
Marco


Il giorno mer 4 set 2019 alle ore 10:29 Jan Pokorný 
ha scritto:

> On 03/09/19 20:15 +0300, Andrei Borzenkov wrote:
> > 03.09.2019 11:09, Marco Marino пишет:
> >> Hi, I have a problem with fencing on a two node cluster. It seems that
> >> randomly the cluster cannot complete monitor operation for fence
> devices.
> >> In log I see:
> >> crmd[8206]:   error: Result of monitor operation for fence-node2 on
> >> ld2.mydomain.it: Timed Out
> >
> > Can you actually access IP addresses of your IPMI ports?
>
> [
> Tangentially, interesting aspect beyond that and applicable for any
> non-IP cross-host referential needs, which I haven't seen mentioned
> anywhere so far, is the risk of DNS resolution (when /etc/hosts will
> come short) getting to troubles (stale records, port blocked, DNS
> server overload [DNSSEC, etc.], IPv4/IPv6 parallel records that the SW
> cannot handle gracefully, etc.).  In any case, just a single DNS
> server would apparently be an undesired SPOF, and would be unfortunate
> when unable to fence a node because of that.
>
> I think the most robust approach is to use IP addresses whenever
> possible, and unambiguous records in /etc/hosts when practical.
> ]
>
> >> As attachment there is
> >> - /var/log/messages for node1 (only the important part)
> >> - /var/log/messages for node2 (only the important part) <-- Problem
> starts
> >> here
> >> - pcs status
> >> - pcs stonith show (for both fence devices)
> >>
> >> I think it could be a timeout problem, so how can I see timeout value
> for
> >> monitor operation in stonith devices?
> >> Please, someone can help me with this problem?
> >> Furthermore, how can I fix the state of fence devices without downtime?
>
> --
> Jan (Poki)
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] stonith-ng - performing action 'monitor' timed out with signal 15

2019-09-04 Thread Jan Pokorný
On 03/09/19 20:15 +0300, Andrei Borzenkov wrote:
> 03.09.2019 11:09, Marco Marino пишет:
>> Hi, I have a problem with fencing on a two node cluster. It seems that
>> randomly the cluster cannot complete monitor operation for fence devices.
>> In log I see:
>> crmd[8206]:   error: Result of monitor operation for fence-node2 on
>> ld2.mydomain.it: Timed Out
> 
> Can you actually access IP addresses of your IPMI ports?

[
Tangentially, interesting aspect beyond that and applicable for any
non-IP cross-host referential needs, which I haven't seen mentioned
anywhere so far, is the risk of DNS resolution (when /etc/hosts will
come short) getting to troubles (stale records, port blocked, DNS
server overload [DNSSEC, etc.], IPv4/IPv6 parallel records that the SW
cannot handle gracefully, etc.).  In any case, just a single DNS
server would apparently be an undesired SPOF, and would be unfortunate
when unable to fence a node because of that.

I think the most robust approach is to use IP addresses whenever
possible, and unambiguous records in /etc/hosts when practical.
]

>> As attachment there is
>> - /var/log/messages for node1 (only the important part)
>> - /var/log/messages for node2 (only the important part) <-- Problem starts
>> here
>> - pcs status
>> - pcs stonith show (for both fence devices)
>> 
>> I think it could be a timeout problem, so how can I see timeout value for
>> monitor operation in stonith devices?
>> Please, someone can help me with this problem?
>> Furthermore, how can I fix the state of fence devices without downtime?

-- 
Jan (Poki)


pgpL97hDs1Edl.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] stonith-ng - performing action 'monitor' timed out with signal 15

2019-09-03 Thread Andrei Borzenkov
03.09.2019 11:09, Marco Marino пишет:
> Hi, I have a problem with fencing on a two node cluster. It seems that
> randomly the cluster cannot complete monitor operation for fence devices.
> In log I see:
> crmd[8206]:   error: Result of monitor operation for fence-node2 on
> ld2.mydomain.it: Timed Out

Can you actually access IP addresses of your IPMI ports?

> As attachment there is
> - /var/log/messages for node1 (only the important part)
> - /var/log/messages for node2 (only the important part) <-- Problem starts
> here
> - pcs status
> - pcs stonith show (for both fence devices)
> 
> I think it could be a timeout problem, so how can I see timeout value for
> monitor operation in stonith devices?
> Please, someone can help me with this problem?
> Furthermore, how can I fix the state of fence devices without downtime?
> 
> Thank you
> 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Stonith two-node cluster shot each other

2018-12-05 Thread Klaus Wenninger
If you are not so sure which of the nodes you want to give
precedence then you can at least add some random-delay
to it as to at least prevent that they at the same time
decide to kill each other (fence-race).

If your fencing-agent doesn't support a delay or you don't
want to use that for some reason, pacemaker offers a
generic approach with the 2 meta attributes pcmk_delay_max
and pcmk_delay_base that will give you a random delay
between pcmk_delay_base and pcmk_delay_max (either of them
can be 0 if you like - not at the same time for obvious reasons ;-) ).

Regards,
Klaus

On 12/04/2018 07:32 PM, Digimer wrote:
> You need to set a fence delay on the node you want to win in a case like
> this. So say, for example, node 1 is hosting services. You will want to
> add 'delay="15"' to the stonith config for node 1.
>
> This way, when both nodes try to fence each other, node 2 looks up how
> to fence node 1, sees a delay and pauses for 15 seconds. Node 1 looks up
> how to fence node 2, sees no delay, and fences immediately. Node 1
> lives, node 2 gets fenced.
>
> digimer
>
> On 2018-12-04 12:48 p.m., Daniel Ragle wrote:
>> I *think* the two nodes of my cluster shot each other in the head this
>> weekend and I can't figure out why.
>>
>> Looking at corosync.log on node1 I see this:
>>
>> [143747] node1.mydomain.com corosyncnotice  [TOTEM ] A processor failed,
>> forming new configuration.
>> [143747] node1.mydomain.com corosyncnotice  [TOTEM ] A new membership
>> (192.168.10.25:236) was formed. Members joined: 2 left: 2
>> [143747] node1.mydomain.com corosyncnotice  [TOTEM ] Failed to receive
>> the leave message. failed: 2
>> [143747] node1.mydomain.com corosyncnotice  [TOTEM ] Retransmit List: 1
>> Dec 01 07:03:50 [143768] node1.mydomain.com   crmd: info:
>> pcmk_cpg_membership: Node 2 left group crmd (peer=node2.mydomain.com,
>> counter=1.0)
>> Dec 01 07:03:50 [143766] node1.mydomain.com  attrd: info:
>> pcmk_cpg_membership: Node 2 left group attrd (peer=node2.mydomain.com,
>> counter=1.0)
>> Dec 01 07:03:50 [143764] node1.mydomain.com stonith-ng: info:
>> pcmk_cpg_membership: Node 2 left group stonith-ng
>> (peer=node2.vselect.com, counter=1.0)
>> Dec 01 07:03:50 [143762] node1.mydomain.com pacemakerd: info:
>> pcmk_cpg_membership: Node 2 left group pacemakerd
>> (peer=node2.vselect.com, counter=1.0)
>>
>> Followed by a whole slew of messages generally saying node2 was
>> dead/could not be reached, culminating in:
>>
>> Dec 01 07:04:20 [143764] node1.mydomain.com stonith-ng:   notice:
>> initiate_remote_stonith_op:  Requesting peer fencing (reboot) of
>> node2.mydomain.com | id=a041d1df-e857-4815-91db-00f448106a33 state=0
>> Dec 01 07:04:20 [143764] node1.mydomain.com stonith-ng: info:
>> process_remote_stonith_query:    Query result 1 of 2 from
>> node1.mydomain.com for node2.mydomain.com/reboot (1 devices)
>> a041d1df-e857-4815-91db-00f448106a33
>> Dec 01 07:04:20 [143764] node1.mydomain.com stonith-ng: info:
>> call_remote_stonith: Total timeout set to 300 for peer's fencing of
>> node2.mydomain.com for
>> stonith-api.139901|id=a041d1df-e857-4815-91db-00f448106a33
>> Dec 01 07:04:20 [143764] node1.mydomain.com stonith-ng: info:
>> call_remote_stonith: Requesting that 'node1.mydomain.com' perform op
>> 'node2.mydomain.com reboot' for stonith-api.139901 (360s, 0s)
>> Dec 01 07:04:20 [143764] node1.mydomain.com stonith-ng: info:
>> process_remote_stonith_query:    Query result 2 of 2 from
>> node2.mydomain.com for node2.mydomain.com/reboot (1 devices)
>> a041d1df-e857-4815-91db-00f448106a33
>> Dec 01 07:04:20 [143764] node1.mydomain.com stonith-ng: info:
>> stonith_fence_get_devices_cb:    Found 1 matching devices for
>> 'node2.mydomain.com'
>> Dec 01 07:04:21 [143768] node1.mydomain.com   crmd: info:
>> crm_update_peer_expected:    handle_request: Node node2.mydomain.com[2]
>> - expected state is now down (was member)
>> Dec 01 07:04:21 [143766] node1.mydomain.com  attrd: info:
>> attrd_peer_update:   Setting shutdown[node2.mydomain.com]: (null) ->
>> 1543665861 from node2.mydomain.com
>> Dec 01 07:04:21 [143763] node1.mydomain.com    cib: info:
>> cib_perform_op:  Diff: --- 0.188.66 2
>> Dec 01 07:04:21 [143763] node1.mydomain.com    cib: info:
>> cib_perform_op:  Diff: +++ 0.188.67 (null)
>> Dec 01 07:04:21 [143763] node1.mydomain.com    cib: info:
>> cib_perform_op:  +  /cib:  @num_updates=67
>> Dec 01 07:04:21 [143763] node1.mydomain.com    cib: info:
>> cib_perform_op:  ++ /cib/status/node_state[@id='2']:
>> 
>> Dec 01 07:04:21 [143763] node1.mydomain.com    cib: info:
>> cib_perform_op:  ++ 
>> Dec 01 07:04:21 [143763] node1.mydomain.com    cib: info:
>> cib_perform_op:  ++   > id="status-2-shutdown" name="shutdown" value="1543665861"/>
>> Dec 01 07:04:21 [143763] node1.mydomain.com    cib: info:
>> cib_perform_op: 

Re: [ClusterLabs] Stonith two-node cluster shot each other

2018-12-04 Thread Digimer
You need to set a fence delay on the node you want to win in a case like
this. So say, for example, node 1 is hosting services. You will want to
add 'delay="15"' to the stonith config for node 1.

This way, when both nodes try to fence each other, node 2 looks up how
to fence node 1, sees a delay and pauses for 15 seconds. Node 1 looks up
how to fence node 2, sees no delay, and fences immediately. Node 1
lives, node 2 gets fenced.

digimer

On 2018-12-04 12:48 p.m., Daniel Ragle wrote:
> I *think* the two nodes of my cluster shot each other in the head this
> weekend and I can't figure out why.
> 
> Looking at corosync.log on node1 I see this:
> 
> [143747] node1.mydomain.com corosyncnotice  [TOTEM ] A processor failed,
> forming new configuration.
> [143747] node1.mydomain.com corosyncnotice  [TOTEM ] A new membership
> (192.168.10.25:236) was formed. Members joined: 2 left: 2
> [143747] node1.mydomain.com corosyncnotice  [TOTEM ] Failed to receive
> the leave message. failed: 2
> [143747] node1.mydomain.com corosyncnotice  [TOTEM ] Retransmit List: 1
> Dec 01 07:03:50 [143768] node1.mydomain.com   crmd: info:
> pcmk_cpg_membership: Node 2 left group crmd (peer=node2.mydomain.com,
> counter=1.0)
> Dec 01 07:03:50 [143766] node1.mydomain.com  attrd: info:
> pcmk_cpg_membership: Node 2 left group attrd (peer=node2.mydomain.com,
> counter=1.0)
> Dec 01 07:03:50 [143764] node1.mydomain.com stonith-ng: info:
> pcmk_cpg_membership: Node 2 left group stonith-ng
> (peer=node2.vselect.com, counter=1.0)
> Dec 01 07:03:50 [143762] node1.mydomain.com pacemakerd: info:
> pcmk_cpg_membership: Node 2 left group pacemakerd
> (peer=node2.vselect.com, counter=1.0)
> 
> Followed by a whole slew of messages generally saying node2 was
> dead/could not be reached, culminating in:
> 
> Dec 01 07:04:20 [143764] node1.mydomain.com stonith-ng:   notice:
> initiate_remote_stonith_op:  Requesting peer fencing (reboot) of
> node2.mydomain.com | id=a041d1df-e857-4815-91db-00f448106a33 state=0
> Dec 01 07:04:20 [143764] node1.mydomain.com stonith-ng: info:
> process_remote_stonith_query:    Query result 1 of 2 from
> node1.mydomain.com for node2.mydomain.com/reboot (1 devices)
> a041d1df-e857-4815-91db-00f448106a33
> Dec 01 07:04:20 [143764] node1.mydomain.com stonith-ng: info:
> call_remote_stonith: Total timeout set to 300 for peer's fencing of
> node2.mydomain.com for
> stonith-api.139901|id=a041d1df-e857-4815-91db-00f448106a33
> Dec 01 07:04:20 [143764] node1.mydomain.com stonith-ng: info:
> call_remote_stonith: Requesting that 'node1.mydomain.com' perform op
> 'node2.mydomain.com reboot' for stonith-api.139901 (360s, 0s)
> Dec 01 07:04:20 [143764] node1.mydomain.com stonith-ng: info:
> process_remote_stonith_query:    Query result 2 of 2 from
> node2.mydomain.com for node2.mydomain.com/reboot (1 devices)
> a041d1df-e857-4815-91db-00f448106a33
> Dec 01 07:04:20 [143764] node1.mydomain.com stonith-ng: info:
> stonith_fence_get_devices_cb:    Found 1 matching devices for
> 'node2.mydomain.com'
> Dec 01 07:04:21 [143768] node1.mydomain.com   crmd: info:
> crm_update_peer_expected:    handle_request: Node node2.mydomain.com[2]
> - expected state is now down (was member)
> Dec 01 07:04:21 [143766] node1.mydomain.com  attrd: info:
> attrd_peer_update:   Setting shutdown[node2.mydomain.com]: (null) ->
> 1543665861 from node2.mydomain.com
> Dec 01 07:04:21 [143763] node1.mydomain.com    cib: info:
> cib_perform_op:  Diff: --- 0.188.66 2
> Dec 01 07:04:21 [143763] node1.mydomain.com    cib: info:
> cib_perform_op:  Diff: +++ 0.188.67 (null)
> Dec 01 07:04:21 [143763] node1.mydomain.com    cib: info:
> cib_perform_op:  +  /cib:  @num_updates=67
> Dec 01 07:04:21 [143763] node1.mydomain.com    cib: info:
> cib_perform_op:  ++ /cib/status/node_state[@id='2']:
> 
> Dec 01 07:04:21 [143763] node1.mydomain.com    cib: info:
> cib_perform_op:  ++ 
> Dec 01 07:04:21 [143763] node1.mydomain.com    cib: info:
> cib_perform_op:  ++    id="status-2-shutdown" name="shutdown" value="1543665861"/>
> Dec 01 07:04:21 [143763] node1.mydomain.com    cib: info:
> cib_perform_op:  ++ 
> Dec 01 07:04:21 [143763] node1.mydomain.com    cib: info:
> cib_perform_op:  ++ 
> Dec 01 07:04:21 [143763] node1.mydomain.com    cib: info:
> cib_process_request: Completed cib_modify operation for section status:
> OK (rc=0, origin=node2.mydomain.com/attrd/6, version=0.188.67)
> 
> And on node2 I see this:
> 
> [50215] node2.mydomain.com corosyncnotice  [TOTEM ] A new membership
> (192.168.10.25:228) was formed. Members
> [50215] node2.mydomain.com corosyncnotice  [TOTEM ] A new membership
> (192.168.10.25:236) was formed. Members joined: 1 left: 1
> [50215] node2.mydomain.com corosyncnotice  [TOTEM ] Failed to receive
> the leave message. failed: 1
> Dec 01 07:03:50 

Re: [ClusterLabs] Stonith two-node cluster shot each other

2018-12-04 Thread Daniel Ragle
Once again I botched the obfuscation. The corosync.conf should in fact 
be 'node1.mydomain.com' and 'node2.mydomain.com' (i.e., it matches the 
rest of the configuration).


Thanks!

Dan

On 12/4/2018 12:48 PM, Daniel Ragle wrote:
I *think* the two nodes of my cluster shot each other in the head this 
weekend and I can't figure out why.


Looking at corosync.log on node1 I see this:

[143747] node1.mydomain.com corosyncnotice  [TOTEM ] A processor failed, 
forming new configuration.
[143747] node1.mydomain.com corosyncnotice  [TOTEM ] A new membership 
(192.168.10.25:236) was formed. Members joined: 2 left: 2
[143747] node1.mydomain.com corosyncnotice  [TOTEM ] Failed to receive 
the leave message. failed: 2

[143747] node1.mydomain.com corosyncnotice  [TOTEM ] Retransmit List: 1
Dec 01 07:03:50 [143768] node1.mydomain.com   crmd: info: 
pcmk_cpg_membership: Node 2 left group crmd (peer=node2.mydomain.com, 
counter=1.0)
Dec 01 07:03:50 [143766] node1.mydomain.com  attrd: info: 
pcmk_cpg_membership: Node 2 left group attrd (peer=node2.mydomain.com, 
counter=1.0)
Dec 01 07:03:50 [143764] node1.mydomain.com stonith-ng: info: 
pcmk_cpg_membership: Node 2 left group stonith-ng 
(peer=node2.vselect.com, counter=1.0)
Dec 01 07:03:50 [143762] node1.mydomain.com pacemakerd: info: 
pcmk_cpg_membership: Node 2 left group pacemakerd 
(peer=node2.vselect.com, counter=1.0)


Followed by a whole slew of messages generally saying node2 was 
dead/could not be reached, culminating in:


Dec 01 07:04:20 [143764] node1.mydomain.com stonith-ng:   notice: 
initiate_remote_stonith_op:  Requesting peer fencing (reboot) of 
node2.mydomain.com | id=a041d1df-e857-4815-91db-00f448106a33 state=0
Dec 01 07:04:20 [143764] node1.mydomain.com stonith-ng: info: 
process_remote_stonith_query:    Query result 1 of 2 from 
node1.mydomain.com for node2.mydomain.com/reboot (1 devices) 
a041d1df-e857-4815-91db-00f448106a33
Dec 01 07:04:20 [143764] node1.mydomain.com stonith-ng: info: 
call_remote_stonith: Total timeout set to 300 for peer's fencing of 
node2.mydomain.com for 
stonith-api.139901|id=a041d1df-e857-4815-91db-00f448106a33
Dec 01 07:04:20 [143764] node1.mydomain.com stonith-ng: info: 
call_remote_stonith: Requesting that 'node1.mydomain.com' perform op 
'node2.mydomain.com reboot' for stonith-api.139901 (360s, 0s)
Dec 01 07:04:20 [143764] node1.mydomain.com stonith-ng: info: 
process_remote_stonith_query:    Query result 2 of 2 from 
node2.mydomain.com for node2.mydomain.com/reboot (1 devices) 
a041d1df-e857-4815-91db-00f448106a33
Dec 01 07:04:20 [143764] node1.mydomain.com stonith-ng: info: 
stonith_fence_get_devices_cb:    Found 1 matching devices for 
'node2.mydomain.com'
Dec 01 07:04:21 [143768] node1.mydomain.com   crmd: info: 
crm_update_peer_expected:    handle_request: Node node2.mydomain.com[2] 
- expected state is now down (was member)
Dec 01 07:04:21 [143766] node1.mydomain.com  attrd: info: 
attrd_peer_update:   Setting shutdown[node2.mydomain.com]: (null) -> 
1543665861 from node2.mydomain.com
Dec 01 07:04:21 [143763] node1.mydomain.com    cib: info: 
cib_perform_op:  Diff: --- 0.188.66 2
Dec 01 07:04:21 [143763] node1.mydomain.com    cib: info: 
cib_perform_op:  Diff: +++ 0.188.67 (null)
Dec 01 07:04:21 [143763] node1.mydomain.com    cib: info: 
cib_perform_op:  +  /cib:  @num_updates=67
Dec 01 07:04:21 [143763] node1.mydomain.com    cib: info: 
cib_perform_op:  ++ /cib/status/node_state[@id='2']: 

Dec 01 07:04:21 [143763] node1.mydomain.com    cib: info: 
cib_perform_op:  ++ 
Dec 01 07:04:21 [143763] node1.mydomain.com    cib: info: 
cib_perform_op:  ++   id="status-2-shutdown" name="shutdown" value="1543665861"/>
Dec 01 07:04:21 [143763] node1.mydomain.com    cib: info: 
cib_perform_op:  ++ 
Dec 01 07:04:21 [143763] node1.mydomain.com    cib: info: 
cib_perform_op:  ++ 
Dec 01 07:04:21 [143763] node1.mydomain.com    cib: info: 
cib_process_request: Completed cib_modify operation for section status: 
OK (rc=0, origin=node2.mydomain.com/attrd/6, version=0.188.67)


And on node2 I see this:

[50215] node2.mydomain.com corosyncnotice  [TOTEM ] A new membership 
(192.168.10.25:228) was formed. Members
[50215] node2.mydomain.com corosyncnotice  [TOTEM ] A new membership 
(192.168.10.25:236) was formed. Members joined: 1 left: 1
[50215] node2.mydomain.com corosyncnotice  [TOTEM ] Failed to receive 
the leave message. failed: 1
Dec 01 07:03:50 [50224] node2.mydomain.com    cib: info: 
pcmk_cpg_membership:  Node 1 left group cib (peer=node1.mydomain.com, 
counter=2.0)
Dec 01 07:03:50 [50224] node2.mydomain.com    cib: info: 
crm_update_peer_proc: pcmk_cpg_membership: Node node1.mydomain.com[1] - 
corosync-cpg is now offline
Dec 01 07:03:50 [50229] node2.mydomain.com   crmd: info: 

Re: [ClusterLabs] STONITH resources on wrong nodes

2018-07-11 Thread Salvatore D'angelo
Thank you. It's clear now.

Il Mer 11 Lug 2018, 7:18 PM Andrei Borzenkov  ha
scritto:

> 11.07.2018 20:12, Salvatore D'angelo пишет:
> > Does this mean that if STONITH resource p_ston_pg1 even if it runs on
> node pg2 if pacemaker send a signal to it pg1 is powered of and not pg2.
> > Am I correct?
>
> Yes. Resource will be used to power off whatever hosts are listed in its
> pcmk_host_list. It is totally unrelated to where it is active currently.
>
> >
> >> On 11 Jul 2018, at 19:10, Andrei Borzenkov  wrote:
> >>
> >> 11.07.2018 19:44, Salvatore D'angelo пишет:
> >>> Hi all,
> >>>
> >>> in my cluster doing cam_mon -1ARrf I noticed my STONITH resources are
> not correctly located:
> >>
> >> Actual location of stonith resources does not really matter in up to
> >> date pacemaker. It only determines where resource will be monitored;
> >> resource will be used by whatever node will be selected to perform
> stonith.
> >>
> >> The only requirement is that stonith resource is not prohibited from
> >> running on node by constraints.
> >>
> >>> p_ston_pg1  (stonith:external/ipmi):Started pg2
> >>> p_ston_pg2  (stonith:external/ipmi):Started pg1
> >>> p_ston_pg3  (stonith:external/ipmi):Started pg1
> >>>
> >>> I have three node: pg1 (10.0.0.11), pg2 (10.0.0.12), and pg3
> (10.0.0.13). I expected p_ston_pg3 was running on pg3, but I see it on pg1.
> >>>
> >>> Here my configuration:
> >>> primitive p_ston_pg1 stonith:external/ipmi \\
> >>> params hostname=pg1 pcmk_host_list=pg1 pcmk_host_check=static-list
> ipaddr=10.0.0.11 userid=root passwd="/etc/ngha/PG1-ipmipass"
> passwd_method=file interface=lan priv=OPERATOR
> >>> primitive p_ston_pg2 stonith:external/ipmi \\
> >>> params hostname=pg2 pcmk_host_list=pg2 pcmk_host_check=static-list
> ipaddr=10.0.0.12 userid=root passwd="/etc/ngha/PG2-ipmipass"
> passwd_method=file interface=lan priv=OPERATOR
> >>> primitive p_ston_pg3 stonith:external/ipmi \\
> >>> params hostname=pg3 pcmk_host_list=pg3 pcmk_host_check=static-list
> ipaddr=10.0.0.13 userid=root passwd="/etc/ngha/PG3-ipmipass"
> passwd_method=file interface=lan priv=OPERATOR
> >>>
> >>> location l_ston_pg1 p_ston_pg1 -inf: pg1
> >>> location l_ston_pg2 p_ston_pg2 -inf: pg2
> >>> location l_ston_pg3 p_ston_pg3 -inf: pg3
> >>>
> >>> this seems work fine on bare metal.
> >>> Any suggestion what could be root cause?
> >>>
> >>
> >> Root cause of what? Locations match your constraints.
> >> ___
> >> Users mailing list: Users@clusterlabs.org
> >> https://lists.clusterlabs.org/mailman/listinfo/users
> >>
> >> Project Home: http://www.clusterlabs.org
> >> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> Bugs: http://bugs.clusterlabs.org
> >
> > ___
> > Users mailing list: Users@clusterlabs.org
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
>
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] STONITH resources on wrong nodes

2018-07-11 Thread Andrei Borzenkov
11.07.2018 20:12, Salvatore D'angelo пишет:
> Does this mean that if STONITH resource p_ston_pg1 even if it runs on node 
> pg2 if pacemaker send a signal to it pg1 is powered of and not pg2.
> Am I correct?

Yes. Resource will be used to power off whatever hosts are listed in its
pcmk_host_list. It is totally unrelated to where it is active currently.

> 
>> On 11 Jul 2018, at 19:10, Andrei Borzenkov  wrote:
>>
>> 11.07.2018 19:44, Salvatore D'angelo пишет:
>>> Hi all,
>>>
>>> in my cluster doing cam_mon -1ARrf I noticed my STONITH resources are not 
>>> correctly located:
>>
>> Actual location of stonith resources does not really matter in up to
>> date pacemaker. It only determines where resource will be monitored;
>> resource will be used by whatever node will be selected to perform stonith.
>>
>> The only requirement is that stonith resource is not prohibited from
>> running on node by constraints.
>>
>>> p_ston_pg1  (stonith:external/ipmi):Started pg2
>>> p_ston_pg2  (stonith:external/ipmi):Started pg1
>>> p_ston_pg3  (stonith:external/ipmi):Started pg1
>>>
>>> I have three node: pg1 (10.0.0.11), pg2 (10.0.0.12), and pg3 (10.0.0.13). I 
>>> expected p_ston_pg3 was running on pg3, but I see it on pg1.
>>>
>>> Here my configuration:
>>> primitive p_ston_pg1 stonith:external/ipmi \\
>>> params hostname=pg1 pcmk_host_list=pg1 pcmk_host_check=static-list 
>>> ipaddr=10.0.0.11 userid=root passwd="/etc/ngha/PG1-ipmipass" 
>>> passwd_method=file interface=lan priv=OPERATOR
>>> primitive p_ston_pg2 stonith:external/ipmi \\
>>> params hostname=pg2 pcmk_host_list=pg2 pcmk_host_check=static-list 
>>> ipaddr=10.0.0.12 userid=root passwd="/etc/ngha/PG2-ipmipass" 
>>> passwd_method=file interface=lan priv=OPERATOR
>>> primitive p_ston_pg3 stonith:external/ipmi \\
>>> params hostname=pg3 pcmk_host_list=pg3 pcmk_host_check=static-list 
>>> ipaddr=10.0.0.13 userid=root passwd="/etc/ngha/PG3-ipmipass" 
>>> passwd_method=file interface=lan priv=OPERATOR
>>>
>>> location l_ston_pg1 p_ston_pg1 -inf: pg1
>>> location l_ston_pg2 p_ston_pg2 -inf: pg2
>>> location l_ston_pg3 p_ston_pg3 -inf: pg3
>>>
>>> this seems work fine on bare metal.
>>> Any suggestion what could be root cause?
>>>
>>
>> Root cause of what? Locations match your constraints.
>> ___
>> Users mailing list: Users@clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] STONITH resources on wrong nodes

2018-07-11 Thread Salvatore D'angelo
Does this mean that if STONITH resource p_ston_pg1 even if it runs on node pg2 
if pacemaker send a signal to it pg1 is powered of and not pg2.
Am I correct?

> On 11 Jul 2018, at 19:10, Andrei Borzenkov  wrote:
> 
> 11.07.2018 19:44, Salvatore D'angelo пишет:
>> Hi all,
>> 
>> in my cluster doing cam_mon -1ARrf I noticed my STONITH resources are not 
>> correctly located:
> 
> Actual location of stonith resources does not really matter in up to
> date pacemaker. It only determines where resource will be monitored;
> resource will be used by whatever node will be selected to perform stonith.
> 
> The only requirement is that stonith resource is not prohibited from
> running on node by constraints.
> 
>> p_ston_pg1   (stonith:external/ipmi):Started pg2
>> p_ston_pg2   (stonith:external/ipmi):Started pg1
>> p_ston_pg3   (stonith:external/ipmi):Started pg1
>> 
>> I have three node: pg1 (10.0.0.11), pg2 (10.0.0.12), and pg3 (10.0.0.13). I 
>> expected p_ston_pg3 was running on pg3, but I see it on pg1.
>> 
>> Here my configuration:
>> primitive p_ston_pg1 stonith:external/ipmi \\
>>  params hostname=pg1 pcmk_host_list=pg1 pcmk_host_check=static-list 
>> ipaddr=10.0.0.11 userid=root passwd="/etc/ngha/PG1-ipmipass" 
>> passwd_method=file interface=lan priv=OPERATOR
>> primitive p_ston_pg2 stonith:external/ipmi \\
>>  params hostname=pg2 pcmk_host_list=pg2 pcmk_host_check=static-list 
>> ipaddr=10.0.0.12 userid=root passwd="/etc/ngha/PG2-ipmipass" 
>> passwd_method=file interface=lan priv=OPERATOR
>> primitive p_ston_pg3 stonith:external/ipmi \\
>>  params hostname=pg3 pcmk_host_list=pg3 pcmk_host_check=static-list 
>> ipaddr=10.0.0.13 userid=root passwd="/etc/ngha/PG3-ipmipass" 
>> passwd_method=file interface=lan priv=OPERATOR
>> 
>> location l_ston_pg1 p_ston_pg1 -inf: pg1
>> location l_ston_pg2 p_ston_pg2 -inf: pg2
>> location l_ston_pg3 p_ston_pg3 -inf: pg3
>> 
>> this seems work fine on bare metal.
>> Any suggestion what could be root cause?
>> 
> 
> Root cause of what? Locations match your constraints.
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] STONITH resources on wrong nodes

2018-07-11 Thread Andrei Borzenkov
11.07.2018 19:44, Salvatore D'angelo пишет:
> Hi all,
> 
> in my cluster doing cam_mon -1ARrf I noticed my STONITH resources are not 
> correctly located:

Actual location of stonith resources does not really matter in up to
date pacemaker. It only determines where resource will be monitored;
resource will be used by whatever node will be selected to perform stonith.

The only requirement is that stonith resource is not prohibited from
running on node by constraints.

> p_ston_pg1(stonith:external/ipmi):Started pg2
> p_ston_pg2(stonith:external/ipmi):Started pg1
> p_ston_pg3(stonith:external/ipmi):Started pg1
> 
> I have three node: pg1 (10.0.0.11), pg2 (10.0.0.12), and pg3 (10.0.0.13). I 
> expected p_ston_pg3 was running on pg3, but I see it on pg1.
> 
> Here my configuration:
> primitive p_ston_pg1 stonith:external/ipmi \\
>   params hostname=pg1 pcmk_host_list=pg1 pcmk_host_check=static-list 
> ipaddr=10.0.0.11 userid=root passwd="/etc/ngha/PG1-ipmipass" 
> passwd_method=file interface=lan priv=OPERATOR
> primitive p_ston_pg2 stonith:external/ipmi \\
>   params hostname=pg2 pcmk_host_list=pg2 pcmk_host_check=static-list 
> ipaddr=10.0.0.12 userid=root passwd="/etc/ngha/PG2-ipmipass" 
> passwd_method=file interface=lan priv=OPERATOR
> primitive p_ston_pg3 stonith:external/ipmi \\
>   params hostname=pg3 pcmk_host_list=pg3 pcmk_host_check=static-list 
> ipaddr=10.0.0.13 userid=root passwd="/etc/ngha/PG3-ipmipass" 
> passwd_method=file interface=lan priv=OPERATOR
> 
> location l_ston_pg1 p_ston_pg1 -inf: pg1
> location l_ston_pg2 p_ston_pg2 -inf: pg2
> location l_ston_pg3 p_ston_pg3 -inf: pg3
> 
> this seems work fine on bare metal.
> Any suggestion what could be root cause?
> 

Root cause of what? Locations match your constraints.
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] STONITH resources on wrong nodes

2018-07-11 Thread Salvatore D'angelo
Suppose I do the following:

crm configure delete l_ston_pg1
crm configure delete l_ston_pg2
crm configure delete l_ston_pg3
crm configure location l_ston_pg1 p_ston_pg1 inf: pg1
crm configure location l_ston_pg2 p_ston_pg2 inf: pg2
crm configure location l_ston_pg3 p_ston_pg3 inf: pg3

How long should I wait to see each STONITH resource on the correct node? 
Should I do something to adjust things on the fly?
Thanks for support.

> On 11 Jul 2018, at 18:47, Emmanuel Gelati  wrote:
> 
> You need to use location l_ston_pg3 p_ston_pg3 inf: pg3, because -inf is 
> negative.
> 
> 2018-07-11 18:44 GMT+02:00 Salvatore D'angelo  >:
> Hi all,
> 
> in my cluster doing cam_mon -1ARrf I noticed my STONITH resources are not 
> correctly located:
> p_ston_pg1(stonith:external/ipmi):Started pg2
> p_ston_pg2(stonith:external/ipmi):Started pg1
> p_ston_pg3(stonith:external/ipmi):Started pg1
> 
> I have three node: pg1 (10.0.0.11), pg2 (10.0.0.12), and pg3 (10.0.0.13). I 
> expected p_ston_pg3 was running on pg3, but I see it on pg1.
> 
> Here my configuration:
> primitive p_ston_pg1 stonith:external/ipmi \\
>   params hostname=pg1 pcmk_host_list=pg1 pcmk_host_check=static-list 
> ipaddr=10.0.0.11 userid=root passwd="/etc/ngha/PG1-ipmipass" 
> passwd_method=file interface=lan priv=OPERATOR
> primitive p_ston_pg2 stonith:external/ipmi \\
>   params hostname=pg2 pcmk_host_list=pg2 pcmk_host_check=static-list 
> ipaddr=10.0.0.12 userid=root passwd="/etc/ngha/PG2-ipmipass" 
> passwd_method=file interface=lan priv=OPERATOR
> primitive p_ston_pg3 stonith:external/ipmi \\
>   params hostname=pg3 pcmk_host_list=pg3 pcmk_host_check=static-list 
> ipaddr=10.0.0.13 userid=root passwd="/etc/ngha/PG3-ipmipass" 
> passwd_method=file interface=lan priv=OPERATOR
> 
> location l_ston_pg1 p_ston_pg1 -inf: pg1
> location l_ston_pg2 p_ston_pg2 -inf: pg2
> location l_ston_pg3 p_ston_pg3 -inf: pg3
> 
> this seems work fine on bare metal.
> Any suggestion what could be root cause?
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org 
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> 
> Bugs: http://bugs.clusterlabs.org 
> 
> 
> 
> 
> -- 
>   .~.
>   /V\
>  //  \\
> /(   )\
> ^`~'^
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] STONITH resources on wrong nodes

2018-07-11 Thread Emmanuel Gelati
You need to use location l_ston_pg3 p_ston_pg3 inf: pg3, because -inf is
negative.

2018-07-11 18:44 GMT+02:00 Salvatore D'angelo :

> Hi all,
>
> in my cluster doing cam_mon -1ARrf I noticed my STONITH resources are not
> correctly located:
> p_ston_pg1 (stonith:external/ipmi): Started pg2
> p_ston_pg2 (stonith:external/ipmi): Started pg1
> p_ston_pg3 (stonith:external/ipmi): Started pg1
>
> I have three node: pg1 (10.0.0.11), pg2 (10.0.0.12), and pg3 (10.0.0.13).
> I expected p_ston_pg3 was running on pg3, but I see it on pg1.
>
> Here my configuration:
> primitive p_ston_pg1 stonith:external/ipmi \\
> params hostname=pg1 pcmk_host_list=pg1 pcmk_host_check=static-list
> ipaddr=10.0.0.11 userid=root passwd="/etc/ngha/PG1-ipmipass"
> passwd_method=file interface=lan priv=OPERATOR
> primitive p_ston_pg2 stonith:external/ipmi \\
> params hostname=pg2 pcmk_host_list=pg2 pcmk_host_check=static-list
> ipaddr=10.0.0.12 userid=root passwd="/etc/ngha/PG2-ipmipass"
> passwd_method=file interface=lan priv=OPERATOR
> primitive p_ston_pg3 stonith:external/ipmi \\
> params hostname=pg3 pcmk_host_list=pg3 pcmk_host_check=static-list ipaddr=
> 10.0.0.13 userid=root passwd="/etc/ngha/PG3-ipmipass" passwd_method=file
> interface=lan priv=OPERATOR
>
> location l_ston_pg1 p_ston_pg1 -inf: pg1
> location l_ston_pg2 p_ston_pg2 -inf: pg2
> location l_ston_pg3 p_ston_pg3 -inf: pg3
>
> this seems work fine on bare metal.
> Any suggestion what could be root cause?
>
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>


-- 
  .~.
  /V\
 //  \\
/(   )\
^`~'^
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] STONITH forever?

2018-04-10 Thread Ken Gaillot
On Tue, 2018-04-10 at 07:26 +, Stefan Schlösser wrote:
> Hi,
>  
> I have a 3 node setup on ubuntu 16.04. Corosync/Pacemaker services
> are not started automatically.
>  
> If I put all 3 nodes to offline mode, with 1 node in an „unclean“
> state I get a never ending STONITH.
>  
> What happens is that the STONITH causes a reboot of the unclean node.
>  
> 1) I would have thought with all nodes in standby no STONITH can
> occur. Why does it?

Standby prevents a node from running resources, but it still
participates in quorum voting. I suspect *starting* a node in standby
mode would prevent it from using fence devices, but *changing* a node
to standby will have no effect on whether it can fence.

> 2) Why does it keep on killing the unclean node?

Good question. The DC's logs will have the most useful information --
each pengine run should say why fencing is being scheduled.

>  
> The only way to stop it, is to temporarily disable stonith and bring
> the unclean node back online manually, and the enable it again.
>  
> Here is a log extract of node c killing node a:
> Apr 10 09:08:30 [2276] xxx-c stonith-ng:   notice: log_operation:  
> Operation 'reboot' [2428] (call 5 from crmd.2175) for host 'xxx-a'
> with device 'stonith_a' returned: 0 (OK)
> Apr 10 09:08:30 [2276] xxx-c stonith-ng:   notice: remote_op_done: 
> Operation reboot of xxx-a by xxx-c for crmd.2175@xxx-b.20531831: OK
> Apr 10 09:08:30 [2275] xxx-c    cib: info:
> cib_process_request: Completed cib_modify operation for section
> status: OK (rc=0, origin=xxx-b/crmd/83, version=0.164.37)
> Apr 10 09:08:30 [2275] xxx-c    cib: info:
> cib_process_request: Completed cib_delete operation for section
> //node_state[@uname='xxx-a']/lrm: OK (rc=0, origin=xxx-b/crmd/84,
> version=0.164.37)
> Apr 10 09:08:30 [2275] xxx-c    cib: info:
> cib_process_request: Completed cib_delete operation for section
> //node_state[@uname='xxx-a']/transient_attributes: OK (rc=0,
> origin=xxx-b/crmd/85, version=0.164.37)
> Apr 10 09:08:30 [2275] xxx-c    cib: info:
> cib_process_request: Completed cib_modify operation for section
> status: OK (rc=0, origin=xxx-b/crmd/86, version=0.164.37)
> Apr 10 09:08:30 [2275] xxx-c    cib: info:
> cib_process_request: Completed cib_delete operation for section
> //node_state[@uname='xxx-a']/lrm: OK (rc=0, origin=xxx-b/crmd/87,
> version=0.164.37)
> Apr 10 09:08:30 [2275] xxx-c    cib: info:
> cib_process_request: Completed cib_delete operation for section
> //node_state[@uname='xxx-a']/transient_attributes: OK (rc=0,
> origin=xxx-b/crmd/88, version=0.164.37)
>  
> This the repeats forevermore ...
>  
> Thanks for any hints,
>  
> cheers,
>  
> Stefan
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Stonith stops after vSphere restart

2018-04-03 Thread jota
Hi again,

After restarting the vCenter, all worked as expected.
Thanks to all.

Have a nice day.

23 de febrero de 2018 7:59, j...@disroot.org escribió:

> Hi all,
> 
> Thanks for your responses.
> With your advice I was able to configure it. I still have to test its 
> operation. When it is
> possible to restart the vCenter, I will post the results.
> Have a nice weekend!
> 
> 22 de febrero de 2018 16:00, "Tomas Jelinek"  escribió:
> 
>> Try this:
>> 
>> pcs resource meta vmware_soap failure-timeout=
>> 
>> Tomas
>> 
>> Dne 22.2.2018 v 14:55 j...@disroot.org napsal(a):
>> 
>>> Hi,
>>> 
>>> I am trying to configure the failure-timeout for stonith, but I only can do 
>>> it for the other
>>> resources.
>>> When try to enable it for stonith, I get this error: "Error: resource 
>>> option(s): 'failure-timeout',
>>> are not recognized for resource type: 'stonith::fence_vmware_soap'".
>>> 
>>> Thanks.
>>> 
>>> 22 de febrero de 2018 13:46, "Andrei Borzenkov"  
>>> escribió:
>> 
>> On Thu, Feb 22, 2018 at 2:40 PM,  wrote:
>>> Thanks for the responses.
>>> 
>>> So, if I understand, this is the right behaviour and it does not affect to 
>>> the stonith mechanism.
>>> 
>>> If I remember correctly, the fault status persists for hours until I fix it 
>>> manually.
>>> Is there any way to modify the expiry time to clean itself?.
>> 
>> Yes, as mentioned set failure-timeout resource meta-attribute.
>>> 22 de febrero de 2018 12:28, "Andrei Borzenkov"  
>>> escribió:
>>> 
>>> Stonith resource state should have no impact on actual stonith
>>> operation. It only reflects whether monitor was successful or not and
>>> serves as warning to administrator that something may be wrong. It
>>> should automatically clear itself after failure-timeout has expired.
>>> 
>>> On Thu, Feb 22, 2018 at 1:58 PM,  wrote:
>>> 
>>> Hi,
>>> 
>>> I have a 2 node pacemaker cluster configured with the fence agent
>>> vmware_soap.
>>> Everything works fine until the vCenter is restarted. After that, stonith
>>> fails and stop.
>>> 
>>> [root@node1 ~]# pcs status
>>> Cluster name: psqltest
>>> Stack: corosync
>>> Current DC: node2 (version 1.1.16-12.el7_4.7-94ff4df) - partition with
>>> quorum
>>> Last updated: Thu Feb 22 11:30:22 2018
>>> Last change: Mon Feb 19 09:28:37 2018 by root via crm_resource on node1
>>> 
>>> 2 nodes configured
>>> 6 resources configured
>>> 
>>> Online: [ node1 node2 ]
>>> 
>>> Full list of resources:
>>> 
>>> Master/Slave Set: ms_drbd_psqltest [drbd_psqltest]
>>> Masters: [ node1 ]
>>> Slaves: [ node2 ]
>>> Resource Group: pgsqltest
>>> psqltestfs (ocf::heartbeat:Filesystem): Started node1
>>> psqltest_vip (ocf::heartbeat:IPaddr2): Started node1
>>> postgresql-94 (ocf::heartbeat:pgsql): Started node1
>>> vmware_soap (stonith:fence_vmware_soap): Stopped
>>> 
>>> Failed Actions:
>>> * vmware_soap_start_0 on node1 'unknown error' (1): call=38, status=Error,
>>> exitreason='none',
>>> last-rc-change='Thu Feb 22 10:55:46 2018', queued=0ms, exec=5374ms
>>> * vmware_soap_start_0 on node2 'unknown error' (1): call=56, status=Error,
>>> exitreason='none',
>>> last-rc-change='Thu Feb 22 10:55:39 2018', queued=0ms, exec=5479ms
>>> 
>>> Daemon Status:
>>> corosync: active/enabled
>>> pacemaker: active/enabled
>>> pcsd: active/enabled
>>> 
>>> [root@node1 ~]# pcs stonith show --full
>>> Resource: vmware_soap (class=stonith type=fence_vmware_soap)
>>> Attributes: inet4_only=1 ipaddr=192.168.1.1 ipport=443 login=MYDOMAIN\User
>>> passwd=mypass pcmk_host_list=node1,node2 power_wait=3 ssl_insecure=1 action=
>>> pcmk_list_timeout=120s pcmk_monitor_timeout=120s pcmk_status_timeout=120s
>>> Operations: monitor interval=60s (vmware_soap-monitor-interval-60s)
>>> 
>>> I need to manually perform a "resource cleanup vmware_soap" to put it online
>>> again.
>>> Is there any way to do this automatically?.
>>> Is it possible to detect vSphere online again and enable stonith?.
>>> 
>>> Thanks.
>>> 
>>> ___
>>> Users mailing list: Users@clusterlabs.org
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>> 
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>> ___
>>> Users mailing list: Users@clusterlabs.org
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>> 
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>> 
>>> ___
>>> Users mailing list: Users@clusterlabs.org
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>> 
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>> 

Re: [ClusterLabs] Stonith stops after vSphere restart

2018-02-22 Thread jota
Hi all,

Thanks for your responses.
With your advice I was able to configure it. I still have to test its 
operation. When it is possible to restart the vCenter, I will post the results.
Have a nice weekend!


22 de febrero de 2018 16:00, "Tomas Jelinek"  escribió:

> Try this:
> 
> pcs resource meta vmware_soap failure-timeout=
> 
> Tomas
> 
> Dne 22.2.2018 v 14:55 j...@disroot.org napsal(a):
> 
>> Hi,
>> 
>> I am trying to configure the failure-timeout for stonith, but I only can do 
>> it for the other
>> resources.
>> When try to enable it for stonith, I get this error: "Error: resource 
>> option(s): 'failure-timeout',
>> are not recognized for resource type: 'stonith::fence_vmware_soap'".
>> 
>> Thanks.
>> 
>> 22 de febrero de 2018 13:46, "Andrei Borzenkov"  
>> escribió:
>> 
>>> On Thu, Feb 22, 2018 at 2:40 PM,  wrote:
>> 
>> Thanks for the responses.
>> 
>> So, if I understand, this is the right behaviour and it does not affect to 
>> the stonith mechanism.
>> 
>> If I remember correctly, the fault status persists for hours until I fix it 
>> manually.
>> Is there any way to modify the expiry time to clean itself?.
>>> Yes, as mentioned set failure-timeout resource meta-attribute.
>> 
>> 22 de febrero de 2018 12:28, "Andrei Borzenkov"  
>> escribió:
>> 
>> Stonith resource state should have no impact on actual stonith
>> operation. It only reflects whether monitor was successful or not and
>> serves as warning to administrator that something may be wrong. It
>> should automatically clear itself after failure-timeout has expired.
>> 
>> On Thu, Feb 22, 2018 at 1:58 PM,  wrote:
>> 
>> Hi,
>> 
>> I have a 2 node pacemaker cluster configured with the fence agent
>> vmware_soap.
>> Everything works fine until the vCenter is restarted. After that, stonith
>> fails and stop.
>> 
>> [root@node1 ~]# pcs status
>> Cluster name: psqltest
>> Stack: corosync
>> Current DC: node2 (version 1.1.16-12.el7_4.7-94ff4df) - partition with
>> quorum
>> Last updated: Thu Feb 22 11:30:22 2018
>> Last change: Mon Feb 19 09:28:37 2018 by root via crm_resource on node1
>> 
>> 2 nodes configured
>> 6 resources configured
>> 
>> Online: [ node1 node2 ]
>> 
>> Full list of resources:
>> 
>> Master/Slave Set: ms_drbd_psqltest [drbd_psqltest]
>> Masters: [ node1 ]
>> Slaves: [ node2 ]
>> Resource Group: pgsqltest
>> psqltestfs (ocf::heartbeat:Filesystem): Started node1
>> psqltest_vip (ocf::heartbeat:IPaddr2): Started node1
>> postgresql-94 (ocf::heartbeat:pgsql): Started node1
>> vmware_soap (stonith:fence_vmware_soap): Stopped
>> 
>> Failed Actions:
>> * vmware_soap_start_0 on node1 'unknown error' (1): call=38, status=Error,
>> exitreason='none',
>> last-rc-change='Thu Feb 22 10:55:46 2018', queued=0ms, exec=5374ms
>> * vmware_soap_start_0 on node2 'unknown error' (1): call=56, status=Error,
>> exitreason='none',
>> last-rc-change='Thu Feb 22 10:55:39 2018', queued=0ms, exec=5479ms
>> 
>> Daemon Status:
>> corosync: active/enabled
>> pacemaker: active/enabled
>> pcsd: active/enabled
>> 
>> [root@node1 ~]# pcs stonith show --full
>> Resource: vmware_soap (class=stonith type=fence_vmware_soap)
>> Attributes: inet4_only=1 ipaddr=192.168.1.1 ipport=443 login=MYDOMAIN\User
>> passwd=mypass pcmk_host_list=node1,node2 power_wait=3 ssl_insecure=1 action=
>> pcmk_list_timeout=120s pcmk_monitor_timeout=120s pcmk_status_timeout=120s
>> Operations: monitor interval=60s (vmware_soap-monitor-interval-60s)
>> 
>> I need to manually perform a "resource cleanup vmware_soap" to put it online
>> again.
>> Is there any way to do this automatically?.
>> Is it possible to detect vSphere online again and enable stonith?.
>> 
>> Thanks.
>> 
>> ___
>> Users mailing list: Users@clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>> ___
>> Users mailing list: Users@clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>> 
>> ___
>> Users mailing list: Users@clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>> ___
>>> Users mailing list: Users@clusterlabs.org
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>> 
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: 

Re: [ClusterLabs] Stonith stops after vSphere restart

2018-02-22 Thread Tomas Jelinek

Try this:

pcs resource meta vmware_soap failure-timeout=


Tomas


Dne 22.2.2018 v 14:55 j...@disroot.org napsal(a):

Hi,

I am trying to configure the failure-timeout for stonith, but I only can do it 
for the other resources.
When try to enable it for stonith, I get this error: "Error: resource option(s): 
'failure-timeout', are not recognized for resource type: 
'stonith::fence_vmware_soap'".

Thanks.

22 de febrero de 2018 13:46, "Andrei Borzenkov"  escribió:


On Thu, Feb 22, 2018 at 2:40 PM,  wrote:


Thanks for the responses.

So, if I understand, this is the right behaviour and it does not affect to the 
stonith mechanism.

If I remember correctly, the fault status persists for hours until I fix it 
manually.
Is there any way to modify the expiry time to clean itself?.


Yes, as mentioned set failure-timeout resource meta-attribute.


22 de febrero de 2018 12:28, "Andrei Borzenkov"  escribió:


Stonith resource state should have no impact on actual stonith
operation. It only reflects whether monitor was successful or not and
serves as warning to administrator that something may be wrong. It
should automatically clear itself after failure-timeout has expired.

On Thu, Feb 22, 2018 at 1:58 PM,  wrote:


Hi,

I have a 2 node pacemaker cluster configured with the fence agent
vmware_soap.
Everything works fine until the vCenter is restarted. After that, stonith
fails and stop.

[root@node1 ~]# pcs status
Cluster name: psqltest
Stack: corosync
Current DC: node2 (version 1.1.16-12.el7_4.7-94ff4df) - partition with
quorum
Last updated: Thu Feb 22 11:30:22 2018
Last change: Mon Feb 19 09:28:37 2018 by root via crm_resource on node1

2 nodes configured
6 resources configured

Online: [ node1 node2 ]

Full list of resources:

Master/Slave Set: ms_drbd_psqltest [drbd_psqltest]
Masters: [ node1 ]
Slaves: [ node2 ]
Resource Group: pgsqltest
psqltestfs (ocf::heartbeat:Filesystem): Started node1
psqltest_vip (ocf::heartbeat:IPaddr2): Started node1
postgresql-94 (ocf::heartbeat:pgsql): Started node1
vmware_soap (stonith:fence_vmware_soap): Stopped

Failed Actions:
* vmware_soap_start_0 on node1 'unknown error' (1): call=38, status=Error,
exitreason='none',
last-rc-change='Thu Feb 22 10:55:46 2018', queued=0ms, exec=5374ms
* vmware_soap_start_0 on node2 'unknown error' (1): call=56, status=Error,
exitreason='none',
last-rc-change='Thu Feb 22 10:55:39 2018', queued=0ms, exec=5479ms

Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled

[root@node1 ~]# pcs stonith show --full
Resource: vmware_soap (class=stonith type=fence_vmware_soap)
Attributes: inet4_only=1 ipaddr=192.168.1.1 ipport=443 login=MYDOMAIN\User
passwd=mypass pcmk_host_list=node1,node2 power_wait=3 ssl_insecure=1 action=
pcmk_list_timeout=120s pcmk_monitor_timeout=120s pcmk_status_timeout=120s
Operations: monitor interval=60s (vmware_soap-monitor-interval-60s)

I need to manually perform a "resource cleanup vmware_soap" to put it online
again.
Is there any way to do this automatically?.
Is it possible to detect vSphere online again and enable stonith?.

Thanks.

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Stonith stops after vSphere restart

2018-02-22 Thread Klaus Wenninger
On 02/22/2018 02:55 PM, j...@disroot.org wrote:
> Hi,
>
> I am trying to configure the failure-timeout for stonith, but I only can do 
> it for the other resources.
> When try to enable it for stonith, I get this error: "Error: resource 
> option(s): 'failure-timeout', are not recognized for resource type: 
> 'stonith::fence_vmware_soap'".

It is a meta-attribute thus 'pcs stonith update ... meta
failure-timeout=...' should work.
Although I'm not 100% sure if it is being adhered properly.

Regards,
Klaus
 
>
> Thanks.
>
> 22 de febrero de 2018 13:46, "Andrei Borzenkov"  
> escribió:
>
>> On Thu, Feb 22, 2018 at 2:40 PM,  wrote:
>>
>>> Thanks for the responses.
>>>
>>> So, if I understand, this is the right behaviour and it does not affect to 
>>> the stonith mechanism.
>>>
>>> If I remember correctly, the fault status persists for hours until I fix it 
>>> manually.
>>> Is there any way to modify the expiry time to clean itself?.
>> Yes, as mentioned set failure-timeout resource meta-attribute.
>>
>>> 22 de febrero de 2018 12:28, "Andrei Borzenkov"  
>>> escribió:
>>>
 Stonith resource state should have no impact on actual stonith
 operation. It only reflects whether monitor was successful or not and
 serves as warning to administrator that something may be wrong. It
 should automatically clear itself after failure-timeout has expired.

 On Thu, Feb 22, 2018 at 1:58 PM,  wrote:
>>> Hi,
>>>
>>> I have a 2 node pacemaker cluster configured with the fence agent
>>> vmware_soap.
>>> Everything works fine until the vCenter is restarted. After that, stonith
>>> fails and stop.
>>>
>>> [root@node1 ~]# pcs status
>>> Cluster name: psqltest
>>> Stack: corosync
>>> Current DC: node2 (version 1.1.16-12.el7_4.7-94ff4df) - partition with
>>> quorum
>>> Last updated: Thu Feb 22 11:30:22 2018
>>> Last change: Mon Feb 19 09:28:37 2018 by root via crm_resource on node1
>>>
>>> 2 nodes configured
>>> 6 resources configured
>>>
>>> Online: [ node1 node2 ]
>>>
>>> Full list of resources:
>>>
>>> Master/Slave Set: ms_drbd_psqltest [drbd_psqltest]
>>> Masters: [ node1 ]
>>> Slaves: [ node2 ]
>>> Resource Group: pgsqltest
>>> psqltestfs (ocf::heartbeat:Filesystem): Started node1
>>> psqltest_vip (ocf::heartbeat:IPaddr2): Started node1
>>> postgresql-94 (ocf::heartbeat:pgsql): Started node1
>>> vmware_soap (stonith:fence_vmware_soap): Stopped
>>>
>>> Failed Actions:
>>> * vmware_soap_start_0 on node1 'unknown error' (1): call=38, status=Error,
>>> exitreason='none',
>>> last-rc-change='Thu Feb 22 10:55:46 2018', queued=0ms, exec=5374ms
>>> * vmware_soap_start_0 on node2 'unknown error' (1): call=56, status=Error,
>>> exitreason='none',
>>> last-rc-change='Thu Feb 22 10:55:39 2018', queued=0ms, exec=5479ms
>>>
>>> Daemon Status:
>>> corosync: active/enabled
>>> pacemaker: active/enabled
>>> pcsd: active/enabled
>>>
>>> [root@node1 ~]# pcs stonith show --full
>>> Resource: vmware_soap (class=stonith type=fence_vmware_soap)
>>> Attributes: inet4_only=1 ipaddr=192.168.1.1 ipport=443 login=MYDOMAIN\User
>>> passwd=mypass pcmk_host_list=node1,node2 power_wait=3 ssl_insecure=1 action=
>>> pcmk_list_timeout=120s pcmk_monitor_timeout=120s pcmk_status_timeout=120s
>>> Operations: monitor interval=60s (vmware_soap-monitor-interval-60s)
>>>
>>> I need to manually perform a "resource cleanup vmware_soap" to put it online
>>> again.
>>> Is there any way to do this automatically?.
>>> Is it possible to detect vSphere online again and enable stonith?.
>>>
>>> Thanks.
>>>
>>> ___
>>> Users mailing list: Users@clusterlabs.org
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
 ___
 Users mailing list: Users@clusterlabs.org
 https://lists.clusterlabs.org/mailman/listinfo/users

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
>>> ___
>>> Users mailing list: Users@clusterlabs.org
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>> ___
>> Users mailing list: Users@clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> ___
> Users mailing list: Users@clusterlabs.org
> 

Re: [ClusterLabs] Stonith stops after vSphere restart

2018-02-22 Thread jota
Hi,

I am trying to configure the failure-timeout for stonith, but I only can do it 
for the other resources.
When try to enable it for stonith, I get this error: "Error: resource 
option(s): 'failure-timeout', are not recognized for resource type: 
'stonith::fence_vmware_soap'".

Thanks.

22 de febrero de 2018 13:46, "Andrei Borzenkov"  escribió:

> On Thu, Feb 22, 2018 at 2:40 PM,  wrote:
> 
>> Thanks for the responses.
>> 
>> So, if I understand, this is the right behaviour and it does not affect to 
>> the stonith mechanism.
>> 
>> If I remember correctly, the fault status persists for hours until I fix it 
>> manually.
>> Is there any way to modify the expiry time to clean itself?.
> 
> Yes, as mentioned set failure-timeout resource meta-attribute.
> 
>> 22 de febrero de 2018 12:28, "Andrei Borzenkov"  
>> escribió:
>> 
>>> Stonith resource state should have no impact on actual stonith
>>> operation. It only reflects whether monitor was successful or not and
>>> serves as warning to administrator that something may be wrong. It
>>> should automatically clear itself after failure-timeout has expired.
>>> 
>>> On Thu, Feb 22, 2018 at 1:58 PM,  wrote:
>> 
>> Hi,
>> 
>> I have a 2 node pacemaker cluster configured with the fence agent
>> vmware_soap.
>> Everything works fine until the vCenter is restarted. After that, stonith
>> fails and stop.
>> 
>> [root@node1 ~]# pcs status
>> Cluster name: psqltest
>> Stack: corosync
>> Current DC: node2 (version 1.1.16-12.el7_4.7-94ff4df) - partition with
>> quorum
>> Last updated: Thu Feb 22 11:30:22 2018
>> Last change: Mon Feb 19 09:28:37 2018 by root via crm_resource on node1
>> 
>> 2 nodes configured
>> 6 resources configured
>> 
>> Online: [ node1 node2 ]
>> 
>> Full list of resources:
>> 
>> Master/Slave Set: ms_drbd_psqltest [drbd_psqltest]
>> Masters: [ node1 ]
>> Slaves: [ node2 ]
>> Resource Group: pgsqltest
>> psqltestfs (ocf::heartbeat:Filesystem): Started node1
>> psqltest_vip (ocf::heartbeat:IPaddr2): Started node1
>> postgresql-94 (ocf::heartbeat:pgsql): Started node1
>> vmware_soap (stonith:fence_vmware_soap): Stopped
>> 
>> Failed Actions:
>> * vmware_soap_start_0 on node1 'unknown error' (1): call=38, status=Error,
>> exitreason='none',
>> last-rc-change='Thu Feb 22 10:55:46 2018', queued=0ms, exec=5374ms
>> * vmware_soap_start_0 on node2 'unknown error' (1): call=56, status=Error,
>> exitreason='none',
>> last-rc-change='Thu Feb 22 10:55:39 2018', queued=0ms, exec=5479ms
>> 
>> Daemon Status:
>> corosync: active/enabled
>> pacemaker: active/enabled
>> pcsd: active/enabled
>> 
>> [root@node1 ~]# pcs stonith show --full
>> Resource: vmware_soap (class=stonith type=fence_vmware_soap)
>> Attributes: inet4_only=1 ipaddr=192.168.1.1 ipport=443 login=MYDOMAIN\User
>> passwd=mypass pcmk_host_list=node1,node2 power_wait=3 ssl_insecure=1 action=
>> pcmk_list_timeout=120s pcmk_monitor_timeout=120s pcmk_status_timeout=120s
>> Operations: monitor interval=60s (vmware_soap-monitor-interval-60s)
>> 
>> I need to manually perform a "resource cleanup vmware_soap" to put it online
>> again.
>> Is there any way to do this automatically?.
>> Is it possible to detect vSphere online again and enable stonith?.
>> 
>> Thanks.
>> 
>> ___
>> Users mailing list: Users@clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>> ___
>>> Users mailing list: Users@clusterlabs.org
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>> 
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>> 
>> ___
>> Users mailing list: Users@clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Stonith stops after vSphere restart

2018-02-22 Thread Andrei Borzenkov
On Thu, Feb 22, 2018 at 2:40 PM,   wrote:
> Thanks for the responses.
>
> So, if I understand, this is the right behaviour and it does not affect to 
> the stonith mechanism.
>
> If I remember correctly, the fault status persists for hours until I fix it 
> manually.
> Is there any way to modify the expiry time to clean itself?.
>

Yes, as mentioned set failure-timeout resource meta-attribute.

> 22 de febrero de 2018 12:28, "Andrei Borzenkov"  
> escribió:
>
>> Stonith resource state should have no impact on actual stonith
>> operation. It only reflects whether monitor was successful or not and
>> serves as warning to administrator that something may be wrong. It
>> should automatically clear itself after failure-timeout has expired.
>>
>> On Thu, Feb 22, 2018 at 1:58 PM,  wrote:
>>
>>> Hi,
>>>
>>> I have a 2 node pacemaker cluster configured with the fence agent
>>> vmware_soap.
>>> Everything works fine until the vCenter is restarted. After that, stonith
>>> fails and stop.
>>>
>>> [root@node1 ~]# pcs status
>>> Cluster name: psqltest
>>> Stack: corosync
>>> Current DC: node2 (version 1.1.16-12.el7_4.7-94ff4df) - partition with
>>> quorum
>>> Last updated: Thu Feb 22 11:30:22 2018
>>> Last change: Mon Feb 19 09:28:37 2018 by root via crm_resource on node1
>>>
>>> 2 nodes configured
>>> 6 resources configured
>>>
>>> Online: [ node1 node2 ]
>>>
>>> Full list of resources:
>>>
>>> Master/Slave Set: ms_drbd_psqltest [drbd_psqltest]
>>> Masters: [ node1 ]
>>> Slaves: [ node2 ]
>>> Resource Group: pgsqltest
>>> psqltestfs (ocf::heartbeat:Filesystem): Started node1
>>> psqltest_vip (ocf::heartbeat:IPaddr2): Started node1
>>> postgresql-94 (ocf::heartbeat:pgsql): Started node1
>>> vmware_soap (stonith:fence_vmware_soap): Stopped
>>>
>>> Failed Actions:
>>> * vmware_soap_start_0 on node1 'unknown error' (1): call=38, status=Error,
>>> exitreason='none',
>>> last-rc-change='Thu Feb 22 10:55:46 2018', queued=0ms, exec=5374ms
>>> * vmware_soap_start_0 on node2 'unknown error' (1): call=56, status=Error,
>>> exitreason='none',
>>> last-rc-change='Thu Feb 22 10:55:39 2018', queued=0ms, exec=5479ms
>>>
>>> Daemon Status:
>>> corosync: active/enabled
>>> pacemaker: active/enabled
>>> pcsd: active/enabled
>>>
>>> [root@node1 ~]# pcs stonith show --full
>>> Resource: vmware_soap (class=stonith type=fence_vmware_soap)
>>> Attributes: inet4_only=1 ipaddr=192.168.1.1 ipport=443 login=MYDOMAIN\User
>>> passwd=mypass pcmk_host_list=node1,node2 power_wait=3 ssl_insecure=1 action=
>>> pcmk_list_timeout=120s pcmk_monitor_timeout=120s pcmk_status_timeout=120s
>>> Operations: monitor interval=60s (vmware_soap-monitor-interval-60s)
>>>
>>> I need to manually perform a "resource cleanup vmware_soap" to put it online
>>> again.
>>> Is there any way to do this automatically?.
>>> Is it possible to detect vSphere online again and enable stonith?.
>>>
>>> Thanks.
>>>
>>> ___
>>> Users mailing list: Users@clusterlabs.org
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>> ___
>> Users mailing list: Users@clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Stonith stops after vSphere restart

2018-02-22 Thread jota
Thanks for the responses.

So, if I understand, this is the right behaviour and it does not affect to the 
stonith mechanism.

If I remember correctly, the fault status persists for hours until I fix it 
manually.
Is there any way to modify the expiry time to clean itself?.

22 de febrero de 2018 12:28, "Andrei Borzenkov"  escribió:

> Stonith resource state should have no impact on actual stonith
> operation. It only reflects whether monitor was successful or not and
> serves as warning to administrator that something may be wrong. It
> should automatically clear itself after failure-timeout has expired.
> 
> On Thu, Feb 22, 2018 at 1:58 PM,  wrote:
> 
>> Hi,
>> 
>> I have a 2 node pacemaker cluster configured with the fence agent
>> vmware_soap.
>> Everything works fine until the vCenter is restarted. After that, stonith
>> fails and stop.
>> 
>> [root@node1 ~]# pcs status
>> Cluster name: psqltest
>> Stack: corosync
>> Current DC: node2 (version 1.1.16-12.el7_4.7-94ff4df) - partition with
>> quorum
>> Last updated: Thu Feb 22 11:30:22 2018
>> Last change: Mon Feb 19 09:28:37 2018 by root via crm_resource on node1
>> 
>> 2 nodes configured
>> 6 resources configured
>> 
>> Online: [ node1 node2 ]
>> 
>> Full list of resources:
>> 
>> Master/Slave Set: ms_drbd_psqltest [drbd_psqltest]
>> Masters: [ node1 ]
>> Slaves: [ node2 ]
>> Resource Group: pgsqltest
>> psqltestfs (ocf::heartbeat:Filesystem): Started node1
>> psqltest_vip (ocf::heartbeat:IPaddr2): Started node1
>> postgresql-94 (ocf::heartbeat:pgsql): Started node1
>> vmware_soap (stonith:fence_vmware_soap): Stopped
>> 
>> Failed Actions:
>> * vmware_soap_start_0 on node1 'unknown error' (1): call=38, status=Error,
>> exitreason='none',
>> last-rc-change='Thu Feb 22 10:55:46 2018', queued=0ms, exec=5374ms
>> * vmware_soap_start_0 on node2 'unknown error' (1): call=56, status=Error,
>> exitreason='none',
>> last-rc-change='Thu Feb 22 10:55:39 2018', queued=0ms, exec=5479ms
>> 
>> Daemon Status:
>> corosync: active/enabled
>> pacemaker: active/enabled
>> pcsd: active/enabled
>> 
>> [root@node1 ~]# pcs stonith show --full
>> Resource: vmware_soap (class=stonith type=fence_vmware_soap)
>> Attributes: inet4_only=1 ipaddr=192.168.1.1 ipport=443 login=MYDOMAIN\User
>> passwd=mypass pcmk_host_list=node1,node2 power_wait=3 ssl_insecure=1 action=
>> pcmk_list_timeout=120s pcmk_monitor_timeout=120s pcmk_status_timeout=120s
>> Operations: monitor interval=60s (vmware_soap-monitor-interval-60s)
>> 
>> I need to manually perform a "resource cleanup vmware_soap" to put it online
>> again.
>> Is there any way to do this automatically?.
>> Is it possible to detect vSphere online again and enable stonith?.
>> 
>> Thanks.
>> 
>> ___
>> Users mailing list: Users@clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Stonith stops after vSphere restart

2018-02-22 Thread Andrei Borzenkov
Stonith resource state should have no impact on actual stonith
operation. It only reflects whether monitor was successful or not and
serves as warning to administrator that something may be wrong. It
should automatically clear itself after failure-timeout has expired.

On Thu, Feb 22, 2018 at 1:58 PM,   wrote:
>
> Hi,
>
> I have a 2 node pacemaker cluster configured with the fence agent
> vmware_soap.
> Everything works fine until the vCenter is restarted. After that, stonith
> fails and stop.
>
> [root@node1 ~]# pcs status
> Cluster name: psqltest
> Stack: corosync
> Current DC: node2 (version 1.1.16-12.el7_4.7-94ff4df) - partition with
> quorum
> Last updated: Thu Feb 22 11:30:22 2018
> Last change: Mon Feb 19 09:28:37 2018 by root via crm_resource on node1
>
> 2 nodes configured
> 6 resources configured
>
> Online: [ node1 node2 ]
>
> Full list of resources:
>
> Master/Slave Set: ms_drbd_psqltest [drbd_psqltest]
> Masters: [ node1 ]
> Slaves: [ node2 ]
> Resource Group: pgsqltest
> psqltestfs (ocf::heartbeat:Filesystem): Started node1
> psqltest_vip (ocf::heartbeat:IPaddr2): Started node1
> postgresql-94 (ocf::heartbeat:pgsql): Started node1
> vmware_soap (stonith:fence_vmware_soap): Stopped
>
> Failed Actions:
> * vmware_soap_start_0 on node1 'unknown error' (1): call=38, status=Error,
> exitreason='none',
> last-rc-change='Thu Feb 22 10:55:46 2018', queued=0ms, exec=5374ms
> * vmware_soap_start_0 on node2 'unknown error' (1): call=56, status=Error,
> exitreason='none',
> last-rc-change='Thu Feb 22 10:55:39 2018', queued=0ms, exec=5479ms
>
> Daemon Status:
> corosync: active/enabled
> pacemaker: active/enabled
> pcsd: active/enabled
>
>
> [root@node1 ~]# pcs stonith show --full
> Resource: vmware_soap (class=stonith type=fence_vmware_soap)
> Attributes: inet4_only=1 ipaddr=192.168.1.1 ipport=443 login=MYDOMAIN\User
> passwd=mypass pcmk_host_list=node1,node2 power_wait=3 ssl_insecure=1 action=
> pcmk_list_timeout=120s pcmk_monitor_timeout=120s pcmk_status_timeout=120s
> Operations: monitor interval=60s (vmware_soap-monitor-interval-60s)
>
>
> I need to manually perform a "resource cleanup vmware_soap" to put it online
> again.
> Is there any way to do this automatically?.
> Is it possible to detect vSphere online again and enable stonith?.
>
> Thanks.
>
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Stonith stops after vSphere restart

2018-02-22 Thread Marek Grac
Hi,

On Thu, Feb 22, 2018 at 11:58 AM,  wrote:

>
> Hi,
>
> I have a 2 node pacemaker cluster configured with the fence agent
> vmware_soap.
> Everything works fine until the vCenter is restarted. After that, stonith
> fails and stop.
>

This is expected as we run 'monitor' action to find out if fence device is
working. I assume that it is not responding when vCenter is restarting. If
your fencing device fails then manual intervention makes sense as you have
to have fencing working  in order to prevent data corruption.

m,


>
> [root@node1 ~]# pcs status
> Cluster name: psqltest
> Stack: corosync
> Current DC: node2 (version 1.1.16-12.el7_4.7-94ff4df) - partition with
> quorum
> Last updated: Thu Feb 22 11:30:22 2018
> Last change: Mon Feb 19 09:28:37 2018 by root via crm_resource on node1
>
> 2 nodes configured
> 6 resources configured
>
> Online: [ node1 node2 ]
>
> Full list of resources:
>
> Master/Slave Set: ms_drbd_psqltest [drbd_psqltest]
> Masters: [ node1 ]
> Slaves: [ node2 ]
> Resource Group: pgsqltest
> psqltestfs (ocf::heartbeat:Filesystem): Started node1
> psqltest_vip (ocf::heartbeat:IPaddr2): Started node1
> postgresql-94 (ocf::heartbeat:pgsql): Started node1
> vmware_soap (stonith:fence_vmware_soap): Stopped
>
> Failed Actions:
> * vmware_soap_start_0 on node1 'unknown error' (1): call=38, status=Error,
> exitreason='none',
> last-rc-change='Thu Feb 22 10:55:46 2018', queued=0ms, exec=5374ms
> * vmware_soap_start_0 on node2 'unknown error' (1): call=56, status=Error,
> exitreason='none',
> last-rc-change='Thu Feb 22 10:55:39 2018', queued=0ms, exec=5479ms
>
> Daemon Status:
> corosync: active/enabled
> pacemaker: active/enabled
> pcsd: active/enabled
>
>
> [root@node1 ~]# pcs stonith show --full
> Resource: vmware_soap (class=stonith type=fence_vmware_soap)
> Attributes: inet4_only=1 ipaddr=192.168.1.1 ipport=443 login=MYDOMAIN\User
> passwd=mypass pcmk_host_list=node1,node2 power_wait=3 ssl_insecure=1
> action= pcmk_list_timeout=120s pcmk_monitor_timeout=120s
> pcmk_status_timeout=120s
> Operations: monitor interval=60s (vmware_soap-monitor-interval-60s)
>
>
> I need to manually perform a "resource cleanup vmware_soap" to put it
> online again.
> Is there any way to do this automatically?.
> Is it possible to detect vSphere online again and enable stonith?.
>
> Thanks.
>
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Stonith hostname vs port vs plug

2017-07-31 Thread ArekW
> The "plug" should match the name used by the hypervisor, not the actual
host name (if they differ).
I understand the difference between plug and hostname. I don't clearly
understand which fence config is correct (I reffer to pcs stonith describe
fence_...):

the same entry on every node:
pmck_host_map="node1:Centos1;node2:Centos2"

or different entry at every node like this:

port=Centos1 (on node1)
port=Centos2 (on node2)
?
Thanks,
Regards

2017-07-31 14:28 GMT+02:00 Digimer :

> On 2017-07-31 03:18 AM, ArekW wrote:
> > Hi, I'm confused how to properly set stonith when a hostname is
> > different than port/plug name. I have 2 vms on vbox/vmware with
> > hostnames: node1, node2. The port's names are: Centos1, Centos2.
> > According to my understanding the stonith device must know which vm to
> > control (each other) so I set:
> > pmck_host_map="node1:Centos1;node2:Centos2" and it seems to work good,
> > however documentation describes port as a decimal "port number"(?).
> > Would it be correct to use something like pmck_host_list="node1
> > node2"? But how the fence device will combine the hostname with port
> > (or plug)? I presume that node1 must somehow know that node2's plug is
> > Centos2, otherwise It could reboot itself (?)
> > Thank you.
>
> The "plug" should match the name used by the hypervisor, not the actual
> host name (if they differ).
>
>
> --
> Digimer
> Papers and Projects: https://alteeve.com/w/
> "I am, somehow, less interested in the weight and convolutions of
> Einstein’s brain than in the near certainty that people of equal talent
> have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
>
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Stonith hostname vs port vs plug

2017-07-31 Thread Digimer
On 2017-07-31 03:18 AM, ArekW wrote:
> Hi, I'm confused how to properly set stonith when a hostname is
> different than port/plug name. I have 2 vms on vbox/vmware with
> hostnames: node1, node2. The port's names are: Centos1, Centos2.
> According to my understanding the stonith device must know which vm to
> control (each other) so I set:
> pmck_host_map="node1:Centos1;node2:Centos2" and it seems to work good,
> however documentation describes port as a decimal "port number"(?).
> Would it be correct to use something like pmck_host_list="node1
> node2"? But how the fence device will combine the hostname with port
> (or plug)? I presume that node1 must somehow know that node2's plug is
> Centos2, otherwise It could reboot itself (?)
> Thank you.

The "plug" should match the name used by the hypervisor, not the actual
host name (if they differ).


-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] stonith disabled, but pacemaker tries to reboot

2017-07-20 Thread Ken Gaillot
On 07/20/2017 03:46 AM, Daniel.L wrote:
> Hi Pacemaker Users,
> 
> 
> We have a 2 node pacemaker cluster (v1.1.14).
> Stonith at this moment is disabled:
> 
> $ pcs property --all | grep stonith
> stonith-action: reboot
> stonith-enabled: false
> stonith-timeout: 60s
> stonith-watchdog-timeout: (null)
> 
> $ pcs property --all | grep fenc
> startup-fencing: true
> 
> 
> But when there is a network outage - it looks like pacemaker tries to
> restart the other node:
> 
> fence_pcmk[5739]: Requesting Pacemaker fence *node1* (reset)
> stonith-ng[31022]:   notice: Client stonith_admin.cman.xxx.
> wants to fence (reboot) '*node1*' with device '(any)'
> stonith-ng[31022]:   notice: Initiating remote operation reboot for
> *node1*: (0)
> stonith-ng[31022]:   notice: Couldn't find anyone to fence (reboot)
> *node1* with any device
> stonith-ng[31022]:   error: Operation reboot of *node1* by  for
> stonith_admin.cman.@xxx: No such device
> crmd[31026]:   notice: Peer *node1* was not terminated (reboot) by
>  for *node2*: No such device (ref=0) by
> client stonith_admin.cman.

stonith-enabled=false stops *Pacemaker* from requesting fencing, but it
doesn't stop external software from requesting fencing.

One hint in the logs is that the client starts with "stonith_admin"
which is the command-line tool that external apps can use to request
fencing.

Another hint is "fence_pcmk", which is not a Pacemaker fence agent, but
software that provides an interface to Pacemaker's fencing that CMAN can
understand. So, something asked CMAN to fence the node, and CMAN asked
Pacemaker to do it.

You'll have to figure out what requested it, and see whether there's a
way to disable fence requests in that app. DLM (used by clvmd and some
cluster filesystems) is a prime suspect, and I believe there's no way to
disable fencing inside it.

Of course, disabling fencing is a bad idea anyway :-)

> I'm looking into it for quite a while already, but to be honest - still
> dont understand this behavior...
> I would expect pacemaker not to try to reboot other node if stonith is
> disabled...
> Can anyone help to understand this behavior ? (and hopefully help to
> avoid those reboot attempts )
> 
> Many thanks in advance!
> 
> best regards
> daniel

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] stonith device locate on same host in active/passive cluster

2017-05-11 Thread Albert Weng
Hi Ken,

thank you for your comment.

i think this case can be closed, i use your suggestion of constraint and
then problem resolved.

thanks a lot~~

On Thu, May 4, 2017 at 10:28 PM, Ken Gaillot  wrote:

> On 05/03/2017 09:04 PM, Albert Weng wrote:
> > Hi Marek,
> >
> > Thanks your reply.
> >
> > On Tue, May 2, 2017 at 5:15 PM, Marek Grac  > > wrote:
> >
> >
> >
> > On Tue, May 2, 2017 at 11:02 AM, Albert Weng  > > wrote:
> >
> >
> > Hi Marek,
> >
> > thanks for your quickly responding.
> >
> > According to you opinion, when i type "pcs status" then i saw
> > the following result of fence :
> > ipmi-fence-node1(stonith:fence_ipmilan):Started cluaterb
> > ipmi-fence-node2(stonith:fence_ipmilan):Started clusterb
> >
> > Does it means both ipmi stonith devices are working correctly?
> > (rest of resources can failover to another node correctly)
> >
> >
> > Yes, they are working correctly.
> >
> > When it becomes important to run fence agents to kill the other
> > node. It will be executed from the other node, so the fact where
> > fence agent resides currently is not important
> >
> > Does "started on node" means which node is controlling fence behavior?
> > even all fence agents and resources "started on same node", the cluster
> > fence behavior still work correctly?
> >
> >
> > Thanks a lot.
> >
> > m,
>
> Correct. Fencing is *executed* independently of where or even whether
> fence devices are running. The node that is "running" a fence device
> performs the recurring monitor on the device; that's the only real effect.
>
> > should i have to use location constraint to avoid stonith device
> > running on same node ?
> > # pcs constraint location ipmi-fence-node1 prefers clustera
> > # pcs constraint location ipmi-fence-node2 prefers clusterb
> >
> > thanks a lot
>
> It's a good idea, so that a node isn't monitoring its own fence device,
> but that's the only reason -- it doesn't affect whether or how the node
> can be fenced. I would configure it as an anti-location, e.g.
>
>pcs constraint location ipmi-fence-node1 avoids node1=100
>
> In a 2-node cluster, there's no real difference, but in a larger
> cluster, it's the simplest config. I wouldn't use INFINITY (there's no
> harm in a node monitoring its own fence device if it's the last node
> standing), but I would use a score high enough to outweigh any stickiness.
>
> > On Tue, May 2, 2017 at 4:25 PM, Marek Grac  > > wrote:
> >
> > Hi,
> >
> >
> >
> > On Tue, May 2, 2017 at 3:39 AM, Albert Weng
> > >
> wrote:
> >
> > Hi All,
> >
> > I have created active/passive pacemaker cluster on RHEL
> 7.
> >
> > here is my environment:
> > clustera : 192.168.11.1
> > clusterb : 192.168.11.2
> > clustera-ilo4 : 192.168.11.10
> > clusterb-ilo4 : 192.168.11.11
> >
> > both nodes are connected SAN storage for shared storage.
> >
> > i used the following cmd to create my stonith devices on
> > each node :
> > # pcs -f stonith_cfg stonith create ipmi-fence-node1
> > fence_ipmilan parms lanplus="ture"
> > pcmk_host_list="clustera" pcmk_host_check="static-list"
> > action="reboot" ipaddr="192.168.11.10"
> > login=adminsitrator passwd=1234322 op monitor
> interval=60s
> >
> > # pcs -f stonith_cfg stonith create ipmi-fence-node02
> > fence_ipmilan parms lanplus="true"
> > pcmk_host_list="clusterb" pcmk_host_check="static-list"
> > action="reboot" ipaddr="192.168.11.11" login=USERID
> > passwd=password op monitor interval=60s
> >
> > # pcs status
> > ipmi-fence-node1 clustera
> > ipmi-fence-node2 clusterb
> >
> > but when i failover to passive node, then i ran
> > # pcs status
> >
> > ipmi-fence-node1clusterb
> > ipmi-fence-node2clusterb
> >
> > why both fence device locate on the same node ?
> >
> >
> > When node 'clustera' is down, is there any place where
> > ipmi-fence-node* can be executed?
> >
> > If you are worrying that node can not self-fence itself you
> > are right. But if 'clustera' will become available then
> > attempt to fence clusterb will work as expected.
> >
> >   

Re: [ClusterLabs] stonith device locate on same host in active/passive cluster

2017-05-04 Thread Ken Gaillot
On 05/03/2017 09:04 PM, Albert Weng wrote:
> Hi Marek,
> 
> Thanks your reply.
> 
> On Tue, May 2, 2017 at 5:15 PM, Marek Grac  > wrote:
> 
> 
> 
> On Tue, May 2, 2017 at 11:02 AM, Albert Weng  > wrote:
> 
> 
> Hi Marek,
> 
> thanks for your quickly responding.
> 
> According to you opinion, when i type "pcs status" then i saw
> the following result of fence :
> ipmi-fence-node1(stonith:fence_ipmilan):Started cluaterb
> ipmi-fence-node2(stonith:fence_ipmilan):Started clusterb
> 
> Does it means both ipmi stonith devices are working correctly?
> (rest of resources can failover to another node correctly)
> 
> 
> Yes, they are working correctly. 
> 
> When it becomes important to run fence agents to kill the other
> node. It will be executed from the other node, so the fact where
> fence agent resides currently is not important
> 
> Does "started on node" means which node is controlling fence behavior?
> even all fence agents and resources "started on same node", the cluster
> fence behavior still work correctly?
>  
> 
> Thanks a lot.
> 
> m,

Correct. Fencing is *executed* independently of where or even whether
fence devices are running. The node that is "running" a fence device
performs the recurring monitor on the device; that's the only real effect.

> should i have to use location constraint to avoid stonith device
> running on same node ?
> # pcs constraint location ipmi-fence-node1 prefers clustera
> # pcs constraint location ipmi-fence-node2 prefers clusterb
> 
> thanks a lot

It's a good idea, so that a node isn't monitoring its own fence device,
but that's the only reason -- it doesn't affect whether or how the node
can be fenced. I would configure it as an anti-location, e.g.

   pcs constraint location ipmi-fence-node1 avoids node1=100

In a 2-node cluster, there's no real difference, but in a larger
cluster, it's the simplest config. I wouldn't use INFINITY (there's no
harm in a node monitoring its own fence device if it's the last node
standing), but I would use a score high enough to outweigh any stickiness.

> On Tue, May 2, 2017 at 4:25 PM, Marek Grac  > wrote:
> 
> Hi,
> 
> 
> 
> On Tue, May 2, 2017 at 3:39 AM, Albert Weng
> > wrote:
> 
> Hi All,
> 
> I have created active/passive pacemaker cluster on RHEL 7.
> 
> here is my environment:
> clustera : 192.168.11.1
> clusterb : 192.168.11.2
> clustera-ilo4 : 192.168.11.10
> clusterb-ilo4 : 192.168.11.11
> 
> both nodes are connected SAN storage for shared storage.
> 
> i used the following cmd to create my stonith devices on
> each node :
> # pcs -f stonith_cfg stonith create ipmi-fence-node1
> fence_ipmilan parms lanplus="ture"
> pcmk_host_list="clustera" pcmk_host_check="static-list"
> action="reboot" ipaddr="192.168.11.10"
> login=adminsitrator passwd=1234322 op monitor interval=60s
> 
> # pcs -f stonith_cfg stonith create ipmi-fence-node02
> fence_ipmilan parms lanplus="true"
> pcmk_host_list="clusterb" pcmk_host_check="static-list"
> action="reboot" ipaddr="192.168.11.11" login=USERID
> passwd=password op monitor interval=60s
> 
> # pcs status
> ipmi-fence-node1 clustera
> ipmi-fence-node2 clusterb
> 
> but when i failover to passive node, then i ran
> # pcs status
> 
> ipmi-fence-node1clusterb
> ipmi-fence-node2clusterb
> 
> why both fence device locate on the same node ? 
> 
> 
> When node 'clustera' is down, is there any place where
> ipmi-fence-node* can be executed?
> 
> If you are worrying that node can not self-fence itself you
> are right. But if 'clustera' will become available then
> attempt to fence clusterb will work as expected.
> 
> m, 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> 
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> 
> Project Home: http://www.clusterlabs.org
> Getting started:
> 

Re: [ClusterLabs] stonith device locate on same host in active/passive cluster

2017-05-02 Thread Albert Weng
Hi Marek,

thanks for your quickly responding.

According to you opinion, when i type "pcs status" then i saw the following
result of fence :
ipmi-fence-node1(stonith:fence_ipmilan):Started cluaterb
ipmi-fence-node2(stonith:fence_ipmilan):Started clusterb

Does it means both ipmi stonith devices are working correctly? (rest of
resources can failover to another node correctly)

should i have to use location constraint to avoid stonith device running on
same node ?
# pcs constraint location ipmi-fence-node1 prefers clustera
# pcs constraint location ipmi-fence-node2 prefers clusterb

thanks a lot

On Tue, May 2, 2017 at 4:25 PM, Marek Grac  wrote:

> Hi,
>
>
>
> On Tue, May 2, 2017 at 3:39 AM, Albert Weng  wrote:
>
>> Hi All,
>>
>> I have created active/passive pacemaker cluster on RHEL 7.
>>
>> here is my environment:
>> clustera : 192.168.11.1
>> clusterb : 192.168.11.2
>> clustera-ilo4 : 192.168.11.10
>> clusterb-ilo4 : 192.168.11.11
>>
>> both nodes are connected SAN storage for shared storage.
>>
>> i used the following cmd to create my stonith devices on each node :
>> # pcs -f stonith_cfg stonith create ipmi-fence-node1 fence_ipmilan parms
>> lanplus="ture" pcmk_host_list="clustera" pcmk_host_check="static-list"
>> action="reboot" ipaddr="192.168.11.10" login=adminsitrator passwd=1234322
>> op monitor interval=60s
>>
>> # pcs -f stonith_cfg stonith create ipmi-fence-node02 fence_ipmilan parms
>> lanplus="true" pcmk_host_list="clusterb" pcmk_host_check="static-list"
>> action="reboot" ipaddr="192.168.11.11" login=USERID passwd=password op
>> monitor interval=60s
>>
>> # pcs status
>> ipmi-fence-node1 clustera
>> ipmi-fence-node2 clusterb
>>
>> but when i failover to passive node, then i ran
>> # pcs status
>>
>> ipmi-fence-node1clusterb
>> ipmi-fence-node2clusterb
>>
>> why both fence device locate on the same node ?
>>
>
> When node 'clustera' is down, is there any place where ipmi-fence-node*
> can be executed?
>
> If you are worrying that node can not self-fence itself you are right. But
> if 'clustera' will become available then attempt to fence clusterb will
> work as expected.
>
> m,
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>


-- 
Kind regards,
Albert Weng


不含病毒。www.avast.com

<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] stonith device locate on same host in active/passive cluster

2017-05-02 Thread Marek Grac
Hi,



On Tue, May 2, 2017 at 3:39 AM, Albert Weng  wrote:

> Hi All,
>
> I have created active/passive pacemaker cluster on RHEL 7.
>
> here is my environment:
> clustera : 192.168.11.1
> clusterb : 192.168.11.2
> clustera-ilo4 : 192.168.11.10
> clusterb-ilo4 : 192.168.11.11
>
> both nodes are connected SAN storage for shared storage.
>
> i used the following cmd to create my stonith devices on each node :
> # pcs -f stonith_cfg stonith create ipmi-fence-node1 fence_ipmilan parms
> lanplus="ture" pcmk_host_list="clustera" pcmk_host_check="static-list"
> action="reboot" ipaddr="192.168.11.10" login=adminsitrator passwd=1234322
> op monitor interval=60s
>
> # pcs -f stonith_cfg stonith create ipmi-fence-node02 fence_ipmilan parms
> lanplus="true" pcmk_host_list="clusterb" pcmk_host_check="static-list"
> action="reboot" ipaddr="192.168.11.11" login=USERID passwd=password op
> monitor interval=60s
>
> # pcs status
> ipmi-fence-node1 clustera
> ipmi-fence-node2 clusterb
>
> but when i failover to passive node, then i ran
> # pcs status
>
> ipmi-fence-node1clusterb
> ipmi-fence-node2clusterb
>
> why both fence device locate on the same node ?
>

When node 'clustera' is down, is there any place where ipmi-fence-node* can
be executed?

If you are worrying that node can not self-fence itself you are right. But
if 'clustera' will become available then attempt to fence clusterb will
work as expected.

m,
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] stonith device locate on same host in active/passive cluster

2017-05-01 Thread Albert Weng
Hi All,

the following logs from corosync.log that might help.

Apr 25 10:29:32 [15334] gmlcdbw02pengine: info: native_print:
ipmi-fence-db01(stonith:fence_ipmilan):Started gmlcdbw01
Apr 25 10:29:32 [15334] gmlcdbw02pengine: info: native_print:
ipmi-fence-db02(stonith:fence_ipmilan):Started gmlcdbw02

Apr 25 10:29:32 [15334] gmlcdbw02pengine: info: RecurringOp:
 Start recurring monitor (60s) for ipmi-fence-db01 on gmlcdbw02
Apr 25 10:29:32 [15334] gmlcdbw02pengine:   notice: LogActions:
Moveipmi-fence-db01(Started gmlcdbw01 -> gmlcdbw02)
Apr 25 10:29:32 [15334] gmlcdbw02pengine: info: LogActions:
Leave   ipmi-fence-db02(Started gmlcdbw02)
Apr 25 10:29:32 [15335] gmlcdbw02   crmd:   notice: te_rsc_command:
Initiating action 11: stop ipmi-fence-db01_stop_0 on gmlcdbw01
Apr 25 10:29:32 [15330] gmlcdbw02cib: info: cib_perform_op:
+
/cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='ipmi-fence-db01']/lrm_rsc_op[@id='ipmi-fence-db01_last_0']:
@operation_key=ipmi-fence-db01_stop_0, @operation=stop,
@transition-key=11:2485:0:27a91aab-060a-4de9-80b1-18abeb7bd850,
@transition-magic=0:0;11:2485:0:27a91aab-060a-4de9-80b1-18abeb7bd850,
@call-id=75, @last-run=1493087372, @last-rc-change=1493087372, @exec-time=0
Apr 25 10:29:32 [15330] gmlcdbw02cib: info: cib_perform_op:
+
/cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='ipmi-fence-db01']/lrm_rsc_op[@id='ipmi-fence-db01_last_0']:
@operation_key=ipmi-fence-db01_stop_0, @operation=stop,
@transition-key=11:2485:0:27a91aab-060a-4de9-80b1-18abeb7bd850,
@transition-magic=0:0;11:2485:0:27a91aab-060a-4de9-80b1-18abeb7bd850,
@call-id=75, @last-run=1493087372, @last-rc-change=1493087372, @exec-time=0
Apr 25 10:29:32 [15330] gmlcdbw02cib: info: cib_perform_op:
+
/cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='ipmi-fence-db01']/lrm_rsc_op[@id='ipmi-fence-db01_last_0']:
@operation_key=ipmi-fence-db01_stop_0, @operation=stop,
@transition-key=11:2485:0:27a91aab-060a-4de9-80b1-18abeb7bd850,
@transition-magic=0:0;11:2485:0:27a91aab-060a-4de9-80b1-18abeb7bd850,
@call-id=75, @last-run=1493087372, @last-rc-change=1493087372, @exec-time=0
Apr 25 10:29:32 [15335] gmlcdbw02   crmd: info:
match_graph_event:Action ipmi-fence-db01_stop_0 (11) confirmed on
gmlcdbw01 (rc=0)
Apr 25 10:29:32 [15335] gmlcdbw02   crmd:   notice: te_rsc_command:
Initiating action 12: start ipmi-fence-db01_start_0 on gmlcdbw02 (local)
Apr 25 10:29:32 [15335] gmlcdbw02   crmd: info: do_lrm_rsc_op:
Performing key=12:2485:0:27a91aab-060a-4de9-80b1-18abeb7bd850
op=ipmi-fence-db01_start_0
Apr 25 10:29:32 [15332] gmlcdbw02   lrmd: info: log_execute:
executing - rsc:ipmi-fence-db01 action:start call_id:65
Apr 25 10:29:32 [15332] gmlcdbw02   lrmd: info: log_finished:
finished - rsc:ipmi-fence-db01 action:start call_id:65  exit-code:0
exec-time:45ms queue-time:0ms
Apr 25 10:29:33 [15335] gmlcdbw02   crmd:   notice:
process_lrm_event:Operation ipmi-fence-db01_start_0: ok
(node=gmlcdbw02, call=65, rc=0, cib-update=2571, confirmed=true)
Apr 25 10:29:33 [15330] gmlcdbw02cib: info: cib_perform_op:
+
/cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='ipmi-fence-db01']/lrm_rsc_op[@id='ipmi-fence-db01_last_0']:
@operation_key=ipmi-fence-db01_start_0, @operation=start,
@transition-key=12:2485:0:27a91aab-060a-4de9-80b1-18abeb7bd850,
@transition-magic=0:0;12:2485:0:27a91aab-060a-4de9-80b1-18abeb7bd850,
@call-id=65, @last-run=1493087372, @last-rc-change=1493087372, @exec-time=45
Apr 25 10:29:33 [15330] gmlcdbw02cib: info: cib_perform_op:
+
/cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='ipmi-fence-db01']/lrm_rsc_op[@id='ipmi-fence-db01_last_0']:
@operation_key=ipmi-fence-db01_start_0, @operation=start,
@transition-key=12:2485:0:27a91aab-060a-4de9-80b1-18abeb7bd850,
@transition-magic=0:0;12:2485:0:27a91aab-060a-4de9-80b1-18abeb7bd850,
@call-id=65, @last-run=1493087372, @last-rc-change=1493087372, @exec-time=45
Apr 25 10:29:33 [15330] gmlcdbw02cib: info: cib_perform_op:
+
/cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='ipmi-fence-db01']/lrm_rsc_op[@id='ipmi-fence-db01_last_0']:
@operation_key=ipmi-fence-db01_start_0, @operation=start,
@transition-key=12:2485:0:27a91aab-060a-4de9-80b1-18abeb7bd850,
@transition-magic=0:0;12:2485:0:27a91aab-060a-4de9-80b1-18abeb7bd850,
@call-id=65, @last-run=1493087372, @last-rc-change=1493087372, @exec-time=45
Apr 25 10:29:33 [15335] gmlcdbw02   crmd: info:
match_graph_event:Action ipmi-fence-db01_start_0 (12) confirmed on
gmlcdbw02 (rc=0)
Apr 25 10:29:33 [15335] gmlcdbw02   crmd:   notice: te_rsc_command:
Initiating action 13: monitor ipmi-fence-db01_monitor_6 on gmlcdbw02
(local)
Apr 25 10:29:33 [15335] gmlcdbw02   crmd: info: 

Re: [ClusterLabs] STONITH not communicated back to initiator until token expires

2017-04-26 Thread Chris Walker
Just to close the loop on this issue, discussions with Redhat have
confirmed that this behavior is as designed, that all membership
changes must first be realized by the Corosync layer.  So the full
trajectory of a STONITH action in response to, for example, a failed
stop operation looks like:

crmd requests STONITH
stonith-ng successfully STONITHs node

corosync communicates membership change to stonith-ng
stonith-ng communicates successful STONITH to crmd
cluster reacts to down node

Thanks,
Chris

On Wed, Apr 5, 2017 at 5:07 PM, Chris Walker
 wrote:
> Thanks very much for your reply Ken.  Unfortunately, the same delay
> happens when the DC is not the node that's being STONITHed. In either
> case, the failure looks the same to me: the stonithd instance that
> does the STONITH operation does not pass back the result to the
> original stonithd, so remote_op_done can't be invoked to send the
> result to the original initiator (in this case, crmd).
>
> Sorry, this problem is probably too in-depth for the mailing list.
> I've created RH ticket 01812422 for this issue (seems stuck in L1/L2
> support at the moment :( )
>
> Thanks again,
> Chris
>
>
>
> On Tue, Apr 4, 2017 at 12:47 PM, Ken Gaillot  wrote:
>> On 03/13/2017 10:43 PM, Chris Walker wrote:
>>> Thanks for your reply Digimer.
>>>
>>> On Mon, Mar 13, 2017 at 1:35 PM, Digimer >> > wrote:
>>>
>>> On 13/03/17 12:07 PM, Chris Walker wrote:
>>> > Hello,
>>> >
>>> > On our two-node EL7 cluster (pacemaker: 1.1.15-11.el7_3.4; corosync:
>>> > 2.4.0-4; libqb: 1.0-1),
>>> > it looks like successful STONITH operations are not communicated from
>>> > stonith-ng back to theinitiator (in this case, crmd) until the 
>>> STONITHed
>>> > node is removed from the cluster when
>>> > Corosync notices that it's gone (i.e., after the token timeout).
>>>
>>> Others might have more useful info, but my understanding of a lost node
>>> sequence is this;
>>>
>>> 1. Node stops responding, corosync declares it lost after token timeout
>>> 2. Corosync reforms the cluster with remaining node(s), checks if it is
>>> quorate (always true in 2-node)
>>> 3. Corosync informs Pacemaker of the membership change.
>>> 4. Pacemaker invokes stonith, waits for the fence agent to return
>>> "success" (exit code of the agent as per the FenceAgentAPI
>>> [https://docs.pagure.org/ClusterLabs.fence-agents/FenceAgentAPI.md]
>>> ).
>>> If
>>> the method fails, it moves on to the next method. If all methods fail,
>>> it goes back to the first method and tries again, looping indefinitely.
>>>
>>>
>>> That's roughly my understanding as well for the case when a node
>>> suddenly leaves the cluster (e.g., poweroff), and this case is working
>>> as expected for me.  I'm seeing delays when a node is marked for STONITH
>>> while it's still up (e.g., after a stop operation fails).  In this case,
>>> what I expect to see is something like:
>>> 1.  crmd requests that stonith-ng fence the node
>>> 2.  stonith-ng (might be a different stonith-ng) fences the node and
>>> sends a message that it has succeeded
>>> 3.  stonith-ng (the original from step 1) receives this message and
>>> communicates back to crmd that the node has been fenced
>>>
>>> but what I'm seeing is
>>> 1.  crmd requests that stonith-ng fence the node
>>> 2.  stonith-ng fences the node and sends a message saying that it has
>>> succeeded
>>> 3.  nobody hears this message
>>> 4.  Corosync eventually realizes that the fenced node is no longer part
>>> of the config and broadcasts a config change
>>> 5.  stonith-ng finishes the STONITH operation that was started earlier
>>> and communicates back to crmd that the node has been STONITHed
>>
>> In your attached log, bug1 was DC at the time of the fencing, and bug0
>> takes over DC after the fencing. This is what I expect is happening
>> (logs from bug1 would help confirm):
>>
>> 1. crmd on the DC (bug1) runs pengine which sees the stop failure and
>> schedules fencing (of bug1)
>>
>> 2. stonithd on bug1 sends a query to all nodes asking who can fence bug1
>>
>> 3. Each node replies, and stonithd on bug1 chooses bug0 to execute the
>> fencing
>>
>> 4. stonithd on bug0 fences bug1. At this point, it would normally report
>> the result to the DC ... but that happens to be bug1.
>>
>> 5. Once crmd on bug0 takes over DC, it can decide that the fencing
>> succeeded, but it can't take over DC until it sees that the old DC is
>> gone, which takes a while because of your long token timeout. So, this
>> is where the delay is coming in.
>>
>> I'll have to think about whether we can improve this, but I don't think
>> it would be easy. There are complications if for example a fencing
>> topology is used, such that the result being reported in step 4 might
>> not be 

Re: [ClusterLabs] STONITH not communicated back to initiator until token expires

2017-04-04 Thread Ken Gaillot
On 03/13/2017 10:43 PM, Chris Walker wrote:
> Thanks for your reply Digimer.
> 
> On Mon, Mar 13, 2017 at 1:35 PM, Digimer  > wrote:
> 
> On 13/03/17 12:07 PM, Chris Walker wrote:
> > Hello,
> >
> > On our two-node EL7 cluster (pacemaker: 1.1.15-11.el7_3.4; corosync:
> > 2.4.0-4; libqb: 1.0-1),
> > it looks like successful STONITH operations are not communicated from
> > stonith-ng back to theinitiator (in this case, crmd) until the STONITHed
> > node is removed from the cluster when
> > Corosync notices that it's gone (i.e., after the token timeout).
> 
> Others might have more useful info, but my understanding of a lost node
> sequence is this;
> 
> 1. Node stops responding, corosync declares it lost after token timeout
> 2. Corosync reforms the cluster with remaining node(s), checks if it is
> quorate (always true in 2-node)
> 3. Corosync informs Pacemaker of the membership change.
> 4. Pacemaker invokes stonith, waits for the fence agent to return
> "success" (exit code of the agent as per the FenceAgentAPI
> [https://docs.pagure.org/ClusterLabs.fence-agents/FenceAgentAPI.md]
> ).
> If
> the method fails, it moves on to the next method. If all methods fail,
> it goes back to the first method and tries again, looping indefinitely.
> 
> 
> That's roughly my understanding as well for the case when a node
> suddenly leaves the cluster (e.g., poweroff), and this case is working
> as expected for me.  I'm seeing delays when a node is marked for STONITH
> while it's still up (e.g., after a stop operation fails).  In this case,
> what I expect to see is something like:
> 1.  crmd requests that stonith-ng fence the node
> 2.  stonith-ng (might be a different stonith-ng) fences the node and
> sends a message that it has succeeded
> 3.  stonith-ng (the original from step 1) receives this message and
> communicates back to crmd that the node has been fenced
> 
> but what I'm seeing is
> 1.  crmd requests that stonith-ng fence the node
> 2.  stonith-ng fences the node and sends a message saying that it has
> succeeded
> 3.  nobody hears this message
> 4.  Corosync eventually realizes that the fenced node is no longer part
> of the config and broadcasts a config change
> 5.  stonith-ng finishes the STONITH operation that was started earlier
> and communicates back to crmd that the node has been STONITHed

In your attached log, bug1 was DC at the time of the fencing, and bug0
takes over DC after the fencing. This is what I expect is happening
(logs from bug1 would help confirm):

1. crmd on the DC (bug1) runs pengine which sees the stop failure and
schedules fencing (of bug1)

2. stonithd on bug1 sends a query to all nodes asking who can fence bug1

3. Each node replies, and stonithd on bug1 chooses bug0 to execute the
fencing

4. stonithd on bug0 fences bug1. At this point, it would normally report
the result to the DC ... but that happens to be bug1.

5. Once crmd on bug0 takes over DC, it can decide that the fencing
succeeded, but it can't take over DC until it sees that the old DC is
gone, which takes a while because of your long token timeout. So, this
is where the delay is coming in.

I'll have to think about whether we can improve this, but I don't think
it would be easy. There are complications if for example a fencing
topology is used, such that the result being reported in step 4 might
not be the entire result.

> I'm less convinced that the sending of the STONITH notify in step 2 is
> at fault; it also seems possible that a callback is not being run when
> it should be.
> 
> 
> The Pacemaker configuration is not important (I've seen this happen on
> our production clusters and on a small sandbox), but the config is:
> 
> primitive bug0-stonith stonith:fence_ipmilan \
> params pcmk_host_list=bug0 ipaddr=bug0-ipmi action=off
> login=admin passwd=admin \
> meta target-role=Started
> primitive bug1-stonith stonith:fence_ipmilan \
> params pcmk_host_list=bug1 ipaddr=bug1-ipmi action=off
> login=admin passwd=admin \
> meta target-role=Started
> primitive prm-snmp-heartbeat snmptrap_heartbeat \
> params snmphost=bug0 community=public \
> op monitor interval=10 timeout=300 \
> op start timeout=300 interval=0 \
> op stop timeout=300 interval=0
> clone cln-snmp-heartbeat prm-snmp-heartbeat \
> meta interleave=true globally-unique=false ordered=false
> notify=false
> location bug0-stonith-loc bug0-stonith -inf: bug0
> location bug1-stonith-loc bug1-stonith -inf: bug1
> 
> The corosync config might be more interesting:
> 
> totem {
> version: 2
> crypto_cipher: none
> crypto_hash: none
> secauth: off
> rrp_mode: passive
> transport: udpu
> token: 24
> consensus: 1000
> 
> interface {
> ringnumber 0
>  

Re: [ClusterLabs] Stonith

2017-03-31 Thread Alexander Markov

Kristoffer Grönlund writes:


The only solution I know which allows for a configuration like this is
using separate clusters in each data center, and using booth for
transferring ticket ownership between them. Booth requires a data
center-level quorum (meaning at least 3 locations), though the third
location can be just a small daemon without an actual cluster, and can
run in a public cloud or similar for example.


Looks like it's really impossible to solve the situation without arbiter 
(3d party)

Thank you, guys.

--
Regards,
Alexander

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Stonith

2017-03-30 Thread Kristoffer Grönlund
Alexander Markov  writes:

> Hello, Kristoffer
>
>> Did you test failover through pacemaker itself?
>
> Yes, I did, no problems here.
>
>> However: Am I understanding it correctly that you have one node in each
>> data center, and a stonith device in each data center?
>
> Yes.
>
>> If the
>> data center is lost, the stonith device for the node in that data 
>> center
>> would also be lost and thus not able to fence.
>
> Exactly what happens!
>
>> In such a hardware configuration, only a poison pill solution like SBD
>> could work, I think.
>
> I've got no shared storage here. Every datacenter has its own storage 
> and they have replication on top (similar to drbd). I can organize a 
> cross-shared solution though if it help, but don't see how.

The only solution I know which allows for a configuration like this is
using separate clusters in each data center, and using booth for
transferring ticket ownership between them. Booth requires a data
center-level quorum (meaning at least 3 locations), though the third
location can be just a small daemon without an actual cluster, and can
run in a public cloud or similar for example.

Cheers,
Kristoffer

>
>> --
>> Regards,
>> Alexander
>
>

-- 
// Kristoffer Grönlund
// kgronl...@suse.com

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Stonith

2017-03-30 Thread Alexander Markov

Hello, Kristoffer


Did you test failover through pacemaker itself?


Yes, I did, no problems here.


However: Am I understanding it correctly that you have one node in each
data center, and a stonith device in each data center?


Yes.


If the
data center is lost, the stonith device for the node in that data 
center

would also be lost and thus not able to fence.


Exactly what happens!


In such a hardware configuration, only a poison pill solution like SBD
could work, I think.


I've got no shared storage here. Every datacenter has its own storage 
and they have replication on top (similar to drbd). I can organize a 
cross-shared solution though if it help, but don't see how.



--
Regards,
Alexander



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Stonith

2017-03-30 Thread Kristoffer Grönlund
Alexander Markov  writes:

> Hello guys,
>
> it looks like I miss something obvious, but I just don't get what has 
> happened.
>
> I've got a number of stonith-enabled clusters within my big POWER boxes. 
> My stonith devices are two HMC (hardware management consoles) - separate 
> servers from IBM that can reboot separate LPARs (logical partitions) 
> within POWER boxes - one per every datacenter.
>
> So my definition for stonith devices was pretty straightforward:
>
> primitive st_dc2_hmc stonith:ibmhmc \
> params ipaddr=10.1.2.9
> primitive st_dc1_hmc stonith:ibmhmc \
> params ipaddr=10.1.2.8
> clone cl_st_dc2_hmc st_dc2_hmc
> clone cl_st_dc1_hmc st_dc1_hmc
>
> Everything was ok when we tested failover. But today upon power outage 

Did you test failover through pacemaker itself?

Otherwise, the logs for the attempted stonith should reveal more about
how Pacemaker tried to call the stonith device, and what went wrong.

However: Am I understanding it correctly that you have one node in each
data center, and a stonith device in each data center? That doesn't
sound like a setup that can recover from data center failure: If the
data center is lost, the stonith device for the node in that data center
would also be lost and thus not able to fence.

In such a hardware configuration, only a poison pill solution like SBD
could work, I think.

Cheers,
Kristoffer

> we lost one DC completely. Shortly after that cluster just literally 
> hanged itself upong trying to reboot nonexistent node. No failover 
> occured. Nonexistent node was marked OFFLINE UNCLEAN and resources were 
> marked "Started UNCLEAN" on nonexistent node.
>
> UNCLEAN seems to flag a problems with stonith configuration. So my 
> question is: how to avoid such behaviour?
>
> Thank you!
>
> -- 
> Regards,
> Alexander
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

-- 
// Kristoffer Grönlund
// kgronl...@suse.com

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] stonith in dual HMC environment

2017-03-28 Thread Dejan Muhamedagic
On Tue, Mar 28, 2017 at 04:20:12PM +0300, Alexander Markov wrote:
> Hello, Dejan,
> 
> >Why? I don't have a test system right now, but for instance this
> >should work:
> >
> >$ stonith -t ibmhmc ipaddr=10.1.2.9 -lS
> >$ stonith -t ibmhmc ipaddr=10.1.2.9 -T reset {nodename}
> 
> Ah, I see. Everything (including stonith methods, fencing and failover)
> works just fine under normal circumstances. Sorry if I wasn't clear about
> that. The problem occurs only when I have one datacenter (i.e. one IBM
> machine and one HMC) lost due to power outage.
> 
> For example:
> test01:~ # stonith -t ibmhmc ipaddr=10.1.2.8 -lS | wc -l
> info: ibmhmc device OK.
> 39
> test01:~ # stonith -t ibmhmc ipaddr=10.1.2.9 -lS | wc -l
> info: ibmhmc device OK.
> 39
> 
> As I had said stonith device can see and manage all the cluster nodes.

That's great :)

> >If so, then your configuration does not appear to be correct. If
> >both are capable of managing all nodes then you should tell
> >pacemaker about it.
> 
> Thanks for the hint. But if stonith device return node list, isn't it
> obvious for cluster that it can manage those nodes?

Did you try that? Just drop the location constraints and see if
it works. The pacemaker should actually keep the list of resources
(stonith) capable of managing the node.

> Could you please be more
> precise about what you refer to? I currently changed configuration to two
> fencing levels (one per HMC) but still don't think I get an idea here.
> 
> >Survived node, running stonith resource for dead node tries to
> >contact ipmi device (which is also dead). How does cluster understand that
> >lost node is really dead and it's not just a network issue?
> >
> >It cannot.
> 
> How do people then actually solve the problem of two node metro cluster?

That depends, but if you have a communication channel for stonith
devices which is _independent_ of the cluster communication then
you should be OK. Of course, a fencing device which goes down
together with its node is of no use, but that doesn't seem to be
the case here.

> I mean, I know one option: stonith-enabled=false, but it doesn't seem right
> for me.

Certainly not.

Thanks,

Dejan

> 
> Thank you.
> 
> Regards,
> Alexander Markov
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] stonith in dual HMC environment

2017-03-28 Thread Ken Gaillot
On 03/28/2017 08:20 AM, Alexander Markov wrote:
> Hello, Dejan,
> 
>> Why? I don't have a test system right now, but for instance this
>> should work:
>>
>> $ stonith -t ibmhmc ipaddr=10.1.2.9 -lS
>> $ stonith -t ibmhmc ipaddr=10.1.2.9 -T reset {nodename}
> 
> Ah, I see. Everything (including stonith methods, fencing and failover)
> works just fine under normal circumstances. Sorry if I wasn't clear
> about that. The problem occurs only when I have one datacenter (i.e. one
> IBM machine and one HMC) lost due to power outage.

If the datacenters are completely separate, you might want to take a
look at booth. With booth, you set up a separate cluster at each
datacenter, and booth coordinates which one can host resources. Each
datacenter must have its own self-sufficient cluster with its own
fencing, but one site does not need to be able to fence the other.

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#idm139683855002656

> 
> For example:
> test01:~ # stonith -t ibmhmc ipaddr=10.1.2.8 -lS | wc -l
> info: ibmhmc device OK.
> 39
> test01:~ # stonith -t ibmhmc ipaddr=10.1.2.9 -lS | wc -l
> info: ibmhmc device OK.
> 39
> 
> As I had said stonith device can see and manage all the cluster nodes.
> 
>> If so, then your configuration does not appear to be correct. If
>> both are capable of managing all nodes then you should tell
>> pacemaker about it.
> 
> Thanks for the hint. But if stonith device return node list, isn't it
> obvious for cluster that it can manage those nodes? Could you please be
> more precise about what you refer to? I currently changed configuration
> to two fencing levels (one per HMC) but still don't think I get an idea
> here.

I believe Dejan is referring to fencing topology (levels). That would be
preferable to booth if the datacenters are physically close, and even if
one fence device fails, the other can still function.

In this case you'd probably want level 1 = the main fence device, and
level 2 = the fence device to use if the main device fails.

A common implementation (which Digimer uses to great effect) is to use
IPMI as level 1 and an intelligent power switch as level 2. If your
second device can function regardless of what hosts are up or down, you
can do something similar.

> 
>> Survived node, running stonith resource for dead node tries to
>> contact ipmi device (which is also dead). How does cluster understand
>> that
>> lost node is really dead and it's not just a network issue?
>>
>> It cannot.

And it will be unable to recover resources that were running on the
questionable partition.

> 
> How do people then actually solve the problem of two node metro cluster?
> I mean, I know one option: stonith-enabled=false, but it doesn't seem
> right for me.
> 
> Thank you.
> 
> Regards,
> Alexander Markov

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] stonith in dual HMC environment

2017-03-28 Thread Alexander Markov

Hello, Dejan,


Why? I don't have a test system right now, but for instance this
should work:

$ stonith -t ibmhmc ipaddr=10.1.2.9 -lS
$ stonith -t ibmhmc ipaddr=10.1.2.9 -T reset {nodename}


Ah, I see. Everything (including stonith methods, fencing and failover) 
works just fine under normal circumstances. Sorry if I wasn't clear 
about that. The problem occurs only when I have one datacenter (i.e. one 
IBM machine and one HMC) lost due to power outage.


For example:
test01:~ # stonith -t ibmhmc ipaddr=10.1.2.8 -lS | wc -l
info: ibmhmc device OK.
39
test01:~ # stonith -t ibmhmc ipaddr=10.1.2.9 -lS | wc -l
info: ibmhmc device OK.
39

As I had said stonith device can see and manage all the cluster nodes.


If so, then your configuration does not appear to be correct. If
both are capable of managing all nodes then you should tell
pacemaker about it.


Thanks for the hint. But if stonith device return node list, isn't it 
obvious for cluster that it can manage those nodes? Could you please be 
more precise about what you refer to? I currently changed configuration 
to two fencing levels (one per HMC) but still don't think I get an idea 
here.



Survived node, running stonith resource for dead node tries to
contact ipmi device (which is also dead). How does cluster understand 
that

lost node is really dead and it's not just a network issue?

It cannot.


How do people then actually solve the problem of two node metro cluster?
I mean, I know one option: stonith-enabled=false, but it doesn't seem 
right for me.


Thank you.

Regards,
Alexander Markov


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] stonith in dual HMC environment

2017-03-28 Thread Dejan Muhamedagic
On Mon, Mar 27, 2017 at 01:17:31PM +0300, Alexander Markov wrote:
> Hello, Dejan,
> 
> 
> >The first thing I'd try is making sure you can fence each node from the
> >command line by manually running the fence agent. I'm not sure how to do
> >that for the "stonith:" type agents.
> >
> >There's a program stonith(8). It's easy to replicate the
> >configuration on the command line.
> 
> Unfortunately, it is not.

Why? I don't have a test system right now, but for instance this
should work:

$ stonith -t ibmhmc ipaddr=10.1.2.9 -lS
$ stonith -t ibmhmc ipaddr=10.1.2.9 -T reset {nodename}

Read the examples in the man page:

$ man stonith

Check also the documentation of your agent:

$ stonith -t ibmhmc -h
$ stonith -t ibmhmc -n

> The landscape I refer to is similar to VMWare. We use cluster for virtual
> machines (LPARs) and everything works OK but the real pain occurs when whole
> host system is down. Keeping in mind that it's actually used now in
> production, I just can't afford to turn it off for test reason.

Yes, I understand. However, I was just talking about how to use
the stonith agents and how to do the testing outside of
pacemaker.

> >Stonith agents are to be queried for the list of nodes they can
> >manage. It's part of the interface. Some agents can figure that
> >out by themself and some need a parameter defining the node list.
> 
> And this is just the place I'm stuck. I've got two stonith devices (ibmhmc)
> for redundancy. Both of them are capable to manage every node.

If so, then your configuration does not appear to be correct. If
both are capable of managing all nodes then you should tell
pacemaker about it. Digimer had a fairly extensive documentation
on how to configure complex fencing configurations. You can also
check with your vendor's documentation.

> The problem starts when
> 
> 1) one stonith device is completely lost and inaccessible (due to power
> outage in datacenter)
> 2) survived stonith device cannot access nor cluster node neither hosting
> system (in VMWare terms) for this cluster node, for both of them are also
> lost due to power outage.

Both lost? What remained? Why do you mention vmware? I thought
that your nodes are LPARs.

> What is the correct solution for this situation?
> 
> >Well, this used to be a standard way to configure one kind of
> >stonith resources, one common representative being ipmi, and
> >served exactly the purpose of restricting the stonith resource
> >from being enabled ("running") on a node which this resource
> >manages.
> 
> Unfortunately, there's no such thing as ipmi in IBM Power boxes.

I mentioned ipmi as an example, not that it has anything to do
with your setup.

> But it
> triggers interesting question for me: if both one node and its complementary
> ipmi device are lost (due to power outage) - what's happening with a
> cluster?

The cluster gets stuck trying to fence the node. Typically this
would render your cluster unusable. There are some IPMI devices
which have a battery to allow for some extra time to manage the
host.

> Survived node, running stonith resource for dead node tries to
> contact ipmi device (which is also dead). How does cluster understand that
> lost node is really dead and it's not just a network issue?

It cannot.

Thanks,

Dejan

> 
> Thank you.
> 
> -- 
> Regards,
> Alexander Markov
> +79104531955
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] stonith in dual HMC environment

2017-03-27 Thread Alexander Markov

Hello, Dejan,



The first thing I'd try is making sure you can fence each node from the
command line by manually running the fence agent. I'm not sure how to 
do

that for the "stonith:" type agents.

There's a program stonith(8). It's easy to replicate the
configuration on the command line.


Unfortunately, it is not.

The landscape I refer to is similar to VMWare. We use cluster for 
virtual machines (LPARs) and everything works OK but the real pain 
occurs when whole host system is down. Keeping in mind that it's 
actually used now in production, I just can't afford to turn it off for 
test reason.




Stonith agents are to be queried for the list of nodes they can
manage. It's part of the interface. Some agents can figure that
out by themself and some need a parameter defining the node list.


And this is just the place I'm stuck. I've got two stonith devices 
(ibmhmc) for redundancy. Both of them are capable to manage every node. 
The problem starts when


1) one stonith device is completely lost and inaccessible (due to power 
outage in datacenter)
2) survived stonith device cannot access nor cluster node neither 
hosting system (in VMWare terms) for this cluster node, for both of them 
are also lost due to power outage.


What is the correct solution for this situation?


Well, this used to be a standard way to configure one kind of
stonith resources, one common representative being ipmi, and
served exactly the purpose of restricting the stonith resource
from being enabled ("running") on a node which this resource
manages.


Unfortunately, there's no such thing as ipmi in IBM Power boxes. But it 
triggers interesting question for me: if both one node and its 
complementary ipmi device are lost (due to power outage) - what's 
happening with a cluster? Survived node, running stonith resource for 
dead node tries to contact ipmi device (which is also dead). How does 
cluster understand that lost node is really dead and it's not just a 
network issue?


Thank you.

--
Regards,
Alexander Markov
+79104531955

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] stonith in dual HMC environment

2017-03-24 Thread Ken Gaillot
On 03/22/2017 09:42 AM, Alexander Markov wrote:
> 
>> Please share your config along with the logs from the nodes that were
>> effected.
> 
> I'm starting to think it's not about how to define stonith resources. If
> the whole box is down with all the logical partitions defined, then HMC
> cannot define if LPAR (partition) is really dead or just inaccessible.
> This leads to UNCLEAN OFFLINE node status and pacemaker refusal to do
> anything until it's resolved. Am I right? Anyway, the simples pacemaker
> config from my partitions is below.

Yes, it looks like you are correct. The fence agent is returning an
error when pacemaker tries to use it to reboot crmapp02. From the stderr
in the logs, the message is "ssh: connect to host 10.1.2.9 port 22: No
route to host".

The first thing I'd try is making sure you can fence each node from the
command line by manually running the fence agent. I'm not sure how to do
that for the "stonith:" type agents.

Once that's working, make sure the cluster can do the same, by manually
running "stonith_admin -B $NODE" for each $NODE.

> 
> primitive sap_ASCS SAPInstance \
> params InstanceName=CAP_ASCS01_crmapp \
> op monitor timeout=60 interval=120 depth=0
> primitive sap_D00 SAPInstance \
> params InstanceName=CAP_D00_crmapp \
> op monitor timeout=60 interval=120 depth=0
> primitive sap_ip IPaddr2 \
> params ip=10.1.12.2 nic=eth0 cidr_netmask=24

> primitive st_ch_hmc stonith:ibmhmc \
> params ipaddr=10.1.2.9 \
> op start interval=0 timeout=300
> primitive st_hq_hmc stonith:ibmhmc \
> params ipaddr=10.1.2.8 \
> op start interval=0 timeout=300

I see you have two stonith devices defined, but they don't specify which
nodes they can fence -- pacemaker will assume that either device can be
used to fence either node.

> group g_sap sap_ip sap_ASCS sap_D00 \
> meta target-role=Started

> location l_ch_hq_hmc st_ch_hmc -inf: crmapp01
> location l_st_hq_hmc st_hq_hmc -inf: crmapp02

These constraints restrict which node monitors which device, not which
node the device can fence.

Assuming st_ch_hmc is intended to fence crmapp01, this will make sure
that crmapp02 monitors that device -- but you also want something like
pcmk_host_list=crmapp01 in the device configuration.

> location prefer_node_1 g_sap 100: crmapp01
> property cib-bootstrap-options: \
> stonith-enabled=true \
> no-quorum-policy=ignore \
> placement-strategy=balanced \
> expected-quorum-votes=2 \
> dc-version=1.1.12-f47ea56 \
> cluster-infrastructure="classic openais (with plugin)" \
> last-lrm-refresh=1490009096 \
> maintenance-mode=false
> rsc_defaults rsc-options: \
> resource-stickiness=200 \
> migration-threshold=3
> op_defaults op-options: \
> timeout=600 \
> record-pending=true
> 
> Logs are pretty much going in circle: stonith cannot reset logical
> partition via HMC, node stays unclean offline, resources are shown to
> stay on node that is down.
> 
> 
> stonith-ng:error: log_operation:Operation 'reboot' [6942] (call
> 6 from crmd.4568) for host 'crmapp02' with device 'st_ch_hmc:0'
> Trying: st_ch_hmc:0
> stonith-ng:  warning: log_operation:st_ch_hmc:0:6942 [ Performing:
> stonith -t ibmhmc -T reset crmapp02 ]
> stonith-ng:  warning: log_operation:st_ch_hmc:0:6942 [ failed:
> crmapp02 3 ]
> stonith-ng: info: internal_stonith_action_execute:  Attempt 2 to
> execute fence_legacy (reboot). remaining timeout is 59
> stonith-ng: info: update_remaining_timeout: Attempted to
> execute agent fence_legacy (reboot) the maximum number of times (2)
> 
> stonith-ng:error: log_operation:Operation 'reboot' [6955] (call
> 6 from crmd.4568) for host 'crmapp02' with device 'st_hq_hmc' re
> Trying: st_hq_hmc
> stonith-ng:  warning: log_operation:st_hq_hmc:6955 [ Performing:
> stonith -t ibmhmc -T reset crmapp02 ]
> stonith-ng:  warning: log_operation:st_hq_hmc:6955 [ failed:
> crmapp02 8 ]
> stonith-ng: info: internal_stonith_action_execute:  Attempt 2 to
> execute fence_legacy (reboot). remaining timeout is 60
> stonith-ng: info: update_remaining_timeout: Attempted to
> execute agent fence_legacy (reboot) the maximum number of times (2)
> 
> stonith-ng:error: log_operation:Operation 'reboot' [6976] (call
> 6 from crmd.4568) for host 'crmapp02' with device 'st_hq_hmc:0'
> 
> stonith-ng:  warning: log_operation:st_hq_hmc:0:6976 [ Performing:
> stonith -t ibmhmc -T reset crmapp02 ]
> stonith-ng:  warning: log_operation:st_hq_hmc:0:6976 [ failed:
> crmapp02 8 ]
> stonith-ng:   notice: stonith_choose_peer:  Couldn't find anyone to
> fence crmapp02 with 
> stonith-ng: info: call_remote_stonith:  None of the 1 peers are
> capable of terminating crmapp02 for crmd.4568 (1)
> stonith-ng:error: remote_op_done:   Operation reboot of crmapp02 by
>  for crmd.4568@crmapp01.6bf66b9c: No route to host
> crmd:   notice: tengine_stonith_callback: Stonith 

Re: [ClusterLabs] stonith in dual HMC environment

2017-03-21 Thread Digimer
On 20/03/17 12:22 PM, Alexander Markov wrote:
> Hello guys,
> 
> it looks like I miss something obvious, but I just don't get what has
> happened.
> 
> I've got a number of stonith-enabled clusters within my big POWER boxes.
> My stonith devices are two HMC (hardware management consoles) - separate
> servers from IBM that can reboot separate LPARs (logical partitions)
> within POWER boxes - one per every datacenter.
> 
> So my definition for stonith devices was pretty straightforward:
> 
> primitive st_dc2_hmc stonith:ibmhmc \
> params ipaddr=10.1.2.9
> primitive st_dc1_hmc stonith:ibmhmc \
> params ipaddr=10.1.2.8
> clone cl_st_dc2_hmc st_dc2_hmc
> clone cl_st_dc1_hmc st_dc1_hmc
> 
> Everything was ok when we tested failover. But today upon power outage
> we lost one DC completely. Shortly after that cluster just literally
> hanged itself upong trying to reboot nonexistent node. No failover
> occured. Nonexistent node was marked OFFLINE UNCLEAN and resources were
> marked "Started UNCLEAN" on nonexistent node.
> 
> UNCLEAN seems to flag a problems with stonith configuration. So my
> question is: how to avoid such behaviour?
> 
> Thank you!

Please share your config along with the logs from the nodes that were
effected.

cheers,

digimer

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] STONITH not communicated back to initiator until token expires

2017-03-13 Thread Chris Walker
Thanks for your reply Digimer.

On Mon, Mar 13, 2017 at 1:35 PM, Digimer  wrote:

> On 13/03/17 12:07 PM, Chris Walker wrote:
> > Hello,
> >
> > On our two-node EL7 cluster (pacemaker: 1.1.15-11.el7_3.4; corosync:
> > 2.4.0-4; libqb: 1.0-1),
> > it looks like successful STONITH operations are not communicated from
> > stonith-ng back to theinitiator (in this case, crmd) until the STONITHed
> > node is removed from the cluster when
> > Corosync notices that it's gone (i.e., after the token timeout).
>
> Others might have more useful info, but my understanding of a lost node
> sequence is this;
>
> 1. Node stops responding, corosync declares it lost after token timeout
> 2. Corosync reforms the cluster with remaining node(s), checks if it is
> quorate (always true in 2-node)
> 3. Corosync informs Pacemaker of the membership change.
> 4. Pacemaker invokes stonith, waits for the fence agent to return
> "success" (exit code of the agent as per the FenceAgentAPI
> [https://docs.pagure.org/ClusterLabs.fence-agents/FenceAgentAPI.md]). If
> the method fails, it moves on to the next method. If all methods fail,
> it goes back to the first method and tries again, looping indefinitely.
>
>
That's roughly my understanding as well for the case when a node suddenly
leaves the cluster (e.g., poweroff), and this case is working as expected
for me.  I'm seeing delays when a node is marked for STONITH while it's
still up (e.g., after a stop operation fails).  In this case, what I expect
to see is something like:
1.  crmd requests that stonith-ng fence the node
2.  stonith-ng (might be a different stonith-ng) fences the node and sends
a message that it has succeeded
3.  stonith-ng (the original from step 1) receives this message and
communicates back to crmd that the node has been fenced

but what I'm seeing is
1.  crmd requests that stonith-ng fence the node
2.  stonith-ng fences the node and sends a message saying that it has
succeeded
3.  nobody hears this message
4.  Corosync eventually realizes that the fenced node is no longer part of
the config and broadcasts a config change
5.  stonith-ng finishes the STONITH operation that was started earlier and
communicates back to crmd that the node has been STONITHed

I'm less convinced that the sending of the STONITH notify in step 2 is at
fault; it also seems possible that a callback is not being run when it
should be.


The Pacemaker configuration is not important (I've seen this happen on our
production clusters and on a small sandbox), but the config is:

primitive bug0-stonith stonith:fence_ipmilan \
params pcmk_host_list=bug0 ipaddr=bug0-ipmi action=off login=admin
passwd=admin \
meta target-role=Started
primitive bug1-stonith stonith:fence_ipmilan \
params pcmk_host_list=bug1 ipaddr=bug1-ipmi action=off login=admin
passwd=admin \
meta target-role=Started
primitive prm-snmp-heartbeat snmptrap_heartbeat \
params snmphost=bug0 community=public \
op monitor interval=10 timeout=300 \
op start timeout=300 interval=0 \
op stop timeout=300 interval=0
clone cln-snmp-heartbeat prm-snmp-heartbeat \
meta interleave=true globally-unique=false ordered=false
notify=false
location bug0-stonith-loc bug0-stonith -inf: bug0
location bug1-stonith-loc bug1-stonith -inf: bug1

The corosync config might be more interesting:

totem {
version: 2
crypto_cipher: none
crypto_hash: none
secauth: off
rrp_mode: passive
transport: udpu
token: 24
consensus: 1000

interface {
ringnumber 0
bindnetaddr: 203.0.113.0
mcastport: 5405
ttl: 1
}
}
nodelist {
node {
ring0_addr: 203.0.113.1
nodeid: 1
}
node {
ring0_addr: 203.0.113.2
nodeid: 2
}
}
quorum {
provider: corosync_votequorum
two_node: 1
}


> In trace debug logs, I see the STONITH reply sent via the
> > cpg_mcast_joined (libqb) function in crm_cs_flush
> > (stonith_send_async_reply->send_cluster_text->send_cluster_
> text->send_cpg_iov->crm_cs_flush->cpg_mcast_joined)
> >
> > Mar 13 07:18:22 [6466] bug0 stonith-ng: (  commands.c:1891  )   trace:
> > stonith_send_async_reply:Reply> t="stonith-ng" st_op="st_fence" st_device_id="ustonith:0"
> > st_remote_op="39b1f1e0-b76f-4d25-bd15-77b956c914a0"
> > st_clientid="823e92da-8470-491a-b662-215526cced22"
> > st_clientname="crmd.3973" st_target="bug1" st_device_action="st_fence"
> > st_callid="3" st_callopt="0" st_rc="0" st_output="Chassis Power Control:
> > Reset\nChassis Power Control: Down/Off\nChassis Power Control:
> Down/Off\nC
> > Mar 13 07:18:22 [6466] bug0 stonith-ng: (   cpg.c:636   )   trace:
> > send_cluster_text:   Queueing CPG message 9 to all (1041 bytes, 449
> > bytes payload):  > st_op="st_notify" st_device_id="ustonith:0"
> > st_remote_op="39b1f1e0-b76f-4d25-bd15-77b956c914a0"
> > 

Re: [ClusterLabs] STONITH not communicated back to initiator until token expires

2017-03-13 Thread Digimer
On 13/03/17 12:07 PM, Chris Walker wrote:
> Hello,
> 
> On our two-node EL7 cluster (pacemaker: 1.1.15-11.el7_3.4; corosync:
> 2.4.0-4; libqb: 1.0-1),
> it looks like successful STONITH operations are not communicated from
> stonith-ng back to theinitiator (in this case, crmd) until the STONITHed
> node is removed from the cluster when
> Corosync notices that it's gone (i.e., after the token timeout).

Others might have more useful info, but my understanding of a lost node
sequence is this;

1. Node stops responding, corosync declares it lost after token timeout
2. Corosync reforms the cluster with remaining node(s), checks if it is
quorate (always true in 2-node)
3. Corosync informs Pacemaker of the membership change.
4. Pacemaker invokes stonith, waits for the fence agent to return
"success" (exit code of the agent as per the FenceAgentAPI
[https://docs.pagure.org/ClusterLabs.fence-agents/FenceAgentAPI.md]). If
the method fails, it moves on to the next method. If all methods fail,
it goes back to the first method and tries again, looping indefinitely.

> In trace debug logs, I see the STONITH reply sent via the
> cpg_mcast_joined (libqb) function in crm_cs_flush
> (stonith_send_async_reply->send_cluster_text->send_cluster_text->send_cpg_iov->crm_cs_flush->cpg_mcast_joined)
> 
> Mar 13 07:18:22 [6466] bug0 stonith-ng: (  commands.c:1891  )   trace:
> stonith_send_async_reply:Replyt="stonith-ng" st_op="st_fence" st_device_id="ustonith:0"
> st_remote_op="39b1f1e0-b76f-4d25-bd15-77b956c914a0"
> st_clientid="823e92da-8470-491a-b662-215526cced22"
> st_clientname="crmd.3973" st_target="bug1" st_device_action="st_fence"
> st_callid="3" st_callopt="0" st_rc="0" st_output="Chassis Power Control:
> Reset\nChassis Power Control: Down/Off\nChassis Power Control: Down/Off\nC
> Mar 13 07:18:22 [6466] bug0 stonith-ng: (   cpg.c:636   )   trace:
> send_cluster_text:   Queueing CPG message 9 to all (1041 bytes, 449
> bytes payload):  st_op="st_notify" st_device_id="ustonith:0"
> st_remote_op="39b1f1e0-b76f-4d25-bd15-77b956c914a0"
> st_clientid="823e92da-8470-491a-b662-215526cced22" st_clientna
> Mar 13 07:18:22 [6466] bug0 stonith-ng: (   cpg.c:207   )   trace:
> send_cpg_iov:Queueing CPG message 9 (1041 bytes)
> Mar 13 07:18:22 [6466] bug0 stonith-ng: (   cpg.c:170   )   trace:
> crm_cs_flush:CPG message sent, size=1041
> Mar 13 07:18:22 [6466] bug0 stonith-ng: (   cpg.c:185   )   trace:
> crm_cs_flush:Sent 1 CPG messages  (0 remaining, last=9): OK (1)
> 
> But I see no further action from stonith-ng until minutes later;
> specifically, I don't see remote_op_done run, so the dead node is still
> an 'online (unclean)' member of the array and no failover can take place.
> 
> When the token expires (yes, we use a very long token), I see the following:
> 
> Mar 13 07:22:11 [6466] bug0 stonith-ng: (membership.c:1018  )  notice:
> crm_update_peer_state_iter:  Node bug1 state is now lost | nodeid=2
> previous=member source=crm_update_peer_proc
> Mar 13 07:22:11 [6466] bug0 stonith-ng: (  main.c:1275  )   debug:
> st_peer_update_callback: Broadcasting our uname because of node 2
> Mar 13 07:22:11 [6466] bug0 stonith-ng: (   cpg.c:636   )   trace:
> send_cluster_text:   Queueing CPG message 10 to all (666 bytes, 74
> bytes payload):  t="stonith-ng" st_op="poke"/>
> ...
> Mar 13 07:22:11 [6466] bug0 stonith-ng: (  commands.c:2582  )   debug:
> stonith_command: Processing st_notify reply 0 from bug0 (   0)
> Mar 13 07:22:11 [6466] bug0 stonith-ng: (remote.c:1945  )   debug:
> process_remote_stonith_exec: Marking call to poweroff for bug1 on
> behalf of crmd.3973@39b1f1e0-b76f-4d25-bd15-77b956c914a0.bug1: OK (0)
> 
> and the STONITH command is finally communicated back to crmd as complete
> and failover commences.
> 
> Is this delay a feature of the cpg_mcast_joined function?  If I
> understand correctly (unlikely), it looks like cpg_mcast_joined is not
> completing because one of the nodes in the group is missing, but I
> haven't looked at that code closely yet.  Is it advisable to have
> stonith-ng broadcast a membership change when it successfully fences a node?
> 
> Attaching logs with PCMK_debug=stonith-ng
> and 
> PCMK_trace_functions=stonith_send_async_reply,send_cluster_text,send_cpg_iov,crm_cs_flush
> 
> Thanks in advance,
> Chris

Can you share your full pacemaker config (please obfuscate passwords).

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: 

Re: [ClusterLabs] Stonith : meta-data contains no resource-agent element

2016-11-28 Thread bliu

Hi,
   SSH stonith is just devel demo if I did not remember wrong. If 
you are using openSUSE, you need to install libglue-devel.

I think there similar packages on other distribution.


On 11/21/2016 04:05 AM, jitendra.jaga...@dell.com wrote:


Hello Pacemaker admins,

We have recently updated to Pacemaker version 1.1.15

Before that we were using Pacemaker version 1.1.10

With Pacemaker version 1.1.10, our Stonith ssh agent use to work.

Now once we upgraded to Pacemaker version 1.1.15 we see below errors 
for basic Stonith configuration as explained in 
[http://clusterlabs.org/doc/crm_fencing.html]


=

crm configure primitive st-ssh stonith:external/ssh params 
hostlist="node1 node2"


*ERROR: stonith:external/ssh: meta-data contains no resource-agent 
element*


*ERROR: None*

*ERROR: stonith:external/ssh: meta-data contains no resource-agent 
element*


*ERROR: stonith:external/ssh: no such resource agent*



Additional info

Below is stonith –L output

=

/home/root# stonith -L

apcmaster

apcsmart

baytech

cyclades

drac3

external/drac5

external/dracmc-telnet

external/hetzner

external/hmchttp

external/ibmrsa

external/ibmrsa-telnet

external/ipmi

external/ippower9258

external/kdumpcheck

external/libvirt

external/nut

external/rackpdu

external/riloe

external/sbd

*external/ssh*

external/vcenter

external/vmware

external/xen0

external/xen0-ha

ibmhmc

meatware

null

nw_rpc100s

rcd_serial

rps10

ssh

suicide

wti_nps

=

Above agents are in /usr/lib/stonith/plugins directory

Please can anyone provide solution to resolve above issue.

Thanks

Jitendra



___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] STONITH Fencing for Amazon EC2

2016-08-04 Thread Jason A Ramsey
Is there some other [updated] fencing module I can use in this use case?

--
 
[ jR ]
  M: +1 (703) 628-2621
  @: ja...@eramsey.org
 
  there is no path to greatness; greatness is the path

On 8/2/16, 11:59 AM, "Digimer"  wrote:

On 02/08/16 10:02 AM, Jason A Ramsey wrote:
> I’ve found [oldish] references on the internet to a fencing module for 
Amazon EC2, but it doesn’t seem to be included in any the fencing yum packages 
for CentOS. Is this module not part of the canonical distribution? Is there 
something else I should be looking for?

I *think* it fell behind (fence_ec2, iirc). It might need to be picked
up, updated/tested and then it can be re-added to the official list.

I'm not 100% on this though, so if someone contradicts me, ignore me.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] STONITH Fencing for Amazon EC2

2016-08-02 Thread Digimer
On 02/08/16 10:02 AM, Jason A Ramsey wrote:
> I’ve found [oldish] references on the internet to a fencing module for Amazon 
> EC2, but it doesn’t seem to be included in any the fencing yum packages for 
> CentOS. Is this module not part of the canonical distribution? Is there 
> something else I should be looking for?

I *think* it fell behind (fence_ec2, iirc). It might need to be picked
up, updated/tested and then it can be re-added to the official list.

I'm not 100% on this though, so if someone contradicts me, ignore me.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Stonith ignores resource stop errors

2016-03-10 Thread Klechomir

Thanks,
That was it!

On 10.03.2016 17:08, Ken Gaillot wrote:

On 03/10/2016 04:42 AM, Klechomir wrote:

Hi List

I'm testing stonith now (pacemaker 1.1.8), and noticed that it properly kills
a node with stopped pacemaker, but ignores resource stop errors.

I'm pretty sure that the same version worked properly with stonith before.
Maybe I'm missing some setting?

Rgards,
Klecho

The only setting that should be relevant there is on-fail for the
resource's stop operation, which defaults to fence but can be set to
other actions.

That said, 1.1.8 is pretty old at this point, so I'm not sure of its
behavior.


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Stonith ignores resource stop errors

2016-03-10 Thread Ken Gaillot
On 03/10/2016 04:42 AM, Klechomir wrote:
> Hi List
> 
> I'm testing stonith now (pacemaker 1.1.8), and noticed that it properly kills 
> a node with stopped pacemaker, but ignores resource stop errors.
> 
> I'm pretty sure that the same version worked properly with stonith before.
> Maybe I'm missing some setting?
> 
> Rgards,
> Klecho

The only setting that should be relevant there is on-fail for the
resource's stop operation, which defaults to fence but can be set to
other actions.

That said, 1.1.8 is pretty old at this point, so I'm not sure of its
behavior.


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] STONITH when both IB interfaces are down, and how to trigger Filesystem mount/umount failure to test STONITH?

2015-08-20 Thread Marcin Dulak
Hi, thanks for the answers,

i've performed the test of shutting down both IPoIB interfaces on an OSS
server
while a Lustre client writing a large file to the OST one that server, the
umount still succeded,
and writing to the file continued after a short delay on the same OST
mounted on the failed-over server.
I found however that if ones incorrectly formats Lustre OST (wrong index)
then it fails to mount,
and STONITH is triggered. I may test the exit $OCF_ERR_GENERIC solution,
but I would like to go back now to the first question: how can one trigger
STONITH
in case a server misses both IB interfaces? How to make it cooperate with
the existing
Filesystem mount based STONITH? Is it a good idea at all? Any examples in
the net?

Marcin


On Thu, Aug 20, 2015 at 9:00 AM, Andrei Borzenkov arvidj...@gmail.com
wrote:

 19.08.2015 13:31, Marcin Dulak пишет:
  However if instead both IPoIB interfaces go down on server-02,
  the mdt is moved to server-01, but no STONITH is performed on server-02.
  This is expected, because there is nothing in the configuration that
  triggers
  STONITH in case of IB connection loss.
  Hovever if IPoIB is flapping this setup could lead to mdt moving
  back and forth between server-01 and server-02.
  Should I have STONITH shutting down a node that misses both IpoIB
  (remember they are passively redundant, only one active at a time)
  interfaces?

 It is really up to the agent. Note that on-fail is triggered only if
 operation fails. So as long as stop invocation does not return error, no
 fencing happens.

  If so, how to achieve that?
 

 If you really want to trigger fencing when access to block device
 fails you probably need to define it as separate resource with own
 agent and set on-fail=fence on monitor operation for this block
 device. Otherwise you cannot really distinguish fiesystem level error
 from block device level.

  The context for the second question: the configuration contains the
  following Filesystem template:
 
  rsc_template lustre-target-template ocf:heartbeat:Filesystem \
 op monitor interval=120 timeout=60 OCF_CHECK_LEVEL=10 \
 op start   interval=0   timeout=300 on-fail=fence \
 op stopinterval=0   timeout=300 on-fail=fence
 
  How can I make umount/mount of Filesystem fail in order to test STONITH
  action in these cases?
 

 Insert exit $OCF_ERR_GENERIC in stop method? :)

  Extra question: where can I find the documentation/source what
  on-fail=fence is doing?

 Pacemaker Explained has some description. It should initiate fencing
 of node where resource had been active.

  Or what does it mean on-fail=stop in the ethmonitor template below (what
 is
  stopped?)?
 

 on-fail=stop sets resource target role to stopped. So pacemaker tries
 to stop it and leave it stopped.

  rsc_template netmonitor-30sec ethmonitor \
 params repeat_count=3 repeat_interval=10 \
 op monitor interval=15s timeout=60s \
 op start   interval=0s  timeout=60s on-fail=stop \
 
  Marcin
 
 
 
  ___
  Users mailing list: Users@clusterlabs.org
  http://clusterlabs.org/mailman/listinfo/users
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 

 ___
 Users mailing list: Users@clusterlabs.org
 http://clusterlabs.org/mailman/listinfo/users

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org