subject:"\[ClusterLabs\] stonith in dual HMC environment"

Re: [ClusterLabs] stonith in dual HMC environment

2017-03-30 Thread Alexander Markov


Hello, Dejan,



If the datacenters are completely separate, you might want to take a
look at booth. With booth, you set up a separate cluster at each
datacenter, and booth coordinates which one can host resources. Each
datacenter must have its own self-sufficient cluster with its own
fencing, but one site does not need to be able to fence the other.


This seems as an overkill for me ;) If I choose not to fence - 
stonith-enabled=false would be much more simple solution.



Yes, it's just that the name escaped me at the time.  But I'm not
sure which pacemaker version is used and if it supports the
fencing topology.


Doesn't help in my case. The problem is I just haven't got any way to 
fence the node at all (because it's already offline with all datacenter 
environment).


I actually built a simple cluster and played a little with different 
stonith schemes and solutions. I tried hostlist analogue of ibmhmc 
stonith device, changing locations - nothing helps. Every time I end up 
with the following:


Last updated: Thu Mar 30 05:19:48 2017
Last change: Thu Mar 30 05:07:32 2017 by root via cibadmin on test01
Stack: classic openais (with plugin)
Current DC: test01 - partition WITHOUT quorum
Version: 1.1.12-f47ea56
2 Nodes configured, 2 expected votes
3 Resources configured


Node test02: UNCLEAN (offline)
Online: [ test01 ]

Full list of resources:

Resource Group: g_ip
rsc_ip_TST_HDB00(ocf::heartbeat:IPaddr2):   Started test02 (UNCLEAN)
st-hq   (stonith:ibmhmc):   Started test01
st-ch   (stonith:ibmhmc):   Started test02 (UNCLEAN)

and logs like

Mar 30 05:10:32 [5112] test01   crmd:   notice: 
too_many_st_failures:   No devices found in cluster to fence test02, 
giving up


and I totally second this. There's no device able to fence node, that is 
already offline. I just need to know how to resolve it without manual 
intervention. The ideal solution for me would be to do a failover.


--
Regards,
Alexander

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] stonith in dual HMC environment

2017-03-28 Thread Dejan Muhamedagic

On Tue, Mar 28, 2017 at 04:20:12PM +0300, Alexander Markov wrote:
> Hello, Dejan,
> 
> >Why? I don't have a test system right now, but for instance this
> >should work:
> >
> >$ stonith -t ibmhmc ipaddr=10.1.2.9 -lS
> >$ stonith -t ibmhmc ipaddr=10.1.2.9 -T reset {nodename}
> 
> Ah, I see. Everything (including stonith methods, fencing and failover)
> works just fine under normal circumstances. Sorry if I wasn't clear about
> that. The problem occurs only when I have one datacenter (i.e. one IBM
> machine and one HMC) lost due to power outage.
> 
> For example:
> test01:~ # stonith -t ibmhmc ipaddr=10.1.2.8 -lS | wc -l
> info: ibmhmc device OK.
> 39
> test01:~ # stonith -t ibmhmc ipaddr=10.1.2.9 -lS | wc -l
> info: ibmhmc device OK.
> 39
> 
> As I had said stonith device can see and manage all the cluster nodes.

That's great :)

> >If so, then your configuration does not appear to be correct. If
> >both are capable of managing all nodes then you should tell
> >pacemaker about it.
> 
> Thanks for the hint. But if stonith device return node list, isn't it
> obvious for cluster that it can manage those nodes?

Did you try that? Just drop the location constraints and see if
it works. The pacemaker should actually keep the list of resources
(stonith) capable of managing the node.

> Could you please be more
> precise about what you refer to? I currently changed configuration to two
> fencing levels (one per HMC) but still don't think I get an idea here.
> 
> >Survived node, running stonith resource for dead node tries to
> >contact ipmi device (which is also dead). How does cluster understand that
> >lost node is really dead and it's not just a network issue?
> >
> >It cannot.
> 
> How do people then actually solve the problem of two node metro cluster?

That depends, but if you have a communication channel for stonith
devices which is _independent_ of the cluster communication then
you should be OK. Of course, a fencing device which goes down
together with its node is of no use, but that doesn't seem to be
the case here.

> I mean, I know one option: stonith-enabled=false, but it doesn't seem right
> for me.

Certainly not.

Thanks,

Dejan

> 
> Thank you.
> 
> Regards,
> Alexander Markov
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] stonith in dual HMC environment

2017-03-28 Thread Dejan Muhamedagic

On Tue, Mar 28, 2017 at 09:54:55AM -0500, Ken Gaillot wrote:
> On 03/28/2017 08:20 AM, Alexander Markov wrote:
> > Hello, Dejan,
> > 
> >> Why? I don't have a test system right now, but for instance this
> >> should work:
> >>
> >> $ stonith -t ibmhmc ipaddr=10.1.2.9 -lS
> >> $ stonith -t ibmhmc ipaddr=10.1.2.9 -T reset {nodename}
> > 
> > Ah, I see. Everything (including stonith methods, fencing and failover)
> > works just fine under normal circumstances. Sorry if I wasn't clear
> > about that. The problem occurs only when I have one datacenter (i.e. one
> > IBM machine and one HMC) lost due to power outage.
> 
> If the datacenters are completely separate, you might want to take a
> look at booth. With booth, you set up a separate cluster at each
> datacenter, and booth coordinates which one can host resources. Each
> datacenter must have its own self-sufficient cluster with its own
> fencing, but one site does not need to be able to fence the other.
> 
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#idm139683855002656
> 
> > 
> > For example:
> > test01:~ # stonith -t ibmhmc ipaddr=10.1.2.8 -lS | wc -l
> > info: ibmhmc device OK.
> > 39
> > test01:~ # stonith -t ibmhmc ipaddr=10.1.2.9 -lS | wc -l
> > info: ibmhmc device OK.
> > 39
> > 
> > As I had said stonith device can see and manage all the cluster nodes.
> > 
> >> If so, then your configuration does not appear to be correct. If
> >> both are capable of managing all nodes then you should tell
> >> pacemaker about it.
> > 
> > Thanks for the hint. But if stonith device return node list, isn't it
> > obvious for cluster that it can manage those nodes? Could you please be
> > more precise about what you refer to? I currently changed configuration
> > to two fencing levels (one per HMC) but still don't think I get an idea
> > here.
> 
> I believe Dejan is referring to fencing topology (levels).

Yes, it's just that the name escaped me at the time.  But I'm not
sure which pacemaker version is used and if it supports the
fencing topology.

Thanks,

Dejan

> That would be
> preferable to booth if the datacenters are physically close, and even if
> one fence device fails, the other can still function.
> 
> In this case you'd probably want level 1 = the main fence device, and
> level 2 = the fence device to use if the main device fails.
> 
> A common implementation (which Digimer uses to great effect) is to use
> IPMI as level 1 and an intelligent power switch as level 2. If your
> second device can function regardless of what hosts are up or down, you
> can do something similar.
> 
> > 
> >> Survived node, running stonith resource for dead node tries to
> >> contact ipmi device (which is also dead). How does cluster understand
> >> that
> >> lost node is really dead and it's not just a network issue?
> >>
> >> It cannot.
> 
> And it will be unable to recover resources that were running on the
> questionable partition.
> 
> > 
> > How do people then actually solve the problem of two node metro cluster?
> > I mean, I know one option: stonith-enabled=false, but it doesn't seem
> > right for me.
> > 
> > Thank you.
> > 
> > Regards,
> > Alexander Markov
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] stonith in dual HMC environment

2017-03-28 Thread Ken Gaillot

On 03/28/2017 08:20 AM, Alexander Markov wrote:
> Hello, Dejan,
> 
>> Why? I don't have a test system right now, but for instance this
>> should work:
>>
>> $ stonith -t ibmhmc ipaddr=10.1.2.9 -lS
>> $ stonith -t ibmhmc ipaddr=10.1.2.9 -T reset {nodename}
> 
> Ah, I see. Everything (including stonith methods, fencing and failover)
> works just fine under normal circumstances. Sorry if I wasn't clear
> about that. The problem occurs only when I have one datacenter (i.e. one
> IBM machine and one HMC) lost due to power outage.

If the datacenters are completely separate, you might want to take a
look at booth. With booth, you set up a separate cluster at each
datacenter, and booth coordinates which one can host resources. Each
datacenter must have its own self-sufficient cluster with its own
fencing, but one site does not need to be able to fence the other.

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#idm139683855002656

> 
> For example:
> test01:~ # stonith -t ibmhmc ipaddr=10.1.2.8 -lS | wc -l
> info: ibmhmc device OK.
> 39
> test01:~ # stonith -t ibmhmc ipaddr=10.1.2.9 -lS | wc -l
> info: ibmhmc device OK.
> 39
> 
> As I had said stonith device can see and manage all the cluster nodes.
> 
>> If so, then your configuration does not appear to be correct. If
>> both are capable of managing all nodes then you should tell
>> pacemaker about it.
> 
> Thanks for the hint. But if stonith device return node list, isn't it
> obvious for cluster that it can manage those nodes? Could you please be
> more precise about what you refer to? I currently changed configuration
> to two fencing levels (one per HMC) but still don't think I get an idea
> here.

I believe Dejan is referring to fencing topology (levels). That would be
preferable to booth if the datacenters are physically close, and even if
one fence device fails, the other can still function.

In this case you'd probably want level 1 = the main fence device, and
level 2 = the fence device to use if the main device fails.

A common implementation (which Digimer uses to great effect) is to use
IPMI as level 1 and an intelligent power switch as level 2. If your
second device can function regardless of what hosts are up or down, you
can do something similar.

> 
>> Survived node, running stonith resource for dead node tries to
>> contact ipmi device (which is also dead). How does cluster understand
>> that
>> lost node is really dead and it's not just a network issue?
>>
>> It cannot.

And it will be unable to recover resources that were running on the
questionable partition.

> 
> How do people then actually solve the problem of two node metro cluster?
> I mean, I know one option: stonith-enabled=false, but it doesn't seem
> right for me.
> 
> Thank you.
> 
> Regards,
> Alexander Markov

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] stonith in dual HMC environment

2017-03-28 Thread Alexander Markov


Hello, Dejan,


Why? I don't have a test system right now, but for instance this
should work:

$ stonith -t ibmhmc ipaddr=10.1.2.9 -lS
$ stonith -t ibmhmc ipaddr=10.1.2.9 -T reset {nodename}


Ah, I see. Everything (including stonith methods, fencing and failover) 
works just fine under normal circumstances. Sorry if I wasn't clear 
about that. The problem occurs only when I have one datacenter (i.e. one 
IBM machine and one HMC) lost due to power outage.


For example:
test01:~ # stonith -t ibmhmc ipaddr=10.1.2.8 -lS | wc -l
info: ibmhmc device OK.
39
test01:~ # stonith -t ibmhmc ipaddr=10.1.2.9 -lS | wc -l
info: ibmhmc device OK.
39

As I had said stonith device can see and manage all the cluster nodes.


If so, then your configuration does not appear to be correct. If
both are capable of managing all nodes then you should tell
pacemaker about it.


Thanks for the hint. But if stonith device return node list, isn't it 
obvious for cluster that it can manage those nodes? Could you please be 
more precise about what you refer to? I currently changed configuration 
to two fencing levels (one per HMC) but still don't think I get an idea 
here.



Survived node, running stonith resource for dead node tries to
contact ipmi device (which is also dead). How does cluster understand 
that

lost node is really dead and it's not just a network issue?

It cannot.


How do people then actually solve the problem of two node metro cluster?
I mean, I know one option: stonith-enabled=false, but it doesn't seem 
right for me.


Thank you.

Regards,
Alexander Markov


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] stonith in dual HMC environment

2017-03-28 Thread Dejan Muhamedagic

On Mon, Mar 27, 2017 at 01:17:31PM +0300, Alexander Markov wrote:
> Hello, Dejan,
> 
> 
> >The first thing I'd try is making sure you can fence each node from the
> >command line by manually running the fence agent. I'm not sure how to do
> >that for the "stonith:" type agents.
> >
> >There's a program stonith(8). It's easy to replicate the
> >configuration on the command line.
> 
> Unfortunately, it is not.

Why? I don't have a test system right now, but for instance this
should work:

$ stonith -t ibmhmc ipaddr=10.1.2.9 -lS
$ stonith -t ibmhmc ipaddr=10.1.2.9 -T reset {nodename}

Read the examples in the man page:

$ man stonith

Check also the documentation of your agent:

$ stonith -t ibmhmc -h
$ stonith -t ibmhmc -n

> The landscape I refer to is similar to VMWare. We use cluster for virtual
> machines (LPARs) and everything works OK but the real pain occurs when whole
> host system is down. Keeping in mind that it's actually used now in
> production, I just can't afford to turn it off for test reason.

Yes, I understand. However, I was just talking about how to use
the stonith agents and how to do the testing outside of
pacemaker.

> >Stonith agents are to be queried for the list of nodes they can
> >manage. It's part of the interface. Some agents can figure that
> >out by themself and some need a parameter defining the node list.
> 
> And this is just the place I'm stuck. I've got two stonith devices (ibmhmc)
> for redundancy. Both of them are capable to manage every node.

If so, then your configuration does not appear to be correct. If
both are capable of managing all nodes then you should tell
pacemaker about it. Digimer had a fairly extensive documentation
on how to configure complex fencing configurations. You can also
check with your vendor's documentation.

> The problem starts when
> 
> 1) one stonith device is completely lost and inaccessible (due to power
> outage in datacenter)
> 2) survived stonith device cannot access nor cluster node neither hosting
> system (in VMWare terms) for this cluster node, for both of them are also
> lost due to power outage.

Both lost? What remained? Why do you mention vmware? I thought
that your nodes are LPARs.

> What is the correct solution for this situation?
> 
> >Well, this used to be a standard way to configure one kind of
> >stonith resources, one common representative being ipmi, and
> >served exactly the purpose of restricting the stonith resource
> >from being enabled ("running") on a node which this resource
> >manages.
> 
> Unfortunately, there's no such thing as ipmi in IBM Power boxes.

I mentioned ipmi as an example, not that it has anything to do
with your setup.

> But it
> triggers interesting question for me: if both one node and its complementary
> ipmi device are lost (due to power outage) - what's happening with a
> cluster?

The cluster gets stuck trying to fence the node. Typically this
would render your cluster unusable. There are some IPMI devices
which have a battery to allow for some extra time to manage the
host.

> Survived node, running stonith resource for dead node tries to
> contact ipmi device (which is also dead). How does cluster understand that
> lost node is really dead and it's not just a network issue?

It cannot.

Thanks,

Dejan

> 
> Thank you.
> 
> -- 
> Regards,
> Alexander Markov
> +79104531955
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] stonith in dual HMC environment

2017-03-27 Thread Alexander Markov


Hello, Dejan,



The first thing I'd try is making sure you can fence each node from the
command line by manually running the fence agent. I'm not sure how to 
do

that for the "stonith:" type agents.

There's a program stonith(8). It's easy to replicate the
configuration on the command line.


Unfortunately, it is not.

The landscape I refer to is similar to VMWare. We use cluster for 
virtual machines (LPARs) and everything works OK but the real pain 
occurs when whole host system is down. Keeping in mind that it's 
actually used now in production, I just can't afford to turn it off for 
test reason.




Stonith agents are to be queried for the list of nodes they can
manage. It's part of the interface. Some agents can figure that
out by themself and some need a parameter defining the node list.


And this is just the place I'm stuck. I've got two stonith devices 
(ibmhmc) for redundancy. Both of them are capable to manage every node. 
The problem starts when


1) one stonith device is completely lost and inaccessible (due to power 
outage in datacenter)
2) survived stonith device cannot access nor cluster node neither 
hosting system (in VMWare terms) for this cluster node, for both of them 
are also lost due to power outage.


What is the correct solution for this situation?


Well, this used to be a standard way to configure one kind of
stonith resources, one common representative being ipmi, and
served exactly the purpose of restricting the stonith resource
from being enabled ("running") on a node which this resource
manages.


Unfortunately, there's no such thing as ipmi in IBM Power boxes. But it 
triggers interesting question for me: if both one node and its 
complementary ipmi device are lost (due to power outage) - what's 
happening with a cluster? Survived node, running stonith resource for 
dead node tries to contact ipmi device (which is also dead). How does 
cluster understand that lost node is really dead and it's not just a 
network issue?


Thank you.

--
Regards,
Alexander Markov
+79104531955

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] stonith in dual HMC environment

2017-03-27 Thread Dejan Muhamedagic

Hi,

On Fri, Mar 24, 2017 at 11:01:45AM -0500, Ken Gaillot wrote:
> On 03/22/2017 09:42 AM, Alexander Markov wrote:
> > 
> >> Please share your config along with the logs from the nodes that were
> >> effected.
> > 
> > I'm starting to think it's not about how to define stonith resources. If
> > the whole box is down with all the logical partitions defined, then HMC
> > cannot define if LPAR (partition) is really dead or just inaccessible.
> > This leads to UNCLEAN OFFLINE node status and pacemaker refusal to do
> > anything until it's resolved. Am I right? Anyway, the simples pacemaker
> > config from my partitions is below.
> 
> Yes, it looks like you are correct. The fence agent is returning an
> error when pacemaker tries to use it to reboot crmapp02. From the stderr
> in the logs, the message is "ssh: connect to host 10.1.2.9 port 22: No
> route to host".
> 
> The first thing I'd try is making sure you can fence each node from the
> command line by manually running the fence agent. I'm not sure how to do
> that for the "stonith:" type agents.

There's a program stonith(8). It's easy to replicate the
configuration on the command line.

> Once that's working, make sure the cluster can do the same, by manually
> running "stonith_admin -B $NODE" for each $NODE.
> 
> > 
> > primitive sap_ASCS SAPInstance \
> > params InstanceName=CAP_ASCS01_crmapp \
> > op monitor timeout=60 interval=120 depth=0
> > primitive sap_D00 SAPInstance \
> > params InstanceName=CAP_D00_crmapp \
> > op monitor timeout=60 interval=120 depth=0
> > primitive sap_ip IPaddr2 \
> > params ip=10.1.12.2 nic=eth0 cidr_netmask=24
> 
> > primitive st_ch_hmc stonith:ibmhmc \
> > params ipaddr=10.1.2.9 \
> > op start interval=0 timeout=300
> > primitive st_hq_hmc stonith:ibmhmc \
> > params ipaddr=10.1.2.8 \
> > op start interval=0 timeout=300
> 
> I see you have two stonith devices defined, but they don't specify which
> nodes they can fence -- pacemaker will assume that either device can be
> used to fence either node.

Stonith agents are to be queried for the list of nodes they can
manage. It's part of the interface. Some agents can figure that
out by themself and some need a parameter defining the node list.
This parameter is usually named hostlist, but that is not a
requirement. At any rate, the CRM should get the list of nodes
by invoking the agent and not from the resource configuration. It
is up to the stonith agent to tell what it can manage.

> > group g_sap sap_ip sap_ASCS sap_D00 \
> > meta target-role=Started
> 
> > location l_ch_hq_hmc st_ch_hmc -inf: crmapp01
> > location l_st_hq_hmc st_hq_hmc -inf: crmapp02
> 
> These constraints restrict which node monitors which device, not which
> node the device can fence.

Well, this used to be a standard way to configure one kind of
stonith resources, one common representative being ipmi, and
served exactly the purpose of restricting the stonith resource
from being enabled ("running") on a node which this resource
manages.

> Assuming st_ch_hmc is intended to fence crmapp01, this will make sure
> that crmapp02 monitors that device -- but you also want something like
> pcmk_host_list=crmapp01 in the device configuration.

pcmk_host_list shouldn't be required for the stonith class
agents.

***

There's a document describing fencing and stonith at clusterlabs.org:

http://clusterlabs.org/doc/crm_fencing.html

If it doesn't hold anymore, then something should be done about
it.

Thanks,

Dejan

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] stonith in dual HMC environment

2017-03-24 Thread Ken Gaillot

On 03/22/2017 09:42 AM, Alexander Markov wrote:
> 
>> Please share your config along with the logs from the nodes that were
>> effected.
> 
> I'm starting to think it's not about how to define stonith resources. If
> the whole box is down with all the logical partitions defined, then HMC
> cannot define if LPAR (partition) is really dead or just inaccessible.
> This leads to UNCLEAN OFFLINE node status and pacemaker refusal to do
> anything until it's resolved. Am I right? Anyway, the simples pacemaker
> config from my partitions is below.

Yes, it looks like you are correct. The fence agent is returning an
error when pacemaker tries to use it to reboot crmapp02. From the stderr
in the logs, the message is "ssh: connect to host 10.1.2.9 port 22: No
route to host".

The first thing I'd try is making sure you can fence each node from the
command line by manually running the fence agent. I'm not sure how to do
that for the "stonith:" type agents.

Once that's working, make sure the cluster can do the same, by manually
running "stonith_admin -B $NODE" for each $NODE.

> 
> primitive sap_ASCS SAPInstance \
> params InstanceName=CAP_ASCS01_crmapp \
> op monitor timeout=60 interval=120 depth=0
> primitive sap_D00 SAPInstance \
> params InstanceName=CAP_D00_crmapp \
> op monitor timeout=60 interval=120 depth=0
> primitive sap_ip IPaddr2 \
> params ip=10.1.12.2 nic=eth0 cidr_netmask=24

> primitive st_ch_hmc stonith:ibmhmc \
> params ipaddr=10.1.2.9 \
> op start interval=0 timeout=300
> primitive st_hq_hmc stonith:ibmhmc \
> params ipaddr=10.1.2.8 \
> op start interval=0 timeout=300

I see you have two stonith devices defined, but they don't specify which
nodes they can fence -- pacemaker will assume that either device can be
used to fence either node.

> group g_sap sap_ip sap_ASCS sap_D00 \
> meta target-role=Started

> location l_ch_hq_hmc st_ch_hmc -inf: crmapp01
> location l_st_hq_hmc st_hq_hmc -inf: crmapp02

These constraints restrict which node monitors which device, not which
node the device can fence.

Assuming st_ch_hmc is intended to fence crmapp01, this will make sure
that crmapp02 monitors that device -- but you also want something like
pcmk_host_list=crmapp01 in the device configuration.

> location prefer_node_1 g_sap 100: crmapp01
> property cib-bootstrap-options: \
> stonith-enabled=true \
> no-quorum-policy=ignore \
> placement-strategy=balanced \
> expected-quorum-votes=2 \
> dc-version=1.1.12-f47ea56 \
> cluster-infrastructure="classic openais (with plugin)" \
> last-lrm-refresh=1490009096 \
> maintenance-mode=false
> rsc_defaults rsc-options: \
> resource-stickiness=200 \
> migration-threshold=3
> op_defaults op-options: \
> timeout=600 \
> record-pending=true
> 
> Logs are pretty much going in circle: stonith cannot reset logical
> partition via HMC, node stays unclean offline, resources are shown to
> stay on node that is down.
> 
> 
> stonith-ng:error: log_operation:Operation 'reboot' [6942] (call
> 6 from crmd.4568) for host 'crmapp02' with device 'st_ch_hmc:0'
> Trying: st_ch_hmc:0
> stonith-ng:  warning: log_operation:st_ch_hmc:0:6942 [ Performing:
> stonith -t ibmhmc -T reset crmapp02 ]
> stonith-ng:  warning: log_operation:st_ch_hmc:0:6942 [ failed:
> crmapp02 3 ]
> stonith-ng: info: internal_stonith_action_execute:  Attempt 2 to
> execute fence_legacy (reboot). remaining timeout is 59
> stonith-ng: info: update_remaining_timeout: Attempted to
> execute agent fence_legacy (reboot) the maximum number of times (2)
> 
> stonith-ng:error: log_operation:Operation 'reboot' [6955] (call
> 6 from crmd.4568) for host 'crmapp02' with device 'st_hq_hmc' re
> Trying: st_hq_hmc
> stonith-ng:  warning: log_operation:st_hq_hmc:6955 [ Performing:
> stonith -t ibmhmc -T reset crmapp02 ]
> stonith-ng:  warning: log_operation:st_hq_hmc:6955 [ failed:
> crmapp02 8 ]
> stonith-ng: info: internal_stonith_action_execute:  Attempt 2 to
> execute fence_legacy (reboot). remaining timeout is 60
> stonith-ng: info: update_remaining_timeout: Attempted to
> execute agent fence_legacy (reboot) the maximum number of times (2)
> 
> stonith-ng:error: log_operation:Operation 'reboot' [6976] (call
> 6 from crmd.4568) for host 'crmapp02' with device 'st_hq_hmc:0'
> 
> stonith-ng:  warning: log_operation:st_hq_hmc:0:6976 [ Performing:
> stonith -t ibmhmc -T reset crmapp02 ]
> stonith-ng:  warning: log_operation:st_hq_hmc:0:6976 [ failed:
> crmapp02 8 ]
> stonith-ng:   notice: stonith_choose_peer:  Couldn't find anyone to
> fence crmapp02 with 
> stonith-ng: info: call_remote_stonith:  None of the 1 peers are
> capable of terminating crmapp02 for crmd.4568 (1)
> stonith-ng:error: remote_op_done:   Operation reboot of crmapp02 by
>  for crmd.4568@crmapp01.6bf66b9c: No route to host
> crmd:   notice: tengine_stonith_callback: Stonith ope

[ClusterLabs] stonith in dual HMC environment

2017-03-23 Thread Alexander Markov




Please share your config along with the logs from the nodes that were
effected.


I'm starting to think it's not about how to define stonith resources. If 
the whole box is down with all the logical partitions defined, then HMC 
cannot define if LPAR (partition) is really dead or just inaccessible. 
This leads to UNCLEAN OFFLINE node status and pacemaker refusal to do 
anything until it's resolved. Am I right? Anyway, the simples pacemaker 
config from my partitions is below.


primitive sap_ASCS SAPInstance \
params InstanceName=CAP_ASCS01_crmapp \
op monitor timeout=60 interval=120 depth=0
primitive sap_D00 SAPInstance \
params InstanceName=CAP_D00_crmapp \
op monitor timeout=60 interval=120 depth=0
primitive sap_ip IPaddr2 \
params ip=10.1.12.2 nic=eth0 cidr_netmask=24
primitive st_ch_hmc stonith:ibmhmc \
params ipaddr=10.1.2.9 \
op start interval=0 timeout=300
primitive st_hq_hmc stonith:ibmhmc \
params ipaddr=10.1.2.8 \
op start interval=0 timeout=300
group g_sap sap_ip sap_ASCS sap_D00 \
meta target-role=Started
location l_ch_hq_hmc st_ch_hmc -inf: crmapp01
location l_st_hq_hmc st_hq_hmc -inf: crmapp02
location prefer_node_1 g_sap 100: crmapp01
property cib-bootstrap-options: \
stonith-enabled=true \
no-quorum-policy=ignore \
placement-strategy=balanced \
expected-quorum-votes=2 \
dc-version=1.1.12-f47ea56 \
cluster-infrastructure="classic openais (with plugin)" \
last-lrm-refresh=1490009096 \
maintenance-mode=false
rsc_defaults rsc-options: \
resource-stickiness=200 \
migration-threshold=3
op_defaults op-options: \
timeout=600 \
record-pending=true

Logs are pretty much going in circle: stonith cannot reset logical 
partition via HMC, node stays unclean offline, resources are shown to 
stay on node that is down.



stonith-ng:error: log_operation:Operation 'reboot' [6942] (call 
6 from crmd.4568) for host 'crmapp02' with device 'st_ch_hmc:0'

Trying: st_ch_hmc:0
stonith-ng:  warning: log_operation:st_ch_hmc:0:6942 [ Performing: 
stonith -t ibmhmc -T reset crmapp02 ]
stonith-ng:  warning: log_operation:st_ch_hmc:0:6942 [ failed: 
crmapp02 3 ]
stonith-ng: info: internal_stonith_action_execute:  Attempt 2 to 
execute fence_legacy (reboot). remaining timeout is 59
stonith-ng: info: update_remaining_timeout: Attempted to 
execute agent fence_legacy (reboot) the maximum number of times (2)


stonith-ng:error: log_operation:Operation 'reboot' [6955] (call 
6 from crmd.4568) for host 'crmapp02' with device 'st_hq_hmc' re

Trying: st_hq_hmc
stonith-ng:  warning: log_operation:st_hq_hmc:6955 [ Performing: 
stonith -t ibmhmc -T reset crmapp02 ]
stonith-ng:  warning: log_operation:st_hq_hmc:6955 [ failed: 
crmapp02 8 ]
stonith-ng: info: internal_stonith_action_execute:  Attempt 2 to 
execute fence_legacy (reboot). remaining timeout is 60
stonith-ng: info: update_remaining_timeout: Attempted to 
execute agent fence_legacy (reboot) the maximum number of times (2)


stonith-ng:error: log_operation:Operation 'reboot' [6976] (call 
6 from crmd.4568) for host 'crmapp02' with device 'st_hq_hmc:0'


stonith-ng:  warning: log_operation:st_hq_hmc:0:6976 [ Performing: 
stonith -t ibmhmc -T reset crmapp02 ]
stonith-ng:  warning: log_operation:st_hq_hmc:0:6976 [ failed: 
crmapp02 8 ]
stonith-ng:   notice: stonith_choose_peer:  Couldn't find anyone to 
fence crmapp02 with 
stonith-ng: info: call_remote_stonith:  None of the 1 peers are 
capable of terminating crmapp02 for crmd.4568 (1)
stonith-ng:error: remote_op_done:   Operation reboot of crmapp02 by 
 for crmd.4568@crmapp01.6bf66b9c: No route to host
crmd:   notice: tengine_stonith_callback: Stonith operation 
6/31:3700:0:b1fed277-9156-48da-8afd-35db672cd1c8: No route to


crmd:   notice: tengine_stonith_callback: Stonith operation 6 
for crmapp02 failed (No route to host): aborting transition.
crmd:   notice: abort_transition_graph:   Transition aborted: Stonith 
failed (source=tengine_stonith_callback:699, 0)
crmd:   notice: tengine_stonith_notify:   Peer crmapp02 was not 
terminated (reboot) by  for crmapp01: No route to host (re


crmd:   notice: run_graph:Transition 3700 (Complete=1, 
Pending=0, Fired=0, Skipped=18, Incomplete=2, Source=/var/lib/pacem


crmd: info: do_state_transition:  State transition 
S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_IN


pengine: info: process_pe_message:   Input has not changed since 
last time, not saving to disk

pengine:   notice: unpack_config:On loss of CCM Quorum: Ignore
pengine: info: determine_online_status_fencing:  Node crmapp01 is 
active

pengine: info: determine_online_status:  Node crmapp01 is online
pengine:  warning: pe_fence_node:Node crmapp02 will be fenced 
because t

Re: [ClusterLabs] stonith in dual HMC environment

2017-03-21 Thread Digimer

On 20/03/17 12:22 PM, Alexander Markov wrote:
> Hello guys,
> 
> it looks like I miss something obvious, but I just don't get what has
> happened.
> 
> I've got a number of stonith-enabled clusters within my big POWER boxes.
> My stonith devices are two HMC (hardware management consoles) - separate
> servers from IBM that can reboot separate LPARs (logical partitions)
> within POWER boxes - one per every datacenter.
> 
> So my definition for stonith devices was pretty straightforward:
> 
> primitive st_dc2_hmc stonith:ibmhmc \
> params ipaddr=10.1.2.9
> primitive st_dc1_hmc stonith:ibmhmc \
> params ipaddr=10.1.2.8
> clone cl_st_dc2_hmc st_dc2_hmc
> clone cl_st_dc1_hmc st_dc1_hmc
> 
> Everything was ok when we tested failover. But today upon power outage
> we lost one DC completely. Shortly after that cluster just literally
> hanged itself upong trying to reboot nonexistent node. No failover
> occured. Nonexistent node was marked OFFLINE UNCLEAN and resources were
> marked "Started UNCLEAN" on nonexistent node.
> 
> UNCLEAN seems to flag a problems with stonith configuration. So my
> question is: how to avoid such behaviour?
> 
> Thank you!

Please share your config along with the logs from the nodes that were
effected.

cheers,

digimer

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] stonith in dual HMC environment

2017-03-21 Thread Alexander Markov


Hello guys,

it looks like I miss something obvious, but I just don't get what has 
happened.


I've got a number of stonith-enabled clusters within my big POWER boxes. 
My stonith devices are two HMC (hardware management consoles) - separate 
servers from IBM that can reboot separate LPARs (logical partitions) 
within POWER boxes - one per every datacenter.


So my definition for stonith devices was pretty straightforward:

primitive st_dc2_hmc stonith:ibmhmc \
params ipaddr=10.1.2.9
primitive st_dc1_hmc stonith:ibmhmc \
params ipaddr=10.1.2.8
clone cl_st_dc2_hmc st_dc2_hmc
clone cl_st_dc1_hmc st_dc1_hmc

Everything was ok when we tested failover. But today upon power outage 
we lost one DC completely. Shortly after that cluster just literally 
hanged itself upong trying to reboot nonexistent node. No failover 
occured. Nonexistent node was marked OFFLINE UNCLEAN and resources were 
marked "Started UNCLEAN" on nonexistent node.


UNCLEAN seems to flag a problems with stonith configuration. So my 
question is: how to avoid such behaviour?


Thank you!

--
Regards,
Alexander

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] stonith in dual HMC environment

Re: [ClusterLabs] stonith in dual HMC environment

Re: [ClusterLabs] stonith in dual HMC environment

Re: [ClusterLabs] stonith in dual HMC environment

Re: [ClusterLabs] stonith in dual HMC environment

Re: [ClusterLabs] stonith in dual HMC environment

Re: [ClusterLabs] stonith in dual HMC environment

Re: [ClusterLabs] stonith in dual HMC environment

Re: [ClusterLabs] stonith in dual HMC environment

[ClusterLabs] stonith in dual HMC environment

Re: [ClusterLabs] stonith in dual HMC environment

[ClusterLabs] stonith in dual HMC environment

12 matches

Site Navigation

Mail list logo

Footer information