Re: [ClusterLabs] Stonith

2017-03-31 Thread Alexander Markov

Kristoffer Grönlund writes:


The only solution I know which allows for a configuration like this is
using separate clusters in each data center, and using booth for
transferring ticket ownership between them. Booth requires a data
center-level quorum (meaning at least 3 locations), though the third
location can be just a small daemon without an actual cluster, and can
run in a public cloud or similar for example.


Looks like it's really impossible to solve the situation without arbiter 
(3d party)

Thank you, guys.

--
Regards,
Alexander

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Stonith

2017-03-30 Thread Alexander Markov

Hello, Kristoffer


Did you test failover through pacemaker itself?


Yes, I did, no problems here.


However: Am I understanding it correctly that you have one node in each
data center, and a stonith device in each data center?


Yes.


If the
data center is lost, the stonith device for the node in that data 
center

would also be lost and thus not able to fence.


Exactly what happens!


In such a hardware configuration, only a poison pill solution like SBD
could work, I think.


I've got no shared storage here. Every datacenter has its own storage 
and they have replication on top (similar to drbd). I can organize a 
cross-shared solution though if it help, but don't see how.



--
Regards,
Alexander



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] stonith in dual HMC environment

2017-03-28 Thread Alexander Markov

Hello, Dejan,


Why? I don't have a test system right now, but for instance this
should work:

$ stonith -t ibmhmc ipaddr=10.1.2.9 -lS
$ stonith -t ibmhmc ipaddr=10.1.2.9 -T reset {nodename}


Ah, I see. Everything (including stonith methods, fencing and failover) 
works just fine under normal circumstances. Sorry if I wasn't clear 
about that. The problem occurs only when I have one datacenter (i.e. one 
IBM machine and one HMC) lost due to power outage.


For example:
test01:~ # stonith -t ibmhmc ipaddr=10.1.2.8 -lS | wc -l
info: ibmhmc device OK.
39
test01:~ # stonith -t ibmhmc ipaddr=10.1.2.9 -lS | wc -l
info: ibmhmc device OK.
39

As I had said stonith device can see and manage all the cluster nodes.


If so, then your configuration does not appear to be correct. If
both are capable of managing all nodes then you should tell
pacemaker about it.


Thanks for the hint. But if stonith device return node list, isn't it 
obvious for cluster that it can manage those nodes? Could you please be 
more precise about what you refer to? I currently changed configuration 
to two fencing levels (one per HMC) but still don't think I get an idea 
here.



Survived node, running stonith resource for dead node tries to
contact ipmi device (which is also dead). How does cluster understand 
that

lost node is really dead and it's not just a network issue?

It cannot.


How do people then actually solve the problem of two node metro cluster?
I mean, I know one option: stonith-enabled=false, but it doesn't seem 
right for me.


Thank you.

Regards,
Alexander Markov


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] stonith in dual HMC environment

2017-03-27 Thread Alexander Markov

Hello, Dejan,



The first thing I'd try is making sure you can fence each node from the
command line by manually running the fence agent. I'm not sure how to 
do

that for the "stonith:" type agents.

There's a program stonith(8). It's easy to replicate the
configuration on the command line.


Unfortunately, it is not.

The landscape I refer to is similar to VMWare. We use cluster for 
virtual machines (LPARs) and everything works OK but the real pain 
occurs when whole host system is down. Keeping in mind that it's 
actually used now in production, I just can't afford to turn it off for 
test reason.




Stonith agents are to be queried for the list of nodes they can
manage. It's part of the interface. Some agents can figure that
out by themself and some need a parameter defining the node list.


And this is just the place I'm stuck. I've got two stonith devices 
(ibmhmc) for redundancy. Both of them are capable to manage every node. 
The problem starts when


1) one stonith device is completely lost and inaccessible (due to power 
outage in datacenter)
2) survived stonith device cannot access nor cluster node neither 
hosting system (in VMWare terms) for this cluster node, for both of them 
are also lost due to power outage.


What is the correct solution for this situation?


Well, this used to be a standard way to configure one kind of
stonith resources, one common representative being ipmi, and
served exactly the purpose of restricting the stonith resource
from being enabled ("running") on a node which this resource
manages.


Unfortunately, there's no such thing as ipmi in IBM Power boxes. But it 
triggers interesting question for me: if both one node and its 
complementary ipmi device are lost (due to power outage) - what's 
happening with a cluster? Survived node, running stonith resource for 
dead node tries to contact ipmi device (which is also dead). How does 
cluster understand that lost node is really dead and it's not just a 
network issue?


Thank you.

--
Regards,
Alexander Markov
+79104531955

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] stonith in dual HMC environment

2017-03-23 Thread Alexander Markov



Please share your config along with the logs from the nodes that were
effected.


I'm starting to think it's not about how to define stonith resources. If 
the whole box is down with all the logical partitions defined, then HMC 
cannot define if LPAR (partition) is really dead or just inaccessible. 
This leads to UNCLEAN OFFLINE node status and pacemaker refusal to do 
anything until it's resolved. Am I right? Anyway, the simples pacemaker 
config from my partitions is below.


primitive sap_ASCS SAPInstance \
params InstanceName=CAP_ASCS01_crmapp \
op monitor timeout=60 interval=120 depth=0
primitive sap_D00 SAPInstance \
params InstanceName=CAP_D00_crmapp \
op monitor timeout=60 interval=120 depth=0
primitive sap_ip IPaddr2 \
params ip=10.1.12.2 nic=eth0 cidr_netmask=24
primitive st_ch_hmc stonith:ibmhmc \
params ipaddr=10.1.2.9 \
op start interval=0 timeout=300
primitive st_hq_hmc stonith:ibmhmc \
params ipaddr=10.1.2.8 \
op start interval=0 timeout=300
group g_sap sap_ip sap_ASCS sap_D00 \
meta target-role=Started
location l_ch_hq_hmc st_ch_hmc -inf: crmapp01
location l_st_hq_hmc st_hq_hmc -inf: crmapp02
location prefer_node_1 g_sap 100: crmapp01
property cib-bootstrap-options: \
stonith-enabled=true \
no-quorum-policy=ignore \
placement-strategy=balanced \
expected-quorum-votes=2 \
dc-version=1.1.12-f47ea56 \
cluster-infrastructure="classic openais (with plugin)" \
last-lrm-refresh=1490009096 \
maintenance-mode=false
rsc_defaults rsc-options: \
resource-stickiness=200 \
migration-threshold=3
op_defaults op-options: \
timeout=600 \
record-pending=true

Logs are pretty much going in circle: stonith cannot reset logical 
partition via HMC, node stays unclean offline, resources are shown to 
stay on node that is down.



stonith-ng:error: log_operation:Operation 'reboot' [6942] (call 
6 from crmd.4568) for host 'crmapp02' with device 'st_ch_hmc:0'

Trying: st_ch_hmc:0
stonith-ng:  warning: log_operation:st_ch_hmc:0:6942 [ Performing: 
stonith -t ibmhmc -T reset crmapp02 ]
stonith-ng:  warning: log_operation:st_ch_hmc:0:6942 [ failed: 
crmapp02 3 ]
stonith-ng: info: internal_stonith_action_execute:  Attempt 2 to 
execute fence_legacy (reboot). remaining timeout is 59
stonith-ng: info: update_remaining_timeout: Attempted to 
execute agent fence_legacy (reboot) the maximum number of times (2)


stonith-ng:error: log_operation:Operation 'reboot' [6955] (call 
6 from crmd.4568) for host 'crmapp02' with device 'st_hq_hmc' re

Trying: st_hq_hmc
stonith-ng:  warning: log_operation:st_hq_hmc:6955 [ Performing: 
stonith -t ibmhmc -T reset crmapp02 ]
stonith-ng:  warning: log_operation:st_hq_hmc:6955 [ failed: 
crmapp02 8 ]
stonith-ng: info: internal_stonith_action_execute:  Attempt 2 to 
execute fence_legacy (reboot). remaining timeout is 60
stonith-ng: info: update_remaining_timeout: Attempted to 
execute agent fence_legacy (reboot) the maximum number of times (2)


stonith-ng:error: log_operation:Operation 'reboot' [6976] (call 
6 from crmd.4568) for host 'crmapp02' with device 'st_hq_hmc:0'


stonith-ng:  warning: log_operation:st_hq_hmc:0:6976 [ Performing: 
stonith -t ibmhmc -T reset crmapp02 ]
stonith-ng:  warning: log_operation:st_hq_hmc:0:6976 [ failed: 
crmapp02 8 ]
stonith-ng:   notice: stonith_choose_peer:  Couldn't find anyone to 
fence crmapp02 with 
stonith-ng: info: call_remote_stonith:  None of the 1 peers are 
capable of terminating crmapp02 for crmd.4568 (1)
stonith-ng:error: remote_op_done:   Operation reboot of crmapp02 by 
 for crmd.4568@crmapp01.6bf66b9c: No route to host
crmd:   notice: tengine_stonith_callback: Stonith operation 
6/31:3700:0:b1fed277-9156-48da-8afd-35db672cd1c8: No route to


crmd:   notice: tengine_stonith_callback: Stonith operation 6 
for crmapp02 failed (No route to host): aborting transition.
crmd:   notice: abort_transition_graph:   Transition aborted: Stonith 
failed (source=tengine_stonith_callback:699, 0)
crmd:   notice: tengine_stonith_notify:   Peer crmapp02 was not 
terminated (reboot) by  for crmapp01: No route to host (re


crmd:   notice: run_graph:Transition 3700 (Complete=1, 
Pending=0, Fired=0, Skipped=18, Incomplete=2, Source=/var/lib/pacem


crmd: info: do_state_transition:  State transition 
S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_IN


pengine: info: process_pe_message:   Input has not changed since 
last time, not saving to disk

pengine:   notice: unpack_config:On loss of CCM Quorum: Ignore
pengine: info: determine_online_status_fencing:  Node crmapp01 is 
active

pengine: info: determine_online_status:  Node crmapp01 is online
pengine:  warning: pe_fence_node:Node crmapp02 will be fenced 
because