Re: [ClusterLabs] group resources not grouped ?!?

2015-10-09 Thread zulucloud



On 10/08/2015 12:57 PM, Jorge Fábregas wrote:

On 10/08/2015 06:04 AM, zulucloud wrote:

  are there any other ways?


Hi,

You might want to check external/vmware or external/vcenter.  I've never
used them but apparently one is used to fence via the hypervisor (ESXi
itself) and the other thru vCenter.



Hello,
thank you very much, this looks interesting... maybe another option. 
I'll give it a closer look..


brgds

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] group resources not grouped ?!?

2015-10-08 Thread Jorge Fábregas
On 10/08/2015 06:04 AM, zulucloud wrote:
>  are there any other ways?

Hi,

You might want to check external/vmware or external/vcenter.  I've never
used them but apparently one is used to fence via the hypervisor (ESXi
itself) and the other thru vCenter.

-- 
Jorge

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] group resources not grouped ?!?

2015-10-07 Thread zulucloud

Hi,
i got a problem i don't understand, maybe someone can give me a hint.

My 2-node cluster (named ali and baba) is configured to run mysql, an IP 
for mysql and the filesystem resource (on drbd master) together as a 
GROUP. After doing some crash-tests i ended up having filesystem and 
mysql running happily on one host (ali), and the related IP on the other 
(baba)  although, the IP's not really up and running, crm_mon just 
SHOWS it as started there. In fact it's nowhere up, neither on ali nor 
on baba.


crm_mon shows that pacemaker tried to start it on baba, but gave up 
after fail-count=100.


Q1: why doesn't pacemaker put the IP on ali, where all the rest of it's 
group lives?
Q2: why doesn't pacemaker try to start the IP on ali, after max 
failcount had been reached on baba?
Q3: why is crm_mon showing the IP as "started", when it's down after 
10 tries?


Thanks :)


config (some parts removed):
---
node ali
node baba

primitive res_drbd ocf:linbit:drbd \
params drbd_resource="r0" \
op stop interval="0" timeout="100" \
op start interval="0" timeout="240" \
op promote interval="0" timeout="90" \
op demote interval="0" timeout="90" \
op notify interval="0" timeout="90" \
op monitor interval="40" role="Slave" timeout="20" \
op monitor interval="20" role="Master" timeout="20"
primitive res_fs ocf:heartbeat:Filesystem \
params device="/dev/drbd0" directory="/drbd_mnt" fstype="ext4" \
op monitor interval="30s"
primitive res_hamysql_ip ocf:heartbeat:IPaddr2 \
params ip="XXX.XXX.XXX.224" nic="eth0" cidr_netmask="23" \
op monitor interval="10s" timeout="20s" depth="0"
primitive res_mysql lsb:mysql \
op start interval="0" timeout="15" \
op stop interval="0" timeout="15" \
op monitor start-delay="30" interval="15" time-out="15"

group gr_mysqlgroup res_fs res_mysql res_hamysql_ip \
meta target-role="Started"
ms ms_drbd res_drbd \
	meta master-max="1" master-node-max="1" clone-max="2" 
clone-node-max="1" notify="true"


colocation col_fs_on_drbd_master inf: res_fs:Started ms_drbd:Master

order ord_drbd_master_then_fs inf: ms_drbd:promote res_fs:start

property $id="cib-bootstrap-options" \
dc-version="1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b" \
cluster-infrastructure="openais" \
stonith-enabled="false" \
no-quorum-policy="ignore" \
expected-quorum-votes="2" \
last-lrm-refresh="1438857246"


crm_mon -rnf (some parts removed):
-
Node ali: online
res_fs  (ocf::heartbeat:Filesystem) Started
res_mysql   (lsb:mysql) Started
res_drbd:0  (ocf::linbit:drbd) Master
Node baba: online
res_hamysql_ip  (ocf::heartbeat:IPaddr2) Started
res_drbd:1  (ocf::linbit:drbd) Slave

Inactive resources:

Migration summary:

* Node baba:
   res_hamysql_ip: migration-threshold=100 fail-count=100

Failed actions:
res_hamysql_ip_stop_0 (node=a891vl107s, call=35, rc=1, 
status=complete): unknown error


corosync.log:
--
pengine: [1223]: WARN: should_dump_input: Ignoring requirement that 
res_hamysql_ip_stop_0 comeplete before gr_mysqlgroup_stopped_0: 
unmanaged failed resources cannot prevent shutdown


pengine: [1223]: WARN: should_dump_input: Ignoring requirement that 
res_hamysql_ip_stop_0 comeplete before gr_mysqlgroup_stopped_0: 
unmanaged failed resources cannot prevent shutdown


Software:
--
corosync 1.2.1-4
pacemaker 1.0.9.1+hg15626-1
drbd8-utils 2:8.3.7-2.1
(for some reason it's not possible to update at this time)



___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] group resources not grouped ?!?

2015-10-07 Thread Ken Gaillot
On 10/07/2015 09:12 AM, zulucloud wrote:
> Hi,
> i got a problem i don't understand, maybe someone can give me a hint.
> 
> My 2-node cluster (named ali and baba) is configured to run mysql, an IP
> for mysql and the filesystem resource (on drbd master) together as a
> GROUP. After doing some crash-tests i ended up having filesystem and
> mysql running happily on one host (ali), and the related IP on the other
> (baba)  although, the IP's not really up and running, crm_mon just
> SHOWS it as started there. In fact it's nowhere up, neither on ali nor
> on baba.
> 
> crm_mon shows that pacemaker tried to start it on baba, but gave up
> after fail-count=100.
> 
> Q1: why doesn't pacemaker put the IP on ali, where all the rest of it's
> group lives?
> Q2: why doesn't pacemaker try to start the IP on ali, after max
> failcount had been reached on baba?
> Q3: why is crm_mon showing the IP as "started", when it's down after
> 10 tries?
> 
> Thanks :)
> 
> 
> config (some parts removed):
> ---
> node ali
> node baba
> 
> primitive res_drbd ocf:linbit:drbd \
> params drbd_resource="r0" \
> op stop interval="0" timeout="100" \
> op start interval="0" timeout="240" \
> op promote interval="0" timeout="90" \
> op demote interval="0" timeout="90" \
> op notify interval="0" timeout="90" \
> op monitor interval="40" role="Slave" timeout="20" \
> op monitor interval="20" role="Master" timeout="20"
> primitive res_fs ocf:heartbeat:Filesystem \
> params device="/dev/drbd0" directory="/drbd_mnt" fstype="ext4" \
> op monitor interval="30s"
> primitive res_hamysql_ip ocf:heartbeat:IPaddr2 \
> params ip="XXX.XXX.XXX.224" nic="eth0" cidr_netmask="23" \
> op monitor interval="10s" timeout="20s" depth="0"
> primitive res_mysql lsb:mysql \
> op start interval="0" timeout="15" \
> op stop interval="0" timeout="15" \
> op monitor start-delay="30" interval="15" time-out="15"
> 
> group gr_mysqlgroup res_fs res_mysql res_hamysql_ip \
> meta target-role="Started"
> ms ms_drbd res_drbd \
> meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true"
> 
> colocation col_fs_on_drbd_master inf: res_fs:Started ms_drbd:Master
> 
> order ord_drbd_master_then_fs inf: ms_drbd:promote res_fs:start
> 
> property $id="cib-bootstrap-options" \
> dc-version="1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b" \
> cluster-infrastructure="openais" \
> stonith-enabled="false" \

Not having stonith is part of the problem (see below).

Without stonith, if the two nodes go into split brain (both up but can't
communicate with each other), Pacemaker will try to promote DRBD to
master on both nodes, mount the filesystem on both nodes, and start
MySQL on both nodes.

> no-quorum-policy="ignore" \
> expected-quorum-votes="2" \
> last-lrm-refresh="1438857246"
> 
> 
> crm_mon -rnf (some parts removed):
> -
> Node ali: online
> res_fs  (ocf::heartbeat:Filesystem) Started
> res_mysql   (lsb:mysql) Started
> res_drbd:0  (ocf::linbit:drbd) Master
> Node baba: online
> res_hamysql_ip  (ocf::heartbeat:IPaddr2) Started
> res_drbd:1  (ocf::linbit:drbd) Slave
> 
> Inactive resources:
> 
> Migration summary:
> 
> * Node baba:
>res_hamysql_ip: migration-threshold=100 fail-count=100
> 
> Failed actions:
> res_hamysql_ip_stop_0 (node=a891vl107s, call=35, rc=1,
> status=complete): unknown error

The "_stop_" above means that a *stop* action on the IP failed.
Pacemaker tried to migrate the IP by first stopping it on baba, but it
couldn't. (Since the IP is the last member of the group, its failure
didn't prevent the other members from moving.)

Normally, when a stop fails, Pacemaker fences the node so it can safely
bring up the resource on the other node. But you disabled stonith, so it
got into this state.

So, to proceed:

1) Stonith would help :)

2) Figure out why it couldn't stop the IP. There might be a clue in the
logs on baba (though they are indeed hard to follow; search for
"res_hamysql_stop_0" around this time, and look around there). You could
also try adding and removing the IP manually, first with the usual OS
commands, and if that works, by calling the IP resource agent directly.
That often turns up the problem.

> 
> corosync.log:
> --
> pengine: [1223]: WARN: should_dump_input: Ignoring requirement that
> res_hamysql_ip_stop_0 comeplete before gr_mysqlgroup_stopped_0:
> unmanaged failed resources cannot prevent shutdown
> 
> pengine: [1223]: WARN: should_dump_input: Ignoring requirement that
> res_hamysql_ip_stop_0 comeplete before gr_mysqlgroup_stopped_0:
> unmanaged failed resources cannot prevent shutdown
> 
> Software:
> --
> corosync 1.2.1-4
> pacemaker 1.0.9.1+hg15626-1
> drbd8-utils 2:8.3.7-2.1
> (for some reason it's not possible to update at this time)

It should be possible to get