On 10/07/2015 09:12 AM, zulucloud wrote: > Hi, > i got a problem i don't understand, maybe someone can give me a hint. > > My 2-node cluster (named ali and baba) is configured to run mysql, an IP > for mysql and the filesystem resource (on drbd master) together as a > GROUP. After doing some crash-tests i ended up having filesystem and > mysql running happily on one host (ali), and the related IP on the other > (baba) .... although, the IP's not really up and running, crm_mon just > SHOWS it as started there. In fact it's nowhere up, neither on ali nor > on baba. > > crm_mon shows that pacemaker tried to start it on baba, but gave up > after fail-count=1000000. > > Q1: why doesn't pacemaker put the IP on ali, where all the rest of it's > group lives? > Q2: why doesn't pacemaker try to start the IP on ali, after max > failcount had been reached on baba? > Q3: why is crm_mon showing the IP as "started", when it's down after > 100000 tries? > > Thanks :) > > > config (some parts removed): > ------------------------------- > node ali > node baba > > primitive res_drbd ocf:linbit:drbd \ > params drbd_resource="r0" \ > op stop interval="0" timeout="100" \ > op start interval="0" timeout="240" \ > op promote interval="0" timeout="90" \ > op demote interval="0" timeout="90" \ > op notify interval="0" timeout="90" \ > op monitor interval="40" role="Slave" timeout="20" \ > op monitor interval="20" role="Master" timeout="20" > primitive res_fs ocf:heartbeat:Filesystem \ > params device="/dev/drbd0" directory="/drbd_mnt" fstype="ext4" \ > op monitor interval="30s" > primitive res_hamysql_ip ocf:heartbeat:IPaddr2 \ > params ip="XXX.XXX.XXX.224" nic="eth0" cidr_netmask="23" \ > op monitor interval="10s" timeout="20s" depth="0" > primitive res_mysql lsb:mysql \ > op start interval="0" timeout="15" \ > op stop interval="0" timeout="15" \ > op monitor start-delay="30" interval="15" time-out="15" > > group gr_mysqlgroup res_fs res_mysql res_hamysql_ip \ > meta target-role="Started" > ms ms_drbd res_drbd \ > meta master-max="1" master-node-max="1" clone-max="2" > clone-node-max="1" notify="true" > > colocation col_fs_on_drbd_master inf: res_fs:Started ms_drbd:Master > > order ord_drbd_master_then_fs inf: ms_drbd:promote res_fs:start > > property $id="cib-bootstrap-options" \ > dc-version="1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b" \ > cluster-infrastructure="openais" \ > stonith-enabled="false" \
Not having stonith is part of the problem (see below). Without stonith, if the two nodes go into split brain (both up but can't communicate with each other), Pacemaker will try to promote DRBD to master on both nodes, mount the filesystem on both nodes, and start MySQL on both nodes. > no-quorum-policy="ignore" \ > expected-quorum-votes="2" \ > last-lrm-refresh="1438857246" > > > crm_mon -rnf (some parts removed): > --------------------------------- > Node ali: online > res_fs (ocf::heartbeat:Filesystem) Started > res_mysql (lsb:mysql) Started > res_drbd:0 (ocf::linbit:drbd) Master > Node baba: online > res_hamysql_ip (ocf::heartbeat:IPaddr2) Started > res_drbd:1 (ocf::linbit:drbd) Slave > > Inactive resources: > > Migration summary: > > * Node baba: > res_hamysql_ip: migration-threshold=1000000 fail-count=1000000 > > Failed actions: > res_hamysql_ip_stop_0 (node=a891vl107s, call=35, rc=1, > status=complete): unknown error The "_stop_" above means that a *stop* action on the IP failed. Pacemaker tried to migrate the IP by first stopping it on baba, but it couldn't. (Since the IP is the last member of the group, its failure didn't prevent the other members from moving.) Normally, when a stop fails, Pacemaker fences the node so it can safely bring up the resource on the other node. But you disabled stonith, so it got into this state. So, to proceed: 1) Stonith would help :) 2) Figure out why it couldn't stop the IP. There might be a clue in the logs on baba (though they are indeed hard to follow; search for "res_hamysql_stop_0" around this time, and look around there). You could also try adding and removing the IP manually, first with the usual OS commands, and if that works, by calling the IP resource agent directly. That often turns up the problem. > > corosync.log: > -------------- > pengine: [1223]: WARN: should_dump_input: Ignoring requirement that > res_hamysql_ip_stop_0 comeplete before gr_mysqlgroup_stopped_0: > unmanaged failed resources cannot prevent shutdown > > pengine: [1223]: WARN: should_dump_input: Ignoring requirement that > res_hamysql_ip_stop_0 comeplete before gr_mysqlgroup_stopped_0: > unmanaged failed resources cannot prevent shutdown > > Software: > ---------- > corosync 1.2.1-4 > pacemaker 1.0.9.1+hg15626-1 > drbd8-utils 2:8.3.7-2.1 > (for some reason it's not possible to update at this time) It should be possible to get working for a simple setup like this, but there have been legions of bugfixes and a much better model for the corosync layer since then. _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org