Re: [Pacemaker] Issues with HA cluster for mysqld

2012-08-24 Thread Dejan Muhamedagic
Hi,

On Thu, Aug 23, 2012 at 04:47:11PM -0400, David Parker wrote:
 
 On 08/23/2012 04:19 PM, Jake Smith wrote:
 Okay, I think I've almost got this.  I updated my Pacemaker config
 and
 made a few changes.  I put the MysqlIP and mysqld primitives into a
 resource group called mysqld-resources, ordered them such that
 mysqld
 will always wait for MysqlIP to be ready first, and added constraints
 to
 make ha1 the preferred host for the mysqld-resources group and ha2
 the
 failover host.  I also created STONITH devices for both ha1 and ha2,
 and
 added constraints to fix the STONIOTH location issues.  My new
 constraints section looks like this:
 
 constraints
 rsc_location id=loc-1 rsc=stonith-ha1 node=ha2
 score=INFINITY/
 rsc_location id=loc-2 rsc=stonith-ha2 node=ha1
 score=INFINITY/
 Don't need the 2 above as long as you have the 2 negative locations below 
 for stonith locations.  I prefer the negative below because if you ever 
 expanded to greater than 2 nodes the stonith for any node could run on any 
 node but itself.
 
 Good call.  I'll take those out of the config.
 
 rsc_location id=loc-3 rsc=stonith-ha1 node=ha1
 score=-INFINITY/
 rsc_location id=loc-4 rsc=stonith-ha2 node=ha2
 score=-INFINITY/
 rsc_location id=loc-5 rsc=mysql-resources node=ha1
 score=200/
 Don't need the 0 score below either - the 200 above will take care of it.  
 Pretty sure no location constraint is the same as a 0 score location.
 
 That was based on the example found in the documentation.  If I
 don't have the 0 score entry, will the service still fail over?
 
 rsc_location id=loc-6 rsc=mysql-resources node=ha2 score=0/
 /constraints
 
 Everything seems to work.  I had the virtual IP and mysqld running on
 ha1, and not on ha2.  I shut down ha1 using poweroff -n and both
 the
 virtual IP and mysqld came up on ha2 almost instantly.  When I
 powered
 ha1 on again, ha2 shut down the the virtual IP and mysqld.  The
 virtual
 IP moved over instantly; a continuous ping of the IP produced one
 Time
 to live exceeded message and one packet was lost, but that's to be
 expected.  However, mysqld took almost 30 seconds to start up on ha1
 after being stopped on ha2, and I'm not exactly sure why.
 
 Here's the relevant log output from ha2:
 
 Aug 23 11:42:48 ha2 crmd: [1166]: info: te_rsc_command: Initiating
 action 16: stop mysqld_stop_0 on ha2 (local)
 Aug 23 11:42:48 ha2 crmd: [1166]: info: do_lrm_rsc_op: Performing
 key=16:1:0:ec1989a8-ff84-4fc5-9f48-88e9b285797c op=mysqld_stop_0 )
 Aug 23 11:42:48 ha2 lrmd: [1163]: info: rsc:mysqld:10: stop
 Aug 23 11:42:50 ha2 lrmd: [1163]: info: RA output:
 (mysqld:stop:stdout)
 Stopping MySQL daemon: mysqld_safe.
 Aug 23 11:42:50 ha2 crmd: [1166]: info: process_lrm_event: LRM
 operation
 mysqld_stop_0 (call=10, rc=0, cib-update=57, confirmed=true) ok
 Aug 23 11:42:50 ha2 crmd: [1166]: info: match_graph_event: Action
 mysqld_stop_0 (16) confirmed on ha2 (rc=0)
 
 And here's the relevant log output from ha1:
 
 Aug 23 11:42:47 ha1 crmd: [1243]: info: do_lrm_rsc_op: Performing
 key=8:1:7:ec1989a8-ff84-4fc5-9f48-88e9b285797c op=mysqld_monitor_0 )
 Aug 23 11:42:47 ha1 lrmd: [1240]: info: rsc:mysqld:5: probe
 Aug 23 11:42:47 ha1 crmd: [1243]: info: process_lrm_event: LRM
 operation
 mysqld_monitor_0 (call=5, rc=7, cib-update=10, confirmed=true) not
 running
 Aug 23 11:43:36 ha1 crmd: [1243]: info: do_lrm_rsc_op: Performing
 key=11:3:0:ec1989a8-ff84-4fc5-9f48-88e9b285797c op=mysqld_start_0 )
 Aug 23 11:43:36 ha1 lrmd: [1240]: info: rsc:mysqld:11: start
 Aug 23 11:43:36 ha1 lrmd: [1240]: info: RA output:
 (mysqld:start:stdout)
 Starting MySQL daemon: mysqld_safe.#012(See
 /usr/local/mysql/data/mysql.messages for messages).
 Aug 23 11:43:36 ha1 crmd: [1243]: info: process_lrm_event: LRM
 operation
 mysqld_start_0 (call=11, rc=0, cib-update=18, confirmed=true) ok
 
 So, ha2 stopped mysqld at 11:42:50, but ha1 didn't start mysqld until
 11:43:36, a full 46 seconds after it was stopped on ha2.  Any ideas
 why
 the delay for mysqld was so long, when the MysqlIP resource moved
 almost
 instantly?
 Couple thoughts.
 
 Are you sure both servers have the same time (in sync)?
 
 Yep.  They're both using NTP.
 
 On HA2 did verify mysqld was actually done stopping at the 11:42:50 mark?
 I don't use mysql so I can't say from experience.
 
 Yes, I kept checking (with ps -ef | grep mysqld) every few
 seconds, and it stopped running around that time.  As soon as it
 stopped running on ha2, I started checking on ha1 and it was quite a
 while before mysqld started.  I knew it was at least 30 seconds, and
 I believe it was actually 42 seconds as the logs indicate.
 
 Just curious but do you really want it to failback if it's actively running 
 on ha2?
 
 Interesting point.  I had just assumed that it was good practice to
 have a preferred node for a service, but I guess it doesn't matter.
 If I don't care which node the services run on, do I just remove the
 location constraints for the mysql-resources group 

Re: [Pacemaker] Loss of ocf:pacemaker:ping target forces resources to restart?

2012-08-24 Thread Florian Crouzat

Le 24/08/2012 01:36, Andrew Martin a écrit :

The dampen parameter tells the cluster to wait before making any decision, so 
that if the IP comes back online within the dampen period then no action is 
taken. Is this correct?


This is also my understanding of this parameter.

--
Cheers,
Florian Crouzat

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Apparent error passing stonith resource parameters (external/libvirt)

2012-08-24 Thread Nathan Bird
I'm trying to setup an external/libvirt stonith fencing thingiemadoodle 
but ran into an error


My pacemaker configuration is (read back out it looks the same):

primitive p-fence-om0101 stonith:external/libvirt \
  params hostlist=proxy1 mysql1 \
   hypervisor_uri=qemu+ssh://root@om01/system?keyfile=/root/.ssh/om01 \
  op monitor interval=60
#there's also a location rule that is working fine


If i just try to connect with libvirt that uri is correct. When the 
stonith script runs though it is getting an incomplete value 
qemu+ssh://root@om01/system?keyfile observed via log messages.


Apparently the equal symbol '=' is causing a problem for the parameter 
passing somewhere.


When I read the external/libvirt plugin's code. It appears to rely on 
the environment variable '$hypervisor_uri' and the log message printing 
of this indicates that is invalid.


I don't know where to look for who is filling that environment value; 
any suggestions?



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Apparent error passing stonith resource parameters (external/libvirt)

2012-08-24 Thread Nathan Bird

On 08/24/2012 10:47 AM, Nathan Bird wrote:
I'm trying to setup an external/libvirt stonith fencing 
thingiemadoodle but ran into an error


My pacemaker configuration is (read back out it looks the same):

primitive p-fence-om0101 stonith:external/libvirt \
  params hostlist=proxy1 mysql1 \
hypervisor_uri=qemu+ssh://root@om01/system?keyfile=/root/.ssh/om01 \
  op monitor interval=60
#there's also a location rule that is working fine


If i just try to connect with libvirt that uri is correct. When the 
stonith script runs though it is getting an incomplete value 
qemu+ssh://root@om01/system?keyfile observed via log messages.


Apparently the equal symbol '=' is causing a problem for the parameter 
passing somewhere.


When I read the external/libvirt plugin's code. It appears to rely on 
the environment variable '$hypervisor_uri' and the log message 
printing of this indicates that is invalid.


I don't know where to look for who is filling that environment value; 
any suggestions?


I worked around this by copying the resource file into a new one named 
after the hypervisor I'm trying to talk to and embedding the correct uri 
in the file.


Additionally I did a bit more bash quoting in the file, e.g.:
-out=$($VIRSH -c $hypervisor_uri start $domain_id 21)
+out=$($VIRSH -c $hypervisor_uri start $domain_id 21)

Though I'm confident that isn't the only issue as even with those quotes 
nothing works-- this script has the wrong value in that variable before 
we get to those lines.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org