Re: [Pacemaker] Apparent error passing stonith resource parameters (external/libvirt)

2012-08-24 Thread Nathan Bird

On 08/24/2012 10:47 AM, Nathan Bird wrote:
I'm trying to setup an external/libvirt stonith fencing 
thingiemadoodle but ran into an error


My pacemaker configuration is (read back out it looks the same):

primitive p-fence-om0101 stonith:external/libvirt \
  params hostlist="proxy1 mysql1" \
hypervisor_uri="qemu+ssh://root@om01/system?keyfile=/root/.ssh/om01" \
  op monitor interval="60"
#there's also a location rule that is working fine


If i just try to connect with libvirt that uri is correct. When the 
stonith script runs though it is getting an incomplete value 
"qemu+ssh://root@om01/system?keyfile" observed via log messages.


Apparently the equal symbol '=' is causing a problem for the parameter 
passing somewhere.


When I read the external/libvirt plugin's code. It appears to rely on 
the environment variable '$hypervisor_uri' and the log message 
printing of this indicates that is invalid.


I don't know where to look for who is filling that environment value; 
any suggestions?


I worked around this by copying the resource file into a new one named 
after the hypervisor I'm trying to talk to and embedding the correct uri 
in the file.


Additionally I did a bit more bash quoting in the file, e.g.:
-out=$($VIRSH -c $hypervisor_uri start $domain_id 2>&1)
+out=$($VIRSH -c "$hypervisor_uri" start $domain_id 2>&1)

Though I'm confident that isn't the only issue as even with those quotes 
nothing works-- this script has the wrong value in that variable before 
we get to those lines.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Apparent error passing stonith resource parameters (external/libvirt)

2012-08-24 Thread Nathan Bird
I'm trying to setup an external/libvirt stonith fencing thingiemadoodle 
but ran into an error


My pacemaker configuration is (read back out it looks the same):

primitive p-fence-om0101 stonith:external/libvirt \
  params hostlist="proxy1 mysql1" \
   hypervisor_uri="qemu+ssh://root@om01/system?keyfile=/root/.ssh/om01" \
  op monitor interval="60"
#there's also a location rule that is working fine


If i just try to connect with libvirt that uri is correct. When the 
stonith script runs though it is getting an incomplete value 
"qemu+ssh://root@om01/system?keyfile" observed via log messages.


Apparently the equal symbol '=' is causing a problem for the parameter 
passing somewhere.


When I read the external/libvirt plugin's code. It appears to rely on 
the environment variable '$hypervisor_uri' and the log message printing 
of this indicates that is invalid.


I don't know where to look for who is filling that environment value; 
any suggestions?



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Loss of ocf:pacemaker:ping target forces resources to restart?

2012-08-24 Thread Florian Crouzat

Le 24/08/2012 01:36, Andrew Martin a écrit :

The dampen parameter tells the cluster to wait before making any decision, so 
that if the IP comes back online within the dampen period then no action is 
taken. Is this correct?


This is also my understanding of this parameter.

--
Cheers,
Florian Crouzat

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Issues with HA cluster for mysqld

2012-08-24 Thread Dejan Muhamedagic
Hi,

On Thu, Aug 23, 2012 at 04:47:11PM -0400, David Parker wrote:
> 
> On 08/23/2012 04:19 PM, Jake Smith wrote:
> >>Okay, I think I've almost got this.  I updated my Pacemaker config
> >>and
> >>made a few changes.  I put the MysqlIP and mysqld primitives into a
> >>resource group called "mysqld-resources", ordered them such that
> >>mysqld
> >>will always wait for MysqlIP to be ready first, and added constraints
> >>to
> >>make ha1 the preferred host for the mysqld-resources group and ha2
> >>the
> >>failover host.  I also created STONITH devices for both ha1 and ha2,
> >>and
> >>added constraints to fix the STONIOTH location issues.  My new
> >>constraints section looks like this:
> >>
> >>
> >> >>score="INFINITY"/>
> >> >>score="INFINITY"/>
> >Don't need the 2 above as long as you have the 2 negative locations below 
> >for stonith locations.  I prefer the negative below because if you ever 
> >expanded to greater than 2 nodes the stonith for any node could run on any 
> >node but itself.
> 
> Good call.  I'll take those out of the config.
> 
> >> >>score="-INFINITY"/>
> >> >>score="-INFINITY"/>
> >> >>score="200"/>
> >Don't need the 0 score below either - the 200 above will take care of it.  
> >Pretty sure no location constraint is the same as a 0 score location.
> 
> That was based on the example found in the documentation.  If I
> don't have the 0 score entry, will the service still fail over?
> 
> >>
> >>
> >>
> >>Everything seems to work.  I had the virtual IP and mysqld running on
> >>ha1, and not on ha2.  I shut down ha1 using "poweroff -n" and both
> >>the
> >>virtual IP and mysqld came up on ha2 almost instantly.  When I
> >>powered
> >>ha1 on again, ha2 shut down the the virtual IP and mysqld.  The
> >>virtual
> >>IP moved over instantly; a continuous ping of the IP produced one
> >>"Time
> >>to live exceeded" message and one packet was lost, but that's to be
> >>expected.  However, mysqld took almost 30 seconds to start up on ha1
> >>after being stopped on ha2, and I'm not exactly sure why.
> >>
> >>Here's the relevant log output from ha2:
> >>
> >>Aug 23 11:42:48 ha2 crmd: [1166]: info: te_rsc_command: Initiating
> >>action 16: stop mysqld_stop_0 on ha2 (local)
> >>Aug 23 11:42:48 ha2 crmd: [1166]: info: do_lrm_rsc_op: Performing
> >>key=16:1:0:ec1989a8-ff84-4fc5-9f48-88e9b285797c op=mysqld_stop_0 )
> >>Aug 23 11:42:48 ha2 lrmd: [1163]: info: rsc:mysqld:10: stop
> >>Aug 23 11:42:50 ha2 lrmd: [1163]: info: RA output:
> >>(mysqld:stop:stdout)
> >>Stopping MySQL daemon: mysqld_safe.
> >>Aug 23 11:42:50 ha2 crmd: [1166]: info: process_lrm_event: LRM
> >>operation
> >>mysqld_stop_0 (call=10, rc=0, cib-update=57, confirmed=true) ok
> >>Aug 23 11:42:50 ha2 crmd: [1166]: info: match_graph_event: Action
> >>mysqld_stop_0 (16) confirmed on ha2 (rc=0)
> >>
> >>And here's the relevant log output from ha1:
> >>
> >>Aug 23 11:42:47 ha1 crmd: [1243]: info: do_lrm_rsc_op: Performing
> >>key=8:1:7:ec1989a8-ff84-4fc5-9f48-88e9b285797c op=mysqld_monitor_0 )
> >>Aug 23 11:42:47 ha1 lrmd: [1240]: info: rsc:mysqld:5: probe
> >>Aug 23 11:42:47 ha1 crmd: [1243]: info: process_lrm_event: LRM
> >>operation
> >>mysqld_monitor_0 (call=5, rc=7, cib-update=10, confirmed=true) not
> >>running
> >>Aug 23 11:43:36 ha1 crmd: [1243]: info: do_lrm_rsc_op: Performing
> >>key=11:3:0:ec1989a8-ff84-4fc5-9f48-88e9b285797c op=mysqld_start_0 )
> >>Aug 23 11:43:36 ha1 lrmd: [1240]: info: rsc:mysqld:11: start
> >>Aug 23 11:43:36 ha1 lrmd: [1240]: info: RA output:
> >>(mysqld:start:stdout)
> >>Starting MySQL daemon: mysqld_safe.#012(See
> >>/usr/local/mysql/data/mysql.messages for messages).
> >>Aug 23 11:43:36 ha1 crmd: [1243]: info: process_lrm_event: LRM
> >>operation
> >>mysqld_start_0 (call=11, rc=0, cib-update=18, confirmed=true) ok
> >>
> >>So, ha2 stopped mysqld at 11:42:50, but ha1 didn't start mysqld until
> >>11:43:36, a full 46 seconds after it was stopped on ha2.  Any ideas
> >>why
> >>the delay for mysqld was so long, when the MysqlIP resource moved
> >>almost
> >>instantly?
> >Couple thoughts.
> >
> >Are you sure both servers have the same time (in sync)?
> 
> Yep.  They're both using NTP.
> 
> >On HA2 did verify mysqld was actually done stopping at the 11:42:50 mark?
> >I don't use mysql so I can't say from experience.
> 
> Yes, I kept checking (with "ps -ef | grep mysqld") every few
> seconds, and it stopped running around that time.  As soon as it
> stopped running on ha2, I started checking on ha1 and it was quite a
> while before mysqld started.  I knew it was at least 30 seconds, and
> I believe it was actually 42 seconds as the logs indicate.
> 
> >Just curious but do you really want it to failback if it's actively running 
> >on ha2?
> 
> Interesting point.  I had just assumed that it was good practice to
> have a preferred node for a service, but I guess it doesn't matter.
> If I don't care which node the services run on, do I just remove the
> location constraints for the "mysql-resources" group altogether?
>