Re: [Pacemaker] mysql ocf resource agent - resource stays unmanaged if binary unavailable

2013-05-17 Thread Andrew Beekhof

On 18/05/2013, at 6:49 AM, Andreas Kurz  wrote:

> On 2013-05-17 00:24, Vladimir wrote:
>> Hi,
>> 
>> our pacemaker setup provides mysql resource using ocf resource agent.
>> Today I tested with my colleagues forcing mysql resource to fail. I
>> don't understand the following behaviour. When I remove the mysqld_safe
>> binary (which path is specified in crm config) from one server and
>> moving the mysql resource to this server, the resource will not fail
>> back and stays in the "unmanaged" status. We can see that the function
>> check_binary(); is called within the mysql ocf resource agent and
>> exists with error code "5". The fail-count gets raised to INFINITY and
>> pacemaker tries to "stop" the resource fails. This results in a
>> "unmanaged" status.
>> 
>> How to reproduce:
>> 
>> 1. mysql resource is running on node1
>> 2. on node2 mv /usr/bin/mysqld_safe{,.bak}
>> 3. crm resource move group-MySQL node2
>> 4. observe corosync.log and crm_mon
>> 
>> # cat /var/log/corosync/corosync.log
>> [...]
>> May 16 10:53:41 node2 lrmd: [1893]: info: operation start[119] on
>> res-MySQL-IP1 for client 1896: pid 5137 exited with return code 0 May
>> 16 10:53:41 node2 crmd: [1896]: info: process_lrm_event: LRM operation
>> res-MySQL-IP1_start_0 (call=119, rc=0, cib-update=98, confirmed=true)
>> ok May 16 10:53:41 node2 crmd: [1896]: info: do_lrm_rsc_op: Performing
>> key=94:102:0:28dea763-d2a2-4b9d-b86a-5357760ed16e
>> op=res-MySQL-IP1_monitor_3 ) May 16 10:53:41 node2 lrmd: [1893]:
>> info: rsc:res-MySQL-IP1 monitor[120] (pid 5222) May 16 10:53:41 node2
>> crmd: [1896]: info: do_lrm_rsc_op: Performing
>> key=96:102:0:28dea763-d2a2-4b9d-b86a-5357760ed16e op=res-MySQL_start_0
>> ) May 16 10:53:41 node2 lrmd: [1893]: info: rsc:res-MySQL start[121]
>> (pid 5223) May 16 10:53:41 node2 lrmd: [1893]: info: RA output:
>> (res-MySQL:start:stderr) 2013/05/16_10:53:41 ERROR: Setup problem:
>> couldn't find command: /usr/bin/mysqld_safe
>> 
>> May 16 10:53:41 node2 lrmd: [1893]: info: operation start[121] on
>> res-MySQL for client 1896: pid 5223 exited with return code 5 May 16
>> 10:53:41 node2 crmd: [1896]: info: process_lrm_event: LRM operation
>> res-MySQL_start_0 (call=121, rc=5, cib-update=99, confirmed=true) not
>> installed May 16 10:53:41 node2 lrmd: [1893]: info: operation
>> monitor[120] on res-MySQL-IP1 for client 1896: pid 5222 exited with
>> return code 0 May 16 10:53:41 node2 crmd: [1896]: info:
>> process_lrm_event: LRM operation res-MySQL-IP1_monitor_3 (call=120,
>> rc=0, cib-update=100, confirmed=false) ok May 16 10:53:41 node2 attrd:
>> [1894]: notice: attrd_ais_dispatch: Update relayed from node1 May 16
>> 10:53:41 node2 attrd: [1894]: notice: attrd_trigger_update: Sending
>> flush op to all hosts for: fail-count-res-MySQL (INFINITY) May 16
>> 10:53:41 node2 attrd: [1894]: notice: attrd_perform_update: Sent update
>> 44: fail-count-res-MySQL=INFINITY May 16 10:53:41 node2 attrd: [1894]:
>> notice: attrd_ais_dispatch: Update relayed from node1 May 16 10:53:41
>> node2 attrd: [1894]: notice: attrd_trigger_update: Sending flush op to
>> all hosts for: last-failure-res-MySQL (1368694421) May 16 10:53:41
>> node2 attrd: [1894]: notice: attrd_perform_update: Sent update 47:
>> last-failure-res-MySQL=1368694421 May 16 10:53:41 node2 lrmd: [1893]:
>> info: cancel_op: operation monitor[117] on res-DRBD-MySQL:1 for client
>> 1896, its parameters: drbd_resource=[mysql] CRM_meta_role=[Master]
>> CRM_meta_timeout=[2] CRM_meta_name=[monitor]
>> crm_feature_set=[3.0.5] CRM_meta_notify=[true]
>> CRM_meta_clone_node_max=[1] CRM_meta_clone=[1] CRM_meta_clone_max=[2]
>> CRM_meta_master_node_max=[1] CRM_meta_interval=[29000]
>> CRM_meta_globally_unique=[false] CRM_meta_master_max=[1]  cancelled May
>> 16 10:53:41 node2 crmd: [1896]: info: send_direct_ack: ACK'ing resource
>> op res-DRBD-MySQL:1_monitor_29000 from
>> 3:104:0:28dea763-d2a2-4b9d-b86a-5357760ed16e:
>> lrm_invoke-lrmd-1368694421-57 May 16 10:53:41 node2 crmd: [1896]: info:
>> do_lrm_rsc_op: Performing
>> key=8:104:0:28dea763-d2a2-4b9d-b86a-5357760ed16e op=res-MySQL_stop_0 )
>> May 16 10:53:41 node2 lrmd: [1893]: info: rsc:res-MySQL stop[122] (pid
>> 5278) [...]
>> 
>> I can not figure out why the fail-count gets raised to INFINITY and
>> especially why pacemaker tries to stop the resource after failing.
>> Shouldn't it be the best for the resource to fail back to another node
>> instead of resulting in a "unmanaged" status on the node? is it
>> possible to force this behavior in any way?
> 
> By default start-failures are fatal and raising the fail-count to
> INFINITY disallows future starts on this node unless the resource and so
> its fail-count is cleaned.
> 
> On a start failure Pacemaker tries to stop the resource to be sure it is
> really not started or somewhere in-between ... stop fails also in your
> case and cluster stucks and sets the resource into unmanaged mode.
> 
> Why? Because you obviously have no stonith configured that could make

Re: [Pacemaker] mysql ocf resource agent - resource stays unmanaged if binary unavailable

2013-05-17 Thread Andreas Kurz
On 2013-05-17 00:24, Vladimir wrote:
> Hi,
> 
> our pacemaker setup provides mysql resource using ocf resource agent.
> Today I tested with my colleagues forcing mysql resource to fail. I
> don't understand the following behaviour. When I remove the mysqld_safe
> binary (which path is specified in crm config) from one server and
> moving the mysql resource to this server, the resource will not fail
> back and stays in the "unmanaged" status. We can see that the function
> check_binary(); is called within the mysql ocf resource agent and
> exists with error code "5". The fail-count gets raised to INFINITY and
> pacemaker tries to "stop" the resource fails. This results in a
> "unmanaged" status.
> 
> How to reproduce:
> 
> 1. mysql resource is running on node1
> 2. on node2 mv /usr/bin/mysqld_safe{,.bak}
> 3. crm resource move group-MySQL node2
> 4. observe corosync.log and crm_mon
> 
> # cat /var/log/corosync/corosync.log
> [...]
> May 16 10:53:41 node2 lrmd: [1893]: info: operation start[119] on
> res-MySQL-IP1 for client 1896: pid 5137 exited with return code 0 May
> 16 10:53:41 node2 crmd: [1896]: info: process_lrm_event: LRM operation
> res-MySQL-IP1_start_0 (call=119, rc=0, cib-update=98, confirmed=true)
> ok May 16 10:53:41 node2 crmd: [1896]: info: do_lrm_rsc_op: Performing
> key=94:102:0:28dea763-d2a2-4b9d-b86a-5357760ed16e
> op=res-MySQL-IP1_monitor_3 ) May 16 10:53:41 node2 lrmd: [1893]:
> info: rsc:res-MySQL-IP1 monitor[120] (pid 5222) May 16 10:53:41 node2
> crmd: [1896]: info: do_lrm_rsc_op: Performing
> key=96:102:0:28dea763-d2a2-4b9d-b86a-5357760ed16e op=res-MySQL_start_0
> ) May 16 10:53:41 node2 lrmd: [1893]: info: rsc:res-MySQL start[121]
> (pid 5223) May 16 10:53:41 node2 lrmd: [1893]: info: RA output:
> (res-MySQL:start:stderr) 2013/05/16_10:53:41 ERROR: Setup problem:
> couldn't find command: /usr/bin/mysqld_safe
> 
> May 16 10:53:41 node2 lrmd: [1893]: info: operation start[121] on
> res-MySQL for client 1896: pid 5223 exited with return code 5 May 16
> 10:53:41 node2 crmd: [1896]: info: process_lrm_event: LRM operation
> res-MySQL_start_0 (call=121, rc=5, cib-update=99, confirmed=true) not
> installed May 16 10:53:41 node2 lrmd: [1893]: info: operation
> monitor[120] on res-MySQL-IP1 for client 1896: pid 5222 exited with
> return code 0 May 16 10:53:41 node2 crmd: [1896]: info:
> process_lrm_event: LRM operation res-MySQL-IP1_monitor_3 (call=120,
> rc=0, cib-update=100, confirmed=false) ok May 16 10:53:41 node2 attrd:
> [1894]: notice: attrd_ais_dispatch: Update relayed from node1 May 16
> 10:53:41 node2 attrd: [1894]: notice: attrd_trigger_update: Sending
> flush op to all hosts for: fail-count-res-MySQL (INFINITY) May 16
> 10:53:41 node2 attrd: [1894]: notice: attrd_perform_update: Sent update
> 44: fail-count-res-MySQL=INFINITY May 16 10:53:41 node2 attrd: [1894]:
> notice: attrd_ais_dispatch: Update relayed from node1 May 16 10:53:41
> node2 attrd: [1894]: notice: attrd_trigger_update: Sending flush op to
> all hosts for: last-failure-res-MySQL (1368694421) May 16 10:53:41
> node2 attrd: [1894]: notice: attrd_perform_update: Sent update 47:
> last-failure-res-MySQL=1368694421 May 16 10:53:41 node2 lrmd: [1893]:
> info: cancel_op: operation monitor[117] on res-DRBD-MySQL:1 for client
> 1896, its parameters: drbd_resource=[mysql] CRM_meta_role=[Master]
> CRM_meta_timeout=[2] CRM_meta_name=[monitor]
> crm_feature_set=[3.0.5] CRM_meta_notify=[true]
> CRM_meta_clone_node_max=[1] CRM_meta_clone=[1] CRM_meta_clone_max=[2]
> CRM_meta_master_node_max=[1] CRM_meta_interval=[29000]
> CRM_meta_globally_unique=[false] CRM_meta_master_max=[1]  cancelled May
> 16 10:53:41 node2 crmd: [1896]: info: send_direct_ack: ACK'ing resource
> op res-DRBD-MySQL:1_monitor_29000 from
> 3:104:0:28dea763-d2a2-4b9d-b86a-5357760ed16e:
> lrm_invoke-lrmd-1368694421-57 May 16 10:53:41 node2 crmd: [1896]: info:
> do_lrm_rsc_op: Performing
> key=8:104:0:28dea763-d2a2-4b9d-b86a-5357760ed16e op=res-MySQL_stop_0 )
> May 16 10:53:41 node2 lrmd: [1893]: info: rsc:res-MySQL stop[122] (pid
> 5278) [...]
> 
> I can not figure out why the fail-count gets raised to INFINITY and
> especially why pacemaker tries to stop the resource after failing.
> Shouldn't it be the best for the resource to fail back to another node
> instead of resulting in a "unmanaged" status on the node? is it
> possible to force this behavior in any way?

By default start-failures are fatal and raising the fail-count to
INFINITY disallows future starts on this node unless the resource and so
its fail-count is cleaned.

On a start failure Pacemaker tries to stop the resource to be sure it is
really not started or somewhere in-between ... stop fails also in your
case and cluster stucks and sets the resource into unmanaged mode.

Why? Because you obviously have no stonith configured that could make
sure the resource is really stopped by fencing that node.

Solution for your problem: correctly configure stonith and enable it in
your cluster

Bes

Re: [Pacemaker] Stonith: How to avoid deathmatch cluster partitioning

2013-05-17 Thread Andreas Kurz
On 2013-05-16 11:01, Lars Marowsky-Bree wrote:
> On 2013-05-15T22:55:43, Andreas Kurz  wrote:
> 
>> start-delay is an option of the monitor operation ... in fact means
>> "don't trust that start was successfull, wait for the initial monitor
>> some more time"
> 
> It can be used on start here though to avoid exactly this situation; and
> it works fine for that, effectively being equivalent to the "delay"
> option on stonith (since the start always precedes the fence).

Hmm ... looking at the configuration there are two stonith resources,
each one locked to a node and they are started all the time so I can't
see how that would help here in case of a split-brain ... but please
correct me if I miss something here.

> 
>> The problem is, this would only make sense for one single stonith
>> resource that can fence more nodes. In case of a split-brain that would
>> delay the start on that node where the stonith resource was not running
>> before and gives that node a "penalty".
> 
> Sure. In a split-brain scenario, one side will receive a penalty, that's
> the whole point of this exercise. In particular for the external/sbd
> agent.

So you are confirming my explanation, thanks ;-)

Best regards,
Andreas

> 
> Or by grouping all fencing resources to always run on one node; if you
> don't have access to RHT fence agents, for example.
> 
> external/sbd also has code to avoid a death-match cycle in case of
> persistent split-brain scenarios now; after a reboot, the node that was
> fenced will not join unless the fence is cleared first.
> 
> (The RHT world calls that "unfence", I believe.)
> 
> That should be a win for the fence_sbd that I hope to get around to
> sometime in the next few months, too ;-)
> 
>> In your example with two stonith resources running all the time,
>> Digimer's suggestion is a good idea: use one of the redhat fencing
>> agents, most of them have some sort of "stonith-delay" parameter that
>> you can use with one instance.
> 
> It'd make sense to have logic for this embedded at a higher level,
> somehow; the problem is all too common.
> 
> Of course, it is most relevant in scenarios where "split brain" is a
> significantly higher probability than "node down". Which is true for
> most test scenarios (admins love yanking cables), but in practice, it's
> mostly truly the node down.
> 
> 
> Regards,
> Lars
> 


-- 
Need help with Pacemaker?
http://www.hastexo.com/now




signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Stonith: How to avoid deathmatch cluster partitioning

2013-05-17 Thread Andreas Kurz
On 2013-05-16 11:31, Klaus Darilion wrote:
> Hi Andreas!
> 
> On 15.05.2013 22:55, Andreas Kurz wrote:
>> On 2013-05-15 15:34, Klaus Darilion wrote:
>>> On 15.05.2013 14:51, Digimer wrote:
 On 05/15/2013 08:37 AM, Klaus Darilion wrote:
> primitive st-pace1 stonith:external/xen0 \
>   params hostlist="pace1" dom0="xentest1" \
>   op start start-delay="15s" interval="0"

 Try;

 primitive st-pace1 stonith:external/xen0 \
   params hostlist="pace1" dom0="xentest1" delay="15" \
   op start start-delay="15s" interval="0"

 The idea here is that, when both nodes lose contact and initiate a
 fence, 'st-pace1' will get a 15 second reprieve. That is, 'st-pace2'
 will wait 15 seconds before trying to fence 'st-pace1'. If st-pace1 is
 still alive, it will fence 'st-pace2' without delay, so pace2 will be
 dead before it's timer expires, preventing a dual-fence. However, if
 pace1 really is dead, pace2 will fence it and recovery, just with a 15
 second delay.
>>>
>>> Sounds good, but pacemaker does not accept the parameter:
>>>
>>> ERROR: st-pace1: parameter delay does not exist
>>
>> start-delay is an option of the monitor operation ... in fact means
>> "don't trust that start was successfull, wait for the initial monitor
>> some more time"
>>
>> The problem is, this would only make sense for one single stonith
>> resource that can fence more nodes. In case of a split-brain that would
>> delay the start on that node where the stonith resource was not running
>> before and gives that node a "penalty".
> 
> Thanks for the clarification. I already thought that the start-delay
> workaround is not useful in my setup.
> 
>> In your example with two stonith resources running all the time,
>> Digimer's suggestion is a good idea: use one of the redhat fencing
>> agents, most of them have some sort of "stonith-delay" parameter that
>> you can use with one instance.
> 
> I found it somehow confusing that a generic parameter (delay is useful
> for all stonith agents) is implemented in the agent, not in pacemaker.
> Further, downloading the RH source RPMS and extracting the agents is
> also quite cumbersome.

If you are on an Ubuntu >=12.04 or Debian Wheezy the fence-agents
package is available ... so no need for extra work ;-)

> 
> I think I will add the delay parameter to the relevant fencing agent
> myself. I guess I also have increase the stonith-timeout and add the
> configured delay.
> 
> Do you know how to submit patches for the stonith agents?

Sending them e.g. to the linux-ha-dev mailinglist is an option.

Best regards,
Andreas

> 
> Thanks
> Klaus
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


-- 
Need help with Pacemaker?
http://www.hastexo.com/now




signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] IPaddr2 cloned address doesn't survive node standby

2013-05-17 Thread Jake Smith



- Original Message -
> From: "Andreas Ntaflos" 
> To: "The Pacemaker cluster resource manager" 
> Sent: Friday, May 17, 2013 3:25:32 PM
> Subject: [Pacemaker] IPaddr2 cloned address doesn't survive node standby
> 
> In a two-node cluster I am trying to use a cloned IP address with a
> cloned Bind 9 instance, in an active-active way. Why? Because simple
> IP
> failover does not work well with Bind, as it only answers queries on
> the
> addresses that are bound to the NIC when starting up (I know about
> Bind's "interface-interval" setting, but the minimum of one minute is
> far too long). Using Ubuntu 12.04.2, Corosync 1.4.2 and Pacemaker
> 1.1.6.
> 
> So my configuration sees to it that the cloned address is set on both
> nodes and Bind is started afterwards (op params omitted for
> readability):
> 
> node dns01
> node dns02
> primitive p_bind9 lsb:bind9
> primitive p_ip_service_ns ocf:heartbeat:IPaddr2 \
>params ip="192.168.114.17" cidr_netmask="24" nic="eth0" \
>  clusterip_hash="sourceip-sourceport"

netmask should be 32 if that's supposed to be a single IP load balanced.

> clone cl_bind9 p_bind9 \
>meta interleave="false"
> clone cl_ip_service_ns p_ip_service_ns \
>meta globally-unique="true" clone-max="2" \
>  clone-node-max="2" interleave="true"
> order o_ip_before_bind9 inf: cl_ip_service_ns cl_bind9
> 
> (suggestions to improve or correct this configuration gladly
> accepted)
> 
> After Corosync starts up the first time everything seems correct, I
> can
> see the cluster/cloned/service IP address and the CLUSTERIP iptables
> rules on both nodes.
> 
> But after putting dns01 in standby and then bringing it online again
> the
> cloned address is no longer present on dns01, only on dns02. iptables
> rules are also gone from dns01.
> 
> Then, putting dns02 into standby the IP address is moved to dns01,
> and
> after going online again no longer present on dns01 (neither are
> iptables rules).
> 
> So the IP address is moved between the nodes, each move accompanied
> by a
> restart of the Bind service (cl_bind9/p_bind9).
> 
> All of this doesn't seem right to me. Shouldn't the cloned IP address
> always be present on *both* nodes when they are online?
> 
> Andreas


Without thinking too hard about these might help:

Don't you need colocation also between the clones so that bind can only start 
on a node that has already started an ip instance?

For the number of restarts it's likely because of the interleaving settings.  
True for both would likely help that but wouldn't work in your case - more 
here: 
http://www.hastexo.com/resources/hints-and-kinks/interleaving-pacemaker-clones

When you put dns01 in standby does dns02 have both instances of the IP there?
If not it should be (you are just load balancing a single IP correct?).  You 
need clone-node-max=2 for the ip clone.
If so one just doesn't move back to dns01 when you bring it out of standby?  I 
would look at resource stickiness=0 for the ip close resource only so the 
cluster will redistribute when the node comes out of standby (I think that 
would work).  Clones have a default stickiness of 1 if you don't have a default 
set for the cluster.

And/or you can write location constraints for the clone instances of ip to 
prefer one node over the other causing them to fail back if the node returns 
i.e. location ip0_prefers_dns01 cl_ip_service_ns:0 200: dns01 and location 
ip1_prefers_dns02 cl_ip_service_ns:1 200: dns02

HTH

Jake

> 
> PS: In the end this configuration works since the Bind 9 service is
> always available to answer queries on the cluster address (as long as
> there is one node online) but it seems that the Bind 9 clones are
> restarted too often and too liberally when things change. This,
> however,
> may be a separate issue, possibly related to the order directive and
> the
> interleave meta params.
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 
> 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] IPaddr2 cloned address doesn't survive node standby

2013-05-17 Thread Andreas Ntaflos
In a two-node cluster I am trying to use a cloned IP address with a 
cloned Bind 9 instance, in an active-active way. Why? Because simple IP 
failover does not work well with Bind, as it only answers queries on the 
addresses that are bound to the NIC when starting up (I know about 
Bind's "interface-interval" setting, but the minimum of one minute is 
far too long). Using Ubuntu 12.04.2, Corosync 1.4.2 and Pacemaker 1.1.6.


So my configuration sees to it that the cloned address is set on both 
nodes and Bind is started afterwards (op params omitted for readability):


node dns01
node dns02
primitive p_bind9 lsb:bind9
primitive p_ip_service_ns ocf:heartbeat:IPaddr2 \
  params ip="192.168.114.17" cidr_netmask="24" nic="eth0" \
clusterip_hash="sourceip-sourceport"
clone cl_bind9 p_bind9 \
  meta interleave="false"
clone cl_ip_service_ns p_ip_service_ns \
  meta globally-unique="true" clone-max="2" \
clone-node-max="2" interleave="true"
order o_ip_before_bind9 inf: cl_ip_service_ns cl_bind9

(suggestions to improve or correct this configuration gladly accepted)

After Corosync starts up the first time everything seems correct, I can 
see the cluster/cloned/service IP address and the CLUSTERIP iptables 
rules on both nodes.


But after putting dns01 in standby and then bringing it online again the 
cloned address is no longer present on dns01, only on dns02. iptables 
rules are also gone from dns01.


Then, putting dns02 into standby the IP address is moved to dns01, and 
after going online again no longer present on dns01 (neither are 
iptables rules).


So the IP address is moved between the nodes, each move accompanied by a 
restart of the Bind service (cl_bind9/p_bind9).


All of this doesn't seem right to me. Shouldn't the cloned IP address 
always be present on *both* nodes when they are online?


Andreas

PS: In the end this configuration works since the Bind 9 service is 
always available to answer queries on the cluster address (as long as 
there is one node online) but it seems that the Bind 9 clones are 
restarted too often and too liberally when things change. This, however, 
may be a separate issue, possibly related to the order directive and the 
interleave meta params.


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker-1.1.10 results in Failed to sign on to the LRM 7

2013-05-17 Thread Lars Marowsky-Bree
On 2013-05-17T14:15:00, Nikita Michalko  wrote:

> I'm just wondering: why is lrm gone?

Rewritten by the pacemaker project upstream, which prefers to no longer
build with cluster-glue at all.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker-1.1.10 results in Failed to sign on to the LRM 7

2013-05-17 Thread Nikita Michalko
I'm just wondering: why is lrm gone?

TIA!

Nikita Michalko


Am Freitag, 17. Mai 2013 05:15:10 schrieb Andrew Widdersheim:
> I'm attaching 3 patches I made fairly quickly to fix the installation
>  issues and also an issue I noticed with the ping ocf from the latest
>  pacemaker. 
> 
> One is for cluster-glue to prevent lrmd from building and later installing.
>  May also want to modify this patch to take lrmd out of both spec files
>  included when you download the source if you plan to build an rpm. I'm not
>  sure if what I did here is the best way to approach this problem so if
>  anyone has anything better please let me know.
> 
> One is for pacemaker to create the lrmd symlink when building with
>  heartbeat support. Note the spec does not need anything changed here.
> 
> Finally, saw the following errors in messages with the latest ping ocf and
>  the attached patch seems to fix the issue.
> 
> May 16 01:10:13 node2 lrmd[16133]:   notice: operation_finished:
>  p_ping_monitor_5000:17758 [ /usr/lib/ocf/resource.d/pacemaker/ping: line
>  296: [: : integer expression expected ]
> 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] having problem with crm cib shadow

2013-05-17 Thread George G. Gibat
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

thanks this worked great



On 2013-05-17 11:22, John McCabe wrote:
> Looks like its due to the directory where the cib files are stored changing 
> between Pacemaker 1.1.7 and 1.1.8. (alluded to in
> this ticket - http://savannah.nongnu.org/bugs/index.php?37760). The cib new 
> does create the shadow, but its looking in the
> wrong directory when it then tries to list/work with that shadow.
> 
> Creating a symlink as follows lets the crm utility work with the shadow as 
> expected for some quick tests (tried a new, edit,
> commit on an existing cluster).
> 
> ln -s /var/lib/pacemaker/cib /var/lib/heartbeat/crm
> 
> Will raise at http://savannah.nongnu.org/bugs/?group=crmsh
> 
> On Fri, May 17, 2013 at 11:01 AM, John McCabe  > wrote:
> 
> 
> On Fri, May 17, 2013 at 8:43 AM, Lars Marowsky-Bree  > wrote:
> 
> On 2013-05-16T21:09:35, John McCabe  > wrote:
> 
>> Worth trying crm_shadow as described here - 
>> http://www.gossamer-threads.com/lists/linuxha/pacemaker/84969
>> 
>> I had the same problem and took it as a sign that I should just move to pcs 
>> (from the RHEL repo, not the latest source),
>> which went pretty smoothly, only had a few problems with assigning 
>> parameters to resources.. but that could easily be worked
>> around using crm_resource.
> 
> So a single bug in the crm shell is a reason to move, while working around "a 
> few problems" isn't? I'm getting too old for this
> world ;-)
> 
> 
> Have been trying to stick with RH/CentOS supplied packages and since the 
> issues I'd hit were all fixed upstream I was happy 
> enough to proceed with pcs. (Meta Attr and Operation handling in the pcs 
> resource create command weren't working properly in
> the 0.9.25-10.el6_4.1 package but are fixed in 0.9.41 on github. To stick 
> with the current 0.9.25 I just created the resource 
> without meta/ops using pcs and then added the meta/ops using crm_resource).
> 
> 
> 
> Have you reported the crm shell issue via bugzilla?
> 
> 
> It hadn't even crossed my mind at the time, my bad. I'll check out the latest 
> source, compare the behaviour then raise a
> ticket.
> 
> 
> 
> Probably a permission problem on centos.
> 
> 
> Regards, Lars
> 
> -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer 
> Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) 
> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
> 
> 
> ___ Pacemaker mailing list: 
> Pacemaker@oss.clusterlabs.org
>  
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
> http://bugs.clusterlabs.org
> 
> 
> 
> 
> 
> ___ Pacemaker mailing list: 
> Pacemaker@oss.clusterlabs.org 
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
> http://bugs.clusterlabs.org
> 

- -- 




 ---
George Gibat, Technical Director, CCNP, MSCE, CISSP, CNE
TTFN
PGP public key - http://www.gibat.com/ggibat-pub.asc

Gibat Enterprises, Inc
Connecting you to the world (R)
Your Portal to the Future (R)

http://www.gibat.com
http://www.spi.net
817.265.9962
9260 Walker Rd.
Ovid, MI 48866

The information contained in and transmitted with this email is or may be
confidential and/or privileged. It is intended only for the individual or
entity designated. You are hereby notified that any dissemination,
distribution, copying, use of or reliance upon the information contained in
and transmitted with this email by or to anyone other than the intended
recipient designated by the sender is unauthorized and strictly prohibited.
If you have received this email in error, please contact the sender at
(817)265-9962. Any email erroneously transmitted to you should be
immediately deleted.

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.16 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlGWG4AACgkQaWdaxHduXciTswCeIqZVp8Wcx91rJqjQxxYpafhC
zLAAniv8ULJbP9Gcf0GAWMaYE9eip3ps
=J//M
-END PGP SIGNATURE-


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] having problem with crm cib shadow

2013-05-17 Thread John McCabe
Looks like its due to the directory where the cib files are stored changing
between Pacemaker 1.1.7 and 1.1.8. (alluded to in this ticket -
http://savannah.nongnu.org/bugs/index.php?37760). The cib new does create
the shadow, but its looking in the wrong directory when it then tries to
list/work with that shadow.

Creating a symlink as follows lets the crm utility work with the shadow as
expected for some quick tests (tried a new, edit, commit on an existing
cluster).

ln -s /var/lib/pacemaker/cib /var/lib/heartbeat/crm

Will raise at http://savannah.nongnu.org/bugs/?group=crmsh

On Fri, May 17, 2013 at 11:01 AM, John McCabe  wrote:

>
> On Fri, May 17, 2013 at 8:43 AM, Lars Marowsky-Bree  wrote:
>
>> On 2013-05-16T21:09:35, John McCabe  wrote:
>>
>> > Worth trying crm_shadow as described here -
>> > http://www.gossamer-threads.com/lists/linuxha/pacemaker/84969
>> >
>> > I had the same problem and took it as a sign that I should just move to
>> pcs
>> > (from the RHEL repo, not the latest source), which went pretty smoothly,
>> > only had a few problems with assigning parameters to resources.. but
>> that
>> > could easily be worked around using crm_resource.
>>
>> So a single bug in the crm shell is a reason to move, while working
>> around "a few problems" isn't? I'm getting too old for this world ;-)
>>
>
> Have been trying to stick with RH/CentOS supplied packages and since the
> issues I'd hit were all fixed upstream I was happy enough to proceed with
> pcs. (Meta Attr and Operation handling in the pcs resource create command
> weren't working properly in the 0.9.25-10.el6_4.1 package but are fixed in
> 0.9.41 on github. To stick with the current 0.9.25 I just created the
> resource without meta/ops using pcs and then added the meta/ops using
> crm_resource).
>
>
>>
>> Have you reported the crm shell issue via bugzilla?
>>
>
> It hadn't even crossed my mind at the time, my bad. I'll check out the
> latest source, compare the behaviour then raise a ticket.
>
>
>>
>> Probably a permission problem on centos.
>>
>>
>> Regards,
>> Lars
>>
>> --
>> Architect Storage/HA
>> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix
>> Imendörffer, HRB 21284 (AG Nürnberg)
>> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
>>
>>
>> ___
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
>
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] having problem with crm cib shadow

2013-05-17 Thread John McCabe
On Fri, May 17, 2013 at 8:43 AM, Lars Marowsky-Bree  wrote:

> On 2013-05-16T21:09:35, John McCabe  wrote:
>
> > Worth trying crm_shadow as described here -
> > http://www.gossamer-threads.com/lists/linuxha/pacemaker/84969
> >
> > I had the same problem and took it as a sign that I should just move to
> pcs
> > (from the RHEL repo, not the latest source), which went pretty smoothly,
> > only had a few problems with assigning parameters to resources.. but that
> > could easily be worked around using crm_resource.
>
> So a single bug in the crm shell is a reason to move, while working
> around "a few problems" isn't? I'm getting too old for this world ;-)
>

Have been trying to stick with RH/CentOS supplied packages and since the
issues I'd hit were all fixed upstream I was happy enough to proceed with
pcs. (Meta Attr and Operation handling in the pcs resource create command
weren't working properly in the 0.9.25-10.el6_4.1 package but are fixed in
0.9.41 on github. To stick with the current 0.9.25 I just created the
resource without meta/ops using pcs and then added the meta/ops using
crm_resource).


>
> Have you reported the crm shell issue via bugzilla?
>

It hadn't even crossed my mind at the time, my bad. I'll check out the
latest source, compare the behaviour then raise a ticket.


>
> Probably a permission problem on centos.
>
>
> Regards,
> Lars
>
> --
> Architect Storage/HA
> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix
> Imendörffer, HRB 21284 (AG Nürnberg)
> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] crm subshell 1.2.4 incompatible to pacemaker 1.1.9?

2013-05-17 Thread Dejan Muhamedagic
Hi Rainer,

On Thu, May 16, 2013 at 06:53:24PM +0200, Rainer Brestan wrote:
> 
> The bug is in the function is_normal_node.
> 
> This function checks the attribute "type" for state 
> "normal".
> 
> But this attribute is not used any more.

I'm very well aware of that and crmsh can deal since v1.2.5 with
both pacemaker versions.

Thanks,

Dejan

>  
> 
> CIB output from Pacemaker 1.1.8
> 
> 
>     
>    uname="int2node1">
>      id="nodes-int2node1">
>    id="nodes-int2node1-standby" name="standby" 
> value="off"/>
>   
>    uname="int2node2">
>      id="nodes-int2node2">
>    id="nodes-int2node2-standby" name="standby" 
> value="on"/>
>   
>     
> 
> 
> CIB output from Pacemaker 1.1.7
> 
> 
>     
>    type="normal" uname="int1node1">
>   
>    type="normal" uname="int1node2">
>   
>     
> 
>  
> 
> Therefore, function listnodes will not return any node and function 
> standby will use the current node as node and the first argument as 
> lifetime.
> 
> In case of specified both (node and lifetime) it works because of other 
> else path.
> 
> Rainer
> 
> 
> 
>  
> 
> Gesendet: Mittwoch, 15. Mai 2013 
> um 21:31 Uhr
> Von: "Lars Ellenberg" 
> 
> An: pacemaker@oss.clusterlabs.org
> Betreff: Re: [Pacemaker] crm subshell 1.2.4 incompatible to 
> pacemaker 1.1.9?
> 
> On Wed, May 15, 2013 at 03:34:14PM +0200, Dejan 
> Muhamedagic wrote:
> > On Tue, May 14, 2013 at 10:03:59PM +0200, Lars Ellenberg wrote:
> > > On Tue, May 14, 2013 at 09:59:50PM +0200, Lars Ellenberg wrote:
> > > > On Mon, May 13, 2013 at 01:53:11PM +0200, Michael 
> Schwartzkopff wrote:
> > > > > Hi,
> > > > >
> > > > > crm tells me it is version 1.2.4
> > > > > pacemaker tell me it is verison 1.1.9
> > > > >
> > > > > So it should work since incompatibilities are resolved in 
> crm higher that
> > > > > version 1.2.1. Anywas crm tells me nonsense:
> > > > >
> > > > > # crm
> > > > > crm(live)# node
> > > > > crm(live)node# standby node1
> > > > > ERROR: bad lifetime: node1
> > > >
> > > > Your node is not named node1.
> > > > check: crm node list
> > > >
> > > > Maybe a typo, maybe some case-is-significant nonsense,
> > > > maybe you just forgot to use the fqdn.
> > > > maybe the check for "is this a known node name" is 
> (now) broken?
> > > >
> > > >
> > > > standby with just one argument checks if that argument
> > > > happens to be a known node name,
> > > > and assumes that if it is not,
> > > > it "has to be" a lifetime,
> > > > and the current node is used as node name...
> > > >
> > > > Maybe we should invert that logic, and instead compare the 
> single
> > > > argument against allowed lifetime values (reboot, forever), 
> and assume
> > > > it is supposed to be a node name otherwise?
> > > >
> > > > Then the error would become
> > > > ERROR: unknown node name: node1
> > > >
> > > > Which is probably more useful most of the time.
> > > >
> > > > Dejan?
> > >
> > > Something like this maybe:
> > >
> > > diff --git a/modules/ui.py.in b/modules/ui.py.in
> > > --- a/modules/ui.py.in
> > > +++ b/modules/ui.py.in
> > > @@ -1185,7 +1185,7 @@ class NodeMgmt(UserInterface):
> > > if not args:
> > > node = vars.this_node
> > > if len(args) == 1:
> > > - if not args[0] in listnodes():
> > > + if args[0] in ("reboot", "forever"):
> >
> > Yes, I wanted to look at it again. Another complication is that
> > the lifetime can be just about anything in that date ISO format.
> 
> That may well be, but right now those would be rejected by crmsh
> anyways:
> 
> if lifetime not in (None,"reboot","forever"):
> common_err("bad lifetime: %s" % lifetime)
> return False
> 
> --
> : Lars Ellenberg
> : LINBIT | Your Way to High Availability
> : DRBD/HA support and consulting http://www.linbit.com"; 
> target="_blank">http://www.linbit.com
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker"; 
> target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org"; 
> target="_blank">http://www.clusterlabs.org
> Getting started:  href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf"; 
> target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org"; 
> target="_blank">http://bugs.clusterlabs.org
> 
> 
> 
> 

> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___

Re: [Pacemaker] question about interface failover

2013-05-17 Thread Florian Crouzat

Le 16/05/2013 21:45, christopher barry a écrit :

Greetings,

I've setup a new 2-node mysql cluster using
* drbd 8.3.1.3
* corosync 1.4.2
* pacemaker 117
on Debian Wheezy nodes.

failover seems to be working fine for everything except the ips manually
configured on the interfaces.


This sentence makes no sense to me.
The cluster will not failover something that is not clusterized (a 
'manually' configured IP...)


What are you trying to achieve exactly ?
Also, could you pastebin the output of "crm_mon -Arf1" I find it more 
easy to read.





see config here:
http://pastebin.aquilenet.fr/?9eb51f6fb7d65fda#/YvSiYFocOzogAmPU9g
+g09RcJvhHbgrY1JuN7D+gA4=

If I bring down an interface, when the cluster restarts it, it only
starts it with the vip - the original ip and route have been removed.


Makes sense if you added the 'original' IP manually...
You should have non-VIP in /etc/sysconfig/network/ifcfg-*
But then again, please precise what you are trying to achieve.



not sure what to do to make sure the permanent ip and the routes get
restored. I'm not all that versed on the cluster commandline yet, and
I'm using LCMC for most of my usage.



--
Cheers,
Florian Crouzat

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Does "stonith_admin --confirm" work?

2013-05-17 Thread Староверов Никита Александрович
Hello, pacemaker users and developers.

First, many thanks to clusterlabs.org for their software, Pacemaker helps us 
very much!

I am testing cluster configuration based on Pacemaker+CMAN. I configured 
fencing as described in  Pacemker documentation about CMAN based clusters and 
it works.
May be I misunderstood something, but I can't  acknowledge nodes fencing 
manually.
I use fence_ipmilan as device and when I plug out power cable from server 
stonith  fails. I expected this,  of course, but I don't know how to 
acknowledge manual fencing.
When I try stonith_admin -C node_name, it does nothing. 
I see this in logs:

May 17 11:46:52 NODE1 stonith-ng[5434]:   notice: stonith_manual_ack: Injecting 
manual confirmation that NODE2 is safely off/down
May 17 11:46:52 NODE1 stonith-ng[5434]:   notice: log_operation: Operation 
'off' [0] (call 2 from stonith_admin.10959) for host 'NODE2' with device 
'manual_ack' returned: 0 (OK)
May 17 11:46:52 NODE1 stonith-ng[5434]:error: crm_abort: do_local_reply: 
Triggered assert at main.c:241 : client_obj->request_id 
 
May 17 11:46:52 NODE1 stonith-ng[5434]:error: crm_abort: crm_ipcs_sendv: 
Triggered assert at ipc.c:575 : header->qb.id != 0  
 
May 17 11:47:35 NODE1 stonith_admin[11162]:   notice: crm_log_args: Invoked: 
stonith_admin -C NODE2  

May 17 11:47:35 NODE1 stonith-ng[5434]:   notice: merge_duplicates: Merging 
stonith action off for node NODE2 originating from client 
stonith_admin.11162.b42172b1 with identical request from 
stonith_admin.10959@NODE1.f2048550 (0s) 



May 17 11:47:35 NODE1 stonith-ng[5434]:   notice: stonith_manual_ack: Injecting 
manual confirmation that NODE2 is safely off/down   
   
May 17 11:47:35 NODE1 stonith-ng[5434]:   notice: log_operation: Operation 
'off' [0] (call 2 from stonith_admin.11162) for host 'NODE2' with device 
'manual_ack' returned: 0 (OK)  
May 17 11:47:35 NODE1 stonith-ng[5434]:error: crm_abort: do_local_reply: 
Triggered assert at main.c:241 : client_obj->request_id 
   
May 17 11:47:35 NODE1 stonith-ng[5434]:error: crm_abort: crm_ipcs_sendv: 
Triggered assert at ipc.c:575 : header->qb.id != 0

Nothing happened after stonith_admin -C.
Fenced still trying to fence_pcmk, and I see lots of "Timer expired" from 
stonith-ng, and failed fence_ipmilan operations.

Yes,  I can do fence_ack_manual on cman-master node, and then cleanup node 
state with cibadmin, but it is very slw way. 
If I lost many servers in cluster, for example, lost power in one rack with two 
or more servers, I need a way to running again services on remaining nodes as 
quickly as possible.

I think fencing manual acknowledgement must be fast and simple and I suppose 
that stonith_admin --confirm have to do that.



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [Question and Problem] In vSphere5.1 environment, IO blocking of pengine occurs at the time of shared disk trouble for a long time.

2013-05-17 Thread Vladislav Bogdanov
Hi Hideo-san,

17.05.2013 10:29, renayama19661...@ybb.ne.jp wrote:
> Hi Vladislav,
> 
> Thank you for advice.
> 
> I try the patch which you showed.
> 
> We use Pacemaker1.0, but apply a patch there because there is a similar code.
> 
> If there is a question by setting, I ask you a question by an email.
>  * At first I only use tmpfs, and I intend to test it.

For just this, patch is unneeded. It only plays when you have that
pengine files symlinked from stable storage to tmpfs, Without patch,
pengine would try to rewrite file where symlink points it - directly on
a stable storage. With that patch, pengine will remove symlink (and just
symlink) and will open new file on tmpfs for writing. Thus, it will not
block if stable storage is inaccessible (for my case because of
connectivity problems, for yours - because of backing storage outage).

If you decide to go with tmpfs *and* use the same synchronization method
as I do, then you'd need to bake the similar patch for 1.0, just add
unlink() before pengine writes its data (I suspect that code to differ
between 1.0 and 1.1.10, even in 1.1.6 it was different to current master).

> 
>> P.S. Andrew, is this patch ok to apply?
> 
> To Andrew...
>   Does the patch in conjunction with the write_xml processing in your 
> repository have to apply it before the confirmation of the patch of Vladislav?
> 
> Many Thanks!
> Hideo Yamauchi.
> 
> 
> 
> 
> --- On Fri, 2013/5/17, Vladislav Bogdanov  wrote:
> 
>> Hi Hideo-san,
>>
>> You may try the following patch (with trick below)
>>
>> From 2c4418d11c491658e33c149f63e6a2f2316ef310 Mon Sep 17 00:00:00 2001
>> From: Vladislav Bogdanov 
>> Date: Fri, 17 May 2013 05:58:34 +
>> Subject: [PATCH] Feature: PE: Unlink pengine output files before writing.
>>  This should help guys who store them to tmpfs and then copy to a stable 
>> storage
>>  on (inotify) events with symlink creation in the original place to survive 
>> when
>>  stable storage is not accessible.
>>
>> ---
>>  pengine/pengine.c |1 +
>>  1 files changed, 1 insertions(+), 0 deletions(-)
>>
>> diff --git a/pengine/pengine.c b/pengine/pengine.c
>> index c7e1c68..99a81c6 100644
>> --- a/pengine/pengine.c
>> +++ b/pengine/pengine.c
>> @@ -184,6 +184,7 @@ process_pe_message(xmlNode * msg, xmlNode * xml_data, 
>> crm_client_t * sender)
>>  }
>>  
>>  if (is_repoke == FALSE && series_wrap != 0) {
>> +unlink(filename);
>>  write_xml_file(xml_data, filename, HAVE_BZLIB_H);
>>  write_last_sequence(PE_STATE_DIR, series[series_id].name, seq + 
>> 1, series_wrap);
>>  } else {
>> -- 
>> 1.7.1
>>
>> You just need to ensure that /var/lib/pacemaker is on tmpfs. Then you may 
>> watch on directories there
>> with inotify or so and take actions to move (copy) files to a stable storage 
>> (RAM is not of infinite size).
>> In my case that is CIFS. And I use lsyncd to synchronize that directories. 
>> If you are interested, I can
>> provide you with relevant lsyncd configuration. Frankly speaking, three is 
>> no big need to create symlinks
>> in tmpfs to stable storage, as pacemaker does not use existing pengine files 
>> (except sequences). That sequence
>> files and cib.xml are the only exceptions which you may want to exist in two 
>> places (and you may want to copy
>> them from stable storage to tmpfs before pacemaker start), and you can just 
>> move everything else away from
>> tmpfs once it is written. In this case you do not need this patch.
>>
>> Best,
>> Vladislav
>>
>> P.S. Andrew, is this patch ok to apply?
>>
>> 17.05.2013 03:27, renayama19661...@ybb.ne.jp wrote:
>>> Hi Andrew,
>>> Hi Vladislav,
>>>
>>> I try whether this correction is effective for this problem.
>>>   * 
>>> https://github.com/beekhof/pacemaker/commit/eb6264bf2db395779e65dadf1c626e050a388c59
>>>
>>> Best Regards,
>>> Hideo Yamauchi.
>>>
>>> --- On Thu, 2013/5/16, Andrew Beekhof  wrote:
>>>

 On 16/05/2013, at 3:49 PM, Vladislav Bogdanov  wrote:

> 16.05.2013 02:46, Andrew Beekhof wrote:
>>
>> On 15/05/2013, at 6:44 PM, Vladislav Bogdanov  
>> wrote:
>>
>>> 15.05.2013 11:18, Andrew Beekhof wrote:

 On 15/05/2013, at 5:31 PM, Vladislav Bogdanov  
 wrote:

> 15.05.2013 10:25, Andrew Beekhof wrote:
>>
>> On 15/05/2013, at 3:50 PM, Vladislav Bogdanov  
>> wrote:
>>
>>> 15.05.2013 08:23, Andrew Beekhof wrote:

 On 15/05/2013, at 3:11 PM, renayama19661...@ybb.ne.jp wrote:

> Hi Andrew,
>
> Thank you for comments.
>
>>> The guest located it to the shared disk.
>>
>> What is on the shared disk?  The whole OS or app-specific data 
>> (i.e. nothing pacemaker needs directly)?
>
> Shared disk has all the OS and the all data.

 Oh. I can imagine that being probl

Re: [Pacemaker] having problem with crm cib shadow

2013-05-17 Thread Lars Marowsky-Bree
On 2013-05-16T21:09:35, John McCabe  wrote:

> Worth trying crm_shadow as described here -
> http://www.gossamer-threads.com/lists/linuxha/pacemaker/84969
> 
> I had the same problem and took it as a sign that I should just move to pcs
> (from the RHEL repo, not the latest source), which went pretty smoothly,
> only had a few problems with assigning parameters to resources.. but that
> could easily be worked around using crm_resource.

So a single bug in the crm shell is a reason to move, while working
around "a few problems" isn't? I'm getting too old for this world ;-)

Have you reported the crm shell issue via bugzilla?

Probably a permission problem on centos.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [Question and Problem] In vSphere5.1 environment, IO blocking of pengine occurs at the time of shared disk trouble for a long time.

2013-05-17 Thread renayama19661014
Hi Vladislav,

Thank you for advice.

I try the patch which you showed.

We use Pacemaker1.0, but apply a patch there because there is a similar code.

If there is a question by setting, I ask you a question by an email.
 * At first I only use tmpfs, and I intend to test it.

> P.S. Andrew, is this patch ok to apply?

To Andrew...
  Does the patch in conjunction with the write_xml processing in your 
repository have to apply it before the confirmation of the patch of Vladislav?

Many Thanks!
Hideo Yamauchi.




--- On Fri, 2013/5/17, Vladislav Bogdanov  wrote:

> Hi Hideo-san,
> 
> You may try the following patch (with trick below)
> 
> From 2c4418d11c491658e33c149f63e6a2f2316ef310 Mon Sep 17 00:00:00 2001
> From: Vladislav Bogdanov 
> Date: Fri, 17 May 2013 05:58:34 +
> Subject: [PATCH] Feature: PE: Unlink pengine output files before writing.
>  This should help guys who store them to tmpfs and then copy to a stable 
> storage
>  on (inotify) events with symlink creation in the original place to survive 
> when
>  stable storage is not accessible.
> 
> ---
>  pengine/pengine.c |    1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)
> 
> diff --git a/pengine/pengine.c b/pengine/pengine.c
> index c7e1c68..99a81c6 100644
> --- a/pengine/pengine.c
> +++ b/pengine/pengine.c
> @@ -184,6 +184,7 @@ process_pe_message(xmlNode * msg, xmlNode * xml_data, 
> crm_client_t * sender)
>          }
>  
>          if (is_repoke == FALSE && series_wrap != 0) {
> +            unlink(filename);
>              write_xml_file(xml_data, filename, HAVE_BZLIB_H);
>              write_last_sequence(PE_STATE_DIR, series[series_id].name, seq + 
> 1, series_wrap);
>          } else {
> -- 
> 1.7.1
> 
> You just need to ensure that /var/lib/pacemaker is on tmpfs. Then you may 
> watch on directories there
> with inotify or so and take actions to move (copy) files to a stable storage 
> (RAM is not of infinite size).
> In my case that is CIFS. And I use lsyncd to synchronize that directories. If 
> you are interested, I can
> provide you with relevant lsyncd configuration. Frankly speaking, three is no 
> big need to create symlinks
> in tmpfs to stable storage, as pacemaker does not use existing pengine files 
> (except sequences). That sequence
> files and cib.xml are the only exceptions which you may want to exist in two 
> places (and you may want to copy
> them from stable storage to tmpfs before pacemaker start), and you can just 
> move everything else away from
> tmpfs once it is written. In this case you do not need this patch.
> 
> Best,
> Vladislav
> 
> P.S. Andrew, is this patch ok to apply?
> 
> 17.05.2013 03:27, renayama19661...@ybb.ne.jp wrote:
> > Hi Andrew,
> > Hi Vladislav,
> > 
> > I try whether this correction is effective for this problem.
> >  * 
> >https://github.com/beekhof/pacemaker/commit/eb6264bf2db395779e65dadf1c626e050a388c59
> > 
> > Best Regards,
> > Hideo Yamauchi.
> > 
> > --- On Thu, 2013/5/16, Andrew Beekhof  wrote:
> > 
> >>
> >> On 16/05/2013, at 3:49 PM, Vladislav Bogdanov  wrote:
> >>
> >>> 16.05.2013 02:46, Andrew Beekhof wrote:
> 
>  On 15/05/2013, at 6:44 PM, Vladislav Bogdanov  
>  wrote:
> 
> > 15.05.2013 11:18, Andrew Beekhof wrote:
> >>
> >> On 15/05/2013, at 5:31 PM, Vladislav Bogdanov  
> >> wrote:
> >>
> >>> 15.05.2013 10:25, Andrew Beekhof wrote:
> 
>  On 15/05/2013, at 3:50 PM, Vladislav Bogdanov  
>  wrote:
> 
> > 15.05.2013 08:23, Andrew Beekhof wrote:
> >>
> >> On 15/05/2013, at 3:11 PM, renayama19661...@ybb.ne.jp wrote:
> >>
> >>> Hi Andrew,
> >>>
> >>> Thank you for comments.
> >>>
> > The guest located it to the shared disk.
> 
>  What is on the shared disk?  The whole OS or app-specific data 
>  (i.e. nothing pacemaker needs directly)?
> >>>
> >>> Shared disk has all the OS and the all data.
> >>
> >> Oh. I can imagine that being problematic.
> >> Pacemaker really isn't designed to function without disk access.
> >>
> >> You might be able to get away with it if you turn off saving PE 
> >> files to disk though.
> >
> > I store CIB and PE files to tmpfs, and sync them to remote storage
> > (CIFS) with lsyncd level 1 config (I may share it on request). It 
> > copies
> > critical data like cib.xml, and moves everything else, symlinking 
> > it to
> > original place. The same technique may apply here, but with local fs
> > instead of cifs.
> >
> > Btw, the following patch is needed for that, otherwise pacemaker
> > overwrites remote files instead of creating new ones on tmpfs:
> >
> > --- a/lib/common/xml.c  2011-02-11 11:42:37.0 +0100
> > +++ b/lib/common/xml.c  2011-02-24 15:07:48.541870829 +0100
> > @@