Re: [Pacemaker] Loss of ocf:pacemaker:ping target forces resources to restart?

2013-05-16 Thread emmanuel segura
Hello

How do you configure your cluster network? are you using a private network
for the cluster and one public for the services?


2013/5/15 Andrew Widdersheim awiddersh...@hotmail.com

 Sorry to bring up old issues but I am having the exact same problem as the
 original poster. A simultaneous disconnect on my two node cluster causes
 the resources to start to transition to the other node but mid flight
 the transition is aborted and resources are started again on
 the original node when the cluster realizes connectivity is same between
 the two nodes.

 I have tried various dampen settings without having any luck. Seems like
 the nodes report the outages at slightly different times which results in a
 partial transition of resources instead of waiting to know the connectivity
 of all of the nodes in the cluster before taking action which is what I
 would have thought dampen would help solve.

 Ideally the cluster wouldn't start the transition if another cluster node
 is having a connectivity issue as well and connectivity status is shared
 between all cluster nodes. Find my configuration below. Let me know there
 is something I can change to fix or if this behavior is expected.

 primitive p_drbd ocf:linbit:drbd \
 params drbd_resource=r1 \
 op monitor interval=30s role=Slave \
 op monitor interval=10s role=Master
 primitive p_fs ocf:heartbeat:Filesystem \
 params device=/dev/drbd/by-res/r1 directory=/drbd/r1
 fstype=ext4 options=noatime \
 op start interval=0 timeout=60s \
 op stop interval=0 timeout=180s \
 op monitor interval=30s timeout=40s
 primitive p_mysql ocf:heartbeat:mysql \
 params binary=/usr/libexec/mysqld config=/drbd/r1/mysql/my.cnf
 datadir=/drbd/r1/mysql \
 op start interval=0 timeout=120s \
 op stop interval=0 timeout=120s \
 op monitor interval=30s \
 meta target-role=Started
 primitive p_ping ocf:pacemaker:ping \
 params host_list=192.168.5.1 dampen=30s multiplier=1000
 debug=true \
 op start interval=0 timeout=60s \
 op stop interval=0 timeout=60s \
 op monitor interval=5s timeout=10s
 group g_mysql_group p_fs p_mysql \
 meta target-role=Started
 ms ms_drbd p_drbd \
 meta notify=true master-max=1 clone-max=2
 target-role=Started
 clone cl_ping p_ping
 location l_connected g_mysql \
 rule $id=l_connected-rule pingd: defined pingd
 colocation c_mysql_on_drbd inf: g_mysql ms_drbd:Master
 order o_drbd_before_mysql inf: ms_drbd:promote g_mysql:start
 property $id=cib-bootstrap-options \
 dc-version=1.1.6-1.el6-8b6c6b9b6dc2627713f870850d20163fad4cc2a2 \
 cluster-infrastructure=Heartbeat \
 no-quorum-policy=ignore \
 stonith-enabled=false \
 cluster-recheck-interval=5m \
 last-lrm-refresh=1368632470
 rsc_defaults $id=rsc-options \
 migration-threshold=5 \
 resource-stickiness=200
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org




-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] pcs/crmsh Cheat sheet

2013-05-16 Thread Andrew Beekhof
By popular request, I've taken a stab at a cheat-sheet for those switching 
between pcs and crmsh.

https://github.com/ClusterLabs/pacemaker/blob/master/doc/pcs-crmsh-quick-ref.md 

Any and all assistance expanding it and ensuring it is accurate will be 
gratefully received.

-- Andrew
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [Question and Problem] In vSphere5.1 environment, IO blocking of pengine occurs at the time of shared disk trouble for a long time.

2013-05-16 Thread Andrew Beekhof

On 16/05/2013, at 3:49 PM, Vladislav Bogdanov bub...@hoster-ok.com wrote:

 16.05.2013 02:46, Andrew Beekhof wrote:
 
 On 15/05/2013, at 6:44 PM, Vladislav Bogdanov bub...@hoster-ok.com wrote:
 
 15.05.2013 11:18, Andrew Beekhof wrote:
 
 On 15/05/2013, at 5:31 PM, Vladislav Bogdanov bub...@hoster-ok.com wrote:
 
 15.05.2013 10:25, Andrew Beekhof wrote:
 
 On 15/05/2013, at 3:50 PM, Vladislav Bogdanov bub...@hoster-ok.com 
 wrote:
 
 15.05.2013 08:23, Andrew Beekhof wrote:
 
 On 15/05/2013, at 3:11 PM, renayama19661...@ybb.ne.jp wrote:
 
 Hi Andrew,
 
 Thank you for comments.
 
 The guest located it to the shared disk.
 
 What is on the shared disk?  The whole OS or app-specific data (i.e. 
 nothing pacemaker needs directly)?
 
 Shared disk has all the OS and the all data.
 
 Oh. I can imagine that being problematic.
 Pacemaker really isn't designed to function without disk access.
 
 You might be able to get away with it if you turn off saving PE files 
 to disk though.
 
 I store CIB and PE files to tmpfs, and sync them to remote storage
 (CIFS) with lsyncd level 1 config (I may share it on request). It copies
 critical data like cib.xml, and moves everything else, symlinking it to
 original place. The same technique may apply here, but with local fs
 instead of cifs.
 
 Btw, the following patch is needed for that, otherwise pacemaker
 overwrites remote files instead of creating new ones on tmpfs:
 
 --- a/lib/common/xml.c  2011-02-11 11:42:37.0 +0100
 +++ b/lib/common/xml.c  2011-02-24 15:07:48.541870829 +0100
 @@ -529,6 +529,8 @@ write_file(const char *string, const char *filename)
  return -1;
  }
 
 +unlink(filename);
 
 Seems like it should be safe to include for normal operation.
 
 Exactly.
 
 Small flaw in that logic... write_file() is not used anywhere.
 
 Heh, thanks for spotting this.
 
 I recall write_file() was used for pengine, but some other function for
 CIB. You probably optimized that but forgot to remove unused function,
 that's why I was sure patch is still valid. And I did tests (CIFS
 storage outage simulation) only after initial patch, but not last years,
 that's why I didn't notice the regression - storage uses pacemaker too ;) .
 
 This should go to write_xml_file() (And probably to other places just
 before fopen(..., w), f.e. series).
 
 I've consolidated the code, however adding the unlink() would break things 
 for anyone intentionally symlinking cib.xml from somewhere else (like a git 
 repo).
 So I'm not so sure I should make the unlink() change :(
 
 Agree.
 I originally made it specific to pengine files.
 What do you prefer, simple wrapper in xml.c (f.e.
 unlink_and_write_xml_file()) or just add unlink() call to pengine before
 it calls write_xml_file()?

The last one :)
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker Digest, Vol 66, Issue 58

2013-05-16 Thread Wolfgang Routschka
Hi Andreas,

thanks for your answer,

crm_simulate -s -L (node2 is offline - r_postfix is running on node1)

native_color: r_haproxy allocation score on node1: -INFINITY
native_color: r_haproxy allocation score on node2: -INFINITY

crm_simulate -s -L (both nodes are online - r_postfix is running on node1)

native_color: r_haproxy allocation score on node1: -INFINITY
native_color: r_haproxy allocation score on node2: 0

with 2 colocation and we see that colocation with score 100 is not setting

colocation cl_r_haproxy_not_on_r_postfix -inf: r_haproxy r_postfix
colocation cl_r_haproxy_on_r_postfix 100: r_haproxy r_postfix

I don´t understand because score on node2 is 0

Regards,

Wolfgang

On 2013-05-15 21:30, Wolfgang Routschka wrote:
 Hi everybody,
  
 one question today about colocation rule on a 2-node cluster on
 scientific linux 6.4 and pacemaker/cman.
  
 2-Node Cluster
  
 first node haproxy load balancer proxy service - second node with
 postfix service.
  
 colocation for running a group called g_ip-address (haproxy lsb-resouce
 and ipaddress resource) on the other node of the postfix server is
  
 cl_g_ip-address_not_on_r_postfix -inf: g_ip-address r_postfix

-INF == never-ever ;-)

  
 The problem is now that the node with haproxy is down pacemaker cannot
 move/migrate the services to the other node -ok second colocation with
 lower score but it doesn?t works for me
  
 colocation cl_g_ip-address_on_r_postfix -1: g_ip-address r_postfix
  
 Whats my fault in these section?

Hard to say without seeing the rest of your configuration, but you can
run crm_simulate -s -L to see all the scores taken into account.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

  
 How can I migrate my group to the other if the master node for it is dead?
  
 Greetings Wolfgang
  
  
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 




-- next part --
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 287 bytes
Desc: OpenPGP digital signature
URL: 
http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20130515/2a392fa1/attachment.sig

--

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


End of Pacemaker Digest, Vol 66, Issue 58
*



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Stonith: How to avoid deathmatch cluster partitioning

2013-05-16 Thread Lars Marowsky-Bree
On 2013-05-15T22:55:43, Andreas Kurz andr...@hastexo.com wrote:

 start-delay is an option of the monitor operation ... in fact means
 don't trust that start was successfull, wait for the initial monitor
 some more time

It can be used on start here though to avoid exactly this situation; and
it works fine for that, effectively being equivalent to the delay
option on stonith (since the start always precedes the fence).

 The problem is, this would only make sense for one single stonith
 resource that can fence more nodes. In case of a split-brain that would
 delay the start on that node where the stonith resource was not running
 before and gives that node a penalty.

Sure. In a split-brain scenario, one side will receive a penalty, that's
the whole point of this exercise. In particular for the external/sbd
agent.

Or by grouping all fencing resources to always run on one node; if you
don't have access to RHT fence agents, for example.

external/sbd also has code to avoid a death-match cycle in case of
persistent split-brain scenarios now; after a reboot, the node that was
fenced will not join unless the fence is cleared first.

(The RHT world calls that unfence, I believe.)

That should be a win for the fence_sbd that I hope to get around to
sometime in the next few months, too ;-)

 In your example with two stonith resources running all the time,
 Digimer's suggestion is a good idea: use one of the redhat fencing
 agents, most of them have some sort of stonith-delay parameter that
 you can use with one instance.

It'd make sense to have logic for this embedded at a higher level,
somehow; the problem is all too common.

Of course, it is most relevant in scenarios where split brain is a
significantly higher probability than node down. Which is true for
most test scenarios (admins love yanking cables), but in practice, it's
mostly truly the node down.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Stonith: How to avoid deathmatch cluster partitioning

2013-05-16 Thread Klaus Darilion

Hi Andreas!

On 15.05.2013 22:55, Andreas Kurz wrote:

On 2013-05-15 15:34, Klaus Darilion wrote:

On 15.05.2013 14:51, Digimer wrote:

On 05/15/2013 08:37 AM, Klaus Darilion wrote:

primitive st-pace1 stonith:external/xen0 \
  params hostlist=pace1 dom0=xentest1 \
  op start start-delay=15s interval=0


Try;

primitive st-pace1 stonith:external/xen0 \
  params hostlist=pace1 dom0=xentest1 delay=15 \
  op start start-delay=15s interval=0

The idea here is that, when both nodes lose contact and initiate a
fence, 'st-pace1' will get a 15 second reprieve. That is, 'st-pace2'
will wait 15 seconds before trying to fence 'st-pace1'. If st-pace1 is
still alive, it will fence 'st-pace2' without delay, so pace2 will be
dead before it's timer expires, preventing a dual-fence. However, if
pace1 really is dead, pace2 will fence it and recovery, just with a 15
second delay.


Sounds good, but pacemaker does not accept the parameter:

ERROR: st-pace1: parameter delay does not exist


start-delay is an option of the monitor operation ... in fact means
don't trust that start was successfull, wait for the initial monitor
some more time

The problem is, this would only make sense for one single stonith
resource that can fence more nodes. In case of a split-brain that would
delay the start on that node where the stonith resource was not running
before and gives that node a penalty.


Thanks for the clarification. I already thought that the start-delay 
workaround is not useful in my setup.



In your example with two stonith resources running all the time,
Digimer's suggestion is a good idea: use one of the redhat fencing
agents, most of them have some sort of stonith-delay parameter that
you can use with one instance.


I found it somehow confusing that a generic parameter (delay is useful 
for all stonith agents) is implemented in the agent, not in pacemaker. 
Further, downloading the RH source RPMS and extracting the agents is 
also quite cumbersome.


I think I will add the delay parameter to the relevant fencing agent 
myself. I guess I also have increase the stonith-timeout and add the 
configured delay.


Do you know how to submit patches for the stonith agents?

Thanks
Klaus

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] error with cib synchronisation on disk

2013-05-16 Thread Халезов Иван

On 16.05.2013 07:14, Andrew Beekhof wrote:

On 15/05/2013, at 9:53 PM, Халезов Иван i.khale...@rts.ru wrote:


Hello everyone!

Some problems occured with synchronisation CIB configuration to disk.
I have this errors in pacemaker's logfile:

What were the messages before this?
Did it happen once or many times?
At startup or while the cluster was running?


I had updated cluster configuration before, so there was some output 
about it in the logfile (not from the beginning here, because it is 
rather big):


May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - primitive 
id=Security_A 
May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - 
meta_attributes id=Security_A-meta_attributes 
May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - nvpair 
id=Security_A-meta_attributes-target-role name=target-role 
value=Stopped __crm_diff_marker__=r

emoved:top /
May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - /meta_attributes
May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - /primitive
May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - primitive 
id=Security_B 
May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - 
meta_attributes id=SPBEX_Security_B-meta_attributes 
May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - nvpair 
id=Security_B-meta_attributes-target-role name=target-role 
value=Started __crm_diff_marker__=removed:top /

May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - /meta_attributes
May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - /primitive
May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - /group
May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - /resources
May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - /configuration
May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - /cib
May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: + cib 
epoch=496 num_updates=1 admin_epoch=0 
validate-with=pacemaker-1.2 cib-last-written=Mon May 13 18:50:25 
2013 crm_feature_set=3.0.6 update-origin=iblade6.net.rts 
update-client=cibadmin have-quorum=1 dc-uuid=2130706433 

May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: + configuration 
May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: + resources 
May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: + group 
id=FAST_SENDERS 
May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: + 
meta_attributes id=FAST_SENDERS-meta_attributes 
__crm_diff_marker__=added:top 
May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: + nvpair 
id=FAST_SENDERS-meta_attributes-target-role name=target-role 
value=Started /

May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: + /meta_attributes
May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: + /group
May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: + /resources
May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: + /configuration
May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: + /cib
May 14 13:29:13 iblade6 cib[2848]: info: cib_process_request: 
Operation complete: op cib_replace for section resources 
(origin=local/cibadmin/2, version=0.496.1): ok (rc=0)
May 14 13:29:13 iblade6 pengine[2852]:   notice: LogActions: Start 
Trades_INCR_A#011(iblade6.net.rts)
May 14 13:29:13 iblade6 pengine[2852]:   notice: LogActions: Start 
Trades_INCR_B#011(iblade6.net.rts)
May 14 13:29:13 iblade6 pengine[2852]:   notice: LogActions: Start 
Security_A#011(iblade6.net.rts)
May 14 13:29:13 iblade6 pengine[2852]:   notice: LogActions: Start 
Security_B#011(iblade6.net.rts)
May 14 13:29:13 iblade6 crmd[2853]:   notice: do_state_transition: State 
transition S_POLICY_ENGINE - S_TRANSITION_ENGINE [ input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=handle_response ]
May 14 13:29:13 iblade6 crmd[2853]: info: do_te_invoke: Processing 
graph 41 (ref=pe_calc-dc-1368523753-125) derived from 
/var/lib/pengine/pe-input-452.bz2
May 14 13:29:13 iblade6 crmd[2853]: info: te_rsc_command: Initiating 
action 80: start Trades_INCR_A_start_0 on iblade6.net.rts (local)
May 14 13:29:13 iblade6 cluster:error: validate_cib_digest: Digest 
comparision failed: expected 2c91194022c98636f90df9dd5e7176c6 
(/var/lib/heartbeat/crm/cib.Zm249H), calculated 
bc160870924630b3907c8cb1c3128eee
May 14 13:29:13 iblade6 cluster:error: retrieveCib: Checksum of 
/var/lib/heartbeat/crm/cib.a024wF failed!  Configuration contents ignored!
May 14 13:29:13 iblade6 cluster:error: retrieveCib: Usually this is 
caused by manual changes, please refer to 
http://clusterlabs.org/wiki/FAQ#cib_changes_detected
May 14 13:29:13 iblade6 cluster:error: crm_abort: 
write_cib_contents: Triggered fatal assert at io.c:662 : 
retrieveCib(tmp1, tmp2, FALSE) != NULL
May 14 13:29:13 iblade6 pengine[2852]:   notice: process_pe_message: 
Transition 41: PEngine Input stored in: /var/lib/pengine/pe-input-452.bz2
May 14 13:29:13 iblade6 cib[2848]:error: cib_diskwrite_complete: 
Disk write failed: status=134, signo=6, exitcode=0
May 14 13:29:13 iblade6 cib[2848]:error: 

Re: [Pacemaker] pacemaker colocation after one node is down

2013-05-16 Thread Wolfgang Routschka
Hi Andreas,

thank you for your answer.

solutions is one coloation with -score

colocation cl_g_ip-address_not_on_r_postfix -1: g_ip-address r_postfix

Greetings Wolfgang


On 2013-05-15 21:30, Wolfgang Routschka wrote:
 Hi everybody,
  
 one question today about colocation rule on a 2-node cluster on
 scientific linux 6.4 and pacemaker/cman.
  
 2-Node Cluster
  
 first node haproxy load balancer proxy service - second node with
 postfix service.
  
 colocation for running a group called g_ip-address (haproxy lsb-resouce
 and ipaddress resource) on the other node of the postfix server is
  
 cl_g_ip-address_not_on_r_postfix -inf: g_ip-address r_postfix

-INF == never-ever ;-)

  
 The problem is now that the node with haproxy is down pacemaker cannot
 move/migrate the services to the other node -ok second colocation with
 lower score but it doesn?t works for me
  
 colocation cl_g_ip-address_on_r_postfix -1: g_ip-address r_postfix
  
 Whats my fault in these section?

Hard to say without seeing the rest of your configuration, but you can
run crm_simulate -s -L to see all the scores taken into account.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

  
 How can I migrate my group to the other if the master node for it is dead?
  
 Greetings Wolfgang
  
  
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 




-- next part --
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 287 bytes
Desc: OpenPGP digital signature
URL: 
http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20130515/2a392fa1/attachment.sig

--

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


End of Pacemaker Digest, Vol 66, Issue 58
*



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] stonith-ng: error: remote_op_done: Operation reboot of node2 by node1 for stonith_admin: Timer expired

2013-05-16 Thread Brian J. Murrell
Using Pacemaker 1.1.8 on EL6.4 with the pacemaker plugin, I'm finding
strange behavior with stonith-admin -B node2.  It seems to shut the
node down but not start it back up and ends up reporting a timer
expired:

# stonith_admin -B node2
Command failed: Timer expired

The pacemaker log for the operation is:

May 16 13:50:41 node1 stonith_admin[23174]:   notice: crm_log_args: Invoked: 
stonith_admin -B node2 
May 16 13:50:41 node1 stonith-ng[1673]:   notice: handle_request: Client 
stonith_admin.23174.4a093de2 wants to fence (reboot) 'node2' with device '(any)'
May 16 13:50:41 node1 stonith-ng[1673]:   notice: initiate_remote_stonith_op: 
Initiating remote operation reboot for node2: 
aa230634-6a38-42b7-8ed4-0a0eb64af39a (0)
May 16 13:50:41 node1 cibadmin[23176]:   notice: crm_log_args: Invoked: 
cibadmin --query 
May 16 13:50:49 node1 corosync[1376]:   [TOTEM ] A processor failed, forming 
new configuration.
May 16 13:50:55 node1 corosync[1376]:   [pcmk  ] notice: pcmk_peer_update: 
Transitional membership event on ring 76: memb=1, new=0, lost=1
May 16 13:50:55 node1 corosync[1376]:   [pcmk  ] info: pcmk_peer_update: memb: 
node1 4252674240
May 16 13:50:55 node1 corosync[1376]:   [pcmk  ] info: pcmk_peer_update: lost: 
node2 2608507072
May 16 13:50:55 node1 corosync[1376]:   [pcmk  ] notice: pcmk_peer_update: 
Stable membership event on ring 76: memb=1, new=0, lost=0
May 16 13:50:55 node1 corosync[1376]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
node1 4252674240
May 16 13:50:55 node1 corosync[1376]:   [pcmk  ] info: 
ais_mark_unseen_peer_dead: Node node2 was not seen in the previous transition
May 16 13:50:55 node1 corosync[1376]:   [pcmk  ] info: update_member: Node 
2608507072/node2 is now: lost
May 16 13:50:55 node1 corosync[1376]:   [pcmk  ] info: 
send_member_notification: Sending membership update 76 to 2 children
May 16 13:50:55 node1 corosync[1376]:   [TOTEM ] A processor joined or left the 
membership and a new membership was formed.
May 16 13:50:55 node1 corosync[1376]:   [CPG   ] chosen downlist: sender r(0) 
ip(192.168.122.253) r(1) ip(10.0.0.253) ; members(old:2 left:1)
May 16 13:50:55 node1 corosync[1376]:   [MAIN  ] Completed service 
synchronization, ready to provide service.
May 16 13:50:55 node1 cib[1672]:   notice: ais_dispatch_message: Membership 76: 
quorum lost
May 16 13:50:55 node1 cib[1672]:   notice: crm_update_peer_state: 
crm_update_ais_node: Node node2[2608507072] - state is now lost
May 16 13:50:55 node1 crmd[1677]:   notice: ais_dispatch_message: Membership 
76: quorum lost
May 16 13:50:55 node1 crmd[1677]:   notice: crm_update_peer_state: 
crm_update_ais_node: Node node2[2608507072] - state is now lost
May 16 13:50:55 node1 crmd[1677]:  warning: match_down_event: No match for 
shutdown action on node2
May 16 13:50:55 node1 crmd[1677]:   notice: peer_update_callback: 
Stonith/shutdown of node2 not matched
May 16 13:50:55 node1 crmd[1677]:   notice: do_state_transition: State 
transition S_IDLE - S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL 
origin=check_join_state ]
May 16 13:50:57 node1 attrd[1675]:   notice: attrd_local_callback: Sending full 
refresh (origin=crmd)
May 16 13:50:57 node1 attrd[1675]:   notice: attrd_trigger_update: Sending 
flush op to all hosts for: last-failure-resource1 (1368710825)
May 16 13:50:57 node1 attrd[1675]:   notice: attrd_trigger_update: Sending 
flush op to all hosts for: probe_complete (true)
May 16 13:50:58 node1 pengine[1676]:   notice: unpack_config: On loss of CCM 
Quorum: Ignore
May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to 
'now'
May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to 
'now'
May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to 
'now'
May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to 
'now'
May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to 
'now'
May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to 
'now'
May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to 
'now'
May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to 
'now'
May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to 
'now'
May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to 
'now'
May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to 
'now'
May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to 
'now'
May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to 
'now'
May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to 
'now'
May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to 
'now'
May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to 
'now'
May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to 
'now'
May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: 

Re: [Pacemaker] Loss of ocf:pacemaker:ping target forces resources to restart?

2013-05-16 Thread Andrew Widdersheim
The cluster has 3 connections total. The first connection is the outside 
interface where services can communicate and is also used for cluster 
communication using mcast. The second interface is a cross-over that is solely 
for cluster communication. The third connection is another cross-over solely 
for DRBD replication.

This issue happens when the first connection that is used for both the services 
and cluster communication is pulled on both nodes at the same time. 
  
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker colocation after one node is down

2013-05-16 Thread Andreas Kurz
On 2013-05-16 13:42, Wolfgang Routschka wrote:
 Hi Andreas,
 
 thank you for your answer.
 
 solutions is one coloation with -score

ah, yes  only _one_ of them with a non-negative value is needed.
Scores of all constraints are added up.

Regards,
Andreas

 
 colocation cl_g_ip-address_not_on_r_postfix -1: g_ip-address r_postfix
 
 Greetings Wolfgang
 
 
 On 2013-05-15 21:30, Wolfgang Routschka wrote:
 Hi everybody,
  
 one question today about colocation rule on a 2-node cluster on
 scientific linux 6.4 and pacemaker/cman.
  
 2-Node Cluster
  
 first node haproxy load balancer proxy service - second node with
 postfix service.
  
 colocation for running a group called g_ip-address (haproxy lsb-resouce
 and ipaddress resource) on the other node of the postfix server is
  
 cl_g_ip-address_not_on_r_postfix -inf: g_ip-address r_postfix
 
 -INF == never-ever ;-)
 
  
 The problem is now that the node with haproxy is down pacemaker cannot
 move/migrate the services to the other node -ok second colocation with
 lower score but it doesn?t works for me
  
 colocation cl_g_ip-address_on_r_postfix -1: g_ip-address r_postfix
  
 Whats my fault in these section?
 
 Hard to say without seeing the rest of your configuration, but you can
 run crm_simulate -s -L to see all the scores taken into account.
 
 Regards,
 Andreas
 


-- 
Need help with Pacemaker?
http://www.hastexo.com/now




signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] crm subshell 1.2.4 incompatible to pacemaker 1.1.9?

2013-05-16 Thread Rainer Brestan

The bug is in the function is_normal_node.

This function checks the attribute type for state normal.

But this attribute is not used any more.



CIB output from Pacemaker 1.1.8


 nodes
 node id=int2node1 uname=int2node1
 instance_attributes id=nodes-int2node1
 nvpair id=nodes-int2node1-standby name=standby value=off/
 /node
 node id=int2node2 uname=int2node2
 instance_attributes id=nodes-int2node2
 nvpair id=nodes-int2node2-standby name=standby value=on/
 /node
 /nodes


CIB output from Pacemaker 1.1.7


 nodes
 node id=int1node1 type=normal uname=int1node1
 /node
 node id=int1node2 type=normal uname=int1node2
 /node
 /nodes



Therefore, function listnodes will not return any node and function standby will use the current node as node and the first argument as lifetime.

In case of specified both (node and lifetime) it works because of other else path.

Rainer





Gesendet:Mittwoch, 15. Mai 2013 um 21:31 Uhr
Von:Lars Ellenberg lars.ellenb...@linbit.com
An:pacemaker@oss.clusterlabs.org
Betreff:Re: [Pacemaker] crm subshell 1.2.4 incompatible to pacemaker 1.1.9?

On Wed, May 15, 2013 at 03:34:14PM +0200, Dejan Muhamedagic wrote:
 On Tue, May 14, 2013 at 10:03:59PM +0200, Lars Ellenberg wrote:
  On Tue, May 14, 2013 at 09:59:50PM +0200, Lars Ellenberg wrote:
   On Mon, May 13, 2013 at 01:53:11PM +0200, Michael Schwartzkopff wrote:
Hi,
   
crm tells me it is version 1.2.4
pacemaker tell me it is verison 1.1.9
   
So it should work since incompatibilities are resolved in crm higher that
version 1.2.1. Anywas crm tells me nonsense:
   
# crm
crm(live)# node
crm(live)node# standby node1
ERROR: bad lifetime: node1
  
   Your node is not named node1.
   check: crm node list
  
   Maybe a typo, maybe some case-is-significant nonsense,
   maybe you just forgot to use the fqdn.
   maybe the check for is this a known node name is (now) broken?
  
  
   standby with just one argument checks if that argument
   happens to be a known node name,
   and assumes that if it is not,
   it has to be a lifetime,
   and the current node is used as node name...
  
   Maybe we should invert that logic, and instead compare the single
   argument against allowed lifetime values (reboot, forever), and assume
   it is supposed to be a node name otherwise?
  
   Then the error would become
   ERROR: unknown node name: node1
  
   Which is probably more useful most of the time.
  
   Dejan?
 
  Something like this maybe:
 
  diff --git a/modules/ui.py.in b/modules/ui.py.in
  --- a/modules/ui.py.in
  +++ b/modules/ui.py.in
  @@ -1185,7 +1185,7 @@ class NodeMgmt(UserInterface):
  if not args:
  node = vars.this_node
  if len(args) == 1:
  - if not args[0] in listnodes():
  + if args[0] in (reboot, forever):

 Yes, I wanted to look at it again. Another complication is that
 the lifetime can be just about anything in that date ISO format.

That may well be, but right now those would be rejected by crmsh
anyways:

if lifetime not in (None,reboot,forever):
common_err(bad lifetime: %s % lifetime)
return False

--
: Lars Ellenberg
: LINBIT  Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] having problem with crm cib shadow

2013-05-16 Thread George Gibat
crm(live)cib# use gfs2
ERROR: gfs2: no such shadow CIB
crm(live)cib# new gfs2
A shadow instance 'gfs2' already exists.
  To prevent accidental destruction of the cluster, the --force flag is
required in order to proceed.
crm(live)cib# list
crm(live)cib# use gfs2
ERROR: gfs2: no such shadow CIB
crm(live)cib#



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Loss of ocf:pacemaker:ping target forces resources to restart?

2013-05-16 Thread Andrew Martin
Andrew,

I'd recommend adding more than one host to your p_ping resource and see if that 
improves the situation. When I had this problem, I observed better behavior 
after adding more than one IP to the list of hosts and changing the p_ping 
location constraint to be as follows:
location loc_run_on_most_connected g_mygroup \
rule $id=loc_run_on_most_connected-rule -inf: not_defined p_ping or 
p_ping lte 0

More information:
http://www.gossamer-threads.com/lists/linuxha/pacemaker/81502#81502

Hope this helps,

Andrew

- Original Message -
 From: Andrew Widdersheim awiddersh...@hotmail.com
 To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org
 Sent: Thursday, May 16, 2013 9:35:56 AM
 Subject: Re: [Pacemaker] Loss of ocf:pacemaker:ping target forces resources 
 to restart?
 
 The cluster has 3 connections total. The first connection is the
 outside interface where services can communicate and is also used
 for cluster communication using mcast. The second interface is a
 cross-over that is solely for cluster communication. The third
 connection is another cross-over solely for DRBD replication.
 
 This issue happens when the first connection that is used for both
 the services and cluster communication is pulled on both nodes at
 the same time.
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started:
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Loss of ocf:pacemaker:ping target forces resources to restart?

2013-05-16 Thread Andrew Widdersheim
Thanks for the help. Adding another node to the ping host_list may help in some 
situations but the root issues doesn't really get solved. Also, the location 
constraint you posted is very different than mine. Your constraint requires 
connectivity where as the one I am trying to use looks for best connectivity. 

I have used the location constraint you posted with success in the past but I 
don't want my resource to be shut off in the event of a network outage that is 
happening across all nodes at the same time. Don't get me wrong in some cluster 
configurationss I do use the configuration you posted but this setup is not one 
of them for specific reasons.
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] having problem with crm cib shadow

2013-05-16 Thread John McCabe
Which Linux distribution and version of pacemaker are you using?
/John

On Thursday, 16 May 2013, George Gibat wrote:

 crm(live)cib# use gfs2
 ERROR: gfs2: no such shadow CIB
 crm(live)cib# new gfs2
 A shadow instance 'gfs2' already exists.
   To prevent accidental destruction of the cluster, the --force flag is
 required in order to proceed.
 crm(live)cib# list
 crm(live)cib# use gfs2
 ERROR: gfs2: no such shadow CIB
 crm(live)cib#



 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org javascript:;
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] having problem with crm cib shadow

2013-05-16 Thread George G. Gibat
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

centos 6.4,  pacemaker 1.1.8-7.el6

On 2013-05-16 18:57, John McCabe wrote:
 Which Linux distribution and version of pacemaker are you using? /John
 
 On Thursday, 16 May 2013, George Gibat wrote:
 
 crm(live)cib# use gfs2 ERROR: gfs2: no such shadow CIB crm(live)cib# new gfs2 
 A shadow instance 'gfs2' already exists. To
 prevent accidental destruction of the cluster, the --force flag is required 
 in order to proceed. crm(live)cib# list 
 crm(live)cib# use gfs2 ERROR: gfs2: no such shadow CIB crm(live)cib#
 
 
 
 ___ Pacemaker mailing list: 
 Pacemaker@oss.clusterlabs.org javascript:; 
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org Getting started: 
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
 http://bugs.clusterlabs.org
 
 
 
 ___ Pacemaker mailing list: 
 Pacemaker@oss.clusterlabs.org 
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org Getting started: 
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
 http://bugs.clusterlabs.org
 

- -- 




 ---
George Gibat, Technical Director, CCNP, MSCE, CISSP, CNE
TTFN
PGP public key - http://www.gibat.com/ggibat-pub.asc

Gibat Enterprises, Inc
Connecting you to the world (R)
Your Portal to the Future (R)

http://www.gibat.com
http://www.spi.net
817.265.9962
9260 Walker Rd.
Ovid, MI 48866

The information contained in and transmitted with this email is or may be
confidential and/or privileged. It is intended only for the individual or
entity designated. You are hereby notified that any dissemination,
distribution, copying, use of or reliance upon the information contained in
and transmitted with this email by or to anyone other than the intended
recipient designated by the sender is unauthorized and strictly prohibited.
If you have received this email in error, please contact the sender at
(817)265-9962. Any email erroneously transmitted to you should be
immediately deleted.

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.16 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlGVMfoACgkQaWdaxHduXchnAACcDnnu3cWSKjfp4aDg8y+65jvW
GmQAnR4PP1AYntV0qGZ87q8o0BdTRHjD
=eSll
-END PGP SIGNATURE-


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] question about interface failover

2013-05-16 Thread christopher barry
Greetings,

I've setup a new 2-node mysql cluster using
* drbd 8.3.1.3
* corosync 1.4.2
* pacemaker 117
on Debian Wheezy nodes.

failover seems to be working fine for everything except the ips manually
configured on the interfaces.

see config here:
http://pastebin.aquilenet.fr/?9eb51f6fb7d65fda#/YvSiYFocOzogAmPU9g
+g09RcJvhHbgrY1JuN7D+gA4=

If I bring down an interface, when the cluster restarts it, it only
starts it with the vip - the original ip and route have been removed.

not sure what to do to make sure the permanent ip and the routes get
restored. I'm not all that versed on the cluster commandline yet, and
I'm using LCMC for most of my usage.

Thanks for your help,
-C


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] having problem with crm cib shadow

2013-05-16 Thread John McCabe
Worth trying crm_shadow as described here -
http://www.gossamer-threads.com/lists/linuxha/pacemaker/84969

I had the same problem and took it as a sign that I should just move to pcs
(from the RHEL repo, not the latest source), which went pretty smoothly,
only had a few problems with assigning parameters to resources.. but that
could easily be worked around using crm_resource.

On 16 May 2013 20:23, George G. Gibat ggi...@gibat.com wrote:

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 centos 6.4,  pacemaker 1.1.8-7.el6

 On 2013-05-16 18:57, John McCabe wrote:
  Which Linux distribution and version of pacemaker are you using? /John
 
  On Thursday, 16 May 2013, George Gibat wrote:
 
  crm(live)cib# use gfs2 ERROR: gfs2: no such shadow CIB crm(live)cib# new
 gfs2 A shadow instance 'gfs2' already exists. To
  prevent accidental destruction of the cluster, the --force flag is
 required in order to proceed. crm(live)cib# list
  crm(live)cib# use gfs2 ERROR: gfs2: no such shadow CIB crm(live)cib#
 
 
 
  ___ Pacemaker mailing list:
 Pacemaker@oss.clusterlabs.org javascript:;
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org Getting started:
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
  http://bugs.clusterlabs.org
 
 
 
  ___ Pacemaker mailing list:
 Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org Getting started:
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
  http://bugs.clusterlabs.org
 

 - --




  ---
 George Gibat, Technical Director, CCNP, MSCE, CISSP, CNE
 TTFN
 PGP public key - http://www.gibat.com/ggibat-pub.asc

 Gibat Enterprises, Inc
 Connecting you to the world (R)
 Your Portal to the Future (R)

 http://www.gibat.com
 http://www.spi.net
 817.265.9962
 9260 Walker Rd.
 Ovid, MI 48866

 The information contained in and transmitted with this email is or may be
 confidential and/or privileged. It is intended only for the individual or
 entity designated. You are hereby notified that any dissemination,
 distribution, copying, use of or reliance upon the information contained in
 and transmitted with this email by or to anyone other than the intended
 recipient designated by the sender is unauthorized and strictly prohibited.
 If you have received this email in error, please contact the sender at
 (817)265-9962. Any email erroneously transmitted to you should be
 immediately deleted.

 -BEGIN PGP SIGNATURE-
 Version: GnuPG v2.0.16 (MingW32)
 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

 iEYEARECAAYFAlGVMfoACgkQaWdaxHduXchnAACcDnnu3cWSKjfp4aDg8y+65jvW
 GmQAnR4PP1AYntV0qGZ87q8o0BdTRHjD
 =eSll
 -END PGP SIGNATURE-


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] pacemaker-remote tls handshaking

2013-05-16 Thread Lindsay Todd
I've built pacemaker 1.1.10rc2 and am trying to get the pacemaker-remote
features working on my Scientific Linux 6.4 system.  It almost works...

The /etc/pacemaker/authkey file is on all the cluster nodes, as well as my
test VM (readable to all users, and checksums are the same everywhere).  I
can connect via telnet to port 3121 of the VM.  I even see the ghost node
appear for my VM when I use either 'crm status' or 'pcs status'.  (Aside:
 crmsh doesn't know about the new meta attributes for remote...)

But the communication isn't quite working.  In my log I see:

May 16 15:58:34 cvmh04 crmd[4893]:  warning: lrmd_tcp_connect_cb: Client
tls han
dshake failed for server swbuildsl6:3121. Disconnecting
May 16 15:58:34 swbuildsl6 pacemaker_remoted[2308]:error:
lrmd_remote_client
_msg: Remote lrmd tls handshake failed
May 16 15:58:35 cvmh04 crmd[4893]:  warning: lrmd_tcp_connect_cb: Client
tls han
dshake failed for server swbuildsl6:3121. Disconnecting
May 16 15:58:35 swbuildsl6 pacemaker_remoted[2308]:error:
lrmd_remote_client
_msg: Remote lrmd tls handshake failed

and it isn't long before pacemaker stops trying.

Is there some additional configuration I need?

/Lindsay
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] mysql ocf resource agent - resource stays unmanaged if binary unavailable

2013-05-16 Thread Vladimir
Hi,

our pacemaker setup provides mysql resource using ocf resource agent.
Today I tested with my colleagues forcing mysql resource to fail. I
don't understand the following behaviour. When I remove the mysqld_safe
binary (which path is specified in crm config) from one server and
moving the mysql resource to this server, the resource will not fail
back and stays in the unmanaged status. We can see that the function
check_binary(); is called within the mysql ocf resource agent and
exists with error code 5. The fail-count gets raised to INFINITY and
pacemaker tries to stop the resource fails. This results in a
unmanaged status.

How to reproduce:

1. mysql resource is running on node1
2. on node2 mv /usr/bin/mysqld_safe{,.bak}
3. crm resource move group-MySQL node2
4. observe corosync.log and crm_mon

# cat /var/log/corosync/corosync.log
[...]
May 16 10:53:41 node2 lrmd: [1893]: info: operation start[119] on
res-MySQL-IP1 for client 1896: pid 5137 exited with return code 0 May
16 10:53:41 node2 crmd: [1896]: info: process_lrm_event: LRM operation
res-MySQL-IP1_start_0 (call=119, rc=0, cib-update=98, confirmed=true)
ok May 16 10:53:41 node2 crmd: [1896]: info: do_lrm_rsc_op: Performing
key=94:102:0:28dea763-d2a2-4b9d-b86a-5357760ed16e
op=res-MySQL-IP1_monitor_3 ) May 16 10:53:41 node2 lrmd: [1893]:
info: rsc:res-MySQL-IP1 monitor[120] (pid 5222) May 16 10:53:41 node2
crmd: [1896]: info: do_lrm_rsc_op: Performing
key=96:102:0:28dea763-d2a2-4b9d-b86a-5357760ed16e op=res-MySQL_start_0
) May 16 10:53:41 node2 lrmd: [1893]: info: rsc:res-MySQL start[121]
(pid 5223) May 16 10:53:41 node2 lrmd: [1893]: info: RA output:
(res-MySQL:start:stderr) 2013/05/16_10:53:41 ERROR: Setup problem:
couldn't find command: /usr/bin/mysqld_safe

May 16 10:53:41 node2 lrmd: [1893]: info: operation start[121] on
res-MySQL for client 1896: pid 5223 exited with return code 5 May 16
10:53:41 node2 crmd: [1896]: info: process_lrm_event: LRM operation
res-MySQL_start_0 (call=121, rc=5, cib-update=99, confirmed=true) not
installed May 16 10:53:41 node2 lrmd: [1893]: info: operation
monitor[120] on res-MySQL-IP1 for client 1896: pid 5222 exited with
return code 0 May 16 10:53:41 node2 crmd: [1896]: info:
process_lrm_event: LRM operation res-MySQL-IP1_monitor_3 (call=120,
rc=0, cib-update=100, confirmed=false) ok May 16 10:53:41 node2 attrd:
[1894]: notice: attrd_ais_dispatch: Update relayed from node1 May 16
10:53:41 node2 attrd: [1894]: notice: attrd_trigger_update: Sending
flush op to all hosts for: fail-count-res-MySQL (INFINITY) May 16
10:53:41 node2 attrd: [1894]: notice: attrd_perform_update: Sent update
44: fail-count-res-MySQL=INFINITY May 16 10:53:41 node2 attrd: [1894]:
notice: attrd_ais_dispatch: Update relayed from node1 May 16 10:53:41
node2 attrd: [1894]: notice: attrd_trigger_update: Sending flush op to
all hosts for: last-failure-res-MySQL (1368694421) May 16 10:53:41
node2 attrd: [1894]: notice: attrd_perform_update: Sent update 47:
last-failure-res-MySQL=1368694421 May 16 10:53:41 node2 lrmd: [1893]:
info: cancel_op: operation monitor[117] on res-DRBD-MySQL:1 for client
1896, its parameters: drbd_resource=[mysql] CRM_meta_role=[Master]
CRM_meta_timeout=[2] CRM_meta_name=[monitor]
crm_feature_set=[3.0.5] CRM_meta_notify=[true]
CRM_meta_clone_node_max=[1] CRM_meta_clone=[1] CRM_meta_clone_max=[2]
CRM_meta_master_node_max=[1] CRM_meta_interval=[29000]
CRM_meta_globally_unique=[false] CRM_meta_master_max=[1]  cancelled May
16 10:53:41 node2 crmd: [1896]: info: send_direct_ack: ACK'ing resource
op res-DRBD-MySQL:1_monitor_29000 from
3:104:0:28dea763-d2a2-4b9d-b86a-5357760ed16e:
lrm_invoke-lrmd-1368694421-57 May 16 10:53:41 node2 crmd: [1896]: info:
do_lrm_rsc_op: Performing
key=8:104:0:28dea763-d2a2-4b9d-b86a-5357760ed16e op=res-MySQL_stop_0 )
May 16 10:53:41 node2 lrmd: [1893]: info: rsc:res-MySQL stop[122] (pid
5278) [...]

I can not figure out why the fail-count gets raised to INFINITY and
especially why pacemaker tries to stop the resource after failing.
Shouldn't it be the best for the resource to fail back to another node
instead of resulting in a unmanaged status on the node? is it
possible to force this behavior in any way?

Here some specs of the software used on our cluster nodes:

node1:~# lsb_release -d  dpkg -l pacemaker | awk '/ii/{print $2,$3}'
 uname -ri Description:Ubuntu 12.04.2 LTS
pacemaker 1.1.6-2ubuntu3
3.2.0-41-generic x86_64

Best regards
Vladimir

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker-remote tls handshaking

2013-05-16 Thread David Vossel
- Original Message -
 From: Lindsay Todd rltodd@gmail.com
 To: The Pacemaker cluster resource manager Pacemaker@oss.clusterlabs.org
 Sent: Thursday, May 16, 2013 3:44:09 PM
 Subject: [Pacemaker] pacemaker-remote tls handshaking
 
 I've built pacemaker 1.1.10rc2 and am trying to get the pacemaker-remote
 features working on my Scientific Linux 6.4 system. It almost works...
 
 The /etc/pacemaker/authkey file is on all the cluster nodes, as well as my
 test VM (readable to all users, and checksums are the same everywhere). I
 can connect via telnet to port 3121 of the VM.

 I even see the ghost node
 appear for my VM when I use either 'crm status' or 'pcs status'. (Aside:
 crmsh doesn't know about the new meta attributes for remote...)
 
 But the communication isn't quite working. In my log I see:
 
 May 16 15:58:34 cvmh04 crmd[4893]: warning: lrmd_tcp_connect_cb: Client tls
 han
 dshake failed for server swbuildsl6:3121. Disconnecting
 May 16 15:58:34 swbuildsl6 pacemaker_remoted[2308]: error: lrmd_remote_client
 _msg: Remote lrmd tls handshake failed
 May 16 15:58:35 cvmh04 crmd[4893]: warning: lrmd_tcp_connect_cb: Client tls
 han
 dshake failed for server swbuildsl6:3121. Disconnecting
 May 16 15:58:35 swbuildsl6 pacemaker_remoted[2308]: error: lrmd_remote_client
 _msg: Remote lrmd tls handshake failed
 
 and it isn't long before pacemaker stops trying.
 
 Is there some additional configuration I need?

Ah, you dared to try my new feature, and this is what you get! :D

It looks like you have it covered.  If you can telnet into the vm from the host 
(it should kick you off pretty quickly), then then all the firewall rules are 
correct. I'm not sure what is going on.  The only thing I can think of is 
perhaps your gnutls version doesn't like that I'm using a non-blocking socket 
during the tls handshake.

I doubt this will make a difference, but here's the key I use during testing, 
lrmd:ce9db0bc3cec583d3b3bf38b0ac9ff91

Has anyone else had success or ran into something similar yet?  I'll help 
investigate this next week. I'll be out of the office until Tuesday.

-- Vossel

 /Lindsay
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [Question and Problem] In vSphere5.1 environment, IO blocking of pengine occurs at the time of shared disk trouble for a long time.

2013-05-16 Thread renayama19661014
Hi Andrew,
Hi Vladislav,

I try whether this correction is effective for this problem.
 * 
https://github.com/beekhof/pacemaker/commit/eb6264bf2db395779e65dadf1c626e050a388c59

Best Regards,
Hideo Yamauchi.

--- On Thu, 2013/5/16, Andrew Beekhof and...@beekhof.net wrote:

 
 On 16/05/2013, at 3:49 PM, Vladislav Bogdanov bub...@hoster-ok.com wrote:
 
  16.05.2013 02:46, Andrew Beekhof wrote:
  
  On 15/05/2013, at 6:44 PM, Vladislav Bogdanov bub...@hoster-ok.com wrote:
  
  15.05.2013 11:18, Andrew Beekhof wrote:
  
  On 15/05/2013, at 5:31 PM, Vladislav Bogdanov bub...@hoster-ok.com 
  wrote:
  
  15.05.2013 10:25, Andrew Beekhof wrote:
  
  On 15/05/2013, at 3:50 PM, Vladislav Bogdanov bub...@hoster-ok.com 
  wrote:
  
  15.05.2013 08:23, Andrew Beekhof wrote:
  
  On 15/05/2013, at 3:11 PM, renayama19661...@ybb.ne.jp wrote:
  
  Hi Andrew,
  
  Thank you for comments.
  
  The guest located it to the shared disk.
  
  What is on the shared disk?  The whole OS or app-specific data 
  (i.e. nothing pacemaker needs directly)?
  
  Shared disk has all the OS and the all data.
  
  Oh. I can imagine that being problematic.
  Pacemaker really isn't designed to function without disk access.
  
  You might be able to get away with it if you turn off saving PE 
  files to disk though.
  
  I store CIB and PE files to tmpfs, and sync them to remote storage
  (CIFS) with lsyncd level 1 config (I may share it on request). It 
  copies
  critical data like cib.xml, and moves everything else, symlinking it 
  to
  original place. The same technique may apply here, but with local fs
  instead of cifs.
  
  Btw, the following patch is needed for that, otherwise pacemaker
  overwrites remote files instead of creating new ones on tmpfs:
  
  --- a/lib/common/xml.c  2011-02-11 11:42:37.0 +0100
  +++ b/lib/common/xml.c  2011-02-24 15:07:48.541870829 +0100
  @@ -529,6 +529,8 @@ write_file(const char *string, const char 
  *filename)
       return -1;
   }
  
  +    unlink(filename);
  
  Seems like it should be safe to include for normal operation.
  
  Exactly.
  
  Small flaw in that logic... write_file() is not used anywhere.
  
  Heh, thanks for spotting this.
  
  I recall write_file() was used for pengine, but some other function for
  CIB. You probably optimized that but forgot to remove unused function,
  that's why I was sure patch is still valid. And I did tests (CIFS
  storage outage simulation) only after initial patch, but not last years,
  that's why I didn't notice the regression - storage uses pacemaker too ;) 
  .
  
  This should go to write_xml_file() (And probably to other places just
  before fopen(..., w), f.e. series).
  
  I've consolidated the code, however adding the unlink() would break things 
  for anyone intentionally symlinking cib.xml from somewhere else (like a 
  git repo).
  So I'm not so sure I should make the unlink() change :(
  
  Agree.
  I originally made it specific to pengine files.
  What do you prefer, simple wrapper in xml.c (f.e.
  unlink_and_write_xml_file()) or just add unlink() call to pengine before
  it calls write_xml_file()?
 
 The last one :)
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker-1.1.10 results in Failed to sign on to the LRM 7

2013-05-16 Thread Andrew Widdersheim
Just tried the patch you gave and it worked fine. Any plans on putting this 
patch in officially or was this a one off? Aside from this patch I guess the 
only thing to get things to work is to install things slightly differently and 
adding a symlink from cluster-glue's lrmd to pacemakers.

 Subject: Re: [Pacemaker] pacemaker-1.1.10 results in Failed to sign on to the 
 LRM 7
 From: and...@beekhof.net
 Date: Thu, 16 May 2013 15:20:59 +1000
 CC: pacemaker@oss.clusterlabs.org
 To: awiddersh...@hotmail.com
 
 
 On 16/05/2013, at 3:16 PM, Andrew Widdersheim awiddersh...@hotmail.com 
 wrote:
 
  I'll look into moving over to the cman option since that is preferred for 
  RHEL6.4 now if I'm not mistaken.
 
 Correct
 
  I'll also try out the patch provided and see how that goes. So was LRMD not 
  apart of pacemaker previously and later added? Was it originally apart of 
  heartbeat/cluster-glue? I'm just trying to figure out all of the pieces so 
  that I know how to fix if I choose to go down that road.
 
 
 Originally everything was part of heartbeat.
 Then what was then called the crm became pacemaker and the lrmd v1 became 
 part of cluster-glue (because the theory was that someone might use it for a 
 pacemaker alternative).
 That never happened and we stopped using almost everything else from 
 cluster-glue, so when lrmd v2 was written, it was done so as part of 
 pacemaker.
 
 or, tl;dr - yes and yes :)
  ___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker-1.1.10 results in Failed to sign on to the LRM 7

2013-05-16 Thread Andrew Beekhof

On 17/05/2013, at 11:38 AM, Andrew Widdersheim awiddersh...@hotmail.com wrote:

 Just tried the patch you gave and it worked fine. Any plans on putting this 
 patch in officially or was this a one off?

It will be in 1.1.10-rc3 soon

 Aside from this patch I guess the only thing to get things to work is to 
 install things slightly differently and adding a symlink from cluster-glue's 
 lrmd to pacemakers.

Excellent

 
  Subject: Re: [Pacemaker] pacemaker-1.1.10 results in Failed to sign on to 
  the LRM 7
  From: and...@beekhof.net
  Date: Thu, 16 May 2013 15:20:59 +1000
  CC: pacemaker@oss.clusterlabs.org
  To: awiddersh...@hotmail.com
  
  
  On 16/05/2013, at 3:16 PM, Andrew Widdersheim awiddersh...@hotmail.com 
  wrote:
  
   I'll look into moving over to the cman option since that is preferred for 
   RHEL6.4 now if I'm not mistaken.
  
  Correct
  
   I'll also try out the patch provided and see how that goes. So was LRMD 
   not apart of pacemaker previously and later added? Was it originally 
   apart of heartbeat/cluster-glue? I'm just trying to figure out all of the 
   pieces so that I know how to fix if I choose to go down that road.
  
  
  Originally everything was part of heartbeat.
  Then what was then called the crm became pacemaker and the lrmd v1 became 
  part of cluster-glue (because the theory was that someone might use it for 
  a pacemaker alternative).
  That never happened and we stopped using almost everything else from 
  cluster-glue, so when lrmd v2 was written, it was done so as part of 
  pacemaker.
  
  or, tl;dr - yes and yes :)
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker-1.1.10 results in Failed to sign on to the LRM 7

2013-05-16 Thread Andrew Widdersheim
I'm attaching 3 patches I made fairly quickly to fix the installation issues 
and also an issue I noticed with the ping ocf from the latest pacemaker. 

One is for cluster-glue to prevent lrmd from building and later installing. May 
also want to modify this patch to take lrmd out of both spec files included 
when you download the source if you plan to build an rpm. I'm not sure if what 
I did here is the best way to approach this problem so if anyone has anything 
better please let me know.

One is for pacemaker to create the lrmd symlink when building with heartbeat 
support. Note the spec does not need anything changed here.

Finally, saw the following errors in messages with the latest ping ocf and the 
attached patch seems to fix the issue.

May 16 01:10:13 node2 lrmd[16133]:   notice: operation_finished: 
p_ping_monitor_5000:17758 [ /usr/lib/ocf/resource.d/pacemaker/ping: line 296: 
[: : integer expression expected ] 

cluster-glue-no-lrmd.patch
Description: Binary data


pacemaker-lrmd-hb.patch
Description: Binary data


pacemaker-ping-failure.patch
Description: Binary data
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [Question and Problem] In vSphere5.1 environment, IO blocking of pengine occurs at the time of shared disk trouble for a long time.

2013-05-16 Thread Andrew Beekhof

On 17/05/2013, at 10:27 AM, renayama19661...@ybb.ne.jp wrote:

 Hi Andrew,
 Hi Vladislav,
 
 I try whether this correction is effective for this problem.
 * 
 https://github.com/beekhof/pacemaker/commit/eb6264bf2db395779e65dadf1c626e050a388c59
 

Doubtful, it just reduces code duplication.
But it would also be a single place to put a deployment specific patch :)

 Best Regards,
 Hideo Yamauchi.
 
 --- On Thu, 2013/5/16, Andrew Beekhof and...@beekhof.net wrote:
 
 
 On 16/05/2013, at 3:49 PM, Vladislav Bogdanov bub...@hoster-ok.com wrote:
 
 16.05.2013 02:46, Andrew Beekhof wrote:
 
 On 15/05/2013, at 6:44 PM, Vladislav Bogdanov bub...@hoster-ok.com wrote:
 
 15.05.2013 11:18, Andrew Beekhof wrote:
 
 On 15/05/2013, at 5:31 PM, Vladislav Bogdanov bub...@hoster-ok.com 
 wrote:
 
 15.05.2013 10:25, Andrew Beekhof wrote:
 
 On 15/05/2013, at 3:50 PM, Vladislav Bogdanov bub...@hoster-ok.com 
 wrote:
 
 15.05.2013 08:23, Andrew Beekhof wrote:
 
 On 15/05/2013, at 3:11 PM, renayama19661...@ybb.ne.jp wrote:
 
 Hi Andrew,
 
 Thank you for comments.
 
 The guest located it to the shared disk.
 
 What is on the shared disk?  The whole OS or app-specific data 
 (i.e. nothing pacemaker needs directly)?
 
 Shared disk has all the OS and the all data.
 
 Oh. I can imagine that being problematic.
 Pacemaker really isn't designed to function without disk access.
 
 You might be able to get away with it if you turn off saving PE 
 files to disk though.
 
 I store CIB and PE files to tmpfs, and sync them to remote storage
 (CIFS) with lsyncd level 1 config (I may share it on request). It 
 copies
 critical data like cib.xml, and moves everything else, symlinking it 
 to
 original place. The same technique may apply here, but with local fs
 instead of cifs.
 
 Btw, the following patch is needed for that, otherwise pacemaker
 overwrites remote files instead of creating new ones on tmpfs:
 
 --- a/lib/common/xml.c  2011-02-11 11:42:37.0 +0100
 +++ b/lib/common/xml.c  2011-02-24 15:07:48.541870829 +0100
 @@ -529,6 +529,8 @@ write_file(const char *string, const char 
 *filename)
   return -1;
   }
 
 +unlink(filename);
 
 Seems like it should be safe to include for normal operation.
 
 Exactly.
 
 Small flaw in that logic... write_file() is not used anywhere.
 
 Heh, thanks for spotting this.
 
 I recall write_file() was used for pengine, but some other function for
 CIB. You probably optimized that but forgot to remove unused function,
 that's why I was sure patch is still valid. And I did tests (CIFS
 storage outage simulation) only after initial patch, but not last years,
 that's why I didn't notice the regression - storage uses pacemaker too ;) 
 .
 
 This should go to write_xml_file() (And probably to other places just
 before fopen(..., w), f.e. series).
 
 I've consolidated the code, however adding the unlink() would break things 
 for anyone intentionally symlinking cib.xml from somewhere else (like a 
 git repo).
 So I'm not so sure I should make the unlink() change :(
 
 Agree.
 I originally made it specific to pengine files.
 What do you prefer, simple wrapper in xml.c (f.e.
 unlink_and_write_xml_file()) or just add unlink() call to pengine before
 it calls write_xml_file()?
 
 The last one :)
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org