Re: [Pacemaker] Does "stonith_admin --confirm" work?
On 20/05/2013, at 3:00 PM, Староверов Никита Александрович wrote: >> Well, thats not nothing, but it certainly doesn't look right either. >> I will investigate. Which version is this? > > I've tried this with pacemaker 1.1.8 from CentOS 6.4 repos, and then update > from clusterlabs.org repo to pacemaker 1.1.9-2. > I again got the same issue with pacemaker 1.1.9-2 and then I posted it in > subscribe list. Ok, I'll see what I can dig up. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] error with cib synchronisation on disk
On 16/05/2013, at 9:31 PM, Халезов Иван wrote: > On 16.05.2013 07:14, Andrew Beekhof wrote: >> On 15/05/2013, at 9:53 PM, Халезов Иван wrote: >> >>> Hello everyone! >>> >>> Some problems occured with synchronisation CIB configuration to disk. >>> I have this errors in pacemaker's logfile: >> What were the messages before this? >> Did it happen once or many times? >> At startup or while the cluster was running? > > I had updated cluster configuration before, so there was some output about it > in the logfile (not from the beginning here, because it is rather big): I'm guessing some whitespace crept into the configuration. We've had problems with that in the past, https://github.com/beekhof/pacemaker/commit/c2550cbd33a3b2ab7efcd6ef516ba124fbae9a81 is one patch that you dont have for example. > > May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - id="Security_A" > > May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - id="Security_A-meta_attributes" > > May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - id="Security_A-meta_attributes-target-role" name="target-role" > value="Stopped" __crm_diff_marker__="r > emoved:top" /> > May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - > May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - > May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - id="Security_B" > > May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - id="SPBEX_Security_B-meta_attributes" > > May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - id="Security_B-meta_attributes-target-role" name="target-role" > value="Started" __crm_diff_marker__="removed:top" /> > May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - > May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - > May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - > May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - > May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - > May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - > May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: + num_updates="1" admin_epoch="0" validate-with="pacemaker-1.2" > cib-last-written="Mon May 13 18:50:25 2013" crm_feature_set="3.0.6" > update-origin="iblade6.net.rts" update-client="cibadmin" have-quorum="1" > dc-uuid="2130706433" > > May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: + > May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: + > May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: + id="FAST_SENDERS" > > May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: + id="FAST_SENDERS-meta_attributes" __crm_diff_marker__="added:top" > > May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: + id="FAST_SENDERS-meta_attributes-target-role" name="target-role" > value="Started" /> > May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: + > May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: + > May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: + > May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: + > May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: + > May 14 13:29:13 iblade6 cib[2848]: info: cib_process_request: Operation > complete: op cib_replace for section resources (origin=local/cibadmin/2, > version=0.496.1): ok (rc=0) > May 14 13:29:13 iblade6 pengine[2852]: notice: LogActions: Start > Trades_INCR_A#011(iblade6.net.rts) > May 14 13:29:13 iblade6 pengine[2852]: notice: LogActions: Start > Trades_INCR_B#011(iblade6.net.rts) > May 14 13:29:13 iblade6 pengine[2852]: notice: LogActions: Start > Security_A#011(iblade6.net.rts) > May 14 13:29:13 iblade6 pengine[2852]: notice: LogActions: Start > Security_B#011(iblade6.net.rts) > May 14 13:29:13 iblade6 crmd[2853]: notice: do_state_transition: State > transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS > cause=C_IPC_MESSAGE origin=handle_response ] > May 14 13:29:13 iblade6 crmd[2853]: info: do_te_invoke: Processing graph > 41 (ref=pe_calc-dc-1368523753-125) derived from > /var/lib/pengine/pe-input-452.bz2 > May 14 13:29:13 iblade6 crmd[2853]: info: te_rsc_command: Initiating > action 80: start Trades_INCR_A_start_0 on iblade6.net.rts (local) > May 14 13:29:13 iblade6 cluster:error: validate_cib_digest: Digest > comparision failed: expected 2c91194022c98636f90df9dd5e7176c6 > (/var/lib/heartbeat/crm/cib.Zm249H), calculated > bc160870924630b3907c8cb1c3128eee > May 14 13:29:13 iblade6 cluster:error: retrieveCib: Checksum of > /var/lib/heartbeat/crm/cib.a024wF failed! Configuration contents ignored! > May 14 13:29:13 iblade6 cluster:error: retrieveCib: Usually this is > caused by manual changes, please refer to > http://clusterlabs.org/wiki/FAQ#cib_changes_detected > May 14 13:29:13 iblade6 cluster:error: crm_abort: write_cib_contents: > Triggered fatal assert at io.c:662 : retrieveCib(tmp1, tmp2, FALSE) != NULL > May 14 13:29:13 iblade6 pengine[2852]: notice: process_pe_message: > Transition 41: PEngine Input s
Re: [Pacemaker] Does "stonith_admin --confirm" work?
> Well, thats not nothing, but it certainly doesn't look right either. > I will investigate. Which version is this? I've tried this with pacemaker 1.1.8 from CentOS 6.4 repos, and then update from clusterlabs.org repo to pacemaker 1.1.9-2. I again got the same issue with pacemaker 1.1.9-2 and then I posted it in subscribe list. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Does "stonith_admin --confirm" work?
On 17/05/2013, at 6:22 PM, Староверов Никита Александрович wrote: > Hello, pacemaker users and developers. > > First, many thanks to clusterlabs.org for their software, Pacemaker helps us > very much! > > I am testing cluster configuration based on Pacemaker+CMAN. I configured > fencing as described in Pacemker documentation about CMAN based clusters and > it works. > May be I misunderstood something, but I can't acknowledge nodes fencing > manually. > I use fence_ipmilan as device and when I plug out power cable from server > stonith fails. I expected this, of course, but I don't know how to > acknowledge manual fencing. > When I try stonith_admin -C node_name, it does nothing. > I see this in logs: > > May 17 11:46:52 NODE1 stonith-ng[5434]: notice: stonith_manual_ack: > Injecting manual confirmation that NODE2 is safely off/down > May 17 11:46:52 NODE1 stonith-ng[5434]: notice: log_operation: Operation > 'off' [0] (call 2 from stonith_admin.10959) for host 'NODE2' with device > 'manual_ack' returned: 0 (OK) > May 17 11:46:52 NODE1 stonith-ng[5434]:error: crm_abort: do_local_reply: > Triggered assert at main.c:241 : client_obj->request_id > > May 17 11:46:52 NODE1 stonith-ng[5434]:error: crm_abort: crm_ipcs_sendv: > Triggered assert at ipc.c:575 : header->qb.id != 0 > > May 17 11:47:35 NODE1 stonith_admin[11162]: notice: crm_log_args: Invoked: > stonith_admin -C NODE2 > > May 17 11:47:35 NODE1 stonith-ng[5434]: notice: merge_duplicates: Merging > stonith action off for node NODE2 originating from client > stonith_admin.11162.b42172b1 with identical request from > stonith_admin.10959@NODE1.f2048550 (0s) > > > > May 17 11:47:35 NODE1 stonith-ng[5434]: notice: stonith_manual_ack: > Injecting manual confirmation that NODE2 is safely off/down > > May 17 11:47:35 NODE1 stonith-ng[5434]: notice: log_operation: Operation > 'off' [0] (call 2 from stonith_admin.11162) for host 'NODE2' with device > 'manual_ack' returned: 0 (OK) > May 17 11:47:35 NODE1 stonith-ng[5434]:error: crm_abort: do_local_reply: > Triggered assert at main.c:241 : client_obj->request_id > > May 17 11:47:35 NODE1 stonith-ng[5434]:error: crm_abort: crm_ipcs_sendv: > Triggered assert at ipc.c:575 : header->qb.id != 0 Well, thats not nothing, but it certainly doesn't look right either. I will investigate. Which version is this? > > Nothing happened after stonith_admin -C. > Fenced still trying to fence_pcmk, and I see lots of "Timer expired" from > stonith-ng, and failed fence_ipmilan operations. > > Yes, I can do fence_ack_manual on cman-master node, and then cleanup node > state with cibadmin, but it is very slw way. > If I lost many servers in cluster, for example, lost power in one rack with > two or more servers, I need a way to running again services on remaining > nodes as quickly as possible. > > I think fencing manual acknowledgement must be fast and simple and I suppose > that stonith_admin --confirm have to do that. > > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Question and Problem] In vSphere5.1 environment, IO blocking of pengine occurs at the time of shared disk trouble for a long time.
Hi Vladislav, > For just this, patch is unneeded. It only plays when you have that > pengine files symlinked from stable storage to tmpfs, Without patch, > pengine would try to rewrite file where symlink points it - directly on > a stable storage. With that patch, pengine will remove symlink (and just > symlink) and will open new file on tmpfs for writing. Thus, it will not > block if stable storage is inaccessible (for my case because of > connectivity problems, for yours - because of backing storage outage). > > If you decide to go with tmpfs *and* use the same synchronization method > as I do, then you'd need to bake the similar patch for 1.0, just add > unlink() before pengine writes its data (I suspect that code to differ > between 1.0 and 1.1.10, even in 1.1.6 it was different to current master). Thank you for detailed explanation. At first I confirm movement only in tmpfs. Many Thanks! Hideo Yamauchi. --- On Fri, 2013/5/17, Vladislav Bogdanov wrote: > Hi Hideo-san, > > 17.05.2013 10:29, renayama19661...@ybb.ne.jp wrote: > > Hi Vladislav, > > > > Thank you for advice. > > > > I try the patch which you showed. > > > > We use Pacemaker1.0, but apply a patch there because there is a similar > > code. > > > > If there is a question by setting, I ask you a question by an email. > > * At first I only use tmpfs, and I intend to test it. > > For just this, patch is unneeded. It only plays when you have that > pengine files symlinked from stable storage to tmpfs, Without patch, > pengine would try to rewrite file where symlink points it - directly on > a stable storage. With that patch, pengine will remove symlink (and just > symlink) and will open new file on tmpfs for writing. Thus, it will not > block if stable storage is inaccessible (for my case because of > connectivity problems, for yours - because of backing storage outage). > > If you decide to go with tmpfs *and* use the same synchronization method > as I do, then you'd need to bake the similar patch for 1.0, just add > unlink() before pengine writes its data (I suspect that code to differ > between 1.0 and 1.1.10, even in 1.1.6 it was different to current master). > > > > >> P.S. Andrew, is this patch ok to apply? > > > > To Andrew... > > Does the patch in conjunction with the write_xml processing in your > >repository have to apply it before the confirmation of the patch of > >Vladislav? > > > > Many Thanks! > > Hideo Yamauchi. > > > > > > > > > > --- On Fri, 2013/5/17, Vladislav Bogdanov wrote: > > > >> Hi Hideo-san, > >> > >> You may try the following patch (with trick below) > >> > >> From 2c4418d11c491658e33c149f63e6a2f2316ef310 Mon Sep 17 00:00:00 2001 > >> From: Vladislav Bogdanov > >> Date: Fri, 17 May 2013 05:58:34 + > >> Subject: [PATCH] Feature: PE: Unlink pengine output files before writing. > >> This should help guys who store them to tmpfs and then copy to a stable > >>storage > >> on (inotify) events with symlink creation in the original place to > >>survive when > >> stable storage is not accessible. > >> > >> --- > >> pengine/pengine.c | 1 + > >> 1 files changed, 1 insertions(+), 0 deletions(-) > >> > >> diff --git a/pengine/pengine.c b/pengine/pengine.c > >> index c7e1c68..99a81c6 100644 > >> --- a/pengine/pengine.c > >> +++ b/pengine/pengine.c > >> @@ -184,6 +184,7 @@ process_pe_message(xmlNode * msg, xmlNode * xml_data, > >> crm_client_t * sender) > >> } > >> > >> if (is_repoke == FALSE && series_wrap != 0) { > >> + unlink(filename); > >> write_xml_file(xml_data, filename, HAVE_BZLIB_H); > >> write_last_sequence(PE_STATE_DIR, series[series_id].name, seq > >>+ 1, series_wrap); > >> } else { > >> -- > >> 1.7.1 > >> > >> You just need to ensure that /var/lib/pacemaker is on tmpfs. Then you may > >> watch on directories there > >> with inotify or so and take actions to move (copy) files to a stable > >> storage (RAM is not of infinite size). > >> In my case that is CIFS. And I use lsyncd to synchronize that directories. > >> If you are interested, I can > >> provide you with relevant lsyncd configuration. Frankly speaking, three is > >> no big need to create symlinks > >> in tmpfs to stable storage, as pacemaker does not use existing pengine > >> files (except sequences). That sequence > >> files and cib.xml are the only exceptions which you may want to exist in > >> two places (and you may want to copy > >> them from stable storage to tmpfs before pacemaker start), and you can > >> just move everything else away from > >> tmpfs once it is written. In this case you do not need this patch. > >> > >> Best, > >> Vladislav > >> > >> P.S. Andrew, is this patch ok to apply? > >> > >> 17.05.2013 03:27, renayama19661...@ybb.ne.jp wrote: > >>> Hi Andrew, > >>> Hi Vladislav, > >>> > >>> I try whether this correction is effective for this problem. > >>> * > >>>https://github.com/beekhof/pacemaker/commit/eb6264bf2db395779e
Re: [Pacemaker] IPaddr2 cloned address doesn't survive node standby
On 2013-05-17 22:07, Jake Smith wrote: >> primitive p_ip_service_ns ocf:heartbeat:IPaddr2 \ >>params ip="192.168.114.17" cidr_netmask="24" nic="eth0" \ >> clusterip_hash="sourceip-sourceport" > > netmask should be 32 if that's supposed to be a single IP load balanced. I've been wondering about that, but I think 24 is correct. The address is recognized as "secondary" by Linux, as can be seen in this "ip addr" output: 2: eth0: mtu 1500 qdisc pfifo_fast state UP qlen 1000 inet 192.168.114.16/24 brd 192.168.114.255 scope global eth0 inet 192.168.114.17/24 brd 192.168.114.255 scope global secondary eth0 Setting it this way has been working fine for a long time now. *shrug* > Don't you need colocation also between the clones so that bind can only start > on a node that has already started an ip instance? I thought since clones are started on all nodes anyway that a simple "order" directive would suffice. But I've added a colocation constraint as well, to be sure. Thanks for the hint. > For the number of restarts it's likely because of the interleaving settings. > True for both would likely help that but wouldn't work in your case - more > here: > http://www.hastexo.com/resources/hints-and-kinks/interleaving-pacemaker-clones Yes, there doesn't seem to be a way to interleave these cloned resources in a way that avoids restarting Bind on such cluster state changes. > When you put dns01 in standby does dns02 have both instances of the IP there? > If not it should be (you are just load balancing a single IP correct?). You > need clone-node-max=2 for the ip clone. clone-node-max was always set to "2", yes. > If so one just doesn't move back to dns01 when you bring it out of standby? > I would look at resource stickiness=0 for the ip close resource only so the > cluster will redistribute when the node comes out of standby (I think that > would work). Clones have a default stickiness of 1 if you don't have a > default set for the cluster. Bingo, the resource stickiness was the problem! I've set it to 0 and now the IP resource gets started again when the node comes back online. Thanks a lot, I would not have thought of that. As stated above, shouldn't cloned resources be (re-)started on all nodes by definition? > And/or you can write location constraints for the clone instances of ip to > prefer one node over the other causing them to fail back if the node returns > i.e. location ip0_prefers_dns01 cl_ip_service_ns:0 200: dns01 and location > ip1_prefers_dns02 cl_ip_service_ns:1 200: dns02 That doesn't seem necessary, now with resource-stickiness="0". Thanks again! Andreas PS: Here's the complete configuration for the archives, in case someone might be interested in the future: node dns01 node dns02 primitive p_bind9 lsb:bind9 \ op monitor interval="10s" timeout="15s" \ op start interval="0" timeout="15s" \ op stop interval="0" timeout="15s" \ meta target-role="Started" primitive p_ip_service_ns ocf:heartbeat:IPaddr2 \ params ip="192.168.114.17" cidr_netmask="24" nic="eth0" clusterip_hash="sourceip-sourceport" \ op monitor interval="10s" \ op start interval="0" timeout="20s" \ op stop interval="0" timeout="20s" clone cl_bind9 p_bind9 \ meta globally-unique="false" clone-max="2" clone-node-max="1" interleave="false" target-role="Started" clone cl_ip_service_ns p_ip_service_ns \ meta globally-unique="true" clone-max="2" clone-node-max="2" interleave="false" target-role="Started" colocation co_ip_before_bind9 inf: cl_ip_service_ns cl_bind9 order o_ip_before_bind9 inf: cl_ip_service_ns cl_bind9 property $id="cib-bootstrap-options" \ dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \ cluster-infrastructure="openais" \ expected-quorum-votes="2" \ no-quorum-policy="ignore" \ stonith-enabled="no" \ last-lrm-refresh="1368814808" rsc_defaults $id="rsc-options" \ resource-stickiness="0" signature.asc Description: OpenPGP digital signature ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Does "stonith_admin --confirm" work?
On 17.05.2013 10:22, Староверов Никита Александрович wrote: Nothing happened after stonith_admin -C. Fenced still trying to fence_pcmk, and I see lots of "Timer expired" from stonith-ng, and failed fence_ipmilan operations. Yes, I can do fence_ack_manual on cman-master node, and then cleanup node state with cibadmin, but it is very slw way. If I lost many servers in cluster, for example, lost power in one rack with two or more servers, I need a way to running again services on remaining nodes as quickly as possible. I think fencing manual acknowledgement must be fast and simple and I suppose that stonith_admin --confirm have to do that. I would also like to know a solution to this problem. My current situation: I am using IPMI as a stonith device. However, if there is a problem with the (redundant) power supply and the IPMI device is therefore not working, I'm having a hard time in troubleshooting my 2 node cluster. Cheers, Raoul ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org