Re: [Pacemaker] Does "stonith_admin --confirm" work?

2013-05-19 Thread Andrew Beekhof

On 20/05/2013, at 3:00 PM, Староверов Никита Александрович 
 wrote:

>> Well, thats not nothing, but it certainly doesn't look right either.
>> I will investigate.  Which version is this?
> 
> I've tried this with pacemaker 1.1.8 from CentOS 6.4 repos, and then update 
> from clusterlabs.org repo to pacemaker 1.1.9-2. 
> I again got the same issue with pacemaker 1.1.9-2 and then I posted it in 
> subscribe list.

Ok, I'll see what I can dig up.
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] error with cib synchronisation on disk

2013-05-19 Thread Andrew Beekhof

On 16/05/2013, at 9:31 PM, Халезов Иван  wrote:

> On 16.05.2013 07:14, Andrew Beekhof wrote:
>> On 15/05/2013, at 9:53 PM, Халезов Иван  wrote:
>> 
>>> Hello everyone!
>>> 
>>> Some problems occured with synchronisation CIB configuration to disk.
>>> I have this errors in pacemaker's logfile:
>> What were the messages before this?
>> Did it happen once or many times?
>> At startup or while the cluster was running?
> 
> I had updated cluster configuration before, so there was some output about it 
> in the logfile (not from the beginning here, because it is rather big):

I'm guessing some whitespace crept into the configuration.
We've had problems with that in the past, 
https://github.com/beekhof/pacemaker/commit/c2550cbd33a3b2ab7efcd6ef516ba124fbae9a81
 is one patch that you dont have for example.

> 
> May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: -  id="Security_A" >
> May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: -  id="Security_A-meta_attributes" >
> May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: -  id="Security_A-meta_attributes-target-role" name="target-role" 
> value="Stopped" __crm_diff_marker__="r
> emoved:top" />
> May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - 
> May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - 
> May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: -  id="Security_B" >
> May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: -  id="SPBEX_Security_B-meta_attributes" >
> May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: -  id="Security_B-meta_attributes-target-role" name="target-role" 
> value="Started" __crm_diff_marker__="removed:top" />
> May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - 
> May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - 
> May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - 
> May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - 
> May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - 
> May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - 
> May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: +  num_updates="1" admin_epoch="0" validate-with="pacemaker-1.2" 
> cib-last-written="Mon May 13 18:50:25 2013" crm_feature_set="3.0.6" 
> update-origin="iblade6.net.rts" update-client="cibadmin" have-quorum="1" 
> dc-uuid="2130706433" >
> May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: + 
> May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: + 
> May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: +  id="FAST_SENDERS" >
> May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: +  id="FAST_SENDERS-meta_attributes" __crm_diff_marker__="added:top" >
> May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: +  id="FAST_SENDERS-meta_attributes-target-role" name="target-role" 
> value="Started" />
> May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: + 
> May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: + 
> May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: + 
> May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: + 
> May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: + 
> May 14 13:29:13 iblade6 cib[2848]: info: cib_process_request: Operation 
> complete: op cib_replace for section resources (origin=local/cibadmin/2, 
> version=0.496.1): ok (rc=0)
> May 14 13:29:13 iblade6 pengine[2852]:   notice: LogActions: Start 
> Trades_INCR_A#011(iblade6.net.rts)
> May 14 13:29:13 iblade6 pengine[2852]:   notice: LogActions: Start 
> Trades_INCR_B#011(iblade6.net.rts)
> May 14 13:29:13 iblade6 pengine[2852]:   notice: LogActions: Start 
> Security_A#011(iblade6.net.rts)
> May 14 13:29:13 iblade6 pengine[2852]:   notice: LogActions: Start 
> Security_B#011(iblade6.net.rts)
> May 14 13:29:13 iblade6 crmd[2853]:   notice: do_state_transition: State 
> transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS 
> cause=C_IPC_MESSAGE origin=handle_response ]
> May 14 13:29:13 iblade6 crmd[2853]: info: do_te_invoke: Processing graph 
> 41 (ref=pe_calc-dc-1368523753-125) derived from 
> /var/lib/pengine/pe-input-452.bz2
> May 14 13:29:13 iblade6 crmd[2853]: info: te_rsc_command: Initiating 
> action 80: start Trades_INCR_A_start_0 on iblade6.net.rts (local)
> May 14 13:29:13 iblade6 cluster:error: validate_cib_digest: Digest 
> comparision failed: expected 2c91194022c98636f90df9dd5e7176c6 
> (/var/lib/heartbeat/crm/cib.Zm249H), calculated 
> bc160870924630b3907c8cb1c3128eee
> May 14 13:29:13 iblade6 cluster:error: retrieveCib: Checksum of 
> /var/lib/heartbeat/crm/cib.a024wF failed!  Configuration contents ignored!
> May 14 13:29:13 iblade6 cluster:error: retrieveCib: Usually this is 
> caused by manual changes, please refer to 
> http://clusterlabs.org/wiki/FAQ#cib_changes_detected
> May 14 13:29:13 iblade6 cluster:error: crm_abort: write_cib_contents: 
> Triggered fatal assert at io.c:662 : retrieveCib(tmp1, tmp2, FALSE) != NULL
> May 14 13:29:13 iblade6 pengine[2852]:   notice: process_pe_message: 
> Transition 41: PEngine Input s

Re: [Pacemaker] Does "stonith_admin --confirm" work?

2013-05-19 Thread Староверов Никита Александрович
> Well, thats not nothing, but it certainly doesn't look right either.
> I will investigate.  Which version is this?

I've tried this with pacemaker 1.1.8 from CentOS 6.4 repos, and then update 
from clusterlabs.org repo to pacemaker 1.1.9-2. 
I again got the same issue with pacemaker 1.1.9-2 and then I posted it in 
subscribe list.


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Does "stonith_admin --confirm" work?

2013-05-19 Thread Andrew Beekhof

On 17/05/2013, at 6:22 PM, Староверов Никита Александрович 
 wrote:

> Hello, pacemaker users and developers.
> 
> First, many thanks to clusterlabs.org for their software, Pacemaker helps us 
> very much!
> 
> I am testing cluster configuration based on Pacemaker+CMAN. I configured 
> fencing as described in  Pacemker documentation about CMAN based clusters and 
> it works.
> May be I misunderstood something, but I can't  acknowledge nodes fencing 
> manually.
> I use fence_ipmilan as device and when I plug out power cable from server 
> stonith  fails. I expected this,  of course, but I don't know how to 
> acknowledge manual fencing.
> When I try stonith_admin -C node_name, it does nothing. 
> I see this in logs:
> 
> May 17 11:46:52 NODE1 stonith-ng[5434]:   notice: stonith_manual_ack: 
> Injecting manual confirmation that NODE2 is safely off/down
> May 17 11:46:52 NODE1 stonith-ng[5434]:   notice: log_operation: Operation 
> 'off' [0] (call 2 from stonith_admin.10959) for host 'NODE2' with device 
> 'manual_ack' returned: 0 (OK)
> May 17 11:46:52 NODE1 stonith-ng[5434]:error: crm_abort: do_local_reply: 
> Triggered assert at main.c:241 : client_obj->request_id   
>
> May 17 11:46:52 NODE1 stonith-ng[5434]:error: crm_abort: crm_ipcs_sendv: 
> Triggered assert at ipc.c:575 : header->qb.id != 0
>
> May 17 11:47:35 NODE1 stonith_admin[11162]:   notice: crm_log_args: Invoked: 
> stonith_admin -C NODE2
>   
> May 17 11:47:35 NODE1 stonith-ng[5434]:   notice: merge_duplicates: Merging 
> stonith action off for node NODE2 originating from client 
> stonith_admin.11162.b42172b1 with identical request from 
> stonith_admin.10959@NODE1.f2048550 (0s)   
>   
>   
>   
> May 17 11:47:35 NODE1 stonith-ng[5434]:   notice: stonith_manual_ack: 
> Injecting manual confirmation that NODE2 is safely off/down   
>
> May 17 11:47:35 NODE1 stonith-ng[5434]:   notice: log_operation: Operation 
> 'off' [0] (call 2 from stonith_admin.11162) for host 'NODE2' with device 
> 'manual_ack' returned: 0 (OK)  
> May 17 11:47:35 NODE1 stonith-ng[5434]:error: crm_abort: do_local_reply: 
> Triggered assert at main.c:241 : client_obj->request_id   
>  
> May 17 11:47:35 NODE1 stonith-ng[5434]:error: crm_abort: crm_ipcs_sendv: 
> Triggered assert at ipc.c:575 : header->qb.id != 0

Well, thats not nothing, but it certainly doesn't look right either.
I will investigate.  Which version is this?

> 
> Nothing happened after stonith_admin -C.
> Fenced still trying to fence_pcmk, and I see lots of "Timer expired" from 
> stonith-ng, and failed fence_ipmilan operations.
> 
> Yes,  I can do fence_ack_manual on cman-master node, and then cleanup node 
> state with cibadmin, but it is very slw way. 
> If I lost many servers in cluster, for example, lost power in one rack with 
> two or more servers, I need a way to running again services on remaining 
> nodes as quickly as possible.
> 
> I think fencing manual acknowledgement must be fast and simple and I suppose 
> that stonith_admin --confirm have to do that.
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [Question and Problem] In vSphere5.1 environment, IO blocking of pengine occurs at the time of shared disk trouble for a long time.

2013-05-19 Thread renayama19661014
Hi Vladislav,

> For just this, patch is unneeded. It only plays when you have that
> pengine files symlinked from stable storage to tmpfs, Without patch,
> pengine would try to rewrite file where symlink points it - directly on
> a stable storage. With that patch, pengine will remove symlink (and just
> symlink) and will open new file on tmpfs for writing. Thus, it will not
> block if stable storage is inaccessible (for my case because of
> connectivity problems, for yours - because of backing storage outage).
> 
> If you decide to go with tmpfs *and* use the same synchronization method
> as I do, then you'd need to bake the similar patch for 1.0, just add
> unlink() before pengine writes its data (I suspect that code to differ
> between 1.0 and 1.1.10, even in 1.1.6 it was different to current master).

Thank you for detailed explanation.
At first I confirm movement only in tmpfs.

Many Thanks!
Hideo Yamauchi.

--- On Fri, 2013/5/17, Vladislav Bogdanov  wrote:

> Hi Hideo-san,
> 
> 17.05.2013 10:29, renayama19661...@ybb.ne.jp wrote:
> > Hi Vladislav,
> > 
> > Thank you for advice.
> > 
> > I try the patch which you showed.
> > 
> > We use Pacemaker1.0, but apply a patch there because there is a similar 
> > code.
> > 
> > If there is a question by setting, I ask you a question by an email.
> >  * At first I only use tmpfs, and I intend to test it.
> 
> For just this, patch is unneeded. It only plays when you have that
> pengine files symlinked from stable storage to tmpfs, Without patch,
> pengine would try to rewrite file where symlink points it - directly on
> a stable storage. With that patch, pengine will remove symlink (and just
> symlink) and will open new file on tmpfs for writing. Thus, it will not
> block if stable storage is inaccessible (for my case because of
> connectivity problems, for yours - because of backing storage outage).
> 
> If you decide to go with tmpfs *and* use the same synchronization method
> as I do, then you'd need to bake the similar patch for 1.0, just add
> unlink() before pengine writes its data (I suspect that code to differ
> between 1.0 and 1.1.10, even in 1.1.6 it was different to current master).
> 
> > 
> >> P.S. Andrew, is this patch ok to apply?
> > 
> > To Andrew...
> >   Does the patch in conjunction with the write_xml processing in your 
> >repository have to apply it before the confirmation of the patch of 
> >Vladislav?
> > 
> > Many Thanks!
> > Hideo Yamauchi.
> > 
> > 
> > 
> > 
> > --- On Fri, 2013/5/17, Vladislav Bogdanov  wrote:
> > 
> >> Hi Hideo-san,
> >>
> >> You may try the following patch (with trick below)
> >>
> >> From 2c4418d11c491658e33c149f63e6a2f2316ef310 Mon Sep 17 00:00:00 2001
> >> From: Vladislav Bogdanov 
> >> Date: Fri, 17 May 2013 05:58:34 +
> >> Subject: [PATCH] Feature: PE: Unlink pengine output files before writing.
> >>  This should help guys who store them to tmpfs and then copy to a stable 
> >>storage
> >>  on (inotify) events with symlink creation in the original place to 
> >>survive when
> >>  stable storage is not accessible.
> >>
> >> ---
> >>  pengine/pengine.c |    1 +
> >>  1 files changed, 1 insertions(+), 0 deletions(-)
> >>
> >> diff --git a/pengine/pengine.c b/pengine/pengine.c
> >> index c7e1c68..99a81c6 100644
> >> --- a/pengine/pengine.c
> >> +++ b/pengine/pengine.c
> >> @@ -184,6 +184,7 @@ process_pe_message(xmlNode * msg, xmlNode * xml_data, 
> >> crm_client_t * sender)
> >>          }
> >>  
> >>          if (is_repoke == FALSE && series_wrap != 0) {
> >> +            unlink(filename);
> >>              write_xml_file(xml_data, filename, HAVE_BZLIB_H);
> >>              write_last_sequence(PE_STATE_DIR, series[series_id].name, seq 
> >>+ 1, series_wrap);
> >>          } else {
> >> -- 
> >> 1.7.1
> >>
> >> You just need to ensure that /var/lib/pacemaker is on tmpfs. Then you may 
> >> watch on directories there
> >> with inotify or so and take actions to move (copy) files to a stable 
> >> storage (RAM is not of infinite size).
> >> In my case that is CIFS. And I use lsyncd to synchronize that directories. 
> >> If you are interested, I can
> >> provide you with relevant lsyncd configuration. Frankly speaking, three is 
> >> no big need to create symlinks
> >> in tmpfs to stable storage, as pacemaker does not use existing pengine 
> >> files (except sequences). That sequence
> >> files and cib.xml are the only exceptions which you may want to exist in 
> >> two places (and you may want to copy
> >> them from stable storage to tmpfs before pacemaker start), and you can 
> >> just move everything else away from
> >> tmpfs once it is written. In this case you do not need this patch.
> >>
> >> Best,
> >> Vladislav
> >>
> >> P.S. Andrew, is this patch ok to apply?
> >>
> >> 17.05.2013 03:27, renayama19661...@ybb.ne.jp wrote:
> >>> Hi Andrew,
> >>> Hi Vladislav,
> >>>
> >>> I try whether this correction is effective for this problem.
> >>>   * 
> >>>https://github.com/beekhof/pacemaker/commit/eb6264bf2db395779e

Re: [Pacemaker] IPaddr2 cloned address doesn't survive node standby

2013-05-19 Thread Andreas Ntaflos
On 2013-05-17 22:07, Jake Smith wrote:
>> primitive p_ip_service_ns ocf:heartbeat:IPaddr2 \
>>params ip="192.168.114.17" cidr_netmask="24" nic="eth0" \
>>  clusterip_hash="sourceip-sourceport"
> 
> netmask should be 32 if that's supposed to be a single IP load balanced.

I've been wondering about that, but I think 24 is correct. The address
is recognized as "secondary" by Linux, as can be seen in this "ip addr"
output:

2: eth0:  mtu 1500 qdisc pfifo_fast
state UP qlen 1000
inet 192.168.114.16/24 brd 192.168.114.255 scope global eth0
inet 192.168.114.17/24 brd 192.168.114.255 scope global secondary eth0

Setting it this way has been working fine for a long time now. *shrug*

> Don't you need colocation also between the clones so that bind can only start 
> on a node that has already started an ip instance?

I thought since clones are started on all nodes anyway that a simple
"order" directive would suffice. But I've added a colocation constraint
as well, to be sure. Thanks for the hint.

> For the number of restarts it's likely because of the interleaving settings.  
> True for both would likely help that but wouldn't work in your case - more 
> here: 
> http://www.hastexo.com/resources/hints-and-kinks/interleaving-pacemaker-clones

Yes, there doesn't seem to be a way to interleave these cloned resources
in a way that avoids restarting Bind on such cluster state changes.

> When you put dns01 in standby does dns02 have both instances of the IP there?
> If not it should be (you are just load balancing a single IP correct?).  You 
> need clone-node-max=2 for the ip clone.

clone-node-max was always set to "2", yes.

> If so one just doesn't move back to dns01 when you bring it out of standby?  
> I would look at resource stickiness=0 for the ip close resource only so the 
> cluster will redistribute when the node comes out of standby (I think that 
> would work).  Clones have a default stickiness of 1 if you don't have a 
> default set for the cluster.

Bingo, the resource stickiness was the problem! I've set it to 0 and now
the IP resource gets started again when the node comes back online.

Thanks a lot, I would not have thought of that. As stated above,
shouldn't cloned resources be (re-)started on all nodes by definition?

> And/or you can write location constraints for the clone instances of ip to 
> prefer one node over the other causing them to fail back if the node returns 
> i.e. location ip0_prefers_dns01 cl_ip_service_ns:0 200: dns01 and location 
> ip1_prefers_dns02 cl_ip_service_ns:1 200: dns02

That doesn't seem necessary, now with resource-stickiness="0".

Thanks again!

Andreas

PS: Here's the complete configuration for the archives, in case someone
might be interested in the future:

node dns01
node dns02
primitive p_bind9 lsb:bind9 \
op monitor interval="10s" timeout="15s" \
op start interval="0" timeout="15s" \
op stop interval="0" timeout="15s" \
meta target-role="Started"
primitive p_ip_service_ns ocf:heartbeat:IPaddr2 \
params ip="192.168.114.17" cidr_netmask="24" nic="eth0"
clusterip_hash="sourceip-sourceport" \
op monitor interval="10s" \
op start interval="0" timeout="20s" \
op stop interval="0" timeout="20s"
clone cl_bind9 p_bind9 \
meta globally-unique="false" clone-max="2" clone-node-max="1"
interleave="false" target-role="Started"
clone cl_ip_service_ns p_ip_service_ns \
meta globally-unique="true" clone-max="2" clone-node-max="2"
interleave="false" target-role="Started"
colocation co_ip_before_bind9 inf: cl_ip_service_ns cl_bind9
order o_ip_before_bind9 inf: cl_ip_service_ns cl_bind9
property $id="cib-bootstrap-options" \
dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \
cluster-infrastructure="openais" \
expected-quorum-votes="2" \
no-quorum-policy="ignore" \
stonith-enabled="no" \
last-lrm-refresh="1368814808"
rsc_defaults $id="rsc-options" \
resource-stickiness="0"




signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Does "stonith_admin --confirm" work?

2013-05-19 Thread Raoul Bhatia [IPAX]

On 17.05.2013 10:22, Староверов Никита Александрович wrote:

Nothing happened after stonith_admin -C.
Fenced still trying to fence_pcmk, and I see lots of "Timer expired" from 
stonith-ng, and failed fence_ipmilan operations.

Yes,  I can do fence_ack_manual on cman-master node, and then cleanup node 
state with cibadmin, but it is very slw way.
If I lost many servers in cluster, for example, lost power in one rack with two 
or more servers, I need a way to running again services on remaining nodes as 
quickly as possible.

I think fencing manual acknowledgement must be fast and simple and I suppose 
that stonith_admin --confirm have to do that.


I would also like to know a solution to this problem.
My current situation: I am using IPMI as a stonith device.

However, if there is a problem with the (redundant) power supply
and the IPMI device is therefore not working, I'm having a hard time
in troubleshooting my 2 node cluster.

Cheers,
Raoul

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org