from:"Andrew Beekhof"

Re: [Openais] problem to delete resource

2015-02-23 Thread Andrew Beekhof

Looks like the resource is badly configured - to the point that the RA doesn't 
know how to stop it.
Thats what this means:

> p_drbd_ora_stop_0 on node1 'not configured' (6): call=6, status=complete, 
> last-rc-change='Mon Feb  2 16:54:19 2015', queued=0ms, exec=26ms
> p_drbd_ora_stop_0 on node1 'not configured' (6): call=6, status=complete, 
> last-rc-change='Mon Feb  2 16:54:19 2015', queued=0ms, exec=26ms

> On 3 Feb 2015, at 2:13 am, Vladimir Berezovski (vberezov) 
>  wrote:
> 
> Hi ,
>  
> I added a new resourse like 
>  
> crm(live)configure# primitive p_drbd_ora ocf:linbit:drbd params 
> drbd_resource="clusterdb_res_ora" op monitor interval="60s"
>  
>  
> but  its status is FAILED(unmanaged) . I tried to stop and delete it but 
> to no result – it’s still running .How to  manage this issue ?
>  
>  
> [root@node1 ~]#  crm configure show
> node node1 \
> attributes standby=off
> node node2
> primitive p_drbd_ora ocf:linbit:drbd \
> params drbd_resource=clusterdb_res_ora \
> op monitor interval=60s \
> meta target-role=Stopped is-managed=true
> property cib-bootstrap-options: \
> dc-version=1.1.11-97629de \
> cluster-infrastructure="classic openais (with plugin)" \
> expected-quorum-votes=2 \
> stonith-enabled=false \
> no-quorum-policy=ignore \
> last-lrm-refresh=1422887129
> rsc_defaults rsc-options: \
> resource-stickiness=100
>  
>  
> [root@node1 ~]# crm_mon -1
> Last updated: Mon Feb  2 17:12:40 2015
> Last change: Mon Feb  2 16:44:52 2015
> Stack: classic openais (with plugin)
> Current DC: node1 - partition WITHOUT quorum
> Version: 1.1.11-97629de
> 2 Nodes configured, 2 expected votes
> 1 Resources configured
>  
>  
> Online: [ node1 ]
> OFFLINE: [ node2 ]
>  
> p_drbd_ora (ocf::linbit:drbd): FAILED node1 (unmanaged)
>  
> Failed actions:
> p_drbd_ora_stop_0 on node1 'not configured' (6): call=6, status=complete, 
> last-rc-change='Mon Feb  2 16:54:19 2015', queued=0ms, exec=26ms
> p_drbd_ora_stop_0 on node1 'not configured' (6): call=6, status=complete, 
> last-rc-change='Mon Feb  2 16:54:19 2015', queued=0ms, exec=26ms
>  
>  
>  
> #crm resource stop  p_drbd_ora
>  
> [root@node1 ~]# crm configure delete p_drbd_ora
> ERROR: resource p_drbd_ora is running, can't delete it
>  
> Regards ,
>  
>  
> Vladimir Berezovski
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/openais

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/openais

Re: [Openais] pgsql troubles.

2014-12-03 Thread Andrew Beekhof

You're probably better to take this to the pacemaker list.
I don't think the guys that wrote the postgres agent subscribe here.

> On 2 Dec 2014, at 11:50 pm, steve  wrote:
> 
> Good Afternoon,
> 
> Sending again now that the holidays are over.
> 
> I am having loads of trouble with pacemaker/corosync/postgres. Defining the 
> symptoms is rather difficult.   The primary being that postgres  starts as 
> slave on both nodes.  I have tested the pgsqlRA start/stop/status/monitor and 
> they work from the command line after I setup the environment.  I have not 
> been able to get promote/demote to work, there are issues with NODENAME not 
> being defined.
> 
> I am able to run postgres in master/slave mode outside of pacemaker.
> 
> I can provide additional logs but here is a start.
> 
> Distributor ID:   Ubuntu
> Description:  Ubuntu 12.04.3 LTS
> Release:  12.04
> Codename: precise
> 
> latest verions of pgsql RA (yesterday)
> pacemaker  1.1.6-2ubuntu3.1   HA cluster resource manager
> corosync   1.4.2-2Standards-based cluster framework 
> (daemon and module
> resource-agents  1:3.9.2-5ubuntu4.1   Cluster 
> Resource Agents
> I have upgraded pgsqlRA to the lastest from git.
> 
> 
> 
> Last updated: Wed Nov 26 13:55:59 2014
> Last change: Wed Nov 26 13:55:58 2014 via crm_attribute on tstdb04
> Stack: openais
> Current DC: tstdb04 - partition with quorum
> Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
> 2 Nodes configured, 2 expected votes
> 4 Resources configured.
> 
> 
> Online: [ tstdb03 tstdb04 ]
> 
> Full list of resources:
> 
> Resource Group: master-group
> vip-master (ocf::heartbeat:IPaddr2):   Stopped
> vip-rep(ocf::heartbeat:IPaddr2):   Stopped
> Master/Slave Set: msPostgresql [pgsql]
> Slaves: [ tstdb04 ]
> Stopped: [ pgsql:0 ]
> 
> Node Attributes:
> * Node tstdb03:
>+ master-pgsql:0: -INFINITY
>+ pgsql-data-status : DISCONNECT
> * Node tstdb04:
>+ master-pgsql:1: -INFINITY
>+ pgsql-data-status : DISCONNECT
> 
> Migration summary:
> * Node tstdb04:
> * Node tstdb03:
>   pgsql:0: migration-threshold=1 fail-count=100
> 
> Failed actions:
>pgsql:0_start_0 (node=tstdb03, call=5, rc=1, status=complete): unknown 
> error
> 
> 
> config:
> property \
> no-quorum-policy="ignore" \
> stonith-enabled="false" \
> crmd-transition-delay="0"
> 
> rsc_defaults \
> resource-stickiness="INFINITY" \
> migration-threshold="1"
> 
> group master-group \
>   vip-master \
>   vip-rep
> 
> primitive vip-master ocf:heartbeat:IPaddr2 \
> params \
> ip="10.132.101.95" \
> nic="eth0" \
> cidr_netmask="24" \
> op start   timeout="60s" interval="0"  on-fail="restart" \
> op monitor timeout="60s" interval="10s" on-fail="restart" \
> op stoptimeout="60s" interval="0"  on-fail="block"
> 
> primitive vip-rep ocf:heartbeat:IPaddr2 \
> params \
> ip="10.132.101.96" \
> nic="eth0" \
> cidr_netmask="24" \
> meta \
> migration-threshold="0" \
> op start   timeout="60s" interval="0"  on-fail="stop" \
> op monitor timeout="60s" interval="10s" on-fail="restart" \
> op stoptimeout="60s" interval="0"  on-fail="ignore"
> 
> master msPostgresql pgsql \
> meta \
> master-max="1" \
> master-node-max="1" \
> clone-max="2" \
> clone-node-max="1" \
> notify="true"
> 
> primitive pgsql ocf:heartbeat:pgsql \
> params \
> pgctl="/usr/bin/pg_ctl" \
> psql="/usr/bin/psql" \
> pgdata="/database/9.3" \
>config="/etc/postgresql/9.3/main/postgresql.conf" \
>socketdir=/var/run/postgresql \
> rep_mode="sync" \
> node_list="tstdb03 tstdb04" \
> restore_command="cp /database/archive/%f %p" \
> primary_conninfo_opt="keepalives_idle=60 keepalives_interval=5 
> keepalives_count=5" \
> master_ip="10.132.101.95" \
> restart_on_promote="true" \
>logfile=/var/log/postgresql/postgresql-9.3-main.log \
> op start   timeout="60s" interval="0"  on-fail="restart" \
> op monitor timeout="60s" interval="4s" on-fail="restart" \
> op monitor timeout="60s" interval="3s"  on-fail="restart" role="Master" \
> op promote timeout="60s" interval="0"  on-fail="restart" \
> op demote  timeout="60s" interval="0"  on-fail="stop" \
> op stoptimeout="60s" interval="0"  on-fail="block" \
> op notify  timeout="60s" interval="0"
> 
> #colocation rsc_colocation-1 inf: vip-master msPostgresql:Master
> #order rsc_order-1 0: msPostgresql:promote  vip-master:start  
> symmetrical=false
> #order rsc_order-2 0: msPostgresql:demote   vip-rep:stop   symmetrical=false
> 
> colocation rsc_colocation-1 inf: master-group msPostgresql:Master
> order rsc_order-1 0: msPostgresql:promote  m

Re: [Openais] unmanaged resource failed - how to get back?

2014-06-30 Thread Andrew Beekhof


On 30 Jun 2014, at 9:25 pm, Senftleben, Stefan (itsc) 
 wrote:

> Hello,
>  
> I set the cluster in a maintainance mode with: crm configure property 
> maintenance-mode=true .
> Afterwards I did stop one resource manually, but after turning of the 
> maintainance mode, the resource is in status „unmanaged“ FAILED.
> But the resource is running already.
> What shoud I do now, to get the resource managed by pacemaker?

1. configure fencing
2. run: crm resource cleanup

>  
> Greetings
> Stefan
>  
>  
> 
> Last updated: Mon Jun 30 12:42:45 2014
> Last change: Mon Jun 30 12:41:33 2014
> Stack: openais
> Current DC: lxds05 - partition with quorum
> Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
> 2 Nodes configured, 2 expected votes
> 10 Resources configured.
> 
>  
> Online: [ lxds05 lxds07 ]
>  
> Full list of resources:
>  
> Resource Group: group_omd
>  pri_fs_omd (ocf::heartbeat:Filesystem):Started lxds05
>  pri_apache2(ocf::heartbeat:apache):Started lxds05
>  pri_nagiosIP   (ocf::heartbeat:IPaddr2):   Started lxds05
> Master/Slave Set: ms_drbd_omd [pri_drbd_omd]
>  Masters: [ lxds05 ]
>  Slaves: [ lxds07 ]
> Clone Set: clone_ping [pri_ping]
>  Started: [ lxds07 lxds05 ]
> res_MailTo_omd_group(ocf::heartbeat:MailTo):Stopped
> omd_itsc(ocf::omd:omdnagios):   Started lxds05 (unmanaged) FAILED
> res_MailTo_omd_itsc (ocf::heartbeat:MailTo):Stopped
>  
> Node Attributes:
> * Node lxds05:
> + master-pri_drbd_omd:0 : 1
> + pingd : 3000
> * Node lxds07:
> + master-pri_drbd_omd:1 : 1
> + pingd : 3000
>  
> Migration summary:
> * Node lxds07:
> * Node lxds05:
>omd_itsc: migration-threshold=100 fail-count=2 last-failure='Mon Jun 
> 30 12:39:03 2014'
>  
> Failed actions:
> omd_itsc_stop_0 (node=lxds05, call=49, rc=1, status=complete): unknown 
> error
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/openais



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/openais

Re: [Openais] Error: -> Need help! cib: [1539]: WARN: cib_peer_callback: Discarding cib_modify message (3) from lxds05: not in our membership

2014-05-19 Thread Andrew Beekhof


On 19 May 2014, at 6:53 pm, Senftleben, Stefan (itsc) 
 wrote:

> Hello,
> 
> thanks for the answers.
> I found that link with an howto of upgrading corosync an pacemaker on ubuntu 
> 10.04. 
> http://martinloschwitz.wordpress.com/2011/10/24/updated-linux-cluster-stack-packages-for-ubuntu-10-04/
> Can somebody confirm that upgrade procedure?

Looks reasonable


> 
> Regards
> Stefan
> 
> -Ursprüngliche Nachricht-
> Von: Jan Friesse [mailto:jfrie...@redhat.com] 
> Gesendet: Montag, 19. Mai 2014 10:44
> An: Andrew Beekhof; Senftleben, Stefan (itsc)
> Cc: openais@lists.linux-foundation.org
> Betreff: Re: [Openais] Error: -> Need help! cib: [1539]: WARN: 
> cib_peer_callback: Discarding cib_modify message (3) from lxds05: not in our 
> membership
> 
> Stefan,
>> 
>> On 16 May 2014, at 11:11 pm, Senftleben, Stefan (itsc) 
>>  wrote:
>> 
>>> Hello,
>>> 
>>> I hope that someone can help me. I have a two node pacemaker cluster, 
>>> with to corosync rings. Ubuntu 10.04, 64 bit. Pacemaker 
>>> 1.0.8+hg15494-2ubuntu2, corosync 1.2.0-0ubuntu1.
>> 
>> It _could_ be a pacemaker issue, but 1.0.8 is over 4 years old and I 
>> have no idea what additional changes went into hg15494. So 
>> unfortunately your options are upgrade to something a little more 
>> recent that upstream can help you with, or see if you can get some 
>> support for that version from ubuntu.
>> 
> 
> Also please upgrade corosync. 1.2.0 is also 4 years old and current flatiron 
> has 441 patches on top of 1.2.0.
> 
> 
>> Have you tried simply stopping the cluster on both nodes before 
>> starting it again? That has been known to help on occasion.
>> 
>> To do so without stopping the resources managed by the cluster you 
>> could draw inspiration from:
>> http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explaine
>> d/_disconnect_and_reattach.html
>> 
>> 
>>> One node (lxds05) disconnected, "crm status" marked it as offline. I 
>>> searched for reason and found a defect bbu on the raid controller. 
>>> After 14 days, the bbu was replaced. But the node
>>> lxds05 does not continue to be a member. One corosync ring is marked 
>>> as faulty, the other as okay. "corosync-cfgtool -r" temporarily markt 
>>> he faulty ring as okay.
>>> 
>>> Any help is appreciated! I do not know, how to solve the problem.
>>> 
>>> Regards Stefan
>>> 
>>> 
>>> part of syslog: May 16 10:01:17 lxds07 crmd: [1543]: info:
>>> handle_shutdown_request: Creating shutdown request for lxds05
>>> (state=S_IDLE) May 16 10:01:17 lxds07 cib: [1539]: WARN:
>>> cib_peer_callback: Discarding cib_modify message (3) from lxds05:
>>> not in our membership
>>> 
>>> crm_mon -rf:  Last updated: Fri May 16 14:28:58 2014
>>> Stack: openais Current DC: lxds07 - partition with quorum
>>> Version: 1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd 2 Nodes 
>>> configured, 2 expected votes 6 Resources configured.
>>> 
>>> 
>>> Online: [ lxds07 ] OFFLINE: [ lxds05 ]
>>> 
>>> Full list of resources:
>>> 
>>> Resource Group: group_omd pri_fs_omd (ocf::heartbeat:Filesystem):
>>> Started lxds07 pri_apache2(ocf::heartbeat:apache):
>>> Started lxds07 pri_nagiosIP   (ocf::heartbeat:IPaddr2):
>>> Started lxds07 Master/Slave Set: ms_drbd_omd Masters: [ lxds07 ]
>>> Stopped: [ pri_drbd_omd:1 ] Clone Set: clone_ping Started: [
>>> lxds07 ] Stopped: [ pri_ping:1 ] res_MailTo_omd_group
>>> (ocf::heartbeat:MailTo):Started lxds07 omd_itsc
>>> (ocf::omd:omdnagios):   Started lxds07 res_MailTo_omd_itsc
>>> (ocf::heartbeat:MailTo):Started lxds07
>>> 
>>> Migration summary: * Node lxds07:  pingd=3000 omd_itsc:
>>> migration-threshold=100 fail-count=16 last-failure='Fri May
>>> 16 10:46:05 2014'
>>> 
>>> 
>>> ___ Openais mailing list 
>>> Openais@lists.linux-foundation.org
>>> https://lists.linuxfoundation.org/mailman/listinfo/openais
>> 
>> 
>> 
>> ___ Openais mailing list 
>> Openais@lists.linux-foundation.org
>> https://lists.linuxfoundation.org/mailman/listinfo/openais
>> 
> 



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/openais

Re: [Openais] Error: -> Need help! cib: [1539]: WARN: cib_peer_callback: Discarding cib_modify message (3) from lxds05: not in our membership

2014-05-18 Thread Andrew Beekhof


On 16 May 2014, at 11:11 pm, Senftleben, Stefan (itsc) 
 wrote:

> Hello,
>  
> I hope that someone can help me…
> I have a two node pacemaker cluster, with to corosync rings.
> Ubuntu 10.04, 64 bit. Pacemaker 1.0.8+hg15494-2ubuntu2, corosync 
> 1.2.0-0ubuntu1.

It _could_ be a pacemaker issue, but 1.0.8 is over 4 years old and I have no 
idea what additional changes went into hg15494.
So unfortunately your options are upgrade to something a little more recent 
that upstream can help you with, or see if you can get some support for that 
version from ubuntu.

Have you tried simply stopping the cluster on both nodes before starting it 
again?
That has been known to help on occasion.

To do so without stopping the resources managed by the cluster you could draw 
inspiration from:
   
http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_disconnect_and_reattach.html

> One node (lxds05) disconnected, „crm status“ marked it as offline. I searched 
> for reason and found a defect bbu on the raid controller. After 14 days, the 
> bbu was replaced.
> But the node lxds05 does not continue to be a member. One corosync ring is 
> marked as faulty, the other as okay. „corosync-cfgtool –r“ temporarily markt 
> he faulty ring as okay.
>  
> Any help is appreciated! I do not know, how to solve the problem.
>  
> Regards
> Stefan
>  
>  
> part of syslog:
> May 16 10:01:17 lxds07 crmd: [1543]: info: handle_shutdown_request: Creating 
> shutdown request for lxds05 (state=S_IDLE)
> May 16 10:01:17 lxds07 cib: [1539]: WARN: cib_peer_callback: Discarding 
> cib_modify message (3) from lxds05: not in our membership
>  
> crm_mon –rf:
> 
> Last updated: Fri May 16 14:28:58 2014
> Stack: openais
> Current DC: lxds07 - partition with quorum
> Version: 1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd
> 2 Nodes configured, 2 expected votes
> 6 Resources configured.
> 
>  
> Online: [ lxds07 ]
> OFFLINE: [ lxds05 ]
>  
> Full list of resources:
>  
> Resource Group: group_omd
>  pri_fs_omd (ocf::heartbeat:Filesystem):Started lxds07
>  pri_apache2(ocf::heartbeat:apache):Started lxds07
>  pri_nagiosIP   (ocf::heartbeat:IPaddr2):   Started lxds07
> Master/Slave Set: ms_drbd_omd
>  Masters: [ lxds07 ]
>  Stopped: [ pri_drbd_omd:1 ]
> Clone Set: clone_ping
>  Started: [ lxds07 ]
>  Stopped: [ pri_ping:1 ]
> res_MailTo_omd_group(ocf::heartbeat:MailTo):Started lxds07
> omd_itsc(ocf::omd:omdnagios):   Started lxds07
> res_MailTo_omd_itsc (ocf::heartbeat:MailTo):Started lxds07
>  
> Migration summary:
> * Node lxds07:  pingd=3000
>omd_itsc: migration-threshold=100 fail-count=16 last-failure='Fri May 
> 16 10:46:05 2014'
>  
>  
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/openais



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/openais

Re: [Openais] Failure in failover, trouble determining cause and how to correct

2014-04-28 Thread Andrew Beekhof

On 29 Apr 2014, at 4:49 am, Joey D.  wrote:

> dc-version="1.1.8-7.el6-394e906" \
> cluster-infrastructure="classic openais (with plugin)" \

Please don't use the custom plugin on RHEL6 (and clones), its likely to go away 
RealSoonNow(tm).
See: http://clusterlabs.org/quickstart-redhat.html

Also, there is an update for 1.1.10 available.

signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/openais

Re: [Openais] very slow pacemaker/corosync shutdown

2013-09-19 Thread Andrew Beekhof


On 19/09/2013, at 8:25 AM, David Lang  wrote:

> I have been using heartbeat for many years, but am now setting up some new 
> clusters with pacemaker/corosync. I'm not sure which component is having 
> problems so I'm sending to both lists.
> 
> These are two machine clusters, configured per the RHEL quickstart on 
> clusterlabs.org
> 
> I'm frequently running into a problem that shutting down pacemaker/corosync 
> takes a very long time (several minutes)
> 
> this happens if we do 'service pacemaker stop' or just reboot the system.
> 
> If we do service pacemaker stop, it seems to pause at different places at 
> different times, but frequently seems to pause at 'unloading cluster'
> 
> What's the best way to see what it's getting stuck doing?

Log files.

> Is there a good way to tell if this is a pacemaker or corosync problem (so I 
> can drop one of the lists from the thread)?

Not without further information


signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/openais

Re: [Openais] Heartbeat to Openais conversion. cib.xml verification errors

2013-01-03 Thread Andrew Beekhof

On Thu, Jan 3, 2013 at 3:43 AM, Nick Hoare  wrote:
> Hi
>
> I am updating SLES 10 SP3 to SLES 11 SP2 and am trying to upgrade a
> heartbeat configuration to openais.
>
> I am following the conversion process as documented and have run the test
> conversion $/usr/lib/heartbeat/hb2openais.sh -T /tmp/hb2openais-testdir
>
> My problem is that I get to the crm_verify as documented:-
>
> Read and verify the resulting corosync.conf/openais.conf and
> cib-out.xml:
>
> $ cd /tmp/hb2openais-testdir
> $ less openais.conf
> $ crm_verify -V -x cib-out.xml
>
> But then I get an error.  I can find nothing in any documentation that gives
> me any indication of what I should do if crm_verify produces an error.
>
> cib-out.xml:2: element configuration: Relax-NG validity error : Element cib
> failed to validate content
> crm_verify[3679]: 2012/08/29_09:54:47 ERROR: main: CIB did not pass
> DTD/schema validation
> Errors found during check: config not valid
>
> I have found that the error becomes more helpful if I add a "validate-with"
> entry to the cib tag. BUT I can't just change the CIB.XML by hand can I,
> that is a no-no.

In this case, not really - since its not in a location that the cluster uses.
I don't know how that conversion script works, you'll probably need to
wait to hear from someone at suse.

>
> Anyway if I make any changes on the one node it then won't be the same as on
> the other node (does that matter, if the other node will be overwritten? I
> only have two nodes)
>
> So where do I go from here?
>
> I found a similar question to this posted previously
> (http://lists.linux-ha.org/pipermail/linux-ha/2012-August/045608.html)
> but this question went unanswered.
>
> Thanks in advance
>
>
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/openais
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/openais

Re: [Openais] Corosync 2.0 Feature Request: Replace objdb/confdb with something easier to use

2011-09-07 Thread Andrew Beekhof

On Thu, Aug 25, 2011 at 12:56 PM, Angus Salkeld  wrote:
> On Mon, Aug 08, 2011 at 09:41:10AM +0200, Jan Friesse wrote:
>> Current objdb/confdb is really hard to use, because of all iterationing,
>> ... It would be nice to replace it by hash table and thus for simple get
>> item or set item, no iteration is needed. But iteration functionality
>> should still somehow be there to allow user select for example all
>> totem.* items. API proposal to be send later.
>
> Honza, here are some ideas...
>
> The Good
> 
> it stores
> - current config
> - stats + statistics
>
> What is nice is that you can:
> - dump this out at runtime and inspect it's state
> - you can also modify the objects using objctl
> - you get notification when values change
>
> The Bad
> ===
> 1] The API is not easy to use
>    iteration and searching are clumsey
>    creating/finding a nested object is teadous

Shrug.  Its not so different to recursing through xml without xpath.

> 2] Typing is not done well
>   so config is added as strings but needed later as say ints
>   and there are no helper functions for this
> 3] Updating the statistics is starting to use more cpu than it should
>   To update an entry we have to retrieve it (memcpy's + refcounting)
>
> Possible Solutions
> ==
>
> 1] API
> We really just want to get/set values do we really need a tree?
>
> Use a map instead of tree data type
> [+] This will make finding and creating objects easier (less code)
> [-] adding notifications to each node is a bit expensive in memory
>
> /* API (ignoring typing & storage for now) */
> map_get(key, ...)
> map_put(key, ...)
> map_del(key)
> map_track(key, notifications)
> map_untrack(key, notifications)
>
> /* of course if we use a hashtable there won't be any sensible order
>  * but we could use a skiplist or binary tree.
>  *
>  * so pass in the prev key (don't store current node) in case of
>  * nodes getting destroyed whilst we iterate.
>  */
> char* key;
> char* val;
> map_iter_t* i = map_iter_create(m, "statistics/totem/*")
> while (map_iter_next(i, &key, &val) == 0) {
>         printf("%s = %s\n", key, val);
> }
> map_iter_free(i);
>
> 2]
> The problem with the typing is we insert different types
> both as keys and values
>
> With C++ collections you would:
> std::map m;
> m["bla"] = obj;
>
> So it's typed and you use these collections for a single purpose (no mixed 
> types)
> This makes it really easy to use.
>
> If we want typing and allow different key and values types it will be clumbsy 
> again:
> map_insert(m, "a_key", QB_TYPE_STR, "a_value", QB_TYPE_STR);
>
> if we assume the key is a string it's a bit better (we also don't need a 
> length for the key):
> map_insert(m, "a_key", "a_value", QB_TYPE_STR);
>
> With some help function it gets better again
> int32_t v = map_get_int32(m, "an_int");
> and we could have str to int helpers:
> int32_t v = map_get_str_as_int32(m, "a_config_option");
>
> 3] objdb currently stores the key and value data (it alloc's the memory and 
> memcopies it)
> since we have a deleted_callback can't the owner alloc/free the mem?
> This will make the statistics more efficient if we don't have to repeatedly 
> lookup the value
> to increment it. we could just have a ref counting func like:
> map_ref(key)
> map_unref(key)
>
> So it doesn't get deleted under our feet.
>
>
> Regards
> Angus
>
>
>
>> ___
>> Openais mailing list
>> Openais@lists.linux-foundation.org
>> https://lists.linux-foundation.org/mailman/listinfo/openais
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais
>
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Installing corosync from source

2011-09-07 Thread Andrew Beekhof

On Wed, Sep 7, 2011 at 4:01 PM, Dan Frincu  wrote:
> Hi,
>
> On Wed, Sep 7, 2011 at 4:05 AM, Nick Khamis  wrote:
>> Hello Everyone,
>>
>> We are moving everything over from heartbeat, after the last update
>> brought the cluster to it's knees... What we are interested in is
>> using corosync, pacemaker to LVS mysql, and asterisk. We have not
>> looked into asterisk yet, and we don't know if it's even possible
>> (i.e. if there is already an ocf a;ready created).
>
> If it has an LSB compliant init script, than it shouldn't be a problem
> in running it.

The last time I looked it wasn't.  But I mentioned it to Russell and
he told me he was going to fix it.

> However if you need an OCF RA, you can start by taking
> a look at this guide:
> http://www.linux-ha.org/doc/dev-guides/ra-dev-guide.html
>
>> Regardless, our attempt to install corosync from source using the
>> directrions found in "http://www.clusterlabs.org/wiki/Install"; seemed
>> to go ok however, nothing was created. We had to manually copy:
>> cp /usr/etc/corosync/corosync.conf.example /etc/corosync/corosync.conf
>> cp /usr/etc/init.d/corosync /etc/init.d/corosync
>> We have a long way to go, you're help is greatly appreciated.
>>
>> simple startup conf
>>
>> totem {
>>        version: 2
>>        secauth: off
>>        threads: 0
>>        interface {
>>                ringnumber: 0
>>                bindnetaddr: 192.168.1.1
>>                mcastaddr: 226.94.1.1
>>                mcastport: 5405
>>                ttl: 1
>>        }
>> }
>>
>> logging {
>>        fileline: off
>>        to_stderr: no
>>        to_logfile: yes
>>        to_syslog: yes
>>        logfile: /var/log/cluster/corosync.log
>>        debug: off
>>        timestamp: on
>>        logger_subsys {
>>                subsys: AMF
>>                debug: off
>>        }
>> }
>>
>> amf {
>>        mode: disabled
>> }
>
> Since you've already compiled it, best thing would be to continue with
> the configuration. This will help:
> http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Clusters_from_Scratch/index.html#s-configure-corosync
>
>>
>> If there are script that I am suppose to be running that will create
>> everyting?
>
> Not really, even with binary versions you don't get the corosync.conf
> file, but the corosync.conf.example that you edit according to your
> environment.
>
>> Do I need to install OpenAIS as well?
>
> Only if you plan on using cluster aware filesystems.
>
>> We downloaded the
>> latest version of resource agents, cluster glue, corosync, pacemaker.
>> I know we can install everything from gentoo source tree, but we are
>> trying to avoid that...
>
> Why? (just curious)
>
> Regards,
> Dan
>
>>
>> Your help is greatly appreciated,
>>
>> Nick.
>> ___
>> Openais mailing list
>> Openais@lists.linux-foundation.org
>> https://lists.linux-foundation.org/mailman/listinfo/openais
>>
>
>
>
> --
> Dan Frincu
> CCNA, RHCE
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] corosync didn't do what I expected

2011-07-31 Thread Andrew Beekhof

Read up on the no-quorum-policy setting.

On Sat, Jul 30, 2011 at 5:36 AM, Keith Stevens  wrote:
> I have the following configuration on two servers netbox1 and netbox2:
>
> crm(live)configure# show
> node netbox1 \
>         attributes standby="off"
> node netbox2
> primitive failover-ip ocf:heartbeat:IPaddr \
>         params ip="216.105.20.43" \
>         op monitor interval="10s"
> location cli-prefer-failover-ip failover-ip \
>         rule $id="cli-prefer-rule-failover-ip" inf: #uname eq netbox1
> property $id="cib-bootstrap-options" \
>         dc-version="1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b" \
>         cluster-infrastructure="openais" \
>         expected-quorum-votes="2" \
>         stonith-enabled="false"
>
> If I put netbox1 on standby the ip address migrates to netbox2 and back
> to netbox1 when
> I bring it back online.
> The ip address was on netbox1 when I powered down netbox2 to move it
> into a cabinet.
> To my surprise, netbox1 lost the ip address and didn't get it back until
> I booted netbox2.
> Apparently I have huge conceptual hole in my understanding, I expected
> netbox1 to keep the ip address.
> Why didn't it?
>
> Thanks,
> -Keith
>
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais
>
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Corosync Compatability

2011-07-28 Thread Andrew Beekhof

On Wed, Jul 27, 2011 at 1:28 PM,   wrote:
>
>    Thank you Steave,
>    We are currentely using corosync-1.2.1 and pacemaker 1.0.10
>    Can we use the same version of pacemaker with corosync-1.4

I'd say its likely.  Try on one node - you'll find out pretty quickly
if its not going to work

>
>
> On Tue, July 26, 2011 7:12 pm, Steven Dake wrote:
>> On 07/26/2011 01:52 AM, manish.gu...@ionidea.com wrote:
>>
>>> Hi,
>>>
>>>
>>> I am facing problem with redundent Communication Channel.
>>> I am using Coroync 1.2 In this auto failback of redundent
>>> channel is not Supported. But 1.4 provide support.
>>>
>>> Corosync-1.4 id compatiable with which version of pacemaker
>>>
>
>>>
>>>
>>
>> corosync 1.4 should work with all versions of pacemaker.  What version of
>> pm are you using?
>>
>> Regards
>> -steve
>>
>>>
>>> ___
>>> Openais mailing list
>>> Openais@lists.linux-foundation.org
>>> https://lists.linux-foundation.org/mailman/listinfo/openais
>>>
>>
>>
>
>
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais
>
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Corosync quesion - ps auxf output

2011-07-25 Thread Andrew Beekhof

2011/7/26 José Pablo Méndez Soto :
> Hello,
>
> According to http://www.clusterlabs.org/wiki/Debian_Lenny_HowTo, if one
> installs pacemaker package alone on a debian based distro, it will install
> on top of Corosync, but if one installs as:
>
> aptitude install pacemaker heartbeat
>
> then Pacemaker would be installed on top of heartbeat. I didn´t install
> heartbeat as it seems the community is moving away from it and toward
> Corosync. Can someone please explain why then my ps auxf shows  heartbeat
> all over the places?

Those are pacemaker processes, not heartbeat ones.

>
> root 17767  0.2  0.3 212236  5256 ?    Ssl  00:35   0:00
> /usr/sbin/corosync
> root 17775  0.0  0.7  77684 12232 ?    SLs  00:35   0:00        \_
> /usr/lib/heartbeat/stonithd
> 103  17776  0.1  0.3  80544  5008 ?    S    00:35   0:00  \_
> /usr/lib/heartbeat/cib
> root 1  0.0  0.1  92616  2776 ?    S    00:35   0:00
> \_ /usr/lib/heartbeat/lrmd
> 103  17778  0.0  0.2  81568  3340 ?    S    00:35   0:00  \_
> /usr/lib/heartbeat/attrd
> 103  17779  0.0  0.1  81916  2840 ?    S    00:35   0:00  \_
> /usr/lib/heartbeat/pengine
> 103  17780  0.0  0.2  87796  3644 ?    S    00:35   0:00  \_
> /usr/lib/heartbeat/crmd
>
> root@shekel:~/corosync# apt-cache policy heartbeat
> heartbeat:
>   Installed: (none)
>   Candidate: 1:3.0.3-2
>
> Is this just the name of a folder where the "heartbeating" lives?
>
> Thanks,
>
>
>  José
>
>
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais
>
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Compilation error in HEAD

2011-07-04 Thread Andrew Beekhof

On Mon, Jul 4, 2011 at 4:45 PM, Jan Friesse  wrote:
> Andrew,
> you are right in the fact that coroipcs uses functions defined in utils.c
> which are not linked to coroipcs itself. On the other hand, libcoroipcs is
> always used only with corosync and corosync executable contains this
> symbols, so there shouldn't be any problem.
>
> What compiler + OS + configure options are you using? Because at least on
> RHEL/FC14 with default configure there is no link time checking of symbols
> availability.

I'm on OSX.
The library parts of the build are completely borked there.

I found it simpler to just allow pacemaker to build without corosync.

>
> Regards,
>  Honza
>
> Andrew Beekhof napsal(a):
>>
>> On Mon, Jul 4, 2011 at 12:16 PM, Andrew Beekhof 
>> wrote:
>>>
>>> [12:06 pm] beekhof@iMac ~/Development/cluster/corosync # make
>>> Making all in include
>>> Making all in lcr
>>> Built Live Component Replacement System
>>> Making all in lib
>>> Built shared libs
>>> Making all in exec
>>>  cd .. && /bin/sh /Users/beekhof/Development/cluster/corosync/missing
>>> --run automake-1.10 --gnu  exec/Makefile
>>>  cd .. && /bin/sh ./config.status exec/Makefile depfiles
>>> config.status: creating exec/Makefile
>>> config.status: executing depfiles commands
>>> Undefined symbols for architecture x86_64:
>>>  "_short_service_name_get", referenced from:
>>>     _coroipcs_handler_dispatch in coroipcs.o
>>> ld: symbol(s) not found for architecture x86_64
>>> collect2: ld returned 1 exit status
>>> make[2]: *** [libcoroipcs.so.4.0.0] Error 1
>>> make[1]: *** [all-recursive] Error 1
>>> make: *** [all] Error 2
>>>
>>> short_service_name_get is defined in utils.c which is only linked into
>>> corosync itself and therefor not available to users of libcoroipcs.
>>>
>>
>> Likewise for some functions in coropoll.c which is linked into
>> libtotempg which doesn't help the service plugins.
>>
>> Making all in services
>> Undefined symbols for architecture x86_64:
>>  "_poll_dispatch_delete", referenced from:
>>      _confdb_exec_exit_fn in confdb.o
>>  "_poll_dispatch_add", referenced from:
>>      _confdb_exec_init_fn in confdb.o
>> ld: symbol(s) not found for architecture x86_64
>> collect2: ld returned 1 exit status
>> make[2]: *** [service_confdb.lcrso] Error 1
>> make[1]: *** [all-recursive] Error 1
>> make: *** [all] Error 2
>> ___
>> Openais mailing list
>> Openais@lists.linux-foundation.org
>> https://lists.linux-foundation.org/mailman/listinfo/openais
>
>
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Where can we find information on Corosync/OpenAIS w/out Pacemaker?

2011-07-03 Thread Andrew Beekhof

On Fri, Jul 1, 2011 at 11:23 PM, Whit Blauvelt  wrote:
> On Fri, Jul 01, 2011 at 11:45:22AM +1000, Andrew Beekhof wrote:
>
>> Being a generic cluster manager, rather than purpose-built for a specific
>> app like a filesystem or database, we don't get involved with things like
>> data replication or anything else that requires an understanding of the
>> services we're managing. This puts a natural limit on the kinds of
>> scenarios we can be involved in.
>>
>> I guess you could say that we aim for "fits-most". But its a free world,
>> people are entitled to build/deploy whatever solution they like.
>
> Thanks for the clarifications. I do appreciate all the discussion following
> my initial question. It helps a lot with the overview.
>
> Here's my admittedly naive concern. In aiming to be "fits-most" Pacemaker
> has become something very complex.

Most of the complexity actually comes with supporting >2 nodes.
I've written more coherently on the topic previously:
  
http://theclusterguy.clusterlabs.org/post/178680309/configuring-heartbeat-v1-was-so-simple

But without doubt Pacemaker is more complex than a purpose built solution.
The question, which only you can answer, is whether its going to be
"cheaper" to build and maintain a custom solution or to use something
off the shelf (all-be-it with a non-trivial learning curve).

Btw. have you read Clusters from Scratch?
It's at http://www.clusterlabs.org/doc/ and is probably the closest to
the type of book you mention below unless you read German.
Oh, there is also
http://oss.clusterlabs.org/pipermail/pacemaker/2010-February/005179.html
which can be purchased in paper form.

-- Andrew

> It may well look simpler from the inside,
> but from the outside it's not just the complexity of the Pacemaker system
> itself, but the difficulty making sense of the scattered documentation.
> There is no book on it. There's no online equivalent of a book on it. There
> are some wonderful diagrams of it on a meta level, and some very low-level
> docs generated from the source, and some specific examples which are pretty
> much only useful if one of them coincides with the desired deployment. In
> our case none of them do.
>
> Now, as a sysadmin, I damn well better understand the deployment of my HA
> scheme. It has to be something I can handle when it goes wrong at 4 a.m.,
> and there's no time to make coffee. I can arrive at that by one of two ways:
> deploy something developed and proved elsewhere that has good reference
> material, or develop something myself to the specific purpose and make sure
> I document it well as I go. Yeah, I could never build something with the
> scope of Pacemaker. But for a narrowly-defined, specific purpose it's within
> the reach of even the moderately skilled, providing that concepts and logic
> are clear.
>
> Or - this all being components - I might take for instance Corosync and use
> it without Pacemaker. That's why I asked the initial question about the
> feasibility of doing so. In a year or two, if someone publishes a good book
> or two on Pacemaker, in classic O'Reilly style, we might do well to throw
> out whatever I currently deploy and do the next version with Pacemaker - or
> more likely Pacemaker plus specific bits of what I presently build.
>
> I'm sure Pacemaker is beautiful. But without better reference materials the
> beauty is hidden. I'm sure if we hired someone intimate with the project as
> a consultant we could get a sterling deployment. But that still wouldn't
> solve the "4 a.m. with no time to brew coffee" dilemma. Also, the scale of
> our project is on the small side, so it would be hard to justify a
> consultant's cost. On the other hand, if that consultant would write the
> book, we'll be first in line to buy a copy.
>
> Best,
> Whit
>
>
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Compilation error in HEAD

2011-07-03 Thread Andrew Beekhof

On Mon, Jul 4, 2011 at 12:16 PM, Andrew Beekhof  wrote:
> [12:06 pm] beekhof@iMac ~/Development/cluster/corosync # make
> Making all in include
> Making all in lcr
> Built Live Component Replacement System
> Making all in lib
> Built shared libs
> Making all in exec
>  cd .. && /bin/sh /Users/beekhof/Development/cluster/corosync/missing
> --run automake-1.10 --gnu  exec/Makefile
>  cd .. && /bin/sh ./config.status exec/Makefile depfiles
> config.status: creating exec/Makefile
> config.status: executing depfiles commands
> Undefined symbols for architecture x86_64:
>  "_short_service_name_get", referenced from:
>      _coroipcs_handler_dispatch in coroipcs.o
> ld: symbol(s) not found for architecture x86_64
> collect2: ld returned 1 exit status
> make[2]: *** [libcoroipcs.so.4.0.0] Error 1
> make[1]: *** [all-recursive] Error 1
> make: *** [all] Error 2
>
> short_service_name_get is defined in utils.c which is only linked into
> corosync itself and therefor not available to users of libcoroipcs.
>

Likewise for some functions in coropoll.c which is linked into
libtotempg which doesn't help the service plugins.

Making all in services
Undefined symbols for architecture x86_64:
  "_poll_dispatch_delete", referenced from:
  _confdb_exec_exit_fn in confdb.o
  "_poll_dispatch_add", referenced from:
  _confdb_exec_init_fn in confdb.o
ld: symbol(s) not found for architecture x86_64
collect2: ld returned 1 exit status
make[2]: *** [service_confdb.lcrso] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all] Error 2
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

[Openais] Compilation error in HEAD

2011-07-03 Thread Andrew Beekhof

[12:06 pm] beekhof@iMac ~/Development/cluster/corosync # make
Making all in include
Making all in lcr
Built Live Component Replacement System
Making all in lib
Built shared libs
Making all in exec
 cd .. && /bin/sh /Users/beekhof/Development/cluster/corosync/missing
--run automake-1.10 --gnu  exec/Makefile
 cd .. && /bin/sh ./config.status exec/Makefile depfiles
config.status: creating exec/Makefile
config.status: executing depfiles commands
Undefined symbols for architecture x86_64:
  "_short_service_name_get", referenced from:
  _coroipcs_handler_dispatch in coroipcs.o
ld: symbol(s) not found for architecture x86_64
collect2: ld returned 1 exit status
make[2]: *** [libcoroipcs.so.4.0.0] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all] Error 2

short_service_name_get is defined in utils.c which is only linked into
corosync itself and therefor not available to users of libcoroipcs.
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Where can we find information on Corosync/OpenAIS w/out Pacemaker?

2011-06-30 Thread Andrew Beekhof

On Fri, Jul 1, 2011 at 2:58 AM, Steven Dake  wrote:
> On 06/30/2011 06:57 AM, Digimer wrote:
>> On 06/30/2011 08:48 AM, Whit Blauvelt wrote:
>>> On Thu, Jun 30, 2011 at 01:30:43PM +1000, Andrew Beekhof wrote:
>>>
>>>> I'll agree that Pacemaker isn't for everyone, I don't know about the
>>>> one-size-fits-all comment though.
>>>
>>> Generally, when clothing is advertised as "one-size-fits-all," that means
>>> "for everyone." If Pacemaker's focus is on meeting the needs of a subset of
>>> projects rather than on fitting everyone, how is that subset defined? The
>>> Pacemaker docs, such as they are, seem to suggest it's the solution for all.
>>>
>>> As they say, "To a hammer, everything looks like a nail." That is, even with
>>> a good general-purpose tool like a hammer (or Pacemaker), a skilled
>>> carpenter sometimes is going to reach for something else.
>>>
>>> Best,
>>> Whit
>>
>> If I may jump in;
>>
>> Clustering, as a topic or concept, is extremely wide. One can not help
>> but use analogies or ways of speaking that, on close analysis, might not
>> be accurate.
>>
>> Pacemaker, compared to alternatives like rgmanager, is very flexible and
>> adaptable. In this regard, it fits far more scenarios than rgmanager and
>> can be described as "one size fits all".
>>
>> The trick is that flexibility comes at the cost of a certain amount of
>> complexity. To use rgmanager (which I like, by the way) to compare
>> against, it is simple to understand. It's syntax is pretty trivial and
>> thus is easy to learn and use. However, this results in a lot of
>> restrictions.
>>
>> Imaging that you have HA virtual machines, and they can't start until
>> storage resources have started. The VMs are configured to run on a given
>> node, ideally, but can run on either. In rgmanager, there is no way at
>> all to say "If the storage starts on the other node, but fails here,
>> start my VMs over there". So then, you have a serious restriction.
>>
>> In pacemaker, this is possible. However, configuring such a scenario is
>> inherently complex, by comparison. So, to summarize; Pacemaker is easy
>> and accessible, given the complexity of the problem. It can be adapted
>> to most any need, so it is "one size fits all".
>
> The bottom line is Paceemaker rocks.  Nobody is arguing it doesn't.  The
> question Whit asks is "if I really want to spend the bucks making a
> custom HA implementation in my software, can I get better results."
>
> The answer is yes and in fact this type of development is exactly why
> Corosync was created.

Not questioning that.
Just wanted to point out that Pacemaker isn't trying to be a "fits-all".

Being a generic cluster manager, rather than purpose-built for a
specific app like a filesystem or database, we don't get involved with
things like data replication or anything else that requires an
understanding of the services we're managing.
This puts a natural limit on the kinds of scenarios we can be involved in.

I guess you could say that we aim for "fits-most".
But its a free world, people are entitled to build/deploy whatever
solution they like.

>
> With nway replication, eveery node has a copy of the information used to
> drive the application.  This moves an application from the realm of
> stateless cold failover to warm failover (reducing MTTR).

To be fair, we can help manage these kinds of applications using our
Master/Slave construct.
So its not an either/or scenario.  I'd see Pacemaker as being
complimentary to this kind of application.

> Instead of
> waiting for a restart which damages availability(A) the application may
> continue along happily.  This is why the cluster infrastructure is
> implemented on top of corosync.  It provides the best mechanism for
> providing the lowest MTTR.
>
> |A better description of availability and some of the solutions can be
> found here:
>
>  >
> http://www.redhat.com/summit/2011/presentations/summit/whats_new/thursday/dake_th_1130_high_availability_in_the_cloud.pdf
>
>
>>
>
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais
>
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Where can we find information on Corosync/OpenAIS w/out Pacemaker?

2011-06-29 Thread Andrew Beekhof

On Tue, Jun 28, 2011 at 1:12 AM, Whit Blauvelt  wrote:
> Hi,
>
> While the corosync.org site is sparse, some of the presentation slides
> linked from there there promise that Corosync is designed to approach an
> ideal of simplicity and clarity, to allow a variety of HA projects to be
> developed against it. Since the Corosync site is sparse, is there anywhere
> to find documentation on setting up Corosync as part of a project which is
> not using Pacemaker, let alone one of the whole, pre-packaged cluster
> stacks? Pacemaker and the pre-packaged stacks come at a cost of
> one-size-fits-all complexity and opacity, and don't fit the existing
> architecture we're wanting to improve the HA attributes of.

I'll agree that Pacemaker isn't for everyone, I don't know about the
one-size-fits-all comment though.

> What Corosync
> or OpenAIS can do on their own in our context is what we hope to understand
> and take advantage of.
>
> Is documentation of this in development somewhere? Or has the goal of
> Corosync being available to a wider universe of uses than simply being an
> implicit component of Pacemaker-based cluster stacks been abandoned?
>
> Thanks,
> Whit
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais
>
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Meatware Configuration Errors

2011-06-26 Thread Andrew Beekhof

Looks like the metadata this agent is returning is borked.
Can you use stonith_admin -M to check the output?

On Fri, Jun 3, 2011 at 1:09 PM, imnotpc  wrote:
> Hi again,
>
> I've got a 3 node cluster running with correct firewall rules this time. I set
> up a meatware fence device using crm:
>
>  
>    
>      
>         value="JeffDesk.LAN Server2.LAN Server4.LAN"/>
>      
>    
>  
>
> One node, an x86_64 box running Fedora 15, posted a bunch of error messages
> when I loaded this configuration:
>
> Jun 02 22:01:45 JeffDesk.LAN cib: [12082]: info: cib_stats: Processed 1
> operations (0.00us average, 0% utilization) in the last 10min
> Jun 02 22:07:49 JeffDesk.LAN crmd: [12086]: info: do_lrm_rsc_op: Performing
> key=9:14:0:92a50663-b97b-4807-8842-5b556fc45df2 op=meatware:0_start_0 )
> Jun 02 22:07:49 JeffDesk.LAN lrmd: [12083]: info: rsc:meatware:0:6: start
> Jun 02 22:07:49 JeffDesk.LAN stonith-ng: [12081]: info:
> stonith_device_register: Added 'meatware:0' to the device list (1 active
> devices)
> Jun 02 22:07:49 JeffDesk.LAN stonith-ng: [12081]: info: stonith_command:
> Processed st_device_register from lrmd: rc=0
> Jun 02 22:07:49 JeffDesk.LAN lrmd: [12083]: info: stonith_api_device_metadata:
> looking up external/meatware/heartbeat metadata
> Jun 02 22:07:49 JeffDesk.LAN lrmd: [12083]: ERROR: crm_abort: crm_strdup_fn:
> Triggered assert at utils.c:995 : src != NULL
> Jun 02 22:07:49 JeffDesk.LAN lrmd: [12083]: ERROR: crm_strdup_fn: Could not
> perform copy at st_client.c:510 (stonith_api_device_metadata)
> Jun 02 22:07:49 JeffDesk.LAN lrmd: [12083]: WARN: stonith_api_device_metadata:
> no long description in external/meatware's metadata.
> Jun 02 22:07:49 JeffDesk.LAN lrmd: [12083]: ERROR: crm_abort: crm_strdup_fn:
> Triggered assert at utils.c:995 : src != NULL
> Jun 02 22:07:49 JeffDesk.LAN lrmd: [12083]: ERROR: crm_strdup_fn: Could not
> perform copy at st_client.c:516 (stonith_api_device_metadata)
> Jun 02 22:07:49 JeffDesk.LAN lrmd: [12083]: WARN: stonith_api_device_metadata:
> no short description in external/meatware's metadata.
> Jun 02 22:07:49 JeffDesk.LAN lrmd: [12083]: ERROR: crm_abort: crm_strdup_fn:
> Triggered assert at utils.c:995 : src != NULL
> Jun 02 22:07:49 JeffDesk.LAN lrmd: [12083]: ERROR: crm_strdup_fn: Could not
> perform copy at st_client.c:522 (stonith_api_device_metadata)
> Jun 02 22:07:49 JeffDesk.LAN lrmd: [12083]: WARN: stonith_api_device_metadata:
> no list of parameters in external/meatware's metadata.
> Jun 02 22:07:49 JeffDesk.LAN crmd: [12086]: info: process_lrm_event: LRM
> operation meatware:0_start_0 (call=6, rc=0, cib-update=11, confirmed=true) ok
> Jun 02 22:11:45 JeffDesk.LAN cib: [12082]: info: cib_stats: Processed 8
> operations (0.00us average, 0% utilization) in the last 10min
>
> If I try to test a node leaving I get similar errors. Configuration issue? 
> Bug?
>
> Jeff
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais
>
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Remote Access not Working

2011-06-26 Thread Andrew Beekhof

On Thu, May 19, 2011 at 8:29 PM,   wrote:
> Hi,
>
>   I have configured 2-Node(Linux1, Linux2) cluster that is working fine.
> But I am not able to remotely access(From Linux3) the cluster.
>
>  I have configured both parameter remote-tls-port and remote-clear-port
> in cib.xml file
>
>  On Linux3(Remote machine) I am doing these operation..
>
>  export CIB_port=1234
>  export CIB_server=Linux1 (dc)
>  export CIB_user=hacluster
>  export CIB_passwd=hacluster
>  cibadmin -Q
>
>  After execution of command cibadmin -Q it goes in wait status..
>
>  These logs are written on DC machine..
>
>  ERROR : cib_recv_remote_msg : Empty reply
>  info:print_xml_formatted cib_remote_listen Login null
>  Error: cib_xml_err
>  ERROR :string2xml could not parse 3 chars T
>  ERROR :cib_recv_remote_msg  could not parse...
>  ERROR :do_pe_invoke_callback can't retrive the cib remote node did not
> respond
>  ERROR :do_log FSA input l_error from do_pe_invoke_callback() received in
> state S_POLICY_ENGINE
>  info: do_state_transition state transition S_POLICY_ENGINE -> S_RECOVERY
> [ input l_ERROR cause =C_FSA_INTERVAL origin do_pe_invoke_callback]
>
>  Please can you help me to solve the problem.


Looks like the remote connection attempt caused the cib on linux1 to crash.
What version of pacemaker are you running?
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Corosync goes into endless loop when same hostname is used on more than one node

2011-05-12 Thread Andrew Beekhof

On Thu, May 12, 2011 at 4:04 PM, Dan Frincu  wrote:
> Hi,
> When using the same hostname on 2 nodes

Don't do that. Ever.

> (debian squeeze, corosync 1.3.0-3
> from unstable) the following happens:
> May 12 08:36:27 debian cib: [3125]: info: cib_process_request: Operation
> complete: op cib_sync for section 'all' (origin=local/crmd/84,
> version=0.5.1): ok (rc=0)
> May 12 08:36:27 debian crmd: [3129]: info: crm_get_peer: Node debian now has
> id: 620757002
> May 12 08:36:27 debian crmd: [3129]: info: do_state_transition: State
> transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED
> cause=C_FSA_INTERNAL origin=check_join_state ]
> May 12 08:36:27 debian crmd: [3129]: info: do_state_transition: All 1
> cluster nodes responded to the join offer.
> May 12 08:36:27 debian crmd: [3129]: info: do_dc_join_finalize: join-29:
> Syncing the CIB from debian to the rest of the cluster
> May 12 08:36:27 debian crmd: [3129]: info: crm_get_peer: Node debian now has
> id: 603979786
> May 12 08:36:27 debian crmd: [3129]: info: do_state_transition: State
> transition S_FINALIZE_JOIN -> S_INTEGRATION [ input=I_JOIN_REQUEST
> cause=C_HA_MESSAGE origin=route_message ]
> May 12 08:36:27 debian crmd: [3129]: info: update_dc: Unset DC debian
> May 12 08:36:27 debian cib: [3125]: info: cib_process_request: Operation
> complete: op cib_sync for section 'all' (origin=local/crmd/86,
> version=0.5.1): ok (rc=0)
> May 12 08:36:27 debian crmd: [3129]: info: do_dc_join_offer_all: join-30:
> Waiting on 1 outstanding join acks
> May 12 08:36:27 debian crmd: [3129]: info: update_dc: Set DC to debian
> (3.0.1)
> May 12 08:36:27 debian crmd: [3129]: info: crm_get_peer: Node debian now has
> id: 620757002
> May 12 08:36:27 debian crmd: [3129]: info: do_state_transition: State
> transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED
> cause=C_FSA_INTERNAL origin=check_join_state ]
> May 12 08:36:27 debian crmd: [3129]: info: do_state_transition: All 1
> cluster nodes responded to the join offer.
> May 12 08:36:27 debian crmd: [3129]: info: do_dc_join_finalize: join-30:
> Syncing the CIB from debian to the rest of the cluster
> May 12 08:36:27 debian crmd: [3129]: info: crm_get_peer: Node debian now has
> id: 603979786
> May 12 08:36:27 debian crmd: [3129]: info: do_state_transition: State
> transition S_FINALIZE_JOIN -> S_INTEGRATION [ input=I_JOIN_REQUEST
> cause=C_HA_MESSAGE origin=route_message ]
> May 12 08:36:27 debian crmd: [3129]: info: update_dc: Unset DC debian
> May 12 08:36:27 debian crmd: [3129]: info: do_dc_join_offer_all: join-31:
> Waiting on 1 outstanding join acks
> May 12 08:36:27 debian crmd: [3129]: info: update_dc: Set DC to debian
> (3.0.1)
> May 12 08:36:27 debian cib: [3125]: info: cib_process_request: Operation
> complete: op cib_sync for section 'all' (origin=local/crmd/88,
> version=0.5.1): ok (rc=0)
> May 12 08:36:27 debian crmd: [3129]: info: crm_get_peer: Node debian now has
> id: 620757002
> May 12 08:36:27 debian crmd: [3129]: info: do_state_transition: State
> transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED
> cause=C_FSA_INTERNAL origin=check_join_state ]
> May 12 08:36:27 debian crmd: [3129]: info: do_state_transition: All 1
> cluster nodes responded to the join offer.
> May 12 08:36:27 debian crmd: [3129]: info: do_dc_join_finalize: join-31:
> Syncing the CIB from debian to the rest of the cluster
> May 12 08:36:27 debian crmd: [3129]: info: crm_get_peer: Node debian now has
> id: 603979786
> May 12 08:36:27 debian crmd: [3129]: info: do_state_transition: State
> transition S_FINALIZE_JOIN -> S_INTEGRATION [ input=I_JOIN_REQUEST
> cause=C_HA_MESSAGE origin=route_message ]
> May 12 08:36:27 debian crmd: [3129]: info: update_dc: Unset DC debian
> May 12 08:36:27 debian crmd: [3129]: info: do_dc_join_offer_all: join-32:
> Waiting on 1 outstanding join acks
> May 12 08:36:27 debian crmd: [3129]: info: update_dc: Set DC to debian
> (3.0.1)
> Basically it goes into an endless loop. This is a improperly configured
> option, but it would help the users if there was a handling of this or a
> relevant message printed in the logfile, such as "duplicate hostname found".
> Regards.
> Dan
> --
> Dan Frincu
> CCNA, RHCE
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais
>
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] cibadmin usages....

2011-04-18 Thread Andrew Beekhof

On Mon, Apr 18, 2011 at 10:40 AM,   wrote:
> Hi ,
>
>    Using cibadmin or any other command I want to extract all the
> configured node for ResgGrp1 here(server150, server151)..

Try the --xpath option?
Or perhaps use ptest --show-scores

>  
>      
>                operation="eq" value="server150"/>
>     
>     
>                operation="eq" value="server151"/>
>      
>  
>
> Please can you help me ... How can I do this..
>
>
> Regards
> Manish
>
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais
>
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Corosync failover too long with many resources

2011-04-14 Thread Andrew Beekhof

On Thu, Apr 14, 2011 at 12:09 PM, Jonathan Amiez
 wrote:
> Hello,
>
> I would need some advices to configure a 2-nodes cluster of load balancers 
> with
> Pacemaker/Corosync.
>
> I already have that cluster set up, but it does not work as expected.
>
> It's running Haproxy/Nginx in active/passive setup.
> In addition of these 2 resources, there are about a hundred of virtual IPs
> configured (ocf:heartbeat:IPaddr2), all tied in a group of resources.
>
> The problem is that the failover takes about 30 seconds to complete, which
> make some services unreachable.

 group services mail haproxy-all nginx vip-xxx [...]

By this, do you mean that all the IPs are in a single group?

>
> Is the number of VIPs too important?
> Is there a limit with the number of resources to keep good performances?
>
> I thought about adding one or more nodes to the cluster, and split the group
> into multiple groups, as it hosts different services, and make haproxy and
> nginx "clone resources".
> Does it seems a reliable solution?
>
> Here is the crm setup (I kept only 1 ocf:heartbeat:IPaddr2, as they are all
> identical) :
>
> node lb-01 \
>        attributes standby="off"
> node lb-02
> primitive haproxy-all lsb:haproxy-all \
>        op monitor on-fail="standby" interval="2s"
> primitive mail ocf:heartbeat:MailTo \
>        params email="xxx" subject="[LB]" \
>        op monitor interval="10s"
> primitive nginx lsb:nginx \
>        op monitor on-fail="standby" interval="2s"
> primitive vip-xxx ocf:heartbeat:IPaddr2 \
>        params ip="xx.xx.xx.xx" broadcast="xx.xx.xx.xx" nic="bond0" \
>        op monitor interval="2s"
> [...]
> group services mail haproxy-all nginx vip-xxx [...]
> property $id="cib-bootstrap-options" \
>        expected-quorum-votes="3" \
>        stonith-enabled="false" \
>        no-quorum-policy="ignore"
>
> Thanks by advance,
>
> Jonathan Amiez
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais
>
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Issues with order of fencing

2011-04-11 Thread Andrew Beekhof

On Thu, Apr 7, 2011 at 4:12 AM, Richard  wrote:
> Hi,
>    I'm rather new to opanais and have run into some issues with the order of
> fencing plus refusal to failover once one fencing method fails. Any help
> would be much appreciated.
>    Even though I've set priority lower on my fence_node2_ipmi device it will
> not fence first. But fence_node2_apc is picked (also tried setting the
> priority the other way, no affect). Only when I delete fence_node2_ipmi and
> add it again does it get used first.

stonith in 1.1.x doesn't strictly observe the priorities.
its one of the things we need to fix soon.

> The second issue i'm running into is
> that if fence_node2_ipmi fails OR fence_node2_apc for that matter it just
> keeps reattempting that same fencing device over and over again.

Its shouldn't do that - can you file a bug and include a crm_report
archive please?

> Also every time it executes the reboot the physical ipmi card is issuing a
> restart and the server endlessly rebooting.
> Log output:
> Apr 06 19:05:39 node1 stonith-ng: [2591]: info: log_data_element:
> process_remote_stonith_exec: ExecResult  st_origin="stonith_construct_async_reply" t="stonith-ng" st_op="st_notify"
> st_remote_op="2311741c-fc3b-4094-badd-0ac9e10a209b" st_callid="0"
> st_callopt="0" st_rc="1" st_output="Rebooting machine @
> IPMI:192.168.1.161...Failed
> " src="node1" seq="268" />
> Apr 06 19:05:49 node1 stonith-ng: [2591]: ERROR: remote_op_timeout: Action
> reboot (2311741c-fc3b-4094-badd-0ac9e10a209b) for node2 timed out
> Apr 06 19:05:49 node1 stonith-ng: [2591]: info: remote_op_done: Notifing
> clients of 2311741c-fc3b-4094-badd-0ac9e10a209b (reboot of node2 from
> fc25a065-3355-455d-937f-360b07f9dda9 by (null)): 1, rc=-7
> Apr 06 19:05:49 node1 stonith-ng: [2591]: info: stonith_notify_client:
> Sending st_fence-notification to client
> 2596/17cafaec-7078-4972-937e-1cf5636c8523
> Apr 06 19:05:50 node1 stonith-ng: [2591]: info: initiate_remote_stonith_op:
> Initiating remote operation reboot for node2:
> e5e1a936-c038-42bc-acff-18c2a41e9ae2
> Apr 06 19:05:50 node1 stonith-ng: [2591]: info: log_data_element:
> stonith_query: Query  st_async_id="e5e1a936-c038-42bc-acff-18c2a41e9ae2" st_op="st_query"
> st_callid="0" st_callopt="0"
> st_remote_op="e5e1a936-c038-42bc-acff-18c2a41e9ae2" st_target="node2"
> st_device_action="reboot" st_clientid="fc25a065-3355-455d-937f-360b07f9dda9"
> src="node1" seq="269" />
> Apr 06 19:05:50 node1 stonith-ng: [2591]: info: can_fence_host_with_device:
> fence_node2_ipmi can fence node2: static-list
> Apr 06 19:05:50 node1 stonith-ng: [2591]: info: can_fence_host_with_device:
> fence_node2_apc can fence node2: static-list
> Apr 06 19:05:50 node1 stonith-ng: [2591]: info: stonith_query: Found 2
> matching devices for 'node2'
> Apr 06 19:05:50 node1 stonith-ng: [2591]: info: call_remote_stonith:
> Requesting that node1 perform op reboot node2
> Apr 06 19:05:50 node1 stonith-ng: [2591]: info: log_data_element:
> stonith_fence: Exec  st_async_id="e5e1a936-c038-42bc-acff-18c2a41e9ae2" st_op="st_fence"
> st_callid="0" st_callopt="0"
> st_remote_op="e5e1a936-c038-42bc-acff-18c2a41e9ae2" st_target="node2"
> st_device_action="reboot" src="node1" seq="271" />
> Apr 06 19:05:50 node1 stonith-ng: [2591]: info: can_fence_host_with_device:
> fence_node2_ipmi can fence node2: static-list
> Apr 06 19:05:50 node1 stonith-ng: [2591]: info: can_fence_host_with_device:
> fence_node2_apc can fence node2: static-list
> Apr 06 19:05:50 node1 stonith-ng: [2591]: info: stonith_fence: Found 2
> matching devices for 'node2'
>
>    I'm running version: 1.1.2.
>
>   Here is the relevant part of my cluster config:
> node node1 \
>         attributes standby="off"
> node node2 \
>         attributes standby="off"
> primitive fence_node1 stonith:fence_ipmilan \
>         params action="reboot" ipaddr="192.168.1.160" login="ADMIN"
> passwd="ADMIN" pcmk_host_check="static-list" pcmk_host_list="node1"
> primitive fence_node1_apc stonith:fence_apc_snmp \
>         params ipaddr="192.168.1.180" action="reboot" port="node1"
> community="private" pcmk_host_check="static-list" pcmk_host_list="node1"
> priority="20"
> primitive fence_node2_apc stonith:fence_apc_snmp \
>         params ipaddr="192.168.1.180" action="reboot" port="node2"
> community="private" pcmk_host_check="static-list" pcmk_host_list="node2"
> priority="100"
> primitive fence_node2_ipmi stonith:fence_ipmilan \
>         params action="reboot" ipaddr="192.168.1.161" login="ADMIN"
> passwd="ADMIN" pcmk_host_check="static-list" pcmk_host_list="node2"
> priority="10"
> location fence-node1_apc-on-node2 fence_node1_apc -inf: node1
> location fence_node1-on-node2 fence_node1 -inf: node1
> location fence_node2-on-node1 fence_node2_apc -inf: node2
> location fence_node2_ipmi-on-node1 fence_node2_ipmi -inf: node2
> property $id="cib-bootstrap-options" \
>         dc-version="1.1.2-f059ec7ced7a86f18e5490b67ebf4a0b963bccfe" \
>         cluster-infrastructure="openais" \
>

Re: [Openais] How to add bind as a resource?

2011-04-09 Thread Andrew Beekhof

Sounds like the init script is not LSB compliant (wrong return code)

On Sat, Apr 9, 2011 at 5:43 AM, Neil Aggarwal  wrote:
> Hello:
>
> I would like to add bind as a resource to my cluster.
>
> I tried this command:
> crm configure primitive named lsb:named op monitor interval="30s"
> timeout="20s"
>
> It gave me no errors, but when I look at crm_mon, I see this
> message:
>
> Failed actions:
>    named_start_0 (node=lb1h, call=10, rc=7, status=complete): not running
>
> When I do a ps aux | grep named on the load balancer, I see that
> named is not running.
>
> That is strange since I can do service named start and it
> starts up.
>
> Is there a better configure command to use for bind?
>
> Thanks,
>        Neil
>
>
> --
> Neil Aggarwal, (281)846-8957, http://UnmeteredVPS.net/centos
> Virtual private server with CentOS 5.5 preinstalled
> Unmetered bandwidth = no overage charges
>
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais
>
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Resource start/stop linear dependency

2011-04-07 Thread Andrew Beekhof

On Thu, Apr 7, 2011 at 9:12 AM,   wrote:
> Hi,
>
>    I have configured three resource(X,Y,Z) in a one resource-group.
>
>    Order of resource in cib.xml file X then Y then Z.
>
>    No rsc-order constraint is added in the cib.xml file

Yes there is - ordering is implied by the use of a group.

>
>    If I stop reource Z then only resource Z is stopped.
>
>    If I start reource Z then only resource Z is startted.
>
>    If I stop reource Y then resource Y and Z are stopped.
>
>    If I start reource Y then resource Y and Z is startted.
>
>
> I want to Start/Stop only resource Y . Then How can I do It.
>
> Is it supported in any version of pacemaker.
>
> Cluster Stack
> Pacemaker
> Heartbeat
>
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais
>
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] need help for stonith configuration on RHEL6

2011-03-11 Thread Andrew Beekhof

Are you using pacemaker or rhcs/rgmanager?

On Fri, Mar 11, 2011 at 12:01 PM, Amit Jathar  wrote:
> Hi,
>
>
>
> I am working on RHEL6.
>
> I am using two-node corosync cluster. I have configured it for Apache,
> tomcat & Mysql. It is running fine & doing good job in the failover
> scenarios.
>
> I want to add the stonith configuration for the split-brain scenario, but I
> could see only stonith_admin on the prompt.
>
>
>
> If I put command :-
>
> [root@OEL6_VIP_1 /]# stonith -L
>
> -bash: stonith: command not found
>
>
>
> [root@OEL6_VIP_1 /]# stonith_admin -L
>
> stonith_admin[11801]: 2011/03/11_01:58:57 info: crm_log_init_worker: Changed
> active directory to /var/lib/heartbeat/cores/root
>
> stonith_admin[11801]: 2011/03/11_01:58:57 notice: log_data_element:
> st_callback: st_notify_disconnect  subt="st_notify_disconnect" />
>
> [root@OEL6_VIP_1 /]#
>
>
>
> Also, I can see the following output :-
>
> [root@OEL6_VIP_1 /]# ls /usr/sbin/fence_*
>
> /usr/sbin/fence_ack_manual   /usr/sbin/fence_bladecenter_snmp
> /usr/sbin/fence_egenera   /usr/sbin/fence_ilo
> /usr/sbin/fence_node /usr/sbin/fence_sanbox2
> /usr/sbin/fence_virt   /usr/sbin/fence_xvm
>
> /usr/sbin/fence_apc  /usr/sbin/fence_cisco_mds
> /usr/sbin/fence_eps   /usr/sbin/fence_ilo_mp
> /usr/sbin/fence_nss_wrapper  /usr/sbin/fence_scsi /usr/sbin/fence_vmware
>
> /usr/sbin/fence_apc_snmp /usr/sbin/fence_drac
> /usr/sbin/fence_ibmblade  /usr/sbin/fence_intelmodular
> /usr/sbin/fence_rsa  /usr/sbin/fence_tool
> /usr/sbin/fence_vmware_helper
>
> /usr/sbin/fence_bladecenter  /usr/sbin/fence_drac5
> /usr/sbin/fence_ifmib /usr/sbin/fence_ipmilan
> /usr/sbin/fence_rsb  /usr/sbin/fence_virsh    /usr/sbin/fence_wti
>
> [root@OEL6_VIP_1 /]#
>
>
>
> I am not sure how to move forward with the stonith configuration.
>
>
>
> Can you guide me, if is it possible to configure stonith using my current
> setup & how to configure it from now ?
>
>
>
> Thanks,
>
> Amit
>
>
>
> 
> This email (message and any attachment) is confidential and may be
> privileged. If you are not certain that you are the intended recipient,
> please notify the sender immediately by replying to this message, and delete
> all copies of this message and attachments. Any other use of this email by
> you is prohibited.
> 
>
>
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais
>
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] corosync shutdown process

2011-03-09 Thread Andrew Beekhof

Not enough information.
Create and attach a hb_report for the shutdown case.

On Tue, Mar 8, 2011 at 8:08 PM, Beau Sapach  wrote:
> Hello everyone,
>
> I’ve got a 2-node cluster that exposes iSCSI targets backed by LVM volumes
> on top of a DRBD device.  For the most part I’ve got everything working as
> I’d like.  Manually moving resources works just fine, either using ‘move’ or
> by putting a node on standby.  Shutting down the corosync service on one
> node is another story though.  I have an order constraint in place to make
> iscsi-scst shutdown before stopping the LVM volume group but in the logs I
> see this:
>
> Mar 08 11:43:07 iscsitest2 crmd: [20755]: info: process_lrm_event: LRM
> operation clusterip_stop_0 (call=72, rc=0, cib-update=91, confirmed=true) ok
> Mar 08 11:43:07 iscsitest2 crmd: [20755]: info: do_lrm_rsc_op: Performing
> key=88:89:0:1effe13b-3093-4bf7-ae29-f764aaf22933 op=iscsi_target_stop_0 )
> Mar 08 11:43:07 iscsitest2 lrmd: [20752]: info: rsc:iscsi_target:73: stop
> Mar 08 11:43:07 iscsitest2 lrmd: [24805]: WARN: For LSB init script, no
> additional parameters are needed.
> Mar 08 11:43:07 iscsitest2 lrmd: [20752]: info: RA output:
> (iscsi_target:stop:stdout) Stopping iSCSI-SCST target service:
> Mar 08 11:43:07 iscsitest2 lrmd: [20752]: info: RA output:
> (iscsi_target:stop:stdout) succeeded.
> Mar 08 11:43:07 iscsitest2 lrmd: [20752]: info: RA output:
> (iscsi_target:stop:stdout) Removing iSCSI-SCST target modules:
> Mar 08 11:43:07 iscsitest2 crmd: [20755]: info: do_lrm_rsc_op: Performing
> key=51:89:0:1effe13b-3093-4bf7-ae29-f764aaf22933 op=drbd_lvm_stor:1_demote_0
> )
> Mar 08 11:43:07 iscsitest2 lrmd: [20752]: info: rsc:drbd_lvm_stor:1:74:
> demote
> Mar 08 11:43:07 iscsitest2 lrmd: [20752]: info: RA output:
> (drbd_lvm_stor:1:demote:stderr) 1: State change failed: (-12) Device is held
> open by someone
>
> The first line is fine, the iSCSI target IP should be shutdown first, then
> the target service and its modules are stopped/unloaded.  Next though I see
> corosync trying to demote the DRBD device that sits ‘under’ the LVM volume
> group, BEFORE it shuts down LVM… why are these things being done out of
> order?  Based on my constraint corosync should:
>
>
> Shutdown iscsi IP
> Shutdown iscsi-scst
> Shutdown LVM
> Demote drbd device
>
>
> The order constraint in my configuration looks like this:
>
> order san_startup inf: ms_drbd_lvm_stor:promote lvm_vg0 iscsi_target
> clusterip
>
> Lastly, I see, near the end of the log:
>
> Mar 08 11:43:11 iscsitest2 lrmd: [20752]: info: RA output:
> (drbd_lvm_stor:1:demote:stdout)
>
> Which, to me, looks like an incomplete line, followed by a number of attrd,
> crmd, stonithd & cib  ERROR messages indicating that the connection to the
> OpenAIS service has been lost.   I suppose this means that corosync doesn’t
> wait for proper resource migration before it shuts down which seems very
> strange to me.  Unless I’m missing something here, has anyone else run into
> anything like this?
>
> Beau
>
>
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais
>
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] [Pacemaker] Finalizing installation

2011-02-24 Thread Andrew Beekhof

On Thu, Feb 24, 2011 at 6:18 PM, Alessio Gennari  wrote:
> Hello,
> I installed in a Ubuntu Server 10.10 64bit machine Cluster-glue, agents,
> Corosync, Openais and Pacemaker. I could to start services and consigure a
> ClusterIP resource that correctly respond. I installed two nodes (opeais and
> openais2), in both I started corosync and only in openais I started
> pacemaker service. Openais can connect to cluster and retrive infromation
> as:
>
> root@openais:~# crm_mon
> Defaulting to one-shot mode
> You need to have curses available at compile time to enable console mode
> 
> Last updated: Thu Feb 24 18:13:37 2011
> Stack: openais
> Current DC: openais2.clustertest - partition with quorum
> Version: 1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
> 2 Nodes configured, 2 expected votes
> 1 Resources configured.
> 
>
> Online: [ openais.clustertest openais2.clustertest ]
>
>  ClusterIP  (ocf::heartbeat:IPaddr2):   Started openais2.clustertest
>
>
>
> root@openais:~# crm configure show
> node openais.clustertest
> node openais2.clustertest
> primitive ClusterIP ocf:heartbeat:IPaddr2 \
>     params ip="192.168.56.199" cidr_netmask="32" \
>     op monitor interval="30s" \
>     meta target-role="Started"
> property $id="cib-bootstrap-options" \
>     dc-version="1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \
>     cluster-infrastructure="openais" \
>     expected-quorum-votes="2" \
>     stonith-enabled="false"
>
>
>
>
> Instead if I launch same commands in openais I retrive errors:
>
>
> user@openais2:~$ crm_mon

Is user part of the clustering group?
What if you try using sudo?

> Defaulting to one-shot mode
> You need to have curses available at compile time to enable console mode
>
> Connection to cluster failed: connection failed
>
>
>
> user@openais2:~$ crm configure show
> Signon to CIB failed: connection failed
> Init failed, could not perform requested operations
> ERROR: cannot parse xml: no element found: line 1, column 0
> ERROR: No CIB!
>
>
>
>
> I was not able to solve the problem on openais2, can anyone help me to
> correct my errors?
>
> Thanks in advance.
>
> Alessio
>
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais
>
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Problems with Pacemaker + Corosync after reboot

2011-01-18 Thread Andrew Beekhof

On Mon, Dec 20, 2010 at 12:55 AM, Daniel Bareiro  wrote:
> Hi all!
>
> I hope this is the right group to discuss my problem.
>
> I'm beginning to test HA clusters with GNU/Linux and for that I decided
> to try Pacemaker + Corosync in Debian Lenny following this [1] howto.
>
> Both packages were installed from the Backports repositories. But I am
> observing that if after configuration I reboot a node, it fails to join
> to the cluster after the boot.
>
> This is what I see in /var/log/daemon.log:
>
> --
> Dec 19 17:13:13 atlantis corosync[1508]:   [pcmk  ] WARN: route_ais_message: 
> Sending message to local.crmd failed: unknown (rc=-2)
> Dec 19 17:13:13 atlantis corosync[1508]:   [pcmk  ] WARN: route_ais_message: 
> Sending message to local.cib failed: unknown (rc=-2)
> Dec 19 17:13:13 atlantis corosync[1508]:   [pcmk  ] WARN: route_ais_message: 
> Sending message to local.attrd failed: unknown (rc=-2)
> Dec 19 17:13:13 atlantis corosync[1508]:   [pcmk  ] WARN: route_ais_message: 
> Sending message to local.cib failed: unknown (rc=-2)
> Dec 19 17:13:14 atlantis corosync[1508]:   [pcmk  ] WARN: route_ais_message: 
> Sending message to local.cib failed: unknown (rc=-2)
> Dec 19 17:13:14 atlantis corosync[1508]:   [pcmk  ] WARN: route_ais_message: 
> Sending message to local.cib failed: unknown (rc=-2)
> Dec 19 17:13:21 atlantis corosync[1508]:   [TOTEM ] A processor failed, 
> forming new configuration.
> Dec 19 17:13:25 atlantis corosync[1508]:   [pcmk  ] notice: pcmk_peer_update: 
> Transitional membership event on ring 72: memb=1, new=0, lost=1
> Dec 19 17:13:25 atlantis corosync[1508]:   [pcmk  ] info: pcmk_peer_update: 
> memb: atlantis 335544586
> Dec 19 17:13:25 atlantis corosync[1508]:   [pcmk  ] info: pcmk_peer_update: 
> lost: daedalus 369099018
> Dec 19 17:13:25 atlantis corosync[1508]:   [pcmk  ] notice: pcmk_peer_update: 
> Stable membership event on ring 72: memb=1, new=0, lost=0
> Dec 19 17:13:25 atlantis corosync[1508]:   [pcmk  ] info: pcmk_peer_update: 
> MEMB: atlantis 335544586
> Dec 19 17:13:25 atlantis corosync[1508]:   [pcmk  ] info: 
> ais_mark_unseen_peer_dead: Node daedalus was not seen in the previous 
> transition
> Dec 19 17:13:25 atlantis corosync[1508]:   [pcmk  ] info: update_member: Node 
> 369099018/daedalus is now: lost
> Dec 19 17:13:25 atlantis corosync[1508]:   [pcmk  ] info: 
> send_member_notification: Sending membership update 72 to 0 children
> Dec 19 17:13:25 atlantis corosync[1508]:   [TOTEM ] A processor joined or 
> left the membership and a new membership was formed.
> Dec 19 17:13:25 atlantis corosync[1508]:   [MAIN  ] Completed service 
> synchronization, ready to provide service.
> --
>
>
> # ps auxf
> [...]
> root      1508  0.1  1.9 182624  4880 ?        Ssl  15:52   0:22 
> /usr/sbin/corosync
> root      1539  0.0  1.2 168144  3240 ?        S    15:52   0:00  \_ 
> /usr/sbin/corosync
> root      1540  0.0  1.2 168144  3240 ?        S    15:52   0:00  \_ 
> /usr/sbin/corosync
> root      1541  0.0  1.2 168144  3240 ?        S    15:52   0:00  \_ 
> /usr/sbin/corosync
> root      1542  0.0  1.2 168144  3240 ?        S    15:52   0:00  \_ 
> /usr/sbin/corosync
> root      1543  0.0  1.2 168144  3240 ?        S    15:52   0:00  \_ 
> /usr/sbin/corosync
> root      1544  0.0  1.2 168144  3240 ?        S    15:52   0:00  \_ 
> /usr/sbin/corosync

You're hitting a deadlock between the calls to fork() and exec() when
pacemaker is trying to start.
This is the reason we created the MCP

http://theclusterguy.clusterlabs.org/post/907043024/introducing-the-pacemaker-master-control-process-for

>
>
> From what I see in the howto, the output should be something like this:
>
>
> root     29980  0.0  0.8  44304  3808 ?        Ssl  20:55   0:00 
> /usr/sbin/corosync
> root     29986  0.0  2.4  10812 10812 ?        SLs  20:55   0:00  \_ 
> /usr/lib/heartbeat/stonithd
> 102      29987  0.0  0.8  13012  3804 ?        S    20:55   0:00  \_ 
> /usr/lib/heartbeat/cib
> root     29988  0.0  0.4   5444  1800 ?        S    20:55   0:00  \_ 
> /usr/lib/heartbeat/lrmd
> 102      29989  0.0  0.5  12364  2368 ?        S    20:55   0:00  \_ 
> /usr/lib/heartbeat/attrd
> 102      29990  0.0  0.5   8604  2304 ?        S    20:55   0:00  \_ 
> /usr/lib/heartbeat/pengine
> 102      29991  0.0  0.6  12648  3080 ?        S    20:55   0:00  \_ 
> /usr/lib/heartbeat/crmd
>
>
>
> I also tried compiling Pacemaker using these [2] steps, but I get the
> same result.
>
>
>
> Thanks in advance for your reply.
>
> Regards,
> Daniel
>
> [1] http://www.clusterlabs.org/wiki/Debian_Lenny_HowTo
> [2] http://www.clusterlabs.org/wiki/Install#Building_from_Source
> --
> Daniel Bareiro - GNU/Linux registered user #188.598
> Proudly running Debian GNU/Linux with uptime:
> 20:31:04 up 67 days, 20:57, 10 users,  load average: 0.11, 0.05, 0.01
>
> -BEGIN PGP SIGNAT

Re: [Openais] Large delay when restarting active node

2010-12-03 Thread Andrew Beekhof

This looks like a drbd issue, you might have more luck on that list.

On Fri, Dec 3, 2010 at 4:15 PM, Dan Frincu  wrote:
> Hi,
>
> Don't know how to summarize what I've encountered, therefore the rather lame
> subject. I'm running a HA setup of 2 nodes on RHEL5U3, and I have done the
> following tests:
> - configure resources, everything runs on cluster1
> - hard reboot of both active and passive nodes (cluster1 and cluster2) at
> the same time (echo b > /proc/sysrq-trigger)
> - after reboot, checked resources, all went fine, they started on cluster1
> - hard rebooted active node (cluster1)
> - second node detects failure of cluster1, but gets stuck when it tries to
> start the drbd resources
>
> Dec 03 16:39:40 cluster2 crmd: [1699]: info: process_lrm_event: LRM
> operation ping_gw:0_monitor_0 (call=13, rc=7, cib-update=16, confirmed=true)
> not running
> Dec 03 16:39:41 cluster2 crm_attribute: [2585]: info: Invoked: crm_attribute
> -N cluster2 -n master-drbd_mysql:1 -l reboot -D
> Dec 03 16:39:41 cluster2 crm_attribute: [2584]: info: Invoked: crm_attribute
> -N cluster2 -n master-drbd_home:1 -l reboot -D
> Dec 03 16:39:41 cluster2 crm_attribute: [2583]: info: Invoked: crm_attribute
> -N cluster2 -n master-drbd_storage:1 -l reboot -D
> Dec 03 16:39:41 cluster2 attrd: [1697]: info: find_hash_entry: Creating hash
> entry for master-drbd_home:1
> Dec 03 16:39:41 cluster2 attrd: [1697]: info: find_hash_entry: Creating hash
> entry for master-drbd_storage:1
> Dec 03 16:39:41 cluster2 attrd: [1697]: info: find_hash_entry: Creating hash
> entry for master-drbd_mysql:1
> Dec 03 16:39:41 cluster2 crmd: [1699]: info: process_lrm_event: LRM
> operation drbd_storage:1_monitor_0 (call=12, rc=7, cib-update=17,
> confirmed=true) not running
> Dec 03 16:39:41 cluster2 crmd: [1699]: info: process_lrm_event: LRM
> operation drbd_home:1_monitor_0 (call=10, rc=7, cib-update=18,
> confirmed=true) not running
> Dec 03 16:39:41 cluster2 crmd: [1699]: info: process_lrm_event: LRM
> operation drbd_mysql:1_monitor_0 (call=11, rc=7, cib-update=19,
> confirmed=true) not running
> Dec 03 16:39:41 cluster2 attrd: [1697]: info: find_hash_entry: Creating hash
> entry for probe_complete
> Dec 03 16:39:41 cluster2 attrd: [1697]: info: attrd_trigger_update: Sending
> flush op to all hosts for: probe_complete (true)
> Dec 03 16:39:41 cluster2 attrd: [1697]: info: attrd_perform_update: Sent
> update 10: probe_complete=true
> Dec 03 16:39:41 cluster2 crmd: [1699]: info: do_lrm_rsc_op: Performing
> key=96:2:0:35c8ee9b-f99d-4d0a-9d16-41d119326713 op=ping_gw:0_start_0 )
> Dec 03 16:39:41 cluster2 lrmd: [1696]: info: rsc:ping_gw:0:14: start
> Dec 03 16:39:41 cluster2 crmd: [1699]: info: do_lrm_rsc_op: Performing
> key=14:2:0:35c8ee9b-f99d-4d0a-9d16-41d119326713 op=drbd_home:1_start_0 )
> Dec 03 16:39:41 cluster2 lrmd: [1696]: info: rsc:drbd_home:1:15: start
> Dec 03 16:39:41 cluster2 crmd: [1699]: info: do_lrm_rsc_op: Performing
> key=42:2:0:35c8ee9b-f99d-4d0a-9d16-41d119326713 op=drbd_mysql:1_start_0 )
> Dec 03 16:39:41 cluster2 lrmd: [1696]: info: rsc:drbd_mysql:1:16: start
> Dec 03 16:39:41 cluster2 crmd: [1699]: info: do_lrm_rsc_op: Performing
> key=70:2:0:35c8ee9b-f99d-4d0a-9d16-41d119326713 op=drbd_storage:1_start_0 )
> Dec 03 16:39:41 cluster2 lrmd: [1696]: info: rsc:drbd_storage:1:17: start
> Dec 03 16:39:41 cluster2 lrmd: [1696]: info: RA output:
> (drbd_mysql:1:start:stdout)
>
> Dec 03 16:39:41 cluster2 lrmd: [1696]: info: RA output:
> (drbd_storage:1:start:stdout)
>
> Dec 03 16:39:41 cluster2 lrmd: [1696]: info: RA output:
> (drbd_home:1:start:stdout)
>
> Dec 03 16:39:41 cluster2 lrmd: [1696]: info: RA output:
> (drbd_storage:1:start:stdout)
>
> Dec 03 16:39:41 cluster2 lrmd: [1696]: info: RA output:
> (drbd_mysql:1:start:stdout)
>
> Dec 03 16:39:41 cluster2 crm_attribute: [2746]: info: Invoked: crm_attribute
> -N cluster2 -n master-drbd_storage:1 -l reboot -v 5
> Dec 03 16:39:41 cluster2 attrd: [1697]: info: attrd_trigger_update: Sending
> flush op to all hosts for: master-drbd_storage:1 (5)
> Dec 03 16:39:41 cluster2 lrmd: [1696]: info: RA output:
> (drbd_mysql:1:start:stderr) 0: Failure: (124) Device is attached to a disk
> (use detach first)
>
> Dec 03 16:39:41 cluster2 lrmd: [1696]: info: RA output:
> (drbd_mysql:1:start:stderr) Command 'drbdsetup 0 disk
> Dec 03 16:39:41 cluster2 lrmd: [1696]: info: RA output:
> (drbd_mysql:1:start:stderr) /dev/xvda5 /dev/xvda5 internal --set-defaults
> --create-device --fencing=resource-only --on-io-e
> rror=detach' terminated with exit code 10
>
> And then continues with
>
> Dec 03 16:40:42 cluster2 lrmd: [1696]: info: RA output:
> (drbd_mysql:1:promote:stderr) 0: State change failed: (-7) Refusing to be
> Primary while peer is not outdated
>
> Dec 03 16:40:42 cluster2 lrmd: [1696]: info: RA output:
> (drbd_mysql:1:promote:stderr) Command '
> Dec 03 16:40:42 cluster2 lrmd: [1696]: info: RA output:
> (drbd_mysql:1:promote:stderr) drbdsetup
> Dec 03 16:40:42 cluster2 lrmd: [1696]: info: RA output

Re: [Openais] Pb with ais Library Error

2010-12-01 Thread Andrew Beekhof

On Wed, Dec 1, 2010 at 7:49 AM, Alain.Moulle  wrote:
> Hi Steve,
>
> I have some difficulties to follow the developpments ... what is
> exactly the "MCP deployment model" ?


http://theclusterguy.clusterlabs.org/post/907043024/introducing-the-pacemaker-master-control-process-for

>
> Thanks
> Alain
>
> Steven Dake a écrit :
>
> On 11/30/2010 12:36 AM, Alain.Moulle wrote:
>
>
> Hi Steve,
>
> thanks for your response.
> A clue for my pb: I have the problem when I set the heartbeat in
> corosync.conf
> on a bridge IF (br0) ; when I set the heartbeat with two ring number on
> two eth IF , I have
> no more the problem.
> I 've tested with last release delivered with RHEL6 , meaning :
> corosynclib-1.2.3-21.el6.x86_64
> corosync-1.2.3-21.el6.x86_64
> I can also test with 1.2.8 if you think it is relevant, but perhaps the
> fact that it
> happens when using a bridge IF can tell you if it has been fixed or not
> in any release ?
>
>
>
> A bug only related to bridging doesn't sound like a known issue.
>
> Are you using the pacemaker MCP deployment model?  If not, give that a
> shot - the plugin model seems to still have problems on some peoples
> deployments (which is why Andrew wrote the MCP system).
>
> 1.2.8 and rhel6 are about the same bits.
>
>
>
> By the way, could you tell me for this time and future eventual problems how
> I can get "a backtrace of the core file" so I can send it to you everytime ?
>
>
>
> http://www.corosync.org/doku.php?id=faq:crash
>
>
>
> And by the way again, what do you mean by "z stream" ?
>
>
>
> 1.2.8 X.Y.Z x=1, y=2, Z=8 - zstream is a rev of the z version number
> indicating a bug fix release.
>
>
>
> Thanks for your help
> Alain
>
>
> A backtrace of the core dump would be very helpful.  Since you are using
> prebuild packages, make sure to install the debuginfo packages of corosync.
>
>
>
> There are several bugs fixed between 1.2.1 and our current stable
> version 1.2.8.  I would recommend updating to the latest z stream.  For
> your specific failure scenario, I'd need to see a backtrace of the core
> file to tell you which patch you would want to cherrypick.
>
> Regards
> -steve
>
>
>
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais
>
>
>
>
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais
>
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] pingd - monitoring different subnet

2010-11-30 Thread Andrew Beekhof

On Thu, Nov 25, 2010 at 8:25 PM, Luc Paulin  wrote:
> Hi, I am looking to setup a monitoring of gre/tunnel interface of our 
> clustered firewall.
>
> Both firewall do have a gre/ipsec tunnel to a another host/site.
> What I would like to do is to add a pingd ressource which will monitor the 
> other endpoint of the gre tunnel. fw01 and fw02 share the same internet 
> connection, however to make the routing redundant I created a gre tunnel per 
> fw, so obviously they both have different set of ip. Here's a brief diag..
>
> fw01<-->gre(172.20.1.11/32)<-->gre(172.20.1.10)<-->ipsec1
> fw02<-->gre(172.20.1.21/32)<-->gre(172.20.1.20)<-->ipsec2
>
> I have tried the following, hoerver it look like when it failover to the 
> other system the ressource is started on the other firewall.
>
> primitive conn-check-gre-fw01 ocf:pacemaker:pingd \
>        params host_list="172.20.1.11" multiplier="100" \
>        op monitor interval="15s" timeout="5s" dampen="6s"
> primitive conn-check-gre-fw02 ocf:pacemaker:pingd \
>        params host_list="172.20.1.22" multiplier="100" \
>        op monitor interval="15s" timeout="5s" dampen="6s"
> location fw01-check-gre conn-check-gre-fw01 inf: fw01
> location fw02-check-gre conn-check-gre-fw02 inf: fw02

try

location fw01-check-gre conn-check-gre-fw01 -inf: fw02
location fw02-check-gre conn-check-gre-fw02 -inf: fw01

this will ensure the pingd resources only run on the _other_ node.
then you need to use the pingd attributes somewhere:
   
http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/ch09s03s03s02.html


>
> What I want to do is ..
>
> conn-check-gre-fw01 should be started only on fw01
> conn-check-gre-fw02 should be started only on fw02
>
> if one of the 2 ressource failed, the other firewall should takeover for 
> other defined ressource, but those 2 ressource shouldn't redundant/started on 
> the other node
>
>     -Luc
>
>
> CONFIDENTIALITY CAUTION
> This e-mail and any attachments may be confidential or legally privileged. If 
> you received this message in error or are not the intended recipient, you 
> should destroy the e-mail message and any attachments or copies, and you are 
> prohibited from retaining, distributing, disclosing or using any information 
> contained herein. Please inform us of the erroneous delivery by return 
> e-mail. Thank you for your cooperation.
> DOCUMENT CONFIDENTIEL
> Le présent courriel et tout fichier joint à celui-ci peuvent contenir des 
> renseignements confidentiels ou privilégiés. Si cet envoi ne s'adresse pas à 
> vous ou si vous l'avez reçu par erreur, vous devez l'effacer. Vous ne pouvez 
> conserver, distribuer, communiquer ou utiliser les renseignements qu'il 
> contient. Nous vous prions de nous signaler l'erreur par courriel. Merci de 
> votre collaboration.
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais
>
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Child process of corosync outputs a core

2010-11-11 Thread Andrew Beekhof

On Thu, Nov 11, 2010 at 4:39 PM, Steven Dake  wrote:
> On 11/11/2010 02:35 AM, Andrew Beekhof wrote:
>> On Wed, Oct 27, 2010 at 5:15 PM, Steven Dake  wrote:
>>> On 10/26/2010 11:17 PM, Andrew Beekhof wrote:
>>>>
>>>> On Wed, Oct 27, 2010 at 7:32 AM, nozawat  wrote:
>>>>>
>>>>> Hi Andrew,
>>>>>
>>>>>  I send two log files of terminal.log and ha.log.
>>>>>  The contents of the terminal log are command results of "ps -ef|grep
>>>>> coro"
>>>>> and "crm_mon -f -1".
>>>>>
>>>>>  It is what processing completes normally when what did not understand me
>>>>> well watches log though corosync outputs core.
>>>>
>>>> Oct 27 10:53:12 hb0101 corosync[6695]:   [pcmk  ] plugin.c:1526 ERROR:
>>>> send_cluster_msg_raw: Child 7016 spawned to record non-fatal assertion
>>>> failure line 1526: rc == 0
>>>>
>>>> Oct 27 10:53:12 hb0101 corosync[6695]:   [pcmk  ] plugin.c:1526 ERROR:
>>>> send_cluster_msg_raw: Message not sent (-1):>>> cib_op="cib_replace" cib_delegated_from="hb0102"
>>>> cib_clientname="hb0102" cib_isreplyto="hb0102" original_c
>>>>
>>>> For some reason
>>>>     rc = pcmk_api->totem_mcast(&iovec, 1, TOTEMPG_SAFE);
>>>> is returning -1
>>>>
>>>>
>>>> Steve: would this happen if membership was in flux?
>>>> I thought only IPC got stopped.
>>>>
>>>
>>> it could
>>>
>>> If api->totem_mcast sends many messages it can fill up the totem queue and
>>> return -1.  The best solution to handling sending messages outside of IPC is
>>> to use the schedwrk api.  It will request a piece of work be done when the
>>> token is sent (and hopefully there are more spots in the new message queue).
>>>  It will continue to schedule work until 0 is retuned by the callback
>>> registered with schedwrk.
>>
>> what about a while-loop with a sleep in it?
>>
>>>
>
> That could cause all kinds of problems with the membership system timers
> resulting in wierd behavior and bad membership states.  That is why
> there is a schedwrk api.

Looks painful.
I think I'd prefer people moved to the MCP instead.

>
> Regards
> -steve
>
>>> Regards
>>> -steve
>>>
>>>>>
>>>>> Regards,
>>>>> Tomo
>>>>>
>>>>>
>>>>> 2010/10/27 Andrew Beekhof
>>>>>>
>>>>>> On Tue, Oct 26, 2010 at 11:22 AM, nozawat  wrote:
>>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> My environment is as follows.
>>>>>>>  * cluster-glue-1.0.6
>>>>>>>  * resource-agents-1.0.3
>>>>>>>  * corosync-1.2.8 (svn revision '3059')
>>>>>>>  * pacemaker-1.1.3-2f0326468a33acb1ada8fa744c7d36d0b315bd35
>>>>>>>
>>>>>>> Core file was output by corosync of the DC node when I load a crm file.
>>>>>>>
>>>>>>> It is the infomation of the core file as follows.
>>>>>>
>>>>>> log file?
>>>>>> you're tripping over an assertion, it would be good to know which one
>>>>>>
>>>>>>>
>>>>>>> [r...@hb0101 ~]$ file /var/lib/corosync/core.32727
>>>>>>> /var/lib/corosync/core.32727: ELF 64-bit LSB core file AMD x86-64,
>>>>>>> version 1
>>>>>>> (SYSV), SVR4-style, from 'corosync'
>>>>>>>
>>>>>>> [r...@hb0101 ~]$ gdb /usr/sbin/corosync /var/lib/corosync/core.32727
>>>>>>> GNU gdb Fedora (6.8-37.el5)
>>>>>>> Copyright (C) 2008 Free Software Foundation, Inc.
>>>>>>> License GPLv3+: GNU GPL version 3 or later
>>>>>>> <http://gnu.org/licenses/gpl.html>
>>>>>>> This is free software: you are free to change and redistribute it.
>>>>>>> There is NO WARRANTY, to the extent permitted by law.  Type "show
>>>>>>> copying"
>>>>>>> and "show warranty" for details.
>>>>>>> This GDB was configured as "x86_64-redhat-linux-gnu"...
>>>>>>> Reading symbols from

Re: [Openais] Child process of corosync outputs a core

2010-11-11 Thread Andrew Beekhof

On Wed, Oct 27, 2010 at 5:15 PM, Steven Dake  wrote:
> On 10/26/2010 11:17 PM, Andrew Beekhof wrote:
>>
>> On Wed, Oct 27, 2010 at 7:32 AM, nozawat  wrote:
>>>
>>> Hi Andrew,
>>>
>>>  I send two log files of terminal.log and ha.log.
>>>  The contents of the terminal log are command results of "ps -ef|grep
>>> coro"
>>> and "crm_mon -f -1".
>>>
>>>  It is what processing completes normally when what did not understand me
>>> well watches log though corosync outputs core.
>>
>> Oct 27 10:53:12 hb0101 corosync[6695]:   [pcmk  ] plugin.c:1526 ERROR:
>> send_cluster_msg_raw: Child 7016 spawned to record non-fatal assertion
>> failure line 1526: rc == 0
>>
>> Oct 27 10:53:12 hb0101 corosync[6695]:   [pcmk  ] plugin.c:1526 ERROR:
>> send_cluster_msg_raw: Message not sent (-1):> cib_op="cib_replace" cib_delegated_from="hb0102"
>> cib_clientname="hb0102" cib_isreplyto="hb0102" original_c
>>
>> For some reason
>>     rc = pcmk_api->totem_mcast(&iovec, 1, TOTEMPG_SAFE);
>> is returning -1
>>
>>
>> Steve: would this happen if membership was in flux?
>> I thought only IPC got stopped.
>>
>
> it could
>
> If api->totem_mcast sends many messages it can fill up the totem queue and
> return -1.  The best solution to handling sending messages outside of IPC is
> to use the schedwrk api.  It will request a piece of work be done when the
> token is sent (and hopefully there are more spots in the new message queue).
>  It will continue to schedule work until 0 is retuned by the callback
> registered with schedwrk.

what about a while-loop with a sleep in it?

>
> Regards
> -steve
>
>>>
>>> Regards,
>>> Tomo
>>>
>>>
>>> 2010/10/27 Andrew Beekhof
>>>>
>>>> On Tue, Oct 26, 2010 at 11:22 AM, nozawat  wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> My environment is as follows.
>>>>>  * cluster-glue-1.0.6
>>>>>  * resource-agents-1.0.3
>>>>>  * corosync-1.2.8 (svn revision '3059')
>>>>>  * pacemaker-1.1.3-2f0326468a33acb1ada8fa744c7d36d0b315bd35
>>>>>
>>>>> Core file was output by corosync of the DC node when I load a crm file.
>>>>>
>>>>> It is the infomation of the core file as follows.
>>>>
>>>> log file?
>>>> you're tripping over an assertion, it would be good to know which one
>>>>
>>>>>
>>>>> [r...@hb0101 ~]$ file /var/lib/corosync/core.32727
>>>>> /var/lib/corosync/core.32727: ELF 64-bit LSB core file AMD x86-64,
>>>>> version 1
>>>>> (SYSV), SVR4-style, from 'corosync'
>>>>>
>>>>> [r...@hb0101 ~]$ gdb /usr/sbin/corosync /var/lib/corosync/core.32727
>>>>> GNU gdb Fedora (6.8-37.el5)
>>>>> Copyright (C) 2008 Free Software Foundation, Inc.
>>>>> License GPLv3+: GNU GPL version 3 or later
>>>>> <http://gnu.org/licenses/gpl.html>
>>>>> This is free software: you are free to change and redistribute it.
>>>>> There is NO WARRANTY, to the extent permitted by law.  Type "show
>>>>> copying"
>>>>> and "show warranty" for details.
>>>>> This GDB was configured as "x86_64-redhat-linux-gnu"...
>>>>> Reading symbols from /usr/lib64/libtotem_pg.so.4...done.
>>>>> Loaded symbols for /usr/lib64/libtotem_pg.so.4
>>>>> Reading symbols from /usr/lib64/liblogsys.so.4...done.
>>>>> Loaded symbols for /usr/lib64/liblogsys.so.4
>>>>> Reading symbols from /usr/lib64/libcoroipcs.so.4...done.
>>>>> Loaded symbols for /usr/lib64/libcoroipcs.so.4
>>>>> Reading symbols from /lib64/librt.so.1...done.
>>>>> Loaded symbols for /lib64/librt.so.1
>>>>> Reading symbols from /lib64/libpthread.so.0...done.
>>>>> Loaded symbols for /lib64/libpthread.so.0
>>>>> Reading symbols from /lib64/libdl.so.2...done.
>>>>> Loaded symbols for /lib64/libdl.so.2
>>>>> Reading symbols from /lib64/libc.so.6...done.
>>>>> Loaded symbols for /lib64/libc.so.6
>>>>> Reading symbols from /usr/lib64/libssl3.so...done.
>>>>> Loaded symbols for /usr/lib64/libssl3.so
>>>>> Reading symbols from /usr/lib64/libsmime3.so...done.
>>

Re: [Openais] All Resources shutting down on the master in a two node cluster when corosync is stopped on the slave

2010-10-27 Thread Andrew Beekhof

On Wed, Oct 27, 2010 at 12:42 PM, Andrew Beekhof  wrote:
> On Wed, Oct 20, 2010 at 7:06 PM, Tom Pride  wrote:
>> Hi there,
>> Could someone please help me diagnose this problem, where if I run "service
>> corosync stop" on the slave server of a 2 node cluster, all of the resources
>> on the master server suddenly get shutdown.
>
> http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/ch05s03.html

Oh, my bad.  You already had no-quorum-policy=ignore.
Better attach the logs as a file (reading logs that have been wrapped
by mail programs is impossible).
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Child process of corosync outputs a core

2010-10-26 Thread Andrew Beekhof

On Wed, Oct 27, 2010 at 7:32 AM, nozawat  wrote:
> Hi Andrew,
>
>  I send two log files of terminal.log and ha.log.
>  The contents of the terminal log are command results of "ps -ef|grep coro"
> and "crm_mon -f -1".
>
>  It is what processing completes normally when what did not understand me
> well watches log though corosync outputs core.

Oct 27 10:53:12 hb0101 corosync[6695]:   [pcmk  ] plugin.c:1526 ERROR:
send_cluster_msg_raw: Child 7016 spawned to record non-fatal assertion
failure line 1526: rc == 0

Oct 27 10:53:12 hb0101 corosync[6695]:   [pcmk  ] plugin.c:1526 ERROR:
send_cluster_msg_raw: Message not sent (-1): totem_mcast(&iovec, 1, TOTEMPG_SAFE);
is returning -1


Steve: would this happen if membership was in flux?
I thought only IPC got stopped.

>
> Regards,
> Tomo
>
>
> 2010/10/27 Andrew Beekhof 
>>
>> On Tue, Oct 26, 2010 at 11:22 AM, nozawat  wrote:
>> > Hi all,
>> >
>> > My environment is as follows.
>> >  * cluster-glue-1.0.6
>> >  * resource-agents-1.0.3
>> >  * corosync-1.2.8 (svn revision '3059')
>> >  * pacemaker-1.1.3-2f0326468a33acb1ada8fa744c7d36d0b315bd35
>> >
>> > Core file was output by corosync of the DC node when I load a crm file.
>> >
>> > It is the infomation of the core file as follows.
>>
>> log file?
>> you're tripping over an assertion, it would be good to know which one
>>
>> >
>> > [r...@hb0101 ~]$ file /var/lib/corosync/core.32727
>> > /var/lib/corosync/core.32727: ELF 64-bit LSB core file AMD x86-64,
>> > version 1
>> > (SYSV), SVR4-style, from 'corosync'
>> >
>> > [r...@hb0101 ~]$ gdb /usr/sbin/corosync /var/lib/corosync/core.32727
>> > GNU gdb Fedora (6.8-37.el5)
>> > Copyright (C) 2008 Free Software Foundation, Inc.
>> > License GPLv3+: GNU GPL version 3 or later
>> > <http://gnu.org/licenses/gpl.html>
>> > This is free software: you are free to change and redistribute it.
>> > There is NO WARRANTY, to the extent permitted by law.  Type "show
>> > copying"
>> > and "show warranty" for details.
>> > This GDB was configured as "x86_64-redhat-linux-gnu"...
>> > Reading symbols from /usr/lib64/libtotem_pg.so.4...done.
>> > Loaded symbols for /usr/lib64/libtotem_pg.so.4
>> > Reading symbols from /usr/lib64/liblogsys.so.4...done.
>> > Loaded symbols for /usr/lib64/liblogsys.so.4
>> > Reading symbols from /usr/lib64/libcoroipcs.so.4...done.
>> > Loaded symbols for /usr/lib64/libcoroipcs.so.4
>> > Reading symbols from /lib64/librt.so.1...done.
>> > Loaded symbols for /lib64/librt.so.1
>> > Reading symbols from /lib64/libpthread.so.0...done.
>> > Loaded symbols for /lib64/libpthread.so.0
>> > Reading symbols from /lib64/libdl.so.2...done.
>> > Loaded symbols for /lib64/libdl.so.2
>> > Reading symbols from /lib64/libc.so.6...done.
>> > Loaded symbols for /lib64/libc.so.6
>> > Reading symbols from /usr/lib64/libssl3.so...done.
>> > Loaded symbols for /usr/lib64/libssl3.so
>> > Reading symbols from /usr/lib64/libsmime3.so...done.
>> > Loaded symbols for /usr/lib64/libsmime3.so
>> > Reading symbols from /usr/lib64/libnss3.so...done.
>> > Loaded symbols for /usr/lib64/libnss3.so
>> > Reading symbols from /usr/lib64/libnssutil3.so...done.
>> > Loaded symbols for /usr/lib64/libnssutil3.so
>> > Reading symbols from /usr/lib64/libplds4.so...done.
>> > Loaded symbols for /usr/lib64/libplds4.so
>> > Reading symbols from /usr/lib64/libplc4.so...done.
>> > Loaded symbols for /usr/lib64/libplc4.so
>> > Reading symbols from /usr/lib64/libnspr4.so...done.
>> > Loaded symbols for /usr/lib64/libnspr4.so
>> > Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
>> > Loaded symbols for /lib64/ld-linux-x86-64.so.2
>> > Reading symbols from /usr/libexec/lcrso/objdb.lcrso...done.
>> > Loaded symbols for /usr/libexec/lcrso/objdb.lcrso
>> > Reading symbols from /usr/libexec/lcrso/coroparse.lcrso...done.
>> > Loaded symbols for /usr/libexec/lcrso/coroparse.lcrso
>> > Reading symbols from /usr/libexec/lcrso/pacemaker.lcrso...done.
>> > Loaded symbols for /usr/libexec/lcrso/pacemaker.lcrso
>> > Reading symbols from /usr/lib64/libplumb.so.2...done.
>> > Loaded symbols for /usr/lib64/libplumb.so.2
>> > Reading symbols from /usr/lib64/libpils.so.2...done.
>> > Loaded symbols for /usr/lib64/libpils.so.2
>> > Reading symbols from /u

Re: [Openais] Child process of corosync outputs a core

2010-10-26 Thread Andrew Beekhof

On Tue, Oct 26, 2010 at 11:22 AM, nozawat  wrote:
> Hi all,
>
> My environment is as follows.
>  * cluster-glue-1.0.6
>  * resource-agents-1.0.3
>  * corosync-1.2.8 (svn revision '3059')
>  * pacemaker-1.1.3-2f0326468a33acb1ada8fa744c7d36d0b315bd35
>
> Core file was output by corosync of the DC node when I load a crm file.
>
> It is the infomation of the core file as follows.

log file?
you're tripping over an assertion, it would be good to know which one

>
> [r...@hb0101 ~]$ file /var/lib/corosync/core.32727
> /var/lib/corosync/core.32727: ELF 64-bit LSB core file AMD x86-64, version 1
> (SYSV), SVR4-style, from 'corosync'
>
> [r...@hb0101 ~]$ gdb /usr/sbin/corosync /var/lib/corosync/core.32727
> GNU gdb Fedora (6.8-37.el5)
> Copyright (C) 2008 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later
> 
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-redhat-linux-gnu"...
> Reading symbols from /usr/lib64/libtotem_pg.so.4...done.
> Loaded symbols for /usr/lib64/libtotem_pg.so.4
> Reading symbols from /usr/lib64/liblogsys.so.4...done.
> Loaded symbols for /usr/lib64/liblogsys.so.4
> Reading symbols from /usr/lib64/libcoroipcs.so.4...done.
> Loaded symbols for /usr/lib64/libcoroipcs.so.4
> Reading symbols from /lib64/librt.so.1...done.
> Loaded symbols for /lib64/librt.so.1
> Reading symbols from /lib64/libpthread.so.0...done.
> Loaded symbols for /lib64/libpthread.so.0
> Reading symbols from /lib64/libdl.so.2...done.
> Loaded symbols for /lib64/libdl.so.2
> Reading symbols from /lib64/libc.so.6...done.
> Loaded symbols for /lib64/libc.so.6
> Reading symbols from /usr/lib64/libssl3.so...done.
> Loaded symbols for /usr/lib64/libssl3.so
> Reading symbols from /usr/lib64/libsmime3.so...done.
> Loaded symbols for /usr/lib64/libsmime3.so
> Reading symbols from /usr/lib64/libnss3.so...done.
> Loaded symbols for /usr/lib64/libnss3.so
> Reading symbols from /usr/lib64/libnssutil3.so...done.
> Loaded symbols for /usr/lib64/libnssutil3.so
> Reading symbols from /usr/lib64/libplds4.so...done.
> Loaded symbols for /usr/lib64/libplds4.so
> Reading symbols from /usr/lib64/libplc4.so...done.
> Loaded symbols for /usr/lib64/libplc4.so
> Reading symbols from /usr/lib64/libnspr4.so...done.
> Loaded symbols for /usr/lib64/libnspr4.so
> Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
> Loaded symbols for /lib64/ld-linux-x86-64.so.2
> Reading symbols from /usr/libexec/lcrso/objdb.lcrso...done.
> Loaded symbols for /usr/libexec/lcrso/objdb.lcrso
> Reading symbols from /usr/libexec/lcrso/coroparse.lcrso...done.
> Loaded symbols for /usr/libexec/lcrso/coroparse.lcrso
> Reading symbols from /usr/libexec/lcrso/pacemaker.lcrso...done.
> Loaded symbols for /usr/libexec/lcrso/pacemaker.lcrso
> Reading symbols from /usr/lib64/libplumb.so.2...done.
> Loaded symbols for /usr/lib64/libplumb.so.2
> Reading symbols from /usr/lib64/libpils.so.2...done.
> Loaded symbols for /usr/lib64/libpils.so.2
> Reading symbols from /usr/lib64/libbz2.so.1...done.
> Loaded symbols for /usr/lib64/libbz2.so.1
> Reading symbols from /usr/lib64/libxslt.so.1...done.
> Loaded symbols for /usr/lib64/libxslt.so.1
> Reading symbols from /usr/lib/libxml2.so.2...done.
> Loaded symbols for /usr/lib/libxml2.so.2
> Reading symbols from /lib64/libuuid.so.1...done.
> Loaded symbols for /lib64/libuuid.so.1
> Reading symbols from /lib64/libpam.so.0...done.
> Loaded symbols for /lib64/libpam.so.0
> Reading symbols from /lib64/libglib-2.0.so.0...done.
> Loaded symbols for /lib64/libglib-2.0.so.0
> Reading symbols from /usr/lib64/libz.so.1...done.
> Loaded symbols for /usr/lib64/libz.so.1
> Reading symbols from /lib64/libm.so.6...done.
> Loaded symbols for /lib64/libm.so.6
> Reading symbols from /lib64/libaudit.so.0...done.
> Loaded symbols for /lib64/libaudit.so.0
> Reading symbols from /lib64/libnss_files.so.2...done.
> Loaded symbols for /lib64/libnss_files.so.2
> Reading symbols from /usr/libexec/lcrso/service_evs.lcrso...done.
> Loaded symbols for /usr/libexec/lcrso/service_evs.lcrso
> Reading symbols from /usr/libexec/lcrso/service_cfg.lcrso...done.
> Loaded symbols for /usr/libexec/lcrso/service_cfg.lcrso
> Reading symbols from /usr/libexec/lcrso/service_cpg.lcrso...done.
> Loaded symbols for /usr/libexec/lcrso/service_cpg.lcrso
> Reading symbols from /usr/libexec/lcrso/service_confdb.lcrso...done.
> Loaded symbols for /usr/libexec/lcrso/service_confdb.lcrso
> Reading symbols from /usr/libexec/lcrso/service_pload.lcrso...done.
> Loaded symbols for /usr/libexec/lcrso/service_pload.lcrso
> Reading symbols from /usr/libexec/lcrso/vsf_quorum.lcrso...done.
> Loaded symbols for /usr/libexec/lcrso/vsf_quorum.lcrso
> Core was generated by `corosync'.
> Program terminated with signal 6, Aborted.
> [New process 32727]
> #0  0x003fff430265 in raise (

Re: [Openais] superfluous dependency in corosync spec file

2010-10-14 Thread Andrew Beekhof

On Thu, Oct 14, 2010 at 2:08 PM, Vadym Chepkov  wrote:
>
> On Oct 14, 2010, at 2:12 AM, Andrew Beekhof wrote:
>>
>> Since when was common sense a basis for reading distro packaging policies?
>> I'm just grateful they don't make us create a separate subpackage for
>> each library because they have different version numbers.
>>
>> *cough* debian *cough*
>
> I hate to break it to you, but they actually do :)

I know, thats why I mentioned them.

> And there are two ways of doing it.
> First, compat packages, like this one:
>
> compat-openldap-2.4.19_2.3.43-15.el6.i686.rpm
>
> Second, change the package name (mostly for "technology preview" packages)
>
> postgresql84-libs-8.4.5-1.el5_5.1.i386.rpm
>
> Vadym
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] superfluous dependency in corosync spec file

2010-10-13 Thread Andrew Beekhof

On Wed, Oct 13, 2010 at 10:44 PM, Vadym Chepkov  wrote:
>
> On Oct 12, 2010, at 6:14 PM, Vadym Chepkov wrote:
>
>>
>> On Oct 12, 2010, at 1:43 PM, Fabio M. Di NItto wrote:
>>
>>>
>>> what distribution are you looking at? In Fedora, where the spec file was
>>> first done as template for others to use and modify as needed, it's
>>> pretty much mandatory to have the subpackage Require the main package.
>>>
>>> http://fedoraproject.org/wiki/Packaging/Guidelines#RequiringBasePackage
>>>
>>> http://fedoraproject.org/wiki/Packaging/ReviewGuidelines#Things_To_Check_On_Review
>>>
>>> "SHOULD: Usually, subpackages other than devel should require the base
>>> package using a fully versioned dependency. [21]"
>>>
>>> If all the packages you mention above do not Require the main package,
>>> either they have an exception from the Fedora Board, or they are not
>>> strictly following the Fedora packaging guidelines.
>>
> …
>
>> I have asked the author of the guidelines, Tom Callaway,  hopefully he will 
>> respond.
>>
>> Vadym
>>
>
>
> Here is Tom's response:
> "
> I would agree with you. In the specific case of a %{name}-libs
> subpackage, which only contains shared libraries, that package does not
> need to explicitly depend on %{name} = %{version}-%{release} (unless
> there is some other technical reason, of course).
>
> I've proposed amending the Packaging Guidelines to reflect common sense
> here:
>
> https://fedorahosted.org/fpc/ticket/20
>
> The Fedora Packaging Committee should address this next week.
>
> ~spot
> "
>
> Seems "common sense" to him too

Since when was common sense a basis for reading distro packaging policies?
I'm just grateful they don't make us create a separate subpackage for
each library because they have different version numbers.

*cough* debian *cough*
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Corosync failing to start

2010-09-26 Thread Andrew Beekhof

On Sat, Sep 25, 2010 at 5:58 AM, Steven Dake  wrote:
> On 09/24/2010 05:55 PM, Lars Kellogg-Stedman wrote:
>>> pacemaker is waiting for something in nanosleep.  Not sure what.
>>
>> Should I ping the pacemaker list separately?  I'm not sure how much
>> overlap there is between here and there.
>>
>>>   The
>>> symptom you describe sounds like a inability for corosync to form a
>>> membership because of switch-default STP settings.
>>
>> I had a brief a-ha! moment: these systems are KVM guests.  Network
>> connectivity is through bridges on the Linux host, which default to a
>> 30 second forwarding delay.  Tragically, we had already set this to
>> zero:
>>
>>    # brctl showstp br613 | grep -i delay
>>    forward delay             0.00                 bridge forward delay       
>> 0.00
>>
>> And in fact these sytems use DHCP to acquire network settings, and if
>> the issue was STP this would prevent them from receiving a lease from
>> the DHCP server.
>>
>>> Try running the following on the node after a lockup:
>>> killall -SEGV corosync
>>> corosync-fplay
>>> attach output
>>
>> I've attached the output to this message.
>
>  From the fplay records, it looks like corosync has started up perfectly
> and acquired all nodes in the network (248,249,250).  I would suggest
> pinging the pacemaker list for further investigation.

This is the whole thing with child processes getting stuck between
fork() and exec().
Basically the corosync process space doesn't like forking children.

The solution is to grab 1.1.3 from www.clusterlabs.org/rpm-next and
use pacemaker's new mcp.
Option 2 as described at:
   
http://theclusterguy.clusterlabs.org/post/907043024/introducing-the-pacemaker-master-control-process-for
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Announcement: Perl bindings for Corosync's CPG

2010-09-13 Thread Andrew Beekhof

On Mon, Sep 13, 2010 at 2:36 PM, Florian Haas  wrote:
> On 2010-09-13 11:21, Chase Venters wrote:
>> On Monday 13 September 2010 3:45:50 am Florian Haas wrote:
>>> I realize I may be asking for a lot, but is there any chance you could
>>> rewrite your module to use SWIG, thereby making it more easily portable
>>> to languages other than Perl?
>>
>> I definitely see the value of having pieces of the corosync stack exposed to
>> more scripting languages.
>>
>> I've only glanced briefly at SWIG before. That was a few years ago. To be
>> honest I don't remember why I've tended to pass up SWIG for XS, but I'm happy
>> to take another look at it SWIG when I find some time. Thankfully the CPG API
>> is nice and simple which made these bindings easy to create.
>>
>> Along these same lines, I wonder what other parts of the stack would be most
>> useful to expose? I've been thinking about doing something with the confdb...
>
> Are you using Pacemaker at all? If so, an object oriented wrapper around
> libcib would be high on my personal list. Probably best done as a C++
> wrapper around the libcib C API,

/me cries

> and then SWIG interfaces to expose the
> CIB to OO languages like Python and Ruby.
>
> Cheers,
> Florian
>
>
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais
>
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] openais trunk - change shutdown priority to 80

2010-09-07 Thread Andrew Beekhof

On Tue, Sep 7, 2010 at 3:36 PM, Ryan O'Hara  wrote:
> On Sat, Sep 04, 2010 at 09:42:28PM +0200, Fabio M. Di NItto wrote:
>> On 09/04/2010 07:23 PM, Steven Dake wrote:
>> > On 09/03/2010 09:33 PM, Fabio M. Di NItto wrote:
>> >> On 09/03/2010 09:13 PM, Ryan O'Hara wrote:
>> >>>
>> >>> Same as Steve's patch to corosync init script.
>> >>>
>> >>
>> >> Can you also consider adding "Provides: corosync" as suggested in the
>> >> thread?
>> >>
>> >
>> > wouldn't openais need a requires?
>>
>> Not in the init script, no.
>>
>> The init script header can declare: "Provides: corosync", that will
>> allow any other package to only Requires: corosync and it would make no
>> difference for the init order if the user decides to start openais instead.
>>
>> Fabio
>
> I'm with Steve on this. Stating that the openais init "provides
> corosync" just seems wrong.

Makes perfect sense to me given that starting openais also starts corosync.
I'd agree with you if openais just loaded and unloaded plugins into an
already running corosync - but AFAIK it doesn't.

> Then again, I don't see what the point of
> even having an openais init script is. At this point, openais should
> just provide AIS services as libraries. Nothing to "init".
>
> Ryan
>
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais
>
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] corosync & syslog dependencies

2010-08-09 Thread Andrew Beekhof

On Sat, Aug 7, 2010 at 3:08 AM, Angus Salkeld  wrote:
> On Fri, Aug 06, 2010 at 11:06:14AM +0200, Alain.Moulle wrote:
>> Hi,
>>
>> About corosync & Pacemaker use :
>>
>> in my current release on RHEL6 : corosync-1.2.1-2.el6.x86_64 ,
>> the start of corosync requires the service syslog-ng to be started before,
>> otherwise, corosync  does not start correctly (and there are multiple
>> corosync processes launched, but this is another subject).

Likely these are child processes forked by pacemaker that get hung
before the call to exec().
Please option 2 or 3:
   
http://theclusterguy.clusterlabs.org/post/907043024/introducing-the-pacemaker-master-control-process-for

>
> Hi Alain
>
> I would have thought that this is the problem to solve (instead of working
> around the problem below).
>
> A trivial corosync alone setup doesn't seem to do this, but I'll have a look
> at a corosync+pacemaker setup and see what I can do.
>
> Can you give me the logging section of your corosync config and what the
> symtoms are?
>
> Thanks
> -Angus
>
>>
>> I wonder : is this requirement about "syslog-ng started" definitive or will
>> it be possible ( in some next releases of corosync rpm ) to start it
>> whereas syslog-ng is stopped ?
>>
>> Thanks
>> Regards
>>
>> Alain Moullé
>>
>>
>> ___
>> Openais mailing list
>> Openais@lists.linux-foundation.org
>> https://lists.linux-foundation.org/mailman/listinfo/openais
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais
>
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] [Corosync] The corosync shared memory keeps increasing

2010-08-05 Thread Andrew Beekhof

On Wed, Aug 4, 2010 at 8:24 PM, Steven Dake  wrote:
> On 08/03/2010 10:02 AM, hj lee wrote:
>> Hi,
>>
>> I tried the latest version corosync 1.2.7 rpms from clusterlabs. The
>> problem is still there. Actually the latest version gets worse. In old
>> 1.1.2 version, the shared memory increases only when cib exachanges
>> messages through corosync. In the 1.2.7 version, corosync shared memory
>> keeps increasing even when the cluster is idle. Simply run the corosync
>> and watch the memory in the top every a few minutes, the shared memory
>> just keeps increasing. Isn't this a memory leak?
>>
>> 1. Shared memory increases when the cluster is idle. I highly suspect it
>> leak comes from circular mmap in logsys.c.
>> [r...@silverthorne4 epel]# top -b -n1 | egrep "coro|cib|attrd"
>> 16579 root      RT   0  207m 4200 1920 S  0.0  0.1   0:00.07 corosync
>> 16587 hacluste  -8   0 69044 4536 2544 S  0.0  0.1   0:00.35 cib
>> 16589 hacluste  -8   0 69808 2436 2024 S  0.0  0.1   0:00.00 attrd
>>  after a few minutes later
>> [r...@silverthorne4 epel]# top -b -n1 | egrep "coro|cib|attrd"
>> 16579 root      RT   0  207m 4212 1932 S  0.0  0.1   0:00.07 corosync
>> 16587 hacluste  -8   0 69044 4536 2544 S  0.0  0.1   0:00.35 cib
>> 16589 hacluste  -8   0 69808 2436 2024 S  0.0  0.1   0:00.00 attrd
>>
>>
>> 2. Shared memory increases whenever pacemaker resource is started or
>> stopped. In this case, cib's shared memory also increases. I highly
>> suspect this leak comes from mmap in corosync ipc code.
>> [r...@silverthorne4 epel]# top -b -n1 | egrep "coro|cib|attrd"
>> 16579 root      RT   0  207m 4316 2036 S  0.0  0.1   0:00.11 corosync
>> 16587 hacluste  -8   0 69048 4596 2584 S  0.0  0.1   0:00.39 cib
>> 16589 hacluste  -8   0 69808 2436 2024 S  0.0  0.1   0:00.00 attrd
>>
>> [r...@silverthorne4 epel]# crm resource stop faultymon-clone
>>
>> [r...@silverthorne4 epel]# top -b -n1 | egrep "coro|cib|attrd"
>> 16579 root      RT   0  207m 4336 2056 S  0.0  0.1   0:00.13 corosync
>> 16587 hacluste  -8   0 69068 4620 2596 S  0.0  0.1   0:00.41 cib
>> 16589 hacluste  -8   0 69808 2436 2024 S  0.0  0.1   0:00.00 attrd
>>
>> Thanks
>> hj
>>
>
> corosync doesn't leak without pacemaker.
>
> The major difference in pacemaker is the use of fork.  Angus, one thing
> to try is madvise (DONTFORK) on the mmap sections.

perhaps try the latest 1.1 code which does the forking outside of corosync:
   
http://theclusterguy.clusterlabs.org/post/907043024/introducing-the-pacemaker-master-control-process-for
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] >>: drbd + pacemaker failback problems

2010-07-29 Thread Andrew Beekhof

On Thu, Jul 29, 2010 at 2:53 PM, Пленкин Алексей  wrote:
> Hi,
>
> i am using drbd 8.3.4 , pacemaker 1.0.1 and openais 0.80.3
> all works good, but i can't stop fail-back
> i try to use resource-stickiness 100, but nothing changes and then failed 
> node comes online, resources migrating back

Usually this is because some of the services are being started outside
of the cluster at boot time.
A more recent pacemaker version might help too.

>
> thanks
>
>
> there is my xml dump
>
> 
>  have-quorum="1" num_updates="1" validate-with="pacemaker-1.0">
>  
>    
>      
>         value="1.0.2-ec6b0bbee1f3aa72c4c2559997e675db6ab39160"/>
>         name="expected-quorum-votes" value="2"/>
>         name="no-quorum-policy" value="ignore"/>
>         name="default-resource-stickiness" value="10"/>
>         name="stonith-enabled" value="false"/>
>         name="last-lrm-refresh" value="1280333787"/>
>      
>    
>    
>    
>    
>      
>        
>          
>        
>      
>      
>        
>          
>        
>      
>    
>    
>      
>         type="Filesystem">
>          
>             value="/dev/drbd4"/>
>             value="/xen/r4"/>
>          
>        
>        
>          
>             value="/xen/r4/dns1.cfg"/>
>          
>          
>            
>            
>            
>          
>        
>      
>      
>        
>           value="Started"/>
>        
>         type="Filesystem">
>          
>             value="/dev/drbd5"/>
>             value="/xen/r5"/>
>          
>          
>             value="Started"/>
>          
>        
>        
>          
>             value="/xen/r5/dns2.cfg"/>
>          
>          
>            
>            
>            
>          
>        
>      
>      
>         type="Filesystem">
>          
>             value="/dev/drbd7"/>
>             value="/xen/r7"/>
>          
>        
>        
>          
>             value="/xen/r7/monitoring.cfg"/>
>          
>          
>            
>             timeout="50s"/>
>             timeout="300s"/>
>          
>        
>      
>      
>        
>           value="1"/>
>           name="master-node-max" value="1"/>
>           value="2"/>
>           name="clone-node-max" value="1"/>
>           value="true"/>
>        
>        
>          
>             name="drbd_resource" value="r4"/>
>          
>          
>            
>          
>        
>      
>      
>        
>           value="1"/>
>           name="master-node-max" value="1"/>
>           value="2"/>
>           name="clone-node-max" value="1"/>
>           value="true"/>
>           name="target-role" value="Stopped"/>
>        
>        
>          
>             name="drbd_resource" value="r5"/>
>          
>          
>            
>          
>          
>             name="target-role" value="Started"/>
>          
>        
>      
>      
>        
>           value="1"/>
>           name="master-node-max" value="1"/>
>           value="2"/>
>           name="clone-node-max" value="1"/>
>           value="true"/>
>        
>        
>          
>             name="drbd_resource" value="r7"/>
>          
>          
>            
>          
>        
>      
>    
>    
>      
>        
>           operation="eq" type="string" value="dom0a"/>
>        
>      
>      
>        
>           operation="eq" type="string" value="dom0a"/>
>        
>      
>      
>        
>           operation="eq" type="string" value="dom0a"/>
>        
>      
>       with-rsc="ms_drbd_r4" with-rsc-role="Master"/>
>       with-rsc="ms_drbd_r5" with-rsc-role="Master"/>
>       with-rsc="ms_drbd_r7" with-rsc-role="Master"/>
>       id="dns1_after_drbd" score="INFINITY" then="dns1" then-action="start"/>
>       id="dns2_after_drbd" score="INFINITY" then="dns2" then-action="start"/>
>       id="monitor1_after_drbd" score="INFINITY" then="monitor1" 
> then-action="start"/>
>    
>  
> 
>
>
>
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais
>
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] stonithd

2010-07-12 Thread Andrew Beekhof

On Mon, Jul 12, 2010 at 5:02 PM, Steven Dake  wrote:
> On 07/12/2010 07:09 AM, morphium wrote:
>> Hi,
>>
>> I today installed pacemaker from Debian squeeze and configured it, but
>> my syslog is filling with
>>
>> Jul 12 15:53:47 host crmd: [2174]: ERROR: stonithd_signon: Can't
>> initiate connection to stonithd
>> Jul 12 15:53:47 host crmd: [2174]: notice: Not currently connected.
>> Jul 12 15:53:47 host crmd: [2174]: ERROR: te_connect_stonith: Sign-in
>> failed: triggered a retry
>> Jul 12 15:53:47 host crmd: [2174]: info: te_connect_stonith:
>> Attempting connection to fencing daemon...
>>
>> What should I do? I already disabled stonith with stonith-enabled="false"
>>
>
> email the pacemaker list

Better yet, the debian maintainers.
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] corosync offline

2010-07-06 Thread Andrew Beekhof

On Tue, Jul 6, 2010 at 1:53 PM,   wrote:
>
> Hello,
>
> I've build a cluster with just two nodes, both of them see each other, but
>  they don't like to go online. This is my config:
>
> interface {
>         bindnetaddr:    172.28.87.0
>         mcastaddr:      226.94.1.1
>                 mcastport:      5420
>                 ringnumber:     0
> }
> Both nodes have the same config.
> ..
>
> # crm_mon --one-shot
> 
> Last updated: Tue Jul  6 13:38:39 2010
> Stack: openais
> Current DC: NONE
> 2 Nodes configured, 2 expected votes
> 1 Resources configured.
> 
>
> OFFLINE: [ lis01 lis11 ]
> ..
>
>
> I made a tcpdump:
> ...
> 13:40:15.870996 IP 172.28.87.64.5419 > 226.94.1.1.5420: UDP, length 119
> 13:40:16.085725 IP 172.28.87.66.5419 > 226.94.1.1.5420: UDP, length 75
> 13:40:16.086270 IP 172.28.87.66.5419 > 226.94.1.1.5420: UDP, length 919
> 13:40:16.296619 IP 172.28.87.64.5419 > 226.94.1.1.5420: UDP, length 119
> 13:40:16.539215 IP 172.28.87.64.5419 > 226.94.1.1.5420: UDP, length 119
> 13:40:16.773796 IP 172.28.87.64.5419 > 226.94.1.1.5420: UDP, length 119
> 
>
> most of the time, just the .64 node is sending packets. Just this cut shows
> after long time the .66 node
> This tcpdump is one the other node near the same, also .64 sends most of the
> packets.
>
> When I stop openais(corosync) on .64 the other node send all the time until
> the .64 is online again.
> That seems that both see each other.
>
>
> The syslog output:
>
>  # tail -f /var/log/messages
> Jul  6 13:42:55 lis11 crmd: [13107]: WARN: do_lrm_control: Failed to sign on
> to the LRM 6 (30 max) times
> Jul  6 13:42:57 lis11 crmd: [13107]: info: crm_timer_popped: Wait Timer
> (I_NULL) just popped!
> Jul  6 13:42:57 lis11 crmd: [13107]: WARN: lrm_signon: can not initiate
> connection
> Jul  6 13:42:57 lis11 crmd: [13107]: WARN: do_lrm_control: Failed to sign on
> to the LRM 7 (30 max) times
> Jul  6 13:42:59 lis11 crmd: [13107]: info: crm_timer_popped: Wait Timer
> (I_NULL) just popped!
> Jul  6 13:42:59 lis11 crmd: [13107]: WARN: lrm_signon: can not initiate
> connection
> ... and so on

So did you check if the lrmd was running (and if not, why not)?


> Jul  6 13:46:17 lis11 cib: [13507]: WARN: do_local_notify: A-Sync reply to
> crmd failed: reply failed
> Jul  6 13:46:17 lis11 corosync[13445]:   [pcmk  ] info: pcmk_ipc_exit:
> Client crmd (conn=0x68eba0, async-conn=0x68eba0) left
> Jul  6 13:46:17 lis11 corosync[13445]:   [pcmk  ] ERROR: pcmk_wait_dispatch:
> Child process crmd exited (pid=15909, rc=2)
> Jul  6 13:46:17 lis11 corosync[13445]:   [pcmk  ] ERROR: pcmk_wait_dispatch:
> Child respawn count exceeded by crmd
> Jul  6 13:46:17 lis11 corosync[13445]:   [pcmk  ] info: update_member: Node
> hhloklis11 now has process list: 0012 (1118482)
> Jul  6 13:46:17 lis11 corosync[13445]:   [pcmk  ] WARN: route_ais_message:
> Sending message to local.crmd failed: ipc delivery failed (rc=-2)
> Jul  6 13:47:06 lis11 corosync[13445]:   [pcmk  ] WARN: route_ais_message:
> Sending message to local.crmd failed: ipc delivery failed (rc=-2)
> Jul  6 13:47:54 lis11 cib: [13507]: info: cib_stats: Processed 28 operations
> (1071.00us average, 0% utilization) in the last 10min
> 
>
>
>
> OS is SuSE SLES11 SP1
>
> pacemaker-1.1.2-0.2.1
> pacemaker-mgmt-2.0.0-0.2.19
> corosync-1.2.1-0.5.1
> libcorosync4-1.2.1-0.5.1
> openais-1.1.2-0.5.19
> libopenais3-1.1.2-0.5.19
>
> openais config is empty.
>
>
> Kernel: 2.6.32.12-0.7-default      x86_64
>
>
> Any help?
>
>
> Thomas Schreiber
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais
>
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Corosync 1.2.5 still hangs on startup

2010-07-01 Thread Andrew Beekhof

On Thu, Jul 1, 2010 at 3:09 PM, Keisuke MORI  wrote:
> Bad news...
>
> 2010/6/30 Andrew Beekhof :
>> On Wed, Jun 30, 2010 at 12:06 PM, Keisuke MORI
>>  wrote:
>>> 2010/6/29 Andrew Beekhof :
>>>> On Mon, Jun 28, 2010 at 2:20 PM, Keisuke MORI  
>>>> wrote:
>>>>> I've upgrade to pacemaker-1.0.9.1 / corosync-1.2.5 from clusterlabs on
>>>>> CentOS 5.5 using yum but it still hangs on its startup somtimes.
>>>>>
>>>>> The symptom is exactly same as this:
>>>>>  https://lists.linux-foundation.org/pipermail/openais/2010-June/014854.html
>>>>
>>>> Arrgghhh!!!
>>>>
>>>> Can you try the following patch?
>>>
>>> With the patch the problem disappeared!
>>> I've not been able to reproduce the hang with rebooting the node more
>>> than 10 times (which was enough to reproduce it previously).
>
> It didn't happen yesterday, but the same hang occurred again today.
>
> I also tried with corosync-1.2.6 but the things didn't get better.
>
> Here is the stack trace and the corosync.conf when I reproduce it with
>  corosync-1.2.6.
> According to the core, fileno=10 looks broken, while filno=0,1,2,3 seems sane.

Any chance you could do some digging to figure out where fileno=10 is
coming from?
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Corosync 1.2.5 still hangs on startup

2010-06-30 Thread Andrew Beekhof

On Wed, Jun 30, 2010 at 12:06 PM, Keisuke MORI
 wrote:
> 2010/6/29 Andrew Beekhof :
>> On Mon, Jun 28, 2010 at 2:20 PM, Keisuke MORI  
>> wrote:
>>> I've upgrade to pacemaker-1.0.9.1 / corosync-1.2.5 from clusterlabs on
>>> CentOS 5.5 using yum but it still hangs on its startup somtimes.
>>>
>>> The symptom is exactly same as this:
>>>  https://lists.linux-foundation.org/pipermail/openais/2010-June/014854.html
>>
>> Arrgghhh!!!
>>
>> Can you try the following patch?
>
> With the patch the problem disappeared!
> I've not been able to reproduce the hang with rebooting the node more
> than 10 times (which was enough to reproduce it previously).
>
> The patch should be definitely included to the flatiron.
> Thanks!

Awesome!
Checked into trunk as r2975.

Steve, can you backport to flatiron please?
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Corosync 1.2.5 still hangs on startup

2010-06-28 Thread Andrew Beekhof

On Mon, Jun 28, 2010 at 2:20 PM, Keisuke MORI  wrote:
> I've upgrade to pacemaker-1.0.9.1 / corosync-1.2.5 from clusterlabs on
> CentOS 5.5 using yum but it still hangs on its startup somtimes.
>
> The symptom is exactly same as this:
>  https://lists.linux-foundation.org/pipermail/openais/2010-June/014854.html

Arrgghhh!!!

Can you try the following patch?

Index: exec/main.c
===
--- exec/main.c.orig2010-06-21 18:59:32.0 +0200
+++ exec/main.c 2010-06-29 08:29:48.834736539 +0200
@@ -425,20 +425,9 @@ static void corosync_tty_detach (void)
/*
 * Map stdin/out/err to /dev/null.
 */
-   fd = open("/dev/null", O_RDWR);
-   if (fd >= 0) {
-   /* dup2 to 0 / 1 / 2 (stdin / stdout / stderr) */
-   close (STDIN_FILENO);
-   close (STDOUT_FILENO);
-   close (STDERR_FILENO);
-   dup2(fd, STDIN_FILENO);  /* 0 */
-   dup2(fd, STDOUT_FILENO); /* 1 */
-   dup2(fd, STDERR_FILENO); /* 2 */
-
-   /* Should be 0, but just in case it isn't... */
-   if (fd > 2)
-   close(fd);
-   }
+   freopen("/dev/null", "r", stdin);
+   freopen("/dev/null", "a", stderr);
+   freopen("/dev/null", "a", stdout);
 }

 static void corosync_mlockall (void)



>
> Any hints what should I look into more?
> I have the core with me (taken by gcore) so if you want to look into
> it then I can show you.
>
> Here is the backtrace.
>
> 8<8<8<8<8<8<8<
> [r...@pm01 ~]# gdb /usr/sbin/corosync core.2596
> (...)
> (gdb) where
> #0  0x00377a607b35 in pthread_join () from /lib64/libpthread.so.0
> #1  0x2b12ea5528d9 in logsys_atexit () at logsys.c:1642
> #2  0x00405a85 in sigsegv_handler (num=)
> at main.c:222
> #3  
> #4  0x003779a9a2fa in fork () from /lib64/libc.so.6
> #5  0x2aba84de in spawn_child () from 
> /usr/libexec/lcrso/pacemaker.lcrso
> #6  0x2abacb9b in pcmk_startup () from
> /usr/libexec/lcrso/pacemaker.lcrso
> #7  0x004082c9 in corosync_service_link_and_init
> (corosync_api=0x613900, service_name=0x1f76c850 "pacemaker",
>    service_ver=0) at service.c:201
> #8  0x00408673 in corosync_service_defaults_link_and_init
> (corosync_api=0x613900) at service.c:534
> #9  0x00405086 in main_service_ready () at main.c:1224
> #10 0x2b12ea332425 in main_iface_change_fn
> (context=0x2aaae010, iface_addr=,
>    iface_no=) at totemsrp.c:4363
> #11 0x2b12ea3291a7 in timer_function_netif_check_timeout
> (data=0x1f793520) at totemudp.c:1380
> #12 0x2b12ea326459 in timerlist_expire (handle=150346236434579456)
> at tlist.h:309
> #13 poll_run (handle=150346236434579456) at coropoll.c:448
> #14 0x00406693 in main (argc=,
> argv=) at main.c:1576
> (gdb) q
> [r...@pm01 ~]# rpm -qa | grep corosync
> corosync-1.2.5-1.3.el5
> corosynclib-1.2.5-1.3.el5
> corosync-debuginfo-1.2.5-1.3.el5
> [r...@pm01 ~]# rpm -qa | grep pacemaker
> drbd-pacemaker-8.3.8-1
> pacemaker-libs-1.0.9.1-1.el5
> pacemaker-1.0.9.1-1.el5
> [r...@pm01 ~]# pgrep -lf corosync
> 2589 corosync
> 2596 corosync
> [r...@pm01 ~]# pgrep -lf heartbeat
> 2595 /usr/lib64/heartbeat/stonithd
> 2597 /usr/lib64/heartbeat/lrmd
> 2599 /usr/lib64/heartbeat/pengine
> 3473 /usr/lib64/heartbeat/crmd
>
> 8<8<8<8<8<8<8<
>
> Regards,
> --
> Keisuke MORI
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais
>
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] recover from corosync daemon restart and cpg_finalize timing

2010-06-24 Thread Andrew Beekhof

On Thu, Jun 24, 2010 at 9:16 AM, Steven Dake  wrote:
> On 06/23/2010 11:35 PM, Andrew Beekhof wrote:
>>
>> On Thu, Jun 24, 2010 at 1:50 AM, dan clark<2cla...@gmail.com>  wrote:
>>>
>>> Dear Gentle Reader
>>>
>>> Attached is a small test program to stress initializing and finalizing
>>> communication between a corosync cpg client and the corosync daemon.
>>> The test was run under version 1.2.4.  Initial testing was with a
>>> single node, subsequent testing occurred on a system consisting of 3
>>> nodes.
>>>
>>> 1) If the program is run in such a way that it loops on the
>>> initialize/mcast_joined/dispatch/finalize AND the corosync daemon is
>>> restarted while the program is looping (service corosync restart) then
>>> the application locks up in the corosync client library in a variety
>>> of interesting locations.  This is easiest to reproduce in a single
>>> node system with a large iteration count and a usleep value between
>>> joins.  'stress_finalize -t 500 -i 1 -u 1000 -v'  Sometimes it
>>> recovers in a few seconds (analysis of strace indicated
>>> futex(...FUTEX_WAIT, 0, {1, 997888000}) ... which would account for
>>> multiple 2 second delays in error recovery from a lost corosync
>>> daemon).  Sometimes it locks up solid!   What is the proper way of
>>> handling the loss of the corosync daemon?  Is it possible to have the
>>> cpg library have a fast error recovery in the case of a failed daemon?
>>>
>>> sample back trace of lockup:
>>> #0  0x00363c60c711 in sem_wait () from /lib64/libpthread.so.0
>>> #1  0x00302a34 in coroipcc_msg_send_reply_receive (
>>>   handle=, iov=, iov_len=1,
>>>   res_msg=0x7fffaefecac0, res_len=24) at coroipcc.c:465
>>> #2  0x003000802db1 in cpg_leave (handle=1648075416440668160,
>>>   group=) at cpg.c:458
>>> #3  0x00400df8 in coInit (handle=0x7fffaefecdb0,
>>>   groupNameStr=0x7fffaefeccb0 "./stress_finalize_groupName-0", ctx=0x6e1)
>>>   at stress_finalize.c:101
>>> #4  0x0040138a in main (argc=8, argv=0x7fffaefecf28)
>>>   at stress_finalize.c:243
>>
>> I've also started getting semaphore related stack traces.
>>
>
> the stack trace from Dan is different from yours Andrew.  Yours is during
> startup.   Dan is more concerned about the fact that sem_timedwait sits
> around for 2 seconds before returning information indicating the server has
> exited or stopped.  (along with other issues)
>
>> #0  __new_sem_init (sem=0x7ff01f81a008, pshared=1, value=0) at
>> sem_init.c:45
>> 45        isem->value = value;
>> Missing separate debuginfos, use: debuginfo-install
>> audit-libs-2.0.1-1.fc12.x86_64 libgcrypt-1.4.4-8.fc12.x86_64
>> libgpg-error-1.6-4.x86_64 libtasn1-2.3-1.fc12.x86_64
>> libuuid-2.16-10.2.fc12.x86_64
>> (gdb) where
>> #0  __new_sem_init (sem=0x7ff01f81a008, pshared=1, value=0) at
>> sem_init.c:45
>> #1  0x7ff01e601e8e in coroipcc_service_connect (socket_name=> optimized out>, service=, request_size=1048576,
>> response_size=1048576, dispatch_size=1048576, handle=> out>)
>>     at coroipcc.c:706
>> #2  0x7ff01ec1bb81 in init_ais_connection_once (dispatch=0x40e798
>> , destroy=0x40e8f2, our_uuid=0x0,
>> our_uname=0x6182c0, nodeid=0x0) at ais.c:622
>> #3  0x7ff01ec1ba22 in init_ais_connection (dispatch=0x40e798
>> , destroy=0x40e8f2, our_uuid=0x0,
>> our_uname=0x6182c0, nodeid=0x0) at ais.c:585
>> #4  0x7ff01ec16b90 in crm_cluster_connect (our_uname=0x6182c0,
>> our_uuid=0x0, dispatch=0x40e798, destroy=0x40e8f2, hb_conn=0x6182b0)
>> at cluster.c:56
>> #5  0x0040e9fb in cib_init () at main.c:424
>> #6  0x0040df78 in main (argc=1, argv=0x7194aaf8) at main.c:218
>> (gdb) print *isem
>> Cannot access memory at address 0x7ff01f81a008
>>
>> sigh
>>
>
> This code literally hasn't been modified for over a year - strange to start
> seeing errors now.
>
> Is your /dev/shm full?

Looks like it.
Also probably explains:

Program terminated with signal 7, Bus error.
#0  memset () at ../sysdeps/x86_64/memset.S:1050
1050movntdq %xmm0,(%rdi)
Missing separate debuginfos, use: debuginfo-install
libibverbs-1.1.3-3.fc12.x86_64 librdmacm-1.0.10-1.fc12.x86_64
nspr-4.8.2-1.fc12.x86_64 nss-3.12.4-14.fc12.x86_64
nss-util-3.12.4-8.fc12.x86_64
(gdb) where
#0  memset () at ../sysdeps/x86_64/memset.S:1050
#1  0x7f08d98453ad in circular_memory_map (bytes=, buf=0x208388) at /usr/include/bits/string

Re: [Openais] recover from corosync daemon restart and cpg_finalize timing

2010-06-23 Thread Andrew Beekhof

On Thu, Jun 24, 2010 at 1:50 AM, dan clark <2cla...@gmail.com> wrote:
> Dear Gentle Reader
>
> Attached is a small test program to stress initializing and finalizing
> communication between a corosync cpg client and the corosync daemon.
> The test was run under version 1.2.4.  Initial testing was with a
> single node, subsequent testing occurred on a system consisting of 3
> nodes.
>
> 1) If the program is run in such a way that it loops on the
> initialize/mcast_joined/dispatch/finalize AND the corosync daemon is
> restarted while the program is looping (service corosync restart) then
> the application locks up in the corosync client library in a variety
> of interesting locations.  This is easiest to reproduce in a single
> node system with a large iteration count and a usleep value between
> joins.  'stress_finalize -t 500 -i 1 -u 1000 -v'  Sometimes it
> recovers in a few seconds (analysis of strace indicated
> futex(...FUTEX_WAIT, 0, {1, 997888000}) ... which would account for
> multiple 2 second delays in error recovery from a lost corosync
> daemon).  Sometimes it locks up solid!   What is the proper way of
> handling the loss of the corosync daemon?  Is it possible to have the
> cpg library have a fast error recovery in the case of a failed daemon?
>
> sample back trace of lockup:
> #0  0x00363c60c711 in sem_wait () from /lib64/libpthread.so.0
> #1  0x00302a34 in coroipcc_msg_send_reply_receive (
>   handle=, iov=, iov_len=1,
>   res_msg=0x7fffaefecac0, res_len=24) at coroipcc.c:465
> #2  0x003000802db1 in cpg_leave (handle=1648075416440668160,
>   group=) at cpg.c:458
> #3  0x00400df8 in coInit (handle=0x7fffaefecdb0,
>   groupNameStr=0x7fffaefeccb0 "./stress_finalize_groupName-0", ctx=0x6e1)
>   at stress_finalize.c:101
> #4  0x0040138a in main (argc=8, argv=0x7fffaefecf28)
>   at stress_finalize.c:243

I've also started getting semaphore related stack traces.

#0  __new_sem_init (sem=0x7ff01f81a008, pshared=1, value=0) at sem_init.c:45
45isem->value = value;
Missing separate debuginfos, use: debuginfo-install
audit-libs-2.0.1-1.fc12.x86_64 libgcrypt-1.4.4-8.fc12.x86_64
libgpg-error-1.6-4.x86_64 libtasn1-2.3-1.fc12.x86_64
libuuid-2.16-10.2.fc12.x86_64
(gdb) where
#0  __new_sem_init (sem=0x7ff01f81a008, pshared=1, value=0) at sem_init.c:45
#1  0x7ff01e601e8e in coroipcc_service_connect (socket_name=, service=, request_size=1048576,
response_size=1048576, dispatch_size=1048576, handle=)
at coroipcc.c:706
#2  0x7ff01ec1bb81 in init_ais_connection_once (dispatch=0x40e798
, destroy=0x40e8f2 , our_uuid=0x0,
our_uname=0x6182c0, nodeid=0x0) at ais.c:622
#3  0x7ff01ec1ba22 in init_ais_connection (dispatch=0x40e798
, destroy=0x40e8f2 , our_uuid=0x0,
our_uname=0x6182c0, nodeid=0x0) at ais.c:585
#4  0x7ff01ec16b90 in crm_cluster_connect (our_uname=0x6182c0,
our_uuid=0x0, dispatch=0x40e798, destroy=0x40e8f2, hb_conn=0x6182b0)
at cluster.c:56
#5  0x0040e9fb in cib_init () at main.c:424
#6  0x0040df78 in main (argc=1, argv=0x7194aaf8) at main.c:218
(gdb) print *isem
Cannot access memory at address 0x7ff01f81a008

sigh

>
> 2) If the test program is run with an iteration count of greater than
> about 10, group joins for the specified group name tends to start
> failing (CS_ERR_TRY_AGAIN) but never recovers (trying again doesn't
> help :).  This test was run on a single node of a 3 node system (but
> may be reproduce similar problems on a smaller number of nodes).
> ' ./stress_finalize -i 10 -j 1 junk'
>
> 3) An unrelated observation is that if the corosync daemon is setup on
> two nodes that are participate in multicast through a tunnel, the
> corosync daemon runs in a tight loop at very high priority level
> effectively halting the machine.  Is this because the basic daemon
> communication relies on message reflection of the underlying transport
> which would occur on an ethernet multicast but would not on a tunnel?
>
> An example setup for an ip tunnel might be something along the following 
> lines:
> modprobe ip_grep up
> echo 1 > /proc/sys/net/ipv4/ip_forward
> ip tunnel add gre1 mode gre remote 10.x.y.z local 20.z.y.x ttl 127
> ip addr add 192.168.100.33/24 peer 192.168.100.11/24 dev gre1
> ip link set gre1 up multicast on
>
> Thank you for taking the time to consider these tests.  Perhaps future
> versions of the software package could include a similar set of tests
> illustrating proper behavior?
>
> dan
>
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais
>
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] corosync 1.2.5 still doesn't shutdown properly

2010-06-22 Thread Andrew Beekhof

On Wed, Jun 23, 2010 at 8:22 AM, Alain.Moulle  wrote:
> Hi,
> With whatever release (i.e. currently with corosync-1.2.1-2.el6.x86_64),
> I always have trouble with the stop of corosync. And each
> time it failed when there were some failed actions reported
> by crm_mon.

That would seem to be a different issue.
Vadym hasn't got pacemaker running at all.

> Regards
> Alain
>
> On 06/22/2010 03:56 AM, Vadym Chepkov wrote:
>
>> Hi,
>>
>> I decided to check if I can start using corosync again on several of
>> my clusters (have to use heartbeat there at the moment).
>> I don't even have any services defined in corosync.conf, commented
>> pacemaker out, just plain corosync and it never goes down:
>>
>> # ps axf|grep corosync
>> 26294 pts/0S+ 0:00  |   \_ /bin/sh /sbin/service
>> corosync restart
>> 26299 pts/0S+ 0:01  |   \_ /bin/bash
>> /etc/init.d/corosync restart
>> 29249 pts/1S+ 0:00  \_ grep corosync
>> 25959 ?Ssl0:00 corosync
>>
>>
>> I attached to the process and this is where it hangs:
>>
>> (gdb) where
>> #0  0x0fe14134 in poll () from /lib/libc.so.6
>> #1  0x0ffbc530 in poll_run (handle=150346236434579456) at coropoll.c:413
>> #2  0x10006e50 in main (argc=, argv=> optimized out>) at main.c:1576
>>
>> How can I help to debug this problem?
>> It is 100% reproducible.
>>
>> Thank you,
>> Vadym
>> 
>
> Vadym,
>
> Thanks for the feedback.  I do test this scenario and it works for me:
>
> [r...@cast flatiron]# service corosync start
> Starting Corosync Cluster Engine (corosync):   [  OK  ]
> [r...@cast flatiron]# service corosync restart
> Signaling Corosync Cluster Engine (corosync) to terminate: [  OK  ]
> Waiting for corosync services to unload:.  [  OK  ]
> Starting Corosync Cluster Engine (corosync):   [  OK  ]
> [r...@cast flatiron]# service corosync stop
> Signaling Corosync Cluster Engine (corosync) to terminate: [  OK  ]
> Waiting for corosync services to unload:.  [  OK  ]
> [r...@cast flatiron]# service corosync start
> Starting Corosync Cluster Engine (corosync):   [  OK  ]
> [r...@cast flatiron]# /etc/init.d/corosync restart
> Signaling Corosync Cluster Engine (corosync) to terminate: [  OK  ]
> Waiting for corosync services to unload:.  [  OK  ]
> Starting Corosync Cluster Engine (corosync):   [  OK  ]
>
>
> One thing that would stop corosync from shutting down is if it couldn't
> enter operational state.  This often happens because of a firewall
> enabled on the ports corosync uses to communicate.
>
> The system logs would be helpful (with debug: on).
>
> Regards
> -steve
>
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais
>
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] {patch] Corosync hangs on startup

2010-06-18 Thread Andrew Beekhof

Checked in as r2948.
Please backport to 1.2

On Fri, Jun 11, 2010 at 6:23 PM, Steven Dake  wrote:
> On 06/11/2010 09:00 AM, Andrew Beekhof wrote:
>>
>> This is a bit convoluted, but hang in there.
>>
>>
>> So there is this bug:
>>   http://developerbugs.linux-foundation.org/show_bug.cgi?id=2379
>>
>> Essentially, to reproduce, you stop syslog but leave it enabled in
>> corosync.conf.
>> Here is the logging section I used:
>>
>> logging {
>>   debug: on
>>   fileline: off
>>   to_syslog: yes
>>   to_stderr: no
>>   to_logfile: yes
>>   logfile: /tmp/corosync.log
>>   syslog_facility: daemon
>>   timestamp: on
>> }
>>
>> What would happen is that on startup, the (pacemaker) child processes
>> would deadlock _inside_ the call to fork().
>> This seemed to happen more often if the logfile didnt yet exist.
>>
>>
>> Here's the gdb stack trace:
>>
>> #0  0x003268407bfd in pthread_join (threadid=140599098124048,
>> thread_return=0x0) at pthread_join.c:89
>> #1  0x00406805 in sigsegv_handler (num=) at
>> main.c:212
>> #2
>> #3  fresetlockfiles () at ../nptl/sysdeps/unix/sysv/linux/fork.c:48
>> #4  __libc_fork () at ../nptl/sysdeps/unix/sysv/linux/fork.c:158
>> #5  0x7fdfc5d95772 in spawn_child () from
>> /usr/libexec/lcrso/pacemaker.lcrso
>> #6  0x7fdfc5d99e4a in pcmk_startup () from
>> /usr/libexec/lcrso/pacemaker.lcrso
>> #7  0x00407f23 in corosync_service_link_and_init
>> (corosync_api=0x613c40, service_name=0x1020430 "pacemaker",
>> service_ver=) at service.c:201
>> #8  0x004082c1 in corosync_service_defaults_link_and_init
>> (corosync_api=0x613c40) at service.c:535
>> #9  0x00405df6 in main_service_ready () at main.c:1204
>> #10 0x00326e00f81f in main_iface_change_fn (context=0x7fdfc6027010,
>> iface_addr=, iface_no=) at
>> totemsrp.c:4347
>> #11 0x00326e009f7a in timer_function_netif_check_timeout
>> (data=0x10474d0) at totemudp.c:1359
>> #12 0x00326e006709 in timerlist_expire (handle=1197105576937521152) at
>> tlist.h:309
>> #13 poll_run (handle=1197105576937521152) at coropoll.c:409
>> #14 0x0040568b in main (argc=, argv=> optimized out>) at main.c:1556
>>
>>
>> As I see it, there are three problems here.
>>
>> 1) fork() is segfaulting
>> 2) sigsegv_handler() is doing _WAY_ too much.
>>    I don't believe it should be calling logsys_atexit() - certainly not if
>> logsys_atexit() then calls pthread_join
>> 3) logsys_atexit() is calling pthread_join() on a thread that doesn't
>> exist in the new process
>>
>
> the signal handling needs some work around segfault.
>
>>
>> I'll leave 2) and 3) for someone more knowledgeable. I investigated 1) as
>> the other two don't matter if fork() isn't segfaulting.
>>
>> I tried running valgrind and every fork produced the same complaint:
>>
>> ==00:00:00:03.081 23392== Invalid write of size 4
>> ==00:00:00:03.082 23392==    at 0x3267CA4E72: fork (fork.c:48)
>> ==00:00:00:03.082 23392==    by 0x7633771: spawn_child (in
>> /usr/libexec/lcrso/pacemaker.lcrso)
>> ==00:00:00:03.082 23392==    by 0x7637E49: pcmk_startup (in
>> /usr/libexec/lcrso/pacemaker.lcrso)
>> ==00:00:00:03.082 23392==    by 0x407E07: corosync_service_link_and_init
>> (service.c:201)
>> ==00:00:00:03.082 23392==    by 0x4081A0:
>> corosync_service_defaults_link_and_init (service.c:534)
>> ==00:00:00:03.082 23392==    by 0x405C85: main_service_ready (main.c:1211)
>> ==00:00:00:03.082 23392==    by 0x4C4B2BE: main_iface_change_fn
>> (totemsrp.c:4363)
>> ==00:00:00:03.082 23392==    by 0x4C42AD9:
>> timer_function_netif_check_timeout (totemudp.c:1380)
>> ==00:00:00:03.082 23392==    by 0x4C3F8DC: poll_run (tlist.h:309)
>> ==00:00:00:03.082 23392==    by 0x405543: main (main.c:1563)
>> ==00:00:00:03.082 23392==  Address 0x0 is not stack'd, malloc'd or
>> (recently) free'd
>>
>> Note that this is the same file and line that gdb reported.
>>
>> And if you look at line 48 of ./nptl/sysdeps/unix/sysv/linux/fork.c, you
>> see that its the body of this for-loop
>>
>> static void
>> fresetlockfiles (void)
>> {
>>   _IO_ITER i;
>>
>>   for (i = _IO_iter_begin(); i != _IO_iter_end(); i = _IO_iter_next(i))
>>     _IO_lock_init (*((_IO_lock_t *) _IO_iter_file(i)->_lock));
>> }
>>
>> I tracking down _IO_iter_begin() which led me to _IO_list_all, which

[Openais] {patch] Corosync hangs on startup

2010-06-11 Thread Andrew Beekhof

This is a bit convoluted, but hang in there.


So there is this bug:
  http://developerbugs.linux-foundation.org/show_bug.cgi?id=2379

Essentially, to reproduce, you stop syslog but leave it enabled in 
corosync.conf.
Here is the logging section I used:

logging {
  debug: on
  fileline: off
  to_syslog: yes
  to_stderr: no
  to_logfile: yes
  logfile: /tmp/corosync.log
  syslog_facility: daemon
  timestamp: on
}

What would happen is that on startup, the (pacemaker) child processes would 
deadlock _inside_ the call to fork().
This seemed to happen more often if the logfile didnt yet exist.


Here's the gdb stack trace:

#0  0x003268407bfd in pthread_join (threadid=140599098124048, 
thread_return=0x0) at pthread_join.c:89
#1  0x00406805 in sigsegv_handler (num=) at 
main.c:212
#2  
#3  fresetlockfiles () at ../nptl/sysdeps/unix/sysv/linux/fork.c:48
#4  __libc_fork () at ../nptl/sysdeps/unix/sysv/linux/fork.c:158
#5  0x7fdfc5d95772 in spawn_child () from /usr/libexec/lcrso/pacemaker.lcrso
#6  0x7fdfc5d99e4a in pcmk_startup () from 
/usr/libexec/lcrso/pacemaker.lcrso
#7  0x00407f23 in corosync_service_link_and_init 
(corosync_api=0x613c40, service_name=0x1020430 "pacemaker", service_ver=) at service.c:201
#8  0x004082c1 in corosync_service_defaults_link_and_init 
(corosync_api=0x613c40) at service.c:535
#9  0x00405df6 in main_service_ready () at main.c:1204
#10 0x00326e00f81f in main_iface_change_fn (context=0x7fdfc6027010, 
iface_addr=, iface_no=) at 
totemsrp.c:4347
#11 0x00326e009f7a in timer_function_netif_check_timeout (data=0x10474d0) 
at totemudp.c:1359
#12 0x00326e006709 in timerlist_expire (handle=1197105576937521152) at 
tlist.h:309
#13 poll_run (handle=1197105576937521152) at coropoll.c:409
#14 0x0040568b in main (argc=, argv=) at main.c:1556


As I see it, there are three problems here.

1) fork() is segfaulting
2) sigsegv_handler() is doing _WAY_ too much. 
   I don't believe it should be calling logsys_atexit() - certainly not if 
logsys_atexit() then calls pthread_join
3) logsys_atexit() is calling pthread_join() on a thread that doesn't exist in 
the new process


I'll leave 2) and 3) for someone more knowledgeable. I investigated 1) as the 
other two don't matter if fork() isn't segfaulting.

I tried running valgrind and every fork produced the same complaint:

==00:00:00:03.081 23392== Invalid write of size 4
==00:00:00:03.082 23392==at 0x3267CA4E72: fork (fork.c:48)
==00:00:00:03.082 23392==by 0x7633771: spawn_child (in 
/usr/libexec/lcrso/pacemaker.lcrso)
==00:00:00:03.082 23392==by 0x7637E49: pcmk_startup (in 
/usr/libexec/lcrso/pacemaker.lcrso)
==00:00:00:03.082 23392==by 0x407E07: corosync_service_link_and_init 
(service.c:201)
==00:00:00:03.082 23392==by 0x4081A0: 
corosync_service_defaults_link_and_init (service.c:534)
==00:00:00:03.082 23392==by 0x405C85: main_service_ready (main.c:1211)
==00:00:00:03.082 23392==by 0x4C4B2BE: main_iface_change_fn 
(totemsrp.c:4363)
==00:00:00:03.082 23392==by 0x4C42AD9: timer_function_netif_check_timeout 
(totemudp.c:1380)
==00:00:00:03.082 23392==by 0x4C3F8DC: poll_run (tlist.h:309)
==00:00:00:03.082 23392==by 0x405543: main (main.c:1563)
==00:00:00:03.082 23392==  Address 0x0 is not stack'd, malloc'd or (recently) 
free'd

Note that this is the same file and line that gdb reported. 

And if you look at line 48 of ./nptl/sysdeps/unix/sysv/linux/fork.c, you see 
that its the body of this for-loop

static void
fresetlockfiles (void)
{
  _IO_ITER i;

  for (i = _IO_iter_begin(); i != _IO_iter_end(); i = _IO_iter_next(i))
_IO_lock_init (*((_IO_lock_t *) _IO_iter_file(i)->_lock));
}

I tracking down _IO_iter_begin() which led me to _IO_list_all, which led me to 
_IO_2_1_stderr_, which led me to test the following patch:

Index: exec/main.c
===
--- exec/main.c (revision 2943)
+++ exec/main.c (working copy)
@@ -418,6 +418,9 @@
fd = open("/dev/null", O_RDWR);
if (fd >= 0) {
/* dup2 to 0 / 1 / 2 (stdin / stdout / stderr) */
+   close (STDIN_FILENO);
+   close (STDOUT_FILENO);
+   close (STDERR_FILENO);
dup2(fd, STDIN_FILENO);  /* 0 */
dup2(fd, STDOUT_FILENO); /* 1 */
dup2(fd, STDERR_FILENO); /* 2 */


With the patch I've not been able to reproduce the hang and valgrind no longer 
complains.
So I'm reasonably certain its the correct fix.

Please ACK.
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] [announce] corosync 1.2.4 released

2010-06-11 Thread Andrew Beekhof

On Fri, Jun 11, 2010 at 4:03 PM, Colin  wrote:
> On Thu, Jun 10, 2010 at 12:22 AM, Steven Dake  wrote:
>>
>> This version has the following changes:
>> * Fixes defects in logsys which are crashing pacemaker installations.
>
> Hm, don't know whether I did something wrong, but I just compiled
> corosync 1.2.4, and then pacemaker 1.0.8, and the pair immediately
> crashes on startop because pacemaker invokes corosync's getpwnam() in
> exec/tsafe.c and the assert(0) gets triggered.

The latest versions from Hg have this fixed and it will be in 1.0.9
thats due "soon"

>
> It seems that tsafe.c wants to enforce, at runtime, that no library
> loaded into corosync calls a non-thread-safe function in the presence
> of threads; now the first thing I thought on seeing and (hopefully)
> understanding tsafe.c was to tell everybody writing extension for
> corosync to keep their code thread-safe, and throw it away.
>
> If, for whatever reasons, it is deemed necessary to keep this run-time
> work-around, one might go the whole way and implement thread-safe
> versions of the "offending" functions (on operating systems where it's
> necessary, for example on Darwin getpwnam() is already
> thread-safe)...?
>
> Regards, Colin
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais
>
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] [Pacemaker] corosync/openais fails to start

2010-05-30 Thread Andrew Beekhof

On Thu, May 27, 2010 at 5:50 PM, Steven Dake  wrote:
> On 05/27/2010 08:40 AM, Diego Remolina wrote:
>>
>> Is there any workaround for this? Perhaps a slightly older version of
>> the rpms? If so where do I find those?
>>
>
> Corosync 1.2.1 doesn't have this issue apparently.  With corosync 1.2.1,
> please don't use "debug: on" keyword in your config options.  I am not sure
> where Andrew has corosync 1.2.1 rpms available.

Normal place: http://www.clusterlabs.org/rpm
People would need to explicitly specify a version when talking to yum though.

>
> The corosync project itself doesn't release rpms.  See our policy on this
> topic:
>
> http://www.corosync.org/doku.php?id=faq:release_binaries
>
> Regards
> -steve
>
>> I cannot get the opensuse-ha rpms any more so I am stuck with a
>> non-functioning cluster.
>>
>> Diego
>>
>> Steven Dake wrote:
>>>
>>> This is a known issue on some platforms, although the exact cause is
>>> unknown. I have tried RHEL 5.5 as well as CentOS 5.5 with clusterrepo
>>> rpms and been unable to reproduce. I'll keep looking.
>>>
>>> Regards
>>> -steve
>>>
>>> On 05/27/2010 06:07 AM, Diego Remolina wrote:

 Hi,

 I was running the old rpms from the opensuse repo and wanted to change
 over to the latest packages from the clusterlabs repo in my RHEL 5.5
 machines.

 Steps I took
 1. Disabled the old repo
 2. Set the nodes to standby (two node drbd cluster) and turned of
 openais
 3. Enabled the new repo.
 4. Performed an update with yum -y update which replaced all packages.
 5. The configuration file for ais was renamed openais.conf.rpmsave
 6. I ran corosync-keygen and copied the key to the second machine
 7. I copied the file openais.conf.rpmsave to /etc/corosync/corosync.conf
 and modified it by removing the service section and moving that to
 /etc/corosync/service.d/pcmk
 8. I copied the configurations to the other machine.
 9. When I try to start either openais or corosync with the init scripts
 I get a failure and nothing that can really point me to an error in the
 logs.

 Updated packages:
 May 26 14:29:32 Updated: cluster-glue-libs-1.0.5-1.el5.x86_64
 May 26 14:29:32 Updated: resource-agents-1.0.3-2.el5.x86_64
 May 26 14:29:34 Updated: cluster-glue-1.0.5-1.el5.x86_64
 May 26 14:29:34 Installed: libibverbs-1.1.3-2.el5.x86_64
 May 26 14:29:34 Installed: corosync-1.2.2-1.1.el5.x86_64
 May 26 14:29:34 Installed: librdmacm-1.0.10-1.el5.x86_64
 May 26 14:29:34 Installed: corosynclib-1.2.2-1.1.el5.x86_64
 May 26 14:29:34 Installed: openaislib-1.1.0-2.el5.x86_64
 May 26 14:29:34 Updated: openais-1.1.0-2.el5.x86_64
 May 26 14:29:34 Installed: libnes-0.9.0-2.el5.x86_64
 May 26 14:29:35 Installed: heartbeat-libs-3.0.3-2.el5.x86_64
 May 26 14:29:35 Updated: pacemaker-libs-1.0.8-6.1.el5.x86_64
 May 26 14:29:36 Updated: heartbeat-3.0.3-2.el5.x86_64
 May 26 14:29:36 Updated: pacemaker-1.0.8-6.1.el5.x86_64

 Apparently corosync is sec faulting when run from the command line:

 # /usr/sbin/corosync -f
 Segmentation fault

 Any help would be greatly appreciated.

 Diego



 ___
 Pacemaker mailing list: pacema...@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>
>
>
> ___
> Pacemaker mailing list: pacema...@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] fusion-io card, drbd, corosync, pacemaker stop issue

2010-05-25 Thread Andrew Beekhof

On Sat, May 22, 2010 at 1:13 AM, Dean Patterson  wrote:
> We are using the following to create a 2-node highly-available cluster:
>
> Disk device - fusion-io cards (PCIe SSD's)
> DRBD/Corosync/Pacemaker
>
> [r...@motest16 log]# rpm -qa | egrep "drbd|corosync|pacemaker"
> drbd-pacemaker-8.3.7-1
> drbd-8.3.7-1
> drbd-bash-completion-8.3.7-1
> drbd-xen-8.3.7-1
> drbd-km-debuginfo-8.3.7-12
> corosynclib-1.2.1-1.el5
> drbd-utils-8.3.7-1
> drbd-udev-8.3.7-1
> drbd-km-2.6.18_164.15.1.0.1.el5-8.3.7-12
> corosynclib-1.2.1-1.el5
> pacemaker-1.0.8-6.el5
> drbd-debuginfo-8.3.7-1
> drbd-heartbeat-8.3.7-1
> corosync-1.2.1-1.el5
> pacemaker-libs-1.0.8-6.el5
>
> [r...@motest16 log]# uname -r
> 2.6.18-164.15.1.0.1.el5
>
> Terminology:
> Pacemaker - Master/Slave
> DRBD      - Primary/Secondary
>

[snip]

> ### TEST CASE #2 ###
> OVERVIEW: Using dd /dev/zero to test the switchover of drbd/pacemaker and it 
> fails. And pacemaker
> does not switchover the master/slave indicating an issue with the 
> corosync/pacemaker layer.

The cluster can't start fsFusion somewhere else until it safely
stopped on motest17.
Unfortunately the stop action failed (by timing out) and since stonith
was disabled, there was no way for the cluster to complete the
recovery.

Step 1, increase the timeouts.
Step 2, enable stonith (and add a stonith device)

[snip]

> fsFusion        (ocf::heartbeat:Filesystem):    Started motest17.apple.com 
> (unmanaged) FAILED
>
> Failed actions:   fsFusion_stop_0 (node=motest17.apple.com, call=54, rc=-2, 
> status=Timed Out): unknown exec error
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Corosync Node not rejoining cluster.

2010-05-20 Thread Andrew Beekhof

On Wed, May 19, 2010 at 6:42 PM, James Mackie  wrote:
> I have had this small 2 node cluster running since February. This morning
> one of the servers (Node2) stopped responding on the external network
> interface. To remedy this the server was rebooted at the console. (Not by
> me). When the node came back up it was showing the other node offline, and
> tried to take over all the services. The Node that was online the whole time
> (Node1) had taken over the services of Node2 when it was rebooted, (the
> internal network on Node2 was still active and responding), Node1 shows
> Node2 offline, Node2 shows Node1 offline. I’ve put Node2 in standby using
> crm so it stopped trying to take back the services, since it was not
> co-ordinating with the other node.

Versions or corosync/pacemaker?

> How do I get the node back re-joined to the cluster properly? All my
> previous experience was that it just rejoined, and the services failed back
> over as expected. This is the first time that the expected behavior has not
> occurred.
>
>
>
> I read another mailing list post regarding something similar, having to do
> with nodeid changes. This is not the case here, I verified that the nodeid
> in the previous logs matches what the node currently has registered as its
> nodeid.
>
>
>
> That same post recommended deleting Node2 with crm on Node1 and restarting
> Node2, along with deleting all of /var/lib/heartbeat/* on Node2 to flush the
> CIB. My assumption is that this will sync to the cluster and update
> automatically.  Doesn’t sound like advice I’d prefer to take blindly, I hate
> assuming.

I'd not do that.


> Does anyone have any input that will point me in the right direction? Any
> input would be helpful. Thank you.

Easiest method is probably to set is-managed-default=false and restart
corosync on both hosts.
Then once they see each other, set it back to true.
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] plan for resolving corosync services unloading, problem blocking shutdown on opensuse

2010-05-10 Thread Andrew Beekhof

On Tue, May 11, 2010 at 7:52 AM, Steven Dake  wrote:
> On Tue, 2010-05-11 at 07:48 +0200, Alain.Moulle wrote:
>> Hi,
>> FYI : me too, I have debug : on and I faced the problem on RHEL5 as well
>> as on fc12.
>> Alain
>
> I have found the root cause I believe is related to your issues.
> Basically with debug:on the internal buffers inside logsys are
> overflowed, triggering a spinning condition and lack of proper logging
> operation.

Very strange, I run with "debug: on" most of the time and never hit this.
Very happy you've managed to track down the issue though :-)
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] What can I do when facing "Waiting for corosync services to unload:........."

2010-05-10 Thread Andrew Beekhof

On Mon, May 10, 2010 at 8:31 AM, Alain.Moulle  wrote:
>
> I meant  "/etc/init.d/corosync stop" never returns.

Ok. Can you show us the logs and "ps axf" please?
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] What can I do when facing "Waiting for corosync services to unload:........."

2010-05-07 Thread Andrew Beekhof

On Fri, May 7, 2010 at 2:10 PM, Alain.Moulle  wrote:
> Hi,
>
> good news I think : I got a good clue to identify the "unload stalled"
> problem , you'll
> tell me if it really helps :
> in fact, I got again the message :
> "Waiting for corosync services to unload:..."
> and from there I did from another window :
> crm_mon
> and saw that a resource was marked as "failed" :
> restofencenode6_start_0 (node=node6, call=3, rc=2, status=complete): invalid
> parameter
> (I know why it fails to start, that's not the issue here in this email)
> so I did:
> crm resource stop restofencenode6
> crm resource cleanup restofencenode6
> and there just after the cleanup, the unload is effective
> and the /etc/init.d/corosync stop returns ok.
>
> Note that, despite there is always the same invalid parameter, sometimes it
> is stalled
> in unload module , sometimes it completes despite the resource not cleant up
> ...
> But 1 time on 4 , it is stalled in unload phase ...
>
> Hope it helps ... ?
> Or tell me if I'm wrong anywhere ...

Almost certainly your wrong here. Sorry.
The fact that "sometimes it completes despite the resource not cleant
up" is a convincing argument that its unrelated.

What we need are logs from a failed shutdown.
I would expect them to contain something like:

   corosync[32440]:   [pcmk  ] notice: pcmk_shutdown: Shutdown complete

If you see that, the problem is not anything related to pacemaker.
Beyond that, please try to be more specific about what "hang" means.
Is the whole machine unresponsive or just that "/etc/init.d/corosync
stop" never returns?
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] What can I do when facing "Waiting for corosync services to unload:........."

2010-05-04 Thread Andrew Beekhof

Alain, clusterlabs has 1.2.1 now.  Could you try updating?

On Tue, May 4, 2010 at 2:48 PM, Jan Friesse  wrote:
> Hi,
> 1.2.0 has some shutdown issues. Try to upgrade to 1.2.1 (1.2.2 when
> released), and problem should dissapeared.
>
> Regards,
>  Honza
>
>
> Alain.Moulle wrote:
>> Hi everybody,
>>
>> thanks for all your responses... for now, I did not get the stall
>> again since this morning , it happens rarely but it happens enough
>> often to be annoying. Notice that I have none
>> resource configured. Except the both stonith resources (for a two nodes
>> cluster)
>> or 4 stonith resources (for a 4 nodes cluster).
>>
>> My corosync release is :
>> corosync-1.2.0-1.el5
>> so on RHEL5, but I already have encountered the problem also of fc12.
>>
>> And yes, it is under Pacemaker, and my rpms are :
>> pacemaker-1.0.8-2.el5
>> cluster-glue-1.0.3-1.el5
>> resource-agents-1.0.1-1.el5
>> and I also have (despite not useful for HA stack):
>> openais-1.1.0-1.el5
>>
>> Thanks
>> Regards.
>> Alain
>>
>> Jan Friesse a écrit :
>>> Alain,
>>> what version of corosync are you using?
>>>
>>> Are you using pacemaker?
>>>
>>> If you are using corosync 1.2.1 please try to send gdb bt of threads.
>>>
>>> Regards,
>>>   Honza
>>>
>>> Alain.Moulle wrote:
>>>
 Hi,

 When stopping corosync with /etc/init.d/corosync stop", I'm from time
 to time stalled
 during unload services :
 Signaling Corosync Cluster Engine (corosync) to terminate: [  OK  ]
 Waiting for corosync services to
 unload:.

 What could be the reasons ?
 What could I do to avoid this ?
 What could I do to force the unload without rebooting the node ?

 Thanks for help.
 Alain Moullé
 ___
 Openais mailing list
 Openais@lists.linux-foundation.org
 https://lists.linux-foundation.org/mailman/listinfo/openais

>>>
>>>
>>>
>>>
>>
>>
>
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais
>
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] What can I do when facing "Waiting for corosync services to unload:........."

2010-05-04 Thread Andrew Beekhof

On Tue, May 4, 2010 at 9:10 AM, Alain.Moulle  wrote:
> Hi,
>
> When stopping corosync with /etc/init.d/corosync stop", I'm from time to
> time stalled
> during unload services :
> Signaling Corosync Cluster Engine (corosync) to terminate: [  OK  ]
> Waiting for corosync services to unload:.
>
> What could be the reasons ?

Waiting for a cluster resource (like a database) that takes a long
time to shut down.
Or possibly a bug.

> What could I do to avoid this ?

Don't run a database, or
Check the logs to see what the node is doing.

"ps axf" is also often useful.
If corosync has active children, chances are its a resource problem.

> What could I do to force the unload without rebooting the node ?
>
> Thanks for help.
> Alain Moullé
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais
>
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] What can I do when facing "Waiting for corosync services to unload:........."

2010-05-04 Thread Andrew Beekhof

On Tue, May 4, 2010 at 9:41 AM, Andreas Mock  wrote:
> -Ursprüngliche Nachricht-
> Von: "Alain.Moulle" 
>>What could I do to avoid this ?
>
> Don't use it.  ;-)

That sort of comments isn't going to win many friends on this list is it.
Even with a smiley face.

It may not even be a corosync problem in this case, so lets not jump the gun.
I forget, are you seeing the problem on VMs or real hardware?
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Failover constraint problem

2010-04-19 Thread Andrew Beekhof

2010/4/19 Sándor Fehér :
> Hi,
>
> I changed the config as you suggested:
> ---
> colocation apache-group-on-ms-drbd0 inf: apache-group ms-drbd0:Master
> colocation co_nfs_client -inf: nfs_client ms-drbd0:Master
> order ms-drbd0-before-apache-group inf: ms-drbd0:promote apache-group:start
> ---
>
> Now I get this:

When you do what?
Make the change? Repeat the test? Something else?

> Online: [ node0 node1 ]
>
>  Resource Group: apache-group
>  fs0    (ocf::heartbeat:Filesystem):    Started node0
>  virtual-ip (ocf::heartbeat:IPaddr2):   Started node0
>  nfs_server (lsb:nfs-kernel-server):    Started node0
>  Master/Slave Set: ms-drbd0
>  Masters: [ node0 ]
>  Slaves: [ node1 ]
>  nfs_client (ocf::heartbeat:Filesystem):    Started node0 (unmanaged)
> FAILED
>
> Failed actions:
>     nfs_client_stop_0 (node=node0, call=21, rc=1, status=complete): unknown
> error
> node1:~#
>
> Here is the relevant part of daemon.log http://pastebin.com/L9scU4fy
>
> Thank you !
>
> Andrew Beekhof írta:
>
> On Sat, Apr 17, 2010 at 12:21 AM, Sandor Feher  wrote:
>
>
> Hi,
>
> First of all my goal is to set up a two-node cluster with pacemaker to
> serve our webhosting service.
> This config sites on two vmware virtual machines for testing purposes
> now. Both of them runs Debian Lenny.
>
> Here are the basic rules I set up:
>
> node0  has
>
> virtual ip
> drbd primary filesystem mounted under /mnt
> nfs server offers /mnt mount point to node1
>
> node1
>
> drbd secondary node
> nfs_client mounts node0's /mnt dir and it should be rw for both nodes
>
> If  node0 fails then node1 will act as primary drbd node, take over
> virtual ip and mount drbd partition under /mnt dir and will not start
> nfs_client resource because it makes no sense (nfs_client should be take
> down before drbd partition get mounted under /mnt).
> If node1 fails the nothing should be happen because nfs_client only run
> node which has secondary drbd partition
>
> So my problems are the following.
>
> 1.  If I migrate apache-group resorce to another node then nfs_client
> won't release the /mnt mount point (I know according to this config it
> should not).
>     I think I need some clever constraint to achieve this.
>
>
> Perhaps instead of:
>colocation co_nfs_client inf: nfs_client ms-drbd0:Slave
> try:
>colocation co_nfs_client -inf: nfs_client ms-drbd0:Master
>
>
>
>
> 2. If I shot down node1 (suppose that node0 the master at the moment and
> runs apache-group) then nothing happens as expected but if node1 comes
> online again the apache-group start to migrate to node1. I don't
> understand why
>
>
> because you told it to:
>location cli-prefer-apache-group apache-group \
>  rule $id="cli-prefer-rule-apache-group" inf: #uname eq node0
>
> Change inf to (for example) 1000
>
>
>
> because there is a constraint for this to get
> apache-group run on node which primary drbd resource and in this
> situation node0 is.
>
>
> crm configure show
>
> node node0 \
>        attributes standby="off"
> node node1 \
>        attributes standby="off"
> primitive drbd0 ocf:heartbeat:drbd \
>        params drbd_resource="r0" \
>        op monitor interval="59s" role="Master" timeout="30s" \
>        op monitor interval="60s" role="Slave" timeout="30s"
> primitive fs0 ocf:heartbeat:Filesystem \
>        params fstype="ext3" directory="/mnt" device="/dev/drbd0" \
>        meta target-role="Started"
> primitive nfs_client ocf:heartbeat:Filesystem \
>        params fstype="nfs" directory="/mnt/"
> device="192.168.1.40:/mnt/"
> options="hard,intr,noatime,rw,nolock,tcp,timeo=50" \
>        meta target-role="Stopped"
> primitive nfs_server lsb:nfs-kernel-server \
>        op monitor interval="1min"
> primitive virtual-ip ocf:heartbeat:IPaddr2 \
>        params ip="192.168.1.40" broadcast="192.168.1.255" nic="eth0"
> cidr_netmask="24" \
>        op monitor interval="21s" timeout="5s" target-role="Started"
> group apache-group fs0 virtual-ip nfs_server \
>        meta target-role="Started"
> ms ms-drbd0 drbd0 \
>        meta clone-max="2" notify="true" globally-unique="false"
> target-role="Started"
> location cli-prefer-apache-group apache-group \
>        rule $id="cli-prefer-rule-apache-group" inf: #un

Re: [Openais] Missing shutdown messages with corosync 1.2.1 and pacemaker

2010-04-19 Thread Andrew Beekhof

On Mon, Apr 12, 2010 at 3:19 PM, Andreas Mock  wrote:
> -Ursprüngliche Nachricht-
> Von: Andrew Beekhof 
> Gesendet: 12.04.2010 08:58:44
> An: Andreas Mock 
> Betreff: Re: [Openais] Missing shutdown messages with corosync 1.2.1 and 
> pacemaker
>
> Hi all,
>
>>You might want to include your corosync config file.
>
> See attached (for the second try with logging to file)
>
>>Does the same happen if you configure log-to-file?
>
> Yes, it's the same.

To to be sure, you're saying the logs are also missing from /tmp/corosync.log ?
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Failover constraint problem

2010-04-19 Thread Andrew Beekhof

On Sat, Apr 17, 2010 at 12:21 AM, Sandor Feher  wrote:
> Hi,
>
> First of all my goal is to set up a two-node cluster with pacemaker to
> serve our webhosting service.
> This config sites on two vmware virtual machines for testing purposes
> now. Both of them runs Debian Lenny.
>
> Here are the basic rules I set up:
>
> node0  has
>
> virtual ip
> drbd primary filesystem mounted under /mnt
> nfs server offers /mnt mount point to node1
>
> node1
>
> drbd secondary node
> nfs_client mounts node0's /mnt dir and it should be rw for both nodes
>
> If  node0 fails then node1 will act as primary drbd node, take over
> virtual ip and mount drbd partition under /mnt dir and will not start
> nfs_client resource because it makes no sense (nfs_client should be take
> down before drbd partition get mounted under /mnt).
> If node1 fails the nothing should be happen because nfs_client only run
> node which has secondary drbd partition
>
> So my problems are the following.
>
> 1.  If I migrate apache-group resorce to another node then nfs_client
> won't release the /mnt mount point (I know according to this config it
> should not).
>     I think I need some clever constraint to achieve this.

Perhaps instead of:
   colocation co_nfs_client inf: nfs_client ms-drbd0:Slave
try:
   colocation co_nfs_client -inf: nfs_client ms-drbd0:Master


> 2. If I shot down node1 (suppose that node0 the master at the moment and
> runs apache-group) then nothing happens as expected but if node1 comes
> online again the apache-group start to migrate to node1. I don't
> understand why

because you told it to:
   location cli-prefer-apache-group apache-group \
 rule $id="cli-prefer-rule-apache-group" inf: #uname eq node0

Change inf to (for example) 1000

> because there is a constraint for this to get
> apache-group run on node which primary drbd resource and in this
> situation node0 is.
>
>
> crm configure show
>
> node node0 \
>        attributes standby="off"
> node node1 \
>        attributes standby="off"
> primitive drbd0 ocf:heartbeat:drbd \
>        params drbd_resource="r0" \
>        op monitor interval="59s" role="Master" timeout="30s" \
>        op monitor interval="60s" role="Slave" timeout="30s"
> primitive fs0 ocf:heartbeat:Filesystem \
>        params fstype="ext3" directory="/mnt" device="/dev/drbd0" \
>        meta target-role="Started"
> primitive nfs_client ocf:heartbeat:Filesystem \
>        params fstype="nfs" directory="/mnt/"
> device="192.168.1.40:/mnt/"
> options="hard,intr,noatime,rw,nolock,tcp,timeo=50" \
>        meta target-role="Stopped"
> primitive nfs_server lsb:nfs-kernel-server \
>        op monitor interval="1min"
> primitive virtual-ip ocf:heartbeat:IPaddr2 \
>        params ip="192.168.1.40" broadcast="192.168.1.255" nic="eth0"
> cidr_netmask="24" \
>        op monitor interval="21s" timeout="5s" target-role="Started"
> group apache-group fs0 virtual-ip nfs_server \
>        meta target-role="Started"
> ms ms-drbd0 drbd0 \
>        meta clone-max="2" notify="true" globally-unique="false"
> target-role="Started"
> location cli-prefer-apache-group apache-group \
>        rule $id="cli-prefer-rule-apache-group" inf: #uname eq node0
> colocation apache-group-on-ms-drbd0 inf: apache-group ms-drbd0:Master
> colocation co_nfs_client inf: nfs_client ms-drbd0:Slave
> order ms-drbd0-before-apache-group inf: ms-drbd0:promote apache-group:start
> order ms-drbd0-before-nfs_client inf: ms-drbd0:promote nfs_client:start
> property $id="cib-bootstrap-options" \
>        dc-version="1.0.8-2c98138c2f070fcb6ddeab1084154cffbf44ba75" \
>        cluster-infrastructure="openais" \
>        stonith-enabled="false" \
>        no-quorum-policy="ignore" \
>        expected-quorum-votes="2" \
>        last-lrm-refresh="1271453094"
>
> node1:~# crm_mon -1
> 
> Last updated: Fri Apr 16 23:49:30 2010
> Stack: openais
> Current DC: node0 - partition with quorum
> Version: 1.0.8-2c98138c2f070fcb6ddeab1084154cffbf44ba75
> 2 Nodes configured, 2 expected votes
> 3 Resources configured.
> 
>
> Online: [ node0 node1 ]
>
>  Resource Group: apache-group
>     fs0        (ocf::heartbeat:Filesystem):    Started node1
> (unmanaged) FAILED
>     virtual-ip (ocf::heartbeat:IPaddr2):       Stopped
>     nfs_server (lsb:nfs-kernel-server):        Stopped
>  Master/Slave Set: ms-drbd0
>     Masters: [ node0 ]
>     Slaves: [ node1 ]
>  nfs_client     (ocf::heartbeat:Filesystem):    Started node1
> (unmanaged) FAILED
>
> Failed actions:
>    nfs_client_start_0 (node=node0, call=98, rc=1, status=complete):
> unknown error
>    fs0_stop_0 (node=node1, call=9, rc=-2, status=Timed Out): unknown
> exec error
>    nfs_client_stop_0 (node=node1, call=7, rc=-2, status=Timed Out):
> unknown exec error
>
>
> I really appreciate any idea. Thank you in advance.
>
> Regards,   Sandor
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailm

Re: [Openais] [solved] Re: problem running ocfs2/o2cb with openais/pacemaker

2010-04-16 Thread Andrew Beekhof

On Wed, Apr 14, 2010 at 1:06 PM, Jürgen Herrmann
 wrote:
>
> On Tue, 13 Apr 2010 16:52:01 +0200, Andrew Beekhof 
> wrote:
>> On Tue, Apr 13, 2010 at 3:33 PM, Jürgen Herrmann
>>  wrote:
>>>
>>> On Mon, 12 Apr 2010 14:46:39 +0200, Andrew Beekhof 
>>> wrote:
>>>> Please keep all replies on the list.
>>>>
>>>> On Apr 12, 2010, at 2:44 PM, Jürgen Herrmann wrote:
>>>>
>>>>>
>>>>> On Mon, 12 Apr 2010 14:25:55 +0200, Andrew Beekhof
> 
>>>>> wrote:
>>>>>> What versions of openais (corosync?) and pacemaker are you using?
>>>>>
>>>>> app1a:~# apt-show-versions |grep pacemaker
>>>>> pacemaker/sid upgradeable from 1.0.8-3~bpo50+1 to 1.0.8+hg15494-2
>>>>>
>>>>> app1a:~# apt-show-versions |grep openais
>>>>> libopenais-dev/lenny uptodate 1.1.2-1~bpo50+1
>>>>> libopenais3/lenny uptodate 1.1.2-1~bpo50+1
>>>>> openais/lenny uptodate 1.1.2-1~bpo50+1
>>>>
>>>> Looks ok.
>>>> Perhaps ping the ocfs2 guys to see what control device its trying
> open.
>>> hmm, no response on the ocfs2 list yet. do you have *any* idea about
>>> which control device this error msg is talking about? ...or where to
>>> configure more verbose logging to dig deeper?
>>
>> Sorry no.
>> I don't have much to do with ocfs2 these days.
>
> SOLVED:
>
> after reading the code for ocfs2_controld.pcmk i figured the missing
> control device on my machines was "/dev/misc/ocfs2_control".
>
> i added a /etc/udev/rules.d/52-ocfs2.rules file with following content:
> KERNEL=="ocfs2_control", NAME="misc/ocfs2_control", MODE="0666"
>
> the control device is automagically created now an mounting ocfs2
> volumes works.

Excellent!
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Failover problem

2010-04-16 Thread Andrew Beekhof

On Fri, Apr 16, 2010 at 3:28 PM, Haussecker, Armin
 wrote:
> Hi,
>
> we have a 2-node-cluster based on SLES11 , openais (0.80.3-26.8.1) and
> pacemaker (1.0.5-0.5.6).

You're best off contacting Novell support for older versions.

There's really not enough in the log fragments below to make any
meaningful comment, but if you _attach_ the complete logs we might be
able to help.


> Sometimes the failover from one node (named
> cuzzonib) to the second node (named cuzzonia) fails with the following
> messages:
>
> Apr 16 13:16:14 cuzzonib lrmd: [6706]: info: Try to stop STONITH resource
>  : Device=external/ipmi
> Apr 16 13:16:14 cuzzonib crmd: [18479]: info: process_lrm_event: LRM
> operation iRMC_cuzzoniaInstance:0_stop_0 (call=51, rc=0, cib-update=108,
> confirmed=true) ok
> Apr 16 13:16:14 cuzzonib crmd: [18479]: info: match_graph_event: Action
> iRMC_cuzzoniaInstance:0_stop_0 (25) confirmed on cuzzonib (rc=0)
>
> Apr 16 13:16:14 cuzzonib crmd: [18479]: info: te_pseudo_action: Pseudo
> action 29 fired and confirmed
> Apr 16 13:16:14 cuzzonib crmd: [18479]: info: te_crm_command: Executing
> crm-event (79): do_shutdown on cuzzonib
> Apr 16 13:16:14 cuzzonib crmd: [18479]: info: te_crm_command: crm-event (79)
> is a local shutdown
>
> Apr 16 13:16:17 cuzzonib logger: /etc/xen/scripts/xen-hotplug-cleanup:
> XENBUS_PATH=backend/vkbd/4/0
> Apr 16 13:16:17 cuzzonib logger: /etc/xen/scripts/xen-hotplug-cleanup:
> XENBUS_PATH=backend/console/4/0
> Apr 16 13:16:17 cuzzonib logger: /etc/xen/scripts/xen-hotplug-cleanup:
> XENBUS_PATH=backend/vfb/4/0
> Apr 16 13:16:17 cuzzonib logger: /etc/xen/scripts/xen-hotplug-cleanup:
> XENBUS_PATH=backend/vif/4/0
> Apr 16 13:16:17 cuzzonib logger: /etc/xen/scripts/block: remove
> XENBUS_PATH=backend/vbd/4/51712
> Apr 16 13:16:17 cuzzonib logger: /etc/xen/scripts/block: remove
> XENBUS_PATH=backend/vbd/4/51744
> Apr 16 13:16:17 cuzzonib logger: /etc/xen/scripts/xen-hotplug-cleanup:
> XENBUS_PATH=backend/vbd/4/51712
> Apr 16 13:16:17 cuzzonib logger: /etc/xen/scripts/xen-hotplug-cleanup:
> XENBUS_PATH=backend/vbd/4/51744
>
> Apr 16 13:16:32 cuzzonib openais[18468]: [crm  ] notice: pcmk_shutdown:
> Still waiting for crmd (pid=18479, seq=6) to terminate..
> .
> Apr 16 13:16:38 cuzzonib openais[18468]: [TOTEM] The token was lost in the
> OPERATIONAL state.
> Apr 16 13:16:38 cuzzonib openais[18468]: [TOTEM] Receive multicast socket
> recv buffer size (262142 bytes).
> Apr 16 13:16:38 cuzzonib openais[18468]: [TOTEM] Transmit multicast socket
> send buffer size (262142 bytes).
> Apr 16 13:16:38 cuzzonib openais[18468]: [TOTEM] entering GATHER state from
> 2.
> Apr 16 13:16:58 cuzzonib openais[18468]: [TOTEM] entering GATHER state from
> 0.
> Apr 16 13:16:58 cuzzonib openais[18468]: [TOTEM] Creating commit token
> because I am the rep.
> Apr 16 13:16:58 cuzzonib openais[18468]: [TOTEM] Saving state aru 14b high
> seq received 14b
> Apr 16 13:16:58 cuzzonib openais[18468]: [TOTEM] Storing new sequence id for
> ring bb4
> Apr 16 13:16:58 cuzzonib openais[18468]: [TOTEM] entering COMMIT state.
> Apr 16 13:16:58 cuzzonib openais[18468]: [TOTEM] entering RECOVERY state.
> Apr 16 13:16:58 cuzzonib openais[18468]: [TOTEM] position [0] member
> 192.168.10.5:
> Apr 16 13:16:58 cuzzonib openais[18468]: [TOTEM] previous ring seq 2992 rep
> 192.168.10.3
> Apr 16 13:16:58 cuzzonib openais[18468]: [TOTEM] aru 14b high delivered 14b
> received flag 1
> Apr 16 13:16:58 cuzzonib openais[18468]: [TOTEM] Did not need to originate
> any messages in recovery.
> Apr 16 13:16:58 cuzzonib openais[18468]: [TOTEM] Sending initial ORF token
> Apr 16 13:16:58 cuzzonib openais[18468]: [CLM  ] CLM CONFIGURATION CHANGE
> Apr 16 13:16:58 cuzzonib openais[18468]: [CLM  ] New Configuration:
> Apr 16 13:16:58 cuzzonib openais[18468]: [CLM  ]    r(0)
> ip(192.168.10.5)
>
> Apr 16 13:16:58 cuzzonib openais[18468]: [CLM  ] Members Left:
> Apr 16 13:16:58 cuzzonib crmd: [18479]: notice: ais_dispatch: Membership
> 2996: quorum lost
> Apr 16 13:16:58 cuzzonib cib: [18475]: notice: ais_dispatch: Membership
> 2996: quorum lost
> Apr 16 13:16:58 cuzzonib crmd: [18479]: info: ais_status_callback: status:
> cuzzonia is now lost (was member)
>
> Apr 16 13:16:58 cuzzonib cib: [18475]: info: crm_update_peer: Node cuzzonia:
> id=51030208 state=lost (new) addr=r(0) ip(192.168.10.3)  votes=1 born=2992
> seen=2992 proc=00053312
>
> Afterwards the second cluster node (cuzzonia) is rebooted.
> What could be the reason for the problem ?
>
> Regards,
> Armin Haussecker
>
>
>
>
>
>
>
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais
>
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] corosync shutdown timeout

2010-04-16 Thread Andrew Beekhof

On Thu, Apr 15, 2010 at 8:06 PM, Vadym Chepkov  wrote:
> pacemaker-1.0.8-4.el5

:-(

Can you create a bug for this please:
   http://developerbugs.linux-foundation.org/

Also, please include a hb_report for the period just before shutdown began.
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] corosync shutdown timeout

2010-04-15 Thread Andrew Beekhof

On Thu, Apr 15, 2010 at 6:34 PM, Vadym Chepkov  wrote:
> In case of a shutdown yes, but in this particular case I did
>
> crm configure property is-managed-default=false.
>
> and it seems brings shutdown procedure to a stupor.

Then thats definitely a bug in pacemaker.
What version or pacemaker are you running?
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] corosync shutdown timeout

2010-04-15 Thread Andrew Beekhof

On Thu, Apr 15, 2010 at 5:29 PM, Vadym Chepkov  wrote:
> Hi,
>
> Is there a way to configure corosync timeout shutdown?
>
> # grep 'Still waiting' /var/log/messages
> Apr 15 15:13:37 ashlin02 corosync[3017]:   [pcmk  ] notice: pcmk_shutdown: 
> Still waiting for crmd (pid=3029, seq=6) to terminate...
> Apr 15 15:18:07 ashlin02 corosync[3017]:   [pcmk  ] notice: pcmk_shutdown: 
> Still waiting for crmd (pid=3029, seq=6) to terminate...
> Apr 15 15:22:07 ashlin02 corosync[3017]:   [pcmk  ] notice: pcmk_shutdown: 
> Still waiting for crmd (pid=3029, seq=6) to terminate...
> Apr 15 15:23:37 ashlin02 corosync[3017]:   [pcmk  ] notice: pcmk_shutdown: 
> Still waiting for crmd (pid=3029, seq=6) to terminate...
> Apr 15 15:24:07 ashlin02 corosync[3017]:   [pcmk  ] notice: pcmk_shutdown: 
> Still waiting for crmd (pid=3029, seq=6) to terminate...
>
>
> Because of this code in init.d script:
>
>        echo -n "Waiting for $prog services to unload:"
>        while status $prog > /dev/null 2>&1; do
>                sleep 1
>                echo -n "."
>        done

No, thats just the symptom.
Corosync is waiting for Pacemaker which is waiting for one of your
services to stop.

Do you really want the node to terminate with your $important_app not
cleanly stopped?

> The system doesn't go down and in case of a shutdown initiated by UPS it can 
> be disastrous.
>
>
> Sincerely yours,
>  Vadym Chepkov
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais
>
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] problem running ocfs2/o2cb with openais/pacemaker

2010-04-13 Thread Andrew Beekhof

On Tue, Apr 13, 2010 at 3:33 PM, Jürgen Herrmann
 wrote:
>
> On Mon, 12 Apr 2010 14:46:39 +0200, Andrew Beekhof 
> wrote:
>> Please keep all replies on the list.
>>
>> On Apr 12, 2010, at 2:44 PM, Jürgen Herrmann wrote:
>>
>>>
>>> On Mon, 12 Apr 2010 14:25:55 +0200, Andrew Beekhof 
>>> wrote:
>>>> What versions of openais (corosync?) and pacemaker are you using?
>>>
>>> app1a:~# apt-show-versions |grep pacemaker
>>> pacemaker/sid upgradeable from 1.0.8-3~bpo50+1 to 1.0.8+hg15494-2
>>>
>>> app1a:~# apt-show-versions |grep openais
>>> libopenais-dev/lenny uptodate 1.1.2-1~bpo50+1
>>> libopenais3/lenny uptodate 1.1.2-1~bpo50+1
>>> openais/lenny uptodate 1.1.2-1~bpo50+1
>>
>> Looks ok.
>> Perhaps ping the ocfs2 guys to see what control device its trying open.
> hmm, no response on the ocfs2 list yet. do you have *any* idea about
> which control device this error msg is talking about? ...or where to
> configure more verbose logging to dig deeper?

Sorry no.
I don't have much to do with ocfs2 these days.
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] problem running ocfs2/o2cb with openais/pacemaker

2010-04-12 Thread Andrew Beekhof

Please keep all replies on the list.

On Apr 12, 2010, at 2:44 PM, Jürgen Herrmann wrote:

> 
> On Mon, 12 Apr 2010 14:25:55 +0200, Andrew Beekhof 
> wrote:
>> What versions of openais (corosync?) and pacemaker are you using?
> 
> app1a:~# apt-show-versions |grep pacemaker
> pacemaker/sid upgradeable from 1.0.8-3~bpo50+1 to 1.0.8+hg15494-2
> 
> app1a:~# apt-show-versions |grep openais
> libopenais-dev/lenny uptodate 1.1.2-1~bpo50+1
> libopenais3/lenny uptodate 1.1.2-1~bpo50+1
> openais/lenny uptodate 1.1.2-1~bpo50+1

Looks ok.
Perhaps ping the ocfs2 guys to see what control device its trying open.

> 
>> 
>> On Mon, Apr 12, 2010 at 2:00 PM, Jürgen Herrmann
>>  wrote:
>>> 
>>> hi!
>>> 
>>> i'm on debian lenny and trying to run ocfs2 on a dual primary
>>> drbd device. the drbd device is already set up as msDRBD0.
>>> 
>>> to get dlm_controld.pcmk i installed it from source (from
>>> cluster-suite-3.0.10)
>>> now i configured a resource "resDLM" with 2 clones:
>>>  primitive resDLM ocf:pacemaker:controld op monitor interval="120s"
>>>  clone cloneDLM resDLM meta globally-unique="false" interleave="true"
>>>  colocation colDLM_DRBD0 inf: cloneDLM msDRBD0:Master
>>>  order ordDRBD0_DLM inf: msDRBD0:promote cloneDLM:start
>>> -> seems to work.
>>> 
>>> 
>>> to get ocfs2_controld.pcmk i installed ocfs2-tools-1.4.3 from source.
>>> after adding the resource:
>>>  primitive resO2CB ocf:pacemaker:o2cb op monitor interval="120s"
>>>  clone cloneO2CB resO2CB meta globally-unique="false" interleave="true"
>>>  colocation colO2CB_DLM inf: cloneO2CB cloneDLM
>>>  order ordDLM_O2CB inf: cloneDLM cloneO2CB
>>> 
>>> i get the following errors in crm_mon:
>>> ==
>>> Failed actions:
>>>resO2CB:0_start_0 (node=app1b.xlhost.de, call=28, rc=1,
>>> status=complete): unknown error
>>>resO2CB:0_start_0 (node=app1a.xlhost.de, call=38, rc=1,
>>> status=complete): unknown error
>>> 
>>> 
>>> the relevant syslog entries:
>>> 
>>> Apr 12 13:15:18 app1a corosync[4638]:   [pcmk  ] info: pcmk_notify:
>>> Enabling node
>>>  notifications for child 8311 (0xd83090)
>>> Apr 12 13:15:18 app1a ocfs2_controld.pcmk: Error opening control
> device:
>>> Unable to  access cluster service
>>> 
>>> 
>>> 
>>> if i start "ocfs2_controld.pcmk -D" i get:
>>> ==
>>> ocfs2_controld[18489]: 2010/04/12_13:40:39 info: init_ais_connection:
>>> Creating connection to our AIS plugin
>>> ocfs2_controld[18489]: 2010/04/12_13:40:39 info: init_ais_connection:
> AIS
>>> connection established
>>> ocfs2_controld[18489]: 2010/04/12_13:40:39 info: get_ais_nodeid: Server
>>> details: id=569559765 uname=app1a.xlhost.de cname=pcmk
>>> ocfs2_controld[18489]: 2010/04/12_13:40:39 info: crm_new_peer: Node
>>> app1a.xlhost.de now has id: 569559765
>>> ocfs2_controld[18489]: 2010/04/12_13:40:39 info: crm_new_peer: Node
>>> 569559765 is now known as app1a.xlhost.de
>>> 1271072439 setup_st...@168: Cluster connection established.  Local node
>>> id: 569559765
>>> 1271072439 setup_st...@172: Added Pacemaker as client 1 with fd 5
>>> 1271072439 setup_c...@609: Initializing CKPT service (try 1)
>>> 1271072439 setup_c...@615: Connected to CKPT service with handle
>>> 0x327b23c6
>>> 1271072439 call_ckpt_o...@160: Opening checkpoint
>>> "ocfs2:controld:21f2cad5" (try 1)
>>> 1271072439 call_ckpt_o...@170: Opened checkpoint
>>> "ocfs2:controld:21f2cad5"
>>> with handle 0x66334873
>>> 1271072439 call_section_wr...@340: Writing to section
>>> "daemon_max_protocol" on checkpoint "ocfs2:controld:21f2cad5" (try 1)
>>> 1271072439 call_section_cre...@292: Creating section
>>> "daemon_max_protocol"
>>> on checkpoint "ocfs2:controld:21f2cad5" (try 1)
>>> 1271072439 call_section_cre...@300: Created section
> "daemon_max_protocol"
>>> on checkpoint "ocfs2:controld:21f2cad5"
>>> 1271072439 call_section_wr...@340: Writing to section
>>> "ocfs2_max_protocol"
>>> on checkpoint "ocfs2:controld:21f2cad5" (try 1)
>>> 1271072439 call_section_cre...@292: C

Re: [Openais] problem running ocfs2/o2cb with openais/pacemaker

2010-04-12 Thread Andrew Beekhof

What versions of openais (corosync?) and pacemaker are you using?

On Mon, Apr 12, 2010 at 2:00 PM, Jürgen Herrmann
 wrote:
>
> hi!
>
> i'm on debian lenny and trying to run ocfs2 on a dual primary
> drbd device. the drbd device is already set up as msDRBD0.
>
> to get dlm_controld.pcmk i installed it from source (from
> cluster-suite-3.0.10)
> now i configured a resource "resDLM" with 2 clones:
>  primitive resDLM ocf:pacemaker:controld op monitor interval="120s"
>  clone cloneDLM resDLM meta globally-unique="false" interleave="true"
>  colocation colDLM_DRBD0 inf: cloneDLM msDRBD0:Master
>  order ordDRBD0_DLM inf: msDRBD0:promote cloneDLM:start
> -> seems to work.
>
>
> to get ocfs2_controld.pcmk i installed ocfs2-tools-1.4.3 from source.
> after adding the resource:
>  primitive resO2CB ocf:pacemaker:o2cb op monitor interval="120s"
>  clone cloneO2CB resO2CB meta globally-unique="false" interleave="true"
>  colocation colO2CB_DLM inf: cloneO2CB cloneDLM
>  order ordDLM_O2CB inf: cloneDLM cloneO2CB
>
> i get the following errors in crm_mon:
> ==
> Failed actions:
>    resO2CB:0_start_0 (node=app1b.xlhost.de, call=28, rc=1,
> status=complete): unknown error
>    resO2CB:0_start_0 (node=app1a.xlhost.de, call=38, rc=1,
> status=complete): unknown error
>
>
> the relevant syslog entries:
> 
> Apr 12 13:15:18 app1a corosync[4638]:   [pcmk  ] info: pcmk_notify:
> Enabling node
>  notifications for child 8311 (0xd83090)
> Apr 12 13:15:18 app1a ocfs2_controld.pcmk: Error opening control device:
> Unable to  access cluster service
>
>
>
> if i start "ocfs2_controld.pcmk -D" i get:
> ==
> ocfs2_controld[18489]: 2010/04/12_13:40:39 info: init_ais_connection:
> Creating connection to our AIS plugin
> ocfs2_controld[18489]: 2010/04/12_13:40:39 info: init_ais_connection: AIS
> connection established
> ocfs2_controld[18489]: 2010/04/12_13:40:39 info: get_ais_nodeid: Server
> details: id=569559765 uname=app1a.xlhost.de cname=pcmk
> ocfs2_controld[18489]: 2010/04/12_13:40:39 info: crm_new_peer: Node
> app1a.xlhost.de now has id: 569559765
> ocfs2_controld[18489]: 2010/04/12_13:40:39 info: crm_new_peer: Node
> 569559765 is now known as app1a.xlhost.de
> 1271072439 setup_st...@168: Cluster connection established.  Local node
> id: 569559765
> 1271072439 setup_st...@172: Added Pacemaker as client 1 with fd 5
> 1271072439 setup_c...@609: Initializing CKPT service (try 1)
> 1271072439 setup_c...@615: Connected to CKPT service with handle
> 0x327b23c6
> 1271072439 call_ckpt_o...@160: Opening checkpoint
> "ocfs2:controld:21f2cad5" (try 1)
> 1271072439 call_ckpt_o...@170: Opened checkpoint "ocfs2:controld:21f2cad5"
> with handle 0x66334873
> 1271072439 call_section_wr...@340: Writing to section
> "daemon_max_protocol" on checkpoint "ocfs2:controld:21f2cad5" (try 1)
> 1271072439 call_section_cre...@292: Creating section "daemon_max_protocol"
> on checkpoint "ocfs2:controld:21f2cad5" (try 1)
> 1271072439 call_section_cre...@300: Created section "daemon_max_protocol"
> on checkpoint "ocfs2:controld:21f2cad5"
> 1271072439 call_section_wr...@340: Writing to section "ocfs2_max_protocol"
> on checkpoint "ocfs2:controld:21f2cad5" (try 1)
> 1271072439 call_section_cre...@292: Creating section "ocfs2_max_protocol"
> on checkpoint "ocfs2:controld:21f2cad5" (try 1)
> 1271072439 call_section_cre...@300: Created section "ocfs2_max_protocol"
> on checkpoint "ocfs2:controld:21f2cad5"
> 1271072439 start_j...@588: Starting join for group "ocfs2:controld"
> 1271072439 start_j...@592: cpg_join succeeded
> 1271072439 l...@975: setup done
> ocfs2_controld[18489]: 2010/04/12_13:40:39 notice: ais_dispatch:
> Membership 156: quorum acquired
> ocfs2_controld[18489]: 2010/04/12_13:40:39 info: crm_update_peer: Node
> app1a.xlhost.de: id=569559765 state=member (new) addr=r(0)
> ip(213.202.242.161)  (new) votes=1 (new) born=156 seen=156
> proc=00013312 (new)
> ocfs2_controld[18489]: 2010/04/12_13:40:39 info: crm_new_peer: Node
> app1b.xlhost.de now has id: 586336981
> ocfs2_controld[18489]: 2010/04/12_13:40:39 info: crm_new_peer: Node
> 586336981 is now known as app1b.xlhost.de
> ocfs2_controld[18489]: 2010/04/12_13:40:39 info: crm_update_peer: Node
> app1b.xlhost.de: id=586336981 state=member (new) addr=r(0)
> ip(213.202.242.162)  votes=1 born=148 seen=156
> proc=00013312
> 1271072439 confchg...@495: confchg called
> 1271072439 daemon_cha...@398: ocfs2_controld (group "ocfs2:controld")
> confchg: members 1, left 0, joined 1
> 1271072439 cpg_joi...@909: CPG is live, we are the first daemon
> 1271072439 call_ckpt_o...@160: Opening checkpoint "ocfs2:controld" (try 1)
> 1271072439 call_ckpt_o...@170: Opened checkpoint "ocfs2:controld" with
> handle 0x2ae8944a0001
> 1271072439 call_section_wr...@340: Writing to section "daemon_protocol" on
> checkpoint "ocfs2:controld" (try 1)
> 1271072439 cal

Re: [Openais] Corosync Patch: Fix the default for COROSYNC_RUN_DIR

2010-04-12 Thread Andrew Beekhof

On Mon, Apr 12, 2010 at 12:46 AM, Steven Dake  wrote:
> On Sun, 2010-04-11 at 10:30 +0200, Andrew Beekhof wrote:
>> On Sun, Apr 11, 2010 at 1:59 AM, Steven Dake  wrote:
>> > On Sat, 2010-04-10 at 13:35 +0200, Andrew Beekhof wrote:
>> >> On Sat, Apr 10, 2010 at 6:18 AM, Fabio M. Di Nitto  
>> >> wrote:
>> >> > On 4/9/2010 8:17 PM, Steven Dake wrote:
>> >> >> On Fri, 2010-04-09 at 15:05 +0200, Andrew Beekhof wrote:
>> >> >>> This looks like a copy/paste error to me...
>> >> >>>
>> >> >>> The "RUN" in COROSYNC_RUN_DIR would seem to imply /var/run
>> >> >>> Also /var/lib is persistent and doesn't need to be created at startup.
>> >> >>> On the other-hand, LSB states that the contents of /var/run is blow
>> >> >>> away at boot time.
>> >> >>>
>> >> >>> So I'm reasonably sure the following patch is correct.
>> >> >>> Please ACK.
>> >> >>
>> >> >> In general "rundir" should probably be renamed to "libdir" since the
>> >> >> idea is that data stored there is persistent.
>> >> >>
>> >> >> Totem requires persistence between node boots of data stored with the
>> >> >> rundir path.
>> >> >
>> >> > /var/lib/corosync should be created at "make install" time and it愀
>> >> > guaranteed to be there by packaging and after each reboot.
>> >> >
>> >> > /var/run/corosync is more complicated. As Andrew already mentioned LSB,
>> >> > we need to make sure that it愀 created at startup time. Most daemons can
>> >> > do that in the init script and be done with it. Corosync doesn愒 have
>> >> > that luxury because it can be invoked in several different ways (cman
>> >> > for example), therefor it needs to do the dir creation/check within the
>> >> > code as the init script is not always used.
>> >> >
>> >> > This is the problem we need to address basically.
>> >>
>> >> And what the patch does :-)
>> >>
>> >> There is no need, at runtime, to create /var/lib/corosync.
>> >> Particularly if its required to be persistent.
>> >> /var/run/corosync is a different story as Fabbio reiterated above.
>> >>
>> >> So given all that, the original patch makes the most sense.
>> >
>> > Oh missed the patch sorry.
>> >
>> > I did review it just now.  Hate to be a stickler to details, but the
>> > rundir environment + variable names should be something like lib instead
>> > (what is this called?).
>>
>> Oh I see what you mean.
>> rundir is used elsewhere in totemsrp.c
>>
>
> The issue is COROSYNC_RUN_DIR is used in ipc

Is it though?
I trawled the code last night and all I could find was:
  /var/run/some_ipc_file
not
  /var/run/corosync/some_ipc_file

So now I'm confused, do we actually need a /var/run/corosync directory
to ever be created?

> (required in some cases for
> non-persistent data) while a libdir is used in totemsrp (in all cases
> the use here is persistent).  We can create a specific lib dir
> environment variable for ipc - although i'm not sure what to call it.
> Suggestions welcome.
>
> Regards
> -steve
>
>
>> I'll send through a new patch which leaves that part intact.
>>
>> > I guess we can continue to create the dirs if
>> > they don't exist for self-installs.
>> >
>> > Regards
>> > -steve
>> >
>> >
>
>
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Missing shutdown messages with corosync 1.2.1 and pacemaker

2010-04-12 Thread Andrew Beekhof

You might want to include your corosync config file.
Does the same happen if you configure log-to-file?

On Fri, Apr 9, 2010 at 11:43 PM, Andreas Mock  wrote:
> Hi all,
>
> while trying to test corosync 1.2.1 and pacemaker 1.0.8 with CTS I found the 
> following
> problem. The expected shutdown messages of corosync can't be found in the 
> debug
> log.
>
> Attached you find the log of the cluster after doing a /etc/init.d/corosync 
> stop
> A string like 'Service engine unloaded: corosync' is not there.
>
> By the way: There are some WARN messages which are probably also interesting 
> to
> look at.
>
> Best regards
> Andreas Mock
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais
>
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Corosync Patch: Fix the default for COROSYNC_RUN_DIR

2010-04-11 Thread Andrew Beekhof

On Sun, Apr 11, 2010 at 1:59 AM, Steven Dake  wrote:
> On Sat, 2010-04-10 at 13:35 +0200, Andrew Beekhof wrote:
>> On Sat, Apr 10, 2010 at 6:18 AM, Fabio M. Di Nitto  
>> wrote:
>> > On 4/9/2010 8:17 PM, Steven Dake wrote:
>> >> On Fri, 2010-04-09 at 15:05 +0200, Andrew Beekhof wrote:
>> >>> This looks like a copy/paste error to me...
>> >>>
>> >>> The "RUN" in COROSYNC_RUN_DIR would seem to imply /var/run
>> >>> Also /var/lib is persistent and doesn't need to be created at startup.
>> >>> On the other-hand, LSB states that the contents of /var/run is blow
>> >>> away at boot time.
>> >>>
>> >>> So I'm reasonably sure the following patch is correct.
>> >>> Please ACK.
>> >>
>> >> In general "rundir" should probably be renamed to "libdir" since the
>> >> idea is that data stored there is persistent.
>> >>
>> >> Totem requires persistence between node boots of data stored with the
>> >> rundir path.
>> >
>> > /var/lib/corosync should be created at "make install" time and it愀
>> > guaranteed to be there by packaging and after each reboot.
>> >
>> > /var/run/corosync is more complicated. As Andrew already mentioned LSB,
>> > we need to make sure that it愀 created at startup time. Most daemons can
>> > do that in the init script and be done with it. Corosync doesn愒 have
>> > that luxury because it can be invoked in several different ways (cman
>> > for example), therefor it needs to do the dir creation/check within the
>> > code as the init script is not always used.
>> >
>> > This is the problem we need to address basically.
>>
>> And what the patch does :-)
>>
>> There is no need, at runtime, to create /var/lib/corosync.
>> Particularly if its required to be persistent.
>> /var/run/corosync is a different story as Fabbio reiterated above.
>>
>> So given all that, the original patch makes the most sense.
>
> Oh missed the patch sorry.
>
> I did review it just now.  Hate to be a stickler to details, but the
> rundir environment + variable names should be something like lib instead
> (what is this called?).

Oh I see what you mean.
rundir is used elsewhere in totemsrp.c

I'll send through a new patch which leaves that part intact.

> I guess we can continue to create the dirs if
> they don't exist for self-installs.
>
> Regards
> -steve
>
>
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Corosync Patch: Fix the default for COROSYNC_RUN_DIR

2010-04-10 Thread Andrew Beekhof

On Sat, Apr 10, 2010 at 6:18 AM, Fabio M. Di Nitto  wrote:
> On 4/9/2010 8:17 PM, Steven Dake wrote:
>> On Fri, 2010-04-09 at 15:05 +0200, Andrew Beekhof wrote:
>>> This looks like a copy/paste error to me...
>>>
>>> The "RUN" in COROSYNC_RUN_DIR would seem to imply /var/run
>>> Also /var/lib is persistent and doesn't need to be created at startup.
>>> On the other-hand, LSB states that the contents of /var/run is blow
>>> away at boot time.
>>>
>>> So I'm reasonably sure the following patch is correct.
>>> Please ACK.
>>
>> In general "rundir" should probably be renamed to "libdir" since the
>> idea is that data stored there is persistent.
>>
>> Totem requires persistence between node boots of data stored with the
>> rundir path.
>
> /var/lib/corosync should be created at "make install" time and it愀
> guaranteed to be there by packaging and after each reboot.
>
> /var/run/corosync is more complicated. As Andrew already mentioned LSB,
> we need to make sure that it愀 created at startup time. Most daemons can
> do that in the init script and be done with it. Corosync doesn愒 have
> that luxury because it can be invoked in several different ways (cman
> for example), therefor it needs to do the dir creation/check within the
> code as the init script is not always used.
>
> This is the problem we need to address basically.

And what the patch does :-)

There is no need, at runtime, to create /var/lib/corosync.
Particularly if its required to be persistent.
/var/run/corosync is a different story as Fabbio reiterated above.

So given all that, the original patch makes the most sense.
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

[Openais] Corosync Patch: Fix the default for COROSYNC_RUN_DIR

2010-04-09 Thread Andrew Beekhof

This looks like a copy/paste error to me...

The "RUN" in COROSYNC_RUN_DIR would seem to imply /var/run
Also /var/lib is persistent and doesn't need to be created at startup.
On the other-hand, LSB states that the contents of /var/run is blow
away at boot time.

So I'm reasonably sure the following patch is correct.
Please ACK.

[02:59 PM] beek...@f12 ~/Development/sources/corosync # svn diff
./exec/totemsrp.c
Index: exec/totemsrp.c
===
--- exec/totemsrp.c (revision 2756)
+++ exec/totemsrp.c (working copy)
@@ -775,7 +775,7 @@

rundir = getenv ("COROSYNC_RUN_DIR");
if (rundir == NULL) {
-   rundir = LOCALSTATEDIR "/lib/corosync";
+   rundir = LOCALSTATEDIR "/run/corosync";
}

res = mkdir (rundir, 0700);
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Unusual exit code with /etc/init.d/corosync stop (Steve - Please ack new patch)

2010-03-25 Thread Andrew Beekhof

On Thu, Mar 25, 2010 at 9:32 AM, Andreas Mock  wrote:
> -Ursprüngliche Nachricht-
> Von: Andrew Beekhof 
> Gesendet: 25.03.2010 09:15:11
> An: Andreas Mock 
> Betreff: Re: [Openais] Unusual exit code with /etc/init.d/corosync stop
>
>>On Tue, Mar 23, 2010 at 12:42 AM, Andreas Mock [ wrote:
>>> Hi all,
>>>
>>> I'm using corosync 1.2.0 from the packages of clusterlabs.org on openSuSE 
>>> 11.2.
>>> A correct /etc/init.d/corosync stop issues a return code of 1
>>
>>The rc code isn't coming from corosync at all.
>>Its coming from the last command in stop(), which is "echo".
>
> Where in my original post did I say that the return code comes from  corosync 
> (binary)??
>
> Please read the mail completely. In the first sentence I just described the
> version and platform I'm using and that the script /etc/init.d/corosync 
> issues a
> return code of 1 when stopping worked correctly.
>
> Some lines further - you can see them in your quoted post - I'll explain - 
> probably in bad English -
> what the reason for this return code is, as I investigated this problem by 
> debugging
> the script /etc/init.d/corosync.
>
> Read the rest of my mail carefully and you get the reason for that behaviour.
> a) The very last line is: exit $rtrn
> b) Where is the global variable $rtrn initialized and set??
> c) It gets set in shell function status!!
> d) When you do a stop and the stop works status is called the last time in 
> the while
> loop setting $rtrn to 1.
> e) This variable is never changed afterwards.
> f) It is returned by the last statement, look at a)

Do try to calm down a little.
I made a mistake, it happens when one tries responding to 40-50
conversations a day.

Patching after stop is wrong though, the root cause is status() not
using a local variable.

--- ./etc/init.d/corosync.old   2010-03-25 10:21:19.673779309 +0100
+++ ./etc/init.d/corosync   2010-03-25 10:23:47.318779319 +0100
@@ -40,13 +40,13 @@ failure()
 status()
 {
pid=$(pidof $1 2>/dev/null)
-   rtrn=$?
-   if [ $rtrn -ne 0 ]; then
+   rc=$?
+   if [ $rc -ne 0 ]; then
echo "$1 is stopped"
else
echo "$1 (pid $pid) is running..."
fi
-   return $rtrn
+   return $rc
 }
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Unusual exit code with /etc/init.d/corosync stop

2010-03-25 Thread Andrew Beekhof

On Tue, Mar 23, 2010 at 12:42 AM, Andreas Mock  wrote:
> Hi all,
>
> I'm using corosync 1.2.0 from the packages of clusterlabs.org on openSuSE 
> 11.2.
> A correct /etc/init.d/corosync stop issues a return code of 1

The rc code isn't coming from corosync at all.
Its coming from the last command in stop(), which is "echo".

Please run the following and report the result:
   echo ; echo $?

On Fedora it produces:

[09:14 AM] r...@f12 ~/tmp # echo ; echo $?

0
[09:14 AM] r...@f12 ~/tmp #


> which definitely hurts
> the Cluster Test Suite when stopping the cluster stack asuming (IMHO 
> correctly)
> that a problem free execution of the rc script should return 0 and not 1.
>
>
>
> The problem is indirectly the setting of the return code variable $rtrn in 
> the while
>
> loop waiting for corosync to die. While loop is exited exactly when the status
>
> call delivers a 1 meaning that the process isn't there any more. This rc of 1
>
> will then be delivered as return code of the "stop"-call.
>
>
>
> Here's the patch just to show the little change.
>
> ---8<--
>
> --- /etc/init.d/corosync 2010-01-20 21:23:53.0 +0100
> +++ /tmp/corosync 2010-03-23 00:25:12.794065102 +0100
> @@ -138,6 +138,7 @@
> ;;
> stop)
> stop
> + rtrn=0
> ;;
> *)
> echo "usage: $0 
> {start|stop|restart|reload|force-reload|condrestart|try-restart|status}"
> ---8<--
>
>
>
> Best regards
>
> Andreas Mock
>
>
>
>
>
>
>
>
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Corosync can't start pacemaker due to syslog and creates a lots of corosync child processes

2010-03-25 Thread Andrew Beekhof

On Thu, Mar 25, 2010 at 2:50 AM, Thomas Guthmann  wrote:
> Hey Steven,
>
>> This is a distro specific bug.  Please file a bugzilla with the
>> appropriate distro to work out the runlevels on their system.  For
>> fedora which I test on mostly, rsyslog is runlevel 12.  Other distros
>> may be different.  The distributed init script is only a guide - it
>> isn't perfect for all distros by default.
>
> Ok, no worries. RPM is done by Fabbione and is available on clusterlabs
> so... is there a bugzilla for that ?
>
> I reckon chkconfig - 60 20 is fine for RHEL/Centos/EPEL-5

I just tried to replicate this and got nothing.
I completely disabled syslog from starting at boot and corosync and
the rest of the cluster come up just fine.
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Strange behaviour of corosync

2010-03-23 Thread Andrew Beekhof

On Tue, Mar 23, 2010 at 9:07 PM, Andreas Mock  wrote:
> -Ursprüngliche Nachricht-
> Von: Andrew Beekhof 
> Gesendet: 23.03.2010 20:37:12
> An: Andreas Mock 
> Betreff: Re: [Openais] Strange behaviour of corosync
>>
>>Because the amount of time is determined by whatever resources you're running.
>>Someone with a couple of IPs needs only seconds but someone with a
>>dozen thumping big databases might need hours.
>>
>>So almost certainly any chosen period of time will be completely wrong.
>
> That's an argument...
>
>>> Because something has gone wrong if corosync is not able to
>>> stop properly. Am I wrong?
>>
>>Yes. Sorry.
>
> That's nothing new...I'm used to it as it happens from time to time.  :-)
>
> So there's no way to find out that corosync is doing nothing anymore and
> could be killed?

Bets thing to do is run "ps axf" on the node.
That will tell you which stage its up to.
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Strange behaviour of corosync

2010-03-23 Thread Andrew Beekhof

On Tue, Mar 23, 2010 at 8:19 PM, Andreas Mock  wrote:
> -Ursprüngliche Nachricht-
> Von: Andrew Beekhof 
> Gesendet: 23.03.2010 16:35:01
> An: sd...@redhat.com
> Betreff: Re: [Openais] Strange behaviour of corosync
>
>>> Andrew really did all the work on the init script so he should comment.
>>> I believe it is designed to allow pacemaker to shutdown in an orderly
>>> fashion as to not stonith the node (which may happen with a kill -9).
>>
>>Correct.  kill -9 == bad.
>
> IMHO my proposal was a little bit more differentiated.
> Besides "kill -9 == bad" I don't see a reason after sending
> a kill -TERM and waiting for seconds/minutes or whatever amount of time
> not to send a finite kill -9 to corosync.

Because the amount of time is determined by whatever resources you're running.
Someone with a couple of IPs needs only seconds but someone with a
dozen thumping big databases might need hours.

So almost certainly any chosen period of time will be completely wrong.

> Because something has gone wrong if corosync is not able to
> stop properly. Am I wrong?

Yes. Sorry.
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Strange behaviour of corosync

2010-03-23 Thread Andrew Beekhof

On Tue, Mar 23, 2010 at 12:28 AM, Steven Dake  wrote:
> On Tue, 2010-03-23 at 00:18 +0100, Andreas Mock wrote:
>> -Ursprüngliche Nachricht-
>> Von: Steven Dake 
>> Gesendet: 22.03.2010 22:56:03
>> An: Andreas Mock 
>> Betreff: Re: [Openais] Strange behaviour of corosync
>>
>> >
>> >Thank you for going to the trouble of gathering a backtrace.  This is a
>> >different defect fixed in openais which we couldn't duplicate in
>> >corosync.  The problem is line #18 pthread_join() after an exit
>> >function.  This means pthread_join() was called in an atexit() handler
>> >which posix is iffy on.
>>
>>
>> Hi Steven,
>>
>> this error showed IMHO room for improvement at another piece of code.
>> After your response I knew that the corosync process is not needed any more 
>> and
>> I wanted to realease the cpu from their 200%CPU usage burden. ;-)
>>
>> A /etc/init.d/corosync stop ended in printing:
>> Signaling Corosync Cluster Engine (corosync) to terminate: [  OK  ]
>> Waiting for corosync services to unload:...   many many many dots
>>
>> Probably the rc-script should be changed in a way that after waiting for
>> corosync to stop gracefully for a certain amount of time the script
>> should hit corosync with a kill -9. What do you think?
>>
>
> Andreas,
>
> Andrew really did all the work on the init script so he should comment.
> I believe it is designed to allow pacemaker to shutdown in an orderly
> fashion as to not stonith the node (which may happen with a kill -9).

Correct.  kill -9 == bad.
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Corosync cluster stack won't start

2010-03-22 Thread Andrew Beekhof

On Mon, Mar 22, 2010 at 8:44 PM, Tom Pride  wrote:
> Thanks for all your help Andrew and Thomas,
>
> I finally worked out what was going wrong.  A separate cluster of 2 servers
> on the same network were configured to multicast over the same port numbers
> as those I was specifying in the corosync.confs of the cluster I was working
> on.  So every time I tried to start my cluster is was failing because it was
> receiving conflicting communications from the other cluster.  After changing
> the port numbers, corosync now starts up without a problem.
>
> However,  I do have one more question: If pacemaker is supposed to replace
> heartbeat as the crm,

It doesn't replace heartbeat, it augments it.
Its a layer on top.

Or, if you prefer, its a layer on top of corosync.
The packaging was recently changed to require the admin to choose a stack.

So while it will require heartbeat-libs and corosync-libs, you'll need
to explicitly specify one of heartbeat or corosync.

>why is it that the pacemaker rpms that I am using from
> http://www.clusterlabs.org/rpm/epel-5/x86_64/ have a dependency on the
> heartbeat rpm?  You cannot install the pacemaker rpms without the heartbeat
> rpm (unless of course you use --no-deps).  The instructions on this page
> http://www.clusterlabs.org/wiki/Install#Binary_Packages specifically tell
> you to do the following in order to install the required software for a
> pacemaker cluster on Redhat Enterprise:
>
> yum install -y pacemaker corosync heartbeat
>
> Is it just that there are some shared scripts or binaries or libraries that
> pacemaker needs from heartbeat?
>
> Cheers,
> Tom
>
>
>
> On Mon, Mar 22, 2010 at 2:35 PM, Andrew Beekhof  wrote:
>>
>> On Sat, Mar 20, 2010 at 1:06 AM, Thomas Guthmann 
>> wrote:
>> > Hi Tom,
>> >
>> >> heartbeat-libs-3.0.2-2.el5.x86_64.rpm
>> >> heartbeat-3.0.2-2.el5.x86_64.rpm
>> >> openais-1.1.0-1.el5.x86_64.rpm
>> >> openaislib-1.1.0-1.el5.x86_64.rpm
>> >
>> > I reckon it could be due to the presence of openais _and_ corosync.
>> > If you want to use corosync you don't need openais. Same than before,
>> > you don't need heartbeat if you plan to use pacemaker (or the opposite)
>> > though that shouldn't hurt.
>>
>> Pacemaker needs either corosync or heartbeat.
>> If you have corosync, you can also add openais on top - but thats only
>> necessary when using GFS2.
>>
>> Try this getting started doc:
>>
>> http://www.clusterlabs.org/mediawiki/images/5/56/Cluster_from_Scratch_-_Fedora_12.pdf
>>
>> > Then, start simple, use a copy of the default corosync.conf in
>> > /etc/corosync/ and use one ring. It seems you are trying to use an old
>> > openais configuration which actually could work but to debug correctly
>> > add your needs one by one (2nd ring, new parameters, etc). Starting with
>> > the lot is usually more complicated to debug than progressively
>> > increasing complexity.
>> >
>> > Good luck
>> > Thomas.
>> > ___
>> > Openais mailing list
>> > Openais@lists.linux-foundation.org
>> > https://lists.linux-foundation.org/mailman/listinfo/openais
>> >
>
>
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Corosync cluster stack won't start

2010-03-22 Thread Andrew Beekhof

On Sat, Mar 20, 2010 at 1:06 AM, Thomas Guthmann  wrote:
> Hi Tom,
>
>> heartbeat-libs-3.0.2-2.el5.x86_64.rpm
>> heartbeat-3.0.2-2.el5.x86_64.rpm
>> openais-1.1.0-1.el5.x86_64.rpm
>> openaislib-1.1.0-1.el5.x86_64.rpm
>
> I reckon it could be due to the presence of openais _and_ corosync.
> If you want to use corosync you don't need openais. Same than before,
> you don't need heartbeat if you plan to use pacemaker (or the opposite)
> though that shouldn't hurt.

Pacemaker needs either corosync or heartbeat.
If you have corosync, you can also add openais on top - but thats only
necessary when using GFS2.

Try this getting started doc:
   
http://www.clusterlabs.org/mediawiki/images/5/56/Cluster_from_Scratch_-_Fedora_12.pdf

> Then, start simple, use a copy of the default corosync.conf in
> /etc/corosync/ and use one ring. It seems you are trying to use an old
> openais configuration which actually could work but to debug correctly
> add your needs one by one (2nd ring, new parameters, etc). Starting with
> the lot is usually more complicated to debug than progressively
> increasing complexity.
>
> Good luck
> Thomas.
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais
>
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Quorum Debian Lenny

2010-03-15 Thread Andrew Beekhof

do you have both nodes up and running?

On Mon, Mar 15, 2010 at 8:31 PM, Olivier BATARD  wrote:
> Hi,
>
>
> I'm trying to setup an active/passive cluster.
>
>
> I follow the cluster from scratch and Debian Lenny howto but I'm having some 
> errors :
>
>
> #crm_mon --one-shot -V
>
>
> Last updated: Mon Mar 15 21:29:22 2010
> Stack: openais
> Current DC: debian2 - partition WITHOUT quorum
> Version: 1.0.7-54d7869bfe3691eb723b1d47810e5585d8246b58
> 2 Nodes configured, 2 expected votes
> 1 Resources configured.
>
>
>
> Don't why quorum are not in ...
>
>
> Each time I change a property :
>
> crm_verify[2451]: 2010/03/15_21:29:36 WARN: cluster_status: We do not have 
> quorum - fencing and resource management disabled
>
>
> Don't know what I missed ... Any Ideas ?
>
>
> Thanks,
>
> Olivier
>
>
> # Please read the corosync.conf.5 manual page
> compatibility: whitetank
>
> totem {
>        version: 2
>        secauth: off
>        threads: 0
>        interface {
>                ringnumber: 0
>                bindnetaddr: 192.168.5.0
>                mcastaddr: 226.94.1.1
>                mcastport: 5405
>        }
> }
>
> logging {
>        fileline: off
>        to_stderr: yes
>        to_logfile: yes
>        to_syslog: yes
>        logfile: /tmp/corosync.log
>        debug: off
>        timestamp: on
>        logger_subsys {
>                subsys: AMF
>                debug: off
>        }
> }
>
> amf {
>        mode: disabled
> }
>
> aisexec {
>        user : root
>        group : root
> }
>
> service {
>        ver: 0
>        name: pacemaker
> }
>
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais
>
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] How Corosync identifies a new node?

2010-03-15 Thread Andrew Beekhof

On Sat, Mar 13, 2010 at 6:02 AM, Steven Dake  wrote:
> On Fri, 2010-03-12 at 16:02 +0530, S, Prashanth wrote:
>> Hi!
>>
>> I need to clarify my understanding on how corosync handles addition of a new 
>> node.
>> I think whenever a new node is up it will multicast about its arrival.  This 
>> will result in gather->recovery->operational state changes and notifying via 
>> the config change callback.
>
> The protocol operates according to this document:
> www.cs.jhu.edu/~yairamir/tocs.ps
>
>> I have another question: Does corosync/pacemaker maintain any data about old 
>> nodes? If so, is there any significance for maintaining old nodes' data?
>>
>
> Corosync does not, but I am not sure about pacemaker.

Pacemaker keeps some (basically just the name) because it assumes it
will be coming back.
The documentation has details on how to purge it.
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] How to debug?

2010-03-10 Thread Andrew Beekhof

On Tue, Mar 9, 2010 at 7:40 PM, Lucian Romi  wrote:
> Great. By referring the link you sent, I now know how these rc code means.
> But how can I know which line of configuration causing the error or
> which module is missing?
> Is there any log for this? I couldn't figure out corosync's log. It's
> this a corosync(openais) error or pacemaker error?

Its a resource agent error.
Usually you'll find the details in /var/log/messages

If in doubt, look in the resource agent itself and look for places
that might return that error.

> Thanks.
>
> On Mon, Mar 8, 2010 at 11:16 PM, Andrew Beekhof  wrote:
>> You might want to cross reference the return codes from the failed
>> operations with:
>>   
>> http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-ocf-return-codes.html
>>
>> Looks like you have some missing packages and invalid configuration options.
>>
>> On Tue, Mar 9, 2010 at 1:38 AM, Lucian Romi  wrote:
>>> Hi,
>>>
>>> I have put a lot of efforts on setting up an active/passive replication.
>>> The backup service I tried is pbxnsip, because looks like somebody
>>> already manage to setup such a system.
>>> As I'm a newbie on this, I think this is a very good start point.
>>> I followed this link http://forum.pbxnsip.com/index.php?showtopic=3070
>>> for a pbxnsip/drbd cluster.
>>> The IP address fail over is working perfectly, but drbd and pbxnsip failed.
>>> There are error, but I don't know how to debug.
>>> I tried to find solution on pbxnsip forum, but no reply. My post is
>>> the last one.
>>> So if any of you can tell me how to trouble shot this, that will be great.
>>>
>>> Here are crm_mon output:
>>> "
>>> rhu...@advocado:~$ sudo crm_mon --one-shot
>>> 
>>> Last updated: Sun Feb 21 10:32:51 2010
>>> Stack: openais
>>> Current DC: cherry - partition WITHOUT quorum
>>> Version: 1.0.7-54d7869bfe3691eb723b1d47810e5585d8246b58
>>> 3 Nodes configured, unknown expected votes
>>> 3 Resources configured.
>>> 
>>>
>>> Online: [ cherry advocado ]
>>> OFFLINE: [  ]
>>>
>>> Master/Slave Set: ms_drbd_pbxnsip
>>>     Masters: [ advocado ]
>>>     Slaves: [ cherry ]
>>> gwsrc_route    (ocf::heartbeat:Route): Started advocado (unmanaged) FAILED
>>> Resource Group: pbxnsip
>>>     fs_pbxnsip (ocf::heartbeat:Filesystem):    Started advocado
>>>     ip_pbxnsip (ocf::heartbeat:IPaddr):        Started advocado
>>>     pbxnsipd   (lsb:pbxnsip):  Started advocado
>>>
>>> Failed actions:
>>>    drbd_pbxnsip_monitor_0 (node=cherry, call=3, rc=6,
>>> status=complete): not configured
>>>    fs_pbxnsip_start_0 (node=cherry, call=10, rc=1, status=complete):
>>> unknown error
>>>    gwsrc_route_monitor_0 (node=cherry, call=12, rc=5,
>>> status=complete): not installed
>>>    drbd_pbxnsip_monitor_0 (node=advocado, call=3, rc=6,
>>> status=complete): not configured
>>>    gwsrc_route_start_0 (node=advocado, call=26, rc=5,
>>> status=complete): not installed
>>>    gwsrc_route_stop_0 (node=advocado, call=27, rc=5,
>>> status=complete): not installed
>>> "
>>> ___
>>> Openais mailing list
>>> Openais@lists.linux-foundation.org
>>> https://lists.linux-foundation.org/mailman/listinfo/openais
>>>
>>
>
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] How to debug?

2010-03-08 Thread Andrew Beekhof

You might want to cross reference the return codes from the failed
operations with:
   
http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-ocf-return-codes.html

Looks like you have some missing packages and invalid configuration options.

On Tue, Mar 9, 2010 at 1:38 AM, Lucian Romi  wrote:
> Hi,
>
> I have put a lot of efforts on setting up an active/passive replication.
> The backup service I tried is pbxnsip, because looks like somebody
> already manage to setup such a system.
> As I'm a newbie on this, I think this is a very good start point.
> I followed this link http://forum.pbxnsip.com/index.php?showtopic=3070
> for a pbxnsip/drbd cluster.
> The IP address fail over is working perfectly, but drbd and pbxnsip failed.
> There are error, but I don't know how to debug.
> I tried to find solution on pbxnsip forum, but no reply. My post is
> the last one.
> So if any of you can tell me how to trouble shot this, that will be great.
>
> Here are crm_mon output:
> "
> rhu...@advocado:~$ sudo crm_mon --one-shot
> 
> Last updated: Sun Feb 21 10:32:51 2010
> Stack: openais
> Current DC: cherry - partition WITHOUT quorum
> Version: 1.0.7-54d7869bfe3691eb723b1d47810e5585d8246b58
> 3 Nodes configured, unknown expected votes
> 3 Resources configured.
> 
>
> Online: [ cherry advocado ]
> OFFLINE: [  ]
>
> Master/Slave Set: ms_drbd_pbxnsip
>     Masters: [ advocado ]
>     Slaves: [ cherry ]
> gwsrc_route    (ocf::heartbeat:Route): Started advocado (unmanaged) FAILED
> Resource Group: pbxnsip
>     fs_pbxnsip (ocf::heartbeat:Filesystem):    Started advocado
>     ip_pbxnsip (ocf::heartbeat:IPaddr):        Started advocado
>     pbxnsipd   (lsb:pbxnsip):  Started advocado
>
> Failed actions:
>    drbd_pbxnsip_monitor_0 (node=cherry, call=3, rc=6,
> status=complete): not configured
>    fs_pbxnsip_start_0 (node=cherry, call=10, rc=1, status=complete):
> unknown error
>    gwsrc_route_monitor_0 (node=cherry, call=12, rc=5,
> status=complete): not installed
>    drbd_pbxnsip_monitor_0 (node=advocado, call=3, rc=6,
> status=complete): not configured
>    gwsrc_route_start_0 (node=advocado, call=26, rc=5,
> status=complete): not installed
>    gwsrc_route_stop_0 (node=advocado, call=27, rc=5,
> status=complete): not installed
> "
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais
>
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] resource restart

2010-03-08 Thread Andrew Beekhof

On Wed, Mar 3, 2010 at 8:21 AM, Haussecker, Armin
 wrote:
> Hi,
>
> we have an openais cluster consisting of two nodes, a resource is started on
> first node, and this resource should remain on first node by suitable
> location constraint, and also it should be started on the same node as
> another resource by colocation constraint.
>
> If the resource is stopped, and afterwards started again, we can see that
> first it is started on second node, and afterwards stopped on second node
> and restarted on first node again. So, finally everything seems to work
> correctly.
>
> But, how can we avoid that the resource is started on second node and
> afterwards stopped on second node and started on first node ?? If the
> resource is stopped on first node, and afterwards started again, it should
> be immediately restarted on first node and not started and stopped on second
> node in the meantime.

It depends on a number of things, the most important of which is the
monitor function of your resource.
If it returns 7 (ie. safely stopped), then the cluster has no reason
to put it back on the first node.
Also related is resource-stickiness
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] constraint problem

2010-02-24 Thread Andrew Beekhof

On Wed, Feb 24, 2010 at 3:05 PM, Haussecker, Armin
 wrote:
> Hi,
>
> calling command cibadmin to create resource constraints in a CIB, the
> following problem occurred:
>
> if file constraints.xml (containing xml snippets) contains more than one
> constraint definition, for example:
>
> 
>    score="INFINITY" then="MONITOR"/>
>    score="INFINITY" then="MONITOR"/>
> 
>
> or only
>
>  score="INFINITY" then="MONITOR"/>
>  score="INFINITY" then="MONITOR"/>,
>
> command cibadmin –C –o constraints –x constraints.xml  updates cluster
> information base (CIB) with both constraints.
> But, command cibadmin –D –o constraints –x constraints.xml only deletes
> first of the constraints defined within constraints.xml, second constraint
> remains in CIB.
>
> If only a single constraint definition is given, everything works well.

You can only supply one object at a time to operate on.
Always use the first form, the second should have complained loudly.
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] [PATCH corosync_trunk] Add a test harness to corosync that uses CTS from pacemaker.

2010-02-23 Thread Andrew Beekhof

On Tue, Feb 23, 2010 at 4:00 AM, Angus Salkeld  wrote:
> On Mon, 2010-02-22 at 15:23 -0700, Steven Dake wrote:
>> On Thu, 2010-02-18 at 11:17 +1100, Angus Salkeld wrote:
>> > Hi
>> >
>> > This adds a test harness to corosync. It reuses the
>> > Cluster Test System (CTS) from pacemaker. It also
>> > has a test agent that runs on the cluster node that
>> > can perform any necessary application interaction with
>> > corosync. I have added only a few test cases, but
>> > once the mechanism in committed to corosync I will
>> > work to add more test cases (and hopefully others
>> > will be able to contribute too).
>> >
>> > Have a look in the README file in cts/ to get started.
>> >
>>
>> I have taken a crack at rewriting the readme to be a little more clear.
>> Hope you like.  It is attached.
>>
>> Regards
>> -steve
> Hi
>
> Here is a new version of the test harness that uses the latest from
> pacemaker-devel.
>
> Note:
> 1) for all the tests to pass you will also need to patch
> /usr/lib/python/site-packages/cts/CTS.py with the cts.patch (attached).
>
> 2) CTS now will by default not use remote syslog, so you don't need the
> syslog config supplied in the README.

Its not the default yet, but I'll probably make it an automatic fallback option.

>
> 3) It incluses Steve's README
>
> Regards
> Angus
>
>
>
>
>
> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais
>
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] does self-fencing makes sense?

2010-02-23 Thread Andrew Beekhof

On Tue, Feb 23, 2010 at 8:29 AM, Steven Dake  wrote:
> On Tue, 2010-02-23 at 08:25 +0100, Dietmar Maurer wrote:
>> > > There are thousands of interactions with power fencing and every one
>> > > of them needs to work perfectly for power fencing to work.
>> >
>> > Thats not the problem.
>> > Its the false positives you need to worry about (devices that report
>> > success when power fencing failed).
>> >
>> > When power fencing fails healthy nodes get some sort of indication and
>> > can take appropriate action.
>> > If suicide fails, um...
>>
>> Ok, for that reason power fencing is better.
>>
>> But what I've heard so far is that many users do not understand
>> why fencing is required, and worse, they do not configure and test
>> it correctly.
>>
>> So the question is if we can combine those approaches? Or is that
>> mutual exclusive for some reason?
>>
>
> It would be beneficial to have implementations that supported one or the
> other or both models at the same time.  Maximum flexibility for the
> user.  Then the user can decide what their viewpoint is on reliability
> just as I have outlined in this previous thread.  If they are super
> paranoid, they might use both.  If they believe simplicity is superior,
> they might choose self fencing.  If they feel that operating in a well
> defined operating environment with more complexity is better, they could
> choose that.
>
> Currently there are two choices 1) power fencing 2) no fencing.

Wll, you could configure both.
But you'd end up with the node being power fenced after committing
sepuku, which wouldn't be ideal.
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] does self-fencing makes sense?

2010-02-22 Thread Andrew Beekhof

On Fri, Feb 19, 2010 at 11:31 PM, Steven Dake  wrote:
> On Fri, 2010-02-19 at 18:41 +0100, Andrew Beekhof wrote:
>> On Fri, Feb 19, 2010 at 5:36 PM, Dietmar Maurer  wrote:
>> > Hi all, I just found a whitepaper from XenServer - seem they implement some
>> > kind of self-fencing:
>> >
>> > -text from XenServer High Availability Whitepaper---
>> > The worst-case scenario for HA is the situation where a host is thought to 
>> > be off-line but is actually
>> > still writing to the shared storage, because this can result in corruption 
>> > of persistent data. To
>> > prevent this situation without requiring active power strip controls, 
>> > XenServer employs
>> > hypervisor-level fencing. This is a Xen modification which hard-powers off 
>> > the host at a very
>> > low-level if it does not hear regularly from a watchdog process running in 
>> > the control domain.
>> > Because it is implemented at a very low-level, this also protects the 
>> > storage in the case where the
>> > control domain becomes unresponsive for some reason.
>> > --
>> >
>> > Does that really make sense? That seem to be a very unreliable solution,
>> > because there is no guarantee that a failed node 'self-fence' itself? Or
>> > do I miss something?
>>
>> Do you trust a host, that has already failed in some way, to now start
>> behaving correctly and fence itself?  I wouldn't.
>
> It really depends on the fencing model and what you believe to be more
> reliable.  One model says "tell node X to fence" (power fencing) while
> the alternative model says "if I don't tell you my health is good,
> please self-fence" (watchdog fencing).
>
> There are millions of lines of C code involved in directing a power
> fencing device to fence a node.  Generally in this case, the system
> directing the fencing is operating from a known good state.
>
> There are several hundred lines of C code that trigger a reboot when a
> watchdog timer isn't fed.  Generally in this case, the system directing
> the fencing (itself) has entered an undefined failure state.
>
> So a quick matrix:
> model            LOC       operating environment
> power fencing    millions  well-defined
> self fencing     hundreds  undefined
>
> Knowing well how software works, I personally would trust the code with
> hundreds of orders of magnitude less LOC, even when operating in an
> undefined state.  The watchdog code (softdog) in the kernel is super
> simple, and relies only on timer interrupts.  It is possible the timer
> interrupts won't be delivered, in which case an NMI watchdog timer
> (which is hardware based) can be used to watch for that situation.  It
> is possible for errant kernel code to corrupt the timer list that the
> kernel uses to expire timers.  If this happens, self-fencing using
> software watchdogs will fail gloriously.
>
> When considering hardware watchdog timer devices, the decision becomes
> even more clear, since a hardware watchdog timer has almost complete
> isolation from the system in which it is integrated.  Also it is
> designed and hardened around one purpose - to powercycle a system if it
> is not fed a healthcheck.
>
> Expanding the matrix:
> model             LOC       operating environment
> power fencing     millions  well-defined
> software watchdog hundreds  undefined
> hardware watchdog ASIC      well-defined
>
> In the case of a hardware watchdog, the LOC is hidden behind a self
> contained ASIC.  This ASIC could be defective in some way.  But it is
> also isolated from the remaining system so that it operates in a
> well-defined environment.
>
> Compare those with the failure scenarios of power fencing:
> 1) the power fencing device could have failed in some way
> 2) the power fencing device could process a request incorrectly
> 3) the code that interfaces with the power fencing device could be
> defective in some conditions
> 4) the power fencing hardware could fail to reset its relays for the
> node to be rebooted
> 5) the fencing system directing the fencing could fail in its
> communication to the fencing device
> 6) the network switch connecting the fencing device to the host systems
> could have a transient failure to the particular port on which the power
> fencing device is configured
> ... think up your own ...
>
> There are thousands of interactions with power fencing and every one of
> them needs to work perfectly for power fencing to work.

Thats not the problem.
Its the false positives you need to worry about (devices that report
success when power f

1 2 3 >

1 - 100 of 271 matches

Mail list logo