[Pacemaker] Suggestions for managing HA of containers from within a Pacemaker container?

2015-02-07 Thread Steven Dake (stdake)
Hi,

I am working on Containerizing OpenStack in the Kolla project 
(http://launchpad.net/kolla).  One of the key things we want to do over the 
next few months is add H/A support to our container tech.  David Vossel had 
suggested using systemctl to monitor the containers themselves by running 
healthchecking scripts within the containers.  That idea is sound.

There is another technology called “super-privileged containers”.  Essentially 
it allows more host access for the container, allowing the treatment of 
Pacemaker as a container rather than a RPM or DEB file.  I’d like corosync to 
run in a separate container.  These containers will communicate using their 
normal mechanisms in a super-privileged mode.  We will implement this in Kolla.

Where I am stuck is how does Pacemaker within a container control other 
containers  in the host os.  One way I have considered is using the docker 
—pid=host flag, allowing pacemaker to communicate directly with the host 
systemctl process.  Where I am stuck is our containers don’t run via systemctl, 
but instead via shell scripts that are executed by third party deployment 
software.

An example:
Lets say a rabbitmq container wants to run:

The user would run
kolla-mgr deploy messaging

This would run a small bit of code to launch the docker container set for 
messaging.

Could pacemaker run something like

Kolla-mgr status messaging

To control the lifecycle of the processes?

Or would we be better off with some systemd integration with kolla-mgr?

Thoughts welcome

Regards,
-steve
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Need to relax corosync due to backup of VM through snapshot

2013-11-24 Thread Steven Dake


On 11/21/2013 06:26 AM, Gianluca Cecchi wrote:

On Thu, Nov 21, 2013 at 9:09 AM, Lars Marowsky-Bree wrote:

On 2013-11-20T16:58:01, Gianluca Cecchi  wrote:


Based on docs  I thought that the timeout should be

token x token_retransmits_before_loss_const

No, the comments in the corosync.conf.example and man corosync.conf
should be pretty clear, I hope. Can you recommend which phrasing we
should improve?

I have not understood exact relationship between token and
token_retransmits_before_loss_const.
When one comes into play and when the other one...
So perhaps the second one could be given more details.
Or some web links


The token retransmit is a timer that is started each time a token is 
transmitted.  This is the maximum timer that exists - it is not token * 
retransmits_before_loss_const.


The retrans_before_loss_const says "please transmit a replacement token 
x many times in the token period".  Since the token is UDP, it could be 
lost in network overflow situations or other scenarios.


Using a real-world example
token: 1
retrans_before_loss_const: 10

token will be retransmitted roughly every 1000 msec and the token will 
be determined lost after 1msec.


Regards
-steve


SO my current test config is:
   # diff corosync.conf corosync.conf.pre181113
24,25c24
< #token: 5000
< token: 12

A 120s node timeout? That is really, really long. Why is the backup tool
interfering with the scheduling of high priority processes so much? That
sounds like the real bug.

In fact I inherited analysis for a previous production cluster and I'm
setting up a test environment to demonstrate that one of the realistic
outputs could well be that a cluster is not the right solution
implemented because the underlying infra is not stable enough.
I'm not given a great visibility for what is VMware and SAN details,
but I'm stressing to get them.
I saw sometimes disk latencies going at 8000milliseceonds ;-(
SO another possible output could be to make a more reliable infra
before going with cluster.
I'm putting deliberately high values to see what happens and lower
them step by step
BTW: I remember in the past some thread with other having problems
with Netbackup (or similar backup software ) using snapshot and that
putting higher values solved the sporadic problems (possibly 2 for
token and 10 for retransmit but I couldn't find them ...)



Any comment?
Any different strategies successfully used in similar environments
where high latencies get in place at snapshot deletion when
consolidate phase of disks is executed?

A setup where a VM apparently can freeze for almost 120s is not suitable
for HA.


I see from previous logs that sometimes drbd disconnect and reconnect
only after 30-40 seconds with default timeouts...

Thanks for your inputs.

Gianluca

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Building corosync from source on Angstrom

2013-05-31 Thread Steven Dake

On 05/31/2013 12:57 PM, Simon Platten wrote:

Hi,
I have been struggling to build corosync in Angstrom Linux on a beaglebone
black which runs an ARM Cortex A8.  I have been using this page as a guide:

http://clusterlabs.org/wiki/SourceInstall

So far I've downloaded and built libqb, no problems there right through to
corosync.  I've managed to perform corosync up to the make, where I encounter:

gcc: error: @nss_CFLAGS@: No such file or director

If there is any further information required, please let me know and I will
do my best to provide it.

Kind Regards,
Simon

Simon,

Corosync requires libnss-devel and friends.  Not sure if that is 
packaged for Angstrom linux or not.  If not, you may have to install it 
prior and pass the nss clfags (see --configure --help I believe) and 
LDFLAGS to the configure script of corosync.


Normally they auto-configure and nss is default installed in most distros.

Regards
-steve



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Need HA for OpenStack instances? Check out heat V5!

2012-08-01 Thread Steven Dake
Hi folks,

A few developers from HA community have been hard at work on a project
called heat which provides native HA for OpenStack virtual machines.
Heat provides a template based system with API matching AWS
CloudFormation semantics specifically for OpenStack.

In v5, instance heatlhchecking has been added.  To get started on Fedora
16+ check out the getting started guide:

https://github.com/heat-api/heat/blob/master/docs/GettingStarted.rst#readme

or on Ubuntu Precise check out the devstack guide:
https://github.com/heat-api/heat/wiki/Getting-Started-with-Heat-using-Master-on-Ubuntu

An example template with instance HA features is here:

https://github.com/heat-api/heat/blob/master/templates/WordPress_Single_Instance_With_IHA.template

An example template with applicatoin HA features that includes
escalation is here:

https://github.com/heat-api/heat/blob/master/templates/WordPress_Single_Instance_With_HA.template

Our website is here:

http://www.heat-api.org

The software can be downloaded from:
https://github.com/heat-api/heat/downloads

Enjoy
-steve

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [corosync] Different Corosync Rings for Different Nodes in Same Cluster?

2012-07-08 Thread Steven Dake
On 07/02/2012 08:19 AM, Andrew Martin wrote:
> Hi Steve,
> 
> Thanks for the clarification. Am I correct in understanding that in a
> complete network, corosync will automatically re-add nodes that drop out
> and reappear for any reason (e.g. maintenance, network connectivity
> loss, STONITH, etc)?
> 

Apologies for delay - was on PTO.

That is correct.

Regards
-steve

> Thanks,
> 
> Andrew
> 
> ----
> *From: *"Steven Dake" 
> *To: *"The Pacemaker cluster resource manager"
> 
> *Cc: *disc...@corosync.org
> *Sent: *Friday, June 29, 2012 9:40:43 AM
> *Subject: *Re: [Pacemaker] Different Corosync Rings for Different Nodes
> in Same Cluster?
> 
> On 06/29/2012 01:42 AM, Dan Frincu wrote:
>> Hi,
>>
>> On Thu, Jun 28, 2012 at 6:13 PM, Andrew Martin 
> wrote:
>>> Hi Dan,
>>>
>>> Thanks for the help. If I configure the network as I described - ring
> 0 as
>>> the network all 3 nodes are on, ring 1 as the network only 2 of the nodes
>>> are on, and using "passive" - and the ring 0 network goes down, corosync
>>> will start using ring 1. Does this mean that the quorum node will
> appear to
>>> be offline to the cluster? Will the cluster attempt to STONITH it?
> Once the
>>> ring 0 network is available again, will corosync transition back to
> using it
>>> as the communication ring, or will it continue to use ring 1 until it
> fails?
>>>
>>> The ideal behavior would be when ring 0 fails it then communicates
> over ring
>>> 1, but keeps periodically checking to see if ring 0 is working again.
> Once
>>> it is, it returns to using ring 0. Is this possible?
>>
>> Added corosync ML in CC as I think this is better asked here as well.
>>
>> Regards,
>> Dan
>>
>>>
>>> Thanks,
>>>
>>> Andrew
>>>
>>> 
>>> From: "Dan Frincu" 
>>> To: "The Pacemaker cluster resource manager"
> 
>>> Sent: Wednesday, June 27, 2012 3:42:42 AM
>>> Subject: Re: [Pacemaker] Different Corosync Rings for Different Nodes
>>> inSame Cluster?
>>>
>>>
>>> Hi,
>>>
>>> On Tue, Jun 26, 2012 at 9:53 PM, Andrew Martin 
> wrote:
>>>> Hello,
>>>>
>>>> I am setting up a 3 node cluster with Corosync + Pacemaker on Ubuntu
> 12.04
>>>> server. Two of the nodes are "real" nodes, while the 3rd is in standby
>>>> mode
>>>> as a quorum node. The two "real" nodes each have two NICs, one that is
>>>> connected to a shared LAN and the other that is directly connected
> between
>>>> the two nodes (for DRBD replication). The quorum node is only
> connected to
>>>> the shared LAN. I would like to have multiple Corosync rings for
>>>> redundancy,
>>>> however I do not know if this would cause problems for the quorum
> node. Is
>>>> it possible for me to configure the shared LAN as ring 0 (which all 3
>>>> nodes
>>>> are connected to) and set the rrp_mode to passive so that it will
> use ring
>>>> 0
>>>> unless there is a failure, but to also configure the direct link between
>>>> the
>>>> two "real" nodes as ring 1?
>>>
> 
> In general I think you cannot do what you describe.  Let me repeat it so
> its clear:
> 
> A B C - NET #1
> A B   - Net #2
> 
> Where A, B are your cluster nodes, and C is your quorum node.
> 
> You want Net #1 and Net #2 to serve as redundant rings.  Since C is
> missing, Net #2 will automatically be detected as faulty.
> 
> The part about corosync automatically repairing nodes is correct, that
> would work (If you had a complete network).
> 
> Regards
> -steve
> 
>>> Short answer, yes.
>>>
>>> Longer answer. I have a setup with two nodes with two interfaces, one
>>> is connected via a switch to the other node and one is a back-to-back
>>> link for DRBD replication. In Corosync I have two rings, one that goes
>>> via the switch and one via the back-to-back link (rrp_mode: active).
>>> With rrp_mode: passive it should work the way you mentioned.
>>>
>>> HTH,
>>> Dan
>>>
>>>>
>>>> Thanks,
>>>>
>>>> Andrew
>>>>
>>>> ___
>>>> Pacemaker mailing list

Re: [Pacemaker] Different Corosync Rings for Different Nodes in Same Cluster?

2012-06-29 Thread Steven Dake
On 06/29/2012 01:42 AM, Dan Frincu wrote:
> Hi,
> 
> On Thu, Jun 28, 2012 at 6:13 PM, Andrew Martin  wrote:
>> Hi Dan,
>>
>> Thanks for the help. If I configure the network as I described - ring 0 as
>> the network all 3 nodes are on, ring 1 as the network only 2 of the nodes
>> are on, and using "passive" - and the ring 0 network goes down, corosync
>> will start using ring 1. Does this mean that the quorum node will appear to
>> be offline to the cluster? Will the cluster attempt to STONITH it? Once the
>> ring 0 network is available again, will corosync transition back to using it
>> as the communication ring, or will it continue to use ring 1 until it fails?
>>
>> The ideal behavior would be when ring 0 fails it then communicates over ring
>> 1, but keeps periodically checking to see if ring 0 is working again. Once
>> it is, it returns to using ring 0. Is this possible?
> 
> Added corosync ML in CC as I think this is better asked here as well.
> 
> Regards,
> Dan
> 
>>
>> Thanks,
>>
>> Andrew
>>
>> 
>> From: "Dan Frincu" 
>> To: "The Pacemaker cluster resource manager" 
>> Sent: Wednesday, June 27, 2012 3:42:42 AM
>> Subject: Re: [Pacemaker] Different Corosync Rings for Different Nodes
>> inSame Cluster?
>>
>>
>> Hi,
>>
>> On Tue, Jun 26, 2012 at 9:53 PM, Andrew Martin  wrote:
>>> Hello,
>>>
>>> I am setting up a 3 node cluster with Corosync + Pacemaker on Ubuntu 12.04
>>> server. Two of the nodes are "real" nodes, while the 3rd is in standby
>>> mode
>>> as a quorum node. The two "real" nodes each have two NICs, one that is
>>> connected to a shared LAN and the other that is directly connected between
>>> the two nodes (for DRBD replication). The quorum node is only connected to
>>> the shared LAN. I would like to have multiple Corosync rings for
>>> redundancy,
>>> however I do not know if this would cause problems for the quorum node. Is
>>> it possible for me to configure the shared LAN as ring 0 (which all 3
>>> nodes
>>> are connected to) and set the rrp_mode to passive so that it will use ring
>>> 0
>>> unless there is a failure, but to also configure the direct link between
>>> the
>>> two "real" nodes as ring 1?
>>

In general I think you cannot do what you describe.  Let me repeat it so
its clear:

A B C - NET #1
A B   - Net #2

Where A, B are your cluster nodes, and C is your quorum node.

You want Net #1 and Net #2 to serve as redundant rings.  Since C is
missing, Net #2 will automatically be detected as faulty.

The part about corosync automatically repairing nodes is correct, that
would work (If you had a complete network).

Regards
-steve

>> Short answer, yes.
>>
>> Longer answer. I have a setup with two nodes with two interfaces, one
>> is connected via a switch to the other node and one is a back-to-back
>> link for DRBD replication. In Corosync I have two rings, one that goes
>> via the switch and one via the back-to-back link (rrp_mode: active).
>> With rrp_mode: passive it should work the way you mentioned.
>>
>> HTH,
>> Dan
>>
>>>
>>> Thanks,
>>>
>>> Andrew
>>>
>>> ___
>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>>
>>
>> --
>> Dan Frincu
>> CCNA, RHCE
>>
>> ___
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>>
>> ___
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
> 
> 
> 



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] If you want High Availability on OpenStack, check out Heat! (details inside)

2012-06-27 Thread Steven Dake
As some may know, Angus and I were working previously on a project
called pacemaker-cloud, with the intention of adding high availbility to
guests in cloud environments.  We stopped developing that project in
March 2012 and took our experiences to a new project called Heat.  For
more details of why that decision was made, have a look at:

http://sdake.wordpress.com/2012/04/24/the-heat-api-a-template-based-orchestration-framework/

We have just released Heat API (v4) which has a really nice HA feature
for users moving work to OpenStack cloud environments.  Heat API uses
templates that describe a cloud application.  Our goal is to provide
parity with Amazon's AWS CloudFormation API and template specification
and we are closing in.

Heat's High Availability feature set will restart failed applications
and escalate repeated failures by restarting the entire VM.  All of this
is defined in one template file with the rest of the application
definition, and can be launched via our AWS CloudFormation API
implementation.

Heat does a ton of great things, which is why I ask you to give it a
spin, especially if you are evaluating OpenStack.

Check out our docs here:
https://github.com/heat-api/heat/wiki

Especially the using HA guide:
https://github.com/heat-api/heat/wiki/Using-HA

Our github project is here:
https://github.com/heat-api

Our mailing list is here:
http://lists.heat-api.org/mailman/listinfo/discuss

Even if your not immediately able to try out the software, follow our
project on github by using the github Watch feature.  If you have other
feedback, feel free to send to this list or join the heat-api mailing
list and respond there.

Thanks!
-steve

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [corosync] Unable to join cluster from a newly-installed centos 6.2 node

2012-03-02 Thread Steven Dake
On 03/02/2012 05:29 PM, Diego Lima wrote:
> Hello,
> 
> I've recently installed Corosync on two CentOS 6.2 machines. One is
> working fine but on the other machine I've been unable to connect to
> the cluster. On the logs I can see this whenever I start
> corosync+pacemaker:
> 
> Mar  2 21:33:16 no2 corosync[15924]:   [MAIN  ] Corosync Cluster
> Engine ('1.4.1'): started and ready to provide service.
> Mar  2 21:33:16 no2 corosync[15924]:   [MAIN  ] Corosync built-in
> features: nss dbus rdma snmp
> Mar  2 21:33:16 no2 corosync[15924]:   [MAIN  ] Successfully read main
> configuration file '/etc/corosync/corosync.conf'.
> Mar  2 21:33:16 no2 corosync[15924]:   [TOTEM ] Initializing transport
> (UDP/IP Multicast).
> Mar  2 21:33:16 no2 corosync[15924]:   [TOTEM ] Initializing
> transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
> Mar  2 21:33:16 no2 corosync[15924]:   [TOTEM ] The network interface
> [172.16.100.2] is now up.
> Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info:
> process_ais_conf: Reading configure
> Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info:
> config_find_init: Local handle: 4730966301143465987 for logging
> Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info:
> config_find_next: Processing additional logging options...
> Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info: get_config_opt:
> Found 'off' for option: debug
> Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info: get_config_opt:
> Found 'no' for option: to_logfile
> Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info: get_config_opt:
> Found 'yes' for option: to_syslog
> Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info: get_config_opt:
> Found 'daemon' for option: syslog_facility
> Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info:
> config_find_init: Local handle: 7739444317642555396 for quorum
> Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info:
> config_find_next: No additional configuration supplied for: quorum
> Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info: get_config_opt:
> No default for option: provider
> Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info:
> config_find_init: Local handle: 5650605097994944517 for service
> Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info:
> config_find_next: Processing additional service options...
> Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info: get_config_opt:
> Found '0' for option: ver
> Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info: get_config_opt:
> Defaulting to 'pcmk' for option: clustername
> Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info: get_config_opt:
> Defaulting to 'no' for option: use_logd
> Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info: get_config_opt:
> Defaulting to 'no' for option: use_mgmtd
> Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info: pcmk_startup:
> CRM: Initialized
> Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] Logging: Initialized
> pcmk_startup
> Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info: pcmk_startup:
> Maximum core file size is: 18446744073709551615
> Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info: pcmk_startup: Service: 
> 10
> Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info: pcmk_startup:
> Local hostname: no2.informidia.int
> Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info:
> pcmk_update_nodeid: Local node id: 40112300
> Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info: update_member:
> Creating entry for node 40112300 born on 0
> Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info: update_member:
> 0x766520 Node 40112300 now known as no2.informidia.int (was: (null))
> Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info: update_member:
> Node no2.informidia.int now has 1 quorum votes (was 0)
> Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info: update_member:
> Node 40112300/no2.informidia.int is now: member
> Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info: spawn_child:
> Forked child 15930 for process stonith-ng
> Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info: spawn_child:
> Forked child 15931 for process cib
> Mar  2 21:33:16 no2 corosync[15924]:   [pcmk  ] info: spawn_child:
> Forked child 15932 for process lrmd
> Mar  2 21:33:16 no2 lrmd: [15932]: info: G_main_add_SignalHandler:
> Added signal handler for signal 15
> Mar  2 21:33:16 no2 stonith-ng: [15930]: info: Invoked:
> /usr/lib64/heartbeat/stonithd
> Mar  2 21:33:16 no2 stonith-ng: [15930]: info: crm_log_init_worker:
> Changed active directory to /var/lib/heartbeat/cores/root
> Mar  2 21:33:16 no2 stonith-ng: [15930]: info:
> G_main_add_SignalHandler: Added signal handler for signal 17
> Mar  2 21:33:16 no2 stonith-ng: [15930]: info: get_cluster_type:
> Cluster type is: 'openais'
> Mar  2 21:33:16 no2 stonith-ng: [15930]: notice: crm_cluster_connect:
> Connecting to cluster infrastructure: classic openais (with plugin)
> Mar  2 21:33:16 no2 stonith-ng: [15930]: info:
> init_ais_connection_classic: Creating connection to our Corosync
> plugin
> Mar  2 21:

Re: [Pacemaker] OCFS2 in Pacemaker, post Corosync 2.0

2012-03-01 Thread Steven Dake
On 03/01/2012 07:19 AM, Lars Marowsky-Bree wrote:
> On 2012-03-01T09:52:29, Florian Haas  wrote:
> 
>> Future situation (Pacemaker with Corosync 2.x):
>> - OpenAIS goes away, no CKPT service, ocfs2_controld.pcmk stops working;
>> - cman goes away, ocfs2_controld.cman stops working.
>>
>> Is that summary correct?
>>
>> Do you happen to know whether the OCFS2 folks are informed about this?
> 
> Yes, at least we at SUSE are.
> 
> The reason why we've not been active ourselves on this is that this will
> most definitely be a non-wire-protocol compatible change (something
> corosync/openais folks seem very careless about, if I may rant for a

I am not sure what your talking about.  We have kept on wire roll
capability between openais and corsoyc 1.4.2 even though its a huge pain.

We guarantee wire compatibility for X releases (ie all 1.y.z are
compatible).  If something breaks let us know and we will address it.

Regards
-steve

> second), not to mention a total ABI mess - so it was unfortunately a low
> priority for us during the SP2 development phase.
> 
> As you may have become aware, we shipped that yesterday, and are now
> looking to the future. This will definitely be on the list of issues to
> resolve.
> 
> 
> Regards,
> Lars
> 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] need cluster-wide variables

2012-01-11 Thread Steven Dake
On 12/21/2011 12:01 AM, Nirmala S wrote:
> Hi,
> 
>  
> 
> This is a followup on earlier thread
> (http://www.gossamer-threads.com/lists/linuxha/pacemaker/76705).
> 
>  
> 
> My situation is somewhat similar. I need to a cluster which contains 3
> kinds of nodes – master, preferred slave, slave. Preferred slave is an
> entity that becomes the master in case of switchover/failover. Master is
> the master for pref_slave and pref_slave is master for other slaves. The
> master election is easy – it is done by crm, all I need to do is use
> crm_master.
> 
> 

RE subject, the cpg interface is perfect for maintaining replicated
state among your cluster nodes.  man cpg_overview.

Regards
-steve

> 
> But for the preferred slave, there needs to an election amongst existing
> slaves. As of now I am using a variable in CIB with
> pref_slave|pref_slave_score|temp_score. If temp_score is 0, then the
> slave will update pref_slave and pref_slave_score and temp_score. If
> temp_score is non-zero, then the node compares its score with
> pref_slave_score and updates only if it is bigger.
> 
>  
> 
> Now I have 2 problems
> 
>  1. Everytime I change the CIB(which I am doing in pre-promote), the
> event (pre-promote) is getting retriggered.
>  2. The event(pre-promote) is sent in parallel to all the slaves. So
> each slave thinks temp_score is 0, and overwrites with its score. Is
> there any way to serialize this using some sort of lock ? Or is
> there a provision to store cluster-wide attributes apart from CIB ?
> 
>  
> 
> Regards
> 
> Nirmala
> 
>  
> 
>  
> 
> This e-mail and attachments contain confidential information from
> HUAWEI, which is intended only for the person or entity whose address is
> listed above. Any use of the information contained herein in any way
> (including, but not limited to, total or partial disclosure,
> reproduction, or
> 
> dissemination) by persons other than the intended recipient's) is
> prohibited. If you receive this e-mail in error, please notify the
> sender by phone or email immediately and delete it!
> 
>  
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] corosync mailing list address change

2011-10-20 Thread Steven Dake
Sending one last reminder that the Corosync mailing list has changed
homes from the Linux Foundation's servers.  I have been unable to obtain
the previous subscriber list, so please resubscribe.

http://lists.corosync.org/mailman/listinfo

The list is called "discuss".

Regards
-steve

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Questions about reasonable cluster size...

2011-10-20 Thread Steven Dake
On 10/20/2011 07:42 AM, Alan Robertson wrote:
> On 10/20/2011 03:11 AM, Proskurin Kirill wrote:
>> On 10/20/2011 03:15 AM, Steven Dake wrote:
>>> On 10/19/2011 01:50 PM, Alan Robertson wrote:
>>>> Hi,
>>>>
>>>> I have an application where having a 12-node cluster with about 250
>>>> resources would be desirable.
>>>>
>>>> Is this reasonable?  Can Pacemaker+Corosync be expected to reliably
>>>> handle a cluster of this size?
>>>>
>>>> If not, what is the current recommendation for maximum number of nodes
>>>> and resources?
> Steven Dake wrote:
> 
> We regularly test 16 nodes.  As far as resources go, Andrew could answer
> that.
> 
>>
>> I start to have problems with 10+ nodes. It`s heavly depended on
>> corosync configuration afaik. You should test it.
> This is somewhat different from Steven's comment.  Exactly what things
> did you have in mind for the corosync configuration that could either
> help or hurt with larger clusters?
> 
> Steven:  Proskurin seems to think that there are some particular things
> to watch out for in the Corosync configuration for larger clusters. 
> Does anything come to mind for you about this?
> 
> 

We do 16 node testing with token=1 (10 seconds).  The rest of the
parameters autoconfigure.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Questions about reasonable cluster size...

2011-10-19 Thread Steven Dake
On 10/19/2011 01:50 PM, Alan Robertson wrote:
> Hi,
> 
> I have an application where having a 12-node cluster with about 250
> resources would be desirable.
> 
> Is this reasonable?  Can Pacemaker+Corosync be expected to reliably
> handle a cluster of this size?
> 
> If not, what is the current recommendation for maximum number of nodes
> and resources?
> 
> Many thanks!
> 

We regularly test 16 nodes.  As far as resources go, Andrew could answer
that.

Regards
-steve



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] reminder - new corosync mailng list location

2011-10-10 Thread Steven Dake
A few weeks ago I posted we had moved the corosync mailing list from the
linux foundation servers because they are down.  Please join the
corosync list if your interested in the cluster stack or corosync and
ask your questions there.

To join:
http://lists.corosync.org/mailman/listinfo

The list is called "discuss".

Regards
-steve

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] New Corosync Mailing list - Please register for it!

2011-09-20 Thread Steven Dake
Hi,

Over the past several years, we have been sharing a mailing list with
the openais project.  I have made a new mailing list specifically for
corosync:  This will be the permanent new list for corosync.

Please register at:
http://lists.corosync.org/mailman/listinfo

The list is called "discuss"

Q Why are we making this change now?

A Several weeks ago Linux Foundation was hacked into (see
http://www.linuxfoundation.org).  They hosted our mailing list service.
 During this event, the mailing list has been unusable.  The Linux
Foundation staff is busy rebuilding their network, but in the interim
this seems like a good opportunity to move everything to our core
infrastructure at corosync.org.

Q What about the archives?

A I hope to restore the archives once I can get the records from Linux
Foundation.  There is no guarantee I can get a restored copy of the
archive however.  Fortunately several services over the years have
archived our mailing list.

Q What about my registration on the openais mailing list?

A I don't have the records to transfer the registrations to the corosync
list, so you will have to sign up for the mailing list again.

Q Is my password that I used to register on the openais mailing list
comrpomised?

A I do not know what extent the systems were hacked, but I'd recommend
treating the password as compromised.  If you shared this password with
other services, please change it.  Mailman stores passwords in plaintext
so that it can mail them to you once a month.  Always use unique
passwords on mailman mailing lists.

Regards
-steve

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Building a Corosync 1.4.1 RPM package for SLES11 SP1

2011-09-01 Thread Steven Dake
On 08/31/2011 11:39 PM, Sebastian Kaps wrote:
> Hi,
> 
> I'm trying to compile Corosync v1.4.1 from source[1] and create an RPM
> x86_64 package for SLES11 SP1.
> When running "make rpm" the build process complains about a broken
> dependency for the nss-devel package.
> The package is not installed on the system - mozilla-nss (non-devel),
> however, is.
> 
> I'd be fine if I could just build the package without using the nss libs.
> I have no problem compiling Corosync using "./configure --disable-nss &&
> make", but I see no way for
> doing that with the "make rpm" command.
> 
> Alternatively I'd compile everything --with-nss, but I can't install the
> mozilla-nss-devel package,
> because the version on the SLE11-SP1-SDK DVD is older than the installed
> mozilla-nss package (3.12.6-3.1.1
> vs. 3.12.8-1.2.1) and creates a conflict when I try to install it.
> 
> [1] ftp://corosync.org/downloads/corosync-1.4.1/corosync-1.4.1.tar.gz

Thanks for pointing out this problem with the build tools for corosync.
 nss should be conditionalized.  This would allow rpmbuild --with-nss or
rpmbuild --without-nss from the default rpm builds.  I would send a
patch to the openais ml to resolve this problem but it is not operating
at the moment, so I'll send one here for you to give a spin.

Regards
-steve

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] TOTEM: Process pause detected? Leading to STONITH...

2011-08-15 Thread Steven Dake
On 08/12/2011 03:19 AM, Vladislav Bogdanov wrote:
> ...
>>> I would really like someone that has these process pause problems to
>>> test a patch I have posted to see if it rectifies the situation.  Our
>>> significant QE team at Red Hat doesn't see these problems and I can't
>>> generate them in engineering.  It is possible your device drivers are
>>> taking spinlocks for extended periods or some other kernel problem is
>>> occurring.
>>>
>>> If you feel up to the task of building your own corosync, try out this
>>> patch:
>>>
>>> http://marc.info/?l=openais&m=130989380207300&w=2
> 

Vladislav,

> I do not see any corosync pauses after applied it (right after it have
> been posted). Although I had vacations for two weeks, all other time I
> test cluster under really high CPU load (frankly speaking I lowered it a
> lot because of optimizations) and did not catch any pause (yet). One
> more thing I did is updated igb driver and returned its buffers to
> original 256 (bearing in mind that I originally had pause problem after
> I increased that buffers to 4096). Do not know if it has influence.
> 

Thanks for the feedback (I did read your original response on this).
Unfortunately it is difficult to tell if the other changes you made
fixed the problem, or the patch fixes the problem.

Regards
-steve

>> I'd love to test this, but it'll take a few weeks. 
>> The machines are already productive and we don't have comparable test 
>> machines.
>> I'm currently (acutally ;) having a few days off, and when I'm back at the 
>> office, 
>> I'll update the Corosync version to v1.4.1 (because of the retransmit list 
>> problem) -- does the patch cleanly apply to v1.4.1?
> 
> yes
> 
> Best,
> Vladislav
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] TOTEM: Process pause detected? Leading to STONITH...

2011-08-11 Thread Steven Dake
On 08/11/2011 03:05 AM, Sebastian Kaps wrote:
> Hi,
> 
> On 04.08.2011, at 18:21, Steven Dake wrote:
> 
>>> Jul 31 03:51:02 node01 corosync[5870]:  [TOTEM ] Process pause detected
>>> for 11149 ms, flushing membership messages.
>>
>> This process pause message indicates the scheduler doesn't schedule
>> corosync for 11 seconds which is greater then the failure detection
>> timeouts.  What does your config file look like?  What load are you running?
> 
> 
> We've had another one of these this morning:
> "Process pause detected for 11763 ms, flushing membership messages."
> According to the graphs that are generated from Nagios data, the load of that 
> system 
> jumped from 1.0 to 5.1 ca. 2 minutes before this event, stayed at that value 
> for 
> ~5 minutes then dropped to below 1 afterwards. 10 Minutes later the system 
> got shot,

Did nagios possibly block for 10+ seconds during this time as well?  In
this case, it wouldn't detect any spikes or delays in scheduling.

Are you running in a virtual machine or on old/slow hardware?

RE deadline cpu scheduler, the only thing I can find about that topic is
a new scheduling class.  Corosync doesn't take advantage of that
scheduling class (its not in the linux 3.0 glibc man pages - if it is
there, we don't know how to use it).

> probably because the OCFS2 got confused by the node leaving the cluster.
> At that time, the machine was only the standby node. The only things that 
> could 
> have been running then, are a daily backup run (TSM) that starts the night 
> before 
> and takes a few hours to complete - and the OCFS2-related processes (the 
> backup of 
> the OCFS2 filesystem is done on that machine).
> 

I would really like someone that has these process pause problems to
test a patch I have posted to see if it rectifies the situation.  Our
significant QE team at Red Hat doesn't see these problems and I can't
generate them in engineering.  It is possible your device drivers are
taking spinlocks for extended periods or some other kernel problem is
occurring.

If you feel up to the task of building your own corosync, try out this
patch:

http://marc.info/?l=openais&m=130989380207300&w=2

Regards
-steve



> What can I do to investigate this behavior? We've switched to the "deadline" 
> cpu 
> scheduler before the July 31st event. Could this cause this kind of behavior?
> I was under the impression, that 'deadline' was designed to prevent exactly 
> these
> kinds of situations.
> Further increasing the timeout above the current value of 10s doesn't look 
> like
> it's the solution for this problem.
> 
> The configuration is unchanged from the one I posted on August 4th.
> The funny thing is, that the cluster did not show any problems since July 
> 31st.
> 
> Thanks in advance!
> 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Backup ring is marked faulty

2011-08-07 Thread Steven Dake
On 08/04/2011 02:04 PM, Sebastian Kaps wrote:
> Hi Steven,
> 
> On 04.08.2011, at 20:59, Steven Dake wrote:
> 
>> meaning the corosync community doesn't investigate redundant ring issues
>> prior to corosync versions 1.4.1.
> 
> Sadly, we need to use the SLES version for support reasons.
> I'll try to convince them to supply us with a fix for this problem.
> 
> In the mean time: would it be safe to leave the backup ring marked faulty 
> the next this happens? Would this result in a state that is effectively 
> like having no second ring or is there a chance that this might still 
> affect the cluster's stability? 

If a ring is marked faulty, it is no longer operational and there is no
longer a redundant network.

> To my knowledge, changing the ring configuration requires a complete 
> restart of the cluster framework on all nodes, right?
> 

yes although fixing the retransmit list problem will not require a restart

Regards
-steve

>> I expect the root of ypur problem is already fixed (the retransmit list
>> problem) however in the repos and latest released versions.
> 
> 
> I'll try to get an update as soon as possible. Thanks a lot!
> 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Backup ring is marked faulty

2011-08-04 Thread Steven Dake
On 08/04/2011 11:43 AM, Sebastian Kaps wrote:
> Hi Steven,
> 
> On 04.08.2011, at 18:27, Steven Dake wrote:
> 
>> redundant ring is only supported upstream in corosync 1.4.1 or later.
> 
> What does "supported" mean in this context, exactly? 
> 

meaning the corosync community doesn't investigate redundant ring issues
prior to corosync versions 1.4.1.

I expect the root of ypur problem is already fixed (the retransmit list
problem) however in the repos and latest released versions.

Regards
-steve

> I'm asking, because we're having serious issues with these systems since 
> they went into production (the testing phase did not show any problems, 
> but we also couldn't use real workloads then).
> 
> Since the cluster went productive, we're having issues with seemingly random 
> STONITH events that seem to be related to a high I/O load on a DRBD-mirrored
> OCFS2 volume - but I don't see any pattern yet. We've had these machines 
> running for nearly two weeks without major problems and suddenly they went 
> back to killing each other :-(
> 
>> The retransmit list message issues you are having is fixed in corosync
>> 1.3.3. and later  This is what is triggering the redundant ring faulty
>> error.
> 
> Could it also cause the instability problems we're seeing?
> Thanks again, for helping!

yes

> 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Backup ring is marked faulty

2011-08-04 Thread Steven Dake
On 08/02/2011 11:53 PM, Sebastian Kaps wrote:
> Hi Steven!
> 
> On Tue, 02 Aug 2011 17:45:46 -0700, Steven Dake wrote:
>> Which version of corosync?
> 
> # corosync -v
> Corosync Cluster Engine, version '1.3.1'
> Copyright (c) 2006-2009 Red Hat, Inc.
> 
> It's the version that comes with SLES11-SP1-HA.
> 

redundant ring is only supported upstream in corosync 1.4.1 or later.

The retransmit list message issues you are having is fixed in corosync
1.3.3. and later  This is what is triggering the redundant ring faulty
error.

Regards
-steve

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Live demo of Pacemaker Cloud on Fedora: Friday August 5th at 8am PST

2011-08-04 Thread Steven Dake
On 08/03/2011 06:39 PM, Bob Schatz wrote:
> Steven,
> 
> Are you planning on recording/taping it if I want to watch it later?
> 
> Thanks,
> 
> Bob

Bob,

Yes I will record if I can beat elluminate into submission.

Regards
-steve


> 
> ----
> *From:* Steven Dake 
> *To:* pcmk-cl...@oss.clusterlabs.org
> *Cc:* aeolus-de...@lists.fedorahosted.org; Fedora Cloud SIG
> ; "open...@lists.linux-foundation.org"
> ; The Pacemaker cluster resource
> manager 
> *Sent:* Wednesday, August 3, 2011 9:42 AM
> *Subject:* [Pacemaker] Live demo of Pacemaker Cloud on Fedora: Friday
> August 5th at 8am PST
> 
> Extending a general invitation to the high availability communities and
> other cloud community contributors to participate in a live demo I am
> giving on Friday August 5th 8am PST (GMT-7).  Demo portion of session is
> 15 minutes and will be provided first followed by more details of our
> approach to high availability.
> 
> I will use elluminate to show the demo on my desktop machine.  To make
> elluminate work, you will need icedtea-web installed on your system
> which is not typically installed by default.
> 
> You will also need a conference # and bridge code.  Please contact me
> offlist with your location and I'll provide you with a hopefully toll
> free conference # and bridge code.
> 
> Elluminate link:
> https://sas.elluminate.com/m.jnlp?sid=819&password=M.13AB020AEBE358D265FD925A07335F
> <https://sas.elluminate.com/m.jnlp?sid=819&password=M.13AB020AEBE358D265FD925A07335F>
> 
> Bridge Code:  Please contact me off list with your location and I'll
> respond back with dial-in information.
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> <mailto:Pacemaker@oss.clusterlabs.org>
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> 
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Backup ring is marked faulty

2011-08-04 Thread Steven Dake
On 08/03/2011 11:31 PM, Tegtmeier.Martin wrote:
> Hello again,
> 
> in my case it is always the slower ring that fails (the 100MB network). Does 
> rrp_mode passive expect both rings to have the same speed?
> 
> Sebastian, can you confirm that in your environment also the slower ring 
> fails?
> 
> Thanks,
>   -Martin
> 
> 

Martin,

I have never tested faster+slower networks in redundant ring configs.
We just recently added support for this feature in the corosync project
meaning we can start to tackle some of these issues going forward.

The protocol is designed to limit to the speed of the slowest ring -
perhaps this is not working as intended.

Regards
-steve

> -Original Message-
> From: Tegtmeier.Martin [mailto:martin.tegtme...@realtech.com] 
> Sent: Mittwoch, 3. August 2011 11:03
> To: The Pacemaker cluster resource manager
> Subject: AW: [Pacemaker] Backup ring is marked faulty
> 
> Hello,
> 
> we have exactly the same issue! Same version of corosync (1.3.1), also 
> running on SuSE Linux Enterprise Server 11 SP1 with HAE.
> 
> Aug 01 15:45:18 corosync [TOTEM ] Received ringid(172.20.16.2:308) seq 6a
> 
> Aug 01 15:45:18 corosync [TOTEM ] Received ringid(172.20.16.2:308) seq 63
> 
> Aug 01 15:45:18 corosync [TOTEM ] releasing messages up to and including 60
> 
> Aug 01 15:45:18 corosync [TOTEM ] releasing messages up to and including 6d
> 
> Aug 01 15:45:18 corosync [TOTEM ] Marking seqid 162 ringid 1 interface 
> 10.2.2.6 FAULTY - administrative intervention required.
> 
> rksaph06:/var/log/cluster # corosync-cfgtool -s
> 
> Printing ring status.
> 
> Local node ID 101717164
> 
> RING ID 0
> 
> id  = 172.20.16.6
> 
> status  = ring 0 active with no faults
> 
> RING ID 1
> 
> id  = 10.2.2.6
> 
> status  = Marking seqid 162 ringid 1 interface 10.2.2.6 FAULTY - 
> administrative intervention required.
> 
> 
> 
> rrp_mode is set to "passive"
> Ring 0 (172.20.16.0) supports 1GB and ring 1 (10.2.2.0) supports 100 MBit. 
> There was no other network traffic on ring 1 - only corosync (!)
> 
> After re-activating both rings with "corosync-cfgtool -r" the problem is 
> reproducable by simply connecting a crm_gui and hitting "refresh" inside the 
> GUI 3-5 times. After that ring 1 (10.2.2.0) will be marked as "faulty" again.
> 
> Thanks and best regards,
>   -Martin Tegtmeier
> 
> 
> 
> 
> -Ursprüngliche Nachricht-
> Von: Sebastian Kaps [mailto:sebastian.k...@imail.de]
> Gesendet: Mi 03.08.2011 08:53
> An: The Pacemaker cluster resource manager
> Betreff: Re: [Pacemaker] Backup ring is marked faulty
>  
>  Hi Steven!
> 
>  On Tue, 02 Aug 2011 17:45:46 -0700, Steven Dake wrote:
>> Which version of corosync?
> 
>  # corosync -v
>  Corosync Cluster Engine, version '1.3.1'
>  Copyright (c) 2006-2009 Red Hat, Inc.
> 
>  It's the version that comes with SLES11-SP1-HA.
> 
> --
>  Sebastian
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org 
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] TOTEM: Process pause detected? Leading to STONITH...

2011-08-04 Thread Steven Dake
On 08/04/2011 05:46 AM, Sebastian Kaps wrote:
> Hello,
> 
> here's another problem we're having:
> 
> Jul 31 03:51:02 node01 corosync[5870]:  [TOTEM ] Process pause detected
> for 11149 ms, flushing membership messages.

This process pause message indicates the scheduler doesn't schedule
corosync for 11 seconds which is greater then the failure detection
timeouts.  What does your config file look like?  What load are you running?

Regards
-steve

> Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ] CLM CONFIGURATION CHANGE
> Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ] New Configuration:
> Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ]   r(0) ip(192.168.1.1)
> r(1) ip(x.y.z.3)
> Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ] Members Left:
> Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ]   r(0) ip(192.168.1.2)
> r(1) ip(x.y.z.1)
> Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ] Members Joined:
> Jul 31 03:51:11 node01 corosync[5870]:  [pcmk  ] notice:
> pcmk_peer_update: Transitional membership event on ring 9708: memb=1,
> new=0, lost=1
> Jul 31 03:51:11 node01 corosync[5870]:  [pcmk  ] info: pcmk_peer_update:
> memb: node01 16885952
> Jul 31 03:51:11 node01 corosync[5870]:  [pcmk  ] info: pcmk_peer_update:
> lost: node02 33663168
> Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ] CLM CONFIGURATION CHANGE
> Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ] New Configuration:
> Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ]   r(0) ip(192.168.1.1)
> r(1) ip(x.y.z.3)
> Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ] Members Left:
> Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ] Members Joined:
> Jul 31 03:51:11 node01 crmd: [5912]: notice: ais_dispatch_message:
> Membership 9708: quorum lost
> 
> Node01 gets Stonith'd shortly after that. There is no indication
> whatsoever that this would happen in the logs.
> For at least half an hour before that there's only the normal
> status-message noise from monitor ops etc.
> 
> Jul 31 03:51:01 node02 corosync[5810]:  [TOTEM ] A processor failed,
> forming new configuration.
> Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ] CLM CONFIGURATION CHANGE
> Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ] New Configuration:
> Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ]   r(0) ip(192.168.1.2)
> r(1) ip(x.y.z.1)
> Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ] Members Left:
> Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ]   r(0) ip(192.168.1.1)
> r(1) ip(x.y.z.3)
> Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ] Members Joined:
> Jul 31 03:51:11 node02 corosync[5810]:  [pcmk  ] notice:
> pcmk_peer_update: Transitional membership event on ring 9708: memb=1,
> new=0, lost=1
> Jul 31 03:51:11 node02 corosync[5810]:  [pcmk  ] info: pcmk_peer_update:
> memb: node02 33663168
> Jul 31 03:51:11 node02 corosync[5810]:  [pcmk  ] info: pcmk_peer_update:
> lost: node01 16885952
> Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ] CLM CONFIGURATION CHANGE
> Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ] New Configuration:
> Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ]   r(0) ip(192.168.1.2)
> r(1) ip(x.y.z.1)
> Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ] Members Left:
> Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ] Members Joined:
> 
> What does "Process pause detected" mean?
> 
> Quoting from my other recent post regarding the backup ring being marked
> faulty sporadically:
> 
> |We're running a two-node cluster with redundant rings.
> |Ring 0 is a 10 GB direct connection; ring 1 consists of two 1GB
> interfaces that are bonded in
> |active-backup mode and routed through two independent switches for each
> node. The ring 1 network
> |is our "normal" 1G LAN and should only be used in case the direct 10G
> connection should fail.
> |
> |Corosync Cluster Engine, version '1.3.1'
> |Copyright (c) 2006-2009 Red Hat, Inc.
> |
> |It's the version that comes with SLES11-SP1-HA.
> 
> Thanks in advance!
> 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Live demo of Pacemaker Cloud on Fedora: Friday August 5th at 8am PST

2011-08-03 Thread Steven Dake
Extending a general invitation to the high availability communities and
other cloud community contributors to participate in a live demo I am
giving on Friday August 5th 8am PST (GMT-7).  Demo portion of session is
15 minutes and will be provided first followed by more details of our
approach to high availability.

I will use elluminate to show the demo on my desktop machine.  To make
elluminate work, you will need icedtea-web installed on your system
which is not typically installed by default.

You will also need a conference # and bridge code.  Please contact me
offlist with your location and I'll provide you with a hopefully toll
free conference # and bridge code.

Elluminate link:
https://sas.elluminate.com/m.jnlp?sid=819&password=M.13AB020AEBE358D265FD925A07335F

Bridge Code:  Please contact me off list with your location and I'll
respond back with dial-in information.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Backup ring is marked faulty

2011-08-02 Thread Steven Dake
Which version of corosync?

On 08/02/2011 07:35 AM, Sebastian Kaps wrote:
> Hi,
> 
> we're running a two-node cluster with redundant rings.
> Ring 0 is a 10 GB direct connection; ring 1 consists of two 1GB
> interfaces that are bonded in
> active-backup mode and routed through two independent switches for each
> node. The ring 1 network
> is our "normal" 1G LAN and should only be used in case the direct 10G
> connection should fail.
> I often (once a day on average, I'd guess) see that ring 1 (an only that
> one) is marked as
> FAULTY without any obvious reasons.
> 
> Aug  2 08:56:15 node02 corosync[5752]:  [TOTEM ] Retransmit List: c76
> c7a c7c c7e c80 c82 c84
> Aug  2 08:56:15 node02 corosync[5752]:  [TOTEM ] Retransmit List: c82
> Aug  2 08:56:15 node02 corosync[5752]:  [TOTEM ] Marking seqid 568416
> ringid 1 interface x.y.z.1 FAULTY - administrative intervention required.
> 
> Whenever I see this, I check if the other node's address can be pinged
> (I never saw any
> connectivity problems there), then reenable the ring with
> "corosync-cfgtool -r" and
> everything looks ok for a while (i.e. hours or days).
> 
> How could I find out why this happens?
> What do these "Retransmit List" or seqid (sequence id, I assume?) values
> tell me?
> Is it safe to reenable the second ring when the partner node can be
> pinged successfully?
> 
> The totem section on our config looks like this:
> 
> totem {
>rrp_mode:   passive
>join:   60
>max_messages:   20
>vsftype:none
>consensus:  1
>secauth:on
>token_retransmits_before_loss_const:10
>threads:16
>token:  1
>version:2
>interface {
>bindnetaddr:192.168.1.0
>mcastaddr:  239.250.1.1
>mcastport:  5405
>ringnumber: 0
>}
>interface {
>bindnetaddr:x.y.z.0
>mcastaddr:  239.250.1.2
>mcastport:  5415
>ringnumber: 1
>}
>clear_node_high_bit:yes
> }
> 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Announcing Pacemaker Cloud 0.4.1 - Available now for download!

2011-07-27 Thread Steven Dake
Angus and I announced a project to apply high availability best known
practice to the field of cloud computing in late March 2011.  We reuse
the policy engine of Pacemaker.  Our first tarball is available today
containing a functional prototype demonstrating these best known practices.

Today the software supports a deployable/assembly model.  Assemblies
represent a virtual machine and deployables represent a collection of
virtual machines.  Resources within a virtual machine can be monitored
for failure and recovered.  Assemblies and deployables are also
monitored for failure and recovered.

Currently the significant limitation with the software is that it
operates single node.  As a result it is not suitable for deployment
today.  We plan to address this in the future by integrating with other
cloud infrastructure systems such as Aeolus (developer ml on CC list).

The software will be available in Fedora 16 for all to evaluate that run
Fedora.  Your feedback is greatly appreciated.  To provide feedback,
join the mailing list:

http://oss.clusterlabs.org/mailman/listinfo/pcmk-cloud/

If you have interest in developing for cloud environments around the
topic of high availability, please feel free to download our git repo
and submit patches.  We also are interested in user feedback!

To get the software, check out:

http://pacemaker-cloud.org/

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Sending message via cpg FAILED: (rc=12) Doesn't exist

2011-07-22 Thread Steven Dake
On 07/22/2011 01:15 AM, Proskurin Kirill wrote:
> Hello all.
> 
> 
> pacemaker-1.1.5
> corosync-1.4.0
> 
> 4 nodes in cluster. 3 online 1 not.
> In logs:
> 
> Jul 22 11:50:23 my106.example.com crmd: [28030]: info:
> pcmk_quorum_notification: Membership 0: quorum retained (0)
> Jul 22 11:50:23 my106.example.com crmd: [28030]: info: do_started:
> Delaying start, no membership data (0010)
> Jul 22 11:50:23 my106.example.com crmd: [28030]: info:
> config_query_callback: Shutdown escalation occurs after: 120ms
> Jul 22 11:50:23 my106.example.com crmd: [28030]: info:
> config_query_callback: Checking for expired actions every 90ms
> Jul 22 11:50:23 my106.example.com crmd: [28030]: info: do_started:
> Delaying start, no membership data (0010)
> Jul 22 11:50:27 my106.example.com attrd: [28028]: info: cib_connect:
> Connected to the CIB after 1 signon attempts
> Jul 22 11:50:27 my106.example.com attrd: [28028]: info: cib_connect:
> Sending full refresh
> Jul 22 11:52:18 corosync [TOTEM ] A processor joined or left the
> membership and a new membership was formed.
> Jul 22 11:52:18 corosync [CPG   ] chosen downlist: sender r(0)
> ip(10.3.1.107) ; members(old:4 left:1)
> Jul 22 11:52:18 corosync [MAIN  ] Completed service synchronization,
> ready to provide service.
> Jul 22 11:52:19 my106.example.com pacemakerd: [28021]: ERROR:
> send_cpg_message: Sending message via cpg FAILED: (rc=12) Doesn't exist
> Jul 22 11:52:19 my106.example.com pacemakerd: [28021]: ERROR:
> send_cpg_message: Sending message via cpg FAILED: (rc=12) Doesn't exist
> Jul 22 11:52:19 my106.example.com pacemakerd: [28021]: ERROR:
> send_cpg_message: Sending message via cpg FAILED: (rc=12) Doesn't exist
> 
> 
> 
> DC:
> 
> Jul 22 11:50:07 corosync [TOTEM ] Retransmit List: e4 e5 e7 e8 ea eb ed ee
> Jul 22 11:50:07 corosync [TOTEM ] Retransmit List: e4 e5 e7 e8 ea eb ed ee
> Jul 22 11:50:07 my107.example.com pacemakerd: [22388]: info:
> update_node_processes: Node my106.example.com now has process list:
> 0002 (was 00
> 12)
> Jul 22 11:50:07 my107.example.com attrd: [22397]: info: crm_update_peer:
> Node my106.example.com: id=0 state=unknown addr=(null) votes=0 born=0
> seen=0 proc=00
> 02 (new)
> Jul 22 11:50:07 my107.example.com cib: [22395]: info: crm_update_peer:
> Node my106.example.com: id=0 state=unknown addr=(null) votes=0 born=0
> seen=0 proc=0002
>  (new)
> Jul 22 11:50:07 my107.example.com stonith-ng: [22394]: info:
> crm_update_peer: Node my106.example.com: id=0 state=unknown addr=(null)
> votes=0 born=0 seen=0 proc=0
> 002 (new)
> Jul 22 11:50:07 my107.example.com crmd: [22399]: info: crm_update_peer:
> Node my106.example.com: id=0 state=unknown addr=(null) votes=0 born=0
> seen=0 proc=000
> 2 (new)
> Jul 22 11:50:07 corosync [TOTEM ] Retransmit List: e4 e5 e7 e8 ea eb ed ee
> Jul 22 11:50:07 corosync [TOTEM ] Retransmit List: e4 e5 e7 e8 ea eb ed ee
> 
> 
> There is a problem?
> 

Does your retransmit list continually display e4 e5 etc for rest of
cluster lifetime, or is this short lived?



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] corosync-quorumtool configuration

2011-06-21 Thread Steven Dake
On 06/20/2011 07:35 PM, Andrew Beekhof wrote:
> I don't think this is legal:
> 
> service {
> 
> name: corosync_quorum
> 
> ver: 0
> 
> name: pacemaker
> 
> use_mgmtd: yes
> 
> use_logd: yes
> 
> }
> 
> 
> and even if it were, corosync's native quorum implementation (or our use
> of it) was a but buggy last time i checked. 
> 

Probably a little of both.  I suggest holding off on deploying quorum
integration until our upstreams have worked out a well integrated solution.

Regards
-steve
> 
> 
> On Sat, May 28, 2011 at 2:19 AM, Roman Schartlmüller
> mailto:roman_schartlmuel...@gmx.at>> wrote:
> 
> Hi, 
> 
> __ __
> 
> I have Problems with corosync-quorumtool. I have set up a cluster
> with 2 nodes which uses corosync. 
> 
> For testing purposes, I am trying to establish a quorum to the
> cluster, because I want to understand the quorum-corosync tool. 
> 
> I want to tag a node with 2 votes so that it can continue on failure
> and doesn't stop to run all resources. 
> 
> Unfortunately, the node with 2 out of 3 votes loses yet always the
> quorum. 
> 
> __ __
> 
> Can anybody please help me with configuration? Maybe there is
> something wrong.
> 
> __ __
> 
> Thanks!
> 
> Best regards,
> 
> Roman
> 
> __ __
> 
> Node1:
> 
> # less /etc/corosync/corosync.conf
> 
> ...
> 
> service {
> 
> name: corosync_quorum
> 
> ver: 0
> 
> name: pacemaker
> 
> use_mgmtd: yes
> 
> use_logd: yes
> 
> }...
> 
> quorum {
> 
> provider: corosync_votequorum
> 
> expected_votes: 3
> 
> votes: 2
> 
> }
> 
> __ __
> 
> #corosync-quorumtool -s
> 
> Version: 1.3.1
> 
> Nodes:  2
> 
> Ring ID: 19936
> 
> Quorum type:  corosync_votequorum
> 
> Quorate:Yes
> 
> Node votes:2
> 
> Expected votes:   3
> 
> Highest expected:3
> 
> Total votes:3
> 
> Quorum:2  
> 
> Flags:Quorate
> 
> __ __
> 
> #corosync-quorumtool -l
> 
> Nodeid Votes  Name
> 
> 553756864 2  192.168.200.33
> 
> 570534080 2  192.168.200.34
> 
> __ __
> 
>   
> 
> Node2:
> 
> # less /etc/corosync/corosync.conf
> 
> ...
> 
> service {
> 
> name: corosync_quorum
> 
> ver: 0
> 
> name: pacemaker
> 
> use_mgmtd: yes
> 
> use_logd: yes
> 
> }...
> 
> quorum {
> 
> provider: corosync_votequorum
> 
> expected_votes: 3
> 
> votes: 1
> 
> }
> 
> __ __
> 
> #corosync-quorumtool -s
> 
> Version: 1.3.1
> 
> Nodes:  2
> 
> Ring ID: 19936
> 
> Quorum type:  corosync_votequorum
> 
> Quorate:Yes
> 
> Node votes:1
> 
> Expected votes:   3
> 
> Highest expected:3
> 
> Total votes:3
> 
> Quorum:2  
> 
> Flags:Quorate
> 
> __ __
> 
> #corosync-quorumtool -l
> 
> Nodeid Votes  Name
> 
> 553756864 1  192.168.200.33
> 
> 570534080 1  192.168.200.34
> 
> __ __
> 
> #crm configure show
> 
> expected-quorum-votes="2" \
> 
> dc-version="1.1.5-1.fc14-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \
> 
> cluster-infrastructure="openais" \
> 
> last-lrm-refresh="1306499083" \
> 
> stonith-enabled="false" \
> 
> no-quorum-policy="ignore"
> 
> __ __
> 
> __ __
> 
> __ __
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> 
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> 
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Pacemaker

Re: [Pacemaker] [Openais] Linux HA on debian sparc

2011-06-07 Thread Steven Dake
On 06/07/2011 04:44 AM, william felipe_welter wrote:
> More two questions.. The patch for mmap calls will be on the mainly
> development for all archs ?
> Any problems if i send this patch's for Debian project ?
> 

These patches will go into the maintenance branches

You can send them to whoever you like ;)

Regards
-steve

> 2011/6/3 Steven Dake :
>> On 06/02/2011 08:16 PM, william felipe_welter wrote:
>>> Well,
>>>
>>> Now with this patch, the pacemakerd process starts and up his other
>>> process ( crmd, lrmd, pengine) but after the process pacemakerd do
>>> a fork, the forked  process pacemakerd dies due to "signal 10, Bus
>>> error".. And  on the log, the process of pacemark ( crmd, lrmd,
>>> pengine) cant connect to open ais plugin (possible because the
>>> "death" of the pacemakerd process).
>>> But this time when the forked pacemakerd dies, he generates a coredump.
>>>
>>> gdb  -c "/usr/var/lib/heartbeat/cores/root/ pacemakerd 7986"  -se
>>> /usr/sbin/pacemakerd :
>>> GNU gdb (GDB) 7.0.1-debian
>>> Copyright (C) 2009 Free Software Foundation, Inc.
>>> License GPLv3+: GNU GPL version 3 or later 
>>> <http://gnu.org/licenses/gpl.html>
>>> This is free software: you are free to change and redistribute it.
>>> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
>>> and "show warranty" for details.
>>> This GDB was configured as "sparc-linux-gnu".
>>> For bug reporting instructions, please see:
>>> <http://www.gnu.org/software/gdb/bugs/>...
>>> Reading symbols from /usr/sbin/pacemakerd...done.
>>> Reading symbols from /usr/lib64/libuuid.so.1...(no debugging symbols
>>> found)...done.
>>> Loaded symbols for /usr/lib64/libuuid.so.1
>>> Reading symbols from /usr/lib/libcoroipcc.so.4...done.
>>> Loaded symbols for /usr/lib/libcoroipcc.so.4
>>> Reading symbols from /usr/lib/libcpg.so.4...done.
>>> Loaded symbols for /usr/lib/libcpg.so.4
>>> Reading symbols from /usr/lib/libquorum.so.4...done.
>>> Loaded symbols for /usr/lib/libquorum.so.4
>>> Reading symbols from /usr/lib64/libcrmcommon.so.2...done.
>>> Loaded symbols for /usr/lib64/libcrmcommon.so.2
>>> Reading symbols from /usr/lib/libcfg.so.4...done.
>>> Loaded symbols for /usr/lib/libcfg.so.4
>>> Reading symbols from /usr/lib/libconfdb.so.4...done.
>>> Loaded symbols for /usr/lib/libconfdb.so.4
>>> Reading symbols from /usr/lib64/libplumb.so.2...done.
>>> Loaded symbols for /usr/lib64/libplumb.so.2
>>> Reading symbols from /usr/lib64/libpils.so.2...done.
>>> Loaded symbols for /usr/lib64/libpils.so.2
>>> Reading symbols from /lib/libbz2.so.1.0...(no debugging symbols 
>>> found)...done.
>>> Loaded symbols for /lib/libbz2.so.1.0
>>> Reading symbols from /usr/lib/libxslt.so.1...(no debugging symbols
>>> found)...done.
>>> Loaded symbols for /usr/lib/libxslt.so.1
>>> Reading symbols from /usr/lib/libxml2.so.2...(no debugging symbols
>>> found)...done.
>>> Loaded symbols for /usr/lib/libxml2.so.2
>>> Reading symbols from /lib/libc.so.6...(no debugging symbols found)...done.
>>> Loaded symbols for /lib/libc.so.6
>>> Reading symbols from /lib/librt.so.1...(no debugging symbols found)...done.
>>> Loaded symbols for /lib/librt.so.1
>>> Reading symbols from /lib/libdl.so.2...(no debugging symbols found)...done.
>>> Loaded symbols for /lib/libdl.so.2
>>> Reading symbols from /lib/libglib-2.0.so.0...(no debugging symbols
>>> found)...done.
>>> Loaded symbols for /lib/libglib-2.0.so.0
>>> Reading symbols from /usr/lib/libltdl.so.7...(no debugging symbols
>>> found)...done.
>>> Loaded symbols for /usr/lib/libltdl.so.7
>>> Reading symbols from /lib/ld-linux.so.2...(no debugging symbols 
>>> found)...done.
>>> Loaded symbols for /lib/ld-linux.so.2
>>> Reading symbols from /lib/libpthread.so.0...(no debugging symbols 
>>> found)...done.
>>> Loaded symbols for /lib/libpthread.so.0
>>> Reading symbols from /lib/libm.so.6...(no debugging symbols found)...done.
>>> Loaded symbols for /lib/libm.so.6
>>> Reading symbols from /usr/lib/libz.so.1...(no debugging symbols 
>>> found)...done.
>>> Loaded symbols for /usr/lib/libz.so.1
>>> Reading symbols from /lib/libpcre.so.3...(no debugging symbols 
>>> found)...done.
>>> Loaded symbols for /lib/libpcre.so.3
>>

[Pacemaker] Updated pacemaker-cloud.org website

2011-06-06 Thread Steven Dake
Hi,

I want to spend a moment to tell you about our new website at
http://pacemaker-cloud.org.  This website will serve as our information
store and tarball repo location for the Pacemaker-Cloud project.  The
features page contains the feature set we plan to deliver.

Please have a look and forward any questions or comments to:

pcmk-cl...@oss.clusterlabs.org.

A big thanks to Adam Stokes who worked on the Matahari website design.
We used his design as our inspiration for most of our website.  Also a
thanks to Angus Salkeld for contributing to moving our hosting to github.

Regards
-steve

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] [Openais] Linux HA on debian sparc

2011-06-03 Thread Steven Dake
xxx crmd: [7994]: debug: cib_native_signon_raw:
> Connection to callback channel failed
> Jun 02 23:12:20 xx crmd: [7994]: debug: cib_native_signon_raw:
> Connection to CIB failed: connection failed
> Jun 02 23:12:20 xx crmd: [7994]: debug: cib_native_signoff:
> Signing out of the CIB Service
> Jun 02 23:12:20 xx cib: [7990]: debug: activateCibXml:
> Triggering CIB write for start op
> Jun 02 23:12:20 xx cib: [7990]: info: startCib: CIB
> Initialization completed successfully
> Jun 02 23:12:20 xx cib: [7990]: info: get_cluster_type:
> Cluster type is: 'openais'.
> Jun 02 23:12:20 xx cib: [7990]: info: crm_cluster_connect:
> Connecting to cluster infrastructure: classic openais (with plugin)
> Jun 02 23:12:20 xx cib: [7990]: info:
> init_ais_connection_classic: Creating connection to our Corosync
> plugin
> Jun 02 23:12:20 xx cib: [7990]: info:
> init_ais_connection_classic: Connection to our AIS plugin (9) failed:
> Doesn't exist (12)
> Jun 02 23:12:20 xx cib: [7990]: CRIT: cib_init: Cannot sign in
> to the cluster... terminating
> Jun 02 23:12:21 corosync [CPG   ] exit_fn for conn=0x62500
> Jun 02 23:12:21 corosync [TOTEM ] mcasted message added to pending queue
> Jun 02 23:12:21 corosync [TOTEM ] Delivering 15 to 16
> Jun 02 23:12:21 corosync [TOTEM ] Delivering MCAST message with seq 16
> to pending delivery queue
> Jun 02 23:12:21 corosync [CPG   ] got procleave message from cluster
> node 1377289226
> Jun 02 23:12:21 corosync [TOTEM ] releasing messages up to and including 16
> Jun 02 23:12:21 xx attrd: [7992]: info: Invoked:
> /usr/lib64/heartbeat/attrd
> Jun 02 23:12:21 xx attrd: [7992]: info: crm_log_init_worker:
> Changed active directory to /usr/var/lib/heartbeat/cores/hacluster
> Jun 02 23:12:21 xx attrd: [7992]: info: main: Starting up
> Jun 02 23:12:21 xx attrd: [7992]: info: get_cluster_type:
> Cluster type is: 'openais'.
> Jun 02 23:12:21 xx attrd: [7992]: info: crm_cluster_connect:
> Connecting to cluster infrastructure: classic openais (with plugin)
> Jun 02 23:12:21 xx attrd: [7992]: info:
> init_ais_connection_classic: Creating connection to our Corosync
> plugin
> Jun 02 23:12:21 xx attrd: [7992]: info:
> init_ais_connection_classic: Connection to our AIS plugin (9) failed:
> Doesn't exist (12)
> Jun 02 23:12:21 xx attrd: [7992]: ERROR: main: HA Signon failed
> Jun 02 23:12:21 xx attrd: [7992]: info: main: Cluster connection 
> active
> Jun 02 23:12:21 xx attrd: [7992]: info: main: Accepting
> attribute updates
> Jun 02 23:12:21 xx attrd: [7992]: ERROR: main: Aborting startup
> Jun 02 23:12:21 xx crmd: [7994]: debug:
> init_client_ipc_comms_nodispatch: Attempting to talk on:
> /usr/var/run/crm/cib_rw
> Jun 02 23:12:21 xx crmd: [7994]: debug:
> init_client_ipc_comms_nodispatch: Could not init comms on:
> /usr/var/run/crm/cib_rw
> Jun 02 23:12:21 xx crmd: [7994]: debug: cib_native_signon_raw:
> Connection to command channel failed
> Jun 02 23:12:21 xx crmd: [7994]: debug:
> init_client_ipc_comms_nodispatch: Attempting to talk on:
> /usr/var/run/crm/cib_callback
> ...
> 
> 
> 2011/6/2 Steven Dake :
>> On 06/01/2011 11:05 PM, william felipe_welter wrote:
>>> I recompile my kernel without hugetlb .. and the result are the same..
>>>
>>> My test program still resulting:
>>> PATH=/dev/shm/teste123XX
>>> page size=2
>>> fd=3
>>> ADDR_ORIG:0xe000a000  ADDR:0x
>>> Erro
>>>
>>> And Pacemaker still resulting because the mmap error:
>>> Could not initialize Cluster Configuration Database API instance error 2
>>>
>>
>> Give the patch I posted recently a spin - corosync WFM with this patch
>> on sparc64 with hugetlb set.  Please report back results.
>>
>> Regards
>> -steve
>>
>>> For make sure that i have disable the hugetlb there is my /proc/meminfo:
>>> MemTotal:   33093488 kB
>>> MemFree:32855616 kB
>>> Buffers:5600 kB
>>> Cached:53480 kB
>>> SwapCached:0 kB
>>> Active:45768 kB
>>> Inactive:  28104 kB
>>> Active(anon):  18024 kB
>>> Inactive(anon): 1560 kB
>>> Active(file):  27744 kB
>>> Inactive(file):26544 kB
>>> Unevictable:   0 kB
>>> Mlocked:   0 kB
>>> SwapTotal:   6104680 kB
>>> SwapFree:6104680 kB
>>> Dirty: 0 kB
>>> Writeback

Re: [Pacemaker] [Openais] Linux HA on debian sparc

2011-06-02 Thread Steven Dake
On 06/01/2011 11:05 PM, william felipe_welter wrote:
> I recompile my kernel without hugetlb .. and the result are the same..
> 
> My test program still resulting:
> PATH=/dev/shm/teste123XX
> page size=2
> fd=3
> ADDR_ORIG:0xe000a000  ADDR:0x
> Erro
> 
> And Pacemaker still resulting because the mmap error:
> Could not initialize Cluster Configuration Database API instance error 2
> 

Give the patch I posted recently a spin - corosync WFM with this patch
on sparc64 with hugetlb set.  Please report back results.

Regards
-steve

> For make sure that i have disable the hugetlb there is my /proc/meminfo:
> MemTotal:   33093488 kB
> MemFree:32855616 kB
> Buffers:5600 kB
> Cached:53480 kB
> SwapCached:0 kB
> Active:45768 kB
> Inactive:  28104 kB
> Active(anon):  18024 kB
> Inactive(anon): 1560 kB
> Active(file):  27744 kB
> Inactive(file):26544 kB
> Unevictable:   0 kB
> Mlocked:   0 kB
> SwapTotal:   6104680 kB
> SwapFree:6104680 kB
> Dirty: 0 kB
> Writeback: 0 kB
> AnonPages: 14936 kB
> Mapped: 7736 kB
> Shmem:  4624 kB
> Slab:  39184 kB
> SReclaimable:  10088 kB
> SUnreclaim:29096 kB
> KernelStack:7088 kB
> PageTables: 1160 kB
> Quicklists:17664 kB
> NFS_Unstable:  0 kB
> Bounce:0 kB
> WritebackTmp:  0 kB
> CommitLimit:22651424 kB
> Committed_AS: 519368 kB
> VmallocTotal:   1069547520 kB
> VmallocUsed:   11064 kB
> VmallocChunk:   1069529616 kB
> 
> 
> 2011/6/1 Steven Dake :
>> On 06/01/2011 07:42 AM, william felipe_welter wrote:
>>> Steven,
>>>
>>> cat /proc/meminfo
>>> ...
>>> HugePages_Total:   0
>>> HugePages_Free:0
>>> HugePages_Rsvd:0
>>> HugePages_Surp:0
>>> Hugepagesize:   4096 kB
>>> ...
>>>
>>
>> It definitely requires a kernel compile and setting the config option to
>> off.  I don't know the debian way of doing this.
>>
>> The only reason you may need this option is if you have very large
>> memory sizes, such as 48GB or more.
>>
>> Regards
>> -steve
>>
>>> Its 4MB..
>>>
>>> How can i disable hugetlb ? ( passing CONFIG_HUGETLBFS=n at boot to
>>> kernel ?)
>>>
>>> 2011/6/1 Steven Dake mailto:sd...@redhat.com>>
>>>
>>> On 06/01/2011 01:05 AM, Steven Dake wrote:
>>> > On 05/31/2011 09:44 PM, Angus Salkeld wrote:
>>> >> On Tue, May 31, 2011 at 11:52:48PM -0300, william felipe_welter
>>> wrote:
>>> >>> Angus,
>>> >>>
>>> >>> I make some test program (based on the code coreipcc.c) and i
>>> now i sure
>>> >>> that are problems with the mmap systems call on sparc..
>>> >>>
>>> >>> Source code of my test program:
>>> >>>
>>> >>> #include 
>>> >>> #include 
>>> >>> #include 
>>> >>>
>>> >>> #define PATH_MAX  36
>>> >>>
>>> >>> int main()
>>> >>> {
>>> >>>
>>> >>> int32_t fd;
>>> >>> void *addr_orig;
>>> >>> void *addr;
>>> >>> char path[PATH_MAX];
>>> >>> const char *file = "teste123XX";
>>> >>> size_t bytes=10024;
>>> >>>
>>> >>> snprintf (path, PATH_MAX, "/dev/shm/%s", file);
>>> >>> printf("PATH=%s\n",path);
>>> >>>
>>> >>> fd = mkstemp (path);
>>> >>> printf("fd=%d \n",fd);
>>> >>>
>>> >>>
>>> >>> addr_orig = mmap (NULL, bytes, PROT_NONE,
>>> >>>   MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
>>> >>>
>>> >>>
>>> >>> addr = mmap (addr_orig, bytes, PROT_READ | PROT_WRITE,
>>> >>>   MAP_FIXED | MAP_SHARED, fd, 0);
>>> >>>
>>> >>> printf("ADDR_ORIG:%p  ADDR:%p\n",addr_orig,addr);
>>> >>>
>>> >>>
>>>

Re: [Pacemaker] [Openais] Linux HA on debian sparc

2011-06-01 Thread Steven Dake
On 06/01/2011 07:42 AM, william felipe_welter wrote:
> Steven,
> 
> cat /proc/meminfo
> ...
> HugePages_Total:   0
> HugePages_Free:0
> HugePages_Rsvd:0
> HugePages_Surp:0
> Hugepagesize:   4096 kB
> ...
> 

It definitely requires a kernel compile and setting the config option to
off.  I don't know the debian way of doing this.

The only reason you may need this option is if you have very large
memory sizes, such as 48GB or more.

Regards
-steve

> Its 4MB..
> 
> How can i disable hugetlb ? ( passing CONFIG_HUGETLBFS=n at boot to
> kernel ?)
> 
> 2011/6/1 Steven Dake mailto:sd...@redhat.com>>
> 
> On 06/01/2011 01:05 AM, Steven Dake wrote:
> > On 05/31/2011 09:44 PM, Angus Salkeld wrote:
> >> On Tue, May 31, 2011 at 11:52:48PM -0300, william felipe_welter
> wrote:
> >>> Angus,
> >>>
> >>> I make some test program (based on the code coreipcc.c) and i
> now i sure
> >>> that are problems with the mmap systems call on sparc..
> >>>
> >>> Source code of my test program:
> >>>
> >>> #include 
> >>> #include 
> >>> #include 
> >>>
> >>> #define PATH_MAX  36
> >>>
> >>> int main()
> >>> {
> >>>
> >>> int32_t fd;
> >>> void *addr_orig;
> >>> void *addr;
> >>> char path[PATH_MAX];
> >>> const char *file = "teste123XX";
> >>> size_t bytes=10024;
> >>>
> >>> snprintf (path, PATH_MAX, "/dev/shm/%s", file);
> >>> printf("PATH=%s\n",path);
> >>>
> >>> fd = mkstemp (path);
> >>> printf("fd=%d \n",fd);
> >>>
> >>>
> >>> addr_orig = mmap (NULL, bytes, PROT_NONE,
> >>>   MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
> >>>
> >>>
> >>> addr = mmap (addr_orig, bytes, PROT_READ | PROT_WRITE,
> >>>   MAP_FIXED | MAP_SHARED, fd, 0);
> >>>
> >>> printf("ADDR_ORIG:%p  ADDR:%p\n",addr_orig,addr);
> >>>
> >>>
> >>>   if (addr != addr_orig) {
> >>>printf("Erro");
> >>> }
> >>> }
> >>>
> >>> Results on x86:
> >>> PATH=/dev/shm/teste123XX
> >>> fd=3
> >>> ADDR_ORIG:0x7f867d8e6000  ADDR:0x7f867d8e6000
> >>>
> >>> Results on sparc:
> >>> PATH=/dev/shm/teste123XX
> >>> fd=3
> >>> ADDR_ORIG:0xf7f72000  ADDR:0x
> >>
> >> Note: 0x == MAP_FAILED
> >>
> >> (from man mmap)
> >> RETURN VALUE
> >>On success, mmap() returns a pointer to the mapped area.  On
> >>error, the value MAP_FAILED (that is, (void *) -1) is
> returned,
> >>and errno is  set appropriately.
> >>
> >>>
> >>>
> >>> But im wondering if is really needed to call mmap 2 times ?
>  What are the
> >>> reason to call the mmap 2 times, on the second time using the
> address of the
> >>> first?
> >>>
> >>>
> >> Well there are 3 calls to mmap()
> >> 1) one to allocate 2 * what you need (in pages)
> >> 2) maps the first half of the mem to a real file
> >> 3) maps the second half of the mem to the same file
> >>
> >> The point is when you write to an address over the end of the
> >> first half of memory it is taken care of the the third mmap which
> maps
> >> the address back to the top of the file for you. This means you
> >> don't have to worry about ringbuffer wrapping which can be a
> headache.
> >>
> >> -Angus
> >>
> >
> > interesting this mmap operation doesn't work on sparc linux.
> >
> > Not sure how I can help here - Next step would be a follow up with the
> > sparc linux mailing list.  I'll do that and cc you on the message
> - see
> > if we get any response.
> >
> > http://vger.kernel.org/vger-lists.html
> >

Re: [Pacemaker] [Openais] Linux HA on debian sparc

2011-06-01 Thread Steven Dake
On 06/01/2011 01:05 AM, Steven Dake wrote:
> On 05/31/2011 09:44 PM, Angus Salkeld wrote:
>> On Tue, May 31, 2011 at 11:52:48PM -0300, william felipe_welter wrote:
>>> Angus,
>>>
>>> I make some test program (based on the code coreipcc.c) and i now i sure
>>> that are problems with the mmap systems call on sparc..
>>>
>>> Source code of my test program:
>>>
>>> #include 
>>> #include 
>>> #include 
>>>
>>> #define PATH_MAX  36
>>>
>>> int main()
>>> {
>>>
>>> int32_t fd;
>>> void *addr_orig;
>>> void *addr;
>>> char path[PATH_MAX];
>>> const char *file = "teste123XX";
>>> size_t bytes=10024;
>>>
>>> snprintf (path, PATH_MAX, "/dev/shm/%s", file);
>>> printf("PATH=%s\n",path);
>>>
>>> fd = mkstemp (path);
>>> printf("fd=%d \n",fd);
>>>
>>>
>>> addr_orig = mmap (NULL, bytes, PROT_NONE,
>>>   MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
>>>
>>>
>>> addr = mmap (addr_orig, bytes, PROT_READ | PROT_WRITE,
>>>   MAP_FIXED | MAP_SHARED, fd, 0);
>>>
>>> printf("ADDR_ORIG:%p  ADDR:%p\n",addr_orig,addr);
>>>
>>>
>>>   if (addr != addr_orig) {
>>>printf("Erro");
>>> }
>>> }
>>>
>>> Results on x86:
>>> PATH=/dev/shm/teste123XX
>>> fd=3
>>> ADDR_ORIG:0x7f867d8e6000  ADDR:0x7f867d8e6000
>>>
>>> Results on sparc:
>>> PATH=/dev/shm/teste123XX
>>> fd=3
>>> ADDR_ORIG:0xf7f72000  ADDR:0x
>>
>> Note: 0x == MAP_FAILED
>>
>> (from man mmap)
>> RETURN VALUE
>>On success, mmap() returns a pointer to the mapped area.  On
>>error, the value MAP_FAILED (that is, (void *) -1) is returned,
>>and errno is  set appropriately.
>>
>>>
>>>
>>> But im wondering if is really needed to call mmap 2 times ?  What are the
>>> reason to call the mmap 2 times, on the second time using the address of the
>>> first?
>>>
>>>
>> Well there are 3 calls to mmap()
>> 1) one to allocate 2 * what you need (in pages)
>> 2) maps the first half of the mem to a real file
>> 3) maps the second half of the mem to the same file
>>
>> The point is when you write to an address over the end of the
>> first half of memory it is taken care of the the third mmap which maps
>> the address back to the top of the file for you. This means you
>> don't have to worry about ringbuffer wrapping which can be a headache.
>>
>> -Angus
>>
> 
> interesting this mmap operation doesn't work on sparc linux.
> 
> Not sure how I can help here - Next step would be a follow up with the
> sparc linux mailing list.  I'll do that and cc you on the message - see
> if we get any response.
> 
> http://vger.kernel.org/vger-lists.html
> 
>>>
>>>
>>>
>>>
>>> 2011/5/31 Angus Salkeld 
>>>
>>>> On Tue, May 31, 2011 at 06:25:56PM -0300, william felipe_welter wrote:
>>>>> Thanks Steven,
>>>>>
>>>>> Now im try to run on the MCP:
>>>>> - Uninstall the pacemaker 1.0
>>>>> - Compile and install 1.1
>>>>>
>>>>> But now i have problems to initialize the pacemakerd: Could not
>>>> initialize
>>>>> Cluster Configuration Database API instance error 2
>>>>> Debbuging with gdb i see that the error are on the confdb.. most
>>>> specificaly
>>>>> the errors start on coreipcc.c  at line:
>>>>>
>>>>>
>>>>> 448if (addr != addr_orig) {
>>>>> 449goto error_close_unlink;  <- enter here
>>>>> 450   }
>>>>>
>>>>> Some ideia about  what can cause this  ?
>>>>>
>>>>
>>>> I tried porting a ringbuffer (www.libqb.org) to sparc and had the same
>>>> failure.
>>>> There are 3 mmap() calls and on sparc the third one keeps failing.
>>>>
>>>> This is a common way of creating a ring buffer, see:
>>>> http://en.wikipedia.org/wiki/Circular_buffer#Exemplary_POSIX_Implementation
>>>>
>>>> I couldn't get it working in the short time I tried. It's probably
>>

Re: [Pacemaker] [Openais] Linux HA on debian sparc

2011-06-01 Thread Steven Dake
On 05/31/2011 09:44 PM, Angus Salkeld wrote:
> On Tue, May 31, 2011 at 11:52:48PM -0300, william felipe_welter wrote:
>> Angus,
>>
>> I make some test program (based on the code coreipcc.c) and i now i sure
>> that are problems with the mmap systems call on sparc..
>>
>> Source code of my test program:
>>
>> #include 
>> #include 
>> #include 
>>
>> #define PATH_MAX  36
>>
>> int main()
>> {
>>
>> int32_t fd;
>> void *addr_orig;
>> void *addr;
>> char path[PATH_MAX];
>> const char *file = "teste123XX";
>> size_t bytes=10024;
>>
>> snprintf (path, PATH_MAX, "/dev/shm/%s", file);
>> printf("PATH=%s\n",path);
>>
>> fd = mkstemp (path);
>> printf("fd=%d \n",fd);
>>
>>
>> addr_orig = mmap (NULL, bytes, PROT_NONE,
>>   MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
>>
>>
>> addr = mmap (addr_orig, bytes, PROT_READ | PROT_WRITE,
>>   MAP_FIXED | MAP_SHARED, fd, 0);
>>
>> printf("ADDR_ORIG:%p  ADDR:%p\n",addr_orig,addr);
>>
>>
>>   if (addr != addr_orig) {
>>printf("Erro");
>> }
>> }
>>
>> Results on x86:
>> PATH=/dev/shm/teste123XX
>> fd=3
>> ADDR_ORIG:0x7f867d8e6000  ADDR:0x7f867d8e6000
>>
>> Results on sparc:
>> PATH=/dev/shm/teste123XX
>> fd=3
>> ADDR_ORIG:0xf7f72000  ADDR:0x
> 
> Note: 0x == MAP_FAILED
> 
> (from man mmap)
> RETURN VALUE
>On success, mmap() returns a pointer to the mapped area.  On
>error, the value MAP_FAILED (that is, (void *) -1) is returned,
>and errno is  set appropriately.
> 
>>
>>
>> But im wondering if is really needed to call mmap 2 times ?  What are the
>> reason to call the mmap 2 times, on the second time using the address of the
>> first?
>>
>>
> Well there are 3 calls to mmap()
> 1) one to allocate 2 * what you need (in pages)
> 2) maps the first half of the mem to a real file
> 3) maps the second half of the mem to the same file
> 
> The point is when you write to an address over the end of the
> first half of memory it is taken care of the the third mmap which maps
> the address back to the top of the file for you. This means you
> don't have to worry about ringbuffer wrapping which can be a headache.
> 
> -Angus
> 

interesting this mmap operation doesn't work on sparc linux.

Not sure how I can help here - Next step would be a follow up with the
sparc linux mailing list.  I'll do that and cc you on the message - see
if we get any response.

http://vger.kernel.org/vger-lists.html

>>
>>
>>
>>
>> 2011/5/31 Angus Salkeld 
>>
>>> On Tue, May 31, 2011 at 06:25:56PM -0300, william felipe_welter wrote:
 Thanks Steven,

 Now im try to run on the MCP:
 - Uninstall the pacemaker 1.0
 - Compile and install 1.1

 But now i have problems to initialize the pacemakerd: Could not
>>> initialize
 Cluster Configuration Database API instance error 2
 Debbuging with gdb i see that the error are on the confdb.. most
>>> specificaly
 the errors start on coreipcc.c  at line:


 448if (addr != addr_orig) {
 449goto error_close_unlink;  <- enter here
 450   }

 Some ideia about  what can cause this  ?

>>>
>>> I tried porting a ringbuffer (www.libqb.org) to sparc and had the same
>>> failure.
>>> There are 3 mmap() calls and on sparc the third one keeps failing.
>>>
>>> This is a common way of creating a ring buffer, see:
>>> http://en.wikipedia.org/wiki/Circular_buffer#Exemplary_POSIX_Implementation
>>>
>>> I couldn't get it working in the short time I tried. It's probably
>>> worth looking at the clib implementation to see why it's failing
>>> (I didn't get to that).
>>>
>>> -Angus
>>>
>>>
>>> ___
>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs:
>>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>>
>>
>>
>>
>> -- 
>> William Felipe Welter
>> --
>> Consultor em Tecnologias Livres
>> william.wel...@4linux.com.br
>> www.4linux.com.br
> 
>> ___
>> Openais mailing list
>> open...@lists.linux-foundation.org
>> https://lists.linux-foundation.org/mailman/listinfo/openais
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/d

Re: [Pacemaker] Linux HA on debian sparc

2011-05-31 Thread Steven Dake
Note.  there are three signals you could possibly see that generate a
core file.

SIGABRT (assert() called in the codebase)
SIGSEGV (segmentation violation)
SIGBUS (alignment error)

Make sure you don't have a sigbus.

Opening the core file with gdb will tell you which signal triggered the
fault.

Regards
-steve

On 05/31/2011 08:34 AM, william felipe_welter wrote:
> Im trying to setup HA with corosync and pacemaker using the debian
> packages on SPARC Architecture. Using Debian package corosync  process
> dies after initializate pacemaker process. I make some tests with ltrace
> and strace and this tools tell me that corosync died because a
> segmentation fault. I try a lot of thing to solve this problem, but
> nothing made corosync works.
> 
> My second try is to compile from scratch (using this
> docs:http://www.clusterlabs.org/wiki/Install#From_Source)
>  . This way
> corosync process startup perfectly! but some process of pacemaker don't
> start.. Analyzing log i see the probably reason:
> 
> attrd: [2283]: info: init_ais_connection_once: Connection to our AIS
> plugin (9) failed: Library error (2)
> 
> stonithd: [2280]: info: init_ais_connection_once: Connection to our AIS
> plugin (9) failed: Library error (2)
> .
> cib: [2281]: info: init_ais_connection_once: Connection to our AIS
> plugin (9) failed: Library error (2)
> .
> crmd: [3320]: debug: init_client_ipc_comms_nodispatch: Attempting to
> talk on: /usr/var/run/crm/cib_rw
> crmd: [3320]: debug: init_client_ipc_comms_nodispatch: Could not init
> comms on: /usr/var/run/crm/cib_rw
> crmd: [3320]: debug: cib_native_signon_raw: Connection to command
> channel failed
> crmd: [3320]: debug: init_client_ipc_comms_nodispatch: Attempting to
> talk on: /usr/var/run/crm/cib_callback
> crmd: [3320]: debug: init_client_ipc_comms_nodispatch: Could not init
> comms on: /usr/var/run/crm/cib_callback
> crmd: [3320]: debug: cib_native_signon_raw: Connection to callback
> channel failed
> crmd: [3320]: debug: cib_native_signon_raw: Connection to CIB failed:
> connection failed
> crmd: [3320]: debug: cib_native_signoff: Signing out of the CIB Service
> crmd: [3320]: debug: init_client_ipc_comms_nodispatch: Attempting to
> talk on: /usr/var/run/crm/cib_rw
> crmd: [3320]: debug: init_client_ipc_comms_nodispatch: Could not init
> comms on: /usr/var/run/crm/cib_rw
> crmd: [3320]: debug: cib_native_signon_raw: Connection to command
> channel failed
> crmd: [3320]: debug: init_client_ipc_comms_nodispatch: Attempting to
> talk on: /usr/var/run/crm/cib_callback
> crmd: [3320]: debug: init_client_ipc_comms_nodispatch: Could not init
> comms on: /usr/var/run/crm/cib_callback
> crmd: [3320]: debug: cib_native_signon_raw: Connection to callback
> channel failed
> crmd: [3320]: debug: cib_native_signon_raw: Connection to CIB failed:
> connection failed
> crmd: [3320]: debug: cib_native_signoff: Signing out of the CIB Service
> crmd: [3320]: info: do_cib_control: Could not connect to the CIB
> service: connection failed
> 
> 
> 
> 
> 
> 
> My conf:
> # Please read the corosync.conf.5 manual page
> compatibility: whitetank
> 
> totem {
> version: 2
> join: 60
> token: 3000
> token_retransmits_before_loss_const: 10
> secauth: off
> threads: 0
> consensus: 8601
> vsftype: none
> threads: 0
> rrp_mode: none
> clear_node_high_bit: yes
> max_messages: 20
> interface {
> ringnumber: 0
> bindnetaddr: 10.10.23.0
> mcastaddr: 226.94.1.1
> mcastport: 5405
> }
> }
> 
> logging {
> fileline: off
> to_stderr: no
> to_logfile: yes
> to_syslog: yes
> logfile: /var/log/cluster/corosync.log
> debug: on
> timestamp: on
> logger_subsys {
> subsys: AMF
> debug: on
> }
> }
> 
> amf {
> mode: disabled
> }
> 
> service {
> # Load the Pacemaker Cluster Resource Manager
> ver:   0
> name:  pacemaker
> }
> 
> aisexec {
> user:   root
> group:  root
> }
> 
> 
> My Question is: why attrd, cib ... can't connect to  AIS Plugin?  What
> could be the reasons for the connection failed ?
> (Yes, my /dev/shm are tmpfs)
> 
> 
> 
> 
> -- 
> William Felipe Welter
> --
> Consultor em Tecnologias Livres
> william.wel...@4linux.com.br 
> www.4linux.com.br 
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinf

Re: [Pacemaker] [Openais] Linux HA on debian sparc

2011-05-31 Thread Steven Dake
Try running paceamaker using the MCP.  The plugin mode of pacemaker
never really worked very well because of complexities of posix mmap and
fork.  Not having sparc hardware personally, YMMV.  We have recently
with corosync 1.3.1 gone through an alignment fixing process for ARM
arches - hope that solves your alignment problems on sparc as well.

Regards
-steve

On 05/31/2011 08:38 AM, william felipe_welter wrote:
> Im trying to setup HA with corosync and pacemaker using the debian
> packages on SPARC Architecture. Using Debian package corosync  process
> dies after initializate pacemaker process. I make some tests with ltrace
> and strace and this tools tell me that corosync died because a
> segmentation fault. I try a lot of thing to solve this problem, but
> nothing made corosync works.
> 
> My second try is to compile from scratch (using this
> docs:http://www.clusterlabs.org/wiki/Install#From_Source)
>  . This way
> corosync process startup perfectly! but some process of pacemaker don't
> start.. Analyzing log i see the probably reason:
> 
> attrd: [2283]: info: init_ais_connection_once: Connection to our AIS
> plugin (9) failed: Library error (2)
> 
> stonithd: [2280]: info: init_ais_connection_once: Connection to our AIS
> plugin (9) failed: Library error (2)
> .
> cib: [2281]: info: init_ais_connection_once: Connection to our AIS
> plugin (9) failed: Library error (2)
> .
> crmd: [3320]: debug: init_client_ipc_comms_
> nodispatch: Attempting to talk on: /usr/var/run/crm/cib_rw
> crmd: [3320]: debug: init_client_ipc_comms_nodispatch: Could not init
> comms on: /usr/var/run/crm/cib_rw
> crmd: [3320]: debug: cib_native_signon_raw: Connection to command
> channel failed
> crmd: [3320]: debug: init_client_ipc_comms_nodispatch: Attempting to
> talk on: /usr/var/run/crm/cib_callback
> crmd: [3320]: debug: init_client_ipc_comms_nodispatch: Could not init
> comms on: /usr/var/run/crm/cib_callback
> crmd: [3320]: debug: cib_native_signon_raw: Connection to callback
> channel failed
> crmd: [3320]: debug: cib_native_signon_raw: Connection to CIB failed:
> connection failed
> crmd: [3320]: debug: cib_native_signoff: Signing out of the CIB Service
> crmd: [3320]: debug: init_client_ipc_comms_nodispatch: Attempting to
> talk on: /usr/var/run/crm/cib_rw
> crmd: [3320]: debug: init_client_ipc_comms_nodispatch: Could not init
> comms on: /usr/var/run/crm/cib_rw
> crmd: [3320]: debug: cib_native_signon_raw: Connection to command
> channel failed
> crmd: [3320]: debug: init_client_ipc_comms_nodispatch: Attempting to
> talk on: /usr/var/run/crm/cib_callback
> crmd: [3320]: debug: init_client_ipc_comms_nodispatch: Could not init
> comms on: /usr/var/run/crm/cib_callback
> crmd: [3320]: debug: cib_native_signon_raw: Connection to callback
> channel failed
> crmd: [3320]: debug: cib_native_signon_raw: Connection to CIB failed:
> connection failed
> crmd: [3320]: debug: cib_native_signoff: Signing out of the CIB Service
> crmd: [3320]: info: do_cib_control: Could not connect to the CIB
> service: connection failed
> 
> 
> 
> 
> 
> 
> My conf:
> # Please read the corosync.conf.5 manual page
> compatibility: whitetank
> 
> totem {
> version: 2
> join: 60
> token: 3000
> token_retransmits_before_loss_const: 10
> secauth: off
> threads: 0
> consensus: 8601
> vsftype: none
> threads: 0
> rrp_mode: none
> clear_node_high_bit: yes
> max_messages: 20
> interface {
> ringnumber: 0
> bindnetaddr: 10.10.23.0
> mcastaddr: 226.94.1.1
> mcastport: 5405
> }
> }
> 
> logging {
> fileline: off
> to_stderr: no
> to_logfile: yes
> to_syslog: yes
> logfile: /var/log/cluster/corosync.log
> debug: on
> timestamp: on
> logger_subsys {
> subsys: AMF
> debug: on
> }
> }
> 
> amf {
> mode: disabled
> }
> 
> service {
> # Load the Pacemaker Cluster Resource Manager
> ver:   0
> name:  pacemaker
> }
> 
> aisexec {
> user:   root
> group:  root
> }
> 
> 
> My Question is: why attrd, cib ... can't connect to  AIS Plugin?  What
> could be the reasons for the connection failed ?
> (Yes, my /dev/shm are tmpfs)
> 
> 
> 
> -- 
> William Felipe Welter
> --
> Consultor em Tecnologias Livres
> william.wel...@4linux.com.br 
> www.4linux.com.br 
> 
> 
> 
> ___
> Openais mailing list
> open...@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerb

Re: [Pacemaker] [Openais] Corosync goes into endless loop when same hostname is used on more than one node

2011-05-12 Thread Steven Dake
On 05/12/2011 07:04 AM, Dan Frincu wrote:
> Hi,
> 
> When using the same hostname on 2 nodes (debian squeeze, corosync
> 1.3.0-3 from unstable) the following happens:
> 
> May 12 08:36:27 debian cib: [3125]: info: cib_process_request: Operation
> complete: op cib_sync for section 'all' (origin=local/crmd/84,
> version=0.5.1): ok (rc=0)
> May 12 08:36:27 debian crmd: [3129]: info: crm_get_peer: Node debian now
> has id: 620757002
> May 12 08:36:27 debian crmd: [3129]: info: do_state_transition: State
> transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED
> cause=C_FSA_INTERNAL origin=check_join_state ]
> May 12 08:36:27 debian crmd: [3129]: info: do_state_transition: All 1
> cluster nodes responded to the join offer.
> May 12 08:36:27 debian crmd: [3129]: info: do_dc_join_finalize: join-29:
> Syncing the CIB from debian to the rest of the cluster
> May 12 08:36:27 debian crmd: [3129]: info: crm_get_peer: Node debian now
> has id: 603979786
> May 12 08:36:27 debian crmd: [3129]: info: do_state_transition: State
> transition S_FINALIZE_JOIN -> S_INTEGRATION [ input=I_JOIN_REQUEST
> cause=C_HA_MESSAGE origin=route_message ]
> May 12 08:36:27 debian crmd: [3129]: info: update_dc: Unset DC debian
> May 12 08:36:27 debian cib: [3125]: info: cib_process_request: Operation
> complete: op cib_sync for section 'all' (origin=local/crmd/86,
> version=0.5.1): ok (rc=0)
> May 12 08:36:27 debian crmd: [3129]: info: do_dc_join_offer_all:
> join-30: Waiting on 1 outstanding join acks
> May 12 08:36:27 debian crmd: [3129]: info: update_dc: Set DC to debian
> (3.0.1)
> May 12 08:36:27 debian crmd: [3129]: info: crm_get_peer: Node debian now
> has id: 620757002
> May 12 08:36:27 debian crmd: [3129]: info: do_state_transition: State
> transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED
> cause=C_FSA_INTERNAL origin=check_join_state ]
> May 12 08:36:27 debian crmd: [3129]: info: do_state_transition: All 1
> cluster nodes responded to the join offer.
> May 12 08:36:27 debian crmd: [3129]: info: do_dc_join_finalize: join-30:
> Syncing the CIB from debian to the rest of the cluster
> May 12 08:36:27 debian crmd: [3129]: info: crm_get_peer: Node debian now
> has id: 603979786
> May 12 08:36:27 debian crmd: [3129]: info: do_state_transition: State
> transition S_FINALIZE_JOIN -> S_INTEGRATION [ input=I_JOIN_REQUEST
> cause=C_HA_MESSAGE origin=route_message ]
> May 12 08:36:27 debian crmd: [3129]: info: update_dc: Unset DC debian
> May 12 08:36:27 debian crmd: [3129]: info: do_dc_join_offer_all:
> join-31: Waiting on 1 outstanding join acks
> May 12 08:36:27 debian crmd: [3129]: info: update_dc: Set DC to debian
> (3.0.1)
> May 12 08:36:27 debian cib: [3125]: info: cib_process_request: Operation
> complete: op cib_sync for section 'all' (origin=local/crmd/88,
> version=0.5.1): ok (rc=0)
> May 12 08:36:27 debian crmd: [3129]: info: crm_get_peer: Node debian now
> has id: 620757002
> May 12 08:36:27 debian crmd: [3129]: info: do_state_transition: State
> transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED
> cause=C_FSA_INTERNAL origin=check_join_state ]
> May 12 08:36:27 debian crmd: [3129]: info: do_state_transition: All 1
> cluster nodes responded to the join offer.
> May 12 08:36:27 debian crmd: [3129]: info: do_dc_join_finalize: join-31:
> Syncing the CIB from debian to the rest of the cluster
> May 12 08:36:27 debian crmd: [3129]: info: crm_get_peer: Node debian now
> has id: 603979786
> May 12 08:36:27 debian crmd: [3129]: info: do_state_transition: State
> transition S_FINALIZE_JOIN -> S_INTEGRATION [ input=I_JOIN_REQUEST
> cause=C_HA_MESSAGE origin=route_message ]
> May 12 08:36:27 debian crmd: [3129]: info: update_dc: Unset DC debian
> May 12 08:36:27 debian crmd: [3129]: info: do_dc_join_offer_all:
> join-32: Waiting on 1 outstanding join acks
> May 12 08:36:27 debian crmd: [3129]: info: update_dc: Set DC to debian
> (3.0.1)
> 
> Basically it goes into an endless loop. This is a improperly configured
> option, but it would help the users if there was a handling of this or a
> relevant message printed in the logfile, such as "duplicate hostname found".
> 

Dan,

I believe this is a pacemaker RFE.  corosync operates entirely on IP
addresses and never does any hostname to IP resolution (because the
resolver can block and cause bad things to happen).

> Regards.
> Dan
> 
> -- 
> Dan Frincu
> CCNA, RHCE
> 
> 
> 
> ___
> Openais mailing list
> open...@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Pacemaker Cloud Policy Engine Red Hat Summit slides and Mailing List

2011-05-08 Thread Steven Dake
In February we announced our intentions to work on a cloud-specific high
availability solution on this list.  The code is coming along, and we
have reached a point where we should have a mailing list dedicated to
cloud specific topics of Pacemaker.

The mailing list subscription page is:

http://oss.clusterlabs.org/mailman/listinfo/pcmk-cloud

To see how we have progressed since February, have a look at the source
in our git repo, or take a look at the Red Hat Summit 2011 slides where
our work was presented this last week
:
http://www.redhat.com/summit/2011/presentations/summit/whats_new/thursday/dake_th_1130_high_availability_in_the_cloud.pdf

If your interested in cloud high availability technology, please feel
free to participate on our mailing lists.  Your input there is
invaluable to ensuring we deliver a great project that downstream
distros and administrators can use.


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] announcing the Pacemaker Cloud Policy Engine subproject

2011-03-01 Thread Steven Dake
Hi,

I'd like to spend a moment to tell you about a new project myself and
Angus Salkeld are working on.  The project, called the Pacemaker Cloud
Policy Engine, is a cloud-specific policy engine and will act as a
sub-project of the Pacemaker project.

We are doing a ground-up implementation of a cloud policy engine, using
a few other open source building block components.

Our dependencies are:
QPID/QMF (provides a management bus for communication of the various
components)
Upstart (provides a mechanism to launch our internal processes)
Pacemaker Policy Engine library (provides a mechanism for us to make
policy decisions)
Matahari (provides a mechanism to monitor VM images)

We have decided on a general model for managing cloud deployments:
Assembly: A VM image with a Matahari instance
Deployable: Collection of Assemblies

We are working on 5 components at the moment:
1. QMF model of the methods/events for various components
2. CLI Shell
2.1 requests the CPE start a deployable
2.2 requests the CPE stop a deployable
2.3 Displays assembly failures
3. Cloud Policy Engine
3.1 starts a deployable policy engine
3.2 Stops a deployable policy engine
4. VM Launcher
4.1 Starts VM images
5. Deployable Policy Engine
5.1 Requests VM launcher to start images
5.2 Monitors a VM image via matahari
5.3 Starts applications via matahari
5.4 Stops applications via matahari
5.5 Recovers failures detected by matahari managed applications
5.6 Uses the Pacemaker policy engine library to make decisions about
which sort of matahari actions to take
5.7 generates assembly failure events for the cli shell to display based
upon matahari output

Our repository is here:
git://gitorious.org/cloud-policy-engine/mainline.git

For the moment we will share the pacemaker list for our development:
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Come join us!
-steve


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Cluster Communication fails after VMWare Migration

2011-03-01 Thread Steven Dake
On 02/25/2011 12:40 AM, Andrew Beekhof wrote:
> On Wed, Feb 23, 2011 at 10:31 AM,   wrote:
>>
>> Have build a 2 node apache cluster on VMWare virtual machines, which was 
>> running as expected. We had to migrate the machines to another computing 
>> center and after that the cluster communication didn't work anymore. 
>> Migration of vmS causes a change of the networks mac address. Maybe that's 
>> the reason for my problem. After removing one node from the cluster and 
>> adding it again the communication worked. Because migrations between 
>> computing centers can happen at any time (mirrored esx infrastructure), I 
>> have to find out, if this breaks the cluster communication.
> 
> Cluster communication issues are the domain of corosync/heartbeat -
> their mailing lists may be able to provide more information.
> We're just the poor consumer of their services :-)
> 

poor consumer lol

Regarding migration, I doubt a mac address migration will work properly
with modern switches and igmp.  For that type of operation to work
properly, you will definately want to take multicast out of the equation
and instead use the udpu transport mode.

Keep in mind the corosync devs don't test the types of things you talk
about as we don't have proprietary software licenses.

Regards
-steve

>>
>> regards Uwe
>>
>>
>> ___
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: 
>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>
>>
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] corosync crash

2011-03-01 Thread Steven Dake
On 02/25/2011 12:38 AM, Andrew Beekhof wrote:
> This is the same one you sent to the openais list right?
> 

Andrew,

This was root caused to a faulty network setup resulting in the failed
to receive abort we are working on currently.  One key detail missing
from this thread is the implementation worked great on VMW ESX 4.0 but
then started having problem in ESX 4.1.

Regards
-steve

> On Thu, Feb 24, 2011 at 10:32 AM,   wrote:
>>
>> Hi,
>>
>> my configuration has 2 nodes, one has a set of virtual adresses and a 
>> webservice. The situation before crash:
>> node1: has all resources
>> node2: online, no resources
>>
>> action on node2: crm standby node2
>> result on node1: corosync crashes, the child processes consume all available 
>> cpu time
>>
>> my actions: stop all child processes on node1 (kill -9) and restart corosync
>>
>> result on node1:
>> node1: online, all resources
>> node2: offline
>>
>> result on node2:
>> node1: offline
>> node2: online, all resources
>>
>> The only way I found to workaround this problem: remove node2 from the 
>> cluster and add it again.
>> There should be other solutions, maybe someone can help. Appended the 
>> coredump and fplay.
>>
>> Update: If I keep the cluster in the split brain state, it recovers after 
>> about 9 hours (logfile available)
>>
>> regards Uwe
>>
>> ___
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: 
>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>
>>
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Article on HA in the IBM cloud using Pacemaker and Heartbeat

2011-01-28 Thread Steven Dake
On 01/28/2011 08:02 AM, Alan Robertson wrote:
>  Hi,
> 
> I recently co-authored an article on HA in the IBM cloud using Pacemaker
> and Heartbeat.
> 
> http://www.ibm.com/developerworks/cloud/library/cl-highavailabilitycloud/
> 
> The cool thing is that the IBM cloud supports virtual IPs.  With most of
> the other clouds you have to do DNS failover - which is sub-optimal
> ;-).  Of course, they added this after we harangued them ;-) - but still
> it's very nice to have.
> 
> It uses Heartbeat rather than Corosync because (for good reason) clouds
> don't support multicast or broadcast.
> 

Corosync works in non broadcast/multicast modes.  (the transport is
called udpu).

Regards
-steve

> There will be a follow-up article on setting up DRBD in the cloud as
> well...  Probably a month away or so...
> 
> -- 
> Alan Robertson 
> 
> "Openness is the foundation and preservative of friendship...  Let me claim 
> from you at all times your undisguised opinions." - William Wilberforce
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] pacemaker + corosync in the cloud

2010-12-15 Thread Steven Dake
On 12/14/2010 05:14 PM, ruslan usifov wrote:
> Hi
> 
> Is it possible to use pacemaker based on corosync in the cloud hosting
> like amazon or soflayer?
> 
> 
> 

yes with corosync 1.3.0 in udpu mode.  The udpu mode avoids the use of
multicast allowing operation in amazon's cloud.

Regards
-steve

> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] UDPU transport patch added, when will the RPMs be available

2010-11-22 Thread Steven Dake
On 11/22/2010 09:27 AM, Dan Frincu wrote:
> Hi Steven,
> 
> Steven Dake wrote:
>> On 11/19/2010 11:42 AM, Andrew Beekhof wrote:
>>   
>>> On Fri, Nov 19, 2010 at 11:38 AM, Dan Frincu  wrote:
>>> 
>>>> Hi,
>>>>
>>>> The subject is pretty self-explanatory but I'll ask anyway, the patch for
>>>> UDPU has been released, this adds the ability to set unicast peer addresses
>>>> of nodes in a cluster, in network environments where multicast is not an
>>>> option. When will it be available as an RPM?
>>>>   
>>> When upstream does a new release.
>>>
>>> 
>>
>> Dan,
>>
>> The flatiron branch (containing the udpu patches) is going through
>> testing for 1.3.0.  We find currently that single CPU virtual machine
>> systems seem to have problems with these patches which we will sort out
>> before release.
>>
>> Regards
>> -steve
>>
>>
>>   
> I've taken the (tip I think it is called) of corosync.git and compiled
> the RPM's on RH5U3 64-bit (I got the code the day it was first released,
> haven't had a chance to post yet).
> 

First off, we release from the flatiron branch.  It is our stable
branch.  From git, do

git checkout flatiron

This will provide the full flatiron branch for building.

> # git show
> commit 565b32c2621c08f82cab57420217060d100d4953
> Author: Fabio M. Di Nitto 
> Date:   Fri Nov 19 09:21:47 2010 +0100
> 
> There were some issues when compiling, deps mostly, some in the spec
> related to version which was UNKNOWN, I did a sed, placed 1.2.9 as a

We are aware of this problem.  We just moved from svn to git, and there
is some pain associated.  This particular problem comes from a lack of a
specific type of tag in the git repo for version numbers.  It will be
fixed once 1.3.0 is released.  Then RPM builds will work as expected.

> number instead of UNKNOWN and it compiled OK. I've installed it on two
> Xen VM's I use for testing and found some issues so the question is:
> where can I send feedback (and what kind of feedback is required) about
> development code? I'm not saying that you guys haven't run into these
> errors, maybe you did and they were fixed and maybe some are specific to
> my setup and haven't been found so, if I can provide some feedback on
> development code, I'd be more than happy to, if that's OK.
> 

We are certainly interested in contributions to the master branch.  What
most people use is the flatiron branch.  If you see defects on the tip
of flatiron, let us know, and we will work to address them.

The best way to report an issue is to start a conversation on our
mailing list (in the cc) "Is XYZ supposed to happen?".  The developers
can say yes or no and ask for further information if there is a defect.

Regards
-steve

> Also, I've read about the cluster test suite, but I'm not actually sure
> how it works, could somebody provide some details as to how I can use
> the cluster test suite on a cluster to check for issues and then how can
> I report if there are any issues found (again, what kind of feedback is
> required).
> 
> Regards,
> Dan
> 
> p.s.: ignore my other email, I didn't see the reply on this one.
> 
>>>> If I'm barking up the wrong tree, please direct me to the proper channel to
>>>> direct this request, I'm really looking forward to testing the UDPU.
>>>>
>>>> Regards,
>>>>
>>>> Dan
>>>>
>>>> --
>>>> Dan FRINCU
>>>> Systems Engineer
>>>> CCNA, RHCE
>>>> Streamwide Romania
>>>>
>>>>
>>>> ___
>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs:
>>>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>>>
>>>>   
>>> ___
>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: 
>>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>> 
>>
>&

Re: [Pacemaker] service corosync start failed

2010-11-22 Thread Steven Dake
On 11/22/2010 01:27 AM, jiaju liu wrote:
> Hi all
> If I use command like this
>  service corosync start
> it shows
> Starting Corosync Cluster Engine (corosync):   [FAILED]
>  
> and I do nothing just reboot my computer it will be OK what is the
> reason?
> Thanks a lot
>  
> my pacemaker packages are
>  pacemaker-1.0.8-6.1.el5
>  pacemaker-libs-devel-1.0.8-6.1.el5
>  pacemaker-libs-1.0.8-6.1.el5
> 
>  openais packages?are
>  openaislib-devel-1.1.0-1.el5
>  openais-1.1.0-1.el5
>  openaislib-1.1.0-1.el5
> 
> corosync packages are
>  corosync-1.2.2-1.1.el5
>  corosynclib-devel-1.2.2-1.1.el5
>  corosynclib-1.2.2-1.1.el5
>  who know why thanks a lot
> 
> 
>  
> 

Your packages are about 1 year old.  I'd suggest updating - we release z
streams to fix bugs and problems that people run into.

Regards
-steve

> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] UDPU transport patch added, when will the RPMs be available

2010-11-19 Thread Steven Dake
On 11/19/2010 11:42 AM, Andrew Beekhof wrote:
> On Fri, Nov 19, 2010 at 11:38 AM, Dan Frincu  wrote:
>> Hi,
>>
>> The subject is pretty self-explanatory but I'll ask anyway, the patch for
>> UDPU has been released, this adds the ability to set unicast peer addresses
>> of nodes in a cluster, in network environments where multicast is not an
>> option. When will it be available as an RPM?
> 
> When upstream does a new release.
> 

Dan,

The flatiron branch (containing the udpu patches) is going through
testing for 1.3.0.  We find currently that single CPU virtual machine
systems seem to have problems with these patches which we will sort out
before release.

Regards
-steve


>>
>> If I'm barking up the wrong tree, please direct me to the proper channel to
>> direct this request, I'm really looking forward to testing the UDPU.
>>
>> Regards,
>>
>> Dan
>>
>> --
>> Dan FRINCU
>> Systems Engineer
>> CCNA, RHCE
>> Streamwide Romania
>>
>>
>> ___
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs:
>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Corosync using unicast instead of multicast

2010-11-08 Thread Steven Dake

On 11/08/2010 05:50 AM, Dan Frincu wrote:

Hi,

Steven Dake wrote:

On 11/05/2010 01:30 AM, Dan Frincu wrote:

Hi,

Alan Jones wrote:

This question should be on the openais list, however, I happen to know
the answer.
To get up and running quickly you can configure broadcast with the
version you have.


I've done that already, however I was a little concerned as to what
Steven Dake said on the openais mailing list about using broadcast
"Broadcast and redundant ring probably don't work to well together.".

I've also done some testing and saw that the broadcast address used is
255.255.255.255, regardless of what the bindnetaddr network address is,
and quite frankly, I was hoping to see a directed broadcast address.
This wasn't the case, therefore I wonder whether this was the issue that
Steven was referring to, because by using the 255.255.255.255 as a
broadcast address, there is the slight chance that some application
running in the same network might send a broadcast packet using the same


This can happen with multicast or unicast modes as well. If a third
party application communicates on the multicast/port combo or unicast
port of a cluster node, there is conflict.

With encryption, corosync encrypts and authenticates all packets,
ignoring packets without a proper signature. The signatures are
difficult to spoof. Without encryption, bad things happen in this
condition.

For more details, read "SECURITY" file in our source distribution.


OK, I read the SECURITY file, a lot of overhead is added, I understand
the reasons why it does it this way, not going to go into the details
right now. Basically enabling encryption ensures that any traffic going
between the nodes is both encrypted and authenticated, so rogue messages
that happen to reach the exact network socket will be discarded. I'll
come back to this a little bit later.

Then again, I have this sentence in my head that I can't seem to get rid
of "Broadcast and redundant ring probably don't work to well together,
broadcast and redundant ring probably don't work to well together"
and also I read "OpenAIS now provides broadcast network communication in
addition to multicast. This functionality is considered Technology
Preview for standalone usage of OpenAIS", therefore I'm a little bit
more concerned.

Can you shed some light on this please? Two questions:

1) What do you mean by "Broadcast and redundant ring probably don't work
to well together"?



broadcast requires a specific port to run on.  As a result, the ports 
should be different for each interface.  I have not done any specific 
testing on broadcast with redundant ring - you would probably be the first.



2) Is using Corosync's broadcast feature instead of multicast stable
enough to be used in production systems?



Personally I'd wait for 2.0 for this feature and use bonding for the moment.


Thank you in advance.

Best regards,

Dan

port as configured on the cluster. How would the cluster react to that,
would it ignore the packet, would it wreak havoc?

Regards,

Dan

That's my main concern right now.

Corosync can distinguish separate clusters with the multicast address
and port that become payload to the messages.
The patch you referred to can be applied to the top of tree for
corosync or you can wait for a new release 1.3.0 planned for the end
of November.
Alan

On Thu, Nov 4, 2010 at 1:02 AM, Dan Frincu
wrote:


Hi all,

I'm having an issue with a setup using the following:
cluster-glue-1.0.6-1.6.el5.x86_64.rpm
cluster-glue-libs-1.0.6-1.6.el5.x86_64.rpm
corosync-1.2.7-1.1.el5.x86_64.rpm
corosynclib-1.2.7-1.1.el5.x86_64.rpm
drbd83-8.3.2-6.el5_3.x86_64.rpm
kmod-drbd83-8.3.2-6.el5_3.x86_64.rpm
openais-1.1.3-1.6.el5.x86_64.rpm
openaislib-1.1.3-1.6.el5.x86_64.rpm
pacemaker-1.0.9.1-1.el5.x86_64.rpm
pacemaker-libs-1.0.9.1-1.el5.x86_64.rpm
resource-agents-1.0.3-2.el5.x86_64.rpm

This is a two-node HA cluster, with the nodes interconnected via
bonded
interfaces through the switch. The issue is that I have no control
of the
switch itself, can't do anything about that, and from what I
understand the
environment doesn't allow enabling multicast on the switch. In this
situation, how can I have the setup functional (with redundant rings,
rrp_mode: active) without using multicast.

I've seen that individual network sockets are formed between nodes,
unicast
sockets, as well as the multicast sockets. I'm interested in
knowing how
will the lack of multicast affect the redundant rings, connectivity,
failover, etc.

I've also seen this page
https://lists.linux-foundation.org/pipermail/openais/2010-October/015271.html

And here it states using UDPU transport mode avoids using multicast or
broadcast, but it's a patch, is this integrated in any of the newer
versions
of corosync?

Thank you in advance.

Regards,

Dan

--
Dan FRINCU
Sys

Re: [Pacemaker] Corosync using unicast instead of multicast

2010-11-05 Thread Steven Dake

On 11/05/2010 01:30 AM, Dan Frincu wrote:

Hi,

Alan Jones wrote:

This question should be on the openais list, however, I happen to know
the answer.
To get up and running quickly you can configure broadcast with the
version you have.


I've done that already, however I was a little concerned as to what
Steven Dake said on the openais mailing list about using broadcast
"Broadcast and redundant ring probably don't work to well together.".

I've also done some testing and saw that the broadcast address used is
255.255.255.255, regardless of what the bindnetaddr network address is,
and quite frankly, I was hoping to see a directed broadcast address.
This wasn't the case, therefore I wonder whether this was the issue that
Steven was referring to, because by using the 255.255.255.255 as a
broadcast address, there is the slight chance that some application
running in the same network might send a broadcast packet using the same


This can happen with multicast or unicast modes as well.  If a third 
party application communicates on the multicast/port combo or unicast 
port of a cluster node, there is conflict.


With encryption, corosync encrypts and authenticates all packets, 
ignoring packets without a proper signature.  The signatures are 
difficult to spoof.  Without encryption, bad things happen in this 
condition.


For more details, read "SECURITY" file in our source distribution.


port as configured on the cluster. How would the cluster react to that,
would it ignore the packet, would it wreak havoc?

Regards,

Dan

That's my main concern right now.

Corosync can distinguish separate clusters with the multicast address
and port that become payload to the messages.
The patch you referred to can be applied to the top of tree for
corosync or you can wait for a new release 1.3.0 planned for the end
of November.
Alan

On Thu, Nov 4, 2010 at 1:02 AM, Dan Frincu  wrote:


Hi all,

I'm having an issue with a setup using the following:
cluster-glue-1.0.6-1.6.el5.x86_64.rpm
cluster-glue-libs-1.0.6-1.6.el5.x86_64.rpm
corosync-1.2.7-1.1.el5.x86_64.rpm
corosynclib-1.2.7-1.1.el5.x86_64.rpm
drbd83-8.3.2-6.el5_3.x86_64.rpm
kmod-drbd83-8.3.2-6.el5_3.x86_64.rpm
openais-1.1.3-1.6.el5.x86_64.rpm
openaislib-1.1.3-1.6.el5.x86_64.rpm
pacemaker-1.0.9.1-1.el5.x86_64.rpm
pacemaker-libs-1.0.9.1-1.el5.x86_64.rpm
resource-agents-1.0.3-2.el5.x86_64.rpm

This is a two-node HA cluster, with the nodes interconnected via bonded
interfaces through the switch. The issue is that I have no control of the
switch itself, can't do anything about that, and from what I understand the
environment doesn't allow enabling multicast on the switch. In this
situation, how can I have the setup functional (with redundant rings,
rrp_mode: active) without using multicast.

I've seen that individual network sockets are formed between nodes, unicast
sockets, as well as the multicast sockets. I'm interested in knowing how
will the lack of multicast affect the redundant rings, connectivity,
failover, etc.

I've also seen this page
https://lists.linux-foundation.org/pipermail/openais/2010-October/015271.html
And here it states using UDPU transport mode avoids using multicast or
broadcast, but it's a patch, is this integrated in any of the newer versions
of corosync?

Thank you in advance.

Regards,

Dan

--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania

___
Pacemaker mailing list:Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home:http://www.clusterlabs.org
Getting started:http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs:
http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker





___
Pacemaker mailing list:Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home:http://www.clusterlabs.org
Getting started:http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs:http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker



--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Fail over algorithm used by Pacemaker

2010-10-04 Thread Steven Dake

On 10/03/2010 07:01 AM, hudan studiawan wrote:

Hi,

I want to start to contribute to Pacemaker project. I start to read
Documentation and try some basic configurations. I have a question: what
kind of algorithm used by Pacemaker to choose another node when a node
die in a cluster? Is there any manual or documentation I can read?

Thank you,
Hudan




In the case of using Corosync, we use a protocol designed in the 90s to 
determine membership.  It is called The Totem Single Ring Protocol:


http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.37.767&rep=rep1&type=pdf

Its full operation is described in that PDF.

Regards
-steve



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Corosync node detection working too good

2010-10-04 Thread Steven Dake

On 10/04/2010 02:04 AM, Stephan-Frank Henry wrote:

Hello all,

still working on my nodes and although the last problem is not officially 
solved (I hard coded certain versions of the packages and that seems to be ok 
now) I have a different interesting feature I need to handle.

I am setting up my nodes by default as single node setups. But today when I set 
up another node, *without* doing any special config to make them know each 
other, the corosyncs on each nodes found each other and distributed the cib.xml 
between each other.
They both also show up together in crm_mon.

Not quite what I wanted. :)

I presume I have a config that is too generic and thus the nodes are finding 
each other and thinking they should link up.
What configs do I have to look into to avoid this?

thanks


A unique cluster is defined by mcastaddr and mcastport in the file

/etc/corosync/corosync.conf

If you simply installed them, you may have the same corosync.conf file 
for each unique cluster which would result in the problem you describe.


Regards
-steve

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Timeout after nodejoin

2010-09-22 Thread Steven Dake

On 09/22/2010 05:43 AM, Dan Frincu wrote:

Hi all,

I have the following packages:

# rpm -qa | grep -i "(openais|cluster|heartbeat|pacemaker|resource)"
openais-0.80.5-15.2
cluster-glue-1.0-12.2
pacemaker-1.0.5-4.2
cluster-glue-libs-1.0-12.2
resource-agents-1.0-31.5
pacemaker-libs-1.0.5-4.2
pacemaker-mgmt-1.99.2-7.2
libopenais2-0.80.5-15.2
heartbeat-3.0.0-33.3
pacemaker-mgmt-client-1.99.2-7.2

When I start openais, I get nodejoin immediately, as seen in the logs
below. However, it takes some time before the nodes are visible in
crm_mon output. Any idea how to minimize this delay?

Sep 22 15:27:24 bench1 openais[12935]: [crm ] info:
send_member_notification: Sending membership update 8 to 1 children
Sep 22 15:27:24 bench1 openais[12935]: [CLM ] got nodejoin message
192.168.165.33
Sep 22 15:27:24 bench1 openais[12935]: [CLM ] got nodejoin message
192.168.165.35
Sep 22 15:27:24 bench1 mgmtd: [12947]: info: Started.
Sep 22 15:27:24 bench1 openais[12935]: [crm ] WARN: route_ais_message:
Sending message to local.crmd failed: unknown (rc=-2)
Sep 22 15:27:24 bench1 openais[12935]: [crm ] WARN: route_ais_message:
Sending message to local.crmd failed: unknown (rc=-2)
Sep 22 15:27:24 bench1 openais[12935]: [crm ] info: pcmk_ipc: Recorded
connection 0x174840d0 for crmd/12946
Sep 22 15:27:24 bench1 openais[12935]: [crm ] info: pcmk_ipc: Sending
membership update 8 to crmd
Sep 22 15:27:24 bench1 openais[12935]: [crm ] info:
update_expected_votes: Expected quorum votes 1024 -> 2
Sep 22 15:27:25 bench1 crmd: [12946]: notice: ais_dispatch: Membership
8: quorum aquired
Sep 22 15:28:15 bench1 crmd: [12946]: info: do_election_count_vote:
Election 2 (owner: bench2) pass: vote from bench2 (Host name)
Sep 22 15:28:15 bench1 crmd: [12946]: info: do_state_transition: State
transition S_PENDING -> S_ELECTION [ input=I_ELECTION
cause=C_FSA_INTERNAL origin=do_election_count_vote ]
Sep 22 15:28:15 bench1 crmd: [12946]: info: do_state_transition: State
transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC
cause=C_FSA_INTERNAL origin=do_election_check ]
Sep 22 15:28:15 bench1 crmd: [12946]: info: do_te_control: Registering
TE UUID: 87c28ab8-ba93-4111-a26a-67e88dd927fb
Sep 22 15:28:15 bench1 crmd: [12946]: WARN:
cib_client_add_notify_callback: Callback already present
Sep 22 15:28:15 bench1 crmd: [12946]: info: set_graph_functions: Setting
custom graph functions
Sep 22 15:28:15 bench1 crmd: [12946]: info: unpack_graph: Unpacked
transition -1: 0 actions in 0 synapses
Sep 22 15:28:15 bench1 crmd: [12946]: info: do_dc_takeover: Taking over
DC status for this partition
Sep 22 15:28:15 bench1 cib: [12942]: info: cib_process_readwrite: We are
now in R/W mode

Regards,

Dan



Where did you get that version of openais?  openais 0.80.x is deprecated 
in the community (and hence, no support).  We recommend using corosync 
instead which has improved testing with pacemaker.


Regards
-steve

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Connection to our AIS plugin (9) failed: Library error

2010-09-22 Thread Steven Dake

On 09/22/2010 04:02 AM, Szymon Hersztek wrote:


Wiadomość napisana w dniu 2010-09-22, o godz. 10:26, przez Andrew Beekhof:


2010/9/21 Szymon Hersztek :


Wiadomość napisana w dniu 2010-09-21, o godz. 09:08, przez Andrew
Beekhof:


2010/9/21 Szymon Hersztek :


Wiadomość napisana w dniu 2010-09-21, o godz. 08:34, przez Andrew
Beekhof:


On Mon, Sep 20, 2010 at 3:34 PM, Szymon Hersztek 
wrote:


Hi
Im trying to setup corosync to work as drbd cluster but after
installing
follow by http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
i got error like below:


Unusual, but did pacemaker fork a replacement attrd process?
At what time did corosync start?



corosync was started manually or do you want to have exact time of
start
?


well you included at most 1 second's worth of logging.
so its kinda hard to know if something took too long or what recovery
was attempted.


Ok it is not a problem to send more. Do you need debug logging or
standard
I have to install server once again so in half of hour i can
reproduce logs



Here's your issue:

corosynclib i386
1.2.7-1.1.el5
clusterlabs 155 k
corosynclib x86_64
1.2.7-1.1.el5
clusterlabs 172 k

Why do you have both i386 and x86_64 versions installed on your machine??





There should be no problems installing lib files for both i386 and 
x86_64.  These rpms only contain the *.so files (and a LICENSE file).


Regards
-steve


Because yum installed it in this way .. as many other packeges
The problem was that i do not use /dev/shm as tmpfs
But thanks for trying






___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs:
http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs:
http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] MCP init script to 21/79?

2010-09-03 Thread Steven Dake

On 09/03/2010 09:56 AM, Vladislav Bogdanov wrote:

03.09.2010 19:34, Steven Dake wrote:

Nope, they are in a natural order for both start and stop sequences.
So lower number means 'do start or stop earlier'.

grep '# chkconfig' /etc/init.d/*



Ok, thanks.  Changed to 10



Given that corosync default is 20/80, shouldnt mcp be 21/79?


I think that pcmk may require additional services to be started (I at
least see reference to cooperation with cman for GFS as one of pcmk MCP
scenarios in Andrew's wiki, but that scenario is still unclear to me),
so it is safer to have it start later, 90 is ok for me. That is also
what Vadim wrote about.

Best,
Vladislav


I was mistaken, not having read the current code.  Ignore the noise.

Regards
-steve

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] MCP init script to 21/79?

2010-09-03 Thread Steven Dake

On 08/24/2010 11:06 PM, Andrew Beekhof wrote:

On Wed, Aug 25, 2010 at 8:02 AM, Vladislav Bogdanov
  wrote:

25.08.2010 08:56, Andrew Beekhof wrote:

On Wed, Aug 25, 2010 at 7:39 AM, Vladislav Bogdanov
  wrote:

Hi all,

pacemaker has
# chkconfig - 90 90
in its MCP initscript.

Shouldn't it be corrected to 90 10?


I thought higher numbers started later and shut down earlier... no?


Nope, they are in a natural order for both start and stop sequences.
So lower number means 'do start or stop earlier'.

grep '# chkconfig' /etc/init.d/*



Ok, thanks.  Changed to 10



Given that corosync default is 20/80, shouldnt mcp be 21/79?

Regards
-steve

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Corosync + Pacemaker New Install: Corosync Fails Without Error Message

2010-06-22 Thread Steven Dake

On 06/18/2010 09:42 AM, Eliot Gable wrote:

I don’t have an “aisexec” section at all. I simply copied the sample
file, which did not have one.

I did figure out why it wasn’t logging. It was set to AMF mode and
‘mode’ was ‘disabled’ in the AMF configuration section. After changing
that to ‘enabled’, I now have logging. That allowed me to figure out
that I needed to set rrp_mode to something other than ‘none’, because I
have two interfaces to run the totem protocol over. However, with it set
to ‘passive’ or ‘active’, corosync tries to start, then seg faults:

Jun 18 07:33:23 corosync [MAIN ] Corosync Cluster Engine ('1.2.2'):
started and ready to provide service.

Jun 18 07:33:23 corosync [MAIN ] Corosync built-in features: nss rdma

Jun 18 07:33:23 corosync [MAIN ] Successfully read main configuration
file '/etc/corosync/corosync.conf'.

Jun 18 07:33:23 corosync [TOTEM ] Token Timeout (1000 ms) retransmit
timeout (238 ms)

Jun 18 07:33:23 corosync [TOTEM ] token hold (180 ms) retransmits before
loss (4 retrans)

Jun 18 07:33:23 corosync [TOTEM ] join (50 ms) send_join (0 ms)
consensus (1200 ms) merge (200 ms)

Jun 18 07:33:23 corosync [TOTEM ] downcheck (1000 ms) fail to recv const
(50 msgs)

Jun 18 07:33:23 corosync [TOTEM ] seqno unchanged const (30 rotations)
Maximum network MTU 1402

Jun 18 07:33:23 corosync [TOTEM ] window size per rotation (50 messages)
maximum messages per rotation (17 messages)

Jun 18 07:33:23 corosync [TOTEM ] send threads (0 threads)

Jun 18 07:33:23 corosync [TOTEM ] RRP token expired timeout (238 ms)

Jun 18 07:33:23 corosync [TOTEM ] RRP token problem counter (2000 ms)

Jun 18 07:33:23 corosync [TOTEM ] RRP threshold (10 problem count)

Jun 18 07:33:23 corosync [TOTEM ] RRP mode set to passive.

Jun 18 07:33:23 corosync [TOTEM ] heartbeat_failures_allowed (0)

Jun 18 07:33:23 corosync [TOTEM ] max_network_delay (50 ms)

Jun 18 07:33:23 corosync [TOTEM ] HeartBeat is Disabled. To enable set
heartbeat_failures_allowed > 0

Jun 18 07:33:23 corosync [TOTEM ] Initializing transport (UDP/IP).

Jun 18 07:33:23 corosync [TOTEM ] Initializing transmit/receive
security: libtomcrypt SOBER128/SHA1HMAC (mode 0).

Jun 18 07:33:23 corosync [TOTEM ] Initializing transport (UDP/IP).

Jun 18 07:33:23 corosync [TOTEM ] Initializing transmit/receive
security: libtomcrypt SOBER128/SHA1HMAC (mode 0).

Jun 18 07:33:23 corosync [IPC ] you are using ipc api v2

Jun 18 07:33:23 corosync [TOTEM ] Receive multicast socket recv buffer
size (262142 bytes).

Jun 18 07:33:23 corosync [TOTEM ] Transmit multicast socket send buffer
size (262142 bytes).

Jun 18 07:33:23 corosync [TOTEM ] The network interface is down.

Jun 18 07:33:23 corosync [TOTEM ] Created or loaded sequence id
0.127.0.0.1 for this ring.

Jun 18 07:33:23 corosync [pcmk ] info: process_ais_conf: Reading configure

Jun 18 07:33:23 corosync [pcmk ] info: config_find_init: Local handle:
2013064636357672962 for logging

Jun 18 07:33:23 corosync [pcmk ] info: config_find_next: Processing
additional logging options...

Jun 18 07:33:23 corosync [pcmk ] info: get_config_opt: Found 'on' for
option: debug

Jun 18 07:33:23 corosync [pcmk ] info: get_config_opt: Defaulting to
'off' for option: to_file

Jun 18 07:33:23 corosync [pcmk ] info: get_config_opt: Found 'yes' for
option: to_syslog

Jun 18 07:33:23 corosync [pcmk ] info: get_config_opt: Defaulting to
'daemon' for option: syslog_facility

Jun 18 07:33:23 corosync [pcmk ] info: config_find_init: Local handle:
4730966301143465987 for service

Jun 18 07:33:23 corosync [pcmk ] info: config_find_next: Processing
additional service options...

Jun 18 07:33:23 corosync [pcmk ] info: get_config_opt: Defaulting to
'pcmk' for option: clustername

Jun 18 07:33:23 corosync [pcmk ] info: get_config_opt: Defaulting to
'no' for option: use_logd

Jun 18 07:33:23 corosync [pcmk ] info: get_config_opt: Defaulting to
'no' for option: use_mgmtd

Jun 18 07:33:23 corosync [pcmk ] info: pcmk_startup: CRM: Initialized

Jun 18 07:33:23 corosync [pcmk ] Logging: Initialized pcmk_startup

Jun 18 07:33:23 corosync [pcmk ] info: pcmk_startup: Maximum core file
size is: 18446744073709551615

Segmentation fault

(gdb) where full

#0 0x00332de797c0 in strlen () from /lib64/libc.so.6

No symbol table info available.

#1 0x2acefb9b in logsys_worker_thread (data=) at logsys.c:760

rec = 0x2aef0c28

dropped = 0

#2 0x00332e60673d in start_thread () from /lib64/libpthread.so.0

No symbol table info available.

#3 0x00332ded3d1d in clone () from /lib64/libc.so.6

No symbol table info available.

(gdb)

Downgrading again back to 1.2.1-1.el5 seems to resolve the issue, and
Corosync runs.

Eliot Gable
Senior Product Developer
1228 Euclid Ave, Suite 390
Cleveland, OH 44115

Direct: 216-373-4808
Fax: 216-373-4657
ega...@broadvox.net 

cid:212454920@11022008-1E22

CONFIDENTIAL COMMUNICATION. This e-mail and any files transmitted with
it are confidential and are intended solely for the u

Re: [Pacemaker] use_logd or use_mgmtd kills corosync

2010-06-08 Thread Steven Dake

On 06/08/2010 11:20 PM, Andrew Beekhof wrote:

On Wed, Jun 9, 2010 at 7:27 AM, Devin Reade  wrote:

I was following the instructions for a new installation of corosync
and was wanting to make use of hb_gui so, following an installation
via yum per the docs, built Pacemaker-Python-GUI-pacemaker-mgmt-2.0.0
from source.

Starting corosync works normally without mgmtd in the picture, but as
soon as *either* of the two lines are added to /etc/corosync/service.d/pcmk,
corosync fails to start with no diagnostics in the logfile or syslog:
use_logd: 1
use_mgmtd: 1

I ran 'strace corosync -f' and got rather uninformative information, the
tail end of it shown here:

statfs("/etc/corosync/service.d", {f_type="EXT2_SUPER_MAGIC", f_bsize=4096,
f_blocks=507860, f_bfree=388733, f_bavail=362519, f_files=524288,
f_ffree=517073, f_fsid={0, 0}, f_namelen=255, f_frsize=4096}) = 0
getdents(3, /* 3 entries */, 32768) = 72
stat("/etc/corosync/service.d/pcmk", {st_mode=S_IFREG|0644, st_size=101,
...}) = 0
open("/etc/corosync/service.d/pcmk", O_RDONLY) = 4
fstat(4, {st_mode=S_IFREG|0644, st_size=101, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0x2acb16dd5000
read(4, "service {\n \t# Load the Pacemaker"..., 4096) = 101
close(4)= 0
munmap(0x2acb16dd5000, 4096)= 0
close(3)= 0
exit_group(8)   = ?


Any thoughts?


Not really.
Do any other children start up?
Where is the mgmtd binary installed to?


# uname -srv
Linux 2.6.18-194.3.1.el5 #1 SMP Thu May 13 13:08:30 EDT 2010

# rpm -q -a | grep openais | sort
openais-1.1.0-2.el5.i386
openais-1.1.0-2.el5.x86_64
openaislib-1.1.0-2.el5.i386
openaislib-1.1.0-2.el5.x86_64
openaislib-devel-1.1.0-2.el5.i386
openaislib-devel-1.1.0-2.el5.x86_64


### /etc/corosync/corosync.conf 
compatibility: none

totem {
version: 2
secauth: off
threads: 0
interface {
ringnumber: 0
# but with a real netaddr, obviously
bindnetaddr: A.B.C.D
mcastaddr: 226.94.1.1
mcastport: 5405
}
}

logging {
fileline: off
to_stderr: no
to_file: yes
to_syslog: yes
logfile: /var/log/corosync.log
# debug: off
timestamp: on
logger_subsys {
subsys: AMF
debug: off
}
}

amf {
mode: disabled
}

aisexec {
user: root
group: root
}

 /etc/corosync/service.d/pcmk #
service {
# Load the Pacemaker Cluster Resource Manager
name: pacemaker
ver:  0
use_logd: 1
}


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


This is likely the sem_wait issue related to some CentOS deployments. 
An update for corosync is pending release.  Hopefully new source 
tarballs will be available Wednesday.


Regards
-steve

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] handle EINTR in sem_wait (pacemaker & corosync 1.2.2+ crash)

2010-06-01 Thread Steven Dake

Hello,

I have found the cause of the crash that was occurring only on some 
deployments.  The cause is that sem_wait is interrupted by signal, and 
the wait operation is not retried (as is customary in posix).


Patch attached to fix

A big thank you to Vladislav Bogdanov for running the test case and 
verifying it fixes the problem.



Regards
-steve
Index: logsys.c
===
--- logsys.c(revision 2915)
+++ logsys.c(working copy)
@@ -661,7 +661,18 @@
sem_post (&logsys_thread_start);
for (;;) {
dropped = 0;
-   sem_wait (&logsys_print_finished);
+retry_sem_wait:
+   res = sem_wait (&logsys_print_finished);
+   if (res == -1 && errno == EINTR) {
+   goto retry_sem_wait;
+   } else
+   if (res == -1) {
+   /*
+ *  * This case shouldn't happen
+ *  */
+   pthread_exit (NULL);
+   }
+   
 
logsys_wthread_lock();
if (wthread_should_exit) {
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] corosync/openais fails to start

2010-05-27 Thread Steven Dake

On 05/27/2010 10:20 AM, Gianluca Cecchi wrote:

On Thu, May 27, 2010 at 5:50 PM, Steven Dake mailto:sd...@redhat.com>> wrote:

On 05/27/2010 08:40 AM, Diego Remolina wrote:

Is there any workaround for this? Perhaps a slightly older
version of
the rpms? If so where do I find those?


Corosync 1.2.1 doesn't have this issue apparently.  With corosync
1.2.1, please don't use "debug: on" keyword in your config options.
  I am not sure where Andrew has corosync 1.2.1 rpms available.

The corosync project itself doesn't release rpms.  See our policy on
this topic:

http://www.corosync.org/doku.php?id=faq:release_binaries

Regards
-steve



In my case, using pacemaker/corosync from clusterlabs repo on rh el 5.5
32 bit I had:
- both nodes ha1 and ha2 with
[r...@ha1 ~]# rpm -qa corosync\* pacemaker\*
pacemaker-1.0.8-6.el5
corosynclib-1.2.1-1.el5
corosync-1.2.1-1.el5
pacemaker-libs-1.0.8-6.el5

- stop of corosync on node ha1
- update (using clusterlabs repo proposed and applied packages for
pacemaker with same version... donna if same bits..)
This takes corosync to 1.2.2
- start of corosync on ha1 and successfull join with the still corosync
1.2.1 one
  May 27 18:59:23 ha1 corosync[5136]:   [MAIN  ] Corosync Cluster Engine
exiting with status -1 at main.c:160.
May 27 19:06:19 ha1 yum: Updated: corosynclib-1.2.2-1.1.el5.i386
May 27 19:06:19 ha1 yum: Updated: pacemaker-libs-1.0.8-6.1.el5.i386
May 27 19:06:19 ha1 yum: Updated: corosync-1.2.2-1.1.el5.i386
May 27 19:06:20 ha1 yum: Updated: pacemaker-1.0.8-6.1.el5.i386
May 27 19:06:20 ha1 yum: Updated: corosynclib-devel-1.2.2-1.1.el5.i386
May 27 19:06:22 ha1 yum: Updated: pacemaker-libs-devel-1.0.8-6.1.el5.i386
May 27 19:06:59 ha1 corosync[7442]:   [MAIN  ] Corosync Cluster Engine
('1.2.2'): started and ready to provide service.
May 27 19:06:59 ha1 corosync[7442]:   [MAIN  ] Corosync built-in
features: nss rdma
May 27 19:06:59 ha1 corosync[7442]:   [MAIN  ] Successfully read main
configuration file '/etc/corosync/corosync.conf'.
May 27 19:06:59 ha1 corosync[7442]:   [TOTEM ] Initializing transport
(UDP/IP).
May 27 19:06:59 ha1 corosync[7442]:   [TOTEM ] Initializing
transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).

this implies also start of resources on it (nfsclient and apache in my case)

- move (and unmove to be able to take them again) of resources from ha2
to the updated node ha1 (nfs-group in my case)
  Resource Group: nfs-group
  lv_drbd0   (ocf::heartbeat:LVM):   Started ha1
  ClusterIP  (ocf::heartbeat:IPaddr2):   Started ha1
  NfsFS  (ocf::heartbeat:Filesystem):Started ha1
  nfssrv (ocf::heartbeat:nfsserver): Started ha1

- stop of corosync 1.2.1 on ha2
- update of pacemaker and corosync on ha2
- startup of corosync on ha2 and correct join to cluster with start of
its resources (nfsclient and apache in my case)
May 27 19:14:42 ha2 corosync[30954]:   [pcmk  ] notice: pcmk_shutdown:
cib confirmed stopped
May 27 19:14:42 ha2 corosync[30954]:   [pcmk  ] notice: stop_child: Sent
-15 to stonithd: [30961]
May 27 19:14:42 ha2 stonithd: [30961]: notice:
/usr/lib/heartbeat/stonithd normally quit.
May 27 19:14:42 ha2 corosync[30954]:   [pcmk  ] info: pcmk_ipc_exit:
Client stonithd (conn=0x82aee48, async-conn=0x82aee48) left
May 27 19:14:43 ha2 corosync[30954]:   [pcmk  ] notice: pcmk_shutdown:
stonithd confirmed stopped
May 27 19:14:43 ha2 corosync[30954]:   [pcmk  ] info: update_member:
Node ha2 now has process list: 0002 (2)
May 27 19:14:43 ha2 corosync[30954]:   [pcmk  ] notice: pcmk_shutdown:
Shutdown complete
May 27 19:14:43 ha2 corosync[30954]:   [SERV  ] Service engine unloaded:
Pacemaker Cluster Manager 1.0.8
May 27 19:14:43 ha2 corosync[30954]:   [SERV  ] Service engine unloaded:
corosync extended virtual synchrony service
May 27 19:14:43 ha2 corosync[30954]:   [SERV  ] Service engine unloaded:
corosync configuration service
May 27 19:14:43 ha2 corosync[30954]:   [SERV  ] Service engine unloaded:
corosync cluster closed process group service v1.01
May 27 19:14:43 ha2 corosync[30954]:   [SERV  ] Service engine unloaded:
corosync cluster config database access v1.01
May 27 19:14:43 ha2 corosync[30954]:   [SERV  ] Service engine unloaded:
corosync profile loading service
May 27 19:14:43 ha2 corosync[30954]:   [SERV  ] Service engine unloaded:
corosync cluster quorum service v0.1
May 27 19:14:43 ha2 corosync[30954]:   [MAIN  ] Corosync Cluster Engine
exiting with status -1 at main.c:160.
May 27 19:15:51 ha2 yum: Updated: corosynclib-1.2.2-1.1.el5.i386
May 27 19:15:51 ha2 yum: Updated: pacemaker-libs-1.0.8-6.1.el5.i386
May 27 19:15:52 ha2 yum: Updated: corosync-1.2.2-1.1.el5.i386
May 27 19:15:52 ha2 yum: Updated: pacemaker-1.0.8-6.1.el5.i386
May 27 19:17:00 ha2 corosync[3430]:   [MAIN  ] Corosync Cluster Engine
('1.2.2'): started and ready to provide service.
May

Re: [Pacemaker] corosync/openais fails to start

2010-05-27 Thread Steven Dake

On 05/27/2010 08:40 AM, Diego Remolina wrote:

Is there any workaround for this? Perhaps a slightly older version of
the rpms? If so where do I find those?



Corosync 1.2.1 doesn't have this issue apparently.  With corosync 1.2.1, 
please don't use "debug: on" keyword in your config options.  I am not 
sure where Andrew has corosync 1.2.1 rpms available.


The corosync project itself doesn't release rpms.  See our policy on 
this topic:


http://www.corosync.org/doku.php?id=faq:release_binaries

Regards
-steve


I cannot get the opensuse-ha rpms any more so I am stuck with a
non-functioning cluster.

Diego

Steven Dake wrote:

This is a known issue on some platforms, although the exact cause is
unknown. I have tried RHEL 5.5 as well as CentOS 5.5 with clusterrepo
rpms and been unable to reproduce. I'll keep looking.

Regards
-steve

On 05/27/2010 06:07 AM, Diego Remolina wrote:

Hi,

I was running the old rpms from the opensuse repo and wanted to change
over to the latest packages from the clusterlabs repo in my RHEL 5.5
machines.

Steps I took
1. Disabled the old repo
2. Set the nodes to standby (two node drbd cluster) and turned of
openais
3. Enabled the new repo.
4. Performed an update with yum -y update which replaced all packages.
5. The configuration file for ais was renamed openais.conf.rpmsave
6. I ran corosync-keygen and copied the key to the second machine
7. I copied the file openais.conf.rpmsave to /etc/corosync/corosync.conf
and modified it by removing the service section and moving that to
/etc/corosync/service.d/pcmk
8. I copied the configurations to the other machine.
9. When I try to start either openais or corosync with the init scripts
I get a failure and nothing that can really point me to an error in the
logs.

Updated packages:
May 26 14:29:32 Updated: cluster-glue-libs-1.0.5-1.el5.x86_64
May 26 14:29:32 Updated: resource-agents-1.0.3-2.el5.x86_64
May 26 14:29:34 Updated: cluster-glue-1.0.5-1.el5.x86_64
May 26 14:29:34 Installed: libibverbs-1.1.3-2.el5.x86_64
May 26 14:29:34 Installed: corosync-1.2.2-1.1.el5.x86_64
May 26 14:29:34 Installed: librdmacm-1.0.10-1.el5.x86_64
May 26 14:29:34 Installed: corosynclib-1.2.2-1.1.el5.x86_64
May 26 14:29:34 Installed: openaislib-1.1.0-2.el5.x86_64
May 26 14:29:34 Updated: openais-1.1.0-2.el5.x86_64
May 26 14:29:34 Installed: libnes-0.9.0-2.el5.x86_64
May 26 14:29:35 Installed: heartbeat-libs-3.0.3-2.el5.x86_64
May 26 14:29:35 Updated: pacemaker-libs-1.0.8-6.1.el5.x86_64
May 26 14:29:36 Updated: heartbeat-3.0.3-2.el5.x86_64
May 26 14:29:36 Updated: pacemaker-1.0.8-6.1.el5.x86_64

Apparently corosync is sec faulting when run from the command line:

# /usr/sbin/corosync -f
Segmentation fault

Any help would be greatly appreciated.

Diego



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf





___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf


Re: [Pacemaker] corosync/openais fails to start

2010-05-27 Thread Steven Dake
This is a known issue on some platforms, although the exact cause is 
unknown.  I have tried RHEL 5.5 as well as CentOS 5.5 with clusterrepo 
rpms and been unable to reproduce.  I'll keep looking.


Regards
-steve

On 05/27/2010 06:07 AM, Diego Remolina wrote:

Hi,

I was running the old rpms from the opensuse repo and wanted to change
over to the latest packages from the clusterlabs repo in my RHEL 5.5
machines.

Steps I took
1. Disabled the old repo
2. Set the nodes to standby (two node drbd cluster) and turned of openais
3. Enabled the new repo.
4. Performed an update with yum -y update which replaced all packages.
5. The configuration file for ais was renamed openais.conf.rpmsave
6. I ran corosync-keygen and copied the key to the second machine
7. I copied the file openais.conf.rpmsave to /etc/corosync/corosync.conf
and modified it by removing the service section and moving that to
/etc/corosync/service.d/pcmk
8. I copied the configurations to the other machine.
9. When I try to start either openais or corosync with the init scripts
I get a failure and nothing that can really point me to an error in the
logs.

Updated packages:
May 26 14:29:32 Updated: cluster-glue-libs-1.0.5-1.el5.x86_64
May 26 14:29:32 Updated: resource-agents-1.0.3-2.el5.x86_64
May 26 14:29:34 Updated: cluster-glue-1.0.5-1.el5.x86_64
May 26 14:29:34 Installed: libibverbs-1.1.3-2.el5.x86_64
May 26 14:29:34 Installed: corosync-1.2.2-1.1.el5.x86_64
May 26 14:29:34 Installed: librdmacm-1.0.10-1.el5.x86_64
May 26 14:29:34 Installed: corosynclib-1.2.2-1.1.el5.x86_64
May 26 14:29:34 Installed: openaislib-1.1.0-2.el5.x86_64
May 26 14:29:34 Updated: openais-1.1.0-2.el5.x86_64
May 26 14:29:34 Installed: libnes-0.9.0-2.el5.x86_64
May 26 14:29:35 Installed: heartbeat-libs-3.0.3-2.el5.x86_64
May 26 14:29:35 Updated: pacemaker-libs-1.0.8-6.1.el5.x86_64
May 26 14:29:36 Updated: heartbeat-3.0.3-2.el5.x86_64
May 26 14:29:36 Updated: pacemaker-1.0.8-6.1.el5.x86_64

Apparently corosync is sec faulting when run from the command line:

# /usr/sbin/corosync -f
Segmentation fault

Any help would be greatly appreciated.

Diego



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf


Re: [Pacemaker] Redundant rings vs one bond based ring

2010-05-18 Thread Steven Dake
On Tue, 2010-05-18 at 23:16 +0200, Gianluca Cecchi wrote:
> Hello,
> based on pacemaker 1.0.8 + corosync 1.2.2, having two network
> interfaces to dedicate to cluster communication, what is better/safer
> at this moment:
> 
bonding

> 
> a) only one corosync ring on top of a bond interface
> b) two different rings, each one associated with one interface
> ?
> 
> 
> Question based also on corosync roadmap document, containing this
> goal:
> Improved redundant ring support:
> The redundant ring support in corosync needs more testing, especially
> around boundary areas such as 0x7FFF seqids.
> Redundant ring should have an automatic way to recover from failures
> by periodically checking the link and instituting a recovery of the
> ring.
> 
> 
> BTW: if a link fail, what is the current "manual" command to notify
> the CCE when it becomes available again? 

corosync-cfgtool -r

> 
> 
> Thanks,
> Gianluca 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf


Re: [Pacemaker] Being fenced node is killed again and again even the connection is recovered!

2010-05-14 Thread Steven Dake
ifconfig eth0 down is not a valid test case.  that will likely lead to
bad things happening.

I recommend using iptables to test the software.

Also Corosync 1.2.2 is out which fixes bugs vs corosync 1.2.0.

Regards
-steve

On Fri, 2010-05-14 at 18:02 +0800, Javen Wu wrote:
> I forget mention the version I used. 
> I used SLES11-SP1-HAE Beta5
> Pacemaker 1.0.7
> Corosync 1.2.0
> Cluster Glue 1.0.3
> 
> 
> 2010/5/14 Javen Wu 
> Hi Folks,
> 
> I setup a three nodes cluster with SBD STONITH configured.
> After I manually isolate one node by running "ifconfig eth1
> down" on the node. The node is fenced as expected.
> But after reboot, even the network is recovered, the node is
> killed again once I start openais&pacemaker.
> I saw the state of the node become from OFFLINE to ONLINE from
> `crm_mon -n` before being killed. And I saw SBD slot from
> reset->clear->reset.
> 
> I attached the syslog and corosync log.
> And my CIB configuration is very simple.
> 
> Could you help me check what's the problem? In my mind, it's
> not expected behaviour.
> 
> ===% 
>  have-quorum="1" admin_epoch="0" epoch="349" num_updates="99"
> cib-last-written="Fri May 14 14:50:21 2010" dc-uuid="vm209">
>   
> 
>   
>  name="dc-version"
> value="1.1.1-530add2a3721a0ecccb24660a97dbfdaa3e68f51"/>
>  id="cib-bootstrap-options-cluster-infrastructure"
> name="cluster-infrastructure" value="openais"/>
>  id="cib-bootstrap-options-expected-quorum-votes"
> name="expected-quorum-votes" value="3"/>
>   
> 
> 
>   
>   
>   
> 
> 
>   
>  type="external/sbd">
>id="sbd-fencing-instance_attributes">
>  id="sbd-fencing-instance_attributes-sbd_device"
> name="sbd_device" value="/dev/sdc"/>
>   
>   
>  name="monitor"/>
>   
> 
>   
> 
> 
> 
> 
>   
>   
>  in_ccm="true" crmd="online" join="member" expected="member"
> crm-debug-origin="post_cache_update" shutdown="0">
>   
> 
>name="probe_complete" value="true"/>
> 
>   
>   
> 
>class="stonith">
>  operation="monitor" crm-debug-origin="build_active_RAs"
> crm_feature_set="3.0.1"
> transition-key="4:1:7:f0adcb5c-10d1-4525-b094-b5ab1f776ee0"
> transition-magic="0:7;4:1:7:f0adcb5c-10d1-4525-b094-b5ab1f776ee0" 
> call-id="2" rc-code="7" op-status="0" interval="0" last-run="1273820137" 
> last-rc-change="1273820137" exec-time="60" queue-time="0" 
> op-digest="4c3fd39434577fbb6540606d808ed050"/>
>  operation="start" crm-debug-origin="build_active_RAs"
> crm_feature_set="3.0.1"
> transition-key="5:1:0:f0adcb5c-10d1-4525-b094-b5ab1f776ee0"
> transition-magic="0:0;5:1:0:f0adcb5c-10d1-4525-b094-b5ab1f776ee0" 
> call-id="3" rc-code="0" op-status="0" interval="0" last-run="1273820137" 
> last-rc-change="1273820137" exec-time="10" queue-time="0" 
> op-digest="4c3fd39434577fbb6540606d808ed050"/>
>  operation="monitor" crm-debug-origin="build_active_RAs"
> crm_feature_set="3.0.1"
> transition-key="6:2:0:f0adcb5c-10d1-4525-b094-b5ab1f776ee0"
> transition-magic="0:0;6:2:0:f0adcb5c-10d1-4525-b094-b5ab1f776ee0" 
> call-id="4" rc-code="0" op-status="0" interval="2" last-run="1273822956" 
> last-rc-change="1273820137" exec-time="1170" queue-time="0" 
> op-digest="4029bbaef749649e82d602afb46dd872"/>
>   
> 
>   
> 
>  in_ccm="false" crmd="offline"
> crm-debug-origin="send_stonith_update" join="down"
> expected="down" shutdown="0"/>
>  in_ccm="true" crmd="online"
> crm-debug-origin="post_cache_update" join="member"
> expected="member" shutdown="0">
>   
> 
>name="probe_complete" value="true"/>
> 
>   
>   
> 
>class="stonith">
>  operation="monitor" crm-debug-origin="build_active_RAs"
> crm_feature_set="3.0.1"
> transition-key="8:5:7:f0adcb5c-10d1-4525-b094-b5ab1f776ee0"
> transition-mag

Re: [Pacemaker] Corosync crashes when cluster NIC disabled (Something strange happened)

2010-03-31 Thread Steven Dake
On Wed, 2010-03-31 at 16:07 -0400, Simpson, John R wrote:
> Greetings all,
> 
> I have a lab cluster using Pacemaker 1.0.8 and Corosync 1.2.0-1
> (see packages below) on CentOS 5.4 (32-bit) VM's running under
> VMware ESXi 3.5.  My location constraints and connectivity
> tests were working well, so I was feeling really good when 
> I decided to shut down the interface used for cluster 
> communication and verify that it resulted in a split-brain cluster.
> 
> Much to my dismay, corosync crashed almost immediately on the node
> where I shut down the Ethernet interface.  I can recreate the issue
> at will on this cluster and a different cluster running a slightly
> more recent version of Pacemaker 1.0.8 and the same version of 
> Corosync on CentOS 5.4 64-bit VMs.
> 
> I've attached the log, but here is the most suspicious message:
> 
> Mar 31 15:35:16 corosync [pcmk  ] ERROR: pcmk_peer_update: Something strange 
> happened: 1
> 
> Cluster communication is on 172.16.0.0/24 (eth1) and Apache, etc. are on 
> 10.127.252.0/24 (eth0).
> 
> I've tried to include or attach all the relevant information -- please let me 
> know if there's anything else that would be useful.
> 
> Regards,
> 
> John Simpson
> 

I've answered this so many times on the ml I've created a faq for it.
If the faq is unclear, let me know, and we can add to it.

http://www.corosync.org/doku.php?id=faq:ifdown

You mentioned Corosync crashed(segfault?), which it should not 

To report that crash, see the following faq:

http://www.corosync.org/doku.php?id=faq:crash


> [r...@cy-ha01 ~]# netstat -rn
> Kernel IP routing table
> Destination Gateway Genmask Flags   MSS Window  irtt Iface
> 10.0.0.00.0.0.0 255.255.255.0   U 0 0 0 eth3
> 172.16.0.0  0.0.0.0 255.255.255.0   U 0 0 0 eth1
> 192.168.0.0 0.0.0.0 255.255.255.0   U 0 0 0 eth2
> 10.127.252.00.0.0.0 255.255.255.0   U 0 0 0 eth0
> 169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eth3
> 224.0.0.0   0.0.0.0 240.0.0.0   U 0 0 0 eth1
> 0.0.0.0 10.127.252.10.0.0.0 UG0 0 0 eth0
> 
> [r...@cy-ha01 ~]# date ; ifconfig eth1 down
> Wed Mar 31 15:35:03 EDT 2010
> 
> Output from crm_mon when eth1 is shut down.
> 
> Last updated: Wed Mar 31 15:31:50 2010
> Stack: openais
> Current DC: cy-ha02 - partition with quorum
> Version: 1.0.8-2a76c6ac04bcccf42b89a08e55bfbd90da2fb49a
> 2 Nodes configured, 2 expected votes
> 2 Resources configured.
> 
> 
> Online: [ cy-ha01 cy-ha02 ]
> 
>  Resource Group: WebSiteGroup
>  ServiceIP  (ocf::heartbeat:IPaddr2):   Started cy-ha01
>  WebSite(ocf::heartbeat:apache):Started cy-ha01
>  Clone Set: CloneConnectivityTest
>  Started: [ cy-ha02 cy-ha01 ]
> Connection to the CIB terminated
> Reconnecting
> 
> [r...@cy-ha01 ~]# rpm -qa | grep pace
> pacemaker-libs-devel-1.0.8-1.el5
> pacemaker-1.0.8-1.el5
> pacemaker-libs-1.0.8-1.el5
> [r...@cy-ha01 ~]# rpm -qa | grep coros
> corosynclib-1.2.0-1.el5
> corosync-1.2.0-1.el5
> corosynclib-devel-1.2.0-1.el5
> 
> --
> John Simpson 
> Senior Software Engineer, I. T. Engineering and Operations
> 
> ___
> Pacemaker mailing list
> Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Dropping HeartBeat Stack?

2010-03-04 Thread Steven Dake
On Thu, 2010-03-04 at 21:29 +0100, Dennis J. wrote:
> On 03/04/2010 03:37 PM, Andrew Beekhof wrote:
> > On Thu, Mar 4, 2010 at 2:54 PM, Dennis J.  wrote:
> >
> >> Pacemaker pulls in hearbeat and corosync as dependency. This is what 
> >> happens
> >> on a freshly install centos 5.4 VM:
> >
> > Ah, so I just imagined making that change :-(
> > The next round of packages wont do that
> 
> There are two other things that should be changed:
> 
> 1) The default values for consensus/token don't seem to be right. Starting 
> corosync with the example corosync config file yields the following in the 
> logs: "parse error in config: The consensus timeout parameter (1200 ms) 
> must be atleast 1.2 * token (1200 ms)."
> The defaults for these should probably be changed so they don't conflict 
> like this.
> 
> 2) /etc/init.d/openais needs a change too. I had to change
> export COROSYNC_DEFAULT_CONFIG_IFACE="openaisserviceenable:openaisparser"
> into:
> export 
> COROSYNC_DEFAULT_CONFIG_IFACE="openaisserviceenableexperimental:corosync_parser"
> to make things work.
> 

note these problems are fixed upstream and awaiting a new stable
release.

Regards
-steve

> Regards,
>Dennis
> 
> ___
> Pacemaker mailing list
> Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] High load issues

2010-02-04 Thread Steven Dake
On Thu, 2010-02-04 at 16:09 +0100, Dominik Klein wrote:
> Hi people,
> 
> I'll take the risk of annoying you, but I really think this should not
> be forgotten.
> 
> If there is high load on a node, the cluster seems to have problems
> recovering from that. I'd expect the cluster to recognize that a node is
> unresponsive, stonith it and start services elsewhere.
> 
> By unresponsive I mean not being able to use the cluster's service, not
> being able to ssh into the node.
> 
> I am not sure whether this is an issue of pacemaker (iiuc, beekhof seems
> to think it is not) or corosync (iiuc, sdake seems to think it is not)
> or maybe a configuration/thinking thing on my side (which might just be).
> 
> Anyway, attached you will find a hb_report which covers the startup of
> the cluster nodes, then what it does when there is high load and no
> memory left. Then I killed the load producing things and almost
> immediately, the cluster cleaned up things.
> 
> I had at least expected that after I saw "FAILED" status in crm_mon,
> that after the configured timeouts for stop (120s max in my case), the
> failover should happen, but it did not.
> 
> What I did to produce load:
> * run several "md5sum $file" on 1gig files
> * run several heavy sql statements on large tables
> * saturate(?) the nic using netcat -l on the busy node and netcat -w fed
> by /dev/urandom on another node
> * start a forkbomb script which does "while (true); do bash $0; done;"
> 
> Used versions:
> corosync 1.2.0
> pacemaker 1.0.7
> 64 bit packages from clusterlabs for opensuse 11.1
> 

The forkbomb triggers an OOM situation.  In Linux, when OOM happens
really all bets are off as to what will occur.  I expect that the system
would work properly without the forkbomb.  Could you try that?

Corosync actually works quite well in OOM situations and usually doesn't
detect this as a failure unless the oom killer blows away the corosync
process.  To corosync, the node is fully operational (because it is
designed to work in an OOM situation).

Detecting memory overcommit and doing something about it may be
something we should do with Corosync.

But generally I believe this test case is invalid.  A system should be
properly sized memory wise to handle the applications that are intended
to run on it.  Really sounds like a deployment issue if the systems
don't contain the appropriate ram to run the applications.

I believe there is a way of setting affinity in the OOM killer but it's
been 4 years since I've worked on the kernel fulltime so I don't know
the details.  One option is to set the affinity to always try to blow
away the corosync process.  Then you would get fencing in this
condition.

Regards
-steve

> If you need more information, want me to try patches, whatever, please
> let me know.
> 
> Regards
> Dominik
> ___
> Pacemaker mailing list
> Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


[Pacemaker] thread safety problem with pacemaker and corosync integration

2010-02-03 Thread Steven Dake
For some time people have reported segfaults on startup when using
pacemaker as a plugin to corosync related to tzset in the stack trace.
I believe we had fixed this by removing the thread-unsafe usage of
localtime and strftime calls in the code base of corosync in 1.2.0.

Via further investigation by H.J. Lee, he mostly identified a problem
with localtime_r calling tzset calling getenv().  If at about the same
time, another thread calls setenv(), the other thread's getenv could
segfault.  syslog() also calls localtime_r in glibc.  On some rare
occasions Pacemaker calls setenv() while corosync executes a syslog
operation resulting in a segfault.

Posix is clear on this issue - tzset should be thread safe, localtime_r
should be thread safe, syslog should be thread safe.  Some C libraries
implementations of these functions unfortunately are not thread safe for
these functions when used in conjunction with setenv because they use
getenv internally (which is not required to be thread safe by posix).

Our short term plan is to workaround these problems in glibc by doing
the following:
1) providing a getenv/setenv api inside coroapi.h so that corosync
internal code and third party plugins such as pacemaker can use a mutex
protected getenv/setenv
2) porting our syslog-direct-communication code from whitetank and avoid
using the syslog C library api (which again uses localtime_r) call
entirely
3) implementing a localtime_r replacement which does not call tzset on
each execution so that timestamp:on operational mode does not suffer
from this same problem

If your suffering from this issue, please be aware we have a root cause
and will get it resolved.

Regards
-steve


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] mcast vs broadcast

2010-01-18 Thread Steven Dake
On Mon, 2010-01-18 at 11:25 -0500, Shravan Mishra wrote:
> Hi all,
> 
> 
> 
> Following is my corosync.conf.
> 
> Even though broadcast is enabled I see "mcasted" messages like these
> in corosync.log.
> 
> Is it ok?  even when the broadcast is on and not mcast.
> 

Yes you are using broadcast and the debug output doesn't print a special
case for "broadcast" (but it really is broadcasting).

This output is debug output meant for developer consumption.  It is
really not all that useful for end users.  
> ==
> Jan 18 09:50:40 corosync [TOTEM ] mcasted message added to pending queue
> Jan 18 09:50:40 corosync [TOTEM ] mcasted message added to pending queue
> Jan 18 09:50:40 corosync [TOTEM ] Delivering 171 to 173
> Jan 18 09:50:40 corosync [TOTEM ] Delivering MCAST message with seq
> 172 to pending delivery queue
> Jan 18 09:50:40 corosync [TOTEM ] Delivering MCAST message with seq
> 173 to pending delivery queue
> Jan 18 09:50:40 corosync [TOTEM ] Received ringid(192.168.2.1:168) seq 172
> Jan 18 09:50:40 corosync [TOTEM ] Received ringid(192.168.2.1:168) seq 172
> Jan 18 09:50:40 corosync [TOTEM ] Received ringid(192.168.2.1:168) seq 173
> Jan 18 09:50:40 corosync [TOTEM ] Received ringid(192.168.2.1:168) seq 173
> Jan 18 09:50:40 corosync [TOTEM ] releasing messages up to and including 172
> Jan 18 09:50:40 corosync [TOTEM ] releasing messages up to and including 173
> 
> 
> =
> 
> ===
> 
> # Please read the corosync.conf.5 manual page
> compatibility: whitetank
> 
> totem {
> version: 2
> token: 3000
> token_retransmits_before_loss_const: 10
> join: 60
> consensus: 1500
> vsftype: none
> max_messages: 20
> clear_node_high_bit: yes
> secauth: on
> threads: 0
> rrp_mode: passive
> 
> interface {
> ringnumber: 0
> bindnetaddr: 192.168.2.0
> #   mcastaddr: 226.94.1.1
> broadcast: yes
> mcastport: 5405
> }
> interface {
> ringnumber: 1
> bindnetaddr: 172.20.20.0
> #mcastaddr: 226.94.2.1
> broadcast: yes
> mcastport: 5405
> }
> }
> logging {
> fileline: off
> to_stderr: yes
> to_logfile: yes
> to_syslog: yes
> logfile: /tmp/corosync.log
> debug: on
> timestamp: on
> logger_subsys {
> subsys: AMF
> debug: off
> }
> }
> 
> service {
> name: pacemaker
> ver: 0
> }
> 
> aisexec {
> user:root
> group: root
> }
> 
> amf {
> mode: disabled
> }
> =
> 
> 
> 
> Thanks
> Shravan
> 
> ___
> Pacemaker mailing list
> Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] errors in corosync.log

2010-01-18 Thread Steven Dake
One possibility is you have a different cluster in your network on the
same multicast address and port.

Regards
-steve

On Sat, 2010-01-16 at 15:20 -0500, Shravan Mishra wrote:
> Hi Guys,
> 
> I'm running the following version of pacemaker and corosync
> corosync=1.1.1-1-2
> pacemaker=1.0.9-2-1
> 
> Every thing had been running fine for quite some time now but then I
> started seeing following errors in the corosync logs,
> 
> 
> =
> Jan 16 15:08:39 corosync [TOTEM ] Received message has invalid
> digest... ignoring.
> Jan 16 15:08:39 corosync [TOTEM ] Invalid packet data
> Jan 16 15:08:39 corosync [TOTEM ] Received message has invalid
> digest... ignoring.
> Jan 16 15:08:39 corosync [TOTEM ] Invalid packet data
> Jan 16 15:08:39 corosync [TOTEM ] Received message has invalid
> digest... ignoring.
> Jan 16 15:08:39 corosync [TOTEM ] Invalid packet data
> 
> 
> I can perform all the crm shell commands and what not but it's
> troubling that the above is happening.
> 
> My crm_mon output looks good.
> 
> 
> I also checked the authkey and did md5sum on both it's same.
> 
> Then I stopped corosync and regenerated the authkey with
> corosync-keygen and copied it to the the other machine but I still get
> the above message in the corosync log.
> 
> Is there anything other authkey that I should look into ?
> 
> 
> corosync.conf
> 
> 
> 
> # Please read the corosync.conf.5 manual page
> compatibility: whitetank
> 
> totem {
> version: 2
> token: 3000
> token_retransmits_before_loss_const: 10
> join: 60
> consensus: 1500
> vsftype: none
> max_messages: 20
> clear_node_high_bit: yes
> secauth: on
> threads: 0
> rrp_mode: passive
> 
> interface {
> ringnumber: 0
> bindnetaddr: 192.168.2.0
> #mcastaddr: 226.94.1.1
> broadcast: yes
> mcastport: 5405
> }
> interface {
> ringnumber: 1
> bindnetaddr: 172.20.20.0
> #mcastaddr: 226.94.1.1
> broadcast: yes
> mcastport: 5405
> }
> }
> 
> 
> logging {
> fileline: off
> to_stderr: yes
> to_logfile: yes
> to_syslog: yes
> logfile: /tmp/corosync.log
> debug: off
> timestamp: on
> logger_subsys {
> subsys: AMF
> debug: off
> }
> }
> 
> service {
> name: pacemaker
> ver: 0
> }
> 
> aisexec {
> user:root
> group: root
> }
> 
> amf {
> mode: disabled
> }
> 
> 
> ===
> 
> 
> Thanks
> Shravan
> 
> ___
> Pacemaker mailing list
> Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Pacemaker/OpenAIS Software for openSuSE 11.2

2010-01-12 Thread Steven Dake

> > d) If I would try to compile from source as described at
> > http://www.clusterlabs.org/wiki/Install#First_Steps
> > one step is to get openais. Why are all the relevant
> > prebuild library packages called corosync?
> > I don't understand the distinction between openais and corosync
> 

read this link:
http://www.corosync.org/doku.php?id=faq:why


> Corosync used to be part of Openais.
> Then they split it into two parts to make maintenance easier.
> From their home page "The OpenAIS software is built to operate on the
> Corosync Cluster Engine "
> 
> > and how this two pieces fit together. By the way: There homepage
> > doesn't enlight me either.
> >
> > Enough questions for a restart.
> >
> > Best regards
> > Andreas Mock
> >
> >
> >
> > ___
> > Pacemaker mailing list
> > Pacemaker@oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> 
> ___
> Pacemaker mailing list
> Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] openais/corosync

2010-01-11 Thread Steven Dake
On Mon, 2010-01-11 at 21:00 +0100, Andreas Mock wrote:
> > -Ursprüngliche Nachricht-
> > Von: "Steven Dake" 
> > Gesendet: 11.01.10 20:13:39
> > An: pacema...@clusterlabs.org
> > Betreff: Re: [Pacemaker] openais/corosync
> 
> 
> > 
> > See reasoning here:
> > http://www.corosync.org/doku.php?id=faq:why
> 
> Hi Steve,
> 
> thank you for that link. A piece of documentation I didn't find.
> 
> They know why they do have "improved documentation" on
> their 2010 agenda.  ;-)
> 

Ya its pretty clear Corosync documentation is weak.  We really focused
on developing a great quality implementation and a good release model at
the expense of all other activities such as documentation and project
marketing.  We hope developers can deal with the documentation warts in
the near term until we sort that out.  In most cases, users don't need
much documentation on Corosync at all except managing corosync.conf
which is very well documented in man pages.  Corosync's functionality
should mostly be hidden behind application's functionality.

That said, we do want to improve documentation.  Beyond man pages for
all tools and APIs, we would eventually like to produce a user guide and
a separate developer guide which may number 100-200 PDF pages combined.
These objectives will happen this year.

Regards
-steve


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] openais/corosync

2010-01-11 Thread Steven Dake
On Mon, 2010-01-11 at 19:59 +0100, Andreas Mock wrote:
> Hi all,
> 
> I don't understand the distinction between
> openais and corosync. The prebuild packages are
> named after corosync while the documentation
> always talk about openais.
> 

See reasoning here:
http://www.corosync.org/doku.php?id=faq:why

> The infos I get from the homepages of openais/corosync
> dont help either. There is one paper on corosync's homepage
> saying that pacemaker is using corosync while the installation
> guide at http://www.clusterlabs.org/wiki/Install#OpenAIS.2A
> says to download openais.
> 

the clusterlabs documentation is technically correct.  You can use
openais whitetank (see link above) but I recommend just using Corosync
instead.

> Can someone enlight me even this may be more related
> to openais/corosync. I'm sure that the users of this code
> can tell me how the parts fit together.  ;-)
> 
> Best regards
> Andreas Mock
> 
> 
> ___
> Pacemaker mailing list
> Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] coroync not able to exec services properly

2010-01-02 Thread Steven Dake
If your using corosync 1.2.0, we enforced a constraint on consensus and
token such that consensus must be 1.2* token. Your consensus is 1/2
token which will cause corosync to exit at start.

Regards
-steve

On Mon, 2009-12-28 at 12:58 +0100, Dejan Muhamedagic wrote:
> Hi,
> 
> On Thu, Dec 24, 2009 at 02:35:01PM -0500, Shravan Mishra wrote:
> > Hi Guys,
> > 
> > I had a perfectly running system for about 3 weeks now but now on reboot I
> > see problems.
> > 
> > Looks like the processes are being spawned and respawned but a proper exec
> > is not happening.
> 
> According to the logs, attrd can't start (exit code 100) for some
> reason (perhaps there are more logs elsewhere where it says
> what's wrong) and pengine segfaults. For the latter please
> enable coredumps (ulimit -c unlimited) and file a bugzilla.
> 
> > Am I missing some permissions on directories.
> > 
> > 
> > I have a script which does the following for directories:
> 
> Why do you need this script? It should be done by the package
> installation scripts.
> 
> > =
> > getent group haclient > /dev/null || groupadd -r haclient
> > getent passwd hacluster > /dev/null || useradd -r -g haclient -d
> > /var/lib/heartbeat/cores/hacluster -s /sbin/nologin -c "cluster user"
> > hacluster
> > 
> > if [ ! -d "/var/lib/pengine" ];then
> >  mkdir /var/lib/pengine
> > fi
> > chown -R hacluster:haclient /var/lib/pengine
> > 
> > if [ ! -d "/var/lib/heartbeat" ];then
> > mkdir /var/lib/heartbeat
> > fi
> > 
> > if [ ! -d "/var/lib/heartbeat/crm" ];then
> >  mkdir /var/lib/heartbeat/crm
> > fi
> > chown -R hacluster:haclient /var/lib/heartbeat/crm/
> > chmod 750 /var/lib/heartbeat/crm/
> > 
> > if [ ! -d "/var/lib/heartbeat/ccm" ];then
> >  mkdir /var/lib/heartbeat/ccm
> > fi
> > chown -R hacluster:haclient /var/lib/heartbeat/ccm/
> > chmod 750 /var/lib/heartbeat/ccm/
> > 
> > if [ ! -d "/var/run/heartbeat/" ];then
> >  mkdir /var/run/heartbeat/
> >  fi
> > 
> > if [ ! -d "/var/run/heartbeat/ccm" ];then
> >  mkdir /var/run/heartbeat/ccm/
> >  fi
> > chown -R hacluster:haclient /var/run/heartbeat/ccm/
> > chmod 750 /var/run/heartbeat/ccm/
> 
> You don't need ccm for corosync/openais clusters.
> 
> > if [ ! -d "/var/run/heartbeat/crm" ];then
> >  mkdir /var/run/heartbeat/crm/
> >  fi
> > chown -R hacluster:haclient /var/run/heartbeat/crm/
> > chmod 750 /var/run/heartbeat/crm/
> > 
> > if [ ! -d "/var/run/crm" ];then
> >  mkdir /var/run/crm
> > fi
> > 
> > if [ ! -d "/var/lib/corosync" ];then
> >  mkdir /var/lib/corosync
> > fi
> > =
> > 
> > 
> > I have a very simple active-passive configuration with just 2 nodes.
> > 
> > On starting Corosync , on doing
> > 
> > 
> > [r...@node2 ~]# ps -ef | grep coro
> > root  8242 1  0 11:33 ?00:00:00 /usr/sbin/corosync
> > root  8248  8242  0 11:33 ?00:00:00 /usr/sbin/corosync
> > root  8249  8242  0 11:33 ?00:00:00 /usr/sbin/corosync
> > root  8250  8242  0 11:33 ?00:00:00 /usr/sbin/corosync
> > root  8252  8242  0 11:33 ?00:00:00 /usr/sbin/corosync
> > root  8393  8242  0 11:35 ?00:00:00 /usr/sbin/corosync
> > [r...@node2 ~]# ps -ef | grep heart
> > 827924 1  0 11:28 ?00:00:00 /usr/lib64/heartbeat/pengine
> > 
> > I'm attaching the log file.
> > 
> > My config is:
> > 
> > 
> > # Please read the corosync.conf.5 manual page
> > compatibility: whitetank
> > 
> > totem {
> >  version: 2
> >   token: 3000
> >   token_retransmits_before_loss_const: 10
> >   join: 60
> >   consensus: 1500
> >   vsftype: none
> >   max_messages: 20
> >   clear_node_high_bit: yes
> >   secauth: on
> >   threads: 0
> >   rrp_mode: passive
> > interface {
> > ringnumber: 0
> > bindnetaddr: 192.168.1.0
> > # mcastaddr: 226.94.1.1
> > broadcast: yes
> > mcastport: 5405
> > }
> > interface {
> > ringnumber: 1
> > bindnetaddr: 172.20.20.0
> > # mcastaddr: 226.94.1.1
> > broadcast: yes
> > mcastport: 5405
> > }
> > }
> > 
> > logging {
> > fileline: off
> > to_stderr: yes
> > to_logfile: yes
> > to_syslog: yes
> > logfile: /tmp/corosync.log
> 
> Don't log to file. Can't recall exactly but there were some
> permission problems with that, probably because Pacemaker daemons
> don't run as root.
> 
> Thanks,
> 
> Dejan
> 
> > debug: on
> > timestamp: on
> > logger_subsys {
> > subsys: AMF
> > debug: off
> > }
> > }
> > 
> > service {
> > name: pacemaker
> > ver: 0
> > }
> > 
> > aisexec {
> > user:root
> > group: root
> > }
> > 
> > amf {
> > mode: disabled
> > }
> > 
> > 
> > Please help.
> > 
> > Sincerely
> > Shravan
> 
> 
> > ___
> > Pacemaker mailing list
> > Pacemaker@oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> 
> ___
> Pacemaker mailing list
> Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker


___
Pacemaker mailing list
Pacemake

Re: [Pacemaker] corosync init script broken

2010-01-02 Thread Steven Dake
Hopefully all of these init script problems have been fixed in 1.2.0 by
Fabio and Andrew and should be in a repo available for you soon.

Regards
-steve

On Mon, 2009-12-28 at 13:22 +0100, Dominik Klein wrote:
> Hi cluster people
> 
> been a while, couldn't really follow things. Today I was tasked to
> install a new cluster, went for 1.0.6 and corosync as described on the
> wiki and hit this:
> 
> New cluster with pacemaker 106 and latest available corosync from the
> clusterlabs.org/rpm opensuse 11.1 repo.
> 
> This installs /etc/init.d/corosync
> 
> "start" says "OK", but does not start corosync.
> 
> Manually starting it, then
> 
> "stop" never returns.
> 
> This is because the internal status in the script calls "killall -0
> corosync". This finds /etc/init.d/corosync, therefore start returns
> early and stop never returns.
> 
> Workaround: Rename /etc/init.d/corosync
> 
> I can't believe I am the first one to hit this. Am I?
> 
> Regards
> Dominik
> 
> ___
> Pacemaker mailing list
> Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] parse error in config: The consensus timeout parameter (1200 ms) must be atleast 1.2 * token (1200 ms)

2010-01-02 Thread Steven Dake
On Mon, 2009-12-28 at 19:05 -0500, Daniel Qian wrote:
> I am using Corosync 1.2.0 that comes with Fedora 12 and have this error when 
> trying to start corosync service. If I set it to anything above 1200 the 
> error goes away. Is this a bug or something intended for?
> 
> Thanks,
> Daniel 
> 

We require consensus to be 1.2 * token, or the membership protocol can
enter a degrated state of operation.  This issue was recently found and
we wanted to enforce the correct policy regarding this config option on
configurations to avoid erroneous membership protocol behavior.

There is a bug in 1.2.0 (fixed in trunk) where if it is exactly 1.2* the
token, an error will be printed.  For example, 1200 returns an error,
1201 works.  (1200 is 1.2* the token).

Regards
-steve
> 
> ___
> Pacemaker mailing list
> Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Fedora 12 repository

2009-12-20 Thread Steven Dake
Pacemaker is integrated directly in the fedora repo instead of
externally.  You can grab it using yum install pacemaker.

Regards
-steve

On Sun, 2009-12-20 at 11:46 -0500, E-Blokos wrote:
> Hi,
> 
> is there any yum repository for Fedora 12 ?
> I checked http://download.opensuse.org/repositories/server%3A/ha-clustering
> but there are only folder for 10 and 11
> 
> Thanks
> 
> Franck
> 
> ___
> Pacemaker mailing list
> Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Openais: Corosync Executive couldn't openconfiguration component 'openaisserviceenable'

2009-12-03 Thread Steven Dake
Please note, we are aware of this bug upstream.  A fix is in the
upstream repo, and will be making a new release which resolves the init
script problem hopefully this week.

What Frank suggests is essentially what we have done in the repo.

Regards
-steve

On Thu, 2009-12-03 at 12:31 -0500, Frank DiMeo wrote:
> Try replacing the existing exported symbol in your /etc/init.d/openais
> file with:
> 
>  
> 
> export
> COROSYNC_DEFAULT_CONFIG_IFACE="openaisserviceenableexperimental:corosync_parser"
> 
>  
> 
> -Frank
> 
>  
> 
> From: Shravan Mishra [mailto:shravan.mis...@gmail.com] 
> Sent: Thursday, December 03, 2009 10:18 AM
> To: pacemaker@oss.clusterlabs.org
> Subject: Re: [Pacemaker] Openais: Corosync Executive couldn't
> openconfiguration component 'openaisserviceenable'
> 
> 
>  
> 
> check if /usr/libexec/lcrso/openaisserviceenable.lcrso exisits on your
> machine.
> 
>  
> 
> 
> Regards
> 
> 
> Shravan
> 
> On Thu, Dec 3, 2009 at 7:00 AM, Martin Gombac(  wrote:
> 
> Hi guys,
> 
> i just installed openais from clusterlabs repo. Configured it, but am
> unable to start it. The error doesn't really tell me anything. Can
> someone point me in the right direction?
> 
> 
> /etc/init.d/openais start
> Starting OpenAIS (corosync): corosync [MAIN  ] Corosync Cluster Engine
> ('1.1.2'): started and ready to provide service.
> corosync [MAIN  ] Corosync built-in features: nss rdma
> corosync [MAIN  ] Corosync Executive couldn't open configuration
> component 'openaisserviceenable'
> corosync [MAIN  ] Corosync Cluster Engine exiting with status -9 at
> main.c:900.
>   [FAILED]
> 
> 
> Config:
> 
> totem {
>version:2
># Disable encryption
>secauth:off
># How many threads to use for encryption/decryption
>threads:0
># How long before declaring a token lost (ms)
>token:  1
># How many token retransmits before forming a new configuration
>token_retransmits_before_loss_const: 20
># How long to wait for join messages in the membership protocol
> (ms)
>join:   60
># How long to wait for consensus to be achieved before starting
> a new round of membership configuration (ms)
>consensus:  4800
># Turn off the virtual synchrony filter
>vsftype:none
># Number of messages that may be sent by one processor on
> receipt of the token
>max_messages:   20
># Limit generated nodeids to 31-bits (positive signed integers)
>clear_node_high_bit: yes
># Optionally assign a fixed node id (integer)
>#nodeid:1234
>interface {
>ringnumber: 0
>bindnetaddr: 192.168.0.1
>mcastaddr: 226.94.1.1
>mcastport: 5405
>}
>interface {
>ringnumber: 1
>bindnetaddr: 189.9.21.16
>mcastaddr: 226.94.1.1
>mcastport: 5405
>}
> }
> 
> logging {
>debug: on
>timestamp: on
>fileline: off
>to_syslog: yes
>to_stderr: no
>syslog_facility: daemon
> }
> 
> amf {
>mode: disabled
> }
> 
>  service {
># Load the Pacemaker Cluster Resource Manager
>name: pacemaker
>ver:  0
>  }
> 
>  aisexec {
>user:   root
>group:  root
> 
> 
> ___
> Pacemaker mailing list
> Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> 
>  
> 
> 
> ___
> Pacemaker mailing list
> Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Node crash when 'ifdown eth0'

2009-11-30 Thread Steven Dake
On Mon, 2009-11-30 at 17:05 -0700, hj lee wrote:
> 
> 
> On Fri, Nov 27, 2009 at 3:05 PM, Steven Dake  wrote:
> On Fri, 2009-11-27 at 11:32 -0200, Mark Horton wrote:
> > I'm using pacemaker 1.0.6 and corosync 1.1.2 (not using
> openais) with
> > centos 5.4.  The packages are from here:
> > http://www.clusterlabs.org/rpm/epel-5/
> >
> > Mark
> >
> > On Fri, Nov 27, 2009 at 9:01 AM, Oscar Remí­rez de Ganuza
> Satrústegui
> >  wrote:
> > > Good morning,
> > >
> > > We are testing a cluster configuration on RHEL5 (x86_64)
> with pacemaker
> > > 1.0.5 and openais (0.80.5).
> > > Two node cluster, active-passive, with the following
> resources:
> > > Mysql service resource and a NFS filesystem resource
> (shared storage in a
> > > SAN).
> > >
> > > In our tests, when we bring down the network interface
> (ifdown eth0), the
> 
> 
> What is the use case for ifdown eth0 (ie what are you trying
> to verify)?
> 
> I have the same test case. In my case, when two nodes cluster is
> disconnect, I want to see split-brain. And then I want to see the
> split-brain handler resets one of nodes. What I want to verify is that
> the cluster will recover network disconnection and split-brain
> situation.
> 

ifconfig eth0 down is a totally different then testing if there is a
node disconnection.  When corosync detects eth0 being taken down, it
binds to the interface 127.0.0.1.  This is probably not what you had in
mind when you wanted to test split brain.  Keep in mind an interface
taken out of service is different then an interface failing from a posix
api perspective.

What you really want to test is pulling the network cable between the
machines.

Regards
-steve

> Thanks
> hj
> 
> 


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Node crash when 'ifdown eth0'

2009-11-27 Thread Steven Dake
On Fri, 2009-11-27 at 11:32 -0200, Mark Horton wrote:
> I'm using pacemaker 1.0.6 and corosync 1.1.2 (not using openais) with
> centos 5.4.  The packages are from here:
> http://www.clusterlabs.org/rpm/epel-5/
> 
> Mark
> 
> On Fri, Nov 27, 2009 at 9:01 AM, Oscar Remí­rez de Ganuza Satrústegui
>  wrote:
> > Good morning,
> >
> > We are testing a cluster configuration on RHEL5 (x86_64) with pacemaker
> > 1.0.5 and openais (0.80.5).
> > Two node cluster, active-passive, with the following resources:
> > Mysql service resource and a NFS filesystem resource (shared storage in a
> > SAN).
> >
> > In our tests, when we bring down the network interface (ifdown eth0), the

What is the use case for ifdown eth0 (ie what are you trying to verify)?

I recommend using latest pacemaker and corosync as well if your doing a
new deployment.

Regards
-steve


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] pacemaker-1.0.6 + corosync-1.1.2 crashing - SOLVED

2009-11-21 Thread Steven Dake
On Sat, 2009-11-21 at 20:00 +0100, Nikola Ciprich wrote:
> Hi Guys,
> Finally I've found where the problem was! On my testing machines,
> the system was lacking separate /dev/shm tmpfs mount. While the /dev
> directory is also mounted as tmpfs, so it seemingly doesn't make any
> difference, there IS one: /dev is mounted with mode=755 parameter,
> while /dev/shm should be mounted without it. So after I mounted it,
> everything starts working like a charm! :)
> While it's certainly unusual that systems lack separate /dev/shm
> mount,
> it still might be mentioned in documentation.
> anyways, thanks to all of You for Your time.
> have a nice day.
> nik

thanks for the bug report.

corosync should use /var/lib/corosync if /dev/shm is unavailable.  This
is also what the code does. Probably needs a bit of testing around that
case.

Regards
-steve


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] **** SPAM **** Re: pacemaker-1.0.6 + corosync 1.1.2 crashing

2009-11-20 Thread Steven Dake
Nik,

Any chance you have a backtrace of the core files?  That might be
helpful in pinpointing the issue.

To do this run
gdb binaryname corefilename
gdb> bt

Regards
-steve

On Thu, 2009-11-19 at 17:50 +0100, Nikola Ciprich wrote:
> Hi Andrew,
> sorry to bother again, do You have some idea what else might be wrong?
> Does it make sense to CC openais or cluster maillist?
> Is there some other debugging You would recommend?
> with best regards
> nik
> 
> On Wed, Nov 18, 2009 at 03:26:28PM +0100, Nikola Ciprich wrote:
> > I've packaged those myself, all are based on clean sources without any
> > additional patches.


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Resource capacity limit

2009-11-12 Thread Steven Dake
On Thu, 2009-11-12 at 14:53 +0100, Andrew Beekhof wrote:
> On Wed, Nov 11, 2009 at 1:36 PM, Lars Marowsky-Bree  wrote:
> > On 2009-11-05T14:45:36, Andrew Beekhof  wrote:
> >
> >> Lastly, I would really like to defer this for 1.2
> >> I know I've bent the rules a bit for 1.0 in the past, but its really
> >> late in the game now.
> >
> > Personally, I think the Linux kernel model works really well. ie, no
> > "major releases" any more, but bugfixes and features alike get merged
> > over time and constantly.
> 
> Thats a great model if you've got hoards of developers and testers.
> Of which we have neither.
> 
> At this point in time, I can't see us going back to the way heartbeat
> releases were done.
> If there was a single thing that I'd credit Pacemaker's current
> reliability to, it would be our release strategy.

Maintaining corosync and openais, I'd surely like to only have one tree
where all work is done and never have a "stable" branch.  Andrew is
right though, this model only works if there is large downstream
adoption and support and distros take on the work of stabilizing the
efforts of the trunk development.

Talking with distros I know this is generally not the case with any
package other then kernel.org and maybe some related bits like xen/kvm
(which has forced this model upon them).

Regards
-steve




___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] pacemaker-1.0.6 + corosync 1.1.2 crashing

2009-11-10 Thread Steven Dake
Nikola,

yet another possibility is your box doesn't have any/enough shared
memory available.  Usually this is in the directory /dev/shm.
Unfortunately bad things happen and error handling around this condition
needs some work.  Its hard to tell because the signal delivered to the
application on failure is not shown in your backtrace.

For example I have plenty of shared memory available (command is from
df).
tmpfs  1027020  3560   1023460   1% /dev/shm

Regards
-steve

On Tue, 2009-11-10 at 10:28 +0100, Nikola Ciprich wrote:
> Hello Andrew et al,
> few days ago, I asked about pacemaker + corosync + clvmd etc. With Your 
> advice, I got this working well.
> It was in testing virtual machines, I'm now trying to install similar setup 
> on raw hardware but for some
> reasong attrd and cib seem to be crashing.
> 
> here's snippet from corosync log:
> Nov 10 14:12:21 vbox3 corosync[4299]:   [MAIN  ] Corosync Cluster Engine 
> ('1.1.2'): started and ready to provide service.
> Nov 10 14:12:21 vbox3 corosync[4299]:   [MAIN  ] Corosync built-in features: 
> nss rdma
> Nov 10 14:12:21 vbox3 corosync[4299]:   [MAIN  ] Successfully read main 
> configuration file '/etc/corosync/corosync.conf'.
> Nov 10 14:12:21 vbox3 corosync[4299]:   [TOTEM ] Initializing transport 
> (UDP/IP).
> Nov 10 14:12:21 vbox3 corosync[4299]:   [TOTEM ] Initializing 
> transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
> Nov 10 14:12:21 vbox3 corosync[4299]:   [MAIN  ] Compatibility mode set to 
> whitetank.  Using V1 and V2 of the synchronization engine.
> Nov 10 14:12:21 vbox3 corosync[4299]:   [TOTEM ] The network interface 
> [10.58.0.1] is now up.
> Nov 10 14:12:21 vbox3 corosync[4299]:   [pcmk  ] info: process_ais_conf: 
> Reading configure
> Nov 10 14:13:16 vbox3 corosync[4348]:   [MAIN  ] Corosync Cluster Engine 
> ('1.1.2'): started and ready to provide service.
> Nov 10 14:13:16 vbox3 corosync[4348]:   [MAIN  ] Corosync built-in features: 
> nss rdma
> Nov 10 14:13:16 vbox3 corosync[4348]:   [MAIN  ] Successfully read main 
> configuration file '/etc/corosync/corosync.conf'.
> Nov 10 14:13:16 vbox3 corosync[4348]:   [TOTEM ] Initializing transport 
> (UDP/IP).
> Nov 10 14:13:16 vbox3 corosync[4348]:   [TOTEM ] Initializing 
> transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
> Nov 10 14:13:16 vbox3 corosync[4348]:   [MAIN  ] Compatibility mode set to 
> whitetank.  Using V1 and V2 of the synchronization engine.
> Nov 10 14:13:16 vbox3 corosync[4348]:   [TOTEM ] The network interface 
> [10.58.0.1] is now up.
> Nov 10 14:13:16 vbox3 corosync[4348]:   [pcmk  ] info: process_ais_conf: 
> Reading configure
> Nov 10 14:13:24 vbox3 corosync[4357]:   [MAIN  ] Corosync Cluster Engine 
> ('1.1.2'): started and ready to provide service.
> Nov 10 14:13:24 vbox3 corosync[4357]:   [MAIN  ] Corosync built-in features: 
> nss rdma
> Nov 10 14:13:24 vbox3 corosync[4357]:   [MAIN  ] Successfully read main 
> configuration file '/etc/corosync/corosync.conf'.
> Nov 10 14:13:24 vbox3 corosync[4357]:   [TOTEM ] Initializing transport 
> (UDP/IP).
> Nov 10 14:13:24 vbox3 corosync[4357]:   [TOTEM ] Initializing 
> transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
> Nov 10 14:13:24 vbox3 corosync[4357]:   [MAIN  ] Compatibility mode set to 
> whitetank.  Using V1 and V2 of the synchronization engine.
> Nov 10 14:13:24 vbox3 corosync[4357]:   [TOTEM ] The network interface 
> [10.58.0.1] is now up.
> Nov 10 14:13:24 vbox3 corosync[4357]:   [pcmk  ] info: process_ais_conf: 
> Reading configure
> Nov 10 14:13:57 vbox3 corosync[4380]:   [MAIN  ] Corosync Cluster Engine 
> ('1.1.2'): started and ready to provide service.
> Nov 10 14:13:57 vbox3 corosync[4380]:   [MAIN  ] Corosync built-in features: 
> nss rdma
> Nov 10 14:13:57 vbox3 corosync[4380]:   [MAIN  ] Successfully read main 
> configuration file '/etc/corosync/corosync.conf'.
> Nov 10 14:13:57 vbox3 corosync[4380]:   [TOTEM ] Initializing transport 
> (UDP/IP).
> Nov 10 14:13:57 vbox3 corosync[4380]:   [TOTEM ] Initializing 
> transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
> Nov 10 14:13:57 vbox3 corosync[4380]:   [MAIN  ] Compatibility mode set to 
> whitetank.  Using V1 and V2 of the synchronization engine.
> Nov 10 14:13:58 vbox3 corosync[4380]:   [TOTEM ] The network interface 
> [10.58.0.1] is now up.
> Nov 10 14:13:58 vbox3 corosync[4380]:   [pcmk  ] info: process_ais_conf: 
> Reading configure
> Nov 10 14:13:58 vbox3 corosync[4380]:   [pcmk  ] info: config_find_init: 
> Local handle: 9213452461992312833 for logging
> Nov 10 14:13:58 vbox3 corosync[4380]:   [pcmk  ] info: config_find_next: 
> Processing additional logging options...
> Nov 10 14:13:58 vbox3 corosync[4380]:   [pcmk  ] info: get_config_opt: Found 
> 'off' for option: debug
> Nov 10 14:13:58 vbox3 corosync[4380]:   [pcmk  ] info: get_config_opt: 
> Defaulting to 'off' for option: to_file
> Nov 10 14:13:58 vbox3 corosync[4380]:   [pcmk  ] info:

Re: [Pacemaker] pacemaker-1.0.6 + corosync 1.1.2 crashing

2009-11-10 Thread Steven Dake
One possibility is selinux is enabled and your selinux policies are out
dated.

Another possibility is you have improper coroipcc libraries (duplicates)
installed on your system.

Check your installed lib dir for coroipcc.so.4 and 4.0.0 and
coroipcc.so.  They should all link to the same file.

Another possibility is your compiling on a libc which does not support
posix semaphores.

Could you explain more of your platform?

regards
-steve

On Tue, 2009-11-10 at 21:48 -0200, Mark Horton wrote:
> Nikola,
> Sorry, I don't have a solution, but I'm curious about your setup.
> Which version of DLM are you using?  Did you have to compile it
> yourself?
> 
> Regards,
> Mark
> 
> On Tue, Nov 10, 2009 at 7:28 AM, Nikola Ciprich  
> wrote:
> > Hello Andrew et al,
> > few days ago, I asked about pacemaker + corosync + clvmd etc. With Your 
> > advice, I got this working well.
> > It was in testing virtual machines, I'm now trying to install similar setup 
> > on raw hardware but for some
> > reasong attrd and cib seem to be crashing.
> >
> > here's snippet from corosync log:
> > Nov 10 14:12:21 vbox3 corosync[4299]:   [MAIN  ] Corosync Cluster Engine 
> > ('1.1.2'): started and ready to provide service.
> > Nov 10 14:12:21 vbox3 corosync[4299]:   [MAIN  ] Corosync built-in 
> > features: nss rdma
> > Nov 10 14:12:21 vbox3 corosync[4299]:   [MAIN  ] Successfully read main 
> > configuration file '/etc/corosync/corosync.conf'.
> > Nov 10 14:12:21 vbox3 corosync[4299]:   [TOTEM ] Initializing transport 
> > (UDP/IP).
> > Nov 10 14:12:21 vbox3 corosync[4299]:   [TOTEM ] Initializing 
> > transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
> > Nov 10 14:12:21 vbox3 corosync[4299]:   [MAIN  ] Compatibility mode set to 
> > whitetank.  Using V1 and V2 of the synchronization engine.
> > Nov 10 14:12:21 vbox3 corosync[4299]:   [TOTEM ] The network interface 
> > [10.58.0.1] is now up.
> > Nov 10 14:12:21 vbox3 corosync[4299]:   [pcmk  ] info: process_ais_conf: 
> > Reading configure
> > Nov 10 14:13:16 vbox3 corosync[4348]:   [MAIN  ] Corosync Cluster Engine 
> > ('1.1.2'): started and ready to provide service.
> > Nov 10 14:13:16 vbox3 corosync[4348]:   [MAIN  ] Corosync built-in 
> > features: nss rdma
> > Nov 10 14:13:16 vbox3 corosync[4348]:   [MAIN  ] Successfully read main 
> > configuration file '/etc/corosync/corosync.conf'.
> > Nov 10 14:13:16 vbox3 corosync[4348]:   [TOTEM ] Initializing transport 
> > (UDP/IP).
> > Nov 10 14:13:16 vbox3 corosync[4348]:   [TOTEM ] Initializing 
> > transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
> > Nov 10 14:13:16 vbox3 corosync[4348]:   [MAIN  ] Compatibility mode set to 
> > whitetank.  Using V1 and V2 of the synchronization engine.
> > Nov 10 14:13:16 vbox3 corosync[4348]:   [TOTEM ] The network interface 
> > [10.58.0.1] is now up.
> > Nov 10 14:13:16 vbox3 corosync[4348]:   [pcmk  ] info: process_ais_conf: 
> > Reading configure
> > Nov 10 14:13:24 vbox3 corosync[4357]:   [MAIN  ] Corosync Cluster Engine 
> > ('1.1.2'): started and ready to provide service.
> > Nov 10 14:13:24 vbox3 corosync[4357]:   [MAIN  ] Corosync built-in 
> > features: nss rdma
> > Nov 10 14:13:24 vbox3 corosync[4357]:   [MAIN  ] Successfully read main 
> > configuration file '/etc/corosync/corosync.conf'.
> > Nov 10 14:13:24 vbox3 corosync[4357]:   [TOTEM ] Initializing transport 
> > (UDP/IP).
> > Nov 10 14:13:24 vbox3 corosync[4357]:   [TOTEM ] Initializing 
> > transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
> > Nov 10 14:13:24 vbox3 corosync[4357]:   [MAIN  ] Compatibility mode set to 
> > whitetank.  Using V1 and V2 of the synchronization engine.
> > Nov 10 14:13:24 vbox3 corosync[4357]:   [TOTEM ] The network interface 
> > [10.58.0.1] is now up.
> > Nov 10 14:13:24 vbox3 corosync[4357]:   [pcmk  ] info: process_ais_conf: 
> > Reading configure
> > Nov 10 14:13:57 vbox3 corosync[4380]:   [MAIN  ] Corosync Cluster Engine 
> > ('1.1.2'): started and ready to provide service.
> > Nov 10 14:13:57 vbox3 corosync[4380]:   [MAIN  ] Corosync built-in 
> > features: nss rdma
> > Nov 10 14:13:57 vbox3 corosync[4380]:   [MAIN  ] Successfully read main 
> > configuration file '/etc/corosync/corosync.conf'.
> > Nov 10 14:13:57 vbox3 corosync[4380]:   [TOTEM ] Initializing transport 
> > (UDP/IP).
> > Nov 10 14:13:57 vbox3 corosync[4380]:   [TOTEM ] Initializing 
> > transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
> > Nov 10 14:13:57 vbox3 corosync[4380]:   [MAIN  ] Compatibility mode set to 
> > whitetank.  Using V1 and V2 of the synchronization engine.
> > Nov 10 14:13:58 vbox3 corosync[4380]:   [TOTEM ] The network interface 
> > [10.58.0.1] is now up.
> > Nov 10 14:13:58 vbox3 corosync[4380]:   [pcmk  ] info: process_ais_conf: 
> > Reading configure
> > Nov 10 14:13:58 vbox3 corosync[4380]:   [pcmk  ] info: config_find_init: 
> > Local handle: 9213452461992312833 for logging
> > Nov 10 14:13:58 vbox3 corosync[4380]:   [pcmk  ] info: confi

Re: [Pacemaker] [ANNOUNCEMENT] Debian Packages for Pacemaker 1.0.6, completely revamped

2009-11-04 Thread Steven Dake
On Thu, 2009-11-05 at 00:06 +0100, Colin wrote:
> On Wed, Nov 4, 2009 at 5:47 PM, Andrew Beekhof  wrote:
> >
> > Hopelessly out of date?
> > Corosync has been supported for all of 3 days now.
> 
> Sorry, it seems that I jumped to a wrong conclusion (namely that with
> Corosync being a part of OpenAIS, and Pacemaker having run on OpenAIS
> for a while, that there wasn't much difference to supporting Corosync
> instea of OpenAIS -- shows that I'm still quite ignorant about some of
> the internals.)
> 
> Actually, I set up Pacemaker with Corosync from the new packages, just
> to see what it looks like, and it was so easy that we'll stick to it
> for the next round of tests, i.o.w., the details of the cluster
> underneath Pacemaker are so well hidden that (a) it doesn't make much
> difference, and (b) my ignorance in that area never was a problem: It
> just works.
> 
> -Colin

The intent with Corosync was that the migration path for users is mostly
seamless and we have more or less nailed that with the exception of a
few different configuration file renaming and CLI binary renaming (and
of course a new ABI for Pacemaker to program to, which was not painless
for Andrew:).

Regards
-steve
> ___
> Pacemaker mailing list
> Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] [ANNOUNCEMENT] Debian Packages for Pacemaker 1.0.6, completely revamped

2009-11-03 Thread Steven Dake
On Wed, 2009-11-04 at 09:35 +0800, Romain CHANU wrote:
> Hi Martin,
> 
> Could you tell us what's the rationale to remove openais and include
> corosync?
> 
> Would it mean that people should use corosync from now on for any HA
> development?
> 
> Best Regards,
> 
> Romain Chanu
> 

Just a short note I would also recommend making available the latest
openais packages which complement both corosync and pacemaker with sa
forum complaint apis.

Regards
-steve

> 
> 2009/11/3 Martin Gerhard Loschwitz 
> Ladies and Gentleman,
> 
> i am happy to announce the availability of Pacemaker 1.0.6
> packages
> for Debian GNU/Linux 5.0 alias Lenny (i386 and amd64).
> 
> These packages are a remarkable break, as they have totally
> and
> ruthlessly been revamped. The whole layout has actually
> changed;
> here are the most important things to keep in mind when using
> them:
> 
> * pacemaker-openais and pacemaker-heartbeat are gone;
> pacemaker now
> only comes in one flavour, having support for corosync and
> heartbeat
> built it. This is based on pacemaker's capability to detect by
> which
> messaging framework it has been started and act accordingly.
> 
> * openais is gone. pacemaker 1.0.6 uses corosync.
> 
> * the new layout allows flawless updates. if you have
> heartbeat
> 2.1.4 and do a dist-upgrade, you will automatically get
> pacemaker.
> all you need to do afterwards is converting the xml-file to
> work
> with pacemaker -- you can then start heartbeat, and things are
> going to be fine (more on this can be found in the
> Clusterlabs-
> Wiki)
> 
> * Now that we finally have a decent layout for pacemaker, we
> can
> easily provide gui packages: welcome pacemaker-mgmt, being in
> good
> condition and shape now, allowing you do administer your
> cluster
> via a GTK tool.
> 
> The new packages can as always be found on:
> 
> deb http://people.debian.org/~madkiss/ha lenny main
> deb-src http://people.debian.org/~madkiss/ha lenny main
> 
> --
> : Martin G. Loschwitz   Tel +43-1-8178292-63
>  :
> : LINBIT Information Technologies GmbH  Fax +43-1-8178292-82
>  :
> : Vivenotgasse 48, 1120 Vienna, Austria
> http://www.linbit.com :
> 
> ___
> Pacemaker mailing list
> Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> ___
> Pacemaker mailing list
> Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] [ANNOUNCEMENT] Debian Packages for Pacemaker 1.0.6, completely revamped

2009-11-03 Thread Steven Dake
On Wed, 2009-11-04 at 11:41 +1000, Luke Bigum wrote:
> The OpenAIS project has split into Corosync and OpenAIS. Someone else
> might be able to explain it better, but Corosync now contains the core
> clustering components the openais package used to have (aisexec, etc),
> while the OpenAIS project just contains the SA Forum API stuff.
> 
>  

>From the maintainer of both openais and corosync, this is an accurate
description.

All of the features of openais that pacemaker used previously are now
integrated into Corosync.  Those features have been removed by openais
and now openais contains only the SA Forum APIs.

Regards
-steve

> 
> So, what most people once thought of as "OpenAIS" is now Corosync.
> 
>  
> 
> Luke Bigum
> 
> Systems Administrator
> 
>  (p) 1300 661 668
> 
>  (f)  1300 661 540
> 
> (e)  lbi...@iseek.com.au
> 
> http://www.iseek.com.au
> 
> Level 1, 100 Ipswich Road Woolloongabba QLD 4102
> 
>  
> 
> iseekbar.jpg
> 
>  
> 
> This e-mail and any files transmitted with it may contain confidential
> and privileged material for the sole use of the intended recipient.
> Any review, use, distribution or disclosure by others is strictly
> prohibited. If you are not the intended recipient (or authorised to
> receive for the recipient), please contact the sender by reply e-mail
> and delete all copies of this message.
> 
>  
> 
>  
> 
> From: Romain CHANU [mailto:romainch...@gmail.com] 
> Sent: Wednesday 4 November 2009 11:36 AM
> To: pacemaker@oss.clusterlabs.org
> Subject: Re: [Pacemaker] [ANNOUNCEMENT] Debian Packages for Pacemaker
> 1.0.6, completely revamped
> 
> 
>  
> 
> Hi Martin,
> 
> Could you tell us what's the rationale to remove openais and include
> corosync?
> 
> Would it mean that people should use corosync from now on for any HA
> development?
> 
> Best Regards,
> 
> Romain Chanu
> 
> 
> 
> 2009/11/3 Martin Gerhard Loschwitz 
> 
> Ladies and Gentleman,
> 
> i am happy to announce the availability of Pacemaker 1.0.6 packages
> for Debian GNU/Linux 5.0 alias Lenny (i386 and amd64).
> 
> These packages are a remarkable break, as they have totally and
> ruthlessly been revamped. The whole layout has actually changed;
> here are the most important things to keep in mind when using them:
> 
> * pacemaker-openais and pacemaker-heartbeat are gone; pacemaker now
> only comes in one flavour, having support for corosync and heartbeat
> built it. This is based on pacemaker's capability to detect by which
> messaging framework it has been started and act accordingly.
> 
> * openais is gone. pacemaker 1.0.6 uses corosync.
> 
> * the new layout allows flawless updates. if you have heartbeat
> 2.1.4 and do a dist-upgrade, you will automatically get pacemaker.
> all you need to do afterwards is converting the xml-file to work
> with pacemaker -- you can then start heartbeat, and things are
> going to be fine (more on this can be found in the Clusterlabs-
> Wiki)
> 
> * Now that we finally have a decent layout for pacemaker, we can
> easily provide gui packages: welcome pacemaker-mgmt, being in good
> condition and shape now, allowing you do administer your cluster
> via a GTK tool.
> 
> The new packages can as always be found on:
> 
> deb http://people.debian.org/~madkiss/ha lenny main
> deb-src http://people.debian.org/~madkiss/ha lenny main
> 
> --
> : Martin G. Loschwitz   Tel +43-1-8178292-63  :
> : LINBIT Information Technologies GmbH  Fax +43-1-8178292-82  :
> : Vivenotgasse 48, 1120 Vienna, Austria http://www.linbit.com :
> 
> ___
> Pacemaker mailing list
> Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> 
>  
> 
> 
> ___
> Pacemaker mailing list
> Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] How to configure the openais.conf

2009-11-03 Thread Steven Dake
We have found through testing the practical limit on cluster size with
the protocol used in openais is ~30 nodes currently.  The default
parameters should work well for these sizes.

Regards
-steve

On Wed, 2009-11-04 at 08:57 +0800, lepace wrote:
> 
> Hi,all
> I want to configure a HA cluster which having more than 60 nodes,and I
> want to use N-to-N mode,so every node can potentially be used for
> failover.which I want to know is how to configure the value of
> parameters in openais.conf,or the methods to find the right value of
> parametres
> thanks
> 
> 
> 
> __
> 09年新晋3D主流网游《天下贰》,网易六年亿资打造
> ___
> Pacemaker mailing list
> Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Pacemaker cluster: OpenAis communication channels

2009-10-22 Thread Steven Dake
On Thu, 2009-10-22 at 08:18 +0200, Florian Haas wrote:
> Steve,
> 
> what has repeatedly come up is that RRP links don't auto-heal (see
> thread:
> http://oss.clusterlabs.org/pipermail/pacemaker/2009-May/001784.html),
> and that passive mode RRP seems to not work at all (see thread:
> https://lists.linux-foundation.org/pipermail/openais/2009-October/013095.html
> -- this was also heavily discussed on IRC; the only approach that fixed
> the issue was to change rrp_mode to active). Can you fill us in on the
> progress on these issues? Thanks!
> 
> Cheers,
> Florian

Passive worked last time I tested, but its been awhile.  Hardening
redundant ring and making it more generally useful is on our community
derived roadmap (targeted for 1.2.0).  Please reference:
ftp://ftp%
40corosync.org:downlo...@corosync.org/presentations/corosync-roadmap.pdf

Regards
-steve

> 
> On 10/22/2009 06:14 AM, Steven Dake wrote:
> > You can run with one NIC (and switch) but then your NIC and switch
> > become a SPOF (single point of failure).  Vehicles have a spare tire for
> > a reason :)  If a NIC fails it may be ok to switch a service to a
> > different node.  If a switch fails, The entire cluster becomes disabled
> > until the switch returns to operation.
> > 
> > Availability is a mathematical equation: 
> > 
> > A = MTTF / (MTTF+MTTR)
> > 
> > Pacemaker improves availability (A) by reducing mean time to repair
> > (MTTR) using failover while keeping the mean time to failure (MTTF)
> > essentially the same (although it is generally a bit lower because of
> > other components in the system required to introduce redundancy).
> > Instead of a typical 1 machine MTTR of 4 hours under a typical SLA, MTTR
> > may be 5-10 seconds or less (the time to failover the application and
> > restart it).  If MTTR is several days to service a switch, your
> > availability may not meet your customer SLA obligations.  When
> > determining whether to use a redundant switch the risks vs cost have to
> > be evaluated based upon your availability requirements.
> > 
> > Regards
> > -steve
> 


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Pacemaker cluster: OpenAis communication channels

2009-10-21 Thread Steven Dake
You can run with one NIC (and switch) but then your NIC and switch
become a SPOF (single point of failure).  Vehicles have a spare tire for
a reason :)  If a NIC fails it may be ok to switch a service to a
different node.  If a switch fails, The entire cluster becomes disabled
until the switch returns to operation.

Availability is a mathematical equation: 

A = MTTF / (MTTF+MTTR)

Pacemaker improves availability (A) by reducing mean time to repair
(MTTR) using failover while keeping the mean time to failure (MTTF)
essentially the same (although it is generally a bit lower because of
other components in the system required to introduce redundancy).
Instead of a typical 1 machine MTTR of 4 hours under a typical SLA, MTTR
may be 5-10 seconds or less (the time to failover the application and
restart it).  If MTTR is several days to service a switch, your
availability may not meet your customer SLA obligations.  When
determining whether to use a redundant switch the risks vs cost have to
be evaluated based upon your availability requirements.

Regards
-steve

On Thu, 2009-10-22 at 09:56 +0800, Romain CHANU wrote:
> Hi,
> 
> 
> 
> I am reading the user's guide for DRBD and there is something I want
> to clarify about Pacemaker.
> 
> 
> 
> In Chapter 8 ("Integrating DRBD with Pacemaker clusters"), section
> 8.1.4 "OpenAis communication channels", it is said that "the absolute
> minimum requirement for stable cluster operation is two independent
> communication channels in a redundant ring"
> 
> 
> Does it mean that if I have a Pacemaker cluster composed of two nodes,
> I need two NIC on each node?
> 
> 
> Thank you.
> 
> 
> Romain Chanu
> ___
> Pacemaker mailing list
> Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] pacemaker unable to start

2009-10-21 Thread Steven Dake
Ya your missing the pacemaker lcrso file.  Either you didn't build
pacemaker with corosync support or pacemaker didn't install that binary
in the proper place.

try:

updatedb
locate lcrso

Regards
-steve

On Wed, 2009-10-21 at 12:28 -0400, Shravan Mishra wrote:
> Steve, this is what my installation shows--
> 
> ls -l /usr/libexec/lcrso
> 
> -rwxr-xr-x  1 root root  101243 Jul 29 11:21 coroparse.lcrso
> -rwxr-xr-x  1 root root  117688 Jul 29 11:21 objdb.lcrso
> -rwxr-xr-x  1 root root   92702 Jul 29 11:54 openaisserviceenable.lcrso
> -rwxr-xr-x  1 root root  110808 Jul 29 11:21 quorum_testquorum.lcrso
> -rwxr-xr-x  1 root root  159057 Jul 29 11:21 quorum_votequorum.lcrso
> -rwxr-xr-x  1 root root 1175430 Jul 29 11:54 service_amf.lcrso
> -rwxr-xr-x  1 root root  133976 Jul 29 11:21 service_cfg.lcrso
> -rwxr-xr-x  1 root root  218374 Jul 29 11:54 service_ckpt.lcrso
> -rwxr-xr-x  1 root root  139029 Jul 29 11:54 service_clm.lcrso
> -rwxr-xr-x  1 root root  122668 Jul 29 11:21 service_confdb.lcrso
> -rwxr-xr-x  1 root root  138412 Jul 29 11:21 service_cpg.lcrso
> -rwxr-xr-x  1 root root  125638 Jul 29 11:21 service_evs.lcrso
> -rwxr-xr-x  1 root root  196443 Jul 29 11:54 service_evt.lcrso
> -rwxr-xr-x  1 root root  194885 Jul 29 11:54 service_lck.lcrso
> -rwxr-xr-x  1 root root  235168 Jul 29 11:54 service_msg.lcrso
> -rwxr-xr-x  1 root root  120445 Jul 29 11:21 service_pload.lcrso
> -rwxr-xr-x  1 root root  135340 Jul 29 11:54 service_tmr.lcrso
> -rwxr-xr-x  1 root root  124092 Jul 29 11:21 vsf_quorum.lcrso
> -rwxr-xr-x  1 root root  121298 Jul 29 11:21 vsf_ykd.lcrso
> 
> I also did
> 
> export COROSYNC_DEFAULT_CONFIG_IFACE="openaisserviceenable:openaisparser"
> 
> In place of openaisparser I also tried corosyncparse and
> corosyncparser but to no avail.
> 
> -sincerely
> Shravan
> 
> On Wed, Oct 21, 2009 at 11:49 AM, Steven Dake  wrote:
> > I recommend using corosync 1.1.1 - several bug fixes one critical for
> > proper pacemaker operation.  It won't fix this particular problem
> > however.
> >
> > Corosync loads pacemaker by searching for a pacemaker lcrso file.  These
> > files are default installed in /usr/libexec/lcrso but may be in a
> > different location depending on your distribution.
> >
> > Regards
> > -steve
> >
> > On Wed, 2009-10-21 at 11:13 -0400, Shravan Mishra wrote:
> >> Hello guys,
> >>
> >> We are running
> >>
> >> corosync-1.0.0
> >> heartbeat-2.99.1
> >> pacemaker-1.0.4
> >>
> >> the corosync.conf  under /etc/corosync/ is
> >>
> >> 
> >> # Please read the corosync.conf.5 manual page
> >> compatibility: whitetank
> >>
> >> aisexec {
> >>user: root
> >>group: root
> >> }
> >> totem {
> >>version: 2
> >>secauth: off
> >>threads: 0
> >>interface {
> >>ringnumber: 0
> >>bindnetaddr: 172.30.0.0
> >>mcastaddr:226.94.1.1
> >>mcastport: 5406
> >>}
> >> }
> >>
> >> logging {
> >>fileline: off
> >>to_stderr: yes
> >>to_logfile: yes
> >>to_syslog: yes
> >>logfile: /tmp/corosync.log
> >>debug: on
> >>timestamp: on
> >>logger_subsys {
> >>subsys: pacemaker
> >>debug: on
> >>tags: enter|leave|trace1|trace2| trace3|trace4|trace6
> >>}
> >> }
> >>
> >>
> >> service {
> >>name: pacemaker
> >>ver: 0
> >> #   use_mgmtd: yes
> >>  #  use_logd:yes
> >> }
> >>
> >>
> >> corosync {
> >>user: root
> >>group: root
> >> }
> >>
> >>
> >> amf {
> >>mode: disabled
> >> }
> >> 
> >>
> >>
> >> #service corosync start
> >>
> >> starts the messaging but fails to load pacemaker,
> >>
> >> /tmp/corosync.log  ---
> >>
> >> ==
> >>
> >> Oct 21 11:05:43 corosync [MAIN  ] Corosync Cluster Engine ('trunk'):
> >> started and ready to provide service.
> >> Oct 21 11:05:43 corosync [MAIN  ] Successfully read main configuration
> >> file '/etc/corosync/corosync.conf'.
> >> Oct 21 11:05:43 cor

Re: [Pacemaker] pacemaker unable to start

2009-10-21 Thread Steven Dake
I recommend using corosync 1.1.1 - several bug fixes one critical for
proper pacemaker operation.  It won't fix this particular problem
however.

Corosync loads pacemaker by searching for a pacemaker lcrso file.  These
files are default installed in /usr/libexec/lcrso but may be in a
different location depending on your distribution.

Regards
-steve

On Wed, 2009-10-21 at 11:13 -0400, Shravan Mishra wrote:
> Hello guys,
> 
> We are running 
> 
> corosync-1.0.0
> heartbeat-2.99.1
> pacemaker-1.0.4
> 
> the corosync.conf  under /etc/corosync/ is 
> 
> 
> # Please read the corosync.conf.5 manual page
> compatibility: whitetank
> 
> aisexec {
>user: root
>group: root
> }
> totem {
>version: 2
>secauth: off
>threads: 0
>interface {
>ringnumber: 0
>bindnetaddr: 172.30.0.0
>mcastaddr:226.94.1.1
>mcastport: 5406
>}
> }
> 
> logging {
>fileline: off
>to_stderr: yes
>to_logfile: yes
>to_syslog: yes
>logfile: /tmp/corosync.log
>debug: on
>timestamp: on
>logger_subsys {
>subsys: pacemaker
>debug: on
>tags: enter|leave|trace1|trace2| trace3|trace4|trace6
>}
> }
> 
> 
> service {
>name: pacemaker
>ver: 0
> #   use_mgmtd: yes
>  #  use_logd:yes
> }
> 
> 
> corosync {
>user: root
>group: root
> }
> 
> 
> amf {
>mode: disabled
> }
> 
> 
> 
> #service corosync start   
> 
> starts the messaging but fails to load pacemaker,
> 
> /tmp/corosync.log  ---   
> 
> ==
> 
> Oct 21 11:05:43 corosync [MAIN  ] Corosync Cluster Engine ('trunk'):
> started and ready to provide service.
> Oct 21 11:05:43 corosync [MAIN  ] Successfully read main configuration
> file '/etc/corosync/corosync.conf'.
> Oct 21 11:05:43 corosync [TOTEM ] Token Timeout (1000 ms) retransmit
> timeout (238 ms)
> Oct 21 11:05:43 corosync [TOTEM ] token hold (180 ms) retransmits
> before loss (4 retrans)
> Oct 21 11:05:43 corosync [TOTEM ] join (50 ms) send_join (0 ms)
> consensus (800 ms) merge (200 ms)
> Oct 21 11:05:43 corosync [TOTEM ] downcheck (1000 ms) fail to recv
> const (50 msgs)
> Oct 21 11:05:43 corosync [TOTEM ] seqno unchanged const (30 rotations)
> Maximum network MTU 1500
> Oct 21 11:05:43 corosync [TOTEM ] window size per rotation (50
> messages) maximum messages per rotation (17 messages)
> Oct 21 11:05:43 corosync [TOTEM ] send threads (0 threads)
> Oct 21 11:05:43 corosync [TOTEM ] RRP token expired timeout (238 ms)
> Oct 21 11:05:43 corosync [TOTEM ] RRP token problem counter (2000 ms)
> Oct 21 11:05:43 corosync [TOTEM ] RRP threshold (10 problem count)
> Oct 21 11:05:43 corosync [TOTEM ] RRP mode set to none.
> Oct 21 11:05:43 corosync [TOTEM ] heartbeat_failures_allowed (0)
> Oct 21 11:05:43 corosync [TOTEM ] max_network_delay (50 ms)
> Oct 21 11:05:43 corosync [TOTEM ] HeartBeat is Disabled. To enable set
> heartbeat_failures_allowed > 0
> Oct 21 11:05:43 corosync [TOTEM ] Initializing transmit/receive
> security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
> Oct 21 11:05:43 corosync [TOTEM ] Receive multicast socket recv buffer
> size (262142 bytes).
> Oct 21 11:05:43 corosync [TOTEM ] Transmit multicast socket send
> buffer size (262142 bytes).
> Oct 21 11:05:43 corosync [TOTEM ] The network interface [172.30.0.145]
> is now up.
> Oct 21 11:05:43 corosync [TOTEM ] Created or loaded sequence id
> 184.172.30.0.145 for this ring.
> Oct 21 11:05:43 corosync [TOTEM ] entering GATHER state from 15.
> Oct 21 11:05:43 corosync [SERV  ] Service failed to load 'pacemaker'.
> Oct 21 11:05:43 corosync [SERV  ] Service initialized 'corosync
> extended virtual synchrony service'
> Oct 21 11:05:43 corosync [SERV  ] Service initialized 'corosync
> configuration service'
> Oct 21 11:05:43 corosync [SERV  ] Service initialized 'corosync
> cluster closed process group service v1.01'
> Oct 21 11:05:43 corosync [SERV  ] Service initialized 'corosync
> cluster config database access v1.01'
> Oct 21 11:05:43 corosync [SERV  ] Service initialized 'corosync
> profile loading service'
> Oct 21 11:05:43 corosync [MAIN  ] Compatibility mode set to
> whitetank.  Using V1 and V2 of the synchronization engine.
> Oct 21 11:05:43 corosync [TOTEM ] Creating commit token because I am
> the rep.
> Oct 21 11:05:43 corosync [TOTEM ] Saving state aru 0 high seq received
> 0
> Oct 21 11:05:43 corosync [TOTEM ] Storing new sequence id for ring bc
> Oct 21 11:05:43 corosync [TOTEM ] entering COMMIT state.
> Oct 21 11:05:43 corosync [TOTEM ] got commit token
> Oct 21 11:05:43 corosync [TOTEM ] entering RECOVERY state.
> Oct 21 11:05:43 corosync [TOTEM ] position [0] member 172.30.0.145:
> Oct 21 11:05:43 corosync [TOTEM ] previous ring seq 184 rep
> 172.30.0.145
> Oct 21 11:05:43 corosync [TOTEM ] aru 0 high delivered 0 received flag
> 1
> Oct 21 11:05:43 c

Re: [Pacemaker] Why are fatal warnings enabled by default?

2009-10-21 Thread Steven Dake
IMO enabling fatal warnings by default when having a bunch of dependent
header files, then enabling most warnings in the warn list is
problematic.

Header files are consistently broken upstream wrt const correctness,
typing, etc.  With fatal warnings enabled its really difficult to get a
clean compile on many platforms because the header files are broken.

Just look as nss libs in rhel5/centos.  Header files broken with certain
warnings.

But do what you like :)

Regards
-steve

On Wed, 2009-10-21 at 15:08 +0200, Florian Haas wrote:
> On 2009-10-21 14:36, Dejan Muhamedagic wrote:
> >>> The warnings being?
> >> In agents, a simple "./configure && make" leads to:
> >>
> >> [...]
> >> gmake[1]: Entering directory `/home/rpmbuild/hg/cluster-agents/heartbeat'
> >> if gcc -DHAVE_CONFIG_H -I. -I. -I../include -I../include -I../include
> >> -I../linux-ha  -I/usr/include/glib-2.0 -I/usr/lib/glib-2.0/include-g
> >> -O2 -ggdb3 -O0  -fgnu89-inline -fstack-protector-all -Wall
> >> -Waggregate-return -Wbad-function-cast -Wcast-qual -Wcast-align
> >> -Wdeclaration-after-statement -Wendif-labels -Wfloat-equal -Wformat=2
> >> -Wformat-security -Wformat-nonliteral -Winline -Wmissing-prototypes
> >> -Wmissing-declarations -Wmissing-format-attribute -Wnested-externs
> >> -Wno-long-long -Wno-strict-aliasing -Wpointer-arith -Wstrict-prototypes
> >> -Wwrite-strings -ansi -D_GNU_SOURCE -DANSI_ONLY -Werror -MT IPv6addr.o
> >> -MD -MP -MF ".deps/IPv6addr.Tpo" -c -o IPv6addr.o IPv6addr.c; \
> >>then mv -f ".deps/IPv6addr.Tpo" ".deps/IPv6addr.Po"; else rm -f
> >> ".deps/IPv6addr.Tpo"; exit 1; fi
> >> cc1: warnings being treated as errors
> >> IPv6addr.c: In function ‘send_ua’:
> >> IPv6addr.c:453: warning: passing argument 2 of
> >> ‘libnet_pblock_record_ip_offset’ makes pointer from integer without a cast
> > 
> > This doesn't happen here with libnet-1.1.2.1-140.75.i586. Which
> > libnet version do you have?
> 
> libnet-1.1.4-3.el5
> 
> Cheers,
> Florian
> 
> ___
> Pacemaker mailing list
> Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] corosync doesn't stop all services

2009-10-21 Thread Steven Dake
We had to change both pacemaker and corosync for this problem.  I
suspect you don't have the updated pacemaker.

Regards
-steve

On Wed, 2009-10-21 at 15:11 +0200, Michael Schwartzkopff wrote:
> Hi,
> 
> perhaps this is the wrong list but anyway:
> 
> I have corosync-1.1.1 and pacemaker-1.0.5 on debian lenny.
> 
> When I start corosync everything looks fine. But when I stop corosync I still 
> see a lot of heartbeart processes. I thought this was fixed in 
> corosync-1.1.1. 
> so what might be the problem?
> 
> # ps uax | grep heart
> root  2083  0.0  0.4   4884  1220 pts/1S<   17:04   0:00 
> /usr/lib/heartbeat/ha_logd -d
> root  2084  0.0  0.3   4884   820 pts/1S<   17:04   0:00 
> /usr/lib/heartbeat/ha_logd -d
> root  2099  0.0  4.1  10712 10712 ?S /usr/lib/heartbeat/stonithd
> 104   2100  0.1  1.4  12768  3748 ?S<   17:04   0:00 
> /usr/lib/heartbeat/cib
> root  2101  0.0  0.7   5352  1800 ?S<   17:04   0:00 
> /usr/lib/heartbeat/lrmd
> 104   2102  0.0  1.0  12260  2596 ?S<   17:04   0:00 
> /usr/lib/heartbeat/attrd
> 104   2103  0.0  1.1   8880  3024 ?S<   17:04   0:00 
> /usr/lib/heartbeat/pengine
> 104   2104  0.0  1.2  12404  3176 ?S<   17:04   0:00 
> /usr/lib/heartbeat/crmd
> root  2140  0.0  0.2   3116   720 pts/1R<+  17:08   0:00 grep heart
> 


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Failed in restart of Corosync.

2009-10-18 Thread Steven Dake
This bug is reported and we are working on a solution.

Regards
-steve

On Mon, 2009-10-19 at 11:05 +0900, renayama19661...@ybb.ne.jp wrote:
> Hi,
> 
> I understand that a combination is not official in Corosync and Pacemaker.
> However, I contributed it because I thought that it was important that I 
> reported a problem.
> 
> I started next combination Corosync.(on Redhat5.4(x86))
> 
> * corosync trunk 2530
> * Cluster-Resource-Agents-6d652f7cf9d8
> * Reusable-Cluster-Components-4edc8f99701c
> * Pacemaker-1-0-de2a3778ace7
> 
> I stopped service(corosync) next.
> But, I did KILL of a process because a process of Pacemaker did not stop well.
> 
> 
> [r...@rh54-1 ~]# service Corosync stop
> Stopping Corosync Cluster Engine (corosync):   [  OK  ]
> Waiting for services to unload:[  OK  ]
> [r...@rh54-1 ~]# ps -ef |grep coro
> root  5263  4617  0 10:54 pts/000:00:00 grep coro
> [r...@rh54-1 ~]# ps -ef |grep heartbeat 
> root  4882 1  0 10:52 ?00:00:00 /usr/lib/heartbeat/stonithd
> 500   4883 1  0 10:52 ?00:00:00 /usr/lib/heartbeat/cib
> root  4884 1  0 10:52 ?00:00:00 /usr/lib/heartbeat/lrmd
> 500   4885 1  0 10:52 ?00:00:00 /usr/lib/heartbeat/attrd
> 500   4886 1  0 10:52 ?00:00:00 /usr/lib/heartbeat/pengine
> 500   4887 1  0 10:52 ?00:00:00 /usr/lib/heartbeat/crmd
> root  5278  4617  0 10:54 pts/000:00:00 grep heartbeat
> [r...@rh54-1 ~]# kill -9 4882 4883 4884 4885 4886 4887
> [r...@rh54-1 ~]# ps -ef |grep heartbeat 
> root  5310  4617  0 10:54 pts/000:00:00 grep heartbeat
> 
> 
> 
> I started Corosync again.
> But, a cib process of Pacemaker seems not to be able to communicate with 
> Corosync.
> 
> 
> 
> Oct 19 10:55:29 rh54-1 cib: [5354]: info: startCib: CIB Initialization 
> completed successfully
> Oct 19 10:55:29 rh54-1 cib: [5354]: info: crm_cluster_connect: Connecting to 
> OpenAIS
> Oct 19 10:55:29 rh54-1 cib: [5354]: info: init_ais_connection: Creating 
> connection to our AIS plugin
> Oct 19 10:55:30 rh54-1 mgmtd: [5359]: info: login to cib live: 1, ret:-10
> Oct 19 10:55:30 rh54-1 crmd: [5358]: info: do_cib_control: Could not connect 
> to the CIB service:
> connection failed
> Oct 19 10:55:30 rh54-1 crmd: [5358]: WARN: do_cib_control: Couldn't complete 
> CIB registration 1
> times... pause and retry
> Oct 19 10:55:30 rh54-1 crmd: [5358]: info: crmd_init: Starting crmd's mainloop
> Oct 19 10:55:31 rh54-1 mgmtd: [5359]: info: login to cib live: 2, ret:-10
> Oct 19 10:55:32 rh54-1 mgmtd: [5359]: info: login to cib live: 3, ret:-10
> Oct 19 10:55:32 rh54-1 crmd: [5358]: info: crm_timer_popped: Wait Timer 
> (I_NULL) just popped!
> Oct 19 10:55:33 rh54-1 mgmtd: [5359]: info: login to cib live: 4, ret:-10
> Oct 19 10:55:33 rh54-1 crmd: [5358]: info: do_cib_control: Could not connect 
> to the CIB service:
> connection failed
> Oct 19 10:55:33 rh54-1 crmd: [5358]: WARN: do_cib_control: Couldn't complete 
> CIB registration 2
> times... pause and retry
> 
> 
> 
> On this account it does not start definitely even if Pacemaker waits till 
> when.
> 
> As for the problem, Corosync seems to fail in poll(?) somehow or other.
> However, possibly the cause may depend on the failure of the first stop.
> 
> 
> [r...@rh54-1 ~]# ps -ef |grep coro
> root  5348 1  0 10:55 ?00:00:00 /usr/sbin/corosync
> root  5400  4617  0 10:56 pts/000:00:00 grep coro
> [r...@rh54-1 ~]# strace -p 5348
> Process 5348 attached - interrupt to quit
> futex(0x805c8c0, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> 
> Is there a method with the avoidance of this phenomenon what it is?
> Can I evade a problem by deleting some file?
> 
> * I hope it so that a combination of Corosync and Pacemaker becomes the 
> practical use early.
> 
> Best Regards,
> Hideo Yamauchi.
> ___
> Pacemaker mailing list
> Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] fedora11: openais fails to start

2009-10-09 Thread Steven Dake
You could try the f12 rpms - we have tested these.  We are in the
process of making these available in f11/f10, but there is a bit of a
lag because of fedora process.

The f12 rpms are at koji.fedoraproject.org.

>From looking at your logs, it appears iptables is enabled and not
configured properly.  try service iptables stop.

Regards
-steve

On Fri, 2009-10-09 at 14:31 +0200, Michael Schwartzkopff wrote:
> Hi,
> 
> I wanted to try pacemaker/openais on a fedora11. Packages from OSBS:
> # rpm -qa | grep "ais\|pace"
> pacemaker-1.0.5-4.1.i386
> libopenais2-0.80.5-15.1.i386
> pacemaker-libs-1.0.5-4.1.i386
> openais-0.80.5-15.1.i386
> pacemaker-mgmt-1.99.2-6.1.i386
> 
> When I start /etc/init.d/openais start
> - There are some entries in the log. Nothing what I could identify as an 
> error. See: http://www.pastebin.org/41120
> 
> - openais-cfgtool -s stops at
> Printing ring status.
> Need to CTRL-C to stop.
> 
> - No pacemaker process are really started:
> ps uax | grep crm
> is empty.
> 
> Any ideas?
> 
> Using corosync-1.0.0 from fedora11 is no option. Results in another error.
> 


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Openais log and cpu occupation

2009-10-04 Thread Steven Dake
The openais logging service will use the syslog LOG_NOTICE level.  To
filter in syslog, syslog must be appropriately configured.  The to_file
directive is really meant for debugging and not deployment.  There is no
way to set filter (except to filter debug) levels on to_file directives.
I recommend against using it.

This madness is hopefully addressed in corosync more appropriately.

Regards
-steve

On Sun, 2009-10-04 at 19:19 +0200, Fausto Lombardi wrote:
> The permissions of what?
> 
> The log file that I insert in openas.conf is created and written
> correctly but the logs is still written in messages file.
> I would that in messages file is written only the error/warning logs.
> 
> This is my openais.conf:
> 
> # Please read the openais.conf.5 manual page
> 
> aisexec {
> # Run as root - this is necessary to be able to manage resources
> with Pacemaker
> user:root
> group:root
> }
> 
> service {
> # Load the Pacemaker Cluster Resource Manager
> ver:   0
> name:  pacemaker
> use_mgmtd: yes
> use_logd:  no
> }
> 
> totem {
> version: 2
> 
> # How long before declaring a token lost (ms)
> token:  5000
> 
> # How many token retransmits before forming a new configuration
> token_retransmits_before_loss_const: 10
> 
> # How long to wait for join messages in the membership protocol
> (ms)
> join:   1000
> 
> # How long to wait for consensus to be achieved before starting a
> new round of membership configuration (ms)
> consensus:  2500
> 
> # Turn off the virtual synchrony filter
> vsftype:none
> 
> # Number of messages that may be sent by one processor on receipt
> of the token
> max_messages:   20
> 
> # Stagger sending the node join messages by 1..send_join ms
> send_join: 45
> 
> # Limit generated nodeids to 31-bits (positive signed integers)
> clear_node_high_bit: yes
> 
> # Disable encryption
> secauth:off
> 
> # How many threads to use for encryption/decryption
> threads:   0
> 
> #rrp_mode:passive
> 
> # Optionally assign a fixed node id (integer)
> # nodeid: 1234
> 
> interface {
> ringnumber: 0
> 
> # The following values need to be set based on your
> environment
> bindnetaddr: 192.168.1.0
> mcastaddr: 226.94.1.1
> mcastport: 5405
> }
> #interface {
> #ringnumber: 1
> #
> ## The following values need to be set based on your
> environment
> #bindnetaddr: 172.16.0.0
> #mcastaddr: 226.95.1.1
> #mcastport: 5505
> #}
> }
> 
> logging {
> debug: off
> fileline: off
> to_file: on
> to_syslog: off
> to_stderr: off
> logfile: /var/log/openais.log
> syslog_facility: daemon
> timestamp: on
> }
> 
> amf {
> mode: disabled
> }
> 
> 
> 
> 2009/10/4 E-Blokos 
> did you check permissions ?
> 
> - Original Message - 
> From: Fausto Lombardi 
> To: pacema...@clusterlabs.org 
> Sent: Sunday, October 04, 2009 1:09 PM
> Subject: Re: [Pacemaker] Openais log and cpu
> occupation
> 
> 
> 
> 
> The problem fo the log??
> 
> Thanks.
> 
> 
> -- 
> This message has been scanned for viruses and 
> dangerous content by MailScanner, and is 
> believed to be clean. 
> 
> 
> __
> ___
> Pacemaker mailing list
> Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> 
> -- 
> This message has been scanned for viruses and 
> dangerous content by MailScanner, and is 
> believed to be clean.
> 
> ___
> Pacemaker mailing list
> Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> 
> ___
> Pacemaker mailing list
> Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] A problem to fail in a stop of Pacemaker.

2009-09-29 Thread Steven Dake
On Wed, 2009-09-30 at 09:51 +0900, renayama19661...@ybb.ne.jp wrote:
> Hi Remi,
> 
> > It appears that this is a similar problem to the one that I reported, 
> > yes.  It appears to not be a bug in Corosync, but rather one in 
> > Pacemaker.  This bug has been filed in Red Hat Bugzilla, see it at:
> > 
> > https://bugzilla.redhat.com/show_bug.cgi?id=525589
> > 
> > Perhaps you could add any additional details that you have found 
> > (affected packages, etc.) to the bug; it may help the developers fix it.
> 
> All right.
> Thank you.
> 
> Best Regards,
> Hideo Yamauchi.
> 

Please note this could still be a bz in corosync related to service
engine integration.  It is just too early to tell.  Andrew should be
able to tell us for certain when he has an opportunity to take a look at
it.

Regards
-steve

> --- Remi Broemeling  wrote:
> 
> > Hello Hideo,
> > 
> > It appears that this is a similar problem to the one that I reported, 
> > yes.  It appears to not be a bug in Corosync, but rather one in 
> > Pacemaker.  This bug has been filed in Red Hat Bugzilla, see it at:
> > 
> > https://bugzilla.redhat.com/show_bug.cgi?id=525589
> > 
> > Perhaps you could add any additional details that you have found 
> > (affected packages, etc.) to the bug; it may help the developers fix it.
> > 
> > Thanks.
> > 
> > 
> > renayama19661...@ybb.ne.jp wrote:
> > > Hi,
> > >
> > > I started a Dummy resource in one node by the next combination.
> > >  * corosync 1.1.0
> > >  * Pacemaker-1-0-05c8b63cbca7
> > >  * Reusable-Cluster-Components-6ef02517ee57
> > >  * Cluster-Resource-Agents-88a9cfd9e8b5
> > >
> > > The Dummy resource started in a node.
> > >
> > > I was going to stop a node(service Corosync stop), but did not stop.
> > >
> > > --log--
> > > (snip)
> > >
> > > Sep 29 13:52:01 rh53-1 crmd: [11193]: info: crm_signal_dispatch: Invoking 
> > > handler for signal
> > 15:
> > > Terminated
> > > Sep 29 13:52:01 rh53-1 crmd: [11193]: info: crm_shutdown: Requesting 
> > > shutdown
> > > Sep 29 13:52:01 rh53-1 crmd: [11193]: info: do_state_transition: State 
> > > transition S_IDLE ->
> > > S_POLICY_ENGINE [ input=I_SHUTDOWN cause=C_SHUTDOWN origin=crm_shutdown ]
> > > Sep 29 13:52:01 rh53-1 crmd: [11193]: info: do_state_transition: All 1 
> > > cluster nodes are
> > eligible to
> > > run resources.
> > > Sep 29 13:52:01 rh53-1 crmd: [11193]: info: do_shutdown_req: Sending 
> > > shutdown request to DC:
> > rh53-1
> > > Sep 29 13:52:30 rh53-1 corosync[11183]:   [pcmk  ] notice: pcmk_shutdown: 
> > > Still waiting for
> > crmd
> > > (pid=11193) to terminate...
> > > Sep 29 13:53:30 rh53-1 last message repeated 2 times
> > > Sep 29 13:55:00 rh53-1 last message repeated 3 times
> > > Sep 29 13:56:30 rh53-1 last message repeated 3 times
> > > Sep 29 13:58:01 rh53-1 last message repeated 3 times
> > > Sep 29 13:59:31 rh53-1 last message repeated 3 times
> > > Sep 29 14:00:31 rh53-1 last message repeated 2 times
> > > Sep 29 14:00:46 rh53-1 cib: [11189]: info: cib_stats: Processed 94 
> > > operations (11489.00us
> > average, 0%
> > > utilization) in the last 10min
> > > Sep 29 14:01:01 rh53-1 corosync[11183]:   [pcmk  ] notice: pcmk_shutdown: 
> > > Still waiting for
> > crmd
> > > (pid=11193) to terminate...
> > >
> > > (snip)
> > > --log--
> > >
> > >
> > > Possibly is the cause same as the next email?
> > >  * http://www.gossamer-threads.com/lists/linuxha/pacemaker/58127
> > >
> > > And, the same problem was taking place by the next combination.
> > >  * corosync 1.0.1
> > >  * Pacemaker-1-0-595cca870aff
> > >  * Reusable-Cluster-Components-6ef02517ee57
> > >  * Cluster-Resource-Agents-88a9cfd9e8b5
> > >
> > > I attach a file of hb_report.
> > >
> > > Best Regards,
> > > Hideo Yamauchi.
> > >   
> > 
> > -- 
> > 
> > Remi Broemeling
> > Sr System Administrator
> > 
> > Nexopia.com Inc.
> > direct: 780 444 1250 ext 435
> > email: r...@nexopia.com 
> > fax: 780 487 0376
> > 
> > www.nexopia.com 
> > 
> > You are only young once, but you can stay immature indefinitely.
> > www.siglets.com
> > > ___
> > Pacemaker mailing list
> > Pacemaker@oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > 
> 
> 
> ___
> Pacemaker mailing list
> Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Can OpenAISs components work with Pacemaker? Whats therelationship between Pacemaker and SA forum?

2009-09-27 Thread Steven Dake
On Sun, 2009-09-27 at 11:11 +0800, xin.li...@cs2c.com.cn wrote:
>   HI,everyone ;-)
> 
> I'm pretty new in the HA world, and I'm from China.
> 
> When I'd download the code of OpenAIS(Whitetank), I found some
> components in it. Such as AMF,CKPT,CLM,MSG.
> 
> Can these components work simultaneously well with Pacemaker all based
> on the Totem Protocol ? If yes, how to write in the openais.conf
> file ?
> 

Yes although AMF is experimental.

> And, What's the relationship between Pacemaker and SA forum
> (http://www.saforum.org/) ?
> 

There is no relationship.  Pacemaker simply plugs into the
infrastructure provided by openais.

Upstream, we have broken apart the SA Forum APIs and the infrastructure
into openais and corosync respectively.  Even with corosync it is
possible to use pacemaker and openais 1.1.0 (wilson) at the same time.

Regards
-steve

> I'm pretty confused.
> 
> Many thanks
> ___
> Pacemaker mailing list
> Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


  1   2   >