Re: [Pacemaker] glassfish resource agent

2012-12-10 Thread Dan Frincu
Hi,

On Mon, Dec 10, 2012 at 6:53 AM, Soni Maula Harriz
 wrote:
> dear forum,
> is there any ready-to-use glassfish resorce agent ? because i don't find any
> on google.
Not that I know of.
> or do i have to make it by myself ?
I think you do.
> thanks
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>



-- 
Dan Frincu
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] crm shell binaries

2013-01-09 Thread Dan Frincu
@Dejan, @LMB

Could you guys post binaries of crmsh for RedHat, Debian?

Regards,
Dan

-- 
Dan Frincu
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [Linux-HA] Which language Pacemaker is written?

2013-01-11 Thread Dan Frincu
Hi,

>From https://www.ohloh.net/p/pacemaker?ref=sample

C 76%
Python 8%
shell script 6%
Other 10%

HTH,
Dan

On Fri, Jan 11, 2013 at 1:15 PM, Felipe Gutierrez
 wrote:
> Hi everyone,
>
> I am writing a school work about program languages and I want to research
> about Pacemaker and its program language.
>
> Which language Pacemaker is written?
> I search at internet and this person said about Erlang
> http://manavar.blogspot.com.br/2012/02/cluster-software-pacemaker-erlang-and.html
>
> Is that right?
>
>
> Thanks,
> Felipe
>
>
> --
> *--
> -- Felipe Oliveira Gutierrez
> -- felipe.o.gutier...@gmail.com
> -- https://sites.google.com/site/lipe82/Home/diaadia*
> ___
> Linux-HA mailing list
> linux...@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems



-- 
Dan Frincu
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] best/proper way to shut down a node for service

2013-01-23 Thread Dan Frincu
Hi,

On Wed, Jan 23, 2013 at 5:21 AM, Brian J. Murrell  wrote:
> OK.  So you have a corosync cluster of nodes with pacemaker managing
> resources on them, including (of course) STONITH.
>
> What's the best/proper way to shut down a node, say, for maintenance
> such that pacemaker doesn't go trying to "fix" that situation and
> STONITHing it to try to bring it back up, etc.?
>
> Currently my practice for STONITH is to have it reboot.  Maybe it's a
> better practice to have STONITH configured to just power a node down and
> not try to power it back up for this exact reason?
>
> Any other suggestions welcome.

I usually put the node in standby, which means it can no longer run
any resources on it. Both Pacemaker and Corosync continue to run, node
provides quorum.

For global cluster maintenance, such as when upgrading to a major
software version, maintenance-mode is needed.

HTH,
Dan

>
> Cheers,
> b.
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>



-- 
Dan Frincu
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] best/proper way to shut down a node for service

2013-01-24 Thread Dan Frincu
Hi,

On Wed, Jan 23, 2013 at 11:28 PM, Brian J. Murrell
 wrote:
> On 13-01-23 03:32 AM, Dan Frincu wrote:
>> Hi,
>
> Hi,
>
>> I usually put the node in standby, which means it can no longer run
>> any resources on it. Both Pacemaker and Corosync continue to run, node
>> provides quorum.
>
> But a node in standby will still be STONITHed if it goes AWOL.  I put a
> node in standby and then yanked it's power and it's peer started STONITH
> operations on it.  That's the part I want to avoid.

You have to explain what AWOL means in this context, even in a 2-node
cluster, putting one node in standby without changing no-quorum-policy
to ignore or setting stonith-enabled=false will just move off the
resources from the node.

Failure to stop a resource running on a node which is in the shutdown
procedure (which means resources will be stopped - shutting down
Pacemaker or by putting the node in standby would have the same effect
on the resources, telling them to stop) will lead to STONITH.

So just to emphasize this again, if there is a stop failure,
regardless of how you turn off the resource (Pacemaker shutdown,
putting the node in standby, telling the resource to move to another
node, etc.), that will STONITH the node.

Now, going back to no-quorum-policy, default action is stop, so in a
2-node cluster, if you shutdown Pacemaker without setting
no-quorum-policy to ignore, when quorum is lost, resources on the
remaining node stop. By putting the node in standby, quorum is still
met, this does not take place.

Once a node is in standby, if you want to stop pacemaker and corosync,
that won't lead into the "node running AWOL" situation you've
mentioned earlier.

Having more than 2 nodes in a cluster means shutdown of pacemaker and
corosync/putting the node in standby won't affect quorum as the other
nodes still work.

Either way, choose whatever fits your requirement best, I just added
some comments related to how this would work and what would be the
possible problems in a 2-node cluster.

HTH,
Dan

>
> b.
>
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>



-- 
Dan Frincu
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] No communication between nodes (setup problem)

2013-01-30 Thread Dan Frincu
 egrep "warning|error"
> Jan 30 10:25:59 [1608] server1   crmd:  warning: do_log:FSA: Input
> I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
> Jan 30 10:25:59 [1607] server1pengine:  warning: cluster_status:We
> do not have quorum - fencing and resource management disabled
> Jan 30 10:28:25 [1525] server1 corosync debug   [QUORUM] getinfo response
> error: 1
> Jan 30 10:40:59 [1607] server1pengine:  warning: cluster_status:We
> do not have quorum - fencing and resource management disabled
>
>
> root@server2 corosync]# cat /var/log/cluster/corosync.log | egrep
> "warning|error"
> Jan 30 10:27:18 [1458] server2   crmd:  warning: do_log:FSA: Input
> I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
> Jan 30 10:27:18 [1457] server2pengine:  warning: cluster_status:We
> do not have quorum - fencing and resource management disabled
> Jan 30 10:29:19 [1349] server2 corosync debug   [QUORUM] getinfo response
> error: 1
> Jan 30 10:42:18 [1457] server2pengine:  warning: cluster_status:We
> do not have quorum - fencing and resource management disabled
> Jan 30 10:44:36 [1349] server2 corosync debug   [QUORUM] getinfo response
> error: 1
>
>
>
>
> We have installed the following packages:
>
> corosync-2.2.0-1.fc18.i686
> corosynclib-2.2.0-1.fc18.i686
> drbd-bash-completion-8.3.13-1.fc18.i686
> drbd-pacemaker-8.3.13-1.fc18.i686
> drbd-utils-8.3.13-1.fc18.i686
> pacemaker-1.1.8-3.fc18.i686
> pacemaker-cli-1.1.8-3.fc18.i686
> pacemaker-cluster-libs-1.1.8-3.fc18.i686
> pacemaker-libs-1.1.8-3.fc18.i686
> pcs-0.9.27-3.fc18.i686
>
>
>
> Firewalls are disabled, Pinging and SSH communication is working without any
> problems.
>
> With best regards
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
>
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>



-- 
Dan Frincu
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] crmsh on fedora 18

2013-02-04 Thread Dan Frincu
Hi,

On Mon, Feb 4, 2013 at 9:38 AM, emmanuel segura  wrote:
> Hello List
>
> Sorry for this stupid question, but i would like to know if i can install
> crmsh on fedora 18, i know fedora 18 use pcs, but i don't like pcs

Maybe this helps.

http://www.gossamer-threads.com/lists/linuxha/pacemaker/83637

>
> Thanks
>
> --
> esta es mi vida e me la vivo hasta que dios quiera
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>



-- 
Dan Frincu
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Corosync over DHCP IP

2013-02-11 Thread Dan Frincu
cemakerd:   notice: pcmk_shutdown_worker:
> Shutdown complete
> Feb 10 07:56:27 [5242] host1 pacemakerd: info: main:   Exiting
> pacemakerd
>
>
> corosync.conf:
>
> compatibility: whitetank
>
> totem {
> version: 2
> secauth: off
> nodeid: 104
> interface {
> member {
> memberaddr: 172.17.0.104
> }
> member {
> memberaddr: 172.17.0.105
> }
> ringnumber: 0
> bindnetaddr: 172.17.0.0
> mcastport: 5426
> ttl: 1
> }
> transport: udpu
> }
>
> logging {
> fileline: off
> to_logfile: yes
> to_syslog: yes
> debug: on
> logfile: /var/log/cluster/corosync.log
> debug: off
> timestamp: on
> logger_subsys {
> subsys: AMF
> debug: off
> }
> }
> service {
># Load the Pacemaker Cluster Resource Manager
>ver:   1
>name:  pacemaker
> }
>
> aisexec {
>user:   root
>group:  root
> }
>
>
>
> Thank you!
>
> --
> Viacheslav Biriukov
> BR
> http://biriukov.me
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>



-- 
Dan Frincu
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] RES: Reboot of cluster members with heavy load on filesystem.

2013-02-11 Thread Dan Frincu
mcastaddr: 226.94.1.1
>> mcastport: 5406
>> ttl: 1
>> }
>> }
>>
>> Can you kindly point what timer/counter should I play with?
>
> I would start by making these higher, perhaps double them and see what
> effect it has.
>
> token:  5000
> token_retransmits_before_loss_const: 10
>
>> What are the reasonable values for them? I got scared with this warning "It 
>> is not recommended to alter this value without guidance
>> from the corosync community."
>> Is there any benefits of changing the rrp_mode from active to passive?

rrp_mode: passive is better tested than active. That's the only real benefit.

>
> Not something I've played with, sorry.
>
>> Should it be done on both hosts?
>
> It should be the same I would imagine.
>
>>
>>> > 
>>> >
>>> > Feb  6 04:30:32 apolo lrmd: [2855]: info: RA output:
>>> > (httpd:0:monitor:stderr) redirecting to systemctl Feb  6 04:31:32
>>> > apolo lrmd: [2855]: info: RA output: (httpd:0:monitor:stderr) redirecting 
>>> > to systemctl Feb  6
>>> 04:31:41 apolo corosync[2848]:  [TOTEM ] A processor failed, forming new 
>>> configuration.
>>> > Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] CLM CONFIGURATION CHANGE
>>> > Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] New Configuration:
>>> > Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] #011r(0) ip(10.10.1.1) 
>>> > r(1) ip(10.10.10.8)
>>> > Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] Members Left:
>>> > Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] #011r(0) ip(10.10.1.2) 
>>> > r(1) ip(10.10.10.9)
>>> > Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] Members Joined:
>>> > Feb  6 04:31:47 apolo corosync[2848]:  [pcmk  ] notice:
>>> > pcmk_peer_update: Transitional membership event on ring 304: memb=1,
>>> > new=0,
>>> > lost=1
>>
>> [snip]
>>
>>> >
>>> > After lots of log apolo asks diana to reboot and sometime after that it 
>>> > got rebooted too.
>>> > We had an old cluster with heartbeat and DRBD used to cause it on that 
>>> > system but now looks like
>>> Pacemaker is the guilt.
>>> >
>>> > Here is my Pacemaker and DRBD configuration
>>> > http://www2.connection.com.br/cbastos/pacemaker/crm_config
>>> > http://www2.connection.com.br/cbastos/pacemaker/drbd_conf/global_commo
>>> > n.setup
>>> > http://www2.connection.com.br/cbastos/pacemaker/drbd_conf/backup.res
>>> > http://www2.connection.com.br/cbastos/pacemaker/drbd_conf/export.res
>>> >
>>> > And more detailed logs
>>> > http://www2.connection.com.br/cbastos/pacemaker/reboot_apolo
>>> > http://www2.connection.com.br/cbastos/pacemaker/reboot_diana
>>> >
>>
>> Best regards,
>> Carlos.
>>
>>
>>
>> ___
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



-- 
Dan Frincu
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Online add a new node to cluster communicating by UDPU

2013-02-12 Thread Dan Frincu
Hi,

On Tue, Feb 12, 2013 at 11:10 AM, Michal Fiala  wrote:
> Hello,
>
> is there a way how to online add a new node to corosync/pacemaker
> cluster, where nodes communicate by unicast UDP?

I don't think this is possible as you need to update corosync.conf on
all nodes with the new node to be added and changes to corosync.conf
are only visible after you've restarted corosync.

>
> Thanks
>
> Michal
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



-- 
Dan Frincu
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Online add a new node to cluster communicating by UDPU

2013-02-12 Thread Dan Frincu
On Tue, Feb 12, 2013 at 1:28 PM, Vladislav Bogdanov
 wrote:
> 12.02.2013 14:11, Viacheslav Biriukov wrote:
>> Why don't you use it? Do you know any issues with this method?
>
> I just did not need it yet.
>
> And, one also needs to check that it is possible to cleanly delete nodes

Since the hostnames don't change, there shouldn't be a requirement to
delete the node. If the IP's are dynamically allocated, but stay the
same, and you don't hit bugs such as the dnsmasq one mentioned
earlier, then the DHCP IP renewal process won't take the interface
down (IIRC the client sends the server a request for a lease when half
of the time allocated for the current lease expires, and retries
several times prior to the lease expiring).

Dynamically adding nodes to the cluster shouldn't be a problem,
removing them should be done manually (that's how I see it) as you
can't differentiate between a node which has been down for a prolonged
period of time due to maintenance and one which is no longer part of
the cluster and should be removed automatically.

My 2 cents.

> from a CIB (both configuration and status sections) and they do not
> reappear there anymore (I recall related issues in the past when node
> reappears in CIB after membership change). Hopefully that was fixed, but
> I'm not sure. Also, as I do not play with cluster size changes right
> now, I don't know exactly how does pacemaker currently deals with
> dynamic change of number of clone instances.
>
> May be Andrew or David can comment on this?
>
> I must admit that it would be very nice to have dynamic membership
> support polished in 2.0 :)
>
> Vladislav
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



-- 
Dan Frincu
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Listing all resources of a specific type

2013-02-20 Thread Dan Frincu
Hi,

On Tue, Feb 19, 2013 at 10:42 PM, Donald Stahl  wrote:
> Is there some way of listing all resources that use a specific
> resource agent using crm shell?
>
> For example:
>
> # crm resource list
>  stonith-sbd(stonith:external/sbd) Started
>  IP1 (ocf::heartbeat:IPaddr2) Started
>  IP2 (ocf::heartbeat:IPaddr2) Started
>  FS1 (ocf::heartbeat:Filesystem) Started
>  FS2 (ocf::heartbeat:Filesystem) Started
>
> I'd like to be able to filter by the resource agents- for example the
> ocf::heartbeat:Filesystem agent.
>
> Much like:
> # crm resource list | grep ocf::heartbeat:Filesystem
>  FS1 (ocf::heartbeat:Filesystem) Started
>  FS2 (ocf::heartbeat:Filesystem) Started
>
> Obviously I can use grep but I'd love to know if there were a native
> way of doing this.

There's crm ra list class:provider:type but I guess you want to find
RA's that are actively used in the configuration. In this case, the
command returns all RA's that match from the ones installed on the
system.

>
> Thanks,
> -Don
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



-- 
Dan Frincu
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Split-brain after

2011-08-15 Thread Dan Frincu
On Thu, Aug 11, 2011 at 8:12 PM, Digimer  wrote:
> On 08/11/2011 12:58 PM, Alex Forster wrote:
>> I have a two node Pacemaker/Corosync cluster with no resources configured 
>> yet.
>> I'm running RHEL 6.1 with the official 1.1.5-5.el6 package.
>>
>> While doing various network configuration, I happened to notice that if I 
>> issue
>> a "service network restart" on one node, then approx. four seconds later 
>> issue
>> "service network restart" on the second node, the two nodes become split 
>> brain,
>> each thinking the other is offline.
>>
>> Obviously, issuing 'service network restarts' four seconds apart will not be 
>> a
>> common occurrence in production, but it concerns me that I can 'trick' the 
>> nodes
>> into becoming split-brain so easily. Is there some way I can configure 
>> Corosync
>> to quickly recover from this scenario?

man corosync.conf
You can increase the value for rrp_problem_count_timeout for this.

rrp_problem_count_timeout
  This specifies the time in milliseconds to wait before
decrementing the problem count by 1 for a particular ring to ensure a
link is not marked faulty for tran‐
  sient network failures.

  The default is 2000 milliseconds.

This, however, will cause issues further along the way so you need to
take into consideration the timeouts that resources will have, as well
as monitor operations as to include the added time from modifying this
value.

Regards,
Dan

p.s.: don't mess with rrp_problem_count_threshold unless you also
consider that (rrp_problem_count_threshold *
rrp_token_expired_timeout) < (token - 50ms) => (10 * 47) < (1000 - 50)
=> 470 < 950 (this is the default, changing
rrp_problem_count_threshold to a higher value would also mean changing
the token timeout and/or other parameters, so it would be best to plan
ahead).

>>
>> Alex
>
> Configuring fence (stonith) will protect against split-brain by causing
> the remote node to be forced offline (rough, but better than split-brain).
>
> --
> Digimer
> E-Mail:              digi...@alteeve.com
> Freenode handle:     digimer
> Papers and Projects: http://alteeve.com
> Node Assassin:       http://nodeassassin.org
> "At what point did we forget that the Space Shuttle was, essentially,
> a program that strapped human beings to an explosion and tried to stab
> through the sky with fire and math?"
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>



-- 
Dan Frincu
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Question about Pacemaker master/slave and mysql replication

2011-08-15 Thread Dan Frincu
Hi,

On Sat, Aug 13, 2011 at 2:53 AM, Michael Szilagyi  wrote:
> I'm new to Pacemaker and trying to understand exactly what it can and can't
> do.
> I currently have a small, mysql master/slave cluster setup that is getting
> monitored within Heartbeat/Pacemaker:  What I'd like to be able to do (and
> am hoping Pacemaker will do) is to have 1 node designated as Master and in
> the event of a failure, automatically promote a slave to master and realign
> all of the existing slaves to be slaves of the newly promoted master.
>  Currently what seems to be happening, however, is heartbeat correctly sees
> that a node goes down and pacemaker promotes it up to master but the
> replication is not adjusted so that it is now feeding everyone else.  It
> seems like this should be possible to do from within Pacemaker but I feel
> like I'm missing a part of the puzzle.  Any suggestions would be
> appreciated.

You could try the mysql RA => from
https://github.com/fghaas/resource-agents/blob/master/heartbeat/mysql
Last I heard, it had replication support.

HTH.

>
> Here's an output of my crm configure show:
> node $id="7deca2cd-9a64-476c-8ea2-372bca859a4f" four \
> attributes 172.17.0.130-log-file-p_sql="mysql-bin.13"
> 172.17.0.130-log-pos-p_sql="632"
> node $id="9b355ab7-8c81-485c-8dcd-1facedde5d03" three \
> attributes 172.17.0.131-log-file-p_sql="mysql-bin.20"
> 172.17.0.131-log-pos-p_sql="106"
> primitive p_sql ocf:heartbeat:mysql \
> params config="/etc/mysql/my.cnf" binary="/usr/bin/mysqld_safe"
> datadir="/var/lib/mysql" \
> params pid="/var/lib/mysql/novaSQL.pid" socket="/var/run/mysqld/mysqld.sock"
> \
> params max_slave_lag="120" \
> params replication_user="novaSlave" replication_passwd="nova" \
> params additional_parameters="--skip-external-locking
> --relay-log=novaSQL-relay-bin --relay-log-index=relay-bin.index
> --relay-log-info-file=relay-bin.info" \
> op start interval="0" timeout="120" \
> op stop interval="0" timeout="120" \
> op promote interval="0" timeout="120" \
> op demote interval="0" timeout="120" \
> op monitor interval="10" role="Master" timeout="30" \
> op monitor interval="30" role="Slave" timeout="30"
> primitive p_sqlIP ocf:heartbeat:IPaddr2 \
> params ip="172.17.0.96" \
> op monitor interval="10s"
> ms ms_sql p_sql \
> meta target-role="Started" is-managed="true"
> location l_sqlMaster p_sqlIP 10: three
> location l_sqlSlave1 p_sqlIP 5: four
> property $id="cib-bootstrap-options" \
> dc-version="1.0.9-unknown" \
> cluster-infrastructure="Heartbeat" \
> stonith-enabled="false" \
> no-quorum-policy="ignore" \
> last-lrm-refresh="1313187103"
>
> Thanks!
> -Mike.
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
>



-- 
Dan Frincu
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Ordering and Colocation

2011-08-15 Thread Dan Frincu
Hi,

On Mon, Aug 15, 2011 at 3:33 AM, Curtis  wrote:
> On 15/08/11 10:23, Curtis wrote:
>>
>> Greetings,
>> I've been wrestling with this configuration for a few days now as I
>> slowly climb the learning curve of Pacemaker.
>
>
> Further details [sorry]-- versions.  All are from debian squeeze
>
> Pacemaker: 1.0.9
> Corosync: 1.2.1
> Cluster Glue: 1.0.6
>
>> My situation is as follows:
>>
>> I have 2 nodes, with 3 layers of resources:
>> drbd->lvm->publish
>>
>> They must run on both nodes, but each service is dependant only on those
>> on the same node.
>>
>> If drbd is not Master, lvm can't start.
>> If lvm isn't started, publish can't start.
>>
>> Now, from talking with beekhof on IRC, All I need is ordering and
>> colocation. This has worked for bringing it up, but when I, say, stop
>> LVM... the publishing doesn't stop.

How do you stop LVM? On what node?
Are you running DRBD dual-primary by any chance?
STONITH configured? And enabled? And tested?

Regards,
Dan

>>
>> Config [sorry if XML is preferred]:
>>
>> primitive drbd_prim ocf:linbit:drbd \
>> params drbd_resource="raid"
>> primitive lvm_prim ocf:heartbeat:LVM \
>> params volgroupname="raid"
>> primitive publish_prim ocf:iomax:scst \
>> prams 
>> ms drbd drbd_prim \
>> meta master-max="2" master-node-max="1" clone-max="2" clone-node-max="2"
>> notify="true"
>> clone lvm lvm_prim \
>> meta globally-unique="true" clone-max="2" clone-node-max="1"
>> clone publish publish_prim \
>> meta globally-unique="true" clone-max="2" clone-node-max="1"
>> colocation lvm_with_drbd inf: drbd:Master lvm
>> colocation publish_with_lvm inf: lvm publish
>> order drbd_then_lvm inf: drbd:promote lvm symmetrical=true
>> order lvm_then_publish inf: lvm publish symmetrical=true
>>
>> I'd really appreciate any information on how my understanding is
>> deficient, and how to get this working.
>>
>> --
>> Curtis
>>
>> ___
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs:
>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>



-- 
Dan Frincu
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] ocf:heartbeat:Filesystem doesn't work via corosync

2011-08-18 Thread Dan Frincu
output: (fs_mysql:start:stdout) 
> Disk write-protected; use the -n option to do a read-only#012check of the 
> device.
> Aug 17 12:35:20 gila lrmd: [24754]: info: RA output: (fs_mysql:start:stderr) 
> fsck.ext4: Read-only file system while trying to open /dev/drbd0#015
>
>
> Any help would be greatly appreciated.
>
> Thanks,
> Cotton Tenney
> Systems Administrator
> Rogers Software Development
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>



-- 
Dan Frincu
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Syntax highlighting in vim for crm configure edit

2011-08-19 Thread Dan Frincu
Hi,

On Thu, Aug 18, 2011 at 5:53 PM, Digimer  wrote:
> On 08/18/2011 10:39 AM, Trevor Hemsley wrote:
>> Hi all
>>
>> I have attached a first stab at a vim syntax highlighting file for 'crm
>> configure edit'
>>
>> To activate this, I have added 'filetype plugin on' to my /root/.vimrc
>> then created /root/.vim/{ftdetect,ftplugin}/pcmk.vim
>>
>> In /root/.vim/ftdetect/pcmk.vim I have the following content
>>
>> au BufNewFile,BufRead /tmp/tmp* set filetype=pcmk
>>
>> but there may be a better way to make this happen. /root/.vim/pcmk.vim
>> is the attached file.
>>
>> Comments (not too nasty please!) welcome.

I've added a couple of extra keywords to the file, to cover a couple
more use cases. Other than that, great job.

Regards,
Dan

>
> I would love to see proper support added for CRM syntax highlighting
> added to vim. I will give this is a test and write back in a bit.
>
> --
> Digimer
> E-Mail:              digi...@alteeve.com
> Freenode handle:     digimer
> Papers and Projects: http://alteeve.com
> Node Assassin:       http://nodeassassin.org
> "At what point did we forget that the Space Shuttle was, essentially,
> a program that strapped human beings to an explosion and tried to stab
> through the sky with fire and math?"
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>



-- 
Dan Frincu
CCNA, RHCE


pcmk.vim
Description: Binary data
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Compile Error on Debian

2011-09-16 Thread Dan Frincu
Hi,

On Fri, Sep 16, 2011 at 10:56 AM, Dejan Muhamedagic  wrote:
> Hi,
>
> On Thu, Sep 15, 2011 at 06:06:31PM -0400, Nick Khamis wrote:
>> Hello Everyone,
>>
>> Using tip 1.0.7 I get:
>>
>> pes -Wwrite-strings -ansi -D_GNU_SOURCE -DANSI_ONLY -Werror -MT
>> pils.lo -MD -MP -MF .deps/pils.Tpo -c pils.c  -fPIC -DPIC -o
>> .libs/pils.o
>> cc1: warnings being treated as errors
>> In file included from /usr/include/glib-2.0/glib/gasyncqueue.h:34,
>>                  from /usr/include/glib-2.0/glib.h:34,
>>                  from pils.c:34:
>> /usr/include/glib-2.0/glib/gthread.h: In function ���g_once_init_enter���:
>> /usr/include/glib-2.0/glib/gthread.h:348: error: cast discards
>> qualifiers from pointer target type
>
Apply the following patch fixes it.
http://bugzilla-attachments.gnome.org/attachment.cgi?id=158740

Regards,
Dan

> This seems to be an issue in glib-2.0. You can also
> configure with enable_fatal_warnings=no.
>
> Thanks,
>
> Dejan
>
>> make[2]: *** [pils.lo] Error 1
>> make[2]: Leaving directory
>> `/usr/local/src/Reusable-Cluster-Components-glue--glue-1.0.7/lib/pils'
>> make[1]: *** [all-recursive] Error 1
>> make[1]: Leaving directory
>> `/usr/local/src/Reusable-Cluster-Components-glue--glue-1.0.7/lib'
>> make: *** [all-recursive] Error 1
>> root@pace1:/usr/local/src/Reusable-Cluster-Components-glue--glue-1.0.7#
>> apt-get install pils
>> Reading package lists... Done
>> Building dependency tree
>> Reading state information... Done
>> E: Unable to locate package pils
>> root@pace1:/usr/local/src/Reusable-Cluster-Components-glue--glue-1.0.7#
>> apt-get install libpils
>> Reading package lists... Done
>> Building dependency tree
>> Reading state information... Done
>> E: Unable to locate package libpils
>>
>>
>> Thanks in Advance,
>>
>> Nick
>>
>> ___
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: 
>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
> _______
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>



-- 
Dan Frincu
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Errors When Loading OCF

2011-09-20 Thread Dan Frincu
Hi,

On Mon, Sep 19, 2011 at 7:21 PM, Nick Khamis  wrote:
> Hello Everyone,
>
> I have been experiencing some problems getting pacemaker going with
> DRBD and MySQL
>
> The Config:
>
> primitive drbd_mysql ocf:linbit:drbd \
>                    params drbd_resource="mysql" \
>                    op monitor interval="15s"
> ms ms_drbd_mysql drbd_mysql \
>                    meta master-max="1" master-node-max="1" \
>                         clone-max="2" clone-node-max="1" \
>                         notify="true"
> primitive fs_mysql ocf:heartbeat:Filesystem \
>                    params device="/dev/drbd/by-res/mysql" \
>                      directory="/var/lib/mysql" fstype="ext3"
> primitive ip_mysql ocf:heartbeat:IPaddr2 \
>                    params ip="192.168.2.100" nic="eth1"
> primitive mysqld lsb:mysqld

I strongly recommend an OCF compliant RA for this (such as
https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/mysql),
not the LSB script.

> group mysql fs_mysql ip_mysql mysqld
> colocation mysql_on_drbd \
>                      inf: mysql ms_drbd_mysql:Master
> order mysql_after_drbd \
>                      inf: ms_drbd_mysql:promote mysql:start
> property $id="cib-bootstrap-options" \
>        no-quorum-policy="ignore" \
>        stonith-enabled="false" \
>        expected-quorum-votes="2" \

I'm assuming you're/ve upgrading/upgraded the cluster stack from a
previous version, the dc-version is not the one provided by 1.1.5, as
you've mentioned below.

>        dc-version="1.0.4-2ec1d189f9c23093bf9239a980534b661baf782d" \
>        cluster-infrastructure="openais"
>
> The Errors:
>
> lrmadmin[2302]: 2011/09/19_11:41:26 ERROR:
> lrm_get_rsc_type_metadata(578): got a return code HA_FAIL from a reply
> message of rmetadata with function get_ret_from_msg.
> ERROR: ocf:linbit:drbd: could not parse meta-data:
> ERROR: ocf:linbit:drbd: no such resource agent

Check /usr/lib/ocf/resource.d/linbit directory for the presence of the
drbd RA. If it isn't there, you might have done something wrong while
compiling DRBD.

> lrmadmin[2333]: 2011/09/19_11:41:26 ERROR:
> lrm_get_rsc_type_metadata(578): got a return code HA_FAIL from a reply
> message of rmetadata with function get_ret_from_msg.
> ERROR: lsb:mysqld: could not parse meta-data:
> ERROR: lsb:mysqld: no such resource agent

Could be mysql, not mysqld. Use a RA instead (see above).

> ERROR: object mysqld does not exist
> ERROR: object drbd_mysql does not exist
> ERROR: syntax in primitive: master-max=1 master-node-max=1 clone-max=2
> clone-node-max=1 notify=true
>
>
> The "ERROR: syntax in primitive: master-max=1 master-node-max=1
> clone-max=2 clone-node-max=1 notify=true" could be resolved by adding
> a trailing backslash to:
>

You're missing the RA for DRBD, which means there can be no ms
resource, which you can't reference in a group. The trailing backslash
allowing this seems more of a bug, then a feature.

> group mysql fs_mysql ip_mysql mysqld
>
> The examples found both miss the slash:
>
> http://www.drbd.org/docs/about/    "Adding a DRBD-backed service to
> the cluster configuration"
> http://www.clusterlabs.org/wiki/DRBD_MySQL_HowTo
>

As they should, the trailing backslash within the crm shell means that
it's expecting input on the next line (not having input on the next
line should result in an error, therefore my mention of this possibly
being a bug).

> Environemnt:
> DRBD and Cluster Stack are all the latest versions downloaded and
> built from source.
> DRBD: version: 8.3.7
> CRM: 1.1.6

You mean Pacemaker 1.1.6.

>
> DRBD Meta Data: /dev/drbd0/by-res/r0.res
> OCF RA: /usr/lib/ocf/resource.d/linbit/drbd
> MySQL RA: /usr/lib/ocf/resource.d/heartbeat/mysql?
> /etc/init.d/mysql starts fine...

No doubt there.

>
> I  just noticed "dc-version", should this match "Version:
> 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" returned by crm?
>

Yes

> Finally, where is the best source of up-to-date documenation for
> Cluster Glue and Resource Agents.

Documentation regarding ... installation, configuration, etc? Here's a
couple of useful links.

Resource agents
http://www.linux-ha.org/wiki/Resource_Agents
http://linux-ha.org/wiki/OCF_Resource_Agents

Dev guide for OCF RA's
http://www.linux-ha.org/doc/dev-guides/ra-dev-guide.html

Resource agents repo
https://github.com/ClusterLabs/resource-agents

Cluster glue repo
http://hg.linux-ha.org/glue/

HTH,
Dan

>
> Thanks in Advnace,
>
> Nick.
>
> 

Re: [Pacemaker] Resource starts on wrong node ?

2011-09-21 Thread Dan Frincu
e"
>
> ms ms_drbd2 drbd2 \
>
>     meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true"
>
> colocation fs2_on_drbd inf: wwwfs ms_drbd1:Master
>
> colocation fs3_on_drbd inf: zarafafs ms_drbd2:Master
>
> colocation fs_on_drbd inf: mysqlfs ms_drbd0:Master
>
> order fs2_after_drbd inf: ms_drbd1:promote wwwfs:start
>
> order fs3_after_drbd inf: ms_drbd2:promote zarafafs:start
>
> order fs_after_drbd inf: ms_drbd0:promote mysqlfs:start
>

You either set a location constraint for mysqlip or use a colocation
and ordering constraint for it.

e.g.: colocation mysqlip_on_drbd inf: mysqlip ms_drbd0:Master
order mysqlip_after_drbd inf: ms_drbd0:promote mysqlip:start

> property $id="cib-bootstrap-options" \
>
>     dc-version="1.1.2-f059ec7ced7a86f18e5490b67ebf4a0b963bccfe" \
>
>     cluster-infrastructure="openais" \
>
>     expected-quorum-votes="2" \
>
>     no-quorum-policy="ignore" \
>
>     stonith-enabled="false"
>
> rsc_defaults $id="rsc-options" \
>
>     resource_stickyness="INFINITY" \

I wouldn't set INFINITY, it will cause problems, I'd give it a value
of 500 or 1000.

Regards,
Dan

>
>     migration-threshold="1"
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
>



-- 
Dan Frincu
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Resource starts on wrong node ?

2011-09-21 Thread Dan Frincu
Hi,

On Wed, Sep 21, 2011 at 3:03 PM, Hans Lammerts  wrote:
>  Dan,
>
>
>
> Thanks for the swift reply.
>
> I didn't know pacemaker was sort of loadbalancing across nodes.
>
> Maybe I should read the documentation in more detail.
>
>
>
> Regarding the versions:
>
> I would like to have the newest versions, but what I've done until now is
> just install what's available
>
> from the Centos repositories.
>
> Indeed I would like to upgrade since I also sometimes experience the issue
> that several heartbeat daemons
>
> start looping when I change something in the config. Something that's
> supposed to be fixed in a higher level
>
> of corosync/heartbeat/pacemaker
>

Have a look at http://clusterlabs.org/wiki/RHEL as to how to add repos
for EL6. Unfortunately, afaics, only Pacemaker is available as a newer
version, 1.1.5, corosync is still at 1.2.3.

I'd also recommend building corosync RPM's from the tarball
(http://www.corosync.org/) , but that's just my personal preference,
some prefer pre-built binaries.

>
>
> About what you said: Is there a limited number of resources that can run on
> one node, before pacemaker decides it is going to run a subsequent resource
> on another node ?

The algorithm is basically round robin. By default it doesn't make any
assumptions about the "importance" of the resources, first resource
goes to first node, second to second node, third to first node, fourth
to second node, a.s.o., it's round robin, like I said.

>
> Wouldn't it be best to always use the colocation and order directives to
> prevent this from happening ?
>

It all depends on the purpose of the cluster, if it fits the need for
your setup, than yes, use colocation and ordering. There really isn't
a "one size fits all" scenario.

Regards,
Dan

>
>
> Thanks again,
>
>
>
> Hans
>
>
> -Original message-
> To: The Pacemaker cluster resource manager ;
> From: Dan Frincu 
> Sent: Wed 21-09-2011 12:44
> Subject: Re: [Pacemaker] Resource starts on wrong node ?
> Hi,
>
> On Wed, Sep 21, 2011 at 1:02 PM, Hans Lammerts  wrote:
>> Hi all,
>>
>>
>>
>> Just started to configure a two node cluster (Centos 6) with drbd
>> 8.4.0-31.el6,
>>
>> corosync 1.2.3 and pacemaker 1.1.2.
>
> Strange choice of versions, if it's a new setup, why don't you go for
> corosync 1.4.1 and pacemaker 1.1.5?
>
>>
>> I created three DRBD filesystems, and started to add them in the crm
>> config
>> one by one.
>>
>> Everything went OK. After adding these resources they start on node1, and
>> when I set node1
>>
>> in standby, these three DRBD resources failover nicely to the second node.
>> And vice versa.
>>
>> So far so good.
>>
>>
>>
>> Next, I added one extra resource, that is supposed to put an IP alias on
>> eth0.
>>
>> This also works, but strangely enough the alias is set on eth0 of the
>> second
>> node, where I would have
>>
>> expected it to start on the first node (just as the three drbd resources
>> did).
>>
>> Why the does Pacemaker decide that this resource is to be started on
>> the
>> second node ? I cannot grasp
>>
>> the reason why.
>
> Because it tries to load balance resources on available nodes. You
> have several resources running on one node, and didn't specify any
> restrictions on the mysqlip, therefore it chose the second node as it
> had less resources on it. You override the behavior with constraints.
> See below.
>
>>
>> Hope anyone can tell me what I'm doing wrong.
>>
>>
>>
>> Thanks,
>>
>> Hans
>>
>>
>>
>> Just to be sure, I'll show my config below:
>>
>>
>>
>> node cl1 \
>>
>>     attributes standby="off"
>>
>> node cl2 \
>>
>>     attributes standby="off"
>>
>> primitive drbd0 ocf:linbit:drbd \
>>
>>     params drbd_resource="mysql" drbdconf="/etc/drbd.conf" \
>>
>>     op start interval="0" timeout="240s" \
>>
>>     op monitor interval="20s" timeout="20s" \
>>
>>     op stop interval="0" timeout="100s"
>>
>> primitive drbd1 ocf:linbit:drbd \
>>
>>     params drbd_resource="www" drbdconf="/etc/drbd.conf" \
>>
>>     op start interval="0" timeout="240s" \
>>
>>     op monitor interval=&

Re: [Pacemaker] pacemaker compatibility

2011-10-19 Thread Dan Frincu
Hi,

On Wed, Oct 19, 2011 at 8:12 AM,   wrote:
> thank you
> Andreas
>
> Now I am facing core dump issue with
> corosync-1.4.2
> cluster-glue-1.0.7
> pacemaker-1.0.11
>

To report a crash of corosync, please follow this guide =>
http://corosync.org/doku.php?id=faq:crash

Regards,
Dan

> In many scenario I getting core dump during corosync start operation for 2
> rings like.
>
> 1. Configure 2 rings ring0 10.16.16.0,  ring1 192.168.1.0
>   ring1 network(ifconfig  eth1 192.168.1.14 down) is down before corosync
>  startup
>
> 2. Configure 2 rings ring0 10.16.16.0,  ring1 192.168.1.0(invalid network)
>   ring1 network(ifconfig  eth1 193.167.1.14 up) is different. means
> network is not present for ring1. start corosync
>
> All the core dump are generated from the same files.
>
> core was generated by 'corosync' programme terminate with signal 6
> C File where problem encountered
> totemsrp.c.2526
> totemsrp.c.3545
> totemrrp.c.1036
> totemrrp.c.1736
> totemudp.c.1252
> coropoll.c.513
> main.c.1846
>
> With corosync1.2 also we are facing core dump issue.
>
> Is there any way, to avoid only corosync core dump.
>
> On Tue, October 18, 2011 2:05 pm, Andreas Kurz wrote:
>> Hello,
>>
>>
>> On 10/18/2011 08:11 AM, manish.gu...@ionidea.com wrote:
>>
>>> Hi,
>>>
>>>
>>> I am using corosync.1.2.1. I want  to upgrade  corosync  from 1.2 to
>>> 1.4.2.
>>>
>>>
>>>
>>> please can you let me know which version of cluster-glue and pacemekr
>>> are compatiable with corosync1.4.2
>>>
>>> Currentely with corosync1.4.2 I am using pacemaker 1.0.10 and
>>> cluster-glue1.0.3 and I am getting error ..
>>
>> You should also upgrade Pacemaker to 1.0.11 and especially cluster-glue
>> to latest version 1.0.7 ... though this old versions might not be the cause
>> for your problems here.
>>
>>>
>>> service failed to load pacemaker ...
>>
>> Hard to say without having a look at your corosync configuration.
>>
>>
>> Regards,
>> Andreas
>>
>>
>> --
>> Need help with Pacemaker?
>> http://www.hastexo.com/now
>>
>>
>>>
>>>
>>> Regards
>>> Manish
>>>
>>>
>>>
>>> ___
>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>  Bugs:
>>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemak
>>> er
>>
>>
>>
>> ___
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs:
>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>
>>
>
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>



-- 
Dan Frincu
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] killing corosync leaves crmd, stonithd, lrmd, cib and attrd to hog up the cpu

2011-11-14 Thread Dan Frincu
Hi,

On Mon, Nov 14, 2011 at 1:32 PM, ihjaz Mohamed  wrote:
> Hi All,
> As part of some robustness test for my cluster, I tried killing the corosync
> process using kill -9 . After this I see that the pacemakerd service is
> stopped but the processes crmd, stonithd, lrmd, cib and attrd are still
> running and are hogging up the cpu.

I have seen this kind of testing before and I have to say I don't
consider it the recommended way of testing the cluster stack's
"robustness". Pacemaker processes rely on corosync for proper
functioning. You kill corosync and then want to "cleanup" the
processes? You have to go through a lot more literature in order to
understand how this cluster stack works.

For the Master Control Process, how it works and other related
information (which is related to what you are experiencing), see
http://theclusterguy.clusterlabs.org/post/907043024/introducing-the-pacemaker-master-control-process-for

The essential guide you need is
http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/

HTH,
Dan

>
> top - 06:26:51 up  2:01,  4 users,  load average: 12.04, 12.01, 11.98
> Tasks: 330 total,  13 running, 317 sleeping,   0 stopped,   0 zombie
> Cpu(s):  7.1%us, 17.1%sy,  0.0%ni, 75.6%id,  0.1%wa,  0.0%hi,  0.0%si,
> 0.0%st
> Mem:   8015444k total,  4804412k used,  3211032k free,    54800k buffers
> Swap: 10256376k total,    0k used, 10256376k free,  1604464k cached
>
>   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  2053 hacluste  RT   0 90492 3324 2476 R 100.0  0.0 113:40.61 crmd
>  2047 root  RT   0 81480 2108 1712 R 99.8  0.0 113:40.43 stonithd
>  2048 hacluste  RT   0 83404 5260 2992 R 99.8  0.1 113:40.90 cib
>  2050 hacluste  RT   0 85896 2388 1952 R 99.8  0.0 113:40.43 attrd
>  5018 root  20   0 8787m 345m  56m S  2.0  4.4   0:56.95 java
> 19017 root  20   0 15068 1252  796 R  2.0  0.0   0:00.01 top
>     1 root  20   0 19232 1444 1156 S  0.0  0.0   0:01.71 init
>     2 root  20   0 0    0    0 S  0.0  0.0   0:00.00 kthreadd
>     3 root  RT   0 0    0    0 S  0.0  0.0   0:00.00 migration/0
>     4 root  20   0 0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/0
>
>
> Is there a way to cleanup these processes ? OR Do I need to kill them one by
> one before respawning the corosync?
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
>



-- 
Dan Frincu
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Syntax highlighting in vim for crm configure edit

2011-11-22 Thread Dan Frincu
Hi,

On Tue, Nov 15, 2011 at 11:47 AM, Raoul Bhatia [IPAX]  wrote:
> hi!
>
> On 2011-08-19 16:28, Dan Frincu wrote:
>>
>> Hi,
>>
>> On Thu, Aug 18, 2011 at 5:53 PM, Digimer  wrote:
>>>
>>> On 08/18/2011 10:39 AM, Trevor Hemsley wrote:
>>>>
>>>> Hi all
>>>>
>>>> I have attached a first stab at a vim syntax highlighting file for 'crm
>>>> configure edit'
>>>>
>>>> To activate this, I have added 'filetype plugin on' to my /root/.vimrc
>>>> then created /root/.vim/{ftdetect,ftplugin}/pcmk.vim
>>>>
>>>> In /root/.vim/ftdetect/pcmk.vim I have the following content
>>>>
>>>> au BufNewFile,BufRead /tmp/tmp* set filetype=pcmk
>>>>
>>>> but there may be a better way to make this happen. /root/.vim/pcmk.vim
>>>> is the attached file.
>>>>
>>>> Comments (not too nasty please!) welcome.
>>
>> I've added a couple of extra keywords to the file, to cover a couple
>> more use cases. Other than that, great job.
>
> will this addition make it into some package(s)?
> would it be right to ship this vim syntax file with crm?

In the hope it will be a part of crm, I've written a patch for this.
Applying the patch over cibconfig.py and utils.py on Pacemaker 1.1.5
and adding the pcmk.vim file to the vim syntax folder (for Debian
Squeeze it's /usr/share/vim/vim72/syntax) gives access to syntax
highlighting in crm configure edit, if using vi/vim as editor.

Original work on pcmk.vim by Trevor Hemsley ,
a couple of additions by me.

Please review it and and add a Signed-Off line if it's ok.

Regards,
Dan

p.s.: many thanks to everyone for the input received on IRC.

>
> thanks,
> raoul
> --
> 
> DI (FH) Raoul Bhatia M.Sc.          email.          r.bha...@ipax.at
> Technischer Leiter
>
> IPAX - Aloy Bhatia Hava OG          web.          http://www.ipax.at
> Barawitzkagasse 10/2/2/11           email.            off...@ipax.at
> 1190 Wien                           tel.               +43 1 3670030
> FN 277995t HG Wien                  fax.            +43 1 3670030 15
> 
>



-- 
Dan Frincu
CCNA, RHCE
From d3ab2ab159137b271382db8d0edeef6d69325894 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Dan=20Fr=C3=AEncu?= 
Date: Tue, 22 Nov 2011 18:50:10 +0200
Subject: [PATCH][BUILD] Low: extra: Add syntax highlighting for crm configure edit
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit


Signed-off-by: Dan Frîncu 
---
 shell/modules/cibconfig.py |1 +
 shell/modules/utils.py |7 +--
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/shell/modules/cibconfig.py b/shell/modules/cibconfig.py
index 9cc9751..49b4b51 100644
--- a/shell/modules/cibconfig.py
+++ b/shell/modules/cibconfig.py
@@ -128,6 +128,7 @@ class CibObjectSet(object):
 except IOError, msg:
 common_err(msg)
 break
+s += "\n# vim: set filetype=.pcmk :\n"
 s = ''.join(f)
 f.close()
 if hash(s) == filehash: # file unchanged
diff --git a/shell/modules/utils.py b/shell/modules/utils.py
index b57aa54..00013c6 100644
--- a/shell/modules/utils.py
+++ b/shell/modules/utils.py
@@ -158,7 +158,7 @@ def str2tmp(s):
 Write the given string to a temporary file. Return the name
 of the file.
 '''
-fd,tmp = mkstemp()
+fd,tmp = mkstemp(suffix=".pcmk")
 try: f = os.fdopen(fd,"w")
 except IOError, msg:
 common_err(msg)
@@ -317,7 +317,10 @@ def edit_file(fname):
 return
 if not user_prefs.editor:
 return
-return ext_cmd("%s %s" % (user_prefs.editor,fname))
+if user_prefs.editor == "vim" or user_prefs.editor == "vi":
+return ext_cmd("%s %s -u /usr/share/vim/vim72/syntax/pcmk.vim" % (user_prefs.editor,fname))
+else:
+return ext_cmd("%s %s" % (user_prefs.editor,fname))
 
 def page_string(s):
 'Write string through a pager.'
-- 
1.7.0.4



pcmk.vim
Description: Binary data
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Syntax highlighting in vim for crm configure edit

2011-12-09 Thread Dan Frincu
Hi,

On Mon, Dec 5, 2011 at 3:54 PM, Dejan Muhamedagic  wrote:
> Hi,
>
> On Tue, Nov 22, 2011 at 07:14:24PM +0200, Dan Frincu wrote:
>> Hi,
>>
>> On Tue, Nov 15, 2011 at 11:47 AM, Raoul Bhatia [IPAX]  
>> wrote:
>> > hi!
>> >
>> > On 2011-08-19 16:28, Dan Frincu wrote:
>> >>
>> >> Hi,
>> >>
>> >> On Thu, Aug 18, 2011 at 5:53 PM, Digimer  wrote:
>> >>>
>> >>> On 08/18/2011 10:39 AM, Trevor Hemsley wrote:
>> >>>>
>> >>>> Hi all
>> >>>>
>> >>>> I have attached a first stab at a vim syntax highlighting file for 'crm
>> >>>> configure edit'
>> >>>>
>> >>>> To activate this, I have added 'filetype plugin on' to my /root/.vimrc
>> >>>> then created /root/.vim/{ftdetect,ftplugin}/pcmk.vim
>> >>>>
>> >>>> In /root/.vim/ftdetect/pcmk.vim I have the following content
>> >>>>
>> >>>> au BufNewFile,BufRead /tmp/tmp* set filetype=pcmk
>> >>>>
>> >>>> but there may be a better way to make this happen. /root/.vim/pcmk.vim
>> >>>> is the attached file.
>> >>>>
>> >>>> Comments (not too nasty please!) welcome.
>> >>
>> >> I've added a couple of extra keywords to the file, to cover a couple
>> >> more use cases. Other than that, great job.
>> >
>> > will this addition make it into some package(s)?
>> > would it be right to ship this vim syntax file with crm?
>>
>> In the hope it will be a part of crm, I've written a patch for this.
>> Applying the patch over cibconfig.py and utils.py on Pacemaker 1.1.5
>> and adding the pcmk.vim file to the vim syntax folder (for Debian
>> Squeeze it's /usr/share/vim/vim72/syntax) gives access to syntax
>> highlighting in crm configure edit, if using vi/vim as editor.
>>
>> Original work on pcmk.vim by Trevor Hemsley ,
>> a couple of additions by me.
>>
>> Please review it and and add a Signed-Off line if it's ok.
>
> Just tried it out, and when I do :set filetype=pcmk, vim spews at
> me this:
>
> Error detected while processing /usr/share/vim/vim72/syntax/synload.vim:
> line   58:
> E127: Cannot redefine function 3_SynSet: It is in use
> E127: Cannot redefine function 3_SynSet: It is in use
> E127: Cannot redefine function 3_SynSet: It is in use
> E127: Cannot redefine function 3_SynSet: It is in use
> Error detected while processing /usr/share/vim/vim72/syntax/nosyntax.vim:
> line   21:
> E218: autocommand nesting too deep
> Error detected while processing /usr/share/vim/vim72/syntax/synload.vim:
> line   58:
> E127: Cannot redefine function 3_SynSet: It is in use
> Error detected while processing /usr/share/vim/vim72/syntax/syntax.vim:
> line   40:
> E218: autocommand nesting too deep
>
> BTW, I just copied the pcmk.vim file to ~/.vim/syntax.
>

Well, first of all, the patch was meant to be applied to the source, I
did not mention this before. To apply it on the running system just
use the patch from http://pastebin.com/PWpuzQ4m

The patch also assumes the pcmk.vim file is copied to
/usr/share/vim/vim72/syntax/pcmk.vim
If not the path must be adjusted to match the location of pcmk.vim.

Then when opening crm configure edit the syntax highlighting is
applied. You're test and the respective errors come from not applying
the patch.

> Otherwise, the output looks fine. There are a few differences to
> the configure show output:
>
> - quotes are red along with the value
> - ids are green whereas in configure show they are normal
> - id references are light blue and in configure show they are green
> - scores are red and in configure show violet
> - roles/actions in constraints red and in configure show normal
>
> There are probably a few more differences.
>

Indeed, not perfect, however it's better than nothing and could be
improved over time.

Regards,
Dan

> Cheers,
>
> Dejan
>
>
>> Regards,
>> Dan
>>
>> p.s.: many thanks to everyone for the input received on IRC.
>>
>> >
>> > thanks,
>> > raoul
>> > --
>> > 
>> > DI (FH) Raoul Bhatia M.Sc.          email.          r.bha...@ipax.at
>> > Technischer Leiter
>> >
>> > IPAX - Aloy Bhatia Hava OG          web.          http://www.ipax.at
>> > Barawitzkagasse 10/2/2/11           email.            off...@ipax.at
>> > 1190 Wien 

Re: [Pacemaker] How to live migrate the kvm vm

2011-12-13 Thread Dan Frincu
Hi,

On Tue, Dec 13, 2011 at 6:11 AM, Qiu Zhigang  wrote:
> Hi,
>
> Thank you, you are right, I correct the 'allow-migrate="true"', but now I 
> found another problem when migrate, migrate failed.
> The following is the log.
>
> Dec 13 12:10:03 h10_151 kernel: type=1400 audit(1323749403.251:623): avc:  
> denied  { search } for  pid=27201 comm="virsh" name="libvirt" dev=dm-0 
> ino=2098071 scontext=unconfined_u:system_r:corosync_t:s0 
> tcontext=system_u:object_r:virt_var_run_t:s0 tclass=dir
> Dec 13 12:10:04 h10_151 kernel: type=1400 audit(1323749404.067:624): avc:  
> denied  { search } for  pid=27218 comm="VirtualDomain" name="" dev=0:1c 
> ino=13825028 scontext=unconfined_u:system_r:corosync_t:s0 
> tcontext=system_u:object_r:nfs_t:s0 tclass=dir
> Dec 13 12:10:04 h10_151 kernel: type=1400 audit(1323749404.252:625): avc:  
> denied  { read } for  pid=27242 comm="virsh" name="random" dev=devtmpfs 
> ino=3585 scontext=unconfined_u:system_r:corosync_t:s0 
> tcontext=system_u:object_r:random_device_t:s0 tclass=chr_file

You need to take a look at the SELinux context.

Regards,
Dan

>
> [root@h10_145 ~]# crm
> crm(live)# status
> 
> Last updated: Tue Dec 13 12:09:06 2011
> Stack: openais
> Current DC: h10_145 - partition with quorum
> Version: 1.1.2-f059ec7ced7a86f18e5490b67ebf4a0b963bccfe
> 2 Nodes configured, 2 expected votes
> 2 Resources configured.
> 
>
> Online: [ h10_151 h10_145 ]
>
>  test2  (ocf::heartbeat:VirtualDomain): Started h10_151 (unmanaged) FAILED
>  test1  (ocf::heartbeat:VirtualDomain): Started h10_145 (unmanaged) FAILED
>
> Failed actions:
>    test1_stop_0 (node=h10_145, call=19, rc=1, status=complete): unknown error
>    test2_stop_0 (node=h10_151, call=14, rc=1, status=complete): unknown error
>
> Best Regards,
>
>> -Original Message-
>> From: Arnold Krille [mailto:arn...@arnoldarts.de]
>> Sent: Monday, December 12, 2011 7:52 PM
>> To: The Pacemaker cluster resource manager
>> Subject: Re: [Pacemaker] How to live migrate the kvm vm
>>
>> Hi,
>>
>> On Monday 12 December 2011 11:22:51 邱志刚 wrote:
>> > I have 2-node cluster of pacemaker,I want to migrate the kvm vm with
>> > command "migrate", but I found the vm isn't migrated, actually it is
>> > shutdown and then start on other node. I checked the log and found the
>> > vm is stopped but not migrated.
>>
>> > How could I live migrate the vm ? The configuration :
>> > crm(live)configure# show
>> > primitive test1 ocf:heartbeat:VirtualDomain \
>> >     params config="/etc/libvirt/qemu/test1.xml"
>> > hypervisor="qemu:///system" \
>> >     meta allow-migrate="ture" priority="100" target-role="Started"
>> > is-managed="true" \
>> >     op start interval="0" timeout="120s" \
>> >     op stop interval="0" timeout="120s" \
>> >     op monitor interval="10s" timeout="30s" depth="0" \
>> >     op migrate_from interval="0" timeout="120s" \
>> >     op migrate_to interval="0" timeout="120"
>>
>> I hope that "ture" is only a typo when writing the email. Otherwise its 
>> probably
>> the reason why your machine stop-start instead of a nice migration.
>> Try with 'allow-migrate="true"' and see if that helps.
>>
>> Have fun,
>>
>> Arnold
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



-- 
Dan Frincu
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] How to live migrate the kvm vm

2011-12-13 Thread Dan Frincu
Hi,

On Tue, Dec 13, 2011 at 11:13 AM, Qiu Zhigang  wrote:
> Hi,
>
>> -Original Message-----
>> From: Dan Frincu [mailto:df.clus...@gmail.com]
>> Sent: Tuesday, December 13, 2011 4:43 PM
>> To: The Pacemaker cluster resource manager
>> Subject: Re: [Pacemaker] How to live migrate the kvm vm
>>
>> Hi,
>>
>> On Tue, Dec 13, 2011 at 6:11 AM, Qiu Zhigang 
>> wrote:
>> > Hi,
>> >
>> > Thank you, you are right, I correct the 'allow-migrate="true"', but now I 
>> > found
>> another problem when migrate, migrate failed.
>> > The following is the log.
>> >
>> > Dec 13 12:10:03 h10_151 kernel: type=1400 audit(1323749403.251:623):
>> > avc:  denied  { search } for  pid=27201 comm="virsh" name="libvirt"
>> > dev=dm-0 ino=2098071 scontext=unconfined_u:system_r:corosync_t:s0
>> > tcontext=system_u:object_r:virt_var_run_t:s0 tclass=dir Dec 13
>> > 12:10:04 h10_151 kernel: type=1400 audit(1323749404.067:624): avc:
>> > denied  { search } for  pid=27218 comm="VirtualDomain" name=""
>> > dev=0:1c ino=13825028 scontext=unconfined_u:system_r:corosync_t:s0
>> > tcontext=system_u:object_r:nfs_t:s0 tclass=dir Dec 13 12:10:04 h10_151
>> > kernel: type=1400 audit(1323749404.252:625): avc:  denied  { read }
>> > for  pid=27242 comm="virsh" name="random" dev=devtmpfs ino=3585
>> > scontext=unconfined_u:system_r:corosync_t:s0
>> > tcontext=system_u:object_r:random_device_t:s0 tclass=chr_file
>>
>> You need to take a look at the SELinux context.
>>
>> Regards,
>> Dan
>>
>
> I'm not familiar with SElinux context, but I have disabled selinux .
>
> [root@h10_151 ~]# cat /etc/sysconfig/selinux
>
> # This file controls the state of SELinux on the system.
> # SELINUX= can take one of these three values:
> #     enforcing - SELinux security policy is enforced.
> #     permissive - SELinux prints warnings instead of enforcing.
> #     disabled - No SELinux policy is loaded.
> SELINUX=disable
> # SELINUXTYPE= can take one of these two values:
> #     targeted - Targeted processes are protected,
> #     mls - Multi Level Security protection.
> SELINUXTYPE=targeted
>
> How can I solve this issue, or any other information you need to help me ?

Try getenforce on both nodes, it should return Disabled. If it doesn't
you need to check that SELinux is disabled on both nodes and then
reboot the nodes.

HTH,
Dan

>
>
> Best Regards,
>
>> >
>> > [root@h10_145 ~]# crm
>> > crm(live)# status
>> > 
>> > Last updated: Tue Dec 13 12:09:06 2011
>> > Stack: openais
>> > Current DC: h10_145 - partition with quorum
>> > Version: 1.1.2-f059ec7ced7a86f18e5490b67ebf4a0b963bccfe
>> > 2 Nodes configured, 2 expected votes
>> > 2 Resources configured.
>> > 
>> >
>> > Online: [ h10_151 h10_145 ]
>> >
>> >  test2  (ocf::heartbeat:VirtualDomain): Started h10_151 (unmanaged)
>> > FAILED
>> >  test1  (ocf::heartbeat:VirtualDomain): Started h10_145 (unmanaged)
>> > FAILED
>> >
>> > Failed actions:
>> >    test1_stop_0 (node=h10_145, call=19, rc=1, status=complete):
>> > unknown error
>> >    test2_stop_0 (node=h10_151, call=14, rc=1, status=complete):
>> > unknown error
>> >
>> > Best Regards,
>> >
>> >> -Original Message-
>> >> From: Arnold Krille [mailto:arn...@arnoldarts.de]
>> >> Sent: Monday, December 12, 2011 7:52 PM
>> >> To: The Pacemaker cluster resource manager
>> >> Subject: Re: [Pacemaker] How to live migrate the kvm vm
>> >>
>> >> Hi,
>> >>
>> >> On Monday 12 December 2011 11:22:51 邱志刚 wrote:
>> >> > I have 2-node cluster of pacemaker,I want to migrate the kvm vm
>> >> > with command "migrate", but I found the vm isn't migrated, actually
>> >> > it is shutdown and then start on other node. I checked the log and
>> >> > found the vm is stopped but not migrated.
>> >>
>> >> > How could I live migrate the vm ? The configuration :
>> >> > crm(live)configure# show
>> >> > primitive test1 ocf:heartbeat:VirtualDomain \
>> >> >     params config="/etc/libvirt/qemu/test1.xml"
>> >> > hypervisor="qemu:///system" \
>> >> >     meta allow-migrate="ture" priority="10

Re: [Pacemaker] Large cluster

2012-01-06 Thread Dan Frincu
Hi,

On Thu, Jan 5, 2012 at 6:43 PM, Graantik  wrote:
> Hi all,
>
> I have a task that I think can logically be implemented using a
> pacemaker/corosync cluster with many nodes (e.g. 15) and maybe thousand or
> more resources. Most of the resources are parametrized processes controlled
> by a custom resource agent. The resources are added and removed dynamically,
> typically many (e.g. 100) at one time.
>
> My first tests in a VM environment show that - even after some tuning of
> lrmd max-children and custom-batch-limit, optimizing the RA and having the
> processes idle - adding so many resources in one step (xml based) appears to
> bring the cluster to its knees, i.e. nodes become unresponsive, DC and other
> nodes have very high load, and the operation takes an hour or longer.
>
> Does this mean that the design limit of this software/hardware is reached or
> are there ways like tuning or best practices to make such a scenario work?

In terms of performance testing on large clusters there is an article
that may be interesting to read
http://theclusterguy.clusterlabs.org/post/1241986422/large-cluster-performance

In the article it talks about using 1 resources, so it's higher
than your use case, you can take away from it the timings that you
have had and the ones presented there and go from there.

Bare in mind that when dealing with so many resources and nodes it
might help to tweak certain things, such as the maximum message size
for corosync (the article mentions using 256k), timeouts in corosync
token might have to be increased, as high load on the systems may
delay replies in network traffic, and also having to sync the CIB onto
~15 nodes as you mentioned means that you _should_ use multicast,
switches must support igmp snooping and have it enabled and properly
configured, the entire cluster should be in a separate vlan, or have
some form of dedicated network, to ensure not only throughput but also
latency and to prevent interference of other network traffic, etc.

>
> Are there known implementations of comparable size?

In terms of nodes, most I know of are clusters of ~10-12 nodes, in
terms of resources, not that I know of.

HTH,
Dan

>
> Thanks
> Gerhard
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>



-- 
Dan Frincu
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] large cluster design questions

2012-01-06 Thread Dan Frincu
 I suggested splitting the cluster by purpose, this way for
MySQL nodes you install and configure as necessary, but don't do the
same on the rest of the nodes.

One other thing, as I see it, you want an N-to-N cluster, with any one
service being able to run on any node and to failover to any node.
Consider all of the services that need coordinated access to data, now
consider any node in the cluster can possibly run that service, which
further along means that you need all the nodes to have access to the
same shared data, so you're talking about a GFS2/OCFS2 cluster
spanning 45 nodes. I know I have an knack on stating the obvious, but
people most of the time say one thing and think another, so when you
reply with what they say, then all of a sudden when someone else other
than you says it, it sheds a different light on the matter.

Bottom line, split the nodes into clusters that match a common purpose.

There's bound to be more input on this on the matter, this is just my opinion.

HTH,
Dan

[1] http://oss.clusterlabs.org/pipermail/pacemaker/2012-January/012639.html

>
> Many thanks for your thoughts on this,
> Christian.
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-- 
Dan Frincu
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] syslog full of redundand link messages

2012-01-09 Thread Dan Frincu
Hi,

On Sun, Jan 8, 2012 at 1:59 AM, Attila Megyeri
 wrote:
> Hi All,
>
>
>
> My syslogs are full of messages like this:
>
>
>
> Jan  7 23:55:47 oa2 corosync[362]:   [TOTEM ] received message requesting
> test of ring now active
>
> Jan  7 23:55:48 oa2 corosync[362]:   [TOTEM ] received message requesting
> test of ring now active
>
> Jan  7 23:55:48 oa2 corosync[362]:   [TOTEM ] received message requesting
> test of ring now active
>
> Jan  7 23:55:48 oa2 corosync[362]:   [TOTEM ] received message requesting
> test of ring now active
>
> Jan  7 23:55:49 oa2 corosync[362]:   [TOTEM ] received message requesting
> test of ring now active
>
> Jan  7 23:55:49 oa2 corosync[362]:   [TOTEM ] received message requesting
> test of ring now active
>
> Jan  7 23:55:49 oa2 corosync[362]:   [TOTEM ] received message requesting
> test of ring now active
>
> Jan  7 23:55:50 oa2 corosync[362]:   [TOTEM ] received message requesting
> test of ring now active
>
> Jan  7 23:55:50 oa2 corosync[362]:   [TOTEM ] received message requesting
> test of ring now active
>
> Jan  7 23:55:50 oa2 corosync[362]:   [TOTEM ] received message requesting
> test of ring now active
>
> Jan  7 23:55:51 oa2 corosync[362]:   [TOTEM ] received message requesting
> test of ring now active
>
> Jan  7 23:55:51 oa2 corosync[362]:   [TOTEM ] received message requesting
> test of ring now active
>
> Jan  7 23:55:51 oa2 corosync[362]:   [TOTEM ] received message requesting
> test of ring now active
>
> Jan  7 23:55:52 oa2 corosync[362]:   [TOTEM ] received message requesting
> test of ring now active
>
> Jan  7 23:55:52 oa2 corosync[362]:   [TOTEM ] received message requesting
> test of ring now active
>
> Jan  7 23:55:52 oa2 corosync[362]:   [TOTEM ] received message requesting
> test of ring now active
>
>
>
>
>
> What could be the reason for this?
>
>
>
>
>
> Pacemaker 1.1.6, Corosync 1.4.2
>
>
>
>
>
> The relevant part of the config:
>
>
>
> Eth0 is ont he 10.100.1.X subnet, eth1 is 192.168.100.X
>
>
>
>
>
>
>
>
>
> totem {
>
>     version: 2
>
>     secauth: off
>
>     threads: 0
>
>     rrp_mode: passive
>
>     interface {
>
>     ringnumber: 0
>
>     bindnetaddr: 10.100.1.255
>
>     mcastaddr: 226.100.40.1
>
>     mcastport: 4000
>
>     }
>
>     interface {
>
>     ringnumber: 1
>
>     bindnetaddr: 192.168.100.255
>
>     mcastaddr: 226.101.40.1
>
>     mcastport: 4000
>
>     }
>

Are the subnets /24 or higher (/23, /22, etc.)? Because as I see
you're using what would be the broadcast address on a /24 subnet and
may cause issues.

>
>
>
>
> }
>
>
>
>
>
> Thanks,
>
>
>
> Attila
>
>
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>



-- 
Dan Frincu
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Cannot Create Primitive in CRM Shell

2012-01-09 Thread Dan Frincu
Hi,

On Fri, Jan 6, 2012 at 11:24 PM, Andrew Martin  wrote:
> Hello,
>
> I am working with DRBD + Heartbeat + Pacemaker to create a 2-node
> highly-available cluster. I have been following this official guide on
> DRBD's website for configuring all of the components:
> http://www.linbit.com/fileadmin/tech-guides/ha-nfs.pdf
>
> However, once I go to configure the primitives in pacemaker's CRM shell
> (section 4.1 in the PDF above) I am unable to create the primitive. For
> example, I enter the following configuration for a DRBD device called
> "drive":
> primitive p_drbd_drive \
>
>   ocf:linbit:drbd \
>
>   params drbd_resource="drive" \
>
>   op monitor interval="15" role="Master" \
>
>   op monitor interval="30" role="Slave"
>
> After entering all of these lines I hit enter and nothing is returned - it
> appears frozen and I am never returned to the "crm(live)configure# " shell.
> An strace of the process does not reveal any obvious blocks. I have also
> tried entering the entire configuration on a single line with the same
> result.

I would recommend going through this guide first
http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/

>
> What can I try to debug this and move forward with configuring pacemaker? Is
> there a command I can use to completely clear out pacemaker to perhaps start
> fresh?

crm configure erase

It will however do what it says, so use it with caution, you have been warned.

>
> Thanks,
>
> Andrew
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>



-- 
Dan Frincu
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Cannot Create Primitive in CRM Shell

2012-01-09 Thread Dan Frincu
Hi,

On Mon, Jan 9, 2012 at 1:44 PM, Florian Haas  wrote:
> On Mon, Jan 9, 2012 at 11:42 AM, Dan Frincu  wrote:
>> Hi,
>>
>> On Fri, Jan 6, 2012 at 11:24 PM, Andrew Martin  wrote:
>>> Hello,
>>>
>>> I am working with DRBD + Heartbeat + Pacemaker to create a 2-node
>>> highly-available cluster. I have been following this official guide on
>>> DRBD's website for configuring all of the components:
>>> http://www.linbit.com/fileadmin/tech-guides/ha-nfs.pdf
>>>
>>> However, once I go to configure the primitives in pacemaker's CRM shell
>>> (section 4.1 in the PDF above) I am unable to create the primitive. For
>>> example, I enter the following configuration for a DRBD device called
>>> "drive":
>>> primitive p_drbd_drive \
>>>
>>>   ocf:linbit:drbd \
>>>
>>>   params drbd_resource="drive" \
>>>
>>>   op monitor interval="15" role="Master" \
>>>
>>>   op monitor interval="30" role="Slave"
>>>
>>> After entering all of these lines I hit enter and nothing is returned - it
>>> appears frozen and I am never returned to the "crm(live)configure# " shell.
>>> An strace of the process does not reveal any obvious blocks. I have also
>>> tried entering the entire configuration on a single line with the same
>>> result.
>>
>> I would recommend going through this guide first
>> http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/
>
> That's a bit of a knee-jerk response if I may say so, and when I wrote
> those guides[1] the intention was specifically that people could
> peruse them _without_ first having to check the documentation that
> covers the configuration internals.

I apologize if it came through as a "knee-jerk response" on my behalf,
if I don't understanding the technology I work with, I look at the
docs, that's why I always point others to the documentation as well.

I have followed the tech guides in reference many times and I'm not in
any way implying that they shouldn't be followed ad-literam, I've
explained in my previous statement why I recommend the docs.

Sorry for the noise.

>
> At any rate, Andrew, if your crm shell is freezing up when you're
> simply trying to add a primitive, something must be seriously awry in
> your setup -- it's something that I've not run into personally, unless
> the cluster was already responding to an error state on one of the
> nodes. Are you sure your cluster is behaving OK otherwise? Are you
> getting meaningful output from "crm_mon -1"? Does your cluster report
> it has successfully elected a DC?
>
> Cheers,
> Florian
>
> [1] Which I did while employed by Linbit, which is no longer the case,
> as they have asked I point out. http://wp.me/p4XzQ-bN
>
> --
> Need help with High Availability?
> http://www.hastexo.com/now
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



-- 
Dan Frincu
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] rsc_ticket and 1.2 rng

2012-01-16 Thread Dan Frincu
Hi,

On Mon, Jan 16, 2012 at 5:58 PM, Vladislav Bogdanov
 wrote:
> Hi Andrew,
>
> is it intentional that 1.2 schema which is now default misses rsc_ticket
> which is now not only works but even well documented by suse?

Sorry to barge in, but there is a pull request related to this issue.

https://github.com/ClusterLabs/pacemaker/pull/6

HTH,
Dan

>
> Vladislav
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



-- 
Dan Frincu
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Best setup for lots and lots of IPs

2012-01-20 Thread Dan Frincu
Hi,

On Thu, Jan 19, 2012 at 9:49 PM, Anton Melser  wrote:
> Hi,
> I want to set up a very simple NAT device for natting around 2000
> internal /24 networks to around 2000 external IPs (1 /24 = 1 public
> IP). That part works fine (and is *extremely* efficient, I have it on
> a pretty powerful machine but cpu is 0% with 2gbps going through!)
> with iproute2 and iptables. I want it to have some failover though...
> I am discovering everything here (including iproute2 and iptables),
> and someone suggested I look at corosync + pacemaker. I did the
> tutorial (btw if I end up using this I'll translate it into French if
> you would like) and things seemed to work fine for a few IPs...
> However, my
>
> crm configure primitive ClusterIP.ABC ocf:heartbeat:IPaddr2 params
> ip=10.A.B.C cidr_netmask=32 op monitor interval=120s
>
> commands started to slow down around 200 IPs and then to a crawl at
> 500-600 or so. It got to around 1000 before I stopped the VMs I was
> testing on to move them onto a much more powerful VM host. It is
> taking an absolute age to get back up again. This may be normal, and
> there may be no way around it with any decent solution - I simply have
> no idea.
> Am I trying to achieve something with the wrong tools here? I don't
> need any sort of connection tracking or anything - we can handle up to
> even maybe 5 minutes of downtime (as long as it's not regularly
> happening). The need is relatively simple but the numbers of
> networks/IPs may make this unwieldy using these tools.
> Any pointers?

There are a couple of performance related topics that you can look at
for further reference.

http://www.gossamer-threads.com/lists/linuxha/pacemaker/77382?do=post_view_threaded
http://www.gossamer-threads.com/lists/linuxha/pacemaker/77384?do=post_view_threaded

However the way I see it in your scenario I would take another
approach. Mind you this is just an opinion on the matter, nothing
else, but I would either update the IPaddr2 script or create a new one
based on it that would either:

a) take 1000 parameters (and internally do a for loop, because I'd
rather have 1 script with 1000 parameters than 1000 scripts with 1
parameter)

b) (based on the use case of 2000 IP's I'd guess you have at least a
/21 public subnet available - or even larger - and based on good
practice I'd also guess these IP's are given from a continuous range,
in which case the script would) take a start IP and end IP as
parameters, and perform a for loop for the resulting range (thus using
only 2 parameters for the IP definition, and the other parameters I've
seen in the example were netmask and monitoring interval, a grand
total of 4).

>From my point of view, such a high number of resources in a Pacemaker
cluster for the sole purpose of adding/removing IP addresses is an
overkill, and another solution, such as the one I suggested makes more
sense. Of course, I went on the assumption that all of these IP's are
either needed all together or not at all, but even if this is not the
case, I doubt you need individual rules per IP, more along the line of
needing to control a large range + some corner cases with individual
assignments, the latter being possible with IPaddr2 just as usual
whilst keeping the total number of resources significantly lower.

The problem with 1000 resources is that when going into the monitoring
part, you can only monitor $LRMD_MAX_CHILDREN resources at a time
(which by default is 4), so you can increase this number and have n
monitor operations run in parallel. You'll have to see how the
timeouts fit in with the increased monitor operations and if there is
a negative effect on performance due to the increased number of
monitor operations.

HTH,
Dan

> Thanks heaps,
> Anton
>
> --
> echo '16i[q]sa[ln0=aln100%Pln100/snlbx]sbA0D4D465452snlbxq' | dc
> This will help you for 99.9% of your problems ...
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



-- 
Dan Frincu
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] corosync vs. pacemaker 1.1

2012-01-26 Thread Dan Frincu
Hi,

On Wed, Jan 25, 2012 at 5:08 PM, Kiss Bence  wrote:
> Hi,
>
> I am newbie to the clustering and I am trying to build a two node
> active/passive cluster based upon the documentation:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>
> My systems are Fedora 14, uptodate. After forming the cluster as wrote, I
> started to test it. (resources: drbd-> lvm-> fs ->group of services)
> Resources moved around, nodes rebooted and killed (first I tried it in
> virtual environment then also on real machines).
>
> After some events the two nodes ended up in a kind of state of split-brain.
> The crm_mon showed me that the other node is offline at both nodes although
> the drbd subsystem showed everything in sync and working. The network was
> not the issue (ping, tcp and udp communications were fine). Nothing changed
> from the network view.
>
> At first the rejoining took place quiet well, but some more events after it
> took longer and after more event it didn't. The network dump showed me the
> multicast packets still coming and going. At corosync (crm_node -l) the
> other node didn't appeared both on them. After trying configuring the cib
> logs was full of messages like ": not in our membership".
>
> I tried to erase the config (crm configure erase, cibadmin -E -f) but it
> worked only locally. I noticed that the pacemaker process didn't started up
> normally on the node that was booting after the other. I also tried to
> remove files from /var/lib/pengine/ and /var/lib/hearbeat/crm/ but only the
> resources are gone. It didn't help on forming a cluster without resources.
> The pacemaker process exited some 20 minutes after it started. Manual
> starting was the same.
>
> After digging into google for answers I found nothing helpful. From running
> tips I changed in the /etc/corosync/service.d/pcmk file the version to 1.1
> (this is the version of the pacemaker in this distro). I realized that the
> cluster processes were startup from corosync itself not by pacemaker. Which
> could be omitted. The cluster forming is stable after this change even after
> many many events.
>
> Now I reread the document mentioned above, and I wonder why it wrote the
> "Important notice" on page 37. What is wrong theoretically with my scenario?
> Why does it working? Why didn't work the config suggested by the document?
>
> Tests were done firsth on virtual machines of a Fedora 14 (1 CPU core, 512Mb
> ram, 10G disk, 1G drbd on logical volume, physical  volume on drbd forming
> volgroup named cluster.)/node.
>
> Then on real machines. They have more cpu cores (4), more RAM (4G) and more
> disk (mirrored 750G), 180G drbd, and 100M garanteed routed link between the
> nodes 5 hops away.
>
> By the way how should one configure the corosync to work on multicast routed
> network? I had to create an openvpn tap link between the real nodes for
> working. The original config with public IP-s didn't worked. Is corosync
> equipped to cope with the multicast pim messages? Or it was a firewall
> issue.

First question, what versions of software are on each of the nodes?

When using multicast, corosync doesn't care about "routing" the
messages AFAIK, it relies on the network layer to do it's job. Now the
"split-brain" you mention can take place due to network interruption,
or due to missing or untested fencing as well.

Second question, do you have fencing configured?

You've mentioned 2(?) nodes "5 hops away", I'm guessing they're not in
the same datacenter. If so, did you also test the latency on the
network between endpoints? Also can you make sure PIM routing is
enabled on all of the "hops" along the way?

Your scenario seems to be a split-site, so you may be interested in
https://github.com/jjzhang/booth as well.

Regards,
Dan

>
> Thanks in advance,
> Bence
>
> ___________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



-- 
Dan Frincu
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] MySQL Master-Master replication with Corosync and Pacemaker

2012-01-26 Thread Dan Frincu
Hi,

On Thu, Jan 26, 2012 at 1:43 AM, Peter Scott  wrote:
> Hello.  Our problem is that a Corosync restart on the idle machine in a
> 2-node cluster shutds down the mysqld process there and we need it to stay
> up for replication.  We are very new to Corosync and Pacemaker and have been
> slogging through every tutorial and document we can find.
>
> Here's the detail: We have two MySQL comasters (each is a master and a slave
> of the other).  Traffic needs to arrive at only one machine at a time
> because otherwise conflicting simultaneous updates at each machine would
> cause a problem.  There is a single IP for clients (192.168.185.50, see
> below).
>
> After much sweating, we came up with the configuration below.  It works: if
> we kill the machine that's in use we see it switch to the other one.  MySQL
> connections are seamlessly rerouted.
>
> The problem is this: Say that dev-mysql01 is the active node.  If we restart
> Corosync on dev-mysql02, it stops mysqld there and does not restart it.  We
> can of course restart it manually but we want to understand why this is
> happening because it surprises us and maybe there are other circumstances
> under which it would either stop mysqld or fail to reatart it.

Corosync is the first layer in the cluster stack (membership and
messaging), Pacemaker is the second layer (cluster resource
management), your services are on the third layer.

You take down the bottom layer, that ensures communication, the upper
layers have no way to talk to the rest of the cluster.

Bottom line, when services are controlled by the cluster and through
manual intervention the processes that control them are stopped,
everything under their control stops as well.

If this is intended for administrative purposes, follow Florian's advice.

HTH,
Dan

>
> mysqld has to run on the inactive machine so that the active one can
> replicate all the transactions there, so that if the active one goes down
> the inactive one can come up in the current state.
>
> Why is a Corosync restart stopping mysqld?
>
> Here's our configuration:
>
> node dev-mysql01
> node dev-mysql02
> primitive DBIP ocf:heartbeat:IPaddr2 \
>        params ip="192.168.185.50" cidr_netmask="24" \
>        op monitor interval="30s"
> primitive mysql ocf:heartbeat:mysql \
>        params binary="/usr/bin/mysqld_safe" config="/etc/my.cnf"
> datadir="/var/lib/mysql" user="mysql" pid="/var/run/mysqld/mysqld.pid"
> socket="/var/lib/mysql/mysql.sock" test_passwd="secret"
> test_table="lbcheck.lbcheck" test_user="lbcheck" \
>        op monitor interval="20s" timeout="10s" \
>        meta migration-threshold="10"
> group mysql_group DBIP mysql
> location master-prefer-node1 mysql_group 50: dev-mysql01
> property $id="cib-bootstrap-options" \
>        dc-version="1.1.2-f059ec7ced7a86ff4a0b963bccfe" \
>        cluster-infrastructure="openais" \
>        expected-quorum-votes="2" \
>        stonith-enabled="false" \
>        no-quorum-policy="ignore"
> rsc_defaults $id="rsc-options" \
>        resource-stickiness="100"
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



-- 
Dan Frincu
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Doc: Utilization and Placement Strategy

2012-02-08 Thread Dan Frincu
Hi,

On Wed, Feb 8, 2012 at 1:11 PM, Gao,Yan  wrote:
> Hi,
>
> The feature "Utilization and Placement Strategy" has been provided for
> quite some time. But it still missing a documentation. (Florian reminded
> us, thanks a lot!).
>
> The attached documentation are based on a blog by Andrew and the
> material from SUSE HAE guide written by Tanja Roth and Thomas Schraitle.
> I added the details about the resource allocation strategy.
>
> One is crm shell syntax version, the other is XML syntax version for
> "Pacemaker_Explained".
>
> If you are interested, please help review it. Any comments or revisions
> are welcome and appreciated!

I've reviewed both files and made some minor additions and fixed a
couple of typos, other than that looks great.

One question though, shouldn't these have been in Docbook format?

Regards,
Dan

>
> Regards,
>  Gao,Yan
> --
> Gao,Yan 
> Software Engineer
> China Server Team, SUSE.
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>



-- 
Dan Frincu
CCNA, RHCE


utilization-and-placement-strategy
Description: Binary data


utilization-and-placement-strategy-crm-shell
Description: Binary data
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Problem during Cluster Upgrade

2012-04-12 Thread Dan Frincu
Hi,

On Wed, Apr 11, 2012 at 5:26 PM, Karl Rößmann  wrote:
> Hi all,
>
> I'm upgrading a three node cluster from SLES 11 SP1 to SLES SP2 node by
> node.
> the upgrade includes:
>         corosync-1.3.3-0.3.1     to corosync-1.4.1-0.13.1
>         pacemaker-1.1.5-5.9.11.1 to pacemaker-1.1.6-1.27.26
>         kernel 2.6.32.54-0.3-xen to 3.0.13-0.27-xen
>
> After Upgrading the first node and restarting the cluster I get these
> never ending messages on the DC (which is not the updated node)
>
> Apr 11 14:19:26 orion14 corosync[6865]:   [TOTEM ] Type of received message
> is wrong...  ignoring 6.
> Apr 11 14:19:27 orion14 corosync[6865]:   [TOTEM ] Type of received message
> is wrong...  ignoring 6.
> Apr 11 14:19:28 orion14 corosync[6865]:   [TOTEM ] Type of received message
> is wrong...  ignoring 6.
> Apr 11 14:19:29 orion14 corosync[6865]:   [TOTEM ] Type of received message
> is wrong...  ignoring 6.

I think the question relates more to corosync, added the proper group in CC.

>
> the updated node is still in STANDBY mode.
> Should I ignore the message and put the mode to ONLINE ?
> I don't want the cluster to crash, there are running services
> on the other two nodes.
> So now I stopped the openais on the updated node: no more messages.
> the other two nodes are still up and working.
>
> Any ideas ?

I don't know exactly if a rolling upgrade is possible (I may be wrong
on this one) but putting the cluster in maintenance-mode, upgrading
corosync and pacemaker on all 3 nodes and then re-probing for the
resources is a more common upgrade path. If there are no issues on the
reprobe, then you could take the cluster out of maintenance-mode.

Also, do you have a support contract with Suse? I think their support
can help out more on this.

HTH,
Dan

>
> Karl
>
>
>
> --
> Karl Rößmann                            Tel. +49-711-689-1657
> Max-Planck-Institut FKF                 Fax. +49-711-689-1632
> Postfach 800 665
> 70506 Stuttgart                         email k.roessm...@fkf.mpg.de
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



-- 
Dan Frincu
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] start/stop operations fail to happen in parallel on resources

2012-04-19 Thread Dan Frincu
Hi,

On Thu, Apr 19, 2012 at 2:22 PM, Parshvi  wrote:
> Observations:
> max-children=30
> total no. of resources=18
>
> 1) At a default value 4 of max-children, following logs were observed
> that led to monitor op’s timeout for some resources (a total of 18 rscs):
>  a. “max_child_count (4) reached, postponing execution of operation monitor”
>  b. “WARN: perform_ra_op: the operation operation monitor[18] on
> ocf::IPaddr2::ClusterIP for client 3754, stayed in operation list for
> 14100 ms (longer than 1 ms)”
>  c. SOLUTION: the max-children of lrmd was raised to 30.
>  d. ISSUES STILL OBSERVED: while 2-3 resources are stuck in start operation,
> if a rsc is issued an explicit start command `crm resource start rcs1`, then 
> the
> start op on this rsc is delayed until any one of the previous resources exit
> from their start operation.

What version of Pacemaker?

>
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



-- 
Dan Frincu
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Corosync service taking 100% cpu and is unable to stop gracefully

2012-04-19 Thread Dan Frincu
Hi,

On Thu, Apr 19, 2012 at 2:11 PM, Parshvi  wrote:
> Major issues:
> 1) Corosync reaching over 100% cpu usage.
> 2) Corosync unable to stop gracefully.
> 3) Virtual IP of a resources being assigned as the primary IP on a interface,
> after a cable disconnect/reconnect on that interface. The static IP on the
> interface shown as global secondary IP.
>
> Use case:
> 1) Two nodes in a cluster.
> 2) Two communication paths exists between the two nodes, with “rrp_mode” set 
> to
> active in corosync.conf

Are both links of the same speed?

>  a. One path is a back-to-back connection between the nodes.
>  b. Second is  via the LAN network  switch.
> 3) The network cable was unplugged on one of the nodes for a while (on both 
> the
> interfaces). It was reconnected after a short while.
>
> Observations:
> 1) Corosync service was taking 100% cpu on the node whose link was down:

What version of Corosync? What OS?

>  a. In the above scenario Corosync service could not be stopped gracefully. A
> SIGKILL had to be issued to stop the service.
>  b. On this node, of the two interfaces configured in corosync.conf, one was
> being used for the Virtual IP’s preferred eth.
>    i. It was observed that when the link was up after a disconnection, the
> primary global IP on that interface was the Virtual IP configured for a
> resource.
>    ii. The static IP assigned to the interface was listed as “scope global
> secondary” in the output of `ip addr show`.
>    iii. Also the Virtual IP of the resources configured in pacemaker were
> active on both the nodes.

Can you pastebin.com your crm configure show?

>    iv. `service network restart` also did not work.
>  c. Coroysnc service was stopped (Killed since it could not be stopped), the
> network service was re-started and then corosync was re-started. All good 
> after
> this.
>
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



-- 
Dan Frincu
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] crm_mon on Node-2 shows both Node-1 & Node-2 as online but crm_mon on Node-1 shows Node-2 as offline

2012-04-19 Thread Dan Frincu
Hi,

On Thu, Apr 19, 2012 at 3:56 PM, Parshvi  wrote:
> 1) What is the use of ssh without pass key between cluster nodes in pacemaker 
> ?
>  a. Use case:
>    i. Two nodes in a cluster (Call them Node-1 and Node-2)
>    ii. One interface configured in corosync.conf for its heartbeat or
> messaging. Eg. Bind net addr : 192.168.10.0
>    iii. Another interface configured in /etc/hosts for hostname resolution.
>    Eg. IP: 192.168.129.10 Hostname: Node-1
>    Eg. IP: 192.168.129.11 Hostname: Node-2
>    iv. Hence for all ssh communication between the two nodes, hostname 
> resolves
> to subnet 129 address.
>    v. 12 services configured in active/passive mode
>    vi. 1 service configured in master/slave mode
>    vii. 8 services are non-sticky (they failback) in active/passive
>    viii. 4 services are sticky (do not failback) in active/passive
>    ix. Distribution: Node-1 is primary for 8 services (of which 4 are non-
> sticky), Node-2 is preferred for 4 services of a total 12 (non-sticky)
>
>  b. Observations:
>    i. On Node-2, the interface was down over which IP: 192.168.129.11 
> Hostname:
> Node-2 was configured.
>    ii. On Node-1 all interfaces were up.
>    iii. Interface used by corosync for hearbeat/messaging was up at all times
> (Bind net addr : 192.168.10.0)
>    iv. In crm_mon: Node-1 sees Node-2 as offline
>        cibadmin --query fails to work (remote node did not respond)
>    v. In crm_mon: Node-2 sees Node-1 as online
>    vi. All the services were seen active on Node-1 (including those that were
> preferred for Node-2). Observed in crm_mon output.
>    vii. 4 services for which Node-2 was preferred were seen active Node-2 also
> (hence 4 services active on both the nodes).
>    Observed in crm_mon output: Only 4 services were shown active, the status 
> of
> the rest of the services active on Node-1 did not reflect in crm_mon
>    Even though crm_mon on Node-2 sees Node-1 as “online”.
>  c. Errors in log file:
>    i. On Node-2:
>      1. Resource ocf::RscRA:rsc appears to be active on 2 nodes
>      2. The above error appears for all the resources configured in pacemaker.
>
>
> Query:
> 1) For what purpose does Pacemaker require “ssh without a pass key” to be
> enabled between the nodes in a cluster ?

scp

> 2) For what purpose does Pacemaker use Node “hostname” for ? how Node 
> “hostname”
> come into picture ?

When choosing where to allocate resources not explicitly tied to a node. See

http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/#node-score-equal

and

http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/#_background

> 3) Let’s say in a two node cluster two communication paths are available 
> between
> the two nodes.
>  a. Eth1 and eth2.
>  b. The hostname of the node resolves to IP Address on eth1.
>  c. Consider, eth1 (network cable disconnected) goes down.
>  d. Eth2 is up, but hostname does not resolve to the IP on eth2 (resolves to
> eth1 addr).

Inter-node communication is usually specified by IP address, and
redundant connections (as in your case) is recommended.

>  e. Will this (hostname) have any issue ?
>
>
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



-- 
Dan Frincu
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Corosync service taking 100% cpu and is unable to stop gracefully

2012-04-19 Thread Dan Frincu
On Thu, Apr 19, 2012 at 4:14 PM, Parshvi  wrote:
> Dan Frincu  writes:
>
>>
>> Hi,
>>
>> On Thu, Apr 19, 2012 at 2:11 PM, Parshvi  gmail.com> wrote:
>> > Major issues:
>> > 1) Corosync reaching over 100% cpu usage.
>> > 2) Corosync unable to stop gracefully.
>> > 3) Virtual IP of a resources being assigned as the primary IP on a
> interface,
>> > after a cable disconnect/reconnect on that interface. The static IP on the
>> > interface shown as global secondary IP.
>> >
>> > Use case:
>> > 1) Two nodes in a cluster.
>> > 2) Two communication paths exists between the two nodes, with “rrp_mode” 
>> > set
> to
>> > active in corosync.conf
>>
>> Are both links of the same speed?
> yes. speed of each: 1000Mb/s
>>
>> >  a. One path is a back-to-back connection between the nodes.
>> >  b. Second is  via the LAN network  switch.
>> > 3) The network cable was unplugged on one of the nodes for a while (on both
> the
>> > interfaces). It was reconnected after a short while.
>> >
>> > Observations:
>> > 1) Corosync service was taking 100% cpu on the node whose link was down:
>>
>> What version of Corosync? What OS?
> Corosync Cluster Engine, version '1.2.7' SVN revision '3008'
> OEL (Oracle Enterprise Linux release 5.6)

You need a newer version of Corosync. For redundant rings to work,
1.3.x or higher, for self healing redundant rings, 1.4.x.

>>
>
>> Can you pastebin.com your crm configure show?
> would do that in a followup mail.
>
> Thanks for a quick response Dan.
>
> Here is a snapshot of top:
>
>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  4726 root      RT   0  201m 5576 2004 R 100.4  0.1  36:35.31 corosync
>
> Logs and core file have been saved and can be posted if required.
> My response inline.
>
>
>
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



-- 
Dan Frincu
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] crm_mon on Node-2 shows both Node-1 & Node-2 as online but crm_mon on Node-1 shows Node-2 as offline

2012-04-20 Thread Dan Frincu
On Fri, Apr 20, 2012 at 3:09 AM, Andrew Beekhof  wrote:
> On Thu, Apr 19, 2012 at 11:51 PM, Dan Frincu  wrote:
>> Hi,
>>
>> On Thu, Apr 19, 2012 at 3:56 PM, Parshvi  wrote:
>>> 1) What is the use of ssh without pass key between cluster nodes in 
>>> pacemaker ?
>>>  a. Use case:
>>>    i. Two nodes in a cluster (Call them Node-1 and Node-2)
>>>    ii. One interface configured in corosync.conf for its heartbeat or
>>> messaging. Eg. Bind net addr : 192.168.10.0
>>>    iii. Another interface configured in /etc/hosts for hostname resolution.
>>>    Eg. IP: 192.168.129.10 Hostname: Node-1
>>>    Eg. IP: 192.168.129.11 Hostname: Node-2
>>>    iv. Hence for all ssh communication between the two nodes, hostname 
>>> resolves
>>> to subnet 129 address.
>>>    v. 12 services configured in active/passive mode
>>>    vi. 1 service configured in master/slave mode
>>>    vii. 8 services are non-sticky (they failback) in active/passive
>>>    viii. 4 services are sticky (do not failback) in active/passive
>>>    ix. Distribution: Node-1 is primary for 8 services (of which 4 are non-
>>> sticky), Node-2 is preferred for 4 services of a total 12 (non-sticky)
>>>
>>>  b. Observations:
>>>    i. On Node-2, the interface was down over which IP: 192.168.129.11 
>>> Hostname:
>>> Node-2 was configured.
>>>    ii. On Node-1 all interfaces were up.
>>>    iii. Interface used by corosync for hearbeat/messaging was up at all 
>>> times
>>> (Bind net addr : 192.168.10.0)
>>>    iv. In crm_mon: Node-1 sees Node-2 as offline
>>>        cibadmin --query fails to work (remote node did not respond)
>>>    v. In crm_mon: Node-2 sees Node-1 as online
>>>    vi. All the services were seen active on Node-1 (including those that 
>>> were
>>> preferred for Node-2). Observed in crm_mon output.
>>>    vii. 4 services for which Node-2 was preferred were seen active Node-2 
>>> also
>>> (hence 4 services active on both the nodes).
>>>    Observed in crm_mon output: Only 4 services were shown active, the 
>>> status of
>>> the rest of the services active on Node-1 did not reflect in crm_mon
>>>    Even though crm_mon on Node-2 sees Node-1 as “online”.
>>>  c. Errors in log file:
>>>    i. On Node-2:
>>>      1. Resource ocf::RscRA:rsc appears to be active on 2 nodes
>>>      2. The above error appears for all the resources configured in 
>>> pacemaker.
>>>
>>>
>>> Query:
>>> 1) For what purpose does Pacemaker require “ssh without a pass key” to be
>>> enabled between the nodes in a cluster ?
>>
>> scp
>
> But pacemaker doesn't use scp... or is this in relation to the
> clusters from scratch document?

It's in relation to the Clusters from Scratch document.

> -ECONFUSED

Sorry about that ;)

>
>>
>>> 2) For what purpose does Pacemaker use Node “hostname” for ? how Node 
>>> “hostname”
>>> come into picture ?
>>
>> When choosing where to allocate resources not explicitly tied to a node. See
>>
>> http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/#node-score-equal
>>
>> and
>>
>> http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/#_background
>>
>>> 3) Let’s say in a two node cluster two communication paths are available 
>>> between
>>> the two nodes.
>>>  a. Eth1 and eth2.
>>>  b. The hostname of the node resolves to IP Address on eth1.
>>>  c. Consider, eth1 (network cable disconnected) goes down.
>>>  d. Eth2 is up, but hostname does not resolve to the IP on eth2 (resolves to
>>> eth1 addr).
>>
>> Inter-node communication is usually specified by IP address, and
>> redundant connections (as in your case) is recommended.
>>
>>>  e. Will this (hostname) have any issue ?
>>>
>>>
>>>
>>>
>>> ___
>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>
>>
>> --
>> Dan Frincu
>> CCNA, RHCE
>>
>> ___
>> Pacemaker mai

Re: [Pacemaker] Corosync / Pacemaker Cluster crashing

2012-04-20 Thread Dan Frincu
be_complete=true: cib not
> connected
> Apr 20 10:54:38 lxdcv01nd01.bauer-uk.bauermedia.group crmd: [22450]: info:
> do_lrm_rsc_op: Performing key=9:5:0:e6a3b9c7-c24d-497a-9c07-d6082ee231a9
> op=lcdcv01_stop_0 )
> Apr 20 10:54:38 lxdcv01nd01.bauer-uk.bauermedia.group lrmd: [22447]: info:
> rsc:lcdcv01:4: stop
> Apr 20 10:54:38 lxdcv01nd01.bauer-uk.bauermedia.group attrd: [22448]: info:
> attrd_trigger_update: Sending flush op to all hosts for: probe_complete
> (true)
> Apr 20 10:54:38 lxdcv01nd01.bauer-uk.bauermedia.group attrd: [22448]: info:
> attrd_perform_update: Delaying operation probe_complete=true: cib not
> connected
> Apr 20 10:54:38 lxdcv01nd01.bauer-uk.bauermedia.group lrmd: [22447]: info:
> RA output: (lcdcv01:stop:stderr) logd is not running
> Apr 20 10:54:38 lxdcv01nd01.bauer-uk.bauermedia.group crmd: [22450]: info:
> process_lrm_event: LRM operation lcdcv01_stop_0 (call=4, rc=0, cib-update=9,
> confirmed=true) ok
> Apr 20 10:54:38 corosync [TOTEM ] ring 1 active with no faults
> Apr 20 10:54:41 lxdcv01nd01.bauer-uk.bauermedia.group attrd: [22448]: info:
> cib_connect: Connected to the CIB after 1 signon attempts
> Apr 20 10:54:41 lxdcv01nd01.bauer-uk.bauermedia.group attrd: [22448]: info:
> cib_connect: Sending full refresh
> Apr 20 10:54:41 lxdcv01nd01.bauer-uk.bauermedia.group attrd: [22448]: info:
> attrd_trigger_update: Sending flush op to all hosts for: probe_complete
> (true)
> Apr 20 10:54:41 lxdcv01nd01.bauer-uk.bauermedia.group attrd: [22448]: info:
> attrd_perform_update: Sent update 4: probe_complete=true
>
>
> Bauer Corporate Services UK LP (BCS) is a division of the Bauer Media Group
> the
> largest consumer publisher in the UK, and second largest commercial radio
> broadcaster. BCS provides financial services and manages and develops IT
> systems
> on which our UK publishing, broadcast, digital and partner businesses
> depend.
>
> The information in this email is intended only for the addressee(s) named
> above.
> Access to this email by anyone else is unauthorised. If you are not the
> intended
> recipient of this message any disclosure, copying, distribution or any
> action
> taken in reliance on it is prohibited and may be unlawful. Bauer Corporate
> Services do not warrant that any attachments are free from viruses or other
> defects and accept no liability for any losses resulting from infected email
> transmissions.
>
> Please note that any views expressed in this email may be those of the
> originator and do not necessarily reflect those of this organisation.
>
> Bauer Corporate Services UK LP is registered in England; Registered address
> is
> 1 Lincoln Court, Lincoln Road, Peterborough, PE1 2RF.
>
> Registration number LP13195
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>



-- 
Dan Frincu
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] DRBD < LVM < EXT4 < NFS performance

2012-05-24 Thread Dan Frincu
Hi,

On Mon, May 21, 2012 at 4:24 PM, Christoph Bartoschek  wrote:
> Florian Haas wrote:
>
>>> Thus I would expect to have a write performance of about 100 MByte/s. But
>>> dd gives me only 20 MByte/s.
>>>
>>> dd if=/dev/zero of=bigfile.10G bs=8192  count=1310720
>>> 1310720+0 records in
>>> 1310720+0 records out
>>> 10737418240 bytes (11 GB) copied, 498.26 s, 21.5 MB/s
>>
>> If you used that same dd invocation for your local test that allegedly
>> produced 450 MB/s, you've probably been testing only your page cache.
>> Add oflag=dsync or oflag=direct (the latter will only work locally, as
>> NFS doesn't support O_DIRECT).
>>
>> If your RAID is one of reasonably contemporary SAS or SATA drives,
>> then a sustained to-disk throughput of 450 MB/s would require about
>> 7-9 stripes in a RAID-0 or RAID-10 configuration. Is that what you've
>> got? Or are you writing to SSDs?
>
> I used the same invocation with different filenames each time. To which page
> cache to you refer? To the one on the client or on the server side?
>
> We are using RAID-1 with 6 x 2 disks. I have repeated the local test 10
> times with different files in a row:
>
> for i in `seq 10`; do time dd if=/dev/zero of=bigfile.10G.$i bs=8192
> count=1310720; done
>
> The resulting values on a system that is also used by other programs as
> reported by dd are:
>
> 515 MB/s, 480 MB/s, 340 MB/s, 338 MB/s, 360 MB/s, 284 MB/s, 311 MB/s, 320
> MB/s, 242 MB/s,  289 MB/s
>
> So I think that the system is capable of more than 200 MB/s which is way
> more what can arrive over the network.

A bit off-topic maybe.

Whenever you do these kinds of tests regarding performance on disk
(locally) to test actual speed and not some caching, as Florian said,
you should use oflag=direct option to dd and also echo 3 >
/proc/sys/vm/drop_caches and sync.

I usually use echo 3 > /proc/sys/vm/drop_caches && sync && date &&
time dd if=/dev/zero of=whatever bs=1G count=x oflag=direct && sync &&
date

You can assess if there is data being flushed if the results given by
dd differ from those obtained by calculating the amount of data
written between the two date calls. It also helps to push more data
than the controller can store.

Regards,
Dan

>
> I've done the measurements on the filesystem that sits on top of LVM and
> DRBD. Thus I think that DRBD is not a problem.
>
> However the strange thing is that I get 108 MB/s on the clients as soon as I
> disable the secondary node for DRBD. Maybe there is strange interaction
> between DRBD and NFS.
>
> After reenabling the secondary node the DRBD synchronization is quite slow.
>
>
>>>
>>> Has anyone an idea what could cause such problems? I have no idea for
>>> further analysis.
>>
>> As a knee-jerk response, that might be the classic issue of NFS
>> filling up the page cache until it hits the vm.dirty_ratio and then
>> having a ton of stuff to write to disk, which the local I/O subsystem
>> can't cope with.
>
> Sounds reasonable but shouldn't the I/O subsystem be capable to write
> anything away that arrives?
>
> Christoph
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



-- 
Dan Frincu
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Different Corosync Rings for Different Nodes in Same Cluster?

2012-06-27 Thread Dan Frincu
Hi,

On Tue, Jun 26, 2012 at 9:53 PM, Andrew Martin  wrote:
> Hello,
>
> I am setting up a 3 node cluster with Corosync + Pacemaker on Ubuntu 12.04
> server. Two of the nodes are "real" nodes, while the 3rd is in standby mode
> as a quorum node. The two "real" nodes each have two NICs, one that is
> connected to a shared LAN and the other that is directly connected between
> the two nodes (for DRBD replication). The quorum node is only connected to
> the shared LAN. I would like to have multiple Corosync rings for redundancy,
> however I do not know if this would cause problems for the quorum node. Is
> it possible for me to configure the shared LAN as ring 0 (which all 3 nodes
> are connected to) and set the rrp_mode to passive so that it will use ring 0
> unless there is a failure, but to also configure the direct link between the
> two "real" nodes as ring 1?

Short answer, yes.

Longer answer. I have a setup with two nodes with two interfaces, one
is connected via a switch to the other node and one is a back-to-back
link for DRBD replication. In Corosync I have two rings, one that goes
via the switch and one via the back-to-back link (rrp_mode: active).
With rrp_mode: passive it should work the way you mentioned.

HTH,
Dan

>
> Thanks,
>
> Andrew
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>



-- 
Dan Frincu
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Different Corosync Rings for Different Nodes in Same Cluster?

2012-06-29 Thread Dan Frincu
Hi,

On Thu, Jun 28, 2012 at 6:13 PM, Andrew Martin  wrote:
> Hi Dan,
>
> Thanks for the help. If I configure the network as I described - ring 0 as
> the network all 3 nodes are on, ring 1 as the network only 2 of the nodes
> are on, and using "passive" - and the ring 0 network goes down, corosync
> will start using ring 1. Does this mean that the quorum node will appear to
> be offline to the cluster? Will the cluster attempt to STONITH it? Once the
> ring 0 network is available again, will corosync transition back to using it
> as the communication ring, or will it continue to use ring 1 until it fails?
>
> The ideal behavior would be when ring 0 fails it then communicates over ring
> 1, but keeps periodically checking to see if ring 0 is working again. Once
> it is, it returns to using ring 0. Is this possible?

Added corosync ML in CC as I think this is better asked here as well.

Regards,
Dan

>
> Thanks,
>
> Andrew
>
> 
> From: "Dan Frincu" 
> To: "The Pacemaker cluster resource manager" 
> Sent: Wednesday, June 27, 2012 3:42:42 AM
> Subject: Re: [Pacemaker] Different Corosync Rings for Different Nodes
> inSame Cluster?
>
>
> Hi,
>
> On Tue, Jun 26, 2012 at 9:53 PM, Andrew Martin  wrote:
>> Hello,
>>
>> I am setting up a 3 node cluster with Corosync + Pacemaker on Ubuntu 12.04
>> server. Two of the nodes are "real" nodes, while the 3rd is in standby
>> mode
>> as a quorum node. The two "real" nodes each have two NICs, one that is
>> connected to a shared LAN and the other that is directly connected between
>> the two nodes (for DRBD replication). The quorum node is only connected to
>> the shared LAN. I would like to have multiple Corosync rings for
>> redundancy,
>> however I do not know if this would cause problems for the quorum node. Is
>> it possible for me to configure the shared LAN as ring 0 (which all 3
>> nodes
>> are connected to) and set the rrp_mode to passive so that it will use ring
>> 0
>> unless there is a failure, but to also configure the direct link between
>> the
>> two "real" nodes as ring 1?
>
> Short answer, yes.
>
> Longer answer. I have a setup with two nodes with two interfaces, one
> is connected via a switch to the other node and one is a back-to-back
> link for DRBD replication. In Corosync I have two rings, one that goes
> via the switch and one via the back-to-back link (rrp_mode: active).
> With rrp_mode: passive it should work the way you mentioned.
>
> HTH,
> Dan
>
>>
>> Thanks,
>>
>> Andrew
>>
>> ___________
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
>
>
> --
> Dan Frincu
> CCNA, RHCE
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>



-- 
Dan Frincu
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Centos 6.2 corosync errors after reboot prevent joining

2012-07-03 Thread Dan Frincu
Hi,

On Mon, Jul 2, 2012 at 7:47 PM, Martin de Koning  wrote:
> Hi all,
>
> Reasonably new to pacemaker and having some issues with corosync loading the
> pacemaker plugin after a reboot of the node. It looks like similar issues
> have been posted before but I haven't found a relavent fix.
>
> The Centos 6.2 node was online before the reboot and restarting the corosync
> and pacemaker services caused no issues. Since the reboot and subsequent
> reboots, I am unable to get pacemaker to join the cluster.
>
> After the reboot corosync now reports the following:
> Jul  2 17:56:22 sessredis-03 corosync[1644]:   [pcmk  ] WARN:
> route_ais_message: Sending message to local.cib failed: ipc delivery failed
> (rc=-2)
> Jul  2 17:56:22 sessredis-03 corosync[1644]:   [pcmk  ] WARN:
> route_ais_message: Sending message to local.cib failed: ipc delivery failed
> (rc=-2)
> Jul  2 17:56:22 sessredis-03 corosync[1644]:   [pcmk  ] WARN:
> route_ais_message: Sending message to local.cib failed: ipc delivery failed
> (rc=-2)
> Jul  2 17:56:22 sessredis-03 corosync[1644]:   [pcmk  ] WARN:
> route_ais_message: Sending message to local.cib failed: ipc delivery failed
> (rc=-2)
> Jul  2 17:56:22 sessredis-03 corosync[1644]:   [pcmk  ] WARN:
> route_ais_message: Sending message to local.cib failed: ipc delivery failed
> (rc=-2)
> Jul  2 17:56:22 sessredis-03 corosync[1644]:   [pcmk  ] WARN:
> route_ais_message: Sending message to local.crmd failed: ipc delivery failed
> (rc=-2)
>
> The full syslog is here:
> http://pastebin.com/raw.php?i=f9eBuqUh
>
> corosync-1.4.1-4.el6_2.3.x86_64
> pacemaker-1.1.6-3.el6.x86_64
>
> I have checked the the obvious such as inter-cluster communication and
> firewall rules. It appears to me that there may be an issue with the with
> Pacemaker cluster information base and not corosync. Any ideas? Can I clear
> the CIB manually somehow to resolve this?

What does "corosync-objctl | grep member" return? Can you see the same
multicast groups on all of the nodes when you run "netstat -ng"?

To clear the CIB manually do a "rm -rfi /var/lib/heartbeat/crm/*" on
the faulty node (with corosync and pacemaker stopped), then start
corosync and pacemaker.

HTH,
Dan

>
> Cheers
> Martin
>
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>



-- 
Dan Frincu
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Tool to query Corosync multicast configuration?

2012-08-13 Thread Dan Frincu
Hi,

On Mon, Aug 13, 2012 at 4:45 PM, Andreas Ntaflos
 wrote:
> Hi,
>
> is it possible to somehow query the multicast address(es) and port(s)
> used by Corosync? I mean other than using grep and awk:
>
> egrep "mcastaddr:" /etc/corosync/corosync.conf| awk '{print $2}'
>
> Is there a commandline tool that displays such information? I have
> looked at corosync-cfgtool, but neither the "-a" or "-s" switches make
> it output any multicast information.

netstat -ng
netstat -tupan | grep corosync

It uses both multicast and unicast.

HTH,
Dan

>
> The reason I am asking is that I want to write a Puppet/Facter fact so
> that we get some overview over our many two-node clusters and their
> multicast configurations.
>
> Thanks,
>
> Andreas
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>



-- 
Dan Frincu
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Change Hostname

2012-09-06 Thread Dan Frincu
Hi,

On Thu, Sep 6, 2012 at 1:35 PM, Thorsten Rehm  wrote:
> Hi everyone,
>
> nobody has an idea?
> Have I missed something in the documentation?

Put the cluster in maintenance-mode.
Stop Pacemaker, stop Corosync.
Change the hostname.
Check if the change actually worked.
Start Corosync, start Pacemaker.
Perform a reprobe and refresh from crm. Remove maintenance-mode.
Delete old node names from cluster configuration (crm node delete
$old-hostname).

Dance.

HTH,
Dan

>
> Regards,
> Thorsten
>
>
> On Tue, Sep 4, 2012 at 10:55 AM, Thorsten Rehm  
> wrote:
>> Hi,
>>
>> ohh, thanks, but I have heartbeat in use.
>> "Legacy cluster stack based on heartbeat"
>> http://www.clusterlabs.org/wiki/File:Stack-lha.png
>>
>> So, there is no corosync.conf ;)
>>
>> Regards,
>> Thorsten
>>
>> On Tue, Sep 4, 2012 at 10:38 AM, Vit Pelcak  wrote:
>>> -BEGIN PGP SIGNED MESSAGE-
>>> Hash: SHA1
>>>
>>> Dne 4.9.2012 10:28, Thorsten Rehm napsal(a):
>>>> Hi everyone,
>>>>
>>>> I have a cluster with three nodes (stack: heartbeat) and I need to
>>>> change the hostname of all systems (only the hostname, not the ip
>>>> address or other network configuration). I have already made
>>>> several attempts, but so far I have not managed that resources are
>>>> available without interruption, after I changed the hostname. Is
>>>> there a procedure that allows me to change the hostname, without
>>>> loss of resources? If so, how would this look like? Is there a best
>>>> case?
>>>
>>>
>>> Hm. What about modifying corosync.conf to reflect hostname change on
>>> all nodes, restarting corosync on all one after another (so you always
>>> have at least 2 nodes running corosync and resources) and then
>>> changing that hostname on desired machine and restarting corosync on it?
>>>
>>> In general, do not stop corosync on more than 1 node at the time and
>>> you should be safe.
>>>
>>>> Cheers, Thorsten
>>>>
>>>> ___ Pacemaker mailing
>>>> list: Pacemaker@oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org Getting started:
>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
>>>> http://bugs.clusterlabs.org
>>>
>>> -BEGIN PGP SIGNATURE-
>>> Version: GnuPG v2.0.19 (GNU/Linux)
>>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>>>
>>> iQEcBAEBAgAGBQJQRb4AAAoJEG+ytY6bjOob0AUH+gKl8OXHnGUUkXe4rFNc1qqr
>>> W1hkafkjDOl2k475kiXiJ9CbgvP4mJSZJ+naMvyh53BJDuWiZH4i3kl1KZVSCvQ6
>>> DNrZhHG90BmTLXiE6tCeVWP6K5tKamvLCRGBehiu83lW2kdH0X3uF9KqZlPnBFhy
>>> AeEYvCsJKfM+u7WndNDFeQVdV//FQaHAB8JZBkgSyHmlvN+bnjUzRTOE1qLyv3/b
>>> nPYVBOYCJgBjmENRRMoP1xWZgAAMeRCzRrpXo2ZSJ8945E/pmc1+9fPDJCqBXqvr
>>> CFzI7iZcyidfpKq6h1S9dlDDMdRidj9P8kfEokThtHXpy45/LhdzYrMg6LmvuIc=
>>> =tZ+G
>>> -END PGP SIGNATURE-
>>>
>>> ___
>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>
>>
>> --
>> Mit freundlichen Gruessen / Kind regards
>> Thorsten Rehm
>
>
>
> --
> Mit freundlichen Gruessen / Kind regards
> Thorsten Rehm
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



-- 
Dan Frincu
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Change Hostname

2012-09-06 Thread Dan Frincu
Hi,

On Thu, Sep 6, 2012 at 1:58 PM, Dan Frincu  wrote:
> Hi,
>
> On Thu, Sep 6, 2012 at 1:35 PM, Thorsten Rehm  wrote:
>> Hi everyone,
>>
>> nobody has an idea?
>> Have I missed something in the documentation?
>
> Put the cluster in maintenance-mode.
> Stop Pacemaker, stop Corosync.
> Change the hostname.
> Check if the change actually worked.
> Start Corosync, start Pacemaker.
> Perform a reprobe and refresh from crm. Remove maintenance-mode.
> Delete old node names from cluster configuration (crm node delete
> $old-hostname).

My bad, you're running on Heartbeat.

>
> Dance.
>
> HTH,
> Dan
>
>>
>> Regards,
>> Thorsten
>>
>>
>> On Tue, Sep 4, 2012 at 10:55 AM, Thorsten Rehm  
>> wrote:
>>> Hi,
>>>
>>> ohh, thanks, but I have heartbeat in use.
>>> "Legacy cluster stack based on heartbeat"
>>> http://www.clusterlabs.org/wiki/File:Stack-lha.png
>>>
>>> So, there is no corosync.conf ;)
>>>
>>> Regards,
>>> Thorsten
>>>
>>> On Tue, Sep 4, 2012 at 10:38 AM, Vit Pelcak  wrote:
>>>> -BEGIN PGP SIGNED MESSAGE-
>>>> Hash: SHA1
>>>>
>>>> Dne 4.9.2012 10:28, Thorsten Rehm napsal(a):
>>>>> Hi everyone,
>>>>>
>>>>> I have a cluster with three nodes (stack: heartbeat) and I need to
>>>>> change the hostname of all systems (only the hostname, not the ip
>>>>> address or other network configuration). I have already made
>>>>> several attempts, but so far I have not managed that resources are
>>>>> available without interruption, after I changed the hostname. Is
>>>>> there a procedure that allows me to change the hostname, without
>>>>> loss of resources? If so, how would this look like? Is there a best
>>>>> case?
>>>>
>>>>
>>>> Hm. What about modifying corosync.conf to reflect hostname change on
>>>> all nodes, restarting corosync on all one after another (so you always
>>>> have at least 2 nodes running corosync and resources) and then
>>>> changing that hostname on desired machine and restarting corosync on it?
>>>>
>>>> In general, do not stop corosync on more than 1 node at the time and
>>>> you should be safe.
>>>>
>>>>> Cheers, Thorsten
>>>>>
>>>>> ___ Pacemaker mailing
>>>>> list: Pacemaker@oss.clusterlabs.org
>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>
>>>>> Project Home: http://www.clusterlabs.org Getting started:
>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
>>>>> http://bugs.clusterlabs.org
>>>>
>>>> -BEGIN PGP SIGNATURE-
>>>> Version: GnuPG v2.0.19 (GNU/Linux)
>>>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>>>>
>>>> iQEcBAEBAgAGBQJQRb4AAAoJEG+ytY6bjOob0AUH+gKl8OXHnGUUkXe4rFNc1qqr
>>>> W1hkafkjDOl2k475kiXiJ9CbgvP4mJSZJ+naMvyh53BJDuWiZH4i3kl1KZVSCvQ6
>>>> DNrZhHG90BmTLXiE6tCeVWP6K5tKamvLCRGBehiu83lW2kdH0X3uF9KqZlPnBFhy
>>>> AeEYvCsJKfM+u7WndNDFeQVdV//FQaHAB8JZBkgSyHmlvN+bnjUzRTOE1qLyv3/b
>>>> nPYVBOYCJgBjmENRRMoP1xWZgAAMeRCzRrpXo2ZSJ8945E/pmc1+9fPDJCqBXqvr
>>>> CFzI7iZcyidfpKq6h1S9dlDDMdRidj9P8kfEokThtHXpy45/LhdzYrMg6LmvuIc=
>>>> =tZ+G
>>>> -END PGP SIGNATURE-
>>>>
>>>> ___
>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>>
>>> --
>>> Mit freundlichen Gruessen / Kind regards
>>> Thorsten Rehm
>>
>>
>>
>> --
>> Mit freundlichen Gruessen / Kind regards
>> Thorsten Rehm
>>
>> ___
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
>
> --
> Dan Frincu
> CCNA, RHCE



-- 
Dan Frincu
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] How to add primitive resource to an already existing Group using crm

2012-09-12 Thread Dan Frincu
Hi,

On Wed, Sep 12, 2012 at 11:56 AM, Kashif Jawed Siddiqui
 wrote:
> Hi,
>
>
>
> I would like to know is there a way to add new primitive resource to an
> already existing group.
>
>
>
> I know crm configure edit requires manual editing.
>
>
>
> But is there a direct command?
>
>
>
> Like,
>
> crm configure group Grp1 Res1 Res2 Res3  ## This is used to create group
>
>
>
> How to add new resource to existing group using command ?

Assuming the primitive is already added you could create a new file
(say it's called group-update) and put in it the following:

group Grp1 Res1 Res2 Res3 this-is-the-new-res-name

Then you could do:

crm configure load update /path/to/group-update

Do test it before, I have only tried this on a shadow cib.

HTH,
Dan

>
>
>
> Regards,
> Kashif Jawed Siddiqui
>
>
> ***
> This e-mail and attachments contain confidential information from HUAWEI,
> which is intended only for the person or entity whose address is listed
> above. Any use of the information contained herein in any way (including,
> but not limited to, total or partial disclosure, reproduction, or
> dissemination) by persons other than the intended recipient's) is
> prohibited. If you receive this e-mail in error, please notify the sender by
> phone or email immediately and delete it!
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>



-- 
Dan Frincu
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] OpenAIS priorities

2010-04-29 Thread Dan Frincu

Greetings all,

In the case of two servers in a cluster with OpenAIS, take the following 
example:


location Failover_Alert_1 Failover_Alert 100: abc.localdomain
location Failover_Alert_2 Failover_Alert 200: def.localdomain

This will setup the preference of a resource to def.localdomain because 
it has the higher priority assigned to it, but what happens when the 
priorities match, is there a tiebreaker, some sort of election process 
to choose which node will be the one handling the resource?


Thank you in advance,
Best regards.

--
Dan FRINCU
Internal Support Engineer



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf


Re: [Pacemaker] Pacemaker Digest, Vol 29, Issue 82

2010-05-04 Thread Dan Frincu



pacemaker-requ...@oss.clusterlabs.org wrote:

Send Pacemaker mailing list submissions to
pacemaker@oss.clusterlabs.org

To subscribe or unsubscribe via the World Wide Web, visit
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
or, via email, send a message with subject or body 'help' to
pacemaker-requ...@oss.clusterlabs.org

You can reach the person managing the list at
pacemaker-ow...@oss.clusterlabs.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Pacemaker digest..."


Today's Topics:

   1. Re: OpenAIS priorities (Vadym Chepkov)
   2. Restart Stonith-resource (Ingmar)
   3. Erro compiling PaceMaker for CoroSync and OpenAIS (Ruiyuan Jiang)
   4. Re: Restart Stonith-resource (Andreas Kurz)
   5. Re: Erro compiling PaceMaker for CoroSync and OpenAIS
  (Andrew Beekhof)
   6. Problem setting up active/passive cluster (Francesco Petretti)
   7. Re: Restart Stonith-resource (Ingmar)
   8. Re: Restart Stonith-resource (Andreas Kurz)


--

Message: 1
Date: Thu, 29 Apr 2010 11:40:46 -0400
From: Vadym Chepkov 
To: The Pacemaker cluster resource manager

Subject: Re: [Pacemaker] OpenAIS priorities
Message-ID: <5ee13497-5ee9-4374-9801-057d209b9...@gmail.com>
Content-Type: text/plain; charset=us-ascii

http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/node-score-equal.html
  

Thank you.

On Apr 29, 2010, at 10:20 AM, Dan Frincu wrote:

  

Greetings all,

In the case of two servers in a cluster with OpenAIS, take the following 
example:

location Failover_Alert_1 Failover_Alert 100: abc.localdomain
location Failover_Alert_2 Failover_Alert 200: def.localdomain

This will setup the preference of a resource to def.localdomain because it has 
the higher priority assigned to it, but what happens when the priorities match, 
is there a tiebreaker, some sort of election process to choose which node will 
be the one handling the resource?

Thank you in advance,
Best regards.

--
Dan FRINCU
Internal Support Engineer



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf




  


--
Dan FRINCU
Internal Support Engineer

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

[Pacemaker] Pacemaker resource management

2010-06-06 Thread Dan Frincu

Hello all,

I have a couple of questions and I haven't found any relevant 
documentation about it so I would appreciate any answers on the matter.


I'm using drbd 8.3.2-6 with pacemaker 1.0.5-4.2, openais 0.80.5-15.2 and 
heartbeat 3.0.0-33.3 for a high availability 2 node cluster for mysql 
and apache with drbd partitions.


What I want to know is if a a resource fails, such as apache, pacemaker 
tries to restart the service, it has to do with 
"common_apply_stickiness", from what I can see in the logs.


1. How many times does pacemaker try to restart a resource before 
declaring it "down" and migrating the resource (and dependencies) to the 
other node?
2. How can I alter this behavior, to be able to set the number of 
retries a resource is attempted to be restarted before migrating it to 
the other available node?


I've noticed that sometimes, if there is a problem with the block device 
(drbd) the cluster will go into a stage where it migrates all resources 
in a group from A to B, however, when trying to start resources on B, 
there is a synchronization issue, one block device is still being in 
process of being updated from node A drbd0 to node B drbd0. In this case 
the group resources don't start until the synchronization is complete.


3. Can I "force" a group of resources to migrate to another node if any 
of the resources fails to be brought up within a number of retries or 
after a timeout (including if the group is just being migrated from A to 
B, but one resource fails to start on B, to be migrated back to A)? How?
4. Is there a Resource Agent out there that can be configured to send 
SNMP traps?


Thank you in advance for your replies.

Best regards.








___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Active/passive cluster with Apache and drbd on rhel 5

2010-07-05 Thread Dan Frincu

Hello all,
I'm currently studying how to set up a 2 node cluster running apache 2.2 
as a reverse proxy using corosync, pacemaker and drbd on RHEL 5.5.

I've downloaded:
- Clusters from scratch PDF (which is perfect... but for fedora 13 which 
includes DRBD in the kernel)

- the yum repo: http://www.clusterlabs.org/rpm/epel-5/clusterlabs.repo

As DRBD is not included in RHEL 5, I've launched the following command:
yum install drbd

It works perfectly... but when installing drbd, the package drbd-xen is 
marked as a dependancy, resulting in the XEN kernel being installed on the 
RHEL 5 box. This is not a big issue, but I would prefer not to install XEN 
on the box.

Is it possible?

Thanks in advance, and excuse my english mistakes,

Pierre


Hello Pierre,

I've had the same issue and resorted to rebuilding the RPM's from SRC RPM's. 
The process can be complex, but I will show the steps I've taken to make it 
easier.
- first setup mock for building RPM's chrooted (thus you use one system to build 
for any arch, x86_86, i686, etc.) => http://fedoraproject.org/wiki/Projects/Mock
- install yum-utils, this provides yumdownloader command
- the yum repo is already configured so I'll skip this step
- using "yumdownloader --disablerepo=\* --enablerepo=clusterlabs drbd --source" 
you get the SRC RPM for drbd. Then it's a simple matter of installing the RPM and editing 
the .spec file in /usr/src/redhat/SPECS, removing the XEN dependencies, then building the 
RPM according to the Mock How-To

Hope this helps.

Cheers.

--
Dan FRINCU
Internal Support Engineer
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Problem with failover/failback under Ubuntu 10.04 for Active/Passive OpenNMS

2010-07-05 Thread Dan Frincu
Filesystem):Started
monitoring-node-01
 fs-opennms-data(ocf::heartbeat:Filesystem):Started
monitoring-node-01
 postgres   (lsb:postgresql-8.4):   Started monitoring-node-01
 opennms(lsb:opennms):  Started monitoring-node-01


There are some entries in my daemon.log[2] which look
if they have something to do with my problem...

 monitoring-node-01 lrmd: [994]: info: rsc:drbd-opennms-data:1:30: promote
 monitoring-node-01 crmd: [998]: info: do_lrm_rsc_op: Performing
key=93:25:0:d49b62be-1e33-48ca-a8c3-cb128676d444
op=fs-opennms-config_start_0 )
 monitoring-node-01 lrmd: [994]: info: rsc:fs-opennms-config:31: start
 monitoring-node-01 Filesystem[2464]: INFO: Running start for
/dev/drbd/by-res/config on /etc/opennms
 monitoring-node-01 lrmd: [994]: info: RA output:
(fs-opennms-config:start:stderr) FATAL: Module scsi_hostadapter not found.
 monitoring-node-01 lrmd: [994]: info: RA output:
(drbd-opennms-data:1:promote:stdout) 
 monitoring-node-01 crmd: [998]: info: process_lrm_event: LRM operation

drbd-opennms-data:1_promote_0 (call=30, rc=0, cib-update=35,
confirmed=true) ok
 monitoring-node-01 lrmd: [994]: info: RA output:
(fs-opennms-config:start:stderr) /dev/drbd/by-res/config: Wrong medium
type
 monitoring-node-01 lrmd: [994]: info: RA output:
(fs-opennms-config:start:stderr) mount: block device /dev/drbd0 is
write-protected, mounting read-only
 monitoring-node-01 lrmd: [994]: info: RA output:
(fs-opennms-config:start:stderr) mount: Wrong medium type
 monitoring-node-01 Filesystem[2464]: ERROR: Couldn't mount filesystem
/dev/drbd/by-res/config on /etc/opennms
 monitoring-node-01 crmd: [998]: info: process_lrm_event: LRM operation
fs-opennms-config_start_0 (call=31, rc=1, cib-update=36, confirmed=true)
unknown error

...but I don't know how to troubleshoot it.



[1] http://www.corosync.org/doku.php?id=faq:cisco_switches
[2] http://pastebin.com/DKLjXtx8

  

Hi,

First you might want to look at the following error, see if the module 
is available on both servers.


(fs-opennms-config:start:stderr) FATAL: Module scsi_hostadapter not found.

Then try to run the resource manually:
- go to /usr/lib/ocf/resource.d/heartbeat
- export OCF_ROOT=/usr/lib/ocf
- export OCF_RESKEY_device="/dev/drbd/by-res/config"
- export OCF_RESKEY_options=rw
- export OCF_RESKEY_fstype=xfs
- export OCF_RESKEY_directory="/etc/opennms"
- ./Filesystem start

See if you encounter any errors here. Run the steps on both servers. 
Make sure to move the drbd resource from server to server so that the 
mount works. You do that via

- go to server where drbd device is currently mounted and in a primary state
- umount /etc/opennms
- drbdadm secondary config
- move to other server
- drbdadm primary config

Also, make sure that pacemaker doesn't interfere with these operations :)

Cheers.

--
Dan FRINCU
Systems Engineer
CCNA, RHCE


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Drbd/Nfs MS don't failover on slave node

2010-07-05 Thread Dan Frincu
I see you have {symmetric-cluster="true"} in your config. But you 
haven't set up any location constraints.


Check the log (usually /var/log/messages unless specified otherwise) 
with tailf while you're setting one node offline. It should give all the 
relevant information about why the resources aren't "moving".


--
Dan FRINCU
Systems Engineer


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Problem with failover/failback under Ubuntu 10.04 for Active/Passive OpenNMS

2010-07-05 Thread Dan Frincu

/ Hi,
/>/ 
/>/ First you might want to look at the following error, see if the module 
/>/ is available on both servers.
/>/ 
/>/ (fs-opennms-config:start:stderr) FATAL: Module scsi_hostadapter not

/found.
/ 

/>/ Then try to run the resource manually:
/>/ - go to /usr/lib/ocf/resource.d/heartbeat
/>/ - export OCF_ROOT=/usr/lib/ocf
/>/ - export OCF_RESKEY_device="/dev/drbd/by-res/config"
/>/ - export OCF_RESKEY_options=rw
/>/ - export OCF_RESKEY_fstype=xfs
/>/ - export OCF_RESKEY_directory="/etc/opennms"
/>/ - ./Filesystem start
/>/ 
/>/ See if you encounter any errors here. Run the steps on both servers. 
/>/ Make sure to move the drbd resource from server to server so that the 
/>/ mount works. You do that via

/>/ - go to server where drbd device is currently mounted and in a primary
/>/ state
/>/ - umount /etc/opennms
/>/ - drbdadm secondary config
/>/ - move to other server
/>/ - drbdadm primary config
/>/ 
/>/ Also, make sure that pacemaker doesn't interfere with these operations

/:/)
/>/ 
/>/ Cheers.

/
I get the error message about the scsi_hostadapter on both nodes
but I can mount the DRBD Device just fine.

__



/  monitoring-node-01 lrmd: [994]: info: RA output:

/>/ (fs-opennms-config:start:stderr) /dev/drbd/by-res/config: Wrong medium
/>/ type
/>/  monitoring-node-01 lrmd: [994]: info: RA output:
/>/ (fs-opennms-config:start:stderr) mount: block device /dev/drbd0 is
/>/ write-protected, mounting read-only
/>/  monitoring-node-01 lrmd: [994]: info: RA output:
/>/ (fs-opennms-config:start:stderr) mount: Wrong medium type
/>/  monitoring-node-01 Filesystem[2464]: ERROR: Couldn't mount filesystem
/>/ /dev/drbd/by-res/config on /etc/opennms

/The errors from the log file are DRBD specific, they occur when you're trying to mount a resource in a Secondary state. 
Increase the "op start interval" for both the DRBD and Filesystem primitives to ~15 seconds. Having configured a start 
interval of 0 (zero) seconds, the change of DRBD resource from Primary to Secondary on node2 and then promotion to 
Primary on node1 is not instantaneous, therefore Pacemaker attempts to mount the filesystem without having the DRBD 
resource in a Primary state, it goes into that huuuge 300 second timeout, but as it waits for one resource (DRBD) to 
timeout, it executes the next one, which is the mount, which fails, with the given errors, for the aforementioned reasons.


I'd also suggest adding an "op monitor" for each resource, with a reasonable 
interval and timeout, and also a mail alert.

Regards,
Dan


--
Dan FRINCU
Systems Engineer

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Problem with failover/failback under Ubuntu 10.04 for Active/Passive OpenNMS

2010-07-05 Thread Dan Frincu
I'm using Thunderbird 2, but it's not the email client's fault, I had a 
bad filter and was not receiving the email digest, so I copy+pasted from 
the web. I changed the filter and set digest mode to off in the mailing 
list options, so I'll be receiving the emails and replying to them properly.


Sorry about the trouble.

Regards.
Dan

--
Dan FRINCU
Systems Engineer


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Lighty doesn't come up always

2010-07-06 Thread Dan Frincu
No, there is no OCF resource agent for that. You can check 
/usr/lib/ocf/resource.d/{provider}/{resource_agents} for available RA's.


Regards.

Torsten Bronger wrote:

Hallöchen!

We have a two-node cluster with a virtual IP and Lighty running on
that node which has this IP currently.  Thus, our configuration
says:

node $id="xxx" mandy
node $id="yyy" olga
primitive Public-IP ocf:heartbeat:IPaddr2 \
params ip="134.94.252.127" broadcast="134.94.253.255" nic="eth1" 
cidr_netmask="23" \
op monitor interval="60s"
primitive lighty lsb:lighttpd \
op monitor interval="60s" timeout="30s" on-fail="restart" \
op start interval="0" timeout="60s" \
meta migration-threshold="3" failure-timeout="30s" 
target-role="Started"
primitive pingd ocf:pacemaker:pingd \
params host_list="134.94.111.186" multiplier="100" \
op monitor interval="15s" timeout="20s"
group lighty_group Public-IP lighty
clone pingclone pingd \
meta globally-unique="false"
location lighty-on-connected-node lighty_group \
rule $id="lighty-on-connected-node-rule" -inf: not_defined pingd or 
pingd lte 0
colocation ip-with-lighty inf: Public-IP lighty
property $id="cib-bootstrap-options" \
dc-version="1.0.8-zzz" \
cluster-infrastructure="Heartbeat" \
stonith-enabled="false"

(By the way, is there an ocf:lighttpd somewhere?)

The problem is that under some circumstances, Lighty is not
started.  Instead, crm_mon shows at the bottom:

Failed actions:
lighty_monitor_0 (node=olga, call=3, rc=1, status=complete): unknown 
error
lighty_monitor_0 (node=mandy, call=3, rc=1, status=complete): unknown 
error

What does this "unknown error" mean?  I is not further explained in
the log files, neither in Heartbeat's nor in Lighty's.  Well, in
Lighty's, there's nothing about errors at all.

If I restart heartbeat on one of the nodes, then it works.  But how
do I get it up reliably?

Tschö,
Torsten.

  


--
Dan FRINCU
Systems Engineer


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Lighty doesn't come up always

2010-07-06 Thread Dan Frincu
You should define location constraints for all resources, including 
those that are part of a group.


Jul  6 11:27:08 olga pengine: [7362]: info: native_color: Resource Public-IP 
cannot run anywhere
Jul  6 11:27:08 olga pengine: [7362]: info: native_color: Resource lighty 
cannot run anywhere

Also, you mentioned that if you start the resources with a two minute 
delay, they work, so try to increase the start interval of lighty to 10 
seconds, then see if the resources start as they should.


Regards.

Torsten Bronger wrote:

Hallöchen!

Dr. Michael Schwartzkopff writes:

  

Am Dienstag, den 06.07.2010, 10:28 +0200 schrieb Torsten Bronger:



We have a two-node cluster with a virtual IP and Lighty running
on that node which has this IP currently.  Thus, our
configuration says:

[...]

The problem is that under some circumstances, Lighty is not
started.  Instead, crm_mon shows at the bottom:
  

Remove the collocation constraint because it is implicitly given
in the group. Then it shoud work.



Thank you, I removed the superfluous line.  However, the problem is
still there.

If I start Heartbeat on both nodes simultaneously, Lighty is not
started, and the log on one of the nodes says

Jul  6 11:27:08 olga pengine: [7362]: notice: group_print:  Resource Group: 
lighty_group
Jul  6 11:27:08 olga pengine: [7362]: notice: native_print:  
Public-IP#011(ocf::heartbeat:IPaddr2):#011Stopped
Jul  6 11:27:08 olga pengine: [7362]: notice: native_print:  
lighty#011(lsb:lighttpd):#011Stopped
Jul  6 11:27:08 olga pengine: [7362]: notice: clone_print:  Clone Set: pingclone
Jul  6 11:27:08 olga pengine: [7362]: notice: short_print:  Stopped: [ 
pingd:0 pingd:1 ]
Jul  6 11:27:08 olga attrd: [7356]: info: attrd_trigger_update: Sending flush op to 
all hosts for: terminate ()
Jul  6 11:27:08 olga pengine: [7362]: info: native_merge_weights: Public-IP: 
Rolling back scores from lighty
Jul  6 11:27:08 olga pengine: [7362]: info: native_color: Resource Public-IP 
cannot run anywhere
Jul  6 11:27:08 olga pengine: [7362]: info: native_color: Resource lighty 
cannot run anywhere
Jul  6 11:27:08 olga pengine: [7362]: notice: RecurringOp:  Start recurring 
monitor (15s) for pingd:0 on mandy
Jul  6 11:27:08 olga pengine: [7362]: notice: RecurringOp:  Start recurring 
monitor (15s) for pingd:1 on olga
Jul  6 11:27:08 olga pengine: [7362]: notice: LogActions: Leave resource 
Public-IP#011(Stopped)
Jul  6 11:27:08 olga pengine: [7362]: notice: LogActions: Leave resource 
lighty#011(Stopped)

If I start both nodes one after the other with two minutes delay,
everything works fine.  Why is this?

Tschö,
Torsten.

  


--
Dan FRINCU
Systems Engineer

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Problem with failover/failback under Ubuntu

2010-07-06 Thread Dan Frincu

As I can see from your config, you haven't specified whether it's an symmetric 
or assymetric cluster, the default is symmetric, which means you need to add 
some location constraints for each resource + the group. Also you need to 
specify a colocation constraint. This is on top of what is already configured.
Choose one of the ms-opennms-*, say ms-opennms-data and add the dependencies 
group as colocation constraint on it. Something like:
- colocation all-dependencies inf: dependencies ms-opennms-data:Master

Regards,
Dan

Ok, that almost solved the problem.
But now the Filesystem primitives run in an endless loop.
The get unmounted and mounted again.


> therefore Pacemaker attempts to
> mount the filesystem without having the DRBD 
> resource in a Primary state
  


Hm, until now I thought this is handled by
the 3 "order" restrictions.

I see I have to find out which intervalls and timeouts I need to adjust.
Thanks for giving me a hint to the right direction so quickly.

If you have some other ideas to improve the config, just let me now.

Cheers, Sven

--
Dan FRINCU
Systems Engineer

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Upgraded mysql from 5.0 to 5.1

2010-07-07 Thread Dan Frincu

Have you copied twice this line?

socket="/var/lib/mysql/mysql.sock" binary="/usr/sbin/mysqld"
socket="/var/lib/mysql/mysql.sock" binary="/usr/sbin/mysqld"

I think so. Regardless, to test a resource agent manually requires that 
you define some variables and then call the script by hand. Also, check 
all the actions (start,stop,restart,promote,etc) and their exit codes, 
to see if they match the OCF RA specification. Most of the problems that 
you will have with a resource agent and it's resource can be found if 
you're manually testing the RA script.


Go to /usr/lib/ocf/resource.d/heartbeat/
Open the mysql RA script. Go to line 63 and starting from that line 
update the values in the script to match the contents of /etc/my.cnf. 
Then update the crm configure for the primitive mysql-server to match as 
well.
From what I remember, the values in 
OCF_RESKEY_{binary_default,pid_default,socket_default} are wrong in the 
RA script vs what's actually installed.


Then "export OCF_ROOT=/usr/lib/ocf/" and all OCF_RESKEY_* with their 
defined values, then call the script with no parameters. It should 
provide the usage of the script. Then take step by step each action and 
check it's exit code, see if it matches the OCF RA specification, and 
also check to see if it actually starts the resource or not. The thing 
is, once the script works as it should, all the issues have been 
resolved, the cluster will work with the mysql-server resource.


Regards,
Dan

Jake Bogie wrote:

So I took Raoul's advice and ditched the lsb:mysql check and went for
the ocf:heartbeat version however...

I'm getting this now...

What am I missing? I'm having a hard time finding a document on how to
setup this resource agent.


Last updated: Tue Jul  6 12:44:07 2010
Stack: openais
Current DC: qad02 - partition with quorum
Version: 1.0.9-89bd754939df5150de7cd76835f98fe90851b677
2 Nodes configured, 2 expected votes
3 Resources configured.


Online: [ qad02 qad01 ]

 Resource Group: mysql
 fs_mysql   (ocf::heartbeat:Filesystem):Started qad02
 ip_mysql   (ocf::heartbeat:IPaddr2):   Started qad02
 Master/Slave Set: ms_drbd_mysql
 Masters: [ qad02 ]
 Slaves: [ qad01 ]

Failed actions:
mysql-server_start_0 (node=qad01, call=6, rc=6, status=complete):
not configured
mysql-server_start_0 (node=qad02, call=33, rc=5, status=complete):
not installed

###

primitive mysql-server ocf:heartbeat:mysql \
op monitor interval="30s" timeout="30s" \
op start interval="0" timeout="120" \
op stop interval="0" timeout="120" \
params config="/etc/my.cnf" datadir="/drbd/mysql/data/"
socket="/var/lib/mysql/mysql.sock" binary="/usr/sbin/mysqld"
socket="/var/lib/mysql/mysql.sock" binary="/usr/sbin/mysqld"
pid="/drbd/mysql/data/mysql.pid" test_passwd="isitup"
test_table="cluster_check.connectioncheck" test_user="qaclus" \

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
  


--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania
E-mail: dfri...@streamwide.ro
Phone: +40 (0) 21 320 41 24


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Upgraded mysql from 5.0 to 5.1 - And changed to OCF RA

2010-07-08 Thread Dan Frincu
 qad01 after 100 failures
(max=100)
Jul  7 11:48:01 qad01 pengine: [4359]: info: native_color: Resource
mysql-server cannot run anywhere
Jul  7 11:48:01 qad01 pengine: [4359]: notice: LogActions: Leave
resource mysql-server  (Stopped)
Jul  7 11:48:10 qad01 pengine: [4359]: ERROR: unpack_rsc_op: Hard error
- mysql-server_start_0 failed with rc=6: Preventing mysql-server from
re-starting anywhere in the cluster
Jul  7 11:48:10 qad01 pengine: [4359]: WARN: unpack_rsc_op: Processing
failed op mysql-server_start_0 on qad01: not configured (6)
Jul  7 11:48:10 qad01 pengine: [4359]: notice: native_print:
mysql-server   (ocf::heartbeat:mysql): Stopped
Jul  7 11:48:10 qad01 pengine: [4359]: info: get_failcount: mysql-server
has failed INFINITY times on qad01
Jul  7 11:48:10 qad01 pengine: [4359]: WARN: common_apply_stickiness:
Forcing mysql-server away from qad01 after 100 failures
(max=100)
Jul  7 11:48:10 qad01 pengine: [4359]: info: native_color: Resource
mysql-server cannot run anywhere
Jul  7 11:48:10 qad01 pengine: [4359]: notice: LogActions: Leave
resource mysql-server  (Stopped)
Jul  7 11:48:11 qad01 pengine: [4359]: ERROR: unpack_rsc_op: Hard error
- mysql-server_start_0 failed with rc=6: Preventing mysql-server from
re-starting anywhere in the cluster
Jul  7 11:48:11 qad01 pengine: [4359]: WARN: unpack_rsc_op: Processing
failed op mysql-server_start_0 on qad01: not configured (6)
Jul  7 11:48:11 qad01 pengine: [4359]: notice: native_print:
mysql-server   (ocf::heartbeat:mysql): Stopped
Jul  7 11:48:11 qad01 pengine: [4359]: info: get_failcount: mysql-server
has failed INFINITY times on qad01
Jul  7 11:48:11 qad01 pengine: [4359]: WARN: common_apply_stickiness:
Forcing mysql-server away from qad01 after 100 failures
(max=100)
Jul  7 11:48:11 qad01 pengine: [4359]: info: native_color: Resource
mysql-server cannot run anywhere
Jul  7 11:48:11 qad01 pengine: [4359]: notice: LogActions: Leave
resource mysql-server  (Stopped)
Jul  7 11:48:26 qad01 pengine: [4359]: ERROR: unpack_rsc_op: Hard error
- mysql-server_start_0 failed with rc=6: Preventing mysql-server from
re-starting anywhere in the cluster
Jul  7 11:48:26 qad01 pengine: [4359]: WARN: unpack_rsc_op: Processing
failed op mysql-server_start_0 on qad01: not configured (6)
Jul  7 11:48:26 qad01 pengine: [4359]: notice: native_print:
mysql-server   (ocf::heartbeat:mysql): Stopped
Jul  7 11:48:26 qad01 pengine: [4359]: info: get_failcount: mysql-server
has failed INFINITY times on qad01
Jul  7 11:48:26 qad01 pengine: [4359]: WARN: common_apply_stickiness:
Forcing mysql-server away from qad01 after 100 failures
(max=100)
Jul  7 11:48:26 qad01 pengine: [4359]: info: native_color: Resource
mysql-server cannot run anywhere
Jul  7 11:48:26 qad01 pengine: [4359]: notice: LogActions: Leave
resource mysql-server  (Stopped)
___

Message: 7
Date: Wed, 07 Jul 2010 12:55:51 +0300
From: Dan Frincu 
To: The Pacemaker cluster resource manager

Subject: Re: [Pacemaker] Upgraded mysql from 5.0 to 5.1
Message-ID: <4c344f27.1060...@streamwide.ro>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Have you copied twice this line?

socket="/var/lib/mysql/mysql.sock" binary="/usr/sbin/mysqld"
socket="/var/lib/mysql/mysql.sock" binary="/usr/sbin/mysqld"

I think so. Regardless, to test a resource agent manually requires that 
you define some variables and then call the script by hand. Also, check 
all the actions (start,stop,restart,promote,etc) and their exit codes, 
to see if they match the OCF RA specification. Most of the problems that


you will have with a resource agent and it's resource can be found if 
you're manually testing the RA script.


Go to /usr/lib/ocf/resource.d/heartbeat/
Open the mysql RA script. Go to line 63 and starting from that line 
update the values in the script to match the contents of /etc/my.cnf. 
Then update the crm configure for the primitive mysql-server to match as


well.
 From what I remember, the values in 
OCF_RESKEY_{binary_default,pid_default,socket_default} are wrong in the 
RA script vs what's actually installed.


Then "export OCF_ROOT=/usr/lib/ocf/" and all OCF_RESKEY_* with their 
defined values, then call the script with no parameters. It should 
provide the usage of the script. Then take step by step each action and 
check it's exit code, see if it matches the OCF RA specification, and 
also check to see if it actually starts the resource or not. The thing 
is, once the script works as it should, all the issues have been 
resolved, the cluster will work with the mysql-server resource.


Regards,
Dan

Jake Bogie wrote:
  

So I took Raoul's advice and ditched the lsb:mysql check and went for
the ocf:heartbeat version however...

I'm getting this now...

What am I missing? I'm having a hard time finding a document on how to
setup this resource agent.


Last updated: Tue 

Re: [Pacemaker] mysql RA constantly restarting db

2010-07-23 Thread Dan Frincu
eat-3.0.3-2.3.el5
pacemaker-libs-1.0.9.1-1.11.el5
heartbeat-libs-3.0.3-2.3.el5
pacemaker-libs-1.0.9.1-1.11.el5

rpm -qa | grep resource
resource-agents-1.0.3-2.6.el5

[r...@sipl-mysql-109 rc0.d]# cat /etc/redhat-release 
CentOS release 5.5 (Final)


[r...@sipl-mysql-109 rc0.d]# uname -r
2.6.18-194.8.1.el5

[r...@sipl-mysql-109 rc0.d]# mysql -V
mysql  Ver 14.14 Distrib 5.1.48, for unknown-linux-gnu (x86_64) using 
readline 5.1


My ha.cf <http://ha.cf> looks like:

autojoin none
mcast eth0 227.0.0.10 694 1 0
warntime 5
deadtime 15
initdead 60
keepalive 5
auto_failback off
node sipl-mysql-109
node sipl-mysql-209
crm on 



Mysql show the following in it's error log:

100722 15:33:57 [Note] Plugin 'FEDERATED' is disabled.
100722 15:33:57  InnoDB: Started; log sequence number 0 44233
100722 15:33:57 [Note] Event Scheduler: Loaded 0 events
100722 15:33:57 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.1.48-community-log'  socket: '/var/lib/mysql/mysql.sock' 
 port: 3306  MySQL Community Server (GPL)

100722 15:34:01 [Note] /usr/sbin/mysqld: Normal shutdown

100722 15:34:01 [Note] Event Scheduler: Purging the queue. 0 events
100722 15:34:01  InnoDB: Starting shutdown...
100722 15:34:02  InnoDB: Shutdown completed; log sequence number 0 44233
100722 15:34:02 [Note] /usr/sbin/mysqld: Shutdown complete

100722 15:34:02 mysqld_safe mysqld from pid file 
/var/run/mysql/mysqld.pid ended
100722 15:34:03 mysqld_safe Starting mysqld daemon with databases from 
/var/lib/mysql
100722 15:34:03 [Warning] '--skip-locking' is deprecated and will be 
removed in a future release. Please use '--skip-external-locking' instead.

100722 15:34:03 [Note] Plugin 'FEDERATED' is disabled.
100722 15:34:03  InnoDB: Started; log sequence number 0 44233
100722 15:34:03 [Note] Event Scheduler: Loaded 0 events
100722 15:34:03 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.1.48-community-log'  socket: '/var/lib/mysql/mysql.sock' 
 port: 3306  MySQL Community Server (GPL)


Any help would be greatly appreciated. Thanks in advance.
F.

----

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
  


--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Master Slave

2010-07-23 Thread Dan Frincu
First take a look at this 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
It contains all you need for this kind of setup. I'm not aware if the 
M/S relationship extends to other resources than DRBD, but in this case 
you don't actually need a M/S relationship (from my point of view).


1. Read in the document above about 'symmetric-cluster'
2. Based on results from point 1, set the resources as being allowed 
either to run anywhere or not, and add location constraints for each 
primitive.
3. Define mandatory ordering constraints for the mysql primitive, with a 
higher priority on the node you wish to use as a Primary node and a 
lower priority on the Secondary node (if Primary node fails, the 
resource will failover to Secondary node)
4. Define colocation so that VIP will always run where the mysql 
resource runs.
5. (optional) define mail alerts when resource failure occurs, define 
what the mysql resource should do when the Primary node recovers, should 
it remain on the Secondary node and be manually moved, should it go back 
to the Primary node by itself, these are things you might want to 
consider, but are not mandatory to the question asked, therefore the 
"optional" at the beginning)


Regards,
Dan

Freddie Sessler wrote:
I have a quick question is the Master Slave setting in pacemaker only 
allowed in regards to a DRBD device? Can you use it to create other 
Master Slave relationships? Does all resource agents potentially 
involved in this need to be aware of the Master Slave relationship? I 
am trying to set up a pair fo mysql servers One is replicating from 
the other(handled within mysql's my.cnf.) I basically want to fail 
over the VIP of the primary node to the secondary node(which also 
happens to be the mysql slave) in the event that the primary has its 
mysql server stopped. I am not using DRBD at all. My config looks like 
the following. 


node $id="0cd2bb09-00b6-4ce4-bdd1-629767ae0739" sipl-mysql-109
node $id="119fc082-7046-4b8d-a9a3-7e777b9ddf60" sipl-mysql-209
primitive p_clusterip ocf:heartbeat:IPaddr2 \
params ip="10.200.131.9" cidr_netmask="32" \
op monitor interval="30s"
primitive p_mysql ocf:heartbeat:mysql \
op start interval="0" timeout="120" \
op stop interval="0" timeout="120" \
op monitor interval="10" timeout="120" depth="0"
ms ms_mysql p_mysql \
meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1"
location l_master ms_mysql \
rule $id="l_master-rule" $role="Master" 100: #uname eq sipl-mysql-109
colocation mysql_master_on_ip inf: p_clusterip ms_mysql:Master
property $id="cib-bootstrap-options" \
stonith-enabled="false" \
no-quorum-policy="ignore" \
start-failure-is-fatal="false" \
expected-quorum-votes="2" \
symmetric-cluster="false" \
dc-version="1.0.9-89bd754939df5150de7cd76835f98fe90851b677" \
cluster-infrastructure="Heartbeat"
rsc_defaults $id="rsc-options" \
resource-stickiness="100"


What's happening is that mysql is never brought up due to the 
following errors:


ul 22 16:15:07 sipl-mysql-109 pengine: [22890]: info: native_color: 
Resource p_mysql:0 cannot run anywhere
Jul 22 16:15:07 sipl-mysql-109 pengine: [22890]: info: native_color: 
Resource p_mysql:1 cannot run anywhere
Jul 22 16:15:07 sipl-mysql-109 pengine: [22890]: info: 
native_merge_weights: ms_mysql: Rolling back scores from p_clusterip
Jul 22 16:15:07 sipl-mysql-109 pengine: [22890]: info: master_color: 
ms_mysql: Promoted 0 instances of a possible 1 to master
Jul 22 16:15:07 sipl-mysql-109 pengine: [22890]: info: native_color: 
Resource p_clusterip cannot run anywhere
Jul 22 16:15:07 sipl-mysql-109 pengine: [22890]: info: master_color: 
ms_mysql: Promoted 0 instances of a possible 1 to master
Jul 22 16:15:07 sipl-mysql-109 pengine: [22890]: notice: LogActions: 
Leave resource p_clusterip (Stopped)
Jul 22 16:15:07 sipl-mysql-109 pengine: [22890]: notice: LogActions: 
Leave resource p_mysql:0 (Stopped)
Jul 22 16:15:07 sipl-mysql-109 pengine: [22890]: notice: LogActions: 
Leave resource p_mysql:1 (Stopped)



I thought I may have overcome this with my location and colocation 
directive but it failed. Could someone give me some feedback on what I 
am trying to do, my config and the resulting errors?


Thanks
F.


_______
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
  


--
Dan FRINCU
Systems E

Re: [Pacemaker] communication channels howto

2010-08-26 Thread Dan Frincu
In OpenAIS for example, in /etc/ais/openais.conf you have a directive 
called interface. In this directive you specify a ringnumber, 
bindnetaddr, mcastaddr and mcastport. Configuring 2 communication 
channels means using adding rrp_mode: passive, ringnumber 0 and 
ringnumber 1, two interface directives in which you use different 
bindnetaddr IP addresses (ex: 10.0.0.1 and 10.0.0.2, or whatever IP's 
you use, but preferably each IP address should be assigned to a 
different network interface), two different mcastaddr multicast groups 
(or mcastport, but I'd recommed different mcastaddr's in the range of 
239.x.x.x).


The principle behind it is if one communication channel (network card, 
IP address connectivity, multicast group/port) fails, the other one is 
still there for redundancy. I see this similar to the way ring network 
topologies work, without the token.


Regards,
Dan.

p.s.: Andrew, nice to have you back, to paraphrase a famous quote "Mr. 
Anderson. Welcome back, we missed you."


lxnf9...@comcast.net wrote:

The DRBD manual says

It is absolutely vital to configure at least two independent OpenAIS 
communication channels for this functionality to work correctly.


My Google'n has not yielded any results in the how to do this department
I have DRBD configured and working properly with one channel
Where can I find info to add a second channel

Richard

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: 
http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] communication channels howto

2010-08-26 Thread Dan Frincu



lxnf9...@comcast.net wrote:

On Thu, 26 Aug 2010, lxnf9...@comcast.net wrote:


On Thu, 26 Aug 2010, Dan Frincu wrote:


 In OpenAIS for example, in /etc/ais/openais.conf you have a directive
 called interface. In this directive you specify a ringnumber, 
bindnetaddr,
 mcastaddr and mcastport. Configuring 2 communication channels means 
using

 adding rrp_mode: passive, ringnumber 0 and ringnumber 1, two interface
 directives in which you use different bindnetaddr IP addresses (ex:
 10.0.0.1 and 10.0.0.2, or whatever IP's you use, but preferably 
each IP

 address should be assigned to a different network interface), two
 different mcastaddr multicast groups (or mcastport, but I'd recommed
 different mcastaddr's in the range of 239.x.x.x).

 The principle behind it is if one communication channel (network 
card, IP
 address connectivity, multicast group/port) fails, the other one is 
still

 there for redundancy. I see this similar to the way ring network
 topologies work, without the token.

 Regards,
 Dan.

 p.s.: Andrew, nice to have you back, to paraphrase a famous quote "Mr.
 Anderson. Welcome back, we missed you."



Thanks that helps a lot
Just one more thing
If two interfaces are configured are they both used equally

Richard



I believe I can answer my own question
http://www.novell.com/documentation/sle_ha/book_sleha/?page=/documentation/sle_ha/book_sleha/data/sec_ha_installation_setup.html 



Use the Redundant Ring Protocol (RRP) to tell the cluster how to use 
these interfaces. RRP can have three modes (rrp_mode): if set to 
active, Corosync uses all interfaces actively. If set to passive, 
Corosync uses the second interface only if the first ring fails.


Richard


Also see http://linux.die.net/man/5/openais.conf

rrp_mode
   This specifies the mode of redundant ring, which may be none,
   active, or passive. Active replication offers slightly lower latency
   from transmit to delivery in faulty network environments but with
   less performance. Passive replication may nearly double the speed of
   the totem protocol if the protocol doesn't become cpu bound. The
   final option is none, in which case only one network interface will
   be used to operate the totem protocol.

   If only one interface directive is specified, none is automatically
   chosen. If multiple interface directives are specified, only active
   or passive may be chosen.



 lxnf9...@comcast.net wrote:
>   The DRBD manual says
> >   It is absolutely vital to configure at least two independent 
OpenAIS

>   communication channels for this functionality to work correctly.
> >   My Google'n has not yielded any results in the how to do this 
>   department

>   I have DRBD configured and working properly with one channel
>   Where can I find info to add a second channel
> >   Richard
> >   ___
>   Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>   http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >   Project Home: http://www.clusterlabs.org
>   Getting started: >   
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

>   Bugs:
>   
http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker 











--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Adding a STONITH module to the distribution

2010-08-28 Thread Dan Frincu

Hi,

Could you attach the script as well, I'm also interested in STONITH via NUT.

Thanks.

William Seligman wrote:

Please forgive the n00b question:

I've written a STONITH device script for systems that monitor their UPSes using
NUT. I think it might be of sufficient interest to include in the standard
Pacemaker distribution. What is the procedure for submitting such scripts?

I don't particularly want credit or anything like that. It's just a simple
script that I think could be a time-saver for sysadmins like me.


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
  


--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] cluster-dlm: set_fs_notified: set_fs_notified no nodeid 1812048064#012

2010-08-30 Thread Dan Frincu
aker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
  


--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Add a resource with commandline to an existing group?

2010-08-31 Thread Dan Frincu

You can update the config by typing: crm configure
This puts you in the crm shell configure mode. Then you type in edit, 
that opens a vi session with the config, you edit the group entry by 
adding the necessary information and then you exit via esc, :wq, verify, 
commit.


Regards,

Dan

Rainer wrote:

Hi all,

is it possible to add a resource with the cmd to an exisiting group?
I always get the Error: ID already in use.

With the gui this is possible...

Kind regards,

Rainer


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
  


--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Resync too slow! Cat /proc/drbd shows 240k/s

2010-09-06 Thread Dan Frincu

Alisson Landim wrote:

Hi.

After setup a 2 node cluster from cluster from scratch guide using 
Fedora 13 i saw that resync of data is too slow.

Cat /proc/drbd shows 240k/s
If you look at the cluster from scratch guide here:

http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/ch07s02s03.html

you can see that the speed of this example is 240k too.

How to increase this speed?
Check /etc/drbd.conf for the  rate parameter. On a Gigabit Ethernet I 
use 40M.

   syncer {
   rate 40M;
   }

See more here: http://www.drbd.org/users-guide/s-configure-syncer-rate.html

Regards,

Dan.



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
  


--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] pygui error to install on Ubuntu 10.04

2010-09-13 Thread Dan Frincu

Hi,

Why don't you use DRBD MC? http://www.drbd.org/mc/management-console/

Regards,

Dan

Luana C. Rocha wrote:

 Hi,

I'm trying to install pygui on my ubuntu server 10.04
I have done this steps:

wget  http://hg.clusterlabs.org/pacemaker/pygui/archive/tip.tar.bz2
tar -jxvf tip.tar.bz2
cd Pacemaker-Python-GUI-6318ced8e29b/
./ConfigureMe make
Configure flags for Debian GNU/Linux: --prefix=/usr --sysconfdir=/etc 
--localstatedir=/var --mandir=/usr/share/man --disable-rpath
Running ./configure --prefix=/usr --sysconfdir=/etc 
--localstatedir=/var --mandir=/usr/share/man --disable-rpath

checking build system type... i686-pc-linux-gnu
checking host system type... i686-pc-linux-gnu
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
./configure: line 3432: syntax error near unexpected token `0.35.2'
./configure: line 3432: `AC_PROG_INTLTOOL(0.35.2)'

Does anyone know how can i make it work?

Tks.

L.C.R.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: 
http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] migration-threshold and failure-timeout

2010-09-21 Thread Dan Frincu

Hi,

This => 
http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-failure-migration.html 
explains it pretty well. Notice the INFINITY score and what sets it.


However I don't know of any automatic method to clear the failcount.

Regards,
Dan

Pavlos Parissis wrote:

Hi,

I am trying to figure a way to do the following
if the monitor of x resource fails N time in a period of Z then fail 
over to the other node and clear fail-count.


Regards,
Pavlos



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
  


--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Timeout after nodejoin

2010-09-22 Thread Dan Frincu

Hi all,

I have the following packages:

# rpm -qa | grep -i "(openais|cluster|heartbeat|pacemaker|resource)"
openais-0.80.5-15.2
cluster-glue-1.0-12.2
pacemaker-1.0.5-4.2
cluster-glue-libs-1.0-12.2
resource-agents-1.0-31.5
pacemaker-libs-1.0.5-4.2
pacemaker-mgmt-1.99.2-7.2
libopenais2-0.80.5-15.2
heartbeat-3.0.0-33.3
pacemaker-mgmt-client-1.99.2-7.2

When I start openais, I get nodejoin immediately, as seen in the logs 
below. However, it takes some time before the nodes are visible in 
crm_mon output. Any idea how to minimize this delay?


Sep 22 15:27:24 bench1 openais[12935]: [crm  ] info: 
send_member_notification: Sending membership update 8 to 1 children
Sep 22 15:27:24 bench1 openais[12935]: [CLM  ] got nodejoin message 
192.168.165.33
Sep 22 15:27:24 bench1 openais[12935]: [CLM  ] got nodejoin message 
192.168.165.35

Sep 22 15:27:24 bench1 mgmtd: [12947]: info: Started.
Sep 22 15:27:24 bench1 openais[12935]: [crm  ] WARN: route_ais_message: 
Sending message to local.crmd failed: unknown (rc=-2)
Sep 22 15:27:24 bench1 openais[12935]: [crm  ] WARN: route_ais_message: 
Sending message to local.crmd failed: unknown (rc=-2)
Sep 22 15:27:24 bench1 openais[12935]: [crm  ] info: pcmk_ipc: Recorded 
connection 0x174840d0 for crmd/12946
Sep 22 15:27:24 bench1 openais[12935]: [crm  ] info: pcmk_ipc: Sending 
membership update 8 to crmd
Sep 22 15:27:24 bench1 openais[12935]: [crm  ] info: 
update_expected_votes: Expected quorum votes 1024 -> 2
Sep 22 15:27:25 bench1 crmd: [12946]: notice: ais_dispatch: Membership 
8: quorum aquired
Sep 22 15:28:15 bench1 crmd: [12946]: info: do_election_count_vote: 
Election 2 (owner: bench2) pass: vote from bench2 (Host name)
Sep 22 15:28:15 bench1 crmd: [12946]: info: do_state_transition: State 
transition S_PENDING -> S_ELECTION [ input=I_ELECTION 
cause=C_FSA_INTERNAL origin=do_election_count_vote ]
Sep 22 15:28:15 bench1 crmd: [12946]: info: do_state_transition: State 
transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC 
cause=C_FSA_INTERNAL origin=do_election_check ]
Sep 22 15:28:15 bench1 crmd: [12946]: info: do_te_control: Registering 
TE UUID: 87c28ab8-ba93-4111-a26a-67e88dd927fb
Sep 22 15:28:15 bench1 crmd: [12946]: WARN: 
cib_client_add_notify_callback: Callback already present
Sep 22 15:28:15 bench1 crmd: [12946]: info: set_graph_functions: Setting 
custom graph functions
Sep 22 15:28:15 bench1 crmd: [12946]: info: unpack_graph: Unpacked 
transition -1: 0 actions in 0 synapses
Sep 22 15:28:15 bench1 crmd: [12946]: info: do_dc_takeover: Taking over 
DC status for this partition
Sep 22 15:28:15 bench1 cib: [12942]: info: cib_process_readwrite: We are 
now in R/W 
mode



Regards,

Dan

--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Timeout after nodejoin

2010-09-22 Thread Dan Frincu

Hi,

Raoul Bhatia [IPAX] wrote:

hi,

On 09/22/2010 02:43 PM, Dan Frincu wrote:
  

When I start openais, I get nodejoin immediately, as seen in the logs
below. However, it takes some time before the nodes are visible in
crm_mon output. Any idea how to minimize this delay?

Sep 22 15:27:24 bench1 openais[12935]: [crm  ] info:
send_member_notification: Sending membership update 8 to 1 children
Sep 22 15:27:24 bench1 openais[12935]: [CLM  ] got nodejoin message
192.168.165.33
Sep 22 15:27:24 bench1 openais[12935]: [CLM  ] got nodejoin message
192.168.165.35
Sep 22 15:27:24 bench1 mgmtd: [12947]: info: Started.
Sep 22 15:27:24 bench1 openais[12935]: [crm  ] WARN: route_ais_message:
Sending message to local.crmd failed: unknown (rc=-2)
Sep 22 15:27:24 bench1 openais[12935]: [crm  ] WARN: route_ais_message:
Sending message to local.crmd failed: unknown (rc=-2)
Sep 22 15:27:24 bench1 openais[12935]: [crm  ] info: pcmk_ipc: Recorded
connection 0x174840d0 for crmd/12946
Sep 22 15:27:24 bench1 openais[12935]: [crm  ] info: pcmk_ipc: Sending
membership update 8 to crmd
Sep 22 15:27:24 bench1 openais[12935]: [crm  ] info:
update_expected_votes: Expected quorum votes 1024 -> 2
Sep 22 15:27:25 bench1 crmd: [12946]: notice: ais_dispatch: Membership
8: quorum aquired
Sep 22 15:28:15 bench1 crmd: [12946]: info: do_election_count_vote:
Election 2 (owner: bench2) pass: vote from bench2 (Host name)
Sep 22 15:28:15 bench1 crmd: [12946]: info: do_state_transition: State
transition S_PENDING -> S_ELECTION [ input=I_ELECTION
cause=C_FSA_INTERNAL origin=do_election_count_vote ]
Sep 22 15:28:15 bench1 crmd: [12946]: info: do_state_transition: State
transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC
cause=C_FSA_INTERNAL origin=do_election_check ]
Sep 22 15:28:15 bench1 crmd: [12946]: info: do_te_control: Registering
TE UUID: 87c28ab8-ba93-4111-a26a-67e88dd927fb
Sep 22 15:28:15 bench1 crmd: [12946]: WARN:
cib_client_add_notify_callback: Callback already present
Sep 22 15:28:15 bench1 crmd: [12946]: info: set_graph_functions: Setting
custom graph functions
Sep 22 15:28:15 bench1 crmd: [12946]: info: unpack_graph: Unpacked
transition -1: 0 actions in 0 synapses
Sep 22 15:28:15 bench1 crmd: [12946]: info: do_dc_takeover: Taking over
DC status for this partition
Sep 22 15:28:15 bench1 cib: [12942]: info: cib_process_readwrite: We are
now in R/W
mode   



is the cluster up and running and you're only (re-)starting one node?
or is this after you start openais on both nodes.

thanks,
raoul
  

Second case, just after openais start on both nodes.

Regards,
Dan

--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Timeout after nodejoin

2010-09-23 Thread Dan Frincu

Hi,

Steven Dake wrote:

On 09/22/2010 05:43 AM, Dan Frincu wrote:

Hi all,

I have the following packages:

# rpm -qa | grep -i "(openais|cluster|heartbeat|pacemaker|resource)"
openais-0.80.5-15.2
cluster-glue-1.0-12.2
pacemaker-1.0.5-4.2
cluster-glue-libs-1.0-12.2
resource-agents-1.0-31.5
pacemaker-libs-1.0.5-4.2
pacemaker-mgmt-1.99.2-7.2
libopenais2-0.80.5-15.2
heartbeat-3.0.0-33.3
pacemaker-mgmt-client-1.99.2-7.2

When I start openais, I get nodejoin immediately, as seen in the logs
below. However, it takes some time before the nodes are visible in
crm_mon output. Any idea how to minimize this delay?

Sep 22 15:27:24 bench1 openais[12935]: [crm ] info:
send_member_notification: Sending membership update 8 to 1 children
Sep 22 15:27:24 bench1 openais[12935]: [CLM ] got nodejoin message
192.168.165.33
Sep 22 15:27:24 bench1 openais[12935]: [CLM ] got nodejoin message
192.168.165.35
Sep 22 15:27:24 bench1 mgmtd: [12947]: info: Started.
Sep 22 15:27:24 bench1 openais[12935]: [crm ] WARN: route_ais_message:
Sending message to local.crmd failed: unknown (rc=-2)
Sep 22 15:27:24 bench1 openais[12935]: [crm ] WARN: route_ais_message:
Sending message to local.crmd failed: unknown (rc=-2)
Sep 22 15:27:24 bench1 openais[12935]: [crm ] info: pcmk_ipc: Recorded
connection 0x174840d0 for crmd/12946
Sep 22 15:27:24 bench1 openais[12935]: [crm ] info: pcmk_ipc: Sending
membership update 8 to crmd
Sep 22 15:27:24 bench1 openais[12935]: [crm ] info:
update_expected_votes: Expected quorum votes 1024 -> 2
Sep 22 15:27:25 bench1 crmd: [12946]: notice: ais_dispatch: Membership
8: quorum aquired
Sep 22 15:28:15 bench1 crmd: [12946]: info: do_election_count_vote:
Election 2 (owner: bench2) pass: vote from bench2 (Host name)
Sep 22 15:28:15 bench1 crmd: [12946]: info: do_state_transition: State
transition S_PENDING -> S_ELECTION [ input=I_ELECTION
cause=C_FSA_INTERNAL origin=do_election_count_vote ]
Sep 22 15:28:15 bench1 crmd: [12946]: info: do_state_transition: State
transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC
cause=C_FSA_INTERNAL origin=do_election_check ]
Sep 22 15:28:15 bench1 crmd: [12946]: info: do_te_control: Registering
TE UUID: 87c28ab8-ba93-4111-a26a-67e88dd927fb
Sep 22 15:28:15 bench1 crmd: [12946]: WARN:
cib_client_add_notify_callback: Callback already present
Sep 22 15:28:15 bench1 crmd: [12946]: info: set_graph_functions: Setting
custom graph functions
Sep 22 15:28:15 bench1 crmd: [12946]: info: unpack_graph: Unpacked
transition -1: 0 actions in 0 synapses
Sep 22 15:28:15 bench1 crmd: [12946]: info: do_dc_takeover: Taking over
DC status for this partition
Sep 22 15:28:15 bench1 cib: [12942]: info: cib_process_readwrite: We are
now in R/W mode

Regards,

Dan



Where did you get that version of openais?  openais 0.80.x is 
deprecated in the community (and hence, no support).  We recommend 
using corosync instead which has improved testing with pacemaker.


From the SUSE repositories for Redhat, last year, when we began working 
with this cluster stack. I also pushed corosync forward, for obvious 
reasons, however for existing installations, upgrade is an option that 
will require some testing, because the platforms cannot be taken offline.


Anyway, thank you all for your input, I've done some researching and 
fiddling with the dc-timeout parameter did the trick.


Regards,

Dan  


--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Migrate resources based on connectivity

2010-10-10 Thread Dan Frincu

Hi,

I have the following setup:
- order drbd0:promote drbd1:promote
- order drbd1:promote drbd2:promote
- order drbd2:promote all:start
- collocation all drbd2:Master
- all is a group of resources, drbd{0..3} are drbd ms resources.

I want to migrate the resources based on ping connectivity to a default 
gateway. Based on 
http://www.clusterlabs.org/wiki/Pingd_with_resources_on_different_networks 
and http://www.clusterlabs.org/wiki/Example_configurations I've tried 
the following:
- primitive ping ocf:pacemaker:ping params host_list=1.2.3.4 
multiplier=100 op monitor interval=5s timeout=5s

- clone ping_clone ping meta globally-unique=false
- location ping_nok all \
   rule $id="ping_nok-rule" -inf: not_defined ping_clone or ping_clone 
number:lte 0


I've also tried ping instead of ping_clone with the same result, 
regardless of the node where ping is prohibited, the "all" group gets 
negative infinite metrics and "cannot run anywhere".


Also based on the "crm configure help location" output, I've tried:
- location ping_nok all \
   rule $id="ping_nok-rule" -inf: ping_clone number:lte 0 and #uname 
string:eq hostname


At this one, nothing happens to the group when ping is prohibited. So 
far, no luck. I'm using:


# rpm -qa | grep -i "(openais|cluster|heartbeat|pacemaker|resource)"
cluster-glue-1.0.6-1.6.el5
pacemaker-libs-1.0.9.1-1.el5
pacemaker-1.0.9.1-1.el5
heartbeat-libs-3.0.3-2.el5
heartbeat-3.0.3-2.el5
openaislib-1.1.3-1.6.el5
resource-agents-1.0.3-2.el5
cluster-glue-libs-1.0.6-1.6.el5
openais-1.1.3-1.6.el5

My goal is to move all the resources, drbd's and group from one node to 
the other if the gateway is unreachable via ping. I can say that the 
ocf:pacemaker:ping RA works (I've read some complaints about the pingd 
RA in the mailing lists), if this was just a group of resources, without 
the ordering and collocation constraints, it would have worked, however 
I need to specify these as well.


Any help would be appreciated, it's been a long weekend and I still 
haven't figured it out. I hope it's not a bug ...


Best regards,

Dan

--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Infinite fail-count and migration-threshold after node fail-back

2010-10-11 Thread Dan Frincu

Hi all,

I've managed to make this setup work, basically the issue with a 
symmetric-cluster="false" and specifying the resources' location 
manually means that the resources will always obey the location 
constraint, and (as far as I could see) disregard the rsc_defaults 
resource-stickiness values. This behavior is not the expected one, in 
theory, setting symmetric-cluster="false" should affect whether 
resources are allowed to run anywhere by default and the 
resource-stickiness should lock in place the resources so they don't 
bounce from node to node. Again, this didn't happen, but by setting 
symmetric-cluster="true", using the same ordering and collocation 
constraints and the resource-stickiness, the behavior is the expected one.


I don't remember seeing anywhere in the docs from clusterlabs.org being 
mentioned that the resource-stickiness only works on 
symmetric-cluster="true", so for anyone that also stumbles upon this 
issue, I hope this helps.


Regards,

Dan

Dan Frincu wrote:

Hi,

Since it was brought to my attention that I should upgrade from 
openais-0.80 to a more recent version of corosync, I've done just 
that, however I'm experiencing a strange behavior on the cluster.


The same setup was used with the below packages:

# rpm -qa | grep -i "(openais|cluster|heartbeat|pacemaker|resource)"
openais-0.80.5-15.2
cluster-glue-1.0-12.2
pacemaker-1.0.5-4.2
cluster-glue-libs-1.0-12.2
resource-agents-1.0-31.5
pacemaker-libs-1.0.5-4.2
pacemaker-mgmt-1.99.2-7.2
libopenais2-0.80.5-15.2
heartbeat-3.0.0-33.3
pacemaker-mgmt-client-1.99.2-7.2

Now I've migrated to the most recent stable packages I could find (on 
the clusterlabs.org website) for RHEL5:


# rpm -qa | grep -i "(openais|cluster|heartbeat|pacemaker|resource)"
cluster-glue-1.0.6-1.6.el5
pacemaker-libs-1.0.9.1-1.el5
pacemaker-1.0.9.1-1.el5
heartbeat-libs-3.0.3-2.el5
heartbeat-3.0.3-2.el5
openaislib-1.1.3-1.6.el5
resource-agents-1.0.3-2.el5
cluster-glue-libs-1.0.6-1.6.el5
openais-1.1.3-1.6.el5

Expected behavior:
- all the resources the in group should go (based on location 
preference) to bench1

- if bench1 goes down, resources migrate to bench2
- if bench1 comes back up, resources stay on bench2, unless manually 
told otherwise.


On the previous incantation, this worked, by using the new packages, 
not so much. Now if bench1 goes down (crm node standby `uname -n`), 
failover occurs, but when bench1 comes backup up, resources migrate 
back, even if default-resource-stickiness is set, and more than that, 
2 drbd block devices reach infinite metrics, most notably because they 
try to promote the resources to a Master state on bench1, but fail to 
do so due to the resource being held open (by some process, I could 
not identify it).


Strangely enough, the resources (drbd) fail to be promoted to a Master 
status on bench1, so they fail back to bench2, where they are mounted 
(functional), but crm_mon shows:


Migration summary:
* Node bench2.streamwide.ro:
  drbd_mysql:1: migration-threshold=100 fail-count=100
  drbd_home:1: migration-threshold=100 fail-count=100
* Node bench1.streamwide.ro:

 infinite metrics on bench2, while the drbd resources are available

version: 8.3.2 (api:88/proto:86-90)
GIT-hash: dd7985327f146f33b86d4bff5ca8c94234ce840e build by 
mockbu...@v20z-x86-64.home.local, 2009-08-29 14:07:55

0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r
   ns:1632 nr:1864 dw:3512 dr:3933 al:11 bm:19 lo:0 pe:0 ua:0 ap:0 
ep:1 wo:b oos:0

1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r
   ns:4 nr:24 dw:28 dr:25 al:1 bm:1 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
2: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r
   ns:4 nr:24 dw:28 dr:85 al:1 bm:1 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

and mounted

/dev/drbd1 on /home type ext3 (rw,noatime,nodiratime)
/dev/drbd0 on /mysql type ext3 (rw,noatime,nodiratime)
/dev/drbd2 on /storage type ext3 (rw,noatime,nodiratime)

Attached is the hb_report.

Thank you in advance.

Best regards



--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Migrate resources based on connectivity

2010-10-11 Thread Dan Frincu

Hi,

Dejan Muhamedagic wrote:

Hi,

On Sun, Oct 10, 2010 at 10:27:13PM +0300, Dan Frincu wrote:
  

Hi,

I have the following setup:
- order drbd0:promote drbd1:promote
- order drbd1:promote drbd2:promote
- order drbd2:promote all:start
- collocation all drbd2:Master
- all is a group of resources, drbd{0..3} are drbd ms resources.

I want to migrate the resources based on ping connectivity to a
default gateway. Based on 
http://www.clusterlabs.org/wiki/Pingd_with_resources_on_different_networks
and http://www.clusterlabs.org/wiki/Example_configurations I've
tried the following:
- primitive ping ocf:pacemaker:ping params host_list=1.2.3.4
multiplier=100 op monitor interval=5s timeout=5s
- clone ping_clone ping meta globally-unique=false
- location ping_nok all \
   rule $id="ping_nok-rule" -inf: not_defined ping_clone or
ping_clone number:lte 0



Use pingd to reference the attribute in the location constraint.
  
Not to be disrespectful, but after 3 days being stuck on this issue, I 
don't exactly understand how to do that. Could you please provide an 
example.


Thank you in advance.

Regards,

Dan

Thanks,

Dejan

  

I've also tried ping instead of ping_clone with the same result,
regardless of the node where ping is prohibited, the "all" group
gets negative infinite metrics and "cannot run anywhere".

Also based on the "crm configure help location" output, I've tried:
- location ping_nok all \
   rule $id="ping_nok-rule" -inf: ping_clone number:lte 0 and #uname
string:eq hostname

At this one, nothing happens to the group when ping is prohibited.
So far, no luck. I'm using:

# rpm -qa | grep -i "(openais|cluster|heartbeat|pacemaker|resource)"
cluster-glue-1.0.6-1.6.el5
pacemaker-libs-1.0.9.1-1.el5
pacemaker-1.0.9.1-1.el5
heartbeat-libs-3.0.3-2.el5
heartbeat-3.0.3-2.el5
openaislib-1.1.3-1.6.el5
resource-agents-1.0.3-2.el5
cluster-glue-libs-1.0.6-1.6.el5
openais-1.1.3-1.6.el5

My goal is to move all the resources, drbd's and group from one node
to the other if the gateway is unreachable via ping. I can say that
the ocf:pacemaker:ping RA works (I've read some complaints about the
pingd RA in the mailing lists), if this was just a group of
resources, without the ordering and collocation constraints, it
would have worked, however I need to specify these as well.

Any help would be appreciated, it's been a long weekend and I still
haven't figured it out. I hope it's not a bug ...

Best regards,

Dan

--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania




  

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
  


--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Migrate resources based on connectivity

2010-10-12 Thread Dan Frincu

Hi,

Lars Ellenberg wrote:

On Mon, Oct 11, 2010 at 03:50:01PM +0300, Dan Frincu wrote:
  

Hi,

Dejan Muhamedagic wrote:


Hi,

On Sun, Oct 10, 2010 at 10:27:13PM +0300, Dan Frincu wrote:
  

Hi,

I have the following setup:
- order drbd0:promote drbd1:promote
- order drbd1:promote drbd2:promote
- order drbd2:promote all:start
- collocation all drbd2:Master
- all is a group of resources, drbd{0..3} are drbd ms resources.

I want to migrate the resources based on ping connectivity to a
default gateway. Based on 
http://www.clusterlabs.org/wiki/Pingd_with_resources_on_different_networks
and http://www.clusterlabs.org/wiki/Example_configurations I've
tried the following:
- primitive ping ocf:pacemaker:ping params host_list=1.2.3.4
multiplier=100 op monitor interval=5s timeout=5s
- clone ping_clone ping meta globally-unique=false
- location ping_nok all \
  rule $id="ping_nok-rule" -inf: not_defined ping_clone or
ping_clone number:lte 0


Use pingd to reference the attribute in the location constraint.
  

Not to be disrespectful, but after 3 days being stuck on this issue,
I don't exactly understand how to do that. Could you please provide
an example.

Thank you in advance.



The example you reference lists:

primitive pingdnet1 ocf:pacemaker:pingd \
params host_list=192.168.23.1 \
name=pingdnet1
^^

clone cl-pingdnet1 pingdnet1
   ^

param name default is pingd,
and is the attribute name to be used in the location constraints.

You will need to reference pingd in you location constraint, or set an
explicit name in the primitive definition, and reference that.

Your ping primitive sets the default 'pingd' attribute,
but you reference some 'ping_clone' attribute,
which apparently no-one really references.

  
I've finally managed to finish the setup with the indications received 
above, the behavior is the expected one. Also, I've tried the 
ocf:pacemaker:pingd and even though it does the reachability tests 
properly, it fails to update the cib upon restoring the connectivity, I 
had to manually run attrd_updater -R to get the resources to start 
again, therefore I'm going with ocf:pacemaker:ping.


Anyways, Dejan, Lars, Andrew, thank you all very much for your help.

Best regards,

Dan
<http://www.clusterlabs.org/wiki/Example_configurations>

--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] problem about move node from one clusterto another cluster

2010-10-12 Thread Dan Frincu

Hi,

Depending on the openais version (please mention it) this behavior could 
happen, I've seen it as well, on openais-0.8.0. What I've done to fix it 
was to restart the openais process via /etc/init.d/openais restart. And 
then it worked, however, this was one of the reasons I updated the 
packages to the latest versions of corosync, pacemaker, etc. The tricky 
part was doing the migration procedure for upgrading production servers 
without service downtime, but that's another story.


Regards,

Dan

jiaju liu wrote:




Message: 2
Date: Tue, 12 Oct 2010 10:40:18 +0800 (CST)
From: jiaju liu http://cn.mc157.mail.yahoo.com/mc/compose?to=liujiaj...@yahoo.com.cn>>
To: pacemaker@oss.clusterlabs.org
<http://cn.mc157.mail.yahoo.com/mc/compose?to=pacema...@oss.clusterlabs.org>
Subject: [Pacemaker] problem about move node from one cluster to
anothercluster
Message-ID: <765547.4759...@web15704.mail.cnb.yahoo.com

<http://cn.mc157.mail.yahoo.com/mc/compose?to=765547.4759...@web15704.mail.cnb.yahoo.com>>
Content-Type: text/plain; charset="iso-8859-1"

hi everybody
I use command service openais stop first to stop openais service
and then use rm -rf /var/lib/heartbear/crm/*? clear all
information. then change multicast address and then use service
openais start  in another cluster.
the problem is sometimes it works well I  can use crm_mon command.
and sometimes it doesn't work. I use service openais status to
check. It shows Running. but I can not use crm_mon to connect to
cluster.
I found the reason may be directory?/var/lib/heartbear/crm/ is
empty. why??if I reboot ,it works again.WHY
 
Now when the is directory is not empty it sometimes also does not

work.
when I use* crm_mon* it shows
Attempting connection to the cluster..
 
when I use *crm node list *it shows

Signon to CIB failed: connection failed
Init failed, could not perform requested operations
ERROR: cannot parse output of cibadmin -Ql -o nodes: no element
found: line 1, column 0


 
-- next part --

An HTML attachment was scrubbed...
URL:

<http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20101012/7ea78f33/attachment-0001.htm>

--


 



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
  


--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] 1st monitor is too fast after the start

2010-10-13 Thread Dan Frincu

Hi,

I've noticed the same type of behavior, however in a different context, 
my setup includes 3 drbd devices and a group of resources, all have to 
run on the same node and move together to other nodes. My issue was with 
the first resource that required access to a drbd device, which was the 
ocf:heartbeat:Filesystem RA trying to do a mount and failing.


The reason, it was trying to do the mount of the drbd device before the 
drbd device had finished migrating to primary state. Same as you, I 
introduced a start-delay, but on the start action. This proved to be of 
no use as the behavior persisted, even with an increased start-delay. 
However, it only happened when performing a fail-back operation, during 
fail-over, everything was ok, during fail-back, error.


The fix I've made was to remove any start-delay and to add group 
collocation constraints to all ms_drbd resources. Before that I only had 
one collocation constraint for the drbd device being promoted last.


I hope this helps.

Regards,

Dan

Pavlos Parissis wrote:

Hi,

I noticed a race condition while I was integration an application with
Pacemaker and thought to share with you.

The init script of the application is LSB-compliant and passes the
tests mentioned at the Pacemaker documentation. Moreover, the init
script
uses the supplied functions from the system[1] for starting,stopping
and checking the application.

I observed few times that the monitor action was failing after the
startup of the cluster or the movement of the resource group.
Because it was not happening always and manual start/status was always
working, it was quite tricky and difficult to find out the root cause
of the failure.
After few hours of troubleshooting, I found out that the 1st monitor
action after the start action, was executed too fast for the
application to create the pid file. As result monitor action was
receiving error.

I know it sounds a bit strange but it happened on my systems. The fact
that my systems are basically vmware images on a laptop could have a
relation with the issue.

Nevertheless, I would like to ask if you are thinking to implement an
"init_wait" on 1st monitor action. Could be useful.

To solve my issue I put a sleep after the start of the application in
the init script. This gives enough time for the application to create
its pid file and the 1st monitor doesn't fail.


Cheers,
Pavlos


[1] Cent0S 5.4

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
  


--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] 1st monitor is too fast after the start

2010-10-13 Thread Dan Frincu

Pavlos Parissis wrote:

On 13 October 2010 09:48, Dan Frincu  wrote:
  

Hi,

I've noticed the same type of behavior, however in a different context, my
setup includes 3 drbd devices and a group of resources, all have to run on
the same node and move together to other nodes. My issue was with the first
resource that required access to a drbd device, which was the
ocf:heartbeat:Filesystem RA trying to do a mount and failing.

The reason, it was trying to do the mount of the drbd device before the drbd
device had finished migrating to primary state. Same as you, I introduced a
start-delay, but on the start action. This proved to be of no use as the
behavior persisted, even with an increased start-delay. However, it only
happened when performing a fail-back operation, during fail-over, everything
was ok, during fail-back, error.

The fix I've made was to remove any start-delay and to add group collocation
constraints to all ms_drbd resources. Before that I only had one collocation
constraint for the drbd device being promoted last.

I hope this helps.




I am glad that somebody else experienced the same issue:)

On my mail I was talking about the monitor action which was failing,
but the behavior you described happened on my system on the same
setup, drbd and fs resource.It also happened on the application
resource, the start was too fast and the FS was not mounted (yet) when
the action start fired for the application resource. A delay on start
function of the resource agent of the application fixed my issue.

In my setup I have all the necessary constraints to avoid this, at
least this is what I believe so:-)

Cheers,
Pavlos
  
From what I see you have a dual primary setup with failover on the 
third node, basically if you have one drbd resource for which you have 
both ordering and collocation, I don't think you need to "improve" it, 
if it ain't broke, don't fix it :)


Regards,

Dan


[r...@node-01 sysconfig]# crm configure show
node $id="059313ce-c6aa-4bd5-a4fb-4b781de6d98f" node-03
node $id="d791b1f5-9522-4c84-a66f-cd3d4e476b38" node-02
node $id="e388e797-21f4-4bbe-a588-93d12964b4d7" node-01
primitive drbd_01 ocf:linbit:drbd \
params drbd_resource="drbd_pbx_service_1" \
op monitor interval="30s" \
op start interval="0" timeout="240s" \
op stop interval="0" timeout="120s"
primitive drbd_02 ocf:linbit:drbd \
params drbd_resource="drbd_pbx_service_2" \
op monitor interval="30s" \
op start interval="0" timeout="240s" \
op stop interval="0" timeout="120s"
primitive fs_01 ocf:heartbeat:Filesystem \
params device="/dev/drbd1" directory="/pbx_service_01" fstype="ext3" \
meta migration-threshold="3" failure-timeout="60" \
op monitor interval="20s" timeout="40s" OCF_CHECK_LEVEL="20" \
op start interval="0" timeout="60s" \
op stop interval="0" timeout="60s"
primitive fs_02 ocf:heartbeat:Filesystem \
params device="/dev/drbd2" directory="/pbx_service_02" fstype="ext3" \
meta migration-threshold="3" failure-timeout="60" \
op monitor interval="20s" timeout="40s" OCF_CHECK_LEVEL="20" \
op start interval="0" timeout="60s" \
op stop interval="0" timeout="60s"
primitive ip_01 ocf:heartbeat:IPaddr2 \
params ip="192.168.78.10" cidr_netmask="24" broadcast="192.168.78.255" \
meta failure-timeout="120" migration-threshold="3" \
op monitor interval="5s"
primitive ip_02 ocf:heartbeat:IPaddr2 \
meta failure-timeout="120" migration-threshold="3" \
params ip="192.168.78.20" cidr_netmask="24" broadcast="192.168.78.255" \
op monitor interval="5s"
primitive pbx_01 lsb:znd-pbx_01 \
meta migration-threshold="3" failure-timeout="60"
target-role="Started" \
op monitor interval="20s" timeout="20s" \
op start interval="0" timeout="60s" \
op stop interval="0" timeout="60s"
primitive pbx_02 lsb:znd-pbx_02 \
meta migration-threshold="3" failure-timeout="60" \
op monitor interval="20s" timeout="20s" \
op start interval="0" timeout="60s" \
op stop interval="0" timeout="60s"
primitive sshd_01 lsb:znd-sshd-pbx_01 \
meta target-role="Started" is-managed="true&

Re: [Pacemaker] Migrate resources based on connectivity

2010-10-13 Thread Dan Frincu

Hi,

Pavlos Parissis wrote:

On 12 October 2010 20:00, Dan Frincu  wrote:
  

Hi,

Lars Ellenberg wrote:

On Mon, Oct 11, 2010 at 03:50:01PM +0300, Dan Frincu wrote:


Hi,

Dejan Muhamedagic wrote:


Hi,

On Sun, Oct 10, 2010 at 10:27:13PM +0300, Dan Frincu wrote:


Hi,

I have the following setup:
- order drbd0:promote drbd1:promote
- order drbd1:promote drbd2:promote
- order drbd2:promote all:start
- collocation all drbd2:Master
- all is a group of resources, drbd{0..3} are drbd ms resources.

I want to migrate the resources based on ping connectivity to a
default gateway. Based on
http://www.clusterlabs.org/wiki/Pingd_with_resources_on_different_networks
and http://www.clusterlabs.org/wiki/Example_configurations I've
tried the following:
- primitive ping ocf:pacemaker:ping params host_list=1.2.3.4
multiplier=100 op monitor interval=5s timeout=5s
- clone ping_clone ping meta globally-unique=false
- location ping_nok all \
  rule $id="ping_nok-rule" -inf: not_defined ping_clone or
ping_clone number:lte 0


Use pingd to reference the attribute in the location constraint.


Not to be disrespectful, but after 3 days being stuck on this issue,
I don't exactly understand how to do that. Could you please provide
an example.

Thank you in advance.


The example you reference lists:

primitive pingdnet1 ocf:pacemaker:pingd \
params host_list=192.168.23.1 \
name=pingdnet1
^^

clone cl-pingdnet1 pingdnet1
   ^

param name default is pingd,
and is the attribute name to be used in the location constraints.

You will need to reference pingd in you location constraint, or set an
explicit name in the primitive definition, and reference that.

Your ping primitive sets the default 'pingd' attribute,
but you reference some 'ping_clone' attribute,
which apparently no-one really references.



I've finally managed to finish the setup with the indications received
above, the behavior is the expected one. Also, I've tried the
ocf:pacemaker:pingd and even though it does the reachability tests properly,
it fails to update the cib upon restoring the connectivity, I had to
manually run attrd_updater -R to get the resources to start again, therefore
I'm going with ocf:pacemaker:ping.



it would be quite useful for the rest of people if you post your final
and working configuration.
Cheers,
Pavlos
  

The relevant stuff is related to the group and ping location constraint.

primitive _/ping_gw/_ ocf:pacemaker:ping \
   params host_list="1.1.1.99" multiplier="100" *name="ping_gw_name" *\
   op monitor interval="5s" timeout="60s" \
   op start interval="0s" timeout="60s"
group all virtual_ip_1 virtual_ip_2 Failover_Alert fs_home fs_mysql 
fs_storage httpd mysqld \

   meta target-role="Started"
ms ms_drbd_home drbd_home \
   meta notify="true" globally-unique="false" target-role="Started" 
master-max="1" master-node-max="1" clone-max="2" clone-node-max="1"

ms ms_drbd_mysql drbd_mysql \
   meta notify="true" globally-unique="false" target-role="Started" 
master-max="1" master-node-max="1" clone-max="2" clone-node-max="1"

ms ms_drbd_storage drbd_storage \
   meta notify="true" globally-unique="false" target-role="Started" 
master-max="1" master-node-max="1" clone-max="2" clone-node-max="1"

clone ping_gw_clone _/ping_gw/_ \
   meta globally-unique="false" target-role="Started"
location nok_ping_to_gw all \
   rule $id="nok_ping_to_gw-rule" -inf: not_defined *ping_gw_name* 
or *ping_gw_name* lte 0

colocation all_on_home inf: all ms_drbd_home:Master
colocation all_on_mysql inf: all ms_drbd_mysql:Master
colocation all_on_storage inf: all ms_drbd_storage:Master
order all_after_storage inf: ms_drbd_storage:promote all:start
order ms_drbd_home_after_ms_drbd_mysql inf: ms_drbd_mysql:promote 
ms_drbd_home:promote
order ms_drbd_storage_after_ms_drbd_home inf: ms_drbd_home:promote 
ms_drbd_storage:promote

property $id="cib-bootstrap-options" \
   expected-quorum-votes="2" \
   stonith-enabled="false" \
   symmetric-cluster="true" \
   dc-version="1.0.9-89bd754939df5150de7cd76835f98fe90851b677" \
   no-quorum-policy="ignore" \
   cluster-infrastructure="openais" \
   last-lrm-refresh="1286905225"
rsc_defaults $id="rsc-options" \
   multiple-active="block" \
   resource-stickiness="1000"

I hope this helps.

Regards,

Dan

___

Re: [Pacemaker] 1st monitor is too fast after the start

2010-10-13 Thread Dan Frincu

Pavlos Parissis wrote:

On 13 October 2010 10:50, Dan Frincu  wrote:
  

From what I see you have a dual primary setup with failover on the third
node, basically if you have one drbd resource for which you have both
ordering and collocation, I don't think you need to "improve" it, if it
ain't broke, don't fix it :)

Regards,




No, I don't have Dual primary. My DRBD is in Single-Primary mode for
both DRBD resources.
I use N+1  setup. I have 2 resource group which have unique Primary
and shared secondary.
pbx_service_01 resource group has primary node-01 and secondary node-03
pbx_service_02 resource group has primary node-02 and secondary node-03

I use asymmetric cluster with specific location constraints in order
to implement the above.
The DRBD resource will never be in primary mode on 2 nodes at the same time.
I have set specific collocation and order constraints in order to
"bond"  DRBD ms resource to the appropriate resource group.

I hope it is clear now.

Cheers and thanks for looking at my conf,
Pavlos

  
True, my bad, Dual-Primary does not apply to your setup, I formulated it 
wrong, I meant what you said :)


Regards,

Dan

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
  


--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] problem about move node from one cluster to another cluster

2010-10-13 Thread Dan Frincu

Hi,

Yes, it sometime needs to be killed manually because the process hangs 
and the restart operation never seems to end. Yet another reason to upgrade.


All,

Question: given the fact that this type of software usually gets 
installed on a platform once and then usually goes into service for many 
years, on servers where downtime should be kept to a minimum (gee, 
that's why you use a cluster :)), how does this fit the release schedule?


I mean, there are plenty of users out there with question related to 
Heartbeat 2, openais-0.8.0, and so on and so forth, some environments 
cannot be changed lightly, others, not at all, so what is the response 
to "this feature doesn't work on that version of software?", upgrade? If 
so, at what interval (keeping in mind that you probably want the stable 
packages on your system)?


I'm asking this because when I started working with openais, the latest 
version available was 0.8.0 on some SUSE repos that aren't available 
anymore.


Regards,

Dan

jiaju liu wrote:

Hi,

Depending on the openais version (please mention it)
 
Hi

Thank you for your reply my openais version is openais-0.80.5-15.1
pacemaker version is pacemaker-1.0.5-4.1.
I use restart but it does not work. I found it could not stop
 
 
this behavior could

happen, I've seen it as well, on openais-0.8.0. What I've done to fix it
was to restart the openais process via /etc/init.d/openais restart. And
then it worked, however, this was one of the reasons I updated the
packages to the latest versions of corosync, pacemaker, etc. The tricky
part was doing the migration procedure for upgrading production servers
without service downtime, but that's another story.


  


--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] problem about move node from one cluster to another

2010-10-14 Thread Dan Frincu

Hi,

Just go to http://clusterlabs.org/rpm/ and search for the RPMs for your 
OS version. If you can shut the nodes down to perform the upgrade, it 
would be a lot easier, if it has to be done without downtime, that is 
also possible, but requires a careful planning of what to change and 
when so that you don't mess up anything and is closely related to how 
the cluster is set up.


How I've done it was as follows:
- 2 node setup, primary - secondary.
- went on secondary node, put it in standby
- shutdown openais
- removed all cluster related RPMs (openais, pacemaker, heartbeat, etc.) 
but not the drbd RPMs
- installed the new RPMs, used corosync instead of openais, configured 
corosync's /etc/corosync/corosync.conf based on the settings of the old 
openais (actually did a meld between the old openais.conf and the new 
corosync.conf and saved a unified corosync.conf)

- removed everything from /var/lib/heartbeat/crm/
- started corosync on secondary node
- waited for node joing, then issued an standby on the primary, 
everything migrated to the secondary
- went on the primary, removed the RPMs, removed 
/var/lib/heartbeat/crm/* configured corosync, started it


Et voila, it works. BTW, awesome job done to maintain compatibility 
between versions, the upgrade was truly seamless.


Regards,

Dan

jiaju liu wrote:



Hi
 
Thank you for your help. I want to upgrade my openais. Do I need

to restall linux and download openais of the latest version? or
any other simple way?Thanks:-). 



Hi,

Depending on the openais version (please mention it)
?
Hi Thank you for your reply my openais version is openais-0.80.5-15.1
pacemaker version is pacemaker-1.0.5-4.1.
I use restart but it does not work.?I found?it could not stop
?
?
this behavior could
happen, I've seen it as well, on openais-0.8.0. What I've done to
fix it
was to restart the openais process via /etc/init.d/openais
restart. And
then it worked, however, this was one of the reasons I updated the
packages to the latest versions of corosync, pacemaker, etc. The
tricky
part was doing the migration procedure for upgrading production
servers
without service downtime, but that's another story.

Regards,

Dan

jiaju liu wrote:
>
>
>
>?


 
-- next part --

An HTML attachment was scrubbed...
URL:

<http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20101013/2f55034d/attachment-0001.htm>

--

Message: 4
Date: Wed, 13 Oct 2010 14:37:51 +0300
From: Dan Frincu http://cn.mc157.mail.yahoo.com/mc/compose?to=dfri...@streamwide.ro>>
To: The Pacemaker cluster resource manager
http://cn.mc157.mail.yahoo.com/mc/compose?to=pacema...@oss.clusterlabs.org>>
Subject: Re: [Pacemaker] problem about move node from one cluster to
another cluster
Message-ID: <4cb59a0f.80...@streamwide.ro
<http://cn.mc157.mail.yahoo.com/mc/compose?to=4cb59a0f.80...@streamwide.ro>>
Content-Type: text/plain; charset="iso-8859-1"; Format="flowed"

Hi,

Yes, it sometime needs to be killed manually because the process
hangs
and the restart operation never seems to end. Yet another reason
to upgrade.

All,

Question: given the fact that this type of software usually gets
installed on a platform once and then usually goes into service
for many
years, on servers where downtime should be kept to a minimum (gee,
that's why you use a cluster :)), how does this fit the release
schedule?

I mean, there are plenty of users out there with question related to
Heartbeat 2, openais-0.8.0, and so on and so forth, some environments
cannot be changed lightly, others, not at all, so what is the
response
to "this feature doesn't work on that version of software?",
upgrade? If
so, at what interval (keeping in mind that you probably want the
stable
packages on your system)?

I'm asking this because when I started working with openais, the
latest
version available was 0.8.0 on some SUSE repos that aren't available
anymore.

Regards,

Dan

jiaju liu wrote:
> Hi,
>
> Depending on the openais version (please mention it)
> 
> Hi

> Thank you for your reply my openais version is openais-0.80.5-15.1
> pacemaker version is pacemaker-1.0.5-4.1.
> I use restart but it does not work. I found it could not stop
> 
> 
> this behavior could

> happen, I've seen it as well, on openais-0.8.0. What I've done
to fix it
> was to restart the openais process via /etc/init.d/openais
restart. And
> then it worked, however, this was one of the

Re: [Pacemaker] update openais

2010-10-15 Thread Dan Frincu

Hi,

It's not mandatory to install ldirectord, I know it's not a dependency 
anymore. As for libesmtp see http://tinyurl.com/2uhdpzw


jiaju liu wrote:

Hi I have already installed rpm as follow:
cluster-glue-1.0.5-1.el5.x86_64.rpm
cluster-glue-libs-1.0.5-1.el5.x86_64.rpm
cluster-glue-libs-devel-1.0.5-1.el5.x86_64.rpm
corosync-1.2.2-1.1.el5.x86_64.rpm
corosynclib-1.2.2-1.1.el5.x86_64.rpm
corosynclib-devel-1.2.2-1.1.el5.x86_64.rpm
heartbeat-3.0.3-2.el5.x86_64.rpm
heartbeat-devel-3.0.3-2.el5.x86_64.rpm
heartbeat-libs-3.0.3-2.el5.x86_64.rpm
openais-1.1.0-1.el5.x86_64.rpm
openaislib-1.1.0-1.el5.x86_64.rpm
openaislib-devel-1.1.0-1.el5.x86_64.rpm
pacemaker-libs-devel-1.0.8-6.1.el5.x86_64.rpm
pacemaker-libs-1.0.8-6.1.el5.x86_64.rpm
resource-agents-1.0.3-2.el5.x86_64.rpm
 
 
at last I install pacemaker-1.0.8-6.1.el5.x86_64.rpm

it shows
error: Failed dependencies:
 libesmtp is needed by pacemaker-1.0.8-6.1.el5.x86_64
 libesmtp.so.5()(64bit) is needed by pacemaker-1.0.8-6.1.el5.x86_64
 
and the ldirectord-1.0.3-2.el5.x86_64.rpm is must be installed?

thanks a lot:-)


  


--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] update openais

2010-10-15 Thread Dan Frincu

Hi,

Just a wild guess, but in corosync.conf you have defined logfile: 
/var/log/cluster/corosync.log. Is the directory /var/log/cluster 
created? If it isn't, create it and try again.


Regards,

Dan

jiaju liu wrote:




Thank you for your help
I have install all rpm packages.
now I use order
service corosync start
the result is
 
Starting Corosync Cluster Engine (corosync):   [FAILED]
 
I change the corosync.conf several times it doesn't work. would

you please send me  a document as reference? Thanks a lot
 
Hi,


It's not mandatory to install ldirectord, I know it's not a
dependency
anymore. As for libesmtp see http://tinyurl.com/2uhdpzw

jiaju liu wrote:
> Hi I have already installed rpm as follow:
> cluster-glue-1.0.5-1.el5.x86_64.rpm
> cluster-glue-libs-1.0.5-1.el5.x86_64.rpm
> cluster-glue-libs-devel-1.0.5-1.el5.x86_64.rpm
> corosync-1.2.2-1.1.el5.x86_64.rpm
> corosynclib-1.2.2-1.1.el5.x86_64.rpm
> corosynclib-devel-1.2.2-1.1.el5.x86_64.rpm
> heartbeat-3.0.3-2.el5.x86_64.rpm
> heartbeat-devel-3.0.3-2.el5.x86_64.rpm
> heartbeat-libs-3.0.3-2.el5.x86_64.rpm
> openais-1.1.0-1.el5.x86_64.rpm
> openaislib-1.1.0-1.el5.x86_64.rpm
> openaislib-devel-1.1.0-1.el5.x86_64.rpm
> pacemaker-libs-devel-1.0.8-6.1.el5.x86_64.rpm
> pacemaker-libs-1.0.8-6.1.el5.x86_64.rpm
> resource-agents-1.0.3-2.el5.x86_64.rpm
> 
> 
> at last I install pacemaker-1.0.8-6.1.el5.x86_64.rpm

> it shows
> error: Failed dependencies:
>  libesmtp is needed by pacemaker-1.0.8-6.1.el5.x86_64
>  libesmtp.so.5()(64bit) is needed by pacemaker-1.0.8-6.1.el5.x86_64
> 
> and the ldirectord-1.0.3-2.el5.x86_64.rpm is must be installed?

> thanks a lot:-)
>
>
>   

-- 
Dan FRINCU

Systems Engineer
CCNA, RHCE
Streamwide Romania

-- next part --
An HTML attachment was scrubbed...
URL:

<http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20101015/a1ea4614/attachment-0001.htm>


 



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
  


--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] ocf:heartbeat:mysql RA update

2010-10-16 Thread Dan Frincu

Hi,

I've had some issues with the ocf:heartbeat:mysql RA, mainly because 
things that normally shouldn't have happened, did happen, such as 
someone deleted the database the RA was monitoring, or started MySQL 
with a different PID, like I said, things that should happen, but did, 
so therefore I created a workaround for those specific cases, and added 
email notifications to the RA. Not sure how many people will run into 
the same issues but thought I'd share my work, maybe it'll be useful to 
someone.


Attached is a diff between my changes and the default RA that comes with 
resource-agents-1.0.3-2.el5.x86_64.rpm.


Regards,

Dan

--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania

76d75
< OCF_RESKEY_database_default="cluster"
99,100d97
< : ${OCF_RESKEY_database=${OCF_RESKEY_database_default}}
< 
289,356d277
< db_check() {
< 	db_up=`echo "USE $OCF_RESKEY_database;" | mysql --user=$OCF_RESKEY_test_user --password=$OCF_RESKEY_test_passwd --socket=$OCF_RESKEY_socket -O connect_timeout=1 2>&1`
< rc=$?
< if [ ! $rc -eq 0 ]; then
< 		ocf_log err "MySQL ((USE $OCF_RESKEY_database)) monitor failed:";
< 		if [ ! -z "$db_up" ]; then 
< 			ocf_log err $db_up; 
< 		fi
< 		# but maybe someone forgot to add the cluster test database?
< 		all_db=`echo "SHOW DATABASES;" | mysql --user=$OCF_RESKEY_test_user --password=$OCF_RESKEY_test_passwd --socket=$OCF_RESKEY_socket -O connect_timeout=1 2>&1`
< 		ex=$?
< 		if [ ! $ex -eq 0 ]; then
< 			ocf_log err "MySQL ((SHOW DATABASES)) monitor failed:";
< 			if [ ! -z "$all_db" ]; then
< ocf_log err $all_db;
< 			fi
< 			send_mail "DBD" "MySQL ((SHOW DATABASES)) monitor failed: $all_db";
< 			return $OCF_ERR_GENERIC;
< 		else
< 			ocf_log info "MySQL ((SHOW DATABASES)) monitor succeeded, this means that the cluster database was not configured properly, e.g. it's missing";
< 			send_mail "NCD" "MySQL ((SHOW DATABASES)) monitor succeeded, this means that the cluster database was not configured properly, e.g. it's missing: 
< $all_db";
< 			return $OCF_ERR_INSTALLED;
< 		fi
<else
< 	ocf_log info "MySQL monitor succeeded";
< 	return $OCF_SUCCESS;
< fi
< }
< 
< send_mail() {
< email=`crm configure show | awk -F'"' '/email/ {print $2}'`
< if [ -z $email ]; then
< 	email="m...@domain.tld"
< fi
< 
< case "$1" in
<   NCP) echo "$2" | $MAILCMD -s "MySQL monitor PID not configured properly on `uname -n`" "$email" ;;
<   NCT) echo "$2" | $MAILCMD -s "MySQL monitor database table not configured properly on `uname -n`" "$email"	;;
<   NCD) echo "$2" | $MAILCMD -s "MySQL monitor database not configured properly on `uname -n`" "$email" ;;
<   DBD) echo "$2" | $MAILCMD -s "MySQL monitor failed on `uname -n`" "$email"	;;
<  *)	echo ""	;;
< esac
< }
< 
< double_check() {
< 	ip_addr="`netstat -tupan | awk '/:3306/ && /mysql/ {sub(/:[[:digit:]]+/, "");count++;line[count]=$4;print line[1];exit}'`"
< 	if [ -z "$ip_addr" ]; then
< 		ocf_log debug "MySQL not running on any network socket"
< 		return $OCF_NOT_RUNNING;
< 	else
< 		return $OCF_SUCCESS;
< 	fi
< }
< 
< pid_check() {
< 	pidcheck=`pgrep -fl mysql | awk '$2 ~ /mysql/ {sub(/--pid-file=/, ""); print $7}'`
< 	if [ -z "$pidcheck" ]; then
< 		return $OCF_NOT_RUNNING;
< 	else 
< 		if [ "$pidcheck" != "$OCF_RESKEY_pid" ]; then
< 			ocf_log err "MySQL is running with --pid-file=$pidcheck which is different from the PID file used to call this RA ($OCF_RESKEY_pid)";
< 			send_mail "NCP" "MySQL is running with --pid-file=$pidcheck which is different from the PID file used to call this RA ($OCF_RESKEY_pid)";
< 			return $OCF_ERR_INSTALLED;
< 		fi
< 	fi
< }
< 
359,366c280,281
< 		ocf_log debug "MySQL PID not found, MySQL could be down, or it's PID could be defined somewhere else, begin double-check"
< 		double_check
< 		rc=$?
< 		if [ $OCF_CHECK_LEVEL = 0 -o $rc != 0 ]; then
< 			return $rc
< 		else
< 			pid_check
< 		fi
---
> 		ocf_log debug "MySQL is not running"
> 		return $OCF_NOT_RUNNING;
377,397c292
< 	# Do a detailed status check
< 	buf=`echo "SELECT * FROM $OCF_RESKEY_test_table" | mysql --user=$OCF_RESKEY_test_user --password=$OCF_RESKEY_test_passwd --socket=$OCF_RESKEY_socket -O connect_timeout=1 2>&1`
< 	rcx=$?
< 	if [ ! $rcx -e

Re: [Pacemaker] is recovery from link failure now automatic?

2010-10-17 Thread Dan Frincu

Hi,

You can take a look at the following and see if it helps

http://lists.linux-ha.org/pipermail/linux-ha/2010-October/041473.html

http://oss.clusterlabs.org/pipermail/pacemaker/2010-October/007904.html

Regards,

Dan

Juha Heinanen wrote:

i have been away for more than a year and would like check if
auto-recovery from link failure as issue discussed on this thread

http://www.mail-archive.com/pacemaker@oss.clusterlabs.org/msg00737.html

has been fixed in the meanwhile or do i still need to keep on using
heartbeat if i don't want to loose self healing capability?

-- juha

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
  


--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Question: How many nodes can join a cluster?

2010-10-18 Thread Dan Frincu

Pavlos Parissis wrote:



On 18 October 2010 10:52, Florian Haas <mailto:florian.h...@linbit.com>> wrote:


- Original Message -
> From: "Andreas Vogelsang" mailto:a.vogels...@uni-muenster.de>>
> To: pacemaker@oss.clusterlabs.org
<mailto:pacemaker@oss.clusterlabs.org>
> Sent: Monday, October 18, 2010 9:46:12 AM
> Subject: [Pacemaker] Question: How many nodes can join a cluster?
> Hello,
>
>
>
> I’m creating a presentation about a virtual Linux-HA Cluster. I just
> asked me how many nodes pacemaker can handle. Mr. Schwartzkopff
wrote
> in his Book that Linux-HA version 2 can handle up to 16 Nodes.
Is this
> also true for pacemaker?


I have been asked the same question and I said to them, let's say it 
is 126, what is the use of having 126 nodes in the cluster?
Can someone imagine himself going through the logs to find why the 
resource-XXX failed while there are 200 resources?!!


The only use of having 126 nodes is if you want to have HPC, but HPC 
is total different story than high available clusters.

Even in N+N setup I would go with more than 4 or 6 nodes.


My 2 cents,
Pavlos


Actually, the syslog_facility in corosync.conf allows you to specify 
either a log file for each node in the cluster (locally), or setting up 
a remote syslog server. Either way, identifying the node by hostname or 
some other identifier should point out what is going on where. Granted, 
it's a large amount of data to process, therefore (such is the case with 
any large deployment) SNMP is a much better alternative for tracking 
issues, or (if you have _126_ times the same resource) adding some 
notification options to the RA might be a choice, such as SNMP trap, or 
even email.


BTW, I'm also interested in this, I remember reading something about 64 
nodes, but I'd appreciate an official response.


Regards,

Dan



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
  


--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Move DRBD master

2010-10-19 Thread Dan Frincu

Vadym Chepkov wrote:

Hi,

What is the crm shell command to move drbd master to a different node?
  

# crm resource help migrate

Migrate a resource to a different node. If node is left out, the
resource is migrated by creating a constraint which prevents it from
running on the current node. Additionally, you may specify a
lifetime for the constraint---once it expires, the location
constraint will no longer be active.

Usage:
...
   migrate  [] []

crm resource migrate ms_drbd_storage

WARNING: Creating rsc_location constraint 'cli-standby-ms_drbd_storage' 
with a score of -INFINITY for resource ms_drbd_storage on cluster1.
   This will prevent ms_drbd_storage from running on cluster1 until 
the constraint is removed using the 'crm_resource -U' command or 
manually with cibadmin
   This will be the case even if cluster1 is the last node in the 
cluster

   This message can be disabled with -Q


This also made me to remind that I was wondering, is there a way to
demote one instance of multi-master ms resource away from particular
node (forcibly switch it to a slave state on that node). I didn't find
the answer too. Is it possible with crm shell?

Same as above, just specify the node in the command.

Regards,

Dan

Thank you,
Vadym

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
  


--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Move DRBD master

2010-10-19 Thread Dan Frincu

Vadym Chepkov wrote:

On Oct 19, 2010, at 3:13 AM, Dan Frincu wrote:

  

Vadym Chepkov wrote:


Hi,

What is the crm shell command to move drbd master to a different node?
 
  

# crm resource help migrate

Migrate a resource to a different node. If node is left out, the
resource is migrated by creating a constraint which prevents it from
running on the current node. Additionally, you may specify a
lifetime for the constraint---once it expires, the location
constraint will no longer be active.

Usage:
...
  migrate  [] []

crm resource migrate ms_drbd_storage

WARNING: Creating rsc_location constraint 'cli-standby-ms_drbd_storage' with a 
score of -INFINITY for resource ms_drbd_storage on cluster1.
  This will prevent ms_drbd_storage from running on cluster1 until the 
constraint is removed using the 'crm_resource -U' command or manually with 
cibadmin
  This will be the case even if cluster1 is the last node in the cluster
  This message can be disabled with -Q


This also made me to remind that I was wondering, is there a way to
demote one instance of multi-master ms resource away from particular
node (forcibly switch it to a slave state on that node). I didn't find
the answer too. Is it possible with crm shell?

Same as above, just specify the node in the command.




Have you actually tried it? This command doesn't work for this case.

  
To be honest, no, I actually don't think it's viable since in a 
multi-master architecture you have a one-to-one relationship on the DRBD 
side. There's a thread related to this over on the DRBD mailing list, 
still no reply yet, might want to check it out over the next couple of 
days for progress.


Regards,

Dan

Regards,

Dan


Thank you,
Vadym

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
 
  

--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
  


--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Move DRBD master

2010-10-19 Thread Dan Frincu



Vadym Chepkov wrote:


On Oct 19, 2010, at 3:42 AM, Pavlos Parissis wrote:




On 19 October 2010 01:18, Vadym Chepkov <mailto:vchep...@gmail.com>> wrote:


Hi,

What is the crm shell command to move drbd master to a different
node?


take a look at this
http://www.mail-archive.com/pacemaker@oss.clusterlabs.org/msg06300.html
___


Wow, not the friendliest command I would say. Maybe "move" command can 
be enhanced to provide something similar?


Thanks,
Vadym
The _crm resource move/migrate _ provides/creates the location 
constraint within the crm shell.


In the example I gave:

_crm resource migrate ms_drbd_storage_

/WARNING: Creating rsc_location constraint 'cli-standby-ms_drbd_storage' 
with a score of -INFINITY for resource ms_drbd_storage on cluster1.
*  This will prevent ms_drbd_storage from running on cluster1 until 
the constraint is removed using the 'crm_resource -U' command or 
manually with cibadmin

 This will be the case even if cluster1 is the last node in the cluster
*  This message can be disabled with -Q
/
The warning message appears in the console after executing the command 
in the crm shell, running a crm configure show reveals the following:


/location cli-standby-ms_drbd_storage ms_drbd_storage \
   rule $id="cli-standby-rule-ms_drbd_storage" -inf: #uname eq cluster1
/
The example in the UEL above has been done manually to (probably) 
provide an advisory placement constraint, the one set by _crm resource 
move _ is a mandatory placement constraint.


Regards,

Dan




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
  


--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] ocf:heartbeat:mysql RA update

2010-10-21 Thread Dan Frincu

Hi Florian,

Sorry from the beginning, this will be a long email. The modifications 
I've made aren't really encountered in a regular setup, that's why I 
said, I posting my work, _maybe_ it benefits somebody.


The mysql RA checks if mysql is up by looking at the pid, if [ ! -e 
$OCF_RESKEY_pid]; then => not running. I've had to go through a case where:

- mysql is installed (at the end of the installation process it starts)
- the hostname of the server is changed without shutting down mysql 
first (thus the pid is still there, but named after the old hostname)
- trying to start mysql after hostname changed without checking for 
running instances => pid manager quit without updating pid file.


For this case I've added the double_check() function which verifies if 
there is a network socket. If ! -e $OCF_RESKEY_pid && no network socket, 
mysql is down => $OCF_NOT_RUNNING.

If ! -e $OCF_RESKEY_pid && network socket exists => pid_check().

pid_check() finds if running pid != $OCF_RESKEY_pid and returns 
$OCF_ERR_INSTALLED + email alert.


If pid is ok and mysql is running, begin a detailed database check. 
First query contents of a table, if $? -eq 0 => $OCF_SUCCESS, else run 
db_check().


db_check() queries the defined database (cluster, in this example), if 
it works => $OCF_SUCCESS, else check "show databases". If $? -eq 0, 
"cluster" database not installed, return $OCF_ERR_INSTALLED + email, 
else $OCF_ERR_GENERIC + email => no rights for the current user to do a 
"show database" or no database available.


If db_check() returns 0, it means that the table query is not done 
properly, but the database exists => $OCF_ERR_INSTALLED + email.


Again, I stress this isn't a normal setup, meaning on a standard setup 
you wouldn't require anything else than the mysql_monitor() query (ok, 
maybe the double_check and pid_check functions would be required, for 
any _paranoid_ setups). In my setup, these were required, especially the 
email sending part and having different checks for the mysql database, 
table, etc.


Attached are only the changes, no diff's this time, basically only the 
mysql_status() function changed, the rest have been added, so I think 
they can be read better this way.


Regards,

Dan

Florian Haas wrote:

Hi Dan,

Thanks for the contribution -- but unfortunately that patch is pretty
much impossible to review as it is. Can you please break this down into
logical chunks, use "diff -u" (or "hg diff") format, and most
importantly explain _why_ you made the changes you made?

Cheers,
Florian

  



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
  


--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania

OCF_RESKEY_database_default="cluster"

: ${OCF_RESKEY_database=${OCF_RESKEY_database_default}}

db_check() {
db_up=`echo "USE $OCF_RESKEY_database;" | mysql 
--user=$OCF_RESKEY_test_user --password=$OCF_RESKEY_test_passwd 
--socket=$OCF_RESKEY_socket -O connect_timeout=1 2>&1`
rc=$?
if [ ! $rc -eq 0 ]; then
ocf_log err "MySQL ((USE $OCF_RESKEY_database)) monitor failed:";
if [ ! -z "$db_up" ]; then
ocf_log err $db_up;
fi
# but maybe someone forgot to add the cluster database?
all_db=`echo "SHOW DATABASES;" | mysql --user=$OCF_RESKEY_test_user 
--password=$OCF_RESKEY_test_passwd --socket=$OCF_RESKEY_socket -O 
connect_timeout=1 2>&1`
ex=$?
if [ ! $ex -eq 0 ]; then
ocf_log err "MySQL ((SHOW DATABASES)) monitor failed:";
if [ ! -z "$all_db" ]; then
ocf_log err $all_db;
fi
send_mail "DBD" "MySQL ((SHOW DATABASES)) monitor failed: $all_db";
return $OCF_ERR_GENERIC;
else
ocf_log info "MySQL ((SHOW DATABASES)) monitor succeeded, this 
means that the cluster database was not configured";
send_mail "NCD" "MySQL ((SHOW DATABASES)) monitor succeeded, this 
means that the cluster database was not configured:
$all_db";
return $OCF_ERR_INSTALLED;
fi
   else
ocf_log info "MySQL monitor succeeded";
return $OCF_SUCCESS;
fi
}

send_mail() {
email=`crm configure show | awk -F'"' '/email/ {print $2}'`
if [ -z $email ]; then
email="em...@domain.tld"
fi

case "$1" in
  NCP) echo "$2" | $MAILCM

[Pacemaker] Add a color scheme to the editor used in crm shell

2010-10-25 Thread Dan Frincu

Hi,

As a person who spends quite a lot of time in the crm shell I have seen 
that there is a colorscheme option that can be applied when issuing crm 
configure show. I'm interested if there's a way to have a colorscheme 
within the editor that is used by crm. It's usually vim, I've noticed 
that I could add my .vimrc and when using the crm configure edit, the 
shortcuts and everything else worked, except for the colorscheme.


Does anyone know how to add a colorscheme to the crm editor (vim for 
example) that can also do syntax highlighting while inside the crm 
configure edit?


Regards,

Dan

--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Multiple independent two-node clusters side-by-side?

2010-10-27 Thread Dan Frincu

Hi,

Andreas Ntaflos wrote:
Hi, 

first time poster, short time Pacemaker user. I don't think this is a 
very difficult question to answer but I seem to be feeding Google the 
wrong search terms. I am using Pacemaker 1.0.8 and Corosync 1.2.0 on 
Ubuntu 10.04.1 Server.


Short version: How do I configure multiple independent two-node clusters 
where the nodes are all on the same subnet? Only the two nodes that form 
the cluster should see that cluster's resources and not any other. 


Is this possible? Where should I look for more and detailed information?
  
You need to specify different multicast sockets for this to work. Under 
the /etc/corosync/corosync.conf you have the interface statements. Even 
if all servers are in the same subnet, you can "split them apart" by 
defining unique multicast sockets.
An example should be useful. Let's say that you have only one interface 
statement in the corosync file.

   interface {
   ringnumber: 0
   bindnetaddr: 192.168.1.0
   mcastaddr: 239.192.168.1
   mcastport: 5405
   }
The multicast socket in this case is 239.192.168.1:5405. All nodes that 
should be in the same cluster should use the same multicast socket. In 
your case, the first two nodes should use the same multicast socket. How 
about the other two nodes? Use another unique multicast socket.

   interface {
   ringnumber: 0
   bindnetaddr: 192.168.1.0
   mcastaddr: 239.192.168.112
   mcastport: 5405
   }
Now the multicast socket is 239.192.168.112:5405. It's unique, the 
network address is the same, but you add this config (edit according to 
your environment, this is just an example) to your other two nodes. So 
you have cluster1 formed out of node1 and node2 linked to 
239.192.168.1:5405 and cluster2 formed out of node3 and node4 linked to 
239.192.168.112:5405.


This way, the clusters don't _see_ each other, so you can reuse the 
resource ID's and see only two nodes per cluster.


Regards,

Dan



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
  


--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


  1   2   >