Re: [Pacemaker] chicken-egg-problem with libvirtd and a VM within cluster

2012-10-12 Thread Florian Haas
On Fri, Oct 12, 2012 at 3:18 AM, Andrew Beekhof and...@beekhof.net wrote:
 This has been a topic that has popped up occasionally over the years.
 Unfortunately we still don't have a good answer for you.

 The least worst practice has been to have the RA return OCF_STOPPED
 for non-recurring monitor operations (aka. startup probes) IFF its
 pre-requistites (ie. binaries, or things that might be on a cluster
 file system) are not available.

 Possibly we need to begin using the ordering constraints (normally
 used for ordering start operations) for the startup probes too.
 Ie. order(A, B) == A.start before B.(monitor_0, start)

 I had been resisting that move, but perhaps its time.

 (It would also help avoid slamming the cluster with a bazillion
 operations in parallel when several nodes start up together)

 Lars? Florian? Comments?

Sure. As Tom correctly observes, the problem (as I know it) occurs
when manually stopping Pacemaker services and then restarting them. As
it shuts down, Pacemaker kills libvirtd (after migrating off or
stopping all VMs), and then as you bring it back up, the probe runs
into an error. The same, btw, applies if you only send the node into
standby mode.

For manual intervention, the workaround is simply this:

- Stop Pacemaker services, or put node in standby (libvirtd stops in
the process as the local clone instance shuts down).
- Do whatever you need to do on that box.
- Start libvirtd.
- Start Pacemaker services, or take node online.

For most people, this issue doesn't occur on system boot, as libvirtd
would normally start before corosync, or corosync/pacemaker isn't part
of the system bootup sequence at all (the latter is preferred for
two-node clusters to prevent fencing shootouts in case of cluster
split brain).

On that ha-kvm.pdf guide, I will add that I'm guessing this is not the
only piece of information missing or outdated in it. However, I have
no rights to that document other than to be named as an original
author and to use it under CC-NC-ND terms like anyone else, and I have
no access to the sources anymore, so there's no way for me to update
it. Maybe the Linbit folks are willing/able to do that.

Back on the probe issue, we're in a bit of a catch-22 als libvirtd can
be freely restarted and stopped while leaving domains (VMs) running.
So the assumption if libvirtd doesn't run, then the domain can't be
running simply doesn't hold up. In fact, it's outright dangerous, as
a domain may well run _and have read/write access to shared resources_
while libvirt isn't running. So doing the naive thing and bail out of
monitor if we can't detect a livirtd pid -- that doesn't fly.

What would fly is to check for libvirtd on _every_ invocation of the
RA (well, maybe all except validate and usage), and to restart it on
the sole condition that we can't detect its pid. That, however, breaks
the contract that a probe should be non-invasive and really shouldn't
be touching any system services. Also, a running libvirtd is not
needed, to the best of my knowledge, when the hypervisor in use is Xen
rather than KVM. We could mitigate that by making it configurable, but
the only sane default would be to have this enabled, which again
breaks said contract.

When virsh is invoked with a qemu:///session URI it will actually
start up a user-specific libvirtd by itself, but as far as I know
there is no way to do that for qemu:///system which most people will
be using.

Andrew, your suggestion would fix that issue, but it would obviously
make the config more convoluted. In effect, we'd need one order and
one colo constraint more than we already do. For a silly idea, how
about thinking about being able to define a list of op types in a
constraint, rather than a single op? As in:

order libvirtd_before_virtdom inf: libvirtd:start virtdom_foo:monitor,start
colocation virtdom_on_libvirtd inf: virtdom_foo:Started,Probed libvirtd:Started

(Of course no such thing as a Probed role currently exists, so here
we go down the rabbit hole...)

I hope this is useful. Thoughts are much appreciated.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] FYI/RFC: Name for 'system service' alias

2012-06-26 Thread Florian Haas
On Mon, Jun 25, 2012 at 1:40 PM, Andrew Beekhof and...@beekhof.net wrote:
 I've added the concept of a 'system service' that expands to whatever 
 standard the local machine supports.
 So you could say, in xml, primitive id=Magic class=system type=mysql 
 and the cluster would use 'lsb' on RHEL, 'upstart' on Ubuntu and 'systemd' on 
 newer fedora releases.
 Handy if you have a mixed cluster.

 My question is, what to call it?
 'system', 'service', something else?

I think Red Hat Cluster has similar functionality named service, so
in the interest of continuity that would be my preference.

One thought though: what's supposed to happen on platforms that
support several system service interfaces, such as Ubuntu which
supports both Upstart and LSB? IOW: If I define a service as
service:foobar, and there is no upstart job named foobar, but
/etc/init.d/foobar exists, would that be an OCF_ERR_INSTALLED?

 In other news, the next pacemaker release will support systemd and both it 
 and upstart will use a persistent connection to the DBus API (no more 
 forking!).

Sweet!

Cheers,
Florian

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] How to write a master/slave resource-script

2012-06-16 Thread Florian Haas
On 06/15/12 16:37, Andrew Beekhof wrote:
 On Fri, Jun 15, 2012 at 12:19 AM, Stallmann, Andreas
 astallm...@conet.de wrote:
 Hi!



 Excuse my blindness; I found the „Stateful“ script, which is obviously the
 template / skeleton I was looking for. Unfortunately it comes without
 explanaition. Does anyone know, where I’d find this?

 
 This would be a good place to start:
 http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/ch10s03s09.html

Or this:
http://www.linux-ha.org/doc/dev-guides/ra-dev-guide.html

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] General question about pacemaker

2012-06-10 Thread Florian Haas
On Sun, Jun 10, 2012 at 3:07 PM, Stefan Günther smguent...@web.de wrote:

 Hello,

 I have a general question about the features of pacemaker.

 We are planning to setup a HA solution with pacemaker, corosync and drbd.

 After a failure of the master at later its recovery, drbd will sync the
 data from the slave to the master.

 Is it now possible to configure pacemaker and/or corosync to perform a
 failback, AFTER drbd has finished syncing?

Yes.

 And if yes, which componenten is responsible for waiting from the signal
 from drbd that syncing has finished?

The ocf:linbit:drbd resource agent (the Pacemaker resource agent that
ships with DRBD) influences the resource master score, which Pacemaker
evaluates for the placement of the DRBD Master role among cluster
nodes. You can combine this with a location contraint that sets a
preference for one of your nodes as the DRBD Master (Primary). If you
set your location constraint score correctly, you would get the
behavior you want.

However, why do you want automatic failback? If your cluster nodes are
interchangeable in terms of performance, you shouldn't need to care
which node is the master. In other words the concept of having a
preferred master is normally moot in well-designed clusters.

Hope this is useful.

Cheers,
Florian

--
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Problem with state: UNCLEAN (OFFLINE)

2012-06-08 Thread Florian Haas
On Fri, Jun 8, 2012 at 1:01 PM, Juan M. Sierra jmsie...@cica.es wrote:
 Problem with state: UNCLEAN (OFFLINE)

 Hello,

 I'm trying to get up a directord service with pacemaker.

 But, I found a problem with the unclean (offline) state. The initial state
 of my cluster was this:

 Online: [ node2 node1 ]

 node1-STONITH    (stonith:external/ipmi):    Started node2
 node2-STONITH    (stonith:external/ipmi):    Started node1
  Clone Set: Connected
  Started: [ node2 node1 ]
  Clone Set: ldirector-activo-activo
  Started: [ node2 node1 ]
 ftp-vip (ocf::heartbeat:IPaddr):    Started node1
 web-vip (ocf::heartbeat:IPaddr):    Started node2

 Migration summary:
 * Node node1:  pingd=2000
 * Node node2:  pingd=2000
    node2-STONITH: migration-threshold=100 fail-count=100

 and then, I removed the electric connection of node1, the state was the
 next:

 Node node1 (8b2aede9-61bb-4a5a-aef6-25fbdefdddfd): UNCLEAN (offline)
 Online: [ node2 ]

 node1-STONITH    (stonith:external/ipmi):    Started node2 FAILED
  Clone Set: Connected
  Started: [ node2 ]
  Stopped: [ ping:1 ]
  Clone Set: ldirector-activo-activo
  Started: [ node2 ]
  Stopped: [ ldirectord:1 ]
 web-vip (ocf::heartbeat:IPaddr):    Started node2

 Migration summary:
 * Node node2:  pingd=2000
    node2-STONITH: migration-threshold=100 fail-count=100
    node1-STONITH: migration-threshold=100 fail-count=100

 Failed actions:
     node2-STONITH_start_0 (node=node2, call=22, rc=2, status=complete):
 invalid parameter
     node1-STONITH_monitor_6 (node=node2, call=11, rc=14,
 status=complete): status: unknown
     node1-STONITH_start_0 (node=node2, call=34, rc=1, status=complete):
 unknown error

 I was hoping that node2 take the management of ftp-vip resource, but it
 wasn't in that way. node1 kept in a unclean state and node2 didn't take the
 management of its resources. When I put back the electric connection of
 node1 and it was recovered then, node2 took the management of ftp-vip
 resource.

 I've seen some similar conversations here. Please, could you show me some
 idea about this subject or some thread where this is discussed?

Well your healthy node failed to fence your offending node. So fix
your STONITH device configuration and as soon as that is able to
fence, your failover should work fine.

Of course, if your IPMI BMC fails immediately after you remove power
from the machine (i.e. it has no backup battery so it can at least
report the power status), then you might have to fix your issue by
switching to a different STONITH device altogether.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] KVM DRBD and Pacemaker

2012-06-05 Thread Florian Haas
On Tue, Jun 5, 2012 at 1:55 AM, Cliff Massey cliffm...@cliffmassey.com wrote:
 My config is:

 http://pastebin.com/5qYiHe56

Yep, you completely forgot your order and colo constraints. You need
those to tie your foo-kvm primitive to its corresponding ms-foo
master/slave set.

http://www.drbd.org/users-guide-8.3/s-pacemaker-crm-drbd-backed-service.html

Take a look at where it says order and colocation.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Announce: pcs / pcs-gui (Pacemaker/Corosync Configuration System)

2012-06-05 Thread Florian Haas
On Mon, Jun 4, 2012 at 3:21 AM, Andrew Beekhof and...@beekhof.net wrote:
 On Sat, Jun 2, 2012 at 12:56 AM, Florian Haas flor...@hastexo.com wrote:
 On Fri, Jun 1, 2012 at 1:40 AM, Chris Feist cfe...@redhat.com wrote:
 I'd like to announce the existence of the Pacemaker/Corosync configuration
 system, PCS.

 Be warned, I will surely catch flak for what I'm about to say.

Andrew, thanks for confirming. :)

 The emphasis in PCS differs somewhat from the existing shell:

 Before you get into where it differs in emphasis, can you explain why
 we need another shell?

 Uh, because the world isn't black  white and people find different
 things important?
 Like, perhaps, some of the things Chris listed.

I don't disagree with the importance of some of those, but none of
them look like a compelling reason to write a new one from scratch.

 PCS will continue the tradition of having a regression test suite and
 discoverable 'ip'-like hierarchical menu structure, however unlike the
 shell we may end up not adding interactivity.

 Strangely enough, if I were to name one feature as the most useful in
 the existing shell, it's its interactivity.

 Personally I disagree.
 Mostly what I see people using is tab completion, which is not
 interactivity and even if considered crucial, doesn't need to be baked
 into the tool itself.

That is true, but having done a bash completion thingy myself before,
I can tell you it's quite a bit of effort. Unless, that is, the tool
has a generic hook that completion systems can tie into, like what
Mercurial does (iirc).

Note that something taking a lot of effort doesn't disqualify it, but
creating a lot of effort just to match functionality that something
else already has -- that's questionable.

The crm shell is actually not just about simple tab completion, it's
about tab completion with the added benefits of providing
documentation interactively, and to the best of my knowledge that's
something you can't do in bash completion. Other completion systems I
don't know.

 How do you envision people configuring, say, an IPaddr2 resource when
 they don't remember the parameter names, or whether a specific
 parameter is optional or required? Or even the resource agent name?

 Now you're just being silly.

Oh, am I?

 Are you seriously claiming interactivity is the only way to discover
 information about a program?

Yeah, we all know how attentively people read man pages.

 Quick, someone tell the iproute developers that no-one can add an IP
 address because 'ip help' and 'ip addr help' aren't interactive!

Remind me how _that_ comment isn't silly?

 Both projects are far from complete, but so far PCS can:
 - Create corosync/pacemaker clusters from scratch
 - Add simple resources and add constraints

 If I were a new user, I'd probably be unable to create even a simple
 resource with this, for the reason given above. But I will concede
 that at its current state it's probably unfair to expect that new
 users are able to use this. (The existing shell is actually usable for
 newcomers, even though it's not perfect. Why to we need a new shell
 again?)

 To see how many straw men you could construct.

See below on that comment.

 - Create/Remove resource groups

 Why is it resource create, but resource group add?

 I /think/ its because you're adding a resource to an existing group.

Well you add and create one in one fell swoop (which is OK -- makes no
sense to have an empty group), but it might still be a good idea in
terms of POLA to add create, even if all it does is check that the
group doesn't already exist, and then hand off to add.

 - Set most pacemaker configuration options

 How do you enumerate which ones are available?

 Valid question

You'll hate me again for saying this, but by having this discussion
we're already smack in the middle of duplicating effort. For something
that's solved in an existing tool.

 - Start/Stop pacemaker/corosync
 - Get basic cluster status

 I'm currently working on getting PCS fully functional with Fedora 17 (and it
 should work with other distributions based on corosync 2.0, pacemaker 1.1
 and systemd).

 I'm hoping to have a fairly complete version of PCS for the Fedora 17
 release (or very shortly thereafter) and a functioning version of pcs-gui
 (which includes the ability to remotely start/stop nodes and set corosync
 config) by the Fedora 18 release.

 The code for both projects is currently hosted on github
 (https://github.com/feist/pcs  https://github.com/feist/pcs-gui)

 You can view a sample pcs session to get a preliminary view of how pcs will
 work  - https://gist.github.com/2697640

 Any reason why the gist doesn't use pcs cluster sync, which as per
 pcs cluster --help would sync the Corosync config across nodes?

 Comments and contributions are welcome.

 I'm sorry, and I really don't mean this personally, but I just don't
 get the point.

 Plenty of people didn't see the point of Pacemaker either.
 And I don't recall anyone saying they hate

Re: [Pacemaker] Announce: pcs / pcs-gui (Pacemaker/Corosync Configuration System)

2012-06-05 Thread Florian Haas
On Mon, Jun 4, 2012 at 1:02 PM, Lars Marowsky-Bree l...@suse.com wrote:
 I am getting a slightly defensive-to-aggressive vibe from your response
 to Florian. Can we tune that down? I much prefer to do the shouting at
 each other in person, because then the gestures come across much more
 vividly and the food is better. Thank you ;-)

In that case I suggest you come to Canberra for next year's
linux.conf.au, where the opportunity is likely to present itself. :)

 Open source has a long and glorious history of people saying I'm
 going to try and do it this way and Chris has every right to try
 something different.
 Personally I'm hoping a little friendly competition will result in
 both projects finding new ways to improve usability.

 Of course. Still, people will ask which one should I choose, and we
 need to be able to answer that.

And if the answer to that were whatever your distro recommends, and
everyone upstream would hence be leaving that decision to product
managers or distro subsystem maintainers, then people should know
about that too. I will add that I'd find that undesirable; we've been
down that road before.

 And as a community, yes, I think we also should think about the cost of
 choice to users - as well as the benefits.

 Even developers will ask questions like I want to do X; where do I
 contribute that?

 I like things that make it easier for users to use our stuff, and still
 I need to understand how to advise them what to do when, and how the
 various toys in the playground relate ;-)

I'd like to add that documentation in the vein of First do A. Then if
you're on X do B, on Y do C and on Z do E. Then do F, unless you're on
X, in which case you skip straight to G just doesn't work.

Cheers,
Florian

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Announce: pcs / pcs-gui (Pacemaker/Corosync Configuration System)

2012-06-05 Thread Florian Haas
On Tue, Jun 5, 2012 at 1:43 AM, Andrew Beekhof and...@beekhof.net wrote:
 On Mon, Jun 4, 2012 at 9:02 PM, Lars Marowsky-Bree l...@suse.com wrote:
 On 2012-06-04T11:21:57, Andrew Beekhof and...@beekhof.net wrote:

 Hi Andrew,

 I am getting a slightly defensive-to-aggressive vibe from your response
 to Florian. Can we tune that down? I much prefer to do the shouting at
 each other in person, because then the gestures come across much more
 vividly and the food is better. Thank you ;-)

 Now you're just being silly.
 Are you seriously claiming interactivity is the only way to discover
 information about a program?
 Quick, someone tell the iproute developers that no-one can add an IP
 address because 'ip help' and 'ip addr help' aren't interactive!

 I think the interactive tab completion is indeed cool. No, of course
 it's not the only way, but it does make things easier. You are of course
 right it doesn't need to be baked in; one can also dump the syntax tree
 and have bash/zsh/emacs do the completion. That does make dynamic
 completion a bit less efficient, though.

 True.  But less efficient is a LONG way from sensationalist words
 like impossible.
 It's this kind of tired hyperbole that tends to generate a
 defensive-to-aggressive vibe on my part.

Who said impossible? Looks to me like you're the first person in this
thread to use that term.

 Plenty of people didn't see the point of Pacemaker either.
 And I don't recall anyone saying they hate the existing [resource
 manager] and this effort solves all their problems about the first
 few years Pacemaker development.

 I don't quite see this is a valid comparison, sorry. The crm was
 developed because the existing resource manager that heartbeat
 implemented was way too limited; the CRM was something radically
 different. That was a huge effort that couldn't possibly have been
 implemented in an incremental fashion.

 My point would be that despite the above, there /still/ wasn't the
 level of public outcry that Florian apparently deems necessary for new
 work.

Nonsense.

 And if Pacemaker couldn't generate it, it makes an unfair criteria to
 require of pcs.


 (When we're talking about Pacemaker (versus the crm), it is obvious that
 that wasn't really a technology-driven move.)

 With the implication being that technology-driven moves are bad?

Who made that implication?

 How do you explain HAWK then? Shouldn't Tim have written a patch to
 py-gui instead?

I think a UI that runs in a browser, as opposed to requiring a
graphics library and rendering engine that is only ubiquitous on Linux
and practically non-existent on other platforms, is a significant
usability improvement. Of course, Tim could also have written a
server-side library that translates GTK2 into HTML5 and would allow
the pygui to run on a server unmodified, but that's a bit much to ask.

 Open source has a long and glorious history of people saying I'm
 going to try and do it this way and Chris has every right to try
 something different.
 Personally I'm hoping a little friendly competition will result in
 both projects finding new ways to improve usability.

 Of course. Still, people will ask which one should I choose, and we
 need to be able to answer that.

 The same way the Linux community has answers for:
 - sh/bash/tsch/zsh/dash...
 - gnome/kde/enlightnment/twm/fvwm...
 - fedora/opensuse/debian/ubuntu/leaf...
 - mysql/postgres/oracle/sybase
 - ext2,3,4/reiserfs/btrfs...
 - GFS2/OCFS2
 - dm_replicator/drbd
 - selinux/apparmor
 - iscsi clients
 - chat/irc/email clients
 - programming languages
 - editors
 - pacemaker GUIs

 Linux is hardly a bastion of there can be only one, so I find the
 level of doom people are expressing over a new cli to be disingenuous.

Who expressed doom?

 Every argument made so far applies equally to HAWK and the Linbit GUI,
 yet there was no outcry when they were announced.

This is likely to be an irrelevant tangent, but the pygui (afaik) had
two problems: it only ran on Linux (for all practical purposes), and
it was unmaintained (for all practical purposes). Neither of the two
are true for the shell.

 It seems duplication is only bad to those that aren't responsible for it.

 And as a community, yes, I think we also should think about the cost of
 choice to users - as well as the benefits.

 Even developers will ask questions like I want to do X; where do I
 contribute that?

 I like things that make it easier for users to use our stuff, and still
 I need to understand how to advise them what to do when, and how the
 various toys in the playground relate ;-)

 Presumably you'll continue to advise SLES customers to use whatever
 you ship there.
 Doesn't seem too complex to me.

Yep, that's what I referred to as leaving recommendations to distro
maintainers and product managers. Not desirable, but if that's the
case, then people at least have a right to know. I will add that this
probably invalidates efforts to unify documentation, and it probably

Re: [Pacemaker] [Help] Pacemaker + Oracle Listener

2012-06-05 Thread Florian Haas
On Wed, Jun 6, 2012 at 12:44 AM, Paul Damken zen.su...@gmail.com wrote:
 Im facing issues with my cluster setup. N+1
 Pacemaker Hosting Oracle 11g Instances. Node name azteca

 I cannot get oralsnr to start my DB listener, it refuses on both nodes.
 Oracle RA is starting first, after all File systems and VIP starts.
 But no way to get Listener UP.

 When I do a manual start from /oracle/11.2.0/db_1/bin/lsnrctl start it
 works
 just fine. (Using oracle user shell prompt)

 CRM Config Oracle RA

 primitive p_oracle1 ocf:heartbeat:oracle \
 params sid=xib11 home=/oracle/11.2.0/db_1 user=oracle ipcrm=orauser
 \
 op start interval=0 timeout=120s \
 op stop interval=0 timeout=120s \
 op monitor interval=15s
 primitive p_oralsnr ocf:heartbeat:oralsnr \
 params sid=xib11 listener=LISTENER user=oracle
 home=/oracle/11.2.0/db_1
 \
 op start interval=0 timeout=30s \
 op stop interval=0 timeout=30s \
 op monitor interval=15s
 group oracle_grp p_oracle1 p_oralsnr \
 meta target-role=Started
 order o_fs_before_listener inf: oracle_fs oracle_grp
 colocation ora_on_fs inf: oracle_grp oracle_fs



 ERROR LOG:
 azteca:/var/log # cat messages | grep p_oralsnr
 Jun  5 17:02:24 azteca crmd: [24262]: info: do_lrm_rsc_op: Performing
 key=20:900:7:8bf8ffb9-cc40-42c5-9dfa-cdb84ec20d97 op=p_oralsnr_monitor_0 )
 Jun  5 17:02:24 azteca lrmd: [24259]: info: rsc:p_oralsnr probe[401] (pid
 9369)
 Jun  5 17:02:24 azteca lrmd: [24259]: info: operation monitor[401] on
 p_oralsnr
 for client 24262: pid 9369 exited with return code 7
 Jun  5 17:02:24 azteca crmd: [24262]: info: process_lrm_event: LRM
 operation
 p_oralsnr_monitor_0 (call=401, rc=7, cib-update=812, confirmed=true) not
 running
 Jun  5 17:02:34 azteca crmd: [24262]: info: do_lrm_rsc_op: Performing
 key=64:900:0:8bf8ffb9-cc40-42c5-9dfa-cdb84ec20d97 op=p_oralsnr_start_0 )
 Jun  5 17:02:34 azteca lrmd: [24259]: info: rsc:p_oralsnr start[404] (pid
 11102)
 Jun  5 17:02:34 azteca lrmd: [24259]: info: operation start[404] on
 p_oralsnr
 for client 24262: pid 11102 exited with return code 1

This is just a generic error, so it could theoretically be anything,
but often this is due to an incorrect tnslistener.ora configuration,
where the listener is attempting to bind to an IP address that doesn't
exist on the node where the listener is about to start.

Find that tnslistener.ora file in your ORACLE_HOME, fix it up so the
listener binds to the virtual IP, and you should hopefully be good to
go.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] KVM DRBD and Pacemaker

2012-06-04 Thread Florian Haas
On Mon, Jun 4, 2012 at 9:51 PM, Cliff Massey cliffm...@cliffmassey.com wrote:

  I am trying to setup a cluster consisting of KVM DRBD and pacemaker. Without 
 pacemaker DRBD and KVM are working. I can even stop everything on one node, 
 promote the other to drbd primary and start the KVM machine on the other.

 However, when trying to start the resource with pacemaker I receive the error:

  lmrd error: unable to open disk path /dev/drbd0: Wrong medium type

Pacemaker config would indeed be helpful, but this sounds like a
missing order and colocation constraint between your DRBD master/slave
set and whatever should use that DRBD device -- probably your
VirtualDomain resource.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [RFC] [Patch] DC node preferences (dc-priority)

2012-05-25 Thread Florian Haas
On Fri, May 25, 2012 at 10:45 AM, Lars Ellenberg
lars.ellenb...@linbit.com wrote:
 Sorry, sent to early.

 That would not catch the case of cluster partitions joining,
 only the pacemaker startup with fully connected cluster communication
 already up.

 I thought about a dc-priority default of 100,
 and only triggering a re-election if I am DC,
 my dc-priority is  50, and I see a node joining.

Hardcoded arbitrary defaults aren't that much fun. You can use any
number, but 100 is the magic threshold is something I wouldn't want
to explain to people over and over again.

We actually discussed node defaults a while back. Those would be
similar to resource and op defaults which Pacemaker already has, and
set defaults for node attributes for newly joined nodes. At the time
the idea was to support putting new joiners in standby mode by
default, so when you added a node in a symmetric cluster, you wouldn't
need to be afraid that Pacemaker would shuffle resources around.[1]
This dc-priority would be another possibly useful use case for this.

Just my two cents.
Florian

[1] Yes, semi-doable with putting the cluster into maintenance mode
before firing up the new node, setting that node into standby, and
then unsetting maintenance mode. But that's just an additional step
that users can easily forget about.

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [RFC] [Patch] DC node preferences (dc-priority)

2012-05-25 Thread Florian Haas
On Fri, May 25, 2012 at 11:38 AM, Lars Ellenberg
lars.ellenb...@linbit.com wrote:
 On Fri, May 25, 2012 at 11:15:32AM +0200, Florian Haas wrote:
 On Fri, May 25, 2012 at 10:45 AM, Lars Ellenberg
 lars.ellenb...@linbit.com wrote:
  Sorry, sent to early.
 
  That would not catch the case of cluster partitions joining,
  only the pacemaker startup with fully connected cluster communication
  already up.
 
  I thought about a dc-priority default of 100,
  and only triggering a re-election if I am DC,
  my dc-priority is  50, and I see a node joining.

 Hardcoded arbitrary defaults aren't that much fun. You can use any
 number, but 100 is the magic threshold is something I wouldn't want
 to explain to people over and over again.

 Then don't ;-)

 Not helping, and irrelevant to this case.

 Besides that was an example.
 Easily possible: move the I want to lose vs I want to win
 magic number to be 0, and allow both positive and negative priorities.
 You get to decide whether positive or negative is the I'd rather lose
 side. Want to make that configurable as well? Right.

Nope, 0 is used as a threshold value in Pacemaker all over the place.
So allowing both positive and negative priorities and making 0 the
default sounds perfectly sane to me.

 I don't think this can be made part of the cib configuration,
 DC election takes place before cibs are resynced, so if you have
 diverging cibs, you possibly end up with a never ending election?

 Then maybe the election is stable enough,
 even after this change to the algorithm.

Andrew?

 But you'd need to add an other trigger on dc-priority in configuration
 changed, complicating this stuff for no reason.

 We actually discussed node defaults a while back. Those would be
 similar to resource and op defaults which Pacemaker already has, and
 set defaults for node attributes for newly joined nodes. At the time
 the idea was to support putting new joiners in standby mode by
 default, so when you added a node in a symmetric cluster, you wouldn't
 need to be afraid that Pacemaker would shuffle resources around.[1]
 This dc-priority would be another possibly useful use case for this.

 Not so sure about that.

 [1] Yes, semi-doable with putting the cluster into maintenance mode
 before firing up the new node, setting that node into standby, and
 then unsetting maintenance mode. But that's just an additional step
 that users can easily forget about.

 Why not simply add the node to the cib, and set it to standby,
 before it even joins for the first time.

Haha, good one.

Wait, you weren't joking?

Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] DRBD LVM EXT4 NFS performance

2012-05-21 Thread Florian Haas
On Sun, May 20, 2012 at 12:05 PM, Christoph Bartoschek
po...@pontohonk.de wrote:
 Hi,

 we have a two node setup with drbd below LVM and an Ext4 filesystem that is
 shared vi NFS. The system shows low performance and lots of timeouts
 resulting in unnecessary failovers from pacemaker.

 The connection between both nodes is capable of 1 GByte/s as shown by iperf.
 The network between the clients and the nodes is capable of 110 MByte/s. The
 RAID can be filled with 450 MByte/s.

No it can't (most likely); see below.

 Thus I would expect to have a write performance of about 100 MByte/s. But dd
 gives me only 20 MByte/s.

 dd if=/dev/zero of=bigfile.10G bs=8192  count=1310720
 1310720+0 records in
 1310720+0 records out
 10737418240 bytes (11 GB) copied, 498.26 s, 21.5 MB/s

If you used that same dd invocation for your local test that allegedly
produced 450 MB/s, you've probably been testing only your page cache.
Add oflag=dsync or oflag=direct (the latter will only work locally, as
NFS doesn't support O_DIRECT).

If your RAID is one of reasonably contemporary SAS or SATA drives,
then a sustained to-disk throughput of 450 MB/s would require about
7-9 stripes in a RAID-0 or RAID-10 configuration. Is that what you've
got? Or are you writing to SSDs?

 While the slow dd runs there are timeouts on the server resulting in a
 restart of some resources. In the logfile I also see:

 [329014.592452] INFO: task nfsd:2252 blocked for more than 120 seconds.
 [329014.592820] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables
 this message.
 [329014.593273] nfsd            D 0007     0  2252      2
 0x
 [329014.593278]  88060a847c40 0046 88060a847bf8
 00030001
 [329014.593284]  88060a847fd8 88060a847fd8 88060a847fd8
 00013780
 [329014.593290]  8806091416f0 8806085bc4d0 88060a847c50
 88061870c800
 [329014.593295] Call Trace:
 [329014.593303]  [8165a55f] schedule+0x3f/0x60
 [329014.593309]  [81265085] jbd2_log_wait_commit+0xb5/0x130
 [329014.593315]  [8108aec0] ? add_wait_queue+0x60/0x60
 [329014.593321]  [812111b8] ext4_sync_file+0x208/0x2d0
 [329014.593328]  [811a62dd] vfs_fsync_range+0x1d/0x40
 [329014.593339]  [a0227e51] nfsd_commit+0xb1/0xd0 [nfsd]
 [329014.593349]  [a022f28d] nfsd3_proc_commit+0x9d/0x100 [nfsd]
 [329014.593356]  [a0222a4b] nfsd_dispatch+0xeb/0x230 [nfsd]
 [329014.593373]  [a00e9d95] svc_process_common+0x345/0x690
 [sunrpc]
 [329014.593379]  [8105f990] ? try_to_wake_up+0x200/0x200
 [329014.593391]  [a00ea1e2] svc_process+0x102/0x150 [sunrpc]
 [329014.593397]  [a02221ad] nfsd+0xbd/0x160 [nfsd]
 [329014.593403]  [a02220f0] ? nfsd_startup+0xf0/0xf0 [nfsd]
 [329014.593407]  [8108a42c] kthread+0x8c/0xa0
 [329014.593412]  [81666bf4] kernel_thread_helper+0x4/0x10
 [329014.593416]  [8108a3a0] ? flush_kthread_worker+0xa0/0xa0
 [329014.593420]  [81666bf0] ? gs_change+0x13/0x13


 Has anyone an idea what could cause such problems? I have no idea for
 further analysis.

As a knee-jerk response, that might be the classic issue of NFS
filling up the page cache until it hits the vm.dirty_ratio and then
having a ton of stuff to write to disk, which the local I/O subsystem
can't cope with.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Is synchronizing rmtab needed?

2012-05-21 Thread Florian Haas
On Mon, May 21, 2012 at 1:36 AM, Christoph Bartoschek
po...@pontohonk.de wrote:
 Hi,

 we currently have the problem that when the NFS server is highly used the
 heartbeat:exportfs monitor script fails with a timeout because it cannot
 write the rmtab to the exported filesystem within the given time.

So, how about increasing the timeout?

 My question is now. Is it necessary to synchronize rmtab? Shouldn't the
 clients just reconnect after a timeout?

Synchronizing the rmtab is meant for the clients being able to
reconnect correctly after NFS _failover_, not just a brief network
hiccup between the NFS client and server.

Hope this helps.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] question about stonith:external/libvirt

2012-05-21 Thread Florian Haas
On Sun, May 20, 2012 at 6:40 AM, Matthew O'Connor m...@ecsorl.com wrote:
 After using the tutorial on the Hastexo site for setting up stonith via
 libvirt, I believe I have it working correctly...but...some strange things
 are happening.  I have two nodes, with shared storage provided by a
 dual-primary DRBD resource and OCFS2.  Here is one of my stonith primitives:

 primitive p_fence-l2 stonith:external/libvirt \
        params hostlist=l2:l2.sandbox
 hypervisor_uri=qemu+ssh://matt@hv01/system stonith-timeout=30
 pcmk_host_check=none \
        op start interval=0 timeout=15 \
        op stop interval=0 timeout=15 \
        op monitor interval=60 \
        meta target-role=Started

 This cluster has stonith-enabled=true in the cluster options, plus the
 necessary location statements in the cib.

Does it have fencing resource-and-stonith in the DRBD configuration,
and stonith_admin-fence-peer.sh as its fence-peer handler?

 To watch the DLM, I run dbench on the shared storage on the node I let live.
  While it's running, I creatively nuke the other node.  If I just killall
 pacemakerd on l2 for instance, the DLM seems unaffected and the fence takes
 place, rebooting the now failed node l2.  No real interruption of service
 on the surviving node, l3.  Yet, if I halt -f -n on l2, the fence still
 takes place but the surviving node's (l3's) DLM hangs and won't come back
 until I bring the failed node back online.

A hanging DLM is OK, and DLM recovery after the failed node comes back
is OK too, but of course the DLM should also recover once it's
satisfied that the offending node has been properly fenced. Any logs
from stonith-ng on l3?

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] question about stonith:external/libvirt

2012-05-21 Thread Florian Haas
On Mon, May 21, 2012 at 8:14 PM, Matthew O'Connor m...@ecsorl.com wrote:
 On 05/21/2012 05:43 AM, Florian Haas wrote:
 Does it have fencing resource-and-stonith in the DRBD configuration,
 and stonith_admin-fence-peer.sh as its fence-peer handler?
 That was the problem.  Totally forgot to update my DRBD configuration.

I actually wasn't saying that that was the root cause of your problem.
:) But it's worth looking into, anyhow.

 For sake of testing, I used the crm-fence-peer.sh script - it seemed
 to do the trick, although I strongly suspect this is the wrong script
 for the job.

It is. No good for dual-Primary, really, as it doesn't prevent split
brain in that sort of configuration.

 Do I need to write my own script to call stonith_admin?

No, stonith_admin-fence-peer.sh ships with recent DRBD releases.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Can Corosync bind to two networks

2012-05-12 Thread Florian Haas
On Sat, May 12, 2012 at 2:49 AM, Steve Davidson
steve.david...@pearl.com wrote:
 We want to run the Corosync heartbeat on the private net and, as a backup
 heartbeat, allow Corosync heartbeat on our public net as well.

 Thus in /etc/corosync/corosync.conf we need something like:

 bindaddr_primary: 192.168.57.0
 bindaddr_secondary: 125.125.125.0

 Our thinking is: if the private net connection fails but everything else is
 okay then we don't need to disrupt services since the private net failure
 won't affect our users.

 Is there any way to do this?

Up to here the question makes sense, and Arnold already answered it.
Use a redundant ring mode, and define two rings. man corosync.conf;
look for redundant.

 Otherwise we need two interfaces connected to
 separate switches just for an (HA) heartbeat.

This part doesn't make sense. Are you thinking that because you're
using redundant rings, you _don't_ need to connect each of your nodes
to two switches? Well, you do. Plugging all NICs in a redundant ring
configuration into the same physical switch makes that switch a single
point of failure. You can combine RRP with bonding, but regardless,
one switch alone won't help.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker+ocfs2 +RAC

2012-04-05 Thread Florian Haas
On Mon, Apr 2, 2012 at 7:00 AM, Ruwan Fernando ruwanm...@gmail.com wrote:
 Hi,

 I was required to build oracle cluster so I configured pacemaker+
 corosync+drbd+ocfs2 and built Active-active cluster.

Why?

pacemaker+corosync+drbd+xfs+oracle works just fine and is fully
integrated with Pacemaker.

RAC is primarily for scaleout, not for HA. And as Lars said, RAC won't
accept any cluster manager other than its own.

Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] socket is incremented after running crm shell

2012-04-03 Thread Florian Haas
On Tue, Apr 3, 2012 at 5:53 PM, David Vossel dvos...@redhat.com wrote:
 I see the same thing.  I'm using the latest pacemaker source from the master 
 branch, so this definitely still exists.  For me the file leak occurs every 
 time I issue a cibadmin --replace --xml-file command.  The shell is doing 
 the same command internally for adding and removing resources, so I see it 
 there as well.

 I opened a bug report for this.
 http://bugs.clusterlabs.org/show_bug.cgi?id=5051

What version of glib is this?

Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker 1.1.7 now available

2012-04-02 Thread Florian Haas
On Mon, Apr 2, 2012 at 11:33 AM, Andrew Beekhof and...@beekhof.net wrote:
 On Fri, Mar 30, 2012 at 8:33 PM, Florian Haas flor...@hastexo.com wrote:
 On Fri, Mar 30, 2012 at 10:37 AM, Andrew Beekhof and...@beekhof.net wrote:
 I blogged about it, which automatically got sent to twitter, and I
 updated the IRC channel topic, but alas I forgot to mention it here
 :-)

 So in case you missed it, 1.1.7 is finally out.
 Special mention is due to David and Yan for the nifty features they've
 been writing lately.
 Thanks guys!

 Quick question: the blog post doesn't mention libqb specifically, the
 changelog says core: *Support* libqb for logging (as opposed to
 require) but the RPM spec file introduces a hard BuildRequires on
 libqb-devel.

 Is there such a thing as a soft BuildRequires?

Nope. I was repeating myself redundantly. I apologetically apologize.

 Is this a hard dependency?

 Not yet, but IPC will likely be libqb-based for 1.1.8 which will make
 it a hard requirement.

 IOW does libqb have to be
 packaged on distros where it's not currently available, or can people
 build without libqb support and still be able to use 1.1.7?

 For 1.1.7 you can build without.

Thanks.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Corosync with puppet

2012-04-02 Thread Florian Haas
On Mon, Apr 2, 2012 at 11:34 AM, Hugo Deprez hugo.dep...@gmail.com wrote:
 Dear community,

 I am using a puppet mode in order to manage my cluster.
 I get a weird thing with the start  stop of the corosync daemon.

 When I modify the corosync.conf file, puppet is asked to restart / reload
 corosync, but it failed on the command :

 start-stop-daemon --stop --quiet --retry forever/QUIT/1 --pidfile
 /var/run/corosync.pid

 this command doesn't seems working when corosync is running.

I would think that your local Corosync doesn't like the fact that it's
the only Corosync instance configured to run with parameters different
from the other Corosync instances in the cluster.

What I always do when I need to make changes to the Corosync config is
enable Pacemaker maintenance mode, shut down pacemakerd  corosync on
all nodes, make the change, fire corosync  pacemakerd back up,
disable maintenance mode. I don't know how you would duplicate this in
puppet, or if that's even possible, but that would be my generally
recommended approach.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] OCF_RESKEY_CRM_meta_{ordered,notify,interleave}

2012-04-02 Thread Florian Haas
On Mon, Apr 2, 2012 at 11:54 AM, Andrew Beekhof and...@beekhof.net wrote:
 On Fri, Mar 30, 2012 at 7:34 PM, Florian Haas flor...@hastexo.com wrote:
 On Fri, Mar 30, 2012 at 1:12 AM, Andrew Beekhof and...@beekhof.net wrote:
 Because it was felt that RAs shouldn't need to know.
 Those options change pacemaker's behaviour, not the RAs.

 But subsequently, in lf#2391, you convinced us to add notify since it
 allowed the drbd agent to error out if they were not turned on.

 Yes, and for ordered the motivation is exactly the same. Let me give a
 bit of background info.

 I'm currently working on an RA for GlusterFS volumes (the server-side
 stuff, everything client side is already covered in
 ocf:heartbeat:Filesystem). GlusterFS volumes are composed of bricks,
 and for every brick there's a separate process to be managed on each
 cluster node. When these brick processes fail, GlusterFS has no
 built-in way to recover, and that's where Pacemaker can be helpful.

 Obviously, you would run that RA as a clone, on however many nodes
 constitute your GlusterFS storage cluster.

 Now, while brick daemons can be _monitored_ individually, they can
 only be _started_ as part of the volume, with the gluster volume
 start command. And if we start a volume simultaneously on multiple
 nodes, GlusterFS just produces an error on all but one of them, and
 that error is also a generic one and not discernible from other errors
 by exit code (yes, you may rant).

 So, whenever we need to start 1 clone instance, we run into this problem:

 1. Check whether brick is already running.
 2. No, it's not. Start volume (this leaves other bricks untouched, but
 fires up the brick daemons expected to run locally).
 3. Grumble. A different node just did the same thing.
 4. All but one fail on start.

 Yes, all this isn't necessarily wonderful design (the start volume
 command could block until volume operations have completed on other
 servers, or it could error out with a try again error, or it could
 sleep randomly before retrying, or something else), but as it happens
 configuring the clone as ordered makes all of this evaporate.

 And it simply would be nice to be able to check whether clone ordering
 is enabled, during validate.

 I'd need more information.  The RA shouldn't need to care I would have
 thought. The ordering happens in the PE/crmd, the RA should just do
 what its told.

 Quite frankly, I don't quite get this segregation of meta attributes
 we expect to be relevant to the RA

 The number of which is supposed to be zero.
 I'm not sure cutting down on questions to the mailing list is a good
 enough reason for adding additional exceptions.

Well, but you did read the technical reason I presented here?

 The one truly valid exception in my mind is globally-unique, since the
 monitor operation has to work quite differently.

Why are we not supposed to check for things like notify, ordered, allow-migrate?

 My concern with providing them all to RAs is that someone will
 probably start abusing them.

_Everything_ about an RA can be abused. Why is that any concern of
yours? You can't possibly enforce, from Pacemaker, that an RA actually
does what it's supposed to do.

Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] OCF_RESKEY_CRM_meta_{ordered,notify,interleave}

2012-04-02 Thread Florian Haas
On Mon, Apr 2, 2012 at 12:32 PM, Andrew Beekhof and...@beekhof.net wrote:
 Well, but you did read the technical reason I presented here?

 Yes, and it boiled down to don't let the user hang themselves.
 Which is a noble goal, I just don't like the way we're achieving it.

 Why not advertise the requirements in the metadata somehow?

The only way to do that is in the longdesc. There is nothing in the
schema that would allow us to do this in a machine-readable way so the
shell, HAWK, LCMC or anything else could warn the user by themselves.

 Why are we not supposed to check for things like notify, ordered, 
 allow-migrate?

 My concern with providing them all to RAs is that someone will
 probably start abusing them.

 _Everything_ about an RA can be abused. Why is that any concern of
 yours? You can't possibly enforce, from Pacemaker, that an RA actually
 does what it's supposed to do.

 No, but I can take away the extra ammo :)

You can count on there always being one round in the gun pointed at your foot.

Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Migration of lower resource causes dependent resources to restart

2012-03-30 Thread Florian Haas
On Thu, Mar 29, 2012 at 8:35 AM, Andrew Beekhof and...@beekhof.net wrote:
 On Thu, Mar 29, 2012 at 5:28 PM, Vladislav Bogdanov
 bub...@hoster-ok.com wrote:
 Hi Andrew, all,

 Pacemaker restarts resources when resource they depend on (ordering
 only, no colocation) is migrated.

 I mean that when I do crm resource migrate lustre, I get

 LogActions: Migrate lustre#011(Started lustre03-left - lustre04-left)
 LogActions: Restart mgs#011(Started lustre01-left)

 I only have one ordering constraint for these two resources:

 order mgs-after-lustre inf: lustre:start mgs:start

 This reminds me what have been with reload in a past (dependent resource
 restart when lower resource is reloaded).

 Shouldn't this be changed? Migration usually means that service is not
 interrupted...

 Is that strictly true?  Always?

No. Few things are always true. :) However, see below.

 My understanding was although A thinks the migration happens
 instantaneously, it is in fact more likely to be pause+migrate+resume
 and during that time anyone trying to talk to A during that time is
 going to be disappointed.

I tend to be with Vladislav on this one. The thing that most people
would expect from a live migration is that it's interruption free.
And what allow-migrate was first implemented for (iirc), live
migrations for Xen, does fulfill that expectation. Same thing is true
for live migrations in libvirt/KVM, and I think anyone would expect
essentially the same thing from checkpoint/restore migrations where
they're available.

So I guess it's reasonable to assume that if one resource migrates,
dependent resources need not be restarted. But since Pacemaker now
does restart them, you might need to figure out a way to preserve the
existing functionality for users who rely on that. Not sure if any do,
though.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] OCF_RESKEY_CRM_meta_{ordered,notify,interleave}

2012-03-30 Thread Florian Haas
On Fri, Mar 30, 2012 at 1:12 AM, Andrew Beekhof and...@beekhof.net wrote:
 Because it was felt that RAs shouldn't need to know.
 Those options change pacemaker's behaviour, not the RAs.

 But subsequently, in lf#2391, you convinced us to add notify since it
 allowed the drbd agent to error out if they were not turned on.

Yes, and for ordered the motivation is exactly the same. Let me give a
bit of background info.

I'm currently working on an RA for GlusterFS volumes (the server-side
stuff, everything client side is already covered in
ocf:heartbeat:Filesystem). GlusterFS volumes are composed of bricks,
and for every brick there's a separate process to be managed on each
cluster node. When these brick processes fail, GlusterFS has no
built-in way to recover, and that's where Pacemaker can be helpful.

Obviously, you would run that RA as a clone, on however many nodes
constitute your GlusterFS storage cluster.

Now, while brick daemons can be _monitored_ individually, they can
only be _started_ as part of the volume, with the gluster volume
start command. And if we start a volume simultaneously on multiple
nodes, GlusterFS just produces an error on all but one of them, and
that error is also a generic one and not discernible from other errors
by exit code (yes, you may rant).

So, whenever we need to start 1 clone instance, we run into this problem:

1. Check whether brick is already running.
2. No, it's not. Start volume (this leaves other bricks untouched, but
fires up the brick daemons expected to run locally).
3. Grumble. A different node just did the same thing.
4. All but one fail on start.

Yes, all this isn't necessarily wonderful design (the start volume
command could block until volume operations have completed on other
servers, or it could error out with a try again error, or it could
sleep randomly before retrying, or something else), but as it happens
configuring the clone as ordered makes all of this evaporate.

And it simply would be nice to be able to check whether clone ordering
is enabled, during validate.

 I'd need more information.  The RA shouldn't need to care I would have
 thought. The ordering happens in the PE/crmd, the RA should just do
 what its told.

Quite frankly, I don't quite get this segregation of meta attributes
we expect to be relevant to the RA and meta attributes the RA
shouldn't care about. Can't we just have a rule that _all_ meta
attributes, like parameters, are just always available in the RA
environment with the OCF_RESKEY_CRM_meta_ prefix?

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker 1.1.7 now available

2012-03-30 Thread Florian Haas
On Fri, Mar 30, 2012 at 10:37 AM, Andrew Beekhof and...@beekhof.net wrote:
 I blogged about it, which automatically got sent to twitter, and I
 updated the IRC channel topic, but alas I forgot to mention it here
 :-)

 So in case you missed it, 1.1.7 is finally out.
 Special mention is due to David and Yan for the nifty features they've
 been writing lately.
 Thanks guys!

Quick question: the blog post doesn't mention libqb specifically, the
changelog says core: *Support* libqb for logging (as opposed to
require) but the RPM spec file introduces a hard BuildRequires on
libqb-devel. Is this a hard dependency? IOW does libqb have to be
packaged on distros where it's not currently available, or can people
build without libqb support and still be able to use 1.1.7?

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Nodes not rejoining cluster

2012-03-30 Thread Florian Haas
On Fri, Mar 30, 2012 at 5:38 PM, Gregg Stock gr...@damagecontrolusa.com wrote:
 I took the last 200 lines of each.

Can you check the health of the Corosync membership, as per this URL?

http://www.hastexo.com/resources/hints-and-kinks/checking-corosync-cluster-membership

Do _all_ nodes agree on the health of the rings, and on the cluster member list?

Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Nodes not rejoining cluster

2012-03-30 Thread Florian Haas
On Fri, Mar 30, 2012 at 6:09 PM, Gregg Stock gr...@damagecontrolusa.com wrote:
 That looks good. They were all the same and had the correct ip addresses.

So you've got both healthy rings, and all 5 nodes have 5 members in
the membership list?

Then this would make it a Pacemaker problem. IIUC the code causing
Pacemaker to discard the update from a node that is not in our
membership has actually been removed from 1.1.7[1] so an upgrade may
not be a bad idea, but you'll probably have to wait for a few more
days until packages become available.

Still, out of curiosity, and since you're saying this is a test
cluster: what happens if you shut down corosync and Pacemaker on *all*
the nodes, and bring it back up?

We've had a few people report these not in our membership issues on
the list before, and they seem to appear in a very sporadic and
transient fashion, so the root cause (which may well be totally
trivial) hasn't really been found out -- as far as I can tell, at
least. Hence, my question of whether the issue persists after a full
cluster shutdown.

Florian

[1] 
https://github.com/ClusterLabs/pacemaker/commit/03f6105592281901cc10550b8ad19af4beb5f72f
-- note Andrew will rightfully flame me to a crisp if I've
misinterpreted that commit, so caveat lector. :)

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] manually failing back resources when set sticky

2012-03-30 Thread Florian Haas
On Fri, Mar 30, 2012 at 8:26 PM, Brian J. Murrell br...@interlinx.bc.ca wrote:
 In my cluster configuration, each resource can be run on one of two node
 and I designate a primary and a secondary using location constraints
 such as:

 location FOO-primary FOO 20: bar1
 location FOO-secondary FOO 10: bar2

 And I also set a default stickiness to prevent auto-fail-back (i.e. to
 prevent flapping):

 rsc_defaults $id=rsc-options resource-stickiness=1000

 This all works as I expect.  Resources run where I expect them to while
 everything is operating normally and when a node fails the resource
 migrates to the secondary and stays there even when the primary node
 comes back.

 The question is, what is the proper administrative command(s) to move
 the resource back to it's primary after I have manually determined
 that that node is OK after coming back from a failure?

 I figure I could just create a new resource constraint, wait for the
 migration and then remove it, but I just wonder if there is a more
 atomic move back to your preferred node command I can issue.

crm configure rsc_defaults resource-stickiness=0

... and then when resources have moved back, set it to 1000 again.
It's really that simple. :)

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Issue with ordering

2012-03-29 Thread Florian Haas
On Thu, Mar 29, 2012 at 10:07 AM, Vladislav Bogdanov
bub...@hoster-ok.com wrote:
 Hi Andrew, all,

 I'm continuing experiments with lustre on stacked drbd, and see
 following problem:

At the risk of going off topic, can you explain *why* you want to do
this? If you need a distributed, replicated filesystem with
asynchronous replication capability (the latter presumably for DR),
why not use a Distributed-Replicated GlusterFS volume with
geo-replication?

Note that I know next to nothing about your actual detailed
requirements, so GlusterFS may well be non-ideal for you and my
suggestion may thus be moot, but it would be nice if you could explain
why you're doing this.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Issue with ordering

2012-03-29 Thread Florian Haas
On Thu, Mar 29, 2012 at 11:40 AM, Vladislav Bogdanov
bub...@hoster-ok.com wrote:
 Hi Florian,

 29.03.2012 11:54, Florian Haas wrote:
 On Thu, Mar 29, 2012 at 10:07 AM, Vladislav Bogdanov
 bub...@hoster-ok.com wrote:
 Hi Andrew, all,

 I'm continuing experiments with lustre on stacked drbd, and see
 following problem:

 At the risk of going off topic, can you explain *why* you want to do
 this? If you need a distributed, replicated filesystem with
 asynchronous replication capability (the latter presumably for DR),
 why not use a Distributed-Replicated GlusterFS volume with
 geo-replication?

 I need fast POSIX fs scalable to tens of petabytes with support for
 fallocate() and friends to prevent fragmentation.

 I generally agree with Linus about FUSE and userspace filesystems in
 general, so that is not an option.

I generally agree with Linus and just about everyone else that
filesystems shouldn't require invasive core kernel patches. But I
digress. :)

 Using any API except what VFS provides via syscalls+glibc is not an
 option too because I need access to files from various scripted
 languages including shell and directly from a web server written in C.
 Having bindings for them all is a real overkill. And it all is in
 userspace again.

 So I generally have choice of CEPH, Lustre, GPFS and PVFS.

 CEPH is still very alpha, so I can't rely on it, although I keep my eye
 on it.

 GPFS is not an option because it is not free and produced by IBM (can't
 say which of these two is more important ;) )

 Can't remember why exactly PVFS is a no-go, their site is down right
 now. Probably userspace server implementation (although some examples
 like nfs server discredit idea of in-kernel servers, I still believe
 this is a way to go).

Ceph is 100% userspace server side, jftr. :) And it has no async
replication capability at this point, which you seem to be after.

 Lustre is widely deployed, predictable and stable. It fully runs in
 kernel space. Although Oracle did its best to bury Lustre development,
 it is actively developed by whamcloud and company. They have builds for
 EL6, so I'm pretty happy with this. Lustre doesn't have any replication
 built-in so I need to add it on a lower layer (no rsync, no rsync, no
 rsync ;) ). DRBD suits my needs for a simple HA.

 But I also need datacenter-level HA, that's why I evaluate stacked DRBD
 and tickets with booth.

 So, frankly speaking, I decided to go with Lustre not because it is so
 cool (it has many-many niceties), but because all others I know do not
 suit my needs at all due to various reasons.

 Hope this clarifies my point,

It does. Doesn't necessarily mean I agree, but the point you're making is fine.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] resources show as running on all nodes right after adding them

2012-03-28 Thread Florian Haas
On Wed, Mar 28, 2012 at 4:26 PM, Brian J. Murrell br...@interlinx.bc.ca wrote:
 We seem to have occasion where we find crm_resource reporting that a
 resource is running on more (usually all!) nodes when we query right
 after adding it:

 # crm_resource -resource chalkfs-OST_3 --locate
 resource chalkfs-OST_3 is running on: chalk02
 resource chalkfs-OST_3 is running on: chalk03
 resource chalkfs-OST_3 is running on: chalk04
 resource chalkfs-OST_3 is running on: chalk01

 Further checking reveals:

 # crm status
 
 Last updated: Mon Dec 19 11:30:31 2011
 Stack: openais
 Current DC: chalk01 - partition with quorum
 Version: 1.0.11-1554a83db0d3c3e546cfd3aaff6af1184f79ee87
 4 Nodes configured, 4 expected votes
 3 Resources configured.
 

 Online: [ chalk01 chalk02 chalk03 chalk04 ]

 MGS_1   (ocf::hydra:Target):    Started chalk01
 chalkfs-OST_3       (ocf::hydra:Target) Started [   chalk02 chalk03 
 chalk04 chalk01 ]
 resource chalkfs-OST_3 is running on: chalk02
 resource chalkfs-OST_3 is running on: chalk03
 resource chalkfs-OST_3 is running on: chalk04
 resource chalkfs-OST_3 is running on: chalk01

 Clearly this resource is not running on all nodes, so why is it
 being reported as such?

Probably because your resource agent reports OCF_SUCCESS on a probe
operation when it ought to be returning OCF_NOT_RUNNING. Pastebin the
source of ocf:hydra:Target and someone will be able to point you to
the exact part of the RA that's causing the problem.

Cheers,
Florian
-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] resources show as running on all nodes right after adding them

2012-03-28 Thread Florian Haas
On Wed, Mar 28, 2012 at 5:07 PM, Brian J. Murrell br...@interlinx.bc.ca wrote:
 On 12-03-28 10:39 AM, Florian Haas wrote:

 Probably because your resource agent reports OCF_SUCCESS on a probe
 operation

 To be clear, is this the status $OP in the agent?

Nope, monitor. Of course, in your implementation monitor may be just a
wrapper around status -- no way to tell without knowing any details
about the agent.

That being said, if there's really an upstream supported resource
agent as Bernd is suggesting, why not use that?

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] High Performance High Availability Guide: new community documentation project

2012-03-23 Thread Florian Haas
Hi everyone,

for those interested in contributing to a community documentation
project focusing on performance optimization in high availability
clusters, please take a look at the following URLs:

https://github.com/fghaas/hp-ha-guide (GitHub repo)
http://www.hastexo.com/node/173 (blog post -- feel free to skip the
Past and Present part; those are unimportant compared to Future)

This is a fledgling project and not complete by any stretch of the
imagination. Comments and feedback are much, much appreciated. Let's
see if we can get this done.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Resource-level fencing without stonith?

2012-03-23 Thread Florian Haas
On Fri, Mar 23, 2012 at 6:07 PM, Lajos Pajtek lajospaj...@yahoo.com wrote:


 Hi,

 I am building a two-node, active-standby cluster with shared storage. I think 
 I got the basic primitives right, but fencing, implemented using SCSI 
 persistent reservations, gives me some headache.First, I am unable to get 
 stonith:fence_scsi work on RH/CentOS 6. (Using the sg_persist utility I am 
 able to register keys, etc so that's not the problem.)

Any specific reason for not using IPMI? That's practically ubiquitous,
and pretty much always works.

 This made me think about the fact that conceptually SCSI fencing should be 
 resource-level fencing, not node-level fencing. The other node is not powered 
 down or rebooted so perhaps I shouldn't be using stonith at all. Currently I 
 think about having stonith-enabled=false and I am writing a master-slave 
 resource agent script to manage the SCSI persistent reservations in case of 
 fail-over.

That idea of such a resource agent is fine, but please don't write one
from scratch. Instead, expand on this one:

https://github.com/nif/ClusterLabs__resource-agents/blob/master/heartbeat/sg_persist
That one's off to a good start, but the original author never had time
to finish it.

Mind you; you'll likely still want STONITH, even if you use the sg_persist RA.

Hope this helps.
Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Resource Agent ethmonitor

2012-03-21 Thread Florian Haas
On Tue, Mar 20, 2012 at 4:18 PM, Fiorenza Meini fme...@esseweb.eu wrote:
 Hi there,
 has anybody configured successfully the RA specified in the object of the
 message?

 I got this error: if_eth0_monitor_0 (node=fw1, call=2297, rc=-2,
 status=Timed Out): unknown exec error

Your ethmonitor RA missed its 50-second timeout on the probe (that is,
the initial monitor operation). You should be seeing Monitoring of
if_eth0 failed, X retries left warnings in your logs. Grepping your
syslog for ethmonitor will probably turn up some useful results.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [Openstack] Howto Nova setup with HA?

2012-03-21 Thread Florian Haas
Hi everyone,

apologies for the cross-post; I believe this might be interesting to
people on both the openstack and the pacemaker lists. Please see
below.

On Tue, Feb 14, 2012 at 9:07 AM, i3D.net - Tristan van Bokkem
tristanvanbok...@i3d.nl wrote:
 Hi Stackers,

 It seems running Openstack components in High Availability hasn't been
 really a focus point lately, am I right?

 The general docs don't really mention HA except for nova-network. So I did
 some resource on how to run Nova in a High Availability and have some
 questions about it:

 The docs guides you on how to setup one cloud controller (running MySQL,
 nova-api, RabbitMQ etc.) and 2+n nodes for nova compute/network. But it does
 not mention how to make the cloud controller redundant. If the cloud
 controller brakes we have a serious problem!

 So, we can run MySQL in master-master mode on multiple hosts, we can run
 nova-api on serveral hosts and load balance those and RabbitMQ has a cluster
 ha setup as well but is this the way to go? I can't find a clear answer to
 this. I am hoping one can shine some light on this!

 Best regards,

 Tristan van Bokkem
 Datacenter Operations

I've taken the liberty to put together a bit of a summary of the
discussion we've had here,[1] roll it into a design summit brainstorm
proposal, and also post it on my blog, here:

http://www.hastexo.com/blogs/florian/2012/03/21/high-availability-openstack

I hope it's not a violation of list etiquette to say that instead of
cross-posting all replies to both lists, everyone's welcome to make
comments on that blog post, too (use your Launchpad OpenID). Please
feel free to flame me to a crisp or call me an idiot; as some of you
are aware I'm quite firmly an HA guy getting into OpenStack, rather
than the other way around.

Even if the design summit proposal doesn't make it through, perhaps a
few interested people (Monty? Jay? Adam? Major?) would like to sit
down over beverages to discuss this in person.

All comments and feedback much appreciated. Thanks!

Cheers,
Florian

[1] Pacemaker subscribers, for context the full thread is at
http://www.mail-archive.com/openstack@lists.launchpad.net/msg07495.html

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] How can I preview the shadow configuration?

2012-03-20 Thread Florian Haas
On Tue, Mar 20, 2012 at 11:15 AM, Rasto Levrinc rasto.levr...@gmail.com wrote:
 2012/3/20 Mars gu gukaicoros...@163.com:
 Hi,
     I want to excute the command ,the problem occurred:

 [root@h10_148 ~]# ptest
 -bash: ptest: command not found

 How can I preview the shadow configuration?

 ptest has been replaced by crm_simulate.

I thought I recalled that ptest was kicked out of the RHEL/CentOS
packages in 1.1.6, and that 1.1.5 still shipped with it. At any rate,
crm_simulate should be in both, and it would be the preferred utility
to use.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Using shadow configurations noninteractively

2012-03-19 Thread Florian Haas
On Mon, Mar 19, 2012 at 8:00 PM, Phil Frost p...@macprofessionals.com wrote:
 I'm attempting to automate my cluster configuration with Puppet. I'm already 
 using Puppet to manage the configuration of my Xen domains. I'd like to 
 instruct puppet to apply the configuration (via cibadmin) to a shadow config, 
 but I can't find any sure way to do this. The issue is that running 
 crm_shadow --create ... starts a subshell, but there's no easy way I can 
 tell puppet to run a command, then run another command in the subshell it 
 creates.

 Normally I'd expect some command-line option, but I can't find any. It does 
 look like it sets the environment variable CIB_shadow. Is that all there is 
 to it? Is it safe to rely on that behavior?

I've never tried this specific use case, so bear with me while I go
out on a limb, but the crm shell is fully scriptable. Thus you
*should* be able to generate a full-blown crm script, with cib foo
commands and whathaveyou, in a temporary file, and then just do crm 
/path/to/temp/file. Does that work for you?

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] How to setup STONITH in a 2-node active/passive linux HA pacemaker cluster?

2012-03-19 Thread Florian Haas
On Mon, Mar 19, 2012 at 8:14 PM, Mathias Nestler
mathias.nest...@barzahlen.de wrote:
 Hi everyone,

 I am trying to setup an active/passive (2 nodes) Linux-HA cluster with 
 corosync and pacemaker to hold a PostgreSQL-Database up and running. It works 
 via DRBD and a service-ip. If node1 fails, node2 should take over. The same 
 if PG runs on node2 and it fails. Everything works fine except the STONITH 
 thing.

 Between the nodes is an dedicated HA-connection (10.10.10.X), so I have the 
 following interface configuration:

 eth0                        eth1                   host
 10.10.10.251    172.10.10.1     node1
 10.10.10.252    172.10.10.2     node2

 Stonith is enabled and I am testing with a ssh-agent to kill nodes.

 crm configure property stonith-enabled=true
 crm configure property stonith-action=poweroff
 crm configure rsc_defaults resource-stickiness=100
 crm configure property no-quorum-policy=ignore

 crm configure primitive stonith_postgres stonith:external/ssh \
               params hostlist=node1 node2
 crm configure clone fencing_postgres stonith_postgres

You're missing location constraints, and doing this with 2 primitives
rather than 1 clone is usually cleaner. The example below is for
external/libvirt rather than external/ssh, but you ought to be able to
apply the concept anyhow:

http://www.hastexo.com/resources/hints-and-kinks/fencing-virtual-cluster-nodes

Hope this helps.
Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Using shadow configurations noninteractively

2012-03-19 Thread Florian Haas
On Mon, Mar 19, 2012 at 9:00 PM, Phil Frost p...@macprofessionals.com wrote:
 On Mar 19, 2012, at 15:22 , Florian Haas wrote:
 On Mon, Mar 19, 2012 at 8:00 PM, Phil Frost p...@macprofessionals.com 
 wrote:
 I'm attempting to automate my cluster configuration with Puppet. I'm 
 already using Puppet to manage the configuration of my Xen domains. I'd 
 like to instruct puppet to apply the configuration (via cibadmin) to a 
 shadow config, but I can't find any sure way to do this. The issue is that 
 running crm_shadow --create ... starts a subshell, but there's no easy 
 way I can tell puppet to run a command, then run another command in the 
 subshell it creates.

 Normally I'd expect some command-line option, but I can't find any. It does 
 look like it sets the environment variable CIB_shadow. Is that all there 
 is to it? Is it safe to rely on that behavior?

 I've never tried this specific use case, so bear with me while I go
 out on a limb, but the crm shell is fully scriptable. Thus you
 *should* be able to generate a full-blown crm script, with cib foo
 commands and whathaveyou, in a temporary file, and then just do crm 
 /path/to/temp/file. Does that work for you?


 I don't think so, because the crm shell, unlike cibadmin, has no idempotent 
 method of configuration I've found. With cibadmin, I can generate the 
 configuration for the primitive and associated location constraints for each 
 Xen domain in one XML file, and feed it cibadmin -M as many times as I want 
 without error. I know that by running that command, the resulting 
 configuration is what I had in the file, regardless if the configuration 
 already existed, did not exist, or existed but some parameters were different.

 To do this with with crm, I'd have to also write code which checks if things 
 are configured as I want them, then take different actions if it doesn't 
 exist, already exists, or already exists but has the incorrect value. That's 
 not impossible, but it's far harder to develop and quite likely I'll make an 
 error in all that logic that will automate the destruction of my cluster.

Huh? What's wrong with crm configure load replace somefile?

Anyhow, I think you haven't really stated what you are trying to
achieve, in detail. So: what is it that you want to do exactly?

Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] offtopic scalable block-device

2012-03-16 Thread Florian Haas
On Fri, Mar 16, 2012 at 10:13 AM, ruslan usifov ruslan.usi...@gmail.com wrote:
 Hello

 I search a solution for scalable block device (dist that can extend if we
 add some machines to cluster). Only what i find accepten on my task is ceph
 + RDB, but ceph on my test i very unstable(regulary crash of all it daemons)
 + have poor integration with pacemaker. So does anybody recommend some
 solution???

Which Ceph version are you using? Both the Ceph daemons and RBD are
fully integrated into Pacemaker in upstream git.

https://github.com/ceph/ceph/tree/master/src/ocf

You may want to look at http://www.hastexo.com/category/tags/ceph for
upcoming updates on this (RSS feed icon at the bottom).

Hope this helps.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] offtopic scalable block-device

2012-03-16 Thread Florian Haas
On Fri, Mar 16, 2012 at 11:06 AM, Vladislav Bogdanov
bub...@hoster-ok.com wrote:
 16.03.2012 12:13, ruslan usifov wrote:
 Hello

 I search a solution for scalable block device (dist that can extend if
 we add some machines to cluster). Only what i find accepten on my task
 is ceph + RDB, but ceph on my test i very unstable(regulary crash of all
 it daemons) + have poor integration with pacemaker. So does anybody
 recommend some solution???


 I'm now investigating possibilities of using Lustre+DRBD+pacemaker.
 Lustre is now available for EL6 thanks whamcloud and others.

That's an option for a scalable _filesystem_, but the OP's question
was about a block device, and to the best of my knowledge Lustre
doesn't offer that. Unless you want to use loop devices in Lustre,
which sounds awkward to say the least.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] offtopic scalable block-device

2012-03-16 Thread Florian Haas
On Fri, Mar 16, 2012 at 11:14 AM, Lars Marowsky-Bree l...@suse.com wrote:
 On 2012-03-16T11:13:17, Florian Haas flor...@hastexo.com wrote:

 Which Ceph version are you using? Both the Ceph daemons and RBD are
 fully integrated into Pacemaker in upstream git.

 https://github.com/ceph/ceph/tree/master/src/ocf

 You may want to look at http://www.hastexo.com/category/tags/ceph for
 upcoming updates on this (RSS feed icon at the bottom).

 is there a reason for integrating ceph with pacemaker? ceph does
 internal monitoring of OSTs etc anyway, doesn't it?

Assuming you're referring to OSDs, yes it does. It does automatic
failover for MDSs (if you use them) and MONs too. But it currently has
no means of recovering an osd/mds/mon daemon in place when it crashes,
and that's what those RAs do. Really trivial.

Clearly, and the ceph devs and I agree on this, this is a stop-gap
until upstart or systemd jobs for the ceph daemons (with respawn
capability, of course) become widely available.

The ocf:ceph:rbd RA by contrast serves an entirely different purpose,
and I currently don't see how _that_ would be replaced by upstart or
systemd. Unless either of those becomes so powerful (and
cluster-aware) that we don't need Pacemaker at all anymore, but I
don't see that happen anytime soon.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] offtopic scalable block-device

2012-03-16 Thread Florian Haas
On Fri, Mar 16, 2012 at 12:50 PM, ruslan usifov ruslan.usi...@gmail.com wrote:
 I crash i have follow stack trcae

How about taking that to the ceph-devel list?

Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] offtopic scalable block-device

2012-03-16 Thread Florian Haas
On Fri, Mar 16, 2012 at 12:42 PM, Lars Marowsky-Bree l...@suse.com wrote:
 On 2012-03-16T11:28:36, Florian Haas flor...@hastexo.com wrote:

  is there a reason for integrating ceph with pacemaker? ceph does
  internal monitoring of OSTs etc anyway, doesn't it?
 Assuming you're referring to OSDs, yes it does. It does automatic
 failover for MDSs (if you use them) and MONs too. But it currently has
 no means of recovering an osd/mds/mon daemon in place when it crashes,
 and that's what those RAs do. Really trivial.

 Yes, I need to stop calling them OSTs, but that's what object storage
 targets were called before ceph came along ;-) Sorry. Yes, of course, I
 mean OSDs.

 Would this not be more readily served by a simple while loop doing the
 monitoring, even if systemd/upstart aren't around? Pacemaker is kind of
 a heavy-weight here.

If you prefer to suggest a self-hacked while loop to your customers
I'm not stopping you.

 The ocf:ceph:rbd RA by contrast serves an entirely different purpose,
 and I currently don't see how _that_ would be replaced by upstart or
 systemd. Unless either of those becomes so powerful (and
 cluster-aware) that we don't need Pacemaker at all anymore, but I
 don't see that happen anytime soon.

 Agreed. I was mostly curious about the server-side. Thanks for the
 clarification.

I forgot to add, if you actually want to use a ceph _filesystem_ as a
cloned Pacemaker resource, ocf:heartbeat:Filesystem now has support
for that too. But that was just a trivial three-line patch, so nothing
new there.

Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] offtopic scalable block-device

2012-03-16 Thread Florian Haas
On Fri, Mar 16, 2012 at 12:24 PM, ruslan usifov ruslan.usi...@gmail.com wrote:
 Luster looks very cool and stability, but it doesn't provide scalable block
 device (Ceph allow it throw RDB), require patched kernel (i doesn't find
 more modern patched kernels for ubuntu lucid), so i think that it doesn't
 acceptable for my use case

Warning, pet peeve here. Everybody, it's RBD. OK? RBD. Not totally
unlike in naming to DRBD. Although they're completely different, the
one thing they do share is that they're block devices, not databases.

End of pet peeve. Thanks for putting up with me. :)

Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] offtopic scalable block-device

2012-03-16 Thread Florian Haas
On Fri, Mar 16, 2012 at 4:55 PM, Lars Marowsky-Bree l...@suse.com wrote:
 On 2012-03-16T13:36:34, Florian Haas flor...@hastexo.com wrote:

  Would this not be more readily served by a simple while loop doing the
  monitoring, even if systemd/upstart aren't around? Pacemaker is kind of
  a heavy-weight here.
 If you prefer to suggest a self-hacked while loop to your customers
 I'm not stopping you.

 I didn't say self-hacked, this could be a wrapper officially included.

Surely you've submitted one?

Actually, better still, you could submit a systemd job; as the Ceph
guys themselves seem to be focused more on upstart at this time.

 It just seems that pacemaker+corosync+... is overkill for watching the
 health of a single service on one node.

Up to 3 services, really, but that's a technicality.

 (And no, I think I wouldn't want to run pacemaker on my OSD cluster,
 because that doesn't scale.)

If *your* Ceph cluster needs to be 100 nodes plus, then you're right.
Mine don't.

 And, anyway, at this point in time, I'd tell my customers to skip
 ceph/RADOS for the next 6-12 months still, but to contact us off-list if
 they're interested in PoCs ;-)

Right. Feel free to point them to
http://www.hastexo.com/blogs/florian/2012/03/08/ceph-tickling-my-geek-genes
if they want a quick overview.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] getting started - crm hangs when adding resources, even crm ra classes hangs

2012-03-14 Thread Florian Haas
On Wed, Mar 14, 2012 at 2:16 PM, Dejan Muhamedagic deja...@fastmail.fm wrote:
 Hi,

 On Tue, Mar 13, 2012 at 05:59:35PM -0400, Phillip Frost wrote:
 On Mar 13, 2012, at 2:21 PM, Jake Smith wrote:

  From: Phillip Frost p...@macprofessionals.com
  Subject: [Pacemaker] getting started - crm hangs when adding resources,   
   even crm ra classes hangs
 
  more interestingly, even crm ra classes never terminates, again
  with no output, and nothing appended to syslog.
 
  In Ubuntu 10.04 there is a bug in glib causing hanging on shutdown as well 
  as hanging on some crm commands - there are patches out to fix it for 
  Ubuntu specifically 
  (https://bugs.launchpad.net/ubuntu/oneiric/+source/cluster-glue/+bug/821732).
    Not sure if they affect Debian too.

 Seems to be the same issue, somewhat. I noticed sometimes I'd get lrmadmin 
 -C to work once, but the 2nd time it would deadlock. That behavior was 
 described in the launchpad link you gave.

 It seems what's happened is the glib bug has been patched in debian 
 unstable, and this raexecupstart patch is disabled in the cluster-glue 
 package as described in launchpad. squeeze-backports took the package from 
 unstable, but glib is not patched in squeeze, so raexecupstart.patch is 
 still needed. Not re-enabled in squeeze-backports, however.

 So, I built cluster-glue from the debian source package after manually 
 applying that patch, and now I can run lrmadmin -C all day. Now it's also 
 leaking sockets, but I guess I can live with that.

 Do you have upstart at all? In that case, the debian package
 shouldn't have the upstart enabled when building cluster-glue.

The current cluster-glue package in squeeze-backports,
cluster-glue_1.0.9+hg2665-1~bpo60+2, has upstart disabled.
Double-check that you're running that version. If you do, and the
issue persists, please let us know.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] getting started - crm hangs when adding resources, even crm ra classes hangs

2012-03-14 Thread Florian Haas
On Wed, Mar 14, 2012 at 2:37 PM, Phillip Frost
p...@macprofessionals.com wrote:
 On Mar 14, 2012, at 9:25 AM, Florian Haas wrote:

 Do you have upstart at all? In that case, the debian package
 shouldn't have the upstart enabled when building cluster-glue.

 The current cluster-glue package in squeeze-backports,
 cluster-glue_1.0.9+hg2665-1~bpo60+2, has upstart disabled.
 Double-check that you're running that version. If you do, and the
 issue persists, please let us know.

 Indeed, that's the version that hit the repo last night when I decided to 
 quit. This morning, I tried that version and concluded I was experiencing the 
 same issue.

Are you absolutely certain?

Can you confirm that you're running the ~bpo60+2 (note trailing 2)
build, that you're actually running an lrmd binary from that version
(meaning: that you properly killed your lrmd prior to installing that
package), _and_ that lrmadmin -
C does *not* list upstart?

Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] getting started - crm hangs when adding resources, even crm ra classes hangs

2012-03-14 Thread Florian Haas
On Wed, Mar 14, 2012 at 4:58 PM, Phillip Frost
p...@macprofessionals.com wrote:
 Can you confirm that you're running the ~bpo60+2 (note trailing 2)
 build, that you're actually running an lrmd binary from that version
 (meaning: that you properly killed your lrmd prior to installing that
 package), _and_ that lrmadmin -
 C does *not* list upstart?

 Let's discard all of my previous conclusions. Apparently I was confused.

 Now, I'm sure I'm running +2 on all three nodes. And, I restarted pacemaker 
 and corosync on all the nodes. I'm basing my knowledge of what versions I'm 
 running on apt-cache policy, output copied below.

dpkg -l package would also tell you what versions you have
installed, in a more concise fashion.

 I can confirm that lrmadmin -C does not list upstart (also below). Nor does 
 it leak sockets, as reported by lsof -f | grep lrm_callback_sock.

Yep, no surprise here.

 However, sometimes pacemakerd will not stop cleanly.

OK. Whether this is related to your original problem or not a complete
open question, jftr.

 I thought it might happen when stopping pacemaker on the current DC, but 
 after successfully reproducing this failure twice, I couldn't do it again. 
 Pacemakerd seems to exit, but fail to notify the other nodes of its shutdown. 
 Syslog is flooded with Retransmit List messages (log attached). These 
 persist until I stop corosync. Asked immediately after stopping pacemaker and 
 corosync on one node, crm status other nodes will report that node as still 
 online. After a while, the stopped node switches to offline; I assume some 
 timeout is expiring and they are assuming it crashed.

You didn't give much other information, so I'm asking this on a hunch:
does your pacemaker service configuration stanza for corosync (either
in /etc/corosync/corosync.conf or in
/etc/corosync/service.d/pacemaker) say ver: 0 or ver: 1?

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] 1.1.6 rpm build for RHEL5

2012-03-10 Thread Florian Haas
On Sat, Mar 10, 2012 at 12:39 AM, Larry Brigman larry.brig...@gmail.com wrote:
 I have looked and cannot seem to find the pre-built 1.1.6 rpm set in the
 clusterlabs repo.

It ships with RHEL/CentOS 6.2.

On RHEL 5 however, 1.1.6 doesn't build. If you don't want to wait for
1.1.7, you'll either need to apply this post-1.1.6 patch:

https://github.com/ClusterLabs/pacemaker/commit/eade0edee5605dcab96522eef779ccc041eddb21

... or just use 1.1.5, which should be fine for most practical
purposes. For RHEL 5, you'll probably also want to build Corosync
1.4.2. Or run on Heartbeat.

Hope this helps.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] DRBD M/S Promotion

2012-03-10 Thread Florian Haas
On Fri, Mar 9, 2012 at 11:24 PM, Scott Piazza
scott.pia...@bespokess.com wrote:
 I have a two-node active/passive pacemaker cluster running with a single
 DRBD resource set up as master-slave.  Today, we restarted both servers
 in the cluster, and when they came back up, both started pacemaker and
 corosync correctly, but the DRBD resource didn't promote.  I manually
 promoted the DRBD resource and all of the child services were able to
 start up without issue.  There were no error counts showing in
 crm_mon.

 The only error I noted in the /var/log/messages log referenced not being
 able to mount /dev/drbd0 because the device wasn't present.

 CIB is available at http://pastebin.com/4b6Fi87w.

 I'm trying to figure out what is wrong with my configuration.

This, most probably:

location cli-prefer-ms_drbd_exports ms_drbd_exports \
rule $id=cli-prefer-rule-ms_drbd_exports inf: #uname eq
pawhsrv01.libertydistribution.com
location cli-prefer-pawhsrv pawhsrv \
rule $id=cli-prefer-rule-pawhsrv inf: #uname eq
pawhsrv01.libertydistribution.com

Remove those. The first is completely wrong, the second is likely a
leftover from when you did crm resource migrate pawhsrv
pawhsrv01.libertydistribution.com and forgot to clear the constraints
when you were done.

Hope this helps.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Surprisingly fast start of resources on cluster failover.

2012-03-07 Thread Florian Haas
On Tue, Mar 6, 2012 at 1:49 PM, Florian Crouzat
gen...@floriancrouzat.net wrote:
 I have Florian's rsyslog config:
 https://github.com/fghaas/pacemaker/blob/syslog/extra/rsyslog/pacemaker.conf.in

I should mention that that rsyslog configuration is no longer being
considered for upstream inclusion. See the discussion on the pull
request, here:

https://github.com/ClusterLabs/pacemaker/pull/17

As I understand it, Andrew's plan is to reduce excessively verbose
logging with the switch to libqb.

But thanks for trying it out. :)

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] What's the exact booth revision that ships in SLES 11 SP2?

2012-03-06 Thread Florian Haas
Jiaju,

would you mind pushing your git tags your GitHub booth repo?
Currently, as far as I can see, there are no tags in that repo at all.
It would be nice to be able to find out what exactly is the git
revision that you guys ship in SP2. Thanks!

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] stonith in a virtual cluster

2012-02-29 Thread Florian Haas
Jean-François,

I realize I'm late to this discussion, however allow me to chime in here anyhow:

On Mon, Feb 27, 2012 at 11:45 PM, Jean-Francois Malouin
jean-francois.malo...@bic.mni.mcgill.ca wrote:
 Have you looked at fence_virt? http://www.clusterlabs.org/wiki/Guest_Fencing

 Yes I did.

 I had a quick go last week at compiling it on Debian/Squeeze with
 backports but with no luck.

Seeing as you're on Debian, there really is no need to use fence_virt.
Instead, you should just be able to use the external/libvirt STONITH
plugin that ships with cluster-glue (in squeeze-backports). That
plugin works like a charm and I've used it in testing many times. No
need to compile anything.

http://www.hastexo.com/resources/hints-and-kinks/fencing-virtual-cluster-nodes
may be a helpful resource.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] OCFS2 in Pacemaker, post Corosync 2.0

2012-02-29 Thread Florian Haas
Andrew,

just a quick question out of curiosity: the ocf:pacemaker:o2cb resource
and ocfs2_controld.pcmk require the OpenAIS CKPT service which is
currently deprecated (as all of OpenAIS) and going away completely
(IIUC) with Corosync 2.0. Does that mean that OCFS2 will be unsupported
from Corosync 2.0 forward, as far as Pacemaker is concerned? Or has that
CKPT dependency been removed, or will there be another supported way to
run it?

All insight much appreciated. Thanks!

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Upstart resources

2012-02-28 Thread Florian Haas
2012/2/27 Ante Karamatić ante.karama...@canonical.com:
 On 27.02.2012 12:27, Florian Haas wrote:

 Alas, to the best of my knowledge the only way to change a specific
 job's respawn policy is by modifying its job definition. Likewise,
 that's the only way to enable or disable starting on system boot. So
 there is a way to overrule the package maintainer's default -- hacking
 the job definition.

 I've explained '(no)respawn' in the other mail. Manual starting/stopping
 is done by:

 echo 'manual'  /etc/init/${service}.override

 That's all you need to forbid automatic starting or stopping the service.

Oh thanks! I didn't know that, much to my dismay.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Question about master/slave resource promotion

2012-02-25 Thread Florian Haas
On Sat, Feb 25, 2012 at 12:31 AM, David Vossel dvos...@redhat.com wrote:
 Hey,

 I have a 2 node cluster with a multi-state master/slave resource. When the 
 multi-state resources start up on each node they enter the Slave role.  At 
 that point I can't figure out how to promote the resource to activate the 
 Master role on one of the nodes. Is there anything special I need to do to 
 get an instance of my multi-state resource to promote to the Master role?

Yeah, actually using a resource type that is capable of running in
master/slave mode would be a good start. :) Use ocf:pacemaker:Stateful
instead of ocf:pacemaker:Dummy in your test setup.

Hope this helps.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Last chance to object to the syntax for cluster tickets (multi-site clusters)

2012-02-24 Thread Florian Haas
On 02/24/12 02:53, Andrew Beekhof wrote:
 We're about to lock in the syntax for cluster tickets (used for
 multi-ste clusters).
 
 The syntax rules are at:
   
 https://github.com/gao-yan/pacemaker/commit/9e492f6231df2d8dd548f111a2490f02822b29ea
 
 And its use, along with some examples, can be found here:

 https://github.com/gao-yan/pacemaker/commit/5f75da8d99171cc100e87935c8c3fd2f83243f93
 
 If there are any comments/concerns, now is the time to raise them.

For naming, I must confess I find it a bit strange that while all other
constraint types use ordinary English names (order, location,
colocation), this one uses a rather strange looking abbreviation.
However, I'll also concede that the only alternative that currently
comes to my mind would be to rename the constraint type to ticket, but
that obviously creates ambiguity between ticket the constraint and
ticket the thing that booth manages, so it would probably be worse.
Perhaps others have a better idea.

About the documentation, I generally find it very useful; I only have
one addition for a suggestion: it's not immediately clear from the
existing docs that multiple resources can depend on the same ticket. It
does mention resource sets (which, still, could use an additional
sentence à la thus, multiple resources can depend on the same ticket
as a courtesy to the novice reader), but it doesn't say whether it's OK
to have multiple constraints referring to the same ticket.

If I can spare the time some time in the next few weeks I might also
prepare a die, passive voice, die patch for that documentation page,
but that's just a pet peeve of mine. :)

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker will not mount ocfs2

2012-02-24 Thread Florian Haas
On 02/24/12 08:50, Johan Rosing Bergkvist wrote:
 Hi
 Just an update.
 So I upgraded to pacemaker 1.1.6 and tried to configure it all again,
 without dlm.
 It didn't work, I still got the OCF_ERR_INSTALLED  so I started looking
 through the setup and found that I didn't specify the drbd.conf path.
 When  I added that meta and boom, it mounted like a dream.

Huh? drbdconf in ocf:linbit:drbd is a regular param, not a meta
attribute. It sounds like you're mixing something up, can you clarify
please?

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker will not mount ocfs2

2012-02-24 Thread Florian Haas
On 02/24/12 09:21, Johan Rosing Bergkvist wrote:
 Sorry parameter, you're right.
 But still It didn't mount untill I added the drbdconf parameter.
 
 primitive clusterDRBD ocf:linbit:drbd \
 params drbd_resource=cluster-ocfs *drbdconf=/etc/drbd.conf
 *#This is what I added \
 op monitor interval=20 role=Master timeout=20 \
 op monitor interval=30 role=Slave timeout=20
 
 I was just wondering if this parameter was required and if so, since I
 used the default path shouldn't it be preconfigured?

It is the default path, it is preconfigured, and you shouldn't need to
add this. This isn't some in-place upgrade from an age-old DRBD version
like 0.7, is it? (*shudder*)

Also, what does ls /etc/drbd* say?

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] DRBD, Fedora, and systemd (tangent off of Re: Upstart resources)

2012-02-24 Thread Florian Haas
On 02/23/12 23:48, Andrew Beekhof wrote:
 On Thu, Feb 23, 2012 at 6:31 PM, Ante Karamatic iv...@ubuntu.com wrote:
 On 23.02.2012 00:10, Andrew Beekhof wrote:

 Do you still have LSB scripts on a machine thats using upstart?

 Yes, some LSB scripts can't be easily converted to upstart jobs. Or,
 let's rephrase that - can't be converted to upstart jobs without losing
 some of the functionality.

 On fedora they purged them all.

 All? Even the stuff like drbd? I have to take a look at that.
 
 I think any package that doesn't have a unit file is going to be
 blacklisted from F-17.
 That was the threat at least.

In a Pacemaker cluster, nothing needs to touch DRBD during the system
boot sequence. Nothing should, really. So the absence of any bootup
script in a DRBD package should hardly be a reason to zap it from the
distro.

I'm CC'ing the Fedora DRBD package maintainer so he's at least informed
of this thread, as I'm unsure if he follows the Pacemaker list on a
regular basis. Hi Major. :)

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker will not mount ocfs2

2012-02-21 Thread Florian Haas
On 02/21/12 13:39, Johan wrote:
 I've been following this 
 http://publications.jbfavre.org/virtualisation/cluster-
 xen-corosync-pacemaker-drbd-ocfs2.en tutorial on how to setup a pacemaker xen 
 cluster. I'm all new to this so pls bear over with me.
 The big problem is that when UI get to the pont where the filasystem should 
 automagically mount, it doesn't. 
 
 Here's my config:
 
 node cluster01
 node cluster02
 primitive Cluster-FS-DLM ocf:pacemaker:controld \
 op monitor interval=15 \
 meta target-role=Stopped
 primitive Cluster-FS-DRBD ocf:linbit:drbd \
 params drbd_resource=cluster-ocfs \
 operations $id=Cluster-FS-DRBD-ops \
 op monitor interval=20 role=Master timeout=20 \
 op monitor interval=30 role=Slave timeout=20
 primitive Cluster-FS-Mount ocf:heartbeat:Filesystem \
 params device=/dev/drbd/by-res/cluster-ocfs directory=/cluster 
 fstype=ocfs2
 ms Cluster-FS-DRBD-Master Cluster-FS-DRBD \
 meta resource-stickines=100 master-max=2 notify=true 
 interleave=true target-role=Stopped
 clone Cluster-FS-Mount-Clone Cluster-FS-Mount \
 meta interleave=true ordered=true target-role=Stopped
 order Cluster-FS-After-DRBD inf: Cluster-FS-DRBD-Master:promote Cluster-FS-
 Mount-Clone:start
 order Cluster-FS-DLM-Order inf: Cluster-FS-DRBD-Master:promote 
 Cluster-FS-Mount-
 Clone:start
 property $id=cib-bootstrap-options \
 dc-version=1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b \
 cluster-infrastructure=openais \
 expected-quorum-votes=2 \
 no-quorum-policy=ignore \
 default-resource-stickiness=1000 \
 stonith-enabled=false \
 last-lrm-refresh=1329823386
 
 I keep getting the:
  info: RA output: (Cluster-FS-Mount:1:start:stderr) FATAL: Module 
 scsi_hostadapter not found.

That's a red herring. Why the Filesystem RA is still trying to modprobe
scsi_hostadapter, and is even logging any failure to do so with a FATAL
priority, don't ask. :)

However, with all those target-role=Stopped attributes in there
nothing of interest is really expected to start.

 in the /var/log/syslog
 
 I've been googling around for a solution but all of them seem to fail for me.
 
 any help is much appreciated

That tutorial is wrong in several places. Specifically,

One word about OCFS2. In a perfect world, we should manage OCFS2 with
pacemaker. In this particular case, this won't be the case (I had issues
with lock managment which is mandatory for pacemaker).

... is just nonsense. You can (and should) put the DLM and O2CB under
Pacemaker management. See the ocf:pacemaker:controld and
ocf:pacemaker:o2cb resource agents for details.

Also, you'll probably need to update your OCFS2 with tunefs.ocfs2
--update-cluster-stack before you can mount it.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Requesting re-sync

2012-02-21 Thread Florian Haas
On Tue, Feb 21, 2012 at 3:57 PM, Pieter Baele pieter.ba...@gmail.com wrote:
 After upgrading a node (RHEL 6.1 to 6.2), my /var/log/messages grows
 really really fast
 because of this error, what can be wrong?

So you upgraded just one node, and the other is still unchanged? Can
you give the Pacemaker and Corosync version for both?

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Upstart resources

2012-02-21 Thread Florian Haas
Jake,

sorry, I missed your original post due to travel; let me toss in one
more thing here:

On Tue, Feb 21, 2012 at 3:32 PM, Jake Smith jsm...@argotec.com wrote:
  Are upstart jobs expected to conform to the LSB spec with regards
  to exit codes, etc?
  Is there any reference documentation using upstart resources in
  Pacemaker?
  Or any good advice :-)

 Newer versions of pacemaker and lrmd are able to deal with upstart
 resources via dbus.

Only if the LRM is compiled with --enable-upstart, of course. Which,
to the best of my knowledge, is only set on the Ubuntu builds (and
Ubuntu builds are currently the only ones for which this makes sense
to set, obviously).

This, however, requires that you run with an updated libglib2 package
(again, only on Ubuntu). All of that should be available either in the
upstream Ubuntu repos or, for the current LTS, in the
ubuntu-ha-maintainers PPA.[1]

Hope this helps.

Cheers,
Florian

[1] Why do I need to use a PPA if this release is ostensibly on
long-term support? Don't ask me, ask someone from Canonical. :)

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker will not mount ocfs2

2012-02-21 Thread Florian Haas
On Tue, Feb 21, 2012 at 4:22 PM, Dejan Muhamedagic deja...@fastmail.fm wrote:
 Hi,

 On Tue, Feb 21, 2012 at 02:26:31PM +0100, Florian Haas wrote:
 On 02/21/12 13:39, Johan wrote:
 
  I keep getting the:
   info: RA output: (Cluster-FS-Mount:1:start:stderr) FATAL: Module
  scsi_hostadapter not found.

 That's a red herring. Why the Filesystem RA is still trying to modprobe
 scsi_hostadapter, and is even logging any failure to do so with a FATAL
 priority, don't ask. :)

 Removed. Let's see who'll complain, then perhaps we'll know why
 it was there ;-)

Could you zap that from the Raid1 RA too, please?

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] problems with cman + corosync + pacemaker in debian

2012-02-19 Thread Florian Haas
On 02/18/12 10:59, diego fanesi wrote:
 are you saying I can install drbd + gfs2 + pacemaker without using cman?
 It seems that gfs2 depends on cman...

Only on RHEL/CentOS/Fedora. Not on Debian.

 I want to realize active/active cluster and I'm following the document
 cluster from scratch that you can found on this website.
 
 I don't know if there are other ways to realize it.

Here's a reference config; we use this in classes we teach (where we run
the Pacemaker stack on Debian because that's the only distro that
supports all of Pacemaker, OCFS2, GFS2, GlusterFS and Ceph). This makes
no claims at being perfect, but it works rather well.

primitive p_dlm_controld ocf:pacemaker:controld \
  params daemon=dlm_controld.pcmk \
  op start interval=0 timeout=90 \
  op stop interval=0 timeout=100 \
  op monitor interval=10
primitive p_gfs_controld ocf:pacemaker:controld \
  params daemon=gfs_controld.pcmk \
  op start interval=0 timeout=90 \
  op stop interval=0 timeout=100 \
  op monitor interval=10
group g_gfs2 p_dlm_controld p_gfs_controld
clone cl_gfs2 g_gfs2 \
meta interleave=true

Here's the corresponding DRBD/Pacemaker configuration.

primitive p_drbd_gfs2 ocf:linbit:drbd \
  params drbd_resource=gfs2 \
  op monitor interval=10 role=Master \
  op monitor interval=30 role=Slave
ms ms_drbd_gfs2 p_drbd_gfs2 \
  meta notify=true master-max=2 \
  interleave=true
colocation c_gfs2_on_drbd inf: cl_gfs2 ms_drbd_gfs2:Master
order o_drbd_before_gfs2 inf: ms_drbd_gfs2:promote cl_gfs2:start

Of course, you'll have to add proper fencing, and there are several DRBD
configuration options that you must remember to set. And, obviously, you
need the actual Filesystem resources to manage your GFSs proper.

That being said, it's entirely possible that a GlusterFS based solution
would solve your issue as well, and be easier to set up. Or even
something NFS based, backed by a single-Primary DRBD config for HA. You
didn't give many details of your setup, however, so it's impossible to
tell for certain.

Hope this helps.
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Resource inter-dependency without being a 'group'

2012-02-18 Thread Florian Haas
On Sat, Feb 18, 2012 at 7:19 PM, David Coulson da...@davidcoulson.net wrote:
 I have an active/active LVS cluster, which uses pacemaker for managing IP
 resources. Currently I have one environment running on it which utilizes ~30
 IP addresses, so a group was created so all resources could be
 stopped/started together. Downside of that is that all resources have to run
 on the same node.
 [...]
 Is there a recommendation or best practice for this type of configuration?
 Is there something similar to 'group', which allows all the resources to be
 referenced as a single 'parent' resource without requiring them all to run
 on the same node?

Is setting meta collocated=false not working for your group?

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] problems with cman + corosync + pacemaker in debian

2012-02-17 Thread Florian Haas
On Sun, Feb 12, 2012 at 10:01 PM, diego fanesi diego.fan...@gmail.com wrote:
 Hi,

 I'm trying to install corosync with pacemaker using drbd + gfs2 with cman
 support.

Why?

GFS2 with dual-Primary DRBD with Pacemaker 1.1.6 is working very well
in squeeze-backports with the dlm_controld.pcmk and gfs_controld.pcmk
daemons. No need to run on cman.

Just install dlm-pcmk and gfs-pcmk and configure appropriately.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Percona Replication Manager

2012-02-10 Thread Florian Haas
On Fri, Feb 10, 2012 at 1:38 PM, Nick Khamis sym...@gmail.com wrote:
 May I ask where the original blog resides? The one
 with the bizerk blog comments

http://www.lmgtfy.com/?q=percona+replication+managerl=1

SCNR. :)

Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] LVM Setup

2012-01-26 Thread Florian Haas
On Wed, Jan 25, 2012 at 6:49 PM, Gregg Stock gr...@damagecontrolusa.com wrote:
 Hi,
 I'm trying to setup a 5 node cluster, the same topology as described in
 Roll Your Own Cloud: Enterprise Virtualization with KVM, DRBD, iSCSI and
 Pacemaker
 http://blip.tv/linuxconfau/roll-your-own-cloud-enterprise-virtualization-with-kvm-drbd-iscsi-and-pacemaker-4738148

 I'm stuck creating the pacemaker LVM resource. I'm not sure if pacemaker
 doesn't see the volume group or something else is wrong.

 Here is the basic setup on CentOS 5.7:
 1. Six disks setup with a raid 0 array.
 2. The raid device md0 is a physical volume with a volume group vg_cluster
 and logical volume lv_iscsi0 on top.
 3. A DRDB resource r0 that uses the logical volume lv_iscsi0 as the disk -
 the device is /dev/drbd1
 4. A physical volume that uses /dev/drbd1
 5. A volume group iscsivg0 that uses /dev/drbd1

 All of this seems to work fine but when I try and create a the LVM primitive
 with iscsivg0, it is not able to start.

 I've tried different filtering schemes in the lvm.conf file but no luck. I'm
 not sure if pacemaker is not able to see the volume group or there is some
 fundamental problem with what I'm trying to do.

Please pastebin your lvm.conf and an screendump of vgscan -vvv,
taken on a node where DRBD is primary.

Thanks,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] MySQL Master-Master replication with Corosync and Pacemaker

2012-01-25 Thread Florian Haas
On Thu, Jan 26, 2012 at 12:43 AM, Peter Scott pe...@psdt.com wrote:
 Hello.  Our problem is that a Corosync restart on the idle machine in a
 2-node cluster shutds down the mysqld process there and we need it to stay
 up for replication.

Well if you just want to restart Corosync by administrative
intervention (i.e. in a planned, controlled fashion), then why not put
the cluster in maintenance mode before you restart Corosync?

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [PATCH 0/2] rsyslog/logrotate configuration snippets

2012-01-15 Thread Florian Haas
On Sun, Jan 15, 2012 at 9:27 PM, Andrew Beekhof and...@beekhof.net wrote:
 On Thu, Jan 12, 2012 at 11:01 PM, Florian Haas flor...@hastexo.com wrote:
 On Thu, Jan 5, 2012 at 10:15 PM, Florian Haas flor...@hastexo.com wrote:
 Florian Haas (2):
      extra: add rsyslog configuration snippet
      extra: add logrotate configuration snippet

  configure.ac                      |    4 +++
  extra/Makefile.am                 |    2 +-
  extra/logrotate/Makefile.am       |    5 
  extra/logrotate/pacemaker.conf.in |    7 ++
  extra/rsyslog/Makefile.am         |    5 
  extra/rsyslog/pacemaker.conf.in   |   39 
 +
  6 files changed, 61 insertions(+), 1 deletions(-)
  create mode 100644 extra/logrotate/Makefile.am
  create mode 100644 extra/logrotate/pacemaker.conf.in
  create mode 100644 extra/rsyslog/Makefile.am
  create mode 100644 extra/rsyslog/pacemaker.conf.in

 Any takers on these?

 Sorry, I was off working on the new fencing logic and then corosync
 2.0 support (when cman and all the plugins, including ours, go away).

 So a couple of comments...

 I fully agree that the state of our logging needs work and I can
 understand people wanting to keep the vast majority of our logs out of
 syslog.
 I'm less thrilled about one-file-per-subsystem, the cluster will often
 do a lot within a single second and splitting everything up really
 hurts the ability to correlate messages.
 I'd also suggest that /some/ information not coming directly from the
 RAs is still appropriate for syslog (such as I'm going to move A from
 B to C or I'm about to turn of node D), so the nuclear option isn't
 really thrilling me.

So everything that is logged by the RAs with ocf_log, as I wrote in
the original post, _is_ still going to whatever the default syslog
destination may be. The rsyslog config doesn't change that at all.
(Stuff that the RAs simply barf out to stdout/err would go to the lrmd
log.) I maintain that this is the stuff that is also most useful to
people. And with just that information in the syslog, you usually get
a pretty clear idea of what the heck the cluster is doing on a node,
and in what order, in about 20 lines of logs close together -- rather
than intermingled with potentially hundreds of lines of other
cluster-related log output.

And disabling the nuclear option is a simple means of adding a #
before  ~ in the config file. You can ship it that way by default
if you think that's more appropriate. That way, people would get the
split-out logs _plus_ everything in one file, which IMHO is sometimes
very useful for pengine or lrmd troubleshooting/debugging. I,
personally, just don't want Pacemaker to flood my /var/log/messages,
so I'd definitely leave the  ~ in there, but that may be personal
preference. I wonder what others think.

 In addition to the above distractions, I've been coming up to speed on
 libqb's logging which is opening up a lot of new doors and should
 hopefully help solve the underlying log issues.
 For starters it lets syslog/stderr/logfile all log at different levels
 of verbosity (and formats), it also supports blackboxes of which a
 dump can be triggered in response to an error condition or manually by
 the admin.

 The plan is something along the lines of: syslog gets NOTICE and
 above, anything else (depending on debug level and trace options) goes
 to /var/log/(cluster/?)pacemaker or whatever was configured in
 corosync.
 However, before I can enact that there will need to be an audit of the
 messages currently going to INFO (674 entries) and NOTICE(160 entries)
 with some getting bumped up, others down (possibly even to debug).
 I'd certainly be interested in feedback as to which logs should and
 should not make it.

Yes, even so, I (again, this is personal preference) would definitely
not want pengine logging (which even if half its INFO messages get
demoted to DEBUG, would still be pretty verbose) in my default
messages file.

 If you want to get analytical about it, there is also an awk script
 that I use when looking at what we log.
 I'd be interested in some numbers from the field.

Thanks; I can look at that after LCA.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [PATCH 0/2] rsyslog/logrotate configuration snippets

2012-01-15 Thread Florian Haas
On Mon, Jan 16, 2012 at 10:59 AM, Andrew Beekhof and...@beekhof.net wrote:

 By Nuclear, I meant nothing at all from Pacemaker.

Which is not what it does.

 If thats what you want, there's a far easier way to achieve this and
 keep usable logs around for debugging, set facility to none and add a
 logfile.

No, I don't want that.

 (Stuff that the RAs simply barf out to stdout/err would go to the lrmd
 log.) I maintain that this is the stuff that is also most useful to
 people. And with just that information in the syslog, you usually get
 a pretty clear idea of what the heck the cluster is doing on a node,
 and in what order, in about 20 lines of logs close together -- rather
 than intermingled with potentially hundreds of lines of other
 cluster-related log output.

 Did I not just finish agreeing that hundreds of lines of other
 cluster-related log[s] was a problem?

What in my statement above indicates that I assumed otherwise?

 I just don't think your knee-jerk everything must go approach is the answer.

That is not my approach.

 And disabling the nuclear option is a simple means of adding a #
 before  ~ in the config file. You can ship it that way by default
 if you think that's more appropriate. That way, people would get the
 split-out logs _plus_ everything in one file, which IMHO is sometimes
 very useful for pengine or lrmd troubleshooting/debugging. I,
 personally, just don't want Pacemaker to flood my /var/log/messages,

 Did you see me arguing against that?

No. What makes you think I did?

 so I'd definitely leave the  ~ in there, but that may be personal
 preference. I wonder what others think.

 In addition to the above distractions, I've been coming up to speed on
 libqb's logging which is opening up a lot of new doors and should
 hopefully help solve the underlying log issues.
 For starters it lets syslog/stderr/logfile all log at different levels
 of verbosity (and formats), it also supports blackboxes of which a
 dump can be triggered in response to an error condition or manually by
 the admin.

 The plan is something along the lines of: syslog gets NOTICE and
 above, anything else (depending on debug level and trace options) goes
 to /var/log/(cluster/?)pacemaker or whatever was configured in
 corosync.
 However, before I can enact that there will need to be an audit of the
 messages currently going to INFO (674 entries) and NOTICE(160 entries)
 with some getting bumped up, others down (possibly even to debug).
 I'd certainly be interested in feedback as to which logs should and
 should not make it.

 Yes, even so, I (again, this is personal preference) would definitely
 not want pengine logging (which even if half its INFO messages get
 demoted to DEBUG, would still be pretty verbose) in my default
 messages file.

 Sigh, please take time out from preaching to actually read the
 replies.  You might learn something.

This is getting frustrating. Not this logging discussion, but pretty
much any discussion the two of us have been having lately. (And no,
this is not an assignment of guilt or responsibility -- it takes two
to tango.) Let's try and sort this out in person on Thursday.

Florian

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] need cluster-wide variables

2012-01-12 Thread Florian Haas
On Tue, Jan 10, 2012 at 10:24 PM, Arnold Krille arn...@arnoldarts.de wrote:
 Is it possible for slaves to modify their score for promotion? I think that
 would be an interesting feature.

 Probably something like that could already be achieved with dependency-rules
 and variables. But I think a function for resource agents to increase or
 decrease the score would be more clean.

http://www.linux-ha.org/doc/dev-guides/_specifying_a_master_preference.html

crm_master has been around for as long as I can remember.

Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [PATCH 0/2] rsyslog/logrotate configuration snippets

2012-01-12 Thread Florian Haas
On Thu, Jan 5, 2012 at 10:15 PM, Florian Haas flor...@hastexo.com wrote:
 Florian Haas (2):
      extra: add rsyslog configuration snippet
      extra: add logrotate configuration snippet

  configure.ac                      |    4 +++
  extra/Makefile.am                 |    2 +-
  extra/logrotate/Makefile.am       |    5 
  extra/logrotate/pacemaker.conf.in |    7 ++
  extra/rsyslog/Makefile.am         |    5 
  extra/rsyslog/pacemaker.conf.in   |   39 
 +
  6 files changed, 61 insertions(+), 1 deletions(-)
  create mode 100644 extra/logrotate/Makefile.am
  create mode 100644 extra/logrotate/pacemaker.conf.in
  create mode 100644 extra/rsyslog/Makefile.am
  create mode 100644 extra/rsyslog/pacemaker.conf.in

Any takers on these?

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [PATCH 0/2] rsyslog/logrotate configuration snippets

2012-01-12 Thread Florian Haas
On Thu, Jan 12, 2012 at 2:15 PM, Vladislav Bogdanov
bub...@hoster-ok.com wrote:
 I marked that message as Important and will include into my builds
 even if it does not go upstream.

 One question - does it break default hb_report and crm_report behavior?

Good point. I presume it would make sense to include anything in
/var/log/pacemaker in hb_report/crm_report. In the meantime, you can
of course use the -E option to include these files manually.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Configuring 3rd Node as Quorum Node in 2 Node Cluster

2012-01-10 Thread Florian Haas
On Wed, Jan 11, 2012 at 1:44 AM, Andrew Beekhof and...@beekhof.net wrote:
 On Wed, Jan 11, 2012 at 3:30 AM, Andrew Martin amar...@xes-inc.com wrote:
 3. Limit the DRBD, nfs, and smbd resources to only node1 and node2 by adding
 a location rule for the g_nfs group (which includes p_fs_drbd0
 p_lsb_nfsserver p_exportfs_drbd0 p_ip_nfs):
 # crm configure location ms-drbd0-placement ms-drbd0 rule -inf: uname ne
 node1 and uname ne node2

 Right.

Another option would be to permanently run the 3rd node in standby mode.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Cannot Create Primitive in CRM Shell

2012-01-09 Thread Florian Haas
On Mon, Jan 9, 2012 at 11:42 AM, Dan Frincu df.clus...@gmail.com wrote:
 Hi,

 On Fri, Jan 6, 2012 at 11:24 PM, Andrew Martin amar...@xes-inc.com wrote:
 Hello,

 I am working with DRBD + Heartbeat + Pacemaker to create a 2-node
 highly-available cluster. I have been following this official guide on
 DRBD's website for configuring all of the components:
 http://www.linbit.com/fileadmin/tech-guides/ha-nfs.pdf

 However, once I go to configure the primitives in pacemaker's CRM shell
 (section 4.1 in the PDF above) I am unable to create the primitive. For
 example, I enter the following configuration for a DRBD device called
 drive:
 primitive p_drbd_drive \

   ocf:linbit:drbd \

   params drbd_resource=drive \

   op monitor interval=15 role=Master \

   op monitor interval=30 role=Slave

 After entering all of these lines I hit enter and nothing is returned - it
 appears frozen and I am never returned to the crm(live)configure#  shell.
 An strace of the process does not reveal any obvious blocks. I have also
 tried entering the entire configuration on a single line with the same
 result.

 I would recommend going through this guide first
 http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/

That's a bit of a knee-jerk response if I may say so, and when I wrote
those guides[1] the intention was specifically that people could
peruse them _without_ first having to check the documentation that
covers the configuration internals.

At any rate, Andrew, if your crm shell is freezing up when you're
simply trying to add a primitive, something must be seriously awry in
your setup -- it's something that I've not run into personally, unless
the cluster was already responding to an error state on one of the
nodes. Are you sure your cluster is behaving OK otherwise? Are you
getting meaningful output from crm_mon -1? Does your cluster report
it has successfully elected a DC?

Cheers,
Florian

[1] Which I did while employed by Linbit, which is no longer the case,
as they have asked I point out. http://wp.me/p4XzQ-bN

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Resource ping fails on passive node after upgrading to second nic

2012-01-09 Thread Florian Haas
Stefan,

sorry, your report triggers a complete -EPARSE in my brain.

On Mon, Jan 9, 2012 at 10:38 AM, Senftleben, Stefan (itsc)
stefan.senftle...@itsc.de wrote:
 Hello everybody,

 last week I installed and configured in each cluster node a second network 
 interface.
 After configuring the corosync.cfg the passive node stops the primative ping 
 (three ping targets).

The Corosync config shouldn't affect the ping resource at all.

 Such errors are in the corosync.log:

 Jan 09 10:12:28 corosync [TOTEM ] A processor joined or left the membership 
 and a new membership was formed.
 Jan 09 10:12:28 corosync [MAIN  ] Completed service synchronization, ready to 
 provide service.
 Jan 09 10:12:30 corosync [TOTEM ] ring 1 active with no faults
 Jan 09 10:12:37 lxds05 crmd: [1347]: info: process_lrm_event: LRM operation 
 pri_ping:1_start_0 (call=11, rc=0, cib-update=17, confirmed=true) ok
 Jan 09 10:12:42 lxds05 attrd: [1345]: info: attrd_trigger_update: Sending 
 flush op to all hosts for: pingd (3000)
 Jan 09 10:13:37 lxds05 crmd: [1347]: WARN: cib_rsc_callback: Resource update 
 17 failed: (rc=-41) Remote node did not respond
 Jan 09 10:17:25 lxds05 attrd: [1345]: info: attrd_trigger_update: Sending 
 flush op to all hosts for: master-pri_drbd_omd:0 (1)
 Jan 09 10:17:25 lxds05 attrd: [1345]: info: attrd_perform_update: Sent update 
 22: master-pri_drbd_omd:0=1
 Jan 09 10:19:25 lxds05 attrd: [1345]: WARN: attrd_cib_callback: Update 22 for 
 master-pri_drbd_omd:0=1 failed: Remote node did not respond
 Jan 09 10:22:08 lxds05 cib: [1343]: info: cib_stats: Processed 67 operations 
 (1044.00us average, 0% utilization) in the last 10min
 Jan 09 10:22:25 lxds05 attrd: [1345]: info: attrd_trigger_update: Sending 
 flush op to all hosts for: master-pri_drbd_omd:0 (1)
 Jan 09 10:22:25 lxds05 attrd: [1345]: info: attrd_perform_update: Sent update 
 24: master-pri_drbd_omd:0=1
 Jan 09 10:24:25 lxds05 attrd: [1345]: WARN: attrd_cib_callback: Update 24 for 
 master-pri_drbd_omd:0=1 failed: Remote node did not respond
 Jan 09 10:27:25 lxds05 attrd: [1345]: info: attrd_trigger_update: Sending 
 flush op to all hosts for: master-pri_drbd_omd:0 (1)
 Jan 09 10:27:25 lxds05 attrd: [1345]: info: attrd_perform_update: Sent update 
 26: master-pri_drbd_omd:0=1
 Jan 09 10:29:25 lxds05 attrd: [1345]: WARN: attrd_cib_callback: Update 26 for 
 master-pri_drbd_omd:0=1 failed: Remote node did not respond
 Jan 09 10:32:08 lxds05 cib: [1343]: info: cib_stats: Processed 6 operations 
 (1666.00us average, 0% utilization) in the last 10min
 Jan 09 10:32:25 lxds05 attrd: [1345]: info: attrd_trigger_update: Sending 
 flush op to all hosts for: master-pri_drbd_omd:0 (1)
 Jan 09 10:32:25 lxds05 attrd: [1345]: info: attrd_perform_update: Sent update 
 28: master-pri_drbd_omd:0=1
 Jan 09 10:34:25 lxds05 attrd: [1345]: WARN: attrd_cib_callback: Update 28 for 
 master-pri_drbd_omd:0=1 failed: Remote node did not respond

Not a single message from any ping resource here.

 The check with corosync-cfg -s runs without errors on both nodes.

Does corosync-objctl | grep member yield two members or one?

 I do not know, what is wrong, because the targets used in the crm config can 
 be pinged successfully.
 Can someone help me, please? Thanks in advance.

Unlikely, you didn't give an awful lot of useful information, even
your resource config is missing. cibadmin -Q dump posted to
pastebin, and the URL shared here, might help.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Resource ping fails on passive node after upgrading to second nic

2012-01-09 Thread Florian Haas
On Mon, Jan 9, 2012 at 2:01 PM, Senftleben, Stefan (itsc)
stefan.senftle...@itsc.de wrote:
 This is the cibadmin dump of the active one:
 http://pastebin.com/Yg4Jsaxy

You would see this in a crm_mon -rf:

Failed actions:
pri_ping:1_start_0 (node=lxds05, call=-1, rc=1, status=Timed Out):
unknown error

Timed out should be pretty self explanatory.

However:

 corosync-objctl | grep member brings no output on the nodes

combined with

 root@lxds05:~# cibadmin -Q
 Call cib_query failed (-41): Remote node did not respond

combined with

 Online: [ lxds05 lxds07 ]

... in other words, the totem member list being empty plus one node
saying it can't talk to the DC plus the DC listing both nodes as
healthy, looks positively odd. I'm afraid I wouldn't be able to help a
lot more without being able to actually look at the box though; please
see the link in my sig block if interested.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/services/remote

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] syslog full of redundand link messages

2012-01-09 Thread Florian Haas
On Mon, Jan 9, 2012 at 3:15 PM, Attila Megyeri
amegy...@minerva-soft.com wrote:
 Hi,

 I might be taking something wrong, but,

 bindnetaddr: 10.100.1.255

 does not mean it will listen on this address, but will listen on every 
 interface where this mask matches.
 This is just to make the config file simpler and common for all nodes in the 
 same subnet.

 Or am I taking something terribly wrong?

As Dan states, what you configured looks more like a broadcast
address, not a network address. Assuming your boxes have IP addresses
of 10.100.1.x in a /24 subnet, the correct network address would be
10.100.1.0.

ipcalc is your friend, btw.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] [PATCH 1/2] extra: add rsyslog configuration snippet

2012-01-05 Thread Florian Haas
---
 configure.ac|4 
 extra/Makefile.am   |2 +-
 extra/rsyslog/Makefile.am   |5 +
 extra/rsyslog/pacemaker.conf.in |   39 +++
 4 files changed, 49 insertions(+), 1 deletions(-)
 create mode 100644 extra/rsyslog/Makefile.am
 create mode 100644 extra/rsyslog/pacemaker.conf.in

diff --git a/configure.ac b/configure.ac
index ecae986..ec81938 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1714,6 +1714,10 @@ fencing/Makefile 
   \
 extra/Makefile \
extra/resources/Makefile\
extra/rgmanager/Makefile\
+   extra/rsyslog/Makefile  \
+   extra/rsyslog/pacemaker.conf\
+   extra/logrotate/Makefile\
+   extra/logrotate/pacemaker.conf  \
 tools/Makefile \
tools/crm_report\
tools/coverage.sh   \
diff --git a/extra/Makefile.am b/extra/Makefile.am
index 5ad7dc7..d9e3360 100644
--- a/extra/Makefile.am
+++ b/extra/Makefile.am
@@ -18,7 +18,7 @@
 
 MAINTAINERCLEANFILES= Makefile.in
 
-SUBDIRS =  resources rgmanager
+SUBDIRS =  resources rgmanager rsyslog
 
 mibdir = $(datadir)/snmp/mibs
 mib_DATA = PCMK-MIB.txt
diff --git a/extra/rsyslog/Makefile.am b/extra/rsyslog/Makefile.am
new file mode 100644
index 000..dbde43c
--- /dev/null
+++ b/extra/rsyslog/Makefile.am
@@ -0,0 +1,5 @@
+MAINTAINERCLEANFILES = Makefile.in
+
+rsyslogdir = $(sysconfdir)/rsyslog.d
+
+rsyslog_DATA = pacemaker.conf
diff --git a/extra/rsyslog/pacemaker.conf.in b/extra/rsyslog/pacemaker.conf.in
new file mode 100644
index 000..4c52698
--- /dev/null
+++ b/extra/rsyslog/pacemaker.conf.in
@@ -0,0 +1,39 @@
+# rsyslog configuration snippet for Pacemaker daemons
+#
+# Include this file in your rsyslog configuration file,
+# _before_ your default log processing rules.
+#
+# If you want Pacemaker log entries in individual log
+# files _and_ your catch-all syslog file, remove the
+# v~ lines.
+
+$template 
PacemakerDaemonLog,@localstatedir@/log/@PACKAGE_TARNAME@/%programname%.log
+
+# Entries from the crm_attribute binary and attrd go
+# to one log file.
+:programname,isequal,crm_attribute 
@localstatedir@/log/@PACKAGE_TARNAME@/attrd.log
+ ~
+:programname,isequal,attrd ?PacemakerDaemonLog
+ ~
+
+# CIB status messages
+:programname,isequal,cib ?PacemakerDaemonLog
+ ~
+
+# Messages from crmd
+:programname,isequal,crmd ?PacemakerDaemonLog
+ ~
+
+# Messages from lrmd, including stdout and stderr
+# from poorly-written resource agents that don't
+# use ocf_log and/or ocf_run
+:programname,isequal,lrmd ?PacemakerDaemonLog
+ ~
+
+# Policy Engine messages
+:programname,isequal,pengine ?PacemakerDaemonLog
+ ~
+
+# Messages from the fencing daemons
+:programname,startswith,stonith ?PacemakerDaemonLog
+ ~
-- 
1.7.5.4


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] [PATCH 0/2] rsyslog/logrotate configuration snippets

2012-01-05 Thread Florian Haas
Hi everyone,

apologies for sending patches to the user list -- my subscription
request is still pending on pcmk-devel, so here goes.

One of the most commonly voiced criticisms against Pacemaker is that
it floods people's logs. And while I contend that the rather verbose
logging that Pacemaker offers is a good thing, and that it shouldn't
necessarily have to tune down the amount of messages it emits, we
should offer users a facility where these log messages don't interfere
with more critical logging info.

So the patches contain a simple rsyslog configuration snippet which,
when included in the rsyslog configuration, will log Pacemaker logging
output to files named /var/log/pacemaker/daemon.log, where daemon
can be attrd, cib, crmd, lrmd, pengine, and stonith. (Output from the
crm_attribute binary also goes to the attrd log.)

So what remains in the default system log (/var/log/messages,
/var/log/syslog)? The stuff that you're most likely to care about,
namely the log messages from the resource agents -- i.e. stuff that's
actually relevant to the health of your application, rather than the
health of your cluster infrastructure. I find this makes issues _much_
easier to troubleshoot (but then of course, that may be my personal
preference).

What's also included is a simple logrotate configuration snippet that
makes sure these log files are compressed and rotated once a week.

These changes, since commit d35d6f96daa04d9a2c3c54a0c60a3ff5db5fc293:

  High: Core: Rempove stray character from qb_ipc_response_header definition 
(2012-01-03 11:38:46 +1100)

are also available in my git repository at:
  git://github.com/fghaas/pacemaker syslog

Florian Haas (2):
  extra: add rsyslog configuration snippet
  extra: add logrotate configuration snippet

 configure.ac  |4 +++
 extra/Makefile.am |2 +-
 extra/logrotate/Makefile.am   |5 
 extra/logrotate/pacemaker.conf.in |7 ++
 extra/rsyslog/Makefile.am |5 
 extra/rsyslog/pacemaker.conf.in   |   39 +
 6 files changed, 61 insertions(+), 1 deletions(-)
 create mode 100644 extra/logrotate/Makefile.am
 create mode 100644 extra/logrotate/pacemaker.conf.in
 create mode 100644 extra/rsyslog/Makefile.am
 create mode 100644 extra/rsyslog/pacemaker.conf.in

Hope this is useful.

Cheers,
Florian

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] [PATCH 2/2] extra: add logrotate configuration snippet

2012-01-05 Thread Florian Haas
---
 extra/Makefile.am |2 +-
 extra/logrotate/Makefile.am   |5 +
 extra/logrotate/pacemaker.conf.in |7 +++
 3 files changed, 13 insertions(+), 1 deletions(-)
 create mode 100644 extra/logrotate/Makefile.am
 create mode 100644 extra/logrotate/pacemaker.conf.in

diff --git a/extra/Makefile.am b/extra/Makefile.am
index d9e3360..6fe0a28 100644
--- a/extra/Makefile.am
+++ b/extra/Makefile.am
@@ -18,7 +18,7 @@
 
 MAINTAINERCLEANFILES= Makefile.in
 
-SUBDIRS =  resources rgmanager rsyslog
+SUBDIRS =  resources rgmanager rsyslog logrotate
 
 mibdir = $(datadir)/snmp/mibs
 mib_DATA = PCMK-MIB.txt
diff --git a/extra/logrotate/Makefile.am b/extra/logrotate/Makefile.am
new file mode 100644
index 000..8e400a4
--- /dev/null
+++ b/extra/logrotate/Makefile.am
@@ -0,0 +1,5 @@
+MAINTAINERCLEANFILES = Makefile.in
+
+logrotatedir = $(sysconfdir)/logrotate.d
+
+logrotate_DATA = pacemaker.conf
diff --git a/extra/logrotate/pacemaker.conf.in 
b/extra/logrotate/pacemaker.conf.in
new file mode 100644
index 000..3edd17e
--- /dev/null
+++ b/extra/logrotate/pacemaker.conf.in
@@ -0,0 +1,7 @@
+@localstatedir@/log/@PACKAGE_TARNAME@/*.log {
+  rotate 4
+  weekly
+  compress
+  missingok
+  notifempty
+}
-- 
1.7.5.4


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Patch: use NFSv4 with RA nfsserver

2011-12-27 Thread Florian Haas
On Tue, Dec 27, 2011 at 12:05 PM, Vogt Josef josef.v...@telecom.li wrote:
 Hi all,

 I wrote a patch to the ressource agent nfsserver which deals with NFSv4 
 (see attachment). It's now possible to use either NFSv3 or NFSv4 with this 
 ressource agent.

Any specific reason for not using exportfs?

http://linux-ha.org/doc/man-pages/re-ra-exportfs.html

It looks to me that your patch largely reimplements what the
wait_for_leasetime_on_stop exportfs parameter already does.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Patch: use NFSv4 with RA nfsserver

2011-12-27 Thread Florian Haas
On Tue, Dec 27, 2011 at 3:30 PM, Vogt Josef josef.v...@telecom.li wrote:
 Just a question here: I could't get it to work without setting the gracetime 
 - which isn't set in the exportfs RA. Are you sure this works as expected?

Thanks, good input. I'd be happy to add that (as in,
wait_for_gracetime_on_start or similar). However, can you do me a
favor please? Take a look at the discussion archived at
http://www.spinics.net/lists/linux-nfs/msg22670.html and let me know
if nlm_grace_period (as mentioned in
http://www.spinics.net/lists/linux-nfs/msg22737.html) made any
difference?

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] OCFS2 problems when connectivity lost

2011-12-21 Thread Florian Haas
2011/12/21 Ivan Savčić | Epix ivan.sav...@epix.rs:
 Hello,


 We are having a problem with a 3-node cluster based on Pacemaker/Corosync
 with 2 primary DRBD+OCFS2 nodes and a quorum node.

 Nodes run on Debian Squeeze, all packages are from the stable branch except
 for Corosync (which is from backports for udpu functionality). Each node has
 a single network card.

Strongly suggest to also use pacemaker and resource-agents from
squeeze-backports.

 When the network is up, everything works without any problems, graceful
 shutdown of resources on any node works as intended and doesn't reflect on
 the remaining cluster partition.

 When the network is down on one OCFS2 node, Pacemaker
 (no-quorum-policy=stop) tries to shut the resources down on that node, but
 fails to stop the OCFS2 filesystem resource stating that it is in use.

Are you sure you have fencing configured correctly? Normally the
remaining nodes should attempt to fence the misbehaving node.

 *Both* OCFS2 nodes (ie. the one with the network down and the one which is
 still up in the partition with quorum) hang with dmesg reporting that
 events, ocfs2rec and ocfs2_wq are blocked for more than 120 seconds.

That, again, would be an expected side effect if your fencing
malfunctioned: I/O on the device has to freeze until those nodes that
are scheduled for fencing, are in fact fenced. If that fencing
operation never succeeds, then I/O on the remaining nodes freezes
indefinitely.

 When the network is operational, umount by hand works without any problems,
 because for the testing scenario there are no services running which are
 keeping the mountpoint busy.

 Configuration we used is pretty much from ClusterStack/LucidTesting
 document [1], with clone-max=2 added where needed because of the
 additional quorum node in comparison to that document.

From that document:

property $id=cib-bootstrap-options \
dc-version=1.0.7-54d7869bfe3691eb723b1d47810e5585d8246b58 \
cluster-infrastructure=openais \
stonith-enabled=false \
no-quorum-policy=ignore

stonith-enabled=false in an OCFS2 cluster with dual-Primary DRBD. I
just don't think so.

Hope this helps.

Cheers,
Florian

--
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] More then one stonith-resource on one node

2011-12-20 Thread Florian Haas
On Tue, Dec 20, 2011 at 3:42 PM, Marc K. marcus.k...@stuttgart.de wrote:
 Hello together,

 I found an older Posting from September this year, with the same problem:
 - a two node cluster
 - every node has two power supplies
 - power supply one is connected to wti-powerswitch 1
 - power supply two is connected to wti-powerswitch 2
 - wit-powerswitch 1 is connected to datacenter-ups 1
 - wit-powerswitch 2 is connected to datacenter-ups 2

 Problem:
 I need two stonith-resources for each node. Working is only one. The second 
 will
 ingnored. (On commandline both working fine.)

 Google found an older post from September this year with the same problem. 
 There
 are new solutions in meantime? (In this post was no really solutions:-( )

Two STONITH devices for one host, _both_ of which you expect to
trigger, is nothing I remember as ever having been supported, up to
this point. What you can do is to run staggered fencing, that is, a
higher-priority fencing device fires first, and then _if that fails_,
another lower-priority one does.

However, in Pacemaker 1.1 this fallback to secondary STONITH devices
(with staggered priorities) simply hasn't yet been implemented in
stonith-ng. It's currently on the list for Fedora 17, and if I
understood Andrew correctly that release should also cover your both
A and B must succeed for fencing to be considered successful
scenario.

So perhaps you can be patient until then and settle for IPMI in the meantime?

Cheers,
Florian

-- 
Need help with Pacemaker?
http://www.hastexo.com/knowledge/pacemaker

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Doc: Resource templates

2011-12-12 Thread Florian Haas
On Mon, Dec 12, 2011 at 10:04 AM, Gao,Yan y...@suse.com wrote:
 On 12/12/11 15:55, Gao,Yan wrote:
 Hi,
 As some people have noticed, we've provided a new feature Resource
 templates since pacemaker-1.1.6. I made a document about it which is
 meant to be included into Pacemaker_Explained. I borrowed the
 materials from Tanja Roth , Thomas Schraitle, (-- the documentation
 specialists from SUSE) and Dejan Muhamedagic. Thanks to them!

 Attaching it here first. If you are interested, please help review it.
 And if anyone would like to help convert it into DocBook and made a
 patch, I would be much appreciate. :-)

 I can tell people would like to see a crm shell version of it as well.
 I'll sort it out and post it here soon.
 Attached the crm shell version of the document.

As much as I appreciate the new feature, was it really necessary that
you re-used a term that already has a defined meaning in the shell?

http://www.clusterlabs.org/doc/crm_cli.html#_templates

Couldn't you have called them resource prototypes instead? We've
already confused users enough in the past.

Florian

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Doc: Resource templates

2011-12-12 Thread Florian Haas
On Mon, Dec 12, 2011 at 11:20 AM, Gao,Yan y...@suse.com wrote:
 On 12/12/11 17:52, Florian Haas wrote:
 On Mon, Dec 12, 2011 at 10:36 AM, Gao,Yan y...@suse.com wrote:
 On 12/12/11 17:16, Florian Haas wrote:
 On Mon, Dec 12, 2011 at 10:04 AM, Gao,Yan y...@suse.com wrote:
 On 12/12/11 15:55, Gao,Yan wrote:
 Hi,
 As some people have noticed, we've provided a new feature Resource
 templates since pacemaker-1.1.6. I made a document about it which is
 meant to be included into Pacemaker_Explained. I borrowed the
 materials from Tanja Roth , Thomas Schraitle, (-- the documentation
 specialists from SUSE) and Dejan Muhamedagic. Thanks to them!

 Attaching it here first. If you are interested, please help review it.
 And if anyone would like to help convert it into DocBook and made a
 patch, I would be much appreciate. :-)

 I can tell people would like to see a crm shell version of it as well.
 I'll sort it out and post it here soon.
 Attached the crm shell version of the document.

 As much as I appreciate the new feature, was it really necessary that
 you re-used a term that already has a defined meaning in the shell?

 http://www.clusterlabs.org/doc/crm_cli.html#_templates

 Couldn't you have called them resource prototypes instead? We've
 already confused users enough in the past.
 Since Dejan adopted the object name rsc_template in crm shell, and
 call it Resource template in the help. I'm not inclined to use another
 term in the document. Opinion, Dejan?

 I didn't mean to suggest to use a term in the documentation that's
 different from the one the shell uses. I am suggesting to rename the
 feature altogether. Granted, it may be a bit late to have a naming
 discussion now, but I haven't seen this feature discussed on the list
 at all, so there wasn't really a chance to voice these concerns
 sooner.
 Actually there were discussions in pcmk-devel mailing list. Given that
 it has been included into pacemaker-1.2 schema and released with
 pacemaker-1.1.6, it seems too late for us to change it from cib side
 now. Unless Dejan would like to rename it from crm shell...

From http://oss.clusterlabs.org/mailman/listinfo/pcmk-devel: The
current archive is only available to the list members. Seriously?

And that's supposedly the list to discuss issues like 'last commit
broke the build' (paraphrasing Andrew, from earlier this year), not
feature additions. When did this change?

Florian

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] right way to update resource configuration on a live cluster?

2011-12-09 Thread Florian Haas
On Fri, Dec 9, 2011 at 10:25 PM, MA Martin Andrews (5542)
mandr...@ag.com wrote:
 I have several heartbeat clusters running Centos 5 and heartbeat 2.1.4.

Argll. Please:

http://www.linux-ha.org/doc/users-guide/_upgrading_from_crm_enabled_heartbeat_2_1_clusters.html

 Is this procedure correct? I was surprised I couldn't find any
 discussion of this process online. If I was using a newer pacemaker
 would the process be simpler?

Proper answer is expletive yeah. (Please don't put this on your
greeting cards.)

Please, by all means, follow the upgrade process.

Cheers,
Florian

-- 
Need help with Pacemaker?
http://www.hastexo.com/knowledge/Pacemaker

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Fw: Unable to start pacemaker due to WARN: do_cib_control: Couldn't complete CIB registration [In reply to]

2011-12-06 Thread Florian Haas
Hi Graham,

On Tue, Dec 6, 2011 at 8:06 AM, Graham Rawolle
rawol...@daintreesystems.com wrote:
 I too am having all sorts of dramas getting pacemaker to start.

 Andrew you mentioned the new way “ver:1” to start the pacemaker daemons.

 The problem is that the two packaged versions of pacemaker that I can find
 for openSUSE 11.4 do not have an /etc/init.d/pacemaker script or even a
 pacemakerd executable – so how can pacemaker be started?

 The versions of pacemaker I have tried are 1.1.5-3.2-x86_64 from
 OpenSUSE-11.4-Oss repository and 1.0.12-1-x86_64 from Cluster Labs
 repository for openSUSE-11.4.

1.0.12 (as any of the 1.0.x releases) did not ship pacemakerd. On
those systems, configuring the Pacemaker service with ver: 1 is not
supported. One would not expect a pacemaker init script there.

However, 1.1.5 would support it. But OpenSUSE doesn't ship it (from
the OpenSUSE 11.4 spec file):

# Don't want to ship this just yet:
rm $RPM_BUILD_ROOT/etc/init.d/pacemaker || true
rm $RPM_BUILD_ROOT/usr/sbin/pacemaker{d,} || true

This is unchanged in the spec file for 12.1, which ships Pacemaker
1.1.6. Tim Serong would be the best person to explain the reasoning
behind this (or correct me if my observation is wrong, always a
possibility). But IIUC Tim is currently traveling back home from
Europe, so please give him a day or two to respond. Thanks!

Cheers,
Florian

-- 
Need help with Pacemaker?
http://www.hastexo.com/knowledge/pacemaker

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] CMAN - Pacemaker - Porftpd setup

2011-12-06 Thread Florian Haas
Hello,

On Tue, Dec 6, 2011 at 2:36 PM, Bensch, Kobus
kobus.ben...@bauerservices.co.uk wrote:
 colocation ftpsite-with-webip inf: ActiveFTPSite WebIP
 colocation website-with-ip inf: ActiveFTPSite WebIP
 order apache-after-ip inf: WebIP ActiveFTPSite
 order propftpd-after-webip inf: WebIP ActiveFTPSite

Any specific reasons for these double colo and order constraints?

Also, does crm_mon -rf yield any failcounts?

Cheers,
Florian

-- 
Need help with Pacemaker?
http://www.hastexo.com/knowledge/pacemaker

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] CMAN - Pacemaker - Porftpd setup

2011-12-06 Thread Florian Haas
On Tue, Dec 6, 2011 at 3:16 PM, Bensch, Kobus
kobus.ben...@bauerservices.co.uk wrote:
 Hi Florian

 Thanks for the reply.

 1.) No reason. I can get rid of one of each

Did you, and if so has it changed the situation?

 2.) The result of crm_mon -rf

OK, no failcounts. Can you create a CIB dump with cibadmin -Q 
/tmp/cib.xml, upload that _unchanged_ to pastebin or whatever similar
service is your favorite, and share the link here?

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Where to install applications

2011-12-02 Thread Florian Haas
On Fri, Dec 2, 2011 at 5:35 PM, Charles DeVoe scarecrow...@yahoo.com wrote:

 We are building a 4 node active/active cluster, which I believe is the same 
 as High Performance.

Not quite. That's still an HA cluster with some scale-out capability.
HPC is a slightly different ballgame.

  The Cluster has a SAN formatted with GFS2.  The discussion is whether to 
install the applications on the shared drive and point each machine to that 
install point or install the applications locally.

Your call, really.

Slapping all applications onto the shared storage means that every
time you update that piece of software, you essentially have to
restart everything all at once -- but only once. So, for updates
you'll normally have downtime. If everything goes nicely, you're back
up very quickly. If something breaks, you're down for some time.

Putting just the data on shared storage, and the applications on
individual nodes, means you're capable of rolling upgrades where you
update your software node by node -- but then again, you have to do it
on every node. If everything works on the first try, you'll normally
take a bit longer than in the approach explained above. If something
breaks on the upgrade of your first node, well, you shut it down, go
back to square one, find and fix the root cause, while three others
continue to hum along.

I for one much prefer the second approach.

Cheers,
Florian

--
Want to know how we've helped others?
http://www.hastexo.com/shoutbox

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] managing config files as resources

2011-12-01 Thread Florian Haas
Hi Larry,

On Thu, Dec 1, 2011 at 6:59 PM, Larry Brigman larry.brig...@gmail.com wrote:
 Is there a method to manage individual files as resources?
 Which RA would be used and any pointer as to how to configure it would be
 great.

 Specifically we need to sync some files between nodes that have
 configuration data
 for our applications like which IP addresses are assigned to each node and
 what
 is the virtual IP of the cluster.

So these files change, dynamically, and _all_ cluster nodes need to
know about it? Or are the files just expected to move along with the
resources?

If the former, one possible approach is to put all files on central
storage (say, an NFS mount point), and then use ocf:heartbeat:symlink
to manage symlinks where your services expect to find the config
files.

If the latter, you can slap everything on DRBD, and mount the
DRBD-backed filesystem wherever your resource is active. You may, of
course, combine this with managed symlinks.

Cheers,
Florian


-- 
Need help with Pacemaker?
http://www.hastexo.com/knowledge/pacemaker

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] managing config files as resources

2011-12-01 Thread Florian Haas
On Thu, Dec 1, 2011 at 10:45 PM, Larry Brigman larry.brig...@gmail.com wrote:


 On Thu, Dec 1, 2011 at 1:42 PM, Florian Haas flor...@hastexo.com wrote:

 On Thu, Dec 1, 2011 at 10:35 PM, Larry Brigman larry.brig...@gmail.com
 wrote:
  Yes, the files can be changed dynamically - mostly by a user doing a
  configuration change.

 Is a user someone with shell access to the box, or a visitor on your
 web site (or whatever the service is)?

 Normally shell access but we also have an external service that pushes a
 config file
 into place also via sftp.

Use csync2 then.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


  1   2   3   4   >