from:"William Seligman"

[Linux-HA] Backing out of HA

2013-07-01 Thread William Seligman

I'm about to write a transition plan for getting rid of high-availability on our
lab's cluster. Before I do that, I thought I'd put my reasons before this group
so that:

a) people can exclaim You fool! and point out all the stupid things I did 
wrong;

b) sysadmins who are contemplating the switch to HA have additional points to
add to the pros and cons.

The basic reason why we want to back out of HA is that, in the three years since
I've implemented an HA cluster at the lab, we have had not a single hardware
problem for which HA would have been useful. However, we've had many instances
of lab-wide downtime due to the HA configuration.

Description: two-node cluster, Scientific Linux 6.2 (=RHEL6.2),
cman+clvmd+pacemaker, dedicated Ethernet ports for DRBD traffic. I've had both
primary/secondary and dual-primary configurations. The resources pacemaker
manages include DRBD and VMs with configuration files and virtual disks on the
DRBD partition. Detailed package versions and configurations are at the end of
this post.

Here are some examples of our difficulties. This is not an exhaustive list.

Mystery crashes

I'll mention this one first because it's the most recent, and it was the straw
that broke the camel's back as far as the users were concerned.

Last week, cman crashed, and the cluster stopped working. There was no clear
message in the logs indicating why. I had no time for archeology, since the
crash happened in middle of our working day; I rebooted everything and cman
started up again just fine.

Problems under heavy server load:

Let's call the two nodes on the cluster A and B. Node A starts running a process
that does heavy disk writes to the shared DRBD volume. The load on A starts to
rise. The load on B rises too, more slowly, because the same blocks must be
written to node B's disk.

Eventually the load on A grows so great that cman+clvmd+pacemaker does not
respond promptly, and node B stoniths node A. The problem is that the DRBD
partition on node B marked Inconsistent. All the other resources in the
pacemaker configuration depend on DRBD, so none of them are allowed to run.

The cluster stays in this non-working state (node A powered off, node B not
running any resources) until I manually intervene.

Poisoned resource

This is the one you can directly attribute to my stupidity.

I add a new resource to the pacemaker configuration. Even though the pacemaker
configuration is syntactically correct, and even though I think I've tested it,
in fact the resource cannot run on either node.

The most recent example: I created a new virtual domain and tested it. It worked
fine. I created the ocf:heartbeat:VirtualDomain resource, verified that crm
could parse it, and activated the configuration. However, I had not actually
created the domain for the virtual machine; I had typed virsh create ... but
not virsh define 

So I had a resource that could not run. What I'd want to happen is for the
poisoned resource to fail, I see lots of error messages, but the remaining
resources would continue to run.

What actually happens is that resource tries to run on both nodes alternately an
infinite number of times (1 times or whatever the value is). Then one of
the nodes stoniths the other. The poisoned resource still won't run on the
remaining node, so that node tries restarting all the other resources in the
pacemaker configuration. That still won't work.

By this time, usually one of the other resources has failed (possibly because
it's not designed to be restarted so frequently), and the cluster is in a
non-working state until I manually intervened.

In this particular case, had we not been running HA, the only problem would have
been that the incorrectly-initialized domain would not have come up after a
system reboot. With HA, my error crashed the cluster.


Let me be clear: I do not claim that HA is without value. My only point is that
for our particular combination of hardware, software, and available sysadmin
support (me), high-availability has not been a good investment.

I also acknowledge that I haven't provided logs for these problems to
corroborate any of the statements I've made. I'm sharing the problems I've had,
but at this point I'm not asking for fixes.


Turgid details:

# rpm -q kernel drbd pacemaker cman \
   lvm2 lvm2-cluster resource-agents
kernel-2.6.32-220.4.1.el6.x86_64
drbd-8.4.1-1.el6.x86_64
pacemaker-1.1.6-3.el6.x86_64
cman-3.0.12.1-23.el6.x86_64
lvm2-2.02.87-7.el6.x86_64
lvm2-cluster-2.02.87-7.el6.x86_64
resource-agents-3.9.2-7.el6.x86_64

/etc/cluster/cluster.conf: http://pastebin.com/qRAxLpkx
/etc/lvm/lvm.conf: http://pastebin.com/tLyZd09i
/etc/drbd.d/global_common.conf: http://pastebin.com/H8Kfi2tM
/etc/drbd.d/admin.res: http://pastebin.com/1GWupJz8
output of crm configure show: http://pastebin.com/wJaX3Msn
output of crm configure show xml: http://pastebin.com/gyUUb2hi
-- 
William Seligman  | Phone: (914) 591-2823
Nevis Labs, Columbia Univ |
PO Box 137

Re: [Linux-HA] exportfs problems

2013-01-07 Thread William Seligman

On 1/4/13 7:10 PM, Matthew Spah wrote:
 Hey everyone,
 
 I've just recently built up a pacemaker cluster and have begun testing it.
 Everything has been going great until after Christmas break.. I fired up
 the cluster to find this going on.
 
 
 Last updated: Fri Jan  4 16:06:41 2013
 Last change: Fri Jan  4 16:02:13 2013 via crmd on emserver1
 Stack: openais
 Current DC: emserver1 - partition with quorum
 Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
 2 Nodes configured, 2 expected votes
 9 Resources configured.
 
 
 Online: [ emserver1 emserver2 ]
 
  Master/Slave Set: ms_drbd_nfs [p_drbd_nfs]
  Masters: [ emserver2 ]
  Slaves: [ emserver1 ]
  Clone Set: cl_lsb_nfsserver [p_lsb_nfsserver]
  Started: [ emserver1 emserver2 ]
  Resource Group: g_nfs
  p_fs_nfs   (ocf::heartbeat:Filesystem):Started emserver2
  p_exportfs_nfs (ocf::heartbeat:exportfs):  Started emserver2
 (unmanaged) FAILED
  p_ip_nfs   (ocf::heartbeat:IPaddr2):   Stopped
  Clone Set: cl_exportfs_root [p_exportfs_root]
  Started: [ emserver2 ]
  Stopped: [ p_exportfs_root:1 ]
 
 Failed actions:
 p_exportfs_root:0_start_0 (node=emserver1, call=10, rc=-2, status=Timed
 Out): unknown exec error
 p_exportfs_root:1_monitor_3 (node=emserver2, call=11, rc=7,
 status=complete): not running
 p_exportfs_nfs_stop_0 (node=emserver2, call=39, rc=-2, status=Timed
 Out): unknown exec error
 
 
 I've been reading through documentation to figure out what is going on. If
 you guys could point me in the right direction that would be a huge help. :)
 
 Here is my configuration...
 node emserver1
 node emserver2
 primitive p_drbd_nfs ocf:linbit:drbd \
 params drbd_resource=r0 \
 op monitor interval=15 role=Master \
 op monitor interval=30 role=Slave
 primitive p_exportfs_nfs ocf:heartbeat:exportfs \
 params fsid=1 directory=/srv/nfs options=rw,crossmnt
 clientspec=10.1.10.0/255.255.255.0 \
 op monitor interval=30s
 primitive p_exportfs_root ocf:heartbeat:exportfs \
 params fsid=0 directory=/srv options=rw,crossmnt clientspec=
 10.1.10.0/255.255.255.0 \
 op monitor interval=30s
 primitive p_fs_nfs ocf:heartbeat:Filesystem \
 params device=/dev/drbd1 directory=/srv/nfs fstype=ext3 \
 op monitor interval=10s
 primitive p_ip_nfs ocf:heartbeat:IPaddr2 \
 params ip=10.1.10.10 cidr_netmask=24 iflabel=NFSV_IP \
 op monitor interval=30s
 primitive p_lsb_nfsserver lsb:nfs-kernel-server \
 op monitor interval=30s
 group g_nfs p_fs_nfs p_exportfs_nfs p_ip_nfs
 ms ms_drbd_nfs p_drbd_nfs \
 meta master-max=1 master-node-max=1 clone-max=2
 clone-node-max=1 notify=true
 clone cl_exportfs_root p_exportfs_root
 clone cl_lsb_nfsserver p_lsb_nfsserver
 colocation c_nfs_on_drbd inf: g_nfs ms_drbd_nfs:Master
 colocation c_nfs_on_root inf: g_nfs cl_exportfs_root
 order o_drbd_before_nfs inf: ms_drbd_nfs:promote g_nfs:start
 order o_root_before_nfs inf: cl_exportfs_root g_nfs:start
 property $id=cib-bootstrap-options \
 dc-version=1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c \
 cluster-infrastructure=openais \
 expected-quorum-votes=2 \
 stonith-enabled=false \
 no-quorum-policy=ignore \
 maintenance-mode=false \
 last-lrm-refresh=1357344133
 rsc_defaults $id=rsc-options \
 resource-stickiness=200

I've had problems like this with the exportfs resource. Here are some things to
check:

- You didn't list the software versions. In particular, look at the version of
your resource-agents package. There have been some recent changes to the
ocf:heartbeat:exportfs script that improve the pattern-matching in its monitor
action.

- The ocf:heartbeat:exportfs monitor works by comparing the clientspec parameter
with the output of the exportfs command. Check when you export to 10.1.10.0 that
the output of exportfs returns exactly that string, instead of a resolved name.

It may help to give a concrete example: I exported a partition via
ocf:heartbeat:exportfs to clientspec=mail.nevis.columbia.edu. The monitor action
always failed, until I realized that mail.nevis.columbia.edu was an alias for
franklin.nevis.columbia.edu; that was the name that appeared in the output of
/usr/sbin/exportfs.

Hope this helps.
-- 
William Seligman  | Phone: (914) 591-2823
Nevis Labs, Columbia Univ |
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] IP Clone

2012-08-21 Thread William Seligman

On 8/20/12 6:54 PM, Yount, William D wrote:
 No, no complaining. Just glad to get a definitive answer on it. Active/Active 
 made me think something that I guess isn't true. No worries. Honestly, thanks 
 for the reply. Without you, I would have kept trying and trying and trying.
 
 
 
 -Original Message-
 From: linux-ha-boun...@lists.linux-ha.org 
 [mailto:linux-ha-boun...@lists.linux-ha.org] On Behalf Of Dimitri Maziuk
 Sent: Monday, August 20, 2012 5:50 PM
 To: linux-ha@lists.linux-ha.org
 Subject: Re: [Linux-HA] IP Clone
 
 On 08/20/2012 05:01 PM, Yount, William D wrote:
 I am trying to set up an Active/Active cluster. I have an
 Active/Passive cluster up and running.
 
 I don't remember seeing a clear explanation of when, where, and why you'd 
 actually want an active/active cluster. I never needed one myself, so can't 
 really help you there.
 
 I don't understand how it could be called an Active/Active cluster if 
 you aren't allowed to run the IP address on two servers at once.
 
 You are not allowed to run the IP address on two servers at once, full stop. 
 Complain to Rob Kahn and Vint Cerf.

For what it's worth, I run an Active/Active cluster (probably for all the wrong
reasons). IP cloning works fine for me. Here's my setup:

primitive IP_cluster ocf:heartbeat:IPaddr2 \
params ip=129.236.252.11 cidr_netmask=32 nic=eth0 \
op monitor interval=30s \
meta resource-stickiness=0

clone IPClone IP_cluster \
meta globally-unique=true clone-max=2 clone-node-max=2 \
interleave=false target-role=Started

Pretty much the canonical version from Clusters From Scratch. Here's what I've
noticed:

- I needed iptables running to make this work.

- This gave me a consistent MAC address for the cluster IP address of
129.236.252.11, improving the availability of the connection.

- I didn't see much load balancing after the first time I set it up. Mostly both
clone instances run on a single node of my two-node cluster. For my needs,
that's OK, since for me load-balancing is a much lower priority than 
availability.
-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] exportfs with multiple client ACLs

2012-05-25 Thread William Seligman

On 5/25/12 5:07 PM, Seth Galitzer wrote:
 We have been using NIS netgroups to specify export options based on host
 membership as specified in the /etc/netgroup file.  Some exports may
 have multiple exports specs based on their netgroups, eg one group
 should have root-quashing enabled whereas another should not.  If I'm
 using /etc/exports, I just add another line onto the spec.  With
 pacemaker, this is not possible, so the suggestion I received was to
 simply add multiple exportfs resources to accomplish this.  What I am
 finding is that I am getting erratic behavior in that export options
 seem to be randomly getting overridden.  So hosts that should not be
 getting root-squashed still are.  From my testing, it does not seem to
 be a matter of last one wins.  If the root-squashed resource is
 running at all, whether started before or after the non-root-squashed
 resource, then all hosts are root-squashed.

 Is anybody else trying to do something like this?  If so, how do you
 specify multiple export rules for different hosts or host groups?  I'm
 using the ocf:heartbeat:exportfs service.  Is this ignoring netgroup
 specs for some reason, or is there something else going on here?  My
 /etc/nsswitch.conf looks correct, as far as NIS goes.

 I'm running pacemaker 1.1.7 from official packages on debain wheezy.
 Kernel version is 3.2.0 and nfsd is 1.2.5, also from official packages.

 Any advice is appreciated.  I can provide crm dumps and other configs if
 needed.

 Thanks.
 Seth


I haven't had that problem, though I don't think I use as many multiple 
host specs as you do.

Here's a quick check: After all your exportfs resources are running, 
look at the output of the exportfs command on the node running the 
resources. Is the result what you expect?
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Can /var/lib/pengine files be deleted at boot?

2012-05-16 Thread William Seligman

On 5/16/12 6:09 AM, Lars Marowsky-Bree wrote:
 On 2012-05-15T13:17:11, William Seligman selig...@nevis.columbia.edu wrote:
 
 I can post details and logs and whatnot, but I don't think I need to do 
 detailed
 debugging. My question is:
 
 I don't think your rationale holds true, though. Like Andrew said, this
 is only ever just written, not read.

So what I really need to learn is how to understand the pengine state enough to
issue some sort of correction. In my case, I think crm resource cleanup
resource-name was sufficient.

So much to learn! So little time!

 If I were to set up a procedure to delete the contents of
 /var/lib/pengine at system boot, would that cause any problems for
 Pacemaker? Is that state information necessary for the successful
 startup of the pacemaker service at system start, or can I remove them
 before pacemaker starts to prevent problems like this in the future?
 
 It won't affect pacemaker, but you're hurting debuggability.

-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] Can /var/lib/pengine files be deleted at boot?

2012-05-15 Thread William Seligman

I've had some problems with my Linux pacemaker cluster recently. I traced the
problem to what I believe is incorrect state information that was saved in
directory /var/lib/pengine.

I can post details and logs and whatnot, but I don't think I need to do detailed
debugging. My question is:

If I were to set up a procedure to delete the contents of /var/lib/pengine at
system boot, would that cause any problems for Pacemaker? Is that state
information necessary for the successful startup of the pacemaker service at
system start, or can I remove them before pacemaker starts to prevent problems
like this in the future?
-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] HA samba?

2012-04-25 Thread William Seligman

On 4/25/12 4:53 PM, Seth Galitzer wrote:
 Can anybody point me to recent docs on how to go about setting this up? 
   I've found several much older posts, but not much current with any 
 kind of helpful detail.
 
 This one has a couple of good tips, but doesn't have much depth:
 http://linux-ha.org/wiki/Samba
 
 This one has a lot of detail, but do I really need to use GFS and CTDB 
 if I just use a common shared FS for both nodes to get locking data from?:
 http://techwithjim.blogspot.com/2012/04/high-availability-windows-share-using.html
 
 I should note that I'm using DRBD+LVM for my node shared storage and 
 also exporting FS shares via NFS (I run heterogeneous systems here with 
 both Linux and Windows clients, so need both available).

Are you running DRBD+LVM primary-secondary or primary-primary?

If it's the former, I suggest using the configuration described in Clusters
From Scratch:

http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/

the only difference being that instead of running Apache you'd run Samba and
NFS. If you're exporting your filesystems read/write, I think that's the
recommended configuration.

I'm running primary-primary and exporting filesystems via NFS (I'm running Samba
too, but inside a KVM virtual machine exporting its internal filesystem).
However, I'm exporting them read-only.

-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] problem with nfs and exportfs failover

2012-04-16 Thread William Seligman

On 4/14/12 5:55 AM, emmanuel segura wrote:
Maybe the problem it's the primitive nfsserver lsb:nfs-kernel-server, i
think this primitive was stoped befoure exportfs-admin
ocf:heartbeat:exportfs

And if i rember the lsb:nfs-kernel-server and exportfs agent does the same
thing

the first use the os scripts and the second the cluster agents

Now that Emmanuel has reminded me, I'll offer two more tips based on advice he's
given me in the past:

- You can deal with issue he raises directly by putting additional constraints
in your setup, something like:

colocation fs-homes-nfsserver inf: group-homes clone-nfsserver
order nfssserver-before-homes inf: clone-nfsserver group-homes

That will make sure that all the group-homes resources (including
exportfs-admin) will not be run unless an instance of nfsserver is already
running on that node.

- There's a more fundamental question: Why are you placing the start/stop of
your NFS server on both nodes under pacemaker control? Why not have the NFS
server start at system startup on each node?

The only reason I see for putting NFS under Pacemaker control is if there are
entries in your /etc/exports file (or the Debian equivalent) that won't work
unless other Pacemaker-controlled resources are running, such as DRBD. If that's
the case, you're better off controlling them with Pacemaker exportfs resources,
the same as you're doing with exportfs-admin, instead of /etc/exports entries.

Il giorno 14 aprile 2012 01:50, William Seligman
selig...@nevis.columbia.edu ha scritto:

On 4/13/12 7:18 PM, William Seligman wrote:
On 4/13/12 6:42 PM, Seth Galitzer wrote:
In attempting to build a nice clean config, I'm now in a state where
exportfs never starts. It always times out and errors.

crm config show is pasted here: http://pastebin.com/cKFFL0Xf
syslog after an attempted restart here: http://pastebin.com/CHdF21M4

Only IPs have been edited.

It's clear that your exportfs resource is timing out for the admin
resource.

I'm no expert, but here are some stupid exportfs tricks to try:

- Check your /etc/exports file (or whatever the equivalent is in Debian;
man exportfs will tell you) on both nodes. Make sure you're not already
exporting the directory when the NFS server starts.

- Take out the exportfs-admin resource. Then try doing things manually:

# exportfs x.x.x.0/24:/exports/admin

Assuming that works, then look at the output of just

# exportfs

The clientspec reported by exportfs has to match the clientspec you put
into the resource exactly. If exportfs is canonicalizing or reporting the
clientspec differently, the exportfs monitor won't work. If this is the
case, change the clientspec parameter in exportfs-admin to match.

If the output of exportfs has any results that span more than one line,
then you've got the problem that the patch I referred you to (quoted
below) is supposed to fix. You'll have to apply the patch to your
exportfs resource.

Wait a second; I completely forgot about this thread that I started:

http://www.gossamer-threads.com/lists/linuxha/users/78585

The solution turned out to be to remove the .rmtab files from the
directories I was exporting, deleting touching /var/lib/nfs/rmtab (you'll
have to look up the Debian location), and adding rmtab_backup=none to all
my exportfs resources.

Hopefully there's a solution for you in there somewhere!

On 04/13/2012 01:51 PM, William Seligman wrote:
On 4/13/12 12:38 PM, Seth Galitzer wrote:
I'm working through this howto doc:
http://www.linbit.com/fileadmin/tech-guides/ha-nfs.pdf
and am stuck at section 4.4. When I put the primary node in standby, it
seems that NFS never releases the export, so it can't shut down, and
thus can't get started on the secondary node. Everything up to that
point in the doc works fine and fails over correctly. But once I add
the exportfs resource, it fails. I'm running this on debian wheezy with
the included standard packages, not custom.

Any suggestions? I'd be happy to post configs and logs if requested.

Yes, please post the output of crm configure show, the output of
exportfs while the resource is running properly, and the relevant
sections of your log file. I suggest using pastebin.com, to keep
mailboxes filling up with walls of text.

In case you haven't seen this thread already, you might want to take a
look:

http://www.gossamer-threads.com/lists/linuxha/dev/77166

And the resulting commit:
https://github.com/ClusterLabs/resource-agents/commit/5b0bf96e77ed3c4e179c8b4c6a5ffd4709f8fdae

(Links courtesy of Lars Ellenberg.)

--
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137

Re: [Linux-HA] problem with nfs and exportfs failover

2012-04-16 Thread William Seligman

On 4/16/12 1:47 PM, Seth Galitzer wrote:
 Just a quick update.  I set the wait_for_leasetime_on_stop parameter on 
 the exportfs resource to false, no it no longer sleeps for 92 sec and 
 the switchover is instantaneous.  Now I just need to figure out how to 
 disable nfsv4 on the server side and I should be home-free.

As you're testing this, a couple of reminders/observations:

- You're exporting /exports/admin with option rw. If your clients are actually
writing to that directory, and you want to have true failover, you may need
NFSv4. I suggest running a test in which you have a client do an extended write
(with dd, for example) then pull the plug on coronado. Is your file or
filesystem trashed when you do this?

- If you don't need your clients to be able to write to /exports/admin, you have
to don't figure out how to turn off NFSv4 (on RHEL6, this is done by passing -N
4 to nfsd, and is typically done in /etc/sysconfig/nfs). I have the following
exportfs definitions on my primary-primary cluster, and my failover tests work
just fine:
'
primitive ExportUsrNevis ocf:heartbeat:exportfs \
description=Site-wide applications installed in /usr/nevis \
op start interval=0 timeout=40 \
op stop interval=0 timeout=120 \
params clientspec=*.nevis.columbia.edu directory=/usr/nevis 
fsid=20
options=ro,no_root_squash,async rmtab_backup=none

Note that I'm exporting this directory ro. If I wanted to support writes with
failover (especially in a primary-primary setup!) I'd have tons more work to do.

I notice in the configuration you've posted, you haven't included fencing yet.
Don't forget this! And test it as well.

 On 04/16/2012 12:42 PM, Seth Galitzer wrote:
 I've been poking at this more over the weekend and this morning.  And
 while your tip about rmtab was useful, it still didn't resolve the
 problem.  I also made sure that my exports were only being
 handled/defined by pacemaker and not by /etc/exports.  Though for the
 cloned nfsserver resource to work, it seems you need an /etc/exports
 file to exist on the server, even if it's empty.

 It seems the clue as to what's going on is in this line from the log:

 coronado exportfs[20325]: INFO: Sleeping 92 seconds to accommodate for
 NFSv4 lease expiry

 If I bump up the timeout for the exportfs resource to 95 sec, then after
 the very long timeout, it switches over correctly.  So while this is a
 working solution to the problem, a 95 sec timeout is a little long for
 my personal comfort on a live and active fileserver.  Any idea what is
 instigating this timeout?  Is is exportfs (looks that way from the log
 entry), nfsd, or pacemaker?  If pacemaker, then where can I reduce or
 remove this?

 I've been looking at disabling nfsv4 entirely on this server, as I don't
 really need it, but haven't found a solution that works yet.  Tried the
 suggestion in this thread, but it seems to be for mounts, not nfsd, and
 still doesn't help:
 http://lists.debian.org/debian-user/2011/11/msg01585.html

 Though I have found that v4 is being loaded on one host but not the
 other.  So if I can find what's different, I may be able to make that work.

 coronado:~# rpcinfo -u localhost nfs
 program 13 version 2 ready and waiting
 program 13 version 3 ready and waiting
 program 13 version 4 ready and waiting

 cascadia:~# rpcinfo -u localhost nfs
 program 13 version 2 ready and waiting
 program 13 version 3 ready and waiting

 Any further suggestions are welcome.  I'll keep poking until I find a
 solution.

 Thanks.
 Seth

 On 04/16/2012 11:49 AM, William Seligman wrote:
 On 4/14/12 5:55 AM, emmanuel segura wrote:
 Maybe the problem it's the primitive nfsserver lsb:nfs-kernel-server, i
 think this primitive was stoped befoure exportfs-admin
 ocf:heartbeat:exportfs

 And if i rember the lsb:nfs-kernel-server and exportfs agent does the same
 thing

 the first use the os scripts and the second the cluster agents

 Now that Emmanuel has reminded me, I'll offer two more tips based on advice 
 he's
 given me in the past:

 - You can deal with issue he raises directly by putting additional 
 constraints
 in your setup, something like:

 colocation fs-homes-nfsserver inf: group-homes clone-nfsserver
 order nfssserver-before-homes inf: clone-nfsserver group-homes

 That will make sure that all the group-homes resources (including
 exportfs-admin) will not be run unless an instance of nfsserver is already
 running on that node.

 - There's a more fundamental question: Why are you placing the start/stop of
 your NFS server on both nodes under pacemaker control? Why not have the NFS
 server start at system startup on each node?

 The only reason I see for putting NFS under Pacemaker control is if there 
 are
 entries in your /etc/exports file (or the Debian equivalent) that won't work
 unless other Pacemaker-controlled resources are running, such as DRBD. If 
 that's
 the case, you're better off controlling them with Pacemaker exportfs

Re: [Linux-HA] problem with nfs and exportfs failover

2012-04-13 Thread William Seligman

On 4/13/12 12:38 PM, Seth Galitzer wrote:
I'm working through this howto doc:
http://www.linbit.com/fileadmin/tech-guides/ha-nfs.pdf
and am stuck at section 4.4. When I put the primary node in standby, it
seems that NFS never releases the export, so it can't shut down, and
thus can't get started on the secondary node. Everything up to that
point in the doc works fine and fails over correctly. But once I add
the exportfs resource, it fails. I'm running this on debian wheezy with
the included standard packages, not custom.

Any suggestions? I'd be happy to post configs and logs if requested.

Yes, please post the output of crm configure show, the output of exportfs
while the resource is running properly, and the relevant sections of your log
file. I suggest using pastebin.com, to keep mailboxes filling up with walls of
text.

In case you haven't seen this thread already, you might want to take a look:

http://www.gossamer-threads.com/lists/linuxha/dev/77166

And the resulting commit:
https://github.com/ClusterLabs/resource-agents/commit/5b0bf96e77ed3c4e179c8b4c6a5ffd4709f8fdae

(Links courtesy of Lars Ellenberg.)

The problem and patch discussed in those links doesn't quite match what you
describe. I mention it because I had to patch my exportfs resource (in
/usr/lib/ocf/resource.d/heartbeat/exportfs on my RHEL systems) to get it to work
properly in my setup.
--
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/

smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] fence_nut fencing agent - use NUT to fence via UPS

2012-04-13 Thread William Seligman

On 3/1/12 5:37 PM, William Seligman wrote:
 After days spent debugging a fencing issue with my cluster, I know for certain
 that this fencing agent works, at least for me. I'd like to contribute it to 
 the
 Linux HA community.
 
 In my cluster, the fencing mechanism is to use NUT (Network UPS Tools;
 http://www.networkupstools.org/ to turn off power to a node. About 1.5 years
 ago, I contributed a NUT-based fencing agent for Pacemaker 1.0:
 
 http://oss.clusterlabs.org/pipermail/pacemaker/2010-August/007347.html
 
 That script doesn't work for stonith-ng. So here's a new agent, written in 
 perl,
 and tested under pacemaker-1.1.6 and nut-2.4.3.
 
 I know there's a fence_apc_snmp agent that already in resource-agents. 
 However,
 that agent only works with APC devices with multiple outlet control; it 
 displays
 an error messages when used with my UPSes. This script is for those who'd 
 rather
 use NUT than play with SNMP MIBs.

I've made some improvements to the NUT-based fencing agent I contributed before.
The changes are:

- A more rigorous approach to the error codes returned by the agent.

- Added options to delay the times between issuing a poweron/poweroff command
and verifying that the UPS responds.

The revised fence_nut agent is at http://pastebin.com/sQdqWKQq.

-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] problem with nfs and exportfs failover

2012-04-13 Thread William Seligman

On 4/13/12 6:42 PM, Seth Galitzer wrote:
In attempting to build a nice clean config, I'm now in a state where
exportfs never starts. It always times out and errors.

crm config show is pasted here: http://pastebin.com/cKFFL0Xf
syslog after an attempted restart here: http://pastebin.com/CHdF21M4

Only IPs have been edited.

It's clear that your exportfs resource is timing out for the admin resource.

I'm no expert, but here are some stupid exportfs tricks to try:

- Check your /etc/exports file (or whatever the equivalent is in Debian; man
exportfs will tell you) on both nodes. Make sure you're not already exporting
the directory when the NFS server starts.

- Take out the exportfs-admin resource. Then try doing things manually:

# exportfs x.x.x.0/24:/exports/admin

Assuming that works, then look at the output of just

# exportfs

The clientspec reported by exportfs has to match the clientspec you put into the
resource exactly. If exportfs is canonicalizing or reporting the clientspec
differently, the exportfs monitor won't work. If this is the case, change the
clientspec parameter in exportfs-admin to match.

If the output of exportfs has any results that span more than one line, then
you've got the problem that the patch I referred you to (quoted below) is
supposed to fix. You'll have to apply the patch to your exportfs resource.

Any suggestions? I'd be happy to post configs and logs if requested.

Yes, please post the output of crm configure show, the output of exportfs
while the resource is running properly, and the relevant sections of your log
file. I suggest using pastebin.com, to keep mailboxes filling up with walls
of
text.

In case you haven't seen this thread already, you might want to take a look:

http://www.gossamer-threads.com/lists/linuxha/dev/77166

And the resulting commit:
https://github.com/ClusterLabs/resource-agents/commit/5b0bf96e77ed3c4e179c8b4c6a5ffd4709f8fdae

(Links courtesy of Lars Ellenberg.)

--
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/

Re: [Linux-HA] problem with nfs and exportfs failover

2012-04-13 Thread William Seligman

crm config show is pasted here: http://pastebin.com/cKFFL0Xf
syslog after an attempted restart here: http://pastebin.com/CHdF21M4

Only IPs have been edited.

It's clear that your exportfs resource is timing out for the admin resource.

I'm no expert, but here are some stupid exportfs tricks to try:

- Check your /etc/exports file (or whatever the equivalent is in Debian; man
exportfs will tell you) on both nodes. Make sure you're not already exporting
the directory when the NFS server starts.

- Take out the exportfs-admin resource. Then try doing things manually:

# exportfs x.x.x.0/24:/exports/admin

Assuming that works, then look at the output of just

# exportfs

The clientspec reported by exportfs has to match the clientspec you put into
the
resource exactly. If exportfs is canonicalizing or reporting the clientspec
differently, the exportfs monitor won't work. If this is the case, change the
clientspec parameter in exportfs-admin to match.

If the output of exportfs has any results that span more than one line, then
you've got the problem that the patch I referred you to (quoted below) is
supposed to fix. You'll have to apply the patch to your exportfs resource.

Wait a second; I completely forgot about this thread that I started:

http://www.gossamer-threads.com/lists/linuxha/users/78585

The solution turned out to be to remove the .rmtab files from the directories I
was exporting, deleting touching /var/lib/nfs/rmtab (you'll have to look up
the Debian location), and adding rmtab_backup=none to all my exportfs
resources.

Hopefully there's a solution for you in there somewhere!

Any suggestions? I'd be happy to post configs and logs if requested.

Yes, please post the output of crm configure show, the output of
exportfs
while the resource is running properly, and the relevant sections of your
log
file. I suggest using pastebin.com, to keep mailboxes filling up with walls
of
text.

In case you haven't seen this thread already, you might want to take a look:

http://www.gossamer-threads.com/lists/linuxha/dev/77166

And the resulting commit:
https://github.com/ClusterLabs/resource-agents/commit/5b0bf96e77ed3c4e179c8b4c6a5ffd4709f8fdae

(Links courtesy of Lars Ellenberg.)

--
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/

Re: [Linux-HA] question regarding KVM in HA

2012-04-10 Thread William Seligman

On 4/10/12 11:43 AM, Cristina Bulfon wrote:

 We have a RH Cluster Suite to manage virtual machines with CLVM.
 A single virtual machine is on a logical volume and all machines that  
 belong to
 a cluster can see it.
 I am wondering if is it possible to have the same with pacemaker ?
 If yes what kind of software do I have to use it other than pacemaker ?

The references I used to set this up are Clusters From Scratch, especially the
chapter on Active/Active:

http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/ch08.html

with some assistance from:

https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial

Basically, to use pacemaker with clvm you'll have to use cman, so you don't
entirely give up RHCS.

I'm in the process of validating a cluster like this now. It may help to look at
some of the threads I started in this forum to see the challenges I faced,
mainly because I'm a slow learner:

http://www.gossamer-threads.com/lists/linuxha/users/78691
http://www.gossamer-threads.com/lists/linuxha/users/78469
-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] pacemaker+drbd promotion delay

2012-03-30 Thread William Seligman

On 3/30/12 1:13 AM, Andrew Beekhof wrote:
 On Fri, Mar 30, 2012 at 2:57 AM, William Seligman
 selig...@nevis.columbia.edu wrote:
 On 3/29/12 3:19 AM, Andrew Beekhof wrote:
 On Wed, Mar 28, 2012 at 9:12 AM, William Seligman
 selig...@nevis.columbia.edu wrote:
 The basics: Dual-primary cman+pacemaker+drbd cluster running on RHEL6.2; 
 spec
 files and versions below.

 Problem: If I restart both nodes at the same time, or even just start 
 pacemaker
 on both nodes at the same time, the drbd ms resource starts, but both 
 nodes stay
 in slave mode. They'll both stay in slave mode until one of the following 
 occurs:

 - I manually type crm resource cleanup ms-resource-name

 - 15 minutes elapse. Then the PEngine Recheck Timer is fired, and the ms
 resources are promoted.

 The key resource definitions:

 primitive AdminDrbd ocf:linbit:drbd \
 � � � �params drbd_resource=admin \
 � � � �op monitor interval=59s role=Master timeout=30s \
 � � � �op monitor interval=60s role=Slave timeout=30s \
 � � � �op stop interval=0 timeout=100 \
 � � � �op start interval=0 timeout=240 \
 � � � �meta target-role=Master
 ms AdminClone AdminDrbd \
 � � � �meta master-max=2 master-node-max=1 clone-max=2 \
 � � � �clone-node-max=1 notify=true interleave=true
 # The lengthy definition of FilesystemGroup is in the crm pastebin below
 clone FilesystemClone FilesystemGroup \
 � � � �meta interleave=true target-role=Started
 colocation Filesystem_With_Admin inf: FilesystemClone AdminClone:Master
 order Admin_Before_Filesystem inf: AdminClone:promote FilesystemClone:start

 Note that I stuck in target-role options to try to solve the problem; no 
 effect.

 When I look in /var/log/messages, I see no error messages or indications 
 why the
 promotion should be delayed. The 'admin' drbd resource is reported as 
 UpToDate
 on both nodes. There are no error messages when I force the issue with:

 crm resource cleanup AdminClone

 It's as if pacemaker, at start, needs some kind of kick after the drbd
 resource is ready to be promoted.

 This is not just an abstract case for me. At my site, it's not uncommon for
 there to be lengthy power outages that will bring down the cluster. Both 
 systems
 will come up when power is restored, and I need for cluster services to be
 available shortly afterward, not 15 minutes later.

 Any ideas?

 Not without any logs

 Sure! Here's an extract from the log: http://pastebin.com/L1ZnsQ0R

 Before you click on the link (it's a big wall of text),
 
 I'm used to trawling the logs.  Grep is a wonderful thing :-)
 
 At this stage it is apparent that I need to see
 /var/lib/pengine/pe-input-4.bz2 from hypatia-corosync.
 Do you have this file still?

No, so I re-ran the test. Here's the log extract from the test I did today
http://pastebin.com/6QYH2jkf.

Based on what you asked for from the previous extract, I think what you want
from this test is pe-input-5. Just to play it safe, I copied and bunzip2'ed all
three pe-input files mentioned in the log messages:

pe-input-4: http://pastebin.com/Txx50BJp
pe-input-5: http://pastebin.com/zzppL6DF
pe-input-6: http://pastebin.com/1dRgURK5

I pray to the gods of Grep that you find a clue in all of that!

 here are what I think
 are the landmarks:

 - The extract starts just after the node boots, at the start of syslog at 
 time
 10:49:21.
 - I've highlighted when pacemakerd starts, at 10:49:46.
 - I've highlighted when drbd reports that the 'admin' resource is UpToDate, 
 at
 10:50:10.
 - One last highlight: When pacemaker finally promotes the drbd resource to
 Primary on both nodes, at 11:05:11.

 Details:

 # rpm -q kernel cman pacemaker drbd
 kernel-2.6.32-220.4.1.el6.x86_64
 cman-3.0.12.1-23.el6.x86_64
 pacemaker-1.1.6-3.el6.x86_64
 drbd-8.4.1-1.el6.x86_64

 Output of crm_mon after two-node reboot or pacemaker restart:
 http://pastebin.com/jzrpCk3i
 cluster.conf: http://pastebin.com/sJw4KBws
 crm configure show: http://pastebin.com/MgYCQ2JH
 drbdadm dump all: http://pastebin.com/NrY6bskk

-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] pacemaker+drbd promotion delay

2012-03-29 Thread William Seligman

On 3/29/12 3:19 AM, Andrew Beekhof wrote:
 On Wed, Mar 28, 2012 at 9:12 AM, William Seligman
 selig...@nevis.columbia.edu wrote:
 The basics: Dual-primary cman+pacemaker+drbd cluster running on RHEL6.2; spec
 files and versions below.

 Problem: If I restart both nodes at the same time, or even just start 
 pacemaker
 on both nodes at the same time, the drbd ms resource starts, but both nodes 
 stay
 in slave mode. They'll both stay in slave mode until one of the following 
 occurs:

 - I manually type crm resource cleanup ms-resource-name

 - 15 minutes elapse. Then the PEngine Recheck Timer is fired, and the ms
 resources are promoted.

 The key resource definitions:

 primitive AdminDrbd ocf:linbit:drbd \
 � � � �params drbd_resource=admin \
 � � � �op monitor interval=59s role=Master timeout=30s \
 � � � �op monitor interval=60s role=Slave timeout=30s \
 � � � �op stop interval=0 timeout=100 \
 � � � �op start interval=0 timeout=240 \
 � � � �meta target-role=Master
 ms AdminClone AdminDrbd \
 � � � �meta master-max=2 master-node-max=1 clone-max=2 \
 � � � �clone-node-max=1 notify=true interleave=true
 # The lengthy definition of FilesystemGroup is in the crm pastebin below
 clone FilesystemClone FilesystemGroup \
 � � � �meta interleave=true target-role=Started
 colocation Filesystem_With_Admin inf: FilesystemClone AdminClone:Master
 order Admin_Before_Filesystem inf: AdminClone:promote FilesystemClone:start

 Note that I stuck in target-role options to try to solve the problem; no 
 effect.

 When I look in /var/log/messages, I see no error messages or indications why 
 the
 promotion should be delayed. The 'admin' drbd resource is reported as 
 UpToDate
 on both nodes. There are no error messages when I force the issue with:

 crm resource cleanup AdminClone

 It's as if pacemaker, at start, needs some kind of kick after the drbd
 resource is ready to be promoted.

 This is not just an abstract case for me. At my site, it's not uncommon for
 there to be lengthy power outages that will bring down the cluster. Both 
 systems
 will come up when power is restored, and I need for cluster services to be
 available shortly afterward, not 15 minutes later.

 Any ideas?
 
 Not without any logs

Sure! Here's an extract from the log: http://pastebin.com/L1ZnsQ0R

Before you click on the link (it's a big wall of text), here are what I think
are the landmarks:

- The extract starts just after the node boots, at the start of syslog at time
10:49:21.
- I've highlighted when pacemakerd starts, at 10:49:46.
- I've highlighted when drbd reports that the 'admin' resource is UpToDate, at
10:50:10.
- One last highlight: When pacemaker finally promotes the drbd resource to
Primary on both nodes, at 11:05:11.

 Details:

 # rpm -q kernel cman pacemaker drbd
 kernel-2.6.32-220.4.1.el6.x86_64
 cman-3.0.12.1-23.el6.x86_64
 pacemaker-1.1.6-3.el6.x86_64
 drbd-8.4.1-1.el6.x86_64

 Output of crm_mon after two-node reboot or pacemaker restart:
 http://pastebin.com/jzrpCk3i
 cluster.conf: http://pastebin.com/sJw4KBws
 crm configure show: http://pastebin.com/MgYCQ2JH
 drbdadm dump all: http://pastebin.com/NrY6bskk

-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] pacemaker+drbd promotion delay

2012-03-28 Thread William Seligman

On 3/27/12 6:12 PM, William Seligman wrote:
 The basics: Dual-primary cman+pacemaker+drbd cluster running on RHEL6.2; spec
 files and versions below.
 
 Problem: If I restart both nodes at the same time, or even just start 
 pacemaker
 on both nodes at the same time, the drbd ms resource starts, but both nodes 
 stay
 in slave mode. They'll both stay in slave mode until one of the following 
 occurs:
 
 - I manually type crm resource cleanup ms-resource-name
 
 - 15 minutes elapse. Then the PEngine Recheck Timer is fired, and the ms
 resources are promoted.
 
 The key resource definitions:
 
 primitive AdminDrbd ocf:linbit:drbd \
 params drbd_resource=admin \
 op monitor interval=59s role=Master timeout=30s \
 op monitor interval=60s role=Slave timeout=30s \
 op stop interval=0 timeout=100 \
 op start interval=0 timeout=240 \
 meta target-role=Master
 ms AdminClone AdminDrbd \
 meta master-max=2 master-node-max=1 clone-max=2 \
 clone-node-max=1 notify=true interleave=true
 # The lengthy definition of FilesystemGroup is in the crm pastebin below
 clone FilesystemClone FilesystemGroup \
 meta interleave=true target-role=Started
 colocation Filesystem_With_Admin inf: FilesystemClone AdminClone:Master
 order Admin_Before_Filesystem inf: AdminClone:promote FilesystemClone:start
 
 Note that I stuck in target-role options to try to solve the problem; no 
 effect.
 
 When I look in /var/log/messages, I see no error messages or indications why 
 the
 promotion should be delayed. The 'admin' drbd resource is reported as UpToDate
 on both nodes. There are no error messages when I force the issue with:
 
 crm resource cleanup AdminClone
 
 It's as if pacemaker, at start, needs some kind of kick after the drbd
 resource is ready to be promoted.
 
 This is not just an abstract case for me. At my site, it's not uncommon for
 there to be lengthy power outages that will bring down the cluster. Both 
 systems
 will come up when power is restored, and I need for cluster services to be
 available shortly afterward, not 15 minutes later.
 
 Any ideas?
 
 Details:
 
 # rpm -q kernel cman pacemaker drbd
 kernel-2.6.32-220.4.1.el6.x86_64
 cman-3.0.12.1-23.el6.x86_64
 pacemaker-1.1.6-3.el6.x86_64
 drbd-8.4.1-1.el6.x86_64
 
 Output of crm_mon after two-node reboot or pacemaker restart:
 http://pastebin.com/jzrpCk3i
 cluster.conf: http://pastebin.com/sJw4KBws
 crm configure show: http://pastebin.com/MgYCQ2JH
 drbdadm dump all: http://pastebin.com/NrY6bskk

Well, I can't say that I've solved this one, but I have a solution: If I turn
on both machines at once there's a 15-minute delay. But if I turn on one
machine, wait a couple of minutes, then turn on the other, at least the
resources start promptly on the first machine. The second machine joins the
cluster, but there's still a 15-minute delay until its DRBD partition is
promoted by pacemaker.

The reason why DRBD is promoted on the first machine has to do the previous
issue I posted to this list:

http://www.gossamer-threads.com/lists/linuxha/users/78691?do=post_view_threaded

When doing the initial resource probe of the AdminLvm resource, it times out due
the one-node LVM issue I discuss in the that thread. This error causes the
pengine on the node to start re-probing resources, promote the DRBD partition,
which in turn leads to all all the other resources starting on that node.

So I have a work-around, but not a solution. I'll take what I can get!
-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes - SOLVED

2012-03-27 Thread William Seligman

On 3/27/12 4:52 AM, emmanuel segura wrote:

 So now your cluster it's OK?

*Laughs* No! There's another problem I have to solve. But it's completely
unrelated to this one. I'll work on it some more, and if I can't solve it I'll
start a new thread.

Thanks for asking, Emmanuel. (I want to prove I can spell your name correctly!)

 Il giorno 27 marzo 2012 00:33, William Seligman selig...@nevis.columbia.edu
 ha scritto:
 
 On 3/26/12 5:31 PM, William Seligman wrote:
 On 3/26/12 5:17 PM, William Seligman wrote:
 On 3/26/12 4:28 PM, emmanuel segura wrote:

 and i suggest you to start clvmd at boot time

 chkconfig clvmd on

 I'm afraid this doesn't work. It's as I predicted; when gfs2 starts I get:

 Mounting GFS2 filesystem (/usr/nevis): invalid device path 
 /dev/mapper/ADMIN-usr
[FAILED]

 ... and so on, because the ADMIN volume group was never loaded by 
 clvmd. Without a vgscan in there somewhere, the system can't see the
 volume groups on the drbd resource.

 Wait a second... there's an ocf:heartbeat:LVM resource! Testing...

 Emannuel, you did it!

 For the sake of future searches, and possibly future documentation, let me 
 start with my original description of the problem:

 I'm setting up a two-node cman+pacemaker+gfs2 cluster as described in
 Clusters From Scratch. Fencing is through forcibly rebooting a node by
 cutting and restoring its power via UPS.
 
 My fencing/failover tests have revealed a problem. If I gracefully turn
 off one node (crm node standby; service pacemaker stop; shutdown -r
 now) all the resources transfer to the other node with no problems. If I
 cut power to one node (as would happen if it were fenced), the lsb::clvmd
 resource on the remaining node eventually fails. Since all the other
 resources depend on clvmd, all the resources on the remaining node stop
 and the cluster is left with nothing running.
 
 I've traced why the lsb::clvmd fails: The monitor/status command
 includes vgdisplay, which hangs indefinitely. Therefore the monitor
 will always time-out.
 
 So this isn't a problem with pacemaker, but with clvmd/dlm: If a node is
 cut off, the cluster isn't handling it properly. Has anyone on this list
 seen this before? Any ideas?

 Details:

 versions:
 Redhat Linux 6.2 (kernel 2.6.32)
 cman-3.0.12.1
 corosync-1.4.1
 pacemaker-1.1.6
 lvm2-2.02.87
 lvm2-cluster-2.02.87

 The problem is that clvmd on the main node will hang if there's a 
 substantive period of time during which the other node returns running cman
 but not clvmd. I never tracked down why this happens, but there's a
 practical solution: minimize any interval for which that would be true. To
 ensure this, take clvmd outside the resource manager's control:

 chkconfig cman on
 chkconfig clvmd on
 chkconfig pacemaker on

 On RHEL6.2, these services will be started in the above order; clvmd will 
 start within a few seconds after cman.
 
 Here's my cluster.conf http://pastebin.com/GUr0CEgZ and the output of 
 crm configure show http://pastebin.com/f9D4Ui5Z. The key lines from
 the latter are:

 primitive AdminDrbd ocf:linbit:drbd \
params drbd_resource=admin
 primitive AdminLvm ocf:heartbeat:LVM \
params volgrpname=ADMIN \
op monitor interval=30 timeout=100 depth=0
 primitive Gfs2 lsb:gfs2
 group VolumeGroup AdminLvm Gfs2
 ms AdminClone AdminDrbd \
meta master-max=2 master-node-max=1 \
clone-max=2 clone-node-max=1 \
notify=true interleave=true
 clone VolumeClone VolumeGroup \
meta interleave=true
 colocation Volume_With_Admin inf: VolumeClone AdminClone:Master
 order Admin_Before_Volume inf: AdminClone:promote VolumeClone:start

 What I learned: If one is going to extend the example in Clusters From 
 Scratch to include logical volumes, one must start clvmd at boot time, and
 include any volume groups in ocf:heartbeat:LVM resources that start before
 gfs2.
 
 Note the long timeout on the ocf:heartbeat:LVM resource. This is a good 
 idea because, during the boot of the crashed node, there'll still be an 
 interval of a few seconds when cman will be running but clvmd won't be.
 During my tests, the LVM monitor would fail if it checked during that
 interval with a timeout that was shorter than it took clvmd to start on the
 crashed node. This was annoying; all resources dependent on AdminLvm would
 be stopped until AdminLvm recovered (a few more seconds). Increasing the
 timeout avoids this.
 
 It also means that during any recovery procedure on the crashed node for 
 which I turn off all the services, I have to minimize the interval between
 the start of cman and clvmd if I've turned off services at boot; e.g.,

 service drbd start # ... and fix any split-brain problems or whatever
 service cman start; service clvmd start # put on one line
 service pacemaker start

 I thank everyone on this list who was patient with me as I pounded on this
 problem for two weeks!

-- 
Bill Seligman

[Linux-HA] pacemaker+drbd promotion delay

2012-03-27 Thread William Seligman

The basics: Dual-primary cman+pacemaker+drbd cluster running on RHEL6.2; spec
files and versions below.

Problem: If I restart both nodes at the same time, or even just start pacemaker
on both nodes at the same time, the drbd ms resource starts, but both nodes stay
in slave mode. They'll both stay in slave mode until one of the following 
occurs:

- I manually type crm resource cleanup ms-resource-name

- 15 minutes elapse. Then the PEngine Recheck Timer is fired, and the ms
resources are promoted.

The key resource definitions:

primitive AdminDrbd ocf:linbit:drbd \
params drbd_resource=admin \
op monitor interval=59s role=Master timeout=30s \
op monitor interval=60s role=Slave timeout=30s \
op stop interval=0 timeout=100 \
op start interval=0 timeout=240 \
meta target-role=Master
ms AdminClone AdminDrbd \
meta master-max=2 master-node-max=1 clone-max=2 \
clone-node-max=1 notify=true interleave=true
# The lengthy definition of FilesystemGroup is in the crm pastebin below
clone FilesystemClone FilesystemGroup \
meta interleave=true target-role=Started
colocation Filesystem_With_Admin inf: FilesystemClone AdminClone:Master
order Admin_Before_Filesystem inf: AdminClone:promote FilesystemClone:start

Note that I stuck in target-role options to try to solve the problem; no 
effect.

When I look in /var/log/messages, I see no error messages or indications why the
promotion should be delayed. The 'admin' drbd resource is reported as UpToDate
on both nodes. There are no error messages when I force the issue with:

crm resource cleanup AdminClone

It's as if pacemaker, at start, needs some kind of kick after the drbd
resource is ready to be promoted.

This is not just an abstract case for me. At my site, it's not uncommon for
there to be lengthy power outages that will bring down the cluster. Both systems
will come up when power is restored, and I need for cluster services to be
available shortly afterward, not 15 minutes later.

Any ideas?

Details:

# rpm -q kernel cman pacemaker drbd
kernel-2.6.32-220.4.1.el6.x86_64
cman-3.0.12.1-23.el6.x86_64
pacemaker-1.1.6-3.el6.x86_64
drbd-8.4.1-1.el6.x86_64

Output of crm_mon after two-node reboot or pacemaker restart:
http://pastebin.com/jzrpCk3i
cluster.conf: http://pastebin.com/sJw4KBws
crm configure show: http://pastebin.com/MgYCQ2JH
drbdadm dump all: http://pastebin.com/NrY6bskk
-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-26 Thread William Seligman

On 3/26/12 4:28 PM, emmanuel segura wrote:
 Sorry Willian i can't post my config now because i'm at home now  not in my
 job
 
 I think it's no a problem if clvm start before drbd, because clvm not
 needed and devices to start
 
 This it's the point, i hope to be clear
 
 The introduction of pacemaker in redhat cluster was thinked  for replace
 rgmanager not whole cluster stack
 
 and i suggest you to start clvmd at boot time
 
 chkconfig clvmd on

I'm afraid this doesn't work. It's as I predicted; when gfs2 starts I get:

Mounting GFS2 filesystem (/usr/nevis): invalid device path 
/dev/mapper/ADMIN-usr
   [FAILED]

... and so on, because the ADMIN volume group was never loaded by clvmd. Without
a vgscan in there somewhere, the system can't see the volume groups on the
drbd resource.

 Sorry for my bad english :-) i can from a spanish country and all days i
 speak Italian

I'm sorry that I don't speak more languages! You're the one who's helping me;
it's my task to learn and understand. Certainly your English is better than my
French or Russian.

 Il giorno 26 marzo 2012 22:04, William Seligman selig...@nevis.columbia.edu
 ha scritto:
 
 On 3/26/12 3:48 PM, emmanuel segura wrote:
 I know it's normal fence_node doesn't work because the request of fence
 must be redirect to pacemaker stonith

 I think call the cluster agents with rgmanager it's really ugly thing, i
 never seen a cluster like this
 ==
 If I understand Pacemaker Explained http://bit.ly/GR5WEY and how I'd
 invoke
 clvmd from cman http://bit.ly/H6ZbKg, the clvmd script that would be
 invoked
 by either HA resource manager is exactly the same: /etc/init.d/clvmd.
 ==

 clvm doesn't need to be called from rgmanger in the cluster configuration

 this the boot sequence of redhat daemons

 1:cman, 2:clvm, 3:rgmanager

 and if you don't wanna use rgmanager you just replace rgmanager

 I'm sorry, but I don't think I understand what you're suggesting. Do you
 suggest
 that I start clvmd at boot? That won't work; clvmd won't see the volume
 groups
 on drbd until drbd is started and promoted to primary.

 May I ask you to post your own cluster.conf on pastebin.com so I can see
 how you
 do it? Along with crm configure show if that's relevant for your cluster?

 Il giorno 26 marzo 2012 19:21, William Seligman 
 selig...@nevis.columbia.edu
 ha scritto:

 On 3/24/12 5:40 PM, emmanuel segura wrote:
 I think it's better you use clvmd with cman

 I don't now why you use the lsb script of clvm

 On Redhat clvmd need of cman and you try to running with pacemaker, i
 not
 sure this is the problem but this type of configuration it's so strange

 I made it a virtual cluster with kvm and i not foud a problems

 While I appreciate the advice, it's not immediately clear that trying to
 eliminate pacemaker would do me any good. Perhaps someone can
 demonstrate
 the
 error in my reasoning:

 If I understand Pacemaker Explained http://bit.ly/GR5WEY and how
 I'd
 invoke
 clvmd from cman http://bit.ly/H6ZbKg, the clvmd script that would be
 invoked
 by either HA resource manager is exactly the same: /etc/init.d/clvmd.

 If I tried to use cman instead of pacemaker, I'd be cutting myself off
 from the
 pacemaker features that cman/rgmanager does not yet have available,
 such as
 pacemaker's symlink, exportfs, and clonable IPaddr2 resources.

 I recognize I've got a strange problem. Given that fence_node doesn't
 work
 but
 stonith_admin does, I strongly suspect that the problem is caused by the
 behavior of my fencing agent, not the use of pacemaker versus rgmanager,
 nor by
 how clvmd is being started.

 Il giorno 24 marzo 2012 13:09, William Seligman 
 selig...@nevis.columbia.edu
 ha scritto:

 On 3/24/12 4:47 AM, emmanuel segura wrote:
 How do you configure clvmd?

 with cman or with pacemaker?

 Pacemaker. Here's the output of 'crm configure show':
 http://pastebin.com/426CdVwN

 Il giorno 23 marzo 2012 22:14, William Seligman 
 selig...@nevis.columbia.edu
 ha scritto:

 On 3/23/12 5:03 PM, emmanuel segura wrote:

 Sorry but i would to know if can show me your
 /etc/cluster/cluster.conf

 Here it is: http://pastebin.com/GUr0CEgZ

 Il giorno 23 marzo 2012 21:50, William Seligman 
 selig...@nevis.columbia.edu
 ha scritto:

 On 3/22/12 2:43 PM, William Seligman wrote:
 On 3/20/12 4:55 PM, Lars Ellenberg wrote:
 On Fri, Mar 16, 2012 at 05:06:04PM -0400, William Seligman
 wrote:
 On 3/16/12 12:12 PM, William Seligman wrote:
 On 3/16/12 7:02 AM, Andreas Kurz wrote:

 s- ... DRBD suspended io, most likely because of it's
 fencing-policy. For valid dual-primary setups you have to use
 resource-and-stonith policy and a working fence-peer
 handler.
 In
 this mode I/O is suspended until fencing of peer was
 succesful.
 Question
 is, why the peer does _not_ also suspend its I/O because
 obviously
 fencing

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-26 Thread William Seligman

On 3/26/12 5:17 PM, William Seligman wrote:
 On 3/26/12 4:28 PM, emmanuel segura wrote:
 Sorry Willian i can't post my config now because i'm at home now  not in my
 job

 I think it's no a problem if clvm start before drbd, because clvm not
 needed and devices to start

 This it's the point, i hope to be clear

 The introduction of pacemaker in redhat cluster was thinked  for replace
 rgmanager not whole cluster stack

 and i suggest you to start clvmd at boot time

 chkconfig clvmd on
 
 I'm afraid this doesn't work. It's as I predicted; when gfs2 starts I get:
 
 Mounting GFS2 filesystem (/usr/nevis): invalid device path 
 /dev/mapper/ADMIN-usr
[FAILED]
 
 ... and so on, because the ADMIN volume group was never loaded by clvmd. 
 Without
 a vgscan in there somewhere, the system can't see the volume groups on the
 drbd resource.

Wait a second... there's an ocf:heartbeat:LVM resource! Testing...
-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes - SOLVED

2012-03-26 Thread William Seligman

On 3/26/12 5:31 PM, William Seligman wrote:
 On 3/26/12 5:17 PM, William Seligman wrote:
 On 3/26/12 4:28 PM, emmanuel segura wrote:

 and i suggest you to start clvmd at boot time

 chkconfig clvmd on

 I'm afraid this doesn't work. It's as I predicted; when gfs2 starts I get:

 Mounting GFS2 filesystem (/usr/nevis): invalid device path 
 /dev/mapper/ADMIN-usr
[FAILED]

 ... and so on, because the ADMIN volume group was never loaded by clvmd. 
 Without
 a vgscan in there somewhere, the system can't see the volume groups on the
 drbd resource.
 
 Wait a second... there's an ocf:heartbeat:LVM resource! Testing...

Emannuel, you did it!

For the sake of future searches, and possibly future documentation, let me start
with my original description of the problem:

 I'm setting up a two-node cman+pacemaker+gfs2 cluster as described in 
 Clusters
 From Scratch. Fencing is through forcibly rebooting a node by cutting and
 restoring its power via UPS.
 
 My fencing/failover tests have revealed a problem. If I gracefully turn off 
 one
 node (crm node standby; service pacemaker stop; shutdown -r now) all the
 resources transfer to the other node with no problems. If I cut power to one
 node (as would happen if it were fenced), the lsb::clvmd resource on the
 remaining node eventually fails. Since all the other resources depend on 
 clvmd,
 all the resources on the remaining node stop and the cluster is left with
 nothing running.
 
 I've traced why the lsb::clvmd fails: The monitor/status command includes
 vgdisplay, which hangs indefinitely. Therefore the monitor will always 
 time-out.
 
 So this isn't a problem with pacemaker, but with clvmd/dlm: If a node is cut
 off, the cluster isn't handling it properly. Has anyone on this list seen this
 before? Any ideas?
 
 Details:
 
 versions:
 Redhat Linux 6.2 (kernel 2.6.32)
 cman-3.0.12.1
 corosync-1.4.1
 pacemaker-1.1.6
 lvm2-2.02.87
 lvm2-cluster-2.02.87

The problem is that clvmd on the main node will hang if there's a substantive
period of time during which the other node returns running cman but not clvmd. I
never tracked down why this happens, but there's a practical solution: minimize
any interval for which that would be true. To ensure this, take clvmd outside
the resource manager's control:

chkconfig cman on
chkconfig clvmd on
chkconfig pacemaker on

On RHEL6.2, these services will be started in the above order; clvmd will start
within a few seconds after cman.

Here's my cluster.conf http://pastebin.com/GUr0CEgZ and the output of crm
configure show http://pastebin.com/f9D4Ui5Z. The key lines from the latter 
are:

primitive AdminDrbd ocf:linbit:drbd \
params drbd_resource=admin
primitive AdminLvm ocf:heartbeat:LVM \
params volgrpname=ADMIN \
op monitor interval=30 timeout=100 depth=0
primitive Gfs2 lsb:gfs2
group VolumeGroup AdminLvm Gfs2
ms AdminClone AdminDrbd \
meta master-max=2 master-node-max=1 \
clone-max=2 clone-node-max=1 \
notify=true interleave=true
clone VolumeClone VolumeGroup \
meta interleave=true
colocation Volume_With_Admin inf: VolumeClone AdminClone:Master
order Admin_Before_Volume inf: AdminClone:promote VolumeClone:start

What I learned: If one is going to extend the example in Clusters From Scratch
to include logical volumes, one must start clvmd at boot time, and include any
volume groups in ocf:heartbeat:LVM resources that start before gfs2.

Note the long timeout on the ocf:heartbeat:LVM resource. This is a good idea
because, during the boot of the crashed node, there'll still be an interval of a
few seconds when cman will be running but clvmd won't be. During my tests, the
LVM monitor would fail if it checked during that interval with a timeout that
was shorter than it took clvmd to start on the crashed node. This was annoying;
all resources dependent on AdminLvm would be stopped until AdminLvm recovered (a
few more seconds). Increasing the timeout avoids this.

It also means that during any recovery procedure on the crashed node for which I
turn off all the services, I have to minimize the interval between the start of
cman and clvmd if I've turned off services at boot; e.g.,

service drbd start # ... and fix any split-brain problems or whatever
service cman start; service clvmd start # put on one line
service pacemaker start

I thank everyone on this list who was patient with me as I pounded on this
problem for two weeks!
-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-24 Thread William Seligman

On 3/24/12 4:47 AM, emmanuel segura wrote:
 How do you configure clvmd?
 
 with cman or with pacemaker?

Pacemaker. Here's the output of 'crm configure show':
http://pastebin.com/426CdVwN

 Il giorno 23 marzo 2012 22:14, William Seligman selig...@nevis.columbia.edu
 ha scritto:
 
 On 3/23/12 5:03 PM, emmanuel segura wrote:

 Sorry but i would to know if can show me your /etc/cluster/cluster.conf

 Here it is: http://pastebin.com/GUr0CEgZ

 Il giorno 23 marzo 2012 21:50, William Seligman 
 selig...@nevis.columbia.edu
 ha scritto:

 On 3/22/12 2:43 PM, William Seligman wrote:
 On 3/20/12 4:55 PM, Lars Ellenberg wrote:
 On Fri, Mar 16, 2012 at 05:06:04PM -0400, William Seligman wrote:
 On 3/16/12 12:12 PM, William Seligman wrote:
 On 3/16/12 7:02 AM, Andreas Kurz wrote:

 s- ... DRBD suspended io, most likely because of it's
 fencing-policy. For valid dual-primary setups you have to use
 resource-and-stonith policy and a working fence-peer handler.
 In
 this mode I/O is suspended until fencing of peer was succesful.
 Question
 is, why the peer does _not_ also suspend its I/O because obviously
 fencing was not successful .

 So with a correct DRBD configuration one of your nodes should
 already
 have been fenced because of connection loss between nodes (on drbd
 replication link).

 You can use e.g. that nice fencing script:

 http://goo.gl/O4N8f

 This is the output of drbdadm dump admin: 
 http://pastebin.com/kTxvHCtx

 So I've got resource-and-stonith. I gather from an earlier thread
 that
 obliterate-peer.sh is more-or-less equivalent in functionality with
 stonith_admin_fence_peer.sh:

 http://www.gossamer-threads.com/lists/linuxha/users/78504#78504

 At the moment I'm pursuing the possibility that I'm returning the
 wrong return
 codes from my fencing agent:

 http://www.gossamer-threads.com/lists/linuxha/users/78572

 I cleaned up my fencing agent, making sure its return code matched
 those
 returned by other agents in /usr/sbin/fence_, and allowing for some
 delay issues
 in reading the UPS status. But...

 After that, I'll look at another suggestion with lvm.conf:

 http://www.gossamer-threads.com/lists/linuxha/users/78796#78796

 Then I'll try DRBD 8.4.1. Hopefully one of these is the source of
 the
 issue.

 Failure on all three counts.

 May I suggest you double check the permissions on your fence peer
 script?
 I suspect you may simply have forgotten the chmod +x .

 Test with drbdadm fence-peer minor-0 from the command line.

 I still haven't solved the problem, but this advice has gotten me
 further than
 before.

 First, Lars was correct: I did not have execute permissions set on my
 fence peer
 scripts. (D'oh!) I turned them on, but that did not change anything:
 cman+clvmd
 still hung on the vgdisplay command if I crashed the peer node.

 I started up both nodes again (cman+pacemaker+drbd+clvmd) and tried
 Lars'
 suggested command. I didn't save the response for this message (d'oh
 again!) but
 it said that the fence-peer script had failed.

 Hmm. The peer was definitely shutting down, so my fencing script is
 working. I
 went over it, comparing the return codes to those of the existing
 scripts, and
 made some changes. Here's my current script: 
 http://pastebin.com/nUnYVcBK.

 Up until now my fence-peer scripts had either been Lon Hohberger's
 obliterate-peer.sh or Digimer's rhcs_fence. I decided to try
 stonith_admin-fence-peer.sh that Andreas Kurz recommended; unlike the
 first two
 scripts, which fence using fence_node, the latter script just calls
 stonith_admin.

 When I tried the stonith_admin-fence-peer.sh script, it worked:

 # drbdadm fence-peer minor-0
 stonith_admin-fence-peer.sh[10886]: stonith_admin successfully fenced
 peer
 orestes-corosync.nevis.columbia.edu.

 Power was cut on the peer, the remaining node stayed up. Then I brought
 up the
 peer with:

 stonith_admin -U orestes-corosync.nevis.columbia.edu

 BUT: When the restored peer came up and started to run cman, the clvmd
 hung on
 the main node again.

 After cycling through some more tests, I found that if I brought down
 the peer
 with drbdadm, then brought up with the peer with no HA services, then
 started
 drbd and then cman, the cluster remained intact.

 If I crashed the peer, the scheme in the previous paragraph didn't
 work.
 I bring
 up drbd, check that the disks are both UpToDate, then bring up cman. At
 that
 point the vgdisplay on the main node takes so long to run that clvmd
 will time out:

 vgdisplay  Error locking on node orestes-corosync.nevis.columbia.edu:
 Command
 timed out

 I timed how long it took vgdisplay to run. I might be able to work
 around this
 by setting the timeout on my clvmd resource to 300s, but that seems to
 be a
 band-aid for an underlying problem. Any suggestions on what else I
 could
 check?

 I've done some more tests. Still no solution, just an observation: The
 death
 mode appears to be:

 - Two nodes running cman+pacemaker+drbd+clvmd
 - Take one node down = one

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-23 Thread William Seligman

On 3/22/12 2:43 PM, William Seligman wrote:
 On 3/20/12 4:55 PM, Lars Ellenberg wrote:
 On Fri, Mar 16, 2012 at 05:06:04PM -0400, William Seligman wrote:
 On 3/16/12 12:12 PM, William Seligman wrote:
 On 3/16/12 7:02 AM, Andreas Kurz wrote:
 On 03/15/2012 11:50 PM, William Seligman wrote:
 On 3/15/12 6:07 PM, William Seligman wrote:
 On 3/15/12 6:05 PM, William Seligman wrote:
 On 3/15/12 4:57 PM, emmanuel segura wrote:

 we can try to understand what happen when clvm hang

 edit the /etc/lvm/lvm.conf  and change level = 7 in the log session 
 and
 uncomment this line

 file = /var/log/lvm2.log

 Here's the tail end of the file (the original is 1.6M). Because there 
 no times
 in the log, it's hard for me to point you to the point where I crashed 
 the other
 system. I think (though I'm not sure) that the crash happened after 
 the last
 occurrence of

 cache/lvmcache.c:1484   Wiping internal VG cache

 Honestly, it looks like a wall of text to me. Does it suggest anything 
 to you?

 Maybe it would help if I included the link to the pastebin where I put 
 the
 output: http://pastebin.com/8pgW3Muw

 Could the problem be with lvm+drbd?

 In lvm2.conf, I see this sequence of lines pre-crash:

 device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:271   /dev/md0: size is 1027968 sectors
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 device/dev-io.c:588   Closed /dev/md0
 device/dev-io.c:271   /dev/md0: size is 1027968 sectors
 device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 device/dev-io.c:588   Closed /dev/md0
 filters/filter-composite.c:31   Using /dev/md0
 device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 label/label.c:186   /dev/md0: No label detected
 device/dev-io.c:588   Closed /dev/md0
 device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
 device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
 device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
 device/dev-io.c:588   Closed /dev/drbd0
 device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
 device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
 device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
 device/dev-io.c:588   Closed /dev/drbd0

 I interpret this: Look at /dev/md0, get some info, close; look at 
 /dev/drbd0,
 get some info, close.

 Post-crash, I see:

 evice/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:271   /dev/md0: size is 1027968 sectors
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 device/dev-io.c:588   Closed /dev/md0
 device/dev-io.c:271   /dev/md0: size is 1027968 sectors
 device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 device/dev-io.c:588   Closed /dev/md0
 filters/filter-composite.c:31   Using /dev/md0
 device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 label/label.c:186   /dev/md0: No label detected
 device/dev-io.c:588   Closed /dev/md0
 device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
 device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
 device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes

 ... and then it hangs. Comparing the two, it looks like it can't close 
 /dev/drbd0.

 If I look at /proc/drbd when I crash one node, I see this:

 # cat /proc/drbd
 version: 8.3.12 (api:88/proto:86-96)
 GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by
 r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34
  0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s-
 ns:764 nr:0 dw:0 dr:7049728 al:0 bm:516 lo:0 pe:0 ua:0 ap:0 ep:1 
 wo:b oos:0

 s- ... DRBD suspended io, most likely because of it's
 fencing-policy. For valid dual-primary setups you have to use
 resource-and-stonith policy and a working fence-peer handler. In
 this mode I/O is suspended until fencing of peer was succesful. Question
 is, why the peer does _not_ also suspend its I/O because obviously
 fencing was not successful .

 So with a correct DRBD configuration one of your nodes should already
 have been fenced because of connection loss between nodes (on drbd
 replication link).

 You can use e.g. that nice fencing script:

 http://goo.gl/O4N8f

 This is the output of drbdadm dump admin: http://pastebin.com/kTxvHCtx

 So I've got resource-and-stonith. I gather from an earlier thread that
 obliterate-peer.sh is more-or-less equivalent in functionality with
 stonith_admin_fence_peer.sh:

 http://www.gossamer-threads.com/lists/linuxha/users/78504#78504

 At the moment I'm pursuing the possibility that I'm returning the wrong 
 return
 codes from my fencing agent:

 http://www.gossamer-threads.com/lists/linuxha/users/78572

 I cleaned up my fencing agent, making sure its return code matched those
 returned by other agents in /usr/sbin/fence_, and allowing for some delay 
 issues
 in reading the UPS status

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-23 Thread William Seligman

On 3/23/12 5:03 PM, emmanuel segura wrote:

 Sorry but i would to know if can show me your /etc/cluster/cluster.conf

Here it is: http://pastebin.com/GUr0CEgZ

 Il giorno 23 marzo 2012 21:50, William Seligman selig...@nevis.columbia.edu
 ha scritto:
 
 On 3/22/12 2:43 PM, William Seligman wrote:
 On 3/20/12 4:55 PM, Lars Ellenberg wrote:
 On Fri, Mar 16, 2012 at 05:06:04PM -0400, William Seligman wrote:
 On 3/16/12 12:12 PM, William Seligman wrote:
 On 3/16/12 7:02 AM, Andreas Kurz wrote:
 On 03/15/2012 11:50 PM, William Seligman wrote:
 On 3/15/12 6:07 PM, William Seligman wrote:
 On 3/15/12 6:05 PM, William Seligman wrote:
 On 3/15/12 4:57 PM, emmanuel segura wrote:

 we can try to understand what happen when clvm hang

 edit the /etc/lvm/lvm.conf  and change level = 7 in the log
 session and
 uncomment this line

 file = /var/log/lvm2.log

 Here's the tail end of the file (the original is 1.6M). Because
 there no times
 in the log, it's hard for me to point you to the point where I
 crashed the other
 system. I think (though I'm not sure) that the crash happened
 after the last
 occurrence of

 cache/lvmcache.c:1484   Wiping internal VG cache

 Honestly, it looks like a wall of text to me. Does it suggest
 anything to you?

 Maybe it would help if I included the link to the pastebin where I
 put the
 output: http://pastebin.com/8pgW3Muw

 Could the problem be with lvm+drbd?

 In lvm2.conf, I see this sequence of lines pre-crash:

 device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:271   /dev/md0: size is 1027968 sectors
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 device/dev-io.c:588   Closed /dev/md0
 device/dev-io.c:271   /dev/md0: size is 1027968 sectors
 device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 device/dev-io.c:588   Closed /dev/md0
 filters/filter-composite.c:31   Using /dev/md0
 device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 label/label.c:186   /dev/md0: No label detected
 device/dev-io.c:588   Closed /dev/md0
 device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
 device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
 device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
 device/dev-io.c:588   Closed /dev/drbd0
 device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
 device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
 device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
 device/dev-io.c:588   Closed /dev/drbd0

 I interpret this: Look at /dev/md0, get some info, close; look at
 /dev/drbd0,
 get some info, close.

 Post-crash, I see:

 evice/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:271   /dev/md0: size is 1027968 sectors
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 device/dev-io.c:588   Closed /dev/md0
 device/dev-io.c:271   /dev/md0: size is 1027968 sectors
 device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 device/dev-io.c:588   Closed /dev/md0
 filters/filter-composite.c:31   Using /dev/md0
 device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 label/label.c:186   /dev/md0: No label detected
 device/dev-io.c:588   Closed /dev/md0
 device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
 device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
 device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes

 ... and then it hangs. Comparing the two, it looks like it can't
 close /dev/drbd0.

 If I look at /proc/drbd when I crash one node, I see this:

 # cat /proc/drbd
 version: 8.3.12 (api:88/proto:86-96)
 GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by
 r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34
  0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s-
 ns:764 nr:0 dw:0 dr:7049728 al:0 bm:516 lo:0 pe:0 ua:0 ap:0
 ep:1 wo:b oos:0

 s- ... DRBD suspended io, most likely because of it's
 fencing-policy. For valid dual-primary setups you have to use
 resource-and-stonith policy and a working fence-peer handler. In
 this mode I/O is suspended until fencing of peer was succesful.
 Question
 is, why the peer does _not_ also suspend its I/O because obviously
 fencing was not successful .

 So with a correct DRBD configuration one of your nodes should already
 have been fenced because of connection loss between nodes (on drbd
 replication link).

 You can use e.g. that nice fencing script:

 http://goo.gl/O4N8f

 This is the output of drbdadm dump admin: 
 http://pastebin.com/kTxvHCtx

 So I've got resource-and-stonith. I gather from an earlier thread that
 obliterate-peer.sh is more-or-less equivalent in functionality with
 stonith_admin_fence_peer.sh:

 http://www.gossamer-threads.com/lists/linuxha/users/78504#78504

 At the moment I'm pursuing the possibility that I'm returning the
 wrong return
 codes from my fencing agent

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-22 Thread William Seligman

On 3/20/12 4:55 PM, Lars Ellenberg wrote:
 On Fri, Mar 16, 2012 at 05:06:04PM -0400, William Seligman wrote:
 On 3/16/12 12:12 PM, William Seligman wrote:
 On 3/16/12 7:02 AM, Andreas Kurz wrote:
 On 03/15/2012 11:50 PM, William Seligman wrote:
 On 3/15/12 6:07 PM, William Seligman wrote:
 On 3/15/12 6:05 PM, William Seligman wrote:
 On 3/15/12 4:57 PM, emmanuel segura wrote:

 we can try to understand what happen when clvm hang

 edit the /etc/lvm/lvm.conf  and change level = 7 in the log session and
 uncomment this line

 file = /var/log/lvm2.log

 Here's the tail end of the file (the original is 1.6M). Because there 
 no times
 in the log, it's hard for me to point you to the point where I crashed 
 the other
 system. I think (though I'm not sure) that the crash happened after the 
 last
 occurrence of

 cache/lvmcache.c:1484   Wiping internal VG cache

 Honestly, it looks like a wall of text to me. Does it suggest anything 
 to you?

 Maybe it would help if I included the link to the pastebin where I put 
 the
 output: http://pastebin.com/8pgW3Muw

 Could the problem be with lvm+drbd?

 In lvm2.conf, I see this sequence of lines pre-crash:

 device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:271   /dev/md0: size is 1027968 sectors
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 device/dev-io.c:588   Closed /dev/md0
 device/dev-io.c:271   /dev/md0: size is 1027968 sectors
 device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 device/dev-io.c:588   Closed /dev/md0
 filters/filter-composite.c:31   Using /dev/md0
 device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 label/label.c:186   /dev/md0: No label detected
 device/dev-io.c:588   Closed /dev/md0
 device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
 device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
 device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
 device/dev-io.c:588   Closed /dev/drbd0
 device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
 device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
 device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
 device/dev-io.c:588   Closed /dev/drbd0

 I interpret this: Look at /dev/md0, get some info, close; look at 
 /dev/drbd0,
 get some info, close.

 Post-crash, I see:

 evice/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:271   /dev/md0: size is 1027968 sectors
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 device/dev-io.c:588   Closed /dev/md0
 device/dev-io.c:271   /dev/md0: size is 1027968 sectors
 device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 device/dev-io.c:588   Closed /dev/md0
 filters/filter-composite.c:31   Using /dev/md0
 device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 label/label.c:186   /dev/md0: No label detected
 device/dev-io.c:588   Closed /dev/md0
 device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
 device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
 device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes

 ... and then it hangs. Comparing the two, it looks like it can't close 
 /dev/drbd0.

 If I look at /proc/drbd when I crash one node, I see this:

 # cat /proc/drbd
 version: 8.3.12 (api:88/proto:86-96)
 GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by
 r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34
  0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s-
 ns:764 nr:0 dw:0 dr:7049728 al:0 bm:516 lo:0 pe:0 ua:0 ap:0 ep:1 
 wo:b oos:0

 s- ... DRBD suspended io, most likely because of it's
 fencing-policy. For valid dual-primary setups you have to use
 resource-and-stonith policy and a working fence-peer handler. In
 this mode I/O is suspended until fencing of peer was succesful. Question
 is, why the peer does _not_ also suspend its I/O because obviously
 fencing was not successful .

 So with a correct DRBD configuration one of your nodes should already
 have been fenced because of connection loss between nodes (on drbd
 replication link).

 You can use e.g. that nice fencing script:

 http://goo.gl/O4N8f

 This is the output of drbdadm dump admin: http://pastebin.com/kTxvHCtx

 So I've got resource-and-stonith. I gather from an earlier thread that
 obliterate-peer.sh is more-or-less equivalent in functionality with
 stonith_admin_fence_peer.sh:

 http://www.gossamer-threads.com/lists/linuxha/users/78504#78504

 At the moment I'm pursuing the possibility that I'm returning the wrong 
 return
 codes from my fencing agent:

 http://www.gossamer-threads.com/lists/linuxha/users/78572

 I cleaned up my fencing agent, making sure its return code matched those
 returned by other agents in /usr/sbin/fence_, and allowing for some delay 
 issues
 in reading the UPS status. But...

 After that, I'll look at another

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-22 Thread William Seligman

On 3/22/12 2:49 PM, David Coulson wrote:
 On 3/22/12 2:43 PM, William Seligman wrote:

 I still haven't solved the problem, but this advice has gotten me further 
 than
 before.

 First, Lars was correct: I did not have execute permissions set on my fence 
 peer
 scripts. (D'oh!) I turned them on, but that did not change anything: 
 cman+clvmd
 still hung on the vgdisplay command if I crashed the peer node.

 Does cman think the node is fenced? clvmd will block IO until the node is 
 fenced
 properly.

Let's see:

On main node, before crashing the peer node:

corosync-objctl | grep member
runtime.totem.pg.mrp.srp.members.1.ip=r(0) ip(192.168.100.207)
runtime.totem.pg.mrp.srp.members.1.join_count=1
runtime.totem.pg.mrp.srp.members.1.status=joined
runtime.totem.pg.mrp.srp.members.2.ip=r(0) ip(192.168.100.206)
runtime.totem.pg.mrp.srp.members.2.join_count=2
runtime.totem.pg.mrp.srp.members.2.status=joined

Then on peer node:

echo c  /proc/sysrq-trigger

The UPS for the peer node shuts down, which tells me the main node ran the
fencing agent. Now:

corosync-objctl | grep member
runtime.totem.pg.mrp.srp.members.1.ip=r(0) ip(192.168.100.207)
runtime.totem.pg.mrp.srp.members.1.join_count=1
runtime.totem.pg.mrp.srp.members.1.status=joined
runtime.totem.pg.mrp.srp.members.2.ip=r(0) ip(192.168.100.206)
runtime.totem.pg.mrp.srp.members.2.join_count=2
runtime.totem.pg.mrp.srp.members.2.status=left

Looks like cman knows. Is there any other way to check a node's fenced status as
far as cman is concerned?

-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] order transitivity (was Re: order troubles)

2012-03-22 Thread William Seligman

On 3/22/12 10:06 AM, Florian Haas wrote:
 On Thu, Mar 22, 2012 at 10:34 AM, Lars Ellenberg
 lars.ellenb...@linbit.com wrote:
 order o_nfs_before_vz 0: cl_fs_nfs cl_vz
 order o_vz_before_ve992 0: cl_vz ve992

 a score of 0 is roughly equivalent to
 if you happen do plan to do both operations
  in the same transition, would you please consider
  to do them in this order, pretty please, if you see fit
 
 Lars beat me to this, as the post turned out to be a little more
 elaborate than expected, but here's a bit of background info for
 additional clarification:
 
 http://www.hastexo.com/resources/hints-and-kinks/mandatory-and-advisory-ordering-pacemaker

I have a related question raised by this web page.

Suppose I have a chain of ordering constraints:

order Gfs2_Before_Libvirtd inf: Gfs2 Libvirtd
order Libvirtd_Before_VirtualMachine 0: Libvirtd VirtualMachine

On startup, Gfs2 will be started before Libvirtd before VirtualMachine. What
happens on shutdown? Will Gfs2 necessarily wait until VirtualMachine stops? Or
is it better to add the additional constraint:

order Gfs2_Before_VirtualMachine inf: Gfs2 VirtualMachine

if that's the behavior I want?

-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-16 Thread William Seligman

On 3/16/12 7:02 AM, Andreas Kurz wrote:
 On 03/15/2012 11:50 PM, William Seligman wrote:
 On 3/15/12 6:07 PM, William Seligman wrote:
 On 3/15/12 6:05 PM, William Seligman wrote:
 On 3/15/12 4:57 PM, emmanuel segura wrote:

 we can try to understand what happen when clvm hang

 edit the /etc/lvm/lvm.conf  and change level = 7 in the log session and
 uncomment this line

 file = /var/log/lvm2.log

 Here's the tail end of the file (the original is 1.6M). Because there no 
 times
 in the log, it's hard for me to point you to the point where I crashed the 
 other
 system. I think (though I'm not sure) that the crash happened after the 
 last
 occurrence of

 cache/lvmcache.c:1484   Wiping internal VG cache

 Honestly, it looks like a wall of text to me. Does it suggest anything to 
 you?

 Maybe it would help if I included the link to the pastebin where I put the
 output: http://pastebin.com/8pgW3Muw

 Could the problem be with lvm+drbd?

 In lvm2.conf, I see this sequence of lines pre-crash:

 device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:271   /dev/md0: size is 1027968 sectors
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 device/dev-io.c:588   Closed /dev/md0
 device/dev-io.c:271   /dev/md0: size is 1027968 sectors
 device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 device/dev-io.c:588   Closed /dev/md0
 filters/filter-composite.c:31   Using /dev/md0
 device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 label/label.c:186   /dev/md0: No label detected
 device/dev-io.c:588   Closed /dev/md0
 device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
 device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
 device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
 device/dev-io.c:588   Closed /dev/drbd0
 device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
 device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
 device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
 device/dev-io.c:588   Closed /dev/drbd0

 I interpret this: Look at /dev/md0, get some info, close; look at /dev/drbd0,
 get some info, close.

 Post-crash, I see:

 evice/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:271   /dev/md0: size is 1027968 sectors
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 device/dev-io.c:588   Closed /dev/md0
 device/dev-io.c:271   /dev/md0: size is 1027968 sectors
 device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 device/dev-io.c:588   Closed /dev/md0
 filters/filter-composite.c:31   Using /dev/md0
 device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 label/label.c:186   /dev/md0: No label detected
 device/dev-io.c:588   Closed /dev/md0
 device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
 device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
 device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes

 ... and then it hangs. Comparing the two, it looks like it can't close 
 /dev/drbd0.

 If I look at /proc/drbd when I crash one node, I see this:

 # cat /proc/drbd
 version: 8.3.12 (api:88/proto:86-96)
 GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by
 r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34
  0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s-
 ns:764 nr:0 dw:0 dr:7049728 al:0 bm:516 lo:0 pe:0 ua:0 ap:0 ep:1 
 wo:b oos:0
 
 s- ... DRBD suspended io, most likely because of it's
 fencing-policy. For valid dual-primary setups you have to use
 resource-and-stonith policy and a working fence-peer handler. In
 this mode I/O is suspended until fencing of peer was succesful. Question
 is, why the peer does _not_ also suspend its I/O because obviously
 fencing was not successful .
 
 So with a correct DRBD configuration one of your nodes should already
 have been fenced because of connection loss between nodes (on drbd
 replication link).
 
 You can use e.g. that nice fencing script:
 
 http://goo.gl/O4N8f

This is the output of drbdadm dump admin: http://pastebin.com/kTxvHCtx

So I've got resource-and-stonith. I gather from an earlier thread that
obliterate-peer.sh is more-or-less equivalent in functionality with
stonith_admin_fence_peer.sh:

http://www.gossamer-threads.com/lists/linuxha/users/78504#78504

At the moment I'm pursuing the possibility that I'm returning the wrong return
codes from my fencing agent:

http://www.gossamer-threads.com/lists/linuxha/users/78572

After that, I'll look at another suggestion with lvm.conf:

http://www.gossamer-threads.com/lists/linuxha/users/78796#78796

Then I'll try DRBD 8.4.1. Hopefully one of these is the source of the issue.

-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-16 Thread William Seligman

On 3/16/12 4:53 AM, emmanuel segura wrote:

 for the lvm hang you can use this in your /etc/lvm/lvm.conf
 
 ignore_suspended_devices = 1
 
 because i seen in the lvm log,
 
 ===
 and then it hangs. Comparing the two, it looks like it can't close
 /dev/drbd0
 ===

No, this does not prevent the hang. I tried with both DRBD 8.3.12 and 8.4.1.

 Il giorno 15 marzo 2012 23:50, William Seligman selig...@nevis.columbia.edu
 ha scritto:
 
 On 3/15/12 6:07 PM, William Seligman wrote:
 On 3/15/12 6:05 PM, William Seligman wrote:
 On 3/15/12 4:57 PM, emmanuel segura wrote:

 we can try to understand what happen when clvm hang

 edit the /etc/lvm/lvm.conf  and change level = 7 in the log session and
 uncomment this line

 file = /var/log/lvm2.log

 Here's the tail end of the file (the original is 1.6M). Because there
 no times
 in the log, it's hard for me to point you to the point where I crashed
 the other
 system. I think (though I'm not sure) that the crash happened after the
 last
 occurrence of

 cache/lvmcache.c:1484   Wiping internal VG cache

 Honestly, it looks like a wall of text to me. Does it suggest anything
 to you?

 Maybe it would help if I included the link to the pastebin where I put
 the
 output: http://pastebin.com/8pgW3Muw

 Could the problem be with lvm+drbd?

 In lvm2.conf, I see this sequence of lines pre-crash:

 device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:271   /dev/md0: size is 1027968 sectors
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 device/dev-io.c:588   Closed /dev/md0
 device/dev-io.c:271   /dev/md0: size is 1027968 sectors
 device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 device/dev-io.c:588   Closed /dev/md0
 filters/filter-composite.c:31   Using /dev/md0
 device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 label/label.c:186   /dev/md0: No label detected
 device/dev-io.c:588   Closed /dev/md0
 device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
 device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
 device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
 device/dev-io.c:588   Closed /dev/drbd0
 device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
 device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
 device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
 device/dev-io.c:588   Closed /dev/drbd0

 I interpret this: Look at /dev/md0, get some info, close; look at
 /dev/drbd0,
 get some info, close.

 Post-crash, I see:

 evice/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:271   /dev/md0: size is 1027968 sectors
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 device/dev-io.c:588   Closed /dev/md0
 device/dev-io.c:271   /dev/md0: size is 1027968 sectors
 device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 device/dev-io.c:588   Closed /dev/md0
 filters/filter-composite.c:31   Using /dev/md0
 device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 label/label.c:186   /dev/md0: No label detected
 device/dev-io.c:588   Closed /dev/md0
 device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
 device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
 device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes

 ... and then it hangs. Comparing the two, it looks like it can't close
 /dev/drbd0.

 If I look at /proc/drbd when I crash one node, I see this:

 # cat /proc/drbd
 version: 8.3.12 (api:88/proto:86-96)
 GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by
 r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34
  0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s-
ns:764 nr:0 dw:0 dr:7049728 al:0 bm:516 lo:0 pe:0 ua:0 ap:0 ep:1
 wo:b oos:0


 If I look at /proc/drbd if I bring down one node gracefully (crm node
 standby),
 I get this:

 # cat /proc/drbd
 version: 8.3.12 (api:88/proto:86-96)
 GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by
 r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34
  0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/Outdated C r-
ns:764 nr:40 dw:40 dr:7036496 al:0 bm:516 lo:0 pe:0 ua:0 ap:0 ep:1
 wo:b
 oos:0

 Could it be that drbd can't respond to certain requests from lvm if the
 state of
 the peer is DUnknown instead of Outdated?

 Il giorno 15 marzo 2012 20:50, William Seligman 
 selig...@nevis.columbia.edu
 ha scritto:

 On 3/15/12 12:55 PM, emmanuel segura wrote:

 I don't see any error and the answer for your question it's yes

 can you show me your /etc/cluster/cluster.conf and your crm configure
 show

 like that more later i can try to look if i found some fix

 Thanks for taking a look.

 My cluster.conf: http://pastebin.com/w5XNYyAX
 crm configure show: http://pastebin.com/atVkXjkn

 Before you

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-16 Thread William Seligman

On 3/16/12 12:12 PM, William Seligman wrote:
 On 3/16/12 7:02 AM, Andreas Kurz wrote:
 On 03/15/2012 11:50 PM, William Seligman wrote:
 On 3/15/12 6:07 PM, William Seligman wrote:
 On 3/15/12 6:05 PM, William Seligman wrote:
 On 3/15/12 4:57 PM, emmanuel segura wrote:

 we can try to understand what happen when clvm hang

 edit the /etc/lvm/lvm.conf  and change level = 7 in the log session and
 uncomment this line

 file = /var/log/lvm2.log

 Here's the tail end of the file (the original is 1.6M). Because there no 
 times
 in the log, it's hard for me to point you to the point where I crashed 
 the other
 system. I think (though I'm not sure) that the crash happened after the 
 last
 occurrence of

 cache/lvmcache.c:1484   Wiping internal VG cache

 Honestly, it looks like a wall of text to me. Does it suggest anything to 
 you?

 Maybe it would help if I included the link to the pastebin where I put the
 output: http://pastebin.com/8pgW3Muw

 Could the problem be with lvm+drbd?

 In lvm2.conf, I see this sequence of lines pre-crash:

 device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:271   /dev/md0: size is 1027968 sectors
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 device/dev-io.c:588   Closed /dev/md0
 device/dev-io.c:271   /dev/md0: size is 1027968 sectors
 device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 device/dev-io.c:588   Closed /dev/md0
 filters/filter-composite.c:31   Using /dev/md0
 device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 label/label.c:186   /dev/md0: No label detected
 device/dev-io.c:588   Closed /dev/md0
 device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
 device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
 device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
 device/dev-io.c:588   Closed /dev/drbd0
 device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
 device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
 device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
 device/dev-io.c:588   Closed /dev/drbd0

 I interpret this: Look at /dev/md0, get some info, close; look at 
 /dev/drbd0,
 get some info, close.

 Post-crash, I see:

 evice/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:271   /dev/md0: size is 1027968 sectors
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 device/dev-io.c:588   Closed /dev/md0
 device/dev-io.c:271   /dev/md0: size is 1027968 sectors
 device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 device/dev-io.c:588   Closed /dev/md0
 filters/filter-composite.c:31   Using /dev/md0
 device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 label/label.c:186   /dev/md0: No label detected
 device/dev-io.c:588   Closed /dev/md0
 device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
 device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
 device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes

 ... and then it hangs. Comparing the two, it looks like it can't close 
 /dev/drbd0.

 If I look at /proc/drbd when I crash one node, I see this:

 # cat /proc/drbd
 version: 8.3.12 (api:88/proto:86-96)
 GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by
 r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34
  0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s-
 ns:764 nr:0 dw:0 dr:7049728 al:0 bm:516 lo:0 pe:0 ua:0 ap:0 ep:1 
 wo:b oos:0

 s- ... DRBD suspended io, most likely because of it's
 fencing-policy. For valid dual-primary setups you have to use
 resource-and-stonith policy and a working fence-peer handler. In
 this mode I/O is suspended until fencing of peer was succesful. Question
 is, why the peer does _not_ also suspend its I/O because obviously
 fencing was not successful .

 So with a correct DRBD configuration one of your nodes should already
 have been fenced because of connection loss between nodes (on drbd
 replication link).

 You can use e.g. that nice fencing script:

 http://goo.gl/O4N8f
 
 This is the output of drbdadm dump admin: http://pastebin.com/kTxvHCtx
 
 So I've got resource-and-stonith. I gather from an earlier thread that
 obliterate-peer.sh is more-or-less equivalent in functionality with
 stonith_admin_fence_peer.sh:
 
 http://www.gossamer-threads.com/lists/linuxha/users/78504#78504
 
 At the moment I'm pursuing the possibility that I'm returning the wrong return
 codes from my fencing agent:
 
 http://www.gossamer-threads.com/lists/linuxha/users/78572

I cleaned up my fencing agent, making sure its return code matched those
returned by other agents in /usr/sbin/fence_, and allowing for some delay issues
in reading the UPS status. But...

 After that, I'll look at another suggestion with lvm.conf:
 
 http://www.gossamer-threads.com/lists/linuxha/users/78796#78796
 
 Then I'll try DRBD 8.4.1

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-15 Thread William Seligman

On 3/15/12 5:18 AM, emmanuel segura wrote:

 The first thing i seen in your clvmd log it's this
 
 =
  WARNING: Locking disabled. Be careful! This could corrupt your metadata.
 =

I saw that too, and thought the same as you did. I did some checks (see below),
but some web searches suggest that this message is a normal consequence of clvmd
initialization; e.g.,

http://markmail.org/message/vmy53pcv52wu7ghx

 use this command
 
 lvmconf --enable-cluster
 
 and remember for cman+pacemaker you don't need qdisk

Before I tried your lvmconf suggestion, here was my /etc/lvm/lvm.conf:
http://pastebin.com/841VZRzW and the output of lvm dumpconfig:
http://pastebin.com/rtw8c3Pf.

Then I did as you suggested, but with a check to see if anything changed:

# cd /etc/lvm/
# cp lvm.conf lvm.conf.cluster
# lvmconf --enable-cluster
# diff lvm.conf lvm.conf.cluster
#

So the key lines have been there all along:
locking_type = 3
fallback_to_local_locking = 0


 Il giorno 14 marzo 2012 23:17, William Seligman selig...@nevis.columbia.edu
 ha scritto:
 
 On 3/14/12 9:20 AM, emmanuel segura wrote:
 Hello William

 i did new you are using drbd and i dont't know what type of configuration
 you using

 But it's better you try to start clvm with clvmd -d

 like thak we can see what it's the problem

 For what it's worth, here's the output of running clvmd -d on the node that
 stays up: http://pastebin.com/sWjaxAEF

 What's probably important in that big mass of output are the last two
 lines. Up
 to that point, I have both nodes up and running cman + clvmd; cluster.conf
 is
 here: http://pastebin.com/w5XNYyAX

 At the time of the next-to-the-last line, I cut power to the other node.

 At the time of the last line, I run vgdisplay on the remaining node,
 which
 hangs forever.

 After a lot of web searching, I found that I'm not the only one with this
 problem. Here's one case that doesn't seem relevant to me, since I don't
 use
 qdisk:
 http://www.redhat.com/archives/linux-cluster/2007-October/msg00212.html.
 Here's one with the same problem with the same OS:
 http://bugs.centos.org/view.php?id=5229, but with no resolution.

 Out of curiosity, has anyone on this list made a two-node cman+clvmd
 cluster
 work for them?

 Il giorno 14 marzo 2012 14:02, William Seligman 
 selig...@nevis.columbia.edu
 ha scritto:

 On 3/14/12 6:02 AM, emmanuel segura wrote:

  I think it's better you make clvmd start at boot

 chkconfig cman on ; chkconfig clvmd on


 I've already tried it. It doesn't work. The problem is that my LVM
 information is on the drbd. If I start up clvmd before drbd, it won't
 find
 the logical volumes.

 I also don't see why that would make a difference (although this could
 be
 part of the confusion): a service is a service. I've tried starting up
 clvmd inside and outside pacemaker control, with the same problem. Why
 would starting clvmd at boot make a difference?

  Il giorno 13 marzo 2012 23:29, William Seligmanseligman@nevis.**
 columbia.edu selig...@nevis.columbia.edu

 ha scritto:


  On 3/13/12 5:50 PM, emmanuel segura wrote:

  So if you using cman why you use lsb::clvmd

 I think you are very confused


 I don't dispute that I may be very confused!

 However, from what I can tell, I still need to run clvmd even if
 I'm running cman (I'm not using rgmanager). If I just run cman,
 gfs2 and any other form of mount fails. If I run cman, then clvmd,
 then gfs2, everything behaves normally.

 Going by these instructions:

 https://alteeve.com/w/2-Node_**Red_Hat_KVM_Cluster_Tutorial
 https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial


 the resources he puts under cluster control (rgmanager) I have to
 put under pacemaker control. Those include drbd, clvmd, and gfs2.

 The difference between what I've got, and what's in Clusters From
 Scratch, is in CFS they assign one DRBD volume to a single
 filesystem. I create an LVM physical volume on my DRBD resource,
 as in the above tutorial, and so I have to start clvmd or the
 logical volumes in the DRBD partition won't be recognized. Is
 there some way to get logical volumes recognized automatically by
 cman without rgmanager that I've missed?


  Il giorno 13 marzo 2012 22:42, William Seligman

 selig...@nevis.columbia.edu

 ha scritto:


  On 3/13/12 12:29 PM, William Seligman wrote:

 I'm not sure if this is a Linux-HA question; please direct
 me to the appropriate list if it's not.

 I'm setting up a two-node cman+pacemaker+gfs2 cluster as
 described in Clusters From Scratch. Fencing is through
 forcibly rebooting a node by cutting and restoring its power
 via UPS.

 My fencing/failover tests have revealed a problem. If I
 gracefully turn off one node (crm node standby; service
 pacemaker stop; shutdown -r now) all the resources
 transfer to the other node with no problems. If I cut power
 to one node (as would happen if it were fenced), the
 lsb::clvmd resource

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-15 Thread William Seligman

On 3/15/12 3:43 AM, Vladislav Bogdanov wrote:
 14.03.2012 00:42, William Seligman wrote:
 [snip]
 These were the log messages, which show that stonith_admin did its job and 
 CMAN
 was notified of the fencing: http://pastebin.com/jaH820Bv.
 
 Could you please look at the output of 'dlm_tool ls' and 'dlm_tool dump'?
 
 You probably have 'kern_stop' and 'fencing' flags there. That means that
 dlm is unaware that node is fenced.

Here's 'dlm_tool ls' with both nodes running cman+clvmd+gfs2:
http://pastebin.com/QrZtm1Ue

'dlm_tool dump': http://pastebin.com/UKWxx9Y4

For comparison, I crashed one node and looked at the same output on the
remaining node:
dlm_tool ls: http://pastebin.com/cKVAGxsd
dlm_tool dump: http://pastebin.com/c0h0p22Q (the post-crash lines begin at
1331824940)

I don't see the kern_stop or fencing flags. There's another thing I don't
see: at the top of 'dlm_tool dump' it displays most of the contents of my
cluster.conf file, except for the fencing sections. Here's my cluster.conf for
comparison: http://pastebin.com/w5XNYyAX

cman doesn't see anything wrong in my cluster.conf file:

# ccs_config_validate
Configuration validates

But could there be something that's causing the fencing sections to be ignored?


 Unfortunately, I still got the gfs2 freeze, so this is not the complete 
 story.
 
 Both clvmd and gfs2 use dlm. If dlm layer thinks fencing is not
 completed, both of them freeze.

I did 'grep -E (dlm|clvm|fenc) /var/log/messages' and looked at the time I
crashed the node: http://pastebin.com/dvBtdLUs. I see lines that indicate that
pacemaker and drbd are fencing the node, but nothing from dlm or clvmd. Does
this indicate what you suggest: Could dlm somehow be ignoring or overlooking the
fencing I put in? Is there any other way to check this?

-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-15 Thread William Seligman

On 3/15/12 11:50 AM, emmanuel segura wrote:
 yes william
 
 Now try clvmd -d and see what happen
 
 locking_type = 3 it's lvm cluster lock type

Since you asked for confirmation, here it is: the output of 'clvmd -d' just now.
http://pastebin.com/bne8piEw. I crashed the other node at Mar 15 12:02:35,
when you see the only additional line of output.

I don't see any particular difference between this and the previous result
http://pastebin.com/sWjaxAEF, which suggests that I had cluster locking
enabled before, and still do now.

 Il giorno 15 marzo 2012 16:15, William Seligman selig...@nevis.columbia.edu
 ha scritto:
 
 On 3/15/12 5:18 AM, emmanuel segura wrote:

 The first thing i seen in your clvmd log it's this

 =
  WARNING: Locking disabled. Be careful! This could corrupt your metadata.
 =

 I saw that too, and thought the same as you did. I did some checks (see
 below),
 but some web searches suggest that this message is a normal consequence of
 clvmd
 initialization; e.g.,

 http://markmail.org/message/vmy53pcv52wu7ghx

 use this command

 lvmconf --enable-cluster

 and remember for cman+pacemaker you don't need qdisk

 Before I tried your lvmconf suggestion, here was my /etc/lvm/lvm.conf:
 http://pastebin.com/841VZRzW and the output of lvm dumpconfig:
 http://pastebin.com/rtw8c3Pf.

 Then I did as you suggested, but with a check to see if anything changed:

 # cd /etc/lvm/
 # cp lvm.conf lvm.conf.cluster
 # lvmconf --enable-cluster
 # diff lvm.conf lvm.conf.cluster
 #

 So the key lines have been there all along:
locking_type = 3
fallback_to_local_locking = 0


 Il giorno 14 marzo 2012 23:17, William Seligman 
 selig...@nevis.columbia.edu
 ha scritto:

 On 3/14/12 9:20 AM, emmanuel segura wrote:
 Hello William

 i did new you are using drbd and i dont't know what type of
 configuration
 you using

 But it's better you try to start clvm with clvmd -d

 like thak we can see what it's the problem

 For what it's worth, here's the output of running clvmd -d on the node
 that
 stays up: http://pastebin.com/sWjaxAEF

 What's probably important in that big mass of output are the last two
 lines. Up
 to that point, I have both nodes up and running cman + clvmd;
 cluster.conf
 is
 here: http://pastebin.com/w5XNYyAX

 At the time of the next-to-the-last line, I cut power to the other node.

 At the time of the last line, I run vgdisplay on the remaining node,
 which
 hangs forever.

 After a lot of web searching, I found that I'm not the only one with
 this
 problem. Here's one case that doesn't seem relevant to me, since I don't
 use
 qdisk:
 
 http://www.redhat.com/archives/linux-cluster/2007-October/msg00212.html.
 Here's one with the same problem with the same OS:
 http://bugs.centos.org/view.php?id=5229, but with no resolution.

 Out of curiosity, has anyone on this list made a two-node cman+clvmd
 cluster
 work for them?

 Il giorno 14 marzo 2012 14:02, William Seligman 
 selig...@nevis.columbia.edu
 ha scritto:

 On 3/14/12 6:02 AM, emmanuel segura wrote:

  I think it's better you make clvmd start at boot

 chkconfig cman on ; chkconfig clvmd on


 I've already tried it. It doesn't work. The problem is that my LVM
 information is on the drbd. If I start up clvmd before drbd, it won't
 find
 the logical volumes.

 I also don't see why that would make a difference (although this could
 be
 part of the confusion): a service is a service. I've tried starting up
 clvmd inside and outside pacemaker control, with the same problem. Why
 would starting clvmd at boot make a difference?

  Il giorno 13 marzo 2012 23:29, William Seligmanseligman@nevis.**
 columbia.edu selig...@nevis.columbia.edu

 ha scritto:


  On 3/13/12 5:50 PM, emmanuel segura wrote:

  So if you using cman why you use lsb::clvmd

 I think you are very confused


 I don't dispute that I may be very confused!

 However, from what I can tell, I still need to run clvmd even if
 I'm running cman (I'm not using rgmanager). If I just run cman,
 gfs2 and any other form of mount fails. If I run cman, then clvmd,
 then gfs2, everything behaves normally.

 Going by these instructions:

 https://alteeve.com/w/2-Node_**Red_Hat_KVM_Cluster_Tutorial
 https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial


 the resources he puts under cluster control (rgmanager) I have to
 put under pacemaker control. Those include drbd, clvmd, and gfs2.

 The difference between what I've got, and what's in Clusters From
 Scratch, is in CFS they assign one DRBD volume to a single
 filesystem. I create an LVM physical volume on my DRBD resource,
 as in the above tutorial, and so I have to start clvmd or the
 logical volumes in the DRBD partition won't be recognized. Is
 there some way to get logical volumes recognized automatically by
 cman without rgmanager that I've missed?


  Il giorno 13 marzo 2012 22:42, William Seligman

 selig...@nevis.columbia.edu

 ha scritto

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-15 Thread William Seligman

On 3/15/12 12:15 PM, emmanuel segura wrote:

 Ho did you created your volume group

pvcreate /dev/drbd0
vgcreate -c y ADMIN /dev/drbd0
lvcreate -L 200G -n usr ADMIN # ... and so on
# Nevis-HA is the cluster name I used in cluster.conf
mkfs.gfs2 -p lock_dlm -j 2 -t Nevis_HA:usr /dev/ADMIN/usr  # ... and so on

 give me the output of vgs command when the cluster it's up

Here it is:

Logging initialised at Thu Mar 15 12:40:39 2012
Set umask from 0022 to 0077
Finding all volume groups
Finding volume group ROOT
Finding volume group ADMIN
  VG#PV #LV #SN Attr   VSize   VFree
  ADMIN   1   5   0 wz--nc   2.61t 765.79g
  ROOT1   2   0 wz--n- 117.16g  0
Wiping internal VG cache

I assume the c in the ADMIN attributes means that clustering is turned on?

 Il giorno 15 marzo 2012 17:06, William Seligman selig...@nevis.columbia.edu
 ha scritto:
 
 On 3/15/12 11:50 AM, emmanuel segura wrote:
 yes william

 Now try clvmd -d and see what happen

 locking_type = 3 it's lvm cluster lock type

 Since you asked for confirmation, here it is: the output of 'clvmd -d'
 just now.
 http://pastebin.com/bne8piEw. I crashed the other node at Mar 15
 12:02:35,
 when you see the only additional line of output.

 I don't see any particular difference between this and the previous result
 http://pastebin.com/sWjaxAEF, which suggests that I had cluster locking
 enabled before, and still do now.

 Il giorno 15 marzo 2012 16:15, William Seligman 
 selig...@nevis.columbia.edu
 ha scritto:

 On 3/15/12 5:18 AM, emmanuel segura wrote:

 The first thing i seen in your clvmd log it's this

 =
  WARNING: Locking disabled. Be careful! This could corrupt your
 metadata.
 =

 I saw that too, and thought the same as you did. I did some checks (see
 below),
 but some web searches suggest that this message is a normal consequence
 of
 clvmd
 initialization; e.g.,

 http://markmail.org/message/vmy53pcv52wu7ghx

 use this command

 lvmconf --enable-cluster

 and remember for cman+pacemaker you don't need qdisk

 Before I tried your lvmconf suggestion, here was my /etc/lvm/lvm.conf:
 http://pastebin.com/841VZRzW and the output of lvm dumpconfig:
 http://pastebin.com/rtw8c3Pf.

 Then I did as you suggested, but with a check to see if anything
 changed:

 # cd /etc/lvm/
 # cp lvm.conf lvm.conf.cluster
 # lvmconf --enable-cluster
 # diff lvm.conf lvm.conf.cluster
 #

 So the key lines have been there all along:
locking_type = 3
fallback_to_local_locking = 0


 Il giorno 14 marzo 2012 23:17, William Seligman 
 selig...@nevis.columbia.edu
 ha scritto:

 On 3/14/12 9:20 AM, emmanuel segura wrote:
 Hello William

 i did new you are using drbd and i dont't know what type of
 configuration
 you using

 But it's better you try to start clvm with clvmd -d

 like thak we can see what it's the problem

 For what it's worth, here's the output of running clvmd -d on the node
 that
 stays up: http://pastebin.com/sWjaxAEF

 What's probably important in that big mass of output are the last two
 lines. Up
 to that point, I have both nodes up and running cman + clvmd;
 cluster.conf
 is
 here: http://pastebin.com/w5XNYyAX

 At the time of the next-to-the-last line, I cut power to the other
 node.

 At the time of the last line, I run vgdisplay on the remaining node,
 which
 hangs forever.

 After a lot of web searching, I found that I'm not the only one with
 this
 problem. Here's one case that doesn't seem relevant to me, since I
 don't
 use
 qdisk:
 
 http://www.redhat.com/archives/linux-cluster/2007-October/msg00212.html
 .
 Here's one with the same problem with the same OS:
 http://bugs.centos.org/view.php?id=5229, but with no resolution.

 Out of curiosity, has anyone on this list made a two-node cman+clvmd
 cluster
 work for them?

 Il giorno 14 marzo 2012 14:02, William Seligman 
 selig...@nevis.columbia.edu
 ha scritto:

 On 3/14/12 6:02 AM, emmanuel segura wrote:

  I think it's better you make clvmd start at boot

 chkconfig cman on ; chkconfig clvmd on


 I've already tried it. It doesn't work. The problem is that my LVM
 information is on the drbd. If I start up clvmd before drbd, it
 won't
 find
 the logical volumes.

 I also don't see why that would make a difference (although this
 could
 be
 part of the confusion): a service is a service. I've tried starting
 up
 clvmd inside and outside pacemaker control, with the same problem.
 Why
 would starting clvmd at boot make a difference?

  Il giorno 13 marzo 2012 23:29, William Seligmanseligman@nevis.**
 columbia.edu selig...@nevis.columbia.edu

 ha scritto:


  On 3/13/12 5:50 PM, emmanuel segura wrote:

  So if you using cman why you use lsb::clvmd

 I think you are very confused


 I don't dispute that I may be very confused!

 However, from what I can tell, I still need to run clvmd even if
 I'm running cman (I'm not using rgmanager). If I just run cman,
 gfs2 and any other form

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-15 Thread William Seligman

On 3/15/12 12:55 PM, emmanuel segura wrote:

 I don't see any error and the answer for your question it's yes
 
 can you show me your /etc/cluster/cluster.conf and your crm configure show
 
 like that more later i can try to look if i found some fix

Thanks for taking a look.

My cluster.conf: http://pastebin.com/w5XNYyAX
crm configure show: http://pastebin.com/atVkXjkn

Before you spend a lot of time on the second file, remember that clvmd will hang
whether or not I'm running pacemaker.

 Il giorno 15 marzo 2012 17:42, William Seligman selig...@nevis.columbia.edu
 ha scritto:
 
 On 3/15/12 12:15 PM, emmanuel segura wrote:

 Ho did you created your volume group

 pvcreate /dev/drbd0
 vgcreate -c y ADMIN /dev/drbd0
 lvcreate -L 200G -n usr ADMIN # ... and so on
 # Nevis-HA is the cluster name I used in cluster.conf
 mkfs.gfs2 -p lock_dlm -j 2 -t Nevis_HA:usr /dev/ADMIN/usr  # ... and so on

 give me the output of vgs command when the cluster it's up

 Here it is:

Logging initialised at Thu Mar 15 12:40:39 2012
Set umask from 0022 to 0077
Finding all volume groups
Finding volume group ROOT
Finding volume group ADMIN
  VG#PV #LV #SN Attr   VSize   VFree
  ADMIN   1   5   0 wz--nc   2.61t 765.79g
  ROOT1   2   0 wz--n- 117.16g  0
Wiping internal VG cache

 I assume the c in the ADMIN attributes means that clustering is turned
 on?

 Il giorno 15 marzo 2012 17:06, William Seligman 
 selig...@nevis.columbia.edu
 ha scritto:

 On 3/15/12 11:50 AM, emmanuel segura wrote:
 yes william

 Now try clvmd -d and see what happen

 locking_type = 3 it's lvm cluster lock type

 Since you asked for confirmation, here it is: the output of 'clvmd -d' 
 just now. http://pastebin.com/bne8piEw. I crashed the other node at
 Mar 15 12:02:35, when you see the only additional line of output.

 I don't see any particular difference between this and the previous
 result http://pastebin.com/sWjaxAEF, which suggests that I had
 cluster locking enabled before, and still do now.

 Il giorno 15 marzo 2012 16:15, William Seligman 
 selig...@nevis.columbia.edu
 ha scritto:

 On 3/15/12 5:18 AM, emmanuel segura wrote:

 The first thing i seen in your clvmd log it's this

 =
  WARNING: Locking disabled. Be careful! This could corrupt your 
 metadata.
 =

 I saw that too, and thought the same as you did. I did some checks
 (see below), but some web searches suggest that this message is a
 normal consequence of clvmd initialization; e.g.,

 http://markmail.org/message/vmy53pcv52wu7ghx

 use this command

 lvmconf --enable-cluster

 and remember for cman+pacemaker you don't need qdisk

 Before I tried your lvmconf suggestion, here was my /etc/lvm/lvm.conf:
 http://pastebin.com/841VZRzW and the output of lvm dumpconfig:
 http://pastebin.com/rtw8c3Pf.

 Then I did as you suggested, but with a check to see if anything
 changed:

 # cd /etc/lvm/
 # cp lvm.conf lvm.conf.cluster
 # lvmconf --enable-cluster
 # diff lvm.conf lvm.conf.cluster
 #

 So the key lines have been there all along:
locking_type = 3
fallback_to_local_locking = 0


 Il giorno 14 marzo 2012 23:17, William Seligman 
 selig...@nevis.columbia.edu
 ha scritto:

 On 3/14/12 9:20 AM, emmanuel segura wrote:
 Hello William

 i did new you are using drbd and i dont't know what type of 
 configuration you using

 But it's better you try to start clvm with clvmd -d

 like thak we can see what it's the problem

 For what it's worth, here's the output of running clvmd -d on
 the node that stays up: http://pastebin.com/sWjaxAEF

 What's probably important in that big mass of output are the
 last two lines. Up to that point, I have both nodes up and
 running cman + clvmd; cluster.conf is here:
 http://pastebin.com/w5XNYyAX

 At the time of the next-to-the-last line, I cut power to the
 other node.

 At the time of the last line, I run vgdisplay on the
 remaining node, which hangs forever.

 After a lot of web searching, I found that I'm not the only one
 with this problem. Here's one case that doesn't seem relevant
 to me, since I don't use qdisk:
 http://www.redhat.com/archives/linux-cluster/2007-October/msg00212.html.
 Here's one with the same problem with the same OS:
 http://bugs.centos.org/view.php?id=5229, but with no resolution.

 Out of curiosity, has anyone on this list made a two-node
 cman+clvmd cluster work for them?

 Il giorno 14 marzo 2012 14:02, William Seligman 
 selig...@nevis.columbia.edu
 ha scritto:

 On 3/14/12 6:02 AM, emmanuel segura wrote:

  I think it's better you make clvmd start at boot

 chkconfig cman on ; chkconfig clvmd on


 I've already tried it. It doesn't work. The problem is that
 my LVM information is on the drbd. If I start up clvmd
 before drbd, it won't find the logical volumes.
 
 I also don't see why that would make a difference (although
 this could be part of the confusion): a service is a
 service. I've tried starting

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-15 Thread William Seligman

On 3/15/12 3:45 PM, Vladislav Bogdanov wrote:
 15.03.2012 18:43, William Seligman wrote:
 On 3/15/12 3:43 AM, Vladislav Bogdanov wrote:
 14.03.2012 00:42, William Seligman wrote:
 [snip]
 These were the log messages, which show that stonith_admin did its job and 
 CMAN
 was notified of the fencing: http://pastebin.com/jaH820Bv.

 Could you please look at the output of 'dlm_tool ls' and 'dlm_tool dump'?

 You probably have 'kern_stop' and 'fencing' flags there. That means that
 dlm is unaware that node is fenced.

 Here's 'dlm_tool ls' with both nodes running cman+clvmd+gfs2:
 http://pastebin.com/QrZtm1Ue

 'dlm_tool dump': http://pastebin.com/UKWxx9Y4

 For comparison, I crashed one node and looked at the same output on the
 remaining node:
 dlm_tool ls: http://pastebin.com/cKVAGxsd
 dlm_tool dump: http://pastebin.com/c0h0p22Q (the post-crash lines begin at
 1331824940)
 
 Everything is fine there, dlm correctly understands that node is fenced
 and returns to a normal state.
 
 The only minor issue I see is that fencing took much time - 21 sec.

Hmm. My fencing agent works by toggling the power on a UPS. If all the agent
does is action=off, it will cut power immediately. But if you tell it
action=reboot, it will cut the load, wait 10 seconds, then turn the load back
on again; I found I needed that delay because otherwise the UPS might
confuse/overlap/ignore sequential commands.

Could this be an issue? I've noticed that my fencing agent always seems to be
called with action=reboot when a node is fenced. Why is it using 'reboot' and
not 'off'? Is this the standard, or am I missing a definition somewhere?


 I don't see the kern_stop or fencing flags. There's another thing I don't
 see: at the top of 'dlm_tool dump' it displays most of the contents of my
 cluster.conf file, except for the fencing sections. Here's my cluster.conf 
 for
 comparison: http://pastebin.com/w5XNYyAX
 
 It also looks correct (I mean fence_pcmk), but I can be wrong here, I do
 not use cman.
 

 cman doesn't see anything wrong in my cluster.conf file:

 # ccs_config_validate
 Configuration validates

 But could there be something that's causing the fencing sections to be 
 ignored?


 Unfortunately, I still got the gfs2 freeze, so this is not the complete 
 story.

 Both clvmd and gfs2 use dlm. If dlm layer thinks fencing is not
 completed, both of them freeze.

 I did 'grep -E (dlm|clvm|fenc) /var/log/messages' and looked at the time I
 crashed the node: http://pastebin.com/dvBtdLUs. I see lines that indicate 
 that
 pacemaker and drbd are fencing the node, but nothing from dlm or clvmd. Does
 this indicate what you suggest: Could dlm somehow be ignoring or overlooking 
 the
 fencing I put in? Is there any other way to check this?
 
 No, dlm_controld (and friends) mostly uses different logging method -
 that is what you see in dlm_tool dump.

-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-15 Thread William Seligman

On 3/15/12 4:57 PM, emmanuel segura wrote:

 we can try to understand what happen when clvm hang
 
 edit the /etc/lvm/lvm.conf  and change level = 7 in the log session and
 uncomment this line
 
 file = /var/log/lvm2.log

Here's the tail end of the file (the original is 1.6M). Because there no times
in the log, it's hard for me to point you to the point where I crashed the other
system. I think (though I'm not sure) that the crash happened after the last
occurrence of

cache/lvmcache.c:1484   Wiping internal VG cache

Honestly, it looks like a wall of text to me. Does it suggest anything to you?

 Il giorno 15 marzo 2012 20:50, William Seligman selig...@nevis.columbia.edu
 ha scritto:
 
 On 3/15/12 12:55 PM, emmanuel segura wrote:

 I don't see any error and the answer for your question it's yes

 can you show me your /etc/cluster/cluster.conf and your crm configure
 show

 like that more later i can try to look if i found some fix

 Thanks for taking a look.

 My cluster.conf: http://pastebin.com/w5XNYyAX
 crm configure show: http://pastebin.com/atVkXjkn

 Before you spend a lot of time on the second file, remember that clvmd
 will hang
 whether or not I'm running pacemaker.

 Il giorno 15 marzo 2012 17:42, William Seligman 
 selig...@nevis.columbia.edu
 ha scritto:

 On 3/15/12 12:15 PM, emmanuel segura wrote:

 Ho did you created your volume group

 pvcreate /dev/drbd0
 vgcreate -c y ADMIN /dev/drbd0
 lvcreate -L 200G -n usr ADMIN # ... and so on
 # Nevis-HA is the cluster name I used in cluster.conf
 mkfs.gfs2 -p lock_dlm -j 2 -t Nevis_HA:usr /dev/ADMIN/usr  # ... and so
 on

 give me the output of vgs command when the cluster it's up

 Here it is:

Logging initialised at Thu Mar 15 12:40:39 2012
Set umask from 0022 to 0077
Finding all volume groups
Finding volume group ROOT
Finding volume group ADMIN
  VG#PV #LV #SN Attr   VSize   VFree
  ADMIN   1   5   0 wz--nc   2.61t 765.79g
  ROOT1   2   0 wz--n- 117.16g  0
Wiping internal VG cache

 I assume the c in the ADMIN attributes means that clustering is turned
 on?

 Il giorno 15 marzo 2012 17:06, William Seligman 
 selig...@nevis.columbia.edu
 ha scritto:

 On 3/15/12 11:50 AM, emmanuel segura wrote:
 yes william

 Now try clvmd -d and see what happen

 locking_type = 3 it's lvm cluster lock type

 Since you asked for confirmation, here it is: the output of 'clvmd -d'
 just now. http://pastebin.com/bne8piEw. I crashed the other node at
 Mar 15 12:02:35, when you see the only additional line of output.

 I don't see any particular difference between this and the previous
 result http://pastebin.com/sWjaxAEF, which suggests that I had
 cluster locking enabled before, and still do now.

 Il giorno 15 marzo 2012 16:15, William Seligman 
 selig...@nevis.columbia.edu
 ha scritto:

 On 3/15/12 5:18 AM, emmanuel segura wrote:

 The first thing i seen in your clvmd log it's this

 =
  WARNING: Locking disabled. Be careful! This could corrupt your
 metadata.
 =

 I saw that too, and thought the same as you did. I did some checks
 (see below), but some web searches suggest that this message is a
 normal consequence of clvmd initialization; e.g.,

 http://markmail.org/message/vmy53pcv52wu7ghx

 use this command

 lvmconf --enable-cluster

 and remember for cman+pacemaker you don't need qdisk

 Before I tried your lvmconf suggestion, here was my
 /etc/lvm/lvm.conf:
 http://pastebin.com/841VZRzW and the output of lvm dumpconfig:
 http://pastebin.com/rtw8c3Pf.

 Then I did as you suggested, but with a check to see if anything
 changed:

 # cd /etc/lvm/
 # cp lvm.conf lvm.conf.cluster
 # lvmconf --enable-cluster
 # diff lvm.conf lvm.conf.cluster
 #

 So the key lines have been there all along:
locking_type = 3
fallback_to_local_locking = 0


 Il giorno 14 marzo 2012 23:17, William Seligman 
 selig...@nevis.columbia.edu
 ha scritto:

 On 3/14/12 9:20 AM, emmanuel segura wrote:
 Hello William

 i did new you are using drbd and i dont't know what type of
 configuration you using

 But it's better you try to start clvm with clvmd -d

 like thak we can see what it's the problem

 For what it's worth, here's the output of running clvmd -d on
 the node that stays up: http://pastebin.com/sWjaxAEF

 What's probably important in that big mass of output are the
 last two lines. Up to that point, I have both nodes up and
 running cman + clvmd; cluster.conf is here:
 http://pastebin.com/w5XNYyAX

 At the time of the next-to-the-last line, I cut power to the
 other node.

 At the time of the last line, I run vgdisplay on the
 remaining node, which hangs forever.

 After a lot of web searching, I found that I'm not the only one
 with this problem. Here's one case that doesn't seem relevant
 to me, since I don't use qdisk:
 
 http://www.redhat.com/archives/linux-cluster/2007-October/msg00212.html.
 Here's one with the same problem with the same OS

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-15 Thread William Seligman

On 3/15/12 6:05 PM, William Seligman wrote:
 On 3/15/12 4:57 PM, emmanuel segura wrote:
 
 we can try to understand what happen when clvm hang

 edit the /etc/lvm/lvm.conf  and change level = 7 in the log session and
 uncomment this line

 file = /var/log/lvm2.log
 
 Here's the tail end of the file (the original is 1.6M). Because there no times
 in the log, it's hard for me to point you to the point where I crashed the 
 other
 system. I think (though I'm not sure) that the crash happened after the last
 occurrence of
 
 cache/lvmcache.c:1484   Wiping internal VG cache
 
 Honestly, it looks like a wall of text to me. Does it suggest anything to you?

Maybe it would help if I included the link to the pastebin where I put the
output: http://pastebin.com/8pgW3Muw

 Il giorno 15 marzo 2012 20:50, William Seligman selig...@nevis.columbia.edu
 ha scritto:

 On 3/15/12 12:55 PM, emmanuel segura wrote:

 I don't see any error and the answer for your question it's yes

 can you show me your /etc/cluster/cluster.conf and your crm configure
 show

 like that more later i can try to look if i found some fix

 Thanks for taking a look.

 My cluster.conf: http://pastebin.com/w5XNYyAX
 crm configure show: http://pastebin.com/atVkXjkn

 Before you spend a lot of time on the second file, remember that clvmd
 will hang
 whether or not I'm running pacemaker.

 Il giorno 15 marzo 2012 17:42, William Seligman 
 selig...@nevis.columbia.edu
 ha scritto:

 On 3/15/12 12:15 PM, emmanuel segura wrote:

 Ho did you created your volume group

 pvcreate /dev/drbd0
 vgcreate -c y ADMIN /dev/drbd0
 lvcreate -L 200G -n usr ADMIN # ... and so on
 # Nevis-HA is the cluster name I used in cluster.conf
 mkfs.gfs2 -p lock_dlm -j 2 -t Nevis_HA:usr /dev/ADMIN/usr  # ... and so
 on

 give me the output of vgs command when the cluster it's up

 Here it is:

Logging initialised at Thu Mar 15 12:40:39 2012
Set umask from 0022 to 0077
Finding all volume groups
Finding volume group ROOT
Finding volume group ADMIN
  VG#PV #LV #SN Attr   VSize   VFree
  ADMIN   1   5   0 wz--nc   2.61t 765.79g
  ROOT1   2   0 wz--n- 117.16g  0
Wiping internal VG cache

 I assume the c in the ADMIN attributes means that clustering is turned
 on?

 Il giorno 15 marzo 2012 17:06, William Seligman 
 selig...@nevis.columbia.edu
 ha scritto:

 On 3/15/12 11:50 AM, emmanuel segura wrote:
 yes william

 Now try clvmd -d and see what happen

 locking_type = 3 it's lvm cluster lock type

 Since you asked for confirmation, here it is: the output of 'clvmd -d'
 just now. http://pastebin.com/bne8piEw. I crashed the other node at
 Mar 15 12:02:35, when you see the only additional line of output.

 I don't see any particular difference between this and the previous
 result http://pastebin.com/sWjaxAEF, which suggests that I had
 cluster locking enabled before, and still do now.

 Il giorno 15 marzo 2012 16:15, William Seligman 
 selig...@nevis.columbia.edu
 ha scritto:

 On 3/15/12 5:18 AM, emmanuel segura wrote:

 The first thing i seen in your clvmd log it's this

 =
  WARNING: Locking disabled. Be careful! This could corrupt your
 metadata.
 =

 I saw that too, and thought the same as you did. I did some checks
 (see below), but some web searches suggest that this message is a
 normal consequence of clvmd initialization; e.g.,

 http://markmail.org/message/vmy53pcv52wu7ghx

 use this command

 lvmconf --enable-cluster

 and remember for cman+pacemaker you don't need qdisk

 Before I tried your lvmconf suggestion, here was my
 /etc/lvm/lvm.conf:
 http://pastebin.com/841VZRzW and the output of lvm dumpconfig:
 http://pastebin.com/rtw8c3Pf.

 Then I did as you suggested, but with a check to see if anything
 changed:

 # cd /etc/lvm/
 # cp lvm.conf lvm.conf.cluster
 # lvmconf --enable-cluster
 # diff lvm.conf lvm.conf.cluster
 #

 So the key lines have been there all along:
locking_type = 3
fallback_to_local_locking = 0


 Il giorno 14 marzo 2012 23:17, William Seligman 
 selig...@nevis.columbia.edu
 ha scritto:

 On 3/14/12 9:20 AM, emmanuel segura wrote:
 Hello William

 i did new you are using drbd and i dont't know what type of
 configuration you using

 But it's better you try to start clvm with clvmd -d

 like thak we can see what it's the problem

 For what it's worth, here's the output of running clvmd -d on
 the node that stays up: http://pastebin.com/sWjaxAEF

 What's probably important in that big mass of output are the
 last two lines. Up to that point, I have both nodes up and
 running cman + clvmd; cluster.conf is here:
 http://pastebin.com/w5XNYyAX

 At the time of the next-to-the-last line, I cut power to the
 other node.

 At the time of the last line, I run vgdisplay on the
 remaining node, which hangs forever.

 After a lot of web searching, I found that I'm not the only one
 with this problem. Here's one case that doesn't seem

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-15 Thread William Seligman

On 3/15/12 6:07 PM, William Seligman wrote:
 On 3/15/12 6:05 PM, William Seligman wrote:
 On 3/15/12 4:57 PM, emmanuel segura wrote:

 we can try to understand what happen when clvm hang

 edit the /etc/lvm/lvm.conf  and change level = 7 in the log session and
 uncomment this line

 file = /var/log/lvm2.log

 Here's the tail end of the file (the original is 1.6M). Because there no 
 times
 in the log, it's hard for me to point you to the point where I crashed the 
 other
 system. I think (though I'm not sure) that the crash happened after the last
 occurrence of

 cache/lvmcache.c:1484   Wiping internal VG cache

 Honestly, it looks like a wall of text to me. Does it suggest anything to 
 you?
 
 Maybe it would help if I included the link to the pastebin where I put the
 output: http://pastebin.com/8pgW3Muw

Could the problem be with lvm+drbd?

In lvm2.conf, I see this sequence of lines pre-crash:

device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
device/dev-io.c:271   /dev/md0: size is 1027968 sectors
device/dev-io.c:137   /dev/md0: block size is 1024 bytes
device/dev-io.c:588   Closed /dev/md0
device/dev-io.c:271   /dev/md0: size is 1027968 sectors
device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
device/dev-io.c:137   /dev/md0: block size is 1024 bytes
device/dev-io.c:588   Closed /dev/md0
filters/filter-composite.c:31   Using /dev/md0
device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
device/dev-io.c:137   /dev/md0: block size is 1024 bytes
label/label.c:186   /dev/md0: No label detected
device/dev-io.c:588   Closed /dev/md0
device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
device/dev-io.c:588   Closed /dev/drbd0
device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
device/dev-io.c:588   Closed /dev/drbd0

I interpret this: Look at /dev/md0, get some info, close; look at /dev/drbd0,
get some info, close.

Post-crash, I see:

evice/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
device/dev-io.c:271   /dev/md0: size is 1027968 sectors
device/dev-io.c:137   /dev/md0: block size is 1024 bytes
device/dev-io.c:588   Closed /dev/md0
device/dev-io.c:271   /dev/md0: size is 1027968 sectors
device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
device/dev-io.c:137   /dev/md0: block size is 1024 bytes
device/dev-io.c:588   Closed /dev/md0
filters/filter-composite.c:31   Using /dev/md0
device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
device/dev-io.c:137   /dev/md0: block size is 1024 bytes
label/label.c:186   /dev/md0: No label detected
device/dev-io.c:588   Closed /dev/md0
device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes

... and then it hangs. Comparing the two, it looks like it can't close 
/dev/drbd0.

If I look at /proc/drbd when I crash one node, I see this:

# cat /proc/drbd
version: 8.3.12 (api:88/proto:86-96)
GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by
r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34
 0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s-
ns:764 nr:0 dw:0 dr:7049728 al:0 bm:516 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b 
oos:0


If I look at /proc/drbd if I bring down one node gracefully (crm node standby),
I get this:

# cat /proc/drbd
version: 8.3.12 (api:88/proto:86-96)
GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by
r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34
 0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/Outdated C r-
ns:764 nr:40 dw:40 dr:7036496 al:0 bm:516 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b
oos:0

Could it be that drbd can't respond to certain requests from lvm if the state of
the peer is DUnknown instead of Outdated?

 Il giorno 15 marzo 2012 20:50, William Seligman selig...@nevis.columbia.edu
 ha scritto:

 On 3/15/12 12:55 PM, emmanuel segura wrote:

 I don't see any error and the answer for your question it's yes

 can you show me your /etc/cluster/cluster.conf and your crm configure
 show

 like that more later i can try to look if i found some fix

 Thanks for taking a look.

 My cluster.conf: http://pastebin.com/w5XNYyAX
 crm configure show: http://pastebin.com/atVkXjkn

 Before you spend a lot of time on the second file, remember that clvmd
 will hang
 whether or not I'm running pacemaker.

 Il giorno 15 marzo 2012 17:42, William Seligman 
 selig...@nevis.columbia.edu
 ha scritto:

 On 3/15/12 12:15 PM, emmanuel segura wrote:

 Ho did you created your volume group

 pvcreate /dev/drbd0
 vgcreate -c y ADMIN /dev/drbd0
 lvcreate -L 200G -n usr ADMIN # ... and so on
 # Nevis-HA is the cluster name I used in cluster.conf
 mkfs.gfs2 -p lock_dlm -j 2 -t Nevis_HA:usr /dev/ADMIN/usr  # ... and so
 on

 give me the output of vgs command when the cluster it's up

 Here

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-14 Thread William Seligman

On 3/14/12 6:02 AM, emmanuel segura wrote:

I think it's better you make clvmd start at boot

chkconfig cman on ; chkconfig clvmd on

I've already tried it. It doesn't work. The problem is that my LVM
information is on the drbd. If I start up clvmd before drbd, it won't
find the logical volumes.

I also don't see why that would make a difference (although this could
be part of the confusion): a service is a service. I've tried starting
up clvmd inside and outside pacemaker control, with the same problem.
Why would starting clvmd at boot make a difference?

Il giorno 13 marzo 2012 23:29, William Seligmanselig...@nevis.columbia.edu

ha scritto:

On 3/13/12 5:50 PM, emmanuel segura wrote:

So if you using cman why you use lsb::clvmd

I think you are very confused

I don't dispute that I may be very confused!

However, from what I can tell, I still need to run clvmd even if
I'm running cman (I'm not using rgmanager). If I just run cman,
gfs2 and any other form of mount fails. If I run cman, then clvmd,
then gfs2, everything behaves normally.

Going by these instructions:

https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial

the resources he puts under cluster control (rgmanager) I have to
put under pacemaker control. Those include drbd, clvmd, and gfs2.

The difference between what I've got, and what's in Clusters From
Scratch, is in CFS they assign one DRBD volume to a single
filesystem. I create an LVM physical volume on my DRBD resource,
as in the above tutorial, and so I have to start clvmd or the
logical volumes in the DRBD partition won't be recognized. Is
there some way to get logical volumes recognized automatically by
cman without rgmanager that I've missed?

Il giorno 13 marzo 2012 22:42, William Seligman

selig...@nevis.columbia.edu

ha scritto:

On 3/13/12 12:29 PM, William Seligman wrote:

I'm not sure if this is a Linux-HA question; please direct
me to the appropriate list if it's not.

I'm setting up a two-node cman+pacemaker+gfs2 cluster as
described in Clusters From Scratch. Fencing is through
forcibly rebooting a node by cutting and restoring its power
via UPS.

My fencing/failover tests have revealed a problem. If I
gracefully turn off one node (crm node standby; service
pacemaker stop; shutdown -r now) all the resources
transfer to the other node with no problems. If I cut power
to one node (as would happen if it were fenced), the
lsb::clvmd resource on the remaining node eventually fails.
Since all the other resources depend on clvmd, all the
resources on the remaining node stop and the cluster is left
with nothing running.

I've traced why the lsb::clvmd fails: The monitor/status
command includes vgdisplay, which hangs indefinitely.
Therefore the monitor will always time-out.

So this isn't a problem with pacemaker, but with clvmd/dlm:
If a node is cut off, the cluster isn't handling it properly.
Has anyone on this list seen this before? Any ideas?

Details:

versions:
Redhat Linux 6.2 (kernel 2.6.32)
cman-3.0.12.1
corosync-1.4.1
pacemaker-1.1.6
lvm2-2.02.87
lvm2-cluster-2.02.87

This may be a Linux-HA question after all!

I ran a few more tests. Here's the output from a typical test of

grep -E (dlm|gfs2}clvmd|fenc|syslogd) /var/log/messages

http://pastebin.com/uqC6bc1b

It looks like what's happening is that the fence agent (one I
wrote) is not returning the proper error code when a node
crashes. According to this page, if a fencing agent fails GFS2
will freeze to protect the data:

http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Global_File_System_2/s1-gfs2hand-allnodes.html

As a test, I tried to fence my test node via standard means:

stonith_admin -F orestes-corosync.nevis.columbia.edu

These were the log messages, which show that stonith_admin did
its job and CMAN was notified of the
fencing:http://pastebin.com/jaH820Bv.

Unfortunately, I still got the gfs2 freeze, so this is not the
complete story.

--
Bill Seligman | mailto://selig...@nevis.columbia.edu
Nevis Labs, Columbia Univ | http://www.nevis.columbia.edu/~seligman/
PO Box 137|
Irvington NY 10533 USA | Phone: (914) 591-2823

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-14 Thread William Seligman


On 3/14/12 9:26 AM, Lars Marowsky-Bree wrote:

On 2012-03-14T09:02:59, William Seligmanselig...@nevis.columbia.edu  wrote:

To ask a slightly different question - why? Does your workload require /
benefit from a dual-primary architecture? Most don't.


http://www.gossamer-threads.com/lists/linuxha/users/78497#78497

I'm mindful of the issues involved, such as those Lars Ellenberg brought up in 
his response. I need something that will failover with a minimum of fuss. 
Although I'm encountering one problem after another, I think I'm closing in on 
my goal.


And if not, at least I'm leaving some interesting threads in Linux-HA for future 
sysadmins to search for.

--
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-14 Thread William Seligman

On 3/14/12 9:20 AM, emmanuel segura wrote:
Hello William

i did new you are using drbd and i dont't know what type of configuration
you using

But it's better you try to start clvm with clvmd -d

like thak we can see what it's the problem

For what it's worth, here's the output of running clvmd -d on the node that
stays up: http://pastebin.com/sWjaxAEF

What's probably important in that big mass of output are the last two lines. Up
to that point, I have both nodes up and running cman + clvmd; cluster.conf is
here: http://pastebin.com/w5XNYyAX

At the time of the next-to-the-last line, I cut power to the other node.

At the time of the last line, I run vgdisplay on the remaining node, which
hangs forever.

After a lot of web searching, I found that I'm not the only one with this
problem. Here's one case that doesn't seem relevant to me, since I don't use
qdisk:
http://www.redhat.com/archives/linux-cluster/2007-October/msg00212.html.
Here's one with the same problem with the same OS:
http://bugs.centos.org/view.php?id=5229, but with no resolution.

Out of curiosity, has anyone on this list made a two-node cman+clvmd cluster
work for them?

Il giorno 14 marzo 2012 14:02, William Seligman selig...@nevis.columbia.edu
ha scritto:

On 3/14/12 6:02 AM, emmanuel segura wrote:

I think it's better you make clvmd start at boot

chkconfig cman on ; chkconfig clvmd on

I've already tried it. It doesn't work. The problem is that my LVM
information is on the drbd. If I start up clvmd before drbd, it won't find
the logical volumes.

I also don't see why that would make a difference (although this could be
part of the confusion): a service is a service. I've tried starting up
clvmd inside and outside pacemaker control, with the same problem. Why
would starting clvmd at boot make a difference?

Il giorno 13 marzo 2012 23:29, William Seligmanseligman@nevis.**
columbia.edu selig...@nevis.columbia.edu

ha scritto:

On 3/13/12 5:50 PM, emmanuel segura wrote:

So if you using cman why you use lsb::clvmd

I think you are very confused

I don't dispute that I may be very confused!

Going by these instructions:

https://alteeve.com/w/2-Node_**Red_Hat_KVM_Cluster_Tutorialhttps://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial

the resources he puts under cluster control (rgmanager) I have to
put under pacemaker control. Those include drbd, clvmd, and gfs2.

Il giorno 13 marzo 2012 22:42, William Seligman

selig...@nevis.columbia.edu

ha scritto:

On 3/13/12 12:29 PM, William Seligman wrote:

I'm not sure if this is a Linux-HA question; please direct
me to the appropriate list if it's not.

I'm setting up a two-node cman+pacemaker+gfs2 cluster as
described in Clusters From Scratch. Fencing is through
forcibly rebooting a node by cutting and restoring its power
via UPS.

I've traced why the lsb::clvmd fails: The monitor/status
command includes vgdisplay, which hangs indefinitely.
Therefore the monitor will always time-out.

So this isn't a problem with pacemaker, but with clvmd/dlm:
If a node is cut off, the cluster isn't handling it properly.
Has anyone on this list seen this before? Any ideas?

Details:

versions:
Redhat Linux 6.2 (kernel 2.6.32)
cman-3.0.12.1
corosync-1.4.1
pacemaker-1.1.6
lvm2-2.02.87
lvm2-cluster-2.02.87

This may be a Linux-HA question after all!

I ran a few more tests. Here's the output from a typical test of

grep -E (dlm|gfs2}clvmd|fenc|syslogd)** /var/log/messages

http://pastebin.com/uqC6bc1b

http://docs.redhat.com/docs/**en-US/Red_Hat_Enterprise_**
Linux/6/html/Global_File_

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-14 Thread William Seligman

On 3/14/12 12:43 PM, Dimitri Maziuk wrote:
 On 03/14/2012 11:08 AM, Lars Marowsky-Bree wrote:
 On 2012-03-14T11:41:53, William Seligman selig...@nevis.columbia.edu wrote:

 I'm mindful of the issues involved, such as those Lars Ellenberg
 brought up in his response. I need something that will failover with
 a minimum of fuss. Although I'm encountering one problem after
 another, I think I'm closing in on my goal.

 I doubt this is what you're getting. An active/passive fail-over
 configuration would likely save you tons of trouble and not perform
 worse, probably be faster for most workloads.
 
 Or if you look at it from another angle, if you can't configure your
 resources to start properly at failover, what makes you think you can
 configure a dual-primary any better?

I'll repeat the answer I gave in that other thread, for what it's worth:

Consider two nodes in a primary-secondary cluster. Primary is running a
resource. It fails, so the resource has to failover to secondary.

Now consider a primary-primary cluster. Both run the same resource. One fails.
There's no failover here; the other box still runs the resource. In my case, the
only thing that has to work is cloned cluster IP address, and that I've verified
to my satisfaction.
-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-13 Thread William Seligman

I'm not sure if this is a Linux-HA question; please direct me to the
appropriate list if it's not.

I'm setting up a two-node cman+pacemaker+gfs2 cluster as described in Clusters
From Scratch. Fencing is through forcibly rebooting a node by cutting and
restoring its power via UPS.

My fencing/failover tests have revealed a problem. If I gracefully turn off one
node (crm node standby; service pacemaker stop; shutdown -r now) all the
resources transfer to the other node with no problems. If I cut power to one
node (as would happen if it were fenced), the lsb::clvmd resource on the
remaining node eventually fails. Since all the other resources depend on clvmd,
all the resources on the remaining node stop and the cluster is left with
nothing running.

I've traced why the lsb::clvmd fails: The monitor/status command includes
vgdisplay, which hangs indefinitely. Therefore the monitor will always 
time-out.

So this isn't a problem with pacemaker, but with clvmd/dlm: If a node is cut
off, the cluster isn't handling it properly. Has anyone on this list seen this
before? Any ideas?

Details:

versions:
Redhat Linux 6.2 (kernel 2.6.32)
cman-3.0.12.1
corosync-1.4.1
pacemaker-1.1.6
lvm2-2.02.87
lvm2-cluster-2.02.87

cluster.conf: http://pastebin.com/w5XNYyAX
output of crm configure show: http://pastebin.com/atVkXjkn
output of lvm dumpconfig: http://pastebin.com/rtw8c3Pf

/var/log/cluster/dlm_controld.log and /var/log/cluster/gfs_controld.log show
nothing. When I shut down power to one nodes (orestes-tb), the output of
grep -E (dlm|gfs2|clvmd) /var/log/messages is http://pastebin.com/vjpvCFeN.

-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-13 Thread William Seligman

On 3/13/12 2:49 PM, emmanuel segura wrote:
 Sorry Willian
 
 But i think clvmd it must be used with
 
 ocf:lvm2:clvmd
 
 esample
 
 crm confgiure
 primitive clvmd  ocf:lvm2:clvmd params daemon_timeout=30
 
 clone cln_clvmd clvmd
 
 and rember clvmd depend on dlm, so for the dlm you sould same

I don't have an ocf:lvm2:clvmd resource on my system. When I do a web search, it
looks like a resource found on SUSE systems, but not on RHEL distros.

Based on Clusters From Scratch, I think that if I'm using cman that dlm is
started automatically. I see dlm_controld is running without my explicitly
starting it:

# ps aux | grep dlm_controld
root  2495  0.0  0.0 234688  7564 ?  Ssl  12:32   0:00 dlm_controld

I should have also mentioned that I can duplicate this problem outside
pacemaker. That is, I can start cman, clvmd, and gfs2 manually on both nodes,
cut off power on one node, and clustering fails on the other node. So I suspect
it's not a pacemaker resource problem.

For a moment I thought I might not have used -p lock_dlm when I created my
GFS2 filesystems, but I think the output of gfs2_edit -p sb ... shows that I
did it correctly: http://pastebin.com/ALQYpKAy.

When I looked more carefully at my lvm.conf, I saw that I had a typo:

fallback_to_local_locking=4

I changed to it the correct value (according to
https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial):

fallback_to_local_locking=0

Unfortunately this doesn't solve the problem.

So... any ideas?

 Il giorno 13 marzo 2012 17:29, William Seligman selig...@nevis.columbia.edu
 ha scritto:
 
 I'm not sure if this is a Linux-HA question; please direct me to the 
 appropriate list if it's not.
 
 I'm setting up a two-node cman+pacemaker+gfs2 cluster as described in 
 Clusters From Scratch. Fencing is through forcibly rebooting a node by
 cutting and restoring its power via UPS.
 
 My fencing/failover tests have revealed a problem. If I gracefully turn off
 one node (crm node standby; service pacemaker stop; shutdown -r now)
 all the resources transfer to the other node with no problems. If I cut
 power to one node (as would happen if it were fenced), the lsb::clvmd
 resource on the remaining node eventually fails. Since all the other
 resources depend on clvmd, all the resources on the remaining node stop and
 the cluster is left with nothing running.

 I've traced why the lsb::clvmd fails: The monitor/status command includes
 vgdisplay, which hangs indefinitely. Therefore the monitor will always
 time-out.

 So this isn't a problem with pacemaker, but with clvmd/dlm: If a node is 
 cut off, the cluster isn't handling it properly. Has anyone on this list
 seen this before? Any ideas?

 Details:

 versions:
 Redhat Linux 6.2 (kernel 2.6.32)
 cman-3.0.12.1
 corosync-1.4.1
 pacemaker-1.1.6
 lvm2-2.02.87
 lvm2-cluster-2.02.87

 cluster.conf: http://pastebin.com/w5XNYyAX
 output of crm configure show: http://pastebin.com/atVkXjkn
 output of lvm dumpconfig: http://pastebin.com/rtw8c3Pf

 /var/log/cluster/dlm_controld.log and /var/log/cluster/gfs_controld.log 
 show nothing. When I shut down power to one nodes (orestes-tb), the output
 of grep -E (dlm|gfs2|clvmd) /var/log/messages is 
 http://pastebin.com/vjpvCFeN.
-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-13 Thread William Seligman

On 3/13/12 12:29 PM, William Seligman wrote:
I'm not sure if this is a Linux-HA question; please direct me to the
appropriate list if it's not.

I'm setting up a two-node cman+pacemaker+gfs2 cluster as described in
Clusters
From Scratch. Fencing is through forcibly rebooting a node by cutting and
restoring its power via UPS.

My fencing/failover tests have revealed a problem. If I gracefully turn off
one
node (crm node standby; service pacemaker stop; shutdown -r now) all the
resources transfer to the other node with no problems. If I cut power to one
node (as would happen if it were fenced), the lsb::clvmd resource on the
remaining node eventually fails. Since all the other resources depend on
clvmd,
all the resources on the remaining node stop and the cluster is left with
nothing running.

I've traced why the lsb::clvmd fails: The monitor/status command includes
vgdisplay, which hangs indefinitely. Therefore the monitor will always
time-out.

So this isn't a problem with pacemaker, but with clvmd/dlm: If a node is cut
off, the cluster isn't handling it properly. Has anyone on this list seen this
before? Any ideas?

Details:

versions:
Redhat Linux 6.2 (kernel 2.6.32)
cman-3.0.12.1
corosync-1.4.1
pacemaker-1.1.6
lvm2-2.02.87
lvm2-cluster-2.02.87

This may be a Linux-HA question after all!

I ran a few more tests. Here's the output from a typical test of

grep -E (dlm|gfs2}clvmd|fenc|syslogd) /var/log/messages

http://pastebin.com/uqC6bc1b

It looks like what's happening is that the fence agent (one I wrote) is not
returning the proper error code when a node crashes. According to this page, if
a fencing agent fails GFS2 will freeze to protect the data:

http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Global_File_System_2/s1-gfs2hand-allnodes.html

As a test, I tried to fence my test node via standard means:

stonith_admin -F orestes-corosync.nevis.columbia.edu

These were the log messages, which show that stonith_admin did its job and CMAN
was notified of the fencing: http://pastebin.com/jaH820Bv.

Unfortunately, I still got the gfs2 freeze, so this is not the complete story.

First things first. I vaguely recall a web page that went over the STONITH
return codes, but I can't locate it again. Is there any reference to the return
codes expected from a fencing agent, perhaps as function of the state of the
fencing device?
--
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/

Re: [Linux-HA] How do I clear the Failed actions section?

2012-03-08 Thread William Seligman


On 3/8/12 6:53 AM, Helmut Wollmersdorfer wrote:


Am 07.03.2012 um 18:01 schrieb Florian Haas:


On Wed, Mar 7, 2012 at 5:51 PM, William Seligman
selig...@nevis.columbia.edu  wrote:

Again, a disclaimer: I am not an expert.


Your advice was spot on. :)


But what to do, if cleanup is not working? And everything is running:

# crm status

Last updated: Thu Mar  8 12:27:00 2012
Stack: Heartbeat
Current DC: xen10 (5ab5ba3d-3be5-4763-83e7-90aaa49361a6) - partition
with quorum
Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
2 Nodes configured, unknown expected votes
12 Resources configured.


Online: [ xen10 xen11 ]

   xen_www  (ocf::heartbeat:Xen):   Started xen11
   Master/Slave Set: DrbdClone1
   Masters: [ xen11 ]
   Slaves: [ xen10 ]
   xen_typo3(ocf::heartbeat:Xen):   Started xen11
   xen_shopdb   (ocf::heartbeat:Xen):   Started xen10
   xen_admintool(ocf::heartbeat:Xen):   Started xen11
   xen_cmsdb(ocf::heartbeat:Xen):   Started xen11
   Master/Slave Set: DrbdClone2
   Resource Group: group_drbd2:0
   xen_drbd2_1:0(ocf::linbit:drbd): Slave xen10 (unmanaged)
FAILED
   xen_drbd2_2:0(ocf::linbit:drbd): Stopped
   Masters: [ xen11 ]
   Master/Slave Set: DrbdClone3
   Masters: [ xen10 ]
   Slaves: [ xen11 ]
   Master/Slave Set: DrbdClone5
   Masters: [ xen11 ]
   Slaves: [ xen10 ]
   Master/Slave Set: DrbdClone6
   Slaves: [ xen11 xen10 ]
   Master/Slave Set: DrbdClone4
   Masters: [ xen11 ]
   Slaves: [ xen10 ]

Failed actions:
  xen_cmsdb_monitor_3000 (node=xen10, call=571, rc=7,
status=complete): not running
  xen_drbd1_2:1_promote_0 (node=xen10, call=5205, rc=1,
status=complete): unknown error
  xen_drbd2_1:1_promote_0 (node=xen10, call=790, rc=1,
status=complete): unknown error
  xen_ns2_monitor_3000 (node=xen10, call=601, rc=7,
status=complete): not running
  xen_drbd3_1:1_promote_0 (node=xen10, call=383, rc=-2,
status=Timed Out): unknown exec error
  xen_drbd2_1:0_promote_0 (node=xen10, call=1326, rc=-2,
status=Timed Out): unknown exec error
  xen_drbd2_1:0_stop_0 (node=xen10, call=1348, rc=-2, status=Timed
Out): unknown exec error

xen11:# crm resource cleanup xen_drbd2_1
Error performing operation: The object/attribute does not exist
Error performing operation: The object/attribute does not exist


Given the list of resources displayed by crm_mon, the command you need is

crm resource cleanup DrbdClone2

I can't say whether that will fix your problems, but you won't get the 
does not exist message.


Somewhere in either Pacemaker Explained or Clusters From Scratch, it 
says that once you clone or ms a resource, you can't refer to that 
resource as an individual anymore; you have to use the clone/ms name.


What I did when faced with a problem like yours is cat /proc/drbd, 
look at the lines for the failed drbd, and fix it on my own. Then I'd 
type the cleanup command for pacemaker to pick up the current state of 
the resource.



# xm list
NameID   Mem VCPUs
State   Time(s)
Domain-0 0  100516 r-
40648.5
admintool5  4096 2 -
b   7455.4
cmsdb3  2048 2 -
b   2106.5
typo32  1024 2 -
b   2890.9
www  1  1024 1 -
b855.0


xen11:# drbdadm status
drbd-status version=8.3.7 api=88
resources config_file=/etc/drbd.conf
resource minor=1 name=drbd1_1 cs=Connected ro1=Primary
ro2=Secondary ds1=UpToDate ds2=UpToDate /
resource minor=2 name=drbd1_2 cs=Connected ro1=Primary
ro2=Secondary ds1=UpToDate ds2=UpToDate /
resource minor=3 name=drbd2_1 cs=Connected ro1=Primary
ro2=Secondary ds1=UpToDate ds2=UpToDate /
resource minor=4 name=drbd2_2 cs=Connected ro1=Primary
ro2=Secondary ds1=UpToDate ds2=UpToDate /
resource minor=5 name=drbd3_1 cs=Connected ro1=Secondary
ro2=Primary ds1=UpToDate ds2=UpToDate /
resource minor=6 name=drbd3_2 cs=Connected ro1=Secondary
ro2=Primary ds1=UpToDate ds2=UpToDate /
resource minor=7 name=drbd4_1 cs=Connected ro1=Primary
ro2=Secondary ds1=UpToDate ds2=UpToDate /
resource minor=8 name=drbd4_2 cs=Connected ro1=Primary
ro2=Secondary ds1=UpToDate ds2=UpToDate /
resource minor=9 name=drbd5_1 cs=Connected ro1=Primary
ro2=Secondary ds1=UpToDate ds2=UpToDate /
resource minor=10 name=drbd5_2 cs=Connected ro1=Primary
ro2=Secondary ds1=UpToDate ds2=UpToDate /
resource minor=11 name=drbd6_1 cs=StandAlone ro1=Secondary
ro2=Unknown ds1=Outdated ds2=DUnknown /
resource minor=12 name=drbd6_2 cs=StandAlone ro1=Secondary
ro2=Unknown ds1=Outdated ds2=DUnknown /
!-- resource minor=13 name=drbd7_1 not available or not yet
created --
!-- resource minor=14 name=drbd7_2 not available or not yet
created --
!-- resource minor=15 name=drbd8_1 not available

Re: [Linux-HA] How do I clear the Failed actions section?

2012-03-07 Thread William Seligman

On 3/7/12 10:50 AM, Jerome Yanga wrote:
I would just want to share that the command recommended did NOT move
the resource to another node. It basically clears the Failed Actions
section.

This is why I was conditional in my response. Suppose you had something like the
following:

primitive MyResource ocf:heartbeat:Dummy
location MyResourcePreferredNode MyResource 10: my-node-a.example.com

with no resource-stickiness set. Assume MyResource fails on my-node-a, and is
moved to my-node-b. Then if you were to do:

crm resource cleanup MyResource

pacemaker might move MyResource back to my-node-a. It might even move it back
without that example MyResourcePreferredNode constraint. If you want to avoid
that, consider per-resource or global resource-stickiness:

http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/ch05s03s02.html
http://www.gossamer-threads.com/lists/linuxha/pacemaker/64076

Again, a disclaimer: I am not an expert.

On Tue, Mar 6, 2012 at 11:46 AM, William Seligman
selig...@nevis.columbia.edu wrote:
On 3/6/12 2:38 PM, Jerome Yanga wrote:

Do you know by chance if that command you have provided bounces the
resource?

I don't know what you mean by bounce the resource. According to:

http://www.clusterlabs.org/doc/crm_cli.html

the command refreshes the resource status. Depending on your configuration,
it
might shift a resource to another node.

But I am not an expert! I merely knew how to clear up the error message.

On Tue, Mar 6, 2012 at 10:28 AM, William Seligman
selig...@nevis.columbia.edu wrote:
On 3/6/12 1:04 PM, Jerome Yanga wrote:
crm_mon shows the error below.

Failed actions:
� � drbd0:1_monitor_59000 (node=testserver1.example.com, call=132,
rc=-2, status=Timed Out): unknown exec error

I have check DRBD and the mirror is connected and uptodate on both nodes.

The error above caused the resources to failover and it seems to be
working OK. �However, the failed actions section has not disappeared.
How do I clear this error?

crm resource cleanup drbd0

--
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/

Re: [Linux-HA] Apparent problem in pacemaker ordering

2012-03-07 Thread William Seligman

On 3/5/12 11:55 AM, William Seligman wrote:
 On 3/3/12 3:30 PM, William Seligman wrote:
 On 3/3/12 2:14 PM, Florian Haas wrote:
 On Sat, Mar 3, 2012 at 6:55 PM, William Seligman
 selig...@nevis.columbia.edu  wrote:
 On 3/3/12 12:03 PM, emmanuel segura wrote:

 are you sure the exportfs agent can be use it with clone active/active?

 a) I've been through the script. If there's some problem associated with it
 being cloned, I haven't seen it. (It can't handle globally-unique=true,
 but I didn't turn that on.)

 It shouldn't have a problem with being cloned. Obviously, cloning that
 RA _really_ makes sense only with the export that manages an NFSv4
 virtual root (fsid=0). Otherwise, the export clone has to be hosted on
 a clustered filesystem, and you'd have to have a pNFS implementation
 that doesn't suck (tough to come by on Linux), and if you want that
 sort of replicate, parallel-access NFS you might as well use Gluster.
 The downside of the latter, though, is it's currently NFSv3-only,
 without sideband locking.

 I'll look this over when I have a chance. I think I can get away without a 
 NFSv4
 virtual root because I'm exporting everything to my cluster either 
 read-only, or
 only one system at a time will do any writing. Now that you've warned me, 
 I'll
 do some more checking.

 b) I had similar problems using the exportfs resource in a 
 primary-secondary
 setup without clones.

 Why would a resource being cloned create an ordering problem? I haven't set
 the interleave parameter (even with the documentation I'm not sure what it
 does) but A before B before C seems pretty clear, even for cloned 
 resources.

 As far as what interleave does. Suppose you have two clones, A and B.
 And they're linked with an order constraint, like this:

 order A_before_B inf: A B

 ... then if interleave is false, _all_ instances of A must be started
 before _any_ instance of B gets to start anywhere in the cluster.
 However if interleave is true, then for any node only the _local_
 instance of A needs to be started before it can start the
 corresponding _local_ instance of B.

 In other words, interleave=true is actually the reasonable thing to
 set on all clone instances by default, and I believe the pengine
 actually does use a default of interleave=true on defined clone sets
 since some 1.1.x release (I don't recall which).

 Thanks, Florian. That's a great explanation. I'll probably stick
 interleave=true on most of my clones just to make sure.

 It explains an error message I've seen in the logs:

 Mar  2 18:15:19 hypatia-tb pengine: [4414]: ERROR: clone_rsc_colocation_rh:
 Cannot interleave clone ClusterIPClone and Gfs2Clone because they do not 
 support
 the same number of resources per node

 Because ClusterIPClone has globally-unique=true and clone-max=2, it's 
 possible
 for both instances to be running on a single node; I've seen this a few 
 times in
 my testing when cycling power on one of the nodes. Interleaving doesn't make
 sense in such a case.

 Bill, seeing as you've already pastebinned your config and crm_mon
 output, could you also pastebin your whole CIB as per cibadmin -Q
 output? Thanks.

 Sure: http://pastebin.com/pjSJ79H6. It doesn't have the exportfs resources 
 in
 it; I took them out before leaving for the weekend. If it helps, I'll put 
 them
 back in and try to get the cibadmin -Q output before any nodes crash.

 
 For a test, I stuck in a exportfs resource with all the ordering constraints.
 Here's the cibadmin -Q output from that:
 
 http://pastebin.com/nugdufJc
 
 The output of crm_mon just after doing that, showing resource failure:
 
 http://pastebin.com/cyCFGUSD
 
 Then all the resources are stopped:
 
 http://pastebin.com/D62sGSrj
 
 A few seconds later one of the nodes is fenced, but this does not bring up
 anything:
 
 http://pastebin.com/wzbmfVas

I believe I have the solution to my stability problem. It doesn't solve the
issue of ordering, but I think I have a configuration that will survive 
failover.

Here's the problem. I had exportfs resources such as:

primitive ExportUsrNevis ocf:heartbeat:exportfs \
op start interval=0 timeout=40 \
op stop interval=0 timeout=45 \
params clientspec=*.nevis.columbia.edu directory=/usr/nevis \
fsid=20 options=ro,no_root_squash,async

I did detailed traces of the execution of exportfs (putting in logger commands)
and found that the problem was in the backup_rmtab function in exportfs:

backup_rmtab() {
local rmtab_backup
if [ ${OCF_RESKEY_rmtab_backup} != none ]; then
rmtab_backup=${OCF_RESKEY_directory}/${OCF_RESKEY_rmtab_backup}
grep :${OCF_RESKEY_directory}: /var/lib/nfs/rmtab  ${rmtab_backup}
fi
}

The problem was that the grep command was taking a long time, longer than any
timeout I'd assigned to the resource. I looked at /var/lib/nfs/rmtab, and saw it
was 60GB on one of my nodes and 16GB on the other. Since backup_rmtab() is
called during the stop action, the resource

Re: [Linux-HA] How do I clear the Failed actions section?

2012-03-06 Thread William Seligman

On 3/6/12 1:04 PM, Jerome Yanga wrote:
 crm_mon shows the error below.
 
 Failed actions:
 drbd0:1_monitor_59000 (node=testserver1.example.com, call=132,
 rc=-2, status=Timed Out): unknown exec error
 
 I have check DRBD and the mirror is connected and uptodate on both nodes.
 
 The error above caused the resources to failover and it seems to be
 working OK.  However, the failed actions section has not disappeared.
 How do I clear this error?

crm resource cleanup drbd0

-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Apparent problem in pacemaker ordering

2012-03-05 Thread William Seligman

On 3/3/12 3:30 PM, William Seligman wrote:
 On 3/3/12 2:14 PM, Florian Haas wrote:
 On Sat, Mar 3, 2012 at 6:55 PM, William Seligman
 selig...@nevis.columbia.edu  wrote:
 On 3/3/12 12:03 PM, emmanuel segura wrote:

 are you sure the exportfs agent can be use it with clone active/active?

 a) I've been through the script. If there's some problem associated with it
 being cloned, I haven't seen it. (It can't handle globally-unique=true,
 but I didn't turn that on.)

 It shouldn't have a problem with being cloned. Obviously, cloning that
 RA _really_ makes sense only with the export that manages an NFSv4
 virtual root (fsid=0). Otherwise, the export clone has to be hosted on
 a clustered filesystem, and you'd have to have a pNFS implementation
 that doesn't suck (tough to come by on Linux), and if you want that
 sort of replicate, parallel-access NFS you might as well use Gluster.
 The downside of the latter, though, is it's currently NFSv3-only,
 without sideband locking.
 
 I'll look this over when I have a chance. I think I can get away without a 
 NFSv4
 virtual root because I'm exporting everything to my cluster either read-only, 
 or
 only one system at a time will do any writing. Now that you've warned me, I'll
 do some more checking.
 
 b) I had similar problems using the exportfs resource in a primary-secondary
 setup without clones.

 Why would a resource being cloned create an ordering problem? I haven't set
 the interleave parameter (even with the documentation I'm not sure what it
 does) but A before B before C seems pretty clear, even for cloned resources.

 As far as what interleave does. Suppose you have two clones, A and B.
 And they're linked with an order constraint, like this:

 order A_before_B inf: A B

 ... then if interleave is false, _all_ instances of A must be started
 before _any_ instance of B gets to start anywhere in the cluster.
 However if interleave is true, then for any node only the _local_
 instance of A needs to be started before it can start the
 corresponding _local_ instance of B.

 In other words, interleave=true is actually the reasonable thing to
 set on all clone instances by default, and I believe the pengine
 actually does use a default of interleave=true on defined clone sets
 since some 1.1.x release (I don't recall which).
 
 Thanks, Florian. That's a great explanation. I'll probably stick
 interleave=true on most of my clones just to make sure.
 
 It explains an error message I've seen in the logs:
 
 Mar  2 18:15:19 hypatia-tb pengine: [4414]: ERROR: clone_rsc_colocation_rh:
 Cannot interleave clone ClusterIPClone and Gfs2Clone because they do not 
 support
 the same number of resources per node
 
 Because ClusterIPClone has globally-unique=true and clone-max=2, it's possible
 for both instances to be running on a single node; I've seen this a few times 
 in
 my testing when cycling power on one of the nodes. Interleaving doesn't make
 sense in such a case.
 
 Bill, seeing as you've already pastebinned your config and crm_mon
 output, could you also pastebin your whole CIB as per cibadmin -Q
 output? Thanks.
 
 Sure: http://pastebin.com/pjSJ79H6. It doesn't have the exportfs resources 
 in
 it; I took them out before leaving for the weekend. If it helps, I'll put them
 back in and try to get the cibadmin -Q output before any nodes crash.
 

For a test, I stuck in a exportfs resource with all the ordering constraints.
Here's the cibadmin -Q output from that:

http://pastebin.com/nugdufJc

The output of crm_mon just after doing that, showing resource failure:

http://pastebin.com/cyCFGUSD

Then all the resources are stopped:

http://pastebin.com/D62sGSrj

A few seconds later one of the nodes is fenced, but this does not bring up
anything:

http://pastebin.com/wzbmfVas
-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Apparent problem in pacemaker ordering

2012-03-03 Thread William Seligman


On 3/3/12 12:03 PM, emmanuel segura wrote:

are you sure the exportfs agent can be use it with clone active/active?


a) I've been through the script. If there's some problem associated with 
it being cloned, I haven't seen it. (It can't handle 
globally-unique=true, but I didn't turn that on.)


b) I had similar problems using the exportfs resource in a 
primary-secondary setup without clones.


Why would a resource being cloned create an ordering problem? I haven't 
set the interleave parameter (even with the documentation I'm not sure 
what it does) but A before B before C seems pretty clear, even for 
cloned resources.



Il giorno 03 marzo 2012 00:12, William Seligmanselig...@nevis.columbia.edu

ha scritto:



One step forward, two steps back.

I'm working on a two-node primary-primary cluster. I'm debugging
problems I have with the ocf:heartbeat:exportfs resource. For some
reason, pacemaker sometimes appears to ignore ordering I put on the
resources.

Florian Haas recommended pastebin in another thread, so let's give
it a try. Here's my complete current output of crm configure
show:

http://pastebin.com/bbSsqyeu

Here's a quick sketch: The sequence of events is supposed to be
DRBD (ms) - clvmd (clone) -  gfs2 (clone) -  exportfs (clone).

But that's not what happens. What happens is that pacemaker tries
to start up the exportfs resource immediately. This fails, because
what it's exporting doesn't exist until after gfs2 runs. Because
the cloned resource can't run on either node, the cluster goes into
a state in which one node is fenced, the other node refuses to run
anything.

Here's a quick snapshot I was able to take of the output of crm_mon
that shows the problem:

http://pastebin.com/CiZvS4Fh

This shows that pacemaker is still trying to start the exportfs
resources, before it has run the chain drbd-clvmd-gfs2.

Just to confirm the obvious, I have the ordering constraints in the
full configuration linked above (Admin is my DRBD resource):

order Admin_Before_Clvmd inf: AdminClone:promote ClvmdClone:start
order Clvmd_Before_Gfs2 inf: ClvmdClone Gfs2Clone order
Gfs2_Before_Exports inf: Gfs2Clone ExportsClone

This is not the only time I've observed this behavior in pacemaker.
Here's a lengthy log file excerpt from the same time I took the
crm_mon snapshot:

http://pastebin.com/HwMUCmcX

I can see that other resources, the symlink ones in particular, are
being probed and started before the drbd Admin resource has a
chance to be promoted. In looking at the log file, it may help to
know that /mail and /var/nevis are gfs2 partitions that aren't
mounted until the Gfs2 resource starts.

So this isn't the first time I've seen this happen. This is just
the first time I've been able to reproduce this reliably and
capture a snapshot.

Any ideas?



--
Bill Seligman | mailto://selig...@nevis.columbia.edu
Nevis Labs, Columbia Univ | http://www.nevis.columbia.edu/~seligman/
PO Box 137|
Irvington NY 10533  USA   | Phone: (914) 591-2823



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Apparent problem in pacemaker ordering

2012-03-03 Thread William Seligman


On 3/3/12 2:14 PM, Florian Haas wrote:

On Sat, Mar 3, 2012 at 6:55 PM, William Seligman
selig...@nevis.columbia.edu  wrote:

On 3/3/12 12:03 PM, emmanuel segura wrote:


are you sure the exportfs agent can be use it with clone active/active?


a) I've been through the script. If there's some problem associated with it
being cloned, I haven't seen it. (It can't handle globally-unique=true,
but I didn't turn that on.)


It shouldn't have a problem with being cloned. Obviously, cloning that
RA _really_ makes sense only with the export that manages an NFSv4
virtual root (fsid=0). Otherwise, the export clone has to be hosted on
a clustered filesystem, and you'd have to have a pNFS implementation
that doesn't suck (tough to come by on Linux), and if you want that
sort of replicate, parallel-access NFS you might as well use Gluster.
The downside of the latter, though, is it's currently NFSv3-only,
without sideband locking.


I'll look this over when I have a chance. I think I can get away without 
a NFSv4 virtual root because I'm exporting everything to my cluster 
either read-only, or only one system at a time will do any writing. Now 
that you've warned me, I'll do some more checking.



b) I had similar problems using the exportfs resource in a primary-secondary
setup without clones.

Why would a resource being cloned create an ordering problem? I haven't set
the interleave parameter (even with the documentation I'm not sure what it
does) but A before B before C seems pretty clear, even for cloned resources.


As far as what interleave does. Suppose you have two clones, A and B.
And they're linked with an order constraint, like this:

order A_before_B inf: A B

... then if interleave is false, _all_ instances of A must be started
before _any_ instance of B gets to start anywhere in the cluster.
However if interleave is true, then for any node only the _local_
instance of A needs to be started before it can start the
corresponding _local_ instance of B.

In other words, interleave=true is actually the reasonable thing to
set on all clone instances by default, and I believe the pengine
actually does use a default of interleave=true on defined clone sets
since some 1.1.x release (I don't recall which).


Thanks, Florian. That's a great explanation. I'll probably stick 
interleave=true on most of my clones just to make sure.


It explains an error message I've seen in the logs:

Mar  2 18:15:19 hypatia-tb pengine: [4414]: ERROR: 
clone_rsc_colocation_rh: Cannot interleave clone ClusterIPClone and 
Gfs2Clone because they do not support the same number of resources per node


Because ClusterIPClone has globally-unique=true and clone-max=2, it's 
possible for both instances to be running on a single node; I've seen 
this a few times in my testing when cycling power on one of the nodes. 
Interleaving doesn't make sense in such a case.



Bill, seeing as you've already pastebinned your config and crm_mon
output, could you also pastebin your whole CIB as per cibadmin -Q
output? Thanks.


Sure: http://pastebin.com/pjSJ79H6. It doesn't have the exportfs 
resources in it; I took them out before leaving for the weekend. If it 
helps, I'll put them back in and try to get the cibadmin -Q output 
before any nodes crash.


--
Bill Seligman | mailto://selig...@nevis.columbia.edu
Nevis Labs, Columbia Univ | http://www.nevis.columbia.edu/~seligman/
PO Box 137|
Irvington NY 10533  USA   | Phone: (914) 591-2823



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] Apparent problem in pacemaker ordering

2012-03-02 Thread William Seligman

One step forward, two steps back.

I'm working on a two-node primary-primary cluster. I'm debugging problems I have
with the ocf:heartbeat:exportfs resource. For some reason, pacemaker sometimes
appears to ignore ordering I put on the resources.

Florian Haas recommended pastebin in another thread, so let's give it a try.
Here's my complete current output of crm configure show:

http://pastebin.com/bbSsqyeu

Here's a quick sketch: The sequence of events is supposed to be DRBD (ms) -
clvmd (clone) - gfs2 (clone) - exportfs (clone).

But that's not what happens. What happens is that pacemaker tries to start up
the exportfs resource immediately. This fails, because what it's exporting
doesn't exist until after gfs2 runs. Because the cloned resource can't run on
either node, the cluster goes into a state in which one node is fenced, the
other node refuses to run anything.

Here's a quick snapshot I was able to take of the output of crm_mon that shows
the problem:

http://pastebin.com/CiZvS4Fh

This shows that pacemaker is still trying to start the exportfs resources,
before it has run the chain drbd-clvmd-gfs2.

Just to confirm the obvious, I have the ordering constraints in the full
configuration linked above (Admin is my DRBD resource):

order Admin_Before_Clvmd inf: AdminClone:promote ClvmdClone:start
order Clvmd_Before_Gfs2 inf: ClvmdClone Gfs2Clone
order Gfs2_Before_Exports inf: Gfs2Clone ExportsClone

This is not the only time I've observed this behavior in pacemaker. Here's a
lengthy log file excerpt from the same time I took the crm_mon snapshot:

http://pastebin.com/HwMUCmcX

I can see that other resources, the symlink ones in particular, are being probed
and started before the drbd Admin resource has a chance to be promoted. In
looking at the log file, it may help to know that /mail and /var/nevis are gfs2
partitions that aren't mounted until the Gfs2 resource starts.

So this isn't the first time I've seen this happen. This is just the first time
I've been able to reproduce this reliably and capture a snapshot.

Any ideas?
-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Apparent problem in pacemaker ordering

2012-03-02 Thread William Seligman

Darn it, forgot versions:

Redhat Linux 6.2 (kernel 2.6.32)
cman-3.0.12.1
corosync-1.4.1
pacemaker-1.1.6

On 3/2/12 6:12 PM, William Seligman wrote:
 One step forward, two steps back.
 
 I'm working on a two-node primary-primary cluster. I'm debugging problems I 
 have
 with the ocf:heartbeat:exportfs resource. For some reason, pacemaker sometimes
 appears to ignore ordering I put on the resources.
 
 Florian Haas recommended pastebin in another thread, so let's give it a try.
 Here's my complete current output of crm configure show:
 
 http://pastebin.com/bbSsqyeu
 
 Here's a quick sketch: The sequence of events is supposed to be DRBD (ms) -
 clvmd (clone) - gfs2 (clone) - exportfs (clone).
 
 But that's not what happens. What happens is that pacemaker tries to start up
 the exportfs resource immediately. This fails, because what it's exporting
 doesn't exist until after gfs2 runs. Because the cloned resource can't run on
 either node, the cluster goes into a state in which one node is fenced, the
 other node refuses to run anything.
 
 Here's a quick snapshot I was able to take of the output of crm_mon that shows
 the problem:
 
 http://pastebin.com/CiZvS4Fh
 
 This shows that pacemaker is still trying to start the exportfs resources,
 before it has run the chain drbd-clvmd-gfs2.
 
 Just to confirm the obvious, I have the ordering constraints in the full
 configuration linked above (Admin is my DRBD resource):
 
 order Admin_Before_Clvmd inf: AdminClone:promote ClvmdClone:start
 order Clvmd_Before_Gfs2 inf: ClvmdClone Gfs2Clone
 order Gfs2_Before_Exports inf: Gfs2Clone ExportsClone
 
 This is not the only time I've observed this behavior in pacemaker. Here's a
 lengthy log file excerpt from the same time I took the crm_mon snapshot:
 
 http://pastebin.com/HwMUCmcX
 
 I can see that other resources, the symlink ones in particular, are being 
 probed
 and started before the drbd Admin resource has a chance to be promoted. In
 looking at the log file, it may help to know that /mail and /var/nevis are 
 gfs2
 partitions that aren't mounted until the Gfs2 resource starts.
 
 So this isn't the first time I've seen this happen. This is just the first 
 time
 I've been able to reproduce this reliably and capture a snapshot.
 
 Any ideas?


-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] cman+pacemaker+drbd fencing problem

2012-03-01 Thread William Seligman


On 3/1/12 4:15 AM, emmanuel segura wrote:

can you show me your /etc/cluster/cluster.conf?

because i think your problem it's a fencing-loop


Here it is:

/etc/cluster/cluster.conf:

?xml version=1.0?
cluster config_version=17 name=Nevis_HA
  logging debug=off/
  cman expected_votes=1 two_node=1 /
  clusternodes
clusternode name=hypatia-tb.nevis.columbia.edu nodeid=1
  altname name=hypatia-private.nevis.columbia.edu port=5405
mcast=226.94.1.1/
  fence
method name=pcmk-redirect
  device name=pcmk port=hypatia-tb.nevis.columbia.edu/
/method
  /fence
/clusternode
clusternode name=orestes-tb.nevis.columbia.edu nodeid=2
  altname name=orestes-private.nevis.columbia.edu port=5405
mcast=226.94.1.1/
  fence
method name=pcmk-redirect
  device name=pcmk port=orestes-tb.nevis.columbia.edu/
/method
  /fence
/clusternode
  /clusternodes
  fencedevices
fencedevice name=pcmk agent=fence_pcmk/
  /fencedevices
  fence_daemon post_join_delay=30 /
  rm disabled=1 /
/cluster



Il giorno 01 marzo 2012 01:03, William Seligmanselig...@nevis.columbia.edu

ha scritto:



On 2/28/12 7:26 PM, Lars Ellenberg wrote:

On Tue, Feb 28, 2012 at 03:51:29PM -0500, William Seligman wrote:

off-topic
Sigh. I wish that were the reason.

The reason why I'm doing dual-primary is that I've a got a

single-primary

two-node cluster in production that simply doesn't work. One node runs
resources; the other sits and twiddles its fingers; fine. But when

primary goes

down, secondary has trouble starting up all the resources; when we've

actually

had primary failures (UPS goes haywire, hard drive failure) the

secondary often

winds up in a state in which it runs none of the significant resources.

With the dual-primary setup I have now, both machines are running the

resources

that typically cause problems in my single-primary configuration. If

one box

goes down, the other doesn't have to failover anything; it's already

running

them. (I needed IPaddr2 cloning to work properly for this to work,

which is why

I started that thread... and all the stupider of me for missing that

crucial

page in Clusters From Scratch.)

My only remaining problem with the configuration is restoring a fenced

node to

the cluster. Hence my tests, and the reason why I started this thread.
/off-topic


Uhm, I do think that is exactly on topic.

Rather fix your resources to be able to successfully take over,
than add even more complexity.

What resources would that be,
and why are they not taking over?


I can't tell you in detail, because the major snafu happened on a
production
system after a power outage a few months ago. My goal was to get the thing
stable as quickly as possible. In the end, that turned out to be a non-HA
configuration: One runs corosync+pacemaker+drbd, while the other just runs
drbd.
It works, in the sense that the users get their e-mail. If there's a power
outage, I have to bring things up manually.

So my only reference is the test-bench dual-primary setup I've got now,
which is
exhibiting the same kinds of problems even though the OS versions, software
versions, and layout are different. This suggests that the problem lies in
the
way I'm setting up the configuration.

The problems I have seem to be in the general category of the 'good guy'
gets
fenced when the 'bad guy' gets into trouble. Examples:

- Assuming I start out with two crashed nodes. If I just start up DRBD and
nothing else, the partitions sync quickly with no problems.

- If the system starts with cman running, and I start drbd, it's likely
that
system who is _not_ Outdated will be fenced (rebooted). Same thing if
cman+pacemaker is running.

- Cloned ocf:heartbeat:exportfs resources are giving me problems as well
(which
is why I tried making changes to that resource script). Assume I start
with one
node running cman+pacemaker, and the other stopped. I turned on the stopped
node. This will typically result in the running node being fenced, because
it
has it times out when stopping the exportfs resource.

Falling back to DRBD 8.3.12 didn't change this behavior.

My pacemaker configuration is long, so I'll excerpt what I think are the
relevant pieces in the hope that it will be enough for someone to say You
fool!
This is covered in Pacemaker Explained page 56! When bringing up a stopped
node, in order to restart AdminClone pacemaker wants to stop ExportsClone,
then
Gfs2Clone, then ClvmdClone. As I said, it's the failure to stop ExportMail
on
the running node that causes it to be fenced.

primitive AdminDrbd ocf:linbit:drbd \
params drbd_resource=admin \
op monitor interval=60s role=Master \
op monitor interval=59s role=Slave \
op stop interval=0 timeout=320 \
op start interval=0 timeout=240
ms AdminClone AdminDrbd \
meta master-max=2 master-node-max=1 \
clone-max=2 clone-node-max=1 notify=true

primitive Clvmd lsb:clvmd op monitor

Re: [Linux-HA] cman+pacemaker+drbd fencing problem

2012-03-01 Thread William Seligman

On 3/1/12 6:34 AM, emmanuel segura wrote:
 try to change the fence daemon tag like this
 
  fence_daemon clean_start=1 post_join_delay=30 /
 
 change your cluster config version and after reboot the cluster

This did not change the behavior of the cluster. In particular, I'm still
dealing with this:

 - If the system starts with cman running, and I start drbd, it's
 likely that system who is _not_ Outdated will be fenced (rebooted).

 Il giorno 01 marzo 2012 12:28, William Seligman selig...@nevis.columbia.edu
 ha scritto:
 
 On 3/1/12 4:15 AM, emmanuel segura wrote:

 can you show me your /etc/cluster/cluster.conf?

 because i think your problem it's a fencing-loop


 Here it is:

 /etc/cluster/cluster.conf:

 ?xml version=1.0?
 cluster config_version=17 name=Nevis_HA
  logging debug=off/
  cman expected_votes=1 two_node=1 /
  clusternodes
clusternode 
 name=hypatia-tb.nevis.**columbia.eduhttp://hypatia-tb.nevis.columbia.edu
 nodeid=1
  altname 
 name=hypatia-private.nevis.**columbia.eduhttp://hypatia-private.nevis.columbia.edu
 port=5405
 mcast=226.94.1.1/
  fence
method name=pcmk-redirect
  device name=pcmk 
 port=hypatia-tb.nevis.**columbia.eduhttp://hypatia-tb.nevis.columbia.edu
 /
/method
  /fence
/clusternode
clusternode 
 name=orestes-tb.nevis.**columbia.eduhttp://orestes-tb.nevis.columbia.edu
 nodeid=2
  altname 
 name=orestes-private.nevis.**columbia.eduhttp://orestes-private.nevis.columbia.edu
 port=5405
 mcast=226.94.1.1/
  fence
method name=pcmk-redirect
  device name=pcmk 
 port=orestes-tb.nevis.**columbia.eduhttp://orestes-tb.nevis.columbia.edu
 /
/method
  /fence
/clusternode
  /clusternodes
  fencedevices
fencedevice name=pcmk agent=fence_pcmk/
  /fencedevices
  fence_daemon post_join_delay=30 /
  rm disabled=1 /
 /cluster



  Il giorno 01 marzo 2012 01:03, William Seligmanseligman@nevis.**
 columbia.edu selig...@nevis.columbia.edu

 ha scritto:


  On 2/28/12 7:26 PM, Lars Ellenberg wrote:

 On Tue, Feb 28, 2012 at 03:51:29PM -0500, William Seligman wrote:

 off-topic
 Sigh. I wish that were the reason.

 The reason why I'm doing dual-primary is that I've a got a 
 single-primary two-node cluster in production that simply doesn't
 work. One node runs resources; the other sits and twiddles its
 fingers; fine. But when primary goes down, secondary has trouble
 starting up all the resources; when we've actually had primary
 failures (UPS goes haywire, hard drive failure) the secondary often
 winds up in a state in which it runs none of the significant
 resources.
 
 With the dual-primary setup I have now, both machines are running 
 the resources that typically cause problems in my single-primary 
 configuration. If one box goes down, the other doesn't have to 
 failover anything; it's already running them. (I needed IPaddr2
 cloning to work properly for this to work, which is why I started
 that thread... and all the stupider of me for missing that crucial
 page in Clusters From Scratch.)
 
 My only remaining problem with the configuration is restoring a 
 fenced node to the cluster. Hence my tests, and the reason why I
 started this thread.
 /off-topic



 Uhm, I do think that is exactly on topic.

 Rather fix your resources to be able to successfully take over,
 than add even more complexity.

 What resources would that be,
 and why are they not taking over?


 I can't tell you in detail, because the major snafu happened on a 
 production system after a power outage a few months ago. My goal was to
 get the thing stable as quickly as possible. In the end, that turned
 out to be a non-HA configuration: One runs corosync+pacemaker+drbd,
 while the other just runs drbd. It works, in the sense that the users
 get their e-mail. If there's a power outage, I have to bring things up
 manually.
 
 So my only reference is the test-bench dual-primary setup I've got
 now, which is exhibiting the same kinds of problems even though the OS
 versions, software versions, and layout are different. This suggests
 that the problem lies in the way I'm setting up the configuration.
 
 The problems I have seem to be in the general category of the 'good
 guy' gets fenced when the 'bad guy' gets into trouble. Examples:
 
 - Assuming I start out with two crashed nodes. If I just start up DRBD 
 and nothing else, the partitions sync quickly with no problems.
 
 - If the system starts with cman running, and I start drbd, it's
 likely that system who is _not_ Outdated will be fenced (rebooted).
 Same thing if cman+pacemaker is running.
 
 - Cloned ocf:heartbeat:exportfs resources are giving me problems as
 well (which is why I tried making changes to that resource script).
 Assume I start with one node running cman+pacemaker, and the other
 stopped. I turned on the stopped node. This will typically result in
 the running node being fenced, because it has

Re: [Linux-HA] cman+pacemaker+drbd fencing problem

2012-03-01 Thread William Seligman

On 3/1/12 12:10 PM, William Seligman wrote:
 On 3/1/12 6:34 AM, emmanuel segura wrote:
 try to change the fence daemon tag like this
 
  fence_daemon clean_start=1 post_join_delay=30 /
 
 change your cluster config version and after reboot the cluster
 
 This did not change the behavior of the cluster. In particular, I'm still
 dealing with this:
 
 - If the system starts with cman running, and I start drbd, it's
 likely that system who is _not_ Outdated will be fenced (rebooted).

This just happened again. Here's the log from the bad node, the one I stopped
and then restarted. cman is running (not pacemaker). I start drbd:

Mar  1 12:03:49 orestes-tb kernel: drbd: initialized. Version: 8.3.12
(api:88/proto:86-96)
Mar  1 12:03:49 orestes-tb kernel: drbd: GIT-hash:
e2a8ef4656be026bbae540305fcb998a5991090f build by
r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34
Mar  1 12:03:49 orestes-tb kernel: drbd: registered as block device major 147
Mar  1 12:03:49 orestes-tb kernel: drbd: minor_table @ 0x88041dbc4b80
Mar  1 12:03:49 orestes-tb kernel: block drbd0: Starting worker thread (from
cqueue [2942])
Mar  1 12:03:49 orestes-tb kernel: block drbd0: disk( Diskless - Attaching )
Mar  1 12:03:50 orestes-tb kernel: block drbd0: Found 57 transactions (57 active
extents) in activity log.
Mar  1 12:03:50 orestes-tb kernel: block drbd0: Method to ensure write ordering:
barrier
Mar  1 12:03:50 orestes-tb kernel: block drbd0: max BIO size = 130560
Mar  1 12:03:50 orestes-tb kernel: block drbd0: Adjusting my ra_pages to backing
device's (32 - 768)
Mar  1 12:03:50 orestes-tb kernel: block drbd0: drbd_bm_resize called with
capacity == 5611549368
Mar  1 12:03:50 orestes-tb kernel: block drbd0: resync bitmap: bits=701443671
words=10960058 pages=21407
Mar  1 12:03:50 orestes-tb kernel: block drbd0: size = 2676 GB (2805774684 KB)
Mar  1 12:03:50 orestes-tb kernel: block drbd0: bitmap READ of 21407 pages took
625 jiffies
Mar  1 12:03:50 orestes-tb kernel: block drbd0: recounting of set bits took
additional 86 jiffies
Mar  1 12:03:50 orestes-tb kernel: block drbd0: 0 KB (0 bits) marked out-of-sync
by on disk bit-map.
Mar  1 12:03:50 orestes-tb kernel: block drbd0: disk( Attaching - Outdated )
Mar  1 12:03:50 orestes-tb kernel: block drbd0: attached to UUIDs
878999EFCFBE8E08::494B48826E41A2C2:494A48826E41A2C3
Mar  1 12:03:50 orestes-tb kernel: block drbd0: conn( StandAlone - Unconnected 
)
Mar  1 12:03:50 orestes-tb kernel: block drbd0: Starting receiver thread (from
drbd0_worker [2951])
Mar  1 12:03:50 orestes-tb kernel: block drbd0: receiver (re)started
Mar  1 12:03:50 orestes-tb kernel: block drbd0: conn( Unconnected - 
WFConnection )
Mar  1 12:03:51 orestes-tb kernel: block drbd0: Handshake successful: Agreed
network protocol version 96
Mar  1 12:03:51 orestes-tb kernel: block drbd0: conn( WFConnection -
WFReportParams )
Mar  1 12:03:51 orestes-tb kernel: block drbd0: Starting asender thread (from
drbd0_receiver [2965])
Mar  1 12:03:51 orestes-tb kernel: block drbd0: data-integrity-alg: not-used
Mar  1 12:03:51 orestes-tb kernel: block drbd0: drbd_sync_handshake:
Mar  1 12:03:51 orestes-tb kernel: block drbd0: self
878999EFCFBE8E08::494B48826E41A2C2:494A48826E41A2C3 bits:0 
flags:0
Mar  1 12:03:51 orestes-tb kernel: block drbd0: peer
D40A1613FAE8F5E9:878999EFCFBE8E09:878899EFCFBE8E09:494B48826E41A2C3 bits:0 
flags:0
Mar  1 12:03:51 orestes-tb kernel: block drbd0: uuid_compare()=-1 by rule 50
Mar  1 12:03:51 orestes-tb kernel: block drbd0: peer( Unknown - Primary ) conn(
WFReportParams - WFBitMapT ) pdsk( DUnknown - UpToDate )
Mar  1 12:03:53 orestes-tb kernel: block drbd0: conn( WFBitMapT - WFSyncUUID )
Mar  1 12:04:01 orestes-tb corosync[2296]:   [TOTEM ] A processor failed,
forming new configuration.
Mar  1 12:04:03 orestes-tb corosync[2296]:   [QUORUM] Members[1]: 2
Mar  1 12:04:03 orestes-tb corosync[2296]:   [TOTEM ] A processor joined or left
the membership and a new membership was formed.
Mar  1 12:04:03 orestes-tb kernel: dlm: closing connection to node 1
Mar  1 12:04:03 orestes-tb corosync[2296]:   [CPG   ] chosen downlist: sender
r(0) ip(129.236.252.14) r(1) ip(192.168.100.6) ; members(old:2 left:1)
Mar  1 12:04:03 orestes-tb corosync[2296]:   [MAIN  ] Completed service
synchronization, ready to provide service.
Mar  1 12:04:03 orestes-tb fenced[2350]: fencing node 
hypatia-tb.nevis.columbia.edu


As near as I can tell, the bad node sees that the good node is Primary and
UpToDate, goes into WFSyncUUID... and then corosync/cman cheerfully fences the
good node.

 Il giorno 01 marzo 2012 12:28, William Seligman selig...@nevis.columbia.edu
 ha scritto:

 On 3/1/12 4:15 AM, emmanuel segura wrote:

 can you show me your /etc/cluster/cluster.conf?

 because i think your problem it's a fencing-loop


 Here it is:

 /etc/cluster/cluster.conf:

 ?xml version=1.0?
 cluster config_version=17 name=Nevis_HA
  logging

Re: [Linux-HA] cman+pacemaker+drbd fencing problem - SOLVED

2012-03-01 Thread William Seligman

On 3/1/12 12:56 PM, Lars Ellenberg wrote:
 On Thu, Mar 01, 2012 at 12:16:17PM -0500, William Seligman wrote:
 On 3/1/12 12:10 PM, William Seligman wrote:
 On 3/1/12 6:34 AM, emmanuel segura wrote:
 try to change the fence daemon tag like this
 
  fence_daemon clean_start=1 post_join_delay=30 /
 
 change your cluster config version and after reboot the cluster

 This did not change the behavior of the cluster. In particular, I'm still
 dealing with this:

 - If the system starts with cman running, and I start drbd, it's
 likely that system who is _not_ Outdated will be fenced (rebooted).

 This just happened again. Here's the log from the bad node, the one I 
 stopped
 and then restarted. cman is running (not pacemaker). I start drbd:

 Mar  1 12:03:49 orestes-tb kernel: drbd: initialized. Version: 8.3.12
 (api:88/proto:86-96)
 Mar  1 12:03:49 orestes-tb kernel: drbd: GIT-hash:
 e2a8ef4656be026bbae540305fcb998a5991090f build by
 r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34
 Mar  1 12:03:49 orestes-tb kernel: drbd: registered as block device major 147
 Mar  1 12:03:49 orestes-tb kernel: drbd: minor_table @ 0x88041dbc4b80
 Mar  1 12:03:49 orestes-tb kernel: block drbd0: Starting worker thread (from
 cqueue [2942])
 Mar  1 12:03:49 orestes-tb kernel: block drbd0: disk( Diskless - Attaching )
 Mar  1 12:03:50 orestes-tb kernel: block drbd0: Found 57 transactions (57 
 active
 extents) in activity log.
 Mar  1 12:03:50 orestes-tb kernel: block drbd0: Method to ensure write 
 ordering:
 barrier
 Mar  1 12:03:50 orestes-tb kernel: block drbd0: max BIO size = 130560
 Mar  1 12:03:50 orestes-tb kernel: block drbd0: Adjusting my ra_pages to 
 backing
 device's (32 - 768)
 Mar  1 12:03:50 orestes-tb kernel: block drbd0: drbd_bm_resize called with
 capacity == 5611549368
 Mar  1 12:03:50 orestes-tb kernel: block drbd0: resync bitmap: bits=701443671
 words=10960058 pages=21407
 Mar  1 12:03:50 orestes-tb kernel: block drbd0: size = 2676 GB (2805774684 
 KB)
 Mar  1 12:03:50 orestes-tb kernel: block drbd0: bitmap READ of 21407 pages 
 took
 625 jiffies
 Mar  1 12:03:50 orestes-tb kernel: block drbd0: recounting of set bits took
 additional 86 jiffies
 Mar  1 12:03:50 orestes-tb kernel: block drbd0: 0 KB (0 bits) marked 
 out-of-sync
 by on disk bit-map.
 Mar  1 12:03:50 orestes-tb kernel: block drbd0: disk( Attaching - Outdated )
 Mar  1 12:03:50 orestes-tb kernel: block drbd0: attached to UUIDs
 878999EFCFBE8E08::494B48826E41A2C2:494A48826E41A2C3
 Mar  1 12:03:50 orestes-tb kernel: block drbd0: conn( StandAlone - 
 Unconnected )
 Mar  1 12:03:50 orestes-tb kernel: block drbd0: Starting receiver thread 
 (from
 drbd0_worker [2951])
 Mar  1 12:03:50 orestes-tb kernel: block drbd0: receiver (re)started
 Mar  1 12:03:50 orestes-tb kernel: block drbd0: conn( Unconnected - 
 WFConnection )
 Mar  1 12:03:51 orestes-tb kernel: block drbd0: Handshake successful: Agreed
 network protocol version 96
 Mar  1 12:03:51 orestes-tb kernel: block drbd0: conn( WFConnection -
 WFReportParams )
 Mar  1 12:03:51 orestes-tb kernel: block drbd0: Starting asender thread (from
 drbd0_receiver [2965])
 Mar  1 12:03:51 orestes-tb kernel: block drbd0: data-integrity-alg: 
 not-used
 Mar  1 12:03:51 orestes-tb kernel: block drbd0: drbd_sync_handshake:
 Mar  1 12:03:51 orestes-tb kernel: block drbd0: self
 878999EFCFBE8E08::494B48826E41A2C2:494A48826E41A2C3 bits:0 
 flags:0
 Mar  1 12:03:51 orestes-tb kernel: block drbd0: peer
 D40A1613FAE8F5E9:878999EFCFBE8E09:878899EFCFBE8E09:494B48826E41A2C3 bits:0 
 flags:0
 Mar  1 12:03:51 orestes-tb kernel: block drbd0: uuid_compare()=-1 by rule 50
 Mar  1 12:03:51 orestes-tb kernel: block drbd0: peer( Unknown - Primary ) 
 conn(
 WFReportParams - WFBitMapT ) pdsk( DUnknown - UpToDate )
 Mar  1 12:03:53 orestes-tb kernel: block drbd0: conn( WFBitMapT - 
 WFSyncUUID )
 Mar  1 12:04:01 orestes-tb corosync[2296]:   [TOTEM ] A processor failed,
 forming new configuration.
 
 some random thoughts...
 
 DRBD Bitmap exchange causes congestion on Network, packet storm, irq
 storm, whatever, and UDP cluster comm packets falling on the floor?
 
 Can you change your cluster comm to use an (additional?) dedicated link?
 Or play with (increase) totem timeouts?  Or play with some sysctls to
 make it less likely for UDP to fall on the floor; if that is what is
 happening.
 
 Maybe if you tcpdump the traffic while you start things up, that could
 give you some hints as to why corosync thinks that A processor failed,
 and it has to fence that failed processor...
 
 Mar  1 12:04:03 orestes-tb corosync[2296]:   [QUORUM] Members[1]: 2
 Mar  1 12:04:03 orestes-tb corosync[2296]:   [TOTEM ] A processor
 joined or left the membership and a new membership was formed.
 Mar  1 12:04:03 orestes-tb kernel: dlm: closing connection to node 1
 Mar  1 12:04:03 orestes-tb corosync[2296]:   [CPG   ] chosen downlist: sender
 r

[Linux-HA] fence_nut fencing agent - use NUT to fence via UPS

2012-03-01 Thread William Seligman

After days spent debugging a fencing issue with my cluster, I know for certain
that this fencing agent works, at least for me. I'd like to contribute it to the
Linux HA community.

In my cluster, the fencing mechanism is to use NUT (Network UPS Tools;
http://www.networkupstools.org/ to turn off power to a node. About 1.5 years
ago, I contributed a NUT-based fencing agent for Pacemaker 1.0:

http://oss.clusterlabs.org/pipermail/pacemaker/2010-August/007347.html

That script doesn't work for stonith-ng. So here's a new agent, written in perl,
and tested under pacemaker-1.1.6 and nut-2.4.3.

I know there's a fence_apc_snmp agent that already in resource-agents. However,
that agent only works with APC devices with multiple outlet control; it displays
an error messages when used with my UPSes. This script is for those who'd rather
use NUT than play with SNMP MIBs.

Enjoy!

-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/
#!/usr/bin/perl

# External fencing agent that uses the NUT daemon to control an external UPS.
# See the comments below, and the various NUT man pages, for how this
# script works. It should work unchanged with most modern smart APC UPSes in
# a Redhat/Fedora/RHEL-style distribution with the nut package installed.

# Author: William Seligman selig...@nevis.columbia.edu
# License: GPLv2

# The Following Agent Has Been Tested With:
# pacemaker-1.1.6
# nut-2.4.3

# As you're designing your UPS and fencing set-up, consider that there may be 
# three computers involved:
#   1) the machine running this fencing agent;
#   2) the machine being controlled by this agent;
#   3) the machine that can send commands to the UPS.

# On my cluster, all the UPSes have SNMP smartcards, so every host can 
communicate
# with every UPS; in other words, machines (1) and (3) are the same. If your 
UPSes 
# are controlled via serial or USB connections, then you might have a
# situation in which host (2) is plugged into a UPS that has a serial connection
# to some master power-control computer, and can potentially be fenced
# by any other machine in your cluster.

# You'll probably need the nut daemon running on both the hosts (1) and
# (3). Strictly speaking, there's no reason for NUT to run on (2).
# From a practical standpoint you'll probably want NUT to be running on all the
# systems in your cluster. 

# For this agent to work, the following conditions have to be met:

# - NUT has to be installed; on RHEL systems, this requires packages nut and 
# nut-client.

# - The nut daemon (the ups or upsd service on RHEL) must be running on hosts 
# (1) and (3). This agent does not start/stop the nut daemons for you.

# - The name of the UPS that affects host (2) has to be defined in ups.conf on 
# host (3). The format for the --ups option is upsname[@controlhost[:port]]. The
# default controlhost is 'localhost'. If you use SNMP management cards, you 
want 
# to make sure you issue comands to a community with read/write privileges; the 
# default is the 'private' community. An example ups.conf:

# [myhost-ups]
# driver = snmp-ups
# port = myhost-ups.example.com
# community = private
# mibs=apcc

# - The --username and --password options to access the UPS must be defined in
# upsd.users on host (3), with the instcmds for poweron, poweroff, and reset 
allowed. 
# An example upsd.users:

# [myuser]
#password = mypassword
#actions = SET
#instcmds = ALL

# - Host (1) must be allowed access via upsd.conf and upsd.users on host (3). 
# On RHEL systems, these files are in /etc/ups. In nut-2.4 and greater, there's
# no per-host access restrictions, but you'll need to grant access in
# nut-2.2 or lower. 

# - If you want to be able to unfence host (2) via stonith_admin, you might 
want 
# to set its BIOS to boot up on AC power restore, as opposed to last state or 
off.
# Otherwise the machine might not come back on even if the UPS restores power. 

# This agent doesn't keep track of which host it controls. Use the 
# Pacemaker parameters for that (man stonithd); e.g.,:
#   primitive StonithMyHost stonith:fence_nut \
#  op monitor interval=60 timeout=30 on-fail=stop \
#  params pcmk_host_list=myhost.example.com pcmk_host_check=static-list \
#  ups=myhost-ups username=myuser password=mypassword

# Note the use of on-fail=stop. The main way this resource's monitor can fail
# is if we lose communication with the UPS. That's not great if it happens, but
# consider what happens if allow the default on-fail=fence, especially in a
# two-node cluster; do you want host (1) to be fenced solely because it can 
# no longer fence host (2)? If you have more than two nodes, on-fail=restart
# is an alternative, but if someone's pulled the communications cable from the
# UPS then the resource will just shift from

Re: [Linux-HA] cman+pacemaker+drbd fencing problem

2012-02-28 Thread William Seligman

On 2/27/12 8:40 PM, Andrew Beekhof wrote:

 Oh, what does the fence_pcmk file look like?

This is a standard part of the pacemaker-1.1.6 package. According to

http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/_configuring_cman_fencing.html

it causes any fencing requests from cman to be redirected to pacemaker.

Since you asked, I've attached a copy of the file. I note that if this script is
used to fence a system it writes to /var/log/messages using logger, and there is
no such log message in my logs. So I guess cman is off the hook.

 On Tue, Feb 28, 2012 at 11:49 AM, William Seligman
 selig...@nevis.columbia.edu wrote:
 I'm trying to set up an active/active HA cluster as explained in Clusters 
 From
 Scratch (which I just re-read after my last problem).

 I'll give versions and config files below, but I'll start with what happens. 
 I
 start with an active/active cman+pacemaker+drbd+gfs2 cluster, with fencing
 enabled. My fencing mechanism cuts power to a node by turning the load off in
 its UPS. The two nodes are hypatia-tb and orestes-tb.

 I want to test fencing and recovery. I start with both nodes running, and
 resources properly running on both nodes. Then I simulate failure on one 
 node,
 e.g., orestes-tb. I've done this with crm node standby, service pacemaker
 off, or by pulling the plug. As expected, all the resources move to 
 hypatia-tb,
 with the drbd resource as Primary.

 When I try to bring orestes-tb back into the cluster with crm node online 
 or
 service pacemaker on (the inverse of how I removed it), orestes-tb is 
 fenced.
 OK, that makes sense, I guess; there's a potential split-brain situation.
 
 Not really, that should only happen if the two nodes can't see each
 other.  Which should not be the case.
 Only when you pull the plug should orestes-tb be fenced.
 
 Or if you're using a fencing device that requires the node to have
 power, then I can imagine that turning it on again might result in
 fencing.
 But not for the other cases.

I ran a test: I turned off pacemaker (and so DRBD) on orestes-tb. I touched a
file on the hypatia-tb DRBD partition, to make it the newer one. Then I turned
off pacemaker on hypatia-tb. Finally I turned on just drbd on hypatia-tb, then
on orestes-tb.

From /var/log/messages on hypatia-tb:

Feb 28 11:39:19 hypatia-tb kernel: d-con admin: Starting worker thread (from
drbdsetup [21822])
Feb 28 11:39:19 hypatia-tb kernel: block drbd0: disk( Diskless - Attaching )
Feb 28 11:39:19 hypatia-tb kernel: d-con admin: Method to ensure write ordering:
barrier
Feb 28 11:39:19 hypatia-tb kernel: block drbd0: max BIO size = 130560
Feb 28 11:39:19 hypatia-tb kernel: block drbd0: Adjusting my ra_pages to backing
device's (32 - 768)
Feb 28 11:39:19 hypatia-tb kernel: block drbd0: drbd_bm_resize called with
capacity == 5611549368
Feb 28 11:39:19 hypatia-tb kernel: block drbd0: resync bitmap: bits=701443671
words=10960058 pages=21407
Feb 28 11:39:19 hypatia-tb kernel: block drbd0: size = 2676 GB (2805774684 KB)
Feb 28 11:39:19 hypatia-tb kernel: block drbd0: bitmap READ of 21407 pages took
576 jiffies
Feb 28 11:39:20 hypatia-tb kernel: block drbd0: recounting of set bits took
additional 87 jiffies
Feb 28 11:39:20 hypatia-tb kernel: block drbd0: 55 MB (14114 bits) marked
out-of-sync by on disk bit-map.
Feb 28 11:39:20 hypatia-tb kernel: block drbd0: disk( Attaching - UpToDate )
pdsk( DUnknown - Outdated )
Feb 28 11:39:20 hypatia-tb kernel: block drbd0: attached to UUIDs
862A336609FD27CD:BFFB722D5E3E15D7:6E63EC4258C86AF2:6E62EC4258C86AF2
Feb 28 11:39:20 hypatia-tb kernel: d-con admin: conn( StandAlone - Unconnected 
)
Feb 28 11:39:20 hypatia-tb kernel: d-con admin: Starting receiver thread (from
drbd_w_admin [21824])
Feb 28 11:39:20 hypatia-tb kernel: d-con admin: receiver (re)started
Feb 28 11:39:20 hypatia-tb kernel: d-con admin: conn( Unconnected - 
WFConnection )


From /var/log/messages on orestes-tb:

Feb 28 11:39:51 orestes-tb kernel: d-con admin: Starting worker thread (from
drbdsetup [17827])
Feb 28 11:39:51 orestes-tb kernel: block drbd0: disk( Diskless - Attaching )
Feb 28 11:39:51 orestes-tb kernel: d-con admin: Method to ensure write ordering:
barrier
Feb 28 11:39:51 orestes-tb kernel: block drbd0: max BIO size = 130560
Feb 28 11:39:51 orestes-tb kernel: block drbd0: Adjusting my ra_pages to backing
device's (32 - 768)
Feb 28 11:39:51 orestes-tb kernel: block drbd0: drbd_bm_resize called with
capacity == 5611549368
Feb 28 11:39:51 orestes-tb kernel: block drbd0: resync bitmap: bits=701443671
words=10960058 pages=21407
Feb 28 11:39:51 orestes-tb kernel: block drbd0: size = 2676 GB (2805774684 KB)
Feb 28 11:39:52 orestes-tb kernel: block drbd0: bitmap READ of 21407 pages took
735 jiffies
Feb 28 11:39:52 orestes-tb kernel: block drbd0: recounting of set bits took
additional 93 jiffies
Feb 28 11:39:52 orestes-tb kernel: block drbd0: 0 KB (0 bits) marked out-of-sync
by on disk bit-map.
Feb 28 11:39:52 orestes-tb kernel: block drbd0

Re: [Linux-HA] cman+pacemaker+drbd fencing problem

2012-02-28 Thread William Seligman

On 2/28/12 2:09 PM, Lars Ellenberg wrote:
On Tue, Feb 28, 2012 at 01:21:51PM -0500, William Seligman wrote:
On 2/27/12 8:40 PM, Andrew Beekhof wrote:

Oh, what does the fence_pcmk file look like?

This is a standard part of the pacemaker-1.1.6 package. According to

http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/_configuring_cman_fencing.html

it causes any fencing requests from cman to be redirected to pacemaker.

Since you asked, I've attached a copy of the file. I note that if this
script is
used to fence a system it writes to /var/log/messages using logger, and
there is
no such log message in my logs. So I guess cman is off the hook.

You say fencing resource-only in drbd.conf.
But you did not show the fencing handler used?
Did you specify one at all?

It looks like I over-edited when I got rid of the comments before I posted my
configuration. The relevant sections are:

disk {
fencing resource-only;
}
handlers {
pri-on-incon-degr /usr/lib/drbd/notify-pri-on-incon-degr.sh;
/usr/lib/drbd/notify-emergency-reboot.sh; echo b /proc/sysrq-trigger ; reboo\
t -f;
pri-lost-after-sb /usr/lib/drbd/notify-pri-lost-after-sb.sh;
/usr/lib/drbd/notify-emergency-reboot.sh; echo b /proc/sysrq-trigger ; reboo\
t -f;
local-io-error /usr/lib/drbd/notify-io-error.sh;
/usr/lib/drbd/notify-emergency-shutdown.sh; echo o /proc/sysrq-trigger ; halt
-f;
split-brain /usr/lib/drbd/notify-split-brain.sh
sysad...@nevis.columbia.edu;
fence-peer /usr/lib/drbd/crm-fence-peer.sh;
after-resync-target /usr/lib/drbd/crm-unfence-peer.sh;
}

Besides, for a dual-primary DRBD setup, you must have fencing
resource-and-stonith;, and you should use a DRBD fencing handler
that really fences off the peer. It may additionally set constraints.

Do crm-fence-peer.sh or Lon Hohberger's obliterate-peer.sh really fence off a
peer? I suspect your answer will be no, since from what I can tell in a
cman+pacemaker configuration they both wind up calling stonith_admin.

Also, maybe that post helps to realize some of the problems involved:
http://www.gossamer-threads.com/lists/linuxha/pacemaker/62927#62927

Especially the part about
But just because you can shoot someone
does not mean you have the bi^Wbetter data.

Because of the increased complexity, I strongly recommend against dual
primary DRBD, unless you have a very good reason to want it.

Because it can be done does not count as good reason in that context

off-topic
Sigh. I wish that were the reason.

The reason why I'm doing dual-primary is that I've a got a single-primary
two-node cluster in production that simply doesn't work. One node runs
resources; the other sits and twiddles its fingers; fine. But when primary goes
down, secondary has trouble starting up all the resources; when we've actually
had primary failures (UPS goes haywire, hard drive failure) the secondary often
winds up in a state in which it runs none of the significant resources.

With the dual-primary setup I have now, both machines are running the resources
that typically cause problems in my single-primary configuration. If one box
goes down, the other doesn't have to failover anything; it's already running
them. (I needed IPaddr2 cloning to work properly for this to work, which is why
I started that thread... and all the stupider of me for missing that crucial
page in Clusters From Scratch.)

My only remaining problem with the configuration is restoring a fenced node to
the cluster. Hence my tests, and the reason why I started this thread.
/off-topic

More comments below.

On Tue, Feb 28, 2012 at 11:49 AM, William Seligman
selig...@nevis.columbia.edu wrote:
I'm trying to set up an active/active HA cluster as explained in Clusters
From
Scratch (which I just re-read after my last problem).

I'll give versions and config files below, but I'll start with what
happens. I
start with an active/active cman+pacemaker+drbd+gfs2 cluster, with fencing
enabled. My fencing mechanism cuts power to a node by turning the load off
in
its UPS. The two nodes are hypatia-tb and orestes-tb.

I want to test fencing and recovery. I start with both nodes running, and
resources properly running on both nodes. Then I simulate failure on one
node,
e.g., orestes-tb. I've done this with crm node standby, service
pacemaker
off, or by pulling the plug. As expected, all the resources move to
hypatia-tb,
with the drbd resource as Primary.

When I try to bring orestes-tb back into the cluster with crm node
online or
service pacemaker on (the inverse of how I removed it), orestes-tb is
fenced.
OK, that makes sense, I guess; there's a potential split-brain situation.

Not really, that should only happen if the two nodes can't see each
other. Which should not be the case.
Only when you pull the plug should orestes-tb

Re: [Linux-HA] cman+pacemaker+drbd fencing problem

2012-02-28 Thread William Seligman

On 2/28/12 5:27 PM, Andrew Beekhof wrote:
 On Wed, Feb 29, 2012 at 5:21 AM, William Seligman
 selig...@nevis.columbia.edu wrote:

 While I was setting up the test for the previous paragraph, there was a 
 problem
 with another resource (ocf:heartbeat:exportfs) that couldn't be properly
 monitored on either node. This led to a cycle of fencing where each node 
 would
 successively fence the other because the exportfs resource couldn't run on
 either node. I had to quickly change my configuration to turn off monitoring 
 on
 the resource.
 
 Not being able to run is fine, but not being able to stop would
 definitely cause fencing.
 Make sure the RA can always stop ;-)

I'm not the one who wrote ocf:heartbeat:exportfs. I've already had my fling with
trying to revise it. I can only hope that the folks who wrote it knew what they
were doing; they certainly know more than I do!
-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Understanding the behavior of IPaddr2 clone

2012-02-27 Thread William Seligman

On 2/24/12 3:36 PM, William Seligman wrote:
 On 2/17/12 7:30 AM, Dejan Muhamedagic wrote:

 OK, I guess that'd also be doable by checking the following
 variables:

  OCF_RESKEY_CRM_meta_notify_inactive_resource (set of
  currently inactive instances)
  OCF_RESKEY_CRM_meta_notify_stop_resource (set of
  instances which were just stopped)

 Any volunteers for a patch? :)
 
 a) I have a test cluster that I can bring up and down at will;
 
 b) I'm a glutton for punishment.
 
 So I'll volunteer, since I offered to try to do something in the first place. 
 I
 think I've got a handle on what to look for; e.g., one has to look for
 notify_type=pre and notify_operation=stop in the 'node_down' test.

Here's my patch, in my usual overly-commented style.

Notes:

- To make this work, you need to turn on notify in the clone resources; e.g.,

clone ipaddr2_clone ipaddr2_resource meta notify=true

None of the clone examples I saw in the documentation (Clusters From Scratch,
Pacemaker Explained) show the notify option; only the ms examples do. You may
want to revise the documentation with an IPaddr2 example.

- I tested this with my two-node cluster, and it works. I wrote it for a
multi-node cluster, but I can't be sure it will work for more than two nodes.
Would some nice person test this?

- I wrote my code assuming that the clone number assigned to a node would remain
constant. If the clone numbers were to change by deleting/adding a node to the
cluster, I don't know what would happen.

Enjoy!

-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/
--- IPaddr2.ori 2012-02-16 11:51:04.942688344 -0500
+++ /usr/lib/ocf/resource.d/heartbeat/IPaddr2   2012-02-27 15:23:46.856510474 
-0500
@@ -13,6 +13,7 @@
 # Copyright (c) 2003 Tuomo Soini
 # Copyright (c) 2004-2006 SUSE LINUX AG, Lars Marowsky-Brée
 #All Rights Reserved.
+# Additions for high availability 2012 by William Seligman
 #
 # This program is free software; you can redistribute it and/or modify
 # it under the terms of version 2 of the GNU General Public License as
@@ -86,7 +87,7 @@
 This Linux-specific resource manages IP alias IP addresses.
 It can add an IP alias, or remove one.
 In addition, it can implement Cluster Alias IP functionality
-if invoked as a clone resource.
+if invoked as a clone resource with 'meta notify=true'.
 /longdesc
 
 shortdesc lang=enManages virtual IPv4 addresses (Linux specific 
version)/shortdesc
@@ -254,6 +255,7 @@
 actions
 action name=start   timeout=20s /
 action name=stoptimeout=20s /
+action name=notify  timeout=20s /
 action name=status depth=0  timeout=20s interval=10s /
 action name=monitor depth=0  timeout=20s interval=10s /
 action name=meta-data  timeout=5s /
@@ -849,6 +851,101 @@
 fi
 }
 
+# Make the IPaddr2 resource highly-available by adjusting the iptables
+# information if nodes drop out of the cluster.
+handle_notify() {
+   # If this is not a cloned IPaddr2 resource, do nothing.
+   # (But if it's not cloned, how did the user set 'meta notify=true'?)
+   if [ $IP_INC_GLOBAL -eq 0 ]; then
+   ocf_log info notify action on non-cloned resource; remove meta 
notify='true'
+   exit $OCF_SUCCESS
+   fi
+   
+   # To test if nodes are dropped, the best flags are when notify_type=pre 
and 
+   # notify_operation=stop. You might not get post/stop if a node is 
fenced.
+   if [ x$OCF_RESKEY_CRM_meta_notify_type = xpre ]  [ 
x$OCF_RESKEY_CRM_meta_notify_operation = xstop ]; then
+   
+   # The stopping nodes will still be included in the
+   # active_resource list, so we have to remove them.
+   local active=$OCF_RESKEY_CRM_meta_notify_active_resource
+   for stopping in $OCF_RESKEY_CRM_meta_notify_stop_resource
+   do
+   # Sanity check: If the user has done a crm node 
standby, then
+   # this method can be called by the node that's stopping.
+   local stopping_clone=`echo ${stopping} | sed 
s/[^[:space:]]\+://`
+   if [ ${stopping_clone} -eq $OCF_RESKEY_CRM_meta_clone 
]; then
+   exit $OCF_SUCCESS
+   fi
+   
+   # We're sane, so remove the stopping node from the 
active list. 
+   active=`echo ${active} | sed s/${stopping}//`
+   done
+   
+   # One of the remaining nodes has to take over the job of the 
dropped
+   # node(s). I'm doing the simplest thing, and choose the last
+   # node in the list of active resources. active_resource is a 
list like
+   # name:0 name:1 name:2. 
+   local selected_node=`echo ${active} | sed 
s

Re: [Linux-HA] Understanding the behavior of IPaddr2 clone

2012-02-27 Thread William Seligman

On 2/27/12 4:10 PM, Lars Ellenberg wrote:
 On Mon, Feb 27, 2012 at 03:39:04PM -0500, William Seligman wrote:
 On 2/24/12 3:36 PM, William Seligman wrote:
 On 2/17/12 7:30 AM, Dejan Muhamedagic wrote:

 OK, I guess that'd also be doable by checking the following
 variables:

OCF_RESKEY_CRM_meta_notify_inactive_resource (set of
currently inactive instances)
OCF_RESKEY_CRM_meta_notify_stop_resource (set of
instances which were just stopped)

 Any volunteers for a patch? :)

 a) I have a test cluster that I can bring up and down at will;

 b) I'm a glutton for punishment.

 So I'll volunteer, since I offered to try to do something in the first 
 place. I
 think I've got a handle on what to look for; e.g., one has to look for
 notify_type=pre and notify_operation=stop in the 'node_down' test.

 Here's my patch, in my usual overly-commented style.
 
 Sorry, I may be missing something obvious, but...
 
 is this not *the* use case of globally-unique=true?

I did not know about globally-unique. I just tested it, replacing (with name
substitutions):

clone ipaddr2_clone ipaddr2_resource meta notify=true

with

clone ipaddr2_clone ipaddr2_resource meta globally-unique=true

This fell back to the old behavior I described in the first message in this
thread: iptables did not update when I took down one of my nodes.

I expected this, since according to Pacemaker Explained,
globally-unique=true is the default. If this had worked, I never would have
reported the problem in the first place.

Is there something else that could suppress the behavior you described for
globally-unique=true?

 Which makes it possible to set clone-node-max = clone-max = number of nodes?
 Or even 7 times (or whatever) number of nodes.
 
 And all the iptables magic is in the start operation.
 If one of the nodes fails, it's bucket(s) will be re-allocated
 to the surviving nodes.
 
 And that is all fully implemented already
 (at least that's how I read the script).
 
 What is not implemented is chaning the number of buckets aka clone-max,
 without restarting clones.
 
 No need for fancy stuff in *pre* notifications,
 which are only statements of intent; the actual action
 may still fail, and all will be different than you anticipated.
 
 Notes:

 - To make this work, you need to turn on notify in the clone resources; e.g.,

 clone ipaddr2_clone ipaddr2_resource meta notify=true

 None of the clone examples I saw in the documentation (Clusters From Scratch,
 Pacemaker Explained) show the notify option; only the ms examples do. You may
 want to revise the documentation with an IPaddr2 example.

 - I tested this with my two-node cluster, and it works. I wrote it for a
 multi-node cluster, but I can't be sure it will work for more than two nodes.
 Would some nice person test this?

 - I wrote my code assuming that the clone number assigned to a node would 
 remain
 constant. If the clone numbers were to change by deleting/adding a node to 
 the
 cluster, I don't know what would happen.
 
 For anonymous clones, it can be relabeled.
 In fact, there are plans to remove the clone number from anonymous
 clones completely.
 
 However, for globally unique clones,
 the clone number is part of its identifier.
 


-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Understanding the behavior of IPaddr2 clone

2012-02-27 Thread William Seligman

On 2/27/12 5:33 PM, Lars Ellenberg wrote:
 On Mon, Feb 27, 2012 at 05:23:36PM -0500, William Seligman wrote:
 On 2/27/12 4:10 PM, Lars Ellenberg wrote:
 On Mon, Feb 27, 2012 at 03:39:04PM -0500, William Seligman wrote:
 On 2/24/12 3:36 PM, William Seligman wrote:
 On 2/17/12 7:30 AM, Dejan Muhamedagic wrote:

 OK, I guess that'd also be doable by checking the following
 variables:

  OCF_RESKEY_CRM_meta_notify_inactive_resource (set of
  currently inactive instances)
  OCF_RESKEY_CRM_meta_notify_stop_resource (set of
  instances which were just stopped)

 Any volunteers for a patch? :)

 a) I have a test cluster that I can bring up and down at will;

 b) I'm a glutton for punishment.

 So I'll volunteer, since I offered to try to do something in the first 
 place. I
 think I've got a handle on what to look for; e.g., one has to look for
 notify_type=pre and notify_operation=stop in the 'node_down' test.

 Here's my patch, in my usual overly-commented style.

 Sorry, I may be missing something obvious, but...

 is this not *the* use case of globally-unique=true?

 I did not know about globally-unique. I just tested it, replacing (with name
 substitutions):

 clone ipaddr2_clone ipaddr2_resource meta notify=true

 with

 clone ipaddr2_clone ipaddr2_resource meta globally-unique=true

 This fell back to the old behavior I described in the first message in this
 thread: iptables did not update when I took down one of my nodes.

 I expected this, since according to Pacemaker Explained,
 globally-unique=true is the default. If this had worked, I never would have
 reported the problem in the first place.

 Is there something else that could suppress the behavior you described for
 globally-unique=true?

 
 You need clone-node-max == clone-max.
 
 It defaults to 1.
 
 Which obviously prevents nodes already running one
 instance from taking over an other...

I tried it, and it works. So there's no need for my patch. The magic invocation
for a highly-available IPaddr2 resource is:

ip_clone ip_resource meta clone-max=2 clone-node-max=2

Could this please be documented more clearly somewhere?

-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Understanding the behavior of IPaddr2 clone

2012-02-27 Thread William Seligman

On 2/27/12 5:41 PM, William Seligman wrote:
 On 2/27/12 5:33 PM, Lars Ellenberg wrote:
 On Mon, Feb 27, 2012 at 05:23:36PM -0500, William Seligman wrote:
 On 2/27/12 4:10 PM, Lars Ellenberg wrote:
 On Mon, Feb 27, 2012 at 03:39:04PM -0500, William Seligman wrote:
 On 2/24/12 3:36 PM, William Seligman wrote:
 On 2/17/12 7:30 AM, Dejan Muhamedagic wrote:

 OK, I guess that'd also be doable by checking the following
 variables:

 OCF_RESKEY_CRM_meta_notify_inactive_resource (set of
 currently inactive instances)
 OCF_RESKEY_CRM_meta_notify_stop_resource (set of
 instances which were just stopped)

 Any volunteers for a patch? :)

 a) I have a test cluster that I can bring up and down at will;

 b) I'm a glutton for punishment.

 So I'll volunteer, since I offered to try to do something in the first 
 place. I
 think I've got a handle on what to look for; e.g., one has to look for
 notify_type=pre and notify_operation=stop in the 'node_down' test.

 Here's my patch, in my usual overly-commented style.

 Sorry, I may be missing something obvious, but...

 is this not *the* use case of globally-unique=true?

 I did not know about globally-unique. I just tested it, replacing (with name
 substitutions):

 clone ipaddr2_clone ipaddr2_resource meta notify=true

 with

 clone ipaddr2_clone ipaddr2_resource meta globally-unique=true

 This fell back to the old behavior I described in the first message in this
 thread: iptables did not update when I took down one of my nodes.

 I expected this, since according to Pacemaker Explained,
 globally-unique=true is the default. If this had worked, I never would 
 have
 reported the problem in the first place.

 Is there something else that could suppress the behavior you described for
 globally-unique=true?


 You need clone-node-max == clone-max.

 It defaults to 1.

 Which obviously prevents nodes already running one
 instance from taking over an other...
 
 I tried it, and it works. So there's no need for my patch. The magic 
 invocation
 for a highly-available IPaddr2 resource is:
 
 ip_clone ip_resource meta clone-max=2 clone-node-max=2
 
 Could this please be documented more clearly somewhere?

Umm... it turns out to be:

ip_clone ip_resource meta globally-unique=true clone-max=2 clone-node-max=2

and for a two-node cluster, of course.

So I guess globally-unique=true is not the default after all.

-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] cman+pacemaker+drbd fencing problem

2012-02-27 Thread William Seligman

I'm trying to set up an active/active HA cluster as explained in Clusters From
Scratch (which I just re-read after my last problem).

I'll give versions and config files below, but I'll start with what happens. I
start with an active/active cman+pacemaker+drbd+gfs2 cluster, with fencing
enabled. My fencing mechanism cuts power to a node by turning the load off in
its UPS. The two nodes are hypatia-tb and orestes-tb.

I want to test fencing and recovery. I start with both nodes running, and
resources properly running on both nodes. Then I simulate failure on one node,
e.g., orestes-tb. I've done this with crm node standby, service pacemaker
off, or by pulling the plug. As expected, all the resources move to hypatia-tb,
with the drbd resource as Primary.

When I try to bring orestes-tb back into the cluster with crm node online or
service pacemaker on (the inverse of how I removed it), orestes-tb is fenced.
OK, that makes sense, I guess; there's a potential split-brain situation.

I bring orestes-tb back up, with the intent of adding it back into the cluster.
I make sure cman, pacemaker, and drbd services were off at system start. On
orestes-tb, I type service drbd start.

What I expect to happen is that the drbd resource on orestes-tb is marked
Outdated or something like that. Then I'd fix it with drbdadm
--discard-my-data connect admin or whatever is appropriate.

What actually happens is that hypatia-tb is fenced. Since this is the node
running all the resources, this is bad behavior. It's even more puzzling when I
consider that at, the time, there isn't any fencing resource actually running on
orestes-tb; my guess is that DRBD on hypatia-tb is fencing itself.

Eventually hypatia-tb reboots, and the cluster goes back to normal. But as a
fencing/stability/HA test, this is a failure.

I've repeated this with a number of variations. In the end, both systems have to
be fenced/rebooted before the cluster is working again.

Any ideas?

Versions:

Scientific Linux 6.2
kernel 2.6.32
cman-3.0.12
corosync-1.4.1
pacemaker-1.1.6
drbd-8.4.1

/etc/drbd.d/global-common.conf:

global {
usage-count yes;
}

common {
startup {
wfc-timeout 60;
degr-wfc-timeout60;
outdated-wfc-timeout60;
}
}

/etc/drbd.d/admin.res:

resource admin {

protocol C;

on hypatia-tb.nevis.columbia.edu {
volume 0 {
device  /dev/drbd0;
disk/dev/md2;
flexible-meta-disk  internal;
}
address 192.168.100.7:7788;
}
on orestes-tb.nevis.columbia.edu {
volume 0 {
device  /dev/drbd0;
disk/dev/md2;
flexible-meta-disk  internal;
}
address 192.168.100.6:7788;
}

startup {
}

net {
allow-two-primaries yes;
after-sb-0pri  discard-zero-changes;
after-sb-1pri  discard-secondary;
after-sb-2pri  disconnect;
sndbuf-size 0;
}

disk {
resync-rate 100M;
c-max-rate  100M;
al-extents  3389;
fencing resource-only;
}

An edited output of crm configure show:

node hypatia-tb.nevis.columbia.edu
node orestes-tb.nevis.columbia.edu
primitive StonithHypatia stonith:fence_nut \
   params pcmk_host_check=static-list \
   pcmk_host_list=hypatia-tb.nevis.columbia.edu \
   ups=sofia-ups username=admin password=XXX
primitive StonithOrestes stonith:fence_nut \
   params pcmk_host_check=static-list \
   pcmk_host_list=orestes-tb.nevis.columbia.edu
   ups=dc-test-stand-ups username=admin password=XXX
location StonithHypatiaLocation StonithHypatia \
   -inf: hypatia-tb.nevis.columbia.edu
location StonithOrestesLocation StonithOrestes \
   -inf: orestes-tb.nevis.columbia.edu

/etc/cluster/cluster.conf:

?xml version=1.0?
cluster config_version=17 name=Nevis_HA
  logging debug=off/
  cman expected_votes=1 two_node=1 /
  clusternodes
clusternode name=hypatia-tb.nevis.columbia.edu nodeid=1
  altname name=hypatia-private.nevis.columbia.edu port=5405
mcast=226.94.1.1/
  fence
method name=pcmk-redirect
  device name=pcmk port=hypatia-tb.nevis.columbia.edu/
/method
  /fence
/clusternode
clusternode name=orestes-tb.nevis.columbia.edu nodeid=2
  altname name=orestes-private.nevis.columbia.edu port=5405
mcast=226.94.1.1/
  fence
method name=pcmk-redirect
  device name=pcmk port=orestes-tb.nevis.columbia.edu/
/method
  /fence
/clusternode
  /clusternodes
  fencedevices
fencedevice name=pcmk agent=fence_pcmk/
  /fencedevices
  fence_daemon post_join_delay=30 /
  rm

Re: [Linux-HA] Understanding the behavior of IPaddr2 clone

2012-02-24 Thread William Seligman

On 2/16/12 11:14 PM, William Seligman wrote:
 On 2/16/12 8:13 PM, Andrew Beekhof wrote:
 On Fri, Feb 17, 2012 at 5:05 AM, Dejan Muhamedagicdeja...@fastmail.fm  
 wrote:

 On Wed, Feb 15, 2012 at 04:24:15PM -0500, William Seligman wrote:
 On 2/10/12 4:53 PM, William Seligman wrote:
 I'm trying to set up an Active/Active cluster (yes, I hear the sounds of
 kittens
 dying). Versions:

 Scientific Linux 6.2
 pacemaker-1.1.6
 resource-agents-3.9.2

 I'm using cloned IPaddr2 resources:

 primitive ClusterIP ocf:heartbeat:IPaddr2 \
  params ip=129.236.252.13 cidr_netmask=32 \
  op monitor interval=30s
 primitive ClusterIPLocal ocf:heartbeat:IPaddr2 \
  params ip=10.44.7.13 cidr_netmask=32 \
  op monitor interval=31s
 primitive ClusterIPSandbox ocf:heartbeat:IPaddr2 \
  params ip=10.43.7.13 cidr_netmask=32 \
  op monitor interval=32s
 group ClusterIPGroup ClusterIP ClusterIPLocal ClusterIPSandbox
 clone ClusterIPClone ClusterIPGroup

 When both nodes of my two-node cluster are running, everything looks and
 functions OK. From service iptables status on node 1 (hypatia-tb):

 5CLUSTERIP  all  --  0.0.0.0/010.43.7.13  
 CLUSTERIP
 hashmode=sourceip-sourceport clustermac=F1:87:E1:64:60:A5 total_nodes=2
 local_node=1 hash_init=0
 6CLUSTERIP  all  --  0.0.0.0/010.44.7.13  
 CLUSTERIP
 hashmode=sourceip-sourceport clustermac=11:8F:23:B9:CA:09 total_nodes=2
 local_node=1 hash_init=0
 7CLUSTERIP  all  --  0.0.0.0/0129.236.252.13  
 CLUSTERIP
 hashmode=sourceip-sourceport clustermac=B1:95:5A:B5:16:79 total_nodes=2
 local_node=1 hash_init=0

 On node 2 (orestes-tb):

 5CLUSTERIP  all  --  0.0.0.0/010.43.7.13  
 CLUSTERIP
 hashmode=sourceip-sourceport clustermac=F1:87:E1:64:60:A5 total_nodes=2
 local_node=2 hash_init=0
 6CLUSTERIP  all  --  0.0.0.0/010.44.7.13  
 CLUSTERIP
 hashmode=sourceip-sourceport clustermac=11:8F:23:B9:CA:09 total_nodes=2
 local_node=2 hash_init=0
 7CLUSTERIP  all  --  0.0.0.0/0129.236.252.13  
 CLUSTERIP
 hashmode=sourceip-sourceport clustermac=B1:95:5A:B5:16:79 total_nodes=2
 local_node=2 hash_init=0

 If I do a simple test of ssh'ing into 129.236.252.13, I see that I 
 alternately
 login into hypatia-tb and orestes-tb. All is good.

 Now take orestes-tb offline. The iptables rules on hypatia-tb are 
 unchanged:

 5CLUSTERIP  all  --  0.0.0.0/010.43.7.13  
 CLUSTERIP
 hashmode=sourceip-sourceport clustermac=F1:87:E1:64:60:A5 total_nodes=2
 local_node=1 hash_init=0
 6CLUSTERIP  all  --  0.0.0.0/010.44.7.13  
 CLUSTERIP
 hashmode=sourceip-sourceport clustermac=11:8F:23:B9:CA:09 total_nodes=2
 local_node=1 hash_init=0
 7CLUSTERIP  all  --  0.0.0.0/0129.236.252.13  
 CLUSTERIP
 hashmode=sourceip-sourceport clustermac=B1:95:5A:B5:16:79 total_nodes=2
 local_node=1 hash_init=0

 If I attempt to ssh to 129.236.252.13, whether or not I get in seems to be
 machine-dependent. On one machine I get in, from another I get a time-out.
 Both
 machines show the same MAC address for 129.236.252.13:

 arp 129.236.252.13
 Address  HWtype  HWaddress   Flags Mask   
 Iface
 hamilton-tb.nevis.colum  ether   B1:95:5A:B5:16:79   C
 eth0

 Is this the way the cloned IPaddr2 resource is supposed to behave in the 
 event
 of a node failure, or have I set things up incorrectly?

 I spent some time looking over the IPaddr2 script. As far as I can tell, 
 the
 script has no mechanism for reconfiguring iptables in the event of a 
 change of
 state in the number of clones.

 I might be stupid -- er -- dedicated enough to make this change on my own, 
 then
 share the code with the appropriate group. The change seems to be 
 relatively
 simple. It would be in the monitor operation. In pseudo-code:

 if (IPaddr2 resource is already started  ) then
if ( OCF_RESKEY_CRM_meta_clone_max != OCF_RESKEY_CRM_meta_clone_max last
 time
  || OCF_RESKEY_CRM_meta_clone != OCF_RESKEY_CRM_meta_clone last 
 time )
  ip_stop
  ip_start

 Just changing the iptables entries should suffice, right?
 Besides, doing stop/start in the monitor is sort of unexpected.
 Another option is to add the missing node to one of the nodes
 which are still running (echo +n
 /proc/net/ipt_CLUSTERIP/ip). But any of that would be extremely
 tricky to implement properly (if not impossible).

fi
 fi

 If this would work, then I'd have two questions for the experts:

 - Would the values of OCF_RESKEY_CRM_meta_clone_max and/or
 OCF_RESKEY_CRM_meta_clone change if the number of cloned copies of a 
 resource
 changed?

 OCF_RESKEY_CRM_meta_clone_max definitely not.
 OCF_RESKEY_CRM_meta_clone may change but also probably not; it's
 just a clone sequence number. In short, there's no way to figure
 out the total number of clones by examining the environment

Re: [Linux-HA] Understanding the behavior of IPaddr2 clone

2012-02-24 Thread William Seligman

On 2/17/12 7:30 AM, Dejan Muhamedagic wrote:
 On Fri, Feb 17, 2012 at 01:15:04PM +0100, Dejan Muhamedagic wrote:
 On Fri, Feb 17, 2012 at 12:13:49PM +1100, Andrew Beekhof wrote:
 [...]
 What about notifications?  The would be the right point to
 re-configure things I'd have thought.

 Sounds like the right way. Still, it may be hard to coordinate
 between different instances. Unless we figure out how to map
 nodes to numbers used by the CLUSTERIP. For instance, the notify
 operation gets:

 OCF_RESKEY_CRM_meta_notify_stop_resource=ip_lb:2 
 OCF_RESKEY_CRM_meta_notify_stop_uname=xen-f 

 But the instance number may not match the node number from
 
 Scratch that.
 
   IP_CIP_FILE=/proc/net/ipt_CLUSTERIP/$OCF_RESKEY_ip
   IP_INC_NO=`expr ${OCF_RESKEY_CRM_meta_clone:-0} + 1`
   ...
   echo +$IP_INC_NO $IP_CIP_FILE
 
 /proc/net/ipt_CLUSTERIP/ip and that's where we should add the
 node. It should be something like:

 notify() {
  if node_down; then
  echo +node_num  /proc/net/ipt_CLUSTERIP/ip
  elif node_up; then
  echo -node_num  /proc/net/ipt_CLUSTERIP/ip
  fi
 }

 Another issue is that the above code should be executed on
 _exactly_ one node.
 
 OK, I guess that'd also be doable by checking the following
 variables:
 
   OCF_RESKEY_CRM_meta_notify_inactive_resource (set of
   currently inactive instances)
   OCF_RESKEY_CRM_meta_notify_stop_resource (set of
   instances which were just stopped)
 
 Any volunteers for a patch? :)

a) I have a test cluster that I can bring up and down at will;

b) I'm a glutton for punishment.

So I'll volunteer, since I offered to try to do something in the first place. I
think I've got a handle on what to look for; e.g., one has to look for
notify_type=pre and notify_operation=stop in the 'node_down' test.

-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Writing a stonith-ng fencing agent in perl

2012-02-23 Thread William Seligman

On 2/23/12 5:42 PM, Andrew Beekhof wrote:

 I'll note that none of the authors of the perl-scripted fencing agents knew 
 that
 arguments are passed via stdin either.
 
 I suspect you'll find they also have some magic for reading them from stdin.
 Both methods are supported by the agents, although when called from
 the cluster, by convention, we only use stdin.

Yeah, you're right. I didn't see it before, because I wasn't looking for 
stdin.

The real reason the perl-scripted fencing agents don't give the correct response
to stonith-admin is that they're looking for a action=XXX parameter from
stdin, when the actual parameter being passed is option=XXX.

In my fence_nut script (which I'll post after I've finished my fencing tests) I
allow for both.

 Or perhaps I'm assuming too much; they
 may have been written for some other package than Pacemaker 1.1.
 
 Right, they were written for cman/rgmanager originally.

-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Writing a stonith-ng fencing agent in perl

2012-02-22 Thread William Seligman

On 2/22/12 6:20 PM, Andrew Beekhof wrote:
 On Thu, Feb 23, 2012 at 8:21 AM, William Seligman
 selig...@nevis.columbia.edu wrote:
 About a 1.5 years ago, I wrote a fencing agent for Pacemaker 1.0.x; it used 
 NUT
 to shut down power on a UPS:

 http://www.mail-archive.com/pacemaker@oss.clusterlabs.org/msg05942.html

 I'm building a new HA cluster using:

 Scientific Linux 6.2 (kernel 2.6.32)
 cman-3.0.12.1
 corosync-1.4.1
 pacemaker-1.1.6
 cluster-glue-1.0.5
 clusterlib-3.0.12.1
 fence-agents-3.1.5

 My old fencing agent, written in bash, won't work with stonith-ng,
 
 What makes you think that?

A number of things: My old script didn't recognize -o metadata, the XML
description that Pacemaker expects has changed a bit, and the big one which I'll
get to just after your response...

 so I wrote a replacement in perl.

 After much debugging, the problem appears to be that stonith-admin (or 
 whatever
 library it's calling) doesn't pass any arguments to the perl script.
 
 You know they are passed via stdin?

No, I did not know this! That was the key; thanks Andrew. I've revised my
script, and it appears to work. I'll have to run some more tests before I post 
it.

 I'd post
 the script, but it's not necessary, since I see the same problem in any of 
 the
 perl-scripted fencing agents in /usr/sbin/fence_* from the regular 
 fence_agent
 package. If I do:

 stonith_admin -M -a fence_scsi
 stonith_admin -M -a fence_vmware_helper

 ... I don't see metadata, but a response equivalent to no argument.

I'll note that none of the authors of the perl-scripted fencing agents knew that
arguments are passed via stdin either. Or perhaps I'm assuming too much; they
may have been written for some other package than Pacemaker 1.1.
-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] Suggestion for exportfs resource

2012-02-21 Thread William Seligman

I had some problems with the monitor operation of the ocf:heartbeat:exportfs
resource. I have a solution for one of them that I'd to share with the 
community.

The first comes from using regex-like expressions for clientspec; e.g.,

primitive ExportUsrNevis ocf:heartbeat:exportfs \
op monitor interval=30 timeout=20 \
params clientspec=*.nevis.columbia.edu \
directory=/usr/nevis fsid=20

For my version of nfs-utils (1.2.3), expressions like *.nevis.columbia.edu are
allowed. The problem is that the monitor operation will fail, since the exportfs
resource uses grep to test the result of the exportfs command:

exportfs | grep -zqs 
${OCF_RESKEY_directory}[[:space:]]*${OCF_RESKEY_clientspec}

I've attached a text file with my proposed change. It escapes any regex
characters in clientspec.

I had another problem for which I don't think there's a simple overall solution:
if clientspec refers to a host alias. For example:

# host mail
mail.nevis.columbia.edu is an alias for franklin.nevis.columbia.edu.
franklin.nevis.columbia.edu has address 129.236.252.8

crm configure primitive ExportMail ocf:heartbeat:exportfs \
params clientspec=mail directory=/mail fsid=30

# exportfs
/mail   franklin.nevis.columbia.edu

The exportfs command canonicalizes the clientspec, so once again the monitor
operation will always fail.

I either have to use the canonical name in the clientspec, or omit the monitor
operation. I tried to come up with simple code to get the canonical name in
bash, but it gets tricky to determine both when canonicalization is needed, and
how to extract it from the output of the host command in an OS-independent 
way.
-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/
# diff -u exportfs.ori /usr/lib/ocf/resource.d/heartbeat/exportfs
--- exportfs.ori2012-02-17 18:31:59.518848166 -0500
+++ /usr/lib/ocf/resource.d/heartbeat/exportfs  2012-02-20 18:14:56.254199732 
-0500
@@ -181,9 +181,11 @@
 
 exportfs_monitor ()
 {
+# Just in case the clientspec contains regexp characters
+CLIENTSPEC=`echo ${OCF_RESKEY_clientspec} | sed -e 
s/[\*\?\[\[]/\/g`
# grep -z matches across newlines, which is necessary as
# exportfs output wraps lines for long export directory names
-   exportfs | grep -zqs 
${OCF_RESKEY_directory}[[:space:]]*${OCF_RESKEY_clientspec}
+   exportfs | grep -zqs ${OCF_RESKEY_directory}[[:space:]]*${CLIENTSPEC}
 
 #Adapt grep status code to OCF return code
case $? in
@@ -224,7 +226,7 @@
fi
OPTIONS=-o ${OPTIONS}
 
-   ocf_run exportfs -v ${OPTIONS} 
${OCF_RESKEY_clientspec}:${OCF_RESKEY_directory} || exit $OCF_ERR_GENERIC
+   ocf_run exportfs -v ${OPTIONS} 
${OCF_RESKEY_clientspec}:${OCF_RESKEY_directory} || exit $OCF_ERR_GENERIC
 
# Restore the rmtab to ensure smooth NFS-over-TCP failover
restore_rmtab
@@ -246,7 +248,7 @@
# Backup the rmtab to ensure smooth NFS-over-TCP failover
backup_rmtab
 
-   ocf_run exportfs -v -u ${OCF_RESKEY_clientspec}:${OCF_RESKEY_directory}
+   ocf_run exportfs -v -u 
${OCF_RESKEY_clientspec}:${OCF_RESKEY_directory}
rc=$?
 
if ocf_is_true ${OCF_RESKEY_unlock_on_stop}; then


smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Understanding the behavior of IPaddr2 clone

2012-02-16 Thread William Seligman


On 2/16/12 8:13 PM, Andrew Beekhof wrote:

On Fri, Feb 17, 2012 at 5:05 AM, Dejan Muhamedagicdeja...@fastmail.fm  wrote:

Hi,

On Wed, Feb 15, 2012 at 04:24:15PM -0500, William Seligman wrote:

On 2/10/12 4:53 PM, William Seligman wrote:

I'm trying to set up an Active/Active cluster (yes, I hear the sounds of kittens
dying). Versions:

Scientific Linux 6.2
pacemaker-1.1.6
resource-agents-3.9.2

I'm using cloned IPaddr2 resources:

primitive ClusterIP ocf:heartbeat:IPaddr2 \
 params ip=129.236.252.13 cidr_netmask=32 \
 op monitor interval=30s
primitive ClusterIPLocal ocf:heartbeat:IPaddr2 \
 params ip=10.44.7.13 cidr_netmask=32 \
 op monitor interval=31s
primitive ClusterIPSandbox ocf:heartbeat:IPaddr2 \
 params ip=10.43.7.13 cidr_netmask=32 \
 op monitor interval=32s
group ClusterIPGroup ClusterIP ClusterIPLocal ClusterIPSandbox
clone ClusterIPClone ClusterIPGroup

When both nodes of my two-node cluster are running, everything looks and
functions OK. From service iptables status on node 1 (hypatia-tb):

5CLUSTERIP  all  --  0.0.0.0/010.43.7.13  CLUSTERIP
hashmode=sourceip-sourceport clustermac=F1:87:E1:64:60:A5 total_nodes=2
local_node=1 hash_init=0
6CLUSTERIP  all  --  0.0.0.0/010.44.7.13  CLUSTERIP
hashmode=sourceip-sourceport clustermac=11:8F:23:B9:CA:09 total_nodes=2
local_node=1 hash_init=0
7CLUSTERIP  all  --  0.0.0.0/0129.236.252.13  CLUSTERIP
hashmode=sourceip-sourceport clustermac=B1:95:5A:B5:16:79 total_nodes=2
local_node=1 hash_init=0

On node 2 (orestes-tb):

5CLUSTERIP  all  --  0.0.0.0/010.43.7.13  CLUSTERIP
hashmode=sourceip-sourceport clustermac=F1:87:E1:64:60:A5 total_nodes=2
local_node=2 hash_init=0
6CLUSTERIP  all  --  0.0.0.0/010.44.7.13  CLUSTERIP
hashmode=sourceip-sourceport clustermac=11:8F:23:B9:CA:09 total_nodes=2
local_node=2 hash_init=0
7CLUSTERIP  all  --  0.0.0.0/0129.236.252.13  CLUSTERIP
hashmode=sourceip-sourceport clustermac=B1:95:5A:B5:16:79 total_nodes=2
local_node=2 hash_init=0

If I do a simple test of ssh'ing into 129.236.252.13, I see that I alternately
login into hypatia-tb and orestes-tb. All is good.

Now take orestes-tb offline. The iptables rules on hypatia-tb are unchanged:

5CLUSTERIP  all  --  0.0.0.0/010.43.7.13  CLUSTERIP
hashmode=sourceip-sourceport clustermac=F1:87:E1:64:60:A5 total_nodes=2
local_node=1 hash_init=0
6CLUSTERIP  all  --  0.0.0.0/010.44.7.13  CLUSTERIP
hashmode=sourceip-sourceport clustermac=11:8F:23:B9:CA:09 total_nodes=2
local_node=1 hash_init=0
7CLUSTERIP  all  --  0.0.0.0/0129.236.252.13  CLUSTERIP
hashmode=sourceip-sourceport clustermac=B1:95:5A:B5:16:79 total_nodes=2
local_node=1 hash_init=0

If I attempt to ssh to 129.236.252.13, whether or not I get in seems to be
machine-dependent. On one machine I get in, from another I get a time-out. Both
machines show the same MAC address for 129.236.252.13:

arp 129.236.252.13
Address  HWtype  HWaddress   Flags MaskIface
hamilton-tb.nevis.colum  ether   B1:95:5A:B5:16:79   C eth0

Is this the way the cloned IPaddr2 resource is supposed to behave in the event
of a node failure, or have I set things up incorrectly?


I spent some time looking over the IPaddr2 script. As far as I can tell, the
script has no mechanism for reconfiguring iptables in the event of a change of
state in the number of clones.

I might be stupid -- er -- dedicated enough to make this change on my own, then
share the code with the appropriate group. The change seems to be relatively
simple. It would be in the monitor operation. In pseudo-code:

if (IPaddr2 resource is already started  ) then
   if ( OCF_RESKEY_CRM_meta_clone_max != OCF_RESKEY_CRM_meta_clone_max last time
 || OCF_RESKEY_CRM_meta_clone != OCF_RESKEY_CRM_meta_clone last time )
 ip_stop
 ip_start


Just changing the iptables entries should suffice, right?
Besides, doing stop/start in the monitor is sort of unexpected.
Another option is to add the missing node to one of the nodes
which are still running (echo +n
/proc/net/ipt_CLUSTERIP/ip). But any of that would be extremely
tricky to implement properly (if not impossible).


   fi
fi

If this would work, then I'd have two questions for the experts:

- Would the values of OCF_RESKEY_CRM_meta_clone_max and/or
OCF_RESKEY_CRM_meta_clone change if the number of cloned copies of a resource
changed?


OCF_RESKEY_CRM_meta_clone_max definitely not.
OCF_RESKEY_CRM_meta_clone may change but also probably not; it's
just a clone sequence number. In short, there's no way to figure
out the total number of clones by examining the environment.
Information such as membership changes doesn't trickle down to
the resource instances.


What about notifications?  The would be the right point to
re-configure things

Re: [Linux-HA] Understanding the behavior of IPaddr2 clone

2012-02-15 Thread William Seligman

On 2/10/12 4:53 PM, William Seligman wrote:
 I'm trying to set up an Active/Active cluster (yes, I hear the sounds of 
 kittens
 dying). Versions:
 
 Scientific Linux 6.2
 pacemaker-1.1.6
 resource-agents-3.9.2
 
 I'm using cloned IPaddr2 resources:
 
 primitive ClusterIP ocf:heartbeat:IPaddr2 \
 params ip=129.236.252.13 cidr_netmask=32 \
 op monitor interval=30s
 primitive ClusterIPLocal ocf:heartbeat:IPaddr2 \
 params ip=10.44.7.13 cidr_netmask=32 \
 op monitor interval=31s
 primitive ClusterIPSandbox ocf:heartbeat:IPaddr2 \
 params ip=10.43.7.13 cidr_netmask=32 \
 op monitor interval=32s
 group ClusterIPGroup ClusterIP ClusterIPLocal ClusterIPSandbox
 clone ClusterIPClone ClusterIPGroup
 
 When both nodes of my two-node cluster are running, everything looks and
 functions OK. From service iptables status on node 1 (hypatia-tb):
 
 5CLUSTERIP  all  --  0.0.0.0/010.43.7.13  CLUSTERIP
 hashmode=sourceip-sourceport clustermac=F1:87:E1:64:60:A5 total_nodes=2
 local_node=1 hash_init=0
 6CLUSTERIP  all  --  0.0.0.0/010.44.7.13  CLUSTERIP
 hashmode=sourceip-sourceport clustermac=11:8F:23:B9:CA:09 total_nodes=2
 local_node=1 hash_init=0
 7CLUSTERIP  all  --  0.0.0.0/0129.236.252.13  CLUSTERIP
 hashmode=sourceip-sourceport clustermac=B1:95:5A:B5:16:79 total_nodes=2
 local_node=1 hash_init=0
 
 On node 2 (orestes-tb):
 
 5CLUSTERIP  all  --  0.0.0.0/010.43.7.13  CLUSTERIP
 hashmode=sourceip-sourceport clustermac=F1:87:E1:64:60:A5 total_nodes=2
 local_node=2 hash_init=0
 6CLUSTERIP  all  --  0.0.0.0/010.44.7.13  CLUSTERIP
 hashmode=sourceip-sourceport clustermac=11:8F:23:B9:CA:09 total_nodes=2
 local_node=2 hash_init=0
 7CLUSTERIP  all  --  0.0.0.0/0129.236.252.13  CLUSTERIP
 hashmode=sourceip-sourceport clustermac=B1:95:5A:B5:16:79 total_nodes=2
 local_node=2 hash_init=0
 
 If I do a simple test of ssh'ing into 129.236.252.13, I see that I alternately
 login into hypatia-tb and orestes-tb. All is good.
 
 Now take orestes-tb offline. The iptables rules on hypatia-tb are unchanged:
 
 5CLUSTERIP  all  --  0.0.0.0/010.43.7.13  CLUSTERIP
 hashmode=sourceip-sourceport clustermac=F1:87:E1:64:60:A5 total_nodes=2
 local_node=1 hash_init=0
 6CLUSTERIP  all  --  0.0.0.0/010.44.7.13  CLUSTERIP
 hashmode=sourceip-sourceport clustermac=11:8F:23:B9:CA:09 total_nodes=2
 local_node=1 hash_init=0
 7CLUSTERIP  all  --  0.0.0.0/0129.236.252.13  CLUSTERIP
 hashmode=sourceip-sourceport clustermac=B1:95:5A:B5:16:79 total_nodes=2
 local_node=1 hash_init=0
 
 If I attempt to ssh to 129.236.252.13, whether or not I get in seems to be
 machine-dependent. On one machine I get in, from another I get a time-out. 
 Both
 machines show the same MAC address for 129.236.252.13:
 
 arp 129.236.252.13
 Address  HWtype  HWaddress   Flags Mask
 Iface
 hamilton-tb.nevis.colum  ether   B1:95:5A:B5:16:79   C 
 eth0
 
 Is this the way the cloned IPaddr2 resource is supposed to behave in the event
 of a node failure, or have I set things up incorrectly?

I spent some time looking over the IPaddr2 script. As far as I can tell, the
script has no mechanism for reconfiguring iptables in the event of a change of
state in the number of clones.

I might be stupid -- er -- dedicated enough to make this change on my own, then
share the code with the appropriate group. The change seems to be relatively
simple. It would be in the monitor operation. In pseudo-code:

if ( IPaddr2 resource is already started ) then
  if ( OCF_RESKEY_CRM_meta_clone_max != OCF_RESKEY_CRM_meta_clone_max last time
|| OCF_RESKEY_CRM_meta_clone != OCF_RESKEY_CRM_meta_clone last time )
ip_stop
ip_start
  fi
fi

If this would work, then I'd have two questions for the experts:

- Would the values of OCF_RESKEY_CRM_meta_clone_max and/or
OCF_RESKEY_CRM_meta_clone change if the number of cloned copies of a resource
changed?

- Is there some standard mechanism by which RA scripts can maintain persistent
information between successive calls?

I realize there's a flaw in the logic: it risks breaking an ongoing IP
connection. But as it stands, IPaddr2 is a clonable resource but not a
highly-available one. If one of N cloned copies goes down, then one out of N new
network connections to the IP address will fail.
-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] Understanding the behavior of IPaddr2 clone

2012-02-10 Thread William Seligman

I'm trying to set up an Active/Active cluster (yes, I hear the sounds of kittens
dying). Versions:

Scientific Linux 6.2
pacemaker-1.1.6
resource-agents-3.9.2

I'm using cloned IPaddr2 resources:

primitive ClusterIP ocf:heartbeat:IPaddr2 \
params ip=129.236.252.13 cidr_netmask=32 \
op monitor interval=30s
primitive ClusterIPLocal ocf:heartbeat:IPaddr2 \
params ip=10.44.7.13 cidr_netmask=32 \
op monitor interval=31s
primitive ClusterIPSandbox ocf:heartbeat:IPaddr2 \
params ip=10.43.7.13 cidr_netmask=32 \
op monitor interval=32s
group ClusterIPGroup ClusterIP ClusterIPLocal ClusterIPSandbox
clone ClusterIPClone ClusterIPGroup

When both nodes of my two-node cluster are running, everything looks and
functions OK. From service iptables status on node 1 (hypatia-tb):

5CLUSTERIP  all  --  0.0.0.0/010.43.7.13  CLUSTERIP
hashmode=sourceip-sourceport clustermac=F1:87:E1:64:60:A5 total_nodes=2
local_node=1 hash_init=0
6CLUSTERIP  all  --  0.0.0.0/010.44.7.13  CLUSTERIP
hashmode=sourceip-sourceport clustermac=11:8F:23:B9:CA:09 total_nodes=2
local_node=1 hash_init=0
7CLUSTERIP  all  --  0.0.0.0/0129.236.252.13  CLUSTERIP
hashmode=sourceip-sourceport clustermac=B1:95:5A:B5:16:79 total_nodes=2
local_node=1 hash_init=0

On node 2 (orestes-tb):

5CLUSTERIP  all  --  0.0.0.0/010.43.7.13  CLUSTERIP
hashmode=sourceip-sourceport clustermac=F1:87:E1:64:60:A5 total_nodes=2
local_node=2 hash_init=0
6CLUSTERIP  all  --  0.0.0.0/010.44.7.13  CLUSTERIP
hashmode=sourceip-sourceport clustermac=11:8F:23:B9:CA:09 total_nodes=2
local_node=2 hash_init=0
7CLUSTERIP  all  --  0.0.0.0/0129.236.252.13  CLUSTERIP
hashmode=sourceip-sourceport clustermac=B1:95:5A:B5:16:79 total_nodes=2
local_node=2 hash_init=0

If I do a simple test of ssh'ing into 129.236.252.13, I see that I alternately
login into hypatia-tb and orestes-tb. All is good.

Now take orestes-tb offline. The iptables rules on hypatia-tb are unchanged:

5CLUSTERIP  all  --  0.0.0.0/010.43.7.13  CLUSTERIP
hashmode=sourceip-sourceport clustermac=F1:87:E1:64:60:A5 total_nodes=2
local_node=1 hash_init=0
6CLUSTERIP  all  --  0.0.0.0/010.44.7.13  CLUSTERIP
hashmode=sourceip-sourceport clustermac=11:8F:23:B9:CA:09 total_nodes=2
local_node=1 hash_init=0
7CLUSTERIP  all  --  0.0.0.0/0129.236.252.13  CLUSTERIP
hashmode=sourceip-sourceport clustermac=B1:95:5A:B5:16:79 total_nodes=2
local_node=1 hash_init=0

If I attempt to ssh to 129.236.252.13, whether or not I get in seems to be
machine-dependent. On one machine I get in, from another I get a time-out. Both
machines show the same MAC address for 129.236.252.13:

arp 129.236.252.13
Address  HWtype  HWaddress   Flags MaskIface
hamilton-tb.nevis.colum  ether   B1:95:5A:B5:16:79   C eth0

Is this the way the cloned IPaddr2 resource is supposed to behave in the event
of a node failure, or have I set things up incorrectly?
-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] cman+pacemaker+dual-primary drbd does not promote

2012-01-31 Thread William Seligman

On Tue, 31 Jan 2012 00:36:23 Arnold Krille wrote:

 On Tuesday 31 January 2012 00:12:52 emmanuel segura wrote:

 But if you wanna implement dual primary i think you don't nee promote for
 your drbd
 Try to use clone without master/slave

 At least when you use the linbit-ra, using it without a master-clone will 
 give 
 you one(!) slave only. When you use a normal clone with two clones, you will 
 get two slaves. The RA only goes primary on promote, that is when its in 
 master-state. = You need a master-clone of two clones with 1-2 masters to 
 use 
 drbd in the cluster.

If I understand Emmanual's suggestion: The only way I know how to implement this
is to create a simple clone group with lsb::drbd instead of Linbit's drbd
resource, and put become-primary-on for both my nodes in drbd.conf.

This might work in the short term, but I think it's risky in the long term. For
example: Something goes wrong and node A stoniths node B. I bring node B back
up, disabling cman+pacemaker before I do so, and want to re-sync node B's DRBD
partition with A. If I'm stupid (occupational hazard), I won't remember to edit
drbd.conf before I do this, node B will automatically try to become primary, and
probably get stonith'ed again.


Arnold: I thought that was what I was doing with these statements:

primitive AdminDrbd ocf:linbit:drbd \
params drbd_resource=admin \
op monitor interval=60s role=Master \
op stop interval=0 timeout=320 \
op start interval=0 timeout=240

ms AdminClone AdminDrbd \
meta master-max=2 master-node-max=1 clone-max=2 clone-node-max=1


That is, master-max=2 means to promote two instances to master. Did I get it
wrong?
-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] cman+pacemaker+dual-primary drbd does not promote

2012-01-31 Thread William Seligman

On Tue Jan 31 03:47:11 MST 2012 Lars Ellenberg wrote:

 On Mon, Jan 30, 2012 at 05:42:34PM -0500, William Seligman wrote:
 I'm trying to follow the directions for setting up a dual-primary DRBD setup
 with CMAN and Pacemaker. I'm stuck at an annoying spot: Pacemaker won't 
 promote
 the DRBD resources to primary at either node.
 
 
 Here's the result of crm_mon:
 
 Last updated: Mon Jan 30 17:07:03 2012
 Stack: cman
 Current DC: hypatia-tb - partition with quorum
 Version: 1.1.5-5.el6-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
 2 Nodes configured, unknown expected votes
 2 Resources configured.
 
 
 Online: [ orestes-tb hypatia-tb ]
 
  Master/Slave Set: AdminClone [AdminDrbd]
  Slaves: [ hypatia-tb orestes-tb ]
 
 crm configure show:
 
 node hypatia-tb
 node orestes-tb
 primitive AdminDrbd ocf:linbit:drbd \
  params drbd_resource=admin \
  op monitor interval=60s role=Master \
 
 You are missing an additional monitor op for role=Slave
 make sure it has a different interval than the one for role=Master.
 
 e.g.
   op monitor interval=59s role=Slave \
 
  op stop interval=0 timeout=320 \
  op start interval=0 timeout=240

I put that in, but it didn't change my basic problem: Neither instance of
AdminDrbd is promoted on either node.

 primitive Clvmd lsb:clvmd
 ms AdminClone AdminDrbd \
  meta master-max=2 master-node-max=1 clone-max=2 clone-node-max=1
 notify=true
 clone ClvmdClone Clvmd
 colocation ClvmdWithAdmin inf: ClvmdClone AdminClone:Master
 order AdminBeforeClvmd inf: AdminClone:promote ClvmdClone:start
 property $id=cib-bootstrap-options \
  dc-version=1.1.5-5.el6-01e86afaaa6d4a8c4836f68df80ababd6ca3902f \
  cluster-infrastructure=cman \
  stonith-enabled=false
 
 Also remember that, for dual-primary DRBD,
 working and tested fencing both on cluster level (stonith) and on drbd
 level (fence-peer) is mandatory.
 Unless you don't care for data integrity.

I'll get to that. I'm just starting out on this configuration. I don't want to
put in STONITH just yet, otherwise I'll have to do recovery after every typo.
I'll put in STONITH and test it when I get to installing the KVM resources. But
until I solve this problem, I can't get to that stage.

 DRBD looks OK:
 
 # cat /proc/drbd
 version: 8.4.0 (api:1/proto:86-100)
 GIT-hash: 28753f559ab51b549d16bcf487fe625d5919c49c build by gardner@, 
 2012-01-25
 19:10:28
  0: cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate C r-
 ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

Any clues as to what I can look at to track the source of the problem?

-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] cman+pacemaker+dual-primary drbd does not promote

2012-01-31 Thread William Seligman

On 1/31/12 3:47 PM, emmanuel segura wrote:

 William try to follow the suggestion of Arnold
 
 In my case it's different because we don't use drbd we are using SAN with
 ocfs2
 
 But i think for drbd in dual primary you need the attribute master-max=2

I did, or thought I did. Have I missed something? Again, from crm configure 
show:

primitive AdminDrbd ocf:linbit:drbd \
   params drbd_resource=admin \
   op monitor interval=60s role=Master \
   op monitor interval=59s role=Slave \
   op stop interval=0 timeout=320 \
   op start interval=0 timeout=240

ms AdminClone AdminDrbd \
   meta master-max=2 master-node-max=1 clone-max=2 clone-node-max=1

Still no promotion to primary on either node.

 
 2012/1/31 William Seligman selig...@nevis.columbia.edu
 
 On Tue, 31 Jan 2012 00:36:23 Arnold Krille wrote:

 On Tuesday 31 January 2012 00:12:52 emmanuel segura wrote:

 But if you wanna implement dual primary i think you don't nee promote
 for your drbd Try to use clone without master/slave

 At least when you use the linbit-ra, using it without a master-clone will
 give you one(!) slave only. When you use a normal clone with two clones,
 you will get two slaves. The RA only goes primary on promote, that is
 when its in master-state. = You need a master-clone of two clones with
 1-2 masters to use drbd in the cluster.

 If I understand Emmanual's suggestion: The only way I know how to implement
 this is to create a simple clone group with lsb::drbd instead of Linbit's
 drbd resource, and put become-primary-on for both my nodes in drbd.conf.
 
 This might work in the short term, but I think it's risky in the long term.
 For example: Something goes wrong and node A stoniths node B. I bring node
 B back up, disabling cman+pacemaker before I do so, and want to re-sync
 node B's DRBD partition with A. If I'm stupid (occupational hazard), I
 won't remember to edit drbd.conf before I do this, node B will
 automatically try to become primary, and probably get stonith'ed again.

 Arnold: I thought that was what I was doing with these statements:

 primitive AdminDrbd ocf:linbit:drbd \
params drbd_resource=admin \
op monitor interval=60s role=Master \
op stop interval=0 timeout=320 \
op start interval=0 timeout=240

 ms AdminClone AdminDrbd \
meta master-max=2 master-node-max=1 clone-max=2
 clone-node-max=1


 That is, master-max=2 means to promote two instances to master. Did I get
 it wrong?

-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] cman+pacemaker+dual-primary drbd does not promote

2012-01-31 Thread William Seligman

On 1/31/12 4:11 PM, emmanuel segura wrote:
 William can you try like this
 
 primitive AdminDrbd ocf:linbit:drbd \
   params drbd_resource=admin \
   op monitor interval=60s role=Master
 
 clone Adming AdminDrbd

Both Arnold and Lars said this wouldn't work. I just tried it. They were right.

Is there anything at all to the log message:

Jan 31 16:20:54 orestes-tb lrmd: [12231]: info: RA output:
(AdminDrbd:1:monitor:stderr) Could not map uname=orestes-tb.nevis.columbia.edu
to a UUID: The object/attribute does not exist

That's been syslog'ed every 59 seconds since I updated AdminDrbd as Lars 
suggested:

primitive AdminDrbd ocf:linbit:drbd \
params drbd_resource=admin \
op monitor interval=60s role=Master \
op monitor interval=59s role=Slave \
op stop interval=0 timeout=320 \
op start interval=0 timeout=240


 2012/1/31 William Seligman selig...@nevis.columbia.edu
 
 On 1/31/12 3:47 PM, emmanuel segura wrote:

 William try to follow the suggestion of Arnold

 In my case it's different because we don't use drbd we are using SAN
 with ocfs2

 But i think for drbd in dual primary you need the attribute
 master-max=2

 I did, or thought I did. Have I missed something? Again, from crm
 configure show:

 primitive AdminDrbd ocf:linbit:drbd \
   params drbd_resource=admin \
   op monitor interval=60s role=Master \
op monitor interval=59s role=Slave \
op stop interval=0 timeout=320 \
   op start interval=0 timeout=240

 ms AdminClone AdminDrbd \
   meta master-max=2 master-node-max=1 clone-max=2 clone-node-max=1

 Still no promotion to primary on either node.


 2012/1/31 William Seligman selig...@nevis.columbia.edu

 On Tue, 31 Jan 2012 00:36:23 Arnold Krille wrote:

 On Tuesday 31 January 2012 00:12:52 emmanuel segura wrote:

 But if you wanna implement dual primary i think you don't nee promote
 for your drbd Try to use clone without master/slave

 At least when you use the linbit-ra, using it without a master-clone
 will
 give you one(!) slave only. When you use a normal clone with two
 clones,
 you will get two slaves. The RA only goes primary on promote, that is
 when its in master-state. = You need a master-clone of two clones with
 1-2 masters to use drbd in the cluster.

 If I understand Emmanual's suggestion: The only way I know how to
 implement this is to create a simple clone group with lsb::drbd instead
 of Linbit's drbd resource, and put become-primary-on for both my
 nodes in drbd.conf.
 
 This might work in the short term, but I think it's risky in the long
 term. For example: Something goes wrong and node A stoniths node B. I
 bring node B back up, disabling cman+pacemaker before I do so, and want
 to re-sync node B's DRBD partition with A. If I'm stupid (occupational
 hazard), I won't remember to edit drbd.conf before I do this, node B
 will automatically try to become primary, and probably get stonith'ed
 again.

 Arnold: I thought that was what I was doing with these statements:

 primitive AdminDrbd ocf:linbit:drbd \
params drbd_resource=admin \
op monitor interval=60s role=Master \
op stop interval=0 timeout=320 \
op start interval=0 timeout=240

 ms AdminClone AdminDrbd \
meta master-max=2 master-node-max=1 clone-max=2
 clone-node-max=1


 That is, master-max=2 means to promote two instances to master. Did I
 get it wrong?

-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] cman+pacemaker+dual-primary drbd does not promote

2012-01-31 Thread William Seligman

On 1/31/12 4:42 PM, Lars Ellenberg wrote:
 On Tue, Jan 31, 2012 at 04:26:44PM -0500, William Seligman wrote:
 On 1/31/12 4:11 PM, emmanuel segura wrote:
 William can you try like this

 primitive AdminDrbd ocf:linbit:drbd \
   params drbd_resource=admin \
   op monitor interval=60s role=Master

 clone Adming AdminDrbd

 Both Arnold and Lars said this wouldn't work. I just tried it. They were 
 right.

 Is there anything at all to the log message:

 Jan 31 16:20:54 orestes-tb lrmd: [12231]: info: RA output:
 (AdminDrbd:1:monitor:stderr) Could not map 
 uname=orestes-tb.nevis.columbia.edu
 to a UUID: The object/attribute does not exist
 
 Hmmm.
 That message comes from cib_utils.c.
 probably crm_master, which is a wrapper
 around crm_attribute.
 should not happen.
 
 Looks like parts of the system do not agree wether to use
 orestes-tb only, or orestes-tb.nevis.columbia.edu ...
 
 And if the resource agent is unable to set a master score,
 pacemaker will not even try to promote.
 
 What does uname -n say?
 Does it list the node name only, or the FQDN?

# uname -n
orestes-tb.nevis.columbia.edu

Aha! I went to /etc/cluster/cluster.conf, and changed all the host names to the
FQDN. It works!


 Master/Slave Set: AdminClone [AdminDrbd]
 Masters: [ hypatia-tb.nevis.columbia.edu orestes-tb.nevis.columbia.edu ]


Lars is the man! And I am a fool for not reading this web page closely enough:

http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/ch08s02s02.html

In the example, they just use the node name. But it clearly states to use the
output from 'uname -n' in cluster.conf. I guess on their Linux distro uname -n
returns just the node name.

Thanks!
-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] Trying to compile pacemaker-pygui

2012-01-30 Thread William Seligman

I'm trying to figure out what which versions of which files I need to compile
pacemaker-pygui. I didn't have (too) much trouble with Scientific Linux 5.5
(=RHEL5.5), but I'm having problems with Scientific Linux 6.1 (=RHEL6.1).

Versions:
corosync.x86_641.2.3-36.el6
corosynclib.x86_64 1.2.3-36.el6
corosynclib-devel.x86_64   1.2.3-36.el6
cluster-glue.x86_641.0.5-2.el6
cluster-glue-libs.x86_64   1.0.5-2.el6
cluster-glue-libs-devel.x86_64 1.0.5-2.el6
clusterlib.x86_64  3.0.12-41.el6
pacemaker.x86_64   1.1.5-5.el6
pacemaker-libs.x86_64  1.1.5-5.el6
pacemaker-libs-devel.x86_641.1.5-5.el6

Kernel: 2.6.32-220.4.1.el6.x86_64

When I try tip.tar.bz2 (or 4186ac0c02b5.tar.bz2; the same file), the compilation
fails with:

mgmt_crm.c: In function ‘on_get_rsc_status’:
mgmt_crm.c:1512: error: ‘pe_rsc_failure_ignored’ undeclared (first use in this
function)

When I try efff2a4588e5.tar.bz2, which I found by Googling on this error, I get:

mgmt_crm.c: In function ‘on_get_crm_metadata’:
mgmt_crm.c:1019: error: ‘CRM_DAEMON_DIR’ undeclared (first use in this function)

What files/versions do I need to compile the GUI?
-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] cman+pacemaker+dual-primary drbd does not promote

2012-01-30 Thread William Seligman

I'm trying to follow the directions for setting up a dual-primary DRBD setup
with CMAN and Pacemaker. I'm stuck at an annoying spot: Pacemaker won't promote
the DRBD resources to primary at either node.


Here's the result of crm_mon:

Last updated: Mon Jan 30 17:07:03 2012
Stack: cman
Current DC: hypatia-tb - partition with quorum
Version: 1.1.5-5.el6-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
2 Nodes configured, unknown expected votes
2 Resources configured.


Online: [ orestes-tb hypatia-tb ]

 Master/Slave Set: AdminClone [AdminDrbd]
 Slaves: [ hypatia-tb orestes-tb ]



/etc/cluster/cluster.conf:

cluster config_version=6 name=Nevis_HA
  logging debug=off/
  cman expected_votes=1 two_node=1 /
  clusternodes
clusternode name=hypatia-tb nodeid=1
  fence
method name=pcmk-redirect
  device name=pcmk port=hypatia-tb/
/method
  /fence
/clusternode
clusternode name=orestes-tb nodeid=2
  fence
method name=pcmk-redirect
  device name=pcmk port=orestes-tb/
/method
  /fence
/clusternode
  /clusternodes
  fencedevices
fencedevice name=pcmk agent=fence_pcmk/
  /fencedevices
  !-- fence_daemon post_join_delay=30 / --
/cluster


crm configure show:

node hypatia-tb
node orestes-tb
primitive AdminDrbd ocf:linbit:drbd \
params drbd_resource=admin \
op monitor interval=60s role=Master \
op stop interval=0 timeout=320 \
op start interval=0 timeout=240
primitive Clvmd lsb:clvmd
ms AdminClone AdminDrbd \
meta master-max=2 master-node-max=1 clone-max=2 clone-node-max=1
notify=true
clone ClvmdClone Clvmd
colocation ClvmdWithAdmin inf: ClvmdClone AdminClone:Master
order AdminBeforeClvmd inf: AdminClone:promote ClvmdClone:start
property $id=cib-bootstrap-options \
dc-version=1.1.5-5.el6-01e86afaaa6d4a8c4836f68df80ababd6ca3902f \
cluster-infrastructure=cman \
stonith-enabled=false


DRBD looks OK:

# cat /proc/drbd
version: 8.4.0 (api:1/proto:86-100)
GIT-hash: 28753f559ab51b549d16bcf487fe625d5919c49c build by gardner@, 2012-01-25
19:10:28
 0: cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate C r-
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0


I can manually do drbdadm primary admin on both nodes and get a
Primary/Primary state. That still does not get Pacemaker to promote the 
resource.


The only vaguely relevant lines in /var/log/messages seem to be:

Jan 30 17:38:13 hypatia-tb lrmd: [11260]: info: RA output
(AdminDrbd:0:start:stdout)
Jan 30 17:38:13 hypatia-tb lrmd: [11260]: info: RA output:
(AdminDrbd:0:start:stderr) Could not map uname=hypatia-tb.nevis.columbia.edu to
a UUID: The object/attribute does not exist
Jan 30 17:38:13 hypatia-tb lrmd: [11260]: info: RA output
(AdminDrbd:0:start:stdout)


I've tried running with iptables both on and off, and the results are the same.


Any clues?
-- 
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/



smime.p7s
Description: S/MIME Cryptographic Signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

84 matches

Mail list logo