[Linux-HA] Backing out of HA
I'm about to write a transition plan for getting rid of high-availability on our lab's cluster. Before I do that, I thought I'd put my reasons before this group so that: a) people can exclaim You fool! and point out all the stupid things I did wrong; b) sysadmins who are contemplating the switch to HA have additional points to add to the pros and cons. The basic reason why we want to back out of HA is that, in the three years since I've implemented an HA cluster at the lab, we have had not a single hardware problem for which HA would have been useful. However, we've had many instances of lab-wide downtime due to the HA configuration. Description: two-node cluster, Scientific Linux 6.2 (=RHEL6.2), cman+clvmd+pacemaker, dedicated Ethernet ports for DRBD traffic. I've had both primary/secondary and dual-primary configurations. The resources pacemaker manages include DRBD and VMs with configuration files and virtual disks on the DRBD partition. Detailed package versions and configurations are at the end of this post. Here are some examples of our difficulties. This is not an exhaustive list. Mystery crashes I'll mention this one first because it's the most recent, and it was the straw that broke the camel's back as far as the users were concerned. Last week, cman crashed, and the cluster stopped working. There was no clear message in the logs indicating why. I had no time for archeology, since the crash happened in middle of our working day; I rebooted everything and cman started up again just fine. Problems under heavy server load: Let's call the two nodes on the cluster A and B. Node A starts running a process that does heavy disk writes to the shared DRBD volume. The load on A starts to rise. The load on B rises too, more slowly, because the same blocks must be written to node B's disk. Eventually the load on A grows so great that cman+clvmd+pacemaker does not respond promptly, and node B stoniths node A. The problem is that the DRBD partition on node B marked Inconsistent. All the other resources in the pacemaker configuration depend on DRBD, so none of them are allowed to run. The cluster stays in this non-working state (node A powered off, node B not running any resources) until I manually intervene. Poisoned resource This is the one you can directly attribute to my stupidity. I add a new resource to the pacemaker configuration. Even though the pacemaker configuration is syntactically correct, and even though I think I've tested it, in fact the resource cannot run on either node. The most recent example: I created a new virtual domain and tested it. It worked fine. I created the ocf:heartbeat:VirtualDomain resource, verified that crm could parse it, and activated the configuration. However, I had not actually created the domain for the virtual machine; I had typed virsh create ... but not virsh define So I had a resource that could not run. What I'd want to happen is for the poisoned resource to fail, I see lots of error messages, but the remaining resources would continue to run. What actually happens is that resource tries to run on both nodes alternately an infinite number of times (1 times or whatever the value is). Then one of the nodes stoniths the other. The poisoned resource still won't run on the remaining node, so that node tries restarting all the other resources in the pacemaker configuration. That still won't work. By this time, usually one of the other resources has failed (possibly because it's not designed to be restarted so frequently), and the cluster is in a non-working state until I manually intervened. In this particular case, had we not been running HA, the only problem would have been that the incorrectly-initialized domain would not have come up after a system reboot. With HA, my error crashed the cluster. Let me be clear: I do not claim that HA is without value. My only point is that for our particular combination of hardware, software, and available sysadmin support (me), high-availability has not been a good investment. I also acknowledge that I haven't provided logs for these problems to corroborate any of the statements I've made. I'm sharing the problems I've had, but at this point I'm not asking for fixes. Turgid details: # rpm -q kernel drbd pacemaker cman \ lvm2 lvm2-cluster resource-agents kernel-2.6.32-220.4.1.el6.x86_64 drbd-8.4.1-1.el6.x86_64 pacemaker-1.1.6-3.el6.x86_64 cman-3.0.12.1-23.el6.x86_64 lvm2-2.02.87-7.el6.x86_64 lvm2-cluster-2.02.87-7.el6.x86_64 resource-agents-3.9.2-7.el6.x86_64 /etc/cluster/cluster.conf: http://pastebin.com/qRAxLpkx /etc/lvm/lvm.conf: http://pastebin.com/tLyZd09i /etc/drbd.d/global_common.conf: http://pastebin.com/H8Kfi2tM /etc/drbd.d/admin.res: http://pastebin.com/1GWupJz8 output of crm configure show: http://pastebin.com/wJaX3Msn output of crm configure show xml: http://pastebin.com/gyUUb2hi -- William Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | PO Box 137
Re: [Linux-HA] exportfs problems
On 1/4/13 7:10 PM, Matthew Spah wrote: Hey everyone, I've just recently built up a pacemaker cluster and have begun testing it. Everything has been going great until after Christmas break.. I fired up the cluster to find this going on. Last updated: Fri Jan 4 16:06:41 2013 Last change: Fri Jan 4 16:02:13 2013 via crmd on emserver1 Stack: openais Current DC: emserver1 - partition with quorum Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c 2 Nodes configured, 2 expected votes 9 Resources configured. Online: [ emserver1 emserver2 ] Master/Slave Set: ms_drbd_nfs [p_drbd_nfs] Masters: [ emserver2 ] Slaves: [ emserver1 ] Clone Set: cl_lsb_nfsserver [p_lsb_nfsserver] Started: [ emserver1 emserver2 ] Resource Group: g_nfs p_fs_nfs (ocf::heartbeat:Filesystem):Started emserver2 p_exportfs_nfs (ocf::heartbeat:exportfs): Started emserver2 (unmanaged) FAILED p_ip_nfs (ocf::heartbeat:IPaddr2): Stopped Clone Set: cl_exportfs_root [p_exportfs_root] Started: [ emserver2 ] Stopped: [ p_exportfs_root:1 ] Failed actions: p_exportfs_root:0_start_0 (node=emserver1, call=10, rc=-2, status=Timed Out): unknown exec error p_exportfs_root:1_monitor_3 (node=emserver2, call=11, rc=7, status=complete): not running p_exportfs_nfs_stop_0 (node=emserver2, call=39, rc=-2, status=Timed Out): unknown exec error I've been reading through documentation to figure out what is going on. If you guys could point me in the right direction that would be a huge help. :) Here is my configuration... node emserver1 node emserver2 primitive p_drbd_nfs ocf:linbit:drbd \ params drbd_resource=r0 \ op monitor interval=15 role=Master \ op monitor interval=30 role=Slave primitive p_exportfs_nfs ocf:heartbeat:exportfs \ params fsid=1 directory=/srv/nfs options=rw,crossmnt clientspec=10.1.10.0/255.255.255.0 \ op monitor interval=30s primitive p_exportfs_root ocf:heartbeat:exportfs \ params fsid=0 directory=/srv options=rw,crossmnt clientspec= 10.1.10.0/255.255.255.0 \ op monitor interval=30s primitive p_fs_nfs ocf:heartbeat:Filesystem \ params device=/dev/drbd1 directory=/srv/nfs fstype=ext3 \ op monitor interval=10s primitive p_ip_nfs ocf:heartbeat:IPaddr2 \ params ip=10.1.10.10 cidr_netmask=24 iflabel=NFSV_IP \ op monitor interval=30s primitive p_lsb_nfsserver lsb:nfs-kernel-server \ op monitor interval=30s group g_nfs p_fs_nfs p_exportfs_nfs p_ip_nfs ms ms_drbd_nfs p_drbd_nfs \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true clone cl_exportfs_root p_exportfs_root clone cl_lsb_nfsserver p_lsb_nfsserver colocation c_nfs_on_drbd inf: g_nfs ms_drbd_nfs:Master colocation c_nfs_on_root inf: g_nfs cl_exportfs_root order o_drbd_before_nfs inf: ms_drbd_nfs:promote g_nfs:start order o_root_before_nfs inf: cl_exportfs_root g_nfs:start property $id=cib-bootstrap-options \ dc-version=1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c \ cluster-infrastructure=openais \ expected-quorum-votes=2 \ stonith-enabled=false \ no-quorum-policy=ignore \ maintenance-mode=false \ last-lrm-refresh=1357344133 rsc_defaults $id=rsc-options \ resource-stickiness=200 I've had problems like this with the exportfs resource. Here are some things to check: - You didn't list the software versions. In particular, look at the version of your resource-agents package. There have been some recent changes to the ocf:heartbeat:exportfs script that improve the pattern-matching in its monitor action. - The ocf:heartbeat:exportfs monitor works by comparing the clientspec parameter with the output of the exportfs command. Check when you export to 10.1.10.0 that the output of exportfs returns exactly that string, instead of a resolved name. It may help to give a concrete example: I exported a partition via ocf:heartbeat:exportfs to clientspec=mail.nevis.columbia.edu. The monitor action always failed, until I realized that mail.nevis.columbia.edu was an alias for franklin.nevis.columbia.edu; that was the name that appeared in the output of /usr/sbin/exportfs. Hope this helps. -- William Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] IP Clone
On 8/20/12 6:54 PM, Yount, William D wrote: No, no complaining. Just glad to get a definitive answer on it. Active/Active made me think something that I guess isn't true. No worries. Honestly, thanks for the reply. Without you, I would have kept trying and trying and trying. -Original Message- From: linux-ha-boun...@lists.linux-ha.org [mailto:linux-ha-boun...@lists.linux-ha.org] On Behalf Of Dimitri Maziuk Sent: Monday, August 20, 2012 5:50 PM To: linux-ha@lists.linux-ha.org Subject: Re: [Linux-HA] IP Clone On 08/20/2012 05:01 PM, Yount, William D wrote: I am trying to set up an Active/Active cluster. I have an Active/Passive cluster up and running. I don't remember seeing a clear explanation of when, where, and why you'd actually want an active/active cluster. I never needed one myself, so can't really help you there. I don't understand how it could be called an Active/Active cluster if you aren't allowed to run the IP address on two servers at once. You are not allowed to run the IP address on two servers at once, full stop. Complain to Rob Kahn and Vint Cerf. For what it's worth, I run an Active/Active cluster (probably for all the wrong reasons). IP cloning works fine for me. Here's my setup: primitive IP_cluster ocf:heartbeat:IPaddr2 \ params ip=129.236.252.11 cidr_netmask=32 nic=eth0 \ op monitor interval=30s \ meta resource-stickiness=0 clone IPClone IP_cluster \ meta globally-unique=true clone-max=2 clone-node-max=2 \ interleave=false target-role=Started Pretty much the canonical version from Clusters From Scratch. Here's what I've noticed: - I needed iptables running to make this work. - This gave me a consistent MAC address for the cluster IP address of 129.236.252.11, improving the availability of the connection. - I didn't see much load balancing after the first time I set it up. Mostly both clone instances run on a single node of my two-node cluster. For my needs, that's OK, since for me load-balancing is a much lower priority than availability. -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] exportfs with multiple client ACLs
On 5/25/12 5:07 PM, Seth Galitzer wrote: We have been using NIS netgroups to specify export options based on host membership as specified in the /etc/netgroup file. Some exports may have multiple exports specs based on their netgroups, eg one group should have root-quashing enabled whereas another should not. If I'm using /etc/exports, I just add another line onto the spec. With pacemaker, this is not possible, so the suggestion I received was to simply add multiple exportfs resources to accomplish this. What I am finding is that I am getting erratic behavior in that export options seem to be randomly getting overridden. So hosts that should not be getting root-squashed still are. From my testing, it does not seem to be a matter of last one wins. If the root-squashed resource is running at all, whether started before or after the non-root-squashed resource, then all hosts are root-squashed. Is anybody else trying to do something like this? If so, how do you specify multiple export rules for different hosts or host groups? I'm using the ocf:heartbeat:exportfs service. Is this ignoring netgroup specs for some reason, or is there something else going on here? My /etc/nsswitch.conf looks correct, as far as NIS goes. I'm running pacemaker 1.1.7 from official packages on debain wheezy. Kernel version is 3.2.0 and nfsd is 1.2.5, also from official packages. Any advice is appreciated. I can provide crm dumps and other configs if needed. Thanks. Seth I haven't had that problem, though I don't think I use as many multiple host specs as you do. Here's a quick check: After all your exportfs resources are running, look at the output of the exportfs command on the node running the resources. Is the result what you expect? ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Can /var/lib/pengine files be deleted at boot?
On 5/16/12 6:09 AM, Lars Marowsky-Bree wrote: On 2012-05-15T13:17:11, William Seligman selig...@nevis.columbia.edu wrote: I can post details and logs and whatnot, but I don't think I need to do detailed debugging. My question is: I don't think your rationale holds true, though. Like Andrew said, this is only ever just written, not read. So what I really need to learn is how to understand the pengine state enough to issue some sort of correction. In my case, I think crm resource cleanup resource-name was sufficient. So much to learn! So little time! If I were to set up a procedure to delete the contents of /var/lib/pengine at system boot, would that cause any problems for Pacemaker? Is that state information necessary for the successful startup of the pacemaker service at system start, or can I remove them before pacemaker starts to prevent problems like this in the future? It won't affect pacemaker, but you're hurting debuggability. -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Can /var/lib/pengine files be deleted at boot?
I've had some problems with my Linux pacemaker cluster recently. I traced the problem to what I believe is incorrect state information that was saved in directory /var/lib/pengine. I can post details and logs and whatnot, but I don't think I need to do detailed debugging. My question is: If I were to set up a procedure to delete the contents of /var/lib/pengine at system boot, would that cause any problems for Pacemaker? Is that state information necessary for the successful startup of the pacemaker service at system start, or can I remove them before pacemaker starts to prevent problems like this in the future? -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] HA samba?
On 4/25/12 4:53 PM, Seth Galitzer wrote: Can anybody point me to recent docs on how to go about setting this up? I've found several much older posts, but not much current with any kind of helpful detail. This one has a couple of good tips, but doesn't have much depth: http://linux-ha.org/wiki/Samba This one has a lot of detail, but do I really need to use GFS and CTDB if I just use a common shared FS for both nodes to get locking data from?: http://techwithjim.blogspot.com/2012/04/high-availability-windows-share-using.html I should note that I'm using DRBD+LVM for my node shared storage and also exporting FS shares via NFS (I run heterogeneous systems here with both Linux and Windows clients, so need both available). Are you running DRBD+LVM primary-secondary or primary-primary? If it's the former, I suggest using the configuration described in Clusters From Scratch: http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/ the only difference being that instead of running Apache you'd run Samba and NFS. If you're exporting your filesystems read/write, I think that's the recommended configuration. I'm running primary-primary and exporting filesystems via NFS (I'm running Samba too, but inside a KVM virtual machine exporting its internal filesystem). However, I'm exporting them read-only. -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] problem with nfs and exportfs failover
On 4/14/12 5:55 AM, emmanuel segura wrote: Maybe the problem it's the primitive nfsserver lsb:nfs-kernel-server, i think this primitive was stoped befoure exportfs-admin ocf:heartbeat:exportfs And if i rember the lsb:nfs-kernel-server and exportfs agent does the same thing the first use the os scripts and the second the cluster agents Now that Emmanuel has reminded me, I'll offer two more tips based on advice he's given me in the past: - You can deal with issue he raises directly by putting additional constraints in your setup, something like: colocation fs-homes-nfsserver inf: group-homes clone-nfsserver order nfssserver-before-homes inf: clone-nfsserver group-homes That will make sure that all the group-homes resources (including exportfs-admin) will not be run unless an instance of nfsserver is already running on that node. - There's a more fundamental question: Why are you placing the start/stop of your NFS server on both nodes under pacemaker control? Why not have the NFS server start at system startup on each node? The only reason I see for putting NFS under Pacemaker control is if there are entries in your /etc/exports file (or the Debian equivalent) that won't work unless other Pacemaker-controlled resources are running, such as DRBD. If that's the case, you're better off controlling them with Pacemaker exportfs resources, the same as you're doing with exportfs-admin, instead of /etc/exports entries. Il giorno 14 aprile 2012 01:50, William Seligman selig...@nevis.columbia.edu ha scritto: On 4/13/12 7:18 PM, William Seligman wrote: On 4/13/12 6:42 PM, Seth Galitzer wrote: In attempting to build a nice clean config, I'm now in a state where exportfs never starts. It always times out and errors. crm config show is pasted here: http://pastebin.com/cKFFL0Xf syslog after an attempted restart here: http://pastebin.com/CHdF21M4 Only IPs have been edited. It's clear that your exportfs resource is timing out for the admin resource. I'm no expert, but here are some stupid exportfs tricks to try: - Check your /etc/exports file (or whatever the equivalent is in Debian; man exportfs will tell you) on both nodes. Make sure you're not already exporting the directory when the NFS server starts. - Take out the exportfs-admin resource. Then try doing things manually: # exportfs x.x.x.0/24:/exports/admin Assuming that works, then look at the output of just # exportfs The clientspec reported by exportfs has to match the clientspec you put into the resource exactly. If exportfs is canonicalizing or reporting the clientspec differently, the exportfs monitor won't work. If this is the case, change the clientspec parameter in exportfs-admin to match. If the output of exportfs has any results that span more than one line, then you've got the problem that the patch I referred you to (quoted below) is supposed to fix. You'll have to apply the patch to your exportfs resource. Wait a second; I completely forgot about this thread that I started: http://www.gossamer-threads.com/lists/linuxha/users/78585 The solution turned out to be to remove the .rmtab files from the directories I was exporting, deleting touching /var/lib/nfs/rmtab (you'll have to look up the Debian location), and adding rmtab_backup=none to all my exportfs resources. Hopefully there's a solution for you in there somewhere! On 04/13/2012 01:51 PM, William Seligman wrote: On 4/13/12 12:38 PM, Seth Galitzer wrote: I'm working through this howto doc: http://www.linbit.com/fileadmin/tech-guides/ha-nfs.pdf and am stuck at section 4.4. When I put the primary node in standby, it seems that NFS never releases the export, so it can't shut down, and thus can't get started on the secondary node. Everything up to that point in the doc works fine and fails over correctly. But once I add the exportfs resource, it fails. I'm running this on debian wheezy with the included standard packages, not custom. Any suggestions? I'd be happy to post configs and logs if requested. Yes, please post the output of crm configure show, the output of exportfs while the resource is running properly, and the relevant sections of your log file. I suggest using pastebin.com, to keep mailboxes filling up with walls of text. In case you haven't seen this thread already, you might want to take a look: http://www.gossamer-threads.com/lists/linuxha/dev/77166 And the resulting commit: https://github.com/ClusterLabs/resource-agents/commit/5b0bf96e77ed3c4e179c8b4c6a5ffd4709f8fdae (Links courtesy of Lars Ellenberg.) The problem and patch discussed in those links doesn't quite match what you describe. I mention it because I had to patch my exportfs resource (in /usr/lib/ocf/resource.d/heartbeat/exportfs on my RHEL systems) to get it to work properly in my setup. -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137
Re: [Linux-HA] problem with nfs and exportfs failover
On 4/16/12 1:47 PM, Seth Galitzer wrote: Just a quick update. I set the wait_for_leasetime_on_stop parameter on the exportfs resource to false, no it no longer sleeps for 92 sec and the switchover is instantaneous. Now I just need to figure out how to disable nfsv4 on the server side and I should be home-free. As you're testing this, a couple of reminders/observations: - You're exporting /exports/admin with option rw. If your clients are actually writing to that directory, and you want to have true failover, you may need NFSv4. I suggest running a test in which you have a client do an extended write (with dd, for example) then pull the plug on coronado. Is your file or filesystem trashed when you do this? - If you don't need your clients to be able to write to /exports/admin, you have to don't figure out how to turn off NFSv4 (on RHEL6, this is done by passing -N 4 to nfsd, and is typically done in /etc/sysconfig/nfs). I have the following exportfs definitions on my primary-primary cluster, and my failover tests work just fine: ' primitive ExportUsrNevis ocf:heartbeat:exportfs \ description=Site-wide applications installed in /usr/nevis \ op start interval=0 timeout=40 \ op stop interval=0 timeout=120 \ params clientspec=*.nevis.columbia.edu directory=/usr/nevis fsid=20 options=ro,no_root_squash,async rmtab_backup=none Note that I'm exporting this directory ro. If I wanted to support writes with failover (especially in a primary-primary setup!) I'd have tons more work to do. I notice in the configuration you've posted, you haven't included fencing yet. Don't forget this! And test it as well. On 04/16/2012 12:42 PM, Seth Galitzer wrote: I've been poking at this more over the weekend and this morning. And while your tip about rmtab was useful, it still didn't resolve the problem. I also made sure that my exports were only being handled/defined by pacemaker and not by /etc/exports. Though for the cloned nfsserver resource to work, it seems you need an /etc/exports file to exist on the server, even if it's empty. It seems the clue as to what's going on is in this line from the log: coronado exportfs[20325]: INFO: Sleeping 92 seconds to accommodate for NFSv4 lease expiry If I bump up the timeout for the exportfs resource to 95 sec, then after the very long timeout, it switches over correctly. So while this is a working solution to the problem, a 95 sec timeout is a little long for my personal comfort on a live and active fileserver. Any idea what is instigating this timeout? Is is exportfs (looks that way from the log entry), nfsd, or pacemaker? If pacemaker, then where can I reduce or remove this? I've been looking at disabling nfsv4 entirely on this server, as I don't really need it, but haven't found a solution that works yet. Tried the suggestion in this thread, but it seems to be for mounts, not nfsd, and still doesn't help: http://lists.debian.org/debian-user/2011/11/msg01585.html Though I have found that v4 is being loaded on one host but not the other. So if I can find what's different, I may be able to make that work. coronado:~# rpcinfo -u localhost nfs program 13 version 2 ready and waiting program 13 version 3 ready and waiting program 13 version 4 ready and waiting cascadia:~# rpcinfo -u localhost nfs program 13 version 2 ready and waiting program 13 version 3 ready and waiting Any further suggestions are welcome. I'll keep poking until I find a solution. Thanks. Seth On 04/16/2012 11:49 AM, William Seligman wrote: On 4/14/12 5:55 AM, emmanuel segura wrote: Maybe the problem it's the primitive nfsserver lsb:nfs-kernel-server, i think this primitive was stoped befoure exportfs-admin ocf:heartbeat:exportfs And if i rember the lsb:nfs-kernel-server and exportfs agent does the same thing the first use the os scripts and the second the cluster agents Now that Emmanuel has reminded me, I'll offer two more tips based on advice he's given me in the past: - You can deal with issue he raises directly by putting additional constraints in your setup, something like: colocation fs-homes-nfsserver inf: group-homes clone-nfsserver order nfssserver-before-homes inf: clone-nfsserver group-homes That will make sure that all the group-homes resources (including exportfs-admin) will not be run unless an instance of nfsserver is already running on that node. - There's a more fundamental question: Why are you placing the start/stop of your NFS server on both nodes under pacemaker control? Why not have the NFS server start at system startup on each node? The only reason I see for putting NFS under Pacemaker control is if there are entries in your /etc/exports file (or the Debian equivalent) that won't work unless other Pacemaker-controlled resources are running, such as DRBD. If that's the case, you're better off controlling them with Pacemaker exportfs
Re: [Linux-HA] problem with nfs and exportfs failover
On 4/13/12 12:38 PM, Seth Galitzer wrote: I'm working through this howto doc: http://www.linbit.com/fileadmin/tech-guides/ha-nfs.pdf and am stuck at section 4.4. When I put the primary node in standby, it seems that NFS never releases the export, so it can't shut down, and thus can't get started on the secondary node. Everything up to that point in the doc works fine and fails over correctly. But once I add the exportfs resource, it fails. I'm running this on debian wheezy with the included standard packages, not custom. Any suggestions? I'd be happy to post configs and logs if requested. Yes, please post the output of crm configure show, the output of exportfs while the resource is running properly, and the relevant sections of your log file. I suggest using pastebin.com, to keep mailboxes filling up with walls of text. In case you haven't seen this thread already, you might want to take a look: http://www.gossamer-threads.com/lists/linuxha/dev/77166 And the resulting commit: https://github.com/ClusterLabs/resource-agents/commit/5b0bf96e77ed3c4e179c8b4c6a5ffd4709f8fdae (Links courtesy of Lars Ellenberg.) The problem and patch discussed in those links doesn't quite match what you describe. I mention it because I had to patch my exportfs resource (in /usr/lib/ocf/resource.d/heartbeat/exportfs on my RHEL systems) to get it to work properly in my setup. -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] fence_nut fencing agent - use NUT to fence via UPS
On 3/1/12 5:37 PM, William Seligman wrote: After days spent debugging a fencing issue with my cluster, I know for certain that this fencing agent works, at least for me. I'd like to contribute it to the Linux HA community. In my cluster, the fencing mechanism is to use NUT (Network UPS Tools; http://www.networkupstools.org/ to turn off power to a node. About 1.5 years ago, I contributed a NUT-based fencing agent for Pacemaker 1.0: http://oss.clusterlabs.org/pipermail/pacemaker/2010-August/007347.html That script doesn't work for stonith-ng. So here's a new agent, written in perl, and tested under pacemaker-1.1.6 and nut-2.4.3. I know there's a fence_apc_snmp agent that already in resource-agents. However, that agent only works with APC devices with multiple outlet control; it displays an error messages when used with my UPSes. This script is for those who'd rather use NUT than play with SNMP MIBs. I've made some improvements to the NUT-based fencing agent I contributed before. The changes are: - A more rigorous approach to the error codes returned by the agent. - Added options to delay the times between issuing a poweron/poweroff command and verifying that the UPS responds. The revised fence_nut agent is at http://pastebin.com/sQdqWKQq. -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] problem with nfs and exportfs failover
On 4/13/12 6:42 PM, Seth Galitzer wrote: In attempting to build a nice clean config, I'm now in a state where exportfs never starts. It always times out and errors. crm config show is pasted here: http://pastebin.com/cKFFL0Xf syslog after an attempted restart here: http://pastebin.com/CHdF21M4 Only IPs have been edited. It's clear that your exportfs resource is timing out for the admin resource. I'm no expert, but here are some stupid exportfs tricks to try: - Check your /etc/exports file (or whatever the equivalent is in Debian; man exportfs will tell you) on both nodes. Make sure you're not already exporting the directory when the NFS server starts. - Take out the exportfs-admin resource. Then try doing things manually: # exportfs x.x.x.0/24:/exports/admin Assuming that works, then look at the output of just # exportfs The clientspec reported by exportfs has to match the clientspec you put into the resource exactly. If exportfs is canonicalizing or reporting the clientspec differently, the exportfs monitor won't work. If this is the case, change the clientspec parameter in exportfs-admin to match. If the output of exportfs has any results that span more than one line, then you've got the problem that the patch I referred you to (quoted below) is supposed to fix. You'll have to apply the patch to your exportfs resource. On 04/13/2012 01:51 PM, William Seligman wrote: On 4/13/12 12:38 PM, Seth Galitzer wrote: I'm working through this howto doc: http://www.linbit.com/fileadmin/tech-guides/ha-nfs.pdf and am stuck at section 4.4. When I put the primary node in standby, it seems that NFS never releases the export, so it can't shut down, and thus can't get started on the secondary node. Everything up to that point in the doc works fine and fails over correctly. But once I add the exportfs resource, it fails. I'm running this on debian wheezy with the included standard packages, not custom. Any suggestions? I'd be happy to post configs and logs if requested. Yes, please post the output of crm configure show, the output of exportfs while the resource is running properly, and the relevant sections of your log file. I suggest using pastebin.com, to keep mailboxes filling up with walls of text. In case you haven't seen this thread already, you might want to take a look: http://www.gossamer-threads.com/lists/linuxha/dev/77166 And the resulting commit: https://github.com/ClusterLabs/resource-agents/commit/5b0bf96e77ed3c4e179c8b4c6a5ffd4709f8fdae (Links courtesy of Lars Ellenberg.) The problem and patch discussed in those links doesn't quite match what you describe. I mention it because I had to patch my exportfs resource (in /usr/lib/ocf/resource.d/heartbeat/exportfs on my RHEL systems) to get it to work properly in my setup. -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] problem with nfs and exportfs failover
On 4/13/12 7:18 PM, William Seligman wrote: On 4/13/12 6:42 PM, Seth Galitzer wrote: In attempting to build a nice clean config, I'm now in a state where exportfs never starts. It always times out and errors. crm config show is pasted here: http://pastebin.com/cKFFL0Xf syslog after an attempted restart here: http://pastebin.com/CHdF21M4 Only IPs have been edited. It's clear that your exportfs resource is timing out for the admin resource. I'm no expert, but here are some stupid exportfs tricks to try: - Check your /etc/exports file (or whatever the equivalent is in Debian; man exportfs will tell you) on both nodes. Make sure you're not already exporting the directory when the NFS server starts. - Take out the exportfs-admin resource. Then try doing things manually: # exportfs x.x.x.0/24:/exports/admin Assuming that works, then look at the output of just # exportfs The clientspec reported by exportfs has to match the clientspec you put into the resource exactly. If exportfs is canonicalizing or reporting the clientspec differently, the exportfs monitor won't work. If this is the case, change the clientspec parameter in exportfs-admin to match. If the output of exportfs has any results that span more than one line, then you've got the problem that the patch I referred you to (quoted below) is supposed to fix. You'll have to apply the patch to your exportfs resource. Wait a second; I completely forgot about this thread that I started: http://www.gossamer-threads.com/lists/linuxha/users/78585 The solution turned out to be to remove the .rmtab files from the directories I was exporting, deleting touching /var/lib/nfs/rmtab (you'll have to look up the Debian location), and adding rmtab_backup=none to all my exportfs resources. Hopefully there's a solution for you in there somewhere! On 04/13/2012 01:51 PM, William Seligman wrote: On 4/13/12 12:38 PM, Seth Galitzer wrote: I'm working through this howto doc: http://www.linbit.com/fileadmin/tech-guides/ha-nfs.pdf and am stuck at section 4.4. When I put the primary node in standby, it seems that NFS never releases the export, so it can't shut down, and thus can't get started on the secondary node. Everything up to that point in the doc works fine and fails over correctly. But once I add the exportfs resource, it fails. I'm running this on debian wheezy with the included standard packages, not custom. Any suggestions? I'd be happy to post configs and logs if requested. Yes, please post the output of crm configure show, the output of exportfs while the resource is running properly, and the relevant sections of your log file. I suggest using pastebin.com, to keep mailboxes filling up with walls of text. In case you haven't seen this thread already, you might want to take a look: http://www.gossamer-threads.com/lists/linuxha/dev/77166 And the resulting commit: https://github.com/ClusterLabs/resource-agents/commit/5b0bf96e77ed3c4e179c8b4c6a5ffd4709f8fdae (Links courtesy of Lars Ellenberg.) The problem and patch discussed in those links doesn't quite match what you describe. I mention it because I had to patch my exportfs resource (in /usr/lib/ocf/resource.d/heartbeat/exportfs on my RHEL systems) to get it to work properly in my setup. -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] question regarding KVM in HA
On 4/10/12 11:43 AM, Cristina Bulfon wrote: We have a RH Cluster Suite to manage virtual machines with CLVM. A single virtual machine is on a logical volume and all machines that belong to a cluster can see it. I am wondering if is it possible to have the same with pacemaker ? If yes what kind of software do I have to use it other than pacemaker ? The references I used to set this up are Clusters From Scratch, especially the chapter on Active/Active: http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/ch08.html with some assistance from: https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial Basically, to use pacemaker with clvm you'll have to use cman, so you don't entirely give up RHCS. I'm in the process of validating a cluster like this now. It may help to look at some of the threads I started in this forum to see the challenges I faced, mainly because I'm a slow learner: http://www.gossamer-threads.com/lists/linuxha/users/78691 http://www.gossamer-threads.com/lists/linuxha/users/78469 -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] pacemaker+drbd promotion delay
On 3/30/12 1:13 AM, Andrew Beekhof wrote: On Fri, Mar 30, 2012 at 2:57 AM, William Seligman selig...@nevis.columbia.edu wrote: On 3/29/12 3:19 AM, Andrew Beekhof wrote: On Wed, Mar 28, 2012 at 9:12 AM, William Seligman selig...@nevis.columbia.edu wrote: The basics: Dual-primary cman+pacemaker+drbd cluster running on RHEL6.2; spec files and versions below. Problem: If I restart both nodes at the same time, or even just start pacemaker on both nodes at the same time, the drbd ms resource starts, but both nodes stay in slave mode. They'll both stay in slave mode until one of the following occurs: - I manually type crm resource cleanup ms-resource-name - 15 minutes elapse. Then the PEngine Recheck Timer is fired, and the ms resources are promoted. The key resource definitions: primitive AdminDrbd ocf:linbit:drbd \ � � � �params drbd_resource=admin \ � � � �op monitor interval=59s role=Master timeout=30s \ � � � �op monitor interval=60s role=Slave timeout=30s \ � � � �op stop interval=0 timeout=100 \ � � � �op start interval=0 timeout=240 \ � � � �meta target-role=Master ms AdminClone AdminDrbd \ � � � �meta master-max=2 master-node-max=1 clone-max=2 \ � � � �clone-node-max=1 notify=true interleave=true # The lengthy definition of FilesystemGroup is in the crm pastebin below clone FilesystemClone FilesystemGroup \ � � � �meta interleave=true target-role=Started colocation Filesystem_With_Admin inf: FilesystemClone AdminClone:Master order Admin_Before_Filesystem inf: AdminClone:promote FilesystemClone:start Note that I stuck in target-role options to try to solve the problem; no effect. When I look in /var/log/messages, I see no error messages or indications why the promotion should be delayed. The 'admin' drbd resource is reported as UpToDate on both nodes. There are no error messages when I force the issue with: crm resource cleanup AdminClone It's as if pacemaker, at start, needs some kind of kick after the drbd resource is ready to be promoted. This is not just an abstract case for me. At my site, it's not uncommon for there to be lengthy power outages that will bring down the cluster. Both systems will come up when power is restored, and I need for cluster services to be available shortly afterward, not 15 minutes later. Any ideas? Not without any logs Sure! Here's an extract from the log: http://pastebin.com/L1ZnsQ0R Before you click on the link (it's a big wall of text), I'm used to trawling the logs. Grep is a wonderful thing :-) At this stage it is apparent that I need to see /var/lib/pengine/pe-input-4.bz2 from hypatia-corosync. Do you have this file still? No, so I re-ran the test. Here's the log extract from the test I did today http://pastebin.com/6QYH2jkf. Based on what you asked for from the previous extract, I think what you want from this test is pe-input-5. Just to play it safe, I copied and bunzip2'ed all three pe-input files mentioned in the log messages: pe-input-4: http://pastebin.com/Txx50BJp pe-input-5: http://pastebin.com/zzppL6DF pe-input-6: http://pastebin.com/1dRgURK5 I pray to the gods of Grep that you find a clue in all of that! here are what I think are the landmarks: - The extract starts just after the node boots, at the start of syslog at time 10:49:21. - I've highlighted when pacemakerd starts, at 10:49:46. - I've highlighted when drbd reports that the 'admin' resource is UpToDate, at 10:50:10. - One last highlight: When pacemaker finally promotes the drbd resource to Primary on both nodes, at 11:05:11. Details: # rpm -q kernel cman pacemaker drbd kernel-2.6.32-220.4.1.el6.x86_64 cman-3.0.12.1-23.el6.x86_64 pacemaker-1.1.6-3.el6.x86_64 drbd-8.4.1-1.el6.x86_64 Output of crm_mon after two-node reboot or pacemaker restart: http://pastebin.com/jzrpCk3i cluster.conf: http://pastebin.com/sJw4KBws crm configure show: http://pastebin.com/MgYCQ2JH drbdadm dump all: http://pastebin.com/NrY6bskk -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] pacemaker+drbd promotion delay
On 3/29/12 3:19 AM, Andrew Beekhof wrote: On Wed, Mar 28, 2012 at 9:12 AM, William Seligman selig...@nevis.columbia.edu wrote: The basics: Dual-primary cman+pacemaker+drbd cluster running on RHEL6.2; spec files and versions below. Problem: If I restart both nodes at the same time, or even just start pacemaker on both nodes at the same time, the drbd ms resource starts, but both nodes stay in slave mode. They'll both stay in slave mode until one of the following occurs: - I manually type crm resource cleanup ms-resource-name - 15 minutes elapse. Then the PEngine Recheck Timer is fired, and the ms resources are promoted. The key resource definitions: primitive AdminDrbd ocf:linbit:drbd \ � � � �params drbd_resource=admin \ � � � �op monitor interval=59s role=Master timeout=30s \ � � � �op monitor interval=60s role=Slave timeout=30s \ � � � �op stop interval=0 timeout=100 \ � � � �op start interval=0 timeout=240 \ � � � �meta target-role=Master ms AdminClone AdminDrbd \ � � � �meta master-max=2 master-node-max=1 clone-max=2 \ � � � �clone-node-max=1 notify=true interleave=true # The lengthy definition of FilesystemGroup is in the crm pastebin below clone FilesystemClone FilesystemGroup \ � � � �meta interleave=true target-role=Started colocation Filesystem_With_Admin inf: FilesystemClone AdminClone:Master order Admin_Before_Filesystem inf: AdminClone:promote FilesystemClone:start Note that I stuck in target-role options to try to solve the problem; no effect. When I look in /var/log/messages, I see no error messages or indications why the promotion should be delayed. The 'admin' drbd resource is reported as UpToDate on both nodes. There are no error messages when I force the issue with: crm resource cleanup AdminClone It's as if pacemaker, at start, needs some kind of kick after the drbd resource is ready to be promoted. This is not just an abstract case for me. At my site, it's not uncommon for there to be lengthy power outages that will bring down the cluster. Both systems will come up when power is restored, and I need for cluster services to be available shortly afterward, not 15 minutes later. Any ideas? Not without any logs Sure! Here's an extract from the log: http://pastebin.com/L1ZnsQ0R Before you click on the link (it's a big wall of text), here are what I think are the landmarks: - The extract starts just after the node boots, at the start of syslog at time 10:49:21. - I've highlighted when pacemakerd starts, at 10:49:46. - I've highlighted when drbd reports that the 'admin' resource is UpToDate, at 10:50:10. - One last highlight: When pacemaker finally promotes the drbd resource to Primary on both nodes, at 11:05:11. Details: # rpm -q kernel cman pacemaker drbd kernel-2.6.32-220.4.1.el6.x86_64 cman-3.0.12.1-23.el6.x86_64 pacemaker-1.1.6-3.el6.x86_64 drbd-8.4.1-1.el6.x86_64 Output of crm_mon after two-node reboot or pacemaker restart: http://pastebin.com/jzrpCk3i cluster.conf: http://pastebin.com/sJw4KBws crm configure show: http://pastebin.com/MgYCQ2JH drbdadm dump all: http://pastebin.com/NrY6bskk -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] pacemaker+drbd promotion delay
On 3/27/12 6:12 PM, William Seligman wrote: The basics: Dual-primary cman+pacemaker+drbd cluster running on RHEL6.2; spec files and versions below. Problem: If I restart both nodes at the same time, or even just start pacemaker on both nodes at the same time, the drbd ms resource starts, but both nodes stay in slave mode. They'll both stay in slave mode until one of the following occurs: - I manually type crm resource cleanup ms-resource-name - 15 minutes elapse. Then the PEngine Recheck Timer is fired, and the ms resources are promoted. The key resource definitions: primitive AdminDrbd ocf:linbit:drbd \ params drbd_resource=admin \ op monitor interval=59s role=Master timeout=30s \ op monitor interval=60s role=Slave timeout=30s \ op stop interval=0 timeout=100 \ op start interval=0 timeout=240 \ meta target-role=Master ms AdminClone AdminDrbd \ meta master-max=2 master-node-max=1 clone-max=2 \ clone-node-max=1 notify=true interleave=true # The lengthy definition of FilesystemGroup is in the crm pastebin below clone FilesystemClone FilesystemGroup \ meta interleave=true target-role=Started colocation Filesystem_With_Admin inf: FilesystemClone AdminClone:Master order Admin_Before_Filesystem inf: AdminClone:promote FilesystemClone:start Note that I stuck in target-role options to try to solve the problem; no effect. When I look in /var/log/messages, I see no error messages or indications why the promotion should be delayed. The 'admin' drbd resource is reported as UpToDate on both nodes. There are no error messages when I force the issue with: crm resource cleanup AdminClone It's as if pacemaker, at start, needs some kind of kick after the drbd resource is ready to be promoted. This is not just an abstract case for me. At my site, it's not uncommon for there to be lengthy power outages that will bring down the cluster. Both systems will come up when power is restored, and I need for cluster services to be available shortly afterward, not 15 minutes later. Any ideas? Details: # rpm -q kernel cman pacemaker drbd kernel-2.6.32-220.4.1.el6.x86_64 cman-3.0.12.1-23.el6.x86_64 pacemaker-1.1.6-3.el6.x86_64 drbd-8.4.1-1.el6.x86_64 Output of crm_mon after two-node reboot or pacemaker restart: http://pastebin.com/jzrpCk3i cluster.conf: http://pastebin.com/sJw4KBws crm configure show: http://pastebin.com/MgYCQ2JH drbdadm dump all: http://pastebin.com/NrY6bskk Well, I can't say that I've solved this one, but I have a solution: If I turn on both machines at once there's a 15-minute delay. But if I turn on one machine, wait a couple of minutes, then turn on the other, at least the resources start promptly on the first machine. The second machine joins the cluster, but there's still a 15-minute delay until its DRBD partition is promoted by pacemaker. The reason why DRBD is promoted on the first machine has to do the previous issue I posted to this list: http://www.gossamer-threads.com/lists/linuxha/users/78691?do=post_view_threaded When doing the initial resource probe of the AdminLvm resource, it times out due the one-node LVM issue I discuss in the that thread. This error causes the pengine on the node to start re-probing resources, promote the DRBD partition, which in turn leads to all all the other resources starting on that node. So I have a work-around, but not a solution. I'll take what I can get! -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes - SOLVED
On 3/27/12 4:52 AM, emmanuel segura wrote: So now your cluster it's OK? *Laughs* No! There's another problem I have to solve. But it's completely unrelated to this one. I'll work on it some more, and if I can't solve it I'll start a new thread. Thanks for asking, Emmanuel. (I want to prove I can spell your name correctly!) Il giorno 27 marzo 2012 00:33, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/26/12 5:31 PM, William Seligman wrote: On 3/26/12 5:17 PM, William Seligman wrote: On 3/26/12 4:28 PM, emmanuel segura wrote: and i suggest you to start clvmd at boot time chkconfig clvmd on I'm afraid this doesn't work. It's as I predicted; when gfs2 starts I get: Mounting GFS2 filesystem (/usr/nevis): invalid device path /dev/mapper/ADMIN-usr [FAILED] ... and so on, because the ADMIN volume group was never loaded by clvmd. Without a vgscan in there somewhere, the system can't see the volume groups on the drbd resource. Wait a second... there's an ocf:heartbeat:LVM resource! Testing... Emannuel, you did it! For the sake of future searches, and possibly future documentation, let me start with my original description of the problem: I'm setting up a two-node cman+pacemaker+gfs2 cluster as described in Clusters From Scratch. Fencing is through forcibly rebooting a node by cutting and restoring its power via UPS. My fencing/failover tests have revealed a problem. If I gracefully turn off one node (crm node standby; service pacemaker stop; shutdown -r now) all the resources transfer to the other node with no problems. If I cut power to one node (as would happen if it were fenced), the lsb::clvmd resource on the remaining node eventually fails. Since all the other resources depend on clvmd, all the resources on the remaining node stop and the cluster is left with nothing running. I've traced why the lsb::clvmd fails: The monitor/status command includes vgdisplay, which hangs indefinitely. Therefore the monitor will always time-out. So this isn't a problem with pacemaker, but with clvmd/dlm: If a node is cut off, the cluster isn't handling it properly. Has anyone on this list seen this before? Any ideas? Details: versions: Redhat Linux 6.2 (kernel 2.6.32) cman-3.0.12.1 corosync-1.4.1 pacemaker-1.1.6 lvm2-2.02.87 lvm2-cluster-2.02.87 The problem is that clvmd on the main node will hang if there's a substantive period of time during which the other node returns running cman but not clvmd. I never tracked down why this happens, but there's a practical solution: minimize any interval for which that would be true. To ensure this, take clvmd outside the resource manager's control: chkconfig cman on chkconfig clvmd on chkconfig pacemaker on On RHEL6.2, these services will be started in the above order; clvmd will start within a few seconds after cman. Here's my cluster.conf http://pastebin.com/GUr0CEgZ and the output of crm configure show http://pastebin.com/f9D4Ui5Z. The key lines from the latter are: primitive AdminDrbd ocf:linbit:drbd \ params drbd_resource=admin primitive AdminLvm ocf:heartbeat:LVM \ params volgrpname=ADMIN \ op monitor interval=30 timeout=100 depth=0 primitive Gfs2 lsb:gfs2 group VolumeGroup AdminLvm Gfs2 ms AdminClone AdminDrbd \ meta master-max=2 master-node-max=1 \ clone-max=2 clone-node-max=1 \ notify=true interleave=true clone VolumeClone VolumeGroup \ meta interleave=true colocation Volume_With_Admin inf: VolumeClone AdminClone:Master order Admin_Before_Volume inf: AdminClone:promote VolumeClone:start What I learned: If one is going to extend the example in Clusters From Scratch to include logical volumes, one must start clvmd at boot time, and include any volume groups in ocf:heartbeat:LVM resources that start before gfs2. Note the long timeout on the ocf:heartbeat:LVM resource. This is a good idea because, during the boot of the crashed node, there'll still be an interval of a few seconds when cman will be running but clvmd won't be. During my tests, the LVM monitor would fail if it checked during that interval with a timeout that was shorter than it took clvmd to start on the crashed node. This was annoying; all resources dependent on AdminLvm would be stopped until AdminLvm recovered (a few more seconds). Increasing the timeout avoids this. It also means that during any recovery procedure on the crashed node for which I turn off all the services, I have to minimize the interval between the start of cman and clvmd if I've turned off services at boot; e.g., service drbd start # ... and fix any split-brain problems or whatever service cman start; service clvmd start # put on one line service pacemaker start I thank everyone on this list who was patient with me as I pounded on this problem for two weeks! -- Bill Seligman
[Linux-HA] pacemaker+drbd promotion delay
The basics: Dual-primary cman+pacemaker+drbd cluster running on RHEL6.2; spec files and versions below. Problem: If I restart both nodes at the same time, or even just start pacemaker on both nodes at the same time, the drbd ms resource starts, but both nodes stay in slave mode. They'll both stay in slave mode until one of the following occurs: - I manually type crm resource cleanup ms-resource-name - 15 minutes elapse. Then the PEngine Recheck Timer is fired, and the ms resources are promoted. The key resource definitions: primitive AdminDrbd ocf:linbit:drbd \ params drbd_resource=admin \ op monitor interval=59s role=Master timeout=30s \ op monitor interval=60s role=Slave timeout=30s \ op stop interval=0 timeout=100 \ op start interval=0 timeout=240 \ meta target-role=Master ms AdminClone AdminDrbd \ meta master-max=2 master-node-max=1 clone-max=2 \ clone-node-max=1 notify=true interleave=true # The lengthy definition of FilesystemGroup is in the crm pastebin below clone FilesystemClone FilesystemGroup \ meta interleave=true target-role=Started colocation Filesystem_With_Admin inf: FilesystemClone AdminClone:Master order Admin_Before_Filesystem inf: AdminClone:promote FilesystemClone:start Note that I stuck in target-role options to try to solve the problem; no effect. When I look in /var/log/messages, I see no error messages or indications why the promotion should be delayed. The 'admin' drbd resource is reported as UpToDate on both nodes. There are no error messages when I force the issue with: crm resource cleanup AdminClone It's as if pacemaker, at start, needs some kind of kick after the drbd resource is ready to be promoted. This is not just an abstract case for me. At my site, it's not uncommon for there to be lengthy power outages that will bring down the cluster. Both systems will come up when power is restored, and I need for cluster services to be available shortly afterward, not 15 minutes later. Any ideas? Details: # rpm -q kernel cman pacemaker drbd kernel-2.6.32-220.4.1.el6.x86_64 cman-3.0.12.1-23.el6.x86_64 pacemaker-1.1.6-3.el6.x86_64 drbd-8.4.1-1.el6.x86_64 Output of crm_mon after two-node reboot or pacemaker restart: http://pastebin.com/jzrpCk3i cluster.conf: http://pastebin.com/sJw4KBws crm configure show: http://pastebin.com/MgYCQ2JH drbdadm dump all: http://pastebin.com/NrY6bskk -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes
On 3/26/12 4:28 PM, emmanuel segura wrote: Sorry Willian i can't post my config now because i'm at home now not in my job I think it's no a problem if clvm start before drbd, because clvm not needed and devices to start This it's the point, i hope to be clear The introduction of pacemaker in redhat cluster was thinked for replace rgmanager not whole cluster stack and i suggest you to start clvmd at boot time chkconfig clvmd on I'm afraid this doesn't work. It's as I predicted; when gfs2 starts I get: Mounting GFS2 filesystem (/usr/nevis): invalid device path /dev/mapper/ADMIN-usr [FAILED] ... and so on, because the ADMIN volume group was never loaded by clvmd. Without a vgscan in there somewhere, the system can't see the volume groups on the drbd resource. Sorry for my bad english :-) i can from a spanish country and all days i speak Italian I'm sorry that I don't speak more languages! You're the one who's helping me; it's my task to learn and understand. Certainly your English is better than my French or Russian. Il giorno 26 marzo 2012 22:04, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/26/12 3:48 PM, emmanuel segura wrote: I know it's normal fence_node doesn't work because the request of fence must be redirect to pacemaker stonith I think call the cluster agents with rgmanager it's really ugly thing, i never seen a cluster like this == If I understand Pacemaker Explained http://bit.ly/GR5WEY and how I'd invoke clvmd from cman http://bit.ly/H6ZbKg, the clvmd script that would be invoked by either HA resource manager is exactly the same: /etc/init.d/clvmd. == clvm doesn't need to be called from rgmanger in the cluster configuration this the boot sequence of redhat daemons 1:cman, 2:clvm, 3:rgmanager and if you don't wanna use rgmanager you just replace rgmanager I'm sorry, but I don't think I understand what you're suggesting. Do you suggest that I start clvmd at boot? That won't work; clvmd won't see the volume groups on drbd until drbd is started and promoted to primary. May I ask you to post your own cluster.conf on pastebin.com so I can see how you do it? Along with crm configure show if that's relevant for your cluster? Il giorno 26 marzo 2012 19:21, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/24/12 5:40 PM, emmanuel segura wrote: I think it's better you use clvmd with cman I don't now why you use the lsb script of clvm On Redhat clvmd need of cman and you try to running with pacemaker, i not sure this is the problem but this type of configuration it's so strange I made it a virtual cluster with kvm and i not foud a problems While I appreciate the advice, it's not immediately clear that trying to eliminate pacemaker would do me any good. Perhaps someone can demonstrate the error in my reasoning: If I understand Pacemaker Explained http://bit.ly/GR5WEY and how I'd invoke clvmd from cman http://bit.ly/H6ZbKg, the clvmd script that would be invoked by either HA resource manager is exactly the same: /etc/init.d/clvmd. If I tried to use cman instead of pacemaker, I'd be cutting myself off from the pacemaker features that cman/rgmanager does not yet have available, such as pacemaker's symlink, exportfs, and clonable IPaddr2 resources. I recognize I've got a strange problem. Given that fence_node doesn't work but stonith_admin does, I strongly suspect that the problem is caused by the behavior of my fencing agent, not the use of pacemaker versus rgmanager, nor by how clvmd is being started. Il giorno 24 marzo 2012 13:09, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/24/12 4:47 AM, emmanuel segura wrote: How do you configure clvmd? with cman or with pacemaker? Pacemaker. Here's the output of 'crm configure show': http://pastebin.com/426CdVwN Il giorno 23 marzo 2012 22:14, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/23/12 5:03 PM, emmanuel segura wrote: Sorry but i would to know if can show me your /etc/cluster/cluster.conf Here it is: http://pastebin.com/GUr0CEgZ Il giorno 23 marzo 2012 21:50, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/22/12 2:43 PM, William Seligman wrote: On 3/20/12 4:55 PM, Lars Ellenberg wrote: On Fri, Mar 16, 2012 at 05:06:04PM -0400, William Seligman wrote: On 3/16/12 12:12 PM, William Seligman wrote: On 3/16/12 7:02 AM, Andreas Kurz wrote: s- ... DRBD suspended io, most likely because of it's fencing-policy. For valid dual-primary setups you have to use resource-and-stonith policy and a working fence-peer handler. In this mode I/O is suspended until fencing of peer was succesful. Question is, why the peer does _not_ also suspend its I/O because obviously fencing
Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes
On 3/26/12 5:17 PM, William Seligman wrote: On 3/26/12 4:28 PM, emmanuel segura wrote: Sorry Willian i can't post my config now because i'm at home now not in my job I think it's no a problem if clvm start before drbd, because clvm not needed and devices to start This it's the point, i hope to be clear The introduction of pacemaker in redhat cluster was thinked for replace rgmanager not whole cluster stack and i suggest you to start clvmd at boot time chkconfig clvmd on I'm afraid this doesn't work. It's as I predicted; when gfs2 starts I get: Mounting GFS2 filesystem (/usr/nevis): invalid device path /dev/mapper/ADMIN-usr [FAILED] ... and so on, because the ADMIN volume group was never loaded by clvmd. Without a vgscan in there somewhere, the system can't see the volume groups on the drbd resource. Wait a second... there's an ocf:heartbeat:LVM resource! Testing... -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes - SOLVED
On 3/26/12 5:31 PM, William Seligman wrote: On 3/26/12 5:17 PM, William Seligman wrote: On 3/26/12 4:28 PM, emmanuel segura wrote: and i suggest you to start clvmd at boot time chkconfig clvmd on I'm afraid this doesn't work. It's as I predicted; when gfs2 starts I get: Mounting GFS2 filesystem (/usr/nevis): invalid device path /dev/mapper/ADMIN-usr [FAILED] ... and so on, because the ADMIN volume group was never loaded by clvmd. Without a vgscan in there somewhere, the system can't see the volume groups on the drbd resource. Wait a second... there's an ocf:heartbeat:LVM resource! Testing... Emannuel, you did it! For the sake of future searches, and possibly future documentation, let me start with my original description of the problem: I'm setting up a two-node cman+pacemaker+gfs2 cluster as described in Clusters From Scratch. Fencing is through forcibly rebooting a node by cutting and restoring its power via UPS. My fencing/failover tests have revealed a problem. If I gracefully turn off one node (crm node standby; service pacemaker stop; shutdown -r now) all the resources transfer to the other node with no problems. If I cut power to one node (as would happen if it were fenced), the lsb::clvmd resource on the remaining node eventually fails. Since all the other resources depend on clvmd, all the resources on the remaining node stop and the cluster is left with nothing running. I've traced why the lsb::clvmd fails: The monitor/status command includes vgdisplay, which hangs indefinitely. Therefore the monitor will always time-out. So this isn't a problem with pacemaker, but with clvmd/dlm: If a node is cut off, the cluster isn't handling it properly. Has anyone on this list seen this before? Any ideas? Details: versions: Redhat Linux 6.2 (kernel 2.6.32) cman-3.0.12.1 corosync-1.4.1 pacemaker-1.1.6 lvm2-2.02.87 lvm2-cluster-2.02.87 The problem is that clvmd on the main node will hang if there's a substantive period of time during which the other node returns running cman but not clvmd. I never tracked down why this happens, but there's a practical solution: minimize any interval for which that would be true. To ensure this, take clvmd outside the resource manager's control: chkconfig cman on chkconfig clvmd on chkconfig pacemaker on On RHEL6.2, these services will be started in the above order; clvmd will start within a few seconds after cman. Here's my cluster.conf http://pastebin.com/GUr0CEgZ and the output of crm configure show http://pastebin.com/f9D4Ui5Z. The key lines from the latter are: primitive AdminDrbd ocf:linbit:drbd \ params drbd_resource=admin primitive AdminLvm ocf:heartbeat:LVM \ params volgrpname=ADMIN \ op monitor interval=30 timeout=100 depth=0 primitive Gfs2 lsb:gfs2 group VolumeGroup AdminLvm Gfs2 ms AdminClone AdminDrbd \ meta master-max=2 master-node-max=1 \ clone-max=2 clone-node-max=1 \ notify=true interleave=true clone VolumeClone VolumeGroup \ meta interleave=true colocation Volume_With_Admin inf: VolumeClone AdminClone:Master order Admin_Before_Volume inf: AdminClone:promote VolumeClone:start What I learned: If one is going to extend the example in Clusters From Scratch to include logical volumes, one must start clvmd at boot time, and include any volume groups in ocf:heartbeat:LVM resources that start before gfs2. Note the long timeout on the ocf:heartbeat:LVM resource. This is a good idea because, during the boot of the crashed node, there'll still be an interval of a few seconds when cman will be running but clvmd won't be. During my tests, the LVM monitor would fail if it checked during that interval with a timeout that was shorter than it took clvmd to start on the crashed node. This was annoying; all resources dependent on AdminLvm would be stopped until AdminLvm recovered (a few more seconds). Increasing the timeout avoids this. It also means that during any recovery procedure on the crashed node for which I turn off all the services, I have to minimize the interval between the start of cman and clvmd if I've turned off services at boot; e.g., service drbd start # ... and fix any split-brain problems or whatever service cman start; service clvmd start # put on one line service pacemaker start I thank everyone on this list who was patient with me as I pounded on this problem for two weeks! -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes
On 3/24/12 4:47 AM, emmanuel segura wrote: How do you configure clvmd? with cman or with pacemaker? Pacemaker. Here's the output of 'crm configure show': http://pastebin.com/426CdVwN Il giorno 23 marzo 2012 22:14, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/23/12 5:03 PM, emmanuel segura wrote: Sorry but i would to know if can show me your /etc/cluster/cluster.conf Here it is: http://pastebin.com/GUr0CEgZ Il giorno 23 marzo 2012 21:50, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/22/12 2:43 PM, William Seligman wrote: On 3/20/12 4:55 PM, Lars Ellenberg wrote: On Fri, Mar 16, 2012 at 05:06:04PM -0400, William Seligman wrote: On 3/16/12 12:12 PM, William Seligman wrote: On 3/16/12 7:02 AM, Andreas Kurz wrote: s- ... DRBD suspended io, most likely because of it's fencing-policy. For valid dual-primary setups you have to use resource-and-stonith policy and a working fence-peer handler. In this mode I/O is suspended until fencing of peer was succesful. Question is, why the peer does _not_ also suspend its I/O because obviously fencing was not successful . So with a correct DRBD configuration one of your nodes should already have been fenced because of connection loss between nodes (on drbd replication link). You can use e.g. that nice fencing script: http://goo.gl/O4N8f This is the output of drbdadm dump admin: http://pastebin.com/kTxvHCtx So I've got resource-and-stonith. I gather from an earlier thread that obliterate-peer.sh is more-or-less equivalent in functionality with stonith_admin_fence_peer.sh: http://www.gossamer-threads.com/lists/linuxha/users/78504#78504 At the moment I'm pursuing the possibility that I'm returning the wrong return codes from my fencing agent: http://www.gossamer-threads.com/lists/linuxha/users/78572 I cleaned up my fencing agent, making sure its return code matched those returned by other agents in /usr/sbin/fence_, and allowing for some delay issues in reading the UPS status. But... After that, I'll look at another suggestion with lvm.conf: http://www.gossamer-threads.com/lists/linuxha/users/78796#78796 Then I'll try DRBD 8.4.1. Hopefully one of these is the source of the issue. Failure on all three counts. May I suggest you double check the permissions on your fence peer script? I suspect you may simply have forgotten the chmod +x . Test with drbdadm fence-peer minor-0 from the command line. I still haven't solved the problem, but this advice has gotten me further than before. First, Lars was correct: I did not have execute permissions set on my fence peer scripts. (D'oh!) I turned them on, but that did not change anything: cman+clvmd still hung on the vgdisplay command if I crashed the peer node. I started up both nodes again (cman+pacemaker+drbd+clvmd) and tried Lars' suggested command. I didn't save the response for this message (d'oh again!) but it said that the fence-peer script had failed. Hmm. The peer was definitely shutting down, so my fencing script is working. I went over it, comparing the return codes to those of the existing scripts, and made some changes. Here's my current script: http://pastebin.com/nUnYVcBK. Up until now my fence-peer scripts had either been Lon Hohberger's obliterate-peer.sh or Digimer's rhcs_fence. I decided to try stonith_admin-fence-peer.sh that Andreas Kurz recommended; unlike the first two scripts, which fence using fence_node, the latter script just calls stonith_admin. When I tried the stonith_admin-fence-peer.sh script, it worked: # drbdadm fence-peer minor-0 stonith_admin-fence-peer.sh[10886]: stonith_admin successfully fenced peer orestes-corosync.nevis.columbia.edu. Power was cut on the peer, the remaining node stayed up. Then I brought up the peer with: stonith_admin -U orestes-corosync.nevis.columbia.edu BUT: When the restored peer came up and started to run cman, the clvmd hung on the main node again. After cycling through some more tests, I found that if I brought down the peer with drbdadm, then brought up with the peer with no HA services, then started drbd and then cman, the cluster remained intact. If I crashed the peer, the scheme in the previous paragraph didn't work. I bring up drbd, check that the disks are both UpToDate, then bring up cman. At that point the vgdisplay on the main node takes so long to run that clvmd will time out: vgdisplay Error locking on node orestes-corosync.nevis.columbia.edu: Command timed out I timed how long it took vgdisplay to run. I might be able to work around this by setting the timeout on my clvmd resource to 300s, but that seems to be a band-aid for an underlying problem. Any suggestions on what else I could check? I've done some more tests. Still no solution, just an observation: The death mode appears to be: - Two nodes running cman+pacemaker+drbd+clvmd - Take one node down = one
Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes
On 3/22/12 2:43 PM, William Seligman wrote: On 3/20/12 4:55 PM, Lars Ellenberg wrote: On Fri, Mar 16, 2012 at 05:06:04PM -0400, William Seligman wrote: On 3/16/12 12:12 PM, William Seligman wrote: On 3/16/12 7:02 AM, Andreas Kurz wrote: On 03/15/2012 11:50 PM, William Seligman wrote: On 3/15/12 6:07 PM, William Seligman wrote: On 3/15/12 6:05 PM, William Seligman wrote: On 3/15/12 4:57 PM, emmanuel segura wrote: we can try to understand what happen when clvm hang edit the /etc/lvm/lvm.conf and change level = 7 in the log session and uncomment this line file = /var/log/lvm2.log Here's the tail end of the file (the original is 1.6M). Because there no times in the log, it's hard for me to point you to the point where I crashed the other system. I think (though I'm not sure) that the crash happened after the last occurrence of cache/lvmcache.c:1484 Wiping internal VG cache Honestly, it looks like a wall of text to me. Does it suggest anything to you? Maybe it would help if I included the link to the pastebin where I put the output: http://pastebin.com/8pgW3Muw Could the problem be with lvm+drbd? In lvm2.conf, I see this sequence of lines pre-crash: device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:271 /dev/md0: size is 1027968 sectors device/dev-io.c:137 /dev/md0: block size is 1024 bytes device/dev-io.c:588 Closed /dev/md0 device/dev-io.c:271 /dev/md0: size is 1027968 sectors device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:137 /dev/md0: block size is 1024 bytes device/dev-io.c:588 Closed /dev/md0 filters/filter-composite.c:31 Using /dev/md0 device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:137 /dev/md0: block size is 1024 bytes label/label.c:186 /dev/md0: No label detected device/dev-io.c:588 Closed /dev/md0 device/dev-io.c:535 Opened /dev/drbd0 RO O_DIRECT device/dev-io.c:271 /dev/drbd0: size is 5611549368 sectors device/dev-io.c:137 /dev/drbd0: block size is 4096 bytes device/dev-io.c:588 Closed /dev/drbd0 device/dev-io.c:271 /dev/drbd0: size is 5611549368 sectors device/dev-io.c:535 Opened /dev/drbd0 RO O_DIRECT device/dev-io.c:137 /dev/drbd0: block size is 4096 bytes device/dev-io.c:588 Closed /dev/drbd0 I interpret this: Look at /dev/md0, get some info, close; look at /dev/drbd0, get some info, close. Post-crash, I see: evice/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:271 /dev/md0: size is 1027968 sectors device/dev-io.c:137 /dev/md0: block size is 1024 bytes device/dev-io.c:588 Closed /dev/md0 device/dev-io.c:271 /dev/md0: size is 1027968 sectors device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:137 /dev/md0: block size is 1024 bytes device/dev-io.c:588 Closed /dev/md0 filters/filter-composite.c:31 Using /dev/md0 device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:137 /dev/md0: block size is 1024 bytes label/label.c:186 /dev/md0: No label detected device/dev-io.c:588 Closed /dev/md0 device/dev-io.c:535 Opened /dev/drbd0 RO O_DIRECT device/dev-io.c:271 /dev/drbd0: size is 5611549368 sectors device/dev-io.c:137 /dev/drbd0: block size is 4096 bytes ... and then it hangs. Comparing the two, it looks like it can't close /dev/drbd0. If I look at /proc/drbd when I crash one node, I see this: # cat /proc/drbd version: 8.3.12 (api:88/proto:86-96) GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34 0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s- ns:764 nr:0 dw:0 dr:7049728 al:0 bm:516 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 s- ... DRBD suspended io, most likely because of it's fencing-policy. For valid dual-primary setups you have to use resource-and-stonith policy and a working fence-peer handler. In this mode I/O is suspended until fencing of peer was succesful. Question is, why the peer does _not_ also suspend its I/O because obviously fencing was not successful . So with a correct DRBD configuration one of your nodes should already have been fenced because of connection loss between nodes (on drbd replication link). You can use e.g. that nice fencing script: http://goo.gl/O4N8f This is the output of drbdadm dump admin: http://pastebin.com/kTxvHCtx So I've got resource-and-stonith. I gather from an earlier thread that obliterate-peer.sh is more-or-less equivalent in functionality with stonith_admin_fence_peer.sh: http://www.gossamer-threads.com/lists/linuxha/users/78504#78504 At the moment I'm pursuing the possibility that I'm returning the wrong return codes from my fencing agent: http://www.gossamer-threads.com/lists/linuxha/users/78572 I cleaned up my fencing agent, making sure its return code matched those returned by other agents in /usr/sbin/fence_, and allowing for some delay issues in reading the UPS status
Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes
On 3/23/12 5:03 PM, emmanuel segura wrote: Sorry but i would to know if can show me your /etc/cluster/cluster.conf Here it is: http://pastebin.com/GUr0CEgZ Il giorno 23 marzo 2012 21:50, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/22/12 2:43 PM, William Seligman wrote: On 3/20/12 4:55 PM, Lars Ellenberg wrote: On Fri, Mar 16, 2012 at 05:06:04PM -0400, William Seligman wrote: On 3/16/12 12:12 PM, William Seligman wrote: On 3/16/12 7:02 AM, Andreas Kurz wrote: On 03/15/2012 11:50 PM, William Seligman wrote: On 3/15/12 6:07 PM, William Seligman wrote: On 3/15/12 6:05 PM, William Seligman wrote: On 3/15/12 4:57 PM, emmanuel segura wrote: we can try to understand what happen when clvm hang edit the /etc/lvm/lvm.conf and change level = 7 in the log session and uncomment this line file = /var/log/lvm2.log Here's the tail end of the file (the original is 1.6M). Because there no times in the log, it's hard for me to point you to the point where I crashed the other system. I think (though I'm not sure) that the crash happened after the last occurrence of cache/lvmcache.c:1484 Wiping internal VG cache Honestly, it looks like a wall of text to me. Does it suggest anything to you? Maybe it would help if I included the link to the pastebin where I put the output: http://pastebin.com/8pgW3Muw Could the problem be with lvm+drbd? In lvm2.conf, I see this sequence of lines pre-crash: device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:271 /dev/md0: size is 1027968 sectors device/dev-io.c:137 /dev/md0: block size is 1024 bytes device/dev-io.c:588 Closed /dev/md0 device/dev-io.c:271 /dev/md0: size is 1027968 sectors device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:137 /dev/md0: block size is 1024 bytes device/dev-io.c:588 Closed /dev/md0 filters/filter-composite.c:31 Using /dev/md0 device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:137 /dev/md0: block size is 1024 bytes label/label.c:186 /dev/md0: No label detected device/dev-io.c:588 Closed /dev/md0 device/dev-io.c:535 Opened /dev/drbd0 RO O_DIRECT device/dev-io.c:271 /dev/drbd0: size is 5611549368 sectors device/dev-io.c:137 /dev/drbd0: block size is 4096 bytes device/dev-io.c:588 Closed /dev/drbd0 device/dev-io.c:271 /dev/drbd0: size is 5611549368 sectors device/dev-io.c:535 Opened /dev/drbd0 RO O_DIRECT device/dev-io.c:137 /dev/drbd0: block size is 4096 bytes device/dev-io.c:588 Closed /dev/drbd0 I interpret this: Look at /dev/md0, get some info, close; look at /dev/drbd0, get some info, close. Post-crash, I see: evice/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:271 /dev/md0: size is 1027968 sectors device/dev-io.c:137 /dev/md0: block size is 1024 bytes device/dev-io.c:588 Closed /dev/md0 device/dev-io.c:271 /dev/md0: size is 1027968 sectors device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:137 /dev/md0: block size is 1024 bytes device/dev-io.c:588 Closed /dev/md0 filters/filter-composite.c:31 Using /dev/md0 device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:137 /dev/md0: block size is 1024 bytes label/label.c:186 /dev/md0: No label detected device/dev-io.c:588 Closed /dev/md0 device/dev-io.c:535 Opened /dev/drbd0 RO O_DIRECT device/dev-io.c:271 /dev/drbd0: size is 5611549368 sectors device/dev-io.c:137 /dev/drbd0: block size is 4096 bytes ... and then it hangs. Comparing the two, it looks like it can't close /dev/drbd0. If I look at /proc/drbd when I crash one node, I see this: # cat /proc/drbd version: 8.3.12 (api:88/proto:86-96) GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34 0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s- ns:764 nr:0 dw:0 dr:7049728 al:0 bm:516 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 s- ... DRBD suspended io, most likely because of it's fencing-policy. For valid dual-primary setups you have to use resource-and-stonith policy and a working fence-peer handler. In this mode I/O is suspended until fencing of peer was succesful. Question is, why the peer does _not_ also suspend its I/O because obviously fencing was not successful . So with a correct DRBD configuration one of your nodes should already have been fenced because of connection loss between nodes (on drbd replication link). You can use e.g. that nice fencing script: http://goo.gl/O4N8f This is the output of drbdadm dump admin: http://pastebin.com/kTxvHCtx So I've got resource-and-stonith. I gather from an earlier thread that obliterate-peer.sh is more-or-less equivalent in functionality with stonith_admin_fence_peer.sh: http://www.gossamer-threads.com/lists/linuxha/users/78504#78504 At the moment I'm pursuing the possibility that I'm returning the wrong return codes from my fencing agent
Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes
On 3/20/12 4:55 PM, Lars Ellenberg wrote: On Fri, Mar 16, 2012 at 05:06:04PM -0400, William Seligman wrote: On 3/16/12 12:12 PM, William Seligman wrote: On 3/16/12 7:02 AM, Andreas Kurz wrote: On 03/15/2012 11:50 PM, William Seligman wrote: On 3/15/12 6:07 PM, William Seligman wrote: On 3/15/12 6:05 PM, William Seligman wrote: On 3/15/12 4:57 PM, emmanuel segura wrote: we can try to understand what happen when clvm hang edit the /etc/lvm/lvm.conf and change level = 7 in the log session and uncomment this line file = /var/log/lvm2.log Here's the tail end of the file (the original is 1.6M). Because there no times in the log, it's hard for me to point you to the point where I crashed the other system. I think (though I'm not sure) that the crash happened after the last occurrence of cache/lvmcache.c:1484 Wiping internal VG cache Honestly, it looks like a wall of text to me. Does it suggest anything to you? Maybe it would help if I included the link to the pastebin where I put the output: http://pastebin.com/8pgW3Muw Could the problem be with lvm+drbd? In lvm2.conf, I see this sequence of lines pre-crash: device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:271 /dev/md0: size is 1027968 sectors device/dev-io.c:137 /dev/md0: block size is 1024 bytes device/dev-io.c:588 Closed /dev/md0 device/dev-io.c:271 /dev/md0: size is 1027968 sectors device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:137 /dev/md0: block size is 1024 bytes device/dev-io.c:588 Closed /dev/md0 filters/filter-composite.c:31 Using /dev/md0 device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:137 /dev/md0: block size is 1024 bytes label/label.c:186 /dev/md0: No label detected device/dev-io.c:588 Closed /dev/md0 device/dev-io.c:535 Opened /dev/drbd0 RO O_DIRECT device/dev-io.c:271 /dev/drbd0: size is 5611549368 sectors device/dev-io.c:137 /dev/drbd0: block size is 4096 bytes device/dev-io.c:588 Closed /dev/drbd0 device/dev-io.c:271 /dev/drbd0: size is 5611549368 sectors device/dev-io.c:535 Opened /dev/drbd0 RO O_DIRECT device/dev-io.c:137 /dev/drbd0: block size is 4096 bytes device/dev-io.c:588 Closed /dev/drbd0 I interpret this: Look at /dev/md0, get some info, close; look at /dev/drbd0, get some info, close. Post-crash, I see: evice/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:271 /dev/md0: size is 1027968 sectors device/dev-io.c:137 /dev/md0: block size is 1024 bytes device/dev-io.c:588 Closed /dev/md0 device/dev-io.c:271 /dev/md0: size is 1027968 sectors device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:137 /dev/md0: block size is 1024 bytes device/dev-io.c:588 Closed /dev/md0 filters/filter-composite.c:31 Using /dev/md0 device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:137 /dev/md0: block size is 1024 bytes label/label.c:186 /dev/md0: No label detected device/dev-io.c:588 Closed /dev/md0 device/dev-io.c:535 Opened /dev/drbd0 RO O_DIRECT device/dev-io.c:271 /dev/drbd0: size is 5611549368 sectors device/dev-io.c:137 /dev/drbd0: block size is 4096 bytes ... and then it hangs. Comparing the two, it looks like it can't close /dev/drbd0. If I look at /proc/drbd when I crash one node, I see this: # cat /proc/drbd version: 8.3.12 (api:88/proto:86-96) GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34 0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s- ns:764 nr:0 dw:0 dr:7049728 al:0 bm:516 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 s- ... DRBD suspended io, most likely because of it's fencing-policy. For valid dual-primary setups you have to use resource-and-stonith policy and a working fence-peer handler. In this mode I/O is suspended until fencing of peer was succesful. Question is, why the peer does _not_ also suspend its I/O because obviously fencing was not successful . So with a correct DRBD configuration one of your nodes should already have been fenced because of connection loss between nodes (on drbd replication link). You can use e.g. that nice fencing script: http://goo.gl/O4N8f This is the output of drbdadm dump admin: http://pastebin.com/kTxvHCtx So I've got resource-and-stonith. I gather from an earlier thread that obliterate-peer.sh is more-or-less equivalent in functionality with stonith_admin_fence_peer.sh: http://www.gossamer-threads.com/lists/linuxha/users/78504#78504 At the moment I'm pursuing the possibility that I'm returning the wrong return codes from my fencing agent: http://www.gossamer-threads.com/lists/linuxha/users/78572 I cleaned up my fencing agent, making sure its return code matched those returned by other agents in /usr/sbin/fence_, and allowing for some delay issues in reading the UPS status. But... After that, I'll look at another
Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes
On 3/22/12 2:49 PM, David Coulson wrote: On 3/22/12 2:43 PM, William Seligman wrote: I still haven't solved the problem, but this advice has gotten me further than before. First, Lars was correct: I did not have execute permissions set on my fence peer scripts. (D'oh!) I turned them on, but that did not change anything: cman+clvmd still hung on the vgdisplay command if I crashed the peer node. Does cman think the node is fenced? clvmd will block IO until the node is fenced properly. Let's see: On main node, before crashing the peer node: corosync-objctl | grep member runtime.totem.pg.mrp.srp.members.1.ip=r(0) ip(192.168.100.207) runtime.totem.pg.mrp.srp.members.1.join_count=1 runtime.totem.pg.mrp.srp.members.1.status=joined runtime.totem.pg.mrp.srp.members.2.ip=r(0) ip(192.168.100.206) runtime.totem.pg.mrp.srp.members.2.join_count=2 runtime.totem.pg.mrp.srp.members.2.status=joined Then on peer node: echo c /proc/sysrq-trigger The UPS for the peer node shuts down, which tells me the main node ran the fencing agent. Now: corosync-objctl | grep member runtime.totem.pg.mrp.srp.members.1.ip=r(0) ip(192.168.100.207) runtime.totem.pg.mrp.srp.members.1.join_count=1 runtime.totem.pg.mrp.srp.members.1.status=joined runtime.totem.pg.mrp.srp.members.2.ip=r(0) ip(192.168.100.206) runtime.totem.pg.mrp.srp.members.2.join_count=2 runtime.totem.pg.mrp.srp.members.2.status=left Looks like cman knows. Is there any other way to check a node's fenced status as far as cman is concerned? -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] order transitivity (was Re: order troubles)
On 3/22/12 10:06 AM, Florian Haas wrote: On Thu, Mar 22, 2012 at 10:34 AM, Lars Ellenberg lars.ellenb...@linbit.com wrote: order o_nfs_before_vz 0: cl_fs_nfs cl_vz order o_vz_before_ve992 0: cl_vz ve992 a score of 0 is roughly equivalent to if you happen do plan to do both operations in the same transition, would you please consider to do them in this order, pretty please, if you see fit Lars beat me to this, as the post turned out to be a little more elaborate than expected, but here's a bit of background info for additional clarification: http://www.hastexo.com/resources/hints-and-kinks/mandatory-and-advisory-ordering-pacemaker I have a related question raised by this web page. Suppose I have a chain of ordering constraints: order Gfs2_Before_Libvirtd inf: Gfs2 Libvirtd order Libvirtd_Before_VirtualMachine 0: Libvirtd VirtualMachine On startup, Gfs2 will be started before Libvirtd before VirtualMachine. What happens on shutdown? Will Gfs2 necessarily wait until VirtualMachine stops? Or is it better to add the additional constraint: order Gfs2_Before_VirtualMachine inf: Gfs2 VirtualMachine if that's the behavior I want? -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes
On 3/16/12 7:02 AM, Andreas Kurz wrote: On 03/15/2012 11:50 PM, William Seligman wrote: On 3/15/12 6:07 PM, William Seligman wrote: On 3/15/12 6:05 PM, William Seligman wrote: On 3/15/12 4:57 PM, emmanuel segura wrote: we can try to understand what happen when clvm hang edit the /etc/lvm/lvm.conf and change level = 7 in the log session and uncomment this line file = /var/log/lvm2.log Here's the tail end of the file (the original is 1.6M). Because there no times in the log, it's hard for me to point you to the point where I crashed the other system. I think (though I'm not sure) that the crash happened after the last occurrence of cache/lvmcache.c:1484 Wiping internal VG cache Honestly, it looks like a wall of text to me. Does it suggest anything to you? Maybe it would help if I included the link to the pastebin where I put the output: http://pastebin.com/8pgW3Muw Could the problem be with lvm+drbd? In lvm2.conf, I see this sequence of lines pre-crash: device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:271 /dev/md0: size is 1027968 sectors device/dev-io.c:137 /dev/md0: block size is 1024 bytes device/dev-io.c:588 Closed /dev/md0 device/dev-io.c:271 /dev/md0: size is 1027968 sectors device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:137 /dev/md0: block size is 1024 bytes device/dev-io.c:588 Closed /dev/md0 filters/filter-composite.c:31 Using /dev/md0 device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:137 /dev/md0: block size is 1024 bytes label/label.c:186 /dev/md0: No label detected device/dev-io.c:588 Closed /dev/md0 device/dev-io.c:535 Opened /dev/drbd0 RO O_DIRECT device/dev-io.c:271 /dev/drbd0: size is 5611549368 sectors device/dev-io.c:137 /dev/drbd0: block size is 4096 bytes device/dev-io.c:588 Closed /dev/drbd0 device/dev-io.c:271 /dev/drbd0: size is 5611549368 sectors device/dev-io.c:535 Opened /dev/drbd0 RO O_DIRECT device/dev-io.c:137 /dev/drbd0: block size is 4096 bytes device/dev-io.c:588 Closed /dev/drbd0 I interpret this: Look at /dev/md0, get some info, close; look at /dev/drbd0, get some info, close. Post-crash, I see: evice/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:271 /dev/md0: size is 1027968 sectors device/dev-io.c:137 /dev/md0: block size is 1024 bytes device/dev-io.c:588 Closed /dev/md0 device/dev-io.c:271 /dev/md0: size is 1027968 sectors device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:137 /dev/md0: block size is 1024 bytes device/dev-io.c:588 Closed /dev/md0 filters/filter-composite.c:31 Using /dev/md0 device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:137 /dev/md0: block size is 1024 bytes label/label.c:186 /dev/md0: No label detected device/dev-io.c:588 Closed /dev/md0 device/dev-io.c:535 Opened /dev/drbd0 RO O_DIRECT device/dev-io.c:271 /dev/drbd0: size is 5611549368 sectors device/dev-io.c:137 /dev/drbd0: block size is 4096 bytes ... and then it hangs. Comparing the two, it looks like it can't close /dev/drbd0. If I look at /proc/drbd when I crash one node, I see this: # cat /proc/drbd version: 8.3.12 (api:88/proto:86-96) GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34 0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s- ns:764 nr:0 dw:0 dr:7049728 al:0 bm:516 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 s- ... DRBD suspended io, most likely because of it's fencing-policy. For valid dual-primary setups you have to use resource-and-stonith policy and a working fence-peer handler. In this mode I/O is suspended until fencing of peer was succesful. Question is, why the peer does _not_ also suspend its I/O because obviously fencing was not successful . So with a correct DRBD configuration one of your nodes should already have been fenced because of connection loss between nodes (on drbd replication link). You can use e.g. that nice fencing script: http://goo.gl/O4N8f This is the output of drbdadm dump admin: http://pastebin.com/kTxvHCtx So I've got resource-and-stonith. I gather from an earlier thread that obliterate-peer.sh is more-or-less equivalent in functionality with stonith_admin_fence_peer.sh: http://www.gossamer-threads.com/lists/linuxha/users/78504#78504 At the moment I'm pursuing the possibility that I'm returning the wrong return codes from my fencing agent: http://www.gossamer-threads.com/lists/linuxha/users/78572 After that, I'll look at another suggestion with lvm.conf: http://www.gossamer-threads.com/lists/linuxha/users/78796#78796 Then I'll try DRBD 8.4.1. Hopefully one of these is the source of the issue. -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http
Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes
On 3/16/12 4:53 AM, emmanuel segura wrote: for the lvm hang you can use this in your /etc/lvm/lvm.conf ignore_suspended_devices = 1 because i seen in the lvm log, === and then it hangs. Comparing the two, it looks like it can't close /dev/drbd0 === No, this does not prevent the hang. I tried with both DRBD 8.3.12 and 8.4.1. Il giorno 15 marzo 2012 23:50, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/15/12 6:07 PM, William Seligman wrote: On 3/15/12 6:05 PM, William Seligman wrote: On 3/15/12 4:57 PM, emmanuel segura wrote: we can try to understand what happen when clvm hang edit the /etc/lvm/lvm.conf and change level = 7 in the log session and uncomment this line file = /var/log/lvm2.log Here's the tail end of the file (the original is 1.6M). Because there no times in the log, it's hard for me to point you to the point where I crashed the other system. I think (though I'm not sure) that the crash happened after the last occurrence of cache/lvmcache.c:1484 Wiping internal VG cache Honestly, it looks like a wall of text to me. Does it suggest anything to you? Maybe it would help if I included the link to the pastebin where I put the output: http://pastebin.com/8pgW3Muw Could the problem be with lvm+drbd? In lvm2.conf, I see this sequence of lines pre-crash: device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:271 /dev/md0: size is 1027968 sectors device/dev-io.c:137 /dev/md0: block size is 1024 bytes device/dev-io.c:588 Closed /dev/md0 device/dev-io.c:271 /dev/md0: size is 1027968 sectors device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:137 /dev/md0: block size is 1024 bytes device/dev-io.c:588 Closed /dev/md0 filters/filter-composite.c:31 Using /dev/md0 device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:137 /dev/md0: block size is 1024 bytes label/label.c:186 /dev/md0: No label detected device/dev-io.c:588 Closed /dev/md0 device/dev-io.c:535 Opened /dev/drbd0 RO O_DIRECT device/dev-io.c:271 /dev/drbd0: size is 5611549368 sectors device/dev-io.c:137 /dev/drbd0: block size is 4096 bytes device/dev-io.c:588 Closed /dev/drbd0 device/dev-io.c:271 /dev/drbd0: size is 5611549368 sectors device/dev-io.c:535 Opened /dev/drbd0 RO O_DIRECT device/dev-io.c:137 /dev/drbd0: block size is 4096 bytes device/dev-io.c:588 Closed /dev/drbd0 I interpret this: Look at /dev/md0, get some info, close; look at /dev/drbd0, get some info, close. Post-crash, I see: evice/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:271 /dev/md0: size is 1027968 sectors device/dev-io.c:137 /dev/md0: block size is 1024 bytes device/dev-io.c:588 Closed /dev/md0 device/dev-io.c:271 /dev/md0: size is 1027968 sectors device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:137 /dev/md0: block size is 1024 bytes device/dev-io.c:588 Closed /dev/md0 filters/filter-composite.c:31 Using /dev/md0 device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:137 /dev/md0: block size is 1024 bytes label/label.c:186 /dev/md0: No label detected device/dev-io.c:588 Closed /dev/md0 device/dev-io.c:535 Opened /dev/drbd0 RO O_DIRECT device/dev-io.c:271 /dev/drbd0: size is 5611549368 sectors device/dev-io.c:137 /dev/drbd0: block size is 4096 bytes ... and then it hangs. Comparing the two, it looks like it can't close /dev/drbd0. If I look at /proc/drbd when I crash one node, I see this: # cat /proc/drbd version: 8.3.12 (api:88/proto:86-96) GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34 0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s- ns:764 nr:0 dw:0 dr:7049728 al:0 bm:516 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 If I look at /proc/drbd if I bring down one node gracefully (crm node standby), I get this: # cat /proc/drbd version: 8.3.12 (api:88/proto:86-96) GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34 0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/Outdated C r- ns:764 nr:40 dw:40 dr:7036496 al:0 bm:516 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 Could it be that drbd can't respond to certain requests from lvm if the state of the peer is DUnknown instead of Outdated? Il giorno 15 marzo 2012 20:50, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/15/12 12:55 PM, emmanuel segura wrote: I don't see any error and the answer for your question it's yes can you show me your /etc/cluster/cluster.conf and your crm configure show like that more later i can try to look if i found some fix Thanks for taking a look. My cluster.conf: http://pastebin.com/w5XNYyAX crm configure show: http://pastebin.com/atVkXjkn Before you
Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes
On 3/16/12 12:12 PM, William Seligman wrote: On 3/16/12 7:02 AM, Andreas Kurz wrote: On 03/15/2012 11:50 PM, William Seligman wrote: On 3/15/12 6:07 PM, William Seligman wrote: On 3/15/12 6:05 PM, William Seligman wrote: On 3/15/12 4:57 PM, emmanuel segura wrote: we can try to understand what happen when clvm hang edit the /etc/lvm/lvm.conf and change level = 7 in the log session and uncomment this line file = /var/log/lvm2.log Here's the tail end of the file (the original is 1.6M). Because there no times in the log, it's hard for me to point you to the point where I crashed the other system. I think (though I'm not sure) that the crash happened after the last occurrence of cache/lvmcache.c:1484 Wiping internal VG cache Honestly, it looks like a wall of text to me. Does it suggest anything to you? Maybe it would help if I included the link to the pastebin where I put the output: http://pastebin.com/8pgW3Muw Could the problem be with lvm+drbd? In lvm2.conf, I see this sequence of lines pre-crash: device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:271 /dev/md0: size is 1027968 sectors device/dev-io.c:137 /dev/md0: block size is 1024 bytes device/dev-io.c:588 Closed /dev/md0 device/dev-io.c:271 /dev/md0: size is 1027968 sectors device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:137 /dev/md0: block size is 1024 bytes device/dev-io.c:588 Closed /dev/md0 filters/filter-composite.c:31 Using /dev/md0 device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:137 /dev/md0: block size is 1024 bytes label/label.c:186 /dev/md0: No label detected device/dev-io.c:588 Closed /dev/md0 device/dev-io.c:535 Opened /dev/drbd0 RO O_DIRECT device/dev-io.c:271 /dev/drbd0: size is 5611549368 sectors device/dev-io.c:137 /dev/drbd0: block size is 4096 bytes device/dev-io.c:588 Closed /dev/drbd0 device/dev-io.c:271 /dev/drbd0: size is 5611549368 sectors device/dev-io.c:535 Opened /dev/drbd0 RO O_DIRECT device/dev-io.c:137 /dev/drbd0: block size is 4096 bytes device/dev-io.c:588 Closed /dev/drbd0 I interpret this: Look at /dev/md0, get some info, close; look at /dev/drbd0, get some info, close. Post-crash, I see: evice/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:271 /dev/md0: size is 1027968 sectors device/dev-io.c:137 /dev/md0: block size is 1024 bytes device/dev-io.c:588 Closed /dev/md0 device/dev-io.c:271 /dev/md0: size is 1027968 sectors device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:137 /dev/md0: block size is 1024 bytes device/dev-io.c:588 Closed /dev/md0 filters/filter-composite.c:31 Using /dev/md0 device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:137 /dev/md0: block size is 1024 bytes label/label.c:186 /dev/md0: No label detected device/dev-io.c:588 Closed /dev/md0 device/dev-io.c:535 Opened /dev/drbd0 RO O_DIRECT device/dev-io.c:271 /dev/drbd0: size is 5611549368 sectors device/dev-io.c:137 /dev/drbd0: block size is 4096 bytes ... and then it hangs. Comparing the two, it looks like it can't close /dev/drbd0. If I look at /proc/drbd when I crash one node, I see this: # cat /proc/drbd version: 8.3.12 (api:88/proto:86-96) GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34 0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s- ns:764 nr:0 dw:0 dr:7049728 al:0 bm:516 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 s- ... DRBD suspended io, most likely because of it's fencing-policy. For valid dual-primary setups you have to use resource-and-stonith policy and a working fence-peer handler. In this mode I/O is suspended until fencing of peer was succesful. Question is, why the peer does _not_ also suspend its I/O because obviously fencing was not successful . So with a correct DRBD configuration one of your nodes should already have been fenced because of connection loss between nodes (on drbd replication link). You can use e.g. that nice fencing script: http://goo.gl/O4N8f This is the output of drbdadm dump admin: http://pastebin.com/kTxvHCtx So I've got resource-and-stonith. I gather from an earlier thread that obliterate-peer.sh is more-or-less equivalent in functionality with stonith_admin_fence_peer.sh: http://www.gossamer-threads.com/lists/linuxha/users/78504#78504 At the moment I'm pursuing the possibility that I'm returning the wrong return codes from my fencing agent: http://www.gossamer-threads.com/lists/linuxha/users/78572 I cleaned up my fencing agent, making sure its return code matched those returned by other agents in /usr/sbin/fence_, and allowing for some delay issues in reading the UPS status. But... After that, I'll look at another suggestion with lvm.conf: http://www.gossamer-threads.com/lists/linuxha/users/78796#78796 Then I'll try DRBD 8.4.1
Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes
On 3/15/12 5:18 AM, emmanuel segura wrote: The first thing i seen in your clvmd log it's this = WARNING: Locking disabled. Be careful! This could corrupt your metadata. = I saw that too, and thought the same as you did. I did some checks (see below), but some web searches suggest that this message is a normal consequence of clvmd initialization; e.g., http://markmail.org/message/vmy53pcv52wu7ghx use this command lvmconf --enable-cluster and remember for cman+pacemaker you don't need qdisk Before I tried your lvmconf suggestion, here was my /etc/lvm/lvm.conf: http://pastebin.com/841VZRzW and the output of lvm dumpconfig: http://pastebin.com/rtw8c3Pf. Then I did as you suggested, but with a check to see if anything changed: # cd /etc/lvm/ # cp lvm.conf lvm.conf.cluster # lvmconf --enable-cluster # diff lvm.conf lvm.conf.cluster # So the key lines have been there all along: locking_type = 3 fallback_to_local_locking = 0 Il giorno 14 marzo 2012 23:17, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/14/12 9:20 AM, emmanuel segura wrote: Hello William i did new you are using drbd and i dont't know what type of configuration you using But it's better you try to start clvm with clvmd -d like thak we can see what it's the problem For what it's worth, here's the output of running clvmd -d on the node that stays up: http://pastebin.com/sWjaxAEF What's probably important in that big mass of output are the last two lines. Up to that point, I have both nodes up and running cman + clvmd; cluster.conf is here: http://pastebin.com/w5XNYyAX At the time of the next-to-the-last line, I cut power to the other node. At the time of the last line, I run vgdisplay on the remaining node, which hangs forever. After a lot of web searching, I found that I'm not the only one with this problem. Here's one case that doesn't seem relevant to me, since I don't use qdisk: http://www.redhat.com/archives/linux-cluster/2007-October/msg00212.html. Here's one with the same problem with the same OS: http://bugs.centos.org/view.php?id=5229, but with no resolution. Out of curiosity, has anyone on this list made a two-node cman+clvmd cluster work for them? Il giorno 14 marzo 2012 14:02, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/14/12 6:02 AM, emmanuel segura wrote: I think it's better you make clvmd start at boot chkconfig cman on ; chkconfig clvmd on I've already tried it. It doesn't work. The problem is that my LVM information is on the drbd. If I start up clvmd before drbd, it won't find the logical volumes. I also don't see why that would make a difference (although this could be part of the confusion): a service is a service. I've tried starting up clvmd inside and outside pacemaker control, with the same problem. Why would starting clvmd at boot make a difference? Il giorno 13 marzo 2012 23:29, William Seligmanseligman@nevis.** columbia.edu selig...@nevis.columbia.edu ha scritto: On 3/13/12 5:50 PM, emmanuel segura wrote: So if you using cman why you use lsb::clvmd I think you are very confused I don't dispute that I may be very confused! However, from what I can tell, I still need to run clvmd even if I'm running cman (I'm not using rgmanager). If I just run cman, gfs2 and any other form of mount fails. If I run cman, then clvmd, then gfs2, everything behaves normally. Going by these instructions: https://alteeve.com/w/2-Node_**Red_Hat_KVM_Cluster_Tutorial https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial the resources he puts under cluster control (rgmanager) I have to put under pacemaker control. Those include drbd, clvmd, and gfs2. The difference between what I've got, and what's in Clusters From Scratch, is in CFS they assign one DRBD volume to a single filesystem. I create an LVM physical volume on my DRBD resource, as in the above tutorial, and so I have to start clvmd or the logical volumes in the DRBD partition won't be recognized. Is there some way to get logical volumes recognized automatically by cman without rgmanager that I've missed? Il giorno 13 marzo 2012 22:42, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/13/12 12:29 PM, William Seligman wrote: I'm not sure if this is a Linux-HA question; please direct me to the appropriate list if it's not. I'm setting up a two-node cman+pacemaker+gfs2 cluster as described in Clusters From Scratch. Fencing is through forcibly rebooting a node by cutting and restoring its power via UPS. My fencing/failover tests have revealed a problem. If I gracefully turn off one node (crm node standby; service pacemaker stop; shutdown -r now) all the resources transfer to the other node with no problems. If I cut power to one node (as would happen if it were fenced), the lsb::clvmd resource
Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes
On 3/15/12 3:43 AM, Vladislav Bogdanov wrote: 14.03.2012 00:42, William Seligman wrote: [snip] These were the log messages, which show that stonith_admin did its job and CMAN was notified of the fencing: http://pastebin.com/jaH820Bv. Could you please look at the output of 'dlm_tool ls' and 'dlm_tool dump'? You probably have 'kern_stop' and 'fencing' flags there. That means that dlm is unaware that node is fenced. Here's 'dlm_tool ls' with both nodes running cman+clvmd+gfs2: http://pastebin.com/QrZtm1Ue 'dlm_tool dump': http://pastebin.com/UKWxx9Y4 For comparison, I crashed one node and looked at the same output on the remaining node: dlm_tool ls: http://pastebin.com/cKVAGxsd dlm_tool dump: http://pastebin.com/c0h0p22Q (the post-crash lines begin at 1331824940) I don't see the kern_stop or fencing flags. There's another thing I don't see: at the top of 'dlm_tool dump' it displays most of the contents of my cluster.conf file, except for the fencing sections. Here's my cluster.conf for comparison: http://pastebin.com/w5XNYyAX cman doesn't see anything wrong in my cluster.conf file: # ccs_config_validate Configuration validates But could there be something that's causing the fencing sections to be ignored? Unfortunately, I still got the gfs2 freeze, so this is not the complete story. Both clvmd and gfs2 use dlm. If dlm layer thinks fencing is not completed, both of them freeze. I did 'grep -E (dlm|clvm|fenc) /var/log/messages' and looked at the time I crashed the node: http://pastebin.com/dvBtdLUs. I see lines that indicate that pacemaker and drbd are fencing the node, but nothing from dlm or clvmd. Does this indicate what you suggest: Could dlm somehow be ignoring or overlooking the fencing I put in? Is there any other way to check this? -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes
On 3/15/12 11:50 AM, emmanuel segura wrote: yes william Now try clvmd -d and see what happen locking_type = 3 it's lvm cluster lock type Since you asked for confirmation, here it is: the output of 'clvmd -d' just now. http://pastebin.com/bne8piEw. I crashed the other node at Mar 15 12:02:35, when you see the only additional line of output. I don't see any particular difference between this and the previous result http://pastebin.com/sWjaxAEF, which suggests that I had cluster locking enabled before, and still do now. Il giorno 15 marzo 2012 16:15, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/15/12 5:18 AM, emmanuel segura wrote: The first thing i seen in your clvmd log it's this = WARNING: Locking disabled. Be careful! This could corrupt your metadata. = I saw that too, and thought the same as you did. I did some checks (see below), but some web searches suggest that this message is a normal consequence of clvmd initialization; e.g., http://markmail.org/message/vmy53pcv52wu7ghx use this command lvmconf --enable-cluster and remember for cman+pacemaker you don't need qdisk Before I tried your lvmconf suggestion, here was my /etc/lvm/lvm.conf: http://pastebin.com/841VZRzW and the output of lvm dumpconfig: http://pastebin.com/rtw8c3Pf. Then I did as you suggested, but with a check to see if anything changed: # cd /etc/lvm/ # cp lvm.conf lvm.conf.cluster # lvmconf --enable-cluster # diff lvm.conf lvm.conf.cluster # So the key lines have been there all along: locking_type = 3 fallback_to_local_locking = 0 Il giorno 14 marzo 2012 23:17, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/14/12 9:20 AM, emmanuel segura wrote: Hello William i did new you are using drbd and i dont't know what type of configuration you using But it's better you try to start clvm with clvmd -d like thak we can see what it's the problem For what it's worth, here's the output of running clvmd -d on the node that stays up: http://pastebin.com/sWjaxAEF What's probably important in that big mass of output are the last two lines. Up to that point, I have both nodes up and running cman + clvmd; cluster.conf is here: http://pastebin.com/w5XNYyAX At the time of the next-to-the-last line, I cut power to the other node. At the time of the last line, I run vgdisplay on the remaining node, which hangs forever. After a lot of web searching, I found that I'm not the only one with this problem. Here's one case that doesn't seem relevant to me, since I don't use qdisk: http://www.redhat.com/archives/linux-cluster/2007-October/msg00212.html. Here's one with the same problem with the same OS: http://bugs.centos.org/view.php?id=5229, but with no resolution. Out of curiosity, has anyone on this list made a two-node cman+clvmd cluster work for them? Il giorno 14 marzo 2012 14:02, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/14/12 6:02 AM, emmanuel segura wrote: I think it's better you make clvmd start at boot chkconfig cman on ; chkconfig clvmd on I've already tried it. It doesn't work. The problem is that my LVM information is on the drbd. If I start up clvmd before drbd, it won't find the logical volumes. I also don't see why that would make a difference (although this could be part of the confusion): a service is a service. I've tried starting up clvmd inside and outside pacemaker control, with the same problem. Why would starting clvmd at boot make a difference? Il giorno 13 marzo 2012 23:29, William Seligmanseligman@nevis.** columbia.edu selig...@nevis.columbia.edu ha scritto: On 3/13/12 5:50 PM, emmanuel segura wrote: So if you using cman why you use lsb::clvmd I think you are very confused I don't dispute that I may be very confused! However, from what I can tell, I still need to run clvmd even if I'm running cman (I'm not using rgmanager). If I just run cman, gfs2 and any other form of mount fails. If I run cman, then clvmd, then gfs2, everything behaves normally. Going by these instructions: https://alteeve.com/w/2-Node_**Red_Hat_KVM_Cluster_Tutorial https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial the resources he puts under cluster control (rgmanager) I have to put under pacemaker control. Those include drbd, clvmd, and gfs2. The difference between what I've got, and what's in Clusters From Scratch, is in CFS they assign one DRBD volume to a single filesystem. I create an LVM physical volume on my DRBD resource, as in the above tutorial, and so I have to start clvmd or the logical volumes in the DRBD partition won't be recognized. Is there some way to get logical volumes recognized automatically by cman without rgmanager that I've missed? Il giorno 13 marzo 2012 22:42, William Seligman selig...@nevis.columbia.edu ha scritto
Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes
On 3/15/12 12:15 PM, emmanuel segura wrote: Ho did you created your volume group pvcreate /dev/drbd0 vgcreate -c y ADMIN /dev/drbd0 lvcreate -L 200G -n usr ADMIN # ... and so on # Nevis-HA is the cluster name I used in cluster.conf mkfs.gfs2 -p lock_dlm -j 2 -t Nevis_HA:usr /dev/ADMIN/usr # ... and so on give me the output of vgs command when the cluster it's up Here it is: Logging initialised at Thu Mar 15 12:40:39 2012 Set umask from 0022 to 0077 Finding all volume groups Finding volume group ROOT Finding volume group ADMIN VG#PV #LV #SN Attr VSize VFree ADMIN 1 5 0 wz--nc 2.61t 765.79g ROOT1 2 0 wz--n- 117.16g 0 Wiping internal VG cache I assume the c in the ADMIN attributes means that clustering is turned on? Il giorno 15 marzo 2012 17:06, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/15/12 11:50 AM, emmanuel segura wrote: yes william Now try clvmd -d and see what happen locking_type = 3 it's lvm cluster lock type Since you asked for confirmation, here it is: the output of 'clvmd -d' just now. http://pastebin.com/bne8piEw. I crashed the other node at Mar 15 12:02:35, when you see the only additional line of output. I don't see any particular difference between this and the previous result http://pastebin.com/sWjaxAEF, which suggests that I had cluster locking enabled before, and still do now. Il giorno 15 marzo 2012 16:15, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/15/12 5:18 AM, emmanuel segura wrote: The first thing i seen in your clvmd log it's this = WARNING: Locking disabled. Be careful! This could corrupt your metadata. = I saw that too, and thought the same as you did. I did some checks (see below), but some web searches suggest that this message is a normal consequence of clvmd initialization; e.g., http://markmail.org/message/vmy53pcv52wu7ghx use this command lvmconf --enable-cluster and remember for cman+pacemaker you don't need qdisk Before I tried your lvmconf suggestion, here was my /etc/lvm/lvm.conf: http://pastebin.com/841VZRzW and the output of lvm dumpconfig: http://pastebin.com/rtw8c3Pf. Then I did as you suggested, but with a check to see if anything changed: # cd /etc/lvm/ # cp lvm.conf lvm.conf.cluster # lvmconf --enable-cluster # diff lvm.conf lvm.conf.cluster # So the key lines have been there all along: locking_type = 3 fallback_to_local_locking = 0 Il giorno 14 marzo 2012 23:17, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/14/12 9:20 AM, emmanuel segura wrote: Hello William i did new you are using drbd and i dont't know what type of configuration you using But it's better you try to start clvm with clvmd -d like thak we can see what it's the problem For what it's worth, here's the output of running clvmd -d on the node that stays up: http://pastebin.com/sWjaxAEF What's probably important in that big mass of output are the last two lines. Up to that point, I have both nodes up and running cman + clvmd; cluster.conf is here: http://pastebin.com/w5XNYyAX At the time of the next-to-the-last line, I cut power to the other node. At the time of the last line, I run vgdisplay on the remaining node, which hangs forever. After a lot of web searching, I found that I'm not the only one with this problem. Here's one case that doesn't seem relevant to me, since I don't use qdisk: http://www.redhat.com/archives/linux-cluster/2007-October/msg00212.html . Here's one with the same problem with the same OS: http://bugs.centos.org/view.php?id=5229, but with no resolution. Out of curiosity, has anyone on this list made a two-node cman+clvmd cluster work for them? Il giorno 14 marzo 2012 14:02, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/14/12 6:02 AM, emmanuel segura wrote: I think it's better you make clvmd start at boot chkconfig cman on ; chkconfig clvmd on I've already tried it. It doesn't work. The problem is that my LVM information is on the drbd. If I start up clvmd before drbd, it won't find the logical volumes. I also don't see why that would make a difference (although this could be part of the confusion): a service is a service. I've tried starting up clvmd inside and outside pacemaker control, with the same problem. Why would starting clvmd at boot make a difference? Il giorno 13 marzo 2012 23:29, William Seligmanseligman@nevis.** columbia.edu selig...@nevis.columbia.edu ha scritto: On 3/13/12 5:50 PM, emmanuel segura wrote: So if you using cman why you use lsb::clvmd I think you are very confused I don't dispute that I may be very confused! However, from what I can tell, I still need to run clvmd even if I'm running cman (I'm not using rgmanager). If I just run cman, gfs2 and any other form
Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes
On 3/15/12 12:55 PM, emmanuel segura wrote: I don't see any error and the answer for your question it's yes can you show me your /etc/cluster/cluster.conf and your crm configure show like that more later i can try to look if i found some fix Thanks for taking a look. My cluster.conf: http://pastebin.com/w5XNYyAX crm configure show: http://pastebin.com/atVkXjkn Before you spend a lot of time on the second file, remember that clvmd will hang whether or not I'm running pacemaker. Il giorno 15 marzo 2012 17:42, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/15/12 12:15 PM, emmanuel segura wrote: Ho did you created your volume group pvcreate /dev/drbd0 vgcreate -c y ADMIN /dev/drbd0 lvcreate -L 200G -n usr ADMIN # ... and so on # Nevis-HA is the cluster name I used in cluster.conf mkfs.gfs2 -p lock_dlm -j 2 -t Nevis_HA:usr /dev/ADMIN/usr # ... and so on give me the output of vgs command when the cluster it's up Here it is: Logging initialised at Thu Mar 15 12:40:39 2012 Set umask from 0022 to 0077 Finding all volume groups Finding volume group ROOT Finding volume group ADMIN VG#PV #LV #SN Attr VSize VFree ADMIN 1 5 0 wz--nc 2.61t 765.79g ROOT1 2 0 wz--n- 117.16g 0 Wiping internal VG cache I assume the c in the ADMIN attributes means that clustering is turned on? Il giorno 15 marzo 2012 17:06, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/15/12 11:50 AM, emmanuel segura wrote: yes william Now try clvmd -d and see what happen locking_type = 3 it's lvm cluster lock type Since you asked for confirmation, here it is: the output of 'clvmd -d' just now. http://pastebin.com/bne8piEw. I crashed the other node at Mar 15 12:02:35, when you see the only additional line of output. I don't see any particular difference between this and the previous result http://pastebin.com/sWjaxAEF, which suggests that I had cluster locking enabled before, and still do now. Il giorno 15 marzo 2012 16:15, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/15/12 5:18 AM, emmanuel segura wrote: The first thing i seen in your clvmd log it's this = WARNING: Locking disabled. Be careful! This could corrupt your metadata. = I saw that too, and thought the same as you did. I did some checks (see below), but some web searches suggest that this message is a normal consequence of clvmd initialization; e.g., http://markmail.org/message/vmy53pcv52wu7ghx use this command lvmconf --enable-cluster and remember for cman+pacemaker you don't need qdisk Before I tried your lvmconf suggestion, here was my /etc/lvm/lvm.conf: http://pastebin.com/841VZRzW and the output of lvm dumpconfig: http://pastebin.com/rtw8c3Pf. Then I did as you suggested, but with a check to see if anything changed: # cd /etc/lvm/ # cp lvm.conf lvm.conf.cluster # lvmconf --enable-cluster # diff lvm.conf lvm.conf.cluster # So the key lines have been there all along: locking_type = 3 fallback_to_local_locking = 0 Il giorno 14 marzo 2012 23:17, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/14/12 9:20 AM, emmanuel segura wrote: Hello William i did new you are using drbd and i dont't know what type of configuration you using But it's better you try to start clvm with clvmd -d like thak we can see what it's the problem For what it's worth, here's the output of running clvmd -d on the node that stays up: http://pastebin.com/sWjaxAEF What's probably important in that big mass of output are the last two lines. Up to that point, I have both nodes up and running cman + clvmd; cluster.conf is here: http://pastebin.com/w5XNYyAX At the time of the next-to-the-last line, I cut power to the other node. At the time of the last line, I run vgdisplay on the remaining node, which hangs forever. After a lot of web searching, I found that I'm not the only one with this problem. Here's one case that doesn't seem relevant to me, since I don't use qdisk: http://www.redhat.com/archives/linux-cluster/2007-October/msg00212.html. Here's one with the same problem with the same OS: http://bugs.centos.org/view.php?id=5229, but with no resolution. Out of curiosity, has anyone on this list made a two-node cman+clvmd cluster work for them? Il giorno 14 marzo 2012 14:02, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/14/12 6:02 AM, emmanuel segura wrote: I think it's better you make clvmd start at boot chkconfig cman on ; chkconfig clvmd on I've already tried it. It doesn't work. The problem is that my LVM information is on the drbd. If I start up clvmd before drbd, it won't find the logical volumes. I also don't see why that would make a difference (although this could be part of the confusion): a service is a service. I've tried starting
Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes
On 3/15/12 3:45 PM, Vladislav Bogdanov wrote: 15.03.2012 18:43, William Seligman wrote: On 3/15/12 3:43 AM, Vladislav Bogdanov wrote: 14.03.2012 00:42, William Seligman wrote: [snip] These were the log messages, which show that stonith_admin did its job and CMAN was notified of the fencing: http://pastebin.com/jaH820Bv. Could you please look at the output of 'dlm_tool ls' and 'dlm_tool dump'? You probably have 'kern_stop' and 'fencing' flags there. That means that dlm is unaware that node is fenced. Here's 'dlm_tool ls' with both nodes running cman+clvmd+gfs2: http://pastebin.com/QrZtm1Ue 'dlm_tool dump': http://pastebin.com/UKWxx9Y4 For comparison, I crashed one node and looked at the same output on the remaining node: dlm_tool ls: http://pastebin.com/cKVAGxsd dlm_tool dump: http://pastebin.com/c0h0p22Q (the post-crash lines begin at 1331824940) Everything is fine there, dlm correctly understands that node is fenced and returns to a normal state. The only minor issue I see is that fencing took much time - 21 sec. Hmm. My fencing agent works by toggling the power on a UPS. If all the agent does is action=off, it will cut power immediately. But if you tell it action=reboot, it will cut the load, wait 10 seconds, then turn the load back on again; I found I needed that delay because otherwise the UPS might confuse/overlap/ignore sequential commands. Could this be an issue? I've noticed that my fencing agent always seems to be called with action=reboot when a node is fenced. Why is it using 'reboot' and not 'off'? Is this the standard, or am I missing a definition somewhere? I don't see the kern_stop or fencing flags. There's another thing I don't see: at the top of 'dlm_tool dump' it displays most of the contents of my cluster.conf file, except for the fencing sections. Here's my cluster.conf for comparison: http://pastebin.com/w5XNYyAX It also looks correct (I mean fence_pcmk), but I can be wrong here, I do not use cman. cman doesn't see anything wrong in my cluster.conf file: # ccs_config_validate Configuration validates But could there be something that's causing the fencing sections to be ignored? Unfortunately, I still got the gfs2 freeze, so this is not the complete story. Both clvmd and gfs2 use dlm. If dlm layer thinks fencing is not completed, both of them freeze. I did 'grep -E (dlm|clvm|fenc) /var/log/messages' and looked at the time I crashed the node: http://pastebin.com/dvBtdLUs. I see lines that indicate that pacemaker and drbd are fencing the node, but nothing from dlm or clvmd. Does this indicate what you suggest: Could dlm somehow be ignoring or overlooking the fencing I put in? Is there any other way to check this? No, dlm_controld (and friends) mostly uses different logging method - that is what you see in dlm_tool dump. -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes
On 3/15/12 4:57 PM, emmanuel segura wrote: we can try to understand what happen when clvm hang edit the /etc/lvm/lvm.conf and change level = 7 in the log session and uncomment this line file = /var/log/lvm2.log Here's the tail end of the file (the original is 1.6M). Because there no times in the log, it's hard for me to point you to the point where I crashed the other system. I think (though I'm not sure) that the crash happened after the last occurrence of cache/lvmcache.c:1484 Wiping internal VG cache Honestly, it looks like a wall of text to me. Does it suggest anything to you? Il giorno 15 marzo 2012 20:50, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/15/12 12:55 PM, emmanuel segura wrote: I don't see any error and the answer for your question it's yes can you show me your /etc/cluster/cluster.conf and your crm configure show like that more later i can try to look if i found some fix Thanks for taking a look. My cluster.conf: http://pastebin.com/w5XNYyAX crm configure show: http://pastebin.com/atVkXjkn Before you spend a lot of time on the second file, remember that clvmd will hang whether or not I'm running pacemaker. Il giorno 15 marzo 2012 17:42, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/15/12 12:15 PM, emmanuel segura wrote: Ho did you created your volume group pvcreate /dev/drbd0 vgcreate -c y ADMIN /dev/drbd0 lvcreate -L 200G -n usr ADMIN # ... and so on # Nevis-HA is the cluster name I used in cluster.conf mkfs.gfs2 -p lock_dlm -j 2 -t Nevis_HA:usr /dev/ADMIN/usr # ... and so on give me the output of vgs command when the cluster it's up Here it is: Logging initialised at Thu Mar 15 12:40:39 2012 Set umask from 0022 to 0077 Finding all volume groups Finding volume group ROOT Finding volume group ADMIN VG#PV #LV #SN Attr VSize VFree ADMIN 1 5 0 wz--nc 2.61t 765.79g ROOT1 2 0 wz--n- 117.16g 0 Wiping internal VG cache I assume the c in the ADMIN attributes means that clustering is turned on? Il giorno 15 marzo 2012 17:06, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/15/12 11:50 AM, emmanuel segura wrote: yes william Now try clvmd -d and see what happen locking_type = 3 it's lvm cluster lock type Since you asked for confirmation, here it is: the output of 'clvmd -d' just now. http://pastebin.com/bne8piEw. I crashed the other node at Mar 15 12:02:35, when you see the only additional line of output. I don't see any particular difference between this and the previous result http://pastebin.com/sWjaxAEF, which suggests that I had cluster locking enabled before, and still do now. Il giorno 15 marzo 2012 16:15, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/15/12 5:18 AM, emmanuel segura wrote: The first thing i seen in your clvmd log it's this = WARNING: Locking disabled. Be careful! This could corrupt your metadata. = I saw that too, and thought the same as you did. I did some checks (see below), but some web searches suggest that this message is a normal consequence of clvmd initialization; e.g., http://markmail.org/message/vmy53pcv52wu7ghx use this command lvmconf --enable-cluster and remember for cman+pacemaker you don't need qdisk Before I tried your lvmconf suggestion, here was my /etc/lvm/lvm.conf: http://pastebin.com/841VZRzW and the output of lvm dumpconfig: http://pastebin.com/rtw8c3Pf. Then I did as you suggested, but with a check to see if anything changed: # cd /etc/lvm/ # cp lvm.conf lvm.conf.cluster # lvmconf --enable-cluster # diff lvm.conf lvm.conf.cluster # So the key lines have been there all along: locking_type = 3 fallback_to_local_locking = 0 Il giorno 14 marzo 2012 23:17, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/14/12 9:20 AM, emmanuel segura wrote: Hello William i did new you are using drbd and i dont't know what type of configuration you using But it's better you try to start clvm with clvmd -d like thak we can see what it's the problem For what it's worth, here's the output of running clvmd -d on the node that stays up: http://pastebin.com/sWjaxAEF What's probably important in that big mass of output are the last two lines. Up to that point, I have both nodes up and running cman + clvmd; cluster.conf is here: http://pastebin.com/w5XNYyAX At the time of the next-to-the-last line, I cut power to the other node. At the time of the last line, I run vgdisplay on the remaining node, which hangs forever. After a lot of web searching, I found that I'm not the only one with this problem. Here's one case that doesn't seem relevant to me, since I don't use qdisk: http://www.redhat.com/archives/linux-cluster/2007-October/msg00212.html. Here's one with the same problem with the same OS
Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes
On 3/15/12 6:05 PM, William Seligman wrote: On 3/15/12 4:57 PM, emmanuel segura wrote: we can try to understand what happen when clvm hang edit the /etc/lvm/lvm.conf and change level = 7 in the log session and uncomment this line file = /var/log/lvm2.log Here's the tail end of the file (the original is 1.6M). Because there no times in the log, it's hard for me to point you to the point where I crashed the other system. I think (though I'm not sure) that the crash happened after the last occurrence of cache/lvmcache.c:1484 Wiping internal VG cache Honestly, it looks like a wall of text to me. Does it suggest anything to you? Maybe it would help if I included the link to the pastebin where I put the output: http://pastebin.com/8pgW3Muw Il giorno 15 marzo 2012 20:50, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/15/12 12:55 PM, emmanuel segura wrote: I don't see any error and the answer for your question it's yes can you show me your /etc/cluster/cluster.conf and your crm configure show like that more later i can try to look if i found some fix Thanks for taking a look. My cluster.conf: http://pastebin.com/w5XNYyAX crm configure show: http://pastebin.com/atVkXjkn Before you spend a lot of time on the second file, remember that clvmd will hang whether or not I'm running pacemaker. Il giorno 15 marzo 2012 17:42, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/15/12 12:15 PM, emmanuel segura wrote: Ho did you created your volume group pvcreate /dev/drbd0 vgcreate -c y ADMIN /dev/drbd0 lvcreate -L 200G -n usr ADMIN # ... and so on # Nevis-HA is the cluster name I used in cluster.conf mkfs.gfs2 -p lock_dlm -j 2 -t Nevis_HA:usr /dev/ADMIN/usr # ... and so on give me the output of vgs command when the cluster it's up Here it is: Logging initialised at Thu Mar 15 12:40:39 2012 Set umask from 0022 to 0077 Finding all volume groups Finding volume group ROOT Finding volume group ADMIN VG#PV #LV #SN Attr VSize VFree ADMIN 1 5 0 wz--nc 2.61t 765.79g ROOT1 2 0 wz--n- 117.16g 0 Wiping internal VG cache I assume the c in the ADMIN attributes means that clustering is turned on? Il giorno 15 marzo 2012 17:06, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/15/12 11:50 AM, emmanuel segura wrote: yes william Now try clvmd -d and see what happen locking_type = 3 it's lvm cluster lock type Since you asked for confirmation, here it is: the output of 'clvmd -d' just now. http://pastebin.com/bne8piEw. I crashed the other node at Mar 15 12:02:35, when you see the only additional line of output. I don't see any particular difference between this and the previous result http://pastebin.com/sWjaxAEF, which suggests that I had cluster locking enabled before, and still do now. Il giorno 15 marzo 2012 16:15, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/15/12 5:18 AM, emmanuel segura wrote: The first thing i seen in your clvmd log it's this = WARNING: Locking disabled. Be careful! This could corrupt your metadata. = I saw that too, and thought the same as you did. I did some checks (see below), but some web searches suggest that this message is a normal consequence of clvmd initialization; e.g., http://markmail.org/message/vmy53pcv52wu7ghx use this command lvmconf --enable-cluster and remember for cman+pacemaker you don't need qdisk Before I tried your lvmconf suggestion, here was my /etc/lvm/lvm.conf: http://pastebin.com/841VZRzW and the output of lvm dumpconfig: http://pastebin.com/rtw8c3Pf. Then I did as you suggested, but with a check to see if anything changed: # cd /etc/lvm/ # cp lvm.conf lvm.conf.cluster # lvmconf --enable-cluster # diff lvm.conf lvm.conf.cluster # So the key lines have been there all along: locking_type = 3 fallback_to_local_locking = 0 Il giorno 14 marzo 2012 23:17, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/14/12 9:20 AM, emmanuel segura wrote: Hello William i did new you are using drbd and i dont't know what type of configuration you using But it's better you try to start clvm with clvmd -d like thak we can see what it's the problem For what it's worth, here's the output of running clvmd -d on the node that stays up: http://pastebin.com/sWjaxAEF What's probably important in that big mass of output are the last two lines. Up to that point, I have both nodes up and running cman + clvmd; cluster.conf is here: http://pastebin.com/w5XNYyAX At the time of the next-to-the-last line, I cut power to the other node. At the time of the last line, I run vgdisplay on the remaining node, which hangs forever. After a lot of web searching, I found that I'm not the only one with this problem. Here's one case that doesn't seem
Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes
On 3/15/12 6:07 PM, William Seligman wrote: On 3/15/12 6:05 PM, William Seligman wrote: On 3/15/12 4:57 PM, emmanuel segura wrote: we can try to understand what happen when clvm hang edit the /etc/lvm/lvm.conf and change level = 7 in the log session and uncomment this line file = /var/log/lvm2.log Here's the tail end of the file (the original is 1.6M). Because there no times in the log, it's hard for me to point you to the point where I crashed the other system. I think (though I'm not sure) that the crash happened after the last occurrence of cache/lvmcache.c:1484 Wiping internal VG cache Honestly, it looks like a wall of text to me. Does it suggest anything to you? Maybe it would help if I included the link to the pastebin where I put the output: http://pastebin.com/8pgW3Muw Could the problem be with lvm+drbd? In lvm2.conf, I see this sequence of lines pre-crash: device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:271 /dev/md0: size is 1027968 sectors device/dev-io.c:137 /dev/md0: block size is 1024 bytes device/dev-io.c:588 Closed /dev/md0 device/dev-io.c:271 /dev/md0: size is 1027968 sectors device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:137 /dev/md0: block size is 1024 bytes device/dev-io.c:588 Closed /dev/md0 filters/filter-composite.c:31 Using /dev/md0 device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:137 /dev/md0: block size is 1024 bytes label/label.c:186 /dev/md0: No label detected device/dev-io.c:588 Closed /dev/md0 device/dev-io.c:535 Opened /dev/drbd0 RO O_DIRECT device/dev-io.c:271 /dev/drbd0: size is 5611549368 sectors device/dev-io.c:137 /dev/drbd0: block size is 4096 bytes device/dev-io.c:588 Closed /dev/drbd0 device/dev-io.c:271 /dev/drbd0: size is 5611549368 sectors device/dev-io.c:535 Opened /dev/drbd0 RO O_DIRECT device/dev-io.c:137 /dev/drbd0: block size is 4096 bytes device/dev-io.c:588 Closed /dev/drbd0 I interpret this: Look at /dev/md0, get some info, close; look at /dev/drbd0, get some info, close. Post-crash, I see: evice/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:271 /dev/md0: size is 1027968 sectors device/dev-io.c:137 /dev/md0: block size is 1024 bytes device/dev-io.c:588 Closed /dev/md0 device/dev-io.c:271 /dev/md0: size is 1027968 sectors device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:137 /dev/md0: block size is 1024 bytes device/dev-io.c:588 Closed /dev/md0 filters/filter-composite.c:31 Using /dev/md0 device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:137 /dev/md0: block size is 1024 bytes label/label.c:186 /dev/md0: No label detected device/dev-io.c:588 Closed /dev/md0 device/dev-io.c:535 Opened /dev/drbd0 RO O_DIRECT device/dev-io.c:271 /dev/drbd0: size is 5611549368 sectors device/dev-io.c:137 /dev/drbd0: block size is 4096 bytes ... and then it hangs. Comparing the two, it looks like it can't close /dev/drbd0. If I look at /proc/drbd when I crash one node, I see this: # cat /proc/drbd version: 8.3.12 (api:88/proto:86-96) GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34 0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s- ns:764 nr:0 dw:0 dr:7049728 al:0 bm:516 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 If I look at /proc/drbd if I bring down one node gracefully (crm node standby), I get this: # cat /proc/drbd version: 8.3.12 (api:88/proto:86-96) GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34 0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/Outdated C r- ns:764 nr:40 dw:40 dr:7036496 al:0 bm:516 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 Could it be that drbd can't respond to certain requests from lvm if the state of the peer is DUnknown instead of Outdated? Il giorno 15 marzo 2012 20:50, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/15/12 12:55 PM, emmanuel segura wrote: I don't see any error and the answer for your question it's yes can you show me your /etc/cluster/cluster.conf and your crm configure show like that more later i can try to look if i found some fix Thanks for taking a look. My cluster.conf: http://pastebin.com/w5XNYyAX crm configure show: http://pastebin.com/atVkXjkn Before you spend a lot of time on the second file, remember that clvmd will hang whether or not I'm running pacemaker. Il giorno 15 marzo 2012 17:42, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/15/12 12:15 PM, emmanuel segura wrote: Ho did you created your volume group pvcreate /dev/drbd0 vgcreate -c y ADMIN /dev/drbd0 lvcreate -L 200G -n usr ADMIN # ... and so on # Nevis-HA is the cluster name I used in cluster.conf mkfs.gfs2 -p lock_dlm -j 2 -t Nevis_HA:usr /dev/ADMIN/usr # ... and so on give me the output of vgs command when the cluster it's up Here
Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes
On 3/14/12 6:02 AM, emmanuel segura wrote: I think it's better you make clvmd start at boot chkconfig cman on ; chkconfig clvmd on I've already tried it. It doesn't work. The problem is that my LVM information is on the drbd. If I start up clvmd before drbd, it won't find the logical volumes. I also don't see why that would make a difference (although this could be part of the confusion): a service is a service. I've tried starting up clvmd inside and outside pacemaker control, with the same problem. Why would starting clvmd at boot make a difference? Il giorno 13 marzo 2012 23:29, William Seligmanselig...@nevis.columbia.edu ha scritto: On 3/13/12 5:50 PM, emmanuel segura wrote: So if you using cman why you use lsb::clvmd I think you are very confused I don't dispute that I may be very confused! However, from what I can tell, I still need to run clvmd even if I'm running cman (I'm not using rgmanager). If I just run cman, gfs2 and any other form of mount fails. If I run cman, then clvmd, then gfs2, everything behaves normally. Going by these instructions: https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial the resources he puts under cluster control (rgmanager) I have to put under pacemaker control. Those include drbd, clvmd, and gfs2. The difference between what I've got, and what's in Clusters From Scratch, is in CFS they assign one DRBD volume to a single filesystem. I create an LVM physical volume on my DRBD resource, as in the above tutorial, and so I have to start clvmd or the logical volumes in the DRBD partition won't be recognized. Is there some way to get logical volumes recognized automatically by cman without rgmanager that I've missed? Il giorno 13 marzo 2012 22:42, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/13/12 12:29 PM, William Seligman wrote: I'm not sure if this is a Linux-HA question; please direct me to the appropriate list if it's not. I'm setting up a two-node cman+pacemaker+gfs2 cluster as described in Clusters From Scratch. Fencing is through forcibly rebooting a node by cutting and restoring its power via UPS. My fencing/failover tests have revealed a problem. If I gracefully turn off one node (crm node standby; service pacemaker stop; shutdown -r now) all the resources transfer to the other node with no problems. If I cut power to one node (as would happen if it were fenced), the lsb::clvmd resource on the remaining node eventually fails. Since all the other resources depend on clvmd, all the resources on the remaining node stop and the cluster is left with nothing running. I've traced why the lsb::clvmd fails: The monitor/status command includes vgdisplay, which hangs indefinitely. Therefore the monitor will always time-out. So this isn't a problem with pacemaker, but with clvmd/dlm: If a node is cut off, the cluster isn't handling it properly. Has anyone on this list seen this before? Any ideas? Details: versions: Redhat Linux 6.2 (kernel 2.6.32) cman-3.0.12.1 corosync-1.4.1 pacemaker-1.1.6 lvm2-2.02.87 lvm2-cluster-2.02.87 This may be a Linux-HA question after all! I ran a few more tests. Here's the output from a typical test of grep -E (dlm|gfs2}clvmd|fenc|syslogd) /var/log/messages http://pastebin.com/uqC6bc1b It looks like what's happening is that the fence agent (one I wrote) is not returning the proper error code when a node crashes. According to this page, if a fencing agent fails GFS2 will freeze to protect the data: http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Global_File_System_2/s1-gfs2hand-allnodes.html As a test, I tried to fence my test node via standard means: stonith_admin -F orestes-corosync.nevis.columbia.edu These were the log messages, which show that stonith_admin did its job and CMAN was notified of the fencing:http://pastebin.com/jaH820Bv. Unfortunately, I still got the gfs2 freeze, so this is not the complete story. First things first. I vaguely recall a web page that went over the STONITH return codes, but I can't locate it again. Is there any reference to the return codes expected from a fencing agent, perhaps as function of the state of the fencing device? -- Bill Seligman | mailto://selig...@nevis.columbia.edu Nevis Labs, Columbia Univ | http://www.nevis.columbia.edu/~seligman/ PO Box 137| Irvington NY 10533 USA | Phone: (914) 591-2823 smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes
On 3/14/12 9:26 AM, Lars Marowsky-Bree wrote: On 2012-03-14T09:02:59, William Seligmanselig...@nevis.columbia.edu wrote: To ask a slightly different question - why? Does your workload require / benefit from a dual-primary architecture? Most don't. http://www.gossamer-threads.com/lists/linuxha/users/78497#78497 I'm mindful of the issues involved, such as those Lars Ellenberg brought up in his response. I need something that will failover with a minimum of fuss. Although I'm encountering one problem after another, I think I'm closing in on my goal. And if not, at least I'm leaving some interesting threads in Linux-HA for future sysadmins to search for. -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes
On 3/14/12 9:20 AM, emmanuel segura wrote: Hello William i did new you are using drbd and i dont't know what type of configuration you using But it's better you try to start clvm with clvmd -d like thak we can see what it's the problem For what it's worth, here's the output of running clvmd -d on the node that stays up: http://pastebin.com/sWjaxAEF What's probably important in that big mass of output are the last two lines. Up to that point, I have both nodes up and running cman + clvmd; cluster.conf is here: http://pastebin.com/w5XNYyAX At the time of the next-to-the-last line, I cut power to the other node. At the time of the last line, I run vgdisplay on the remaining node, which hangs forever. After a lot of web searching, I found that I'm not the only one with this problem. Here's one case that doesn't seem relevant to me, since I don't use qdisk: http://www.redhat.com/archives/linux-cluster/2007-October/msg00212.html. Here's one with the same problem with the same OS: http://bugs.centos.org/view.php?id=5229, but with no resolution. Out of curiosity, has anyone on this list made a two-node cman+clvmd cluster work for them? Il giorno 14 marzo 2012 14:02, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/14/12 6:02 AM, emmanuel segura wrote: I think it's better you make clvmd start at boot chkconfig cman on ; chkconfig clvmd on I've already tried it. It doesn't work. The problem is that my LVM information is on the drbd. If I start up clvmd before drbd, it won't find the logical volumes. I also don't see why that would make a difference (although this could be part of the confusion): a service is a service. I've tried starting up clvmd inside and outside pacemaker control, with the same problem. Why would starting clvmd at boot make a difference? Il giorno 13 marzo 2012 23:29, William Seligmanseligman@nevis.** columbia.edu selig...@nevis.columbia.edu ha scritto: On 3/13/12 5:50 PM, emmanuel segura wrote: So if you using cman why you use lsb::clvmd I think you are very confused I don't dispute that I may be very confused! However, from what I can tell, I still need to run clvmd even if I'm running cman (I'm not using rgmanager). If I just run cman, gfs2 and any other form of mount fails. If I run cman, then clvmd, then gfs2, everything behaves normally. Going by these instructions: https://alteeve.com/w/2-Node_**Red_Hat_KVM_Cluster_Tutorialhttps://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial the resources he puts under cluster control (rgmanager) I have to put under pacemaker control. Those include drbd, clvmd, and gfs2. The difference between what I've got, and what's in Clusters From Scratch, is in CFS they assign one DRBD volume to a single filesystem. I create an LVM physical volume on my DRBD resource, as in the above tutorial, and so I have to start clvmd or the logical volumes in the DRBD partition won't be recognized. Is there some way to get logical volumes recognized automatically by cman without rgmanager that I've missed? Il giorno 13 marzo 2012 22:42, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/13/12 12:29 PM, William Seligman wrote: I'm not sure if this is a Linux-HA question; please direct me to the appropriate list if it's not. I'm setting up a two-node cman+pacemaker+gfs2 cluster as described in Clusters From Scratch. Fencing is through forcibly rebooting a node by cutting and restoring its power via UPS. My fencing/failover tests have revealed a problem. If I gracefully turn off one node (crm node standby; service pacemaker stop; shutdown -r now) all the resources transfer to the other node with no problems. If I cut power to one node (as would happen if it were fenced), the lsb::clvmd resource on the remaining node eventually fails. Since all the other resources depend on clvmd, all the resources on the remaining node stop and the cluster is left with nothing running. I've traced why the lsb::clvmd fails: The monitor/status command includes vgdisplay, which hangs indefinitely. Therefore the monitor will always time-out. So this isn't a problem with pacemaker, but with clvmd/dlm: If a node is cut off, the cluster isn't handling it properly. Has anyone on this list seen this before? Any ideas? Details: versions: Redhat Linux 6.2 (kernel 2.6.32) cman-3.0.12.1 corosync-1.4.1 pacemaker-1.1.6 lvm2-2.02.87 lvm2-cluster-2.02.87 This may be a Linux-HA question after all! I ran a few more tests. Here's the output from a typical test of grep -E (dlm|gfs2}clvmd|fenc|syslogd)** /var/log/messages http://pastebin.com/uqC6bc1b It looks like what's happening is that the fence agent (one I wrote) is not returning the proper error code when a node crashes. According to this page, if a fencing agent fails GFS2 will freeze to protect the data: http://docs.redhat.com/docs/**en-US/Red_Hat_Enterprise_** Linux/6/html/Global_File_
Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes
On 3/14/12 12:43 PM, Dimitri Maziuk wrote: On 03/14/2012 11:08 AM, Lars Marowsky-Bree wrote: On 2012-03-14T11:41:53, William Seligman selig...@nevis.columbia.edu wrote: I'm mindful of the issues involved, such as those Lars Ellenberg brought up in his response. I need something that will failover with a minimum of fuss. Although I'm encountering one problem after another, I think I'm closing in on my goal. I doubt this is what you're getting. An active/passive fail-over configuration would likely save you tons of trouble and not perform worse, probably be faster for most workloads. Or if you look at it from another angle, if you can't configure your resources to start properly at failover, what makes you think you can configure a dual-primary any better? I'll repeat the answer I gave in that other thread, for what it's worth: Consider two nodes in a primary-secondary cluster. Primary is running a resource. It fails, so the resource has to failover to secondary. Now consider a primary-primary cluster. Both run the same resource. One fails. There's no failover here; the other box still runs the resource. In my case, the only thing that has to work is cloned cluster IP address, and that I've verified to my satisfaction. -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] clvm/dlm/gfs2 hangs if a node crashes
I'm not sure if this is a Linux-HA question; please direct me to the appropriate list if it's not. I'm setting up a two-node cman+pacemaker+gfs2 cluster as described in Clusters From Scratch. Fencing is through forcibly rebooting a node by cutting and restoring its power via UPS. My fencing/failover tests have revealed a problem. If I gracefully turn off one node (crm node standby; service pacemaker stop; shutdown -r now) all the resources transfer to the other node with no problems. If I cut power to one node (as would happen if it were fenced), the lsb::clvmd resource on the remaining node eventually fails. Since all the other resources depend on clvmd, all the resources on the remaining node stop and the cluster is left with nothing running. I've traced why the lsb::clvmd fails: The monitor/status command includes vgdisplay, which hangs indefinitely. Therefore the monitor will always time-out. So this isn't a problem with pacemaker, but with clvmd/dlm: If a node is cut off, the cluster isn't handling it properly. Has anyone on this list seen this before? Any ideas? Details: versions: Redhat Linux 6.2 (kernel 2.6.32) cman-3.0.12.1 corosync-1.4.1 pacemaker-1.1.6 lvm2-2.02.87 lvm2-cluster-2.02.87 cluster.conf: http://pastebin.com/w5XNYyAX output of crm configure show: http://pastebin.com/atVkXjkn output of lvm dumpconfig: http://pastebin.com/rtw8c3Pf /var/log/cluster/dlm_controld.log and /var/log/cluster/gfs_controld.log show nothing. When I shut down power to one nodes (orestes-tb), the output of grep -E (dlm|gfs2|clvmd) /var/log/messages is http://pastebin.com/vjpvCFeN. -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes
On 3/13/12 2:49 PM, emmanuel segura wrote: Sorry Willian But i think clvmd it must be used with ocf:lvm2:clvmd esample crm confgiure primitive clvmd ocf:lvm2:clvmd params daemon_timeout=30 clone cln_clvmd clvmd and rember clvmd depend on dlm, so for the dlm you sould same I don't have an ocf:lvm2:clvmd resource on my system. When I do a web search, it looks like a resource found on SUSE systems, but not on RHEL distros. Based on Clusters From Scratch, I think that if I'm using cman that dlm is started automatically. I see dlm_controld is running without my explicitly starting it: # ps aux | grep dlm_controld root 2495 0.0 0.0 234688 7564 ? Ssl 12:32 0:00 dlm_controld I should have also mentioned that I can duplicate this problem outside pacemaker. That is, I can start cman, clvmd, and gfs2 manually on both nodes, cut off power on one node, and clustering fails on the other node. So I suspect it's not a pacemaker resource problem. For a moment I thought I might not have used -p lock_dlm when I created my GFS2 filesystems, but I think the output of gfs2_edit -p sb ... shows that I did it correctly: http://pastebin.com/ALQYpKAy. When I looked more carefully at my lvm.conf, I saw that I had a typo: fallback_to_local_locking=4 I changed to it the correct value (according to https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial): fallback_to_local_locking=0 Unfortunately this doesn't solve the problem. So... any ideas? Il giorno 13 marzo 2012 17:29, William Seligman selig...@nevis.columbia.edu ha scritto: I'm not sure if this is a Linux-HA question; please direct me to the appropriate list if it's not. I'm setting up a two-node cman+pacemaker+gfs2 cluster as described in Clusters From Scratch. Fencing is through forcibly rebooting a node by cutting and restoring its power via UPS. My fencing/failover tests have revealed a problem. If I gracefully turn off one node (crm node standby; service pacemaker stop; shutdown -r now) all the resources transfer to the other node with no problems. If I cut power to one node (as would happen if it were fenced), the lsb::clvmd resource on the remaining node eventually fails. Since all the other resources depend on clvmd, all the resources on the remaining node stop and the cluster is left with nothing running. I've traced why the lsb::clvmd fails: The monitor/status command includes vgdisplay, which hangs indefinitely. Therefore the monitor will always time-out. So this isn't a problem with pacemaker, but with clvmd/dlm: If a node is cut off, the cluster isn't handling it properly. Has anyone on this list seen this before? Any ideas? Details: versions: Redhat Linux 6.2 (kernel 2.6.32) cman-3.0.12.1 corosync-1.4.1 pacemaker-1.1.6 lvm2-2.02.87 lvm2-cluster-2.02.87 cluster.conf: http://pastebin.com/w5XNYyAX output of crm configure show: http://pastebin.com/atVkXjkn output of lvm dumpconfig: http://pastebin.com/rtw8c3Pf /var/log/cluster/dlm_controld.log and /var/log/cluster/gfs_controld.log show nothing. When I shut down power to one nodes (orestes-tb), the output of grep -E (dlm|gfs2|clvmd) /var/log/messages is http://pastebin.com/vjpvCFeN. -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes
On 3/13/12 12:29 PM, William Seligman wrote: I'm not sure if this is a Linux-HA question; please direct me to the appropriate list if it's not. I'm setting up a two-node cman+pacemaker+gfs2 cluster as described in Clusters From Scratch. Fencing is through forcibly rebooting a node by cutting and restoring its power via UPS. My fencing/failover tests have revealed a problem. If I gracefully turn off one node (crm node standby; service pacemaker stop; shutdown -r now) all the resources transfer to the other node with no problems. If I cut power to one node (as would happen if it were fenced), the lsb::clvmd resource on the remaining node eventually fails. Since all the other resources depend on clvmd, all the resources on the remaining node stop and the cluster is left with nothing running. I've traced why the lsb::clvmd fails: The monitor/status command includes vgdisplay, which hangs indefinitely. Therefore the monitor will always time-out. So this isn't a problem with pacemaker, but with clvmd/dlm: If a node is cut off, the cluster isn't handling it properly. Has anyone on this list seen this before? Any ideas? Details: versions: Redhat Linux 6.2 (kernel 2.6.32) cman-3.0.12.1 corosync-1.4.1 pacemaker-1.1.6 lvm2-2.02.87 lvm2-cluster-2.02.87 This may be a Linux-HA question after all! I ran a few more tests. Here's the output from a typical test of grep -E (dlm|gfs2}clvmd|fenc|syslogd) /var/log/messages http://pastebin.com/uqC6bc1b It looks like what's happening is that the fence agent (one I wrote) is not returning the proper error code when a node crashes. According to this page, if a fencing agent fails GFS2 will freeze to protect the data: http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Global_File_System_2/s1-gfs2hand-allnodes.html As a test, I tried to fence my test node via standard means: stonith_admin -F orestes-corosync.nevis.columbia.edu These were the log messages, which show that stonith_admin did its job and CMAN was notified of the fencing: http://pastebin.com/jaH820Bv. Unfortunately, I still got the gfs2 freeze, so this is not the complete story. First things first. I vaguely recall a web page that went over the STONITH return codes, but I can't locate it again. Is there any reference to the return codes expected from a fencing agent, perhaps as function of the state of the fencing device? -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] How do I clear the Failed actions section?
On 3/8/12 6:53 AM, Helmut Wollmersdorfer wrote: Am 07.03.2012 um 18:01 schrieb Florian Haas: On Wed, Mar 7, 2012 at 5:51 PM, William Seligman selig...@nevis.columbia.edu wrote: Again, a disclaimer: I am not an expert. Your advice was spot on. :) But what to do, if cleanup is not working? And everything is running: # crm status Last updated: Thu Mar 8 12:27:00 2012 Stack: Heartbeat Current DC: xen10 (5ab5ba3d-3be5-4763-83e7-90aaa49361a6) - partition with quorum Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b 2 Nodes configured, unknown expected votes 12 Resources configured. Online: [ xen10 xen11 ] xen_www (ocf::heartbeat:Xen): Started xen11 Master/Slave Set: DrbdClone1 Masters: [ xen11 ] Slaves: [ xen10 ] xen_typo3(ocf::heartbeat:Xen): Started xen11 xen_shopdb (ocf::heartbeat:Xen): Started xen10 xen_admintool(ocf::heartbeat:Xen): Started xen11 xen_cmsdb(ocf::heartbeat:Xen): Started xen11 Master/Slave Set: DrbdClone2 Resource Group: group_drbd2:0 xen_drbd2_1:0(ocf::linbit:drbd): Slave xen10 (unmanaged) FAILED xen_drbd2_2:0(ocf::linbit:drbd): Stopped Masters: [ xen11 ] Master/Slave Set: DrbdClone3 Masters: [ xen10 ] Slaves: [ xen11 ] Master/Slave Set: DrbdClone5 Masters: [ xen11 ] Slaves: [ xen10 ] Master/Slave Set: DrbdClone6 Slaves: [ xen11 xen10 ] Master/Slave Set: DrbdClone4 Masters: [ xen11 ] Slaves: [ xen10 ] Failed actions: xen_cmsdb_monitor_3000 (node=xen10, call=571, rc=7, status=complete): not running xen_drbd1_2:1_promote_0 (node=xen10, call=5205, rc=1, status=complete): unknown error xen_drbd2_1:1_promote_0 (node=xen10, call=790, rc=1, status=complete): unknown error xen_ns2_monitor_3000 (node=xen10, call=601, rc=7, status=complete): not running xen_drbd3_1:1_promote_0 (node=xen10, call=383, rc=-2, status=Timed Out): unknown exec error xen_drbd2_1:0_promote_0 (node=xen10, call=1326, rc=-2, status=Timed Out): unknown exec error xen_drbd2_1:0_stop_0 (node=xen10, call=1348, rc=-2, status=Timed Out): unknown exec error xen11:# crm resource cleanup xen_drbd2_1 Error performing operation: The object/attribute does not exist Error performing operation: The object/attribute does not exist Given the list of resources displayed by crm_mon, the command you need is crm resource cleanup DrbdClone2 I can't say whether that will fix your problems, but you won't get the does not exist message. Somewhere in either Pacemaker Explained or Clusters From Scratch, it says that once you clone or ms a resource, you can't refer to that resource as an individual anymore; you have to use the clone/ms name. What I did when faced with a problem like yours is cat /proc/drbd, look at the lines for the failed drbd, and fix it on my own. Then I'd type the cleanup command for pacemaker to pick up the current state of the resource. # xm list NameID Mem VCPUs State Time(s) Domain-0 0 100516 r- 40648.5 admintool5 4096 2 - b 7455.4 cmsdb3 2048 2 - b 2106.5 typo32 1024 2 - b 2890.9 www 1 1024 1 - b855.0 xen11:# drbdadm status drbd-status version=8.3.7 api=88 resources config_file=/etc/drbd.conf resource minor=1 name=drbd1_1 cs=Connected ro1=Primary ro2=Secondary ds1=UpToDate ds2=UpToDate / resource minor=2 name=drbd1_2 cs=Connected ro1=Primary ro2=Secondary ds1=UpToDate ds2=UpToDate / resource minor=3 name=drbd2_1 cs=Connected ro1=Primary ro2=Secondary ds1=UpToDate ds2=UpToDate / resource minor=4 name=drbd2_2 cs=Connected ro1=Primary ro2=Secondary ds1=UpToDate ds2=UpToDate / resource minor=5 name=drbd3_1 cs=Connected ro1=Secondary ro2=Primary ds1=UpToDate ds2=UpToDate / resource minor=6 name=drbd3_2 cs=Connected ro1=Secondary ro2=Primary ds1=UpToDate ds2=UpToDate / resource minor=7 name=drbd4_1 cs=Connected ro1=Primary ro2=Secondary ds1=UpToDate ds2=UpToDate / resource minor=8 name=drbd4_2 cs=Connected ro1=Primary ro2=Secondary ds1=UpToDate ds2=UpToDate / resource minor=9 name=drbd5_1 cs=Connected ro1=Primary ro2=Secondary ds1=UpToDate ds2=UpToDate / resource minor=10 name=drbd5_2 cs=Connected ro1=Primary ro2=Secondary ds1=UpToDate ds2=UpToDate / resource minor=11 name=drbd6_1 cs=StandAlone ro1=Secondary ro2=Unknown ds1=Outdated ds2=DUnknown / resource minor=12 name=drbd6_2 cs=StandAlone ro1=Secondary ro2=Unknown ds1=Outdated ds2=DUnknown / !-- resource minor=13 name=drbd7_1 not available or not yet created -- !-- resource minor=14 name=drbd7_2 not available or not yet created -- !-- resource minor=15 name=drbd8_1 not available
Re: [Linux-HA] How do I clear the Failed actions section?
On 3/7/12 10:50 AM, Jerome Yanga wrote: I would just want to share that the command recommended did NOT move the resource to another node. It basically clears the Failed Actions section. This is why I was conditional in my response. Suppose you had something like the following: primitive MyResource ocf:heartbeat:Dummy location MyResourcePreferredNode MyResource 10: my-node-a.example.com with no resource-stickiness set. Assume MyResource fails on my-node-a, and is moved to my-node-b. Then if you were to do: crm resource cleanup MyResource pacemaker might move MyResource back to my-node-a. It might even move it back without that example MyResourcePreferredNode constraint. If you want to avoid that, consider per-resource or global resource-stickiness: http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/ch05s03s02.html http://www.gossamer-threads.com/lists/linuxha/pacemaker/64076 Again, a disclaimer: I am not an expert. On Tue, Mar 6, 2012 at 11:46 AM, William Seligman selig...@nevis.columbia.edu wrote: On 3/6/12 2:38 PM, Jerome Yanga wrote: Do you know by chance if that command you have provided bounces the resource? I don't know what you mean by bounce the resource. According to: http://www.clusterlabs.org/doc/crm_cli.html the command refreshes the resource status. Depending on your configuration, it might shift a resource to another node. But I am not an expert! I merely knew how to clear up the error message. On Tue, Mar 6, 2012 at 10:28 AM, William Seligman selig...@nevis.columbia.edu wrote: On 3/6/12 1:04 PM, Jerome Yanga wrote: crm_mon shows the error below. Failed actions: � � drbd0:1_monitor_59000 (node=testserver1.example.com, call=132, rc=-2, status=Timed Out): unknown exec error I have check DRBD and the mirror is connected and uptodate on both nodes. The error above caused the resources to failover and it seems to be working OK. �However, the failed actions section has not disappeared. How do I clear this error? crm resource cleanup drbd0 -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Apparent problem in pacemaker ordering
On 3/5/12 11:55 AM, William Seligman wrote: On 3/3/12 3:30 PM, William Seligman wrote: On 3/3/12 2:14 PM, Florian Haas wrote: On Sat, Mar 3, 2012 at 6:55 PM, William Seligman selig...@nevis.columbia.edu wrote: On 3/3/12 12:03 PM, emmanuel segura wrote: are you sure the exportfs agent can be use it with clone active/active? a) I've been through the script. If there's some problem associated with it being cloned, I haven't seen it. (It can't handle globally-unique=true, but I didn't turn that on.) It shouldn't have a problem with being cloned. Obviously, cloning that RA _really_ makes sense only with the export that manages an NFSv4 virtual root (fsid=0). Otherwise, the export clone has to be hosted on a clustered filesystem, and you'd have to have a pNFS implementation that doesn't suck (tough to come by on Linux), and if you want that sort of replicate, parallel-access NFS you might as well use Gluster. The downside of the latter, though, is it's currently NFSv3-only, without sideband locking. I'll look this over when I have a chance. I think I can get away without a NFSv4 virtual root because I'm exporting everything to my cluster either read-only, or only one system at a time will do any writing. Now that you've warned me, I'll do some more checking. b) I had similar problems using the exportfs resource in a primary-secondary setup without clones. Why would a resource being cloned create an ordering problem? I haven't set the interleave parameter (even with the documentation I'm not sure what it does) but A before B before C seems pretty clear, even for cloned resources. As far as what interleave does. Suppose you have two clones, A and B. And they're linked with an order constraint, like this: order A_before_B inf: A B ... then if interleave is false, _all_ instances of A must be started before _any_ instance of B gets to start anywhere in the cluster. However if interleave is true, then for any node only the _local_ instance of A needs to be started before it can start the corresponding _local_ instance of B. In other words, interleave=true is actually the reasonable thing to set on all clone instances by default, and I believe the pengine actually does use a default of interleave=true on defined clone sets since some 1.1.x release (I don't recall which). Thanks, Florian. That's a great explanation. I'll probably stick interleave=true on most of my clones just to make sure. It explains an error message I've seen in the logs: Mar 2 18:15:19 hypatia-tb pengine: [4414]: ERROR: clone_rsc_colocation_rh: Cannot interleave clone ClusterIPClone and Gfs2Clone because they do not support the same number of resources per node Because ClusterIPClone has globally-unique=true and clone-max=2, it's possible for both instances to be running on a single node; I've seen this a few times in my testing when cycling power on one of the nodes. Interleaving doesn't make sense in such a case. Bill, seeing as you've already pastebinned your config and crm_mon output, could you also pastebin your whole CIB as per cibadmin -Q output? Thanks. Sure: http://pastebin.com/pjSJ79H6. It doesn't have the exportfs resources in it; I took them out before leaving for the weekend. If it helps, I'll put them back in and try to get the cibadmin -Q output before any nodes crash. For a test, I stuck in a exportfs resource with all the ordering constraints. Here's the cibadmin -Q output from that: http://pastebin.com/nugdufJc The output of crm_mon just after doing that, showing resource failure: http://pastebin.com/cyCFGUSD Then all the resources are stopped: http://pastebin.com/D62sGSrj A few seconds later one of the nodes is fenced, but this does not bring up anything: http://pastebin.com/wzbmfVas I believe I have the solution to my stability problem. It doesn't solve the issue of ordering, but I think I have a configuration that will survive failover. Here's the problem. I had exportfs resources such as: primitive ExportUsrNevis ocf:heartbeat:exportfs \ op start interval=0 timeout=40 \ op stop interval=0 timeout=45 \ params clientspec=*.nevis.columbia.edu directory=/usr/nevis \ fsid=20 options=ro,no_root_squash,async I did detailed traces of the execution of exportfs (putting in logger commands) and found that the problem was in the backup_rmtab function in exportfs: backup_rmtab() { local rmtab_backup if [ ${OCF_RESKEY_rmtab_backup} != none ]; then rmtab_backup=${OCF_RESKEY_directory}/${OCF_RESKEY_rmtab_backup} grep :${OCF_RESKEY_directory}: /var/lib/nfs/rmtab ${rmtab_backup} fi } The problem was that the grep command was taking a long time, longer than any timeout I'd assigned to the resource. I looked at /var/lib/nfs/rmtab, and saw it was 60GB on one of my nodes and 16GB on the other. Since backup_rmtab() is called during the stop action, the resource
Re: [Linux-HA] How do I clear the Failed actions section?
On 3/6/12 1:04 PM, Jerome Yanga wrote: crm_mon shows the error below. Failed actions: drbd0:1_monitor_59000 (node=testserver1.example.com, call=132, rc=-2, status=Timed Out): unknown exec error I have check DRBD and the mirror is connected and uptodate on both nodes. The error above caused the resources to failover and it seems to be working OK. However, the failed actions section has not disappeared. How do I clear this error? crm resource cleanup drbd0 -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Apparent problem in pacemaker ordering
On 3/3/12 3:30 PM, William Seligman wrote: On 3/3/12 2:14 PM, Florian Haas wrote: On Sat, Mar 3, 2012 at 6:55 PM, William Seligman selig...@nevis.columbia.edu wrote: On 3/3/12 12:03 PM, emmanuel segura wrote: are you sure the exportfs agent can be use it with clone active/active? a) I've been through the script. If there's some problem associated with it being cloned, I haven't seen it. (It can't handle globally-unique=true, but I didn't turn that on.) It shouldn't have a problem with being cloned. Obviously, cloning that RA _really_ makes sense only with the export that manages an NFSv4 virtual root (fsid=0). Otherwise, the export clone has to be hosted on a clustered filesystem, and you'd have to have a pNFS implementation that doesn't suck (tough to come by on Linux), and if you want that sort of replicate, parallel-access NFS you might as well use Gluster. The downside of the latter, though, is it's currently NFSv3-only, without sideband locking. I'll look this over when I have a chance. I think I can get away without a NFSv4 virtual root because I'm exporting everything to my cluster either read-only, or only one system at a time will do any writing. Now that you've warned me, I'll do some more checking. b) I had similar problems using the exportfs resource in a primary-secondary setup without clones. Why would a resource being cloned create an ordering problem? I haven't set the interleave parameter (even with the documentation I'm not sure what it does) but A before B before C seems pretty clear, even for cloned resources. As far as what interleave does. Suppose you have two clones, A and B. And they're linked with an order constraint, like this: order A_before_B inf: A B ... then if interleave is false, _all_ instances of A must be started before _any_ instance of B gets to start anywhere in the cluster. However if interleave is true, then for any node only the _local_ instance of A needs to be started before it can start the corresponding _local_ instance of B. In other words, interleave=true is actually the reasonable thing to set on all clone instances by default, and I believe the pengine actually does use a default of interleave=true on defined clone sets since some 1.1.x release (I don't recall which). Thanks, Florian. That's a great explanation. I'll probably stick interleave=true on most of my clones just to make sure. It explains an error message I've seen in the logs: Mar 2 18:15:19 hypatia-tb pengine: [4414]: ERROR: clone_rsc_colocation_rh: Cannot interleave clone ClusterIPClone and Gfs2Clone because they do not support the same number of resources per node Because ClusterIPClone has globally-unique=true and clone-max=2, it's possible for both instances to be running on a single node; I've seen this a few times in my testing when cycling power on one of the nodes. Interleaving doesn't make sense in such a case. Bill, seeing as you've already pastebinned your config and crm_mon output, could you also pastebin your whole CIB as per cibadmin -Q output? Thanks. Sure: http://pastebin.com/pjSJ79H6. It doesn't have the exportfs resources in it; I took them out before leaving for the weekend. If it helps, I'll put them back in and try to get the cibadmin -Q output before any nodes crash. For a test, I stuck in a exportfs resource with all the ordering constraints. Here's the cibadmin -Q output from that: http://pastebin.com/nugdufJc The output of crm_mon just after doing that, showing resource failure: http://pastebin.com/cyCFGUSD Then all the resources are stopped: http://pastebin.com/D62sGSrj A few seconds later one of the nodes is fenced, but this does not bring up anything: http://pastebin.com/wzbmfVas -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Apparent problem in pacemaker ordering
On 3/3/12 12:03 PM, emmanuel segura wrote: are you sure the exportfs agent can be use it with clone active/active? a) I've been through the script. If there's some problem associated with it being cloned, I haven't seen it. (It can't handle globally-unique=true, but I didn't turn that on.) b) I had similar problems using the exportfs resource in a primary-secondary setup without clones. Why would a resource being cloned create an ordering problem? I haven't set the interleave parameter (even with the documentation I'm not sure what it does) but A before B before C seems pretty clear, even for cloned resources. Il giorno 03 marzo 2012 00:12, William Seligmanselig...@nevis.columbia.edu ha scritto: One step forward, two steps back. I'm working on a two-node primary-primary cluster. I'm debugging problems I have with the ocf:heartbeat:exportfs resource. For some reason, pacemaker sometimes appears to ignore ordering I put on the resources. Florian Haas recommended pastebin in another thread, so let's give it a try. Here's my complete current output of crm configure show: http://pastebin.com/bbSsqyeu Here's a quick sketch: The sequence of events is supposed to be DRBD (ms) - clvmd (clone) - gfs2 (clone) - exportfs (clone). But that's not what happens. What happens is that pacemaker tries to start up the exportfs resource immediately. This fails, because what it's exporting doesn't exist until after gfs2 runs. Because the cloned resource can't run on either node, the cluster goes into a state in which one node is fenced, the other node refuses to run anything. Here's a quick snapshot I was able to take of the output of crm_mon that shows the problem: http://pastebin.com/CiZvS4Fh This shows that pacemaker is still trying to start the exportfs resources, before it has run the chain drbd-clvmd-gfs2. Just to confirm the obvious, I have the ordering constraints in the full configuration linked above (Admin is my DRBD resource): order Admin_Before_Clvmd inf: AdminClone:promote ClvmdClone:start order Clvmd_Before_Gfs2 inf: ClvmdClone Gfs2Clone order Gfs2_Before_Exports inf: Gfs2Clone ExportsClone This is not the only time I've observed this behavior in pacemaker. Here's a lengthy log file excerpt from the same time I took the crm_mon snapshot: http://pastebin.com/HwMUCmcX I can see that other resources, the symlink ones in particular, are being probed and started before the drbd Admin resource has a chance to be promoted. In looking at the log file, it may help to know that /mail and /var/nevis are gfs2 partitions that aren't mounted until the Gfs2 resource starts. So this isn't the first time I've seen this happen. This is just the first time I've been able to reproduce this reliably and capture a snapshot. Any ideas? -- Bill Seligman | mailto://selig...@nevis.columbia.edu Nevis Labs, Columbia Univ | http://www.nevis.columbia.edu/~seligman/ PO Box 137| Irvington NY 10533 USA | Phone: (914) 591-2823 smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Apparent problem in pacemaker ordering
On 3/3/12 2:14 PM, Florian Haas wrote: On Sat, Mar 3, 2012 at 6:55 PM, William Seligman selig...@nevis.columbia.edu wrote: On 3/3/12 12:03 PM, emmanuel segura wrote: are you sure the exportfs agent can be use it with clone active/active? a) I've been through the script. If there's some problem associated with it being cloned, I haven't seen it. (It can't handle globally-unique=true, but I didn't turn that on.) It shouldn't have a problem with being cloned. Obviously, cloning that RA _really_ makes sense only with the export that manages an NFSv4 virtual root (fsid=0). Otherwise, the export clone has to be hosted on a clustered filesystem, and you'd have to have a pNFS implementation that doesn't suck (tough to come by on Linux), and if you want that sort of replicate, parallel-access NFS you might as well use Gluster. The downside of the latter, though, is it's currently NFSv3-only, without sideband locking. I'll look this over when I have a chance. I think I can get away without a NFSv4 virtual root because I'm exporting everything to my cluster either read-only, or only one system at a time will do any writing. Now that you've warned me, I'll do some more checking. b) I had similar problems using the exportfs resource in a primary-secondary setup without clones. Why would a resource being cloned create an ordering problem? I haven't set the interleave parameter (even with the documentation I'm not sure what it does) but A before B before C seems pretty clear, even for cloned resources. As far as what interleave does. Suppose you have two clones, A and B. And they're linked with an order constraint, like this: order A_before_B inf: A B ... then if interleave is false, _all_ instances of A must be started before _any_ instance of B gets to start anywhere in the cluster. However if interleave is true, then for any node only the _local_ instance of A needs to be started before it can start the corresponding _local_ instance of B. In other words, interleave=true is actually the reasonable thing to set on all clone instances by default, and I believe the pengine actually does use a default of interleave=true on defined clone sets since some 1.1.x release (I don't recall which). Thanks, Florian. That's a great explanation. I'll probably stick interleave=true on most of my clones just to make sure. It explains an error message I've seen in the logs: Mar 2 18:15:19 hypatia-tb pengine: [4414]: ERROR: clone_rsc_colocation_rh: Cannot interleave clone ClusterIPClone and Gfs2Clone because they do not support the same number of resources per node Because ClusterIPClone has globally-unique=true and clone-max=2, it's possible for both instances to be running on a single node; I've seen this a few times in my testing when cycling power on one of the nodes. Interleaving doesn't make sense in such a case. Bill, seeing as you've already pastebinned your config and crm_mon output, could you also pastebin your whole CIB as per cibadmin -Q output? Thanks. Sure: http://pastebin.com/pjSJ79H6. It doesn't have the exportfs resources in it; I took them out before leaving for the weekend. If it helps, I'll put them back in and try to get the cibadmin -Q output before any nodes crash. -- Bill Seligman | mailto://selig...@nevis.columbia.edu Nevis Labs, Columbia Univ | http://www.nevis.columbia.edu/~seligman/ PO Box 137| Irvington NY 10533 USA | Phone: (914) 591-2823 smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Apparent problem in pacemaker ordering
One step forward, two steps back. I'm working on a two-node primary-primary cluster. I'm debugging problems I have with the ocf:heartbeat:exportfs resource. For some reason, pacemaker sometimes appears to ignore ordering I put on the resources. Florian Haas recommended pastebin in another thread, so let's give it a try. Here's my complete current output of crm configure show: http://pastebin.com/bbSsqyeu Here's a quick sketch: The sequence of events is supposed to be DRBD (ms) - clvmd (clone) - gfs2 (clone) - exportfs (clone). But that's not what happens. What happens is that pacemaker tries to start up the exportfs resource immediately. This fails, because what it's exporting doesn't exist until after gfs2 runs. Because the cloned resource can't run on either node, the cluster goes into a state in which one node is fenced, the other node refuses to run anything. Here's a quick snapshot I was able to take of the output of crm_mon that shows the problem: http://pastebin.com/CiZvS4Fh This shows that pacemaker is still trying to start the exportfs resources, before it has run the chain drbd-clvmd-gfs2. Just to confirm the obvious, I have the ordering constraints in the full configuration linked above (Admin is my DRBD resource): order Admin_Before_Clvmd inf: AdminClone:promote ClvmdClone:start order Clvmd_Before_Gfs2 inf: ClvmdClone Gfs2Clone order Gfs2_Before_Exports inf: Gfs2Clone ExportsClone This is not the only time I've observed this behavior in pacemaker. Here's a lengthy log file excerpt from the same time I took the crm_mon snapshot: http://pastebin.com/HwMUCmcX I can see that other resources, the symlink ones in particular, are being probed and started before the drbd Admin resource has a chance to be promoted. In looking at the log file, it may help to know that /mail and /var/nevis are gfs2 partitions that aren't mounted until the Gfs2 resource starts. So this isn't the first time I've seen this happen. This is just the first time I've been able to reproduce this reliably and capture a snapshot. Any ideas? -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Apparent problem in pacemaker ordering
Darn it, forgot versions: Redhat Linux 6.2 (kernel 2.6.32) cman-3.0.12.1 corosync-1.4.1 pacemaker-1.1.6 On 3/2/12 6:12 PM, William Seligman wrote: One step forward, two steps back. I'm working on a two-node primary-primary cluster. I'm debugging problems I have with the ocf:heartbeat:exportfs resource. For some reason, pacemaker sometimes appears to ignore ordering I put on the resources. Florian Haas recommended pastebin in another thread, so let's give it a try. Here's my complete current output of crm configure show: http://pastebin.com/bbSsqyeu Here's a quick sketch: The sequence of events is supposed to be DRBD (ms) - clvmd (clone) - gfs2 (clone) - exportfs (clone). But that's not what happens. What happens is that pacemaker tries to start up the exportfs resource immediately. This fails, because what it's exporting doesn't exist until after gfs2 runs. Because the cloned resource can't run on either node, the cluster goes into a state in which one node is fenced, the other node refuses to run anything. Here's a quick snapshot I was able to take of the output of crm_mon that shows the problem: http://pastebin.com/CiZvS4Fh This shows that pacemaker is still trying to start the exportfs resources, before it has run the chain drbd-clvmd-gfs2. Just to confirm the obvious, I have the ordering constraints in the full configuration linked above (Admin is my DRBD resource): order Admin_Before_Clvmd inf: AdminClone:promote ClvmdClone:start order Clvmd_Before_Gfs2 inf: ClvmdClone Gfs2Clone order Gfs2_Before_Exports inf: Gfs2Clone ExportsClone This is not the only time I've observed this behavior in pacemaker. Here's a lengthy log file excerpt from the same time I took the crm_mon snapshot: http://pastebin.com/HwMUCmcX I can see that other resources, the symlink ones in particular, are being probed and started before the drbd Admin resource has a chance to be promoted. In looking at the log file, it may help to know that /mail and /var/nevis are gfs2 partitions that aren't mounted until the Gfs2 resource starts. So this isn't the first time I've seen this happen. This is just the first time I've been able to reproduce this reliably and capture a snapshot. Any ideas? -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] cman+pacemaker+drbd fencing problem
On 3/1/12 4:15 AM, emmanuel segura wrote: can you show me your /etc/cluster/cluster.conf? because i think your problem it's a fencing-loop Here it is: /etc/cluster/cluster.conf: ?xml version=1.0? cluster config_version=17 name=Nevis_HA logging debug=off/ cman expected_votes=1 two_node=1 / clusternodes clusternode name=hypatia-tb.nevis.columbia.edu nodeid=1 altname name=hypatia-private.nevis.columbia.edu port=5405 mcast=226.94.1.1/ fence method name=pcmk-redirect device name=pcmk port=hypatia-tb.nevis.columbia.edu/ /method /fence /clusternode clusternode name=orestes-tb.nevis.columbia.edu nodeid=2 altname name=orestes-private.nevis.columbia.edu port=5405 mcast=226.94.1.1/ fence method name=pcmk-redirect device name=pcmk port=orestes-tb.nevis.columbia.edu/ /method /fence /clusternode /clusternodes fencedevices fencedevice name=pcmk agent=fence_pcmk/ /fencedevices fence_daemon post_join_delay=30 / rm disabled=1 / /cluster Il giorno 01 marzo 2012 01:03, William Seligmanselig...@nevis.columbia.edu ha scritto: On 2/28/12 7:26 PM, Lars Ellenberg wrote: On Tue, Feb 28, 2012 at 03:51:29PM -0500, William Seligman wrote: off-topic Sigh. I wish that were the reason. The reason why I'm doing dual-primary is that I've a got a single-primary two-node cluster in production that simply doesn't work. One node runs resources; the other sits and twiddles its fingers; fine. But when primary goes down, secondary has trouble starting up all the resources; when we've actually had primary failures (UPS goes haywire, hard drive failure) the secondary often winds up in a state in which it runs none of the significant resources. With the dual-primary setup I have now, both machines are running the resources that typically cause problems in my single-primary configuration. If one box goes down, the other doesn't have to failover anything; it's already running them. (I needed IPaddr2 cloning to work properly for this to work, which is why I started that thread... and all the stupider of me for missing that crucial page in Clusters From Scratch.) My only remaining problem with the configuration is restoring a fenced node to the cluster. Hence my tests, and the reason why I started this thread. /off-topic Uhm, I do think that is exactly on topic. Rather fix your resources to be able to successfully take over, than add even more complexity. What resources would that be, and why are they not taking over? I can't tell you in detail, because the major snafu happened on a production system after a power outage a few months ago. My goal was to get the thing stable as quickly as possible. In the end, that turned out to be a non-HA configuration: One runs corosync+pacemaker+drbd, while the other just runs drbd. It works, in the sense that the users get their e-mail. If there's a power outage, I have to bring things up manually. So my only reference is the test-bench dual-primary setup I've got now, which is exhibiting the same kinds of problems even though the OS versions, software versions, and layout are different. This suggests that the problem lies in the way I'm setting up the configuration. The problems I have seem to be in the general category of the 'good guy' gets fenced when the 'bad guy' gets into trouble. Examples: - Assuming I start out with two crashed nodes. If I just start up DRBD and nothing else, the partitions sync quickly with no problems. - If the system starts with cman running, and I start drbd, it's likely that system who is _not_ Outdated will be fenced (rebooted). Same thing if cman+pacemaker is running. - Cloned ocf:heartbeat:exportfs resources are giving me problems as well (which is why I tried making changes to that resource script). Assume I start with one node running cman+pacemaker, and the other stopped. I turned on the stopped node. This will typically result in the running node being fenced, because it has it times out when stopping the exportfs resource. Falling back to DRBD 8.3.12 didn't change this behavior. My pacemaker configuration is long, so I'll excerpt what I think are the relevant pieces in the hope that it will be enough for someone to say You fool! This is covered in Pacemaker Explained page 56! When bringing up a stopped node, in order to restart AdminClone pacemaker wants to stop ExportsClone, then Gfs2Clone, then ClvmdClone. As I said, it's the failure to stop ExportMail on the running node that causes it to be fenced. primitive AdminDrbd ocf:linbit:drbd \ params drbd_resource=admin \ op monitor interval=60s role=Master \ op monitor interval=59s role=Slave \ op stop interval=0 timeout=320 \ op start interval=0 timeout=240 ms AdminClone AdminDrbd \ meta master-max=2 master-node-max=1 \ clone-max=2 clone-node-max=1 notify=true primitive Clvmd lsb:clvmd op monitor
Re: [Linux-HA] cman+pacemaker+drbd fencing problem
On 3/1/12 6:34 AM, emmanuel segura wrote: try to change the fence daemon tag like this fence_daemon clean_start=1 post_join_delay=30 / change your cluster config version and after reboot the cluster This did not change the behavior of the cluster. In particular, I'm still dealing with this: - If the system starts with cman running, and I start drbd, it's likely that system who is _not_ Outdated will be fenced (rebooted). Il giorno 01 marzo 2012 12:28, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/1/12 4:15 AM, emmanuel segura wrote: can you show me your /etc/cluster/cluster.conf? because i think your problem it's a fencing-loop Here it is: /etc/cluster/cluster.conf: ?xml version=1.0? cluster config_version=17 name=Nevis_HA logging debug=off/ cman expected_votes=1 two_node=1 / clusternodes clusternode name=hypatia-tb.nevis.**columbia.eduhttp://hypatia-tb.nevis.columbia.edu nodeid=1 altname name=hypatia-private.nevis.**columbia.eduhttp://hypatia-private.nevis.columbia.edu port=5405 mcast=226.94.1.1/ fence method name=pcmk-redirect device name=pcmk port=hypatia-tb.nevis.**columbia.eduhttp://hypatia-tb.nevis.columbia.edu / /method /fence /clusternode clusternode name=orestes-tb.nevis.**columbia.eduhttp://orestes-tb.nevis.columbia.edu nodeid=2 altname name=orestes-private.nevis.**columbia.eduhttp://orestes-private.nevis.columbia.edu port=5405 mcast=226.94.1.1/ fence method name=pcmk-redirect device name=pcmk port=orestes-tb.nevis.**columbia.eduhttp://orestes-tb.nevis.columbia.edu / /method /fence /clusternode /clusternodes fencedevices fencedevice name=pcmk agent=fence_pcmk/ /fencedevices fence_daemon post_join_delay=30 / rm disabled=1 / /cluster Il giorno 01 marzo 2012 01:03, William Seligmanseligman@nevis.** columbia.edu selig...@nevis.columbia.edu ha scritto: On 2/28/12 7:26 PM, Lars Ellenberg wrote: On Tue, Feb 28, 2012 at 03:51:29PM -0500, William Seligman wrote: off-topic Sigh. I wish that were the reason. The reason why I'm doing dual-primary is that I've a got a single-primary two-node cluster in production that simply doesn't work. One node runs resources; the other sits and twiddles its fingers; fine. But when primary goes down, secondary has trouble starting up all the resources; when we've actually had primary failures (UPS goes haywire, hard drive failure) the secondary often winds up in a state in which it runs none of the significant resources. With the dual-primary setup I have now, both machines are running the resources that typically cause problems in my single-primary configuration. If one box goes down, the other doesn't have to failover anything; it's already running them. (I needed IPaddr2 cloning to work properly for this to work, which is why I started that thread... and all the stupider of me for missing that crucial page in Clusters From Scratch.) My only remaining problem with the configuration is restoring a fenced node to the cluster. Hence my tests, and the reason why I started this thread. /off-topic Uhm, I do think that is exactly on topic. Rather fix your resources to be able to successfully take over, than add even more complexity. What resources would that be, and why are they not taking over? I can't tell you in detail, because the major snafu happened on a production system after a power outage a few months ago. My goal was to get the thing stable as quickly as possible. In the end, that turned out to be a non-HA configuration: One runs corosync+pacemaker+drbd, while the other just runs drbd. It works, in the sense that the users get their e-mail. If there's a power outage, I have to bring things up manually. So my only reference is the test-bench dual-primary setup I've got now, which is exhibiting the same kinds of problems even though the OS versions, software versions, and layout are different. This suggests that the problem lies in the way I'm setting up the configuration. The problems I have seem to be in the general category of the 'good guy' gets fenced when the 'bad guy' gets into trouble. Examples: - Assuming I start out with two crashed nodes. If I just start up DRBD and nothing else, the partitions sync quickly with no problems. - If the system starts with cman running, and I start drbd, it's likely that system who is _not_ Outdated will be fenced (rebooted). Same thing if cman+pacemaker is running. - Cloned ocf:heartbeat:exportfs resources are giving me problems as well (which is why I tried making changes to that resource script). Assume I start with one node running cman+pacemaker, and the other stopped. I turned on the stopped node. This will typically result in the running node being fenced, because it has
Re: [Linux-HA] cman+pacemaker+drbd fencing problem
On 3/1/12 12:10 PM, William Seligman wrote: On 3/1/12 6:34 AM, emmanuel segura wrote: try to change the fence daemon tag like this fence_daemon clean_start=1 post_join_delay=30 / change your cluster config version and after reboot the cluster This did not change the behavior of the cluster. In particular, I'm still dealing with this: - If the system starts with cman running, and I start drbd, it's likely that system who is _not_ Outdated will be fenced (rebooted). This just happened again. Here's the log from the bad node, the one I stopped and then restarted. cman is running (not pacemaker). I start drbd: Mar 1 12:03:49 orestes-tb kernel: drbd: initialized. Version: 8.3.12 (api:88/proto:86-96) Mar 1 12:03:49 orestes-tb kernel: drbd: GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34 Mar 1 12:03:49 orestes-tb kernel: drbd: registered as block device major 147 Mar 1 12:03:49 orestes-tb kernel: drbd: minor_table @ 0x88041dbc4b80 Mar 1 12:03:49 orestes-tb kernel: block drbd0: Starting worker thread (from cqueue [2942]) Mar 1 12:03:49 orestes-tb kernel: block drbd0: disk( Diskless - Attaching ) Mar 1 12:03:50 orestes-tb kernel: block drbd0: Found 57 transactions (57 active extents) in activity log. Mar 1 12:03:50 orestes-tb kernel: block drbd0: Method to ensure write ordering: barrier Mar 1 12:03:50 orestes-tb kernel: block drbd0: max BIO size = 130560 Mar 1 12:03:50 orestes-tb kernel: block drbd0: Adjusting my ra_pages to backing device's (32 - 768) Mar 1 12:03:50 orestes-tb kernel: block drbd0: drbd_bm_resize called with capacity == 5611549368 Mar 1 12:03:50 orestes-tb kernel: block drbd0: resync bitmap: bits=701443671 words=10960058 pages=21407 Mar 1 12:03:50 orestes-tb kernel: block drbd0: size = 2676 GB (2805774684 KB) Mar 1 12:03:50 orestes-tb kernel: block drbd0: bitmap READ of 21407 pages took 625 jiffies Mar 1 12:03:50 orestes-tb kernel: block drbd0: recounting of set bits took additional 86 jiffies Mar 1 12:03:50 orestes-tb kernel: block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map. Mar 1 12:03:50 orestes-tb kernel: block drbd0: disk( Attaching - Outdated ) Mar 1 12:03:50 orestes-tb kernel: block drbd0: attached to UUIDs 878999EFCFBE8E08::494B48826E41A2C2:494A48826E41A2C3 Mar 1 12:03:50 orestes-tb kernel: block drbd0: conn( StandAlone - Unconnected ) Mar 1 12:03:50 orestes-tb kernel: block drbd0: Starting receiver thread (from drbd0_worker [2951]) Mar 1 12:03:50 orestes-tb kernel: block drbd0: receiver (re)started Mar 1 12:03:50 orestes-tb kernel: block drbd0: conn( Unconnected - WFConnection ) Mar 1 12:03:51 orestes-tb kernel: block drbd0: Handshake successful: Agreed network protocol version 96 Mar 1 12:03:51 orestes-tb kernel: block drbd0: conn( WFConnection - WFReportParams ) Mar 1 12:03:51 orestes-tb kernel: block drbd0: Starting asender thread (from drbd0_receiver [2965]) Mar 1 12:03:51 orestes-tb kernel: block drbd0: data-integrity-alg: not-used Mar 1 12:03:51 orestes-tb kernel: block drbd0: drbd_sync_handshake: Mar 1 12:03:51 orestes-tb kernel: block drbd0: self 878999EFCFBE8E08::494B48826E41A2C2:494A48826E41A2C3 bits:0 flags:0 Mar 1 12:03:51 orestes-tb kernel: block drbd0: peer D40A1613FAE8F5E9:878999EFCFBE8E09:878899EFCFBE8E09:494B48826E41A2C3 bits:0 flags:0 Mar 1 12:03:51 orestes-tb kernel: block drbd0: uuid_compare()=-1 by rule 50 Mar 1 12:03:51 orestes-tb kernel: block drbd0: peer( Unknown - Primary ) conn( WFReportParams - WFBitMapT ) pdsk( DUnknown - UpToDate ) Mar 1 12:03:53 orestes-tb kernel: block drbd0: conn( WFBitMapT - WFSyncUUID ) Mar 1 12:04:01 orestes-tb corosync[2296]: [TOTEM ] A processor failed, forming new configuration. Mar 1 12:04:03 orestes-tb corosync[2296]: [QUORUM] Members[1]: 2 Mar 1 12:04:03 orestes-tb corosync[2296]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Mar 1 12:04:03 orestes-tb kernel: dlm: closing connection to node 1 Mar 1 12:04:03 orestes-tb corosync[2296]: [CPG ] chosen downlist: sender r(0) ip(129.236.252.14) r(1) ip(192.168.100.6) ; members(old:2 left:1) Mar 1 12:04:03 orestes-tb corosync[2296]: [MAIN ] Completed service synchronization, ready to provide service. Mar 1 12:04:03 orestes-tb fenced[2350]: fencing node hypatia-tb.nevis.columbia.edu As near as I can tell, the bad node sees that the good node is Primary and UpToDate, goes into WFSyncUUID... and then corosync/cman cheerfully fences the good node. Il giorno 01 marzo 2012 12:28, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/1/12 4:15 AM, emmanuel segura wrote: can you show me your /etc/cluster/cluster.conf? because i think your problem it's a fencing-loop Here it is: /etc/cluster/cluster.conf: ?xml version=1.0? cluster config_version=17 name=Nevis_HA logging
Re: [Linux-HA] cman+pacemaker+drbd fencing problem - SOLVED
On 3/1/12 12:56 PM, Lars Ellenberg wrote: On Thu, Mar 01, 2012 at 12:16:17PM -0500, William Seligman wrote: On 3/1/12 12:10 PM, William Seligman wrote: On 3/1/12 6:34 AM, emmanuel segura wrote: try to change the fence daemon tag like this fence_daemon clean_start=1 post_join_delay=30 / change your cluster config version and after reboot the cluster This did not change the behavior of the cluster. In particular, I'm still dealing with this: - If the system starts with cman running, and I start drbd, it's likely that system who is _not_ Outdated will be fenced (rebooted). This just happened again. Here's the log from the bad node, the one I stopped and then restarted. cman is running (not pacemaker). I start drbd: Mar 1 12:03:49 orestes-tb kernel: drbd: initialized. Version: 8.3.12 (api:88/proto:86-96) Mar 1 12:03:49 orestes-tb kernel: drbd: GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34 Mar 1 12:03:49 orestes-tb kernel: drbd: registered as block device major 147 Mar 1 12:03:49 orestes-tb kernel: drbd: minor_table @ 0x88041dbc4b80 Mar 1 12:03:49 orestes-tb kernel: block drbd0: Starting worker thread (from cqueue [2942]) Mar 1 12:03:49 orestes-tb kernel: block drbd0: disk( Diskless - Attaching ) Mar 1 12:03:50 orestes-tb kernel: block drbd0: Found 57 transactions (57 active extents) in activity log. Mar 1 12:03:50 orestes-tb kernel: block drbd0: Method to ensure write ordering: barrier Mar 1 12:03:50 orestes-tb kernel: block drbd0: max BIO size = 130560 Mar 1 12:03:50 orestes-tb kernel: block drbd0: Adjusting my ra_pages to backing device's (32 - 768) Mar 1 12:03:50 orestes-tb kernel: block drbd0: drbd_bm_resize called with capacity == 5611549368 Mar 1 12:03:50 orestes-tb kernel: block drbd0: resync bitmap: bits=701443671 words=10960058 pages=21407 Mar 1 12:03:50 orestes-tb kernel: block drbd0: size = 2676 GB (2805774684 KB) Mar 1 12:03:50 orestes-tb kernel: block drbd0: bitmap READ of 21407 pages took 625 jiffies Mar 1 12:03:50 orestes-tb kernel: block drbd0: recounting of set bits took additional 86 jiffies Mar 1 12:03:50 orestes-tb kernel: block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map. Mar 1 12:03:50 orestes-tb kernel: block drbd0: disk( Attaching - Outdated ) Mar 1 12:03:50 orestes-tb kernel: block drbd0: attached to UUIDs 878999EFCFBE8E08::494B48826E41A2C2:494A48826E41A2C3 Mar 1 12:03:50 orestes-tb kernel: block drbd0: conn( StandAlone - Unconnected ) Mar 1 12:03:50 orestes-tb kernel: block drbd0: Starting receiver thread (from drbd0_worker [2951]) Mar 1 12:03:50 orestes-tb kernel: block drbd0: receiver (re)started Mar 1 12:03:50 orestes-tb kernel: block drbd0: conn( Unconnected - WFConnection ) Mar 1 12:03:51 orestes-tb kernel: block drbd0: Handshake successful: Agreed network protocol version 96 Mar 1 12:03:51 orestes-tb kernel: block drbd0: conn( WFConnection - WFReportParams ) Mar 1 12:03:51 orestes-tb kernel: block drbd0: Starting asender thread (from drbd0_receiver [2965]) Mar 1 12:03:51 orestes-tb kernel: block drbd0: data-integrity-alg: not-used Mar 1 12:03:51 orestes-tb kernel: block drbd0: drbd_sync_handshake: Mar 1 12:03:51 orestes-tb kernel: block drbd0: self 878999EFCFBE8E08::494B48826E41A2C2:494A48826E41A2C3 bits:0 flags:0 Mar 1 12:03:51 orestes-tb kernel: block drbd0: peer D40A1613FAE8F5E9:878999EFCFBE8E09:878899EFCFBE8E09:494B48826E41A2C3 bits:0 flags:0 Mar 1 12:03:51 orestes-tb kernel: block drbd0: uuid_compare()=-1 by rule 50 Mar 1 12:03:51 orestes-tb kernel: block drbd0: peer( Unknown - Primary ) conn( WFReportParams - WFBitMapT ) pdsk( DUnknown - UpToDate ) Mar 1 12:03:53 orestes-tb kernel: block drbd0: conn( WFBitMapT - WFSyncUUID ) Mar 1 12:04:01 orestes-tb corosync[2296]: [TOTEM ] A processor failed, forming new configuration. some random thoughts... DRBD Bitmap exchange causes congestion on Network, packet storm, irq storm, whatever, and UDP cluster comm packets falling on the floor? Can you change your cluster comm to use an (additional?) dedicated link? Or play with (increase) totem timeouts? Or play with some sysctls to make it less likely for UDP to fall on the floor; if that is what is happening. Maybe if you tcpdump the traffic while you start things up, that could give you some hints as to why corosync thinks that A processor failed, and it has to fence that failed processor... Mar 1 12:04:03 orestes-tb corosync[2296]: [QUORUM] Members[1]: 2 Mar 1 12:04:03 orestes-tb corosync[2296]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Mar 1 12:04:03 orestes-tb kernel: dlm: closing connection to node 1 Mar 1 12:04:03 orestes-tb corosync[2296]: [CPG ] chosen downlist: sender r
[Linux-HA] fence_nut fencing agent - use NUT to fence via UPS
After days spent debugging a fencing issue with my cluster, I know for certain that this fencing agent works, at least for me. I'd like to contribute it to the Linux HA community. In my cluster, the fencing mechanism is to use NUT (Network UPS Tools; http://www.networkupstools.org/ to turn off power to a node. About 1.5 years ago, I contributed a NUT-based fencing agent for Pacemaker 1.0: http://oss.clusterlabs.org/pipermail/pacemaker/2010-August/007347.html That script doesn't work for stonith-ng. So here's a new agent, written in perl, and tested under pacemaker-1.1.6 and nut-2.4.3. I know there's a fence_apc_snmp agent that already in resource-agents. However, that agent only works with APC devices with multiple outlet control; it displays an error messages when used with my UPSes. This script is for those who'd rather use NUT than play with SNMP MIBs. Enjoy! -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ #!/usr/bin/perl # External fencing agent that uses the NUT daemon to control an external UPS. # See the comments below, and the various NUT man pages, for how this # script works. It should work unchanged with most modern smart APC UPSes in # a Redhat/Fedora/RHEL-style distribution with the nut package installed. # Author: William Seligman selig...@nevis.columbia.edu # License: GPLv2 # The Following Agent Has Been Tested With: # pacemaker-1.1.6 # nut-2.4.3 # As you're designing your UPS and fencing set-up, consider that there may be # three computers involved: # 1) the machine running this fencing agent; # 2) the machine being controlled by this agent; # 3) the machine that can send commands to the UPS. # On my cluster, all the UPSes have SNMP smartcards, so every host can communicate # with every UPS; in other words, machines (1) and (3) are the same. If your UPSes # are controlled via serial or USB connections, then you might have a # situation in which host (2) is plugged into a UPS that has a serial connection # to some master power-control computer, and can potentially be fenced # by any other machine in your cluster. # You'll probably need the nut daemon running on both the hosts (1) and # (3). Strictly speaking, there's no reason for NUT to run on (2). # From a practical standpoint you'll probably want NUT to be running on all the # systems in your cluster. # For this agent to work, the following conditions have to be met: # - NUT has to be installed; on RHEL systems, this requires packages nut and # nut-client. # - The nut daemon (the ups or upsd service on RHEL) must be running on hosts # (1) and (3). This agent does not start/stop the nut daemons for you. # - The name of the UPS that affects host (2) has to be defined in ups.conf on # host (3). The format for the --ups option is upsname[@controlhost[:port]]. The # default controlhost is 'localhost'. If you use SNMP management cards, you want # to make sure you issue comands to a community with read/write privileges; the # default is the 'private' community. An example ups.conf: # [myhost-ups] # driver = snmp-ups # port = myhost-ups.example.com # community = private # mibs=apcc # - The --username and --password options to access the UPS must be defined in # upsd.users on host (3), with the instcmds for poweron, poweroff, and reset allowed. # An example upsd.users: # [myuser] #password = mypassword #actions = SET #instcmds = ALL # - Host (1) must be allowed access via upsd.conf and upsd.users on host (3). # On RHEL systems, these files are in /etc/ups. In nut-2.4 and greater, there's # no per-host access restrictions, but you'll need to grant access in # nut-2.2 or lower. # - If you want to be able to unfence host (2) via stonith_admin, you might want # to set its BIOS to boot up on AC power restore, as opposed to last state or off. # Otherwise the machine might not come back on even if the UPS restores power. # This agent doesn't keep track of which host it controls. Use the # Pacemaker parameters for that (man stonithd); e.g.,: # primitive StonithMyHost stonith:fence_nut \ # op monitor interval=60 timeout=30 on-fail=stop \ # params pcmk_host_list=myhost.example.com pcmk_host_check=static-list \ # ups=myhost-ups username=myuser password=mypassword # Note the use of on-fail=stop. The main way this resource's monitor can fail # is if we lose communication with the UPS. That's not great if it happens, but # consider what happens if allow the default on-fail=fence, especially in a # two-node cluster; do you want host (1) to be fenced solely because it can # no longer fence host (2)? If you have more than two nodes, on-fail=restart # is an alternative, but if someone's pulled the communications cable from the # UPS then the resource will just shift from
Re: [Linux-HA] cman+pacemaker+drbd fencing problem
On 2/27/12 8:40 PM, Andrew Beekhof wrote: Oh, what does the fence_pcmk file look like? This is a standard part of the pacemaker-1.1.6 package. According to http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/_configuring_cman_fencing.html it causes any fencing requests from cman to be redirected to pacemaker. Since you asked, I've attached a copy of the file. I note that if this script is used to fence a system it writes to /var/log/messages using logger, and there is no such log message in my logs. So I guess cman is off the hook. On Tue, Feb 28, 2012 at 11:49 AM, William Seligman selig...@nevis.columbia.edu wrote: I'm trying to set up an active/active HA cluster as explained in Clusters From Scratch (which I just re-read after my last problem). I'll give versions and config files below, but I'll start with what happens. I start with an active/active cman+pacemaker+drbd+gfs2 cluster, with fencing enabled. My fencing mechanism cuts power to a node by turning the load off in its UPS. The two nodes are hypatia-tb and orestes-tb. I want to test fencing and recovery. I start with both nodes running, and resources properly running on both nodes. Then I simulate failure on one node, e.g., orestes-tb. I've done this with crm node standby, service pacemaker off, or by pulling the plug. As expected, all the resources move to hypatia-tb, with the drbd resource as Primary. When I try to bring orestes-tb back into the cluster with crm node online or service pacemaker on (the inverse of how I removed it), orestes-tb is fenced. OK, that makes sense, I guess; there's a potential split-brain situation. Not really, that should only happen if the two nodes can't see each other. Which should not be the case. Only when you pull the plug should orestes-tb be fenced. Or if you're using a fencing device that requires the node to have power, then I can imagine that turning it on again might result in fencing. But not for the other cases. I ran a test: I turned off pacemaker (and so DRBD) on orestes-tb. I touched a file on the hypatia-tb DRBD partition, to make it the newer one. Then I turned off pacemaker on hypatia-tb. Finally I turned on just drbd on hypatia-tb, then on orestes-tb. From /var/log/messages on hypatia-tb: Feb 28 11:39:19 hypatia-tb kernel: d-con admin: Starting worker thread (from drbdsetup [21822]) Feb 28 11:39:19 hypatia-tb kernel: block drbd0: disk( Diskless - Attaching ) Feb 28 11:39:19 hypatia-tb kernel: d-con admin: Method to ensure write ordering: barrier Feb 28 11:39:19 hypatia-tb kernel: block drbd0: max BIO size = 130560 Feb 28 11:39:19 hypatia-tb kernel: block drbd0: Adjusting my ra_pages to backing device's (32 - 768) Feb 28 11:39:19 hypatia-tb kernel: block drbd0: drbd_bm_resize called with capacity == 5611549368 Feb 28 11:39:19 hypatia-tb kernel: block drbd0: resync bitmap: bits=701443671 words=10960058 pages=21407 Feb 28 11:39:19 hypatia-tb kernel: block drbd0: size = 2676 GB (2805774684 KB) Feb 28 11:39:19 hypatia-tb kernel: block drbd0: bitmap READ of 21407 pages took 576 jiffies Feb 28 11:39:20 hypatia-tb kernel: block drbd0: recounting of set bits took additional 87 jiffies Feb 28 11:39:20 hypatia-tb kernel: block drbd0: 55 MB (14114 bits) marked out-of-sync by on disk bit-map. Feb 28 11:39:20 hypatia-tb kernel: block drbd0: disk( Attaching - UpToDate ) pdsk( DUnknown - Outdated ) Feb 28 11:39:20 hypatia-tb kernel: block drbd0: attached to UUIDs 862A336609FD27CD:BFFB722D5E3E15D7:6E63EC4258C86AF2:6E62EC4258C86AF2 Feb 28 11:39:20 hypatia-tb kernel: d-con admin: conn( StandAlone - Unconnected ) Feb 28 11:39:20 hypatia-tb kernel: d-con admin: Starting receiver thread (from drbd_w_admin [21824]) Feb 28 11:39:20 hypatia-tb kernel: d-con admin: receiver (re)started Feb 28 11:39:20 hypatia-tb kernel: d-con admin: conn( Unconnected - WFConnection ) From /var/log/messages on orestes-tb: Feb 28 11:39:51 orestes-tb kernel: d-con admin: Starting worker thread (from drbdsetup [17827]) Feb 28 11:39:51 orestes-tb kernel: block drbd0: disk( Diskless - Attaching ) Feb 28 11:39:51 orestes-tb kernel: d-con admin: Method to ensure write ordering: barrier Feb 28 11:39:51 orestes-tb kernel: block drbd0: max BIO size = 130560 Feb 28 11:39:51 orestes-tb kernel: block drbd0: Adjusting my ra_pages to backing device's (32 - 768) Feb 28 11:39:51 orestes-tb kernel: block drbd0: drbd_bm_resize called with capacity == 5611549368 Feb 28 11:39:51 orestes-tb kernel: block drbd0: resync bitmap: bits=701443671 words=10960058 pages=21407 Feb 28 11:39:51 orestes-tb kernel: block drbd0: size = 2676 GB (2805774684 KB) Feb 28 11:39:52 orestes-tb kernel: block drbd0: bitmap READ of 21407 pages took 735 jiffies Feb 28 11:39:52 orestes-tb kernel: block drbd0: recounting of set bits took additional 93 jiffies Feb 28 11:39:52 orestes-tb kernel: block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map. Feb 28 11:39:52 orestes-tb kernel: block drbd0
Re: [Linux-HA] cman+pacemaker+drbd fencing problem
On 2/28/12 2:09 PM, Lars Ellenberg wrote: On Tue, Feb 28, 2012 at 01:21:51PM -0500, William Seligman wrote: On 2/27/12 8:40 PM, Andrew Beekhof wrote: Oh, what does the fence_pcmk file look like? This is a standard part of the pacemaker-1.1.6 package. According to http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/_configuring_cman_fencing.html it causes any fencing requests from cman to be redirected to pacemaker. Since you asked, I've attached a copy of the file. I note that if this script is used to fence a system it writes to /var/log/messages using logger, and there is no such log message in my logs. So I guess cman is off the hook. You say fencing resource-only in drbd.conf. But you did not show the fencing handler used? Did you specify one at all? It looks like I over-edited when I got rid of the comments before I posted my configuration. The relevant sections are: disk { fencing resource-only; } handlers { pri-on-incon-degr /usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b /proc/sysrq-trigger ; reboo\ t -f; pri-lost-after-sb /usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b /proc/sysrq-trigger ; reboo\ t -f; local-io-error /usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o /proc/sysrq-trigger ; halt -f; split-brain /usr/lib/drbd/notify-split-brain.sh sysad...@nevis.columbia.edu; fence-peer /usr/lib/drbd/crm-fence-peer.sh; after-resync-target /usr/lib/drbd/crm-unfence-peer.sh; } Besides, for a dual-primary DRBD setup, you must have fencing resource-and-stonith;, and you should use a DRBD fencing handler that really fences off the peer. It may additionally set constraints. Do crm-fence-peer.sh or Lon Hohberger's obliterate-peer.sh really fence off a peer? I suspect your answer will be no, since from what I can tell in a cman+pacemaker configuration they both wind up calling stonith_admin. Also, maybe that post helps to realize some of the problems involved: http://www.gossamer-threads.com/lists/linuxha/pacemaker/62927#62927 Especially the part about But just because you can shoot someone does not mean you have the bi^Wbetter data. Because of the increased complexity, I strongly recommend against dual primary DRBD, unless you have a very good reason to want it. Because it can be done does not count as good reason in that context off-topic Sigh. I wish that were the reason. The reason why I'm doing dual-primary is that I've a got a single-primary two-node cluster in production that simply doesn't work. One node runs resources; the other sits and twiddles its fingers; fine. But when primary goes down, secondary has trouble starting up all the resources; when we've actually had primary failures (UPS goes haywire, hard drive failure) the secondary often winds up in a state in which it runs none of the significant resources. With the dual-primary setup I have now, both machines are running the resources that typically cause problems in my single-primary configuration. If one box goes down, the other doesn't have to failover anything; it's already running them. (I needed IPaddr2 cloning to work properly for this to work, which is why I started that thread... and all the stupider of me for missing that crucial page in Clusters From Scratch.) My only remaining problem with the configuration is restoring a fenced node to the cluster. Hence my tests, and the reason why I started this thread. /off-topic More comments below. On Tue, Feb 28, 2012 at 11:49 AM, William Seligman selig...@nevis.columbia.edu wrote: I'm trying to set up an active/active HA cluster as explained in Clusters From Scratch (which I just re-read after my last problem). I'll give versions and config files below, but I'll start with what happens. I start with an active/active cman+pacemaker+drbd+gfs2 cluster, with fencing enabled. My fencing mechanism cuts power to a node by turning the load off in its UPS. The two nodes are hypatia-tb and orestes-tb. I want to test fencing and recovery. I start with both nodes running, and resources properly running on both nodes. Then I simulate failure on one node, e.g., orestes-tb. I've done this with crm node standby, service pacemaker off, or by pulling the plug. As expected, all the resources move to hypatia-tb, with the drbd resource as Primary. When I try to bring orestes-tb back into the cluster with crm node online or service pacemaker on (the inverse of how I removed it), orestes-tb is fenced. OK, that makes sense, I guess; there's a potential split-brain situation. Not really, that should only happen if the two nodes can't see each other. Which should not be the case. Only when you pull the plug should orestes-tb
Re: [Linux-HA] cman+pacemaker+drbd fencing problem
On 2/28/12 5:27 PM, Andrew Beekhof wrote: On Wed, Feb 29, 2012 at 5:21 AM, William Seligman selig...@nevis.columbia.edu wrote: While I was setting up the test for the previous paragraph, there was a problem with another resource (ocf:heartbeat:exportfs) that couldn't be properly monitored on either node. This led to a cycle of fencing where each node would successively fence the other because the exportfs resource couldn't run on either node. I had to quickly change my configuration to turn off monitoring on the resource. Not being able to run is fine, but not being able to stop would definitely cause fencing. Make sure the RA can always stop ;-) I'm not the one who wrote ocf:heartbeat:exportfs. I've already had my fling with trying to revise it. I can only hope that the folks who wrote it knew what they were doing; they certainly know more than I do! -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Understanding the behavior of IPaddr2 clone
On 2/24/12 3:36 PM, William Seligman wrote: On 2/17/12 7:30 AM, Dejan Muhamedagic wrote: OK, I guess that'd also be doable by checking the following variables: OCF_RESKEY_CRM_meta_notify_inactive_resource (set of currently inactive instances) OCF_RESKEY_CRM_meta_notify_stop_resource (set of instances which were just stopped) Any volunteers for a patch? :) a) I have a test cluster that I can bring up and down at will; b) I'm a glutton for punishment. So I'll volunteer, since I offered to try to do something in the first place. I think I've got a handle on what to look for; e.g., one has to look for notify_type=pre and notify_operation=stop in the 'node_down' test. Here's my patch, in my usual overly-commented style. Notes: - To make this work, you need to turn on notify in the clone resources; e.g., clone ipaddr2_clone ipaddr2_resource meta notify=true None of the clone examples I saw in the documentation (Clusters From Scratch, Pacemaker Explained) show the notify option; only the ms examples do. You may want to revise the documentation with an IPaddr2 example. - I tested this with my two-node cluster, and it works. I wrote it for a multi-node cluster, but I can't be sure it will work for more than two nodes. Would some nice person test this? - I wrote my code assuming that the clone number assigned to a node would remain constant. If the clone numbers were to change by deleting/adding a node to the cluster, I don't know what would happen. Enjoy! -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ --- IPaddr2.ori 2012-02-16 11:51:04.942688344 -0500 +++ /usr/lib/ocf/resource.d/heartbeat/IPaddr2 2012-02-27 15:23:46.856510474 -0500 @@ -13,6 +13,7 @@ # Copyright (c) 2003 Tuomo Soini # Copyright (c) 2004-2006 SUSE LINUX AG, Lars Marowsky-Brée #All Rights Reserved. +# Additions for high availability 2012 by William Seligman # # This program is free software; you can redistribute it and/or modify # it under the terms of version 2 of the GNU General Public License as @@ -86,7 +87,7 @@ This Linux-specific resource manages IP alias IP addresses. It can add an IP alias, or remove one. In addition, it can implement Cluster Alias IP functionality -if invoked as a clone resource. +if invoked as a clone resource with 'meta notify=true'. /longdesc shortdesc lang=enManages virtual IPv4 addresses (Linux specific version)/shortdesc @@ -254,6 +255,7 @@ actions action name=start timeout=20s / action name=stoptimeout=20s / +action name=notify timeout=20s / action name=status depth=0 timeout=20s interval=10s / action name=monitor depth=0 timeout=20s interval=10s / action name=meta-data timeout=5s / @@ -849,6 +851,101 @@ fi } +# Make the IPaddr2 resource highly-available by adjusting the iptables +# information if nodes drop out of the cluster. +handle_notify() { + # If this is not a cloned IPaddr2 resource, do nothing. + # (But if it's not cloned, how did the user set 'meta notify=true'?) + if [ $IP_INC_GLOBAL -eq 0 ]; then + ocf_log info notify action on non-cloned resource; remove meta notify='true' + exit $OCF_SUCCESS + fi + + # To test if nodes are dropped, the best flags are when notify_type=pre and + # notify_operation=stop. You might not get post/stop if a node is fenced. + if [ x$OCF_RESKEY_CRM_meta_notify_type = xpre ] [ x$OCF_RESKEY_CRM_meta_notify_operation = xstop ]; then + + # The stopping nodes will still be included in the + # active_resource list, so we have to remove them. + local active=$OCF_RESKEY_CRM_meta_notify_active_resource + for stopping in $OCF_RESKEY_CRM_meta_notify_stop_resource + do + # Sanity check: If the user has done a crm node standby, then + # this method can be called by the node that's stopping. + local stopping_clone=`echo ${stopping} | sed s/[^[:space:]]\+://` + if [ ${stopping_clone} -eq $OCF_RESKEY_CRM_meta_clone ]; then + exit $OCF_SUCCESS + fi + + # We're sane, so remove the stopping node from the active list. + active=`echo ${active} | sed s/${stopping}//` + done + + # One of the remaining nodes has to take over the job of the dropped + # node(s). I'm doing the simplest thing, and choose the last + # node in the list of active resources. active_resource is a list like + # name:0 name:1 name:2. + local selected_node=`echo ${active} | sed s
Re: [Linux-HA] Understanding the behavior of IPaddr2 clone
On 2/27/12 4:10 PM, Lars Ellenberg wrote: On Mon, Feb 27, 2012 at 03:39:04PM -0500, William Seligman wrote: On 2/24/12 3:36 PM, William Seligman wrote: On 2/17/12 7:30 AM, Dejan Muhamedagic wrote: OK, I guess that'd also be doable by checking the following variables: OCF_RESKEY_CRM_meta_notify_inactive_resource (set of currently inactive instances) OCF_RESKEY_CRM_meta_notify_stop_resource (set of instances which were just stopped) Any volunteers for a patch? :) a) I have a test cluster that I can bring up and down at will; b) I'm a glutton for punishment. So I'll volunteer, since I offered to try to do something in the first place. I think I've got a handle on what to look for; e.g., one has to look for notify_type=pre and notify_operation=stop in the 'node_down' test. Here's my patch, in my usual overly-commented style. Sorry, I may be missing something obvious, but... is this not *the* use case of globally-unique=true? I did not know about globally-unique. I just tested it, replacing (with name substitutions): clone ipaddr2_clone ipaddr2_resource meta notify=true with clone ipaddr2_clone ipaddr2_resource meta globally-unique=true This fell back to the old behavior I described in the first message in this thread: iptables did not update when I took down one of my nodes. I expected this, since according to Pacemaker Explained, globally-unique=true is the default. If this had worked, I never would have reported the problem in the first place. Is there something else that could suppress the behavior you described for globally-unique=true? Which makes it possible to set clone-node-max = clone-max = number of nodes? Or even 7 times (or whatever) number of nodes. And all the iptables magic is in the start operation. If one of the nodes fails, it's bucket(s) will be re-allocated to the surviving nodes. And that is all fully implemented already (at least that's how I read the script). What is not implemented is chaning the number of buckets aka clone-max, without restarting clones. No need for fancy stuff in *pre* notifications, which are only statements of intent; the actual action may still fail, and all will be different than you anticipated. Notes: - To make this work, you need to turn on notify in the clone resources; e.g., clone ipaddr2_clone ipaddr2_resource meta notify=true None of the clone examples I saw in the documentation (Clusters From Scratch, Pacemaker Explained) show the notify option; only the ms examples do. You may want to revise the documentation with an IPaddr2 example. - I tested this with my two-node cluster, and it works. I wrote it for a multi-node cluster, but I can't be sure it will work for more than two nodes. Would some nice person test this? - I wrote my code assuming that the clone number assigned to a node would remain constant. If the clone numbers were to change by deleting/adding a node to the cluster, I don't know what would happen. For anonymous clones, it can be relabeled. In fact, there are plans to remove the clone number from anonymous clones completely. However, for globally unique clones, the clone number is part of its identifier. -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Understanding the behavior of IPaddr2 clone
On 2/27/12 5:33 PM, Lars Ellenberg wrote: On Mon, Feb 27, 2012 at 05:23:36PM -0500, William Seligman wrote: On 2/27/12 4:10 PM, Lars Ellenberg wrote: On Mon, Feb 27, 2012 at 03:39:04PM -0500, William Seligman wrote: On 2/24/12 3:36 PM, William Seligman wrote: On 2/17/12 7:30 AM, Dejan Muhamedagic wrote: OK, I guess that'd also be doable by checking the following variables: OCF_RESKEY_CRM_meta_notify_inactive_resource (set of currently inactive instances) OCF_RESKEY_CRM_meta_notify_stop_resource (set of instances which were just stopped) Any volunteers for a patch? :) a) I have a test cluster that I can bring up and down at will; b) I'm a glutton for punishment. So I'll volunteer, since I offered to try to do something in the first place. I think I've got a handle on what to look for; e.g., one has to look for notify_type=pre and notify_operation=stop in the 'node_down' test. Here's my patch, in my usual overly-commented style. Sorry, I may be missing something obvious, but... is this not *the* use case of globally-unique=true? I did not know about globally-unique. I just tested it, replacing (with name substitutions): clone ipaddr2_clone ipaddr2_resource meta notify=true with clone ipaddr2_clone ipaddr2_resource meta globally-unique=true This fell back to the old behavior I described in the first message in this thread: iptables did not update when I took down one of my nodes. I expected this, since according to Pacemaker Explained, globally-unique=true is the default. If this had worked, I never would have reported the problem in the first place. Is there something else that could suppress the behavior you described for globally-unique=true? You need clone-node-max == clone-max. It defaults to 1. Which obviously prevents nodes already running one instance from taking over an other... I tried it, and it works. So there's no need for my patch. The magic invocation for a highly-available IPaddr2 resource is: ip_clone ip_resource meta clone-max=2 clone-node-max=2 Could this please be documented more clearly somewhere? -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Understanding the behavior of IPaddr2 clone
On 2/27/12 5:41 PM, William Seligman wrote: On 2/27/12 5:33 PM, Lars Ellenberg wrote: On Mon, Feb 27, 2012 at 05:23:36PM -0500, William Seligman wrote: On 2/27/12 4:10 PM, Lars Ellenberg wrote: On Mon, Feb 27, 2012 at 03:39:04PM -0500, William Seligman wrote: On 2/24/12 3:36 PM, William Seligman wrote: On 2/17/12 7:30 AM, Dejan Muhamedagic wrote: OK, I guess that'd also be doable by checking the following variables: OCF_RESKEY_CRM_meta_notify_inactive_resource (set of currently inactive instances) OCF_RESKEY_CRM_meta_notify_stop_resource (set of instances which were just stopped) Any volunteers for a patch? :) a) I have a test cluster that I can bring up and down at will; b) I'm a glutton for punishment. So I'll volunteer, since I offered to try to do something in the first place. I think I've got a handle on what to look for; e.g., one has to look for notify_type=pre and notify_operation=stop in the 'node_down' test. Here's my patch, in my usual overly-commented style. Sorry, I may be missing something obvious, but... is this not *the* use case of globally-unique=true? I did not know about globally-unique. I just tested it, replacing (with name substitutions): clone ipaddr2_clone ipaddr2_resource meta notify=true with clone ipaddr2_clone ipaddr2_resource meta globally-unique=true This fell back to the old behavior I described in the first message in this thread: iptables did not update when I took down one of my nodes. I expected this, since according to Pacemaker Explained, globally-unique=true is the default. If this had worked, I never would have reported the problem in the first place. Is there something else that could suppress the behavior you described for globally-unique=true? You need clone-node-max == clone-max. It defaults to 1. Which obviously prevents nodes already running one instance from taking over an other... I tried it, and it works. So there's no need for my patch. The magic invocation for a highly-available IPaddr2 resource is: ip_clone ip_resource meta clone-max=2 clone-node-max=2 Could this please be documented more clearly somewhere? Umm... it turns out to be: ip_clone ip_resource meta globally-unique=true clone-max=2 clone-node-max=2 and for a two-node cluster, of course. So I guess globally-unique=true is not the default after all. -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] cman+pacemaker+drbd fencing problem
I'm trying to set up an active/active HA cluster as explained in Clusters From Scratch (which I just re-read after my last problem). I'll give versions and config files below, but I'll start with what happens. I start with an active/active cman+pacemaker+drbd+gfs2 cluster, with fencing enabled. My fencing mechanism cuts power to a node by turning the load off in its UPS. The two nodes are hypatia-tb and orestes-tb. I want to test fencing and recovery. I start with both nodes running, and resources properly running on both nodes. Then I simulate failure on one node, e.g., orestes-tb. I've done this with crm node standby, service pacemaker off, or by pulling the plug. As expected, all the resources move to hypatia-tb, with the drbd resource as Primary. When I try to bring orestes-tb back into the cluster with crm node online or service pacemaker on (the inverse of how I removed it), orestes-tb is fenced. OK, that makes sense, I guess; there's a potential split-brain situation. I bring orestes-tb back up, with the intent of adding it back into the cluster. I make sure cman, pacemaker, and drbd services were off at system start. On orestes-tb, I type service drbd start. What I expect to happen is that the drbd resource on orestes-tb is marked Outdated or something like that. Then I'd fix it with drbdadm --discard-my-data connect admin or whatever is appropriate. What actually happens is that hypatia-tb is fenced. Since this is the node running all the resources, this is bad behavior. It's even more puzzling when I consider that at, the time, there isn't any fencing resource actually running on orestes-tb; my guess is that DRBD on hypatia-tb is fencing itself. Eventually hypatia-tb reboots, and the cluster goes back to normal. But as a fencing/stability/HA test, this is a failure. I've repeated this with a number of variations. In the end, both systems have to be fenced/rebooted before the cluster is working again. Any ideas? Versions: Scientific Linux 6.2 kernel 2.6.32 cman-3.0.12 corosync-1.4.1 pacemaker-1.1.6 drbd-8.4.1 /etc/drbd.d/global-common.conf: global { usage-count yes; } common { startup { wfc-timeout 60; degr-wfc-timeout60; outdated-wfc-timeout60; } } /etc/drbd.d/admin.res: resource admin { protocol C; on hypatia-tb.nevis.columbia.edu { volume 0 { device /dev/drbd0; disk/dev/md2; flexible-meta-disk internal; } address 192.168.100.7:7788; } on orestes-tb.nevis.columbia.edu { volume 0 { device /dev/drbd0; disk/dev/md2; flexible-meta-disk internal; } address 192.168.100.6:7788; } startup { } net { allow-two-primaries yes; after-sb-0pri discard-zero-changes; after-sb-1pri discard-secondary; after-sb-2pri disconnect; sndbuf-size 0; } disk { resync-rate 100M; c-max-rate 100M; al-extents 3389; fencing resource-only; } An edited output of crm configure show: node hypatia-tb.nevis.columbia.edu node orestes-tb.nevis.columbia.edu primitive StonithHypatia stonith:fence_nut \ params pcmk_host_check=static-list \ pcmk_host_list=hypatia-tb.nevis.columbia.edu \ ups=sofia-ups username=admin password=XXX primitive StonithOrestes stonith:fence_nut \ params pcmk_host_check=static-list \ pcmk_host_list=orestes-tb.nevis.columbia.edu ups=dc-test-stand-ups username=admin password=XXX location StonithHypatiaLocation StonithHypatia \ -inf: hypatia-tb.nevis.columbia.edu location StonithOrestesLocation StonithOrestes \ -inf: orestes-tb.nevis.columbia.edu /etc/cluster/cluster.conf: ?xml version=1.0? cluster config_version=17 name=Nevis_HA logging debug=off/ cman expected_votes=1 two_node=1 / clusternodes clusternode name=hypatia-tb.nevis.columbia.edu nodeid=1 altname name=hypatia-private.nevis.columbia.edu port=5405 mcast=226.94.1.1/ fence method name=pcmk-redirect device name=pcmk port=hypatia-tb.nevis.columbia.edu/ /method /fence /clusternode clusternode name=orestes-tb.nevis.columbia.edu nodeid=2 altname name=orestes-private.nevis.columbia.edu port=5405 mcast=226.94.1.1/ fence method name=pcmk-redirect device name=pcmk port=orestes-tb.nevis.columbia.edu/ /method /fence /clusternode /clusternodes fencedevices fencedevice name=pcmk agent=fence_pcmk/ /fencedevices fence_daemon post_join_delay=30 / rm
Re: [Linux-HA] Understanding the behavior of IPaddr2 clone
On 2/16/12 11:14 PM, William Seligman wrote: On 2/16/12 8:13 PM, Andrew Beekhof wrote: On Fri, Feb 17, 2012 at 5:05 AM, Dejan Muhamedagicdeja...@fastmail.fm wrote: On Wed, Feb 15, 2012 at 04:24:15PM -0500, William Seligman wrote: On 2/10/12 4:53 PM, William Seligman wrote: I'm trying to set up an Active/Active cluster (yes, I hear the sounds of kittens dying). Versions: Scientific Linux 6.2 pacemaker-1.1.6 resource-agents-3.9.2 I'm using cloned IPaddr2 resources: primitive ClusterIP ocf:heartbeat:IPaddr2 \ params ip=129.236.252.13 cidr_netmask=32 \ op monitor interval=30s primitive ClusterIPLocal ocf:heartbeat:IPaddr2 \ params ip=10.44.7.13 cidr_netmask=32 \ op monitor interval=31s primitive ClusterIPSandbox ocf:heartbeat:IPaddr2 \ params ip=10.43.7.13 cidr_netmask=32 \ op monitor interval=32s group ClusterIPGroup ClusterIP ClusterIPLocal ClusterIPSandbox clone ClusterIPClone ClusterIPGroup When both nodes of my two-node cluster are running, everything looks and functions OK. From service iptables status on node 1 (hypatia-tb): 5CLUSTERIP all -- 0.0.0.0/010.43.7.13 CLUSTERIP hashmode=sourceip-sourceport clustermac=F1:87:E1:64:60:A5 total_nodes=2 local_node=1 hash_init=0 6CLUSTERIP all -- 0.0.0.0/010.44.7.13 CLUSTERIP hashmode=sourceip-sourceport clustermac=11:8F:23:B9:CA:09 total_nodes=2 local_node=1 hash_init=0 7CLUSTERIP all -- 0.0.0.0/0129.236.252.13 CLUSTERIP hashmode=sourceip-sourceport clustermac=B1:95:5A:B5:16:79 total_nodes=2 local_node=1 hash_init=0 On node 2 (orestes-tb): 5CLUSTERIP all -- 0.0.0.0/010.43.7.13 CLUSTERIP hashmode=sourceip-sourceport clustermac=F1:87:E1:64:60:A5 total_nodes=2 local_node=2 hash_init=0 6CLUSTERIP all -- 0.0.0.0/010.44.7.13 CLUSTERIP hashmode=sourceip-sourceport clustermac=11:8F:23:B9:CA:09 total_nodes=2 local_node=2 hash_init=0 7CLUSTERIP all -- 0.0.0.0/0129.236.252.13 CLUSTERIP hashmode=sourceip-sourceport clustermac=B1:95:5A:B5:16:79 total_nodes=2 local_node=2 hash_init=0 If I do a simple test of ssh'ing into 129.236.252.13, I see that I alternately login into hypatia-tb and orestes-tb. All is good. Now take orestes-tb offline. The iptables rules on hypatia-tb are unchanged: 5CLUSTERIP all -- 0.0.0.0/010.43.7.13 CLUSTERIP hashmode=sourceip-sourceport clustermac=F1:87:E1:64:60:A5 total_nodes=2 local_node=1 hash_init=0 6CLUSTERIP all -- 0.0.0.0/010.44.7.13 CLUSTERIP hashmode=sourceip-sourceport clustermac=11:8F:23:B9:CA:09 total_nodes=2 local_node=1 hash_init=0 7CLUSTERIP all -- 0.0.0.0/0129.236.252.13 CLUSTERIP hashmode=sourceip-sourceport clustermac=B1:95:5A:B5:16:79 total_nodes=2 local_node=1 hash_init=0 If I attempt to ssh to 129.236.252.13, whether or not I get in seems to be machine-dependent. On one machine I get in, from another I get a time-out. Both machines show the same MAC address for 129.236.252.13: arp 129.236.252.13 Address HWtype HWaddress Flags Mask Iface hamilton-tb.nevis.colum ether B1:95:5A:B5:16:79 C eth0 Is this the way the cloned IPaddr2 resource is supposed to behave in the event of a node failure, or have I set things up incorrectly? I spent some time looking over the IPaddr2 script. As far as I can tell, the script has no mechanism for reconfiguring iptables in the event of a change of state in the number of clones. I might be stupid -- er -- dedicated enough to make this change on my own, then share the code with the appropriate group. The change seems to be relatively simple. It would be in the monitor operation. In pseudo-code: if (IPaddr2 resource is already started ) then if ( OCF_RESKEY_CRM_meta_clone_max != OCF_RESKEY_CRM_meta_clone_max last time || OCF_RESKEY_CRM_meta_clone != OCF_RESKEY_CRM_meta_clone last time ) ip_stop ip_start Just changing the iptables entries should suffice, right? Besides, doing stop/start in the monitor is sort of unexpected. Another option is to add the missing node to one of the nodes which are still running (echo +n /proc/net/ipt_CLUSTERIP/ip). But any of that would be extremely tricky to implement properly (if not impossible). fi fi If this would work, then I'd have two questions for the experts: - Would the values of OCF_RESKEY_CRM_meta_clone_max and/or OCF_RESKEY_CRM_meta_clone change if the number of cloned copies of a resource changed? OCF_RESKEY_CRM_meta_clone_max definitely not. OCF_RESKEY_CRM_meta_clone may change but also probably not; it's just a clone sequence number. In short, there's no way to figure out the total number of clones by examining the environment
Re: [Linux-HA] Understanding the behavior of IPaddr2 clone
On 2/17/12 7:30 AM, Dejan Muhamedagic wrote: On Fri, Feb 17, 2012 at 01:15:04PM +0100, Dejan Muhamedagic wrote: On Fri, Feb 17, 2012 at 12:13:49PM +1100, Andrew Beekhof wrote: [...] What about notifications? The would be the right point to re-configure things I'd have thought. Sounds like the right way. Still, it may be hard to coordinate between different instances. Unless we figure out how to map nodes to numbers used by the CLUSTERIP. For instance, the notify operation gets: OCF_RESKEY_CRM_meta_notify_stop_resource=ip_lb:2 OCF_RESKEY_CRM_meta_notify_stop_uname=xen-f But the instance number may not match the node number from Scratch that. IP_CIP_FILE=/proc/net/ipt_CLUSTERIP/$OCF_RESKEY_ip IP_INC_NO=`expr ${OCF_RESKEY_CRM_meta_clone:-0} + 1` ... echo +$IP_INC_NO $IP_CIP_FILE /proc/net/ipt_CLUSTERIP/ip and that's where we should add the node. It should be something like: notify() { if node_down; then echo +node_num /proc/net/ipt_CLUSTERIP/ip elif node_up; then echo -node_num /proc/net/ipt_CLUSTERIP/ip fi } Another issue is that the above code should be executed on _exactly_ one node. OK, I guess that'd also be doable by checking the following variables: OCF_RESKEY_CRM_meta_notify_inactive_resource (set of currently inactive instances) OCF_RESKEY_CRM_meta_notify_stop_resource (set of instances which were just stopped) Any volunteers for a patch? :) a) I have a test cluster that I can bring up and down at will; b) I'm a glutton for punishment. So I'll volunteer, since I offered to try to do something in the first place. I think I've got a handle on what to look for; e.g., one has to look for notify_type=pre and notify_operation=stop in the 'node_down' test. -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Writing a stonith-ng fencing agent in perl
On 2/23/12 5:42 PM, Andrew Beekhof wrote: I'll note that none of the authors of the perl-scripted fencing agents knew that arguments are passed via stdin either. I suspect you'll find they also have some magic for reading them from stdin. Both methods are supported by the agents, although when called from the cluster, by convention, we only use stdin. Yeah, you're right. I didn't see it before, because I wasn't looking for stdin. The real reason the perl-scripted fencing agents don't give the correct response to stonith-admin is that they're looking for a action=XXX parameter from stdin, when the actual parameter being passed is option=XXX. In my fence_nut script (which I'll post after I've finished my fencing tests) I allow for both. Or perhaps I'm assuming too much; they may have been written for some other package than Pacemaker 1.1. Right, they were written for cman/rgmanager originally. -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Writing a stonith-ng fencing agent in perl
On 2/22/12 6:20 PM, Andrew Beekhof wrote: On Thu, Feb 23, 2012 at 8:21 AM, William Seligman selig...@nevis.columbia.edu wrote: About a 1.5 years ago, I wrote a fencing agent for Pacemaker 1.0.x; it used NUT to shut down power on a UPS: http://www.mail-archive.com/pacemaker@oss.clusterlabs.org/msg05942.html I'm building a new HA cluster using: Scientific Linux 6.2 (kernel 2.6.32) cman-3.0.12.1 corosync-1.4.1 pacemaker-1.1.6 cluster-glue-1.0.5 clusterlib-3.0.12.1 fence-agents-3.1.5 My old fencing agent, written in bash, won't work with stonith-ng, What makes you think that? A number of things: My old script didn't recognize -o metadata, the XML description that Pacemaker expects has changed a bit, and the big one which I'll get to just after your response... so I wrote a replacement in perl. After much debugging, the problem appears to be that stonith-admin (or whatever library it's calling) doesn't pass any arguments to the perl script. You know they are passed via stdin? No, I did not know this! That was the key; thanks Andrew. I've revised my script, and it appears to work. I'll have to run some more tests before I post it. I'd post the script, but it's not necessary, since I see the same problem in any of the perl-scripted fencing agents in /usr/sbin/fence_* from the regular fence_agent package. If I do: stonith_admin -M -a fence_scsi stonith_admin -M -a fence_vmware_helper ... I don't see metadata, but a response equivalent to no argument. I'll note that none of the authors of the perl-scripted fencing agents knew that arguments are passed via stdin either. Or perhaps I'm assuming too much; they may have been written for some other package than Pacemaker 1.1. -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Suggestion for exportfs resource
I had some problems with the monitor operation of the ocf:heartbeat:exportfs resource. I have a solution for one of them that I'd to share with the community. The first comes from using regex-like expressions for clientspec; e.g., primitive ExportUsrNevis ocf:heartbeat:exportfs \ op monitor interval=30 timeout=20 \ params clientspec=*.nevis.columbia.edu \ directory=/usr/nevis fsid=20 For my version of nfs-utils (1.2.3), expressions like *.nevis.columbia.edu are allowed. The problem is that the monitor operation will fail, since the exportfs resource uses grep to test the result of the exportfs command: exportfs | grep -zqs ${OCF_RESKEY_directory}[[:space:]]*${OCF_RESKEY_clientspec} I've attached a text file with my proposed change. It escapes any regex characters in clientspec. I had another problem for which I don't think there's a simple overall solution: if clientspec refers to a host alias. For example: # host mail mail.nevis.columbia.edu is an alias for franklin.nevis.columbia.edu. franklin.nevis.columbia.edu has address 129.236.252.8 crm configure primitive ExportMail ocf:heartbeat:exportfs \ params clientspec=mail directory=/mail fsid=30 # exportfs /mail franklin.nevis.columbia.edu The exportfs command canonicalizes the clientspec, so once again the monitor operation will always fail. I either have to use the canonical name in the clientspec, or omit the monitor operation. I tried to come up with simple code to get the canonical name in bash, but it gets tricky to determine both when canonicalization is needed, and how to extract it from the output of the host command in an OS-independent way. -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ # diff -u exportfs.ori /usr/lib/ocf/resource.d/heartbeat/exportfs --- exportfs.ori2012-02-17 18:31:59.518848166 -0500 +++ /usr/lib/ocf/resource.d/heartbeat/exportfs 2012-02-20 18:14:56.254199732 -0500 @@ -181,9 +181,11 @@ exportfs_monitor () { +# Just in case the clientspec contains regexp characters +CLIENTSPEC=`echo ${OCF_RESKEY_clientspec} | sed -e s/[\*\?\[\[]/\/g` # grep -z matches across newlines, which is necessary as # exportfs output wraps lines for long export directory names - exportfs | grep -zqs ${OCF_RESKEY_directory}[[:space:]]*${OCF_RESKEY_clientspec} + exportfs | grep -zqs ${OCF_RESKEY_directory}[[:space:]]*${CLIENTSPEC} #Adapt grep status code to OCF return code case $? in @@ -224,7 +226,7 @@ fi OPTIONS=-o ${OPTIONS} - ocf_run exportfs -v ${OPTIONS} ${OCF_RESKEY_clientspec}:${OCF_RESKEY_directory} || exit $OCF_ERR_GENERIC + ocf_run exportfs -v ${OPTIONS} ${OCF_RESKEY_clientspec}:${OCF_RESKEY_directory} || exit $OCF_ERR_GENERIC # Restore the rmtab to ensure smooth NFS-over-TCP failover restore_rmtab @@ -246,7 +248,7 @@ # Backup the rmtab to ensure smooth NFS-over-TCP failover backup_rmtab - ocf_run exportfs -v -u ${OCF_RESKEY_clientspec}:${OCF_RESKEY_directory} + ocf_run exportfs -v -u ${OCF_RESKEY_clientspec}:${OCF_RESKEY_directory} rc=$? if ocf_is_true ${OCF_RESKEY_unlock_on_stop}; then smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Understanding the behavior of IPaddr2 clone
On 2/16/12 8:13 PM, Andrew Beekhof wrote: On Fri, Feb 17, 2012 at 5:05 AM, Dejan Muhamedagicdeja...@fastmail.fm wrote: Hi, On Wed, Feb 15, 2012 at 04:24:15PM -0500, William Seligman wrote: On 2/10/12 4:53 PM, William Seligman wrote: I'm trying to set up an Active/Active cluster (yes, I hear the sounds of kittens dying). Versions: Scientific Linux 6.2 pacemaker-1.1.6 resource-agents-3.9.2 I'm using cloned IPaddr2 resources: primitive ClusterIP ocf:heartbeat:IPaddr2 \ params ip=129.236.252.13 cidr_netmask=32 \ op monitor interval=30s primitive ClusterIPLocal ocf:heartbeat:IPaddr2 \ params ip=10.44.7.13 cidr_netmask=32 \ op monitor interval=31s primitive ClusterIPSandbox ocf:heartbeat:IPaddr2 \ params ip=10.43.7.13 cidr_netmask=32 \ op monitor interval=32s group ClusterIPGroup ClusterIP ClusterIPLocal ClusterIPSandbox clone ClusterIPClone ClusterIPGroup When both nodes of my two-node cluster are running, everything looks and functions OK. From service iptables status on node 1 (hypatia-tb): 5CLUSTERIP all -- 0.0.0.0/010.43.7.13 CLUSTERIP hashmode=sourceip-sourceport clustermac=F1:87:E1:64:60:A5 total_nodes=2 local_node=1 hash_init=0 6CLUSTERIP all -- 0.0.0.0/010.44.7.13 CLUSTERIP hashmode=sourceip-sourceport clustermac=11:8F:23:B9:CA:09 total_nodes=2 local_node=1 hash_init=0 7CLUSTERIP all -- 0.0.0.0/0129.236.252.13 CLUSTERIP hashmode=sourceip-sourceport clustermac=B1:95:5A:B5:16:79 total_nodes=2 local_node=1 hash_init=0 On node 2 (orestes-tb): 5CLUSTERIP all -- 0.0.0.0/010.43.7.13 CLUSTERIP hashmode=sourceip-sourceport clustermac=F1:87:E1:64:60:A5 total_nodes=2 local_node=2 hash_init=0 6CLUSTERIP all -- 0.0.0.0/010.44.7.13 CLUSTERIP hashmode=sourceip-sourceport clustermac=11:8F:23:B9:CA:09 total_nodes=2 local_node=2 hash_init=0 7CLUSTERIP all -- 0.0.0.0/0129.236.252.13 CLUSTERIP hashmode=sourceip-sourceport clustermac=B1:95:5A:B5:16:79 total_nodes=2 local_node=2 hash_init=0 If I do a simple test of ssh'ing into 129.236.252.13, I see that I alternately login into hypatia-tb and orestes-tb. All is good. Now take orestes-tb offline. The iptables rules on hypatia-tb are unchanged: 5CLUSTERIP all -- 0.0.0.0/010.43.7.13 CLUSTERIP hashmode=sourceip-sourceport clustermac=F1:87:E1:64:60:A5 total_nodes=2 local_node=1 hash_init=0 6CLUSTERIP all -- 0.0.0.0/010.44.7.13 CLUSTERIP hashmode=sourceip-sourceport clustermac=11:8F:23:B9:CA:09 total_nodes=2 local_node=1 hash_init=0 7CLUSTERIP all -- 0.0.0.0/0129.236.252.13 CLUSTERIP hashmode=sourceip-sourceport clustermac=B1:95:5A:B5:16:79 total_nodes=2 local_node=1 hash_init=0 If I attempt to ssh to 129.236.252.13, whether or not I get in seems to be machine-dependent. On one machine I get in, from another I get a time-out. Both machines show the same MAC address for 129.236.252.13: arp 129.236.252.13 Address HWtype HWaddress Flags MaskIface hamilton-tb.nevis.colum ether B1:95:5A:B5:16:79 C eth0 Is this the way the cloned IPaddr2 resource is supposed to behave in the event of a node failure, or have I set things up incorrectly? I spent some time looking over the IPaddr2 script. As far as I can tell, the script has no mechanism for reconfiguring iptables in the event of a change of state in the number of clones. I might be stupid -- er -- dedicated enough to make this change on my own, then share the code with the appropriate group. The change seems to be relatively simple. It would be in the monitor operation. In pseudo-code: if (IPaddr2 resource is already started ) then if ( OCF_RESKEY_CRM_meta_clone_max != OCF_RESKEY_CRM_meta_clone_max last time || OCF_RESKEY_CRM_meta_clone != OCF_RESKEY_CRM_meta_clone last time ) ip_stop ip_start Just changing the iptables entries should suffice, right? Besides, doing stop/start in the monitor is sort of unexpected. Another option is to add the missing node to one of the nodes which are still running (echo +n /proc/net/ipt_CLUSTERIP/ip). But any of that would be extremely tricky to implement properly (if not impossible). fi fi If this would work, then I'd have two questions for the experts: - Would the values of OCF_RESKEY_CRM_meta_clone_max and/or OCF_RESKEY_CRM_meta_clone change if the number of cloned copies of a resource changed? OCF_RESKEY_CRM_meta_clone_max definitely not. OCF_RESKEY_CRM_meta_clone may change but also probably not; it's just a clone sequence number. In short, there's no way to figure out the total number of clones by examining the environment. Information such as membership changes doesn't trickle down to the resource instances. What about notifications? The would be the right point to re-configure things
Re: [Linux-HA] Understanding the behavior of IPaddr2 clone
On 2/10/12 4:53 PM, William Seligman wrote: I'm trying to set up an Active/Active cluster (yes, I hear the sounds of kittens dying). Versions: Scientific Linux 6.2 pacemaker-1.1.6 resource-agents-3.9.2 I'm using cloned IPaddr2 resources: primitive ClusterIP ocf:heartbeat:IPaddr2 \ params ip=129.236.252.13 cidr_netmask=32 \ op monitor interval=30s primitive ClusterIPLocal ocf:heartbeat:IPaddr2 \ params ip=10.44.7.13 cidr_netmask=32 \ op monitor interval=31s primitive ClusterIPSandbox ocf:heartbeat:IPaddr2 \ params ip=10.43.7.13 cidr_netmask=32 \ op monitor interval=32s group ClusterIPGroup ClusterIP ClusterIPLocal ClusterIPSandbox clone ClusterIPClone ClusterIPGroup When both nodes of my two-node cluster are running, everything looks and functions OK. From service iptables status on node 1 (hypatia-tb): 5CLUSTERIP all -- 0.0.0.0/010.43.7.13 CLUSTERIP hashmode=sourceip-sourceport clustermac=F1:87:E1:64:60:A5 total_nodes=2 local_node=1 hash_init=0 6CLUSTERIP all -- 0.0.0.0/010.44.7.13 CLUSTERIP hashmode=sourceip-sourceport clustermac=11:8F:23:B9:CA:09 total_nodes=2 local_node=1 hash_init=0 7CLUSTERIP all -- 0.0.0.0/0129.236.252.13 CLUSTERIP hashmode=sourceip-sourceport clustermac=B1:95:5A:B5:16:79 total_nodes=2 local_node=1 hash_init=0 On node 2 (orestes-tb): 5CLUSTERIP all -- 0.0.0.0/010.43.7.13 CLUSTERIP hashmode=sourceip-sourceport clustermac=F1:87:E1:64:60:A5 total_nodes=2 local_node=2 hash_init=0 6CLUSTERIP all -- 0.0.0.0/010.44.7.13 CLUSTERIP hashmode=sourceip-sourceport clustermac=11:8F:23:B9:CA:09 total_nodes=2 local_node=2 hash_init=0 7CLUSTERIP all -- 0.0.0.0/0129.236.252.13 CLUSTERIP hashmode=sourceip-sourceport clustermac=B1:95:5A:B5:16:79 total_nodes=2 local_node=2 hash_init=0 If I do a simple test of ssh'ing into 129.236.252.13, I see that I alternately login into hypatia-tb and orestes-tb. All is good. Now take orestes-tb offline. The iptables rules on hypatia-tb are unchanged: 5CLUSTERIP all -- 0.0.0.0/010.43.7.13 CLUSTERIP hashmode=sourceip-sourceport clustermac=F1:87:E1:64:60:A5 total_nodes=2 local_node=1 hash_init=0 6CLUSTERIP all -- 0.0.0.0/010.44.7.13 CLUSTERIP hashmode=sourceip-sourceport clustermac=11:8F:23:B9:CA:09 total_nodes=2 local_node=1 hash_init=0 7CLUSTERIP all -- 0.0.0.0/0129.236.252.13 CLUSTERIP hashmode=sourceip-sourceport clustermac=B1:95:5A:B5:16:79 total_nodes=2 local_node=1 hash_init=0 If I attempt to ssh to 129.236.252.13, whether or not I get in seems to be machine-dependent. On one machine I get in, from another I get a time-out. Both machines show the same MAC address for 129.236.252.13: arp 129.236.252.13 Address HWtype HWaddress Flags Mask Iface hamilton-tb.nevis.colum ether B1:95:5A:B5:16:79 C eth0 Is this the way the cloned IPaddr2 resource is supposed to behave in the event of a node failure, or have I set things up incorrectly? I spent some time looking over the IPaddr2 script. As far as I can tell, the script has no mechanism for reconfiguring iptables in the event of a change of state in the number of clones. I might be stupid -- er -- dedicated enough to make this change on my own, then share the code with the appropriate group. The change seems to be relatively simple. It would be in the monitor operation. In pseudo-code: if ( IPaddr2 resource is already started ) then if ( OCF_RESKEY_CRM_meta_clone_max != OCF_RESKEY_CRM_meta_clone_max last time || OCF_RESKEY_CRM_meta_clone != OCF_RESKEY_CRM_meta_clone last time ) ip_stop ip_start fi fi If this would work, then I'd have two questions for the experts: - Would the values of OCF_RESKEY_CRM_meta_clone_max and/or OCF_RESKEY_CRM_meta_clone change if the number of cloned copies of a resource changed? - Is there some standard mechanism by which RA scripts can maintain persistent information between successive calls? I realize there's a flaw in the logic: it risks breaking an ongoing IP connection. But as it stands, IPaddr2 is a clonable resource but not a highly-available one. If one of N cloned copies goes down, then one out of N new network connections to the IP address will fail. -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Understanding the behavior of IPaddr2 clone
I'm trying to set up an Active/Active cluster (yes, I hear the sounds of kittens dying). Versions: Scientific Linux 6.2 pacemaker-1.1.6 resource-agents-3.9.2 I'm using cloned IPaddr2 resources: primitive ClusterIP ocf:heartbeat:IPaddr2 \ params ip=129.236.252.13 cidr_netmask=32 \ op monitor interval=30s primitive ClusterIPLocal ocf:heartbeat:IPaddr2 \ params ip=10.44.7.13 cidr_netmask=32 \ op monitor interval=31s primitive ClusterIPSandbox ocf:heartbeat:IPaddr2 \ params ip=10.43.7.13 cidr_netmask=32 \ op monitor interval=32s group ClusterIPGroup ClusterIP ClusterIPLocal ClusterIPSandbox clone ClusterIPClone ClusterIPGroup When both nodes of my two-node cluster are running, everything looks and functions OK. From service iptables status on node 1 (hypatia-tb): 5CLUSTERIP all -- 0.0.0.0/010.43.7.13 CLUSTERIP hashmode=sourceip-sourceport clustermac=F1:87:E1:64:60:A5 total_nodes=2 local_node=1 hash_init=0 6CLUSTERIP all -- 0.0.0.0/010.44.7.13 CLUSTERIP hashmode=sourceip-sourceport clustermac=11:8F:23:B9:CA:09 total_nodes=2 local_node=1 hash_init=0 7CLUSTERIP all -- 0.0.0.0/0129.236.252.13 CLUSTERIP hashmode=sourceip-sourceport clustermac=B1:95:5A:B5:16:79 total_nodes=2 local_node=1 hash_init=0 On node 2 (orestes-tb): 5CLUSTERIP all -- 0.0.0.0/010.43.7.13 CLUSTERIP hashmode=sourceip-sourceport clustermac=F1:87:E1:64:60:A5 total_nodes=2 local_node=2 hash_init=0 6CLUSTERIP all -- 0.0.0.0/010.44.7.13 CLUSTERIP hashmode=sourceip-sourceport clustermac=11:8F:23:B9:CA:09 total_nodes=2 local_node=2 hash_init=0 7CLUSTERIP all -- 0.0.0.0/0129.236.252.13 CLUSTERIP hashmode=sourceip-sourceport clustermac=B1:95:5A:B5:16:79 total_nodes=2 local_node=2 hash_init=0 If I do a simple test of ssh'ing into 129.236.252.13, I see that I alternately login into hypatia-tb and orestes-tb. All is good. Now take orestes-tb offline. The iptables rules on hypatia-tb are unchanged: 5CLUSTERIP all -- 0.0.0.0/010.43.7.13 CLUSTERIP hashmode=sourceip-sourceport clustermac=F1:87:E1:64:60:A5 total_nodes=2 local_node=1 hash_init=0 6CLUSTERIP all -- 0.0.0.0/010.44.7.13 CLUSTERIP hashmode=sourceip-sourceport clustermac=11:8F:23:B9:CA:09 total_nodes=2 local_node=1 hash_init=0 7CLUSTERIP all -- 0.0.0.0/0129.236.252.13 CLUSTERIP hashmode=sourceip-sourceport clustermac=B1:95:5A:B5:16:79 total_nodes=2 local_node=1 hash_init=0 If I attempt to ssh to 129.236.252.13, whether or not I get in seems to be machine-dependent. On one machine I get in, from another I get a time-out. Both machines show the same MAC address for 129.236.252.13: arp 129.236.252.13 Address HWtype HWaddress Flags MaskIface hamilton-tb.nevis.colum ether B1:95:5A:B5:16:79 C eth0 Is this the way the cloned IPaddr2 resource is supposed to behave in the event of a node failure, or have I set things up incorrectly? -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] cman+pacemaker+dual-primary drbd does not promote
On Tue, 31 Jan 2012 00:36:23 Arnold Krille wrote: On Tuesday 31 January 2012 00:12:52 emmanuel segura wrote: But if you wanna implement dual primary i think you don't nee promote for your drbd Try to use clone without master/slave At least when you use the linbit-ra, using it without a master-clone will give you one(!) slave only. When you use a normal clone with two clones, you will get two slaves. The RA only goes primary on promote, that is when its in master-state. = You need a master-clone of two clones with 1-2 masters to use drbd in the cluster. If I understand Emmanual's suggestion: The only way I know how to implement this is to create a simple clone group with lsb::drbd instead of Linbit's drbd resource, and put become-primary-on for both my nodes in drbd.conf. This might work in the short term, but I think it's risky in the long term. For example: Something goes wrong and node A stoniths node B. I bring node B back up, disabling cman+pacemaker before I do so, and want to re-sync node B's DRBD partition with A. If I'm stupid (occupational hazard), I won't remember to edit drbd.conf before I do this, node B will automatically try to become primary, and probably get stonith'ed again. Arnold: I thought that was what I was doing with these statements: primitive AdminDrbd ocf:linbit:drbd \ params drbd_resource=admin \ op monitor interval=60s role=Master \ op stop interval=0 timeout=320 \ op start interval=0 timeout=240 ms AdminClone AdminDrbd \ meta master-max=2 master-node-max=1 clone-max=2 clone-node-max=1 That is, master-max=2 means to promote two instances to master. Did I get it wrong? -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] cman+pacemaker+dual-primary drbd does not promote
On Tue Jan 31 03:47:11 MST 2012 Lars Ellenberg wrote: On Mon, Jan 30, 2012 at 05:42:34PM -0500, William Seligman wrote: I'm trying to follow the directions for setting up a dual-primary DRBD setup with CMAN and Pacemaker. I'm stuck at an annoying spot: Pacemaker won't promote the DRBD resources to primary at either node. Here's the result of crm_mon: Last updated: Mon Jan 30 17:07:03 2012 Stack: cman Current DC: hypatia-tb - partition with quorum Version: 1.1.5-5.el6-01e86afaaa6d4a8c4836f68df80ababd6ca3902f 2 Nodes configured, unknown expected votes 2 Resources configured. Online: [ orestes-tb hypatia-tb ] Master/Slave Set: AdminClone [AdminDrbd] Slaves: [ hypatia-tb orestes-tb ] crm configure show: node hypatia-tb node orestes-tb primitive AdminDrbd ocf:linbit:drbd \ params drbd_resource=admin \ op monitor interval=60s role=Master \ You are missing an additional monitor op for role=Slave make sure it has a different interval than the one for role=Master. e.g. op monitor interval=59s role=Slave \ op stop interval=0 timeout=320 \ op start interval=0 timeout=240 I put that in, but it didn't change my basic problem: Neither instance of AdminDrbd is promoted on either node. primitive Clvmd lsb:clvmd ms AdminClone AdminDrbd \ meta master-max=2 master-node-max=1 clone-max=2 clone-node-max=1 notify=true clone ClvmdClone Clvmd colocation ClvmdWithAdmin inf: ClvmdClone AdminClone:Master order AdminBeforeClvmd inf: AdminClone:promote ClvmdClone:start property $id=cib-bootstrap-options \ dc-version=1.1.5-5.el6-01e86afaaa6d4a8c4836f68df80ababd6ca3902f \ cluster-infrastructure=cman \ stonith-enabled=false Also remember that, for dual-primary DRBD, working and tested fencing both on cluster level (stonith) and on drbd level (fence-peer) is mandatory. Unless you don't care for data integrity. I'll get to that. I'm just starting out on this configuration. I don't want to put in STONITH just yet, otherwise I'll have to do recovery after every typo. I'll put in STONITH and test it when I get to installing the KVM resources. But until I solve this problem, I can't get to that stage. DRBD looks OK: # cat /proc/drbd version: 8.4.0 (api:1/proto:86-100) GIT-hash: 28753f559ab51b549d16bcf487fe625d5919c49c build by gardner@, 2012-01-25 19:10:28 0: cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate C r- ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 Any clues as to what I can look at to track the source of the problem? -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] cman+pacemaker+dual-primary drbd does not promote
On 1/31/12 3:47 PM, emmanuel segura wrote: William try to follow the suggestion of Arnold In my case it's different because we don't use drbd we are using SAN with ocfs2 But i think for drbd in dual primary you need the attribute master-max=2 I did, or thought I did. Have I missed something? Again, from crm configure show: primitive AdminDrbd ocf:linbit:drbd \ params drbd_resource=admin \ op monitor interval=60s role=Master \ op monitor interval=59s role=Slave \ op stop interval=0 timeout=320 \ op start interval=0 timeout=240 ms AdminClone AdminDrbd \ meta master-max=2 master-node-max=1 clone-max=2 clone-node-max=1 Still no promotion to primary on either node. 2012/1/31 William Seligman selig...@nevis.columbia.edu On Tue, 31 Jan 2012 00:36:23 Arnold Krille wrote: On Tuesday 31 January 2012 00:12:52 emmanuel segura wrote: But if you wanna implement dual primary i think you don't nee promote for your drbd Try to use clone without master/slave At least when you use the linbit-ra, using it without a master-clone will give you one(!) slave only. When you use a normal clone with two clones, you will get two slaves. The RA only goes primary on promote, that is when its in master-state. = You need a master-clone of two clones with 1-2 masters to use drbd in the cluster. If I understand Emmanual's suggestion: The only way I know how to implement this is to create a simple clone group with lsb::drbd instead of Linbit's drbd resource, and put become-primary-on for both my nodes in drbd.conf. This might work in the short term, but I think it's risky in the long term. For example: Something goes wrong and node A stoniths node B. I bring node B back up, disabling cman+pacemaker before I do so, and want to re-sync node B's DRBD partition with A. If I'm stupid (occupational hazard), I won't remember to edit drbd.conf before I do this, node B will automatically try to become primary, and probably get stonith'ed again. Arnold: I thought that was what I was doing with these statements: primitive AdminDrbd ocf:linbit:drbd \ params drbd_resource=admin \ op monitor interval=60s role=Master \ op stop interval=0 timeout=320 \ op start interval=0 timeout=240 ms AdminClone AdminDrbd \ meta master-max=2 master-node-max=1 clone-max=2 clone-node-max=1 That is, master-max=2 means to promote two instances to master. Did I get it wrong? -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] cman+pacemaker+dual-primary drbd does not promote
On 1/31/12 4:11 PM, emmanuel segura wrote: William can you try like this primitive AdminDrbd ocf:linbit:drbd \ params drbd_resource=admin \ op monitor interval=60s role=Master clone Adming AdminDrbd Both Arnold and Lars said this wouldn't work. I just tried it. They were right. Is there anything at all to the log message: Jan 31 16:20:54 orestes-tb lrmd: [12231]: info: RA output: (AdminDrbd:1:monitor:stderr) Could not map uname=orestes-tb.nevis.columbia.edu to a UUID: The object/attribute does not exist That's been syslog'ed every 59 seconds since I updated AdminDrbd as Lars suggested: primitive AdminDrbd ocf:linbit:drbd \ params drbd_resource=admin \ op monitor interval=60s role=Master \ op monitor interval=59s role=Slave \ op stop interval=0 timeout=320 \ op start interval=0 timeout=240 2012/1/31 William Seligman selig...@nevis.columbia.edu On 1/31/12 3:47 PM, emmanuel segura wrote: William try to follow the suggestion of Arnold In my case it's different because we don't use drbd we are using SAN with ocfs2 But i think for drbd in dual primary you need the attribute master-max=2 I did, or thought I did. Have I missed something? Again, from crm configure show: primitive AdminDrbd ocf:linbit:drbd \ params drbd_resource=admin \ op monitor interval=60s role=Master \ op monitor interval=59s role=Slave \ op stop interval=0 timeout=320 \ op start interval=0 timeout=240 ms AdminClone AdminDrbd \ meta master-max=2 master-node-max=1 clone-max=2 clone-node-max=1 Still no promotion to primary on either node. 2012/1/31 William Seligman selig...@nevis.columbia.edu On Tue, 31 Jan 2012 00:36:23 Arnold Krille wrote: On Tuesday 31 January 2012 00:12:52 emmanuel segura wrote: But if you wanna implement dual primary i think you don't nee promote for your drbd Try to use clone without master/slave At least when you use the linbit-ra, using it without a master-clone will give you one(!) slave only. When you use a normal clone with two clones, you will get two slaves. The RA only goes primary on promote, that is when its in master-state. = You need a master-clone of two clones with 1-2 masters to use drbd in the cluster. If I understand Emmanual's suggestion: The only way I know how to implement this is to create a simple clone group with lsb::drbd instead of Linbit's drbd resource, and put become-primary-on for both my nodes in drbd.conf. This might work in the short term, but I think it's risky in the long term. For example: Something goes wrong and node A stoniths node B. I bring node B back up, disabling cman+pacemaker before I do so, and want to re-sync node B's DRBD partition with A. If I'm stupid (occupational hazard), I won't remember to edit drbd.conf before I do this, node B will automatically try to become primary, and probably get stonith'ed again. Arnold: I thought that was what I was doing with these statements: primitive AdminDrbd ocf:linbit:drbd \ params drbd_resource=admin \ op monitor interval=60s role=Master \ op stop interval=0 timeout=320 \ op start interval=0 timeout=240 ms AdminClone AdminDrbd \ meta master-max=2 master-node-max=1 clone-max=2 clone-node-max=1 That is, master-max=2 means to promote two instances to master. Did I get it wrong? -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] cman+pacemaker+dual-primary drbd does not promote
On 1/31/12 4:42 PM, Lars Ellenberg wrote: On Tue, Jan 31, 2012 at 04:26:44PM -0500, William Seligman wrote: On 1/31/12 4:11 PM, emmanuel segura wrote: William can you try like this primitive AdminDrbd ocf:linbit:drbd \ params drbd_resource=admin \ op monitor interval=60s role=Master clone Adming AdminDrbd Both Arnold and Lars said this wouldn't work. I just tried it. They were right. Is there anything at all to the log message: Jan 31 16:20:54 orestes-tb lrmd: [12231]: info: RA output: (AdminDrbd:1:monitor:stderr) Could not map uname=orestes-tb.nevis.columbia.edu to a UUID: The object/attribute does not exist Hmmm. That message comes from cib_utils.c. probably crm_master, which is a wrapper around crm_attribute. should not happen. Looks like parts of the system do not agree wether to use orestes-tb only, or orestes-tb.nevis.columbia.edu ... And if the resource agent is unable to set a master score, pacemaker will not even try to promote. What does uname -n say? Does it list the node name only, or the FQDN? # uname -n orestes-tb.nevis.columbia.edu Aha! I went to /etc/cluster/cluster.conf, and changed all the host names to the FQDN. It works! Master/Slave Set: AdminClone [AdminDrbd] Masters: [ hypatia-tb.nevis.columbia.edu orestes-tb.nevis.columbia.edu ] Lars is the man! And I am a fool for not reading this web page closely enough: http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/ch08s02s02.html In the example, they just use the node name. But it clearly states to use the output from 'uname -n' in cluster.conf. I guess on their Linux distro uname -n returns just the node name. Thanks! -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Trying to compile pacemaker-pygui
I'm trying to figure out what which versions of which files I need to compile pacemaker-pygui. I didn't have (too) much trouble with Scientific Linux 5.5 (=RHEL5.5), but I'm having problems with Scientific Linux 6.1 (=RHEL6.1). Versions: corosync.x86_641.2.3-36.el6 corosynclib.x86_64 1.2.3-36.el6 corosynclib-devel.x86_64 1.2.3-36.el6 cluster-glue.x86_641.0.5-2.el6 cluster-glue-libs.x86_64 1.0.5-2.el6 cluster-glue-libs-devel.x86_64 1.0.5-2.el6 clusterlib.x86_64 3.0.12-41.el6 pacemaker.x86_64 1.1.5-5.el6 pacemaker-libs.x86_64 1.1.5-5.el6 pacemaker-libs-devel.x86_641.1.5-5.el6 Kernel: 2.6.32-220.4.1.el6.x86_64 When I try tip.tar.bz2 (or 4186ac0c02b5.tar.bz2; the same file), the compilation fails with: mgmt_crm.c: In function ‘on_get_rsc_status’: mgmt_crm.c:1512: error: ‘pe_rsc_failure_ignored’ undeclared (first use in this function) When I try efff2a4588e5.tar.bz2, which I found by Googling on this error, I get: mgmt_crm.c: In function ‘on_get_crm_metadata’: mgmt_crm.c:1019: error: ‘CRM_DAEMON_DIR’ undeclared (first use in this function) What files/versions do I need to compile the GUI? -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] cman+pacemaker+dual-primary drbd does not promote
I'm trying to follow the directions for setting up a dual-primary DRBD setup with CMAN and Pacemaker. I'm stuck at an annoying spot: Pacemaker won't promote the DRBD resources to primary at either node. Here's the result of crm_mon: Last updated: Mon Jan 30 17:07:03 2012 Stack: cman Current DC: hypatia-tb - partition with quorum Version: 1.1.5-5.el6-01e86afaaa6d4a8c4836f68df80ababd6ca3902f 2 Nodes configured, unknown expected votes 2 Resources configured. Online: [ orestes-tb hypatia-tb ] Master/Slave Set: AdminClone [AdminDrbd] Slaves: [ hypatia-tb orestes-tb ] /etc/cluster/cluster.conf: cluster config_version=6 name=Nevis_HA logging debug=off/ cman expected_votes=1 two_node=1 / clusternodes clusternode name=hypatia-tb nodeid=1 fence method name=pcmk-redirect device name=pcmk port=hypatia-tb/ /method /fence /clusternode clusternode name=orestes-tb nodeid=2 fence method name=pcmk-redirect device name=pcmk port=orestes-tb/ /method /fence /clusternode /clusternodes fencedevices fencedevice name=pcmk agent=fence_pcmk/ /fencedevices !-- fence_daemon post_join_delay=30 / -- /cluster crm configure show: node hypatia-tb node orestes-tb primitive AdminDrbd ocf:linbit:drbd \ params drbd_resource=admin \ op monitor interval=60s role=Master \ op stop interval=0 timeout=320 \ op start interval=0 timeout=240 primitive Clvmd lsb:clvmd ms AdminClone AdminDrbd \ meta master-max=2 master-node-max=1 clone-max=2 clone-node-max=1 notify=true clone ClvmdClone Clvmd colocation ClvmdWithAdmin inf: ClvmdClone AdminClone:Master order AdminBeforeClvmd inf: AdminClone:promote ClvmdClone:start property $id=cib-bootstrap-options \ dc-version=1.1.5-5.el6-01e86afaaa6d4a8c4836f68df80ababd6ca3902f \ cluster-infrastructure=cman \ stonith-enabled=false DRBD looks OK: # cat /proc/drbd version: 8.4.0 (api:1/proto:86-100) GIT-hash: 28753f559ab51b549d16bcf487fe625d5919c49c build by gardner@, 2012-01-25 19:10:28 0: cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate C r- ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 I can manually do drbdadm primary admin on both nodes and get a Primary/Primary state. That still does not get Pacemaker to promote the resource. The only vaguely relevant lines in /var/log/messages seem to be: Jan 30 17:38:13 hypatia-tb lrmd: [11260]: info: RA output (AdminDrbd:0:start:stdout) Jan 30 17:38:13 hypatia-tb lrmd: [11260]: info: RA output: (AdminDrbd:0:start:stderr) Could not map uname=hypatia-tb.nevis.columbia.edu to a UUID: The object/attribute does not exist Jan 30 17:38:13 hypatia-tb lrmd: [11260]: info: RA output (AdminDrbd:0:start:stdout) I've tried running with iptables both on and off, and the results are the same. Any clues? -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ smime.p7s Description: S/MIME Cryptographic Signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems