Re: [ClusterLabs] Antw: After reboot each node thinks the other is offline.
On 2017-08-01 03:05, Stephen Carville (HA List) wrote: Can clustering even be done reliably on CentOS 6? I have no objection to moving to 7 but I was hoping I could get this up quicker than building out a bunch of new balancers. I have a number of centos 6 active/passive pairs running heartbeat r1 on centos 6. However, I've been doing it for some time and have a collection of mon scripts for them -- you'd have to roll your own. the duplicate IP to its own eth0. I probably do not need to tell you the mischief that can cause if these were production servers. Really? 'Cause over here it starts with "checking if ip already exists on the network" and one of them is supposed to fail there. Dima ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Two nodes cluster issue
On 2017-07-24 07:51, Tomer Azran wrote: We don't have the ability to use it. Is that the only solution? No, but I'd recommend thinking about it first. Are you sure you will care about your cluster working when your server room is on fire? 'Cause unless you have halon suppression, your server room is a complete write-off anyway. (Think water from sprinklers hitting rich chunky volts in the servers.) Dima ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] epic fail
On 2017-07-23 07:40, Valentin Vidic wrote: It seems you did not put the node into standby before the upgrade as it still had resources running. What was the old/new pacemaker version there? Versions: whatever's in centos repos. Any attempts to migrate the services: standby, reboot, etc. result in DRBD stuck on unmout. According to lsof it's the kernel keeping the filesystem busy. The only difference this time is it happened during yum run and scrambled the rpm database on top of everything else. I've been doing this stuff for quite some time now so please don't bother telling me what I did wrong. I wasn't asking. Dima ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] epic fail
So yesterday I ran yum update that puled in the new pacemaker and tried to restart it. The node went into its usual "can't unmount drbd because kernel is using it" and got stonith'ed in the middle of yum transaction. The end result: DRBD reports split brain, HA daemons don't start on boot, RPM database is FUBAR. I've had enough. I'm rebuilding this cluster as centos 6 + heartbeat R1. Centos 7 + DRBD 8.4 + pacemaker + NFS server: FAIL. You have been warned. Dima ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: DRBD or SAN ?
On 7/19/2017 1:29 AM, Ulrich Windl wrote: Maybe it's like with the cluster: Once you have set it up correctly, it runs quite well, but the wy to get there may be painful. I quit my experiments with dual-primary DRBD in some early SLES11 (SP1), because it fenced a lot and refused to come up automatically after fencing. That may have been a configuration problem, but with the docs at hand at that time, I preferred to quit and try something else. I'm losing the point of this thread very fast, it seems. How is drbd cluster that exports images on floating ip *not* a nas? Why would you need an active-active storage if you can't run the same vm on two hosts at once? Anyway, +1: if I were building a vm infrastructure, I'd take a closer look at ceph myself. And openstack: it was a PITA when I played with it a couple of years ago but it sounds like a dual-primary drbd on top of lvm with corosync on the side's gonna be at least as bad. Dima ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] DRBD or SAN ?
On 7/17/2017 2:07 PM, Chris Adams wrote: However, just like RAID is not a replacement for backups, DRBD is IMHO not a replacement for database replication. DRBD would just replicate database files, so if for example file corruption would be copied from host to host. When something provides a native replication system, it is probably better to use that (or at least use it at one level). Since DRBD is RAID-1, you need double the drives either way, no advantage over two independent copies -- only the potential for replicating errors. You probably need a 10G pipe, with associated costs, for "no performance penalty" DRBD while native replication tends to work OK over slower links. At this point a 2U SuperMicro chassis gives you 2 SSD slots for system and ZiL/L2ARC plus 12 spinning rust slots for a pretty large database... That won't work for VM images, for that you'll need NAS or DRBD but IMO NAS wins. Realistically, a hard drive failure is the most likely kind of failure you're looking at, and initiating a full storage cluster failover for that is probably not a good idea. So you might want a drive-level redundancy on at least the primary node, at which point dual-ported SAS drives in external shelves become economical, even with a couple of dual-ported SAS SSDs for caches. So ZFS setup I linked to above actually comes with fewer moving parts and all the handy features absent from previous-gen filesystems. Dima ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] DRBD or SAN ?
On 7/17/2017 4:51 AM, Lentes, Bernd wrote: I'm asking myself if a DRBD configuration wouldn't be more redundant and high available. ... Is DRBD in conjuction with a database (MySQL or Postgres) possible ? Have you seen https://github.com/ewwhite/zfs-ha/wiki ? -- I recently deployed one and so far it's working better than one centos 7 drbd + pacemaker + nfs cluster I have. Although in 20-20 hindsight I wonder if I should've gone BSD instead. If your database is postgres, streaming replication in 9.latest is something to consider. I haven't had any problems running it on top of drbd, but there are advantages to having two completely independent copies of everything. Esp. if it's on top of zfs's advantages. We already "upgraded" our webserver postgres to streaming replication, in the next month or so I plan to bump it to 9.latest from postgres repo, run two independent instances, ditch drbd altogether and use pacemaker only for floating ip. (And zfs insremental snapshots to replicate static content.) In that scenario freebsd's carp looks like much leaner and cleaner alternative to the whole linux ha stack and with zfs in the kernel and absence of systemd... the grass on the other side looks greener and greener every day. Dima ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] DRBD split brain after Cluster node recovery
On 7/14/2017 3:57 AM, ArekW wrote: Hi, I have stonith run and tested. The problem was that there is mistake in drbd documentation. The 'fencing' belongs to net (not disk). If you are running NFS on top of a dual-primary DRBD with some sort of a cluster filesystem, I'd think *that* is your problem. Not fencing/stonith. Dima ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] DRBD split brain after Cluster node recovery
On 7/12/2017 4:33 AM, ArekW wrote: Hi, Can in be fixed that the drbd is entering split brain after cluster node recovery? I always configure "after-sb*" handlers and drbd-level fence but I never ran it with allow-two-primaries. You'll have read the fine manual on how that works in a dual-primary config. Dima ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Installing on SLES 12 -- Where's the Repos?
On 2017-06-16 10:16, Digimer wrote: On 16/06/17 11:07 AM, Eric Robinson wrote: Step over to the *bsd side. They have cookies. Also zfs. And no lennartware, that alone's worth $700/year. Dima I left BSD for Linux back in 2000 or so. I have often been wistful for those days. ;-) --Eric Jokes (?) aside; Red Hat and SUSE both have paid teams that make sure the HA software works well. So if you're new to HA, I strongly recommend sticking with one of those two, and SUSE is what you mentioned. If you really want to go to BSD or something else, I would recommend learning HA on SUSE/RHEL and then, after you know what config works for you, migrate to the target OS. That way you have only one set of variables at a time. Also, use fencing. Seriously, just do it. Yeah. Fencing is the only bit that's missing from this picture. Dima ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Installing on SLES 12 -- Where's the Repos?
On 2017-06-16 02:21, Eric Robinson wrote: Someone talk me off the ledge here. Step over to the *bsd side. They have cookies. Also zfs. And no lennartware, that alone's worth $700/year. Dima ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] how to set a dedicated fence delay for a stonith agent ?
On 2017-05-17 06:24, Lentes, Bernd wrote: ... I'd like to know what the software is use is doing. Am i the only one having that opinion ? No. How do you solve the problem of a deathmatch or killing the wrong node ? *I* live dangerously with fencing disabled. But then my clusters only really go down for maintenance reboots, and I usually do those when I'm at work and can walk into the server room and push the power button when it comes to that. (More accurately the one cluster that goes down. The others fail over without any problems.) Dima ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: 2-Node Cluster Pointless?
On 4/22/2017 11:51 PM, Andrei Borzenkov wrote: As a real life example (not Linux/pacemaker) - panicking node flush eddisk buffers, so it was not safe to access shared filesystem until this was complete. This could take quite a lot of time, so without agent on *surviving* node(s) that monitors and acknowledges this process this resulted in data corruption. If your syncs take that long, pay an extra nickel and buy a disk shelf with dual-ported sas drives and a pair of ssds for the log device. Otherwise what you're looking at is effectively downtime during the failover, and having "quite a lot" of it kinda defeats the purpose I should think. Dima ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: 2-Node Cluster Pointless?
On 4/22/2017 12:02 PM, Digimer wrote: Having SBD properly configured is *massively* safer than no fencing at all. So for people where other fence methods are not available for whatever reason, SBD is the way to go. Now you're talking. IMO in a 2-node cluster, a node that kills itself in response to, say, losing link on eth0 is infinitely preferable to a node that tries to shoot the other node when it can't ping it. This way you can also start with a one-node cluster and do the induction thing. Now all you need is to separate the monitors from services so you can easily monitor things the cluster didn't start (like that eth0 above), and it'll all start making sense. Dima ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] 2-Node Cluster Pointless?
On 2017-04-16 15:04, Eric Robinson wrote: On 16/04/17 01:53 PM, Eric Robinson wrote: I was reading in "Clusters from Scratch" where Beekhof states, "Some would argue that two-node clusters are always pointless, but that is an argument for another time." What you want to know is whether the customer can access the service. Adding more nodes does not answer that question, but since Andrew is writing cluster software, not providing services, that's not his problem. Dima ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Fraud Detection Check?
On 2017-04-13 01:39, Jan Pokorný wrote: After a bit of a search, the best practice at the list server seems to be: [...] if you change the message (eg, by adding a list signature or by adding the list name to the Subject field), you *should* DKIM sign. This is of course going entirely off-topic for the list, but DKIM's stated purpose is to sign mail coming from *.clusterlabs.org with a key from clustrlab.org's dns zone file. DKIM is not mandatory, so you strip all existing dkim headers and then either sign or not, it's up to you. None of this is new. SourceForge list manager, for example, adds SF footers *inside* the PGP-signed MIME part, resulting in the exact same "invalid signature" problem. Dima ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] OS Patching Process
On 2016-11-24 10:41, Toni Tschampke wrote: We recently did an upgrade for our cluster nodes from Wheezy to Jessie. IIRC it's the MIT CS joke that they have clusters whose uptime goes way back past the manufacturing date of any/every piece of hardware they're running on. They aren't linux-ha clusters but there's no reason why that shouldn't be doable with linux-ha. Dima ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: OS Patching Process
On 2016-11-23 02:23, Ulrich Windl wrote: I'd recommend making a backup of the DRBD data (you always should, anyway), the shut down the cluster, upgrade all the needed components, then start the cluster again. Do your basic tests. If you corrupted your data, re-create DRBD from scratch. Then test again. If your data is currupted again, the new version has a problem. Then you should go back to your old version (You have desaster-suitable backups, right?). Very funny: run a full backup a small 100TB live mail spool for a kernel upgrade. When the cluster fscks it up, run a full restore. While the mail spool is highly unavailable. How about an easier solution: never install any patches. Or better yet don't use pacemaker. Hilarious. Dima ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] OS Patching Process
On 2016-11-22 10:35, Jason A Ramsey wrote: Can anyone recommend a bulletproof process for OS patching a pacemaker cluster that manages a drbd mirror (with LVM on top of the drbd and luns defined for an iscsi target cluster if that matters)? Any time I’ve tried to mess with the cluster, it seems like I manage to corrupt my drbd filesystem, and now that I have actual data on the thing, that’s kind of a scary proposition. Thanks in advance! +1 I managed to clearly standby/unstandby mine a few times initially -- otherwise I wouldn't have put it in production -- but on the last several reboots DRBD filesystem just wouldn't unmount. Never corrupted anything but it's still a serious PITA. Especially with a couple of haresources pairs right next to it switching over perfectly every time. Insult to injury, the RA start spewing "can't unmount, somebody's holding open" messages to the console at such rate that it is impossible to login and try lsof, fuser, or anything. Dima ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: Antw: Re: Can't do anything right; how do I start over?
On 2016-10-18 01:18, Ulrich Windl wrote: Ah, that rings a bell: Sometimes when kernel modules are updated, somescripts think they must unload modues, then reload them. With a new kernel not being bootet yet, the modules on disk don't fit the running kernel. Maybe your problem is like this? I think it has to be something along those lines, otherwise "mountpoint busy" with no processes using it is an error I haven't seen in linux since redhat (as in not rhel) 7. In this case, however, the other node gets an error failing over too, and that one has been bounced after update. There's no mismatch there... so it's something non-obvious going on in there. Dima ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: Can't do anything right; how do I start over?
On 2016-10-17 02:12, Ulrich Windl wrote: Have you tried a proper variant of "lsof" before? So maybe you know which process might block the device. I also think if you have LVM on top of DRBD, you must deactivate the VG before trying to unmount. No LVM here: AFAIMC these days it's another solution in search of a problem. Next time I'll try to remember to do it from console root login in the node, maybe I'll see whet lsof has to say then. I have a feeling it's not happening on plain kernel upgrade, only when kmod-drbd is updated, but I don't have enough data points yet for anything more than a feeling. I have a couple of haresources drbd+nfs clusters that don't do this, but neither of them is running dovecot... could be systemd "RA" not stopping dovecot properly, too. Dima ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Can't do anything right; how do I start over?
On 2016-10-15 01:56, Jay Scott wrote: So, what's wrong? (I'm a newbie, of course.) Here's what worked for me on centos 7: http://octopus.bmrb.wisc.edu/dokuwiki/doku.php?id=sysadmin:pacemaker YMMV and all that. cheers, Dima ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: Corosync ring shown faulty between healthy nodes & networks (rrp_mode: passive)
On 2016-10-07 01:18, Ulrich Windl wrote: Any hardware may fail at any time. We even had an onboard NIC, that stopped operating correctly some day, we had CPU chache errors, RAM parity errors, PCI bus errors, and everything you can imagine. :) http://dilbert.com/strip/1995-06-24 Our vendor's been good to us: over the last dozen or so years we only had about 4 dead mobos, 3 psus (same batch), a few of dimms and one sata backplane. But we mostly run storage so my perception is heavily biased towards disks. Dima ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Virtual ip resource restarted on node with down network device
On 2016-09-20 09:53, Ken Gaillot wrote: I do think ifdown is not quite the best failure simulation, since there aren't that many real-world situation that merely take an interface down. To simulate network loss (without pulling the cable), I think maybe using the firewall to block all traffic to and from the interface might be better. Or unloading the driver module to simulate NIC hardware failure. Dep. on how close you look at the interface it may or may not matter that pulling the cable/the other side going down will result in NO CARRIER whereas firewalling it off will not. Dima ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Change disk
On 2016-09-14 09:30, NetLink wrote: 1.Put node 2 in standby 2.Change and configure the new bigger disk on node 2 3.Put node 2 back online and wait for syncing. 4.Put node 1 in standby and repeat the procedure Would this approach work? I wonder if rsync'ing the spool to e.g. usb drive, recreating drbd on larger disks, and rsync'ing it all back will be faster and less painful. Dima ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] DRBD failover in Pacemaker
On 2016-09-08 02:03, Digimer wrote: You need to solve the problem with fencing in DRBD. Leaving it off WILL result in a split-brain eventually, full stop. With working fencing, you will NOT get a split-brain, full stop. "Split brain is a situation where, due to temporary failure of all network links between cluster nodes, and possibly due to intervention by a cluster management software or human error, both nodes switched to the primary role while disconnected." -- DRBD Users Guide 8.4 # 2.9 Split brain notification. About the only practical problem with *DRBD* split brain under pacemaker is that pacemaker won't let you run "drbdadm secondary && drbdadm connect --discard-my-data" as easy as busted ancient code did. Dima ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] DRBD failover in Pacemaker
On 2016-09-06 14:04, Devin Ortner wrote: I have a 2-node cluster running CentOS 6.8 and Pacemaker with DRBD. I have been using the "Clusters from Scratch" documentation to create my cluster and I am running into a problem where DRBD is not failing over to the other node when one goes down. I forget if Clusters From Scratch spell this out: you have to create the DRBD volume and let it finish the initial sync before you let pacemaker near it. Was 'cat /proc/drbd' showing UpToDate/UpToDdate Primary/Secondary when you tried the failover? Ignore the "stonith is optional; you *must* use stonith" mantra du jour. Dima ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] ocf scripts shell and local variables
On 2016-08-31 03:59, Dejan Muhamedagic wrote: On Tue, Aug 30, 2016 at 12:32:36PM -0500, Dimitri Maziuk wrote: I expect you're being deliberately obtuse. Not sure why do you think that Because the point I was trying to make was that having shebang line say #!/opt/swf/bin/bash does not guarantee the script will actually be interpreted by /opt/swf/bin/bash. For example When a file is sourced, the "#!" line has no special meaning (apart from documenting purposes). (sic) Or when I haven't read the code either, but it must be some of the exec(2) system calls. it's execl("/bin/sh", "/bin/sh", "/script/file") instead of execl( "script/file', ...) directly. (As an aside, I suspect the feature where exec(2) will run the loader which will read the magic and load an appropriate binfmt* kernel module, may well also be portable between "most" systems, just like "local" is portable to "most" shell. I don't think posix specifies anything more than "executable image" and that on a strictly posix-compliant system execl( "/my/script.sh", ... ) will fail. I am so old that I have a vague recollection it *had to be* execl("/bin/sh", "/bin/sh", "/script/file") back when I learned it. But this going even further OT.) My point, again, was that solutions involving shebang lines are great as long as you can guarantee those shebang lines are being used on all supported platforms at all times. Sourcing findif.sh from IPAddr2 is proof by counter-example that they aren't and you can't. Dima ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] ocf scripts shell and local variables
On 2016-08-30 03:44, Dejan Muhamedagic wrote: The kernel reads the shebang line and it is what defines the interpreter which is to be invoked to run the script. Yes, and does the kernel read when the script is source'd or executed via any of the mechanisms that have the executable specified in the call, explicitly or implicitly? None of /bin/sh RA requires bash. Yeah, only "local". Dima ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] ocf scripts shell and local variables
On 2016-08-29 04:06, Gabriele Bulfon wrote: Thanks, though this does not work :) Uhm... right. Too many languages, sorry: perl's system() will call the login shell, system system() uses /bin/sh, and exec()s will run whatever the programmer tells them to. The point is none of them cares what shell's in shebang line AFAIK. But anyway, you're correct; a lot of linux "shell" scripts are bash-only and pacemaker RAs are no exception. Dima ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] ocf scripts shell and local variables
On 2016-08-26 08:56, Ken Gaillot wrote: On 08/26/2016 08:11 AM, Gabriele Bulfon wrote: I tried adding some debug in ocf-shellfuncs, showing env and ps -ef into the corosync.log I suspect it's always using ksh, because in the env output I produced I find this: KSH_VERSION=.sh.version This is normally not present in the environment, unless ksh is running the shell. The RAs typically start with #!/bin/sh, so whatever that points to on your system is what will be used. ISTR exec() family will ignore the shebang and run whatever shell's in user's /etc/passwd. Or something. Try changing that one. Dima ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Unable to Build fence-agents from Source on RHEL6
On 2016-08-10 10:04, Jason A Ramsey wrote: Traceback (most recent call last): File "eps/fence_eps", line 14, in if sys.version_info.major > 2: AttributeError: 'tuple' object has no attribute 'major' Replace with sys.version_info[0] Dima ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Recovering after split-brain
On 2016-06-20 17:19, Digimer wrote: Nikhil indicated that they could switch where traffic went up-stream without issue, if I understood properly. They have some interesting setup, but that notwithstanding: if split brain happens some clients will connect to "old master" and some: to "new master", dep. on arp update. If there's a shared resource unavailable on one node, clients going there will error out. The other ones will not. It will work for some clients. Cf. both nodes going into stonith deathmatch and killing each other: the service now is not available for all clients. What I don't get is the blanket assertion that this "more highly" available that option #1. Dimitri ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Recovering after split-brain
On 2016-06-20 09:13, Jehan-Guillaume de Rorthais wrote: I've heard multiple time this kind of argument on the field, but soon or later, these clusters actually had a split brain scenario with clients connected on both side, some very bad corruptions, data lost, etc. I'm sure it's a very helpful answer but the question was about suspending pacemaker while I manually fix a problem with the resource. I too would very much like to know how to get pacemaker to "unmonitor" my resources and not get in the way while I'm updating and/or fixing them. In heartbeat mon was a completely separate component that could be moved out of the way when needed. In pacemaker I now had to power-cycle the nodes several times because in a 2-node active/passive cluster without quorum and fencing set up like - drbd master-slave - drbd filesystem (colocated and ordered after the master) - symlink (colocated and ordered after the fs) - service (colocated and ordered after the symlink) -- when the service fails to start due to user error, pacemaker fscks up everything up to and including the master-slave drbd and "clearing" errors on the service does not fix the symlink and the rest of it. (So far I've been unable to reliable reproduce it in testing environments, Murphy made sure it only happens on production clusters.) Right now it seems to me for drbd split brain I'll have to stop the cluster on victim node, do manual split brain recovery, and restart the cluster after sync is complete. Is that correct? Dimitri ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] restarting pacemakerd
On 2016-06-18 05:15, Ferenc Wágner wrote: ... On the other hand, one could argue that restarting failed services should be the default behavior of systemd (or any init system). Still, it is not. As an off-topic snide comment, I never understood the thinking behind that: restarting without removing the cause of the failure will just make it fail again. If at first you don't succeed, then try, try, try again? Dimitri ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] dovecot RA
On 2016-06-08 09:11, Ken Gaillot wrote: On 06/08/2016 03:26 AM, Jan Pokorný wrote: Pacemaker can drive systemd-managed services for quite some time. This is as easy as changing lsb:dovecot to systemd:dovecot. Great! Any chance that could be mentioned on http://www.linux-ha.org/wiki/Resource_agents -- hint, hint ;) Thanks guys, Dima ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] mail server (postfix)
On 2016-06-04 01:10, Digimer wrote: We're running postfix/dovecot/postgres for our mail on an HA cluster, but we put it all in a set of VMs and made the VMs HA on DRBD. Hmm. I deliver to ~/Maildir and /home is NFS-mounted all over the place, so my primary goal is HA NFS server. I'd hesitate to add a VM layer there. NFS-mounted /home on the mail gateway is what I have now and that works fine... having that in HA setup is more or less icing on the cake. Thanks, Dimitri ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] start a resource
On 2016-05-17 09:21, Ken Gaillot wrote: What happens after "pcs resource cleanup"? "pcs status" reports the time associated with each failure, so you can check whether you are seeing the same failure or a new one. The system log is usually the best starting point, as it will have messages from pacemaker, corosync and the resource agents. Yes, it'd be much easier if I saw anything useful from pcs and/or in the logs. I've another active/passive pair to setup, if I get a round tuit -- hopefully in the next few weeks -- I'll see if I can reproduce this. Dimitri ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] start a resource
On 2016-05-05 23:50, Moiz Arif wrote: Hi Dimitri, Try cleanup of the fail count for the resource with the any of the below commands: via pcs : pcs resource cleanup rsyncd Tried it, didn't work. Tried pcs resource debug-start rsyncd -- got no errors, resource didn't start. Tried disable/enable. So far the only way I've been able to do this is pcs cluster stop ; pcs cluster start which is ridiculous on a production cluster with drbd and a database etc. (And it killed my ssh connection to the other node, again.) Ay other suggestions? Thanks, Dima ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Monitoring action of Pacemaker resources fail because of high load on the nodes
On 2016-04-26 00:58, Klaus Wenninger wrote: But what you are attempting doesn't sound entirely proprietary. So once you have something that looks like it might be useful for others as well let the community participate and free yourself from having to always take care of your private copy ;-) Presumably you could try a pull request but last time I failed to convince Andrew that wget'ing http://localhost/server-status/ is a wrong thing to do in the first place (apache RA). So your pull request may never get merged. Which I suppose is better than my mon scripts: those are private-copy-only with no place in the heartbeat packages to try and share them. Dimitri ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Monitoring action of Pacemaker resources fail because of high load on the nodes
On 2016-04-24 16:20, Ken Gaillot wrote: Correct, you would need to customize the RA. Well, you wouldn't because your custom RA will be overwritten by the next RPM update. Dimitri ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] dropping ssh connection on failover
On 2016-04-15 07:46, Klaus Wenninger wrote: Which IP-address did you use to ssh to that box? One controlled by pacemaker and possibly being migrated or a fixed one assigned to that box? Good try but no: the "sunken" (as opposed to floating ;) address of course. If what digimer says is true, it is ridiculous. I might end up building my own heartbeat rpm for centos 7 after all... (as my haresources-based clusters don't have a problem not dropping ssh connections on failover). Dimitri ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org