Re: [Pacemaker] command to dump cluster configuration in pcs format?
On 16 Jan 2014, at 10:59 pm, Lars Marowsky-Bree l...@suse.com wrote: On 2014-01-15T20:25:30, Bob Haxo bh...@sgi.com wrote: Unfortunately, it configuration has taken me weeks to develop (what now seems to be) a working configuration (including mods to the VirtualDomain agent to avoid spurious restarts of the VM). Curious if you can push these upstream too ;-) (Or already have.) The problem is that this goes into a product that gets shipped to mfg, and then to customers, and then needs to be supported by other engineers (and then often back to me). Easy to create configurations with crm and then load (crm -f file) the Pacemaker configuration using the crm commands created with crm configure show, with some scripted substitutions for hostnames, IP addresses, and other site customizations. The SLES HAE uses crm, and I'm trying to make SLES and RHEL versions as identical as possible. Makes it easier for me to maintain, and for others to support. Well, unless RHT states that installing crmsh on top of their distribution invalidates support for the pacemaker back-end, you could just ship crmsh as part of your product on that platform. Thats not how RHT operates I'm afraid. If something isn't at least planned to be supported, they don't ship it. However some interested party could get it into fedora and from there get it into EPEL if they chose to. It should be easy to install on RHEL, and you're already installing your own product anyway, so it shouldn't be a huge problem to add one more package? Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] command to dump cluster configuration in pcs format?
On 17 Jan 2014, at 9:05 am, Lars Marowsky-Bree l...@suse.com wrote: On 2014-01-17T07:40:34, Andrew Beekhof and...@beekhof.net wrote: Well, unless RHT states that installing crmsh on top of their distribution invalidates support for the pacemaker back-end, you could just ship crmsh as part of your product on that platform. Thats not how RHT operates I'm afraid. If something isn't at least planned to be supported, they don't ship it. Right, and I very much understand that. The question was if installing crmsh on top of RHEL/RHCS (as he's doing now) would invalidate support for the rest of the system. And I can't answer that. If some behaviour specific to crmsh (or the mixing of the two) was triggering a bug, that /may/ result in the fix being given lower priority for inclusion. I can also imagine support wanting to ensure issues are reproducible with pcs prior to accepting them. But I'd not think using crmsh would completely invalidate support. signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Question about new migration
On 15 Jan 2014, at 7:12 pm, Kazunori INOUE kazunori.ino...@gmail.com wrote: Hi David, With new migration logic, when VM was migrated by 'node standby', start was performed in migrate_target. (migrate_from was not performed.) Is this the designed behavior? # crm_mon -rf1 Stack: corosync Current DC: bl460g1n6 (3232261592) - partition with quorum Version: 1.1.11-0.27.b48276b.git.el6-b48276b 2 Nodes configured 3 Resources configured Online: [ bl460g1n6 bl460g1n7 ] Full list of resources: prmVM2 (ocf::heartbeat:VirtualDomain): Started bl460g1n6 Clone Set: clnPing [prmPing] Started: [ bl460g1n6 bl460g1n7 ] Node Attributes: * Node bl460g1n6: + default_ping_set : 100 * Node bl460g1n7: + default_ping_set : 100 # crm node standby bl460g1n6 # egrep do_lrm_rsc_op:|process_lrm_event: ha-log | grep prmVM2 Jan 15 15:39:22 bl460g1n6 crmd[30795]: info: do_lrm_rsc_op: Performing key=11:5:0:be72ea63-75a9-4de4-a591-e716f960743b op=prmVM2_migrate_to_0 Jan 15 15:39:28 bl460g1n6 crmd[30795]: notice: process_lrm_event: LRM operation prmVM2_migrate_to_0 (call=16, rc=0, cib-update=66, confirmed=true) ok Jan 15 15:39:30 bl460g1n6 crmd[30795]: info: do_lrm_rsc_op: Performing key=7:6:0:be72ea63-75a9-4de4-a591-e716f960743b op=prmVM2_stop_0 Looks like the transition was aborted (5) and another (6) calculated. Compare action:transition:expected_rc:uuid key=11:5:0:be72ea63-75a9-4de4-a591-e716f960743b and key=7:6:0:be72ea63-75a9-4de4-a591-e716f960743b Jan 15 15:39:30 bl460g1n6 crmd[30795]: notice: process_lrm_event: LRM operation prmVM2_stop_0 (call=19, rc=0, cib-update=68, confirmed=true) ok Jan 15 15:39:30 bl460g1n7 crmd[29923]: info: do_lrm_rsc_op: Performing key=8:6:0:be72ea63-75a9-4de4-a591-e716f960743b op=prmVM2_start_0 Jan 15 15:39:30 bl460g1n7 crmd[29923]: notice: process_lrm_event: LRM operation prmVM2_start_0 (call=13, rc=0, cib-update=17, confirmed=true) ok Best Regards, Kazunori INOUE pcmk-Wed-15-Jan-2014.tar.bz2___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] hangs pending
On 16 Jan 2014, at 12:41 am, Andrey Groshev gre...@yandex.ru wrote: 15.01.2014, 02:53, Andrew Beekhof and...@beekhof.net: On 15 Jan 2014, at 12:15 am, Andrey Groshev gre...@yandex.ru wrote: 14.01.2014, 10:00, Andrey Groshev gre...@yandex.ru: 14.01.2014, 07:47, Andrew Beekhof and...@beekhof.net: Ok, here's what happens: 1. node2 is lost 2. fencing of node2 starts 3. node2 reboots (and cluster starts) 4. node2 returns to the membership 5. node2 is marked as a cluster member 6. DC tries to bring it into the cluster, but needs to cancel the active transition first. Which is a problem since the node2 fencing operation is part of that 7. node2 is in a transition (pending) state until fencing passes or fails 8a. fencing fails: transition completes and the node joins the cluster Thats in theory, except we automatically try again. Which isn't appropriate. This should be relatively easy to fix. 8b. fencing passes: the node is incorrectly marked as offline This I have no idea how to fix yet. On another note, it doesn't look like this agent works at all. The node has been back online for a long time and the agent is still timing out after 10 minutes. So Once the script makes sure that the victim will rebooted and again available via ssh - it exit with 0. does not seem true. Damn. Looks like you're right. At some time I broke my agent and had not noticed it. Who will understand. I repaired my agent - after send reboot he is wait STDIN. Returned normally a behavior - hangs pending, until manually send reboot. :) Right. Now you're in case 8b. Can you try this patch: http://paste.fedoraproject.org/68450/38973966 Killed all day experiences. It turns out here that: 1. Did cluster. 2. On the node-2 send signal (-4) - killed corosink 3. From node-1 (there DC) - stonith sent reboot 4. Noda rebooted and resources start. 5. Again. On the node-2 send signal (-4) - killed corosink 6. Again. From node-1 (there DC) - stonith sent reboot 7. Noda-2 rebooted and hangs in pending 8. Waiting, waiting. manually reboot. 9. Noda-2 reboot and raised resources start. 10. GOTO p.2 Logs? New logs: http://send2me.ru/crmrep1.tar.bz2 On 14 Jan 2014, at 1:19 pm, Andrew Beekhof and...@beekhof.net wrote: Apart from anything else, your timeout needs to be bigger: Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru stonith-ng: ( commands.c:1321 ) error: log_operation: Operation 'reboot' [11331] (call 2 from crmd.17227) for host 'dev-cluster2-node2.unix.tensor.ru' with device 'st1' returned: -62 (Timer expired) On 14 Jan 2014, at 7:18 am, Andrew Beekhof and...@beekhof.net wrote: On 13 Jan 2014, at 8:31 pm, Andrey Groshev gre...@yandex.ru wrote: 13.01.2014, 02:51, Andrew Beekhof and...@beekhof.net: On 10 Jan 2014, at 9:55 pm, Andrey Groshev gre...@yandex.ru wrote: 10.01.2014, 14:31, Andrey Groshev gre...@yandex.ru: 10.01.2014, 14:01, Andrew Beekhof and...@beekhof.net: On 10 Jan 2014, at 5:03 pm, Andrey Groshev gre...@yandex.ru wrote: 10.01.2014, 05:29, Andrew Beekhof and...@beekhof.net: On 9 Jan 2014, at 11:11 pm, Andrey Groshev gre...@yandex.ru wrote: 08.01.2014, 06:22, Andrew Beekhof and...@beekhof.net: On 29 Nov 2013, at 7:17 pm, Andrey Groshev gre...@yandex.ru wrote: Hi, ALL. I'm still trying to cope with the fact that after the fence - node hangs in pending. Please define pending. Where did you see this? In crm_mon: .. Node dev-cluster2-node2 (172793105): pending .. The experiment was like this: Four nodes in cluster. On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11). Thereafter, the remaining start it constantly reboot, under various pretexts, softly whistling, fly low, not a cluster member! ... Then in the log fell out Too many failures All this time in the status in crm_mon is pending. Depending on the wind direction changed to UNCLEAN Much time has passed and I can not accurately describe the behavior... Now I am in the following state: I tried locate the problem. Came here with this. I set big value in property stonith-timeout=600s. And got the following behavior: 1. pkill -4 corosync 2. from node with DC call my fence agent sshbykey 3. It sends reboot victim and waits until she comes to life again. Hmmm what version of pacemaker? This sounds like a timing issue that we fixed a while back Was a version 1.1.11 from December 3. Now try full update and retest. That should be recent enough. Can you create a crm_report the next time you reproduce? Of course yes. Little delay :) .. cc1: warnings being treated as errors upstart.c: In function ‘upstart_job_property
Re: [Pacemaker] crm_resource -L not trustable right after restart
On 16 Jan 2014, at 6:53 am, Brian J. Murrell (brian) br...@interlinx.bc.ca wrote: On Wed, 2014-01-15 at 17:11 +1100, Andrew Beekhof wrote: Consider any long running action, such as starting a database. We do not update the CIB until after actions have completed, so there can and will be times when the status section is out of date to one degree or another. But that is the opposite of what I am reporting I know, I was giving you another example of when the cib is not completely up-to-date with reality. and is acceptable. It's acceptable for a resource that is in the process of starting being reported as stopped, because it's not yet started. It may very well be partially started. Its almost certainly not stopped which is what is being reported. What I am seeing is resources being reported as stopped when they are in fact started/running and have been for a long time. At node startup is another point at which the status could potentially be behind. Right. Which is the case I am talking about. It sounds to me like you're trying to second guess the cluster, which is a dangerous path. No, not trying to second guess at all. You're not using the output to decide whether to perform some logic? Because crm_mon is the more usual command to run right after startup (which would give you enough context to know things are still syncing). I'm just trying to ask the cluster what the state is and not getting the truth. I am willing to believe whatever state the cluster says it's in as long as what I am getting is the truth. What if its the first node to start up? I'd think a timeout comes in to play here. There'd be no fresh copy to arrive in that case. I can't say that I know how the CIB works internally/entirely, but I'd imagine that when a cluster node starts up it tries to see if there is a more fresh CIB out there in the cluster. Nope. Maybe this is part of the process of choosing/discovering a DC. DC election happens at the crmd. The cib is a dumb repository of name/value pairs. It doesn't even understand new vs. old - only different. But ultimately if the node is the first one up, it will eventually figure that out so that it can nominate itself as the DC. Or it finds out that there is a DC already (and gets a fresh CIB from it?). It's during that window that I propose that crm_resource should not be asserting anything and should just admit that it does not (yet) know. If it had enough information to know it was out of date, it wouldn't be out of date. But surely it understands if it is in the process of joining a cluster or not, and therefore does know enough to know that it doesn't know if it's out of date or not. And if it has a newer config compared to the existing nodes? But that it could be. As above, there are situations when you'd never get an answer. I should have added to my proposal or has determined that there is nothing to refresh it's CIB from and that it's local copy is authoritative for the whole cluster. b. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] command to dump cluster configuration in pcs format?
On 16 Jan 2014, at 11:49 am, Bob Haxo bh...@sgi.com wrote: On 01/15/2014 05:02 PM, Bob Haxo wrote: Greetings, The command crm configure show dumps the cluster configuration in a format that is suitable for use in configuring a cluster. The command pcs config generates nice human readable information, but this is not directly suitable for use in configuring a cluster. Is there a pcs command analogous to the crm command that dumps the cluster configuration in pcs format? On Wed, 2014-01-15 at 17:55 -0600, Chris Feist wrote: Currently there is not. We may at some point look into this, but it isn't on my short term list of things to do. Thanks, Chris Oh, well, bummer ... but at least I hadn't missed the command in the docs or in the installed code. There list of commands you used to build a cluster is in the history too remember. You could just save that to a file instead and restore with bash ./bit-of-history-i-care-about I'll probably use crm configure show to capture the pcs created configuration for installations ... and save the pcs for when I correspond with RedHat. Dumping and loading the xml really is not an option. You dump and load xml a lot? Even assuming yes, its in a file that you don't have to read... so where is the problem? Regards, Bob Haxo Regards, Bob Haxo ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] crm_resource -L not trustable right after restart
On 16 Jan 2014, at 1:13 pm, Brian J. Murrell (brian) br...@interlinx.bc.ca wrote: On Thu, 2014-01-16 at 08:35 +1100, Andrew Beekhof wrote: I know, I was giving you another example of when the cib is not completely up-to-date with reality. Yeah, I understood that. I was just countering with why that example is actually more acceptable. It may very well be partially started. Sure. Its almost certainly not stopped which is what is being reported. Right. But until it is completely started (and ready to do whatever it's supposed to do), it might as well be considered stopped. If you have to make a binary state out of stopped, starting, started, I think most people will agree that the states are stopped and starting and stopped is anything starting since most things are not useful until they are fully started. You're not using the output to decide whether to perform some logic? Nope. Just reporting the state. But that's difficult when you have two participants making positive assertions about state when one is not really in a position to do so. Because crm_mon is the more usual command to run right after startup The problem with crm_mon is that it doesn't tell you where a resource is running. What crm_mon are you looking at? I see stuff like: virt-fencing (stonith:fence_xvm):Started rhos4-node3 Resource Group: mysql-group mysql-vip (ocf::heartbeat:IPaddr2): Started rhos4-node3 mysql-fs (ocf::heartbeat:Filesystem):Started rhos4-node3 mysql-db (ocf::heartbeat:mysql): Started rhos4-node3 (which would give you enough context to know things are still syncing). That's interesting. Would polling crm_mon be more efficient than polling the remote CIB with cibadmin -Q? crm_mon in interactive mode subscribes to updates from the cib. which would be more efficient than repeatedly calling cibadmin or crm_mon DC election happens at the crmd. So would it be fair to say then that I should not trust the local CIB until DC election has finished or could there be latency between that completing and the CIB being refreshed? After the join completes (which happens after the election or when a new node is found), then it is safe. You can tell this by running crmadmin -S -H `uname -n` and looking for S_IDLE, S_POLICY_ENGINE or S_TRANSITION_ENGINE iirc If DC election completion is accurate, what's the best way to determine that has completed? Ideally it doesn't happen when a node joins an existing cluster. signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] command to dump cluster configuration in pcs format?
On 16 Jan 2014, at 3:25 pm, Bob Haxo bh...@sgi.com wrote: On Thu, 2014-01-16 at 12:32 +1100, Andrew Beekhof wrote: On 16 Jan 2014, at 11:49 am, Bob Haxo bh...@sgi.com wrote: On 01/15/2014 05:02 PM, Bob Haxo wrote: Greetings, The command crm configure show dumps the cluster configuration in a format that is suitable for use in configuring a cluster. The command pcs config generates nice human readable information, but this is not directly suitable for use in configuring a cluster. Is there a pcs command analogous to the crm command that dumps the cluster configuration in pcs format? On Wed, 2014-01-15 at 17:55 -0600, Chris Feist wrote: Currently there is not. We may at some point look into this, but it isn't on my short term list of things to do. Thanks, Chris Oh, well, bummer ... but at least I hadn't missed the command in the docs or in the installed code. There list of commands you used to build a cluster is in the history too remember. You could just save that to a file instead and restore with bash ./bit-of-history-i-care-about How very nice it would have been had I been able to have just entered the correct commands at single time in a single shell instance. Unfortunately, it configuration has taken me weeks to develop (what now seems to be) a working configuration (including mods to the VirtualDomain agent to avoid spurious restarts of the VM). Now that its done though, I'd not have thought reverse engineering the pcs commands was /that/ hard (from the xml or shell history). And once you have that it's not so far from crm show. The challenge with forcing your input and output formats to match is that you're limited to how smart you can make the input side of things. For example, having one command create both an ordering and colocation constraint is challenging... how do you determine whether to merge them or leave them separate when representing it in the output? (Hint: no matter which option you pick someone will complain it should be the other :-) Its not a completely black and white discussion... I'll probably use crm configure show to capture the pcs created configuration for installations ... and save the pcs for when I correspond with RedHat. Dumping and loading the xml really is not an option. You dump and load xml a lot? No, but I did read that pcs can dump/load the cib xml. Sorry, I meant the config (regardless of format). Even assuming yes, its in a file that you don't have to read... so where is the problem? The problem is that this goes into a product that gets shipped to mfg, and then to customers, and then needs to be supported by other engineers (and then often back to me). Easy to create configurations with crm and then load (crm -f file) the Pacemaker configuration using the crm commands created with crm configure show, with some scripted substitutions for hostnames, IP addresses, and other site customizations. The SLES HAE uses crm, and I'm trying to make SLES and RHEL versions as identical as possible. Makes it easier for me to maintain, and for others to support. Fair enough. Regards, Bob Haxo Regards, Bob Haxo ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Time to get ready for 1.1.11
On 16 Jan 2014, at 3:00 pm, Keisuke MORI keisuke.mori...@gmail.com wrote: Hi, Just curious, I found that RC4 has been branched out of the master after RC3. What would the fixes only in the master branch be in the future? Are they going to be merged into 1.1.12 someday and just skipping 1.1.11? Yes. Unlike last time we're trying to be better about not merging new features and other risky changes during the RC phase :-) The branching should have happened earlier but I forgot. Or are they separated for the next major version such as v1.2 or v2.0? I think our plans for 1.2/2.0 are on hold indefinitely. Its all 1.1.x releases for the foreseeable future. For a small dev team, the benefits didn't outweigh the costs. Thanks, 2014/1/16 David Vossel dvos...@redhat.com: - Original Message - From: David Vossel dvos...@redhat.com To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Tuesday, January 7, 2014 4:50:11 PM Subject: Re: [Pacemaker] Time to get ready for 1.1.11 - Original Message - From: Andrew Beekhof and...@beekhof.net To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Thursday, December 19, 2013 2:25:00 PM Subject: Re: [Pacemaker] Time to get ready for 1.1.11 On 20 Dec 2013, at 2:11 am, Andrew Martin amar...@xes-inc.com wrote: David/Andrew, Once 1.1.11 final is released, is it considered the new stable series of Pacemaker, yes or should 1.1.10 still be used in very stable/critical production environments? Thanks, Andrew - Original Message - From: David Vossel dvos...@redhat.com To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Wednesday, December 11, 2013 3:33:46 PM Subject: Re: [Pacemaker] Time to get ready for 1.1.11 - Original Message - From: Andrew Beekhof and...@beekhof.net To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Wednesday, November 20, 2013 9:02:40 PM Subject: [Pacemaker] Time to get ready for 1.1.11 With over 400 updates since the release of 1.1.10, its time to start thinking about a new release. Today I have tagged release candidate 1[1]. The most notable fixes include: + attrd: Implementation of a truely atomic attrd for use with corosync 2.x + cib: Allow values to be added/updated and removed in a single update + cib: Support XML comments in diffs + Core: Allow blackbox logging to be disabled with SIGUSR2 + crmd: Do not block on proxied calls from pacemaker_remoted + crmd: Enable cluster-wide throttling when the cib heavily exceeds its target load + crmd: Use the load on our peers to know how many jobs to send them + crm_mon: add --hide-headers option to hide all headers + crm_report: Collect logs directly from journald if available + Fencing: On timeout, clean up the agent's entire process group + Fencing: Support agents that need the host to be unfenced at startup + ipc: Raise the default buffer size to 128k + PE: Add a special attribute for distinguishing between real nodes and containers in constraint rules + PE: Allow location constraints to take a regex pattern to match against resource IDs + pengine: Distinguish between the agent being missing and something the agent needs being missing + remote: Properly version the remote connection protocol + services: Detect missing agents and permission errors before forking + Bug cl#5171 - pengine: Don't prevent clones from running due to dependant resources + Bug cl#5179 - Corosync: Attempt to retrieve a peer's node name if it is not already known + Bug cl#5181 - corosync: Ensure node IDs are written to the CIB as unsigned integers If you are a user of `pacemaker_remoted`, you should take the time to read about changes to the online wire protocol[2] that are present in this release. [1] https://github.com/ClusterLabs/pacemaker/releases/Pacemaker-1.1.11-rc1 [2] http://blog.clusterlabs.org/blog/2013/changes-to-the-remote-wire-protocol/ To build `rpm` packages for testing: 1. Clone the current sources: # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git # cd pacemaker 1. If you haven't already, install Pacemaker's dependancies [Fedora] # sudo yum install -y yum-utils [ALL] # make rpm-dep 1. Build Pacemaker # make rc 1. Copy the rpms and deploy as needed A new release candidate, Pacemaker-1.1.11-rc2, is ready for testing. https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1.11-rc2 Assuming no major regressions are encountered during testing, this tag will become the final Pacemaker-1.1.11 release a week from today. -- Vossel Alright, New RC time. Pacemaker-1.1.11-rc3. If no regressions are encountered, rc3 will become the 1.1.11 final release a week from today. https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1.11-rc3 CHANGES RC2 vs RC3
Re: [Pacemaker] [Linux-HA] Better way to change master in 3 node pgsql cluster
On 14 Jan 2014, at 10:32 pm, Andrey Rogovsky a.rogov...@gmail.com wrote: I understand it. So, no way change master better without cluster software update? crm_resource is just creating (and removing) normal location constraints. no reason you couldn't write a script to create them instead. 2014/1/14 Andrey Groshev gre...@yandex.ru 14.01.2014, 12:39, Andrey Rogovsky a.rogov...@gmail.com: I use Debian 7 and got: Reconnecting...root@a:~# crm_resource --resource msPostgresql --ban --master --host a.geocluster.e-autopay.com crm_resource: unrecognized option '--ban' No other way to move master? 2014/1/13 Andrew Beekhof and...@beekhof.net On 13 Jan 2014, at 8:32 pm, Andrey Rogovsky a.rogov...@gmail.com wrote: Hi I have 3 node postgresql cluster. It work well. But I have some trobule with change master. For now, if I need change master, I must: 1) Stop PGSQL on each node and cluster service 2) Start Setup new manual PGSQL replication 3) Change attributes on each node for point to new master 4) Stop PGSQL on each node 5) Celanup resource and start cluster service It take a lot of time. Is it exist better way to change master? Newer versions support: crm_resource --resource msPostgresql --ban --master --host a.geocluster.e-autopay.com This is my cluster service status: Node Attributes: * Node a.geocluster.e-autopay.com: + master-pgsql:0 : 1000 + pgsql-data-status : LATEST + pgsql-master-baseline : 2F90 + pgsql-status : PRI * Node c.geocluster.e-autopay.com: + master-pgsql:0 : 1000 + pgsql-data-status : SYNC + pgsql-status : STOP * Node b.geocluster.e-autopay.com: + master-pgsql:0 : 1000 + pgsql-data-status : SYNC + pgsql-status : STOP I was use http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster for my 3 nodes cluster without hard stik. Now I got strange situation all nodes stay slave: Last updated: Sat Dec 7 04:33:47 2013 Last change: Sat Dec 7 12:56:23 2013 via crmd on a Stack: openais Current DC: c - partition with quorum Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff You use 1.1.7 version. Option --ban added in 1.1.9 See: https://github.com/ClusterLabs/pacemaker/blob/master/ChangeLog 5 Nodes configured, 3 expected votes 4 Resources configured. Online: [ a c b ] Master/Slave Set: msPostgresql [pgsql] Slaves: [ a c b ] My config is: node a \ attributes pgsql-data-status=DISCONNECT node b \ attributes pgsql-data-status=DISCONNECT node c \ attributes pgsql-data-status=DISCONNECT primitive pgsql ocf:heartbeat:pgsql \ params pgctl=/usr/lib/postgresql/9.3/bin/pg_ctl psql=/usr/bin/psql pgdata=/var/lib/postgresql/9.3/main start_opt=-p 5432 rep_mode=sync node_list=a b c restore_command=cp /var/lib/postgresql/9.3/pg_archive/%f %p master_ip=192.168.10.200 restart_on_promote=true config=/etc/postgresql/9.3/main/postgresql.conf \ op start interval=0s timeout=60s on-fail=restart \ op monitor interval=4s timeout=60s on-fail=restart \ op monitor interval=3s role=Master timeout=60s on-fail=restart \ op promote interval=0s timeout=60s on-fail=restart \ op demote interval=0s timeout=60s on-fail=stop \ op stop interval=0s timeout=60s on-fail=block \ op notify interval=0s timeout=60s primitive pgsql-master-ip ocf:heartbeat:IPaddr2 \ params ip=192.168.10.200 nic=peervpn0 \ op start interval=0s timeout=60s on-fail=restart \ op monitor interval=10s timeout=60s on-fail=restart \ op stop interval=0s timeout=60s on-fail=block group master pgsql-master-ip ms msPostgresql pgsql \ meta master-max=1 master-node-max=1 clone-max=3 clone-node-max=1 notify=true colocation set_ip inf: master msPostgresql:Master order ip_down 0: msPostgresql:demote master:stop symmetrical=false order ip_up 0: msPostgresql:promote master:start symmetrical=false property $id=cib-bootstrap-options \ dc-version=1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff \ cluster-infrastructure=openais \ expected-quorum-votes=3 \ no-quorum-policy=ignore \ stonith-enabled=false \ crmd-transition-delay=0 \ last-lrm-refresh=1386404222 rsc_defaults $id=rsc-options \ resource-stickiness=100 \ migration-threshold=1 ___ Linux-HA mailing list linux...@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc
Re: [Pacemaker] hangs pending
On 15 Jan 2014, at 12:15 am, Andrey Groshev gre...@yandex.ru wrote: 14.01.2014, 10:00, Andrey Groshev gre...@yandex.ru: 14.01.2014, 07:47, Andrew Beekhof and...@beekhof.net: Ok, here's what happens: 1. node2 is lost 2. fencing of node2 starts 3. node2 reboots (and cluster starts) 4. node2 returns to the membership 5. node2 is marked as a cluster member 6. DC tries to bring it into the cluster, but needs to cancel the active transition first. Which is a problem since the node2 fencing operation is part of that 7. node2 is in a transition (pending) state until fencing passes or fails 8a. fencing fails: transition completes and the node joins the cluster Thats in theory, except we automatically try again. Which isn't appropriate. This should be relatively easy to fix. 8b. fencing passes: the node is incorrectly marked as offline This I have no idea how to fix yet. On another note, it doesn't look like this agent works at all. The node has been back online for a long time and the agent is still timing out after 10 minutes. So Once the script makes sure that the victim will rebooted and again available via ssh - it exit with 0. does not seem true. Damn. Looks like you're right. At some time I broke my agent and had not noticed it. Who will understand. I repaired my agent - after send reboot he is wait STDIN. Returned normally a behavior - hangs pending, until manually send reboot. :) Right. Now you're in case 8b. Can you try this patch: http://paste.fedoraproject.org/68450/38973966 New logs: http://send2me.ru/crmrep1.tar.bz2 On 14 Jan 2014, at 1:19 pm, Andrew Beekhof and...@beekhof.net wrote: Apart from anything else, your timeout needs to be bigger: Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru stonith-ng: ( commands.c:1321 ) error: log_operation: Operation 'reboot' [11331] (call 2 from crmd.17227) for host 'dev-cluster2-node2.unix.tensor.ru' with device 'st1' returned: -62 (Timer expired) On 14 Jan 2014, at 7:18 am, Andrew Beekhof and...@beekhof.net wrote: On 13 Jan 2014, at 8:31 pm, Andrey Groshev gre...@yandex.ru wrote: 13.01.2014, 02:51, Andrew Beekhof and...@beekhof.net: On 10 Jan 2014, at 9:55 pm, Andrey Groshev gre...@yandex.ru wrote: 10.01.2014, 14:31, Andrey Groshev gre...@yandex.ru: 10.01.2014, 14:01, Andrew Beekhof and...@beekhof.net: On 10 Jan 2014, at 5:03 pm, Andrey Groshev gre...@yandex.ru wrote: 10.01.2014, 05:29, Andrew Beekhof and...@beekhof.net: On 9 Jan 2014, at 11:11 pm, Andrey Groshev gre...@yandex.ru wrote: 08.01.2014, 06:22, Andrew Beekhof and...@beekhof.net: On 29 Nov 2013, at 7:17 pm, Andrey Groshev gre...@yandex.ru wrote: Hi, ALL. I'm still trying to cope with the fact that after the fence - node hangs in pending. Please define pending. Where did you see this? In crm_mon: .. Node dev-cluster2-node2 (172793105): pending .. The experiment was like this: Four nodes in cluster. On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11). Thereafter, the remaining start it constantly reboot, under various pretexts, softly whistling, fly low, not a cluster member! ... Then in the log fell out Too many failures All this time in the status in crm_mon is pending. Depending on the wind direction changed to UNCLEAN Much time has passed and I can not accurately describe the behavior... Now I am in the following state: I tried locate the problem. Came here with this. I set big value in property stonith-timeout=600s. And got the following behavior: 1. pkill -4 corosync 2. from node with DC call my fence agent sshbykey 3. It sends reboot victim and waits until she comes to life again. Hmmm what version of pacemaker? This sounds like a timing issue that we fixed a while back Was a version 1.1.11 from December 3. Now try full update and retest. That should be recent enough. Can you create a crm_report the next time you reproduce? Of course yes. Little delay :) .. cc1: warnings being treated as errors upstart.c: In function ‘upstart_job_property’: upstart.c:264: error: implicit declaration of function ‘g_variant_lookup_value’ upstart.c:264: error: nested extern declaration of ‘g_variant_lookup_value’ upstart.c:264: error: assignment makes pointer from integer without a cast gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1 gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/root/ha/pacemaker/lib' make: *** [core] Error 1 I'm trying to solve this a problem. Do not get solved quickly... https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value g_variant_lookup_value () Since
Re: [Pacemaker] [Enhancement] Change of the globally-unique attribute of the resource.
On 14 Jan 2014, at 7:26 pm, renayama19661...@ybb.ne.jp wrote: Hi All, When a user changes the globally-unique attribute of the resource, a problem occurs. When it manages the resource with PID file, this occurs, but this is because PID file name changes by globally-unique attribute. (snip) if [ ${OCF_RESKEY_CRM_meta_globally_unique} = false ]; then : ${OCF_RESKEY_pidfile:=$HA_VARRUN/ping-${OCF_RESKEY_name}} else : ${OCF_RESKEY_pidfile:=$HA_VARRUN/ping-${OCF_RESOURCE_INSTANCE}} fi (snip) This is correct. The pid file cannot include the instance number when globally-unique is false and must do so when it is true. The problem can reappear in the following procedure. * Step1: Started a resource. (snip) primitive prmPingd ocf:pacemaker:pingd \ params name=default_ping_set host_list=192.168.0.1 multiplier=200 \ op start interval=0s timeout=60s on-fail=restart \ op monitor interval=10s timeout=60s on-fail=restart \ op stop interval=0s timeout=60s on-fail=ignore clone clnPingd prmPingd (snip) * Step2: Change globally-unique attribute. [root]# crm configure edit (snip) clone clnPingd prmPingd \ meta clone-max=2 clone-node-max=2 globally-unique=true (snip) * Step3: Stop Pacemaker But, the resource does not stop because PID file was changed as for the changed resource of the globally-unique attribute. I'd have expected the stop action to be performed with the old attributes. crm_report tarball? I think that this is a known problem. It wasn't until now. I wish this problem is solved in the future Best Regards, Hideo Yamauchi. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Enhancement] Change of the globally-unique attribute of the resource.
On 15 Jan 2014, at 12:06 pm, renayama19661...@ybb.ne.jp wrote: Hi Andrew, Sorry This problem is a thing of Pacemaker1.0. On Pacemaker1.1.11, the resource did movement to stop definitely. When globally-unique attribute changed somehow or other in Pacemaker1.1, Pacemkaer seems to carry out the reboot of the resource. Makes sense, since the definition changed (snip) Jan 15 18:29:40 rh64-2744 pengine[3369]: warning: process_rsc_state: Detected active orphan prmClusterMon running on rh64-2744 Jan 15 18:29:40 rh64-2744 pengine[3369]: info: clone_print: Clone Set: clnClusterMon [prmClusterMon] (unique) Jan 15 18:29:40 rh64-2744 pengine[3369]: info: native_print: prmClusterMon:0#011(ocf::pacemaker:ClusterMon):#011Stopped Jan 15 18:29:40 rh64-2744 pengine[3369]: info: native_print: prmClusterMon:1#011(ocf::pacemaker:ClusterMon):#011Stopped Jan 15 18:29:40 rh64-2744 pengine[3369]: info: native_print: prmClusterMon#011(ocf::pacemaker:ClusterMon):#011 ORPHANED Started rh64-2744 Jan 15 18:29:40 rh64-2744 pengine[3369]: notice: DeleteRsc: Removing prmClusterMon from rh64-2744 Jan 15 18:29:40 rh64-2744 pengine[3369]: info: native_color: Stopping orphan resource prmClusterMon Jan 15 18:29:40 rh64-2744 pengine[3369]: info: RecurringOp: Start recurring monitor (10s) for prmClusterMon:0 on rh64-2744 Jan 15 18:29:40 rh64-2744 pengine[3369]: info: RecurringOp: Start recurring monitor (10s) for prmClusterMon:1 on rh64-2744 Jan 15 18:29:40 rh64-2744 pengine[3369]: notice: LogActions: Start prmClusterMon:0#011(rh64-2744) Jan 15 18:29:40 rh64-2744 pengine[3369]: notice: LogActions: Start prmClusterMon:1#011(rh64-2744)Jan 15 18:29:40 rh64-2744 pengine[3369]: notice: LogActions: StopprmClusterMon#011(rh64-2744) (snip) Best Regards, Hideo Yamauchi. --- On Wed, 2014/1/15, renayama19661...@ybb.ne.jp renayama19661...@ybb.ne.jp wrote: Hi Andrew, Thank you for comment. But, the resource does not stop because PID file was changed as for the changed resource of the globally-unique attribute. I'd have expected the stop action to be performed with the old attributes. crm_report tarball? Okay. I register this topic with Bugzilla. I attach the log to Bugzilla. Best Regards, Hideo Yamauchi. --- On Wed, 2014/1/15, Andrew Beekhof and...@beekhof.net wrote: On 14 Jan 2014, at 7:26 pm, renayama19661...@ybb.ne.jp wrote: Hi All, When a user changes the globally-unique attribute of the resource, a problem occurs. When it manages the resource with PID file, this occurs, but this is because PID file name changes by globally-unique attribute. (snip) if [ ${OCF_RESKEY_CRM_meta_globally_unique} = false ]; then : ${OCF_RESKEY_pidfile:=$HA_VARRUN/ping-${OCF_RESKEY_name}} else : ${OCF_RESKEY_pidfile:=$HA_VARRUN/ping-${OCF_RESOURCE_INSTANCE}} fi (snip) This is correct. The pid file cannot include the instance number when globally-unique is false and must do so when it is true. The problem can reappear in the following procedure. * Step1: Started a resource. (snip) primitive prmPingd ocf:pacemaker:pingd \ params name=default_ping_set host_list=192.168.0.1 multiplier=200 \ op start interval=0s timeout=60s on-fail=restart \ op monitor interval=10s timeout=60s on-fail=restart \ op stop interval=0s timeout=60s on-fail=ignore clone clnPingd prmPingd (snip) * Step2: Change globally-unique attribute. [root]# crm configure edit (snip) clone clnPingd prmPingd \ meta clone-max=2 clone-node-max=2 globally-unique=true (snip) * Step3: Stop Pacemaker But, the resource does not stop because PID file was changed as for the changed resource of the globally-unique attribute. I'd have expected the stop action to be performed with the old attributes. crm_report tarball? I think that this is a known problem. It wasn't until now. I wish this problem is solved in the future Best Regards, Hideo Yamauchi. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org
Re: [Pacemaker] Consider extra slave node resource when calculating actions for failover
On 14 Jan 2014, at 11:25 pm, Juraj Fabo juraj.f...@gmail.com wrote: Hi I have master-slave cluster with configuration attached below. It is based on documented postgresql master-slave cluster configuration. Colocation constraints should work that way that if some of master-group resources fails, failover to slave node will be done. This basically works ok. I would like to have an additional condition integrated. On and only on the HotStandby runs resource SERVICE-res-mon-s1. If the SERVICE-res-mon-s1 resource on slave reports negative score then failover should not be done, because it indicates that the slave node is not ready to run services from master-group. However, even if the SERVICE-res-mon-s1 fails, postgresql slave (HotStandby) should still run, because the SERVICE-res-mon-s1 monitors some application related functionality which does not block the postgres itself. The requested feature is very close to the one described in the http://clusterlabs.org/doc/en- US/Pacemaker/1.0/html/Pacemaker_Explained/ch09s03s03s02.html prefer nodes with the most connectivity with that difference, that the resource agent is running only on the standby node. Reason is, that the SERVICE-res-mon-s1 is in reality minimalistic implementation of the SERVICE-service in order to know whether the slave would be able to run the SERVICE-service. The simplest approach might be to run: crm_attribute --name postgres --value 100 --node serv1 --lifetime forever And then have SERVICE-res-mon-s1 run: crm_attribute --name postgres --value 10 --node serv2 --lifetime reboot whenever it starts and crm_attribute --name postgres --delete --node serv2 --lifetime reboot whenever it stops. Then you can use the 'postgres' attribute in the same way as you did with 'ping'. If needed, I could change this design to use clone of SERVICE-res-mon-s1 to run them on both nodes, however, I did not succeed even with this configuration. Next step would be to have multiple instances of this resource agent running on both nodes (with different parameter spanID) and preffer node where more spans are ok. Ocf agent service_res_check reports the resource availability via monitor function, where it updates its own score attribute via crm_attribute. I thought that using this custom score attribute in location or colocation constraint could do the job, but it did not affected the failover logic. Please, what should be done in order to have the cluster consider also the SERVICE-res-mon-s1 results when calculating resources score and willingnes to move? note: my pacemaker contains also patch from https://github.com/beekhof/pacemaker/commit/58962338 Thank you in advance node $id=1 serv1 \ attributes SERVICE-pgsql-data-status=STREAMING|ASYNC node $id=2 serv2 \ attributes SERVICE-pgsql-data-status=LATEST primitive SERVICE-MIP1 ocf:heartbeat:IPaddr2 \ params ip=10.40.0.70 cidr_netmask=24 iflabel=ma1 \ op monitor interval=10s primitive SERVICE-MIP2 ocf:heartbeat:IPaddr2 \ params ip=10.40.0.71 cidr_netmask=26 iflabel=ma2 \ op monitor interval=10s primitive SERVICE-VIP ocf:heartbeat:IPaddr2 \ params ip=10.40.0.72 cidr_netmask=24 iflabel=sla \ meta resource-stickiness=1 \ op monitor interval=10s timeout=60s on-fail=restart primitive SERVICE-res-mon-s1 ocf:heartbeat:service_res_check \ params spanID=1 \ meta resource-stickiness=1 \ op monitor interval=9s timeout=4s on-fail=restart primitive SERVICE-pgsql ocf:heartbeat:pgsql \ params master_ip=10.40.0.70 slave_ip=10.40.0.72 node_list=serv1 serv2 pgctl=/usr/bin/pg_ctl psql=/usr/bin/psql pgdata=/var/lib/pgsql/data/ start_opt=-p 5432 rep_mode=async logfile=/var/log/service_ra_pgsql.log primary_conninfo_opt=keepalives_idle=60 keepalives_interval=5 keepalives_count=5 stop_escalate=0 \ op start interval=0s timeout=120s on-fail=restart \ op monitor interval=7s timeout=30s on-fail=restart \ op monitor interval=2s role=Master timeout=30s on- fail=restart \ op promote interval=0s timeout=120s on-fail=restart \ op demote interval=0s timeout=30s on-fail=stop \ op stop interval=0s timeout=30s on-fail=block \ op notify interval=0s timeout=30s primitive SERVICE-pingCheck ocf:pacemaker:ping \ params host_list=10.40.0.99 name=default_ping_set multiplier=100 \ op start interval=0s timeout=60s on-fail=restart \ op monitor interval=2s timeout=60s on-fail=restart \ op stop interval=0s timeout=60s on-fail=ignore primitive SERVICE-service ocf:heartbeat:service_service_ocf \ op monitor interval=7s timeout=30s on-fail=restart primitive SERVICE-tomcat ocf:heartbeat:tomcat \ params java_home=/usr/java/default catalina_home=/usr/share/tomcat6 statusurl=http://127.0.0.1:9081/admin; catalina_pid=/var/run/tomcat6.pid
Re: [Pacemaker] crm_resource -L not trustable right after restart
On 14 Jan 2014, at 11:50 pm, Brian J. Murrell (brian) br...@interlinx.bc.ca wrote: On Tue, 2014-01-14 at 16:01 +1100, Andrew Beekhof wrote: On Tue, 2014-01-14 at 08:09 +1100, Andrew Beekhof wrote: The local cib hasn't caught up yet by the looks of it. I should have asked in my previous message: is this entirely an artifact of having just restarted or are there any other times where the local CIB can in fact be out of date (and thus crm_resource is inaccurate), if even for a brief period of time? I just want to completely understand the nature of this situation. Consider any long running action, such as starting a database. We do not update the CIB until after actions have completed, so there can and will be times when the status section is out of date to one degree or another. At node startup is another point at which the status could potentially be behind. It sounds to me like you're trying to second guess the cluster, which is a dangerous path. It doesn't know that it doesn't know. But it (pacemaker at least) does know that it's just started up, and should also know whether it's gotten a fresh copy of the CIB since starting up, right? What if its the first node to start up? There'd be no fresh copy to arrive in that case. Many things are obvious to external observers that are not at all obvious to the cluster. If it had enough information to know it was out of date, it wouldn't be out of date. I think I'd consider it required behaviour that pacemaker not consider itself authoritative enough to provide answers like location until it has gotten a fresh copy of the CIB. Does it show anything as running? Any nodes as online? I'd not expect that it stays in that situation for more than a second or two... You are probably right about that. But unfortunately that second or two provides a large enough window to provide mis-information. We could add an option to force crm_resource to use the master instance instead of the local one I guess. Or, depending on the answers to above (like can this local-is-not-true situation every manifest itself at times other than just started) perhaps just don't allow crm_resource (or any other tool) to provide information from the local CIB until it's been refreshed at least once since a startup. As above, there are situations when you'd never get an answer. I would much rather crm_resource experience some latency in being able to provide answers than provide wrong ones. Perhaps there needs to be a switch to indicate if it should block waiting for the local CIB to be up-to-date or should return immediately with an unknown type response if the local CIB has not yet been updated since a start. Cheers, b. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] hangs pending
On 13 Jan 2014, at 8:31 pm, Andrey Groshev gre...@yandex.ru wrote: 13.01.2014, 02:51, Andrew Beekhof and...@beekhof.net: On 10 Jan 2014, at 9:55 pm, Andrey Groshev gre...@yandex.ru wrote: 10.01.2014, 14:31, Andrey Groshev gre...@yandex.ru: 10.01.2014, 14:01, Andrew Beekhof and...@beekhof.net: On 10 Jan 2014, at 5:03 pm, Andrey Groshev gre...@yandex.ru wrote: 10.01.2014, 05:29, Andrew Beekhof and...@beekhof.net: On 9 Jan 2014, at 11:11 pm, Andrey Groshev gre...@yandex.ru wrote: 08.01.2014, 06:22, Andrew Beekhof and...@beekhof.net: On 29 Nov 2013, at 7:17 pm, Andrey Groshev gre...@yandex.ru wrote: Hi, ALL. I'm still trying to cope with the fact that after the fence - node hangs in pending. Please define pending. Where did you see this? In crm_mon: .. Node dev-cluster2-node2 (172793105): pending .. The experiment was like this: Four nodes in cluster. On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11). Thereafter, the remaining start it constantly reboot, under various pretexts, softly whistling, fly low, not a cluster member! ... Then in the log fell out Too many failures All this time in the status in crm_mon is pending. Depending on the wind direction changed to UNCLEAN Much time has passed and I can not accurately describe the behavior... Now I am in the following state: I tried locate the problem. Came here with this. I set big value in property stonith-timeout=600s. And got the following behavior: 1. pkill -4 corosync 2. from node with DC call my fence agent sshbykey 3. It sends reboot victim and waits until she comes to life again. Hmmm what version of pacemaker? This sounds like a timing issue that we fixed a while back Was a version 1.1.11 from December 3. Now try full update and retest. That should be recent enough. Can you create a crm_report the next time you reproduce? Of course yes. Little delay :) .. cc1: warnings being treated as errors upstart.c: In function ‘upstart_job_property’: upstart.c:264: error: implicit declaration of function ‘g_variant_lookup_value’ upstart.c:264: error: nested extern declaration of ‘g_variant_lookup_value’ upstart.c:264: error: assignment makes pointer from integer without a cast gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1 gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/root/ha/pacemaker/lib' make: *** [core] Error 1 I'm trying to solve this a problem. Do not get solved quickly... https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value g_variant_lookup_value () Since 2.28 # yum list installed glib2 Loaded plugins: fastestmirror, rhnplugin, security This system is receiving updates from RHN Classic or Red Hat Satellite. Loading mirror speeds from cached hostfile Installed Packages glib2.x86_64 2.26.1-3.el6 installed # cat /etc/issue CentOS release 6.5 (Final) Kernel \r on an \m Can you try this patch? Upstart jobs wont work, but the code will compile diff --git a/lib/services/upstart.c b/lib/services/upstart.c index 831e7cf..195c3a4 100644 --- a/lib/services/upstart.c +++ b/lib/services/upstart.c @@ -231,12 +231,21 @@ upstart_job_exists(const char *name) static char * upstart_job_property(const char *obj, const gchar * iface, const char *name) { +char *output = NULL; + +#if !GLIB_CHECK_VERSION(2,28,0) +static bool err = TRUE; + +if(err) { +crm_err(This version of glib is too old to support upstart jobs); +err = FALSE; +} +#else GError *error = NULL; GDBusProxy *proxy; GVariant *asv = NULL; GVariant *value = NULL; GVariant *_ret = NULL; -char *output = NULL; crm_info(Calling GetAll on %s, obj); proxy = get_proxy(obj, BUS_PROPERTY_IFACE); @@ -272,6 +281,7 @@ upstart_job_property(const char *obj, const gchar * iface, const char *name) g_object_unref(proxy); g_variant_unref(_ret); +#endif return output; } Ok :) I patch source. Type make rc - the same error. Because its not building your local changes Make new copy via fetch - the same error. It seems that if not exist ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz, then download it. Otherwise use exist archive. Cutted log ... # make rc make TAG=Pacemaker-1.1.11-rc3 rpm make[1]: Entering directory `/root/ha/pacemaker' rm -f pacemaker-dirty.tar.* pacemaker-tip.tar.* pacemaker-HEAD.tar.* if [ ! -f ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz
Re: [Pacemaker] crm_resource -L not trustable right after restart
On 14 Jan 2014, at 5:13 am, Brian J. Murrell (brian) br...@interlinx.bc.ca wrote: Hi, I found a situation using pacemaker 1.1.10 on RHEL6.5 where the output of crm_resource -L is not trust-able, shortly after a node is booted. Here is the output from crm_resource -L on one of the nodes in a two node cluster (the one that was not rebooted): st-fencing(stonith:fence_foo):Started res1 (ocf::foo:Target): Started res2 (ocf::foo:Target): Started Here is the output from the same command on the other node in the two node cluster right after it was rebooted: st-fencing(stonith:fence_foo):Stopped res1 (ocf::foo:Target): Stopped res2 (ocf::foo:Target): Stopped These were collected at the same time (within the same second) on the two nodes. Clearly the rebooted node is not telling the truth. Perhaps the truth for it is I don't know, which would be fair enough but that's not what pacemaker is asserting there. So, how do I know (i.e. programmatically -- what command can I issue to know) if and when crm_resource can be trusted to be truthful? The local cib hasn't caught up yet by the looks of it. You could compare 'cibadmin -Ql' with 'cibadmin -Q' b. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Linux-HA] Better way to change master in 3 node pgsql cluster
On 13 Jan 2014, at 8:32 pm, Andrey Rogovsky a.rogov...@gmail.com wrote: Hi I have 3 node postgresql cluster. It work well. But I have some trobule with change master. For now, if I need change master, I must: 1) Stop PGSQL on each node and cluster service 2) Start Setup new manual PGSQL replication 3) Change attributes on each node for point to new master 4) Stop PGSQL on each node 5) Celanup resource and start cluster service It take a lot of time. Is it exist better way to change master? Newer versions support: crm_resource --resource msPostgresql --ban --master --host a.geocluster.e-autopay.com This is my cluster service status: Node Attributes: * Node a.geocluster.e-autopay.com: + master-pgsql:0 : 1000 + pgsql-data-status : LATEST + pgsql-master-baseline : 2F90 + pgsql-status : PRI * Node c.geocluster.e-autopay.com: + master-pgsql:0 : 1000 + pgsql-data-status : SYNC + pgsql-status : STOP * Node b.geocluster.e-autopay.com: + master-pgsql:0 : 1000 + pgsql-data-status : SYNC + pgsql-status : STOP I was use http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster for my 3 nodes cluster without hard stik. Now I got strange situation all nodes stay slave: Last updated: Sat Dec 7 04:33:47 2013 Last change: Sat Dec 7 12:56:23 2013 via crmd on a Stack: openais Current DC: c - partition with quorum Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff 5 Nodes configured, 3 expected votes 4 Resources configured. Online: [ a c b ] Master/Slave Set: msPostgresql [pgsql] Slaves: [ a c b ] My config is: node a \ attributes pgsql-data-status=DISCONNECT node b \ attributes pgsql-data-status=DISCONNECT node c \ attributes pgsql-data-status=DISCONNECT primitive pgsql ocf:heartbeat:pgsql \ params pgctl=/usr/lib/postgresql/9.3/bin/pg_ctl psql=/usr/bin/psql pgdata=/var/lib/postgresql/9.3/main start_opt=-p 5432 rep_mode=sync node_list=a b c restore_command=cp /var/lib/postgresql/9.3/pg_archive/%f %p master_ip=192.168.10.200 restart_on_promote=true config=/etc/postgresql/9.3/main/postgresql.conf \ op start interval=0s timeout=60s on-fail=restart \ op monitor interval=4s timeout=60s on-fail=restart \ op monitor interval=3s role=Master timeout=60s on-fail=restart \ op promote interval=0s timeout=60s on-fail=restart \ op demote interval=0s timeout=60s on-fail=stop \ op stop interval=0s timeout=60s on-fail=block \ op notify interval=0s timeout=60s primitive pgsql-master-ip ocf:heartbeat:IPaddr2 \ params ip=192.168.10.200 nic=peervpn0 \ op start interval=0s timeout=60s on-fail=restart \ op monitor interval=10s timeout=60s on-fail=restart \ op stop interval=0s timeout=60s on-fail=block group master pgsql-master-ip ms msPostgresql pgsql \ meta master-max=1 master-node-max=1 clone-max=3 clone-node-max=1 notify=true colocation set_ip inf: master msPostgresql:Master order ip_down 0: msPostgresql:demote master:stop symmetrical=false order ip_up 0: msPostgresql:promote master:start symmetrical=false property $id=cib-bootstrap-options \ dc-version=1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff \ cluster-infrastructure=openais \ expected-quorum-votes=3 \ no-quorum-policy=ignore \ stonith-enabled=false \ crmd-transition-delay=0 \ last-lrm-refresh=1386404222 rsc_defaults $id=rsc-options \ resource-stickiness=100 \ migration-threshold=1 ___ Linux-HA mailing list linux...@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Location / Colocation constraints issue
On 19 Dec 2013, at 1:08 am, Gaëtan Slongo gslo...@it-optics.com wrote: Hi ! I'm currently building a 2 node cluster for firewalling. I would like to run a shorewall on both on the master and the Slave node. I tried many things but nothing works as expected. Shorewall configurations are good. What I want to do is to start shorewall standby on the other node as soon as my drbd resources are Slave or Stopped..? Could you please give me a bit of help on this problem ? It will be something like: colocation XXX -inf: shorewall-standby drbd_master_slave_ServicesConfigs1:Master colocation YYY -inf: shorewall-standby drbd_master_slave_ServicesLogs1:Master Here is my current config Thanks node keskonrix1 \ attributes standby=off node keskonrix2 \ attributes standby=off primitive VIPDMZ ocf:heartbeat:IPaddr2 \ params ip=10.0.1.1 nic=eth2 cidr_netmask=24 iflabel=VIPDMZ \ op monitor interval=30s timeout=30s primitive VIPEXPL ocf:heartbeat:IPaddr2 \ params ip=10.0.2.2 nic=eth3 cidr_netmask=28 iflabel=VIPEXPL \ op monitor interval=30s timeout=30s primitive VIPLAN ocf:heartbeat:IPaddr2 \ params ip=192.168.1.248 nic=br0 cidr_netmask=16 iflabel=VIPLAN \ op monitor interval=30s timeout=30s primitive VIPNET ocf:heartbeat:IPaddr2 \ params ip=XX.XX.XX.XX nic=eth1 cidr_netmask=29 iflabel=VIPDMZ \ op monitor interval=30s timeout=30s primitive VIPPDA ocf:heartbeat:IPaddr2 \ params ip=XX.XX.XX.XX nic=eth1 cidr_netmask=29 iflabel=VIPPDA \ op monitor interval=30s timeout=30s primitive apache2 lsb:apache2 \ op start interval=0 timeout=15s primitive bind9 lsb:bind9 \ op start interval=0 timeout=15s primitive dansguardian lsb:dansguardian \ op start interval=0 timeout=30s on-fail=ignore primitive drbd-ServicesConfigs1 ocf:linbit:drbd \ params drbd_resource=services-configs1 \ op monitor interval=29s role=Master \ op monitor interval=31s role=Slave primitive drbd-ServicesLogs1 ocf:linbit:drbd \ params drbd_resource=services-logs1 \ op monitor interval=29s role=Master \ op monitor interval=31s role=Slave primitive fs_ServicesConfigs1 ocf:heartbeat:Filesystem \ params device=/dev/drbd/by-res/services-configs1 directory=/drbd/services-configs1/ fstype=ext4 options=noatime,nodiratime \ meta target-role=Started primitive fs_ServicesLogs1 ocf:heartbeat:Filesystem \ params device=/dev/drbd/by-res/services-logs1 directory=/drbd/services-logs1/ fstype=ext4 options=noatime,nodiratime \ meta target-role=Started primitive ipsec-setkey lsb:setkey \ op start interval=0 timeout=30s primitive links_ServicesConfigs1 heartbeat:drbdlinks \ meta target-role=Started primitive openvpn lsb:openvpn \ op monitor interval=10 timeout=30s \ meta target-role=Started primitive racoon lsb:racoon \ op start interval=0 timeout=30s primitive shorewall lsb:shorewall \ op start interval=0 timeout=30s \ meta target-role=Started primitive shorewall-standby lsb:shorewall \ op start interval=0 timeout=30s primitive squid lsb:squid \ op start interval=0 timeout=15s \ op stop interval=0 timeout=120s group IPS-Services1 VIPLAN VIPDMZ VIPPDA VIPEXPL VIPNET \ meta target-role=Started group IPSec ipsec-setkey racoon group Services1 bind9 squid dansguardian apache2 openvpn shorewall group ServicesData1 fs_ServicesConfigs1 fs_ServicesLogs1 links_ServicesConfigs1 ms drbd_master_slave_ServicesConfigs1 drbd-ServicesConfigs1 \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 globally-unique=false notify=true target-role=Master ms drbd_master_slave_ServicesLogs1 drbd-ServicesLogs1 \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 globally-unique=false notify=true target-role=Master colocation Services1_on_drbd inf: drbd_master_slave_ServicesConfigs1:Master drbd_master_slave_ServicesLogs1:Master ServicesData1 IPS-Services1 Services1 IPSec colocation start-shorewall_standby-on-passive-node -inf: shorewall-standby shorewall order all_drbd inf: shorewall-standby:stop drbd_master_slave_ServicesConfigs1:promote drbd_master_slave_ServicesLogs1:promote ServicesData1:start IPS-Services1:start IPSec:start Services1:start property $id=cib-bootstrap-options \ dc-version=1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff \ cluster-infrastructure=openais \ expected-quorum-votes=2 \ stonith-enabled=false \ no-quorum-policy=ignore rsc_defaults $id=rsc-options \ resource-stickiness=100 ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started:
Re: [Pacemaker] hangs pending
Apart from anything else, your timeout needs to be bigger: Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru stonith-ng: ( commands.c:1321 ) error: log_operation: Operation 'reboot' [11331] (call 2 from crmd.17227) for host 'dev-cluster2-node2.unix.tensor.ru' with device 'st1' returned: -62 (Timer expired) On 14 Jan 2014, at 7:18 am, Andrew Beekhof and...@beekhof.net wrote: On 13 Jan 2014, at 8:31 pm, Andrey Groshev gre...@yandex.ru wrote: 13.01.2014, 02:51, Andrew Beekhof and...@beekhof.net: On 10 Jan 2014, at 9:55 pm, Andrey Groshev gre...@yandex.ru wrote: 10.01.2014, 14:31, Andrey Groshev gre...@yandex.ru: 10.01.2014, 14:01, Andrew Beekhof and...@beekhof.net: On 10 Jan 2014, at 5:03 pm, Andrey Groshev gre...@yandex.ru wrote: 10.01.2014, 05:29, Andrew Beekhof and...@beekhof.net: On 9 Jan 2014, at 11:11 pm, Andrey Groshev gre...@yandex.ru wrote: 08.01.2014, 06:22, Andrew Beekhof and...@beekhof.net: On 29 Nov 2013, at 7:17 pm, Andrey Groshev gre...@yandex.ru wrote: Hi, ALL. I'm still trying to cope with the fact that after the fence - node hangs in pending. Please define pending. Where did you see this? In crm_mon: .. Node dev-cluster2-node2 (172793105): pending .. The experiment was like this: Four nodes in cluster. On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11). Thereafter, the remaining start it constantly reboot, under various pretexts, softly whistling, fly low, not a cluster member! ... Then in the log fell out Too many failures All this time in the status in crm_mon is pending. Depending on the wind direction changed to UNCLEAN Much time has passed and I can not accurately describe the behavior... Now I am in the following state: I tried locate the problem. Came here with this. I set big value in property stonith-timeout=600s. And got the following behavior: 1. pkill -4 corosync 2. from node with DC call my fence agent sshbykey 3. It sends reboot victim and waits until she comes to life again. Hmmm what version of pacemaker? This sounds like a timing issue that we fixed a while back Was a version 1.1.11 from December 3. Now try full update and retest. That should be recent enough. Can you create a crm_report the next time you reproduce? Of course yes. Little delay :) .. cc1: warnings being treated as errors upstart.c: In function ‘upstart_job_property’: upstart.c:264: error: implicit declaration of function ‘g_variant_lookup_value’ upstart.c:264: error: nested extern declaration of ‘g_variant_lookup_value’ upstart.c:264: error: assignment makes pointer from integer without a cast gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1 gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/root/ha/pacemaker/lib' make: *** [core] Error 1 I'm trying to solve this a problem. Do not get solved quickly... https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value g_variant_lookup_value () Since 2.28 # yum list installed glib2 Loaded plugins: fastestmirror, rhnplugin, security This system is receiving updates from RHN Classic or Red Hat Satellite. Loading mirror speeds from cached hostfile Installed Packages glib2.x86_64 2.26.1-3.el6 installed # cat /etc/issue CentOS release 6.5 (Final) Kernel \r on an \m Can you try this patch? Upstart jobs wont work, but the code will compile diff --git a/lib/services/upstart.c b/lib/services/upstart.c index 831e7cf..195c3a4 100644 --- a/lib/services/upstart.c +++ b/lib/services/upstart.c @@ -231,12 +231,21 @@ upstart_job_exists(const char *name) static char * upstart_job_property(const char *obj, const gchar * iface, const char *name) { +char *output = NULL; + +#if !GLIB_CHECK_VERSION(2,28,0) +static bool err = TRUE; + +if(err) { +crm_err(This version of glib is too old to support upstart jobs); +err = FALSE; +} +#else GError *error = NULL; GDBusProxy *proxy; GVariant *asv = NULL; GVariant *value = NULL; GVariant *_ret = NULL; -char *output = NULL; crm_info(Calling GetAll on %s, obj); proxy = get_proxy(obj, BUS_PROPERTY_IFACE); @@ -272,6 +281,7 @@ upstart_job_property(const char *obj, const gchar * iface, const char *name) g_object_unref(proxy); g_variant_unref(_ret); +#endif return output; } Ok :) I patch source. Type make rc - the same error. Because its not building your local changes Make new copy via fetch - the same error. It seems that if not exist ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz, then download it. Otherwise use
Re: [Pacemaker] hangs pending
On 14 Jan 2014, at 1:19 pm, Andrew Beekhof and...@beekhof.net wrote: Apart from anything else, your timeout needs to be bigger: Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru stonith-ng: ( commands.c:1321 ) error: log_operation: Operation 'reboot' [11331] (call 2 from crmd.17227) for host 'dev-cluster2-node2.unix.tensor.ru' with device 'st1' returned: -62 (Timer expired) also: Jan 13 12:04:54 [17226] dev-cluster2-node1.unix.tensor.rupengine: ( utils.c:723 ) error: unpack_operation: Specifying on_fail=fence and stonith-enabled=false makes no sense On 14 Jan 2014, at 7:18 am, Andrew Beekhof and...@beekhof.net wrote: On 13 Jan 2014, at 8:31 pm, Andrey Groshev gre...@yandex.ru wrote: 13.01.2014, 02:51, Andrew Beekhof and...@beekhof.net: On 10 Jan 2014, at 9:55 pm, Andrey Groshev gre...@yandex.ru wrote: 10.01.2014, 14:31, Andrey Groshev gre...@yandex.ru: 10.01.2014, 14:01, Andrew Beekhof and...@beekhof.net: On 10 Jan 2014, at 5:03 pm, Andrey Groshev gre...@yandex.ru wrote: 10.01.2014, 05:29, Andrew Beekhof and...@beekhof.net: On 9 Jan 2014, at 11:11 pm, Andrey Groshev gre...@yandex.ru wrote: 08.01.2014, 06:22, Andrew Beekhof and...@beekhof.net: On 29 Nov 2013, at 7:17 pm, Andrey Groshev gre...@yandex.ru wrote: Hi, ALL. I'm still trying to cope with the fact that after the fence - node hangs in pending. Please define pending. Where did you see this? In crm_mon: .. Node dev-cluster2-node2 (172793105): pending .. The experiment was like this: Four nodes in cluster. On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11). Thereafter, the remaining start it constantly reboot, under various pretexts, softly whistling, fly low, not a cluster member! ... Then in the log fell out Too many failures All this time in the status in crm_mon is pending. Depending on the wind direction changed to UNCLEAN Much time has passed and I can not accurately describe the behavior... Now I am in the following state: I tried locate the problem. Came here with this. I set big value in property stonith-timeout=600s. And got the following behavior: 1. pkill -4 corosync 2. from node with DC call my fence agent sshbykey 3. It sends reboot victim and waits until she comes to life again. Hmmm what version of pacemaker? This sounds like a timing issue that we fixed a while back Was a version 1.1.11 from December 3. Now try full update and retest. That should be recent enough. Can you create a crm_report the next time you reproduce? Of course yes. Little delay :) .. cc1: warnings being treated as errors upstart.c: In function ‘upstart_job_property’: upstart.c:264: error: implicit declaration of function ‘g_variant_lookup_value’ upstart.c:264: error: nested extern declaration of ‘g_variant_lookup_value’ upstart.c:264: error: assignment makes pointer from integer without a cast gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1 gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/root/ha/pacemaker/lib' make: *** [core] Error 1 I'm trying to solve this a problem. Do not get solved quickly... https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value g_variant_lookup_value () Since 2.28 # yum list installed glib2 Loaded plugins: fastestmirror, rhnplugin, security This system is receiving updates from RHN Classic or Red Hat Satellite. Loading mirror speeds from cached hostfile Installed Packages glib2.x86_64 2.26.1-3.el6 installed # cat /etc/issue CentOS release 6.5 (Final) Kernel \r on an \m Can you try this patch? Upstart jobs wont work, but the code will compile diff --git a/lib/services/upstart.c b/lib/services/upstart.c index 831e7cf..195c3a4 100644 --- a/lib/services/upstart.c +++ b/lib/services/upstart.c @@ -231,12 +231,21 @@ upstart_job_exists(const char *name) static char * upstart_job_property(const char *obj, const gchar * iface, const char *name) { +char *output = NULL; + +#if !GLIB_CHECK_VERSION(2,28,0) +static bool err = TRUE; + +if(err) { +crm_err(This version of glib is too old to support upstart jobs); +err = FALSE; +} +#else GError *error = NULL; GDBusProxy *proxy; GVariant *asv = NULL; GVariant *value = NULL; GVariant *_ret = NULL; -char *output = NULL; crm_info(Calling GetAll on %s, obj); proxy = get_proxy(obj, BUS_PROPERTY_IFACE); @@ -272,6 +281,7 @@ upstart_job_property(const char *obj, const gchar * iface, const char *name) g_object_unref(proxy); g_variant_unref(_ret); +#endif return output; } Ok :) I patch source
Re: [Pacemaker] hangs pending
Ok, here's what happens: 1. node2 is lost 2. fencing of node2 starts 3. node2 reboots (and cluster starts) 4. node2 returns to the membership 5. node2 is marked as a cluster member 6. DC tries to bring it into the cluster, but needs to cancel the active transition first. Which is a problem since the node2 fencing operation is part of that 7. node2 is in a transition (pending) state until fencing passes or fails 8a. fencing fails: transition completes and the node joins the cluster Thats in theory, except we automatically try again. Which isn't appropriate. This should be relatively easy to fix. 8b. fencing passes: the node is incorrectly marked as offline This I have no idea how to fix yet. On another note, it doesn't look like this agent works at all. The node has been back online for a long time and the agent is still timing out after 10 minutes. So Once the script makes sure that the victim will rebooted and again available via ssh - it exit with 0. does not seem true. On 14 Jan 2014, at 1:19 pm, Andrew Beekhof and...@beekhof.net wrote: Apart from anything else, your timeout needs to be bigger: Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru stonith-ng: ( commands.c:1321 ) error: log_operation: Operation 'reboot' [11331] (call 2 from crmd.17227) for host 'dev-cluster2-node2.unix.tensor.ru' with device 'st1' returned: -62 (Timer expired) On 14 Jan 2014, at 7:18 am, Andrew Beekhof and...@beekhof.net wrote: On 13 Jan 2014, at 8:31 pm, Andrey Groshev gre...@yandex.ru wrote: 13.01.2014, 02:51, Andrew Beekhof and...@beekhof.net: On 10 Jan 2014, at 9:55 pm, Andrey Groshev gre...@yandex.ru wrote: 10.01.2014, 14:31, Andrey Groshev gre...@yandex.ru: 10.01.2014, 14:01, Andrew Beekhof and...@beekhof.net: On 10 Jan 2014, at 5:03 pm, Andrey Groshev gre...@yandex.ru wrote: 10.01.2014, 05:29, Andrew Beekhof and...@beekhof.net: On 9 Jan 2014, at 11:11 pm, Andrey Groshev gre...@yandex.ru wrote: 08.01.2014, 06:22, Andrew Beekhof and...@beekhof.net: On 29 Nov 2013, at 7:17 pm, Andrey Groshev gre...@yandex.ru wrote: Hi, ALL. I'm still trying to cope with the fact that after the fence - node hangs in pending. Please define pending. Where did you see this? In crm_mon: .. Node dev-cluster2-node2 (172793105): pending .. The experiment was like this: Four nodes in cluster. On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11). Thereafter, the remaining start it constantly reboot, under various pretexts, softly whistling, fly low, not a cluster member! ... Then in the log fell out Too many failures All this time in the status in crm_mon is pending. Depending on the wind direction changed to UNCLEAN Much time has passed and I can not accurately describe the behavior... Now I am in the following state: I tried locate the problem. Came here with this. I set big value in property stonith-timeout=600s. And got the following behavior: 1. pkill -4 corosync 2. from node with DC call my fence agent sshbykey 3. It sends reboot victim and waits until she comes to life again. Hmmm what version of pacemaker? This sounds like a timing issue that we fixed a while back Was a version 1.1.11 from December 3. Now try full update and retest. That should be recent enough. Can you create a crm_report the next time you reproduce? Of course yes. Little delay :) .. cc1: warnings being treated as errors upstart.c: In function ‘upstart_job_property’: upstart.c:264: error: implicit declaration of function ‘g_variant_lookup_value’ upstart.c:264: error: nested extern declaration of ‘g_variant_lookup_value’ upstart.c:264: error: assignment makes pointer from integer without a cast gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1 gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/root/ha/pacemaker/lib' make: *** [core] Error 1 I'm trying to solve this a problem. Do not get solved quickly... https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value g_variant_lookup_value () Since 2.28 # yum list installed glib2 Loaded plugins: fastestmirror, rhnplugin, security This system is receiving updates from RHN Classic or Red Hat Satellite. Loading mirror speeds from cached hostfile Installed Packages glib2.x86_64 2.26.1-3.el6 installed # cat /etc/issue CentOS release 6.5 (Final) Kernel \r on an \m Can you try this patch? Upstart jobs wont work, but the code will compile diff --git a/lib/services/upstart.c b/lib/services/upstart.c index 831e7cf..195c3a4 100644 --- a/lib/services/upstart.c +++ b/lib/services/upstart.c @@ -231,12 +231,21 @@ upstart_job_exists(const
Re: [Pacemaker] hangs pending
On 14 Jan 2014, at 3:34 pm, Andrey Groshev gre...@yandex.ru wrote: 14.01.2014, 06:25, Andrew Beekhof and...@beekhof.net: Apart from anything else, your timeout needs to be bigger: Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru stonith-ng: ( commands.c:1321 ) error: log_operation: Operation 'reboot' [11331] (call 2 from crmd.17227) for host 'dev-cluster2-node2.unix.tensor.ru' with device 'st1' returned: -62 (Timer expired) Bigger than that? See my other email, the agent is broken. In :21 node2 A long time ago already booted and work (almost). Exactly, so why didnt the agent return? #cat /var/log/cluster/mystonith.log . Mon Jan 13 11:48:43 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () STONITH DEBUG(): getinfo-devdescr Mon Jan 13 11:48:43 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () STONITH DEBUG(): getinfo-devid Mon Jan 13 11:48:43 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () STONITH DEBUG(): getinfo-xml Mon Jan 13 11:48:46 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () STONITH DEBUG(/opt/cluster_tools_2/keys/r...@dev-cluster2-master.unix.tensor.ru): getconfignames Mon Jan 13 11:48:46 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () STONITH DEBUG(/opt/cluster_tools_2/keys/r...@dev-cluster2-master.unix.tensor.ru): status Mon Jan 13 12:11:37 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () STONITH DEBUG(/opt/cluster_tools_2/keys/r...@dev-cluster2-master.unix.tensor.ru): getconfignames Mon Jan 13 12:11:37 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () STONITH DEBUG(/opt/cluster_tools_2/keys/r...@dev-cluster2-master.unix.tensor.ru): reset dev-cluster2-node2.unix.tensor.ru Mon Jan 13 12:11:37 MSK 2014 Now boot time 1389256739, send reboot ... On 14 Jan 2014, at 7:18 am, Andrew Beekhof and...@beekhof.net wrote: On 13 Jan 2014, at 8:31 pm, Andrey Groshev gre...@yandex.ru wrote: 13.01.2014, 02:51, Andrew Beekhof and...@beekhof.net: On 10 Jan 2014, at 9:55 pm, Andrey Groshev gre...@yandex.ru wrote: 10.01.2014, 14:31, Andrey Groshev gre...@yandex.ru: 10.01.2014, 14:01, Andrew Beekhof and...@beekhof.net: On 10 Jan 2014, at 5:03 pm, Andrey Groshev gre...@yandex.ru wrote: 10.01.2014, 05:29, Andrew Beekhof and...@beekhof.net: On 9 Jan 2014, at 11:11 pm, Andrey Groshev gre...@yandex.ru wrote: 08.01.2014, 06:22, Andrew Beekhof and...@beekhof.net: On 29 Nov 2013, at 7:17 pm, Andrey Groshev gre...@yandex.ru wrote: Hi, ALL. I'm still trying to cope with the fact that after the fence - node hangs in pending. Please define pending. Where did you see this? In crm_mon: .. Node dev-cluster2-node2 (172793105): pending .. The experiment was like this: Four nodes in cluster. On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11). Thereafter, the remaining start it constantly reboot, under various pretexts, softly whistling, fly low, not a cluster member! ... Then in the log fell out Too many failures All this time in the status in crm_mon is pending. Depending on the wind direction changed to UNCLEAN Much time has passed and I can not accurately describe the behavior... Now I am in the following state: I tried locate the problem. Came here with this. I set big value in property stonith-timeout=600s. And got the following behavior: 1. pkill -4 corosync 2. from node with DC call my fence agent sshbykey 3. It sends reboot victim and waits until she comes to life again. Hmmm what version of pacemaker? This sounds like a timing issue that we fixed a while back Was a version 1.1.11 from December 3. Now try full update and retest. That should be recent enough. Can you create a crm_report the next time you reproduce? Of course yes. Little delay :) .. cc1: warnings being treated as errors upstart.c: In function ‘upstart_job_property’: upstart.c:264: error: implicit declaration of function ‘g_variant_lookup_value’ upstart.c:264: error: nested extern declaration of ‘g_variant_lookup_value’ upstart.c:264: error: assignment makes pointer from integer without a cast gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1 gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/root/ha/pacemaker/lib' make: *** [core] Error 1 I'm trying to solve this a problem. Do not get solved quickly... https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value g_variant_lookup_value () Since 2.28 # yum list installed glib2 Loaded plugins: fastestmirror, rhnplugin, security This system is receiving updates from RHN Classic or Red Hat Satellite. Loading mirror speeds from cached hostfile Installed Packages glib2.x86_64
Re: [Pacemaker] crm_resource -L not trustable right after restart
On 14 Jan 2014, at 3:41 pm, Brian J. Murrell (brian) br...@interlinx.bc.ca wrote: On Tue, 2014-01-14 at 08:09 +1100, Andrew Beekhof wrote: The local cib hasn't caught up yet by the looks of it. Should crm_resource actually be [mis-]reporting as if it were knowledgeable when it's not though? IOW is this expected behaviour or should it be considered a bug? Should I open a ticket? It doesn't know that it doesn't know. Does it show anything as running? Any nodes as online? I'd not expect that it stays in that situation for more than a second or two... You could compare 'cibadmin -Ql' with 'cibadmin -Q' Is there no other way to force crm_resource to be truthful/accurate or silent if it cannot be truthful/accurate? Having to run this kind of pre-check before every crm_resource --locate seems like it's going to drive overhead up quite a bit. True. Maybe I am using the wrong tool for the job. Is there a better tool than crm_resource to ascertain, with full truthfullness (or silence if truthfullness is not possible), where resources are running? We could add an option to force crm_resource to use the master instance instead of the local one I guess. signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] A resource starts with a standby node.(Latest attrd does not serve as the crmd-transition-delay parameter)
On 14 Jan 2014, at 3:52 pm, renayama19661...@ybb.ne.jp wrote: Hi All, I contributed next bugzilla by a problem to occur for the difference of the timing of the attribute update by attrd before. * https://developerbugs.linuxfoundation.org/show_bug.cgi?id=2528 We can evade this problem now by using crmd-transition-delay parameter. I confirmed whether I could evade this problem by renewed attrd recently. * In latest attrd, one became a leader and seemed to come to update an attribute. However, latest attrd does not seem to substitute for crmd-transition-delay. * I contribute detailed log later. We are dissatisfied with continuing using crmd-transition-delay. Is there the plan when attrd handles this problem well in the future? Are you using the new attrd code or the legacy stuff? If you're not using corosync 2.x or see: crm_notice(Starting mainloop...); then its the old code. The new code could also be used with CMAN but isn't configured to build for in that situation. Only the new code makes (or at least should do) crmd-transition-delay redundant. signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] A resource starts with a standby node.(Latest attrd does not serve as the crmd-transition-delay parameter)
On 14 Jan 2014, at 4:13 pm, renayama19661...@ybb.ne.jp wrote: Hi Andrew, Thank you for comments. Are you using the new attrd code or the legacy stuff? I use new attrd. And the values are not being sent to the cib at the same time? If you're not using corosync 2.x or see: crm_notice(Starting mainloop...); then its the old code. The new code could also be used with CMAN but isn't configured to build for in that situation. Only the new code makes (or at least should do) crmd-transition-delay redundant. It did not seem to work so that new attrd dispensed with crmd-transition-delay to me. I report the details again. # Probably it will be Bugzilla. . . Sounds good Best Regards, Hideo Yamauchi. --- On Tue, 2014/1/14, Andrew Beekhof and...@beekhof.net wrote: On 14 Jan 2014, at 3:52 pm, renayama19661...@ybb.ne.jp wrote: Hi All, I contributed next bugzilla by a problem to occur for the difference of the timing of the attribute update by attrd before. * https://developerbugs.linuxfoundation.org/show_bug.cgi?id=2528 We can evade this problem now by using crmd-transition-delay parameter. I confirmed whether I could evade this problem by renewed attrd recently. * In latest attrd, one became a leader and seemed to come to update an attribute. However, latest attrd does not seem to substitute for crmd-transition-delay. * I contribute detailed log later. We are dissatisfied with continuing using crmd-transition-delay. Is there the plan when attrd handles this problem well in the future? Are you using the new attrd code or the legacy stuff? If you're not using corosync 2.x or see: crm_notice(Starting mainloop...); then its the old code. The new code could also be used with CMAN but isn't configured to build for in that situation. Only the new code makes (or at least should do) crmd-transition-delay redundant. signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] A resource starts with a standby node.(Latest attrd does not serve as the crmd-transition-delay parameter)
On 14 Jan 2014, at 4:33 pm, renayama19661...@ybb.ne.jp wrote: Hi Andrew, Are you using the new attrd code or the legacy stuff? I use new attrd. And the values are not being sent to the cib at the same time? As far as I looked. . . When the transmission of the attribute of attrd of the node was late, a leader of attrd seemed to send an attribute to cib without waiting for it. And you have a delay configured? And this value was set prior to that delay expiring? Only the new code makes (or at least should do) crmd-transition-delay redundant. It did not seem to work so that new attrd dispensed with crmd-transition-delay to me. I report the details again. # Probably it will be Bugzilla. . . Sounds good All right! Many Thanks! Hideo Yamauch. --- On Tue, 2014/1/14, Andrew Beekhof and...@beekhof.net wrote: On 14 Jan 2014, at 4:13 pm, renayama19661...@ybb.ne.jp wrote: Hi Andrew, Thank you for comments. Are you using the new attrd code or the legacy stuff? I use new attrd. And the values are not being sent to the cib at the same time? If you're not using corosync 2.x or see: crm_notice(Starting mainloop...); then its the old code. The new code could also be used with CMAN but isn't configured to build for in that situation. Only the new code makes (or at least should do) crmd-transition-delay redundant. It did not seem to work so that new attrd dispensed with crmd-transition-delay to me. I report the details again. # Probably it will be Bugzilla. . . Sounds good Best Regards, Hideo Yamauchi. --- On Tue, 2014/1/14, Andrew Beekhof and...@beekhof.net wrote: On 14 Jan 2014, at 3:52 pm, renayama19661...@ybb.ne.jp wrote: Hi All, I contributed next bugzilla by a problem to occur for the difference of the timing of the attribute update by attrd before. * https://developerbugs.linuxfoundation.org/show_bug.cgi?id=2528 We can evade this problem now by using crmd-transition-delay parameter. I confirmed whether I could evade this problem by renewed attrd recently. * In latest attrd, one became a leader and seemed to come to update an attribute. However, latest attrd does not seem to substitute for crmd-transition-delay. * I contribute detailed log later. We are dissatisfied with continuing using crmd-transition-delay. Is there the plan when attrd handles this problem well in the future? Are you using the new attrd code or the legacy stuff? If you're not using corosync 2.x or see: crm_notice(Starting mainloop...); then its the old code. The new code could also be used with CMAN but isn't configured to build for in that situation. Only the new code makes (or at least should do) crmd-transition-delay redundant. signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [PATCH] Downgrade probe log message for promoted ms resources
Fair enough. Pull request? On 12 Jan 2014, at 8:29 pm, Vladislav Bogdanov bub...@hoster-ok.com wrote: Hi, This is the only one message I see in logs in otherwise static cluster (with rechecks enabled), probably it is good idea to downgrade it to info. diff --git a/lib/pengine/unpack.c b/lib/pengine/unpack.c index 97e114f..6dbcf19 100644 --- a/lib/pengine/unpack.c +++ b/lib/pengine/unpack.c @@ -2515,7 +2515,7 @@ determine_op_status( case PCMK_OCF_RUNNING_MASTER: if (is_probe) { result = PCMK_LRM_OP_DONE; -crm_notice(Operation %s found resource %s active in master mode on %s, +pe_rsc_info(rsc, Operation %s found resource %s active in master mode on %s, task, rsc-id, node-details-uname); } else if (target_rc == rc) { ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] hangs pending
On 10 Jan 2014, at 9:55 pm, Andrey Groshev gre...@yandex.ru wrote: 10.01.2014, 14:31, Andrey Groshev gre...@yandex.ru: 10.01.2014, 14:01, Andrew Beekhof and...@beekhof.net: On 10 Jan 2014, at 5:03 pm, Andrey Groshev gre...@yandex.ru wrote: 10.01.2014, 05:29, Andrew Beekhof and...@beekhof.net: On 9 Jan 2014, at 11:11 pm, Andrey Groshev gre...@yandex.ru wrote: 08.01.2014, 06:22, Andrew Beekhof and...@beekhof.net: On 29 Nov 2013, at 7:17 pm, Andrey Groshev gre...@yandex.ru wrote: Hi, ALL. I'm still trying to cope with the fact that after the fence - node hangs in pending. Please define pending. Where did you see this? In crm_mon: .. Node dev-cluster2-node2 (172793105): pending .. The experiment was like this: Four nodes in cluster. On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11). Thereafter, the remaining start it constantly reboot, under various pretexts, softly whistling, fly low, not a cluster member! ... Then in the log fell out Too many failures All this time in the status in crm_mon is pending. Depending on the wind direction changed to UNCLEAN Much time has passed and I can not accurately describe the behavior... Now I am in the following state: I tried locate the problem. Came here with this. I set big value in property stonith-timeout=600s. And got the following behavior: 1. pkill -4 corosync 2. from node with DC call my fence agent sshbykey 3. It sends reboot victim and waits until she comes to life again. Hmmm what version of pacemaker? This sounds like a timing issue that we fixed a while back Was a version 1.1.11 from December 3. Now try full update and retest. That should be recent enough. Can you create a crm_report the next time you reproduce? Of course yes. Little delay :) .. cc1: warnings being treated as errors upstart.c: In function ‘upstart_job_property’: upstart.c:264: error: implicit declaration of function ‘g_variant_lookup_value’ upstart.c:264: error: nested extern declaration of ‘g_variant_lookup_value’ upstart.c:264: error: assignment makes pointer from integer without a cast gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1 gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/root/ha/pacemaker/lib' make: *** [core] Error 1 I'm trying to solve this a problem. Do not get solved quickly... https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value g_variant_lookup_value () Since 2.28 # yum list installed glib2 Loaded plugins: fastestmirror, rhnplugin, security This system is receiving updates from RHN Classic or Red Hat Satellite. Loading mirror speeds from cached hostfile Installed Packages glib2.x86_64 2.26.1-3.el6 installed # cat /etc/issue CentOS release 6.5 (Final) Kernel \r on an \m Can you try this patch? Upstart jobs wont work, but the code will compile diff --git a/lib/services/upstart.c b/lib/services/upstart.c index 831e7cf..195c3a4 100644 --- a/lib/services/upstart.c +++ b/lib/services/upstart.c @@ -231,12 +231,21 @@ upstart_job_exists(const char *name) static char * upstart_job_property(const char *obj, const gchar * iface, const char *name) { +char *output = NULL; + +#if !GLIB_CHECK_VERSION(2,28,0) +static bool err = TRUE; + +if(err) { +crm_err(This version of glib is too old to support upstart jobs); +err = FALSE; +} +#else GError *error = NULL; GDBusProxy *proxy; GVariant *asv = NULL; GVariant *value = NULL; GVariant *_ret = NULL; -char *output = NULL; crm_info(Calling GetAll on %s, obj); proxy = get_proxy(obj, BUS_PROPERTY_IFACE); @@ -272,6 +281,7 @@ upstart_job_property(const char *obj, const gchar * iface, const char *name) g_object_unref(proxy); g_variant_unref(_ret); +#endif return output; } Once the script makes sure that the victim will rebooted and again available via ssh - it exit with 0. All command is logged both the victim and the killer - all right. 4. A little later, the status of the (victim) nodes in crm_mon changes to online. 5. BUT... not one resource don't start! Despite the fact that crm_simalate -sL shows the correct resource to start: * Start pingCheck:3 (dev-cluster2-node2) 6. In this state, we spend the next 600 seconds. After completing this timeout causes another node (not DC) decides to kill again our victim. All command again is logged both the victim and the killer - All documented :) 7. NOW all resource started in right sequence. I almost happy, but I do not like: two reboots and 10 minutes
Re: [Pacemaker] again return code, now in crm_attribute
On 10 Jan 2014, at 6:18 pm, Andrey Groshev gre...@yandex.ru wrote: 10.01.2014, 10:15, Andrew Beekhof and...@beekhof.net: On 10 Jan 2014, at 4:38 pm, Andrey Groshev gre...@yandex.ru wrote: 10.01.2014, 09:06, Andrew Beekhof and...@beekhof.net: On 10 Jan 2014, at 3:51 pm, Andrey Groshev gre...@yandex.ru wrote: 10.01.2014, 03:28, Andrew Beekhof and...@beekhof.net: On 9 Jan 2014, at 4:44 pm, Andrey Groshev gre...@yandex.ru wrote: 09.01.2014, 02:39, Andrew Beekhof and...@beekhof.net: On 18 Dec 2013, at 11:55 pm, Andrey Groshev gre...@yandex.ru wrote: Hi, Andrew and ALL. I'm sorry, but I again found an error. :) Crux of the problem: # crm_attribute --type crm_config --attr-name stonith-enabled --query; echo $? scope=crm_config name=stonith-enabled value=true 0 # crm_attribute --type crm_config --attr-name stonith-enabled --update firstval ; echo $? 0 # crm_attribute --type crm_config --attr-name stonith-enabled --query; echo $? scope=crm_config name=stonith-enabled value=firstval 0 # crm_attribute --type crm_config --attr-name stonith-enabled --update secondval --lifetime=reboot ; echo $? 0 # crm_attribute --type crm_config --attr-name stonith-enabled --query; echo $? scope=crm_config name=stonith-enabled value=firstval 0 # crm_attribute --type crm_config --attr-name stonith-enabled --update thirdval --lifetime=forever ; echo $? 0 # crm_attribute --type crm_config --attr-name stonith-enabled --query; echo $? scope=crm_config name=stonith-enabled value=firstval 0 Ie if specify the lifetime of an attribute, then a attribure is not updated. If impossible setup the lifetime of the attribute when it is installing, it must be return an error. Agreed. I'll reproduce and get back to you. How, I was able to review code, problem comes when used both options --type and options --lifetime. One variant in case without break; Unfortunately, I did not have time to dive into the logic. Actually, the logic is correct. The command: # crm_attribute --type crm_config --attr-name stonith-enabled --update secondval --lifetime=reboot ; echo $? is invalid. You only get to specify --type OR --lifetime, not both. By specifying --lifetime, you're creating a node attribute, not a cluster proprerty. With this, I do not argue. I think that should be the exit code is NOT ZERO, ie it's error! No, its setting a value, just not where you thought (or where you're looking for it in the next command). Its the same as writing: crm_attribute --type crm_config --type status --attr-name stonith-enabled --update secondval; echo $? Only the last value for --type wins Because of this confusion is obtained. Here is an example of the old cluster: #crm_attribute --type crm_config --attr-name test1 --update val1 --lifetime=reboot ; echo $? 0 # cibadmin -Q|grep test1 nvpair id=status-test-ins-db2-test1 name=test1 value=val1/ Win --lifetime ? Yes. Because it was specified last. Is not it easier to produce an error when trying to use incompatible options? They're not incompatible. They're aliases for each other in a different context. Ok. I understood you . Let's say you're right . :) In the end , if you change the order of words in the human language, then meaning of a sentence can change. But suppose, I have a strange desire, but it may be represented and write. I say crm_attribute - attr-name attr1 - update val1 - lifetime = reboot - type crm_config. I mean, that ... I want to set some attribute to a cluster , and this attribute should disappear if the cluster is restarted. This functionality does/can not exist for anything other than node attributes. If I сhange the order of arguments, the meaning of a sentence is still not change. But what do you say to that? 100% chance you will say that the sentence is not correct. Why ? Because the lifetime is not quite a time of life and can not be used in the context of the properties of the cluster ? Ie You gave me back 1 and crm_attribute returned 0 :) Then there is this uncertainty and was meant..., was meant..., was meant And if possible then the value should be established. In general, something is wrong. Denser unfortunately not yet looked, because I struggle with STONITH :) P.S. Andrew! Late to congratulate you on your new addition to the family. This fine time - now you will have toys which was not in your childhood. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc
Re: [Pacemaker] Manual fence confirmation by stonith_admin doesn't work again.
On 10 Jan 2014, at 3:54 pm, Nikita Staroverov nsfo...@gmail.com wrote: There is no-one to tell yet. We have to wait for cman to decide something needs fencing before pacemaker can perform the notification. if I get you right i need own fencing agent that doing manual confirmed fence action with cman+pacemaker configurations, do i? No. Manual fencing confirmations can only work after CMAN asks for the node to be fenced. You can't tell it in advance. I'll try to explain my problem again. :) in my test setup cluster member was manually powered off by me and fence device was powered off too. crm_mon shows host as offline unclean. cman shows that host needs fencing and try to do that by fence_pcmk. Fence agent in pacemaker can't do that because fence device is unreacheable. i do stonith_admin -C for node. After that crm_mon shows host as offline. cman tries to do fencing indefinitly. I think pacemaker doesn't notify cman that fencing was confirmed by sysadmin. Code seems to be trying to... Do you see any logs from tengine_stonith_notify after you run stonith_admin? crm_report? signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] hangs pending
On 10 Jan 2014, at 5:03 pm, Andrey Groshev gre...@yandex.ru wrote: 10.01.2014, 05:29, Andrew Beekhof and...@beekhof.net: On 9 Jan 2014, at 11:11 pm, Andrey Groshev gre...@yandex.ru wrote: 08.01.2014, 06:22, Andrew Beekhof and...@beekhof.net: On 29 Nov 2013, at 7:17 pm, Andrey Groshev gre...@yandex.ru wrote: Hi, ALL. I'm still trying to cope with the fact that after the fence - node hangs in pending. Please define pending. Where did you see this? In crm_mon: .. Node dev-cluster2-node2 (172793105): pending .. The experiment was like this: Four nodes in cluster. On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11). Thereafter, the remaining start it constantly reboot, under various pretexts, softly whistling, fly low, not a cluster member! ... Then in the log fell out Too many failures All this time in the status in crm_mon is pending. Depending on the wind direction changed to UNCLEAN Much time has passed and I can not accurately describe the behavior... Now I am in the following state: I tried locate the problem. Came here with this. I set big value in property stonith-timeout=600s. And got the following behavior: 1. pkill -4 corosync 2. from node with DC call my fence agent sshbykey 3. It sends reboot victim and waits until she comes to life again. Hmmm what version of pacemaker? This sounds like a timing issue that we fixed a while back Was a version 1.1.11 from December 3. Now try full update and retest. That should be recent enough. Can you create a crm_report the next time you reproduce? Once the script makes sure that the victim will rebooted and again available via ssh - it exit with 0. All command is logged both the victim and the killer - all right. 4. A little later, the status of the (victim) nodes in crm_mon changes to online. 5. BUT... not one resource don't start! Despite the fact that crm_simalate -sL shows the correct resource to start: * Start pingCheck:3 (dev-cluster2-node2) 6. In this state, we spend the next 600 seconds. After completing this timeout causes another node (not DC) decides to kill again our victim. All command again is logged both the victim and the killer - All documented :) 7. NOW all resource started in right sequence. I almost happy, but I do not like: two reboots and 10 minutes of waiting ;) And if something happens on another node, this the behavior is superimposed on old and not any resources not start until the last node will not reload twice. I tried understood this behavior. As I understand it: 1. Ultimately, in ./lib/fencing/st_client.c call internal_stonith_action_execute(). 2. It make fork and pipe from tham. 3. Async call mainloop_child_add with callback to stonith_action_async_done. 4. Add timeout g_timeout_add to TERM and KILL signals. If all right must - call stonith_action_async_done, remove timeout. For some reason this does not happen. I sit and think At this time, there are constant re-election. Also, I noticed the difference when you start pacemaker. At normal startup: * corosync * pacemakerd * attrd * pengine * lrmd * crmd * cib When hangs start: * corosync * pacemakerd * attrd * pengine * crmd * lrmd * cib. Are you referring to the order of the daemons here? The cib should not be at the bottom in either case. Who knows who runs lrmd? Pacemakerd. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org , ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org , ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
Re: [Pacemaker] starting resources with failed stonith resource
On 9 Jan 2014, at 8:29 pm, Frank Van Damme frank.vanda...@gmail.com wrote: 2014/1/8 Andrew Beekhof and...@beekhof.net: I don't understand it: if this means that the stonith devices have failed a million times, We also set it to 100 when the start action fails. why is it trying to start the mysql resource? It depends if any nodes need fencing. It's agains Pacemaker policies to start resources on a cluster without working stonith devices, isn't it? Not if all nodes are present and healthy. But if they fail or disappear, they can't be killed and might have resources still running on them? Yes, and the cluster wont be able to do anything about it except wait -- Frank Van Damme Make everything as simple as possible, but not simpler. - Albert Einstein ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Breaking dependency loop stonith
On 9 Jan 2014, at 5:05 pm, Andrey Groshev gre...@yandex.ru wrote: 08.01.2014, 06:15, Andrew Beekhof and...@beekhof.net: On 27 Nov 2013, at 12:26 am, Andrey Groshev gre...@yandex.ru wrote: Hi, ALL. I want to clarify two more questions. After stonith reboot - this node hangs with status pending. The logs found string . info: rsc_merge_weights:pgsql:1: Breaking dependency loop at msPostgresql info: rsc_merge_weights:pgsql:2: Breaking dependency loop at msPostgresql This means that breaking search the depends, because they are no more. Or interrupted by an infinite loop for search the dependency? The second one, but it has nothing to do with a node being in the pending state. Where did you see this? Ok, I've already understood this the problem. I have location for right promote|demote resource. And too same logic trough collocation/order. As I thought, they do the same thing No, collocation and ordering are orthogonal concepts and do not at all do the same thing. See the docs. and collisions should not happen. At least on the old cluster it works :) Now I have removed all unnecessary. And two. Do I need to clone the stonith resource now (In PCMK 1.1.11)? No. On the one hand, I see this resource on all nodes through command. # cibadmin -Q|grep stonith nvpair name=stonith-enabled value=true id=cib-bootstrap-options-stonith-enabled/ primitive id=st1 class=stonith type=external/sshbykey lrm_resource id=st1 type=external/sshbykey class=stonith lrm_resource id=st1 type=external/sshbykey class=stonith lrm_resource id=st1 type=external/sshbykey class=stonith (without pending node) Like all resources, we check all nodes at startup to see if it is already active. On the other hand, another command I see only one instance on a particular node. # crm_verify -L info: main: =#=#=#=#= Getting XML =#=#=#=#= info: main: Reading XML from: live cluster info: validate_with_relaxng:Creating RNG parser context info: determine_online_status_fencing: Node dev-cluster2-node4 is active info: determine_online_status: Node dev-cluster2-node4 is online info: determine_online_status_fencing: - Node dev-cluster2-node1 is not ready to run resources info: determine_online_status_fencing: Node dev-cluster2-node2 is active info: determine_online_status: Node dev-cluster2-node2 is online info: determine_online_status_fencing: Node dev-cluster2-node3 is active info: determine_online_status: Node dev-cluster2-node3 is online info: determine_op_status: Operation monitor found resource pingCheck:0 active on dev-cluster2-node4 info: native_print: VirtualIP (ocf::heartbeat:IPaddr2): Started dev-cluster2-node4 info: clone_print: Master/Slave Set: msPostgresql [pgsql] info: short_print: Masters: [ dev-cluster2-node4 ] info: short_print: Slaves: [ dev-cluster2-node2 dev-cluster2-node3 ] info: short_print: Stopped: [ dev-cluster2-node1 ] info: clone_print: Clone Set: clnPingCheck [pingCheck] info: short_print: Started: [ dev-cluster2-node2 dev-cluster2-node3 dev-cluster2-node4 ] info: short_print: Stopped: [ dev-cluster2-node1 ] info: native_print: st1 (stonith:external/sshbykey): Started dev-cluster2-node4 info: native_color: Resource pingCheck:3 cannot run anywhere info: native_color: Resource pgsql:3 cannot run anywhere info: rsc_merge_weights:pgsql:1: Breaking dependency loop at msPostgresql info: rsc_merge_weights:pgsql:2: Breaking dependency loop at msPostgresql info: master_color: Promoting pgsql:0 (Master dev-cluster2-node4) info: master_color: msPostgresql: Promoted 1 instances of a possible 1 to master info: LogActions: Leave VirtualIP (Started dev-cluster2-node4) info: LogActions: Leave pgsql:0 (Master dev-cluster2-node4) info: LogActions: Leave pgsql:1 (Slave dev-cluster2-node2) info: LogActions: Leave pgsql:2 (Slave dev-cluster2-node3) info: LogActions: Leave pgsql:3 (Stopped) info: LogActions: Leave pingCheck:0 (Started dev-cluster2-node4) info: LogActions: Leave pingCheck:1 (Started dev-cluster2-node2) info: LogActions: Leave pingCheck:2 (Started dev-cluster2-node3) info: LogActions: Leave pingCheck:3 (Stopped) info: LogActions: Leave st1 (Started dev-cluster2-node4) However, if I do a clone - it turns out the same garbage. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting
Re: [Pacemaker] again return code, now in crm_attribute
On 9 Jan 2014, at 4:44 pm, Andrey Groshev gre...@yandex.ru wrote: 09.01.2014, 02:39, Andrew Beekhof and...@beekhof.net: On 18 Dec 2013, at 11:55 pm, Andrey Groshev gre...@yandex.ru wrote: Hi, Andrew and ALL. I'm sorry, but I again found an error. :) Crux of the problem: # crm_attribute --type crm_config --attr-name stonith-enabled --query; echo $? scope=crm_config name=stonith-enabled value=true 0 # crm_attribute --type crm_config --attr-name stonith-enabled --update firstval ; echo $? 0 # crm_attribute --type crm_config --attr-name stonith-enabled --query; echo $? scope=crm_config name=stonith-enabled value=firstval 0 # crm_attribute --type crm_config --attr-name stonith-enabled --update secondval --lifetime=reboot ; echo $? 0 # crm_attribute --type crm_config --attr-name stonith-enabled --query; echo $? scope=crm_config name=stonith-enabled value=firstval 0 # crm_attribute --type crm_config --attr-name stonith-enabled --update thirdval --lifetime=forever ; echo $? 0 # crm_attribute --type crm_config --attr-name stonith-enabled --query; echo $? scope=crm_config name=stonith-enabled value=firstval 0 Ie if specify the lifetime of an attribute, then a attribure is not updated. If impossible setup the lifetime of the attribute when it is installing, it must be return an error. Agreed. I'll reproduce and get back to you. How, I was able to review code, problem comes when used both options --type and options --lifetime. One variant in case without break; Unfortunately, I did not have time to dive into the logic. Actually, the logic is correct. The command: # crm_attribute --type crm_config --attr-name stonith-enabled --update secondval --lifetime=reboot ; echo $? is invalid. You only get to specify --type OR --lifetime, not both. By specifying --lifetime, you're creating a node attribute, not a cluster proprerty. And if possible then the value should be established. In general, something is wrong. Denser unfortunately not yet looked, because I struggle with STONITH :) P.S. Andrew! Late to congratulate you on your new addition to the family. This fine time - now you will have toys which was not in your childhood. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org , ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] again return code, now in crm_attribute
On 10 Jan 2014, at 3:51 pm, Andrey Groshev gre...@yandex.ru wrote: 10.01.2014, 03:28, Andrew Beekhof and...@beekhof.net: On 9 Jan 2014, at 4:44 pm, Andrey Groshev gre...@yandex.ru wrote: 09.01.2014, 02:39, Andrew Beekhof and...@beekhof.net: On 18 Dec 2013, at 11:55 pm, Andrey Groshev gre...@yandex.ru wrote: Hi, Andrew and ALL. I'm sorry, but I again found an error. :) Crux of the problem: # crm_attribute --type crm_config --attr-name stonith-enabled --query; echo $? scope=crm_config name=stonith-enabled value=true 0 # crm_attribute --type crm_config --attr-name stonith-enabled --update firstval ; echo $? 0 # crm_attribute --type crm_config --attr-name stonith-enabled --query; echo $? scope=crm_config name=stonith-enabled value=firstval 0 # crm_attribute --type crm_config --attr-name stonith-enabled --update secondval --lifetime=reboot ; echo $? 0 # crm_attribute --type crm_config --attr-name stonith-enabled --query; echo $? scope=crm_config name=stonith-enabled value=firstval 0 # crm_attribute --type crm_config --attr-name stonith-enabled --update thirdval --lifetime=forever ; echo $? 0 # crm_attribute --type crm_config --attr-name stonith-enabled --query; echo $? scope=crm_config name=stonith-enabled value=firstval 0 Ie if specify the lifetime of an attribute, then a attribure is not updated. If impossible setup the lifetime of the attribute when it is installing, it must be return an error. Agreed. I'll reproduce and get back to you. How, I was able to review code, problem comes when used both options --type and options --lifetime. One variant in case without break; Unfortunately, I did not have time to dive into the logic. Actually, the logic is correct. The command: # crm_attribute --type crm_config --attr-name stonith-enabled --update secondval --lifetime=reboot ; echo $? is invalid. You only get to specify --type OR --lifetime, not both. By specifying --lifetime, you're creating a node attribute, not a cluster proprerty. With this, I do not argue. I think that should be the exit code is NOT ZERO, ie it's error! No, its setting a value, just not where you thought (or where you're looking for it in the next command). Its the same as writing: crm_attribute --type crm_config --type status --attr-name stonith-enabled --update secondval; echo $? Only the last value for --type wins And if possible then the value should be established. In general, something is wrong. Denser unfortunately not yet looked, because I struggle with STONITH :) P.S. Andrew! Late to congratulate you on your new addition to the family. This fine time - now you will have toys which was not in your childhood. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org , ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org , ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] again return code, now in crm_attribute
On 10 Jan 2014, at 4:38 pm, Andrey Groshev gre...@yandex.ru wrote: 10.01.2014, 09:06, Andrew Beekhof and...@beekhof.net: On 10 Jan 2014, at 3:51 pm, Andrey Groshev gre...@yandex.ru wrote: 10.01.2014, 03:28, Andrew Beekhof and...@beekhof.net: On 9 Jan 2014, at 4:44 pm, Andrey Groshev gre...@yandex.ru wrote: 09.01.2014, 02:39, Andrew Beekhof and...@beekhof.net: On 18 Dec 2013, at 11:55 pm, Andrey Groshev gre...@yandex.ru wrote: Hi, Andrew and ALL. I'm sorry, but I again found an error. :) Crux of the problem: # crm_attribute --type crm_config --attr-name stonith-enabled --query; echo $? scope=crm_config name=stonith-enabled value=true 0 # crm_attribute --type crm_config --attr-name stonith-enabled --update firstval ; echo $? 0 # crm_attribute --type crm_config --attr-name stonith-enabled --query; echo $? scope=crm_config name=stonith-enabled value=firstval 0 # crm_attribute --type crm_config --attr-name stonith-enabled --update secondval --lifetime=reboot ; echo $? 0 # crm_attribute --type crm_config --attr-name stonith-enabled --query; echo $? scope=crm_config name=stonith-enabled value=firstval 0 # crm_attribute --type crm_config --attr-name stonith-enabled --update thirdval --lifetime=forever ; echo $? 0 # crm_attribute --type crm_config --attr-name stonith-enabled --query; echo $? scope=crm_config name=stonith-enabled value=firstval 0 Ie if specify the lifetime of an attribute, then a attribure is not updated. If impossible setup the lifetime of the attribute when it is installing, it must be return an error. Agreed. I'll reproduce and get back to you. How, I was able to review code, problem comes when used both options --type and options --lifetime. One variant in case without break; Unfortunately, I did not have time to dive into the logic. Actually, the logic is correct. The command: # crm_attribute --type crm_config --attr-name stonith-enabled --update secondval --lifetime=reboot ; echo $? is invalid. You only get to specify --type OR --lifetime, not both. By specifying --lifetime, you're creating a node attribute, not a cluster proprerty. With this, I do not argue. I think that should be the exit code is NOT ZERO, ie it's error! No, its setting a value, just not where you thought (or where you're looking for it in the next command). Its the same as writing: crm_attribute --type crm_config --type status --attr-name stonith-enabled --update secondval; echo $? Only the last value for --type wins Because of this confusion is obtained. Here is an example of the old cluster: #crm_attribute --type crm_config --attr-name test1 --update val1 --lifetime=reboot ; echo $? 0 # cibadmin -Q|grep test1 nvpair id=status-test-ins-db2-test1 name=test1 value=val1/ Win --lifetime ? Yes. Because it was specified last. Is not it easier to produce an error when trying to use incompatible options? They're not incompatible. They're aliases for each other in a different context. Then there is this uncertainty and was meant..., was meant..., was meant And if possible then the value should be established. In general, something is wrong. Denser unfortunately not yet looked, because I struggle with STONITH :) P.S. Andrew! Late to congratulate you on your new addition to the family. This fine time - now you will have toys which was not in your childhood. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org , ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org , ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list
Re: [Pacemaker] lrmd segfault at pacemaker 1.1.11-rc1
On 8 Jan 2014, at 9:15 pm, Kazunori INOUE kazunori.ino...@gmail.com wrote: 2014/1/8 Andrew Beekhof and...@beekhof.net: On 18 Dec 2013, at 9:50 pm, Kazunori INOUE kazunori.ino...@gmail.com wrote: Hi David, 2013/12/18 David Vossel dvos...@redhat.com: That's a really weird one... I don't see how it is possible for op-id to be NULL there. You might need to give valgrind a shot to detect whatever is really going on here. -- Vossel Thank you for advice. I try it. Any update on this? We are still investigating a cause. It was not reproduced when I gave valgrind.. And it was reproduced in RC3. So it happened RC3 - valgrind, but not RC3 + valgrind? Thats concerning. Nothing in the valgrind output? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] starting resources with failed stonith resource
On 8 Jan 2014, at 2:41 am, Frank Van Damme frank.vanda...@gmail.com wrote: Hi list, I recently had some trouble with a dual-node mysql cluster, which runs in master-slave mode with Percona resource manager. While analyzing what happened to the cluster, I found this in syslog (network trouble, the cluster lost disk/iscsi access on both nodes, this is a piece from the former master trying to start up again when recovering connectivity): Jan 6 07:26:49 infante pengine: [3839]: notice: get_failcount: Failcount for MasterSlave_mysql on infante has expired (limit was 60s) Jan 6 07:26:49 infante pengine: [3839]: notice: get_failcount: Failcount for MasterSlave_mysql on infante has expired (limit was 60s) Jan 6 07:26:49 infante pengine: [3839]: WARN: common_apply_stickiness: Forcing p-stonith-ingstad away from infante after 100 failures (max=100) Jan 6 07:26:49 infante pengine: [3839]: notice: LogActions: Start prim_mysql:0#011(infante) Jan 6 07:26:49 infante pengine: [3839]: notice: LogActions: Start prim_mysql:1#011(ingstad) I don't understand it: if this means that the stonith devices have failed a million times, We also set it to 100 when the start action fails. why is it trying to start the mysql resource? It depends if any nodes need fencing. It's agains Pacemaker policies to start resources on a cluster without working stonith devices, isn't it? Not if all nodes are present and healthy. -- Frank Van Damme Make everything as simple as possible, but not simpler. - Albert Einstein ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Error: node does not appear to exist in configuration
On 6 Jan 2014, at 8:09 pm, Jerald B. Darow jbda...@ace-host.net wrote: Where am I going wrong here? Good question... Chris? [root@zero mysql]# pcs cluster standby zero.acenet.us Error: node 'zero.acenet.us' does not appear to exist in configuration [root@zero mysql]# pcs cluster cib | grep node id node id=diet.acenet.us uname=diet.acenet.us/ node id=zero.acenet.us uname=zero.acenet.us/ --- standby node | --all Put specified node into standby mode (the node specified will no longer be able to host resources), if --all is specified all nodes will be put into standby mode. --- ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] monitoring redis in master-slave mode
On 13 Dec 2013, at 11:06 pm, ESWAR RAO eswar7...@gmail.com wrote: Hi All, I have a 3 node setup with HB+pacemaker. I wanted to run redis in master-slave mode using an ocf script. https://groups.google.com/forum/#!msg/redis-db/eY3zCKnl0G0/lW5fObHrjwQJ But with the below configuration , I am able to start in master-slave mode but pacemaker is not monitoring the redis. I killed the redis-server on node-1 (slave node/master node) but pacemaker is not re-starting it . In the crm status I could see it as: Masters: [ oc-vm ] Slaves: [ oc-vm1 oc-vm2 ] even though it's not running on oc-vm1. # crm configure primitive cluster-ip ocf:IPaddr2 params ip=192.168.101.205 cidr_netmask=32 nic=eth1 op monitor interval=30s # crm configure primitive oc_redis ocf:redis op monitor role=Master interval=3s timeout=5s op monitor role=Slave interval=3s timeout=3s use a different interval for the two recurring operations. # crm configure ms redis_clone oc_redis meta notify=true master-max=1 master-node-max=1 clone-node-max=1 interleave=false globally-unique=false # crm configure colocation ip-on-redis inf: cluster-ip redis_clone:Master Can someone help me in fixing it??? Thanks Eswar ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] catch-22: can't fence node A because node A has the fencing resource
On 4 Dec 2013, at 11:47 am, Brian J. Murrell br...@interlinx.bc.ca wrote: On Tue, 2013-12-03 at 18:26 -0500, David Vossel wrote: We did away with all of the policy engine logic involved with trying to move fencing devices off of the target node before executing the fencing action. Behind the scenes all fencing devices are now essentially clones. If the target node to be fenced has a fencing device running on it, that device can execute anywhere in the cluster to avoid the suicide situation. OK. When you are looking at crm_mon output and see a fencing device is running on a specific node, all that really means is that we are going to attempt to execute fencing actions for that device from that node first. If that node is unavailable, Would it be better to not even try to use a node and ask it to commit suicide but always try to use another node? IIRC the only time we ask a node to fence itself is when it is (or thinks it is) the last node standing. we'll try that same device anywhere in the cluster we can get it to work OK. (unless you've specifically built some location constraint that prevents the fencing device from ever running on a specific node) While I do have constraints on the more service-oriented resources to give them preferred nodes, I don't have any constraints on the fencing resources. So given all of the above, and given the log I supplied showing that the fencing was just not being attempted anywhere other than the node to be fenced (which was down during that log) any clues as to where to look for why? Hope that helps. It explains the differences, but unfortunately I'm still not sure why it wouldn't get run somewhere else, eventually, rather than continually being attempted on the node to be killed (which as I mentioned, was shut down at the time the log was made). Yes, this is surprising. Can you enable the blackbox for stonith-ng, reproduce and generate a crm_report for us please? It will contain all the information we need. signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pacemaker + cman - node names and bind address
On 5 Dec 2013, at 8:51 pm, Nikola Ciprich nikola.cipr...@linuxbox.cz wrote: Hello Digimer, and thanks for Your reply. I understand your points, but my question is about something a bit different.. example: I have two nodes, node1 (lan address resolves to 192.168.1.1) and node2 (lan address resolves to 192.168.1.2). connected using crosslink (10.0.0.1, 10.0.0.2). I'd like to use node1/node2 as cluster node names, but I'd like to have node1/node2 resolving to 192.168.1.0 addresses but cluster to communicate over 10.0.0.0 link. In corosync, I was able to force this by setting bindnetaddr, but how can I set this in cman cluster? (I (roughly) know that corosync is used even in cman-based cluster, but this unimportant now, I think) what is the clear way to achieve this? There is some information on altname and altmulticast at the bottom of: https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Cluster_Administration/s1-config-rrp-cli-CA.html that might be of some use. thanks a lot in advance BR nik On Thu, Dec 05, 2013 at 02:36:36AM -0500, Digimer wrote: On 05/12/13 02:28, Nikola Ciprich wrote: Hello, not sure whether this shouldn't be asked in some different conference, if so, I'm sorry in advance.. Since this seems to be recommended solution for RHEL6 and since I need to use CLVMD, I switched my cluster from corosync + pacemaker to cman+pacemaker. I usually use two-nodes clusters, where hostnames resolve to lan uplink addresses, and nodes are interconnected using crosslink (to be exact, multiple crosslinks with bonding). I'd like to use hostnames as cluster node names, but I'd like cluster communication to go over crosslinks. Is there a way I could force this in cluster.conf? cluster.conf manpage is quite brief on this.. Could somebody please give me an advice? thanks a lot in advance BR nik Hello, This list is fine to ask. I speak mainly as a cman user, but I will make an educated guess on the pacemaker side of things. First, cman does not replace corosync. Both pacemaker and cman use corosync for cluster communications and membership. The difference is that cman configures and starts corosync for you, behind the scenes. The node names you set in cluster.conf's clusternode name=an-c05n01.alteeve.ca nodeid=1 ... /clusternode (an-c05n01.alteeve.ca in this case) is resolved to an IP address. This IP address/subnet is then looked for of your network and, once found, the associated network card is used. Generally, it's advised that you set the cluster names to match each node's 'uname -n' hostname. If you did that, then the following bit of bash will tell you which interface will be used for the cluster (on RHEL/CentOS 6, anyway); ifconfig |grep -B 1 $(gethostip -d $(uname -n)) | grep HWaddr | awk '{ print $1 }' Once you've identified the network, you will want to make sure that, if it is bonded, it is a supported bond mode. Prior to 6.3 (iirc), only mode=1 (active/passive) bonding was supported. With 6.4, mode=0 and mode=2 support was added. All other bond modes are not supported under corosync. As for pacemaker integration; Be sure to setup stonith in pacemaker and make sure it is working. Then use the 'fence_pcmk' fence agent in cluster.conf itself. This way, should cman detect a node failure, it will call pacemaker and ask it to fence the target node, saving you from having to keep the real fencing configured in two places at once. hth digimer -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started:
Re: [Pacemaker] How to permanently delete ghostly nodes?
On 7 Dec 2013, at 8:19 pm, Andrey Rogovsky a.rogov...@gmail.com wrote: I renamed several nodes and restart the cluster Now I show a old nodes in status offline I tried to delete them, but every time you change the cluster configuration they show in offline again It depends a bit on the version of pacemaker and whether you're using heartbeat or corosync or the corosync plugin or cman. Details? signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] again return code, now in crm_attribute
On 18 Dec 2013, at 11:55 pm, Andrey Groshev gre...@yandex.ru wrote: Hi, Andrew and ALL. I'm sorry, but I again found an error. :) Crux of the problem: # crm_attribute --type crm_config --attr-name stonith-enabled --query; echo $? scope=crm_config name=stonith-enabled value=true 0 # crm_attribute --type crm_config --attr-name stonith-enabled --update firstval ; echo $? 0 # crm_attribute --type crm_config --attr-name stonith-enabled --query; echo $? scope=crm_config name=stonith-enabled value=firstval 0 # crm_attribute --type crm_config --attr-name stonith-enabled --update secondval --lifetime=reboot ; echo $? 0 # crm_attribute --type crm_config --attr-name stonith-enabled --query; echo $? scope=crm_config name=stonith-enabled value=firstval 0 # crm_attribute --type crm_config --attr-name stonith-enabled --update thirdval --lifetime=forever ; echo $? 0 # crm_attribute --type crm_config --attr-name stonith-enabled --query; echo $? scope=crm_config name=stonith-enabled value=firstval 0 Ie if specify the lifetime of an attribute, then a attribure is not updated. If impossible setup the lifetime of the attribute when it is installing, it must be return an error. Agreed. I'll reproduce and get back to you. And if possible then the value should be established. In general, something is wrong. Denser unfortunately not yet looked, because I struggle with STONITH :) P.S. Andrew! Late to congratulate you on your new addition to the family. This fine time - now you will have toys which was not in your childhood. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] CentOS 6.5 Pacemaker Oracle Active/Failover cluster setup on SAN
On 6 Jan 2014, at 4:15 pm, Pui Edylie em...@edylie.net wrote: Good Day members, I am wondering if anyone has set this up successfully? I noticed that there is a lack of Oracle script to initiate this. I would willing to pay someone for this effort and hopefully we could create a howto to benefit subsequent readers. Red Hat could presumably help you, but you'd have to use RHEL instead of CentOS. Please contact me! Thanks Edy ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Manual fence confirmation by stonith_admin doesn't work again.
On 19 Dec 2013, at 6:54 pm, Nikita Staroverov nsfo...@gmail.com wrote: Please see: https://access.redhat.com/site/articles/36302 If you don't have an account, the relevant part is: Usage of fence_manual is not supported in any production cluster. You may use this fence agent for development or debugging purposes only. I wrote about fence_ack_manual, not about fence_manual. fence_ack_manual isn't a fence agent, it's confirmation tool like stonith_admin -C. 2. Pacemaker notifies CMAN about real fencing, why not about manual confirmations? It's bug, i'm sure. There is no-one to tell yet. We have to wait for cman to decide something needs fencing before pacemaker can perform the notification. Perhaps, I can't speak to pacemaker's behaviour. Ok. I'll think that it's not a bug, but feature request. 3. I use real fencing. Manual fencing needed in rare situations, like problem with IPMI controller, lost power or so. What should administrator do without manual confirmations? :) Fencing should loop until it succeeds. Fix the problem, fence_ack_manual if that's not possible or, ideally, use multiple fence methods (I use IPMI on one switch and switched PDUs on another switch for redundancy). I have two datacenters joined with fast links, i have syncronous replication so server groups in two separate datacenters forms clusters. if one campus burned with own fencing devices in a whole how i must fix this? In some situations administrator needs of simple and fast manual confirmation or it's not a high availability. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] lrmd segfault at pacemaker 1.1.11-rc1
On 18 Dec 2013, at 9:50 pm, Kazunori INOUE kazunori.ino...@gmail.com wrote: Hi David, 2013/12/18 David Vossel dvos...@redhat.com: That's a really weird one... I don't see how it is possible for op-id to be NULL there. You might need to give valgrind a shot to detect whatever is really going on here. -- Vossel Thank you for advice. I try it. Any update on this? signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] reboot of non-vm host results in VM restart -- of chickens and eggs and VMs
On 20 Dec 2013, at 5:30 am, Bob Haxo bh...@sgi.com wrote: Hello, Earlier emails related to this topic: [pacemaker] chicken-egg-problem with libvirtd and a VM within cluster [pacemaker] VirtualDomain problem after reboot of one node My configuration: RHEL6.5/CMAN/gfs2/Pacemaker/crmsh pacemaker-libs-1.1.10-14.el6_5.1.x86_64 pacemaker-cli-1.1.10-14.el6_5.1.x86_64 pacemaker-1.1.10-14.el6_5.1.x86_64 pacemaker-cluster-libs-1.1.10-14.el6_5.1.x86_64 Two node HA VM cluster using real shared drive, not drbd. Resources (relevant to this discussion): primitive p_fs_images ocf:heartbeat:Filesystem \ primitive p_libvirtd lsb:libvirtd \ primitive virt ocf:heartbeat:VirtualDomain \ services chkconfig on: cman, clvmd, pacemaker services chkconfig off: corosync, gfs2, libvirtd Observation: Rebooting the NON-host system results in the restart of the VM merrily running on the host system. I'm still bootstrapping after the break, but I'm not following this. Can you rephrase? Apparent cause: Upon startup, Pacemaker apparently checks the status of configured resources. However, the status request for the virt (ocf:heartbeat:VirtualDomain) resource fails with: Dec 18 12:19:30 [4147] mici-admin2 lrmd: warning: child_timeout_callback:virt_monitor_0 process (PID 4158) timed out Dec 18 12:19:30 [4147] mici-admin2 lrmd: warning: operation_finished: virt_monitor_0:4158 - timed out after 20ms Dec 18 12:19:30 [4147] mici-admin2 lrmd: notice: operation_finished: virt_monitor_0:4158:stderr [ error: Failed to reconnect to the hypervisor ] Dec 18 12:19:30 [4147] mici-admin2 lrmd: notice: operation_finished: virt_monitor_0:4158:stderr [ error: no valid connection ] Dec 18 12:19:30 [4147] mici-admin2 lrmd: notice: operation_finished: virt_monitor_0:4158:stderr [ error: Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory ] Sounds like the agent should perhaps be returning OCF_NOT_RUNNING in this case. This failure then snowballs into an orphan situation in which the running VM is restarted. There was the suggestion of chkconfig on libvirtd (and presumably deleting the resource) so that the /var/run/libvirt/libvirt-sock has been created by service libvirtd. With libvirtd started by the system, there is no un-needed reboot of the VM. However, it may be that removing libvirtd from Pacemaker control leaves the VM vdisk filesystem susceptible to corruption during a reboot induced failover. Question: Is there an accepted Pacemaker configuration such that the un-needed restart of the VM does not occur with the reboot of the non-host system? Regards, Bob Haxo ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Minor buffer overflow..
On 5 Dec 2013, at 3:20 pm, Rob Thomas xro...@gmail.com wrote: I was idly wondering why the SMTP and SNMP modules were disabled by default on the RHEL builds, and was in the middle of writing a shell script to duplicate them when I noticed there was a tiny buffer overflow in crm_mon. This may be why it's disabled by default? Not really. It was more of a dependancy issue. Plus the run script option makes them redundant. Patch: https://github.com/xrobau/pacemaker/commit/b1515e3f83fceeac951de8823d718bdf13e4a093 Can you make a pull request for that? --Rob ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Starting Pacemaker Cluster Manager [FAILED]
On 21 Nov 2013, at 9:56 pm, Miha m...@softnet.si wrote: HI, how can i delete/reset all config, so that I could do again: pcs cluster destroy on all nodes looks about right 'pcs cluster setup mycluster pcmk-1 pcmk-2' and begin again at the beginning? tnx! p.s.: bellowe is a log Nov 21 11:51:21 [10578] sip1cib:error: xml_log: Expecting element status, got node_state Nov 21 11:51:21 [10578] sip1cib:error: xml_log: Element cib failed to validate content Nov 21 11:51:21 [10578] sip1cib:error: readCibXmlFile: CIB does not validate with pacemaker-1.2 Nov 21 11:51:22 [10578] sip1cib:error: cib_process_request: Operation ignored, cluster configuration is invalid. Please repair and restart: Update does not conform to the configured schema Nov 21 11:51:22 [10578] sip1cib:error: cib_process_request: Operation ignored, cluster configuration is invalid. Please repair and restart: Update does not conform to the configured schema Nov 21 11:51:22 [10579] sip1 stonith-ng:error: cluster_option: Value 'tru' for cluster option 'stop-all-resources' is invalid. Defaulting to false Nov 21 11:51:22 [10578] sip1cib:error: cib_process_request: Operation ignored, cluster configuration is invalid. Please repair and restart: Update does not conform to the configured schema Nov 21 11:51:22 [10578] sip1cib:error: cib_process_request: Operation ignored, cluster configuration is invalid. Please repair and restart: Update does not conform to the configured schema Nov 21 11:51:22 [10583] sip1 crmd: info: register_fsa_error_adv: Resetting the current action list Nov 21 11:51:22 [10583] sip1 crmd:error: node_list_update_callback: Node update 3 failed: Update does not conform to the configured schema (-203) Nov 21 11:51:22 [10583] sip1 crmd: info: register_fsa_error_adv: Resetting the current action list Nov 21 11:51:22 [10583] sip1 crmd:error: config_query_callback: Local CIB query resulted in an error: Update does not conform to the configured schema Nov 21 11:51:22 [10583] sip1 crmd: info: register_fsa_error_adv: Resetting the current action list Nov 21 11:51:22 [10583] sip1 crmd:error: config_query_callback: The cluster is mis-configured - shutting down and staying down Nov 21 11:51:22 [10583] sip1 crmd:error: do_log: FSA: Input I_ERROR from config_query_callback() received in state S_STARTING Nov 21 11:51:22 [10583] sip1 crmd: warning: do_recover: Fast-tracking shutdown in response to errors Nov 21 11:51:22 [10583] sip1 crmd:error: do_log: FSA: Input I_ERROR from node_list_update_callback() received in state S_RECOVERY Nov 21 11:51:22 [10583] sip1 crmd:error: do_log: FSA: Input I_ERROR from revision_check_callback() received in state S_RECOVERY Nov 21 11:51:22 [10583] sip1 crmd:error: do_started: Start cancelled... S_RECOVERY Nov 21 11:51:22 [10583] sip1 crmd:error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY Nov 21 11:51:22 [10578] sip1cib:error: cib_process_request: Operation ignored, cluster configuration is invalid. Please repair and restart: Update does not conform to the configured schema Nov 21 11:51:22 [10572] sip1 pacemakerd:error: pcmk_child_exit: Child process crmd (10583) exited: Network is down (100) Nov 21 11:51:22 [10572] sip1 pacemakerd: notice: pcmk_shutdown_worker: Attempting to inhibit respawning after fatal error ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] some questions about STONITH
On 26 Nov 2013, at 12:39 am, Andrey Groshev gre...@yandex.ru wrote: ...snip... Make next test: #stonith_admin --reboot=dev-cluster2-node2 Node reboot, but resource don't start. In crm_mon status - Node dev-cluster2-node2 (172793105): pending. And it will be hung. That is *probably* a race - the node reboots too fast, or still communicates for a bit after the fence has supposedly completed (if it's not a reboot -nf, but a mere reboot). We have had problems here in the past. You may want to file a proper bug report with crm_report included, and preferably corosync/pacemaker debugging enabled. It was found that he hangs not forever. Triggered timeout - in 20 minutes. crm_report archive - http://send2me.ru/pen2.tar.bz2 Of course in the logs many type entries: pgsql:1: Breaking dependency loop at msPostgresql But where does this relationship after a timeout, I do not understand. Can you rephrase your question? I'm not 100% sure I understand what you're asking. signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Breaking dependency loop stonith
On 27 Nov 2013, at 12:26 am, Andrey Groshev gre...@yandex.ru wrote: Hi, ALL. I want to clarify two more questions. After stonith reboot - this node hangs with status pending. The logs found string . info: rsc_merge_weights:pgsql:1: Breaking dependency loop at msPostgresql info: rsc_merge_weights:pgsql:2: Breaking dependency loop at msPostgresql This means that breaking search the depends, because they are no more. Or interrupted by an infinite loop for search the dependency? The second one, but it has nothing to do with a node being in the pending state. Where did you see this? And two. Do I need to clone the stonith resource now (In PCMK 1.1.11)? No. On the one hand, I see this resource on all nodes through command. # cibadmin -Q|grep stonith nvpair name=stonith-enabled value=true id=cib-bootstrap-options-stonith-enabled/ primitive id=st1 class=stonith type=external/sshbykey lrm_resource id=st1 type=external/sshbykey class=stonith lrm_resource id=st1 type=external/sshbykey class=stonith lrm_resource id=st1 type=external/sshbykey class=stonith (without pending node) Like all resources, we check all nodes at startup to see if it is already active. On the other hand, another command I see only one instance on a particular node. # crm_verify -L info: main: =#=#=#=#= Getting XML =#=#=#=#= info: main: Reading XML from: live cluster info: validate_with_relaxng:Creating RNG parser context info: determine_online_status_fencing: Node dev-cluster2-node4 is active info: determine_online_status: Node dev-cluster2-node4 is online info: determine_online_status_fencing: - Node dev-cluster2-node1 is not ready to run resources info: determine_online_status_fencing: Node dev-cluster2-node2 is active info: determine_online_status: Node dev-cluster2-node2 is online info: determine_online_status_fencing: Node dev-cluster2-node3 is active info: determine_online_status: Node dev-cluster2-node3 is online info: determine_op_status: Operation monitor found resource pingCheck:0 active on dev-cluster2-node4 info: native_print: VirtualIP (ocf::heartbeat:IPaddr2): Started dev-cluster2-node4 info: clone_print: Master/Slave Set: msPostgresql [pgsql] info: short_print: Masters: [ dev-cluster2-node4 ] info: short_print: Slaves: [ dev-cluster2-node2 dev-cluster2-node3 ] info: short_print: Stopped: [ dev-cluster2-node1 ] info: clone_print: Clone Set: clnPingCheck [pingCheck] info: short_print: Started: [ dev-cluster2-node2 dev-cluster2-node3 dev-cluster2-node4 ] info: short_print: Stopped: [ dev-cluster2-node1 ] info: native_print: st1 (stonith:external/sshbykey): Started dev-cluster2-node4 info: native_color: Resource pingCheck:3 cannot run anywhere info: native_color: Resource pgsql:3 cannot run anywhere info: rsc_merge_weights:pgsql:1: Breaking dependency loop at msPostgresql info: rsc_merge_weights:pgsql:2: Breaking dependency loop at msPostgresql info: master_color: Promoting pgsql:0 (Master dev-cluster2-node4) info: master_color: msPostgresql: Promoted 1 instances of a possible 1 to master info: LogActions: Leave VirtualIP (Started dev-cluster2-node4) info: LogActions: Leave pgsql:0 (Master dev-cluster2-node4) info: LogActions: Leave pgsql:1 (Slave dev-cluster2-node2) info: LogActions: Leave pgsql:2 (Slave dev-cluster2-node3) info: LogActions: Leave pgsql:3 (Stopped) info: LogActions: Leave pingCheck:0 (Started dev-cluster2-node4) info: LogActions: Leave pingCheck:1 (Started dev-cluster2-node2) info: LogActions: Leave pingCheck:2 (Started dev-cluster2-node3) info: LogActions: Leave pingCheck:3 (Stopped) info: LogActions: Leave st1 (Started dev-cluster2-node4) However, if I do a clone - it turns out the same garbage. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Weird behavior of PCS command while defining DRBD resources
On 27 Nov 2013, at 10:21 pm, Muhammad Kamran Azeem kamranaz...@gmail.com wrote: Apologies for double post. In my initial post, I forgot to set the subject properly. Hello List, I am new here. I worked with Linux HA during 2006-2008, went in HPC direction, and came back to HA a month ago. Realized that a lot has changed. My setup: Two KVM machines vdb1 (192.168.122.11), vdb2 (192.168.122.12) ClusterIP: 192.168.122.10 Fedora 19 (64 bit). PCS, CoroSync, PaceMaker, DRBD Note: I use the names node1 and node2 for vdb1 and vdb2 for explanations. I am trying to setup a test cluster, using http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Clusters_from_Scratch/_configure_the_cluster_for_drbd.html First, the status: [root@vdb1 drbd.d]# pcs status Cluster name: MySQLCluster Last updated: Tue Nov 26 14:05:33 2013 Last change: Mon Nov 25 17:25:59 2013 via crm_resource on vdb2.example.com Stack: corosync Current DC: vdb1.example.com (1) - partition with quorum Version: 1.1.9-3.fc19-781a388 2 Nodes configured, unknown expected votes 2 Resources configured. Online: [ vdb1.example.com vdb2.example.com ] Full list of resources: ClusterIP(ocf::heartbeat:IPaddr2): Started vdb1.example.com Apache (ocf::heartbeat:apache):Started vdb1.example.com [root@vdb1 drbd.d]# My DRBD disks are: [root@vdb1 drbd.d]# drbd-overview 1:MySQLDisk/0 Connected Secondary/Secondary UpToDate/UpToDate C r- 2:ApacheDisk/0 Connected Secondary/Secondary UpToDate/UpToDate C r- [root@vdb1 drbd.d]# Now, the guide suggests creating a small config file, define the new resources in that, and then push that in CIB. Extract from the guide: # pcs cluster cib drbd_cfg # pcs -f drbd_cfg resource create WebData ocf:linbit:drbd \ drbd_resource=wwwdata op monitor interval=60s # pcs -f drbd_cfg resource master WebDataClone WebData \ master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 \ notify=true I decided to execute the commands (manually), without using the config file method, as: # pcs resource create p_ApacheDisk ocf:linbit:drbd \ drbd_resource=ApacheDisk op monitor interval=60s # pcs resource master MasterApacheDisk p_ApacheDisk \ master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 \ notify=true (I changed the names of resources a bit) I get the following errors: [root@vdb2 ~]# pcs resource create p_ApacheDisk ocf:linbit:drbd \ drbd_resource=ApacheDisk op monitor interval=60s [root@vdb2 ~]# pcs status Cluster name: MySQLCluster Last updated: Wed Nov 27 11:50:35 2013 Last change: Wed Nov 27 11:49:36 2013 via cibadmin on vdb2.example.com Stack: corosync Current DC: vdb1.example.com (1) - partition with quorum Version: 1.1.9-3.fc19-781a388 2 Nodes configured, unknown expected votes 3 Resources configured. Online: [ vdb1.example.com vdb2.example.com ] Full list of resources: ClusterIP(ocf::heartbeat:IPaddr2): Started vdb1.example.com Apache (ocf::heartbeat:apache):Started vdb1.example.com p_ApacheDisk (ocf::linbit:drbd): Stopped Failed actions: p_ApacheDisk_monitor_0 (node=vdb1.example.com, call=27, rc=6, status=complete, last-rc-change=Wed Nov 27 11:49:36 2013 , queued=23ms, exec=0ms ): not configured p_ApacheDisk_monitor_0 (node=vdb2.example.com, call=15, rc=6, status=complete, last-rc-change=Wed Nov 27 11:49:36 2013 , queued=22ms, exec=1ms ): not configured Got the following in /var/log/messages on DC (node 1): Nov 27 11:49:36 vdb1 cib[538]: notice: cib:diff: Diff: --- 0.43.13 Nov 27 11:49:36 vdb1 cib[538]: notice: cib:diff: Diff: +++ 0.44.1 f4b87d9dee145747f86583cb5eb8276b Nov 27 11:49:36 vdb1 stonith-ng[539]: notice: unpack_config: On loss of CCM Quorum: Ignore Nov 27 11:49:36 vdb1 crmd[543]: notice: do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Nov 27 11:49:36 vdb1 pengine[542]: notice: unpack_config: On loss of CCM Quorum: Ignore Nov 27 11:49:36 vdb1 pengine[542]: notice: LogActions: Start p_ApacheDisk#011(vdb2.example.com) Nov 27 11:49:36 vdb1 pengine[542]: notice: process_pe_message: Calculated Transition 92: /var/lib/pacemaker/pengine/pe-input-74.bz2 Nov 27 11:49:36 vdb1 crmd[543]: notice: te_rsc_command: Initiating action 8: monitor p_ApacheDisk_monitor_0 on vdb2.example.com Nov 27 11:49:36 vdb1 crmd[543]: notice: te_rsc_command: Initiating action 6: monitor p_ApacheDisk_monitor_0 on vdb1.example.com (local) Nov 27 11:49:36 vdb1 drbd(p_ApacheDisk)[9807]: ERROR: meta parameter misconfigured, expected clone-max -le 2, but found unset. Nov 27 11:49:36 vdb1 crmd[543]: notice: process_lrm_event: LRM operation p_ApacheDisk_monitor_0 (call=27, rc=6, cib-update=124, confirmed=true) not
Re: [Pacemaker] prevent starting resources on failed node
On 7 Dec 2013, at 2:17 am, Brian J. Murrell (brian) br...@interlinx.bc.ca wrote: [ Hopefully this doesn't cause a duplicate post but my first attempt returned an error. ] Using pacemaker 1.1.10 (but I think this issue is more general than that release), I want to enforce a policy that once a node fails, no resources can be started/run on it until the user permits it. Node fails? Or resource on a node fails? If you really mean the node, just don't configure it to start pacemaker when it boots. I have been successful in achieving this using resource stickiness. Mostly. It seems that once the resource has been successfully started on another node, it stays put, even once the failed node comes back up. So this is all good. Where it does seem to be falling down though is that if the failed node comes back up before the resource can be successfully started on another node, pacemaker seems to include the just-failed-and-restarted node in the candidate list of nodes it tries to start the resource on. So in this manner, it seems that resource stickiness only applies once the resource has been started (which is not surprising; it seems a reasonable behaviour). The question then is, anyone have any ideas on how to implement such a policy? That is, once a node fails, no resources are allowed to start on it, even if it means not starting the resource (i.e. all other nodes are unable to start it for whatever reason)? Simply not starting the node would be one way to achieve it, yes, but we cannot rely on the node not being started. It seems perhaps the installation of a constraint when a node is stonithed might do the trick, but the question is how to couple/trigger the installation of a constraint with a stonith action? Or is there a better/different way to achieve this? Cheers, b. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] error: send_cpg_message: Sending message via cpg FAILED: (rc=6) Try again
What version of pacemaker? There were some improvements to how we handle sending messages via CPG recently. On 10 Dec 2013, at 4:40 am, Brian J. Murrell br...@interlinx.bc.ca wrote: On Mon, 2013-12-09 at 09:28 +0100, Jan Friesse wrote: Error 6 error means try again. This is happening ether if corosync is overloaded or creating new membership. Please take a look to /var/log/cluster/corosync.log if you see something strange there (+ make sure you have newest corosync). Would that same information be available in /var/log/messages if I have configured corosync such as: logging { fileline: off to_stderr: no to_logfile: no to_syslog: yes logfile: /var/log/cluster/corosync.log debug: off timestamp: on logger_subsys { subsys: AMF debug: off } } If so, then the log snippet I posted in the prior message includes all that corosync had to report. Should I increase the amount of logging? Any suggestions on an appropriate amount/flags, etc.? (+ make sure you have newest corosync). corosync-1.4.1-15.el6_4.1.x86_64 as shipped by RH in EL6. Is this new enough? I know 2.x is also available but I don't think RH is shipping that yet. Hopefully their 1.4.1 is still supported. Cheers, b. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Reg. trigger when node failure occurs
On 11 Dec 2013, at 3:45 pm, ESWAR RAO eswar7...@gmail.com wrote: Hi Micheal, I am configuring the ClusterMon as below on the 3 node setup: I am following http://floriancrouzat.net/2013/01/monitor-a-pacemaker-cluster-with-ocfpacemakerclustermon-andor-external-agent/ # crm configure primitive ClusterMon ocf:pacemaker:ClusterMon params user=root update=10 extra_options=-E /root/monitor.sh -e 192.168.100.188 op monitor on-fail=restart interval=60 Entity: line 24: element nvpair: Relax-NG validity error : Type ID doesn't allow value 'ClusterMon-instance_attributes-/root/monitor.sh' Looks like crmsh is having trouble parsing your command line. Entity: line 24: element nvpair: Relax-NG validity error : Element nvpair failed to validate attributes Relax-NG validity error : Extra element nvpair in interleave Entity: line 21: element nvpair: Relax-NG validity error : Element instance_attributes failed to validate content Relax-NG validity error : Extra element instance_attributes in interleave Entity: line 2: element cib: Relax-NG validity error : Element cib failed to validate content crm_verify[15283]: 2013/12/10_20:35:45 ERROR: main: CIB did not pass DTD/schema validation Errors found during check: config not valid ERROR: ClusterMon: parameter -e does not exist ERROR: ClusterMon: parameter 192.168.100.188 does not exist ERROR: ClusterMon: parameter /root/monitor.sh does not exist Can I write my external agent as sample dummy like : crm_mon -1|grep Online\|OFFLINE My intention is when HB is stopped on any node/node failure occurs the script should be triggered so that I can know OFFLINE nodes. Thanks Eswar On Tue, Dec 10, 2013 at 1:06 PM, Michael Schwartzkopff m...@sys4.de wrote: Am Dienstag, 10. Dezember 2013, 12:19:25 schrieb ESWAR RAO: Hi All, Can someone please let me know if there is a clean to trigger any script by pacemaker if HB on a node has stopped/node failed occurred if I ran HB+pacemaker on a 3 node setup?? Thanks Eswar On Mon, Dec 9, 2013 at 5:16 PM, ESWAR RAO eswar7...@gmail.com wrote: Hi All, I have a 3 node ( node1, node2, node3 ) setup on which HB+pacemaker runs. I have resources running on clone mode on node1 and node2. Is there anyway to get a trigger when a node failure occurs i.e., can i trigger any script if the node3 fails (on which no resource runs) ??? Yes. run a ocf:pacemaker:ClusterMon resource and read man crm_mon for the additional options to call a script. -- Mit freundlichen Grüßen, Michael Schwartzkopff -- [*] sys4 AG http://sys4.de, +49 (89) 30 90 46 64, +49 (162) 165 0044 Franziskanerstraße 15, 81669 München Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263 Vorstand: Patrick Ben Koetter, Axel von der Ohe, Marc Schiffbauer Aufsichtsratsvorsitzender: Florian Kirstein ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] host came online but it was ignored
On 18 Dec 2013, at 4:23 pm, ESWAR RAO eswar7...@gmail.com wrote: Hi All, Can someone help me how to narrow down the problem?? I'd probably start with an upgrade. There were some membership issues around about the time of 1.1.7, but they may have been corosync specific (I don't really test pacemaker with heartbeat anymore). If you can reproduce with something more recent, I'd be happy to take a look at the logs. Thanks Eswar On Wed, Dec 11, 2013 at 9:35 AM, ESWAR RAO eswar7...@gmail.com wrote: Hi Andrew, # pacemakerd --version Pacemaker 1.1.7 Written by Andrew Beekhof # ps -aef|grep heart root 8926 1 0 20:02 ?00:00:00 heartbeat: master control process root 8930 8926 0 20:02 ?00:00:00 heartbeat: FIFO reader root 8931 8926 0 20:02 ?00:00:00 heartbeat: write: bcast eth0 root 8932 8926 0 20:02 ?00:00:00 heartbeat: read: bcast eth0 108 8936 8926 0 20:02 ?00:00:00 /usr/lib/heartbeat/ccm 108 8937 8926 0 20:02 ?00:00:00 /usr/lib/heartbeat/cib root 8938 8926 0 20:02 ?00:00:00 /usr/lib/heartbeat/lrmd -r root 8939 8926 0 20:02 ?00:00:00 /usr/lib/heartbeat/stonithd 108 8940 8926 0 20:02 ?00:00:00 /usr/lib/heartbeat/attrd 108 8941 8926 0 20:02 ?00:00:00 /usr/lib/heartbeat/crmd In /etc/ha.d/ha.cf i am using crm respawn tag. I am installing through apt-get install Thanks Eswar On Wed, Dec 11, 2013 at 6:18 AM, Andrew Beekhof and...@beekhof.net wrote: version of pacemaker? On 10 Dec 2013, at 10:41 pm, ESWAR RAO eswar7...@gmail.com wrote: Hi Micheal, There are no firewall rules. I could only see below messages in logs: Dec 10 14:13:48 nvp-common crmd: [9220]: WARN: crmd_ha_msg_callback: Ignoring HA message (op=join_announce) from nvsd-1: not in our membership list (size=1) On Tue, Dec 10, 2013 at 2:46 PM, Michael Schwartzkopff m...@sys4.de wrote: Am Dienstag, 10. Dezember 2013, 14:30:32 schrieb ESWAR RAO: Hi All, I had a 3 node HB+pacemaker setup. When I restarted the nodes, all the nodes have HB restarted but they are not joining ONLINE group in crm status. #vim /etc/ha.d/ha.cf .. node nvsd-1 nvsd-2 nvp-common crm respawn Dec 10 14:13:48 nvp-common crmd: [9220]: WARN: crmd_ha_msg_callback: Ignoring HA message (op=join_announce) from nvsd-1: not in our membership list (size=1) Dec 10 14:14:07 nvp-common crmd: [9220]: WARN: crmd_ha_msg_callback: Ignoring HA message (op=vote) from nvsd-1: not in our membership list (size=1) Dec 10 14:14:12 nvp-common cib: [9216]: WARN: cib_process_diff: Diff 0.3.0 - 0.3.1 not applied to 0.4.4: current epoch is greater than required Dec 10 14:14:12 nvp-common cib: [9216]: WARN: cib_process_diff: Diff 0.3.1 - 0.3.2 not applied to 0.4.4: current epoch is greater than required Dec 10 14:14:12 nvp-common cib: [9216]: WARN: cib_process_diff: Diff 0.3.2 - 0.3.3 not applied to 0.4.4: current epoch is greater than required Dec 10 14:14:12 nvp-common cib: [9216]: WARN: cib_process_diff: Diff 0.3.3 - 0.3.4 not applied to 0.4.4: current epoch is greater than required Dec 10 14:14:13 nvp-common cib: [9216]: WARN: cib_process_replace: Replacement 0.3.4 not applied to 0.4.4: current epoch is greater than the replacement Dec 10 14:14:13 nvp-common cib: [9216]: WARN: cib_diff_notify: Update (client: crmd, call:13): -1.-1.-1 - 0.3.4 (Update was older than existing configuration) I am unable to understand this behaviour. Has someone already seen this issue??? Thanks Eswar Firewall? Mit freundlichen Grüßen, Michael Schwartzkopff -- [*] sys4 AG http://sys4.de, +49 (89) 30 90 46 64, +49 (162) 165 0044 Franziskanerstraße 15, 81669 München Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263 Vorstand: Patrick Ben Koetter, Axel von der Ohe, Marc Schiffbauer Aufsichtsratsvorsitzender: Florian Kirstein ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http
Re: [Pacemaker] Question about node-action-limit and migration-limit
On 18 Dec 2013, at 9:51 pm, Kazunori INOUE kazunori.ino...@gmail.com wrote: Hi, When I set only migration-limit without setting node-action-limit in pacemaker-1.1, the number of 'operation' other than migrate_to/from was limited to the value of migration-limit. (The node that I used has 8 cores.) [cib] property \ no-quorum-policy=freeze \ stonith-enabled=true \ startup-fencing=false \ migration-limit=3 ...snip... [log] $ egrep warning: cluster_option|debug: throttle_update: /var/log/ha-debug Dec 12 16:35:23 [7416] bl460g1n7 crmd:debug: throttle_update: Host bl460g1n6 supports a maximum of 16 jobs and throttle mode . New job limit is 16 Dec 12 16:35:25 [7416] bl460g1n7 crmd: warning: cluster_option: Using deprecated name 'migration-limit' for cluster option 'node-action-limit' Dec 12 16:35:25 [7416] bl460g1n7 crmd: warning: cluster_option: Using deprecated name 'migration-limit' for cluster option 'node-action-limit' Dec 12 16:35:25 [7416] bl460g1n7 crmd: warning: cluster_option: Using deprecated name 'migration-limit' for cluster option 'node-action-limit' Dec 12 16:35:26 [7416] bl460g1n7 crmd:debug: throttle_update: Host bl460g1n7 supports a maximum of 3 jobs and throttle mode . New job limit is 3 Dec 12 16:35:28 [7416] bl460g1n7 crmd:debug: throttle_update: Host bl460g1n8 supports a maximum of 3 jobs and throttle mode . New job limit is 3 $ egrep do_lrm_rsc_op: Performing .* op=prmVM|process_lrm_event: LRM operation prmVM /var/log/ha-log|grep -v monitor Dec 12 16:35:28 bl460g1n7 crmd[7416]: info: do_lrm_rsc_op: Performing key=24:1:0:12d6cc8e-9dd9-40b4-8a07-5e9639050d12 op=prmVM1_start_0 Dec 12 16:35:28 bl460g1n7 crmd[7416]: info: do_lrm_rsc_op: Performing key=26:1:0:12d6cc8e-9dd9-40b4-8a07-5e9639050d12 op=prmVM2_start_0 Dec 12 16:35:28 bl460g1n7 crmd[7416]: info: do_lrm_rsc_op: Performing key=28:1:0:12d6cc8e-9dd9-40b4-8a07-5e9639050d12 op=prmVM3_start_0 Dec 12 16:35:30 bl460g1n7 crmd[7416]: notice: process_lrm_event: LRM operation prmVM2_start_0 (call=27, rc=0, cib-update=23, confirmed=true) ok Dec 12 16:35:30 bl460g1n7 crmd[7416]: notice: process_lrm_event: LRM operation prmVM1_start_0 (call=26, rc=0, cib-update=24, confirmed=true) ok Dec 12 16:35:30 bl460g1n7 crmd[7416]: notice: process_lrm_event: LRM operation prmVM3_start_0 (call=28, rc=0, cib-update=25, confirmed=true) ok Dec 12 16:35:32 bl460g1n7 crmd[7416]: info: do_lrm_rsc_op: Performing key=15:2:0:12d6cc8e-9dd9-40b4-8a07-5e9639050d12 op=prmVM4_start_0 Dec 12 16:35:32 bl460g1n7 crmd[7416]: info: do_lrm_rsc_op: Performing key=17:2:0:12d6cc8e-9dd9-40b4-8a07-5e9639050d12 op=prmVM5_start_0 Dec 12 16:35:32 bl460g1n7 crmd[7416]: info: do_lrm_rsc_op: Performing key=19:2:0:12d6cc8e-9dd9-40b4-8a07-5e9639050d12 op=prmVM6_start_0 Dec 12 16:35:34 bl460g1n7 crmd[7416]: notice: process_lrm_event: LRM operation prmVM4_start_0 (call=32, rc=0, cib-update=29, confirmed=true) ok Dec 12 16:35:34 bl460g1n7 crmd[7416]: notice: process_lrm_event: LRM operation prmVM5_start_0 (call=33, rc=0, cib-update=30, confirmed=true) ok Dec 12 16:35:34 bl460g1n7 crmd[7416]: notice: process_lrm_event: LRM operation prmVM6_start_0 (call=34, rc=0, cib-update=31, confirmed=true) ok Dec 12 16:37:26 bl460g1n7 crmd[7416]: info: do_lrm_rsc_op: Performing key=12:4:0:12d6cc8e-9dd9-40b4-8a07-5e9639050d12 op=prmVM1_stop_0 Dec 12 16:37:26 bl460g1n7 crmd[7416]: info: do_lrm_rsc_op: Performing key=13:4:0:12d6cc8e-9dd9-40b4-8a07-5e9639050d12 op=prmVM2_stop_0 Dec 12 16:37:26 bl460g1n7 crmd[7416]: info: do_lrm_rsc_op: Performing key=14:4:0:12d6cc8e-9dd9-40b4-8a07-5e9639050d12 op=prmVM3_stop_0 Dec 12 16:37:39 bl460g1n7 crmd[7416]: notice: process_lrm_event: LRM operation prmVM1_stop_0 (call=39, rc=0, cib-update=35, confirmed=true) ok Dec 12 16:37:39 bl460g1n7 crmd[7416]: info: do_lrm_rsc_op: Performing key=15:4:0:12d6cc8e-9dd9-40b4-8a07-5e9639050d12 op=prmVM4_stop_0 Dec 12 16:37:39 bl460g1n7 crmd[7416]: notice: process_lrm_event: LRM operation prmVM2_stop_0 (call=41, rc=0, cib-update=36, confirmed=true) ok Dec 12 16:37:39 bl460g1n7 crmd[7416]: info: do_lrm_rsc_op: Performing key=16:4:0:12d6cc8e-9dd9-40b4-8a07-5e9639050d12 op=prmVM5_stop_0 Dec 12 16:37:40 bl460g1n7 crmd[7416]: notice: process_lrm_event: LRM operation prmVM3_stop_0 (call=43, rc=0, cib-update=37, confirmed=true) ok Dec 12 16:37:40 bl460g1n7 crmd[7416]: info: do_lrm_rsc_op: Performing key=17:4:0:12d6cc8e-9dd9-40b4-8a07-5e9639050d12 op=prmVM6_stop_0 Dec 12 16:37:51 bl460g1n7 crmd[7416]: notice: process_lrm_event: LRM operation prmVM4_stop_0 (call=45, rc=0, cib-update=38, confirmed=true) ok Dec 12 16:37:52 bl460g1n7 crmd[7416]: notice: process_lrm_event: LRM operation prmVM5_stop_0 (call=47, rc=0, cib-update=39, confirmed=true) ok Dec 12 16:37:52 bl460g1n7
Re: [Pacemaker] Time to get ready for 1.1.11
On 20 Dec 2013, at 2:11 am, Andrew Martin amar...@xes-inc.com wrote: David/Andrew, Once 1.1.11 final is released, is it considered the new stable series of Pacemaker, yes or should 1.1.10 still be used in very stable/critical production environments? Thanks, Andrew - Original Message - From: David Vossel dvos...@redhat.com To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Wednesday, December 11, 2013 3:33:46 PM Subject: Re: [Pacemaker] Time to get ready for 1.1.11 - Original Message - From: Andrew Beekhof and...@beekhof.net To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Wednesday, November 20, 2013 9:02:40 PM Subject: [Pacemaker] Time to get ready for 1.1.11 With over 400 updates since the release of 1.1.10, its time to start thinking about a new release. Today I have tagged release candidate 1[1]. The most notable fixes include: + attrd: Implementation of a truely atomic attrd for use with corosync 2.x + cib: Allow values to be added/updated and removed in a single update + cib: Support XML comments in diffs + Core: Allow blackbox logging to be disabled with SIGUSR2 + crmd: Do not block on proxied calls from pacemaker_remoted + crmd: Enable cluster-wide throttling when the cib heavily exceeds its target load + crmd: Use the load on our peers to know how many jobs to send them + crm_mon: add --hide-headers option to hide all headers + crm_report: Collect logs directly from journald if available + Fencing: On timeout, clean up the agent's entire process group + Fencing: Support agents that need the host to be unfenced at startup + ipc: Raise the default buffer size to 128k + PE: Add a special attribute for distinguishing between real nodes and containers in constraint rules + PE: Allow location constraints to take a regex pattern to match against resource IDs + pengine: Distinguish between the agent being missing and something the agent needs being missing + remote: Properly version the remote connection protocol + services: Detect missing agents and permission errors before forking + Bug cl#5171 - pengine: Don't prevent clones from running due to dependant resources + Bug cl#5179 - Corosync: Attempt to retrieve a peer's node name if it is not already known + Bug cl#5181 - corosync: Ensure node IDs are written to the CIB as unsigned integers If you are a user of `pacemaker_remoted`, you should take the time to read about changes to the online wire protocol[2] that are present in this release. [1] https://github.com/ClusterLabs/pacemaker/releases/Pacemaker-1.1.11-rc1 [2] http://blog.clusterlabs.org/blog/2013/changes-to-the-remote-wire-protocol/ To build `rpm` packages for testing: 1. Clone the current sources: # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git # cd pacemaker 1. If you haven't already, install Pacemaker's dependancies [Fedora] # sudo yum install -y yum-utils [ALL] # make rpm-dep 1. Build Pacemaker # make rc 1. Copy the rpms and deploy as needed A new release candidate, Pacemaker-1.1.11-rc2, is ready for testing. https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1.11-rc2 Assuming no major regressions are encountered during testing, this tag will become the final Pacemaker-1.1.11 release a week from today. -- Vossel ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker and RHEL/CENTOS 5.x compatibility ?
On 20 Dec 2013, at 1:36 am, Stephane Robin sro...@kivasystems.com wrote: Hi, This is a follow up on my previous post 'Trouble building Pacemaker from source on CentOS 5.10' Andrew: Thanks for your pointers. It turns out Pacemaker 1.1.10 needed more changes to build on CentOS 5.x. • revert of a81d222 • g_timeout_add_seconds not available in libc in lib/services/services_linux.c • qb_to_cs_error conflicting type definition in include/crm_internal.h • Configure with --disable-fatal-warnings This brings to my question: Pacemaker 1.1.10 was already broken for this OS, and I'm assuming that 1.1.11 will diverge even further. What is the official position in regard to RHEL/CENTOS 5.x support testing ? There's no conscious effort to break RHEL5, its just not a focus for the developers. So we rely on reports like yours to tell us when something breaks - and if anyone cares. All the above seem pretty easily resolvable and we'll happily include them for 1.1.11 (hint, test the latest .11 beta to make sure there are no others :) Are there any other person that can not afford yet to move to RHEL 6 (for whatever reason) and are interested in keeping RHEL/CENTOS 5.x compatibility ? If there are, they don't seem interested in upgrading. Also, for what its worth, pacemaker is now supported on RHEL6. Perhaps that adds incentive to update :) signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] question on on-fail=restart
On 19 Dec 2013, at 4:03 am, Brusq, Jerome jerome.br...@signalis.com wrote: Dear all, I have a custom lsb script that launch a custom process. primitive myscript lsb:ha_swift \ op start interval=0 timeout=30s \ op stop interval=0 timeout=30s \ op monitor interval=15s on-fail=restart \ When this one crashes, I have the feeling that pacemaker do : “/etc/init.d/myscript stop” Then, “/etc/init.d/myscript start” Is there a way for pacemaker to do “/etc/init.d/myscript restart” (of course my “restart” option is doing something a little bit different that a stop + start in my script ….) No, sorry Thanks! Jerome ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Trouble building Pacemaker from source on CentOS 5.10
On 14 Dec 2013, at 7:51 am, Stephane Robin sro...@kivasystems.com wrote: Hi, I'm trying to build Pacemaker-1.1.10 (from git), with corosync 2.3.2 and libqb 0.16.0 on a CentOS 5.10 64b system. I have latest auto tools (automake 1.14, autoconf 2.69, lib tool 2.4, pkg-config 0.27.1) For Pacemaker, I'm doing: ./autogen.sh ./configure --with-corosync --with-cs-quorum --without-snmp --without-nagios --enable-upstart=no pacemaker configuration: Version = 1.1.10 (Build: 368c726) Features = libqb-logging libqb-ipc corosync-native Prefix = /usr Executables = /usr/sbin Man pages= /usr/share/man Libraries= /usr/lib64 Header files = /usr/include Arch-independent files = /usr/share State information= /var System configuration = /etc Corosync Plugins = /usr/lib64 Use system LTDL = yes HA group name= haclient HA user name = hacluster CFLAGS = -g -O2 -I/usr/include -I/usr/include/heartbeat -ggdb -fgnu89-inline -fstack-protector-all -Wall -Waggregate-return -Wbad-function-cast -Wcast-align -Wdeclaration-after-statement -Wendif-labels -Wfloat-equal -Wformat=2 -Wformat-security -Wformat-nonliteral -Wmissing-prototypes -Wmissing-declarations -Wnested-externs -Wno-long-long -Wno-strict-aliasing -Wpointer-arith -Wstrict-prototypes -Wwrite-strings -Werror Libraries= -lcorosync_common -lqb -lbz2 -lxslt -lxml2 -lc -luuid -lrt -ldl -L/lib64 -lglib-2.0 -lltdl -L/usr/lib64 -lqb -ldl -lrt -lpthread Stack Libraries = -L/usr/lib64 -lqb -ldl -lrt -lpthread -L/usr/lib64 -lcpg -L/usr/lib64 -lcfg -L/usr/lib64 -lcmap -L/usr/lib64 -lquorum But I'm getting this error: gmake[2]: Entering directory `/root/Cluster3/pacemaker/lib/common' CC ipc.lo cc1: warnings being treated as errors In file included from ../../include/crm_internal.h:26, from ipc.c:19: ../../include/portability.h: In function 'g_strcmp0': ../../include/portability.h:165: warning: implicit declaration of function 'strcmp' gmake[2]: *** [ipc.lo] Error 1 gmake[2]: Leaving directory `/root/Cluster3/pacemaker/lib/common' If I disable the warning as errors, I can pass this problem (and several similar), but then fail on: services_linux.c:33:26: error: sys/signalfd.h: No such file or directory services_linux.c: In function 'services_os_action_execute': services_linux.c:436: warning: implicit declaration of function 'signalfd' services_linux.c:436: warning: nested extern declaration of 'signalfd' services_linux.c:468: error: storage size of 'fdsi' isn't known services_linux.c:471: error: invalid application of 'sizeof' to incomplete type 'struct signalfd_siginfo' services_linux.c:472: error: invalid application of 'sizeof' to incomplete type 'struct signalfd_siginfo' services_linux.c:468: warning: unused variable 'fdsi' What am I missing here ? It there a page somewhere with RHEL-5/CENTOS-5 prerequisites or special instruction to build from sources ? RHEL5 is too old by the looks of it. In this case, you could revert https://github.com/beekhof/pacemaker/commit/a81d222e which introduced the use of signalfd. signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] host came online but it was ignored
version of pacemaker? On 10 Dec 2013, at 10:41 pm, ESWAR RAO eswar7...@gmail.com wrote: Hi Micheal, There are no firewall rules. I could only see below messages in logs: Dec 10 14:13:48 nvp-common crmd: [9220]: WARN: crmd_ha_msg_callback: Ignoring HA message (op=join_announce) from nvsd-1: not in our membership list (size=1) On Tue, Dec 10, 2013 at 2:46 PM, Michael Schwartzkopff m...@sys4.de wrote: Am Dienstag, 10. Dezember 2013, 14:30:32 schrieb ESWAR RAO: Hi All, I had a 3 node HB+pacemaker setup. When I restarted the nodes, all the nodes have HB restarted but they are not joining ONLINE group in crm status. #vim /etc/ha.d/ha.cf .. node nvsd-1 nvsd-2 nvp-common crm respawn Dec 10 14:13:48 nvp-common crmd: [9220]: WARN: crmd_ha_msg_callback: Ignoring HA message (op=join_announce) from nvsd-1: not in our membership list (size=1) Dec 10 14:14:07 nvp-common crmd: [9220]: WARN: crmd_ha_msg_callback: Ignoring HA message (op=vote) from nvsd-1: not in our membership list (size=1) Dec 10 14:14:12 nvp-common cib: [9216]: WARN: cib_process_diff: Diff 0.3.0 - 0.3.1 not applied to 0.4.4: current epoch is greater than required Dec 10 14:14:12 nvp-common cib: [9216]: WARN: cib_process_diff: Diff 0.3.1 - 0.3.2 not applied to 0.4.4: current epoch is greater than required Dec 10 14:14:12 nvp-common cib: [9216]: WARN: cib_process_diff: Diff 0.3.2 - 0.3.3 not applied to 0.4.4: current epoch is greater than required Dec 10 14:14:12 nvp-common cib: [9216]: WARN: cib_process_diff: Diff 0.3.3 - 0.3.4 not applied to 0.4.4: current epoch is greater than required Dec 10 14:14:13 nvp-common cib: [9216]: WARN: cib_process_replace: Replacement 0.3.4 not applied to 0.4.4: current epoch is greater than the replacement Dec 10 14:14:13 nvp-common cib: [9216]: WARN: cib_diff_notify: Update (client: crmd, call:13): -1.-1.-1 - 0.3.4 (Update was older than existing configuration) I am unable to understand this behaviour. Has someone already seen this issue??? Thanks Eswar Firewall? Mit freundlichen Grüßen, Michael Schwartzkopff -- [*] sys4 AG http://sys4.de, +49 (89) 30 90 46 64, +49 (162) 165 0044 Franziskanerstraße 15, 81669 München Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263 Vorstand: Patrick Ben Koetter, Axel von der Ohe, Marc Schiffbauer Aufsichtsratsvorsitzender: Florian Kirstein ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] is ccs as racy as it feels?
On 10 Dec 2013, at 11:31 pm, Brian J. Murrell br...@interlinx.bc.ca wrote: On Tue, 2013-12-10 at 10:27 +, Christine Caulfield wrote: Sadly you're not wrong. That's what I was afraid of. But it's actually no worse than updating corosync.conf manually, I think it is... in fact it's pretty much the same thing, Not really. Updating corosync.conf on any given node means only having to write that file on that node. There is no cluster-wide synchronization needed Approximately speaking, cman takes cluster.conf and generates an in-memory corosync.conf equivalent to be passed to corosync. So anything that could be done by editing corosync.conf should be possible with 'ccs -f ...', neither command results in any synchronisation or automatic update into the running process. and therefore no last-write-wins race so all nodes can do that in parallel. Plus adding a new node means only having to update the corosync.conf on that new node (and starting up corosync of course) and corosync then does the job of telling it's peers about the new node rather than having to have the administrator go out and touch every node to inform them of the new member. It sounds like this thread is less about cluster.conf vs. corosync.conf and more about autodiscovery vs. fixed node lists. Chrissie: is there no way to use cman in autodiscovery mode (ie. with multicast/broadcast and learning about peers as they appear)? It's this removal of node auto-discovery and changing it to an operator task that is really complicating the workflow. Granted, it's not so much complicating it for a human operator who is naturally only single-threaded and mostly incapable of inducing the last-write-wins races. But when you are writing tools that now have to take what used to be a very capable multithreaded task, free of races and shove it down a single-threaded pipe/queue just to eliminate races, this is a huge step backwards in evolution. so nothing is actually getting worse. It is though. See above. All the CIB information is still properly replicated. Yeah. I understand/understood that. Pacemaker's actual operations go mostly unchanged. It's the cluster membership process that's gotten needlessly complicated and regressed in functionality. The main difficulty is in safely replicating information that's needed to boot the system. Do you literally mean staring the system up? I guess the use-case you are describing here is booting nodes from a clustered filesystem? But what if you don't need that complication? This process is being made more complicated to satisfy only a subset of the use-cases. In general use we've not found it to be a huge problem (though, I'm still not keen on it either TBH) because most management is done by one person from one node. Indeed. As I said above, WRT to single-threaded operators. But when you are writing a management system on top of all of this, which naturally wants to be multi-threaded (because scalable systems avoid bottlenecking through single choke points) and was able to be multithreaded when it was just corosync.conf, having to choke everything back down into a single thread just sucks. There is not really any concept of nodes trying to add themselves to a cluster, it needs to be done by a person - which maybe what you're unhappy with. Yes, not so much add themselves but allowed to be added, in parallel without fear of racing. This ccs tool wouldn't be so bad if it operated more like the CIB where modifications were replicated automatically and properly locked so that modifications could be made anywhere on the cluster and all members got those modifications automatically rather than pushing off the work of locking, replication and serialization off onto the caller. b. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Where the heck is Beekhof?
Thanks to everyone for the well wishes, and yes this is our third so we can get quorum now ;-) On Sun, Dec 1, 2013, at 01:51 PM, Serge Dubrouski wrote: Nope, you need three to always have a quorum. On Dec 1, 2013 9:43 AM, Arnold Krille [1]arn...@arnoldarts.de wrote: On Thu, 28 Nov 2013 12:04:01 +1100 Andrew Beekhof [2]and...@beekhof.net wrote: If you find yourself asking $subject at some point in the next couple of months, the answer is that I'm taking leave to look after our new son (Lawson Tiberius Beekhof) who was born on Tuesday. Concrats! And remember: If you want HA, you gotta have two :-P - Arnold ___ Pacemaker mailing list: [3]Pacemaker@oss.clusterlabs.org [4]http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: [5]http://www.clusterlabs.org Getting started: [6]http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: [7]http://bugs.clusterlabs.org ___ Pacemaker mailing list: [8]Pacemaker@oss.clusterlabs.org [9]http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: [10]http://www.clusterlabs.org Getting started: [11]http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: [12]http://bugs.clusterlabs.org References 1. mailto:arn...@arnoldarts.de 2. mailto:and...@beekhof.net 3. mailto:Pacemaker@oss.clusterlabs.org 4. http://oss.clusterlabs.org/mailman/listinfo/pacemaker 5. http://www.clusterlabs.org/ 6. http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 7. http://bugs.clusterlabs.org/ 8. mailto:Pacemaker@oss.clusterlabs.org 9. http://oss.clusterlabs.org/mailman/listinfo/pacemaker 10. http://www.clusterlabs.org/ 11. http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 12. http://bugs.clusterlabs.org/ ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] no-quorum-policy=freeze
On Wed, Nov 27, 2013, at 04:50 AM, Olivier Nicaise wrote: Hello all, I have an issue with the no quorum policy freeze (stonith disabled). I'm using an old version of pacemaker (1.1.6), the one distributed by Ubuntu 12.04. I have a cluster with 3 nodes running various resources, including drbd. This night, my hosting provider did a maintenance on its network, and all the machines were disconnected. What happened is that every machine promoted the DRBD resources and started the resources on top of them. Using no-quorum-policy=freeze, I was not expecting that. I wouldn't expect that either. The network came back 1 or 2 minutes later but the damage was already done. All my drbd resources were in a split brain situation. Here are my options: property $id=cib-bootstrap-options \ dc-version=1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c \ cluster-infrastructure=openais \ expected-quorum-votes=3 \ stonith-enabled=false \ last-lrm-refresh=1385395648 \ no-quorum-policy=freeze rsc_defaults $id=rsc-options \ resource-stickiness=1000 Do you know if there is a bug with this 1.1.6 version Almost certainly, we're up to 1.1.10 (nearly .11) now. If you attach this file: Nov 27 01:18:41 vs001 crmd: [1016]: info: do_te_invoke: Processing graph 0 (ref= pe_calc-dc-1385511521-83) derived from /var/lib/pengine/pe-input-297.bz2 We can see if a newer version would help. or am I missing something? Logs are available at [1]http://pastebin.com/Jxgq3MRH I know there is an issue with the cinder-HDD1 resources. It is not yet correctly configured ___ Pacemaker mailing list: [2]Pacemaker@oss.clusterlabs.org [3]http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: [4]http://www.clusterlabs.org Getting started: [5]http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: [6]http://bugs.clusterlabs.org References 1. http://pastebin.com/Jxgq3MRH 2. mailto:Pacemaker@oss.clusterlabs.org 3. http://oss.clusterlabs.org/mailman/listinfo/pacemaker 4. http://www.clusterlabs.org/ 5. http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 6. http://bugs.clusterlabs.org/ ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] p_mysql peration monitor failed 'not installed'
On 22 Nov 2013, at 7:32 am, Miha m...@softnet.si wrote: HI, what could be a reason for this error: notice: unpack_rsc_op: Preventing p_mysql from re-starting on sip2: operation monitor failed 'not installed' (rc=5) the agent, or something the agent needs is not available. how did you configure p_mysql? p_mysql_monitor_0 on sip2 'not installed' (5): call=22, status=complete, last-rc-change='Thu Nov 21 15:27:01 2013', queued=33ms, exec=0ms p_mysql_monitor_0 on sip1 'not installed' (5): call=22, status=complete, last-rc-change='Thu Nov 21 15:26:52 2013', queued=36ms, exec=0ms Mysql is running, How? You shouldn't be starting cluster resources outside of the cluster drive is mounted, and in mysql there is no errors. tnx! miha ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] exit code crm_attibute
On 22 Nov 2013, at 12:33 am, Andrey Groshev gre...@yandex.ru wrote: Hi, Andrew! I'm trying to find the source of my problems. This touble exist only on --query I learned crm_attribute.c IMHO, when call rc = read_attr_delegate(the_cib, type, dest_node, set_type, set_name, attr_id, attr_name, read_value, TRUE, NULL); I thick that dest_node == NULL Since in following piece of code ignored return value. 238if (pcmk_ok != query_node_uuid(the_cib, dest_uname, dest_node, is_remote_node)) { 239 fprintf(stderr, Could not map name=%s to a UUID\n, dest_uname); 240} Maybe it should look like this? Agreed. That does look better. https://github.com/beekhof/pacemaker/commit/a4bdc9a 238rc = query_node_uuid(the_cib, dest_uname, dest_node, is_remote_node)) 239if (rc != pcmk_ok) { 239 fprintf(stderr, Could not map name=%s to a UUID\n, dest_uname); 240 return crm_exit(rc); 241} 19.11.2013, 16:12, Andrey Groshev gre...@yandex.ru: Hellow Andrew! I'm sorry, forgot about this thread, and now again came across the same problem. # crm_attribute --type nodes --node-uname fackename.node.org --attr-name notexistattibute --query /dev/null; echo $? Could not map name=fackename.node.org to a UUID 0 Version PCMK 1.1.11 23.09.2013, 08:23, Andrew Beekhof and...@beekhof.net: On 20/09/2013, at 5:53 PM, Andrey Groshev gre...@yandex.ru wrote: Hi again! Today again met a strange behavior. I asked for a non-existent attribute of an existing node. # crm_attribute --type nodes --node-uname exist.node.domain.com --attr-name notexistattibute --query ; echo $? Could not map name=dev-cluster2-node2.unix.tensor.ru to a UUID 0 That is, to STDERR - swore, but the exit code - 0. That probably shouldn't happen. Version? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org , ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] CentOS 6.4 last update - Failed to create cluster resources with pcs command
On 22 Nov 2013, at 4:15 am, Dmitry Bron dmitr...@gmail.com wrote: Hi All, We have two fresh installed boxes with CentOS 6.4 and with last updates which we want to configure as Active - Standby in HA cluster. We copied all configuration files from another worked well HA cluster. We already have another pair of CentOS 6.4 boxes where is configured Active - Standby HA cluster, however we didn't install last updates on these machines. Of course we have changed IP addresses for cluster nodes and for multicast in cluster.conf file and restarted pacemaker service. Then when we tried to create cluster resources it was fail. We tried to create cluster resources many times and every time it was fail in different place: - Once it was fail at creation of ClusterIP resource, in the second at constraint colocation, and so on. The problem doesn't happen on another HA cluster where we haven't done updates. We found in /var/log/messages segmentation fault at the time when we tried to create and configure resources: Please install the debuginfo packages and run crm_report. We need information contained in the core file. === Nov 21 13:15:27 ha-test1 cibadmin[6466]: notice: crm_log_args: Invoked: /usr/sbin/cibadmin -Q --xpath //primitive[@id='TssAgent'] Nov 21 13:15:27 ha-test1 cibadmin[6467]: notice: crm_log_args: Invoked: /usr/sbin/cibadmin -Q --xpath //primitive[@id='ClusterIP'] Nov 21 13:15:27 ha-test1 cibadmin[6468]: notice: crm_log_args: Invoked: /usr/sbin/cibadmin -Q --xpath //constraints Nov 21 13:15:27 ha-test1 cibadmin[6469]: notice: crm_log_args: Invoked: /usr/sbin/cibadmin -c -R --xml-text constraintsrsc_colocation id=colocation-TssAgent-ClusterIP-INFINITY rsc=TssAgent score=INFINITY with-rsc=ClusterIP/ Nov 21 13:15:27 ha-test1 cib[6066]: notice: cib:diff: Diff: --- 0.328.3 Nov 21 13:15:27 ha-test1 cib[6066]: notice: cib:diff: Diff: +++ 0.329.1 09b3edecab4ee85ea2b04e70cefef472 Nov 21 13:15:27 ha-test1 cib[6066]: notice: cib:diff: -- cib admin_epoch=0 epoch=328 num_updates=3 / Nov 21 13:15:27 ha-test1 cib[6066]: notice: cib:diff: ++ rsc_colocation id=colocation-TssAgent-ClusterIP-INFINITY rsc=TssAgent score=INFINITY with-rsc=ClusterIP / Nov 21 13:15:27 ha-test1 crmd[6069]: notice: do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Nov 21 13:15:27 ha-test1 crmd[6069]: notice: do_state_transition: State transition S_ELECTION - S_INTEGRATION [ input=I_ELECTION_DC cause=C_FSA_INTERNAL origin=do_election_check ] Nov 21 13:15:27 ha-test1 kernel: cib[6066]: segfault at 1 ip 7fa9a99f50ec sp 7fff875032d0 error 4 in libc-2.12.so[7fa9a99ad000+18a000] Nov 21 13:15:27 ha-test1 pacemakerd[10460]: notice: pcmk_child_exit: Child process cib terminated with signal 11 (pid=6066, core=128) Nov 21 13:15:27 ha-test1 pacemakerd[10460]: notice: pcmk_process_exit: Respawning failed child process: cib === Please see in the attachment the call traces which were done by starce of the 'crm_attribute --type op_defaults --attr-name timeout --attr-value 300s' command. This command was failed. The cman, pcs and clusterlib RPMSs were updated, please see the list of all updated RPMs in the attachment. The operating system updates are very important for us and we endeavor always to keep our systems up to date. Thanks for your help, Dima crm_attribute-strace_debug_log.txtCentOS_6.4-updated_rpms.txt___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pacemaker update crash my config (cannot be represented in the CLI notation)
On 21 Nov 2013, at 6:08 am, Lars Marowsky-Bree l...@suse.com wrote: On 2013-11-20T16:43:51, Beo Banks beo.ba...@googlemail.com wrote: INFO: object cli-prefer-mysql cannot be represented in the CLI notation crm configure show | grep xml INFO: object cli-prefer-mysql cannot be represented in the CLI notation xml rsc_location id=cli-prefer-mysql node=hostname role=Started rsc=mysql score=INFINITY/ This does not mean your configuration is invalid, just that the crm shell version encountered a XML countruct it can't render, hence it displays it as XML. You need to upgrade crm shell too. The errors you show below are unrelated to this. Yep. Beo: can you provide a crm_report starting from the time that both nodes were started? signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Time to get ready for 1.1.11
With over 400 updates since the release of 1.1.10, its time to start thinking about a new release. Today I have tagged release candidate 1[1]. The most notable fixes include: + attrd: Implementation of a truely atomic attrd for use with corosync 2.x + cib: Allow values to be added/updated and removed in a single update + cib: Support XML comments in diffs + Core: Allow blackbox logging to be disabled with SIGUSR2 + crmd: Do not block on proxied calls from pacemaker_remoted + crmd: Enable cluster-wide throttling when the cib heavily exceeds its target load + crmd: Use the load on our peers to know how many jobs to send them + crm_mon: add --hide-headers option to hide all headers + crm_report: Collect logs directly from journald if available + Fencing: On timeout, clean up the agent's entire process group + Fencing: Support agents that need the host to be unfenced at startup + ipc: Raise the default buffer size to 128k + PE: Add a special attribute for distinguishing between real nodes and containers in constraint rules + PE: Allow location constraints to take a regex pattern to match against resource IDs + pengine: Distinguish between the agent being missing and something the agent needs being missing + remote: Properly version the remote connection protocol + services: Detect missing agents and permission errors before forking + Bug cl#5171 - pengine: Don't prevent clones from running due to dependant resources + Bug cl#5179 - Corosync: Attempt to retrieve a peer's node name if it is not already known + Bug cl#5181 - corosync: Ensure node IDs are written to the CIB as unsigned integers If you are a user of `pacemaker_remoted`, you should take the time to read about changes to the online wire protocol[2] that are present in this release. [1] https://github.com/ClusterLabs/pacemaker/releases/Pacemaker-1.1.11-rc1 [2] http://blog.clusterlabs.org/blog/2013/changes-to-the-remote-wire-protocol/ To build `rpm` packages for testing: 1. Clone the current sources: # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git # cd pacemaker 1. If you haven't already, install Pacemaker's dependancies [Fedora] # sudo yum install -y yum-utils [ALL]# make rpm-dep 1. Build Pacemaker # make rc 1. Copy the rpms and deploy as needed signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] stonith ra class missing
On 19 Nov 2013, at 4:19 pm, Michael Schwartzkopff m...@sys4.de wrote: Andrew Beekhof and...@beekhof.net schrieb: On 19 Nov 2013, at 1:23 am, Michael Schwartzkopff m...@sys4.de wrote: Hi, I installed pacemaker on a RHEL 6.4 machine. Now crm tells me that there is no stonith ra class, onyl lsb, ocf and service. What did I miss? thanks for any valuable comments. did you install the fencing-agents package? Yes, of course. Not everyone does :-) What does 'stonith_admin -I' say? -- Mit freundlichen Grüßen, Michael Schwartzkopff -- [*] sys4 AG http://sys4.de, +49 (89) 30 90 46 64, +49 (162) 165 0044 Franziskanerstraße 15, 81669 München Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263 Vorstand: Patrick Ben Koetter, Axel von der Ohe, Marc Schiffbauer Aufsichtsratsvorsitzender: Florian Kirstein___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- Diese Nachricht wurde von meinem Mobiltelefon mit Kaiten Mail gesendet. signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Finally. A REAL question.
On 18 Nov 2013, at 3:30 pm, Rob Thomas xro...@gmail.com wrote: I've been browsing through the cluster.log, and it's not even trying to move httpd. I'm almost certain that it used to work fine with resource sets. Hmm. OK. I went and -actually looked- at the CIB I was previously generating. This works: rsc_colocation id=freepbx score=INFINITY resource_set id=colo-freepbx-0 resource_ref id=asterisk/ /resource_set resource_set id=colo-freepbx-1 resource_ref id=httpd/ /resource_set resource_set id=colo-freepbx-2 role=Master resource_ref id=ms-asterisk/ /resource_set resource_set id=colo-freepbx-3 role=Master resource_ref id=ms-httpd/ /resource_set /rsc_colocation my eyes! my eyes! It appears that pcs can't do that, or if it's possible to, I can't figure out how. Which is why I've been fighting with it all day! Is this a feature request, or a PEBCAK issue? --Rob ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] stonith ra class missing
On 19 Nov 2013, at 1:23 am, Michael Schwartzkopff m...@sys4.de wrote: Hi, I installed pacemaker on a RHEL 6.4 machine. Now crm tells me that there is no stonith ra class, onyl lsb, ocf and service. What did I miss? thanks for any valuable comments. did you install the fencing-agents package? -- Mit freundlichen Grüßen, Michael Schwartzkopff -- [*] sys4 AG http://sys4.de, +49 (89) 30 90 46 64, +49 (162) 165 0044 Franziskanerstraße 15, 81669 München Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263 Vorstand: Patrick Ben Koetter, Axel von der Ohe, Marc Schiffbauer Aufsichtsratsvorsitzender: Florian Kirstein___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Finally. A REAL question.
On 19 Nov 2013, at 6:00 am, Rob Thomas xro...@gmail.com wrote: On Mon, Nov 18, 2013 at 9:17 PM, Andrew Beekhof and...@beekhof.net wrote: my eyes! my eyes! So... What's the -right- way to do it then? 8) http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-resource-sets-collocation.html rsc_colocation id=pcs_rsc_colocation resource_set id=pcs_rsc_set resource_ref id=httpd/ resource_ref id=asterisk/ /resource_set /rsc_colocation is almost right, but misses score=INFINITY in the rsc_colocation tag. You can do that with: pcs constraint colocation set httpd asterisk setoptions score=INFINITY Note that this is very different to the command I asked you about: pcs constraint colocation add asterisk with httpd Which creates something more like: rsc_colocation id=pcs_rsc_colocation rsc=asterisk with-rsc=httpd score=INFINITY/ See http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_mandatory_placement.html --Rob ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Finally. A REAL question.
On 19 Nov 2013, at 6:00 am, Rob Thomas xro...@gmail.com wrote: On Mon, Nov 18, 2013 at 9:17 PM, Andrew Beekhof and...@beekhof.net wrote: my eyes! my eyes! So... What's the -right- way to do it then? 8) http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-resource-sets-collocation.html rsc_colocation id=pcs_rsc_colocation resource_set id=pcs_rsc_set resource_ref id=httpd/ resource_ref id=asterisk/ /resource_set /rsc_colocation is almost right, but misses score=INFINITY in the rsc_colocation tag. You can do that with: pcs constraint colocation set httpd asterisk setoptions score=INFINITY Note that this is very different to the command I asked you about: pcs constraint colocation add asterisk with httpd Which creates something more like: rsc_colocation id=pcs_rsc_colocation rsc=asterisk with-rsc=httpd score=INFINITY/ See http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_mandatory_placement.html --Rob ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Finally. A REAL question.
On 19 Nov 2013, at 10:30 am, Rob Thomas xro...@gmail.com wrote: On Tue, Nov 19, 2013 at 8:55 AM, Andrew Beekhof and...@beekhof.net wrote: On 19 Nov 2013, at 6:00 am, Rob Thomas xro...@gmail.com wrote: So... What's the -right- way to do it then? 8) rsc_colocation id=pcs_rsc_colocation resource_set id=pcs_rsc_set resource_ref id=httpd/ resource_ref id=asterisk/ /resource_set /rsc_colocation is almost right, but misses score=INFINITY in the rsc_colocation tag. Ah. And that would have been why it didn't shut down httpd. When I tried that, I wasn't looking at the raw XML to see what pcs was actually doing. Note that this is very different to the command I asked you about: pcs constraint colocation add asterisk with httpd Which creates something more like: rsc_colocation id=pcs_rsc_colocation rsc=asterisk with-rsc=httpd score=INFINITY/ Yep. I replied to that earlier. Oh, you mean you tried that way too? It wasn't clear because the xml only had the 'set' variant present. --snip-- Yep. It ends up with asterisk stopped, and httpd happily running on -a (and it won't start, because the colocation docs say 'if z is not running, y won't start', so that makes sense) Resource Group: httpd httpd_fs (ocf::heartbeat:Filesystem):Started freepbx-a httpd_ip (ocf::heartbeat:IPaddr2): Started freepbx-a httpd_service (ocf::heartbeat:apache):Started freepbx-a Resource Group: asterisk asterisk_fs(ocf::heartbeat:Filesystem):Stopped asterisk_ip(ocf::heartbeat:IPaddr2): Stopped asterisk_service (ocf::heartbeat:freepbx): Stopped Failed actions: asterisk_service_monitor_3 on freepbx-a 'not running' (7): call=2217, status=complete, last-rc-change='Mon Nov 18 14:05:08 2013', queued=0ms, exec=0ms I've been browsing through the cluster.log, and it's not even trying to move httpd. I'm almost certain that it used to work fine with resource sets. Hmm. --snip-- After I posted that, I then went and had an actual look at a working cluster, and realised exactly how I was doing it, and posted the next message. I'll have a try with the setoptions and see if that works. Thanks! --Rob ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] The larger cluster is tested.
On 16 Nov 2013, at 12:22 am, yusuke iida yusk.i...@gmail.com wrote: Hi, Andrew Thanks for the suggestion variety. I fixed and tested the value of batch-limit by 1, 2, 3, and 4 from the beginning, in order to confirm what batch-limit is suitable. It was something like the following in my environment. Timeout did not occur batch-limit=1 and 2. batch-limit = 3 was 1 timeout. batch-limit = 4 was 5 timeout. I think the limit is still high in; From the above results, limit = QB_MAX (1, peers / 4). Remember these results are specific to your (virtual) hardware and configured timeouts. I would argue that 5 timeouts out of 2853 actions is actually quite impressive for a default value in this sort of situation.[1] Some tuning in a cluster of this kind is to be expected. [1] It took crm_simulate 4 minutes to even pretend to perform all those operations. So I have created a fix to fixed to 2 batch-limit when it became a state of extreme. https://github.com/yuusuke/pacemaker/commit/efe2d6ebc55be39b8be43de38e7662f039b61dec Results of the test several times, it seems to work without problems. When batch-limit is fixed and tested, below has a report. batch-limit=1 https://drive.google.com/file/d/0BwMFJItoO-fVNk8wTGlYNjNnSHc/edit?usp=sharing batch-limit=2 https://drive.google.com/file/d/0BwMFJItoO-fVTnc4bXY2YXF2M2M/edit?usp=sharing batch-limit=3 https://drive.google.com/file/d/0BwMFJItoO-fVYl9Gbks2VlJMR0k/edit?usp=sharing batch-limit=4 https://drive.google.com/file/d/0BwMFJItoO-fVZnJIazd5MFQ1aGs/edit?usp=sharing The report at the time of making it operate by my test code is the following. https://drive.google.com/file/d/0BwMFJItoO-fVbzB0NjFLeVY3Zmc/edit?usp=sharing Regards, Yusuke 2013/11/13 Andrew Beekhof and...@beekhof.net: Did you look at the load numbers in the logs? The CPUs are being slammed for over 20 minutes. The automatic tuning can only help so much, you're simply asking the cluster to do more work than it is capable of. Giving more priority to cib operations the come via IPC is one option, but as I explained earlier, it comes at the cost of correctness. Given the huge mismatch between the nodes' capacity and the tasks you're asking them to achieve, your best path forward is probably setting a load-threshold 40% or a batch-limit = 8. Or we could try a patch like the one below if we think that the defaults are not aggressive enough. diff --git a/crmd/throttle.c b/crmd/throttle.c index d77195a..7636d4a 100644 --- a/crmd/throttle.c +++ b/crmd/throttle.c @@ -611,14 +611,14 @@ throttle_get_total_job_limit(int l) switch(r-mode) { case throttle_extreme: -if(limit == 0 || limit peers/2) { -limit = peers/2; +if(limit == 0 || limit peers/4) { +limit = QB_MAX(1, peers/4); } break; case throttle_high: -if(limit == 0 || limit peers) { -limit = peers; +if(limit == 0 || limit peers/2) { +limit = QB_MAX(1, peers/2); } break; default: This may also be worthwhile: diff --git a/crmd/throttle.c b/crmd/throttle.c index d77195a..586513a 100644 --- a/crmd/throttle.c +++ b/crmd/throttle.c @@ -387,22 +387,36 @@ static bool throttle_io_load(float *load, unsigned int *blocked) } static enum throttle_state_e -throttle_handle_load(float load, const char *desc) +throttle_handle_load(float load, const char *desc, int cores) { -if(load THROTTLE_FACTOR_HIGH * throttle_load_target) { +float adjusted_load = load; + +if(cores = 0) { +/* No adjusting of the supplied load value */ + +} else if(cores == 1) { +/* On a single core machine, a load of 1.0 is already too high */ +adjusted_load = load * THROTTLE_FACTOR_MEDIUM; + +} else { +/* Normalize the load to be per-core */ +adjusted_load = load / cores; +} + +if(adjusted_load THROTTLE_FACTOR_HIGH * throttle_load_target) { crm_notice(High %s detected: %f, desc, load); return throttle_high; -} else if(load THROTTLE_FACTOR_MEDIUM * throttle_load_target) { +} else if(adjusted_load THROTTLE_FACTOR_MEDIUM * throttle_load_target) { crm_info(Moderate %s detected: %f, desc, load); return throttle_med; -} else if(load THROTTLE_FACTOR_LOW * throttle_load_target) { +} else if(adjusted_load THROTTLE_FACTOR_LOW * throttle_load_target) { crm_debug(Noticable %s detected: %f, desc, load); return throttle_low; } -crm_trace(Negligable %s detected: %f, desc, load); +crm_trace(Negligable %s detected: %f, desc, adjusted_load); return throttle_none; } @@ -464,22 +478,12 @@ throttle_mode(void) } if(throttle_load_avg(load)) { -float
Re: [Pacemaker] No such device, problem with setting pacemaker
On 18 Nov 2013, at 11:59 pm, Miha m...@softnet.si wrote: HI, I am for the first time setting cluster with pacemaker corosync. Server A and server B can ping each other, I have disabled selinux and iptables but I can not get this going. I did step by step as is writen in tutorial. Have you configured a stonith device in pacemaker? What does your config look like? Here is a error that I am getting it in messeges: Nov 18 21:36:39 sip2 crmd[13483]: notice: tengine_stonith_notify: Peer sip1.domain.com was not terminated (reboot) by sip2.domain.com for sip2.domain.com: No such device (ref=ee5d3db1-3c95-4230-8330-a279a03f5f90) by client stonith_admin.cman.18267 Nov 18 21:36:39 sip2 fence_pcmk[18266]: Call to fence sip1.domain.com (reset) failed with rc=237 Nov 18 21:36:42 sip2 fence_pcmk[18286]: Requesting Pacemaker fence sip1.domain.com (reset) Nov 18 21:36:42 sip2 stonith_admin[18287]: notice: crm_log_args: Invoked: stonith_admin --reboot sip1.domain.com --tolerance 5s --tag cman Nov 18 21:36:42 sip2 stonith-ng[13479]: notice: handle_request: Client stonith_admin.cman.18287.46738a96 wants to fence (reboot) 'sip1.domain.com' with device '(any)' Nov 18 21:36:42 sip2 stonith-ng[13479]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for sip1.domain.com: 64a4294f-059a-416f-8e9d-944db759a0e6 (0) Nov 18 21:36:42 sip2 stonith-ng[13479]:error: remote_op_done: Operation reboot of sip1.domain.com by sip2.domain.com for stonith_admin.cman.18...@sip2.domain.com.64a4294f: No such device Nov 18 21:36:42 sip2 crmd[13483]: notice: tengine_stonith_notify: Peer sip1.domain.com was not terminated (reboot) by sip2.domain.com for sip2.domain.com: No such device (ref=64a4294f-059a-416f-8e9d-944db759a0e6) by client stonith_admin.cman.18287 Nov 18 21:36:42 sip2 fence_pcmk[18286]: Call to fence sip1.domain.com (reset) failed with rc=237 tnx for help! miha ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Finally. A REAL question.
On 19 Nov 2013, at 2:50 pm, Rob Thomas xro...@gmail.com wrote: On 19 Nov 2013, at 6:00 am, Rob Thomas xro...@gmail.com wrote: So... What's the -right- way to do it then? 8) rsc_colocation id=pcs_rsc_colocation resource_set id=pcs_rsc_set resource_ref id=httpd/ resource_ref id=asterisk/ /resource_set /rsc_colocation ... I'll have a try with the setoptions and see if that works. Thanks! Without adding the ms resource, it won't fail the other service over completely. This works (which is a LITTLE bit more pleasing to the eyes, hopefully! I even set the font to monospace!) rsc_colocation id=pcs_rsc_colocation score=INFINITY resource_set id=pcs_rsc_set-1 resource_ref id=httpd/ resource_ref id=asterisk/ /resource_set resource_set id=pcs_rsc_set-2 role=Master resource_ref id=ms-asterisk/ resource_ref id=ms-httpd/ /resource_set /rsc_colocation I'm still pretty sure you can't do that through pcs. The docs suggest this should work, but it seems not to: pcs constraint colocation set httpd asterisk set ms-asterisk ms-httpd setoptions role=Master score=INFINITY Chris: the command 'pcs constraint colocation set httpd asterisk setoptions role=Master' creates: rsc_colocation id=pcs_rsc_colocation role=Master resource_set id=pcs_rsc_set resource_ref id=httpd / resource_ref id=asterisk / /resource_set /rsc_colocation However role belongs with the resource_set. And the reason why (I believe) I need them, to go into a bit more depth - (sorry for everyone else who's getting bored with this incredibly arcane and in-depth discussion, that has degenerated into pasting XML snippets everywhere) here's the relevant associated constraints: rsc_colocation id=c-1 rsc=asterisk_fs score=INFINITY with-rsc=ms-asterisk with-rsc-role=Master/ rsc_order first=ms-asterisk first-action=promote id=o-1 score=INFINITY then=asterisk_fs then-action=start/ rsc_colocation id=c-2 rsc=httpd_fs score=INFINITY with-rsc=ms-httpd with-rsc-role=Master/ rsc_order first=ms-httpd first-action=promote id=o-2 score=INFINITY then=httpd_fs then-action=start/ (id's changed to aid reading) Without pcs_rsc_set-2 before failing: Master/Slave Set: ms-asterisk [drbd_asterisk] Masters: [ freepbx-a ] Slaves: [ freepbx-b ] Master/Slave Set: ms-httpd [drbd_httpd] Masters: [ freepbx-a ] Slaves: [ freepbx-b ] Resource Group: httpd httpd_fs (ocf::heartbeat:Filesystem):Started freepbx-a httpd_ip (ocf::heartbeat:IPaddr2): Started freepbx-a httpd_service (ocf::heartbeat:apache):Started freepbx-a Resource Group: asterisk asterisk_fs(ocf::heartbeat:Filesystem):Started freepbx-a asterisk_ip(ocf::heartbeat:IPaddr2): Started freepbx-a asterisk_service (ocf::heartbeat:freepbx): Started freepbx-a isymphony_service (lsb:iSymphonyServer): Started freepbx-a AFTER failing: Master/Slave Set: ms-asterisk [drbd_asterisk] Masters: [ freepbx-a ] -- THIS IS WRONG Slaves: [ freepbx-b ] Master/Slave Set: ms-httpd [drbd_httpd] Masters: [ freepbx-b ] Slaves: [ freepbx-a ] Resource Group: asterisk asterisk_fs(ocf::heartbeat:Filesystem):Stopped asterisk_ip(ocf::heartbeat:IPaddr2): Stopped asterisk_service (ocf::heartbeat:freepbx): Stopped isymphony_service (lsb:iSymphonyServer): Stopped Resource Group: httpd httpd_fs (ocf::heartbeat:Filesystem):Started freepbx-b httpd_ip (ocf::heartbeat:IPaddr2): Started freepbx-b httpd_service (ocf::heartbeat:apache):Started freepbx-b The asterisk group isn't starting because - obviously - it's not the master for ms-asterisk. So the constraint worked, BUT, because I don't have the resource in there, I can't tell it to shut down. My other idea was having the ms- ordering start the group, but that doesn't work either. --Rob ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Finally. A REAL question.
On 19 Nov 2013, at 3:09 pm, Andrew Beekhof and...@beekhof.net wrote: On 19 Nov 2013, at 2:50 pm, Rob Thomas xro...@gmail.com wrote: On 19 Nov 2013, at 6:00 am, Rob Thomas xro...@gmail.com wrote: So... What's the -right- way to do it then? 8) rsc_colocation id=pcs_rsc_colocation resource_set id=pcs_rsc_set resource_ref id=httpd/ resource_ref id=asterisk/ /resource_set /rsc_colocation ... I'll have a try with the setoptions and see if that works. Thanks! Without adding the ms resource, it won't fail the other service over completely. This works (which is a LITTLE bit more pleasing to the eyes, hopefully! I even set the font to monospace!) rsc_colocation id=pcs_rsc_colocation score=INFINITY resource_set id=pcs_rsc_set-1 resource_ref id=httpd/ resource_ref id=asterisk/ /resource_set resource_set id=pcs_rsc_set-2 role=Master resource_ref id=ms-asterisk/ resource_ref id=ms-httpd/ /resource_set /rsc_colocation Also, yes that is correct. The non-set equivalent would be: pcs constraint colocation add asterisk with httpd pcs constraint colocation add master ms-asterisk with asterisk pcs constraint colocation add master ms-httpd with master ms-asterisk If that doesn't work, send me cibadmin -Ql after the failure and I can investigate. I'm still pretty sure you can't do that through pcs. The docs suggest this should work, but it seems not to: pcs constraint colocation set httpd asterisk set ms-asterisk ms-httpd setoptions role=Master score=INFINITY Chris: the command 'pcs constraint colocation set httpd asterisk setoptions role=Master' creates: rsc_colocation id=pcs_rsc_colocation role=Master resource_set id=pcs_rsc_set resource_ref id=httpd / resource_ref id=asterisk / /resource_set /rsc_colocation However role belongs with the resource_set. And the reason why (I believe) I need them, to go into a bit more depth - (sorry for everyone else who's getting bored with this incredibly arcane and in-depth discussion, that has degenerated into pasting XML snippets everywhere) here's the relevant associated constraints: rsc_colocation id=c-1 rsc=asterisk_fs score=INFINITY with-rsc=ms-asterisk with-rsc-role=Master/ rsc_order first=ms-asterisk first-action=promote id=o-1 score=INFINITY then=asterisk_fs then-action=start/ rsc_colocation id=c-2 rsc=httpd_fs score=INFINITY with-rsc=ms-httpd with-rsc-role=Master/ rsc_order first=ms-httpd first-action=promote id=o-2 score=INFINITY then=httpd_fs then-action=start/ (id's changed to aid reading) Without pcs_rsc_set-2 before failing: Master/Slave Set: ms-asterisk [drbd_asterisk] Masters: [ freepbx-a ] Slaves: [ freepbx-b ] Master/Slave Set: ms-httpd [drbd_httpd] Masters: [ freepbx-a ] Slaves: [ freepbx-b ] Resource Group: httpd httpd_fs (ocf::heartbeat:Filesystem):Started freepbx-a httpd_ip (ocf::heartbeat:IPaddr2): Started freepbx-a httpd_service (ocf::heartbeat:apache):Started freepbx-a Resource Group: asterisk asterisk_fs(ocf::heartbeat:Filesystem):Started freepbx-a asterisk_ip(ocf::heartbeat:IPaddr2): Started freepbx-a asterisk_service (ocf::heartbeat:freepbx): Started freepbx-a isymphony_service (lsb:iSymphonyServer): Started freepbx-a AFTER failing: Master/Slave Set: ms-asterisk [drbd_asterisk] Masters: [ freepbx-a ] -- THIS IS WRONG Slaves: [ freepbx-b ] Master/Slave Set: ms-httpd [drbd_httpd] Masters: [ freepbx-b ] Slaves: [ freepbx-a ] Resource Group: asterisk asterisk_fs(ocf::heartbeat:Filesystem):Stopped asterisk_ip(ocf::heartbeat:IPaddr2): Stopped asterisk_service (ocf::heartbeat:freepbx): Stopped isymphony_service (lsb:iSymphonyServer): Stopped Resource Group: httpd httpd_fs (ocf::heartbeat:Filesystem):Started freepbx-b httpd_ip (ocf::heartbeat:IPaddr2): Started freepbx-b httpd_service (ocf::heartbeat:apache):Started freepbx-b The asterisk group isn't starting because - obviously - it's not the master for ms-asterisk. So the constraint worked, BUT, because I don't have the resource in there, I can't tell it to shut down. My other idea was having the ms- ordering start the group, but that doesn't work either. --Rob ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http
Re: [Pacemaker] Remove a ghost node
On 19 Nov 2013, at 3:21 am, Sean Lutner s...@rentul.net wrote: On Nov 17, 2013, at 7:40 PM, Andrew Beekhof and...@beekhof.net wrote: On 15 Nov 2013, at 2:28 pm, Sean Lutner s...@rentul.net wrote: Yes the varnish resources are in a group which is then cloned. -EDONTDOTHAT You cant refer to the things inside a clone. 1.1.8 will have just been ignoring those constraints. So the implicit order and colocation constraints in a group and clone will take care of those? Which means remove the constraints and retry the upgrade? No, it means rewrite them to refer to the clone - whatever is the outer most container. I see, thanks. Did I miss that in the docs or is it undocumented/implied? If I didn't miss it, it'd be nice if that were explicitly documented. Assuming: clone id=X ... primitive id=Y .../ /clone For most of pacemaker's existence, it simply wasn't possible because there is no resource named Y (they were actually called Y:0 Y:1 .. Y:N). Then pacemaker was made smarter and as a side effect, started being able to find something matching Y. Then I closed the loophole :) So it was never legal. Is there a combination of constraints I can configure for a single IP resource and a cloned group such that if there is a failure only the IP resource will move? These are implied simply because they're in a group: rsc_order first=Varnish id=order-Varnish-Varnishlog-mandatory then=Varnishlog/ rsc_order first=Varnishlog id=order-Varnishlog-Varnishncsa-mandatory then=Varnishncsa/ rsc_colocation id=colocation-Varnishlog-Varnish-INFINITY rsc=Varnishlog score=INFINITY with-rsc=Varnish/ rsc_colocation id=colocation-Varnishncsa-Varnishlog-INFINITY rsc=Varnishncsa score=INFINITY with-rsc=Varnishlog/ This: rsc_order first=ClusterEIP_54.215.143.166 id=order-ClusterEIP_54.215.143.166-Varnish-mandatory then=Varnish/ can be replaced with: pcs constraint order start ClusterEIP_54.215.143.166 then EIP-AND-VARNISH-clone and: rsc_colocation id=colocation-Varnish-ClusterEIP_54.215.143.166-INFINITY rsc=Varnish score=INFINITY with-rsc=ClusterEIP_54.215.143.166/ is the same as: pcs constraint colocation EIP-AND-VARNISH-clone with ClusterEIP_54.215.143.166 but that makes no sense because then the clone (which wants to run everywhere) can only run on the node the IP is on. If thats what you want, then there is no point having a clone. Better to reverse the colocation to be: pcs constraint colocation ClusterEIP_54.215.143.166 with EIP-AND-VARNISH-clone and possibly the ordering too: pcs constraint order start EIP-AND-VARNISH-clone then ClusterEIP_54.215.143.166 Or in the case where I previously had the constraints applied to the resources and not the clone was that causing a problem? Thanks again. I was able to get the upgrade done. I also had to upgrade the libqb package. I know that's been mentioned in other threads, but I think that should either be a dependency of pacemaker or explicitly documented. libqb is a dependancy, just not a versioned one. We should probably change that next time. I would say it's a requirement that that be changed. Well... pacemaker can build against older versions, but you need to run it with whatever version you built against. possibly the libqb versioning is being done incorrectly preventing rpm from figuring this all out Second order of business is that failover is no longer working as expected. Because the order and colocation constraints are gone, if one of the varnish resources fails, the EIP resource does not move to the other node like it used to. Is there a way I can create or re-create that behavior? See above :) The resource group EIP-AND_VARNISH has the three varnish services and is then cloned so running on both nodes. If any of them fail I want the EIP resource to move to the other node. Any advice for doing this? Thanks ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http
Re: [Pacemaker] CentOS 6.4 and CFS.
On 16 Nov 2013, at 9:42 am, Rob Thomas xro...@gmail.com wrote: Line 363 of /usr/lib/python2.6/site-packages/pcs/cluster.py has this: nodes = utils.getNodesFromCorosyncConf() Ahha. Look what I just spotted. https://github.com/feist/pcs/commit/8b888080c37ddea88b92dfd95aadd78b9db68b55 Are you building pcs yourself or using the packages supplied by CentOS? I'd be surprised if the supplied packages had not been patched to look in cluster.conf instead of corosync.conf signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Remove a ghost node
On 15 Nov 2013, at 2:28 pm, Sean Lutner s...@rentul.net wrote: Yes the varnish resources are in a group which is then cloned. -EDONTDOTHAT You cant refer to the things inside a clone. 1.1.8 will have just been ignoring those constraints. So the implicit order and colocation constraints in a group and clone will take care of those? Which means remove the constraints and retry the upgrade? No, it means rewrite them to refer to the clone - whatever is the outer most container. I was able to get the upgrade done. I also had to upgrade the libqb package. I know that's been mentioned in other threads, but I think that should either be a dependency of pacemaker or explicitly documented. libqb is a dependancy, just not a versioned one. We should probably change that next time. Second order of business is that failover is no longer working as expected. Because the order and colocation constraints are gone, if one of the varnish resources fails, the EIP resource does not move to the other node like it used to. Is there a way I can create or re-create that behavior? See above :) The resource group EIP-AND_VARNISH has the three varnish services and is then cloned so running on both nodes. If any of them fail I want the EIP resource to move to the other node. Any advice for doing this? Thanks signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Finally. A REAL question.
On 18 Nov 2013, at 12:43 pm, Rob Thomas xro...@gmail.com wrote: Previously, using crm, it was reasonably painless to ensure that resource groups ran on the same node. I'm having difficulties figuring out what the 'right' way to do this is with pcs You tried: pcs constraint colocation add asterisk with httpd ? Specifically, I want the 'asterisk' group to run on the same node as the 'httpd' group. Basically, this should never happen: pcs status Cluster name: freepbx-ha Last updated: Mon Nov 18 11:38:27 2013 Last change: Mon Nov 18 11:30:44 2013 via cibadmin on freepbx-a Stack: cman Current DC: freepbx-a - partition with quorum Version: 1.1.10-1.el6_4.4-368c726 2 Nodes configured 16 Resources configured Online: [ freepbx-a freepbx-b ] Full list of resources: floating_ip(ocf::heartbeat:IPaddr2): Started freepbx-b Master/Slave Set: ms-asterisk [drbd_asterisk] Masters: [ freepbx-b ] Slaves: [ freepbx-a ] Master/Slave Set: ms-mysql [drbd_mysql] Masters: [ freepbx-a ] Slaves: [ freepbx-b ] Master/Slave Set: ms-httpd [drbd_httpd] Masters: [ freepbx-a ] Slaves: [ freepbx-b ] Resource Group: mysql mysql_fs (ocf::heartbeat:Filesystem):Started freepbx-a mysql_ip (ocf::heartbeat:IPaddr2): Started freepbx-a mysql_service (ocf::heartbeat:mysql): Started freepbx-a Resource Group: asterisk asterisk_fs(ocf::heartbeat:Filesystem):Started freepbx-b asterisk_ip(ocf::heartbeat:IPaddr2): Started freepbx-b asterisk_service (ocf::heartbeat:freepbx): Started freepbx-b Resource Group: httpd httpd_fs (ocf::heartbeat:Filesystem):Started freepbx-a httpd_ip (ocf::heartbeat:IPaddr2): Started freepbx-a httpd_service (ocf::heartbeat:apache):Started freepbx-a Note that asterisk and httpd are running on different nodes, after I caused asterisk to fail across (by shutting it down) What I want to happen is that when asterisk fails (or httpd), the cluster should shut down the other non failing resource, and move it I can do this by making a single resource group that contains both the asterisk_ and httpd_ resources, but, it seems untidy to me. I've tried a colocation set, which does let you add groups, but it doesn't seem to do what I'm after. Attached is the dump of my current config. Any advice or help would be appreciated! (Currently on CentOS 6.4, with pacemaker 1.1.10, and pcs 0.9.90) --Rob cibadmin.xml___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] CentOS 6.4 and CFS.
On 15 Nov 2013, at 5:56 pm, Rob Thomas xro...@gmail.com wrote: So I'm a long time corosync fan, and I've recently come back into the fold to change everything I've previously written to pcs, because that's the new cool thing. Sadly, things seem to be a bit broken. Here's how things have gone today! I managed to get things kinda sorta working with the old 1.1.8 version of PCS on CentOS 6.4. I wasn't happy with it, but I went 'meh, that'll do, I'll fix it all with pacemaker 1.1.10'. No such luck. So, I create a few test resources and that all seems to work. Excellent. Now I want to start working on failing things over properly, 'pcs cluster standby node-a' Error: node 'node-a' does not appear to exist in configuration Looking through the pcs code, it's now checking that the node exists in /etc/corosync/corosync.conf N. Not on RHEL-6 anyway. Before I address the rest of your email, you need to be using pacemaker with cman (cluster.conf) as described at: http://clusterlabs.org/quickstart-redhat.html and: https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6-Beta/html/Configuring_the_Red_Hat_High_Availability_Add-On_with_Pacemaker/ch-clusteradmin-HAAR.html If you're still having issues after reading one or more of those, we'll be only too happy to help :) Well that's cool, I can generate a conf with 'pcs cluster auth' and 'pcs cluster setup' according to CFS. No. No I can't. pcs cluster auth and pcs cluster setup requires pcsd (I assume? Whatever's meant to be listening on port 2224) but that appears to be missing in RHEL based distros. (This is, apparently, by design according to https://github.com/feist/pcs/issues/3 which is - admittedly - quite old) OK, so I'll create the corosync config file based on the template I found in /usr/lib/python2.6/site-packages/pcs/corosync.conf.template -- except that one isn't used, and only the fedora.template is. The fedora template specifies that it's using a pacemaker service, but.. I thought that had been deprecated and removed? So. I'm now at the point where I'm confused. Question: Am I doing something basically and fundamentally wrong? Is there a step that I've missed that generates the corosync.conf file that's required now? If not, can I just use the .template one, or, should I be using the .fedora.template with 6.4? I normally would just derp around with it and add my old corosync.conf file and keep playing, but, I have to wander off to geek at a Roller Derby thing tonight, so I thought I may throw it to the crowds before I start going down too many blind alleys. Question 2: What else is going to bite me? 8) Sorry for the length! --Rob ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Remove a ghost node
On 14 Nov 2013, at 2:55 pm, Sean Lutner s...@rentul.net wrote: On Nov 13, 2013, at 10:51 PM, Andrew Beekhof and...@beekhof.net wrote: On 14 Nov 2013, at 1:12 pm, Sean Lutner s...@rentul.net wrote: On Nov 10, 2013, at 8:03 PM, Sean Lutner s...@rentul.net wrote: On Nov 10, 2013, at 7:54 PM, Andrew Beekhof and...@beekhof.net wrote: On 11 Nov 2013, at 11:44 am, Sean Lutner s...@rentul.net wrote: On Nov 10, 2013, at 6:27 PM, Andrew Beekhof and...@beekhof.net wrote: On 8 Nov 2013, at 12:59 pm, Sean Lutner s...@rentul.net wrote: On Nov 7, 2013, at 8:34 PM, Andrew Beekhof and...@beekhof.net wrote: On 8 Nov 2013, at 4:45 am, Sean Lutner s...@rentul.net wrote: I have a confusing situation that I'm hoping to get help with. Last night after configuring STONITH on my two node cluster, I suddenly have a ghost node in my cluster. I'm looking to understand the best way to remove this node from the config. I'm using the fence_ec2 device for for STONITH. I dropped the script on each node, registered the device with stonith_admin -R -a fence_ec2 and confirmed the registration with both # stonith_admin -I # pcs stonith list I then configured STONITH per the Clusters from Scratch doc http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Clusters_from_Scratch/_example.html Here are my commands: # pcs cluster cib stonith_cfg # pcs -f stonith_cfg stonith create ec2-fencing fence_ec2 ec2-home=/opt/ec2-api-tools pcmk_host_check=static-list pcmk_host_list=ip-10-50-3-122 ip-10-50-3-251 op monitor interval=300s timeout=150s op start start-delay=30s interval=0 # pcs -f stonith_cfg stonith # pcs -f stonith_cfg property set stonith-enabled=true # pcs -f stonith_cfg property # pcs cluster push cib stonith_cfg After that I saw that STONITH appears to be functioning but a new node listed in pcs status output: Do the EC2 instances have fixed IPs? I didn't have much luck with EC2 because every time they came back up it was with a new name/address which confused corosync and created situations like this. The IPs persist across reboots as far as I can tell. I thought the problem was due to stonith being enabled but not working so I removed the stonith_id and disabled stonith. After that I restarted pacemaker and cman on both nodes and things started as expected but the ghost node it still there. Someone else working on the cluster exported the CIB, removed the node and then imported the CIB. They used this process http://clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-config-updates.html Even after that, the ghost node is still there? Would pcs cluster cib /tmp/cib-temp.xml and then pcs cluster push cib /tmp/cib-temp.xml after editing the node out of the config? No. If its coming back then pacemaker is holding it in one of its internal caches. The only way to clear it out in your version is to restart pacemaker on the DC. Actually... are you sure someone didn't just slip while editing cluster.conf? [...].1251 does not look like a valid IP :) In the end this fixed it # pcs cluster cib /tmp/cib-tmp.xml # vi /tmp/cib-tmp.xml # remove bad node # pcs cluster push cib /tmp/cib-tmp.xml Followed by restaring pacemaker and cman on both nodes. The ghost node disappeared, so it was cached as you mentioned. I also tracked the bad IP down to bad non-printing characters in the initial command line while configuring the fence_ec2 stonith device. I'd put the command together from the github README and some mailing list posts and laid it out in an external editor. Go me. :) Version: 1.1.8-7.el6-394e906 There is now an update to 1.1.10 available for 6.4, that _may_ help in the future. That's my next task. I believe I'm hitting the failure-timeout not clearing failcount bug and want to upgrade to 1.1.10. Is it safe to yum update pacemaker after stopping the cluster? I see there is also an updated pcs in CentOS 6.4, should I update that as well? yes and yes you might want to check if you're using any OCF resource agents that didn't make it into the first supported release though. http://blog.clusterlabs.org/blog/2013/pacemaker-and-rhel-6-dot-4/ Thanks, I'll give that a read. All the resource agents are custom so I'm thinking I'm okay (I'll back them up before upgrading). One last question related to the fence_ec2 script. Should crm_mon -VW show it running on both nodes or just one? I just went through the upgrade to pacemaker 1.1.10 and pcs. After running the yum update for those I ran a crm_verify and I'm seeing errors related to my order and colocation constraints. Did the behavior of these change from 1.1.8 to 1.1.10? # crm_verify -L -V error: unpack_order_template:Invalid constraint 'order-ClusterEIP_54.215.143.166-Varnish-mandatory': No resource or template named 'Varnish' Is that true? No, it's
Re: [Pacemaker] why pacemaker does not control the resources
On 14 Nov 2013, at 5:06 pm, Andrey Groshev gre...@yandex.ru wrote: 14.11.2013, 02:22, Andrew Beekhof and...@beekhof.net: On 14 Nov 2013, at 6:13 am, Andrey Groshev gre...@yandex.ru wrote: 13.11.2013, 03:22, Andrew Beekhof and...@beekhof.net: On 12 Nov 2013, at 4:42 pm, Andrey Groshev gre...@yandex.ru wrote: 11.11.2013, 03:44, Andrew Beekhof and...@beekhof.net: On 8 Nov 2013, at 7:49 am, Andrey Groshev gre...@yandex.ru wrote: Hi, PPL! I need help. I do not understand... Why has stopped working. This configuration work on other cluster, but on corosync1. So... cluster postgres with master/slave. Classic config as in wiki. I build cluster, start, he is working. Next I kill postgres on Master with 6 signal, as if disk space left # pkill -6 postgres # ps axuww|grep postgres root 9032 0.0 0.1 103236 860 pts/0S+ 00:37 0:00 grep postgres PostgreSQL die, But crm_mon shows that the master is still running. Last updated: Fri Nov 8 00:42:08 2013 Last change: Fri Nov 8 00:37:05 2013 via crm_attribute on dev-cluster2-node4 Stack: corosync Current DC: dev-cluster2-node4 (172793107) - partition with quorum Version: 1.1.10-1.el6-368c726 3 Nodes configured 7 Resources configured Node dev-cluster2-node2 (172793105): online pingCheck (ocf::pacemaker:ping): Started pgsql (ocf::heartbeat:pgsql): Started Node dev-cluster2-node3 (172793106): online pingCheck (ocf::pacemaker:ping): Started pgsql (ocf::heartbeat:pgsql): Started Node dev-cluster2-node4 (172793107): online pgsql (ocf::heartbeat:pgsql): Master pingCheck (ocf::pacemaker:ping): Started VirtualIP (ocf::heartbeat:IPaddr2): Started Node Attributes: * Node dev-cluster2-node2: + default_ping_set : 100 + master-pgsql : -INFINITY + pgsql-data-status : STREAMING|ASYNC + pgsql-status : HS:async * Node dev-cluster2-node3: + default_ping_set : 100 + master-pgsql : -INFINITY + pgsql-data-status : STREAMING|ASYNC + pgsql-status : HS:async * Node dev-cluster2-node4: + default_ping_set : 100 + master-pgsql : 1000 + pgsql-data-status : LATEST + pgsql-master-baseline : 0278 + pgsql-status : PRI Migration summary: * Node dev-cluster2-node4: * Node dev-cluster2-node2: * Node dev-cluster2-node3: Tickets: CONFIG: node $id=172793105 dev-cluster2-node2. \ attributes pgsql-data-status=STREAMING|ASYNC standby=false node $id=172793106 dev-cluster2-node3. \ attributes pgsql-data-status=STREAMING|ASYNC standby=false node $id=172793107 dev-cluster2-node4. \ attributes pgsql-data-status=LATEST primitive VirtualIP ocf:heartbeat:IPaddr2 \ params ip=10.76.157.194 \ op start interval=0 timeout=60s on-fail=stop \ op monitor interval=10s timeout=60s on-fail=restart \ op stop interval=0 timeout=60s on-fail=block primitive pgsql ocf:heartbeat:pgsql \ params pgctl=/usr/pgsql-9.1/bin/pg_ctl psql=/usr/pgsql-9.1/bin/psql pgdata=/var/lib/pgsql/9.1/data tmpdir=/tmp/pg start_opt=-p 5432 logfile=/var/lib/pgsql/9.1//pgstartup.log rep_mode=async node_list= dev-cluster2-node2. dev-cluster2-node3. dev-cluster2-node4. restore_command=gzip -cd /var/backup/pitr/dev-cluster2-master#5432/xlog/%f.gz %p primary_conninfo_opt=keepalives_idle=60 keepalives_interval=5 keepalives_count=5 master_ip=10.76.157.194 \ op start interval=0 timeout=60s on-fail=restart \ op monitor interval=5s timeout=61s on-fail=restart \ op monitor interval=1s role=Master timeout=62s on-fail=restart \ op promote interval=0 timeout=63s on-fail=restart \ op demote interval=0 timeout=64s on-fail=stop \ op stop interval=0 timeout=65s on-fail=block \ op notify interval=0 timeout=66s primitive pingCheck ocf:pacemaker:ping \ params name=default_ping_set host_list=10.76.156.1 multiplier=100 \ op start interval=0 timeout=60s on-fail=restart \ op monitor interval=10s timeout=60s on-fail=restart \ op stop interval=0 timeout=60s on-fail=ignore ms msPostgresql pgsql \ meta master-max=1 master-node-max=1 clone-node-max=1 notify=true target-role=Master clone-max=3 clone clnPingCheck pingCheck \ meta clone-max=3 location l0_DontRunPgIfNotPingGW msPostgresql \ rule $id=l0_DontRunPgIfNotPingGW-rule -inf: not_defined default_ping_set
Re: [Pacemaker] Remove a ghost node
On 15 Nov 2013, at 10:24 am, Sean Lutner s...@rentul.net wrote: On Nov 14, 2013, at 6:14 PM, Andrew Beekhof and...@beekhof.net wrote: On 14 Nov 2013, at 2:55 pm, Sean Lutner s...@rentul.net wrote: On Nov 13, 2013, at 10:51 PM, Andrew Beekhof and...@beekhof.net wrote: On 14 Nov 2013, at 1:12 pm, Sean Lutner s...@rentul.net wrote: On Nov 10, 2013, at 8:03 PM, Sean Lutner s...@rentul.net wrote: On Nov 10, 2013, at 7:54 PM, Andrew Beekhof and...@beekhof.net wrote: On 11 Nov 2013, at 11:44 am, Sean Lutner s...@rentul.net wrote: On Nov 10, 2013, at 6:27 PM, Andrew Beekhof and...@beekhof.net wrote: On 8 Nov 2013, at 12:59 pm, Sean Lutner s...@rentul.net wrote: On Nov 7, 2013, at 8:34 PM, Andrew Beekhof and...@beekhof.net wrote: On 8 Nov 2013, at 4:45 am, Sean Lutner s...@rentul.net wrote: I have a confusing situation that I'm hoping to get help with. Last night after configuring STONITH on my two node cluster, I suddenly have a ghost node in my cluster. I'm looking to understand the best way to remove this node from the config. I'm using the fence_ec2 device for for STONITH. I dropped the script on each node, registered the device with stonith_admin -R -a fence_ec2 and confirmed the registration with both # stonith_admin -I # pcs stonith list I then configured STONITH per the Clusters from Scratch doc http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Clusters_from_Scratch/_example.html Here are my commands: # pcs cluster cib stonith_cfg # pcs -f stonith_cfg stonith create ec2-fencing fence_ec2 ec2-home=/opt/ec2-api-tools pcmk_host_check=static-list pcmk_host_list=ip-10-50-3-122 ip-10-50-3-251 op monitor interval=300s timeout=150s op start start-delay=30s interval=0 # pcs -f stonith_cfg stonith # pcs -f stonith_cfg property set stonith-enabled=true # pcs -f stonith_cfg property # pcs cluster push cib stonith_cfg After that I saw that STONITH appears to be functioning but a new node listed in pcs status output: Do the EC2 instances have fixed IPs? I didn't have much luck with EC2 because every time they came back up it was with a new name/address which confused corosync and created situations like this. The IPs persist across reboots as far as I can tell. I thought the problem was due to stonith being enabled but not working so I removed the stonith_id and disabled stonith. After that I restarted pacemaker and cman on both nodes and things started as expected but the ghost node it still there. Someone else working on the cluster exported the CIB, removed the node and then imported the CIB. They used this process http://clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-config-updates.html Even after that, the ghost node is still there? Would pcs cluster cib /tmp/cib-temp.xml and then pcs cluster push cib /tmp/cib-temp.xml after editing the node out of the config? No. If its coming back then pacemaker is holding it in one of its internal caches. The only way to clear it out in your version is to restart pacemaker on the DC. Actually... are you sure someone didn't just slip while editing cluster.conf? [...].1251 does not look like a valid IP :) In the end this fixed it # pcs cluster cib /tmp/cib-tmp.xml # vi /tmp/cib-tmp.xml # remove bad node # pcs cluster push cib /tmp/cib-tmp.xml Followed by restaring pacemaker and cman on both nodes. The ghost node disappeared, so it was cached as you mentioned. I also tracked the bad IP down to bad non-printing characters in the initial command line while configuring the fence_ec2 stonith device. I'd put the command together from the github README and some mailing list posts and laid it out in an external editor. Go me. :) Version: 1.1.8-7.el6-394e906 There is now an update to 1.1.10 available for 6.4, that _may_ help in the future. That's my next task. I believe I'm hitting the failure-timeout not clearing failcount bug and want to upgrade to 1.1.10. Is it safe to yum update pacemaker after stopping the cluster? I see there is also an updated pcs in CentOS 6.4, should I update that as well? yes and yes you might want to check if you're using any OCF resource agents that didn't make it into the first supported release though. http://blog.clusterlabs.org/blog/2013/pacemaker-and-rhel-6-dot-4/ Thanks, I'll give that a read. All the resource agents are custom so I'm thinking I'm okay (I'll back them up before upgrading). One last question related to the fence_ec2 script. Should crm_mon -VW show it running on both nodes or just one? I just went through the upgrade to pacemaker 1.1.10 and pcs. After running the yum update for those I ran a crm_verify and I'm seeing errors related to my order and colocation constraints. Did the behavior of these change from 1.1.8 to 1.1.10? # crm_verify -L -V error: unpack_order_template
Re: [Pacemaker] Question about the resource to fence a node
On 14 Nov 2013, at 5:53 pm, Kazunori INOUE kazunori.ino...@gmail.com wrote: Hi, Andrew 2013/11/13 Kazunori INOUE kazunori.ino...@gmail.com: 2013/11/13 Andrew Beekhof and...@beekhof.net: On 16 Oct 2013, at 8:51 am, Andrew Beekhof and...@beekhof.net wrote: On 15/10/2013, at 8:24 PM, Kazunori INOUE kazunori.ino...@gmail.com wrote: Hi, I'm using pacemaker-1.1 (the latest devel). I started resource (f1 and f2) which fence vm3 on vm1. $ crm_mon -1 Last updated: Tue Oct 15 15:16:37 2013 Last change: Tue Oct 15 15:16:21 2013 via crmd on vm1 Stack: corosync Current DC: vm1 (3232261517) - partition with quorum Version: 1.1.11-0.284.6a5e863.git.el6-6a5e863 3 Nodes configured 3 Resources configured Online: [ vm1 vm2 vm3 ] pDummy (ocf::pacemaker:Dummy): Started vm3 Resource Group: gStonith3 f1 (stonith:external/libvirt): Started vm1 f2 (stonith:external/ssh): Started vm1 reset of f1 which hasn't been started on vm2 was performed when vm3 is fenced. $ ssh vm3 'rm -f /var/run/Dummy-pDummy.state' $ for i in vm1 vm2; do ssh $i 'hostname; egrep reset | off /var/log/ha-log'; done vm1 Oct 15 15:17:35 vm1 stonith-ng[14870]: warning: log_operation: f2:15076 [ Performing: stonith -t external/ssh -T reset vm3 ] Oct 15 15:18:06 vm1 stonith-ng[14870]: warning: log_operation: f2:15464 [ Performing: stonith -t external/ssh -T reset vm3 ] vm2 Oct 15 15:17:16 vm2 stonith-ng[9160]: warning: log_operation: f1:9273 [ Performing: stonith -t external/libvirt -T reset vm3 ] Oct 15 15:17:46 vm2 stonith-ng[9160]: warning: log_operation: f1:9588 [ Performing: stonith -t external/libvirt -T reset vm3 ] Is it specifications? Yes, although the host on which the device is started usually gets priority. I will try to find some time to look through the report to see why this didn't happen. Reading through this again, it sounds like it should be fixed by your earlier pull request: https://github.com/beekhof/pacemaker/commit/6b4bfd6 Yes? No. How is this change? Thanks for this. I tweaked it a bit further and pushed: https://github.com/beekhof/pacemaker/commit/4cbbeb0 diff --git a/fencing/remote.c b/fencing/remote.c index 6c11ba9..68b31c5 100644 --- a/fencing/remote.c +++ b/fencing/remote.c @@ -778,6 +778,7 @@ stonith_choose_peer(remote_fencing_op_t * op) { st_query_result_t *peer = NULL; const char *device = NULL; +uint32_t active = fencing_active_peers(); do { if (op-devices) { @@ -790,7 +791,8 @@ stonith_choose_peer(remote_fencing_op_t * op) if ((peer = find_best_peer(device, op, FIND_PEER_SKIP_TARGET | FIND_PEER_VERIFIED_ONLY))) { return peer; -} else if ((peer = find_best_peer(device, op, FIND_PEER_SKIP_TARGET))) { +} else if ((op-query_timer == 0 || op-replies = op-replies_expected || op-replies = active) +(peer = find_best_peer(device, op, FIND_PEER_SKIP_TARGET))) { return peer; } else if ((peer = find_best_peer(device, op, FIND_PEER_TARGET_ONLY))) { return peer; @@ -801,8 +803,13 @@ stonith_choose_peer(remote_fencing_op_t * op) stonith_topology_next(op) == pcmk_ok); if (op-devices) { -crm_notice(Couldn't find anyone to fence %s with %s, op-target, - (char *)op-devices-data); +if (op-query_timer == 0 || op-replies = op-replies_expected || op-replies = active) { +crm_notice(Couldn't find anyone to fence %s with %s, op-target, + (char *)op-devices-data); +} else { +crm_debug(Couldn't find verified device to fence %s with %s, op-target, + (char *)op-devices-data); +} } else { crm_debug(Couldn't find anyone to fence %s, op-target); } I'm kind of swamped at the moment though. Best Regards, Kazunori INOUE stopped_resource_performed_reset.tar.bz2___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed
Re: [Pacemaker] why pacemaker does not control the resources
On 14 Nov 2013, at 6:13 am, Andrey Groshev gre...@yandex.ru wrote: 13.11.2013, 03:22, Andrew Beekhof and...@beekhof.net: On 12 Nov 2013, at 4:42 pm, Andrey Groshev gre...@yandex.ru wrote: 11.11.2013, 03:44, Andrew Beekhof and...@beekhof.net: On 8 Nov 2013, at 7:49 am, Andrey Groshev gre...@yandex.ru wrote: Hi, PPL! I need help. I do not understand... Why has stopped working. This configuration work on other cluster, but on corosync1. So... cluster postgres with master/slave. Classic config as in wiki. I build cluster, start, he is working. Next I kill postgres on Master with 6 signal, as if disk space left # pkill -6 postgres # ps axuww|grep postgres root 9032 0.0 0.1 103236 860 pts/0S+ 00:37 0:00 grep postgres PostgreSQL die, But crm_mon shows that the master is still running. Last updated: Fri Nov 8 00:42:08 2013 Last change: Fri Nov 8 00:37:05 2013 via crm_attribute on dev-cluster2-node4 Stack: corosync Current DC: dev-cluster2-node4 (172793107) - partition with quorum Version: 1.1.10-1.el6-368c726 3 Nodes configured 7 Resources configured Node dev-cluster2-node2 (172793105): online pingCheck (ocf::pacemaker:ping): Started pgsql (ocf::heartbeat:pgsql): Started Node dev-cluster2-node3 (172793106): online pingCheck (ocf::pacemaker:ping): Started pgsql (ocf::heartbeat:pgsql): Started Node dev-cluster2-node4 (172793107): online pgsql (ocf::heartbeat:pgsql): Master pingCheck (ocf::pacemaker:ping): Started VirtualIP (ocf::heartbeat:IPaddr2): Started Node Attributes: * Node dev-cluster2-node2: + default_ping_set : 100 + master-pgsql : -INFINITY + pgsql-data-status : STREAMING|ASYNC + pgsql-status : HS:async * Node dev-cluster2-node3: + default_ping_set : 100 + master-pgsql : -INFINITY + pgsql-data-status : STREAMING|ASYNC + pgsql-status : HS:async * Node dev-cluster2-node4: + default_ping_set : 100 + master-pgsql : 1000 + pgsql-data-status : LATEST + pgsql-master-baseline : 0278 + pgsql-status : PRI Migration summary: * Node dev-cluster2-node4: * Node dev-cluster2-node2: * Node dev-cluster2-node3: Tickets: CONFIG: node $id=172793105 dev-cluster2-node2. \ attributes pgsql-data-status=STREAMING|ASYNC standby=false node $id=172793106 dev-cluster2-node3. \ attributes pgsql-data-status=STREAMING|ASYNC standby=false node $id=172793107 dev-cluster2-node4. \ attributes pgsql-data-status=LATEST primitive VirtualIP ocf:heartbeat:IPaddr2 \ params ip=10.76.157.194 \ op start interval=0 timeout=60s on-fail=stop \ op monitor interval=10s timeout=60s on-fail=restart \ op stop interval=0 timeout=60s on-fail=block primitive pgsql ocf:heartbeat:pgsql \ params pgctl=/usr/pgsql-9.1/bin/pg_ctl psql=/usr/pgsql-9.1/bin/psql pgdata=/var/lib/pgsql/9.1/data tmpdir=/tmp/pg start_opt=-p 5432 logfile=/var/lib/pgsql/9.1//pgstartup.log rep_mode=async node_list= dev-cluster2-node2. dev-cluster2-node3. dev-cluster2-node4. restore_command=gzip -cd /var/backup/pitr/dev-cluster2-master#5432/xlog/%f.gz %p primary_conninfo_opt=keepalives_idle=60 keepalives_interval=5 keepalives_count=5 master_ip=10.76.157.194 \ op start interval=0 timeout=60s on-fail=restart \ op monitor interval=5s timeout=61s on-fail=restart \ op monitor interval=1s role=Master timeout=62s on-fail=restart \ op promote interval=0 timeout=63s on-fail=restart \ op demote interval=0 timeout=64s on-fail=stop \ op stop interval=0 timeout=65s on-fail=block \ op notify interval=0 timeout=66s primitive pingCheck ocf:pacemaker:ping \ params name=default_ping_set host_list=10.76.156.1 multiplier=100 \ op start interval=0 timeout=60s on-fail=restart \ op monitor interval=10s timeout=60s on-fail=restart \ op stop interval=0 timeout=60s on-fail=ignore ms msPostgresql pgsql \ meta master-max=1 master-node-max=1 clone-node-max=1 notify=true target-role=Master clone-max=3 clone clnPingCheck pingCheck \ meta clone-max=3 location l0_DontRunPgIfNotPingGW msPostgresql \ rule $id=l0_DontRunPgIfNotPingGW-rule -inf: not_defined default_ping_set or default_ping_set lt 100 colocation r0_StartPgIfPingGW inf: msPostgresql clnPingCheck colocation r1_MastersGroup inf: VirtualIP msPostgresql:Master order rsc_order-1 0: clnPingCheck msPostgresql order
Re: [Pacemaker] stonith_admin does not work as expected
On 13 Nov 2013, at 11:33 pm, andreas graeper agrae...@googlemail.com wrote: hi, pacemaker version is 1.1.7 quite a bit of work has gone into fencing since then, any chance you could try something newer? the fence-agent (i thought was one of the standards) calls snmpget -a ipaddr:udpport -c comunity oid snmpset -a ipaddr:udpport -c comunity oid i 0|1 therefor it needs/uses commandline arguments -o action -n port (slot-index) -a ipaddr -c community (udpport is not necessary, cause fix == 161) or (as logs tell me) the fence-agent gets its parameters from stdin fence_ifmib EOF action= port= ipaddr= comunity= EOF another unvalid 'nodename=xyz' is given. the fence-agents was written for another device, and cause our device does not support a function (OID_PORT used to get port-index from port-name) we have to use port- numbers. but except other tiny limitations it works great primitive class=stonith id=fence_1 type=fence_ifmib_epc8212 instance_attributes id=fence_1-instance_attributes nvpair id=fence_1-instance_attributes-ipaddr name=ipaddr value=172.27.51.33/ nvpair id=fence_1-instance_attributes-community name=community value=xxx/ nvpair id=fence_1-instance_attributes-port name=port value=1/ nvpair id=fence_1-instance_attributes-action name=action value=off/ nvpair id=fence_1-instance_attributes-pcmk_poweroff_action name=pcmk_poweroff_action value=off/ nvpair id=fence_1-instance_attributes-pcmk_host_list name=pcmk_host_list value=lisel1/ nvpair id=fence_1-instance_attributes-pcmk_host_check name=pcmk_host_check value=static-list/ nvpair id=fence_1-instance_attributes-verbose name=verbose value=true/ primitive class=stonith id=fence_2 type=fence_ifmib_epc8212 instance_attributes id=fence_2-instance_attributes nvpair id=fence_2-instance_attributes-ipaddr name=ipaddr value=172.27.51.33/ nvpair id=fence_2-instance_attributes-community name=community value=xxx/ nvpair id=fence_2-instance_attributes-port name=port value=2/ nvpair id=fence_2-instance_attributes-action name=action value=off/ nvpair id=fence_2-instance_attributes-pcmk_poweroff_action name=pcmk_poweroff_action value=off/ nvpair id=fence_2-instance_attributes-pcmk_host_list name=pcmk_host_list value=lisel2/ nvpair id=fence_2-instance_attributes-pcmk_host_check name=pcmk_host_check value=static-list/ nvpair id=fence_2-instance_attributes-verbose name=verbose value=true/ rsc_location id=location-fence_1-lisel1--INFINITY node=lisel1 rsc=fence_1 score=-INFINITY/ rsc_location id=location-fence_2-lisel2--INFINITY node=lisel2 rsc=fence_2 score=-INFINITY/ old master is back now as slave. now on (new) master stonith_admin does not see the device/fence-agent. (see last message) how can i repair this ? thanks andreas 2013/11/11, Andrew Beekhof and...@beekhof.net: Impossible to comment without knowing the pacemaker version, full config, and how fence_ifmib works (I assume its a custom agent?) On 12 Nov 2013, at 1:21 am, andreas graeper agrae...@googlemail.com wrote: hi, two nodes. n1 (slave) fence_2:stonith:fence_ifmib n2 (master) fence_1:stonith:fence_ifmib n1 was fenced cause suddenly not reachable. (reason still unknown) n2 stonith_admin -L - 'fence_1' n2 stonith_admin -U fence_1 timed out n2 stonith_admin -L - 'no devices found' crm_mon shows fence_1 is running after manual unfencing n1 with smnpset the slave n1 is up again, but still stonith_admin -L tells 'no devices found' on n2 same on n1: 'fence_2 \n 1 devices found' what went wrong with stonith_admin ? when calling crm_mon -rA1 at the end 'Node Attributes' are listed : * Node lisel1: + master-p_drbd_r0:0 : 5 * Node lisel2: + master-p_drbd_r0:0 : 5 + master-p_drbd_r0:1 : 5 looks strange ? resources are ms_drbd_r0 on primary p_drbd_r0 on secondary ?! or how this is to interpret ? thanks in advance andreas ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project
Re: [Pacemaker] crmd Segmentation fault at pacemaker 1.0.12
On 13 Nov 2013, at 7:36 pm, TAKATSUKA Haruka haru...@sraoss.co.jp wrote: Hello, pacemaker hackers I report crmd's crash at pacemaker 1.0.12 . We are going to upgrade pacemaker 1.0.12 to 1.0.13 . But I was not able to find a fix for this problem from ChangeLog. tengine.c:do_te_invoke() is not seem to care for transition_graph==NULL case in even 1.0.x head code. This should help: https://github.com/ClusterLabs/pacemaker-1.0/commit/20f169d9cccb6c889946c64ab09ab4fb7f572f7c regards, Haruka Takatsuka. - [log] Nov 07 00:00:08 srv1 crmd: [21843]: ERROR: crm_abort: abort_transition_graph: Triggered assert at te_utils.c:259 : transition_graph != NULL Nov 07 00:00:08 srv1 heartbeat: [21823]: WARN: Managed /usr/lib64/heartbeat/crmd process 21843 killed by signal 11 [SIGSEGV - Segmentation violation]. Nov 07 00:00:08 srv1 heartbeat: [21823]: ERROR: Managed /usr/lib64/heartbeat/crmd process 21843 dumped core Nov 07 00:00:08 srv1 heartbeat: [21823]: EMERG: Rebooting system. Reason: /usr/lib64/heartbeat/crmd [gdb] $ gdb -c core.21843 -s crmd.debug crmd --(snip)-- Program terminated with signal 11, Segmentation fault. #0 0x004199c4 in do_te_invoke (action=140737488355328, cause=C_FSA_INTERNAL, cur_state=S_POLICY_ENGINE, current_input=I_FINALIZED, msg_data=0x1b28e20) at tengine.c:186 186 if(transition_graph-complete == FALSE) { --(snip)-- (gdb) bt #0 0x004199c4 in do_te_invoke (action=140737488355328, cause= C_FSA_INTERNAL, cur_state=S_POLICY_ENGINE, current_input=I_FINALIZED, msg_data=0x1b28e20) at tengine.c:186 #1 0x00405ca3 in do_fsa_action (fsa_data=0x1b28e20, an_action= 140737488355328, function=0x419831 do_te_invoke) at fsa.c:154 #2 0x00406b22 in s_crmd_fsa_actions (fsa_data=0x1b28e20) at fsa.c:410 #3 0x004061a1 in s_crmd_fsa (cause=C_FSA_INTERNAL) at fsa.c:267 #4 0x0041208f in crm_fsa_trigger (user_data=0x0) at callbacks.c:631 #5 0x003777a26146 in crm_trigger_dispatch (source=0x1b1b590, callback= 0x412026 crm_fsa_trigger, userdata=0x1b1b590) at mainloop.c:53 #6 0x0031d8a38f0e in g_main_context_dispatch () from /lib64/libglib-2.0.so.0 #7 0x0031d8a3c938 in ?? () from /lib64/libglib-2.0.so.0 #8 0x0031d8a3cd55 in g_main_loop_run () from /lib64/libglib-2.0.so.0 #9 0x004051bb in crmd_init () at main.c:139 #10 0x00405093 in main (argc=1, argv=0x7fff947d1388) at main.c:105 (gdb) list 181 182 if(action A_TE_CANCEL) { 183 crm_debug(Cancelling the transition: %s, 184 transition_graph-complete?inactive:active); 185 abort_transition(INFINITY, tg_restart, Peer Cancelled, NULL); 186 if(transition_graph-complete == FALSE) { 187 crmd_fsa_stall(NULL); 188 } 189 190 } else if(action A_TE_HALT) { (gdb) p transition_graph $1 = (crm_graph_t *) 0x0 ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Remove a ghost node
On 14 Nov 2013, at 1:12 pm, Sean Lutner s...@rentul.net wrote: On Nov 10, 2013, at 8:03 PM, Sean Lutner s...@rentul.net wrote: On Nov 10, 2013, at 7:54 PM, Andrew Beekhof and...@beekhof.net wrote: On 11 Nov 2013, at 11:44 am, Sean Lutner s...@rentul.net wrote: On Nov 10, 2013, at 6:27 PM, Andrew Beekhof and...@beekhof.net wrote: On 8 Nov 2013, at 12:59 pm, Sean Lutner s...@rentul.net wrote: On Nov 7, 2013, at 8:34 PM, Andrew Beekhof and...@beekhof.net wrote: On 8 Nov 2013, at 4:45 am, Sean Lutner s...@rentul.net wrote: I have a confusing situation that I'm hoping to get help with. Last night after configuring STONITH on my two node cluster, I suddenly have a ghost node in my cluster. I'm looking to understand the best way to remove this node from the config. I'm using the fence_ec2 device for for STONITH. I dropped the script on each node, registered the device with stonith_admin -R -a fence_ec2 and confirmed the registration with both # stonith_admin -I # pcs stonith list I then configured STONITH per the Clusters from Scratch doc http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Clusters_from_Scratch/_example.html Here are my commands: # pcs cluster cib stonith_cfg # pcs -f stonith_cfg stonith create ec2-fencing fence_ec2 ec2-home=/opt/ec2-api-tools pcmk_host_check=static-list pcmk_host_list=ip-10-50-3-122 ip-10-50-3-251 op monitor interval=300s timeout=150s op start start-delay=30s interval=0 # pcs -f stonith_cfg stonith # pcs -f stonith_cfg property set stonith-enabled=true # pcs -f stonith_cfg property # pcs cluster push cib stonith_cfg After that I saw that STONITH appears to be functioning but a new node listed in pcs status output: Do the EC2 instances have fixed IPs? I didn't have much luck with EC2 because every time they came back up it was with a new name/address which confused corosync and created situations like this. The IPs persist across reboots as far as I can tell. I thought the problem was due to stonith being enabled but not working so I removed the stonith_id and disabled stonith. After that I restarted pacemaker and cman on both nodes and things started as expected but the ghost node it still there. Someone else working on the cluster exported the CIB, removed the node and then imported the CIB. They used this process http://clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-config-updates.html Even after that, the ghost node is still there? Would pcs cluster cib /tmp/cib-temp.xml and then pcs cluster push cib /tmp/cib-temp.xml after editing the node out of the config? No. If its coming back then pacemaker is holding it in one of its internal caches. The only way to clear it out in your version is to restart pacemaker on the DC. Actually... are you sure someone didn't just slip while editing cluster.conf? [...].1251 does not look like a valid IP :) In the end this fixed it # pcs cluster cib /tmp/cib-tmp.xml # vi /tmp/cib-tmp.xml # remove bad node # pcs cluster push cib /tmp/cib-tmp.xml Followed by restaring pacemaker and cman on both nodes. The ghost node disappeared, so it was cached as you mentioned. I also tracked the bad IP down to bad non-printing characters in the initial command line while configuring the fence_ec2 stonith device. I'd put the command together from the github README and some mailing list posts and laid it out in an external editor. Go me. :) Version: 1.1.8-7.el6-394e906 There is now an update to 1.1.10 available for 6.4, that _may_ help in the future. That's my next task. I believe I'm hitting the failure-timeout not clearing failcount bug and want to upgrade to 1.1.10. Is it safe to yum update pacemaker after stopping the cluster? I see there is also an updated pcs in CentOS 6.4, should I update that as well? yes and yes you might want to check if you're using any OCF resource agents that didn't make it into the first supported release though. http://blog.clusterlabs.org/blog/2013/pacemaker-and-rhel-6-dot-4/ Thanks, I'll give that a read. All the resource agents are custom so I'm thinking I'm okay (I'll back them up before upgrading). One last question related to the fence_ec2 script. Should crm_mon -VW show it running on both nodes or just one? I just went through the upgrade to pacemaker 1.1.10 and pcs. After running the yum update for those I ran a crm_verify and I'm seeing errors related to my order and colocation constraints. Did the behavior of these change from 1.1.8 to 1.1.10? # crm_verify -L -V error: unpack_order_template:Invalid constraint 'order-ClusterEIP_54.215.143.166-Varnish-mandatory': No resource or template named 'Varnish' Is that true? error: unpack_order_template:Invalid constraint 'order-Varnish-Varnishlog-mandatory': No resource or template named 'Varnish' error
Re: [Pacemaker] recover cib from raw file
I wouldn't be surprised to see a relevant pcs command in the future ;-) On 12 Nov 2013, at 8:51 pm, s.oreilly s.orei...@linnovations.co.uk wrote: Brilliant, thanks Andrew. I was looking for a pcs option. Should have thought about cibadmin. Hopefully I will never break things badly enough to have to use it :-) Regards Sean O'Reilly On Mon 11/11/13 10:03 PM , Andrew Beekhof and...@beekhof.net sent: On 11 Nov 2013, at 9:41 pm, s.oreilly s.orei...@linnovations.co.uk wrote: Hi, Is it possible to recover/replace cib.xml from one of the raw files in /var/lib/pacemaker/cib? I would like to reset the cib to the configuration referenced in cib.last. In the case cib-89.raw I haven't been able to find a command to do this. You can reload it into a running cluster with: cibadmin --replace --xml-file /path/to/cib-89.raw Thanks Sean O'Reilly ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Network outage debugging
On 13 Nov 2013, at 6:10 am, Sean Lutner s...@rentul.net wrote: The folks testing the cluster I've been building have run a script which blocks all traffic except SSH on one node of the cluster for 15 seconds to mimic a network failure. During this time, the network being down seems to cause some odd behavior from pacemaker resulting in it dying. The cluster is two nodes and running four custom resources on EC2 instances. The OS is CentOS 6.4 with the config below: I've attached the /var/log/messages and /var/log/cluster/corosync.log from the time period during the test. I've having some difficulty in piecing together what happened and am hoping someone can shed some light on the problem. Any indications why pacemaker is dying on that node? Because corosync is dying underneath it: Nov 09 14:51:49 [942] ip-10-50-3-251cib:error: send_ais_text: Sending message 28 via cpg: FAILED (rc=2): Library error: Connection timed out (110) Nov 09 14:51:49 [942] ip-10-50-3-251cib:error: pcmk_cpg_dispatch: Connection to the CPG API failed: 2 Nov 09 14:51:49 [942] ip-10-50-3-251cib:error: cib_ais_destroy: Corosync connection lost! Exiting. Nov 09 14:51:49 [942] ip-10-50-3-251cib: info: terminate_cib: cib_ais_destroy: Exiting fast... [root@ip-10-50-3-122 ~]# pcs config Corosync Nodes: Pacemaker Nodes: ip-10-50-3-122 ip-10-50-3-251 Resources: Resource: ClusterEIP_54.215.143.166 (provider=pacemaker type=EIP class=ocf) Attributes: first_network_interface_id=eni-e4e0b68c second_network_interface_id=eni-35f9af5d first_private_ip=10.50.3.191 second_private_ip=10.50.3.91 eip=54.215.143.166 alloc_id=eipalloc-376c3c5f interval=5s Operations: monitor interval=5s Clone: EIP-AND-VARNISH-clone Group: EIP-AND-VARNISH Resource: Varnish (provider=redhat type=varnish.sh class=ocf) Operations: monitor interval=5s Resource: Varnishlog (provider=redhat type=varnishlog.sh class=ocf) Operations: monitor interval=5s Resource: Varnishncsa (provider=redhat type=varnishncsa.sh class=ocf) Operations: monitor interval=5s Resource: ec2-fencing (type=fence_ec2 class=stonith) Attributes: ec2-home=/opt/ec2-api-tools pcmk_host_check=static-list pcmk_host_list=HA01 HA02 Operations: monitor start-delay=30s interval=0 timeout=150s Location Constraints: Ordering Constraints: ClusterEIP_54.215.143.166 then Varnish Varnish then Varnishlog Varnishlog then Varnishncsa Colocation Constraints: Varnish with ClusterEIP_54.215.143.166 Varnishlog with Varnish Varnishncsa with Varnishlog Cluster Properties: dc-version: 1.1.8-7.el6-394e906 cluster-infrastructure: cman last-lrm-refresh: 1384196963 no-quorum-policy: ignore stonith-enabled: true net-failure-messages-110913.outnet-failure-corosync-110913.out ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Follow up: Colocation constraint to External Managed Resource (cluster-recheck-interval=5m ignored after 1.1.10 update?)
On 13 Nov 2013, at 12:06 am, Robert H. pacema...@elconas.de wrote: Hello, for PaceMaker 1.1.8 (CentOS Version) the thread http://www.mail-archive.com/pacemaker@oss.clusterlabs.org/msg18048.html was solved with adding cluster-recheck-interval=5m, causing the LRM Its the policy engine btw. Not the lrmd. to be executed every 5 minutes and detecting externally managed resources as started (in this case an externally managed percona cluster). cluster-recheck-interval shouldn't have anything to do with it. its completely handled by: op monitor enabled=true timeout=20s interval=11s role=Stopped So the questions to ask: 1. is that recurring operation being executed? 2. is it reporting accurate results? (For 1., this happens without the involvement of cluster-recheck-interval, the lrmd will re-run the command every 'interval' seconds). Now the same cluster was updated to 1.1.10 (new upstream) and it seems that the problem is back again. It seems that cluster-recheck-interval=5m does not cause the LRM to be executed again after 5 minutes, detecting, that external - unmanaged ressources are started again. CIB is unmodified. Has something changed in the upstream release ? Not intentionally. Any hints ? Have a read of the section But wait there’s still more of http://blog.clusterlabs.org/blog/2013/pacemaker-logging/ and see if you can get the information from the lrmd process to answer questions 1 and 2 above. signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] why pacemaker does not control the resources
On 12 Nov 2013, at 4:42 pm, Andrey Groshev gre...@yandex.ru wrote: 11.11.2013, 03:44, Andrew Beekhof and...@beekhof.net: On 8 Nov 2013, at 7:49 am, Andrey Groshev gre...@yandex.ru wrote: Hi, PPL! I need help. I do not understand... Why has stopped working. This configuration work on other cluster, but on corosync1. So... cluster postgres with master/slave. Classic config as in wiki. I build cluster, start, he is working. Next I kill postgres on Master with 6 signal, as if disk space left # pkill -6 postgres # ps axuww|grep postgres root 9032 0.0 0.1 103236 860 pts/0S+ 00:37 0:00 grep postgres PostgreSQL die, But crm_mon shows that the master is still running. Last updated: Fri Nov 8 00:42:08 2013 Last change: Fri Nov 8 00:37:05 2013 via crm_attribute on dev-cluster2-node4 Stack: corosync Current DC: dev-cluster2-node4 (172793107) - partition with quorum Version: 1.1.10-1.el6-368c726 3 Nodes configured 7 Resources configured Node dev-cluster2-node2 (172793105): online pingCheck (ocf::pacemaker:ping): Started pgsql (ocf::heartbeat:pgsql): Started Node dev-cluster2-node3 (172793106): online pingCheck (ocf::pacemaker:ping): Started pgsql (ocf::heartbeat:pgsql): Started Node dev-cluster2-node4 (172793107): online pgsql (ocf::heartbeat:pgsql): Master pingCheck (ocf::pacemaker:ping): Started VirtualIP (ocf::heartbeat:IPaddr2): Started Node Attributes: * Node dev-cluster2-node2: + default_ping_set : 100 + master-pgsql : -INFINITY + pgsql-data-status : STREAMING|ASYNC + pgsql-status : HS:async * Node dev-cluster2-node3: + default_ping_set : 100 + master-pgsql : -INFINITY + pgsql-data-status : STREAMING|ASYNC + pgsql-status : HS:async * Node dev-cluster2-node4: + default_ping_set : 100 + master-pgsql : 1000 + pgsql-data-status : LATEST + pgsql-master-baseline : 0278 + pgsql-status : PRI Migration summary: * Node dev-cluster2-node4: * Node dev-cluster2-node2: * Node dev-cluster2-node3: Tickets: CONFIG: node $id=172793105 dev-cluster2-node2. \ attributes pgsql-data-status=STREAMING|ASYNC standby=false node $id=172793106 dev-cluster2-node3. \ attributes pgsql-data-status=STREAMING|ASYNC standby=false node $id=172793107 dev-cluster2-node4. \ attributes pgsql-data-status=LATEST primitive VirtualIP ocf:heartbeat:IPaddr2 \ params ip=10.76.157.194 \ op start interval=0 timeout=60s on-fail=stop \ op monitor interval=10s timeout=60s on-fail=restart \ op stop interval=0 timeout=60s on-fail=block primitive pgsql ocf:heartbeat:pgsql \ params pgctl=/usr/pgsql-9.1/bin/pg_ctl psql=/usr/pgsql-9.1/bin/psql pgdata=/var/lib/pgsql/9.1/data tmpdir=/tmp/pg start_opt=-p 5432 logfile=/var/lib/pgsql/9.1//pgstartup.log rep_mode=async node_list= dev-cluster2-node2. dev-cluster2-node3. dev-cluster2-node4. restore_command=gzip -cd /var/backup/pitr/dev-cluster2-master#5432/xlog/%f.gz %p primary_conninfo_opt=keepalives_idle=60 keepalives_interval=5 keepalives_count=5 master_ip=10.76.157.194 \ op start interval=0 timeout=60s on-fail=restart \ op monitor interval=5s timeout=61s on-fail=restart \ op monitor interval=1s role=Master timeout=62s on-fail=restart \ op promote interval=0 timeout=63s on-fail=restart \ op demote interval=0 timeout=64s on-fail=stop \ op stop interval=0 timeout=65s on-fail=block \ op notify interval=0 timeout=66s primitive pingCheck ocf:pacemaker:ping \ params name=default_ping_set host_list=10.76.156.1 multiplier=100 \ op start interval=0 timeout=60s on-fail=restart \ op monitor interval=10s timeout=60s on-fail=restart \ op stop interval=0 timeout=60s on-fail=ignore ms msPostgresql pgsql \ meta master-max=1 master-node-max=1 clone-node-max=1 notify=true target-role=Master clone-max=3 clone clnPingCheck pingCheck \ meta clone-max=3 location l0_DontRunPgIfNotPingGW msPostgresql \ rule $id=l0_DontRunPgIfNotPingGW-rule -inf: not_defined default_ping_set or default_ping_set lt 100 colocation r0_StartPgIfPingGW inf: msPostgresql clnPingCheck colocation r1_MastersGroup inf: VirtualIP msPostgresql:Master order rsc_order-1 0: clnPingCheck msPostgresql order rsc_order-2 0: msPostgresql:promote VirtualIP:start symmetrical=false order rsc_order-3 0: msPostgresql:demote VirtualIP:stop symmetrical=false property $id=cib-bootstrap-options \ dc-version=1.1.10-1
Re: [Pacemaker] The larger cluster is tested.
throttle_get_total_job_limit pacemaker.log (snip) Nov 08 11:08:31 [2387] vm13 crmd: ( throttle.c:629 ) trace: throttle_get_total_job_limit:No change to batch-limit=0 Nov 08 11:08:32 [2387] vm13 crmd: ( throttle.c:632 ) trace: throttle_get_total_job_limit:Using batch-limit=8 (snip) Nov 08 11:10:32 [2387] vm13 crmd: ( throttle.c:632 ) trace: throttle_get_total_job_limit:Using batch-limit=16 The above shows that it is not solved even if it restricts the whole number of jobs by batch-limit. Are there any other methods of reducing a synchronous message? Internal IPC message is not so much. Do not be able to handle even a little it on the way to handle the synchronization message? Regards, Yusuke 2013/11/12 Andrew Beekhof and...@beekhof.net: On 11 Nov 2013, at 11:48 pm, yusuke iida yusk.i...@gmail.com wrote: Execution of the graph was also checked. Since the number of pending(s) is restricted to 16 from the middle, it is judged that batch-limit is effective. Observing here, even if a job is restricted by batch-limit, two or more jobs are always fired(ed) in 1 second. These performed jobs return a result and the synchronous message of CIB generates them. The node which continued receiving a synchronous message processes there preferentially, and postpones an internal IPC message. I think that it caused timeout. What load-threshold were you running this with? I see this in the logs: Host vm10 supports a maximum of 4 jobs and throttle mode 0100. New job limit is 1 Have you set LRMD_MAX_CHILDREN=4 on these nodes? I wouldn't recommend that for a single core VM. I'd let the default of 2*cores be used. Also, I'm not seeing Extreme CIB load detected. Are these still single core machines? If so it would suggest that something about: if(cores == 1) { cib_max_cpu = 0.4; } if(throttle_load_target 0.0 throttle_load_target cib_max_cpu) { cib_max_cpu = throttle_load_target; } if(load 1.5 * cib_max_cpu) { /* Can only happen on machines with a low number of cores */ crm_notice(Extreme %s detected: %f, desc, load); mode |= throttle_extreme; is wrong. What was load-threshold configured as? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- METRO SYSTEMS CO., LTD Yusuke Iida Mail: yusk.i...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org