Re: [Pacemaker] command to dump cluster configuration in pcs format?

2014-01-16 Thread Andrew Beekhof

On 16 Jan 2014, at 10:59 pm, Lars Marowsky-Bree l...@suse.com wrote:

 On 2014-01-15T20:25:30, Bob Haxo bh...@sgi.com wrote:
 
 Unfortunately, it configuration has taken me weeks to develop (what now
 seems to be) a working configuration (including mods to the
 VirtualDomain agent to avoid spurious restarts of the VM).
 
 Curious if you can push these upstream too ;-) (Or already have.)
 
 The problem is that this goes into a product that gets shipped to mfg,
 and then to customers, and then needs to be supported by other engineers
 (and then often back to me).  
 
 Easy to create configurations with crm and then load
 (crm -f file) the Pacemaker configuration using the crm commands created
 with crm configure show, with some scripted substitutions for
 hostnames, IP addresses, and other site customizations.
 
 The SLES HAE uses crm, and I'm trying to make SLES and RHEL versions as
 identical as possible.  Makes it easier for me to maintain, and for
 others to support.
 
 Well, unless RHT states that installing crmsh on top of their
 distribution invalidates support for the pacemaker back-end, you could
 just ship crmsh as part of your product on that platform.

Thats not how RHT operates I'm afraid.  If something isn't at least planned to 
be supported, they don't ship it.
However some interested party could get it into fedora and from there get it 
into EPEL if they chose to.


 It should be
 easy to install on RHEL, and you're already installing your own
 product anyway, so it shouldn't be a huge problem to add one more
 package?
 
 
 
 Regards,
Lars
 
 -- 
 Architect Storage/HA
 SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, 
 HRB 21284 (AG Nürnberg)
 Experience is the name everyone gives to their mistakes. -- Oscar Wilde
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] command to dump cluster configuration in pcs format?

2014-01-16 Thread Andrew Beekhof

On 17 Jan 2014, at 9:05 am, Lars Marowsky-Bree l...@suse.com wrote:

 On 2014-01-17T07:40:34, Andrew Beekhof and...@beekhof.net wrote:
 
 Well, unless RHT states that installing crmsh on top of their
 distribution invalidates support for the pacemaker back-end, you could
 just ship crmsh as part of your product on that platform.
 Thats not how RHT operates I'm afraid.  If something isn't at least planned 
 to be supported, they don't ship it.
 
 Right, and I very much understand that.
 
 The question was if installing crmsh on top of RHEL/RHCS (as he's doing
 now) would invalidate support for the rest of the system. And I can't
 answer that.

If some behaviour specific to crmsh (or the mixing of the two) was triggering a 
bug, that /may/ result in the fix being given lower priority for inclusion.
I can also imagine support wanting to ensure issues are reproducible with pcs 
prior to accepting them.

But I'd not think using crmsh would completely invalidate support.



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Question about new migration

2014-01-15 Thread Andrew Beekhof

On 15 Jan 2014, at 7:12 pm, Kazunori INOUE kazunori.ino...@gmail.com wrote:

 Hi David,
 
 With new migration logic, when VM was migrated by 'node standby',
 start was performed in migrate_target. (migrate_from was not performed.)
 
 Is this the designed behavior?
 
 
 # crm_mon -rf1
 Stack: corosync
 Current DC: bl460g1n6 (3232261592) - partition with quorum
 Version: 1.1.11-0.27.b48276b.git.el6-b48276b
 2 Nodes configured
 3 Resources configured
 
 Online: [ bl460g1n6 bl460g1n7 ]
 
 Full list of resources:
 
 prmVM2  (ocf::heartbeat:VirtualDomain): Started bl460g1n6
 Clone Set: clnPing [prmPing]
 Started: [ bl460g1n6 bl460g1n7 ]
 
 Node Attributes:
 * Node bl460g1n6:
+ default_ping_set  : 100
 * Node bl460g1n7:
+ default_ping_set  : 100
 
 # crm node standby bl460g1n6
 # egrep do_lrm_rsc_op:|process_lrm_event: ha-log | grep prmVM2
 Jan 15 15:39:22 bl460g1n6 crmd[30795]: info: do_lrm_rsc_op:
 Performing key=11:5:0:be72ea63-75a9-4de4-a591-e716f960743b
 op=prmVM2_migrate_to_0
 Jan 15 15:39:28 bl460g1n6 crmd[30795]:   notice: process_lrm_event:
 LRM operation prmVM2_migrate_to_0 (call=16, rc=0, cib-update=66,
 confirmed=true) ok
 Jan 15 15:39:30 bl460g1n6 crmd[30795]: info: do_lrm_rsc_op:
 Performing key=7:6:0:be72ea63-75a9-4de4-a591-e716f960743b
 op=prmVM2_stop_0


Looks like the transition was aborted (5) and another (6) calculated.

Compare action:transition:expected_rc:uuid 

 key=11:5:0:be72ea63-75a9-4de4-a591-e716f960743b

and

 key=7:6:0:be72ea63-75a9-4de4-a591-e716f960743b



 Jan 15 15:39:30 bl460g1n6 crmd[30795]:   notice: process_lrm_event:
 LRM operation prmVM2_stop_0 (call=19, rc=0, cib-update=68,
 confirmed=true) ok
 
 Jan 15 15:39:30 bl460g1n7 crmd[29923]: info: do_lrm_rsc_op:
 Performing key=8:6:0:be72ea63-75a9-4de4-a591-e716f960743b
 op=prmVM2_start_0
 Jan 15 15:39:30 bl460g1n7 crmd[29923]:   notice: process_lrm_event:
 LRM operation prmVM2_start_0 (call=13, rc=0, cib-update=17,
 confirmed=true) ok
 
 
 Best Regards,
 Kazunori INOUE
 pcmk-Wed-15-Jan-2014.tar.bz2___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] hangs pending

2014-01-15 Thread Andrew Beekhof

On 16 Jan 2014, at 12:41 am, Andrey Groshev gre...@yandex.ru wrote:

 
 
 15.01.2014, 02:53, Andrew Beekhof and...@beekhof.net:
 On 15 Jan 2014, at 12:15 am, Andrey Groshev gre...@yandex.ru wrote:
 
  14.01.2014, 10:00, Andrey Groshev gre...@yandex.ru:
  14.01.2014, 07:47, Andrew Beekhof and...@beekhof.net:
   Ok, here's what happens:
 
   1. node2 is lost
   2. fencing of node2 starts
   3. node2 reboots (and cluster starts)
   4. node2 returns to the membership
   5. node2 is marked as a cluster member
   6. DC tries to bring it into the cluster, but needs to cancel the 
 active transition first.
  Which is a problem since the node2 fencing operation is part of that
   7. node2 is in a transition (pending) state until fencing passes or 
 fails
   8a. fencing fails: transition completes and the node joins the cluster
 
   Thats in theory, except we automatically try again. Which isn't 
 appropriate.
   This should be relatively easy to fix.
 
   8b. fencing passes: the node is incorrectly marked as offline
 
   This I have no idea how to fix yet.
 
   On another note, it doesn't look like this agent works at all.
   The node has been back online for a long time and the agent is still 
 timing out after 10 minutes.
   So Once the script makes sure that the victim will rebooted and again 
 available via ssh - it exit with 0. does not seem true.
  Damn. Looks like you're right. At some time I broke my agent and had not 
 noticed it. Who will understand.
  I repaired my agent - after send reboot he is wait STDIN.
  Returned normally a behavior - hangs pending, until manually send 
 reboot. :)
 
 Right. Now you're in case 8b.
 
 Can you try this patch:  http://paste.fedoraproject.org/68450/38973966
 
 
 Killed all day experiences.
 It turns out here that:
 1. Did cluster.
 2. On the node-2 send signal (-4) - killed corosink
 3. From node-1 (there DC) - stonith sent reboot
 4. Noda rebooted and resources start.
 5. Again. On the node-2 send signal (-4) - killed corosink
 6. Again. From node-1 (there DC) - stonith sent reboot
 7. Noda-2 rebooted and hangs in pending
 8. Waiting, waiting. manually reboot.
 9. Noda-2 reboot and raised resources start.
 10. GOTO p.2

Logs?

 
 
 
  New logs: http://send2me.ru/crmrep1.tar.bz2
   On 14 Jan 2014, at 1:19 pm, Andrew Beekhof and...@beekhof.net wrote:
Apart from anything else, your timeout needs to be bigger:
 
Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru stonith-ng: 
 (  commands.c:1321  )   error: log_operation: Operation 'reboot' [11331] 
 (call 2 from crmd.17227) for host 'dev-cluster2-node2.unix.tensor.ru' 
 with device 'st1' returned: -62 (Timer expired)
 
On 14 Jan 2014, at 7:18 am, Andrew Beekhof and...@beekhof.net wrote:
On 13 Jan 2014, at 8:31 pm, Andrey Groshev gre...@yandex.ru wrote:
13.01.2014, 02:51, Andrew Beekhof and...@beekhof.net:
On 10 Jan 2014, at 9:55 pm, Andrey Groshev gre...@yandex.ru 
 wrote:
10.01.2014, 14:31, Andrey Groshev gre...@yandex.ru:
10.01.2014, 14:01, Andrew Beekhof and...@beekhof.net:
On 10 Jan 2014, at 5:03 pm, Andrey Groshev gre...@yandex.ru 
 wrote:
 10.01.2014, 05:29, Andrew Beekhof and...@beekhof.net:
  On 9 Jan 2014, at 11:11 pm, Andrey Groshev 
 gre...@yandex.ru wrote:
   08.01.2014, 06:22, Andrew Beekhof and...@beekhof.net:
   On 29 Nov 2013, at 7:17 pm, Andrey Groshev 
 gre...@yandex.ru wrote:
Hi, ALL.
 
I'm still trying to cope with the fact that after the 
 fence - node hangs in pending.
   Please define pending.  Where did you see this?
   In crm_mon:
   ..
   Node dev-cluster2-node2 (172793105): pending
   ..
 
   The experiment was like this:
   Four nodes in cluster.
   On one of them kill corosync or pacemakerd (signal 4 or 6 
 oк 11).
   Thereafter, the remaining start it constantly reboot, 
 under various pretexts, softly whistling, fly low, not a 
 cluster member! ...
   Then in the log fell out Too many failures 
   All this time in the status in crm_mon is pending.
   Depending on the wind direction changed to UNCLEAN
   Much time has passed and I can not accurately describe 
 the behavior...
 
   Now I am in the following state:
   I tried locate the problem. Came here with this.
   I set big value in property stonith-timeout=600s.
   And got the following behavior:
   1. pkill -4 corosync
   2. from node with DC call my fence agent sshbykey
   3. It sends reboot victim and waits until she comes to 
 life again.
  Hmmm what version of pacemaker?
  This sounds like a timing issue that we fixed a while back
 Was a version 1.1.11 from December 3.
 Now try full update and retest.
That should be recent enough.  Can you create a crm_report the 
 next time you reproduce?
Of course yes. Little delay :)
 
..
cc1: warnings being treated as errors
upstart.c: In function ‘upstart_job_property

Re: [Pacemaker] crm_resource -L not trustable right after restart

2014-01-15 Thread Andrew Beekhof

On 16 Jan 2014, at 6:53 am, Brian J. Murrell (brian) br...@interlinx.bc.ca 
wrote:

 On Wed, 2014-01-15 at 17:11 +1100, Andrew Beekhof wrote:
 
 Consider any long running action, such as starting a database.
 We do not update the CIB until after actions have completed, so there can 
 and will be times when the status section is out of date to one degree or 
 another.
 
 But that is the opposite of what I am reporting

I know, I was giving you another example of when the cib is not completely 
up-to-date with reality.

 and is acceptable.  It's
 acceptable for a resource that is in the process of starting being
 reported as stopped, because it's not yet started.

It may very well be partially started.  Its almost certainly not stopped which 
is what is being reported.

 
 What I am seeing is resources being reported as stopped when they are in
 fact started/running and have been for a long time.
 
 At node startup is another point at which the status could potentially be 
 behind.
 
 Right.  Which is the case I am talking about.
 
 It sounds to me like you're trying to second guess the cluster, which is a 
 dangerous path.
 
 No, not trying to second guess at all.

You're not using the output to decide whether to perform some logic?
Because crm_mon is the more usual command to run right after startup (which 
would give you enough context to know things are still syncing).

  I'm just trying to ask the
 cluster what the state is and not getting the truth.  I am willing to
 believe whatever state the cluster says it's in as long as what I am
 getting is the truth.
 
 What if its the first node to start up?
 
 I'd think a timeout comes in to play here.
 
 There'd be no fresh copy to arrive in that case.
 
 I can't say that I know how the CIB works internally/entirely, but I'd
 imagine that when a cluster node starts up it tries to see if there is a
 more fresh CIB out there in the cluster.

Nope.

  Maybe this is part of the
 process of choosing/discovering a DC.

DC election happens at the crmd.  The cib is a dumb repository of name/value 
pairs.
It doesn't even understand new vs. old - only different. 

  But ultimately if the node is the
 first one up, it will eventually figure that out so that it can nominate
 itself as the DC.  Or it finds out that there is a DC already (and gets
 a fresh CIB from it?).  It's during that window that I propose that
 crm_resource should not be asserting anything and should just admit that
 it does not (yet) know.
 
 If it had enough information to know it was out of date, it wouldn't be out 
 of date.
 
 But surely it understands if it is in the process of joining a cluster
 or not, and therefore does know enough to know that it doesn't know if
 it's out of date or not.

And if it has a newer config compared to the existing nodes?

  But that it could be.
 
 As above, there are situations when you'd never get an answer.
 
 I should have added to my proposal or has determined that there is
 nothing to refresh it's CIB from and that it's local copy is
 authoritative for the whole cluster.
 
 b.
 
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] command to dump cluster configuration in pcs format?

2014-01-15 Thread Andrew Beekhof

On 16 Jan 2014, at 11:49 am, Bob Haxo bh...@sgi.com wrote:

 On 01/15/2014 05:02 PM, Bob Haxo wrote:
  Greetings,
 
  The command  crm configure show dumps the cluster configuration in a 
  format
  that is suitable for use in configuring a cluster.
 
  The command pcs config generates nice human readable information, but 
  this is
  not directly suitable for use in configuring a cluster.
 
  Is there a pcs command analogous to the crm command that dumps the 
  cluster
  configuration in pcs format?
 
 
 
 On Wed, 2014-01-15 at 17:55 -0600, Chris Feist wrote:
 Currently there is not.  We may at some point look into this, but it isn't 
 on my 
 short term list of things to do.
 
 
 
 Thanks,
 Chris
 
 
 
 Oh, well, bummer ... but at least I hadn't missed the command in the docs or 
 in the installed code.

There list of commands you used to build a cluster is in the history too 
remember.
You could just save that to a file instead and restore with bash 
./bit-of-history-i-care-about

 
 I'll probably use crm configure show to capture the pcs created 
 configuration for installations ... and save the pcs for when I correspond 
 with RedHat.
 
 Dumping and loading the xml really is not an option.

You dump and load xml a lot?
Even assuming yes, its in a file that you don't have to read... so where is the 
problem? 

 
 Regards,
 Bob Haxo
 
 
 
 
 
  Regards,
  Bob Haxo
 
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 
 
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] crm_resource -L not trustable right after restart

2014-01-15 Thread Andrew Beekhof

On 16 Jan 2014, at 1:13 pm, Brian J. Murrell (brian) br...@interlinx.bc.ca 
wrote:

 On Thu, 2014-01-16 at 08:35 +1100, Andrew Beekhof wrote:
 
 I know, I was giving you another example of when the cib is not completely 
 up-to-date with reality.
 
 Yeah, I understood that.  I was just countering with why that example is
 actually more acceptable.
 
 It may very well be partially started.
 
 Sure.
 
 Its almost certainly not stopped which is what is being reported.
 
 Right.  But until it is completely started (and ready to do whatever
 it's supposed to do), it might as well be considered stopped.  If you
 have to make a binary state out of stopped, starting, started, I think
 most people will agree that the states are stopped and starting and
 stopped is anything  starting since most things are not useful until
 they are fully started.
 
 You're not using the output to decide whether to perform some logic?
 
 Nope.  Just reporting the state.  But that's difficult when you have two
 participants making positive assertions about state when one is not
 really in a position to do so.
 
 Because crm_mon is the more usual command to run right after startup
 
 The problem with crm_mon is that it doesn't tell you where a resource is
 running.

What crm_mon are you looking at?
I see stuff like:

 virt-fencing   (stonith:fence_xvm):Started rhos4-node3 
 Resource Group: mysql-group
 mysql-vip  (ocf::heartbeat:IPaddr2):   Started rhos4-node3 
 mysql-fs   (ocf::heartbeat:Filesystem):Started rhos4-node3 
 mysql-db   (ocf::heartbeat:mysql): Started rhos4-node3 


 
 (which would give you enough context to know things are still syncing).
 
 That's interesting.  Would polling crm_mon be more efficient than
 polling the remote CIB with cibadmin -Q?

crm_mon in interactive mode subscribes to updates from the cib.
which would be more efficient than repeatedly calling cibadmin or crm_mon 

 
 DC election happens at the crmd.
 
 So would it be fair to say then that I should not trust the local CIB
 until DC election has finished or could there be latency between that
 completing and the CIB being refreshed?

After the join completes (which happens after the election or when a new node 
is found), then it is safe.
You can tell this by running crmadmin -S -H `uname -n` and looking for S_IDLE, 
S_POLICY_ENGINE or S_TRANSITION_ENGINE iirc

 
 If DC election completion is accurate, what's the best way to determine
 that has completed?

Ideally it doesn't happen when a node joins an existing cluster.



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] command to dump cluster configuration in pcs format?

2014-01-15 Thread Andrew Beekhof

On 16 Jan 2014, at 3:25 pm, Bob Haxo bh...@sgi.com wrote:

 
 On Thu, 2014-01-16 at 12:32 +1100, Andrew Beekhof wrote:
 On 16 Jan 2014, at 11:49 am, Bob Haxo bh...@sgi.com wrote:
 
 On 01/15/2014 05:02 PM, Bob Haxo wrote:
 Greetings,
 
 The command  crm configure show dumps the cluster configuration in a 
 format
 that is suitable for use in configuring a cluster.
 
 The command pcs config generates nice human readable information, but 
 this is
 not directly suitable for use in configuring a cluster.
 
 Is there a pcs command analogous to the crm command that dumps the 
 cluster
 configuration in pcs format?
 
 
 
 On Wed, 2014-01-15 at 17:55 -0600, Chris Feist wrote:
 Currently there is not.  We may at some point look into this, but it isn't 
 on my 
 short term list of things to do.
 
 Thanks,
 Chris
 
 Oh, well, bummer ... but at least I hadn't missed the command in the docs 
 or 
 in the installed code.
 
 There list of commands you used to build a cluster is in the history too 
 remember.
 You could just save that to a file instead and restore with bash 
 ./bit-of-history-i-care-about
 
 How very nice it would have been had I been able to have just entered
 the correct commands at single time in a single shell instance.
 Unfortunately, it configuration has taken me weeks to develop (what now
 seems to be) a working configuration (including mods to the
 VirtualDomain agent to avoid spurious restarts of the VM).

Now that its done though, I'd not have thought reverse engineering the pcs 
commands was /that/ hard (from the xml or shell history).
And once you have that it's not so far from crm show.

The challenge with forcing your input and output formats to match is that 
you're limited to how smart you can make the input side of things.
For example, having one command create both an ordering and colocation 
constraint is challenging... how do you determine whether to merge them or 
leave them separate when representing it in the output?
(Hint: no matter which option you pick someone will complain it should be the 
other :-)

Its not a completely black and white discussion...

 
 I'll probably use crm configure show to capture the pcs created 
 configuration for installations ... and save the pcs for when I
 correspond with RedHat.
 
 Dumping and loading the xml really is not an option.
 
 You dump and load xml a lot?
 No, but I did read that pcs can dump/load the cib xml.

Sorry, I meant the config (regardless of format).

 
 Even assuming yes, its in a file that you don't have 
 to read... so where is the problem?
 
 The problem is that this goes into a product that gets shipped to mfg,
 and then to customers, and then needs to be supported by other engineers
 (and then often back to me).  
 
 Easy to create configurations with crm and then load
 (crm -f file) the Pacemaker configuration using the crm commands created
 with crm configure show, with some scripted substitutions for
 hostnames, IP addresses, and other site customizations.
 
 The SLES HAE uses crm, and I'm trying to make SLES and RHEL versions as
 identical as possible.  Makes it easier for me to maintain, and for
 others to support.

Fair enough.

 
 Regards,
 Bob Haxo
 
 
 
 
 
 Regards,
 Bob Haxo
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Time to get ready for 1.1.11

2014-01-15 Thread Andrew Beekhof

On 16 Jan 2014, at 3:00 pm, Keisuke MORI keisuke.mori...@gmail.com wrote:

 Hi,
 
 Just curious,
 I found that RC4 has been branched out of the master after RC3.
 
 What would the fixes only in the master branch be in the future?
 Are they going to be merged into 1.1.12 someday and just skipping 1.1.11?

Yes.  Unlike last time we're trying to be better about not merging new features 
and other risky changes during the RC phase :-)
The branching should have happened earlier but I forgot.

 Or are they separated for the next major version such as v1.2 or v2.0?

I think our plans for 1.2/2.0 are on hold indefinitely.  Its all 1.1.x releases 
for the foreseeable future.
For a small dev team, the benefits didn't outweigh the costs.

 
 Thanks,
 
 
 2014/1/16 David Vossel dvos...@redhat.com:
 - Original Message -
 From: David Vossel dvos...@redhat.com
 To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org
 Sent: Tuesday, January 7, 2014 4:50:11 PM
 Subject: Re: [Pacemaker] Time to get ready for 1.1.11
 
 - Original Message -
 From: Andrew Beekhof and...@beekhof.net
 To: The Pacemaker cluster resource manager
 pacemaker@oss.clusterlabs.org
 Sent: Thursday, December 19, 2013 2:25:00 PM
 Subject: Re: [Pacemaker] Time to get ready for 1.1.11
 
 
 On 20 Dec 2013, at 2:11 am, Andrew Martin amar...@xes-inc.com wrote:
 
 David/Andrew,
 
 Once 1.1.11 final is released, is it considered the new stable series of
 Pacemaker,
 
 yes
 
 or should 1.1.10 still be used in very stable/critical production
 environments?
 
 Thanks,
 
 Andrew
 
 - Original Message -
 From: David Vossel dvos...@redhat.com
 To: The Pacemaker cluster resource manager
 pacemaker@oss.clusterlabs.org
 Sent: Wednesday, December 11, 2013 3:33:46 PM
 Subject: Re: [Pacemaker] Time to get ready for 1.1.11
 
 - Original Message -
 From: Andrew Beekhof and...@beekhof.net
 To: The Pacemaker cluster resource manager
 pacemaker@oss.clusterlabs.org
 Sent: Wednesday, November 20, 2013 9:02:40 PM
 Subject: [Pacemaker] Time to get ready for 1.1.11
 
 With over 400 updates since the release of 1.1.10, its time to start
 thinking about a new release.
 
 Today I have tagged release candidate 1[1].
 The most notable fixes include:
 
 + attrd: Implementation of a truely atomic attrd for use with corosync
 2.x
 + cib: Allow values to be added/updated and removed in a single update
 + cib: Support XML comments in diffs
 + Core: Allow blackbox logging to be disabled with SIGUSR2
 + crmd: Do not block on proxied calls from pacemaker_remoted
 + crmd: Enable cluster-wide throttling when the cib heavily exceeds
 its
 target load
 + crmd: Use the load on our peers to know how many jobs to send them
 + crm_mon: add --hide-headers option to hide all headers
 + crm_report: Collect logs directly from journald if available
 + Fencing: On timeout, clean up the agent's entire process group
 + Fencing: Support agents that need the host to be unfenced at startup
 + ipc: Raise the default buffer size to 128k
 + PE: Add a special attribute for distinguishing between real nodes
 and
 containers in constraint rules
 + PE: Allow location constraints to take a regex pattern to match
 against
 resource IDs
 + pengine: Distinguish between the agent being missing and something
 the
 agent needs being missing
 + remote: Properly version the remote connection protocol
 + services: Detect missing agents and permission errors before forking
 + Bug cl#5171 - pengine: Don't prevent clones from running due to
 dependant
 resources
 + Bug cl#5179 - Corosync: Attempt to retrieve a peer's node name if it
 is
 not already known
 + Bug cl#5181 - corosync: Ensure node IDs are written to the CIB as
 unsigned integers
 
 If you are a user of `pacemaker_remoted`, you should take the time to
 read
 about changes to the online wire protocol[2] that are present in this
 release.
 
 [1]
 https://github.com/ClusterLabs/pacemaker/releases/Pacemaker-1.1.11-rc1
 [2]
 http://blog.clusterlabs.org/blog/2013/changes-to-the-remote-wire-protocol/
 
 To build `rpm` packages for testing:
 
 1. Clone the current sources:
 
  # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git
  # cd pacemaker
 
 1. If you haven't already, install Pacemaker's dependancies
 
  [Fedora] # sudo yum install -y yum-utils
  [ALL] # make rpm-dep
 
 1. Build Pacemaker
 
  # make rc
 
 1. Copy the rpms and deploy as needed
 
 
 A new release candidate, Pacemaker-1.1.11-rc2, is ready for testing.
 https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1.11-rc2
 
 Assuming no major regressions are encountered during testing, this tag
 will
 become the final Pacemaker-1.1.11 release a week from today.
 
 -- Vossel
 
 Alright, New RC time. Pacemaker-1.1.11-rc3.
 
 If no regressions are encountered, rc3 will become the 1.1.11 final release 
 a
 week from today.
 
 https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1.11-rc3
 
 CHANGES RC2 vs RC3

Re: [Pacemaker] [Linux-HA] Better way to change master in 3 node pgsql cluster

2014-01-14 Thread Andrew Beekhof

On 14 Jan 2014, at 10:32 pm, Andrey Rogovsky a.rogov...@gmail.com wrote:

 I understand it. So, no way change master better without cluster software 
 update?

crm_resource is just creating (and removing) normal location constraints.
no reason you couldn't write a script to create them instead.

 
 
 
 2014/1/14 Andrey Groshev gre...@yandex.ru
  
  
 14.01.2014, 12:39, Andrey Rogovsky a.rogov...@gmail.com:
 I use Debian 7 and got:
 Reconnecting...root@a:~# crm_resource --resource msPostgresql --ban --master 
 --host a.geocluster.e-autopay.com
 crm_resource: unrecognized option '--ban'
  
  
 No other way to move master?
  
 
 
 2014/1/13 Andrew Beekhof and...@beekhof.net
 
 On 13 Jan 2014, at 8:32 pm, Andrey Rogovsky a.rogov...@gmail.com wrote:
 
  Hi
 
  I have 3 node postgresql cluster.
  It work well. But I have some trobule with change master.
 
  For now, if I need change master, I must:
  1) Stop PGSQL on each node and cluster service
  2) Start Setup new manual PGSQL replication
  3) Change attributes on each node for point to new master
  4) Stop PGSQL on each node
  5) Celanup resource and start cluster service
 
  It take a lot of time. Is it exist better way to change master?
 Newer versions support:
 
crm_resource --resource msPostgresql --ban --master --host 
 a.geocluster.e-autopay.com
 
 
 
 
  This is my cluster service status:
  Node Attributes:
  * Node a.geocluster.e-autopay.com:
 + master-pgsql:0   : 1000
 + pgsql-data-status   : LATEST
 + pgsql-master-baseline   : 2F90
 + pgsql-status : PRI
  * Node c.geocluster.e-autopay.com:
 + master-pgsql:0   : 1000
 + pgsql-data-status   : SYNC
 + pgsql-status : STOP
  * Node b.geocluster.e-autopay.com:
 + master-pgsql:0   : 1000
 + pgsql-data-status   : SYNC
 + pgsql-status : STOP
 
  I was use http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster for my 3
  nodes cluster without hard stik.
  Now I got strange situation all nodes stay slave:
  
  Last updated: Sat Dec  7 04:33:47 2013
  Last change: Sat Dec  7 12:56:23 2013 via crmd on a
  Stack: openais
  Current DC: c - partition with quorum
  Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
  
 You use 1.1.7 version.
 Option --ban added in 1.1.9 
 See: https://github.com/ClusterLabs/pacemaker/blob/master/ChangeLog
  
 
  5 Nodes configured, 3 expected votes
  4 Resources configured.
  
 
  Online: [ a c b ]
 
  Master/Slave Set: msPostgresql [pgsql]
  Slaves: [ a c b ]
 
  My config is:
  node a \
  attributes pgsql-data-status=DISCONNECT
  node b \
  attributes pgsql-data-status=DISCONNECT
  node c \
  attributes pgsql-data-status=DISCONNECT
  primitive pgsql ocf:heartbeat:pgsql \
  params pgctl=/usr/lib/postgresql/9.3/bin/pg_ctl psql=/usr/bin/psql
  pgdata=/var/lib/postgresql/9.3/main start_opt=-p 5432 rep_mode=sync
  node_list=a b c restore_command=cp /var/lib/postgresql/9.3/pg_archive/%f
  %p master_ip=192.168.10.200 restart_on_promote=true
  config=/etc/postgresql/9.3/main/postgresql.conf \
  op start interval=0s timeout=60s on-fail=restart \
  op monitor interval=4s timeout=60s on-fail=restart \
  op monitor interval=3s role=Master timeout=60s on-fail=restart \
  op promote interval=0s timeout=60s on-fail=restart \
  op demote interval=0s timeout=60s on-fail=stop \
  op stop interval=0s timeout=60s on-fail=block \
  op notify interval=0s timeout=60s
  primitive pgsql-master-ip ocf:heartbeat:IPaddr2 \
  params ip=192.168.10.200 nic=peervpn0 \
  op start interval=0s timeout=60s on-fail=restart \
  op monitor interval=10s timeout=60s on-fail=restart \
  op stop interval=0s timeout=60s on-fail=block
  group master pgsql-master-ip
  ms msPostgresql pgsql \
  meta master-max=1 master-node-max=1 clone-max=3 clone-node-max=1
  notify=true
  colocation set_ip inf: master msPostgresql:Master
  order ip_down 0: msPostgresql:demote master:stop symmetrical=false
  order ip_up 0: msPostgresql:promote master:start symmetrical=false
  property $id=cib-bootstrap-options \
  dc-version=1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff \
  cluster-infrastructure=openais \
  expected-quorum-votes=3 \
  no-quorum-policy=ignore \
  stonith-enabled=false \
  crmd-transition-delay=0 \
  last-lrm-refresh=1386404222
  rsc_defaults $id=rsc-options \
  resource-stickiness=100 \
  migration-threshold=1
  ___
  Linux-HA mailing list
  linux...@lists.linux-ha.org
  http://lists.linux-ha.org/mailman/listinfo/linux-ha
  See also: http://linux-ha.org/ReportingProblems
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc

Re: [Pacemaker] hangs pending

2014-01-14 Thread Andrew Beekhof

On 15 Jan 2014, at 12:15 am, Andrey Groshev gre...@yandex.ru wrote:

 
 
 14.01.2014, 10:00, Andrey Groshev gre...@yandex.ru:
 14.01.2014, 07:47, Andrew Beekhof and...@beekhof.net:
 
  Ok, here's what happens:
 
  1. node2 is lost
  2. fencing of node2 starts
  3. node2 reboots (and cluster starts)
  4. node2 returns to the membership
  5. node2 is marked as a cluster member
  6. DC tries to bring it into the cluster, but needs to cancel the active 
 transition first.
 Which is a problem since the node2 fencing operation is part of that
  7. node2 is in a transition (pending) state until fencing passes or fails
  8a. fencing fails: transition completes and the node joins the cluster
 
  Thats in theory, except we automatically try again. Which isn't 
 appropriate.
  This should be relatively easy to fix.
 
  8b. fencing passes: the node is incorrectly marked as offline
 
  This I have no idea how to fix yet.
 
  On another note, it doesn't look like this agent works at all.
  The node has been back online for a long time and the agent is still 
 timing out after 10 minutes.
  So Once the script makes sure that the victim will rebooted and again 
 available via ssh - it exit with 0. does not seem true.
 
 Damn. Looks like you're right. At some time I broke my agent and had not 
 noticed it. Who will understand.
 
 I repaired my agent - after send reboot he is wait STDIN.
 Returned normally a behavior - hangs pending, until manually send reboot. 
 :) 

Right. Now you're in case 8b.

Can you try this patch:  http://paste.fedoraproject.org/68450/38973966

 New logs: http://send2me.ru/crmrep1.tar.bz2
 
 
  On 14 Jan 2014, at 1:19 pm, Andrew Beekhof and...@beekhof.net wrote:
   Apart from anything else, your timeout needs to be bigger:
 
   Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru stonith-ng: (  
 commands.c:1321  )   error: log_operation: Operation 'reboot' [11331] 
 (call 2 from crmd.17227) for host 'dev-cluster2-node2.unix.tensor.ru' with 
 device 'st1' returned: -62 (Timer expired)
 
   On 14 Jan 2014, at 7:18 am, Andrew Beekhof and...@beekhof.net wrote:
   On 13 Jan 2014, at 8:31 pm, Andrey Groshev gre...@yandex.ru wrote:
   13.01.2014, 02:51, Andrew Beekhof and...@beekhof.net:
   On 10 Jan 2014, at 9:55 pm, Andrey Groshev gre...@yandex.ru wrote:
   10.01.2014, 14:31, Andrey Groshev gre...@yandex.ru:
   10.01.2014, 14:01, Andrew Beekhof and...@beekhof.net:
   On 10 Jan 2014, at 5:03 pm, Andrey Groshev gre...@yandex.ru 
 wrote:
10.01.2014, 05:29, Andrew Beekhof and...@beekhof.net:
 On 9 Jan 2014, at 11:11 pm, Andrey Groshev gre...@yandex.ru 
 wrote:
  08.01.2014, 06:22, Andrew Beekhof and...@beekhof.net:
  On 29 Nov 2013, at 7:17 pm, Andrey Groshev 
 gre...@yandex.ru wrote:
   Hi, ALL.
 
   I'm still trying to cope with the fact that after the 
 fence - node hangs in pending.
  Please define pending.  Where did you see this?
  In crm_mon:
  ..
  Node dev-cluster2-node2 (172793105): pending
  ..
 
  The experiment was like this:
  Four nodes in cluster.
  On one of them kill corosync or pacemakerd (signal 4 or 6 oк 
 11).
  Thereafter, the remaining start it constantly reboot, under 
 various pretexts, softly whistling, fly low, not a cluster 
 member! ...
  Then in the log fell out Too many failures 
  All this time in the status in crm_mon is pending.
  Depending on the wind direction changed to UNCLEAN
  Much time has passed and I can not accurately describe the 
 behavior...
 
  Now I am in the following state:
  I tried locate the problem. Came here with this.
  I set big value in property stonith-timeout=600s.
  And got the following behavior:
  1. pkill -4 corosync
  2. from node with DC call my fence agent sshbykey
  3. It sends reboot victim and waits until she comes to life 
 again.
 Hmmm what version of pacemaker?
 This sounds like a timing issue that we fixed a while back
Was a version 1.1.11 from December 3.
Now try full update and retest.
   That should be recent enough.  Can you create a crm_report the 
 next time you reproduce?
   Of course yes. Little delay :)
 
   ..
   cc1: warnings being treated as errors
   upstart.c: In function ‘upstart_job_property’:
   upstart.c:264: error: implicit declaration of function 
 ‘g_variant_lookup_value’
   upstart.c:264: error: nested extern declaration of 
 ‘g_variant_lookup_value’
   upstart.c:264: error: assignment makes pointer from integer without 
 a cast
   gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1
   gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services'
   make[1]: *** [all-recursive] Error 1
   make[1]: Leaving directory `/root/ha/pacemaker/lib'
   make: *** [core] Error 1
 
   I'm trying to solve this a problem.
   Do not get solved quickly...
 
   
 https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value
   g_variant_lookup_value () Since

Re: [Pacemaker] [Enhancement] Change of the globally-unique attribute of the resource.

2014-01-14 Thread Andrew Beekhof

On 14 Jan 2014, at 7:26 pm, renayama19661...@ybb.ne.jp wrote:

 Hi All,
 
 When a user changes the globally-unique attribute of the resource, a 
 problem occurs.
 
 When it manages the resource with PID file, this occurs, but this is because 
 PID file name changes by globally-unique attribute.
 
 (snip)
 if [ ${OCF_RESKEY_CRM_meta_globally_unique} = false ]; then
: ${OCF_RESKEY_pidfile:=$HA_VARRUN/ping-${OCF_RESKEY_name}}
 else
: ${OCF_RESKEY_pidfile:=$HA_VARRUN/ping-${OCF_RESOURCE_INSTANCE}}
 fi
 (snip)

This is correct.  The pid file cannot include the instance number when 
globally-unique is false and must do so when it is true.

 
 
 The problem can reappear in the following procedure.
 
 * Step1: Started a resource.
 (snip)
 primitive prmPingd ocf:pacemaker:pingd \
params name=default_ping_set host_list=192.168.0.1 
 multiplier=200 \
op start interval=0s timeout=60s on-fail=restart \
op monitor interval=10s timeout=60s on-fail=restart \
op stop interval=0s timeout=60s on-fail=ignore
 clone clnPingd prmPingd
 (snip)
 
 * Step2: Change globally-unique attribute.
 
 [root]# crm configure edit
 (snip)
 clone clnPingd prmPingd \
meta clone-max=2 clone-node-max=2 globally-unique=true
 (snip)
 
 * Step3: Stop Pacemaker
 
 But, the resource does not stop because PID file was changed as for the 
 changed resource of the globally-unique attribute.

I'd have expected the stop action to be performed with the old attributes.
crm_report tarball?


 
 I think that this is a known problem.

It wasn't until now.

 
 I wish this problem is solved in the future
 
 Best Regards,
 Hideo Yamauchi.
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [Enhancement] Change of the globally-unique attribute of the resource.

2014-01-14 Thread Andrew Beekhof

On 15 Jan 2014, at 12:06 pm, renayama19661...@ybb.ne.jp wrote:

 Hi Andrew,
 
 Sorry
 
 This problem is a thing of Pacemaker1.0.
 On Pacemaker1.1.11, the resource did movement to stop definitely.
 
 When globally-unique attribute changed somehow or other in Pacemaker1.1, 
 Pacemkaer seems to carry out the reboot of the resource.

Makes sense, since the definition changed

 
 (snip)
 Jan 15 18:29:40 rh64-2744 pengine[3369]:  warning: process_rsc_state: 
 Detected active orphan prmClusterMon running on rh64-2744
 Jan 15 18:29:40 rh64-2744 pengine[3369]: info: clone_print:  Clone Set: 
 clnClusterMon [prmClusterMon] (unique)
 Jan 15 18:29:40 rh64-2744 pengine[3369]: info: native_print:  
 prmClusterMon:0#011(ocf::pacemaker:ClusterMon):#011Stopped
 Jan 15 18:29:40 rh64-2744 pengine[3369]: info: native_print:  
 prmClusterMon:1#011(ocf::pacemaker:ClusterMon):#011Stopped
 Jan 15 18:29:40 rh64-2744 pengine[3369]: info: native_print: 
 prmClusterMon#011(ocf::pacemaker:ClusterMon):#011 ORPHANED Started rh64-2744
 Jan 15 18:29:40 rh64-2744 pengine[3369]:   notice: DeleteRsc: Removing 
 prmClusterMon from rh64-2744
 Jan 15 18:29:40 rh64-2744 pengine[3369]: info: native_color: Stopping 
 orphan resource prmClusterMon
 Jan 15 18:29:40 rh64-2744 pengine[3369]: info: RecurringOp:  Start 
 recurring monitor (10s) for prmClusterMon:0 on rh64-2744
 Jan 15 18:29:40 rh64-2744 pengine[3369]: info: RecurringOp:  Start 
 recurring monitor (10s) for prmClusterMon:1 on rh64-2744
 Jan 15 18:29:40 rh64-2744 pengine[3369]:   notice: LogActions: Start   
 prmClusterMon:0#011(rh64-2744)
 Jan 15 18:29:40 rh64-2744 pengine[3369]:   notice: LogActions: Start   
 prmClusterMon:1#011(rh64-2744)Jan 15 18:29:40 rh64-2744 pengine[3369]:   
 notice: LogActions: StopprmClusterMon#011(rh64-2744)
 
 (snip)
 
 Best Regards,
 Hideo Yamauchi.
 
 --- On Wed, 2014/1/15, renayama19661...@ybb.ne.jp 
 renayama19661...@ybb.ne.jp wrote:
 
 Hi Andrew,
 
 Thank you for comment.
 
 But, the resource does not stop because PID file was changed as for the 
 changed resource of the globally-unique attribute.
 
 I'd have expected the stop action to be performed with the old attributes.
 crm_report tarball?
 
 Okay.
 
 I register this topic with Bugzilla.
 I attach the log to Bugzilla.
 
 Best Regards,
 Hideo Yamauchi.
 --- On Wed, 2014/1/15, Andrew Beekhof and...@beekhof.net wrote:
 
 
 On 14 Jan 2014, at 7:26 pm, renayama19661...@ybb.ne.jp wrote:
 
 Hi All,
 
 When a user changes the globally-unique attribute of the resource, a 
 problem occurs.
 
 When it manages the resource with PID file, this occurs, but this is 
 because PID file name changes by globally-unique attribute.
 
 (snip)
 if [ ${OCF_RESKEY_CRM_meta_globally_unique} = false ]; then
 : ${OCF_RESKEY_pidfile:=$HA_VARRUN/ping-${OCF_RESKEY_name}}
 else
 : ${OCF_RESKEY_pidfile:=$HA_VARRUN/ping-${OCF_RESOURCE_INSTANCE}}
 fi
 (snip)
 
 This is correct.  The pid file cannot include the instance number when 
 globally-unique is false and must do so when it is true.
 
 
 
 The problem can reappear in the following procedure.
 
 * Step1: Started a resource.
 (snip)
 primitive prmPingd ocf:pacemaker:pingd \
 params name=default_ping_set host_list=192.168.0.1 
 multiplier=200 \
 op start interval=0s timeout=60s on-fail=restart \
 op monitor interval=10s timeout=60s on-fail=restart \
 op stop interval=0s timeout=60s on-fail=ignore
 clone clnPingd prmPingd
 (snip)
 
 * Step2: Change globally-unique attribute.
 
 [root]# crm configure edit
 (snip)
 clone clnPingd prmPingd \
 meta clone-max=2 clone-node-max=2 globally-unique=true
 (snip)
 
 * Step3: Stop Pacemaker
 
 But, the resource does not stop because PID file was changed as for the 
 changed resource of the globally-unique attribute.
 
 I'd have expected the stop action to be performed with the old attributes.
 crm_report tarball?
 
 
 
 I think that this is a known problem.
 
 It wasn't until now.
 
 
 I wish this problem is solved in the future
 
 Best Regards,
 Hideo Yamauchi.
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org

Re: [Pacemaker] Consider extra slave node resource when calculating actions for failover

2014-01-14 Thread Andrew Beekhof

On 14 Jan 2014, at 11:25 pm, Juraj Fabo juraj.f...@gmail.com wrote:

 Hi
 
 I have master-slave cluster with configuration attached below.
 It is based on documented postgresql master-slave cluster configuration.
 Colocation constraints should work that way that if some of  master-group 
 resources fails,
 failover to slave node will be done. This basically works ok.
 
 I would like to have an additional condition integrated.
 On and only on the HotStandby runs resource SERVICE-res-mon-s1.
 If the SERVICE-res-mon-s1 resource on slave reports negative score then 
 failover should not be done,
 because it indicates that the slave node is not ready to run services from 
 master-group.
 However, even if the SERVICE-res-mon-s1 fails, postgresql slave (HotStandby) 
 should still run, 
 because the SERVICE-res-mon-s1 monitors some application related 
 functionality which does not block the postgres itself.
 
 The requested feature is very close to the one described in the 
 http://clusterlabs.org/doc/en-
 US/Pacemaker/1.0/html/Pacemaker_Explained/ch09s03s03s02.html
 prefer nodes with the most connectivity
 
 with that difference, that the resource agent is running only on the standby 
 node.
 Reason is, that the SERVICE-res-mon-s1 is in reality minimalistic 
 implementation of the SERVICE-service
 in order to know whether the slave would be able to run the SERVICE-service.

The simplest approach might be to run:

crm_attribute --name postgres --value 100 --node serv1 --lifetime forever 

And then have SERVICE-res-mon-s1 run:

crm_attribute --name postgres --value 10 --node serv2 --lifetime reboot

whenever it starts and 

crm_attribute --name postgres --delete --node serv2 --lifetime reboot

whenever it stops.

Then you can use the 'postgres' attribute in the same way as you did with 
'ping'.

 
 If needed, I could change this design to use clone of SERVICE-res-mon-s1 to 
 run them on both nodes, however, I did not succeed even with this 
 configuration.
 Next step would be to have multiple instances of this resource agent running 
 on both nodes (with different parameter spanID) and preffer node where more 
 spans are ok.
 
 Ocf agent service_res_check reports the resource availability via monitor 
 function, where it updates its own score attribute via crm_attribute.
 I thought that using this custom score attribute in location or colocation 
 constraint could do the job, but it did not affected the failover logic.
 
 Please, what should be done in order to have the cluster consider also the 
 SERVICE-res-mon-s1 results when calculating resources score and willingnes 
 to move?
 
 note: my pacemaker contains also patch from 
 https://github.com/beekhof/pacemaker/commit/58962338
 
 Thank you in advance
 
 node $id=1 serv1 \
attributes SERVICE-pgsql-data-status=STREAMING|ASYNC
 node $id=2 serv2 \
attributes SERVICE-pgsql-data-status=LATEST
 primitive SERVICE-MIP1 ocf:heartbeat:IPaddr2 \
params ip=10.40.0.70 cidr_netmask=24 iflabel=ma1 \
op monitor interval=10s
 primitive SERVICE-MIP2 ocf:heartbeat:IPaddr2 \
params ip=10.40.0.71 cidr_netmask=26 iflabel=ma2 \
op monitor interval=10s
 primitive SERVICE-VIP ocf:heartbeat:IPaddr2 \
params ip=10.40.0.72 cidr_netmask=24 iflabel=sla \
meta resource-stickiness=1 \
op monitor interval=10s timeout=60s on-fail=restart
 primitive SERVICE-res-mon-s1 ocf:heartbeat:service_res_check \
params spanID=1 \
meta resource-stickiness=1 \
op monitor interval=9s timeout=4s on-fail=restart
 primitive SERVICE-pgsql ocf:heartbeat:pgsql \
params master_ip=10.40.0.70 slave_ip=10.40.0.72 node_list=serv1 
 serv2 pgctl=/usr/bin/pg_ctl psql=/usr/bin/psql 
 pgdata=/var/lib/pgsql/data/ start_opt=-p 5432 rep_mode=async 
 logfile=/var/log/service_ra_pgsql.log 
 primary_conninfo_opt=keepalives_idle=60 keepalives_interval=5 
 keepalives_count=5 stop_escalate=0 \
op start interval=0s timeout=120s on-fail=restart \
op monitor interval=7s timeout=30s on-fail=restart \
op monitor interval=2s role=Master timeout=30s on-
 fail=restart \
op promote interval=0s timeout=120s on-fail=restart \
op demote interval=0s timeout=30s on-fail=stop \
op stop interval=0s timeout=30s on-fail=block \
op notify interval=0s timeout=30s
 primitive SERVICE-pingCheck ocf:pacemaker:ping \
params host_list=10.40.0.99 name=default_ping_set 
 multiplier=100 \
op start interval=0s timeout=60s on-fail=restart \
op monitor interval=2s timeout=60s on-fail=restart \
op stop interval=0s timeout=60s on-fail=ignore
 primitive SERVICE-service ocf:heartbeat:service_service_ocf \
op monitor interval=7s timeout=30s on-fail=restart
 primitive SERVICE-tomcat ocf:heartbeat:tomcat \
params java_home=/usr/java/default 
 catalina_home=/usr/share/tomcat6 statusurl=http://127.0.0.1:9081/admin; 
 catalina_pid=/var/run/tomcat6.pid 

Re: [Pacemaker] crm_resource -L not trustable right after restart

2014-01-14 Thread Andrew Beekhof

On 14 Jan 2014, at 11:50 pm, Brian J. Murrell (brian) br...@interlinx.bc.ca 
wrote:

 On Tue, 2014-01-14 at 16:01 +1100, Andrew Beekhof wrote:
 
 On Tue, 2014-01-14 at 08:09 +1100, Andrew Beekhof wrote:
 
 The local cib hasn't caught up yet by the looks of it.
 
 I should have asked in my previous message: is this entirely an artifact
 of having just restarted or are there any other times where the local
 CIB can in fact be out of date (and thus crm_resource is inaccurate), if
 even for a brief period of time?  I just want to completely understand
 the nature of this situation.

Consider any long running action, such as starting a database.
We do not update the CIB until after actions have completed, so there can and 
will be times when the status section is out of date to one degree or another.
At node startup is another point at which the status could potentially be 
behind.

It sounds to me like you're trying to second guess the cluster, which is a 
dangerous path.

 
 It doesn't know that it doesn't know.
 
 But it (pacemaker at least) does know that it's just started up, and
 should also know whether it's gotten a fresh copy of the CIB since
 starting up, right?  

What if its the first node to start up?  There'd be no fresh copy to arrive in 
that case.
Many things are obvious to external observers that are not at all obvious to 
the cluster.

If it had enough information to know it was out of date, it wouldn't be out of 
date.

 I think I'd consider it required behaviour that
 pacemaker not consider itself authoritative enough to provide answers
 like location until it has gotten a fresh copy of the CIB.
 
 Does it show anything as running?  Any nodes as online?
 
 
 I'd not expect that it stays in that situation for more than a second or 
 two...
 
 You are probably right about that.  But unfortunately that second or two
 provides a large enough window to provide mis-information.
 
 We could add an option to force crm_resource to use the master instance 
 instead of the local one I guess.
 
 Or, depending on the answers to above (like can this local-is-not-true
 situation every manifest itself at times other than just started)
 perhaps just don't allow crm_resource (or any other tool) to provide
 information from the local CIB until it's been refreshed at least once
 since a startup.

As above, there are situations when you'd never get an answer.

 
 I would much rather crm_resource experience some latency in being able
 to provide answers than provide wrong ones.  Perhaps there needs to be a
 switch to indicate if it should block waiting for the local CIB to be
 up-to-date or should return immediately with an unknown type response
 if the local CIB has not yet been updated since a start.
 
 Cheers,
 b.
 
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] hangs pending

2014-01-13 Thread Andrew Beekhof

On 13 Jan 2014, at 8:31 pm, Andrey Groshev gre...@yandex.ru wrote:

 
 
 13.01.2014, 02:51, Andrew Beekhof and...@beekhof.net:
 On 10 Jan 2014, at 9:55 pm, Andrey Groshev gre...@yandex.ru wrote:
 
  10.01.2014, 14:31, Andrey Groshev gre...@yandex.ru:
  10.01.2014, 14:01, Andrew Beekhof and...@beekhof.net:
   On 10 Jan 2014, at 5:03 pm, Andrey Groshev gre...@yandex.ru wrote:
10.01.2014, 05:29, Andrew Beekhof and...@beekhof.net:
 On 9 Jan 2014, at 11:11 pm, Andrey Groshev gre...@yandex.ru wrote:
  08.01.2014, 06:22, Andrew Beekhof and...@beekhof.net:
  On 29 Nov 2013, at 7:17 pm, Andrey Groshev gre...@yandex.ru 
 wrote:
   Hi, ALL.
 
   I'm still trying to cope with the fact that after the fence - 
 node hangs in pending.
  Please define pending.  Where did you see this?
  In crm_mon:
  ..
  Node dev-cluster2-node2 (172793105): pending
  ..
 
  The experiment was like this:
  Four nodes in cluster.
  On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11).
  Thereafter, the remaining start it constantly reboot, under 
 various pretexts, softly whistling, fly low, not a cluster 
 member! ...
  Then in the log fell out Too many failures 
  All this time in the status in crm_mon is pending.
  Depending on the wind direction changed to UNCLEAN
  Much time has passed and I can not accurately describe the 
 behavior...
 
  Now I am in the following state:
  I tried locate the problem. Came here with this.
  I set big value in property stonith-timeout=600s.
  And got the following behavior:
  1. pkill -4 corosync
  2. from node with DC call my fence agent sshbykey
  3. It sends reboot victim and waits until she comes to life again.
 Hmmm what version of pacemaker?
 This sounds like a timing issue that we fixed a while back
Was a version 1.1.11 from December 3.
Now try full update and retest.
   That should be recent enough.  Can you create a crm_report the next 
 time you reproduce?
  Of course yes. Little delay :)
 
  ..
  cc1: warnings being treated as errors
  upstart.c: In function ‘upstart_job_property’:
  upstart.c:264: error: implicit declaration of function 
 ‘g_variant_lookup_value’
  upstart.c:264: error: nested extern declaration of 
 ‘g_variant_lookup_value’
  upstart.c:264: error: assignment makes pointer from integer without a cast
  gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1
  gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services'
  make[1]: *** [all-recursive] Error 1
  make[1]: Leaving directory `/root/ha/pacemaker/lib'
  make: *** [core] Error 1
 
  I'm trying to solve this a problem.
  Do not get solved quickly...
 
  
 https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value
  g_variant_lookup_value () Since 2.28
 
  # yum list installed glib2
  Loaded plugins: fastestmirror, rhnplugin, security
  This system is receiving updates from RHN Classic or Red Hat Satellite.
  Loading mirror speeds from cached hostfile
  Installed Packages
  glib2.x86_64  
 2.26.1-3.el6   
 installed
 
  # cat /etc/issue
  CentOS release 6.5 (Final)
  Kernel \r on an \m
 
 Can you try this patch?
 Upstart jobs wont work, but the code will compile
 
 diff --git a/lib/services/upstart.c b/lib/services/upstart.c
 index 831e7cf..195c3a4 100644
 --- a/lib/services/upstart.c
 +++ b/lib/services/upstart.c
 @@ -231,12 +231,21 @@ upstart_job_exists(const char *name)
  static char *
  upstart_job_property(const char *obj, const gchar * iface, const char *name)
  {
 +char *output = NULL;
 +
 +#if !GLIB_CHECK_VERSION(2,28,0)
 +static bool err = TRUE;
 +
 +if(err) {
 +crm_err(This version of glib is too old to support upstart jobs);
 +err = FALSE;
 +}
 +#else
  GError *error = NULL;
  GDBusProxy *proxy;
  GVariant *asv = NULL;
  GVariant *value = NULL;
  GVariant *_ret = NULL;
 -char *output = NULL;
 
  crm_info(Calling GetAll on %s, obj);
  proxy = get_proxy(obj, BUS_PROPERTY_IFACE);
 @@ -272,6 +281,7 @@ upstart_job_property(const char *obj, const gchar * 
 iface, const char *name)
 
  g_object_unref(proxy);
  g_variant_unref(_ret);
 +#endif
  return output;
  }
 
 
 Ok :) I patch source. 
 Type make rc - the same error.

Because its not building your local changes

 Make new copy via fetch - the same error.
 It seems that if not exist ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz, 
 then download it. 
 Otherwise use exist archive.
 Cutted log ...
 
 # make rc
 make TAG=Pacemaker-1.1.11-rc3 rpm
 make[1]: Entering directory `/root/ha/pacemaker'
 rm -f pacemaker-dirty.tar.* pacemaker-tip.tar.* pacemaker-HEAD.tar.*
 if [ ! -f ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz

Re: [Pacemaker] crm_resource -L not trustable right after restart

2014-01-13 Thread Andrew Beekhof

On 14 Jan 2014, at 5:13 am, Brian J. Murrell (brian) br...@interlinx.bc.ca 
wrote:

 Hi,
 
 I found a situation using pacemaker 1.1.10 on RHEL6.5 where the output
 of crm_resource -L is not trust-able, shortly after a node is booted.
 
 Here is the output from crm_resource -L on one of the nodes in a two
 node cluster (the one that was not rebooted):
 
 st-fencing(stonith:fence_foo):Started 
 res1  (ocf::foo:Target):  Started 
 res2  (ocf::foo:Target):  Started 
 
 Here is the output from the same command on the other node in the two
 node cluster right after it was rebooted:
 
 st-fencing(stonith:fence_foo):Stopped 
 res1  (ocf::foo:Target):  Stopped 
 res2  (ocf::foo:Target):  Stopped 
 
 These were collected at the same time (within the same second) on the
 two nodes.
 
 Clearly the rebooted node is not telling the truth.  Perhaps the truth
 for it is I don't know, which would be fair enough but that's not what
 pacemaker is asserting there.
 
 So, how do I know (i.e. programmatically -- what command can I issue to
 know) if and when crm_resource can be trusted to be truthful?

The local cib hasn't caught up yet by the looks of it.
You could compare 'cibadmin -Ql' with 'cibadmin -Q'

 
 b.
 
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [Linux-HA] Better way to change master in 3 node pgsql cluster

2014-01-13 Thread Andrew Beekhof

On 13 Jan 2014, at 8:32 pm, Andrey Rogovsky a.rogov...@gmail.com wrote:

 Hi
 
 I have 3 node postgresql cluster.
 It work well. But I have some trobule with change master.
 
 For now, if I need change master, I must:
 1) Stop PGSQL on each node and cluster service
 2) Start Setup new manual PGSQL replication
 3) Change attributes on each node for point to new master
 4) Stop PGSQL on each node
 5) Celanup resource and start cluster service
 
 It take a lot of time. Is it exist better way to change master?

Newer versions support:

   crm_resource --resource msPostgresql --ban --master --host 
a.geocluster.e-autopay.com

 
 
 
 This is my cluster service status:
 Node Attributes:
 * Node a.geocluster.e-autopay.com:
+ master-pgsql:0   : 1000
+ pgsql-data-status   : LATEST
+ pgsql-master-baseline   : 2F90
+ pgsql-status : PRI
 * Node c.geocluster.e-autopay.com:
+ master-pgsql:0   : 1000
+ pgsql-data-status   : SYNC
+ pgsql-status : STOP
 * Node b.geocluster.e-autopay.com:
+ master-pgsql:0   : 1000
+ pgsql-data-status   : SYNC
+ pgsql-status : STOP
 
 I was use http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster for my 3
 nodes cluster without hard stik.
 Now I got strange situation all nodes stay slave:
 
 Last updated: Sat Dec  7 04:33:47 2013
 Last change: Sat Dec  7 12:56:23 2013 via crmd on a
 Stack: openais
 Current DC: c - partition with quorum
 Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
 5 Nodes configured, 3 expected votes
 4 Resources configured.
 
 
 Online: [ a c b ]
 
 Master/Slave Set: msPostgresql [pgsql]
 Slaves: [ a c b ]
 
 My config is:
 node a \
 attributes pgsql-data-status=DISCONNECT
 node b \
 attributes pgsql-data-status=DISCONNECT
 node c \
 attributes pgsql-data-status=DISCONNECT
 primitive pgsql ocf:heartbeat:pgsql \
 params pgctl=/usr/lib/postgresql/9.3/bin/pg_ctl psql=/usr/bin/psql
 pgdata=/var/lib/postgresql/9.3/main start_opt=-p 5432 rep_mode=sync
 node_list=a b c restore_command=cp /var/lib/postgresql/9.3/pg_archive/%f
 %p master_ip=192.168.10.200 restart_on_promote=true
 config=/etc/postgresql/9.3/main/postgresql.conf \
 op start interval=0s timeout=60s on-fail=restart \
 op monitor interval=4s timeout=60s on-fail=restart \
 op monitor interval=3s role=Master timeout=60s on-fail=restart \
 op promote interval=0s timeout=60s on-fail=restart \
 op demote interval=0s timeout=60s on-fail=stop \
 op stop interval=0s timeout=60s on-fail=block \
 op notify interval=0s timeout=60s
 primitive pgsql-master-ip ocf:heartbeat:IPaddr2 \
 params ip=192.168.10.200 nic=peervpn0 \
 op start interval=0s timeout=60s on-fail=restart \
 op monitor interval=10s timeout=60s on-fail=restart \
 op stop interval=0s timeout=60s on-fail=block
 group master pgsql-master-ip
 ms msPostgresql pgsql \
 meta master-max=1 master-node-max=1 clone-max=3 clone-node-max=1
 notify=true
 colocation set_ip inf: master msPostgresql:Master
 order ip_down 0: msPostgresql:demote master:stop symmetrical=false
 order ip_up 0: msPostgresql:promote master:start symmetrical=false
 property $id=cib-bootstrap-options \
 dc-version=1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff \
 cluster-infrastructure=openais \
 expected-quorum-votes=3 \
 no-quorum-policy=ignore \
 stonith-enabled=false \
 crmd-transition-delay=0 \
 last-lrm-refresh=1386404222
 rsc_defaults $id=rsc-options \
 resource-stickiness=100 \
 migration-threshold=1
 ___
 Linux-HA mailing list
 linux...@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Location / Colocation constraints issue

2014-01-13 Thread Andrew Beekhof

On 19 Dec 2013, at 1:08 am, Gaëtan Slongo gslo...@it-optics.com wrote:

 Hi !
 
 I'm currently building a 2 node cluster for firewalling.
 I would like to run a shorewall on both on the master and the Slave
 node. I tried many things but nothing works as expected. Shorewall
 configurations are good.
 What I want to do is to start shorewall standby on the other node as
 soon as my drbd resources are Slave or Stopped..?
 Could you please give me a bit of help on this problem ?

It will be something like:

colocation XXX -inf: shorewall-standby drbd_master_slave_ServicesConfigs1:Master
colocation YYY -inf: shorewall-standby drbd_master_slave_ServicesLogs1:Master

 
 Here is my current config
 
 Thanks
 
 
 node keskonrix1 \
attributes standby=off
 node keskonrix2 \
attributes standby=off
 primitive VIPDMZ ocf:heartbeat:IPaddr2 \
params ip=10.0.1.1 nic=eth2 cidr_netmask=24 iflabel=VIPDMZ \
op monitor interval=30s timeout=30s
 primitive VIPEXPL ocf:heartbeat:IPaddr2 \
params ip=10.0.2.2 nic=eth3 cidr_netmask=28
 iflabel=VIPEXPL \
op monitor interval=30s timeout=30s
 primitive VIPLAN ocf:heartbeat:IPaddr2 \
params ip=192.168.1.248 nic=br0 cidr_netmask=16
 iflabel=VIPLAN \
op monitor interval=30s timeout=30s
 primitive VIPNET ocf:heartbeat:IPaddr2 \
params ip=XX.XX.XX.XX nic=eth1 cidr_netmask=29
 iflabel=VIPDMZ \
op monitor interval=30s timeout=30s
 primitive VIPPDA ocf:heartbeat:IPaddr2 \
params ip=XX.XX.XX.XX nic=eth1 cidr_netmask=29
 iflabel=VIPPDA \
op monitor interval=30s timeout=30s
 primitive apache2 lsb:apache2 \
op start interval=0 timeout=15s
 primitive bind9 lsb:bind9 \
op start interval=0 timeout=15s
 primitive dansguardian lsb:dansguardian \
op start interval=0 timeout=30s on-fail=ignore
 primitive drbd-ServicesConfigs1 ocf:linbit:drbd \
params drbd_resource=services-configs1 \
op monitor interval=29s role=Master \
op monitor interval=31s role=Slave
 primitive drbd-ServicesLogs1 ocf:linbit:drbd \
params drbd_resource=services-logs1 \
op monitor interval=29s role=Master \
op monitor interval=31s role=Slave
 primitive fs_ServicesConfigs1 ocf:heartbeat:Filesystem \
params device=/dev/drbd/by-res/services-configs1
 directory=/drbd/services-configs1/ fstype=ext4
 options=noatime,nodiratime \
meta target-role=Started
 primitive fs_ServicesLogs1 ocf:heartbeat:Filesystem \
params device=/dev/drbd/by-res/services-logs1
 directory=/drbd/services-logs1/ fstype=ext4
 options=noatime,nodiratime \
meta target-role=Started
 primitive ipsec-setkey lsb:setkey \
op start interval=0 timeout=30s
 primitive links_ServicesConfigs1 heartbeat:drbdlinks \
meta target-role=Started
 primitive openvpn lsb:openvpn \
op monitor interval=10 timeout=30s \
meta target-role=Started
 primitive racoon lsb:racoon \
op start interval=0 timeout=30s
 primitive shorewall lsb:shorewall \
op start interval=0 timeout=30s \
meta target-role=Started
 primitive shorewall-standby lsb:shorewall \
op start interval=0 timeout=30s
 primitive squid lsb:squid \
op start interval=0 timeout=15s \
op stop interval=0 timeout=120s
 group IPS-Services1 VIPLAN VIPDMZ VIPPDA VIPEXPL VIPNET \
meta target-role=Started
 group IPSec ipsec-setkey racoon
 group Services1 bind9 squid dansguardian apache2 openvpn shorewall
 group ServicesData1 fs_ServicesConfigs1 fs_ServicesLogs1
 links_ServicesConfigs1
 ms drbd_master_slave_ServicesConfigs1 drbd-ServicesConfigs1 \
meta master-max=1 master-node-max=1 clone-max=2
 clone-node-max=1 globally-unique=false notify=true
 target-role=Master
 ms drbd_master_slave_ServicesLogs1 drbd-ServicesLogs1 \
meta master-max=1 master-node-max=1 clone-max=2
 clone-node-max=1 globally-unique=false notify=true
 target-role=Master
 colocation Services1_on_drbd inf:
 drbd_master_slave_ServicesConfigs1:Master
 drbd_master_slave_ServicesLogs1:Master ServicesData1 IPS-Services1
 Services1 IPSec
 colocation start-shorewall_standby-on-passive-node -inf:
 shorewall-standby shorewall
 order all_drbd inf: shorewall-standby:stop
 drbd_master_slave_ServicesConfigs1:promote
 drbd_master_slave_ServicesLogs1:promote ServicesData1:start
 IPS-Services1:start IPSec:start Services1:start
 property $id=cib-bootstrap-options \
dc-version=1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff \
cluster-infrastructure=openais \
expected-quorum-votes=2 \
stonith-enabled=false \
no-quorum-policy=ignore
 rsc_defaults $id=rsc-options \
resource-stickiness=100
 
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: 

Re: [Pacemaker] hangs pending

2014-01-13 Thread Andrew Beekhof
Apart from anything else, your timeout needs to be bigger:

Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru stonith-ng: (  
commands.c:1321  )   error: log_operation: Operation 'reboot' [11331] (call 
2 from crmd.17227) for host 'dev-cluster2-node2.unix.tensor.ru' with device 
'st1' returned: -62 (Timer expired)


On 14 Jan 2014, at 7:18 am, Andrew Beekhof and...@beekhof.net wrote:

 
 On 13 Jan 2014, at 8:31 pm, Andrey Groshev gre...@yandex.ru wrote:
 
 
 
 13.01.2014, 02:51, Andrew Beekhof and...@beekhof.net:
 On 10 Jan 2014, at 9:55 pm, Andrey Groshev gre...@yandex.ru wrote:
 
 10.01.2014, 14:31, Andrey Groshev gre...@yandex.ru:
 10.01.2014, 14:01, Andrew Beekhof and...@beekhof.net:
  On 10 Jan 2014, at 5:03 pm, Andrey Groshev gre...@yandex.ru wrote:
   10.01.2014, 05:29, Andrew Beekhof and...@beekhof.net:
On 9 Jan 2014, at 11:11 pm, Andrey Groshev gre...@yandex.ru wrote:
 08.01.2014, 06:22, Andrew Beekhof and...@beekhof.net:
 On 29 Nov 2013, at 7:17 pm, Andrey Groshev gre...@yandex.ru 
 wrote:
  Hi, ALL.
 
  I'm still trying to cope with the fact that after the fence - 
 node hangs in pending.
 Please define pending.  Where did you see this?
 In crm_mon:
 ..
 Node dev-cluster2-node2 (172793105): pending
 ..
 
 The experiment was like this:
 Four nodes in cluster.
 On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11).
 Thereafter, the remaining start it constantly reboot, under 
 various pretexts, softly whistling, fly low, not a cluster 
 member! ...
 Then in the log fell out Too many failures 
 All this time in the status in crm_mon is pending.
 Depending on the wind direction changed to UNCLEAN
 Much time has passed and I can not accurately describe the 
 behavior...
 
 Now I am in the following state:
 I tried locate the problem. Came here with this.
 I set big value in property stonith-timeout=600s.
 And got the following behavior:
 1. pkill -4 corosync
 2. from node with DC call my fence agent sshbykey
 3. It sends reboot victim and waits until she comes to life again.
Hmmm what version of pacemaker?
This sounds like a timing issue that we fixed a while back
   Was a version 1.1.11 from December 3.
   Now try full update and retest.
  That should be recent enough.  Can you create a crm_report the next 
 time you reproduce?
 Of course yes. Little delay :)
 
 ..
 cc1: warnings being treated as errors
 upstart.c: In function ‘upstart_job_property’:
 upstart.c:264: error: implicit declaration of function 
 ‘g_variant_lookup_value’
 upstart.c:264: error: nested extern declaration of 
 ‘g_variant_lookup_value’
 upstart.c:264: error: assignment makes pointer from integer without a cast
 gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1
 gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services'
 make[1]: *** [all-recursive] Error 1
 make[1]: Leaving directory `/root/ha/pacemaker/lib'
 make: *** [core] Error 1
 
 I'm trying to solve this a problem.
 Do not get solved quickly...
 
 https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value
 g_variant_lookup_value () Since 2.28
 
 # yum list installed glib2
 Loaded plugins: fastestmirror, rhnplugin, security
 This system is receiving updates from RHN Classic or Red Hat Satellite.
 Loading mirror speeds from cached hostfile
 Installed Packages
 glib2.x86_64  
 2.26.1-3.el6   
 installed
 
 # cat /etc/issue
 CentOS release 6.5 (Final)
 Kernel \r on an \m
 
 Can you try this patch?
 Upstart jobs wont work, but the code will compile
 
 diff --git a/lib/services/upstart.c b/lib/services/upstart.c
 index 831e7cf..195c3a4 100644
 --- a/lib/services/upstart.c
 +++ b/lib/services/upstart.c
 @@ -231,12 +231,21 @@ upstart_job_exists(const char *name)
 static char *
 upstart_job_property(const char *obj, const gchar * iface, const char *name)
 {
 +char *output = NULL;
 +
 +#if !GLIB_CHECK_VERSION(2,28,0)
 +static bool err = TRUE;
 +
 +if(err) {
 +crm_err(This version of glib is too old to support upstart jobs);
 +err = FALSE;
 +}
 +#else
 GError *error = NULL;
 GDBusProxy *proxy;
 GVariant *asv = NULL;
 GVariant *value = NULL;
 GVariant *_ret = NULL;
 -char *output = NULL;
 
 crm_info(Calling GetAll on %s, obj);
 proxy = get_proxy(obj, BUS_PROPERTY_IFACE);
 @@ -272,6 +281,7 @@ upstart_job_property(const char *obj, const gchar * 
 iface, const char *name)
 
 g_object_unref(proxy);
 g_variant_unref(_ret);
 +#endif
 return output;
 }
 
 
 Ok :) I patch source. 
 Type make rc - the same error.
 
 Because its not building your local changes
 
 Make new copy via fetch - the same error.
 It seems that if not exist 
 ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz, then download it. 
 Otherwise use

Re: [Pacemaker] hangs pending

2014-01-13 Thread Andrew Beekhof

On 14 Jan 2014, at 1:19 pm, Andrew Beekhof and...@beekhof.net wrote:

 Apart from anything else, your timeout needs to be bigger:
 
 Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru stonith-ng: (  
 commands.c:1321  )   error: log_operation:   Operation 'reboot' [11331] (call 
 2 from crmd.17227) for host 'dev-cluster2-node2.unix.tensor.ru' with device 
 'st1' returned: -62 (Timer expired)
 

also:

Jan 13 12:04:54 [17226] dev-cluster2-node1.unix.tensor.rupengine: ( 
utils.c:723   )   error: unpack_operation:  Specifying on_fail=fence and 
stonith-enabled=false makes no sense


 
 On 14 Jan 2014, at 7:18 am, Andrew Beekhof and...@beekhof.net wrote:
 
 
 On 13 Jan 2014, at 8:31 pm, Andrey Groshev gre...@yandex.ru wrote:
 
 
 
 13.01.2014, 02:51, Andrew Beekhof and...@beekhof.net:
 On 10 Jan 2014, at 9:55 pm, Andrey Groshev gre...@yandex.ru wrote:
 
 10.01.2014, 14:31, Andrey Groshev gre...@yandex.ru:
 10.01.2014, 14:01, Andrew Beekhof and...@beekhof.net:
 On 10 Jan 2014, at 5:03 pm, Andrey Groshev gre...@yandex.ru wrote:
  10.01.2014, 05:29, Andrew Beekhof and...@beekhof.net:
   On 9 Jan 2014, at 11:11 pm, Andrey Groshev gre...@yandex.ru wrote:
08.01.2014, 06:22, Andrew Beekhof and...@beekhof.net:
On 29 Nov 2013, at 7:17 pm, Andrey Groshev gre...@yandex.ru 
 wrote:
 Hi, ALL.
 
 I'm still trying to cope with the fact that after the fence - 
 node hangs in pending.
Please define pending.  Where did you see this?
In crm_mon:
..
Node dev-cluster2-node2 (172793105): pending
..
 
The experiment was like this:
Four nodes in cluster.
On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11).
Thereafter, the remaining start it constantly reboot, under 
 various pretexts, softly whistling, fly low, not a cluster 
 member! ...
Then in the log fell out Too many failures 
All this time in the status in crm_mon is pending.
Depending on the wind direction changed to UNCLEAN
Much time has passed and I can not accurately describe the 
 behavior...
 
Now I am in the following state:
I tried locate the problem. Came here with this.
I set big value in property stonith-timeout=600s.
And got the following behavior:
1. pkill -4 corosync
2. from node with DC call my fence agent sshbykey
3. It sends reboot victim and waits until she comes to life again.
   Hmmm what version of pacemaker?
   This sounds like a timing issue that we fixed a while back
  Was a version 1.1.11 from December 3.
  Now try full update and retest.
 That should be recent enough.  Can you create a crm_report the next 
 time you reproduce?
 Of course yes. Little delay :)
 
 ..
 cc1: warnings being treated as errors
 upstart.c: In function ‘upstart_job_property’:
 upstart.c:264: error: implicit declaration of function 
 ‘g_variant_lookup_value’
 upstart.c:264: error: nested extern declaration of 
 ‘g_variant_lookup_value’
 upstart.c:264: error: assignment makes pointer from integer without a 
 cast
 gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1
 gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services'
 make[1]: *** [all-recursive] Error 1
 make[1]: Leaving directory `/root/ha/pacemaker/lib'
 make: *** [core] Error 1
 
 I'm trying to solve this a problem.
 Do not get solved quickly...
 
 https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value
 g_variant_lookup_value () Since 2.28
 
 # yum list installed glib2
 Loaded plugins: fastestmirror, rhnplugin, security
 This system is receiving updates from RHN Classic or Red Hat Satellite.
 Loading mirror speeds from cached hostfile
 Installed Packages
 glib2.x86_64  
 2.26.1-3.el6  
  installed
 
 # cat /etc/issue
 CentOS release 6.5 (Final)
 Kernel \r on an \m
 
 Can you try this patch?
 Upstart jobs wont work, but the code will compile
 
 diff --git a/lib/services/upstart.c b/lib/services/upstart.c
 index 831e7cf..195c3a4 100644
 --- a/lib/services/upstart.c
 +++ b/lib/services/upstart.c
 @@ -231,12 +231,21 @@ upstart_job_exists(const char *name)
 static char *
 upstart_job_property(const char *obj, const gchar * iface, const char 
 *name)
 {
 +char *output = NULL;
 +
 +#if !GLIB_CHECK_VERSION(2,28,0)
 +static bool err = TRUE;
 +
 +if(err) {
 +crm_err(This version of glib is too old to support upstart 
 jobs);
 +err = FALSE;
 +}
 +#else
GError *error = NULL;
GDBusProxy *proxy;
GVariant *asv = NULL;
GVariant *value = NULL;
GVariant *_ret = NULL;
 -char *output = NULL;
 
crm_info(Calling GetAll on %s, obj);
proxy = get_proxy(obj, BUS_PROPERTY_IFACE);
 @@ -272,6 +281,7 @@ upstart_job_property(const char *obj, const gchar * 
 iface, const char *name)
 
g_object_unref(proxy);
g_variant_unref(_ret);
 +#endif
return output;
 }
 
 
 Ok :) I patch source

Re: [Pacemaker] hangs pending

2014-01-13 Thread Andrew Beekhof
Ok, here's what happens:

1. node2 is lost
2. fencing of node2 starts
3. node2 reboots (and cluster starts)
4. node2 returns to the membership
5. node2 is marked as a cluster member
6. DC tries to bring it into the cluster, but needs to cancel the active 
transition first.
   Which is a problem since the node2 fencing operation is part of that
7. node2 is in a transition (pending) state until fencing passes or fails
8a. fencing fails: transition completes and the node joins the cluster

Thats in theory, except we automatically try again. Which isn't appropriate.
This should be relatively easy to fix.

8b. fencing passes: the node is incorrectly marked as offline

This I have no idea how to fix yet.


On another note, it doesn't look like this agent works at all.
The node has been back online for a long time and the agent is still timing out 
after 10 minutes.
So Once the script makes sure that the victim will rebooted and again 
available via ssh - it exit with 0. does not seem true.

On 14 Jan 2014, at 1:19 pm, Andrew Beekhof and...@beekhof.net wrote:

 Apart from anything else, your timeout needs to be bigger:
 
 Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru stonith-ng: (  
 commands.c:1321  )   error: log_operation:   Operation 'reboot' [11331] (call 
 2 from crmd.17227) for host 'dev-cluster2-node2.unix.tensor.ru' with device 
 'st1' returned: -62 (Timer expired)
 
 
 On 14 Jan 2014, at 7:18 am, Andrew Beekhof and...@beekhof.net wrote:
 
 
 On 13 Jan 2014, at 8:31 pm, Andrey Groshev gre...@yandex.ru wrote:
 
 
 
 13.01.2014, 02:51, Andrew Beekhof and...@beekhof.net:
 On 10 Jan 2014, at 9:55 pm, Andrey Groshev gre...@yandex.ru wrote:
 
 10.01.2014, 14:31, Andrey Groshev gre...@yandex.ru:
 10.01.2014, 14:01, Andrew Beekhof and...@beekhof.net:
 On 10 Jan 2014, at 5:03 pm, Andrey Groshev gre...@yandex.ru wrote:
  10.01.2014, 05:29, Andrew Beekhof and...@beekhof.net:
   On 9 Jan 2014, at 11:11 pm, Andrey Groshev gre...@yandex.ru wrote:
08.01.2014, 06:22, Andrew Beekhof and...@beekhof.net:
On 29 Nov 2013, at 7:17 pm, Andrey Groshev gre...@yandex.ru 
 wrote:
 Hi, ALL.
 
 I'm still trying to cope with the fact that after the fence - 
 node hangs in pending.
Please define pending.  Where did you see this?
In crm_mon:
..
Node dev-cluster2-node2 (172793105): pending
..
 
The experiment was like this:
Four nodes in cluster.
On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11).
Thereafter, the remaining start it constantly reboot, under 
 various pretexts, softly whistling, fly low, not a cluster 
 member! ...
Then in the log fell out Too many failures 
All this time in the status in crm_mon is pending.
Depending on the wind direction changed to UNCLEAN
Much time has passed and I can not accurately describe the 
 behavior...
 
Now I am in the following state:
I tried locate the problem. Came here with this.
I set big value in property stonith-timeout=600s.
And got the following behavior:
1. pkill -4 corosync
2. from node with DC call my fence agent sshbykey
3. It sends reboot victim and waits until she comes to life again.
   Hmmm what version of pacemaker?
   This sounds like a timing issue that we fixed a while back
  Was a version 1.1.11 from December 3.
  Now try full update and retest.
 That should be recent enough.  Can you create a crm_report the next 
 time you reproduce?
 Of course yes. Little delay :)
 
 ..
 cc1: warnings being treated as errors
 upstart.c: In function ‘upstart_job_property’:
 upstart.c:264: error: implicit declaration of function 
 ‘g_variant_lookup_value’
 upstart.c:264: error: nested extern declaration of 
 ‘g_variant_lookup_value’
 upstart.c:264: error: assignment makes pointer from integer without a 
 cast
 gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1
 gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services'
 make[1]: *** [all-recursive] Error 1
 make[1]: Leaving directory `/root/ha/pacemaker/lib'
 make: *** [core] Error 1
 
 I'm trying to solve this a problem.
 Do not get solved quickly...
 
 https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value
 g_variant_lookup_value () Since 2.28
 
 # yum list installed glib2
 Loaded plugins: fastestmirror, rhnplugin, security
 This system is receiving updates from RHN Classic or Red Hat Satellite.
 Loading mirror speeds from cached hostfile
 Installed Packages
 glib2.x86_64  
 2.26.1-3.el6  
  installed
 
 # cat /etc/issue
 CentOS release 6.5 (Final)
 Kernel \r on an \m
 
 Can you try this patch?
 Upstart jobs wont work, but the code will compile
 
 diff --git a/lib/services/upstart.c b/lib/services/upstart.c
 index 831e7cf..195c3a4 100644
 --- a/lib/services/upstart.c
 +++ b/lib/services/upstart.c
 @@ -231,12 +231,21 @@ upstart_job_exists(const

Re: [Pacemaker] hangs pending

2014-01-13 Thread Andrew Beekhof

On 14 Jan 2014, at 3:34 pm, Andrey Groshev gre...@yandex.ru wrote:

 
 
 14.01.2014, 06:25, Andrew Beekhof and...@beekhof.net:
 Apart from anything else, your timeout needs to be bigger:
 
 Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru stonith-ng: (  
 commands.c:1321  )   error: log_operation: Operation 'reboot' [11331] (call 
 2 from crmd.17227) for host 'dev-cluster2-node2.unix.tensor.ru' with device 
 'st1' returned: -62 (Timer expired)
 
 
 Bigger than that?

See my other email, the agent is broken.

 In :21 node2 A long time ago already booted and work (almost).

Exactly, so why didnt the agent return?

 #cat /var/log/cluster/mystonith.log
 .
 Mon Jan 13 11:48:43 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () 
 STONITH DEBUG(): getinfo-devdescr
 Mon Jan 13 11:48:43 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () 
 STONITH DEBUG(): getinfo-devid
 Mon Jan 13 11:48:43 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () 
 STONITH DEBUG(): getinfo-xml
 Mon Jan 13 11:48:46 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () 
 STONITH 
 DEBUG(/opt/cluster_tools_2/keys/r...@dev-cluster2-master.unix.tensor.ru): 
 getconfignames
 Mon Jan 13 11:48:46 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () 
 STONITH 
 DEBUG(/opt/cluster_tools_2/keys/r...@dev-cluster2-master.unix.tensor.ru): 
 status
 Mon Jan 13 12:11:37 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () 
 STONITH 
 DEBUG(/opt/cluster_tools_2/keys/r...@dev-cluster2-master.unix.tensor.ru): 
 getconfignames
 Mon Jan 13 12:11:37 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () 
 STONITH 
 DEBUG(/opt/cluster_tools_2/keys/r...@dev-cluster2-master.unix.tensor.ru): 
 reset dev-cluster2-node2.unix.tensor.ru
 Mon Jan 13 12:11:37 MSK 2014 Now boot time 1389256739, send reboot
 ...
 
 On 14 Jan 2014, at 7:18 am, Andrew Beekhof and...@beekhof.net wrote:
 
  On 13 Jan 2014, at 8:31 pm, Andrey Groshev gre...@yandex.ru wrote:
  13.01.2014, 02:51, Andrew Beekhof and...@beekhof.net:
  On 10 Jan 2014, at 9:55 pm, Andrey Groshev gre...@yandex.ru wrote:
  10.01.2014, 14:31, Andrey Groshev gre...@yandex.ru:
  10.01.2014, 14:01, Andrew Beekhof and...@beekhof.net:
   On 10 Jan 2014, at 5:03 pm, Andrey Groshev gre...@yandex.ru wrote:
10.01.2014, 05:29, Andrew Beekhof and...@beekhof.net:
 On 9 Jan 2014, at 11:11 pm, Andrey Groshev gre...@yandex.ru 
 wrote:
  08.01.2014, 06:22, Andrew Beekhof and...@beekhof.net:
  On 29 Nov 2013, at 7:17 pm, Andrey Groshev gre...@yandex.ru 
 wrote:
   Hi, ALL.
 
   I'm still trying to cope with the fact that after the fence 
 - node hangs in pending.
  Please define pending.  Where did you see this?
  In crm_mon:
  ..
  Node dev-cluster2-node2 (172793105): pending
  ..
 
  The experiment was like this:
  Four nodes in cluster.
  On one of them kill corosync or pacemakerd (signal 4 or 6 oк 
 11).
  Thereafter, the remaining start it constantly reboot, under 
 various pretexts, softly whistling, fly low, not a cluster 
 member! ...
  Then in the log fell out Too many failures 
  All this time in the status in crm_mon is pending.
  Depending on the wind direction changed to UNCLEAN
  Much time has passed and I can not accurately describe the 
 behavior...
 
  Now I am in the following state:
  I tried locate the problem. Came here with this.
  I set big value in property stonith-timeout=600s.
  And got the following behavior:
  1. pkill -4 corosync
  2. from node with DC call my fence agent sshbykey
  3. It sends reboot victim and waits until she comes to life 
 again.
 Hmmm what version of pacemaker?
 This sounds like a timing issue that we fixed a while back
Was a version 1.1.11 from December 3.
Now try full update and retest.
   That should be recent enough.  Can you create a crm_report the next 
 time you reproduce?
  Of course yes. Little delay :)
 
  ..
  cc1: warnings being treated as errors
  upstart.c: In function ‘upstart_job_property’:
  upstart.c:264: error: implicit declaration of function 
 ‘g_variant_lookup_value’
  upstart.c:264: error: nested extern declaration of 
 ‘g_variant_lookup_value’
  upstart.c:264: error: assignment makes pointer from integer without a 
 cast
  gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1
  gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services'
  make[1]: *** [all-recursive] Error 1
  make[1]: Leaving directory `/root/ha/pacemaker/lib'
  make: *** [core] Error 1
 
  I'm trying to solve this a problem.
  Do not get solved quickly...
 
  
 https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value
  g_variant_lookup_value () Since 2.28
 
  # yum list installed glib2
  Loaded plugins: fastestmirror, rhnplugin, security
  This system is receiving updates from RHN Classic or Red Hat Satellite.
  Loading mirror speeds from cached hostfile
  Installed Packages
  glib2.x86_64

Re: [Pacemaker] crm_resource -L not trustable right after restart

2014-01-13 Thread Andrew Beekhof

On 14 Jan 2014, at 3:41 pm, Brian J. Murrell (brian) br...@interlinx.bc.ca 
wrote:

 On Tue, 2014-01-14 at 08:09 +1100, Andrew Beekhof wrote:
 
 The local cib hasn't caught up yet by the looks of it.
 
 Should crm_resource actually be [mis-]reporting as if it were
 knowledgeable when it's not though?  IOW is this expected behaviour or
 should it be considered a bug?  Should I open a ticket?

It doesn't know that it doesn't know.
Does it show anything as running?  Any nodes as online?

I'd not expect that it stays in that situation for more than a second or two...

 
 You could compare 'cibadmin -Ql' with 'cibadmin -Q'
 
 Is there no other way to force crm_resource to be truthful/accurate or
 silent if it cannot be truthful/accurate?  Having to run this kind of
 pre-check before every crm_resource --locate seems like it's going to
 drive overhead up quite a bit.

True.

 
 Maybe I am using the wrong tool for the job.  Is there a better tool
 than crm_resource to ascertain, with full truthfullness (or silence if
 truthfullness is not possible), where resources are running?

We could add an option to force crm_resource to use the master instance instead 
of the local one I guess.


signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] A resource starts with a standby node.(Latest attrd does not serve as the crmd-transition-delay parameter)

2014-01-13 Thread Andrew Beekhof

On 14 Jan 2014, at 3:52 pm, renayama19661...@ybb.ne.jp wrote:

 Hi All,
 
 I contributed next bugzilla by a problem to occur for the difference of the 
 timing of the attribute update by attrd before.
 * https://developerbugs.linuxfoundation.org/show_bug.cgi?id=2528
 
 We can evade this problem now by using crmd-transition-delay parameter.
 
 I confirmed whether I could evade this problem by renewed attrd recently.
 * In latest attrd, one became a leader and seemed to come to update an 
 attribute.
 
 However, latest attrd does not seem to substitute for crmd-transition-delay.
 * I contribute detailed log later.
 
 We are dissatisfied with continuing using crmd-transition-delay.
 Is there the plan when attrd handles this problem well in the future?

Are you using the new attrd code or the legacy stuff?

If you're not using corosync 2.x or see:

crm_notice(Starting mainloop...);

then its the old code.  The new code could also be used with CMAN but isn't 
configured to build for in that situation.

Only the new code makes (or at least should do) crmd-transition-delay redundant.


signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] A resource starts with a standby node.(Latest attrd does not serve as the crmd-transition-delay parameter)

2014-01-13 Thread Andrew Beekhof

On 14 Jan 2014, at 4:13 pm, renayama19661...@ybb.ne.jp wrote:

 Hi Andrew,
 
 Thank you for comments.
 
 Are you using the new attrd code or the legacy stuff?
 
 I use new attrd.

And the values are not being sent to the cib at the same time? 

 
 
 If you're not using corosync 2.x or see:
 
 crm_notice(Starting mainloop...);
 
 then its the old code.  The new code could also be used with CMAN but isn't 
 configured to build for in that situation.
 
 Only the new code makes (or at least should do) crmd-transition-delay 
 redundant.
 
 It did not seem to work so that new attrd dispensed with 
 crmd-transition-delay to me.
 I report the details again.
 # Probably it will be Bugzilla. . .

Sounds good

 
 Best Regards,
 Hideo Yamauchi.
 
 --- On Tue, 2014/1/14, Andrew Beekhof and...@beekhof.net wrote:
 
 
 On 14 Jan 2014, at 3:52 pm, renayama19661...@ybb.ne.jp wrote:
 
 Hi All,
 
 I contributed next bugzilla by a problem to occur for the difference of the 
 timing of the attribute update by attrd before.
 * https://developerbugs.linuxfoundation.org/show_bug.cgi?id=2528
 
 We can evade this problem now by using crmd-transition-delay parameter.
 
 I confirmed whether I could evade this problem by renewed attrd recently.
 * In latest attrd, one became a leader and seemed to come to update an 
 attribute.
 
 However, latest attrd does not seem to substitute for crmd-transition-delay.
 * I contribute detailed log later.
 
 We are dissatisfied with continuing using crmd-transition-delay.
 Is there the plan when attrd handles this problem well in the future?
 
 Are you using the new attrd code or the legacy stuff?
 
 If you're not using corosync 2.x or see:
 
 crm_notice(Starting mainloop...);
 
 then its the old code.  The new code could also be used with CMAN but isn't 
 configured to build for in that situation.
 
 Only the new code makes (or at least should do) crmd-transition-delay 
 redundant.
 



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] A resource starts with a standby node.(Latest attrd does not serve as the crmd-transition-delay parameter)

2014-01-13 Thread Andrew Beekhof

On 14 Jan 2014, at 4:33 pm, renayama19661...@ybb.ne.jp wrote:

 Hi Andrew,
 
 Are you using the new attrd code or the legacy stuff?
 
 I use new attrd.
 
 And the values are not being sent to the cib at the same time? 
 
 As far as I looked. . .
 When the transmission of the attribute of attrd of the node was late, a 
 leader of attrd seemed to send an attribute to cib without waiting for it.

And you have a delay configured?  And this value was set prior to that delay 
expiring?

 
 Only the new code makes (or at least should do) crmd-transition-delay 
 redundant.
 
 It did not seem to work so that new attrd dispensed with 
 crmd-transition-delay to me.
 I report the details again.
 # Probably it will be Bugzilla. . .
 
 Sounds good
 
 All right!
 
 Many Thanks!
 Hideo Yamauch.
 
 --- On Tue, 2014/1/14, Andrew Beekhof and...@beekhof.net wrote:
 
 
 On 14 Jan 2014, at 4:13 pm, renayama19661...@ybb.ne.jp wrote:
 
 Hi Andrew,
 
 Thank you for comments.
 
 Are you using the new attrd code or the legacy stuff?
 
 I use new attrd.
 
 And the values are not being sent to the cib at the same time? 
 
 
 
 If you're not using corosync 2.x or see:
 
  crm_notice(Starting mainloop...);
 
 then its the old code.  The new code could also be used with CMAN but 
 isn't configured to build for in that situation.
 
 Only the new code makes (or at least should do) crmd-transition-delay 
 redundant.
 
 It did not seem to work so that new attrd dispensed with 
 crmd-transition-delay to me.
 I report the details again.
 # Probably it will be Bugzilla. . .
 
 Sounds good
 
 
 Best Regards,
 Hideo Yamauchi.
 
 --- On Tue, 2014/1/14, Andrew Beekhof and...@beekhof.net wrote:
 
 
 On 14 Jan 2014, at 3:52 pm, renayama19661...@ybb.ne.jp wrote:
 
 Hi All,
 
 I contributed next bugzilla by a problem to occur for the difference of 
 the timing of the attribute update by attrd before.
 * https://developerbugs.linuxfoundation.org/show_bug.cgi?id=2528
 
 We can evade this problem now by using crmd-transition-delay parameter.
 
 I confirmed whether I could evade this problem by renewed attrd recently.
 * In latest attrd, one became a leader and seemed to come to update an 
 attribute.
 
 However, latest attrd does not seem to substitute for 
 crmd-transition-delay.
 * I contribute detailed log later.
 
 We are dissatisfied with continuing using crmd-transition-delay.
 Is there the plan when attrd handles this problem well in the future?
 
 Are you using the new attrd code or the legacy stuff?
 
 If you're not using corosync 2.x or see:
 
  crm_notice(Starting mainloop...);
 
 then its the old code.  The new code could also be used with CMAN but 
 isn't configured to build for in that situation.
 
 Only the new code makes (or at least should do) crmd-transition-delay 
 redundant.
 
 
 



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [PATCH] Downgrade probe log message for promoted ms resources

2014-01-12 Thread Andrew Beekhof
Fair enough.  Pull request?

On 12 Jan 2014, at 8:29 pm, Vladislav Bogdanov bub...@hoster-ok.com wrote:

 Hi,
 
 This is the only one message I see in logs in otherwise static cluster
 (with rechecks enabled), probably it is good idea to downgrade it to info.
 
 diff --git a/lib/pengine/unpack.c b/lib/pengine/unpack.c
 index 97e114f..6dbcf19 100644
 --- a/lib/pengine/unpack.c
 +++ b/lib/pengine/unpack.c
 @@ -2515,7 +2515,7 @@ determine_op_status(
 case PCMK_OCF_RUNNING_MASTER:
 if (is_probe) {
 result = PCMK_LRM_OP_DONE;
 -crm_notice(Operation %s found resource %s active in master 
 mode on %s,
 +pe_rsc_info(rsc, Operation %s found resource %s active in 
 master mode on %s,
task, rsc-id, node-details-uname);
 
 } else if (target_rc == rc) {
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] hangs pending

2014-01-12 Thread Andrew Beekhof

On 10 Jan 2014, at 9:55 pm, Andrey Groshev gre...@yandex.ru wrote:

 
 
 10.01.2014, 14:31, Andrey Groshev gre...@yandex.ru:
 10.01.2014, 14:01, Andrew Beekhof and...@beekhof.net:
 
  On 10 Jan 2014, at 5:03 pm, Andrey Groshev gre...@yandex.ru wrote:
   10.01.2014, 05:29, Andrew Beekhof and...@beekhof.net:
On 9 Jan 2014, at 11:11 pm, Andrey Groshev gre...@yandex.ru wrote:
 08.01.2014, 06:22, Andrew Beekhof and...@beekhof.net:
 On 29 Nov 2013, at 7:17 pm, Andrey Groshev gre...@yandex.ru wrote:
  Hi, ALL.
 
  I'm still trying to cope with the fact that after the fence - 
 node hangs in pending.
 Please define pending.  Where did you see this?
 In crm_mon:
 ..
 Node dev-cluster2-node2 (172793105): pending
 ..
 
 The experiment was like this:
 Four nodes in cluster.
 On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11).
 Thereafter, the remaining start it constantly reboot, under various 
 pretexts, softly whistling, fly low, not a cluster member! ...
 Then in the log fell out Too many failures 
 All this time in the status in crm_mon is pending.
 Depending on the wind direction changed to UNCLEAN
 Much time has passed and I can not accurately describe the 
 behavior...
 
 Now I am in the following state:
 I tried locate the problem. Came here with this.
 I set big value in property stonith-timeout=600s.
 And got the following behavior:
 1. pkill -4 corosync
 2. from node with DC call my fence agent sshbykey
 3. It sends reboot victim and waits until she comes to life again.
Hmmm what version of pacemaker?
This sounds like a timing issue that we fixed a while back
   Was a version 1.1.11 from December 3.
   Now try full update and retest.
  That should be recent enough.  Can you create a crm_report the next time 
 you reproduce?
 
 Of course yes. Little delay :)
 
 ..
 cc1: warnings being treated as errors
 upstart.c: In function ‘upstart_job_property’:
 upstart.c:264: error: implicit declaration of function 
 ‘g_variant_lookup_value’
 upstart.c:264: error: nested extern declaration of ‘g_variant_lookup_value’
 upstart.c:264: error: assignment makes pointer from integer without a cast
 gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1
 gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services'
 make[1]: *** [all-recursive] Error 1
 make[1]: Leaving directory `/root/ha/pacemaker/lib'
 make: *** [core] Error 1
 
 I'm trying to solve this a problem.
 
 
 Do not get solved quickly...
 
 https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value
 g_variant_lookup_value () Since 2.28
 
 # yum list installed glib2
 Loaded plugins: fastestmirror, rhnplugin, security
 This system is receiving updates from RHN Classic or Red Hat Satellite.
 Loading mirror speeds from cached hostfile
 Installed Packages
 glib2.x86_64  
 2.26.1-3.el6   
 installed
 
 # cat /etc/issue
 CentOS release 6.5 (Final)
 Kernel \r on an \m

Can you try this patch?
Upstart jobs wont work, but the code will compile

diff --git a/lib/services/upstart.c b/lib/services/upstart.c
index 831e7cf..195c3a4 100644
--- a/lib/services/upstart.c
+++ b/lib/services/upstart.c
@@ -231,12 +231,21 @@ upstart_job_exists(const char *name)
 static char *
 upstart_job_property(const char *obj, const gchar * iface, const char *name)
 {
+char *output = NULL;
+
+#if !GLIB_CHECK_VERSION(2,28,0)
+static bool err = TRUE;
+
+if(err) {
+crm_err(This version of glib is too old to support upstart jobs);
+err = FALSE;
+}
+#else
 GError *error = NULL;
 GDBusProxy *proxy;
 GVariant *asv = NULL;
 GVariant *value = NULL;
 GVariant *_ret = NULL;
-char *output = NULL;
 
 crm_info(Calling GetAll on %s, obj);
 proxy = get_proxy(obj, BUS_PROPERTY_IFACE);
@@ -272,6 +281,7 @@ upstart_job_property(const char *obj, const gchar * iface, 
const char *name)
 
 g_object_unref(proxy);
 g_variant_unref(_ret);
+#endif
 return output;
 }


 
 
   Once the script makes sure that the victim will rebooted and again 
 available via ssh - it exit with 0.
   All command is logged both the victim and the killer - all right.
 4. A little later, the status of the (victim) nodes in crm_mon 
 changes to online.
 5. BUT... not one resource don't start! Despite the fact that 
 crm_simalate -sL shows the correct resource to start:
   * Start   pingCheck:3  (dev-cluster2-node2)
 6. In this state, we spend the next 600 seconds.
   After completing this timeout causes another node (not DC) decides 
 to kill again our victim.
   All command again is logged both the victim and the killer - All 
 documented :)
 7. NOW all resource started in right sequence.
 
 I almost happy, but I do not like: two reboots and 10 minutes

Re: [Pacemaker] again return code, now in crm_attribute

2014-01-12 Thread Andrew Beekhof

On 10 Jan 2014, at 6:18 pm, Andrey Groshev gre...@yandex.ru wrote:

 
 
 10.01.2014, 10:15, Andrew Beekhof and...@beekhof.net:
 On 10 Jan 2014, at 4:38 pm, Andrey Groshev gre...@yandex.ru wrote:
 
  10.01.2014, 09:06, Andrew Beekhof and...@beekhof.net:
  On 10 Jan 2014, at 3:51 pm, Andrey Groshev gre...@yandex.ru wrote:
   10.01.2014, 03:28, Andrew Beekhof and...@beekhof.net:
   On 9 Jan 2014, at 4:44 pm, Andrey Groshev gre...@yandex.ru wrote:
09.01.2014, 02:39, Andrew Beekhof and...@beekhof.net:
 On 18 Dec 2013, at 11:55 pm, Andrey Groshev gre...@yandex.ru 
 wrote:
  Hi, Andrew and ALL.
 
  I'm sorry, but I again found an error. :)
  Crux of the problem:
 
  # crm_attribute --type crm_config --attr-name stonith-enabled 
 --query; echo $?
  scope=crm_config  name=stonith-enabled value=true
  0
 
  # crm_attribute --type crm_config --attr-name stonith-enabled 
 --update firstval ; echo $?
  0
 
  # crm_attribute --type crm_config --attr-name stonith-enabled 
 --query; echo $?
  scope=crm_config  name=stonith-enabled value=firstval
  0
 
  # crm_attribute --type crm_config --attr-name stonith-enabled  
 --update secondval --lifetime=reboot ; echo $?
  0
 
  # crm_attribute --type crm_config --attr-name stonith-enabled 
 --query; echo $?
  scope=crm_config  name=stonith-enabled value=firstval
  0
 
  # crm_attribute --type crm_config --attr-name stonith-enabled  
 --update thirdval --lifetime=forever ; echo $?
  0
 
  # crm_attribute --type crm_config --attr-name stonith-enabled 
 --query; echo $?
  scope=crm_config  name=stonith-enabled value=firstval
  0
 
  Ie if specify the lifetime of an attribute, then a attribure is 
 not updated.
 
  If impossible setup the lifetime of the attribute when it is 
 installing, it must be return an error.
 Agreed. I'll reproduce and get back to you.
How, I was able to review code, problem comes when used both options 
 --type and options --lifetime.
One variant in case without break;
Unfortunately, I did not have time to dive into the logic.
   Actually, the logic is correct.  The command:
 
   # crm_attribute --type crm_config --attr-name stonith-enabled  
 --update secondval --lifetime=reboot ; echo $?
 
   is invalid.  You only get to specify --type OR --lifetime, not both.
   By specifying --lifetime, you're creating a node attribute, not a 
 cluster proprerty.
   With this, I do not argue. I think that should be the exit code is NOT 
 ZERO, ie it's error!
  No, its setting a value, just not where you thought (or where you're 
 looking for it in the next command).
 
  Its the same as writing:
 
crm_attribute --type crm_config --type status --attr-name 
 stonith-enabled  --update secondval; echo $?
 
  Only the last value for --type wins
  Because of this confusion is obtained. Here is an example of the old 
 cluster:
  #crm_attribute --type crm_config --attr-name test1  --update val1 
 --lifetime=reboot ; echo $?
  0
  # cibadmin -Q|grep test1
   nvpair id=status-test-ins-db2-test1 name=test1 value=val1/
  Win --lifetime ?
 
 Yes.  Because it was specified last.
 
  Is not it easier to produce an error when trying to use incompatible 
 options?
 
 They're not incompatible. They're aliases for each other in a different 
 context.
 
 Ok. I understood you . Let's say you're right . :)
 In the end , if you change the order of words in the human language, then 
 meaning of a sentence can change. 
 But suppose, I have a strange desire, but it may be represented and write.
 
 I say crm_attribute - attr-name attr1 - update val1 - lifetime = reboot - 
 type crm_config.
 
 I mean, that ...
 I want to set some attribute to a cluster , and this attribute should 
 disappear if the cluster is restarted. 

This functionality does/can not exist for anything other than node attributes.

 
 If I сhange the order of arguments, the meaning of a sentence is still not 
 change.
 But what do you say to that? 100% chance you will say that the sentence is 
 not correct. Why ? Because the lifetime is not quite a time of life and can 
 not be used in the context of the properties of the cluster ? Ie You gave me 
 back  1 and crm_attribute returned 0 :)
 
 
  Then there is this uncertainty and was meant..., was meant..., was 
 meant
  And if possible then the value should be established.
  In general, something is wrong.
  Denser unfortunately not yet looked, because I struggle with 
 STONITH :)
 
  P.S. Andrew! Late to congratulate you on your new addition to 
 the family.
  This fine time - now you will have toys which was not in your 
 childhood.
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: 
 http://www.clusterlabs.org/doc

Re: [Pacemaker] Manual fence confirmation by stonith_admin doesn't work again.

2014-01-12 Thread Andrew Beekhof

On 10 Jan 2014, at 3:54 pm, Nikita Staroverov nsfo...@gmail.com wrote:

 
 There is no-one to tell yet.
 We have to wait for cman to decide something needs fencing before 
 pacemaker can perform the notification.
 
 if I get you right i need own fencing agent that doing manual confirmed 
 fence action with cman+pacemaker configurations, do i?
 No.  Manual fencing confirmations can only work after CMAN asks for the node 
 to be fenced.  You can't tell it in advance.
 I'll try to explain my problem again. :)
 in my test setup cluster member  was manually powered off by me and fence 
 device was powered off too.
 crm_mon shows host as offline unclean.
 cman shows that host needs fencing and try to do that by fence_pcmk.
 Fence agent in pacemaker can't do that because fence device is unreacheable.
 i do stonith_admin -C for node. After that crm_mon shows host as offline. 
 cman tries to do fencing indefinitly.
 I think pacemaker doesn't notify cman that fencing was confirmed by sysadmin.

Code seems to be trying to...
Do you see any logs from tengine_stonith_notify after you run stonith_admin?

crm_report?


signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] hangs pending

2014-01-10 Thread Andrew Beekhof

On 10 Jan 2014, at 5:03 pm, Andrey Groshev gre...@yandex.ru wrote:

 10.01.2014, 05:29, Andrew Beekhof and...@beekhof.net:
 
  On 9 Jan 2014, at 11:11 pm, Andrey Groshev gre...@yandex.ru wrote:
   08.01.2014, 06:22, Andrew Beekhof and...@beekhof.net:
   On 29 Nov 2013, at 7:17 pm, Andrey Groshev gre...@yandex.ru wrote:
Hi, ALL.
 
I'm still trying to cope with the fact that after the fence - node 
 hangs in pending.
   Please define pending.  Where did you see this?
   In crm_mon:
   ..
   Node dev-cluster2-node2 (172793105): pending
   ..
 
   The experiment was like this:
   Four nodes in cluster.
   On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11).
   Thereafter, the remaining start it constantly reboot, under various 
 pretexts, softly whistling, fly low, not a cluster member! ...
   Then in the log fell out Too many failures 
   All this time in the status in crm_mon is pending.
   Depending on the wind direction changed to UNCLEAN
   Much time has passed and I can not accurately describe the behavior...
 
   Now I am in the following state:
   I tried locate the problem. Came here with this.
   I set big value in property stonith-timeout=600s.
   And got the following behavior:
   1. pkill -4 corosync
   2. from node with DC call my fence agent sshbykey
   3. It sends reboot victim and waits until she comes to life again.
  Hmmm what version of pacemaker?
  This sounds like a timing issue that we fixed a while back
 
 Was a version 1.1.11 from December 3.
 Now try full update and retest.

That should be recent enough.  Can you create a crm_report the next time you 
reproduce?

 
 Once the script makes sure that the victim will rebooted and again 
 available via ssh - it exit with 0.
 All command is logged both the victim and the killer - all right.
   4. A little later, the status of the (victim) nodes in crm_mon changes to 
 online.
   5. BUT... not one resource don't start! Despite the fact that 
 crm_simalate -sL shows the correct resource to start:
 * Start   pingCheck:3  (dev-cluster2-node2)
   6. In this state, we spend the next 600 seconds.
 After completing this timeout causes another node (not DC) decides to 
 kill again our victim.
 All command again is logged both the victim and the killer - All 
 documented :)
   7. NOW all resource started in right sequence.
 
   I almost happy, but I do not like: two reboots and 10 minutes of waiting 
 ;)
   And if something happens on another node, this the behavior is 
 superimposed on old and not any resources not start until the last node 
 will not reload twice.
 
   I tried understood this behavior.
   As I understand it:
   1. Ultimately, in ./lib/fencing/st_client.c call 
 internal_stonith_action_execute().
   2. It make fork and pipe from tham.
   3. Async call mainloop_child_add with callback to  
 stonith_action_async_done.
   4. Add timeout  g_timeout_add to TERM and KILL signals.
 
   If all right must - call stonith_action_async_done, remove timeout.
   For some reason this does not happen. I sit and think 
At this time, there are constant re-election.
Also, I noticed the difference when you start pacemaker.
At normal startup:
* corosync
* pacemakerd
* attrd
* pengine
* lrmd
* crmd
* cib
 
When hangs start:
* corosync
* pacemakerd
* attrd
* pengine
* crmd
* lrmd
* cib.
   Are you referring to the order of the daemons here?
   The cib should not be at the bottom in either case.
Who knows who runs lrmd?
   Pacemakerd.
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
Project Home: http://www.clusterlabs.org
Getting started: 
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
   ,
   ___
   Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
   http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
   Project Home: http://www.clusterlabs.org
   Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
   Bugs: http://bugs.clusterlabs.org
   ___
   Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
   http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
   Project Home: http://www.clusterlabs.org
   Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
   Bugs: http://bugs.clusterlabs.org
  ,
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org

Re: [Pacemaker] starting resources with failed stonith resource

2014-01-09 Thread Andrew Beekhof

On 9 Jan 2014, at 8:29 pm, Frank Van Damme frank.vanda...@gmail.com wrote:

 2014/1/8 Andrew Beekhof and...@beekhof.net:
 I don't understand it: if this means that the stonith devices have
 failed a million times,
 
 We also set it to 100 when the start action fails.
 
 why is it trying to start the mysql resource?
 
 It depends if any nodes need fencing.
 
 It's agains Pacemaker policies to start resources on a cluster without
 working stonith devices, isn't it?
 
 Not if all nodes are present and healthy.
 
 But if they fail or disappear, they can't be killed and might have
 resources still running on them?

Yes, and the cluster wont be able to do anything about it except wait

 
 
 -- 
 Frank Van Damme
 Make everything as simple as possible, but not simpler. - Albert Einstein
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Breaking dependency loop stonith

2014-01-09 Thread Andrew Beekhof

On 9 Jan 2014, at 5:05 pm, Andrey Groshev gre...@yandex.ru wrote:

 
 
 08.01.2014, 06:15, Andrew Beekhof and...@beekhof.net:
 On 27 Nov 2013, at 12:26 am, Andrey Groshev gre...@yandex.ru wrote:
 
  Hi, ALL.
 
  I want to clarify two more questions.
  After stonith reboot - this node hangs with status pending.
  The logs found string .
 
 info: rsc_merge_weights:pgsql:1: Breaking dependency loop at 
 msPostgresql
 info: rsc_merge_weights:pgsql:2: Breaking dependency loop at 
 msPostgresql
 
  This means that breaking search the depends, because they are no more.
  Or interrupted by an infinite loop for search the dependency?
 
 The second one, but it has nothing to do with a node being in the pending 
 state.
 Where did you see this?
 
 Ok, I've already understood this the problem.
 I have location for right promote|demote resource.
 And too same logic trough collocation/order.
 As I thought, they do the same thing

No, collocation and ordering are orthogonal concepts and do not at all do the 
same thing.
See the docs.

 and collisions should not happen.
 At least on the old cluster it works :)
 Now I have removed all unnecessary.
 
 
 
  And two.
  Do I need to clone the stonith resource now (In PCMK 1.1.11)?
 
 No.
 
  On the one hand, I see this resource on all nodes through command.
  # cibadmin -Q|grep stonith
 nvpair name=stonith-enabled value=true 
 id=cib-bootstrap-options-stonith-enabled/
   primitive id=st1 class=stonith type=external/sshbykey
   lrm_resource id=st1 type=external/sshbykey class=stonith
   lrm_resource id=st1 type=external/sshbykey class=stonith
   lrm_resource id=st1 type=external/sshbykey class=stonith
  (without pending node)
 
 Like all resources, we check all nodes at startup to see if it is already 
 active.
 
  On the other hand, another command I see only one instance on a particular 
 node.
  # crm_verify -L
 info: main: =#=#=#=#= Getting XML =#=#=#=#=
 info: main: Reading XML from: live cluster
 info: validate_with_relaxng:Creating RNG parser context
 info: determine_online_status_fencing:  Node dev-cluster2-node4 is 
 active
 info: determine_online_status:  Node dev-cluster2-node4 is online
 info: determine_online_status_fencing:  - Node dev-cluster2-node1 
 is not ready to run resources
 info: determine_online_status_fencing:  Node dev-cluster2-node2 is 
 active
 info: determine_online_status:  Node dev-cluster2-node2 is online
 info: determine_online_status_fencing:  Node dev-cluster2-node3 is 
 active
 info: determine_online_status:  Node dev-cluster2-node3 is online
 info: determine_op_status:  Operation monitor found resource 
 pingCheck:0 active on dev-cluster2-node4
 info: native_print: VirtualIP   (ocf::heartbeat:IPaddr2):   
 Started dev-cluster2-node4
 info: clone_print:   Master/Slave Set: msPostgresql [pgsql]
 info: short_print:   Masters: [ dev-cluster2-node4 ]
 info: short_print:   Slaves: [ dev-cluster2-node2 
 dev-cluster2-node3 ]
 info: short_print:   Stopped: [ dev-cluster2-node1 ]
 info: clone_print:   Clone Set: clnPingCheck [pingCheck]
 info: short_print:   Started: [ dev-cluster2-node2 
 dev-cluster2-node3 dev-cluster2-node4 ]
 info: short_print:   Stopped: [ dev-cluster2-node1 ]
 info: native_print: st1 (stonith:external/sshbykey):
 Started dev-cluster2-node4
 info: native_color: Resource pingCheck:3 cannot run anywhere
 info: native_color: Resource pgsql:3 cannot run anywhere
 info: rsc_merge_weights:pgsql:1: Breaking dependency loop at 
 msPostgresql
 info: rsc_merge_weights:pgsql:2: Breaking dependency loop at 
 msPostgresql
 info: master_color: Promoting pgsql:0 (Master 
 dev-cluster2-node4)
 info: master_color: msPostgresql: Promoted 1 instances of a 
 possible 1 to master
 info: LogActions:   Leave   VirtualIP   (Started dev-cluster2-node4)
 info: LogActions:   Leave   pgsql:0 (Master dev-cluster2-node4)
 info: LogActions:   Leave   pgsql:1 (Slave dev-cluster2-node2)
 info: LogActions:   Leave   pgsql:2 (Slave dev-cluster2-node3)
 info: LogActions:   Leave   pgsql:3 (Stopped)
 info: LogActions:   Leave   pingCheck:0 (Started dev-cluster2-node4)
 info: LogActions:   Leave   pingCheck:1 (Started dev-cluster2-node2)
 info: LogActions:   Leave   pingCheck:2 (Started dev-cluster2-node3)
 info: LogActions:   Leave   pingCheck:3 (Stopped)
 info: LogActions:   Leave   st1 (Started dev-cluster2-node4)
 
  However, if I do a clone - it turns out the same garbage.
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting

Re: [Pacemaker] again return code, now in crm_attribute

2014-01-09 Thread Andrew Beekhof

On 9 Jan 2014, at 4:44 pm, Andrey Groshev gre...@yandex.ru wrote:

 09.01.2014, 02:39, Andrew Beekhof and...@beekhof.net:
 
  On 18 Dec 2013, at 11:55 pm, Andrey Groshev gre...@yandex.ru wrote:
   Hi, Andrew and ALL.
 
   I'm sorry, but I again found an error. :)
   Crux of the problem:
 
   # crm_attribute --type crm_config --attr-name stonith-enabled --query; 
 echo $?
   scope=crm_config  name=stonith-enabled value=true
   0
 
   # crm_attribute --type crm_config --attr-name stonith-enabled --update 
 firstval ; echo $?
   0
 
   # crm_attribute --type crm_config --attr-name stonith-enabled --query; 
 echo $?
   scope=crm_config  name=stonith-enabled value=firstval
   0
 
   # crm_attribute --type crm_config --attr-name stonith-enabled  --update 
 secondval --lifetime=reboot ; echo $?
   0
 
   # crm_attribute --type crm_config --attr-name stonith-enabled --query; 
 echo $?
   scope=crm_config  name=stonith-enabled value=firstval
   0
 
   # crm_attribute --type crm_config --attr-name stonith-enabled  --update 
 thirdval --lifetime=forever ; echo $?
   0
 
   # crm_attribute --type crm_config --attr-name stonith-enabled --query; 
 echo $?
   scope=crm_config  name=stonith-enabled value=firstval
   0
 
   Ie if specify the lifetime of an attribute, then a attribure is not 
 updated.
 
   If impossible setup the lifetime of the attribute when it is installing, 
 it must be return an error.
  Agreed. I'll reproduce and get back to you.
 
 How, I was able to review code, problem comes when used both options --type 
 and options --lifetime.
 One variant in case without break;
 Unfortunately, I did not have time to dive into the logic.

Actually, the logic is correct.  The command:

# crm_attribute --type crm_config --attr-name stonith-enabled  --update 
secondval --lifetime=reboot ; echo $?

is invalid.  You only get to specify --type OR --lifetime, not both.
By specifying --lifetime, you're creating a node attribute, not a cluster 
proprerty.

 
   And if possible then the value should be established.
   In general, something is wrong.
   Denser unfortunately not yet looked, because I struggle with STONITH :)
 
   P.S. Andrew! Late to congratulate you on your new addition to the family.
   This fine time - now you will have toys which was not in your childhood.
 
   ___
   Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
   http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
   Project Home: http://www.clusterlabs.org
   Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
   Bugs: http://bugs.clusterlabs.org
  ,
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] again return code, now in crm_attribute

2014-01-09 Thread Andrew Beekhof

On 10 Jan 2014, at 3:51 pm, Andrey Groshev gre...@yandex.ru wrote:

 
 
 10.01.2014, 03:28, Andrew Beekhof and...@beekhof.net:
 On 9 Jan 2014, at 4:44 pm, Andrey Groshev gre...@yandex.ru wrote:
 
  09.01.2014, 02:39, Andrew Beekhof and...@beekhof.net:
   On 18 Dec 2013, at 11:55 pm, Andrey Groshev gre...@yandex.ru wrote:
Hi, Andrew and ALL.
 
I'm sorry, but I again found an error. :)
Crux of the problem:
 
# crm_attribute --type crm_config --attr-name stonith-enabled --query; 
 echo $?
scope=crm_config  name=stonith-enabled value=true
0
 
# crm_attribute --type crm_config --attr-name stonith-enabled --update 
 firstval ; echo $?
0
 
# crm_attribute --type crm_config --attr-name stonith-enabled --query; 
 echo $?
scope=crm_config  name=stonith-enabled value=firstval
0
 
# crm_attribute --type crm_config --attr-name stonith-enabled  
 --update secondval --lifetime=reboot ; echo $?
0
 
# crm_attribute --type crm_config --attr-name stonith-enabled --query; 
 echo $?
scope=crm_config  name=stonith-enabled value=firstval
0
 
# crm_attribute --type crm_config --attr-name stonith-enabled  
 --update thirdval --lifetime=forever ; echo $?
0
 
# crm_attribute --type crm_config --attr-name stonith-enabled --query; 
 echo $?
scope=crm_config  name=stonith-enabled value=firstval
0
 
Ie if specify the lifetime of an attribute, then a attribure is not 
 updated.
 
If impossible setup the lifetime of the attribute when it is 
 installing, it must be return an error.
   Agreed. I'll reproduce and get back to you.
  How, I was able to review code, problem comes when used both options 
 --type and options --lifetime.
  One variant in case without break;
  Unfortunately, I did not have time to dive into the logic.
 
 Actually, the logic is correct.  The command:
 
 # crm_attribute --type crm_config --attr-name stonith-enabled  --update 
 secondval --lifetime=reboot ; echo $?
 
 is invalid.  You only get to specify --type OR --lifetime, not both.
 By specifying --lifetime, you're creating a node attribute, not a cluster 
 proprerty.
 
 With this, I do not argue. I think that should be the exit code is NOT ZERO, 
 ie it's error!

No, its setting a value, just not where you thought (or where you're looking 
for it in the next command).

Its the same as writing:

  crm_attribute --type crm_config --type status --attr-name stonith-enabled  
--update secondval; echo $?

Only the last value for --type wins

 
And if possible then the value should be established.
In general, something is wrong.
Denser unfortunately not yet looked, because I struggle with STONITH 
 :)
 
P.S. Andrew! Late to congratulate you on your new addition to the 
 family.
This fine time - now you will have toys which was not in your 
 childhood.
 
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
Project Home: http://www.clusterlabs.org
Getting started: 
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
   ,
   ___
   Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
   http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
   Project Home: http://www.clusterlabs.org
   Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
   Bugs: http://bugs.clusterlabs.org
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 
 ,
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] again return code, now in crm_attribute

2014-01-09 Thread Andrew Beekhof

On 10 Jan 2014, at 4:38 pm, Andrey Groshev gre...@yandex.ru wrote:

 
 
 10.01.2014, 09:06, Andrew Beekhof and...@beekhof.net:
 On 10 Jan 2014, at 3:51 pm, Andrey Groshev gre...@yandex.ru wrote:
 
  10.01.2014, 03:28, Andrew Beekhof and...@beekhof.net:
  On 9 Jan 2014, at 4:44 pm, Andrey Groshev gre...@yandex.ru wrote:
   09.01.2014, 02:39, Andrew Beekhof and...@beekhof.net:
On 18 Dec 2013, at 11:55 pm, Andrey Groshev gre...@yandex.ru wrote:
 Hi, Andrew and ALL.
 
 I'm sorry, but I again found an error. :)
 Crux of the problem:
 
 # crm_attribute --type crm_config --attr-name stonith-enabled 
 --query; echo $?
 scope=crm_config  name=stonith-enabled value=true
 0
 
 # crm_attribute --type crm_config --attr-name stonith-enabled 
 --update firstval ; echo $?
 0
 
 # crm_attribute --type crm_config --attr-name stonith-enabled 
 --query; echo $?
 scope=crm_config  name=stonith-enabled value=firstval
 0
 
 # crm_attribute --type crm_config --attr-name stonith-enabled  
 --update secondval --lifetime=reboot ; echo $?
 0
 
 # crm_attribute --type crm_config --attr-name stonith-enabled 
 --query; echo $?
 scope=crm_config  name=stonith-enabled value=firstval
 0
 
 # crm_attribute --type crm_config --attr-name stonith-enabled  
 --update thirdval --lifetime=forever ; echo $?
 0
 
 # crm_attribute --type crm_config --attr-name stonith-enabled 
 --query; echo $?
 scope=crm_config  name=stonith-enabled value=firstval
 0
 
 Ie if specify the lifetime of an attribute, then a attribure is not 
 updated.
 
 If impossible setup the lifetime of the attribute when it is 
 installing, it must be return an error.
Agreed. I'll reproduce and get back to you.
   How, I was able to review code, problem comes when used both options 
 --type and options --lifetime.
   One variant in case without break;
   Unfortunately, I did not have time to dive into the logic.
  Actually, the logic is correct.  The command:
 
  # crm_attribute --type crm_config --attr-name stonith-enabled  --update 
 secondval --lifetime=reboot ; echo $?
 
  is invalid.  You only get to specify --type OR --lifetime, not both.
  By specifying --lifetime, you're creating a node attribute, not a cluster 
 proprerty.
  With this, I do not argue. I think that should be the exit code is NOT 
 ZERO, ie it's error!
 
 No, its setting a value, just not where you thought (or where you're looking 
 for it in the next command).
 
 Its the same as writing:
 
   crm_attribute --type crm_config --type status --attr-name stonith-enabled  
 --update secondval; echo $?
 
 Only the last value for --type wins
 
 
 Because of this confusion is obtained. Here is an example of the old cluster:
 #crm_attribute --type crm_config --attr-name test1  --update val1 
 --lifetime=reboot ; echo $?
 0
 # cibadmin -Q|grep test1
  nvpair id=status-test-ins-db2-test1 name=test1 value=val1/
 Win --lifetime ?

Yes.  Because it was specified last.

 Is not it easier to produce an error when trying to use incompatible options?

They're not incompatible. They're aliases for each other in a different context.

 Then there is this uncertainty and was meant..., was meant..., was 
 meant
 
 
 And if possible then the value should be established.
 In general, something is wrong.
 Denser unfortunately not yet looked, because I struggle with 
 STONITH :)
 
 P.S. Andrew! Late to congratulate you on your new addition to the 
 family.
 This fine time - now you will have toys which was not in your 
 childhood.
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: 
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
,
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
Project Home: http://www.clusterlabs.org
Getting started: 
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
   ___
   Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
   http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
   Project Home: http://www.clusterlabs.org
   Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
   Bugs: http://bugs.clusterlabs.org
  ,
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
  ___
  Pacemaker mailing list

Re: [Pacemaker] lrmd segfault at pacemaker 1.1.11-rc1

2014-01-08 Thread Andrew Beekhof

On 8 Jan 2014, at 9:15 pm, Kazunori INOUE kazunori.ino...@gmail.com wrote:

 2014/1/8 Andrew Beekhof and...@beekhof.net:
 
 On 18 Dec 2013, at 9:50 pm, Kazunori INOUE kazunori.ino...@gmail.com wrote:
 
 Hi David,
 
 2013/12/18 David Vossel dvos...@redhat.com:
 
 That's a really weird one... I don't see how it is possible for op-id to 
 be NULL there.   You might need to give valgrind a shot to detect whatever 
 is really going on here.
 
 -- Vossel
 
 Thank you for advice. I try it.
 
 Any update on this?
 
 
 We are still investigating a cause. It was not reproduced when I gave 
 valgrind..
 And it was reproduced in RC3.

So it happened RC3 - valgrind, but not RC3 + valgrind?
Thats concerning.

Nothing in the valgrind output?

 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] starting resources with failed stonith resource

2014-01-08 Thread Andrew Beekhof

On 8 Jan 2014, at 2:41 am, Frank Van Damme frank.vanda...@gmail.com wrote:

 Hi list,
 
 I recently had some trouble with a dual-node mysql cluster, which runs
 in master-slave mode with Percona resource manager. While analyzing
 what happened to the cluster, I found this in syslog (network trouble,
 the cluster lost disk/iscsi access on both nodes, this is a piece from
 the former master trying to start up again when recovering
 connectivity):
 
 Jan  6 07:26:49 infante pengine: [3839]: notice: get_failcount:
 Failcount for MasterSlave_mysql on infante has expired (limit was 60s)
 Jan  6 07:26:49 infante pengine: [3839]: notice: get_failcount:
 Failcount for MasterSlave_mysql on infante has expired (limit was 60s)
 Jan  6 07:26:49 infante pengine: [3839]: WARN:
 common_apply_stickiness: Forcing p-stonith-ingstad away from infante
 after 100 failures (max=100)
 Jan  6 07:26:49 infante pengine: [3839]: notice: LogActions: Start
 prim_mysql:0#011(infante)
 Jan  6 07:26:49 infante pengine: [3839]: notice: LogActions: Start
 prim_mysql:1#011(ingstad)
 
 I don't understand it: if this means that the stonith devices have
 failed a million times,

We also set it to 100 when the start action fails.

 why is it trying to start the mysql resource?

It depends if any nodes need fencing.

 It's agains Pacemaker policies to start resources on a cluster without
 working stonith devices, isn't it?

Not if all nodes are present and healthy.

 
 -- 
 Frank Van Damme
 Make everything as simple as possible, but not simpler. - Albert Einstein
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Error: node does not appear to exist in configuration

2014-01-08 Thread Andrew Beekhof

On 6 Jan 2014, at 8:09 pm, Jerald B. Darow jbda...@ace-host.net wrote:

 Where am I going wrong here?

Good question... Chris?

 
 [root@zero mysql]# pcs cluster standby zero.acenet.us
 Error: node 'zero.acenet.us' does not appear to exist in configuration
 [root@zero mysql]# pcs cluster cib | grep node id
  node id=diet.acenet.us uname=diet.acenet.us/
  node id=zero.acenet.us uname=zero.acenet.us/
 
 ---
 
standby node | --all
Put specified node into standby mode (the node specified will no
 longer
be able to host resources), if --all is specified all nodes will
 be put
into standby mode.
 
 ---
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] monitoring redis in master-slave mode

2014-01-08 Thread Andrew Beekhof

On 13 Dec 2013, at 11:06 pm, ESWAR RAO eswar7...@gmail.com wrote:

 
 Hi All,
 
 I have a 3 node setup with HB+pacemaker. I wanted to run redis in 
 master-slave mode using an ocf script.
 https://groups.google.com/forum/#!msg/redis-db/eY3zCKnl0G0/lW5fObHrjwQJ
 
 But with the below configuration , I am able to start in master-slave mode 
 but pacemaker is not monitoring the redis.
 I killed the redis-server on node-1 (slave node/master node) but pacemaker is 
 not re-starting it .
 In the crm status I could see it as:
  Masters: [ oc-vm ]
  Slaves: [ oc-vm1 oc-vm2 ]
 
 even though it's not running on oc-vm1.
 
 # crm configure primitive cluster-ip ocf:IPaddr2 params ip=192.168.101.205 
 cidr_netmask=32 nic=eth1 op monitor interval=30s
 # crm configure primitive oc_redis ocf:redis op monitor role=Master 
 interval=3s timeout=5s op monitor role=Slave interval=3s timeout=3s

use a different interval for the two recurring operations.

 # crm configure ms redis_clone oc_redis meta notify=true master-max=1 
 master-node-max=1 clone-node-max=1 interleave=false 
 globally-unique=false
 # crm configure colocation ip-on-redis inf: cluster-ip redis_clone:Master
 
 Can someone help me in fixing it???
 
 Thanks
 Eswar
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] catch-22: can't fence node A because node A has the fencing resource

2014-01-08 Thread Andrew Beekhof

On 4 Dec 2013, at 11:47 am, Brian J. Murrell br...@interlinx.bc.ca wrote:

 
 On Tue, 2013-12-03 at 18:26 -0500, David Vossel wrote: 
 
 We did away with all of the policy engine logic involved with trying to move 
 fencing devices off of the target node before executing the fencing action. 
 Behind the scenes all fencing devices are now essentially clones.  If the 
 target node to be fenced has a fencing device running on it, that device can 
 execute anywhere in the cluster to avoid the suicide situation.
 
 OK.
 
 When you are looking at crm_mon output and see a fencing device is running 
 on a specific node, all that really means is that we are going to attempt to 
 execute fencing actions for that device from that node first. If that node 
 is unavailable,
 
 Would it be better to not even try to use a node and ask it to commit
 suicide but always try to use another node?

IIRC the only time we ask a node to fence itself is when it is (or thinks it 
is) the last node standing.

 
 we'll try that same device anywhere in the cluster we can get it to work
 
 OK.
 
 (unless you've specifically built some location constraint that prevents the 
 fencing device from ever running on a specific node)
 
 While I do have constraints on the more service-oriented resources to
 give them preferred nodes, I don't have any constraints on the fencing
 resources.
 
 So given all of the above, and given the log I supplied showing that the
 fencing was just not being attempted anywhere other than the node to be
 fenced (which was down during that log) any clues as to where to look
 for why?
 
 Hope that helps.
 
 It explains the differences, but unfortunately I'm still not sure why it
 wouldn't get run somewhere else, eventually, rather than continually
 being attempted on the node to be killed (which as I mentioned, was shut
 down at the time the log was made).

Yes, this is surprising.
Can you enable the blackbox for stonith-ng, reproduce and generate a crm_report 
for us please?  It will contain all the information we need.


signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker + cman - node names and bind address

2014-01-08 Thread Andrew Beekhof

On 5 Dec 2013, at 8:51 pm, Nikola Ciprich nikola.cipr...@linuxbox.cz wrote:

 Hello Digimer,
 
 and thanks for Your reply. I understand your points, but my question
 is about something a bit different.. 
 
 example: I have two nodes, node1 (lan address resolves to 192.168.1.1)
 and node2 (lan address resolves to 192.168.1.2).
 connected using crosslink (10.0.0.1, 10.0.0.2).
 
 I'd like to use node1/node2 as cluster node names, but I'd like
 to have node1/node2 resolving to 192.168.1.0 addresses but
 cluster to communicate over 10.0.0.0 link.
 
 In corosync, I was able to force this by setting bindnetaddr, but
 how can I set this in cman cluster? (I (roughly) know that corosync
 is used even in cman-based cluster, but this unimportant now, I think)
 
 what is the clear way to achieve this?

There is some information on altname and altmulticast at the bottom of:

   
https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Cluster_Administration/s1-config-rrp-cli-CA.html

that might be of some use.

 
 thanks a lot in advance
 
 BR
 
 nik
 
 
 
 
 
 On Thu, Dec 05, 2013 at 02:36:36AM -0500, Digimer wrote:
 On 05/12/13 02:28, Nikola Ciprich wrote:
 Hello,
 
 not sure whether this shouldn't be asked in some different
 conference, if so, I'm sorry in advance..
 
 Since this seems to be recommended solution for RHEL6 and since I
 need to use CLVMD, I switched my cluster from corosync + pacemaker
 to cman+pacemaker.
 
 I usually use two-nodes clusters, where hostnames resolve to lan
 uplink addresses, and nodes are interconnected using crosslink (to
 be exact, multiple crosslinks with bonding).
 
 I'd like to use hostnames as cluster node names, but I'd like
 cluster communication to go over crosslinks. Is there a way I could
 force this in cluster.conf?
 
 cluster.conf manpage is quite brief on this..
 
 Could somebody please give me an advice?
 
 thanks a lot in advance
 
 BR
 
 nik
 
 Hello,
 
  This list is fine to ask.
 
  I speak mainly as a cman user, but I will make an educated guess on
 the pacemaker side of things.
 
  First, cman does not replace corosync. Both pacemaker and cman use
 corosync for cluster communications and membership. The difference is
 that cman configures and starts corosync for you, behind the scenes.
 
  The node names you set in cluster.conf's clusternode
 name=an-c05n01.alteeve.ca nodeid=1 ... /clusternode
 (an-c05n01.alteeve.ca in this case) is resolved to an IP address. This
 IP address/subnet is then looked for of your network and, once found,
 the associated network card is used.
 
  Generally, it's advised that you set the cluster names to match each
 node's 'uname -n' hostname. If you did that, then the following bit of
 bash will tell you which interface will be used for the cluster (on
 RHEL/CentOS 6, anyway);
 
 ifconfig |grep -B 1 $(gethostip -d $(uname -n)) | grep HWaddr | awk '{
 print $1 }'
 
  Once you've identified the network, you will want to make sure that,
 if it is bonded, it is a supported bond mode. Prior to 6.3 (iirc),
 only mode=1 (active/passive) bonding was supported. With 6.4, mode=0
 and mode=2 support was added. All other bond modes are not supported
 under corosync.
 
  As for pacemaker integration; Be sure to setup stonith in pacemaker
 and make sure it is working. Then use the 'fence_pcmk' fence agent in
 cluster.conf itself. This way, should cman detect a node failure, it
 will call pacemaker and ask it to fence the target node, saving you
 from having to keep the real fencing configured in two places at once.
 
 hth
 
 digimer
 
 -- 
 Digimer
 Papers and Projects: https://alteeve.ca/w/
 What if the cure for cancer is trapped in the mind of a person without
 access to education?
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 
 -- 
 -
 Ing. Nikola CIPRICH
 LinuxBox.cz, s.r.o.
 28.rijna 168, 709 00 Ostrava
 
 tel.:   +420 591 166 214
 fax:+420 596 621 273
 mobil:  +420 777 093 799
 www.linuxbox.cz
 
 mobil servis: +420 737 238 656
 email servis: ser...@linuxbox.cz
 -
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: 

Re: [Pacemaker] How to permanently delete ghostly nodes?

2014-01-08 Thread Andrew Beekhof

On 7 Dec 2013, at 8:19 pm, Andrey Rogovsky a.rogov...@gmail.com wrote:

 I renamed several nodes and restart the cluster
 Now I show a old nodes in status offline
 I tried to delete them, but every time you change the cluster configuration 
 they show in offline again

It depends a bit on the version of pacemaker and whether you're using heartbeat 
or corosync or the corosync plugin or cman.
Details?



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] again return code, now in crm_attribute

2014-01-08 Thread Andrew Beekhof

On 18 Dec 2013, at 11:55 pm, Andrey Groshev gre...@yandex.ru wrote:

 Hi, Andrew and ALL.
 
 I'm sorry, but I again found an error. :)
 Crux of the problem:
 
 # crm_attribute --type crm_config --attr-name stonith-enabled --query; echo $?
 scope=crm_config  name=stonith-enabled value=true
 0
 
 # crm_attribute --type crm_config --attr-name stonith-enabled --update 
 firstval ; echo $?
 0
 
 # crm_attribute --type crm_config --attr-name stonith-enabled --query; echo $?
 scope=crm_config  name=stonith-enabled value=firstval
 0
 
 # crm_attribute --type crm_config --attr-name stonith-enabled  --update 
 secondval --lifetime=reboot ; echo $?
 0
 
 # crm_attribute --type crm_config --attr-name stonith-enabled --query; echo $?
 scope=crm_config  name=stonith-enabled value=firstval
 0
 
 # crm_attribute --type crm_config --attr-name stonith-enabled  --update 
 thirdval --lifetime=forever ; echo $?
 0
 
 # crm_attribute --type crm_config --attr-name stonith-enabled --query; echo $?
 scope=crm_config  name=stonith-enabled value=firstval
 0
 
 Ie if specify the lifetime of an attribute, then a attribure is not updated.
 
 If impossible setup the lifetime of the attribute when it is installing, it 
 must be return an error.

Agreed. I'll reproduce and get back to you.

 
 And if possible then the value should be established.
 In general, something is wrong.
 Denser unfortunately not yet looked, because I struggle with STONITH :)
 
 P.S. Andrew! Late to congratulate you on your new addition to the family. 
 This fine time - now you will have toys which was not in your childhood.
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] CentOS 6.5 Pacemaker Oracle Active/Failover cluster setup on SAN

2014-01-07 Thread Andrew Beekhof

On 6 Jan 2014, at 4:15 pm, Pui Edylie em...@edylie.net wrote:

 Good Day members,
 
 I am wondering if anyone has set this up successfully?
 
 I noticed that there is a lack of Oracle script to initiate this.
 
 I would willing to pay someone for this effort and hopefully we could create 
 a howto to benefit subsequent readers.

Red Hat could presumably help you, but you'd have to use RHEL instead of CentOS.

 
 Please contact me!
 
 Thanks
 
 Edy
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Manual fence confirmation by stonith_admin doesn't work again.

2014-01-07 Thread Andrew Beekhof

On 19 Dec 2013, at 6:54 pm, Nikita Staroverov nsfo...@gmail.com wrote:

 
 Please see:
 
 https://access.redhat.com/site/articles/36302
 
 If you don't have an account, the relevant part is:
 
 Usage of fence_manual is not supported in any production cluster. You may 
 use this fence agent for development or debugging purposes only.
 
 I wrote about fence_ack_manual, not about fence_manual. fence_ack_manual 
 isn't a fence agent, it's confirmation tool like stonith_admin -C.
 2. Pacemaker notifies CMAN about  real fencing, why not about manual
 confirmations? It's bug, i'm sure.

There is no-one to tell yet.
We have to wait for cman to decide something needs fencing before pacemaker can 
perform the notification.

 
 Perhaps, I can't speak to pacemaker's behaviour.
 Ok. I'll think that it's not a bug, but feature request.
 
 3. I use real fencing. Manual fencing needed in rare situations, like
 problem with IPMI controller, lost power or so. What should
 administrator do without manual confirmations? :)
 
 Fencing should loop until it succeeds. Fix the problem, fence_ack_manual if 
 that's not possible or, ideally, use multiple fence methods (I use IPMI on 
 one switch and switched PDUs on another switch for redundancy).
 
 I have two datacenters joined with fast links, i have syncronous replication 
 so server groups in two separate datacenters forms clusters.
 if one campus burned with own fencing devices in a whole how i must fix this?
 In some situations administrator needs of simple and fast manual confirmation 
 or it's not a high availability.
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] lrmd segfault at pacemaker 1.1.11-rc1

2014-01-07 Thread Andrew Beekhof

On 18 Dec 2013, at 9:50 pm, Kazunori INOUE kazunori.ino...@gmail.com wrote:

 Hi David,
 
 2013/12/18 David Vossel dvos...@redhat.com:
 
 That's a really weird one... I don't see how it is possible for op-id to be 
 NULL there.   You might need to give valgrind a shot to detect whatever is 
 really going on here.
 
 -- Vossel
 
 Thank you for advice. I try it.

Any update on this?



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] reboot of non-vm host results in VM restart -- of chickens and eggs and VMs

2014-01-07 Thread Andrew Beekhof

On 20 Dec 2013, at 5:30 am, Bob Haxo bh...@sgi.com wrote:

 Hello,
 
 Earlier emails related to this topic:
 [pacemaker] chicken-egg-problem with libvirtd and a VM within cluster
 [pacemaker] VirtualDomain problem after reboot of one node
 
 
 My configuration:
 
 RHEL6.5/CMAN/gfs2/Pacemaker/crmsh
 
 pacemaker-libs-1.1.10-14.el6_5.1.x86_64
 pacemaker-cli-1.1.10-14.el6_5.1.x86_64
 pacemaker-1.1.10-14.el6_5.1.x86_64
 pacemaker-cluster-libs-1.1.10-14.el6_5.1.x86_64
 
 Two node HA VM cluster using real shared drive, not drbd.
 
 Resources (relevant to this discussion):
 primitive p_fs_images ocf:heartbeat:Filesystem \
 primitive p_libvirtd lsb:libvirtd \
 primitive virt ocf:heartbeat:VirtualDomain \
 
 services chkconfig on: cman, clvmd, pacemaker
 services chkconfig off: corosync, gfs2, libvirtd
 
 Observation:
 
 Rebooting the NON-host system results in the restart of the VM merrily 
 running on the host system.

I'm still bootstrapping after the break, but I'm not following this.  Can you 
rephrase? 

 
 Apparent cause:
 
 Upon startup, Pacemaker apparently checks the status of configured resources. 
 However, the status request for the virt (ocf:heartbeat:VirtualDomain) 
 resource fails with:
 
 Dec 18 12:19:30 [4147] mici-admin2   lrmd:  warning: 
 child_timeout_callback:virt_monitor_0 process (PID 4158) timed out
 Dec 18 12:19:30 [4147] mici-admin2   lrmd:  warning: operation_finished:  
   virt_monitor_0:4158 - timed out after 20ms
 Dec 18 12:19:30 [4147] mici-admin2   lrmd:   notice: operation_finished:  
   virt_monitor_0:4158:stderr [ error: Failed to reconnect to the hypervisor ]
 Dec 18 12:19:30 [4147] mici-admin2   lrmd:   notice: operation_finished:  
   virt_monitor_0:4158:stderr [ error: no valid connection ]
 Dec 18 12:19:30 [4147] mici-admin2   lrmd:   notice: operation_finished:  
   virt_monitor_0:4158:stderr [ error: Failed to connect socket to 
 '/var/run/libvirt/libvirt-sock': No such file or directory ]

Sounds like the agent should perhaps be returning OCF_NOT_RUNNING in this case.

 
 
 This failure then snowballs into an orphan situation in which the running 
 VM is restarted.
 
 There was the suggestion of chkconfig on libvirtd (and presumably deleting 
 the resource) so that the /var/run/libvirt/libvirt-sock has been created by 
 service libvirtd. With libvirtd started by the system, there is no un-needed 
 reboot of the VM.
 
 However, it may be that removing libvirtd from Pacemaker control leaves the 
 VM vdisk filesystem susceptible to corruption during a reboot induced 
 failover.
 
 Question:
 
 Is there an accepted Pacemaker configuration such that the un-needed restart 
 of the VM does not occur with the reboot of the non-host system?
 
 Regards,
 Bob Haxo
 
 
 
 
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Minor buffer overflow..

2014-01-07 Thread Andrew Beekhof

On 5 Dec 2013, at 3:20 pm, Rob Thomas xro...@gmail.com wrote:

 I was idly wondering why the SMTP and SNMP modules were disabled by
 default on the RHEL builds, and was in the middle of writing a shell
 script to duplicate them when I noticed there was a tiny buffer
 overflow in crm_mon.
 
 This may be why it's disabled by default?

Not really. It was more of a dependancy issue.
Plus the run script option makes them redundant.

 
 Patch:
 
 https://github.com/xrobau/pacemaker/commit/b1515e3f83fceeac951de8823d718bdf13e4a093

Can you make a pull request for that?

 
 --Rob
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Starting Pacemaker Cluster Manager [FAILED]

2014-01-07 Thread Andrew Beekhof

On 21 Nov 2013, at 9:56 pm, Miha m...@softnet.si wrote:

 HI,
 
 how can i delete/reset all config, so that I could do again:


pcs cluster destroy on all nodes looks about right

 
 'pcs cluster setup mycluster pcmk-1 pcmk-2' and begin again at the beginning?
 tnx!
 
 p.s.: bellowe is a log
 
 Nov 21 11:51:21 [10578] sip1cib:error: xml_log: Expecting 
 element status, got node_state
 Nov 21 11:51:21 [10578] sip1cib:error: xml_log: Element cib 
 failed to validate content
 Nov 21 11:51:21 [10578] sip1cib:error: readCibXmlFile:  CIB 
 does not validate with pacemaker-1.2
 Nov 21 11:51:22 [10578] sip1cib:error: cib_process_request:   
   Operation ignored, cluster configuration is invalid. Please repair and 
 restart: Update does not conform to the configured schema
 Nov 21 11:51:22 [10578] sip1cib:error: cib_process_request:   
   Operation ignored, cluster configuration is invalid. Please repair and 
 restart: Update does not conform to the configured schema
 Nov 21 11:51:22 [10579] sip1 stonith-ng:error: cluster_option:  Value 
 'tru' for cluster option 'stop-all-resources' is invalid.  Defaulting to false
 Nov 21 11:51:22 [10578] sip1cib:error: cib_process_request:   
   Operation ignored, cluster configuration is invalid. Please repair and 
 restart: Update does not conform to the configured schema
 Nov 21 11:51:22 [10578] sip1cib:error: cib_process_request:   
   Operation ignored, cluster configuration is invalid. Please repair and 
 restart: Update does not conform to the configured schema
 Nov 21 11:51:22 [10583] sip1   crmd: info: register_fsa_error_adv:
   Resetting the current action list
 Nov 21 11:51:22 [10583] sip1   crmd:error: node_list_update_callback: 
   Node update 3 failed: Update does not conform to the configured schema 
 (-203)
 Nov 21 11:51:22 [10583] sip1   crmd: info: register_fsa_error_adv:
   Resetting the current action list
 Nov 21 11:51:22 [10583] sip1   crmd:error: config_query_callback: 
   Local CIB query resulted in an error: Update does not conform to the 
 configured schema
 Nov 21 11:51:22 [10583] sip1   crmd: info: register_fsa_error_adv:
   Resetting the current action list
 Nov 21 11:51:22 [10583] sip1   crmd:error: config_query_callback: 
   The cluster is mis-configured - shutting down and staying down
 Nov 21 11:51:22 [10583] sip1   crmd:error: do_log:  FSA: Input 
 I_ERROR from config_query_callback() received in state S_STARTING
 Nov 21 11:51:22 [10583] sip1   crmd:  warning: do_recover:  Fast-tracking 
 shutdown in response to errors
 Nov 21 11:51:22 [10583] sip1   crmd:error: do_log:  FSA: Input 
 I_ERROR from node_list_update_callback() received in state S_RECOVERY
 Nov 21 11:51:22 [10583] sip1   crmd:error: do_log:  FSA: Input 
 I_ERROR from revision_check_callback() received in state S_RECOVERY
 Nov 21 11:51:22 [10583] sip1   crmd:error: do_started:  Start 
 cancelled... S_RECOVERY
 Nov 21 11:51:22 [10583] sip1   crmd:error: do_log:  FSA: Input 
 I_TERMINATE from do_recover() received in state S_RECOVERY
 Nov 21 11:51:22 [10578] sip1cib:error: cib_process_request:   
   Operation ignored, cluster configuration is invalid. Please repair and 
 restart: Update does not conform to the configured schema
 Nov 21 11:51:22 [10572] sip1 pacemakerd:error: pcmk_child_exit: Child 
 process crmd (10583) exited: Network is down (100)
 Nov 21 11:51:22 [10572] sip1 pacemakerd:   notice: pcmk_shutdown_worker:  
   Attempting to inhibit respawning after fatal error
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] some questions about STONITH

2014-01-07 Thread Andrew Beekhof

On 26 Nov 2013, at 12:39 am, Andrey Groshev gre...@yandex.ru wrote:

 ...snip...
  Make next test:
  #stonith_admin --reboot=dev-cluster2-node2
  Node reboot, but resource don't start.
  In crm_mon status - Node dev-cluster2-node2 (172793105): pending.
  And it will be hung.
 
 That is *probably* a race - the node reboots too fast, or still
 communicates for a bit after the fence has supposedly completed (if it's
 not a reboot -nf, but a mere reboot). We have had problems here in the
 past.
 
 You may want to file a proper bug report with crm_report included, and
 preferably corosync/pacemaker debugging enabled.
 
 It was found that he hangs not forever.
 Triggered timeout - in 20 minutes.
 crm_report archive - http://send2me.ru/pen2.tar.bz2
 Of course in the logs many type entries:
 
 pgsql:1: Breaking dependency loop at msPostgresql
 
 But where does this relationship after a timeout, I do not understand.

Can you rephrase your question?
I'm not 100% sure I understand what you're asking.



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Breaking dependency loop stonith

2014-01-07 Thread Andrew Beekhof

On 27 Nov 2013, at 12:26 am, Andrey Groshev gre...@yandex.ru wrote:

 Hi, ALL.
 
 I want to clarify two more questions.
 After stonith reboot - this node hangs with status pending.
 The logs found string .
 
info: rsc_merge_weights:pgsql:1: Breaking dependency loop at 
 msPostgresql
info: rsc_merge_weights:pgsql:2: Breaking dependency loop at 
 msPostgresql
 
 This means that breaking search the depends, because they are no more.
 Or interrupted by an infinite loop for search the dependency?

The second one, but it has nothing to do with a node being in the pending 
state.
Where did you see this?

 
 And two.
 Do I need to clone the stonith resource now (In PCMK 1.1.11)?

No.

 On the one hand, I see this resource on all nodes through command.
 # cibadmin -Q|grep stonith
nvpair name=stonith-enabled value=true 
 id=cib-bootstrap-options-stonith-enabled/
  primitive id=st1 class=stonith type=external/sshbykey
  lrm_resource id=st1 type=external/sshbykey class=stonith
  lrm_resource id=st1 type=external/sshbykey class=stonith
  lrm_resource id=st1 type=external/sshbykey class=stonith
 (without pending node)

Like all resources, we check all nodes at startup to see if it is already 
active.

 
 On the other hand, another command I see only one instance on a particular 
 node.
 # crm_verify -L
info: main: =#=#=#=#= Getting XML =#=#=#=#=
info: main: Reading XML from: live cluster
info: validate_with_relaxng:Creating RNG parser context
info: determine_online_status_fencing:  Node dev-cluster2-node4 is 
 active
info: determine_online_status:  Node dev-cluster2-node4 is online
info: determine_online_status_fencing:  - Node dev-cluster2-node1 is 
 not ready to run resources
info: determine_online_status_fencing:  Node dev-cluster2-node2 is 
 active
info: determine_online_status:  Node dev-cluster2-node2 is online
info: determine_online_status_fencing:  Node dev-cluster2-node3 is 
 active
info: determine_online_status:  Node dev-cluster2-node3 is online
info: determine_op_status:  Operation monitor found resource pingCheck:0 
 active on dev-cluster2-node4
info: native_print: VirtualIP   (ocf::heartbeat:IPaddr2):  
  Started dev-cluster2-node4
info: clone_print:   Master/Slave Set: msPostgresql [pgsql]
info: short_print:   Masters: [ dev-cluster2-node4 ]
info: short_print:   Slaves: [ dev-cluster2-node2 dev-cluster2-node3 ]
info: short_print:   Stopped: [ dev-cluster2-node1 ]
info: clone_print:   Clone Set: clnPingCheck [pingCheck]
info: short_print:   Started: [ dev-cluster2-node2 dev-cluster2-node3 
 dev-cluster2-node4 ]
info: short_print:   Stopped: [ dev-cluster2-node1 ]
info: native_print: st1 (stonith:external/sshbykey):
 Started dev-cluster2-node4
info: native_color: Resource pingCheck:3 cannot run anywhere
info: native_color: Resource pgsql:3 cannot run anywhere
info: rsc_merge_weights:pgsql:1: Breaking dependency loop at 
 msPostgresql
info: rsc_merge_weights:pgsql:2: Breaking dependency loop at 
 msPostgresql
info: master_color: Promoting pgsql:0 (Master dev-cluster2-node4)
info: master_color: msPostgresql: Promoted 1 instances of a 
 possible 1 to master
info: LogActions:   Leave   VirtualIP   (Started dev-cluster2-node4)
info: LogActions:   Leave   pgsql:0 (Master dev-cluster2-node4)
info: LogActions:   Leave   pgsql:1 (Slave dev-cluster2-node2)
info: LogActions:   Leave   pgsql:2 (Slave dev-cluster2-node3)
info: LogActions:   Leave   pgsql:3 (Stopped)
info: LogActions:   Leave   pingCheck:0 (Started dev-cluster2-node4)
info: LogActions:   Leave   pingCheck:1 (Started dev-cluster2-node2)
info: LogActions:   Leave   pingCheck:2 (Started dev-cluster2-node3)
info: LogActions:   Leave   pingCheck:3 (Stopped)
info: LogActions:   Leave   st1 (Started dev-cluster2-node4)
 
 
 However, if I do a clone - it turns out the same garbage.
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Weird behavior of PCS command while defining DRBD resources

2014-01-07 Thread Andrew Beekhof

On 27 Nov 2013, at 10:21 pm, Muhammad Kamran Azeem kamranaz...@gmail.com 
wrote:

 Apologies for double post. In my initial post, I forgot to set the subject 
 properly.
 
 
 Hello List,
 
 I am new here. I worked with Linux HA during 2006-2008, went in HPC 
 direction, and came back to HA a month ago. Realized that a lot has changed. 
 
 My setup:
 
 Two KVM machines vdb1 (192.168.122.11), vdb2 (192.168.122.12)
 ClusterIP: 192.168.122.10 
 Fedora 19 (64 bit). PCS, CoroSync, PaceMaker, DRBD
 
 Note: I use the names node1 and node2 for vdb1 and vdb2 for explanations.
 
 I am trying to setup a test cluster, using  
 http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Clusters_from_Scratch/_configure_the_cluster_for_drbd.html
 
 First, the status:
 
 [root@vdb1 drbd.d]# pcs status
 Cluster name: MySQLCluster
 Last updated: Tue Nov 26 14:05:33 2013
 Last change: Mon Nov 25 17:25:59 2013 via crm_resource on vdb2.example.com
 Stack: corosync
 Current DC: vdb1.example.com (1) - partition with quorum
 Version: 1.1.9-3.fc19-781a388
 2 Nodes configured, unknown expected votes
 2 Resources configured.
 
 Online: [ vdb1.example.com vdb2.example.com ]
 
 Full list of resources:
 
  ClusterIP(ocf::heartbeat:IPaddr2):   Started vdb1.example.com 
  Apache   (ocf::heartbeat:apache):Started vdb1.example.com 
 
 [root@vdb1 drbd.d]#
 
 
 My DRBD disks are: 
 
 [root@vdb1 drbd.d]# drbd-overview 
   1:MySQLDisk/0   Connected Secondary/Secondary UpToDate/UpToDate C r- 
   2:ApacheDisk/0  Connected Secondary/Secondary UpToDate/UpToDate C r- 
 [root@vdb1 drbd.d]# 
 
 
 Now, the guide suggests creating a small config file, define the new 
 resources in that, and then push that in CIB. Extract from the guide:
 # pcs cluster cib drbd_cfg
 # pcs -f drbd_cfg resource create WebData ocf:linbit:drbd \
  drbd_resource=wwwdata op monitor interval=60s
 # pcs -f drbd_cfg resource master WebDataClone WebData \
  master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 \
  notify=true
 
 
 I decided to execute the commands (manually), without using the config file 
 method, as:
 
 # pcs resource create p_ApacheDisk ocf:linbit:drbd \
  drbd_resource=ApacheDisk op monitor interval=60s
 
 # pcs resource master MasterApacheDisk p_ApacheDisk \
   master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 \
   notify=true
 
 (I changed the names of resources a bit)
 
 I get the following errors:
 
 [root@vdb2 ~]# pcs resource create p_ApacheDisk ocf:linbit:drbd \
  drbd_resource=ApacheDisk op monitor interval=60s
 
 
 [root@vdb2 ~]# pcs status
 Cluster name: MySQLCluster
 Last updated: Wed Nov 27 11:50:35 2013
 Last change: Wed Nov 27 11:49:36 2013 via cibadmin on vdb2.example.com
 Stack: corosync
 Current DC: vdb1.example.com (1) - partition with quorum
 Version: 1.1.9-3.fc19-781a388
 2 Nodes configured, unknown expected votes
 3 Resources configured.
 
 Online: [ vdb1.example.com vdb2.example.com ]
 
 Full list of resources:
 
  ClusterIP(ocf::heartbeat:IPaddr2):   Started vdb1.example.com 
  Apache   (ocf::heartbeat:apache):Started vdb1.example.com 
  p_ApacheDisk (ocf::linbit:drbd): Stopped 
 
 Failed actions:
 p_ApacheDisk_monitor_0 (node=vdb1.example.com, call=27, rc=6, 
 status=complete, last-rc-change=Wed Nov 27 11:49:36 2013
 , queued=23ms, exec=0ms
 ): not configured
 p_ApacheDisk_monitor_0 (node=vdb2.example.com, call=15, rc=6, 
 status=complete, last-rc-change=Wed Nov 27 11:49:36 2013
 , queued=22ms, exec=1ms
 ): not configured
 
 
 Got the following in /var/log/messages on DC (node 1):
 
 Nov 27 11:49:36 vdb1 cib[538]:   notice: cib:diff: Diff: --- 0.43.13
 Nov 27 11:49:36 vdb1 cib[538]:   notice: cib:diff: Diff: +++ 0.44.1 
 f4b87d9dee145747f86583cb5eb8276b
 Nov 27 11:49:36 vdb1 stonith-ng[539]:   notice: unpack_config: On loss of CCM 
 Quorum: Ignore
 Nov 27 11:49:36 vdb1 crmd[543]:   notice: do_state_transition: State 
 transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
 origin=abort_transition_graph ]
 Nov 27 11:49:36 vdb1 pengine[542]:   notice: unpack_config: On loss of CCM 
 Quorum: Ignore
 Nov 27 11:49:36 vdb1 pengine[542]:   notice: LogActions: Start   
 p_ApacheDisk#011(vdb2.example.com)
 Nov 27 11:49:36 vdb1 pengine[542]:   notice: process_pe_message: Calculated 
 Transition 92: /var/lib/pacemaker/pengine/pe-input-74.bz2
 Nov 27 11:49:36 vdb1 crmd[543]:   notice: te_rsc_command: Initiating action 
 8: monitor p_ApacheDisk_monitor_0 on vdb2.example.com
 Nov 27 11:49:36 vdb1 crmd[543]:   notice: te_rsc_command: Initiating action 
 6: monitor p_ApacheDisk_monitor_0 on vdb1.example.com (local)
 Nov 27 11:49:36 vdb1 drbd(p_ApacheDisk)[9807]: ERROR: meta parameter 
 misconfigured, expected clone-max -le 2, but found unset.
 Nov 27 11:49:36 vdb1 crmd[543]:   notice: process_lrm_event: LRM operation 
 p_ApacheDisk_monitor_0 (call=27, rc=6, cib-update=124, confirmed=true) not 
 

Re: [Pacemaker] prevent starting resources on failed node

2014-01-07 Thread Andrew Beekhof

On 7 Dec 2013, at 2:17 am, Brian J. Murrell (brian) br...@interlinx.bc.ca 
wrote:

 [ Hopefully this doesn't cause a duplicate post but my first attempt
 returned an error. ]
 
 Using pacemaker 1.1.10 (but I think this issue is more general than that
 release), I want to enforce a policy that once a node fails, no
 resources can be started/run on it until the user permits it.

Node fails? Or resource on a node fails?
If you really mean the node, just don't configure it to start pacemaker when 
it boots.

 
 I have been successful in achieving this using resource stickiness.
 Mostly.  It seems that once the resource has been successfully started
 on another node, it stays put, even once the failed node comes back up.
 So this is all good.
 
 Where it does seem to be falling down though is that if the failed node
 comes back up before the resource can be successfully started on another
 node, pacemaker seems to include the just-failed-and-restarted node in
 the candidate list of nodes it tries to start the resource on.  So in
 this manner, it seems that resource stickiness only applies once the
 resource has been started (which is not surprising; it seems a
 reasonable behaviour).
 
 The question then is, anyone have any ideas on how to implement such a
 policy?  That is, once a node fails, no resources are allowed to start
 on it, even if it means not starting the resource (i.e. all other nodes
 are unable to start it for whatever reason)?  Simply not starting the
 node would be one way to achieve it, yes, but we cannot rely on the node
 not being started.
 
 It seems perhaps the installation of a constraint when a node is
 stonithed might do the trick, but the question is how to couple/trigger
 the installation of a constraint with a stonith action?
 
 Or is there a better/different way to achieve this?
 
 Cheers,
 b.
 
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] error: send_cpg_message: Sending message via cpg FAILED: (rc=6) Try again

2014-01-07 Thread Andrew Beekhof
What version of pacemaker?
There were some improvements to how we handle sending messages via CPG recently.

On 10 Dec 2013, at 4:40 am, Brian J. Murrell br...@interlinx.bc.ca wrote:

 On Mon, 2013-12-09 at 09:28 +0100, Jan Friesse wrote:
 
 Error 6 error means try again. This is happening ether if corosync is
 overloaded or creating new membership. Please take a look to
 /var/log/cluster/corosync.log if you see something strange there (+ make
 sure you have newest corosync).
 
 Would that same information be available in /var/log/messages if I have
 configured corosync such as:
 
 logging {
fileline: off
to_stderr: no
to_logfile: no
to_syslog: yes
logfile: /var/log/cluster/corosync.log
debug: off
timestamp: on
logger_subsys {
subsys: AMF
debug: off
}
 }
 
 If so, then the log snippet I posted in the prior message includes all
 that corosync had to report.  Should I increase the amount of logging?
 Any suggestions on an appropriate amount/flags, etc.?
 
 (+ make
 sure you have newest corosync).
 
 corosync-1.4.1-15.el6_4.1.x86_64 as shipped by RH in EL6.
 
 Is this new enough?  I know 2.x is also available but I don't think RH
 is shipping that yet.  Hopefully their 1.4.1 is still supported.
 
 Cheers,
 b.
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Reg. trigger when node failure occurs

2014-01-07 Thread Andrew Beekhof

On 11 Dec 2013, at 3:45 pm, ESWAR RAO eswar7...@gmail.com wrote:

 Hi Micheal,
 
 I am configuring the ClusterMon as below on the 3 node setup:
 I am following 
 http://floriancrouzat.net/2013/01/monitor-a-pacemaker-cluster-with-ocfpacemakerclustermon-andor-external-agent/
 
 # crm configure primitive ClusterMon ocf:pacemaker:ClusterMon params 
 user=root update=10 extra_options=-E /root/monitor.sh -e 
 192.168.100.188 op monitor on-fail=restart interval=60
 
 Entity: line 24: element nvpair: Relax-NG validity error : Type ID doesn't 
 allow value 'ClusterMon-instance_attributes-/root/monitor.sh'

Looks like crmsh is having trouble parsing your command line.

 Entity: line 24: element nvpair: Relax-NG validity error : Element nvpair 
 failed to validate attributes
 Relax-NG validity error : Extra element nvpair in interleave
 Entity: line 21: element nvpair: Relax-NG validity error : Element 
 instance_attributes failed to validate content
 Relax-NG validity error : Extra element instance_attributes in interleave
 Entity: line 2: element cib: Relax-NG validity error : Element cib failed to 
 validate content
 crm_verify[15283]: 2013/12/10_20:35:45 ERROR: main: CIB did not pass 
 DTD/schema validation
 Errors found during check: config not valid
 ERROR: ClusterMon: parameter -e does not exist
 ERROR: ClusterMon: parameter 192.168.100.188 does not exist
 ERROR: ClusterMon: parameter /root/monitor.sh does not exist
 
 
 Can I write my external agent as sample dummy like : crm_mon -1|grep 
 Online\|OFFLINE 
 My intention is when HB is stopped on any node/node failure occurs the script 
 should be triggered so that I can know OFFLINE nodes.
 
 Thanks
 Eswar
 
 
 
 On Tue, Dec 10, 2013 at 1:06 PM, Michael Schwartzkopff m...@sys4.de wrote:
 Am Dienstag, 10. Dezember 2013, 12:19:25 schrieb ESWAR RAO:
  Hi All,
 
  Can someone please let me know if there is a clean to trigger any script by
  pacemaker if HB on a node has stopped/node failed occurred if I ran
  HB+pacemaker on a 3 node setup??
 
  Thanks
  Eswar
 
  On Mon, Dec 9, 2013 at 5:16 PM, ESWAR RAO eswar7...@gmail.com wrote:
   Hi All,
  
   I have a 3 node ( node1, node2, node3 ) setup on which HB+pacemaker runs.
   I have resources running on clone mode on node1 and node2.
  
   Is there anyway to get a trigger when a node failure occurs i.e., can i
   trigger any script if the node3 fails (on which no resource runs) ???
 
 Yes. run a ocf:pacemaker:ClusterMon resource and read man crm_mon for the
 additional options to call a script.
 
 --
 Mit freundlichen Grüßen,
 
 Michael Schwartzkopff
 
 --
 [*] sys4 AG
 
 http://sys4.de, +49 (89) 30 90 46 64, +49 (162) 165 0044
 Franziskanerstraße 15, 81669 München
 
 Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263
 Vorstand: Patrick Ben Koetter, Axel von der Ohe, Marc Schiffbauer
 Aufsichtsratsvorsitzender: Florian Kirstein
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] host came online but it was ignored

2014-01-07 Thread Andrew Beekhof

On 18 Dec 2013, at 4:23 pm, ESWAR RAO eswar7...@gmail.com wrote:

 Hi All,
 
 Can someone help me how to narrow down the problem??

I'd probably start with an upgrade.
There were some membership issues around about the time of 1.1.7, but they may 
have been corosync specific (I don't really test pacemaker with heartbeat 
anymore).

If you can reproduce with something more recent, I'd be happy to take a look at 
the logs.

 
 Thanks
 Eswar
 
 
 On Wed, Dec 11, 2013 at 9:35 AM, ESWAR RAO eswar7...@gmail.com wrote:
 Hi Andrew,
 
 # pacemakerd --version
 Pacemaker 1.1.7
 Written by Andrew Beekhof
 
 # ps -aef|grep heart
 root  8926 1  0 20:02 ?00:00:00 heartbeat: master control 
 process
 root  8930  8926  0 20:02 ?00:00:00 heartbeat: FIFO reader  
 root  8931  8926  0 20:02 ?00:00:00 heartbeat: write: bcast eth0
 root  8932  8926  0 20:02 ?00:00:00 heartbeat: read: bcast eth0 
 108   8936  8926  0 20:02 ?00:00:00 /usr/lib/heartbeat/ccm
 108   8937  8926  0 20:02 ?00:00:00 /usr/lib/heartbeat/cib
 root  8938  8926  0 20:02 ?00:00:00 /usr/lib/heartbeat/lrmd -r
 root  8939  8926  0 20:02 ?00:00:00 /usr/lib/heartbeat/stonithd
 108   8940  8926  0 20:02 ?00:00:00 /usr/lib/heartbeat/attrd
 108   8941  8926  0 20:02 ?00:00:00 /usr/lib/heartbeat/crmd
 
 In /etc/ha.d/ha.cf i am using crm respawn tag.
 I am installing through apt-get install
 
 Thanks
 Eswar
 
 
 
 
 On Wed, Dec 11, 2013 at 6:18 AM, Andrew Beekhof and...@beekhof.net wrote:
 version of pacemaker?
 
 On 10 Dec 2013, at 10:41 pm, ESWAR RAO eswar7...@gmail.com wrote:
 
  Hi Micheal,
 
  There are no firewall rules.
 
  I could only see below messages in logs:
 
  Dec 10 14:13:48 nvp-common crmd: [9220]: WARN: crmd_ha_msg_callback: 
  Ignoring HA message (op=join_announce) from nvsd-1: not in our membership 
  list (size=1)
 
 
 
  On Tue, Dec 10, 2013 at 2:46 PM, Michael Schwartzkopff m...@sys4.de wrote:
  Am Dienstag, 10. Dezember 2013, 14:30:32 schrieb ESWAR RAO:
   Hi All,
  
   I had a 3 node HB+pacemaker setup. When I restarted the nodes, all the
   nodes have HB restarted but they are not joining ONLINE group in crm
   status.
  
   #vim /etc/ha.d/ha.cf
   ..
   node nvsd-1 nvsd-2 nvp-common
   crm respawn
   
  
   Dec 10 14:13:48 nvp-common crmd: [9220]: WARN: crmd_ha_msg_callback:
   Ignoring HA message (op=join_announce) from nvsd-1: not in our membership
   list (size=1)
   Dec 10 14:14:07 nvp-common crmd: [9220]: WARN: crmd_ha_msg_callback:
   Ignoring HA message (op=vote) from nvsd-1: not in our membership list
   (size=1)
   Dec 10 14:14:12 nvp-common cib: [9216]: WARN: cib_process_diff: Diff 0.3.0
   - 0.3.1 not applied to 0.4.4: current epoch is greater than required
   Dec 10 14:14:12 nvp-common cib: [9216]: WARN: cib_process_diff: Diff 0.3.1
   - 0.3.2 not applied to 0.4.4: current epoch is greater than required
   Dec 10 14:14:12 nvp-common cib: [9216]: WARN: cib_process_diff: Diff 0.3.2
   - 0.3.3 not applied to 0.4.4: current epoch is greater than required
   Dec 10 14:14:12 nvp-common cib: [9216]: WARN: cib_process_diff: Diff 0.3.3
   - 0.3.4 not applied to 0.4.4: current epoch is greater than required
   Dec 10 14:14:13 nvp-common cib: [9216]: WARN: cib_process_replace:
   Replacement 0.3.4 not applied to 0.4.4: current epoch is greater than the
   replacement
   Dec 10 14:14:13 nvp-common cib: [9216]: WARN: cib_diff_notify: Update
   (client: crmd, call:13): -1.-1.-1 - 0.3.4 (Update was older than existing
   configuration)
  
   I am unable to understand this behaviour.
   Has someone already seen this issue???
  
   Thanks
   Eswar
 
  Firewall?
 
  Mit freundlichen Grüßen,
 
  Michael Schwartzkopff
 
  --
  [*] sys4 AG
 
  http://sys4.de, +49 (89) 30 90 46 64, +49 (162) 165 0044
  Franziskanerstraße 15, 81669 München
 
  Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263
  Vorstand: Patrick Ben Koetter, Axel von der Ohe, Marc Schiffbauer
  Aufsichtsratsvorsitzender: Florian Kirstein
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http

Re: [Pacemaker] Question about node-action-limit and migration-limit

2014-01-07 Thread Andrew Beekhof

On 18 Dec 2013, at 9:51 pm, Kazunori INOUE kazunori.ino...@gmail.com wrote:

 Hi,
 
 When I set only migration-limit without setting node-action-limit in
 pacemaker-1.1,
 the number of 'operation' other than migrate_to/from was limited to
 the value of migration-limit.
 (The node that I used has 8 cores.)
 
 [cib]
 property \
  no-quorum-policy=freeze \
  stonith-enabled=true \
  startup-fencing=false \
  migration-limit=3
 
 ...snip...
 
 [log]
 $ egrep warning: cluster_option|debug: throttle_update: /var/log/ha-debug
 Dec 12 16:35:23 [7416] bl460g1n7   crmd:debug:
 throttle_update: Host bl460g1n6 supports a maximum of 16 jobs
 and throttle mode .  New job limit is 16
 Dec 12 16:35:25 [7416] bl460g1n7   crmd:  warning: cluster_option:
 Using deprecated name 'migration-limit' for cluster option
 'node-action-limit'
 Dec 12 16:35:25 [7416] bl460g1n7   crmd:  warning: cluster_option:
 Using deprecated name 'migration-limit' for cluster option
 'node-action-limit'
 Dec 12 16:35:25 [7416] bl460g1n7   crmd:  warning: cluster_option:
 Using deprecated name 'migration-limit' for cluster option
 'node-action-limit'
 Dec 12 16:35:26 [7416] bl460g1n7   crmd:debug:
 throttle_update: Host bl460g1n7 supports a maximum of 3 jobs
 and throttle mode .  New job limit is 3
 Dec 12 16:35:28 [7416] bl460g1n7   crmd:debug:
 throttle_update: Host bl460g1n8 supports a maximum of 3 jobs
 and throttle mode .  New job limit is 3
 
 $ egrep do_lrm_rsc_op: Performing .* op=prmVM|process_lrm_event: LRM
 operation prmVM /var/log/ha-log|grep -v monitor
 Dec 12 16:35:28 bl460g1n7 crmd[7416]: info: do_lrm_rsc_op:
 Performing key=24:1:0:12d6cc8e-9dd9-40b4-8a07-5e9639050d12
 op=prmVM1_start_0
 Dec 12 16:35:28 bl460g1n7 crmd[7416]: info: do_lrm_rsc_op:
 Performing key=26:1:0:12d6cc8e-9dd9-40b4-8a07-5e9639050d12
 op=prmVM2_start_0
 Dec 12 16:35:28 bl460g1n7 crmd[7416]: info: do_lrm_rsc_op:
 Performing key=28:1:0:12d6cc8e-9dd9-40b4-8a07-5e9639050d12
 op=prmVM3_start_0
 Dec 12 16:35:30 bl460g1n7 crmd[7416]:   notice: process_lrm_event: LRM
 operation prmVM2_start_0 (call=27, rc=0, cib-update=23,
 confirmed=true) ok
 Dec 12 16:35:30 bl460g1n7 crmd[7416]:   notice: process_lrm_event: LRM
 operation prmVM1_start_0 (call=26, rc=0, cib-update=24,
 confirmed=true) ok
 Dec 12 16:35:30 bl460g1n7 crmd[7416]:   notice: process_lrm_event: LRM
 operation prmVM3_start_0 (call=28, rc=0, cib-update=25,
 confirmed=true) ok
 Dec 12 16:35:32 bl460g1n7 crmd[7416]: info: do_lrm_rsc_op:
 Performing key=15:2:0:12d6cc8e-9dd9-40b4-8a07-5e9639050d12
 op=prmVM4_start_0
 Dec 12 16:35:32 bl460g1n7 crmd[7416]: info: do_lrm_rsc_op:
 Performing key=17:2:0:12d6cc8e-9dd9-40b4-8a07-5e9639050d12
 op=prmVM5_start_0
 Dec 12 16:35:32 bl460g1n7 crmd[7416]: info: do_lrm_rsc_op:
 Performing key=19:2:0:12d6cc8e-9dd9-40b4-8a07-5e9639050d12
 op=prmVM6_start_0
 Dec 12 16:35:34 bl460g1n7 crmd[7416]:   notice: process_lrm_event: LRM
 operation prmVM4_start_0 (call=32, rc=0, cib-update=29,
 confirmed=true) ok
 Dec 12 16:35:34 bl460g1n7 crmd[7416]:   notice: process_lrm_event: LRM
 operation prmVM5_start_0 (call=33, rc=0, cib-update=30,
 confirmed=true) ok
 Dec 12 16:35:34 bl460g1n7 crmd[7416]:   notice: process_lrm_event: LRM
 operation prmVM6_start_0 (call=34, rc=0, cib-update=31,
 confirmed=true) ok
 
 Dec 12 16:37:26 bl460g1n7 crmd[7416]: info: do_lrm_rsc_op:
 Performing key=12:4:0:12d6cc8e-9dd9-40b4-8a07-5e9639050d12
 op=prmVM1_stop_0
 Dec 12 16:37:26 bl460g1n7 crmd[7416]: info: do_lrm_rsc_op:
 Performing key=13:4:0:12d6cc8e-9dd9-40b4-8a07-5e9639050d12
 op=prmVM2_stop_0
 Dec 12 16:37:26 bl460g1n7 crmd[7416]: info: do_lrm_rsc_op:
 Performing key=14:4:0:12d6cc8e-9dd9-40b4-8a07-5e9639050d12
 op=prmVM3_stop_0
 Dec 12 16:37:39 bl460g1n7 crmd[7416]:   notice: process_lrm_event: LRM
 operation prmVM1_stop_0 (call=39, rc=0, cib-update=35, confirmed=true)
 ok
 Dec 12 16:37:39 bl460g1n7 crmd[7416]: info: do_lrm_rsc_op:
 Performing key=15:4:0:12d6cc8e-9dd9-40b4-8a07-5e9639050d12
 op=prmVM4_stop_0
 Dec 12 16:37:39 bl460g1n7 crmd[7416]:   notice: process_lrm_event: LRM
 operation prmVM2_stop_0 (call=41, rc=0, cib-update=36, confirmed=true)
 ok
 Dec 12 16:37:39 bl460g1n7 crmd[7416]: info: do_lrm_rsc_op:
 Performing key=16:4:0:12d6cc8e-9dd9-40b4-8a07-5e9639050d12
 op=prmVM5_stop_0
 Dec 12 16:37:40 bl460g1n7 crmd[7416]:   notice: process_lrm_event: LRM
 operation prmVM3_stop_0 (call=43, rc=0, cib-update=37, confirmed=true)
 ok
 Dec 12 16:37:40 bl460g1n7 crmd[7416]: info: do_lrm_rsc_op:
 Performing key=17:4:0:12d6cc8e-9dd9-40b4-8a07-5e9639050d12
 op=prmVM6_stop_0
 Dec 12 16:37:51 bl460g1n7 crmd[7416]:   notice: process_lrm_event: LRM
 operation prmVM4_stop_0 (call=45, rc=0, cib-update=38, confirmed=true)
 ok
 Dec 12 16:37:52 bl460g1n7 crmd[7416]:   notice: process_lrm_event: LRM
 operation prmVM5_stop_0 (call=47, rc=0, cib-update=39, confirmed=true)
 ok
 Dec 12 16:37:52 bl460g1n7 

Re: [Pacemaker] Time to get ready for 1.1.11

2013-12-19 Thread Andrew Beekhof

On 20 Dec 2013, at 2:11 am, Andrew Martin amar...@xes-inc.com wrote:

 David/Andrew,
 
 Once 1.1.11 final is released, is it considered the new stable series of 
 Pacemaker,

yes

 or should 1.1.10 still be used in very stable/critical production 
 environments?
 
 Thanks,
 
 Andrew
 
 - Original Message -
 From: David Vossel dvos...@redhat.com
 To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org
 Sent: Wednesday, December 11, 2013 3:33:46 PM
 Subject: Re: [Pacemaker] Time to get ready for 1.1.11
 
 - Original Message -
 From: Andrew Beekhof and...@beekhof.net
 To: The Pacemaker cluster resource manager
 pacemaker@oss.clusterlabs.org
 Sent: Wednesday, November 20, 2013 9:02:40 PM
 Subject: [Pacemaker] Time to get ready for 1.1.11
 
 With over 400 updates since the release of 1.1.10, its time to start
 thinking about a new release.
 
 Today I have tagged release candidate 1[1].
 The most notable fixes include:
 
  + attrd: Implementation of a truely atomic attrd for use with corosync
  2.x
  + cib: Allow values to be added/updated and removed in a single update
  + cib: Support XML comments in diffs
  + Core: Allow blackbox logging to be disabled with SIGUSR2
  + crmd: Do not block on proxied calls from pacemaker_remoted
  + crmd: Enable cluster-wide throttling when the cib heavily exceeds its
  target load
  + crmd: Use the load on our peers to know how many jobs to send them
  + crm_mon: add --hide-headers option to hide all headers
  + crm_report: Collect logs directly from journald if available
  + Fencing: On timeout, clean up the agent's entire process group
  + Fencing: Support agents that need the host to be unfenced at startup
  + ipc: Raise the default buffer size to 128k
  + PE: Add a special attribute for distinguishing between real nodes and
  containers in constraint rules
  + PE: Allow location constraints to take a regex pattern to match against
  resource IDs
  + pengine: Distinguish between the agent being missing and something the
  agent needs being missing
  + remote: Properly version the remote connection protocol
  + services: Detect missing agents and permission errors before forking
  + Bug cl#5171 - pengine: Don't prevent clones from running due to
  dependant
  resources
  + Bug cl#5179 - Corosync: Attempt to retrieve a peer's node name if it is
  not already known
  + Bug cl#5181 - corosync: Ensure node IDs are written to the CIB as
  unsigned integers
 
 If you are a user of `pacemaker_remoted`, you should take the time to read
 about changes to the online wire protocol[2] that are present in this
 release.
 
 [1] https://github.com/ClusterLabs/pacemaker/releases/Pacemaker-1.1.11-rc1
 [2]
 http://blog.clusterlabs.org/blog/2013/changes-to-the-remote-wire-protocol/
 
 To build `rpm` packages for testing:
 
 1. Clone the current sources:
 
   # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git
   # cd pacemaker
 
 1. If you haven't already, install Pacemaker's dependancies
 
   [Fedora] # sudo yum install -y yum-utils
   [ALL] # make rpm-dep
 
 1. Build Pacemaker
 
   # make rc
 
 1. Copy the rpms and deploy as needed
 
 
 A new release candidate, Pacemaker-1.1.11-rc2, is ready for testing.
 https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1.11-rc2
 
 Assuming no major regressions are encountered during testing, this tag will
 become the final Pacemaker-1.1.11 release a week from today.
 
 -- Vossel
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker and RHEL/CENTOS 5.x compatibility ?

2013-12-19 Thread Andrew Beekhof

On 20 Dec 2013, at 1:36 am, Stephane Robin sro...@kivasystems.com wrote:

 Hi, 
 
 This is a follow up on my previous post 'Trouble building Pacemaker from 
 source on CentOS 5.10'
 Andrew: Thanks for your pointers.
 
 It turns out Pacemaker 1.1.10 needed more changes to build on CentOS 5.x.
   • revert of a81d222
   • g_timeout_add_seconds not available in libc in 
 lib/services/services_linux.c
   • qb_to_cs_error conflicting type definition in include/crm_internal.h
   • Configure with --disable-fatal-warnings
 This brings to my question:
 Pacemaker 1.1.10 was already broken for this OS, and I'm assuming that 1.1.11 
 will diverge even further.
 
 What is the official position in regard to RHEL/CENTOS 5.x support  testing ?

There's no conscious effort to break RHEL5, its just not a focus for the 
developers.
So we rely on reports like yours to tell us when something breaks - and if 
anyone cares.

All the above seem pretty easily resolvable and we'll happily include them for 
1.1.11 (hint, test the latest .11 beta to make sure there are no others :)

 Are there any other person that can not afford yet to move to RHEL 6 (for 
 whatever reason) and are interested in keeping RHEL/CENTOS 5.x compatibility ?

If there are, they don't seem interested in upgrading.

Also, for what its worth, pacemaker is now supported on RHEL6.  Perhaps that 
adds incentive to update :)


signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] question on on-fail=restart

2013-12-18 Thread Andrew Beekhof

On 19 Dec 2013, at 4:03 am, Brusq, Jerome jerome.br...@signalis.com wrote:

 Dear all,
  
 I have a  custom lsb script that launch a custom process.
 
 primitive myscript lsb:ha_swift \
op start interval=0 timeout=30s \
op stop interval=0 timeout=30s \
op monitor interval=15s on-fail=restart \
  
 When this one crashes, I have the feeling that pacemaker do :
 “/etc/init.d/myscript stop”
 Then,
 “/etc/init.d/myscript start”
  
 Is there a way for pacemaker to do “/etc/init.d/myscript restart” (of course 
 my “restart” option is doing something a little bit different that a stop + 
 start in my script ….)

No, sorry

  
 Thanks!
  
 Jerome
  
  
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Trouble building Pacemaker from source on CentOS 5.10

2013-12-18 Thread Andrew Beekhof

On 14 Dec 2013, at 7:51 am, Stephane Robin sro...@kivasystems.com wrote:

 Hi, 
 
 I'm trying to build Pacemaker-1.1.10 (from git), with corosync 2.3.2 and 
 libqb 0.16.0 on a CentOS 5.10 64b system.
 I have latest auto tools (automake 1.14, autoconf 2.69, lib tool 2.4, 
 pkg-config 0.27.1)
 
 For Pacemaker, I'm doing:
 ./autogen.sh
 ./configure --with-corosync --with-cs-quorum  --without-snmp --without-nagios 
 --enable-upstart=no
 
 pacemaker configuration:
   Version  = 1.1.10 (Build: 368c726)
   Features = libqb-logging libqb-ipc  corosync-native
 
   Prefix   = /usr
   Executables  = /usr/sbin
   Man pages= /usr/share/man
   Libraries= /usr/lib64
   Header files = /usr/include
   Arch-independent files   = /usr/share
   State information= /var
   System configuration = /etc
   Corosync Plugins = /usr/lib64
 
   Use system LTDL  = yes
 
   HA group name= haclient
   HA user name = hacluster
 
   CFLAGS   = -g -O2 -I/usr/include -I/usr/include/heartbeat   
-ggdb  -fgnu89-inline -fstack-protector-all -Wall -Waggregate-return 
 -Wbad-function-cast -Wcast-align -Wdeclaration-after-statement -Wendif-labels 
 -Wfloat-equal -Wformat=2  -Wformat-security -Wformat-nonliteral 
 -Wmissing-prototypes -Wmissing-declarations -Wnested-externs -Wno-long-long 
 -Wno-strict-aliasing -Wpointer-arith -Wstrict-prototypes -Wwrite-strings 
 -Werror
   Libraries= -lcorosync_common -lqb -lbz2 -lxslt -lxml2 -lc 
 -luuid -lrt -ldl  -L/lib64 -lglib-2.0  -lltdl -L/usr/lib64 -lqb -ldl -lrt 
 -lpthread
   Stack Libraries  =   -L/usr/lib64 -lqb -ldl -lrt -lpthread  
 -L/usr/lib64 -lcpg  -L/usr/lib64 -lcfg  -L/usr/lib64 -lcmap  -L/usr/lib64 
 -lquorum
 
 
 But I'm getting this error:
 gmake[2]: Entering directory `/root/Cluster3/pacemaker/lib/common'
   CC   ipc.lo
 cc1: warnings being treated as errors
 In file included from ../../include/crm_internal.h:26,
  from ipc.c:19:
 ../../include/portability.h: In function 'g_strcmp0':
 ../../include/portability.h:165: warning: implicit declaration of function 
 'strcmp'
 gmake[2]: *** [ipc.lo] Error 1
 gmake[2]: Leaving directory `/root/Cluster3/pacemaker/lib/common'
 
 
 If I disable the warning as errors, I can pass this problem (and several 
 similar), but then fail on:
 
 services_linux.c:33:26: error: sys/signalfd.h: No such file or directory
 services_linux.c: In function 'services_os_action_execute':
 services_linux.c:436: warning: implicit declaration of function 'signalfd'
 services_linux.c:436: warning: nested extern declaration of 'signalfd'
 services_linux.c:468: error: storage size of 'fdsi' isn't known
 services_linux.c:471: error: invalid application of 'sizeof' to incomplete 
 type 'struct signalfd_siginfo'
 services_linux.c:472: error: invalid application of 'sizeof' to incomplete 
 type 'struct signalfd_siginfo'
 services_linux.c:468: warning: unused variable 'fdsi'
 
 
 What am I missing here ?
 It there a page somewhere with RHEL-5/CENTOS-5 prerequisites or special 
 instruction to build from sources ?

RHEL5 is too old by the looks of it.
In this case, you could revert 
https://github.com/beekhof/pacemaker/commit/a81d222e which introduced the use 
of signalfd.


signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] host came online but it was ignored

2013-12-10 Thread Andrew Beekhof
version of pacemaker?

On 10 Dec 2013, at 10:41 pm, ESWAR RAO eswar7...@gmail.com wrote:

 Hi Micheal,
 
 There are no firewall rules.
 
 I could only see below messages in logs:
 
 Dec 10 14:13:48 nvp-common crmd: [9220]: WARN: crmd_ha_msg_callback: Ignoring 
 HA message (op=join_announce) from nvsd-1: not in our membership list (size=1)
 
 
 
 On Tue, Dec 10, 2013 at 2:46 PM, Michael Schwartzkopff m...@sys4.de wrote:
 Am Dienstag, 10. Dezember 2013, 14:30:32 schrieb ESWAR RAO:
  Hi All,
 
  I had a 3 node HB+pacemaker setup. When I restarted the nodes, all the
  nodes have HB restarted but they are not joining ONLINE group in crm
  status.
 
  #vim /etc/ha.d/ha.cf
  ..
  node nvsd-1 nvsd-2 nvp-common
  crm respawn
  
 
  Dec 10 14:13:48 nvp-common crmd: [9220]: WARN: crmd_ha_msg_callback:
  Ignoring HA message (op=join_announce) from nvsd-1: not in our membership
  list (size=1)
  Dec 10 14:14:07 nvp-common crmd: [9220]: WARN: crmd_ha_msg_callback:
  Ignoring HA message (op=vote) from nvsd-1: not in our membership list
  (size=1)
  Dec 10 14:14:12 nvp-common cib: [9216]: WARN: cib_process_diff: Diff 0.3.0
  - 0.3.1 not applied to 0.4.4: current epoch is greater than required
  Dec 10 14:14:12 nvp-common cib: [9216]: WARN: cib_process_diff: Diff 0.3.1
  - 0.3.2 not applied to 0.4.4: current epoch is greater than required
  Dec 10 14:14:12 nvp-common cib: [9216]: WARN: cib_process_diff: Diff 0.3.2
  - 0.3.3 not applied to 0.4.4: current epoch is greater than required
  Dec 10 14:14:12 nvp-common cib: [9216]: WARN: cib_process_diff: Diff 0.3.3
  - 0.3.4 not applied to 0.4.4: current epoch is greater than required
  Dec 10 14:14:13 nvp-common cib: [9216]: WARN: cib_process_replace:
  Replacement 0.3.4 not applied to 0.4.4: current epoch is greater than the
  replacement
  Dec 10 14:14:13 nvp-common cib: [9216]: WARN: cib_diff_notify: Update
  (client: crmd, call:13): -1.-1.-1 - 0.3.4 (Update was older than existing
  configuration)
 
  I am unable to understand this behaviour.
  Has someone already seen this issue???
 
  Thanks
  Eswar
 
 Firewall?
 
 Mit freundlichen Grüßen,
 
 Michael Schwartzkopff
 
 --
 [*] sys4 AG
 
 http://sys4.de, +49 (89) 30 90 46 64, +49 (162) 165 0044
 Franziskanerstraße 15, 81669 München
 
 Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263
 Vorstand: Patrick Ben Koetter, Axel von der Ohe, Marc Schiffbauer
 Aufsichtsratsvorsitzender: Florian Kirstein
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] is ccs as racy as it feels?

2013-12-10 Thread Andrew Beekhof

On 10 Dec 2013, at 11:31 pm, Brian J. Murrell br...@interlinx.bc.ca wrote:

 On Tue, 2013-12-10 at 10:27 +, Christine Caulfield wrote: 
 
 Sadly you're not wrong.
 
 That's what I was afraid of.
 
 But it's actually no worse than updating 
 corosync.conf manually,
 
 I think it is...
 
 in fact it's pretty much the same thing,
 
 Not really.  Updating corosync.conf on any given node means only having
 to write that file on that node.  There is no cluster-wide
 synchronization needed

Approximately speaking, cman takes cluster.conf and generates an in-memory 
corosync.conf equivalent to be passed to corosync.
So anything that could be done by editing corosync.conf should be possible with 
'ccs -f ...', neither command results in any synchronisation or automatic 
update into the running process.

 and therefore no last-write-wins race so all
 nodes can do that in parallel.  Plus adding a new node means only having
 to update the corosync.conf on that new node (and starting up corosync
 of course) and corosync then does the job of telling it's peers about
 the new node rather than having to have the administrator go out and
 touch every node to inform them of the new member.

It sounds like this thread is less about cluster.conf vs. corosync.conf and 
more about autodiscovery vs. fixed node lists.

Chrissie: is there no way to use cman in autodiscovery mode (ie. with 
multicast/broadcast and learning about peers as they appear)?

 
 It's this removal of node auto-discovery and changing it to an operator
 task that is really complicating the workflow.  Granted, it's not so
 much complicating it for a human operator who is naturally only
 single-threaded and mostly incapable of inducing the last-write-wins
 races.
 
 But when you are writing tools that now have to take what used to be a
 very capable multithreaded task, free of races and shove it down a
 single-threaded pipe/queue just to eliminate races, this is a huge step
 backwards in evolution.
 
 so 
 nothing is actually getting worse.
 
 It is though.  See above.
 
 All the CIB information is still 
 properly replicated.
 
 Yeah.  I understand/understood that.  Pacemaker's actual operations go
 mostly unchanged.  It's the cluster membership process that's gotten
 needlessly complicated and regressed in functionality.
 
 The main difficulty is in safely replicating information that's needed 
 to boot the system.
 
 Do you literally mean staring the system up?  I guess the use-case you
 are describing here is booting nodes from a clustered filesystem?  But
 what if you don't need that complication?  This process is being made
 more complicated to satisfy only a subset of the use-cases.
 
 In general use we've not found it to be a huge problem (though, I'm 
 still not keen on it either TBH) because most management is done by one 
 person from one node.
 
 Indeed.  As I said above, WRT to single-threaded operators.  But when
 you are writing a management system on top of all of this, which
 naturally wants to be multi-threaded (because scalable systems avoid
 bottlenecking through single choke points) and was able to be
 multithreaded when it was just corosync.conf, having to choke everything
 back down into a single thread just sucks.
 
 There is not really any concept of nodes trying to 
 add themselves to a cluster, it needs to be done by a person - which 
 maybe what you're unhappy with.
 
 Yes, not so much add themselves but allowed to be added, in parallel
 without fear of racing.
 
 This ccs tool wouldn't be so bad if it operated more like the CIB where
 modifications were replicated automatically and properly locked so that
 modifications could be made anywhere on the cluster and all members got
 those modifications automatically rather than pushing off the work of
 locking, replication and serialization off onto the caller.
 
 b.
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Where the heck is Beekhof?

2013-12-01 Thread Andrew Beekhof
Thanks to everyone for the well wishes, and yes this is our third so we
can get quorum now ;-)





On Sun, Dec 1, 2013, at 01:51 PM, Serge Dubrouski wrote:

  Nope, you need three to always have a quorum.

On Dec 1, 2013 9:43 AM, Arnold Krille [1]arn...@arnoldarts.de
wrote:

On Thu, 28 Nov 2013 12:04:01 +1100 Andrew Beekhof
[2]and...@beekhof.net

wrote:

 If you find yourself asking $subject at some point in the next couple

 of months, the answer is that I'm taking leave to look after our new

 son (Lawson Tiberius Beekhof) who was born on Tuesday.



Concrats!



And remember: If you want HA, you gotta have two :-P



- Arnold



___

Pacemaker mailing list: [3]Pacemaker@oss.clusterlabs.org

[4]http://oss.clusterlabs.org/mailman/listinfo/pacemaker



Project Home: [5]http://www.clusterlabs.org

Getting started:
[6]http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

Bugs: [7]http://bugs.clusterlabs.org



___

Pacemaker mailing list: [8]Pacemaker@oss.clusterlabs.org

[9]http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Project Home: [10]http://www.clusterlabs.org

Getting started:
[11]http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

Bugs: [12]http://bugs.clusterlabs.org

References

1. mailto:arn...@arnoldarts.de
2. mailto:and...@beekhof.net
3. mailto:Pacemaker@oss.clusterlabs.org
4. http://oss.clusterlabs.org/mailman/listinfo/pacemaker
5. http://www.clusterlabs.org/
6. http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
7. http://bugs.clusterlabs.org/
8. mailto:Pacemaker@oss.clusterlabs.org
9. http://oss.clusterlabs.org/mailman/listinfo/pacemaker
  10. http://www.clusterlabs.org/
  11. http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  12. http://bugs.clusterlabs.org/
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] no-quorum-policy=freeze

2013-12-01 Thread Andrew Beekhof



On Wed, Nov 27, 2013, at 04:50 AM, Olivier Nicaise wrote:

Hello all,

I have an issue with the no quorum policy freeze (stonith disabled).
I'm using an old version of pacemaker (1.1.6), the one distributed by
Ubuntu 12.04.

I have a cluster with 3 nodes running various resources, including
drbd. This night, my hosting provider did a maintenance on its network,
and all the machines were disconnected.

What happened is that every machine promoted the DRBD resources and
started the resources on top of them. Using no-quorum-policy=freeze,
I was not expecting that.


I wouldn't expect that either.


The network came back 1 or 2 minutes later but the damage was already
done. All my drbd resources were in a split brain situation.

Here are my options:
property $id=cib-bootstrap-options \
dc-version=1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c \
cluster-infrastructure=openais \
expected-quorum-votes=3 \
stonith-enabled=false \
last-lrm-refresh=1385395648 \
no-quorum-policy=freeze

rsc_defaults $id=rsc-options \
resource-stickiness=1000

Do you know if there is a bug with this 1.1.6 version


Almost certainly, we're up to 1.1.10 (nearly .11) now.

If you attach this file:

Nov 27 01:18:41 vs001 crmd: [1016]: info: do_te_invoke: Processing graph 0 (ref=
pe_calc-dc-1385511521-83) derived from /var/lib/pengine/pe-input-297.bz2



We can see if a newer version would help.




 or am I missing something?

Logs are available at [1]http://pastebin.com/Jxgq3MRH
I know there is an issue with the cinder-HDD1 resources. It is not yet
correctly configured
___
Pacemaker mailing list: [2]Pacemaker@oss.clusterlabs.org

[3]http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Project Home: [4]http://www.clusterlabs.org

Getting started:
[5]http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

Bugs: [6]http://bugs.clusterlabs.org

References

1. http://pastebin.com/Jxgq3MRH
2. mailto:Pacemaker@oss.clusterlabs.org
3. http://oss.clusterlabs.org/mailman/listinfo/pacemaker
4. http://www.clusterlabs.org/
5. http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
6. http://bugs.clusterlabs.org/
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] p_mysql peration monitor failed 'not installed'

2013-11-21 Thread Andrew Beekhof

On 22 Nov 2013, at 7:32 am, Miha m...@softnet.si wrote:

 HI,
 
 what could be a reason for this error:
 
 notice: unpack_rsc_op: Preventing p_mysql from re-starting
 on sip2: operation monitor failed 'not installed' (rc=5)

the agent, or something the agent needs is not available.
how did you configure p_mysql?

 
 
  p_mysql_monitor_0 on sip2 'not installed' (5): call=22,
 status=complete, last-rc-change='Thu Nov 21 15:27:01 2013',
 queued=33ms, exec=0ms
 
p_mysql_monitor_0 on sip1 'not installed' (5): call=22,
 status=complete, last-rc-change='Thu Nov 21 15:26:52 2013',
 queued=36ms, exec=0ms
 
 Mysql is running,

How?  You shouldn't be starting cluster resources outside of the cluster

 drive is mounted, and in mysql there is
 no errors.
 
 tnx!
 miha
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] exit code crm_attibute

2013-11-21 Thread Andrew Beekhof

On 22 Nov 2013, at 12:33 am, Andrey Groshev gre...@yandex.ru wrote:

 Hi, Andrew!
 
 I'm trying to find the source of my problems. 
 This touble exist only on --query 
 I learned crm_attribute.c 
 IMHO, when call
 rc = read_attr_delegate(the_cib, type, dest_node, set_type, set_name,
 attr_id, attr_name, read_value, TRUE, NULL);
 
 I thick that dest_node == NULL
 
 Since in following piece of code ignored return value.
 
 238if (pcmk_ok != query_node_uuid(the_cib, dest_uname, dest_node, 
 is_remote_node)) {
 239 fprintf(stderr, Could not map name=%s to a UUID\n, dest_uname);
 240}
 
 Maybe it should look like this?

Agreed. That does look better.

   https://github.com/beekhof/pacemaker/commit/a4bdc9a

 238rc = query_node_uuid(the_cib, dest_uname, dest_node, 
 is_remote_node)) 
 239if (rc != pcmk_ok) {
 239 fprintf(stderr, Could not map name=%s to a UUID\n, dest_uname);
 240 return crm_exit(rc);
 241}
 
 
 19.11.2013, 16:12, Andrey Groshev gre...@yandex.ru:
 Hellow Andrew!
 
 I'm sorry, forgot about this thread, and now again came across the same 
 problem.
 # crm_attribute --type nodes --node-uname fackename.node.org --attr-name 
 notexistattibute --query  /dev/null; echo $?
 Could not map name=fackename.node.org to a UUID
 0
 
 Version PCMK 1.1.11
 
 23.09.2013, 08:23, Andrew Beekhof and...@beekhof.net:
 
  On 20/09/2013, at 5:53 PM, Andrey Groshev gre...@yandex.ru wrote:
   Hi again!
 
   Today again met a strange behavior.
   I asked for a non-existent attribute of an existing node.
 
   # crm_attribute --type nodes --node-uname exist.node.domain.com 
 --attr-name notexistattibute --query  ; echo $?
   Could not map name=dev-cluster2-node2.unix.tensor.ru to a UUID
   0
 
   That is, to STDERR - swore, but the exit code - 0.
  That probably shouldn't happen.  Version?
   ___
   Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
   http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
   Project Home: http://www.clusterlabs.org
   Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
   Bugs: http://bugs.clusterlabs.org
  ,
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] CentOS 6.4 last update - Failed to create cluster resources with pcs command

2013-11-21 Thread Andrew Beekhof

On 22 Nov 2013, at 4:15 am, Dmitry Bron dmitr...@gmail.com wrote:

 Hi All,
 
 We have two fresh installed boxes with CentOS 6.4 and with last updates which 
 we want to configure as Active - Standby in HA cluster.
 We copied all configuration files from another worked well HA cluster. We 
 already have another pair of CentOS 6.4 boxes where is configured Active - 
 Standby HA cluster, however we didn't install last updates on these machines. 
 Of course we have changed IP addresses for cluster nodes and for multicast in 
 cluster.conf file and restarted pacemaker service. Then when we tried to 
 create cluster resources it was fail. We tried to create cluster resources 
 many times and every time it was fail in different place: - Once it was fail 
 at creation of ClusterIP resource, in the second at constraint colocation, 
 and so on. The problem doesn't happen on another HA cluster where we haven't 
 done updates.
 
 We found in /var/log/messages segmentation fault at the time when we tried to 
 create and configure resources:

Please install the debuginfo packages and run crm_report.  We need information 
contained in the core file.

 ===
 Nov 21 13:15:27 ha-test1 cibadmin[6466]:   notice: crm_log_args: Invoked: 
 /usr/sbin/cibadmin -Q --xpath //primitive[@id='TssAgent']
 Nov 21 13:15:27 ha-test1 cibadmin[6467]:   notice: crm_log_args: Invoked: 
 /usr/sbin/cibadmin -Q --xpath //primitive[@id='ClusterIP']
 Nov 21 13:15:27 ha-test1 cibadmin[6468]:   notice: crm_log_args: Invoked: 
 /usr/sbin/cibadmin -Q --xpath //constraints
 Nov 21 13:15:27 ha-test1 cibadmin[6469]:   notice: crm_log_args: Invoked: 
 /usr/sbin/cibadmin -c -R --xml-text constraintsrsc_colocation 
 id=colocation-TssAgent-ClusterIP-INFINITY rsc=TssAgent score=INFINITY 
 with-rsc=ClusterIP/
 Nov 21 13:15:27 ha-test1 cib[6066]:   notice: cib:diff: Diff: --- 0.328.3
 Nov 21 13:15:27 ha-test1 cib[6066]:   notice: cib:diff: Diff: +++ 0.329.1 
 09b3edecab4ee85ea2b04e70cefef472
 Nov 21 13:15:27 ha-test1 cib[6066]:   notice: cib:diff: -- cib 
 admin_epoch=0 epoch=328 num_updates=3 /
 Nov 21 13:15:27 ha-test1 cib[6066]:   notice: cib:diff: ++   
 rsc_colocation id=colocation-TssAgent-ClusterIP-INFINITY rsc=TssAgent 
 score=INFINITY with-rsc=ClusterIP /
 Nov 21 13:15:27 ha-test1 crmd[6069]:   notice: do_state_transition: State 
 transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
 origin=abort_transition_graph ]
 Nov 21 13:15:27 ha-test1 crmd[6069]:   notice: do_state_transition: State 
 transition S_ELECTION - S_INTEGRATION [ input=I_ELECTION_DC 
 cause=C_FSA_INTERNAL origin=do_election_check ]
 Nov 21 13:15:27 ha-test1 kernel: cib[6066]: segfault at 1 ip 7fa9a99f50ec 
 sp 7fff875032d0 error 4 in libc-2.12.so[7fa9a99ad000+18a000]
 Nov 21 13:15:27 ha-test1 pacemakerd[10460]:   notice: pcmk_child_exit: Child 
 process cib terminated with signal 11 (pid=6066, core=128)
 Nov 21 13:15:27 ha-test1 pacemakerd[10460]:   notice: pcmk_process_exit: 
 Respawning failed child process: cib
 ===
 
 Please see in the attachment the call traces which were done by starce of the 
 'crm_attribute --type op_defaults --attr-name timeout --attr-value 300s' 
 command. This command was failed.
 The cman, pcs and clusterlib RPMSs were updated, please see the list of all 
 updated RPMs in the attachment.
 
 The operating system updates are very important for us and we endeavor always 
 to keep our systems up to date.
 
 Thanks for your help,
 Dima
 crm_attribute-strace_debug_log.txtCentOS_6.4-updated_rpms.txt___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker update crash my config (cannot be represented in the CLI notation)

2013-11-20 Thread Andrew Beekhof

On 21 Nov 2013, at 6:08 am, Lars Marowsky-Bree l...@suse.com wrote:

 On 2013-11-20T16:43:51, Beo Banks beo.ba...@googlemail.com wrote:
 
 INFO: object cli-prefer-mysql cannot be represented in the CLI notation
 
 
 crm configure show | grep xml
 INFO: object cli-prefer-mysql cannot be represented in the CLI notation
 xml rsc_location id=cli-prefer-mysql node=hostname role=Started
 rsc=mysql score=INFINITY/
 
 This does not mean your configuration is invalid, just that the crm
 shell version encountered a XML countruct it can't render, hence it
 displays it as XML.
 
 You need to upgrade crm shell too.
 
 
 The errors you show below are unrelated to this.

Yep. 
Beo: can you provide a crm_report starting from the time that both nodes were 
started?


signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Time to get ready for 1.1.11

2013-11-20 Thread Andrew Beekhof
With over 400 updates since the release of 1.1.10, its time to start
thinking about a new release.

Today I have tagged release candidate 1[1].
The most notable fixes include:

  + attrd: Implementation of a truely atomic attrd for use with corosync 2.x
  + cib: Allow values to be added/updated and removed in a single update
  + cib: Support XML comments in diffs
  + Core: Allow blackbox logging to be disabled with SIGUSR2
  + crmd: Do not block on proxied calls from pacemaker_remoted
  + crmd: Enable cluster-wide throttling when the cib heavily exceeds its 
target load
  + crmd: Use the load on our peers to know how many jobs to send them
  + crm_mon: add --hide-headers option to hide all headers
  + crm_report: Collect logs directly from journald if available
  + Fencing: On timeout, clean up the agent's entire process group
  + Fencing: Support agents that need the host to be unfenced at startup
  + ipc: Raise the default buffer size to 128k
  + PE: Add a special attribute for distinguishing between real nodes and 
containers in constraint rules
  + PE: Allow location constraints to take a regex pattern to match against 
resource IDs
  + pengine: Distinguish between the agent being missing and something the 
agent needs being missing
  + remote: Properly version the remote connection protocol
  + services: Detect missing agents and permission errors before forking
  + Bug cl#5171 - pengine: Don't prevent clones from running due to dependant 
resources
  + Bug cl#5179 - Corosync: Attempt to retrieve a peer's node name if it is not 
already known
  + Bug cl#5181 - corosync: Ensure node IDs are written to the CIB as unsigned 
integers

If you are a user of `pacemaker_remoted`, you should take the time to read 
about changes to the online wire protocol[2] that are present in this release.

[1] https://github.com/ClusterLabs/pacemaker/releases/Pacemaker-1.1.11-rc1
[2] http://blog.clusterlabs.org/blog/2013/changes-to-the-remote-wire-protocol/


To build `rpm` packages for testing:

1. Clone the current sources:

   # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git
   # cd pacemaker

1. If you haven't already, install Pacemaker's dependancies

   [Fedora] # sudo yum install -y yum-utils
   [ALL]# make rpm-dep

1. Build Pacemaker

   # make rc

1. Copy the rpms and deploy as needed



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] stonith ra class missing

2013-11-19 Thread Andrew Beekhof

On 19 Nov 2013, at 4:19 pm, Michael Schwartzkopff m...@sys4.de wrote:

 
 
 
 
 Andrew Beekhof and...@beekhof.net schrieb:
 
 On 19 Nov 2013, at 1:23 am, Michael Schwartzkopff m...@sys4.de wrote:
 
 Hi,
 
 I installed pacemaker on a RHEL 6.4 machine. Now crm tells me that
 there is no 
 stonith ra class, onyl lsb, ocf and service.
 
 What did I miss? thanks for any valuable comments.
 
 did you install the fencing-agents package?
 
 
 Yes, of course.

Not everyone does :-)
What does 'stonith_admin -I' say?

 
 
 
 -- 
 Mit freundlichen Grüßen,
 
 Michael Schwartzkopff
 
 -- 
 [*] sys4 AG
 
 http://sys4.de, +49 (89) 30 90 46 64, +49 (162) 165 0044
 Franziskanerstraße 15, 81669 München
 
 Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263
 Vorstand: Patrick Ben Koetter, Axel von der Ohe, Marc Schiffbauer
 Aufsichtsratsvorsitzender: Florian
 Kirstein___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started:
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 -- 
 Diese Nachricht wurde von meinem Mobiltelefon mit Kaiten Mail gesendet.



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Finally. A REAL question.

2013-11-18 Thread Andrew Beekhof

On 18 Nov 2013, at 3:30 pm, Rob Thomas xro...@gmail.com wrote:

 I've been browsing through the cluster.log, and it's not even trying
 to move httpd.  I'm almost certain that it used to work fine with
 resource sets. Hmm.
 
 OK. I went and -actually looked- at the CIB I was previously generating.
 
 This works:
 
  rsc_colocation id=freepbx score=INFINITY
resource_set id=colo-freepbx-0
  resource_ref id=asterisk/
/resource_set
resource_set id=colo-freepbx-1
  resource_ref id=httpd/
/resource_set
resource_set id=colo-freepbx-2 role=Master
  resource_ref id=ms-asterisk/
/resource_set
resource_set id=colo-freepbx-3 role=Master
  resource_ref id=ms-httpd/
/resource_set
  /rsc_colocation

my eyes! my eyes!

 
 It appears that pcs can't do that, or if it's possible to, I can't
 figure out how.
 
 Which is why I've been fighting with it all day!
 
 Is this a feature request, or a PEBCAK issue?
 
 --Rob
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] stonith ra class missing

2013-11-18 Thread Andrew Beekhof

On 19 Nov 2013, at 1:23 am, Michael Schwartzkopff m...@sys4.de wrote:

 Hi,
 
 I installed pacemaker on a RHEL 6.4 machine. Now crm tells me that there is 
 no 
 stonith ra class, onyl lsb, ocf and service.
 
 What did I miss? thanks for any valuable comments.

did you install the fencing-agents package?

 
 
 -- 
 Mit freundlichen Grüßen,
 
 Michael Schwartzkopff
 
 -- 
 [*] sys4 AG
 
 http://sys4.de, +49 (89) 30 90 46 64, +49 (162) 165 0044
 Franziskanerstraße 15, 81669 München
 
 Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263
 Vorstand: Patrick Ben Koetter, Axel von der Ohe, Marc Schiffbauer
 Aufsichtsratsvorsitzender: Florian 
 Kirstein___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Finally. A REAL question.

2013-11-18 Thread Andrew Beekhof

On 19 Nov 2013, at 6:00 am, Rob Thomas xro...@gmail.com wrote:

 On Mon, Nov 18, 2013 at 9:17 PM, Andrew Beekhof and...@beekhof.net wrote:
 
 my eyes! my eyes!
 
 So... What's the -right- way to do it then? 8)


http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-resource-sets-collocation.html

  rsc_colocation id=pcs_rsc_colocation
resource_set id=pcs_rsc_set
  resource_ref id=httpd/
  resource_ref id=asterisk/
/resource_set
  /rsc_colocation

is almost right, but misses score=INFINITY in the rsc_colocation tag.
You can do that with:

   pcs constraint colocation set httpd asterisk setoptions score=INFINITY

Note that this is very different to the command I asked you about:

   pcs constraint colocation add asterisk with httpd

Which creates something more like:

   rsc_colocation id=pcs_rsc_colocation rsc=asterisk with-rsc=httpd 
score=INFINITY/
   
See 
http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_mandatory_placement.html

 
 --Rob
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Finally. A REAL question.

2013-11-18 Thread Andrew Beekhof

On 19 Nov 2013, at 6:00 am, Rob Thomas xro...@gmail.com wrote:

 On Mon, Nov 18, 2013 at 9:17 PM, Andrew Beekhof and...@beekhof.net wrote:
 
 my eyes! my eyes!
 
 So... What's the -right- way to do it then? 8)


http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-resource-sets-collocation.html

  rsc_colocation id=pcs_rsc_colocation
resource_set id=pcs_rsc_set
  resource_ref id=httpd/
  resource_ref id=asterisk/
/resource_set
  /rsc_colocation

is almost right, but misses score=INFINITY in the rsc_colocation tag.
You can do that with:

   pcs constraint colocation set httpd asterisk setoptions score=INFINITY

Note that this is very different to the command I asked you about:

   pcs constraint colocation add asterisk with httpd

Which creates something more like:

   rsc_colocation id=pcs_rsc_colocation rsc=asterisk with-rsc=httpd 
score=INFINITY/
   
See 
http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_mandatory_placement.html

 
 --Rob
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Finally. A REAL question.

2013-11-18 Thread Andrew Beekhof

On 19 Nov 2013, at 10:30 am, Rob Thomas xro...@gmail.com wrote:

 On Tue, Nov 19, 2013 at 8:55 AM, Andrew Beekhof and...@beekhof.net wrote:
 
 On 19 Nov 2013, at 6:00 am, Rob Thomas xro...@gmail.com wrote:
 So... What's the -right- way to do it then? 8)
 
  rsc_colocation id=pcs_rsc_colocation
resource_set id=pcs_rsc_set
  resource_ref id=httpd/
  resource_ref id=asterisk/
/resource_set
  /rsc_colocation
 
 is almost right, but misses score=INFINITY in the rsc_colocation tag.
 
 Ah. And that would have been why it didn't shut down httpd. When I
 tried that, I wasn't looking at the raw XML to see what pcs was
 actually doing.
 
 Note that this is very different to the command I asked you about:
 
   pcs constraint colocation add asterisk with httpd
 
 Which creates something more like:
 
   rsc_colocation id=pcs_rsc_colocation rsc=asterisk with-rsc=httpd 
 score=INFINITY/
 
 Yep. I replied to that earlier.

Oh, you mean you tried that way too?
It wasn't clear because the xml only had the 'set' variant present.

 
 --snip--
 Yep. It ends up with asterisk stopped, and httpd happily running on -a
 (and it won't start, because the colocation docs say 'if z is not
 running, y won't start', so that makes sense)
 
 Resource Group: httpd
 httpd_fs   (ocf::heartbeat:Filesystem):Started freepbx-a
 httpd_ip   (ocf::heartbeat:IPaddr2):   Started freepbx-a
 httpd_service  (ocf::heartbeat:apache):Started freepbx-a
 Resource Group: asterisk
 asterisk_fs(ocf::heartbeat:Filesystem):Stopped
 asterisk_ip(ocf::heartbeat:IPaddr2):   Stopped
 asterisk_service   (ocf::heartbeat:freepbx):   Stopped
 
 Failed actions:
asterisk_service_monitor_3 on freepbx-a 'not running' (7):
 call=2217, status=complete, last-rc-change='Mon Nov 18 14:05:08 2013',
 queued=0ms, exec=0ms
 
 I've been browsing through the cluster.log, and it's not even trying
 to move httpd.  I'm almost certain that it used to work fine with
 resource sets. Hmm.
 --snip--
 
 After I posted that, I then went and had an actual look at a working
 cluster, and realised exactly how I was doing it, and posted the next
 message.
 
 I'll have a try with the setoptions and see if that works. Thanks!
 
 --Rob
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] The larger cluster is tested.

2013-11-18 Thread Andrew Beekhof

On 16 Nov 2013, at 12:22 am, yusuke iida yusk.i...@gmail.com wrote:

 Hi, Andrew
 
 Thanks for the suggestion variety.
 
 I fixed and tested the value of batch-limit by 1, 2, 3, and 4 from the
 beginning, in order to confirm what batch-limit is suitable.
 
 It was something like the following in my environment.
 Timeout did not occur batch-limit=1 and 2.
 batch-limit = 3 was 1 timeout.
 batch-limit = 4 was 5 timeout.
 
 I think the limit is still high in; From the above results, limit =
 QB_MAX (1, peers / 4).

Remember these results are specific to your (virtual) hardware and configured 
timeouts.
I would argue that 5 timeouts out of 2853 actions is actually quite impressive 
for a default value in this sort of situation.[1]

Some tuning in a cluster of this kind is to be expected.

[1] It took crm_simulate 4 minutes to even pretend to perform all those 
operations.

 
 So I have created a fix to fixed to 2 batch-limit when it became a
 state of extreme.
 https://github.com/yuusuke/pacemaker/commit/efe2d6ebc55be39b8be43de38e7662f039b61dec
 
 Results of the test several times, it seems to work without problems.
 
 When batch-limit is fixed and tested, below has a report.
 batch-limit=1
 https://drive.google.com/file/d/0BwMFJItoO-fVNk8wTGlYNjNnSHc/edit?usp=sharing
 batch-limit=2
 https://drive.google.com/file/d/0BwMFJItoO-fVTnc4bXY2YXF2M2M/edit?usp=sharing
 batch-limit=3
 https://drive.google.com/file/d/0BwMFJItoO-fVYl9Gbks2VlJMR0k/edit?usp=sharing
 batch-limit=4
 https://drive.google.com/file/d/0BwMFJItoO-fVZnJIazd5MFQ1aGs/edit?usp=sharing
 
 The report at the time of making it operate by my test code is the following.
 https://drive.google.com/file/d/0BwMFJItoO-fVbzB0NjFLeVY3Zmc/edit?usp=sharing
 
 Regards,
 Yusuke
 
 2013/11/13 Andrew Beekhof and...@beekhof.net:
 Did you look at the load numbers in the logs?
 The CPUs are being slammed for over 20 minutes.
 
 The automatic tuning can only help so much, you're simply asking the cluster 
 to do more work than it is capable of.
 Giving more priority to cib operations the come via IPC is one option, but 
 as I explained earlier, it comes at the cost of correctness.
 
 Given the huge mismatch between the nodes' capacity and the tasks you're 
 asking them to achieve, your best path forward is probably setting a 
 load-threshold  40% or a batch-limit = 8.
 Or we could try a patch like the one below if we think that the defaults are 
 not aggressive enough.
 
 diff --git a/crmd/throttle.c b/crmd/throttle.c
 index d77195a..7636d4a 100644
 --- a/crmd/throttle.c
 +++ b/crmd/throttle.c
 @@ -611,14 +611,14 @@ throttle_get_total_job_limit(int l)
 switch(r-mode) {
 
 case throttle_extreme:
 -if(limit == 0 || limit  peers/2) {
 -limit = peers/2;
 +if(limit == 0 || limit  peers/4) {
 +limit = QB_MAX(1, peers/4);
 }
 break;
 
 case throttle_high:
 -if(limit == 0 || limit  peers) {
 -limit = peers;
 +if(limit == 0 || limit  peers/2) {
 +limit = QB_MAX(1, peers/2);
 }
 break;
 default:
 
 This may also be worthwhile:
 
 diff --git a/crmd/throttle.c b/crmd/throttle.c
 index d77195a..586513a 100644
 --- a/crmd/throttle.c
 +++ b/crmd/throttle.c
 @@ -387,22 +387,36 @@ static bool throttle_io_load(float *load, unsigned int 
 *blocked)
 }
 
 static enum throttle_state_e
 -throttle_handle_load(float load, const char *desc)
 +throttle_handle_load(float load, const char *desc, int cores)
 {
 -if(load  THROTTLE_FACTOR_HIGH * throttle_load_target) {
 +float adjusted_load = load;
 +
 +if(cores = 0) {
 +/* No adjusting of the supplied load value */
 +
 +} else if(cores == 1) {
 +/* On a single core machine, a load of 1.0 is already too high */
 +adjusted_load = load * THROTTLE_FACTOR_MEDIUM;
 +
 +} else {
 +/* Normalize the load to be per-core */
 +adjusted_load = load / cores;
 +}
 +
 +if(adjusted_load  THROTTLE_FACTOR_HIGH * throttle_load_target) {
 crm_notice(High %s detected: %f, desc, load);
 return throttle_high;
 
 -} else if(load  THROTTLE_FACTOR_MEDIUM * throttle_load_target) {
 +} else if(adjusted_load  THROTTLE_FACTOR_MEDIUM * 
 throttle_load_target) {
 crm_info(Moderate %s detected: %f, desc, load);
 return throttle_med;
 
 -} else if(load  THROTTLE_FACTOR_LOW * throttle_load_target) {
 +} else if(adjusted_load  THROTTLE_FACTOR_LOW * throttle_load_target) {
 crm_debug(Noticable %s detected: %f, desc, load);
 return throttle_low;
 }
 
 -crm_trace(Negligable %s detected: %f, desc, load);
 +crm_trace(Negligable %s detected: %f, desc, adjusted_load);
 return throttle_none;
 }
 
 @@ -464,22 +478,12 @@ throttle_mode(void)
 }
 
 if(throttle_load_avg(load)) {
 -float

Re: [Pacemaker] No such device, problem with setting pacemaker

2013-11-18 Thread Andrew Beekhof

On 18 Nov 2013, at 11:59 pm, Miha m...@softnet.si wrote:

 HI,
 
 I am for the first time setting cluster with pacemaker  corosync.
 Server A and server B can ping each other, I have disabled selinux and 
 iptables but I can not get this going. I did step by step as is writen in 
 tutorial.

Have you configured a stonith device in pacemaker?
What does your config look like?

 
 Here is a error that I am getting it in messeges:
 
 Nov 18 21:36:39 sip2 crmd[13483]:   notice: tengine_stonith_notify: Peer 
 sip1.domain.com was not terminated (reboot) by sip2.domain.com for 
 sip2.domain.com: No such device (ref=ee5d3db1-3c95-4230-8330-a279a03f5f90) by 
 client stonith_admin.cman.18267
 Nov 18 21:36:39 sip2 fence_pcmk[18266]: Call to fence sip1.domain.com (reset) 
 failed with rc=237
 Nov 18 21:36:42 sip2 fence_pcmk[18286]: Requesting Pacemaker fence 
 sip1.domain.com (reset)
 Nov 18 21:36:42 sip2 stonith_admin[18287]:   notice: crm_log_args: Invoked: 
 stonith_admin --reboot sip1.domain.com --tolerance 5s --tag cman
 Nov 18 21:36:42 sip2 stonith-ng[13479]:   notice: handle_request: Client 
 stonith_admin.cman.18287.46738a96 wants to fence (reboot) 'sip1.domain.com' 
 with device '(any)'
 Nov 18 21:36:42 sip2 stonith-ng[13479]:   notice: initiate_remote_stonith_op: 
 Initiating remote operation reboot for sip1.domain.com: 
 64a4294f-059a-416f-8e9d-944db759a0e6 (0)
 Nov 18 21:36:42 sip2 stonith-ng[13479]:error: remote_op_done: Operation 
 reboot of sip1.domain.com by sip2.domain.com for 
 stonith_admin.cman.18...@sip2.domain.com.64a4294f: No such device
 Nov 18 21:36:42 sip2 crmd[13483]:   notice: tengine_stonith_notify: Peer 
 sip1.domain.com was not terminated (reboot) by sip2.domain.com for 
 sip2.domain.com: No such device (ref=64a4294f-059a-416f-8e9d-944db759a0e6) by 
 client stonith_admin.cman.18287
 Nov 18 21:36:42 sip2 fence_pcmk[18286]: Call to fence sip1.domain.com (reset) 
 failed with rc=237
 
 
 tnx for help!
 
 miha
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Finally. A REAL question.

2013-11-18 Thread Andrew Beekhof

On 19 Nov 2013, at 2:50 pm, Rob Thomas xro...@gmail.com wrote:

  On 19 Nov 2013, at 6:00 am, Rob Thomas xro...@gmail.com wrote:
  So... What's the -right- way to do it then? 8)
 
   rsc_colocation id=pcs_rsc_colocation
 resource_set id=pcs_rsc_set
   resource_ref id=httpd/
   resource_ref id=asterisk/
 /resource_set
   /rsc_colocation
 
 ... 
 
  I'll have a try with the setoptions and see if that works. Thanks!
 
 Without adding the ms resource, it won't fail the other service over 
 completely.
 
 This works (which is a LITTLE bit more pleasing to the eyes, hopefully! I 
 even set the font to monospace!)
 
 rsc_colocation id=pcs_rsc_colocation score=INFINITY
   resource_set id=pcs_rsc_set-1
 resource_ref id=httpd/
 resource_ref id=asterisk/
   /resource_set
   resource_set id=pcs_rsc_set-2 role=Master
  resource_ref id=ms-asterisk/
  resource_ref id=ms-httpd/
   /resource_set
 /rsc_colocation
 
 I'm still pretty sure you can't do that through pcs. 

The docs suggest this should work, but it seems not to:

   pcs constraint colocation set httpd asterisk set ms-asterisk ms-httpd 
setoptions role=Master score=INFINITY 

Chris: the command 'pcs constraint colocation set httpd asterisk setoptions 
role=Master' creates:

  rsc_colocation id=pcs_rsc_colocation role=Master
 resource_set id=pcs_rsc_set
resource_ref id=httpd /
resource_ref id=asterisk /
 /resource_set
  /rsc_colocation

However role belongs with the resource_set.

 
 And the reason why (I believe) I need them, to go into a bit more depth - 
 (sorry for everyone else who's getting bored with this incredibly arcane and 
 in-depth discussion, that has degenerated into pasting XML snippets 
 everywhere) here's the relevant associated constraints:
 
 rsc_colocation id=c-1 rsc=asterisk_fs score=INFINITY 
 with-rsc=ms-asterisk with-rsc-role=Master/
 rsc_order first=ms-asterisk first-action=promote id=o-1 
 score=INFINITY then=asterisk_fs then-action=start/
 rsc_colocation id=c-2 rsc=httpd_fs score=INFINITY with-rsc=ms-httpd 
 with-rsc-role=Master/
 rsc_order first=ms-httpd first-action=promote id=o-2 score=INFINITY 
 then=httpd_fs then-action=start/
 
 (id's changed to aid reading) 
 
 Without pcs_rsc_set-2 before failing:
 
  Master/Slave Set: ms-asterisk [drbd_asterisk]
  Masters: [ freepbx-a ]
  Slaves: [ freepbx-b ]
  Master/Slave Set: ms-httpd [drbd_httpd]
  Masters: [ freepbx-a ]
  Slaves: [ freepbx-b ]
  Resource Group: httpd
  httpd_fs   (ocf::heartbeat:Filesystem):Started freepbx-a
  httpd_ip   (ocf::heartbeat:IPaddr2):   Started freepbx-a
  httpd_service  (ocf::heartbeat:apache):Started freepbx-a
  Resource Group: asterisk
  asterisk_fs(ocf::heartbeat:Filesystem):Started freepbx-a
  asterisk_ip(ocf::heartbeat:IPaddr2):   Started freepbx-a
  asterisk_service   (ocf::heartbeat:freepbx):   Started freepbx-a
  isymphony_service  (lsb:iSymphonyServer):  Started freepbx-a
 
 AFTER failing:
 
  Master/Slave Set: ms-asterisk [drbd_asterisk]
  Masters: [ freepbx-a ]  -- THIS IS WRONG
  Slaves: [ freepbx-b ]
  Master/Slave Set: ms-httpd [drbd_httpd]
  Masters: [ freepbx-b ]
  Slaves: [ freepbx-a ]
  Resource Group: asterisk
  asterisk_fs(ocf::heartbeat:Filesystem):Stopped
  asterisk_ip(ocf::heartbeat:IPaddr2):   Stopped
  asterisk_service   (ocf::heartbeat:freepbx):   Stopped
  isymphony_service  (lsb:iSymphonyServer):  Stopped
  Resource Group: httpd
  httpd_fs   (ocf::heartbeat:Filesystem):Started freepbx-b
  httpd_ip   (ocf::heartbeat:IPaddr2):   Started freepbx-b
  httpd_service  (ocf::heartbeat:apache):Started freepbx-b
 
 The asterisk group isn't starting because - obviously - it's not the master 
 for ms-asterisk.  So the constraint worked, BUT, because I don't have the 
 resource in there, I can't tell it to shut down.
 
 My other idea was having the ms- ordering start the group, but that doesn't 
 work either. 
 
 --Rob
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Finally. A REAL question.

2013-11-18 Thread Andrew Beekhof

On 19 Nov 2013, at 3:09 pm, Andrew Beekhof and...@beekhof.net wrote:

 
 On 19 Nov 2013, at 2:50 pm, Rob Thomas xro...@gmail.com wrote:
 
 On 19 Nov 2013, at 6:00 am, Rob Thomas xro...@gmail.com wrote:
 So... What's the -right- way to do it then? 8)
 
 rsc_colocation id=pcs_rsc_colocation
   resource_set id=pcs_rsc_set
 resource_ref id=httpd/
 resource_ref id=asterisk/
   /resource_set
 /rsc_colocation
 
 ... 
 
 I'll have a try with the setoptions and see if that works. Thanks!
 
 Without adding the ms resource, it won't fail the other service over 
 completely.
 
 This works (which is a LITTLE bit more pleasing to the eyes, hopefully! I 
 even set the font to monospace!)
 
 rsc_colocation id=pcs_rsc_colocation score=INFINITY
  resource_set id=pcs_rsc_set-1
resource_ref id=httpd/
resource_ref id=asterisk/
  /resource_set
  resource_set id=pcs_rsc_set-2 role=Master
 resource_ref id=ms-asterisk/
 resource_ref id=ms-httpd/
  /resource_set
 /rsc_colocation

Also, yes that is correct.

The non-set equivalent would be:

pcs constraint colocation add asterisk with httpd  
pcs constraint colocation add master ms-asterisk with asterisk  
pcs constraint colocation add master ms-httpd with master ms-asterisk

If that doesn't work, send me cibadmin -Ql after the failure and I can 
investigate.

 
 I'm still pretty sure you can't do that through pcs. 
 
 The docs suggest this should work, but it seems not to:
 
   pcs constraint colocation set httpd asterisk set ms-asterisk ms-httpd 
 setoptions role=Master score=INFINITY 
 
 Chris: the command 'pcs constraint colocation set httpd asterisk setoptions 
 role=Master' creates:
 
  rsc_colocation id=pcs_rsc_colocation role=Master
 resource_set id=pcs_rsc_set
resource_ref id=httpd /
resource_ref id=asterisk /
 /resource_set
  /rsc_colocation
 
 However role belongs with the resource_set.
 
 
 And the reason why (I believe) I need them, to go into a bit more depth - 
 (sorry for everyone else who's getting bored with this incredibly arcane and 
 in-depth discussion, that has degenerated into pasting XML snippets 
 everywhere) here's the relevant associated constraints:
 
 rsc_colocation id=c-1 rsc=asterisk_fs score=INFINITY 
 with-rsc=ms-asterisk with-rsc-role=Master/
 rsc_order first=ms-asterisk first-action=promote id=o-1 
 score=INFINITY then=asterisk_fs then-action=start/
 rsc_colocation id=c-2 rsc=httpd_fs score=INFINITY with-rsc=ms-httpd 
 with-rsc-role=Master/
 rsc_order first=ms-httpd first-action=promote id=o-2 score=INFINITY 
 then=httpd_fs then-action=start/
 
 (id's changed to aid reading) 
 
 Without pcs_rsc_set-2 before failing:
 
 Master/Slave Set: ms-asterisk [drbd_asterisk]
 Masters: [ freepbx-a ]
 Slaves: [ freepbx-b ]
 Master/Slave Set: ms-httpd [drbd_httpd]
 Masters: [ freepbx-a ]
 Slaves: [ freepbx-b ]
 Resource Group: httpd
 httpd_fs   (ocf::heartbeat:Filesystem):Started freepbx-a
 httpd_ip   (ocf::heartbeat:IPaddr2):   Started freepbx-a
 httpd_service  (ocf::heartbeat:apache):Started freepbx-a
 Resource Group: asterisk
 asterisk_fs(ocf::heartbeat:Filesystem):Started freepbx-a
 asterisk_ip(ocf::heartbeat:IPaddr2):   Started freepbx-a
 asterisk_service   (ocf::heartbeat:freepbx):   Started freepbx-a
 isymphony_service  (lsb:iSymphonyServer):  Started freepbx-a
 
 AFTER failing:
 
 Master/Slave Set: ms-asterisk [drbd_asterisk]
 Masters: [ freepbx-a ]  -- THIS IS WRONG
 Slaves: [ freepbx-b ]
 Master/Slave Set: ms-httpd [drbd_httpd]
 Masters: [ freepbx-b ]
 Slaves: [ freepbx-a ]
 Resource Group: asterisk
 asterisk_fs(ocf::heartbeat:Filesystem):Stopped
 asterisk_ip(ocf::heartbeat:IPaddr2):   Stopped
 asterisk_service   (ocf::heartbeat:freepbx):   Stopped
 isymphony_service  (lsb:iSymphonyServer):  Stopped
 Resource Group: httpd
 httpd_fs   (ocf::heartbeat:Filesystem):Started freepbx-b
 httpd_ip   (ocf::heartbeat:IPaddr2):   Started freepbx-b
 httpd_service  (ocf::heartbeat:apache):Started freepbx-b
 
 The asterisk group isn't starting because - obviously - it's not the master 
 for ms-asterisk.  So the constraint worked, BUT, because I don't have the 
 resource in there, I can't tell it to shut down.
 
 My other idea was having the ms- ordering start the group, but that doesn't 
 work either. 
 
 --Rob
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http

Re: [Pacemaker] Remove a ghost node

2013-11-18 Thread Andrew Beekhof

On 19 Nov 2013, at 3:21 am, Sean Lutner s...@rentul.net wrote:

 
 On Nov 17, 2013, at 7:40 PM, Andrew Beekhof and...@beekhof.net wrote:
 
 
 On 15 Nov 2013, at 2:28 pm, Sean Lutner s...@rentul.net wrote:
 
 Yes the varnish resources are in a group which is then cloned.
 
 -EDONTDOTHAT
 
 You cant refer to the things inside a clone.
 1.1.8 will have just been ignoring those constraints.
 
 So the implicit order and colocation constraints in a group and clone will 
 take care of those?
 
 Which means remove the constraints and retry the upgrade?
 
 No, it means rewrite them to refer to the clone - whatever is the outer most 
 container. 
 
 I see, thanks. Did I miss that in the docs or is it undocumented/implied? If 
 I didn't miss it, it'd be nice if that were explicitly documented.

Assuming:

clone id=X ...
  primitive id=Y .../
/clone

For most of pacemaker's existence, it simply wasn't possible because there is 
no resource named Y (they were actually called Y:0 Y:1 .. Y:N).
Then pacemaker was made smarter and as a side effect, started being able to 
find something matching Y.

Then I closed the loophole :)

So it was never legal.

 
 Is there a combination of constraints I can configure for a single IP 
 resource and a cloned group such that if there is a failure only the IP 
 resource will move?

These are implied simply because they're in a group:

  rsc_order first=Varnish id=order-Varnish-Varnishlog-mandatory 
then=Varnishlog/
  rsc_order first=Varnishlog id=order-Varnishlog-Varnishncsa-mandatory 
then=Varnishncsa/
  rsc_colocation id=colocation-Varnishlog-Varnish-INFINITY 
rsc=Varnishlog score=INFINITY with-rsc=Varnish/
  rsc_colocation id=colocation-Varnishncsa-Varnishlog-INFINITY 
rsc=Varnishncsa score=INFINITY with-rsc=Varnishlog/
This:

  rsc_order first=ClusterEIP_54.215.143.166 
id=order-ClusterEIP_54.215.143.166-Varnish-mandatory then=Varnish/

can be replaced with:

  pcs constraint order start ClusterEIP_54.215.143.166 then 
EIP-AND-VARNISH-clone

and:

  rsc_colocation 
id=colocation-Varnish-ClusterEIP_54.215.143.166-INFINITY rsc=Varnish 
score=INFINITY with-rsc=ClusterEIP_54.215.143.166/

is the same as:

  pcs constraint colocation EIP-AND-VARNISH-clone with 
ClusterEIP_54.215.143.166 

but that makes no sense because then the clone (which wants to run everywhere) 
can only run on the node the IP is on.
If thats what you want, then there is no point having a clone.

Better to reverse the colocation to be:

  pcs constraint colocation ClusterEIP_54.215.143.166 with 
EIP-AND-VARNISH-clone

and possibly the ordering too:

  pcs constraint order start EIP-AND-VARNISH-clone then 
ClusterEIP_54.215.143.166 


 Or in the case where I previously had the constraints applied to the 
 resources and not the clone was that causing a problem?
 
 Thanks again.
 
 
 
 
 I was able to get the upgrade done. I also had to upgrade the libqb 
 package. I know that's been mentioned in other threads, but I think that 
 should either be a dependency of pacemaker or explicitly documented.
 
 libqb is a dependancy, just not a versioned one.
 We should probably change that next time.
 
 I would say it's a requirement that that be changed.

Well... pacemaker can build against older versions, but you need to run it with 
whatever version you built against.
possibly the libqb versioning is being done incorrectly preventing rpm from 
figuring this all out

 
 
 
 Second order of business is that failover is no longer working as expected. 
 Because the order and colocation constraints are gone, if one of the 
 varnish resources fails, the EIP resource does not move to the other node 
 like it used to.
 
 Is there a way I can create or re-create that behavior?
 
 See above :)
 
 
 The resource group EIP-AND_VARNISH has the three varnish services and is 
 then cloned so running on both nodes. If any of them fail I want the EIP 
 resource to move to the other node.
 
 Any advice for doing this?
 
 Thanks
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http

Re: [Pacemaker] CentOS 6.4 and CFS.

2013-11-17 Thread Andrew Beekhof

On 16 Nov 2013, at 9:42 am, Rob Thomas xro...@gmail.com wrote:

 Line 363 of /usr/lib/python2.6/site-packages/pcs/cluster.py has this:
 
nodes = utils.getNodesFromCorosyncConf()
 
 Ahha. Look what I just spotted.
 
 https://github.com/feist/pcs/commit/8b888080c37ddea88b92dfd95aadd78b9db68b55

Are you building pcs yourself or using the packages supplied by CentOS?
I'd be surprised if the supplied packages had not been patched to look in 
cluster.conf instead of corosync.conf



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Remove a ghost node

2013-11-17 Thread Andrew Beekhof

On 15 Nov 2013, at 2:28 pm, Sean Lutner s...@rentul.net wrote:

 Yes the varnish resources are in a group which is then cloned.
 
 -EDONTDOTHAT
 
 You cant refer to the things inside a clone.
 1.1.8 will have just been ignoring those constraints.
 
 So the implicit order and colocation constraints in a group and clone will 
 take care of those?
 
 Which means remove the constraints and retry the upgrade?

No, it means rewrite them to refer to the clone - whatever is the outer most 
container. 

 
 I was able to get the upgrade done. I also had to upgrade the libqb package. 
 I know that's been mentioned in other threads, but I think that should either 
 be a dependency of pacemaker or explicitly documented.

libqb is a dependancy, just not a versioned one.
We should probably change that next time.

 
 Second order of business is that failover is no longer working as expected. 
 Because the order and colocation constraints are gone, if one of the varnish 
 resources fails, the EIP resource does not move to the other node like it 
 used to.
 
 Is there a way I can create or re-create that behavior?

See above :)

 
 The resource group EIP-AND_VARNISH has the three varnish services and is then 
 cloned so running on both nodes. If any of them fail I want the EIP resource 
 to move to the other node.
 
 Any advice for doing this?
 
 Thanks



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Finally. A REAL question.

2013-11-17 Thread Andrew Beekhof

On 18 Nov 2013, at 12:43 pm, Rob Thomas xro...@gmail.com wrote:

 Previously, using crm, it was reasonably painless to ensure that
 resource groups ran on the same node.
 
 I'm having difficulties figuring out what the 'right' way to do this is with 
 pcs

You tried:

   pcs constraint colocation add asterisk with httpd

?

 
 Specifically, I want the 'asterisk' group to run on the same node as
 the 'httpd' group.
 
 Basically, this should never happen:
 
 pcs status
 Cluster name: freepbx-ha
 Last updated: Mon Nov 18 11:38:27 2013
 Last change: Mon Nov 18 11:30:44 2013 via cibadmin on freepbx-a
 Stack: cman
 Current DC: freepbx-a - partition with quorum
 Version: 1.1.10-1.el6_4.4-368c726
 2 Nodes configured
 16 Resources configured
 
 
 Online: [ freepbx-a freepbx-b ]
 
 Full list of resources:
 
 floating_ip(ocf::heartbeat:IPaddr2):   Started freepbx-b
 Master/Slave Set: ms-asterisk [drbd_asterisk]
 Masters: [ freepbx-b ]
 Slaves: [ freepbx-a ]
 Master/Slave Set: ms-mysql [drbd_mysql]
 Masters: [ freepbx-a ]
 Slaves: [ freepbx-b ]
 Master/Slave Set: ms-httpd [drbd_httpd]
 Masters: [ freepbx-a ]
 Slaves: [ freepbx-b ]
 Resource Group: mysql
 mysql_fs   (ocf::heartbeat:Filesystem):Started freepbx-a
 mysql_ip   (ocf::heartbeat:IPaddr2):   Started freepbx-a
 mysql_service  (ocf::heartbeat:mysql): Started freepbx-a
 Resource Group: asterisk
 asterisk_fs(ocf::heartbeat:Filesystem):Started freepbx-b
 asterisk_ip(ocf::heartbeat:IPaddr2):   Started freepbx-b
 asterisk_service   (ocf::heartbeat:freepbx):   Started freepbx-b
 Resource Group: httpd
 httpd_fs   (ocf::heartbeat:Filesystem):Started freepbx-a
 httpd_ip   (ocf::heartbeat:IPaddr2):   Started freepbx-a
 httpd_service  (ocf::heartbeat:apache):Started freepbx-a
 
 
 Note that asterisk and httpd are running on different nodes, after I
 caused asterisk to fail across (by shutting it down)
 
 What I want to happen is that when asterisk fails (or httpd), the
 cluster should shut down the other non failing resource, and move it
 
 I can do this by making a single resource group that contains both the
 asterisk_ and httpd_ resources, but, it seems untidy to me.
 
 I've tried a colocation set, which does let you add groups, but it
 doesn't seem to do what I'm after.
 
 Attached is the dump of my current config.
 
 Any advice or help would be appreciated!
 
 (Currently on CentOS 6.4, with pacemaker 1.1.10, and pcs 0.9.90)
 
 --Rob
 cibadmin.xml___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] CentOS 6.4 and CFS.

2013-11-15 Thread Andrew Beekhof

On 15 Nov 2013, at 5:56 pm, Rob Thomas xro...@gmail.com wrote:

 So I'm a long time corosync fan, and I've recently come back into the
 fold to change everything I've previously written to pcs, because
 that's the new cool thing.
 
 Sadly, things seem to be a bit broken.
 
 Here's how things have gone today!
 
 I managed to get things kinda sorta working with the old 1.1.8 version
 of PCS on CentOS 6.4.  I wasn't happy with it, but I went 'meh,
 that'll do, I'll fix it all with pacemaker 1.1.10'.
 
 No such luck.
 
 So, I create a few test resources and that all seems to work. Excellent.
 
 Now I want to start working on failing things over properly, 'pcs
 cluster standby node-a'
 
 Error: node 'node-a' does not appear to exist in configuration
 
 Looking through the pcs code, it's now checking that the node exists
 in /etc/corosync/corosync.conf

N.  Not on RHEL-6 anyway.
Before I address the rest of your email, you need to be using pacemaker with 
cman (cluster.conf) as described at:

   http://clusterlabs.org/quickstart-redhat.html

and:

   
https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6-Beta/html/Configuring_the_Red_Hat_High_Availability_Add-On_with_Pacemaker/ch-clusteradmin-HAAR.html

If you're still having issues after reading one or more of those, we'll be only 
too happy to help :)

 
 Well that's cool, I can generate a conf with 'pcs cluster auth' and
 'pcs cluster setup' according to CFS. No. No I can't.  pcs cluster
 auth and pcs cluster setup requires pcsd (I assume?  Whatever's meant
 to be listening on port 2224) but that appears to be missing in RHEL
 based distros.
 
 (This is, apparently, by design according to
 https://github.com/feist/pcs/issues/3 which is - admittedly - quite
 old)
 
 OK, so I'll create the corosync config file based on the template I
 found in /usr/lib/python2.6/site-packages/pcs/corosync.conf.template
 -- except that one isn't used, and only the fedora.template is.
 
 The fedora template specifies that it's using a pacemaker service,
 but.. I thought that had been deprecated and removed?
 
 So. I'm now at the point where I'm confused.
 
 Question:  Am I doing something basically and fundamentally wrong? Is
 there a step that I've missed that generates the corosync.conf file
 that's required now?  If not, can I just use the .template one, or,
 should I be using the .fedora.template with 6.4?
 
 I normally would just derp around with it and add my old corosync.conf
 file and keep playing, but, I have to wander off to geek at a Roller
 Derby thing tonight, so I thought I may throw it to the crowds before
 I start going down too many blind alleys.
 
 Question 2: What else is going to bite me? 8)
 
 Sorry for the length!
 
 --Rob
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Remove a ghost node

2013-11-14 Thread Andrew Beekhof

On 14 Nov 2013, at 2:55 pm, Sean Lutner s...@rentul.net wrote:

 
 On Nov 13, 2013, at 10:51 PM, Andrew Beekhof and...@beekhof.net wrote:
 
 
 On 14 Nov 2013, at 1:12 pm, Sean Lutner s...@rentul.net wrote:
 
 
 On Nov 10, 2013, at 8:03 PM, Sean Lutner s...@rentul.net wrote:
 
 
 On Nov 10, 2013, at 7:54 PM, Andrew Beekhof and...@beekhof.net wrote:
 
 
 On 11 Nov 2013, at 11:44 am, Sean Lutner s...@rentul.net wrote:
 
 
 On Nov 10, 2013, at 6:27 PM, Andrew Beekhof and...@beekhof.net wrote:
 
 
 On 8 Nov 2013, at 12:59 pm, Sean Lutner s...@rentul.net wrote:
 
 
 On Nov 7, 2013, at 8:34 PM, Andrew Beekhof and...@beekhof.net wrote:
 
 
 On 8 Nov 2013, at 4:45 am, Sean Lutner s...@rentul.net wrote:
 
 I have a confusing situation that I'm hoping to get help with. Last 
 night after configuring STONITH on my two node cluster, I suddenly 
 have a ghost node in my cluster. I'm looking to understand the 
 best way to remove this node from the config.
 
 I'm using the fence_ec2 device for for STONITH. I dropped the script 
 on each node, registered the device with stonith_admin -R -a 
 fence_ec2 and confirmed the registration with both
 
 # stonith_admin -I
 # pcs stonith list
 
 I then configured STONITH per the Clusters from Scratch doc
 
 http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Clusters_from_Scratch/_example.html
 
 Here are my commands:
 # pcs cluster cib stonith_cfg
 # pcs -f stonith_cfg stonith create ec2-fencing fence_ec2 
 ec2-home=/opt/ec2-api-tools pcmk_host_check=static-list 
 pcmk_host_list=ip-10-50-3-122 ip-10-50-3-251 op monitor 
 interval=300s timeout=150s op start start-delay=30s 
 interval=0
 # pcs -f stonith_cfg stonith
 # pcs -f stonith_cfg property set stonith-enabled=true
 # pcs -f stonith_cfg property
 # pcs cluster push cib stonith_cfg
 
 After that I saw that STONITH appears to be functioning but a new 
 node listed in pcs status output:
 
 Do the EC2 instances have fixed IPs?
 I didn't have much luck with EC2 because every time they came back up 
 it was with a new name/address which confused corosync and created 
 situations like this.
 
 The IPs persist across reboots as far as I can tell. I thought the 
 problem was due to stonith being enabled but not working so I removed 
 the stonith_id and disabled stonith. After that I restarted pacemaker 
 and cman on both nodes and things started as expected but the ghost 
 node it still there. 
 
 Someone else working on the cluster exported the CIB, removed the node 
 and then imported the CIB. They used this process 
 http://clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-config-updates.html
 
 Even after that, the ghost node is still there? Would pcs cluster cib 
  /tmp/cib-temp.xml and then pcs cluster push cib /tmp/cib-temp.xml 
 after editing the node out of the config?
 
 No. If its coming back then pacemaker is holding it in one of its 
 internal caches.
 The only way to clear it out in your version is to restart pacemaker on 
 the DC.
 
 Actually... are you sure someone didn't just slip while editing 
 cluster.conf?  [...].1251 does not look like a valid IP :)
 
 In the end this fixed it
 
 # pcs cluster cib  /tmp/cib-tmp.xml
 # vi /tmp/cib-tmp.xml # remove bad node
 # pcs cluster push cib /tmp/cib-tmp.xml
 
 Followed by restaring pacemaker and cman on both nodes. The ghost node 
 disappeared, so it was cached as you mentioned.
 
 I also tracked the bad IP down to bad non-printing characters in the 
 initial command line while configuring the fence_ec2 stonith device. I'd 
 put the command together from the github README and some mailing list 
 posts and laid it out in an external editor. Go me. :)
 
 
 
 Version: 1.1.8-7.el6-394e906
 
 There is now an update to 1.1.10 available for 6.4, that _may_ help in 
 the future.
 
 That's my next task. I believe I'm hitting the failure-timeout not 
 clearing failcount bug and want to upgrade to 1.1.10. Is it safe to yum 
 update pacemaker after stopping the cluster? I see there is also an 
 updated pcs in CentOS 6.4, should I update that as well?
 
 yes and yes
 
 you might want to check if you're using any OCF resource agents that 
 didn't make it into the first supported release though.
 
 http://blog.clusterlabs.org/blog/2013/pacemaker-and-rhel-6-dot-4/
 
 Thanks, I'll give that a read. All the resource agents are custom so I'm 
 thinking I'm okay (I'll back them up before upgrading). 
 
 One last question related to the fence_ec2 script. Should crm_mon -VW show 
 it running on both nodes or just one?
 
 I just went through the upgrade to pacemaker 1.1.10 and pcs. After running 
 the yum update for those I ran a crm_verify and I'm seeing errors related 
 to my order and colocation constraints. Did the behavior of these change 
 from 1.1.8 to 1.1.10?
 
 # crm_verify -L -V
 error: unpack_order_template:Invalid constraint 
 'order-ClusterEIP_54.215.143.166-Varnish-mandatory': No resource or 
 template named 'Varnish'
 
 Is that true?
 
 No, it's

Re: [Pacemaker] why pacemaker does not control the resources

2013-11-14 Thread Andrew Beekhof

On 14 Nov 2013, at 5:06 pm, Andrey Groshev gre...@yandex.ru wrote:

 
 
 14.11.2013, 02:22, Andrew Beekhof and...@beekhof.net:
 On 14 Nov 2013, at 6:13 am, Andrey Groshev gre...@yandex.ru wrote:
 
  13.11.2013, 03:22, Andrew Beekhof and...@beekhof.net:
  On 12 Nov 2013, at 4:42 pm, Andrey Groshev gre...@yandex.ru wrote:
   11.11.2013, 03:44, Andrew Beekhof and...@beekhof.net:
   On 8 Nov 2013, at 7:49 am, Andrey Groshev gre...@yandex.ru wrote:
Hi, PPL!
I need help. I do not understand... Why has stopped working.
This configuration work on other cluster, but on corosync1.
 
So... cluster postgres with master/slave.
Classic config as in wiki.
I build cluster, start, he is working.
Next I kill postgres on Master with 6 signal, as if disk space left
 
# pkill -6 postgres
# ps axuww|grep postgres
root  9032  0.0  0.1 103236   860 pts/0S+   00:37   0:00 
 grep postgres
 
PostgreSQL die, But crm_mon shows that the master is still running.
 
Last updated: Fri Nov  8 00:42:08 2013
Last change: Fri Nov  8 00:37:05 2013 via crm_attribute on 
 dev-cluster2-node4
Stack: corosync
Current DC: dev-cluster2-node4 (172793107) - partition with quorum
Version: 1.1.10-1.el6-368c726
3 Nodes configured
7 Resources configured
 
Node dev-cluster2-node2 (172793105): online
   pingCheck   (ocf::pacemaker:ping):  Started
   pgsql   (ocf::heartbeat:pgsql): Started
Node dev-cluster2-node3 (172793106): online
   pingCheck   (ocf::pacemaker:ping):  Started
   pgsql   (ocf::heartbeat:pgsql): Started
Node dev-cluster2-node4 (172793107): online
   pgsql   (ocf::heartbeat:pgsql): Master
   pingCheck   (ocf::pacemaker:ping):  Started
   VirtualIP   (ocf::heartbeat:IPaddr2):   Started
 
Node Attributes:
* Node dev-cluster2-node2:
   + default_ping_set  : 100
   + master-pgsql  : -INFINITY
   + pgsql-data-status : STREAMING|ASYNC
   + pgsql-status  : HS:async
* Node dev-cluster2-node3:
   + default_ping_set  : 100
   + master-pgsql  : -INFINITY
   + pgsql-data-status : STREAMING|ASYNC
   + pgsql-status  : HS:async
* Node dev-cluster2-node4:
   + default_ping_set  : 100
   + master-pgsql  : 1000
   + pgsql-data-status : LATEST
   + pgsql-master-baseline : 0278
   + pgsql-status  : PRI
 
Migration summary:
* Node dev-cluster2-node4:
* Node dev-cluster2-node2:
* Node dev-cluster2-node3:
 
Tickets:
 
CONFIG:
node $id=172793105 dev-cluster2-node2. \
   attributes pgsql-data-status=STREAMING|ASYNC standby=false
node $id=172793106 dev-cluster2-node3. \
   attributes pgsql-data-status=STREAMING|ASYNC standby=false
node $id=172793107 dev-cluster2-node4. \
   attributes pgsql-data-status=LATEST
primitive VirtualIP ocf:heartbeat:IPaddr2 \
   params ip=10.76.157.194 \
   op start interval=0 timeout=60s on-fail=stop \
   op monitor interval=10s timeout=60s on-fail=restart \
   op stop interval=0 timeout=60s on-fail=block
primitive pgsql ocf:heartbeat:pgsql \
   params pgctl=/usr/pgsql-9.1/bin/pg_ctl 
 psql=/usr/pgsql-9.1/bin/psql pgdata=/var/lib/pgsql/9.1/data 
 tmpdir=/tmp/pg start_opt=-p 5432 
 logfile=/var/lib/pgsql/9.1//pgstartup.log rep_mode=async 
 node_list= dev-cluster2-node2. dev-cluster2-node3. dev-cluster2-node4. 
  restore_command=gzip -cd 
 /var/backup/pitr/dev-cluster2-master#5432/xlog/%f.gz  %p 
 primary_conninfo_opt=keepalives_idle=60 keepalives_interval=5 
 keepalives_count=5 master_ip=10.76.157.194 \
   op start interval=0 timeout=60s on-fail=restart \
   op monitor interval=5s timeout=61s on-fail=restart \
   op monitor interval=1s role=Master timeout=62s 
 on-fail=restart \
   op promote interval=0 timeout=63s on-fail=restart \
   op demote interval=0 timeout=64s on-fail=stop \
   op stop interval=0 timeout=65s on-fail=block \
   op notify interval=0 timeout=66s
primitive pingCheck ocf:pacemaker:ping \
   params name=default_ping_set host_list=10.76.156.1 
 multiplier=100 \
   op start interval=0 timeout=60s on-fail=restart \
   op monitor interval=10s timeout=60s on-fail=restart \
   op stop interval=0 timeout=60s on-fail=ignore
ms msPostgresql pgsql \
   meta master-max=1 master-node-max=1 clone-node-max=1 
 notify=true target-role=Master clone-max=3
clone clnPingCheck pingCheck \
   meta clone-max=3
location l0_DontRunPgIfNotPingGW msPostgresql \
   rule $id=l0_DontRunPgIfNotPingGW-rule -inf: not_defined 
 default_ping_set

Re: [Pacemaker] Remove a ghost node

2013-11-14 Thread Andrew Beekhof

On 15 Nov 2013, at 10:24 am, Sean Lutner s...@rentul.net wrote:

 
 On Nov 14, 2013, at 6:14 PM, Andrew Beekhof and...@beekhof.net wrote:
 
 
 On 14 Nov 2013, at 2:55 pm, Sean Lutner s...@rentul.net wrote:
 
 
 On Nov 13, 2013, at 10:51 PM, Andrew Beekhof and...@beekhof.net wrote:
 
 
 On 14 Nov 2013, at 1:12 pm, Sean Lutner s...@rentul.net wrote:
 
 
 On Nov 10, 2013, at 8:03 PM, Sean Lutner s...@rentul.net wrote:
 
 
 On Nov 10, 2013, at 7:54 PM, Andrew Beekhof and...@beekhof.net wrote:
 
 
 On 11 Nov 2013, at 11:44 am, Sean Lutner s...@rentul.net wrote:
 
 
 On Nov 10, 2013, at 6:27 PM, Andrew Beekhof and...@beekhof.net wrote:
 
 
 On 8 Nov 2013, at 12:59 pm, Sean Lutner s...@rentul.net wrote:
 
 
 On Nov 7, 2013, at 8:34 PM, Andrew Beekhof and...@beekhof.net 
 wrote:
 
 
 On 8 Nov 2013, at 4:45 am, Sean Lutner s...@rentul.net wrote:
 
 I have a confusing situation that I'm hoping to get help with. 
 Last night after configuring STONITH on my two node cluster, I 
 suddenly have a ghost node in my cluster. I'm looking to 
 understand the best way to remove this node from the config.
 
 I'm using the fence_ec2 device for for STONITH. I dropped the 
 script on each node, registered the device with stonith_admin -R 
 -a fence_ec2 and confirmed the registration with both
 
 # stonith_admin -I
 # pcs stonith list
 
 I then configured STONITH per the Clusters from Scratch doc
 
 http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Clusters_from_Scratch/_example.html
 
 Here are my commands:
 # pcs cluster cib stonith_cfg
 # pcs -f stonith_cfg stonith create ec2-fencing fence_ec2 
 ec2-home=/opt/ec2-api-tools pcmk_host_check=static-list 
 pcmk_host_list=ip-10-50-3-122 ip-10-50-3-251 op monitor 
 interval=300s timeout=150s op start start-delay=30s 
 interval=0
 # pcs -f stonith_cfg stonith
 # pcs -f stonith_cfg property set stonith-enabled=true
 # pcs -f stonith_cfg property
 # pcs cluster push cib stonith_cfg
 
 After that I saw that STONITH appears to be functioning but a new 
 node listed in pcs status output:
 
 Do the EC2 instances have fixed IPs?
 I didn't have much luck with EC2 because every time they came back 
 up it was with a new name/address which confused corosync and 
 created situations like this.
 
 The IPs persist across reboots as far as I can tell. I thought the 
 problem was due to stonith being enabled but not working so I 
 removed the stonith_id and disabled stonith. After that I restarted 
 pacemaker and cman on both nodes and things started as expected but 
 the ghost node it still there. 
 
 Someone else working on the cluster exported the CIB, removed the 
 node and then imported the CIB. They used this process 
 http://clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-config-updates.html
 
 Even after that, the ghost node is still there? Would pcs cluster 
 cib  /tmp/cib-temp.xml and then pcs cluster push cib 
 /tmp/cib-temp.xml after editing the node out of the config?
 
 No. If its coming back then pacemaker is holding it in one of its 
 internal caches.
 The only way to clear it out in your version is to restart pacemaker 
 on the DC.
 
 Actually... are you sure someone didn't just slip while editing 
 cluster.conf?  [...].1251 does not look like a valid IP :)
 
 In the end this fixed it
 
 # pcs cluster cib  /tmp/cib-tmp.xml
 # vi /tmp/cib-tmp.xml # remove bad node
 # pcs cluster push cib /tmp/cib-tmp.xml
 
 Followed by restaring pacemaker and cman on both nodes. The ghost node 
 disappeared, so it was cached as you mentioned.
 
 I also tracked the bad IP down to bad non-printing characters in the 
 initial command line while configuring the fence_ec2 stonith device. 
 I'd put the command together from the github README and some mailing 
 list posts and laid it out in an external editor. Go me. :)
 
 
 
 Version: 1.1.8-7.el6-394e906
 
 There is now an update to 1.1.10 available for 6.4, that _may_ help 
 in the future.
 
 That's my next task. I believe I'm hitting the failure-timeout not 
 clearing failcount bug and want to upgrade to 1.1.10. Is it safe to 
 yum update pacemaker after stopping the cluster? I see there is also 
 an updated pcs in CentOS 6.4, should I update that as well?
 
 yes and yes
 
 you might want to check if you're using any OCF resource agents that 
 didn't make it into the first supported release though.
 
 http://blog.clusterlabs.org/blog/2013/pacemaker-and-rhel-6-dot-4/
 
 Thanks, I'll give that a read. All the resource agents are custom so I'm 
 thinking I'm okay (I'll back them up before upgrading). 
 
 One last question related to the fence_ec2 script. Should crm_mon -VW 
 show it running on both nodes or just one?
 
 I just went through the upgrade to pacemaker 1.1.10 and pcs. After 
 running the yum update for those I ran a crm_verify and I'm seeing errors 
 related to my order and colocation constraints. Did the behavior of these 
 change from 1.1.8 to 1.1.10?
 
 # crm_verify -L -V
 error: unpack_order_template

Re: [Pacemaker] Question about the resource to fence a node

2013-11-14 Thread Andrew Beekhof

On 14 Nov 2013, at 5:53 pm, Kazunori INOUE kazunori.ino...@gmail.com wrote:

 Hi, Andrew
 
 2013/11/13 Kazunori INOUE kazunori.ino...@gmail.com:
 2013/11/13 Andrew Beekhof and...@beekhof.net:
 
 On 16 Oct 2013, at 8:51 am, Andrew Beekhof and...@beekhof.net wrote:
 
 
 On 15/10/2013, at 8:24 PM, Kazunori INOUE kazunori.ino...@gmail.com 
 wrote:
 
 Hi,
 
 I'm using pacemaker-1.1 (the latest devel).
 I started resource (f1 and f2) which fence vm3 on vm1.
 
 $ crm_mon -1
 Last updated: Tue Oct 15 15:16:37 2013
 Last change: Tue Oct 15 15:16:21 2013 via crmd on vm1
 Stack: corosync
 Current DC: vm1 (3232261517) - partition with quorum
 Version: 1.1.11-0.284.6a5e863.git.el6-6a5e863
 3 Nodes configured
 3 Resources configured
 
 Online: [ vm1 vm2 vm3 ]
 
 pDummy (ocf::pacemaker:Dummy): Started vm3
 Resource Group: gStonith3
   f1 (stonith:external/libvirt): Started vm1
   f2 (stonith:external/ssh): Started vm1
 
 
 reset of f1 which hasn't been started on vm2 was performed when vm3 is 
 fenced.
 
 $ ssh vm3 'rm -f /var/run/Dummy-pDummy.state'
 $ for i in vm1 vm2; do ssh $i 'hostname; egrep  reset | off 
 /var/log/ha-log'; done
 vm1
 Oct 15 15:17:35 vm1 stonith-ng[14870]:  warning: log_operation:
 f2:15076 [ Performing: stonith -t external/ssh -T reset vm3 ]
 Oct 15 15:18:06 vm1 stonith-ng[14870]:  warning: log_operation:
 f2:15464 [ Performing: stonith -t external/ssh -T reset vm3 ]
 vm2
 Oct 15 15:17:16 vm2 stonith-ng[9160]:  warning: log_operation: f1:9273
 [ Performing: stonith -t external/libvirt -T reset vm3 ]
 Oct 15 15:17:46 vm2 stonith-ng[9160]:  warning: log_operation: f1:9588
 [ Performing: stonith -t external/libvirt -T reset vm3 ]
 
 Is it specifications?
 
 Yes, although the host on which the device is started usually gets 
 priority.
 I will try to find some time to look through the report to see why this 
 didn't happen.
 
 Reading through this again, it sounds like it should be fixed by your 
 earlier pull request:
 
   https://github.com/beekhof/pacemaker/commit/6b4bfd6
 
 Yes?
 
 No.
 
 How is this change?

Thanks for this.  I tweaked it a bit further and pushed:

https://github.com/beekhof/pacemaker/commit/4cbbeb0

 
 diff --git a/fencing/remote.c b/fencing/remote.c
 index 6c11ba9..68b31c5 100644
 --- a/fencing/remote.c
 +++ b/fencing/remote.c
 @@ -778,6 +778,7 @@ stonith_choose_peer(remote_fencing_op_t * op)
 {
 st_query_result_t *peer = NULL;
 const char *device = NULL;
 +uint32_t active = fencing_active_peers();
 
 do {
 if (op-devices) {
 @@ -790,7 +791,8 @@ stonith_choose_peer(remote_fencing_op_t * op)
 
 if ((peer = find_best_peer(device, op, FIND_PEER_SKIP_TARGET
 | FIND_PEER_VERIFIED_ONLY))) {
 return peer;
 -} else if ((peer = find_best_peer(device, op,
 FIND_PEER_SKIP_TARGET))) {
 +} else if ((op-query_timer == 0 || op-replies =
 op-replies_expected || op-replies = active)
 +(peer = find_best_peer(device, op,
 FIND_PEER_SKIP_TARGET))) {
 return peer;
 } else if ((peer = find_best_peer(device, op,
 FIND_PEER_TARGET_ONLY))) {
 return peer;
 @@ -801,8 +803,13 @@ stonith_choose_peer(remote_fencing_op_t * op)
   stonith_topology_next(op) == pcmk_ok);
 
 if (op-devices) {
 -crm_notice(Couldn't find anyone to fence %s with %s, op-target,
 -   (char *)op-devices-data);
 +if (op-query_timer == 0 || op-replies =
 op-replies_expected || op-replies = active) {
 +crm_notice(Couldn't find anyone to fence %s with %s, 
 op-target,
 +   (char *)op-devices-data);
 +} else {
 +crm_debug(Couldn't find verified device to fence %s with
 %s, op-target,
 +   (char *)op-devices-data);
 +}
 } else {
 crm_debug(Couldn't find anyone to fence %s, op-target);
 }
 
 
 I'm kind of swamped at the moment though.
 
 
 Best Regards,
 Kazunori INOUE
 stopped_resource_performed_reset.tar.bz2___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed

Re: [Pacemaker] why pacemaker does not control the resources

2013-11-13 Thread Andrew Beekhof

On 14 Nov 2013, at 6:13 am, Andrey Groshev gre...@yandex.ru wrote:

 
 
 13.11.2013, 03:22, Andrew Beekhof and...@beekhof.net:
 On 12 Nov 2013, at 4:42 pm, Andrey Groshev gre...@yandex.ru wrote:
 
  11.11.2013, 03:44, Andrew Beekhof and...@beekhof.net:
  On 8 Nov 2013, at 7:49 am, Andrey Groshev gre...@yandex.ru wrote:
   Hi, PPL!
   I need help. I do not understand... Why has stopped working.
   This configuration work on other cluster, but on corosync1.
 
   So... cluster postgres with master/slave.
   Classic config as in wiki.
   I build cluster, start, he is working.
   Next I kill postgres on Master with 6 signal, as if disk space left
 
   # pkill -6 postgres
   # ps axuww|grep postgres
   root  9032  0.0  0.1 103236   860 pts/0S+   00:37   0:00 grep 
 postgres
 
   PostgreSQL die, But crm_mon shows that the master is still running.
 
   Last updated: Fri Nov  8 00:42:08 2013
   Last change: Fri Nov  8 00:37:05 2013 via crm_attribute on 
 dev-cluster2-node4
   Stack: corosync
   Current DC: dev-cluster2-node4 (172793107) - partition with quorum
   Version: 1.1.10-1.el6-368c726
   3 Nodes configured
   7 Resources configured
 
   Node dev-cluster2-node2 (172793105): online
  pingCheck   (ocf::pacemaker:ping):  Started
  pgsql   (ocf::heartbeat:pgsql): Started
   Node dev-cluster2-node3 (172793106): online
  pingCheck   (ocf::pacemaker:ping):  Started
  pgsql   (ocf::heartbeat:pgsql): Started
   Node dev-cluster2-node4 (172793107): online
  pgsql   (ocf::heartbeat:pgsql): Master
  pingCheck   (ocf::pacemaker:ping):  Started
  VirtualIP   (ocf::heartbeat:IPaddr2):   Started
 
   Node Attributes:
   * Node dev-cluster2-node2:
  + default_ping_set  : 100
  + master-pgsql  : -INFINITY
  + pgsql-data-status : STREAMING|ASYNC
  + pgsql-status  : HS:async
   * Node dev-cluster2-node3:
  + default_ping_set  : 100
  + master-pgsql  : -INFINITY
  + pgsql-data-status : STREAMING|ASYNC
  + pgsql-status  : HS:async
   * Node dev-cluster2-node4:
  + default_ping_set  : 100
  + master-pgsql  : 1000
  + pgsql-data-status : LATEST
  + pgsql-master-baseline : 0278
  + pgsql-status  : PRI
 
   Migration summary:
   * Node dev-cluster2-node4:
   * Node dev-cluster2-node2:
   * Node dev-cluster2-node3:
 
   Tickets:
 
   CONFIG:
   node $id=172793105 dev-cluster2-node2. \
  attributes pgsql-data-status=STREAMING|ASYNC standby=false
   node $id=172793106 dev-cluster2-node3. \
  attributes pgsql-data-status=STREAMING|ASYNC standby=false
   node $id=172793107 dev-cluster2-node4. \
  attributes pgsql-data-status=LATEST
   primitive VirtualIP ocf:heartbeat:IPaddr2 \
  params ip=10.76.157.194 \
  op start interval=0 timeout=60s on-fail=stop \
  op monitor interval=10s timeout=60s on-fail=restart \
  op stop interval=0 timeout=60s on-fail=block
   primitive pgsql ocf:heartbeat:pgsql \
  params pgctl=/usr/pgsql-9.1/bin/pg_ctl 
 psql=/usr/pgsql-9.1/bin/psql pgdata=/var/lib/pgsql/9.1/data 
 tmpdir=/tmp/pg start_opt=-p 5432 
 logfile=/var/lib/pgsql/9.1//pgstartup.log rep_mode=async node_list= 
 dev-cluster2-node2. dev-cluster2-node3. dev-cluster2-node4.  
 restore_command=gzip -cd 
 /var/backup/pitr/dev-cluster2-master#5432/xlog/%f.gz  %p 
 primary_conninfo_opt=keepalives_idle=60 keepalives_interval=5 
 keepalives_count=5 master_ip=10.76.157.194 \
  op start interval=0 timeout=60s on-fail=restart \
  op monitor interval=5s timeout=61s on-fail=restart \
  op monitor interval=1s role=Master timeout=62s 
 on-fail=restart \
  op promote interval=0 timeout=63s on-fail=restart \
  op demote interval=0 timeout=64s on-fail=stop \
  op stop interval=0 timeout=65s on-fail=block \
  op notify interval=0 timeout=66s
   primitive pingCheck ocf:pacemaker:ping \
  params name=default_ping_set host_list=10.76.156.1 
 multiplier=100 \
  op start interval=0 timeout=60s on-fail=restart \
  op monitor interval=10s timeout=60s on-fail=restart \
  op stop interval=0 timeout=60s on-fail=ignore
   ms msPostgresql pgsql \
  meta master-max=1 master-node-max=1 clone-node-max=1 
 notify=true target-role=Master clone-max=3
   clone clnPingCheck pingCheck \
  meta clone-max=3
   location l0_DontRunPgIfNotPingGW msPostgresql \
  rule $id=l0_DontRunPgIfNotPingGW-rule -inf: not_defined 
 default_ping_set or default_ping_set lt 100
   colocation r0_StartPgIfPingGW inf: msPostgresql clnPingCheck
   colocation r1_MastersGroup inf: VirtualIP msPostgresql:Master
   order rsc_order-1 0: clnPingCheck msPostgresql
   order

Re: [Pacemaker] stonith_admin does not work as expected

2013-11-13 Thread Andrew Beekhof

On 13 Nov 2013, at 11:33 pm, andreas graeper agrae...@googlemail.com wrote:

 hi,
 pacemaker version is 1.1.7

quite a bit of work has gone into fencing since then, any chance you could try 
something newer?

 
 the fence-agent (i thought was one of the standards) calls
 snmpget -a ipaddr:udpport -c comunity oid
 snmpset -a ipaddr:udpport -c comunity oid i 0|1
 
 therefor it needs/uses commandline arguments
 -o action
 -n port (slot-index)
 -a ipaddr
 -c community
 (udpport is not necessary, cause fix == 161)
 
 or (as logs tell me) the fence-agent gets its parameters from stdin
 fence_ifmib EOF
  action=
  port=
  ipaddr=
  comunity=
 EOF
 another unvalid 'nodename=xyz' is given.
 the fence-agents was written for another device, and cause our device
 does not support
 a function (OID_PORT used to get port-index from port-name) we have to
 use port- numbers. but except other tiny limitations it works great
 
 
 primitive class=stonith id=fence_1 type=fence_ifmib_epc8212
instance_attributes id=fence_1-instance_attributes
  nvpair id=fence_1-instance_attributes-ipaddr
 name=ipaddr value=172.27.51.33/
  nvpair id=fence_1-instance_attributes-community
 name=community value=xxx/
  nvpair id=fence_1-instance_attributes-port name=port value=1/
  nvpair id=fence_1-instance_attributes-action
 name=action value=off/
  nvpair
 id=fence_1-instance_attributes-pcmk_poweroff_action
 name=pcmk_poweroff_action value=off/
  nvpair id=fence_1-instance_attributes-pcmk_host_list
 name=pcmk_host_list value=lisel1/
  nvpair id=fence_1-instance_attributes-pcmk_host_check
 name=pcmk_host_check value=static-list/
  nvpair id=fence_1-instance_attributes-verbose
 name=verbose value=true/
 
  primitive class=stonith id=fence_2 type=fence_ifmib_epc8212
instance_attributes id=fence_2-instance_attributes
  nvpair id=fence_2-instance_attributes-ipaddr
 name=ipaddr value=172.27.51.33/
  nvpair id=fence_2-instance_attributes-community
 name=community value=xxx/
  nvpair id=fence_2-instance_attributes-port name=port value=2/
  nvpair id=fence_2-instance_attributes-action
 name=action value=off/
  nvpair
 id=fence_2-instance_attributes-pcmk_poweroff_action
 name=pcmk_poweroff_action value=off/
  nvpair id=fence_2-instance_attributes-pcmk_host_list
 name=pcmk_host_list value=lisel2/
  nvpair id=fence_2-instance_attributes-pcmk_host_check
 name=pcmk_host_check value=static-list/
  nvpair id=fence_2-instance_attributes-verbose
 name=verbose value=true/
 
  rsc_location id=location-fence_1-lisel1--INFINITY
 node=lisel1 rsc=fence_1 score=-INFINITY/
  rsc_location id=location-fence_2-lisel2--INFINITY
 node=lisel2 rsc=fence_2 score=-INFINITY/
 
 
 old master is back now as slave.
 now on (new) master stonith_admin does not see the device/fence-agent.
 (see last message)
 
 how can i repair this ?
 
 thanks
 andreas
 
 
 
 
 
 2013/11/11, Andrew Beekhof and...@beekhof.net:
 Impossible to comment without knowing the pacemaker version, full config,
 and how fence_ifmib works (I assume its a custom agent?)
 
 On 12 Nov 2013, at 1:21 am, andreas graeper agrae...@googlemail.com
 wrote:
 
 hi,
 two nodes.
 n1 (slave) fence_2:stonith:fence_ifmib
 n2 (master) fence_1:stonith:fence_ifmib
 
 n1 was fenced cause suddenly not reachable. (reason still unknown)
 
 n2  stonith_admin -L - 'fence_1'
 n2  stonith_admin -U fence_1   timed out
 n2  stonith_admin -L - 'no devices found'
 
 crm_mon shows fence_1 is running
 
 after manual unfencing n1 with smnpset the slave n1 is up again, but
 still
 stonith_admin -L tells 'no devices found' on n2
 same on n1: 'fence_2 \n 1 devices found'
 
 what went wrong with stonith_admin ?
 
 when calling crm_mon -rA1 at the end 'Node Attributes' are listed :
 
 * Node lisel1:
   + master-p_drbd_r0:0  : 5
 * Node lisel2:
   + master-p_drbd_r0:0  : 5
   + master-p_drbd_r0:1  : 5
 
 looks strange ? resources are
 ms_drbd_r0 on primary
 p_drbd_r0 on secondary
 ?! or how this is to interpret ?
 
 thanks in advance
 andreas
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project

Re: [Pacemaker] crmd Segmentation fault at pacemaker 1.0.12

2013-11-13 Thread Andrew Beekhof

On 13 Nov 2013, at 7:36 pm, TAKATSUKA Haruka haru...@sraoss.co.jp wrote:

 Hello,  pacemaker hackers
 
 I report crmd's crash at pacemaker 1.0.12 .
 
 We are going to upgrade pacemaker 1.0.12 to 1.0.13 .
 But I was not able to find a fix for this problem from ChangeLog.
 tengine.c:do_te_invoke() is not seem to care for transition_graph==NULL
 case in even 1.0.x head code.

This should help:

https://github.com/ClusterLabs/pacemaker-1.0/commit/20f169d9cccb6c889946c64ab09ab4fb7f572f7c

 
 regards,
 Haruka Takatsuka.
 -
 
 [log]
 Nov 07 00:00:08 srv1 crmd: [21843]: ERROR: crm_abort: abort_transition_graph: 
 Triggered assert at te_utils.c:259 : transition_graph != NULL
 Nov 07 00:00:08 srv1 heartbeat: [21823]: WARN: Managed 
 /usr/lib64/heartbeat/crmd process 21843 killed by signal 11 [SIGSEGV - 
 Segmentation violation].
 Nov 07 00:00:08 srv1 heartbeat: [21823]: ERROR: Managed 
 /usr/lib64/heartbeat/crmd process 21843 dumped core
 Nov 07 00:00:08 srv1 heartbeat: [21823]: EMERG: Rebooting system.  Reason: 
 /usr/lib64/heartbeat/crmd
 
 [gdb]
 $ gdb -c core.21843 -s crmd.debug crmd
 --(snip)--
 Program terminated with signal 11, Segmentation fault.
 #0  0x004199c4 in do_te_invoke (action=140737488355328,
cause=C_FSA_INTERNAL, cur_state=S_POLICY_ENGINE,
current_input=I_FINALIZED, msg_data=0x1b28e20) at tengine.c:186
 186 if(transition_graph-complete == FALSE) {
 --(snip)--
 (gdb) bt
 #0  0x004199c4 in do_te_invoke (action=140737488355328, cause=
C_FSA_INTERNAL, cur_state=S_POLICY_ENGINE, current_input=I_FINALIZED,
msg_data=0x1b28e20) at tengine.c:186
 #1  0x00405ca3 in do_fsa_action (fsa_data=0x1b28e20, an_action=
140737488355328, function=0x419831 do_te_invoke) at fsa.c:154
 #2  0x00406b22 in s_crmd_fsa_actions (fsa_data=0x1b28e20) at fsa.c:410
 #3  0x004061a1 in s_crmd_fsa (cause=C_FSA_INTERNAL) at fsa.c:267
 #4  0x0041208f in crm_fsa_trigger (user_data=0x0) at callbacks.c:631
 #5  0x003777a26146 in crm_trigger_dispatch (source=0x1b1b590, callback=
0x412026 crm_fsa_trigger, userdata=0x1b1b590) at mainloop.c:53
 #6  0x0031d8a38f0e in g_main_context_dispatch ()
   from /lib64/libglib-2.0.so.0
 #7  0x0031d8a3c938 in ?? () from /lib64/libglib-2.0.so.0
 #8  0x0031d8a3cd55 in g_main_loop_run () from /lib64/libglib-2.0.so.0
 #9  0x004051bb in crmd_init () at main.c:139
 #10 0x00405093 in main (argc=1, argv=0x7fff947d1388) at main.c:105
 (gdb) list
 181
 182 if(action  A_TE_CANCEL) {
 183 crm_debug(Cancelling the transition: %s,
 184   
 transition_graph-complete?inactive:active);
 185 abort_transition(INFINITY, tg_restart, Peer 
 Cancelled, NULL);
 186 if(transition_graph-complete == FALSE) {
 187 crmd_fsa_stall(NULL);
 188 }
 189
 190 } else if(action  A_TE_HALT) {
 (gdb) p transition_graph
 $1 = (crm_graph_t *) 0x0
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Remove a ghost node

2013-11-13 Thread Andrew Beekhof

On 14 Nov 2013, at 1:12 pm, Sean Lutner s...@rentul.net wrote:

 
 On Nov 10, 2013, at 8:03 PM, Sean Lutner s...@rentul.net wrote:
 
 
 On Nov 10, 2013, at 7:54 PM, Andrew Beekhof and...@beekhof.net wrote:
 
 
 On 11 Nov 2013, at 11:44 am, Sean Lutner s...@rentul.net wrote:
 
 
 On Nov 10, 2013, at 6:27 PM, Andrew Beekhof and...@beekhof.net wrote:
 
 
 On 8 Nov 2013, at 12:59 pm, Sean Lutner s...@rentul.net wrote:
 
 
 On Nov 7, 2013, at 8:34 PM, Andrew Beekhof and...@beekhof.net wrote:
 
 
 On 8 Nov 2013, at 4:45 am, Sean Lutner s...@rentul.net wrote:
 
 I have a confusing situation that I'm hoping to get help with. Last 
 night after configuring STONITH on my two node cluster, I suddenly 
 have a ghost node in my cluster. I'm looking to understand the best 
 way to remove this node from the config.
 
 I'm using the fence_ec2 device for for STONITH. I dropped the script 
 on each node, registered the device with stonith_admin -R -a fence_ec2 
 and confirmed the registration with both
 
 # stonith_admin -I
 # pcs stonith list
 
 I then configured STONITH per the Clusters from Scratch doc
 
 http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Clusters_from_Scratch/_example.html
 
 Here are my commands:
 # pcs cluster cib stonith_cfg
 # pcs -f stonith_cfg stonith create ec2-fencing fence_ec2 
 ec2-home=/opt/ec2-api-tools pcmk_host_check=static-list 
 pcmk_host_list=ip-10-50-3-122 ip-10-50-3-251 op monitor 
 interval=300s timeout=150s op start start-delay=30s interval=0
 # pcs -f stonith_cfg stonith
 # pcs -f stonith_cfg property set stonith-enabled=true
 # pcs -f stonith_cfg property
 # pcs cluster push cib stonith_cfg
 
 After that I saw that STONITH appears to be functioning but a new node 
 listed in pcs status output:
 
 Do the EC2 instances have fixed IPs?
 I didn't have much luck with EC2 because every time they came back up 
 it was with a new name/address which confused corosync and created 
 situations like this.
 
 The IPs persist across reboots as far as I can tell. I thought the 
 problem was due to stonith being enabled but not working so I removed 
 the stonith_id and disabled stonith. After that I restarted pacemaker 
 and cman on both nodes and things started as expected but the ghost node 
 it still there. 
 
 Someone else working on the cluster exported the CIB, removed the node 
 and then imported the CIB. They used this process 
 http://clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-config-updates.html
 
 Even after that, the ghost node is still there? Would pcs cluster cib  
 /tmp/cib-temp.xml and then pcs cluster push cib /tmp/cib-temp.xml after 
 editing the node out of the config?
 
 No. If its coming back then pacemaker is holding it in one of its 
 internal caches.
 The only way to clear it out in your version is to restart pacemaker on 
 the DC.
 
 Actually... are you sure someone didn't just slip while editing 
 cluster.conf?  [...].1251 does not look like a valid IP :)
 
 In the end this fixed it
 
 # pcs cluster cib  /tmp/cib-tmp.xml
 # vi /tmp/cib-tmp.xml # remove bad node
 # pcs cluster push cib /tmp/cib-tmp.xml
 
 Followed by restaring pacemaker and cman on both nodes. The ghost node 
 disappeared, so it was cached as you mentioned.
 
 I also tracked the bad IP down to bad non-printing characters in the 
 initial command line while configuring the fence_ec2 stonith device. I'd 
 put the command together from the github README and some mailing list 
 posts and laid it out in an external editor. Go me. :)
 
 
 
 Version: 1.1.8-7.el6-394e906
 
 There is now an update to 1.1.10 available for 6.4, that _may_ help in 
 the future.
 
 That's my next task. I believe I'm hitting the failure-timeout not 
 clearing failcount bug and want to upgrade to 1.1.10. Is it safe to yum 
 update pacemaker after stopping the cluster? I see there is also an 
 updated pcs in CentOS 6.4, should I update that as well?
 
 yes and yes
 
 you might want to check if you're using any OCF resource agents that didn't 
 make it into the first supported release though.
 
 http://blog.clusterlabs.org/blog/2013/pacemaker-and-rhel-6-dot-4/
 
 Thanks, I'll give that a read. All the resource agents are custom so I'm 
 thinking I'm okay (I'll back them up before upgrading). 
 
 One last question related to the fence_ec2 script. Should crm_mon -VW show 
 it running on both nodes or just one?
 
 I just went through the upgrade to pacemaker 1.1.10 and pcs. After running 
 the yum update for those I ran a crm_verify and I'm seeing errors related to 
 my order and colocation constraints. Did the behavior of these change from 
 1.1.8 to 1.1.10?
 
 # crm_verify -L -V
   error: unpack_order_template:Invalid constraint 
 'order-ClusterEIP_54.215.143.166-Varnish-mandatory': No resource or template 
 named 'Varnish'

Is that true?

   error: unpack_order_template:Invalid constraint 
 'order-Varnish-Varnishlog-mandatory': No resource or template named 'Varnish'
   error

Re: [Pacemaker] recover cib from raw file

2013-11-12 Thread Andrew Beekhof
I wouldn't be surprised to see a relevant pcs command in the future ;-)

On 12 Nov 2013, at 8:51 pm, s.oreilly s.orei...@linnovations.co.uk wrote:

 Brilliant, thanks Andrew. I was looking for a pcs option. Should have thought
 about cibadmin. Hopefully I will never break things badly enough to have to 
 use
 it :-)
 
 Regards
 
 
 Sean O'Reilly
 
 On Mon 11/11/13 10:03 PM , Andrew Beekhof and...@beekhof.net sent:
 
 On 11 Nov 2013, at 9:41 pm, s.oreilly 
 s.orei...@linnovations.co.uk wrote:
 Hi,
 
 Is it possible to recover/replace cib.xml from
 one of the raw files in /var/lib/pacemaker/cib?
 
 I would like to reset the cib to the
 configuration referenced in cib.last. In the case cib-89.raw
 
 I haven't been able to find a command to do
 this.
 You can reload it into a running cluster with:
 
 cibadmin --replace --xml-file /path/to/cib-89.raw
 
 
 Thanks
 
 
 Sean O'Reilly
 
 
 
 
 
 ___ Pacemaker mailing list:
 Pacemaker@oss.clusterlabs.org 
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 Project Home: http://www.clusterlabs.org Getting started:
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
 http://bugs.clusterlabs.org
 
 
 
 



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Network outage debugging

2013-11-12 Thread Andrew Beekhof

On 13 Nov 2013, at 6:10 am, Sean Lutner s...@rentul.net wrote:

 The folks testing the cluster I've been building have run a script which 
 blocks all traffic except SSH on one node of the cluster for 15 seconds to 
 mimic a network failure. During this time, the network being down seems to 
 cause some odd behavior from pacemaker resulting in it dying.
 
 The cluster is two nodes and running four custom resources on EC2 instances. 
 The OS is CentOS 6.4 with the config below:
 
 I've attached the /var/log/messages and /var/log/cluster/corosync.log from 
 the time period during the test. I've having some difficulty in piecing 
 together what happened and am hoping someone can shed some light on the 
 problem. Any indications why pacemaker is dying on that node?

Because corosync is dying underneath it:

Nov 09 14:51:49 [942] ip-10-50-3-251cib:error: send_ais_text:   
Sending message 28 via cpg: FAILED (rc=2): Library error: Connection timed out 
(110)
Nov 09 14:51:49 [942] ip-10-50-3-251cib:error: pcmk_cpg_dispatch:   
Connection to the CPG API failed: 2
Nov 09 14:51:49 [942] ip-10-50-3-251cib:error: cib_ais_destroy: 
Corosync connection lost!  Exiting.
Nov 09 14:51:49 [942] ip-10-50-3-251cib: info: terminate_cib:   
cib_ais_destroy: Exiting fast...


 
 
 [root@ip-10-50-3-122 ~]# pcs config
 Corosync Nodes:
 
 Pacemaker Nodes:
 ip-10-50-3-122 ip-10-50-3-251 
 
 Resources: 
 Resource: ClusterEIP_54.215.143.166 (provider=pacemaker type=EIP class=ocf)
  Attributes: first_network_interface_id=eni-e4e0b68c 
 second_network_interface_id=eni-35f9af5d first_private_ip=10.50.3.191 
 second_private_ip=10.50.3.91 eip=54.215.143.166 alloc_id=eipalloc-376c3c5f 
 interval=5s 
  Operations: monitor interval=5s
 Clone: EIP-AND-VARNISH-clone
  Group: EIP-AND-VARNISH
   Resource: Varnish (provider=redhat type=varnish.sh class=ocf)
Operations: monitor interval=5s
   Resource: Varnishlog (provider=redhat type=varnishlog.sh class=ocf)
Operations: monitor interval=5s
   Resource: Varnishncsa (provider=redhat type=varnishncsa.sh class=ocf)
Operations: monitor interval=5s
 Resource: ec2-fencing (type=fence_ec2 class=stonith)
  Attributes: ec2-home=/opt/ec2-api-tools pcmk_host_check=static-list 
 pcmk_host_list=HA01 HA02 
  Operations: monitor start-delay=30s interval=0 timeout=150s
 
 Location Constraints:
 Ordering Constraints:
  ClusterEIP_54.215.143.166 then Varnish
  Varnish then Varnishlog
  Varnishlog then Varnishncsa
 Colocation Constraints:
  Varnish with ClusterEIP_54.215.143.166
  Varnishlog with Varnish
  Varnishncsa with Varnishlog
 
 Cluster Properties:
 dc-version: 1.1.8-7.el6-394e906
 cluster-infrastructure: cman
 last-lrm-refresh: 1384196963
 no-quorum-policy: ignore
 stonith-enabled: true
 
 net-failure-messages-110913.outnet-failure-corosync-110913.out
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Follow up: Colocation constraint to External Managed Resource (cluster-recheck-interval=5m ignored after 1.1.10 update?)

2013-11-12 Thread Andrew Beekhof

On 13 Nov 2013, at 12:06 am, Robert H. pacema...@elconas.de wrote:

 Hello,
 
 for PaceMaker 1.1.8 (CentOS Version) the thread 
 http://www.mail-archive.com/pacemaker@oss.clusterlabs.org/msg18048.html was 
 solved with adding cluster-recheck-interval=5m, causing the LRM

Its the policy engine btw.  Not the lrmd.

 to be executed every 5 minutes and detecting externally managed resources as 
 started (in this case an externally managed percona cluster).

cluster-recheck-interval shouldn't have anything to do with it.
its completely handled by:

   op monitor enabled=true timeout=20s interval=11s role=Stopped

So the questions to ask:

1. is that recurring operation being executed?
2. is it reporting accurate results?

(For 1., this happens without the involvement of cluster-recheck-interval, the 
lrmd will re-run the command every 'interval' seconds).

 
 Now the same cluster was updated to 1.1.10 (new upstream) and it seems that 
 the problem is back again. It seems that cluster-recheck-interval=5m does 
 not cause the LRM to be executed again after 5 minutes, detecting, that 
 external - unmanaged ressources are started again. CIB is unmodified.
 
 Has something changed in the upstream release ?

Not intentionally.

 Any hints ?

Have a read of the section But wait there’s still more of 
http://blog.clusterlabs.org/blog/2013/pacemaker-logging/ and see if you can get 
the information from the lrmd process to answer questions 1 and 2 above.


signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] why pacemaker does not control the resources

2013-11-12 Thread Andrew Beekhof

On 12 Nov 2013, at 4:42 pm, Andrey Groshev gre...@yandex.ru wrote:

 
 
 11.11.2013, 03:44, Andrew Beekhof and...@beekhof.net:
 On 8 Nov 2013, at 7:49 am, Andrey Groshev gre...@yandex.ru wrote:
 
  Hi, PPL!
  I need help. I do not understand... Why has stopped working.
  This configuration work on other cluster, but on corosync1.
 
  So... cluster postgres with master/slave.
  Classic config as in wiki.
  I build cluster, start, he is working.
  Next I kill postgres on Master with 6 signal, as if disk space left
 
  # pkill -6 postgres
  # ps axuww|grep postgres
  root  9032  0.0  0.1 103236   860 pts/0S+   00:37   0:00 grep 
 postgres
 
  PostgreSQL die, But crm_mon shows that the master is still running.
 
  Last updated: Fri Nov  8 00:42:08 2013
  Last change: Fri Nov  8 00:37:05 2013 via crm_attribute on 
 dev-cluster2-node4
  Stack: corosync
  Current DC: dev-cluster2-node4 (172793107) - partition with quorum
  Version: 1.1.10-1.el6-368c726
  3 Nodes configured
  7 Resources configured
 
  Node dev-cluster2-node2 (172793105): online
 pingCheck   (ocf::pacemaker:ping):  Started
 pgsql   (ocf::heartbeat:pgsql): Started
  Node dev-cluster2-node3 (172793106): online
 pingCheck   (ocf::pacemaker:ping):  Started
 pgsql   (ocf::heartbeat:pgsql): Started
  Node dev-cluster2-node4 (172793107): online
 pgsql   (ocf::heartbeat:pgsql): Master
 pingCheck   (ocf::pacemaker:ping):  Started
 VirtualIP   (ocf::heartbeat:IPaddr2):   Started
 
  Node Attributes:
  * Node dev-cluster2-node2:
 + default_ping_set  : 100
 + master-pgsql  : -INFINITY
 + pgsql-data-status : STREAMING|ASYNC
 + pgsql-status  : HS:async
  * Node dev-cluster2-node3:
 + default_ping_set  : 100
 + master-pgsql  : -INFINITY
 + pgsql-data-status : STREAMING|ASYNC
 + pgsql-status  : HS:async
  * Node dev-cluster2-node4:
 + default_ping_set  : 100
 + master-pgsql  : 1000
 + pgsql-data-status : LATEST
 + pgsql-master-baseline : 0278
 + pgsql-status  : PRI
 
  Migration summary:
  * Node dev-cluster2-node4:
  * Node dev-cluster2-node2:
  * Node dev-cluster2-node3:
 
  Tickets:
 
  CONFIG:
  node $id=172793105 dev-cluster2-node2. \
 attributes pgsql-data-status=STREAMING|ASYNC standby=false
  node $id=172793106 dev-cluster2-node3. \
 attributes pgsql-data-status=STREAMING|ASYNC standby=false
  node $id=172793107 dev-cluster2-node4. \
 attributes pgsql-data-status=LATEST
  primitive VirtualIP ocf:heartbeat:IPaddr2 \
 params ip=10.76.157.194 \
 op start interval=0 timeout=60s on-fail=stop \
 op monitor interval=10s timeout=60s on-fail=restart \
 op stop interval=0 timeout=60s on-fail=block
  primitive pgsql ocf:heartbeat:pgsql \
 params pgctl=/usr/pgsql-9.1/bin/pg_ctl 
 psql=/usr/pgsql-9.1/bin/psql pgdata=/var/lib/pgsql/9.1/data 
 tmpdir=/tmp/pg start_opt=-p 5432 
 logfile=/var/lib/pgsql/9.1//pgstartup.log rep_mode=async node_list= 
 dev-cluster2-node2. dev-cluster2-node3. dev-cluster2-node4.  
 restore_command=gzip -cd 
 /var/backup/pitr/dev-cluster2-master#5432/xlog/%f.gz  %p 
 primary_conninfo_opt=keepalives_idle=60 keepalives_interval=5 
 keepalives_count=5 master_ip=10.76.157.194 \
 op start interval=0 timeout=60s on-fail=restart \
 op monitor interval=5s timeout=61s on-fail=restart \
 op monitor interval=1s role=Master timeout=62s 
 on-fail=restart \
 op promote interval=0 timeout=63s on-fail=restart \
 op demote interval=0 timeout=64s on-fail=stop \
 op stop interval=0 timeout=65s on-fail=block \
 op notify interval=0 timeout=66s
  primitive pingCheck ocf:pacemaker:ping \
 params name=default_ping_set host_list=10.76.156.1 
 multiplier=100 \
 op start interval=0 timeout=60s on-fail=restart \
 op monitor interval=10s timeout=60s on-fail=restart \
 op stop interval=0 timeout=60s on-fail=ignore
  ms msPostgresql pgsql \
 meta master-max=1 master-node-max=1 clone-node-max=1 
 notify=true target-role=Master clone-max=3
  clone clnPingCheck pingCheck \
 meta clone-max=3
  location l0_DontRunPgIfNotPingGW msPostgresql \
 rule $id=l0_DontRunPgIfNotPingGW-rule -inf: not_defined 
 default_ping_set or default_ping_set lt 100
  colocation r0_StartPgIfPingGW inf: msPostgresql clnPingCheck
  colocation r1_MastersGroup inf: VirtualIP msPostgresql:Master
  order rsc_order-1 0: clnPingCheck msPostgresql
  order rsc_order-2 0: msPostgresql:promote VirtualIP:start symmetrical=false
  order rsc_order-3 0: msPostgresql:demote VirtualIP:stop symmetrical=false
  property $id=cib-bootstrap-options \
 dc-version=1.1.10-1

Re: [Pacemaker] The larger cluster is tested.

2013-11-12 Thread Andrew Beekhof
 throttle_get_total_job_limit pacemaker.log
 (snip)
 Nov 08 11:08:31 [2387] vm13   crmd: (  throttle.c:629   )   trace:
 throttle_get_total_job_limit:No change to batch-limit=0
 Nov 08 11:08:32 [2387] vm13   crmd: (  throttle.c:632   )   trace:
 throttle_get_total_job_limit:Using batch-limit=8
 (snip)
 Nov 08 11:10:32 [2387] vm13   crmd: (  throttle.c:632   )   trace:
 throttle_get_total_job_limit:Using batch-limit=16
 
 The above shows that it is not solved even if it restricts the whole
 number of jobs by batch-limit.
 Are there any other methods of reducing a synchronous message?
 
 Internal IPC message is not so much.
 Do not be able to handle even a little it on the way to handle the
 synchronization message?
 
 Regards,
 Yusuke
 
 2013/11/12 Andrew Beekhof and...@beekhof.net:
 
 On 11 Nov 2013, at 11:48 pm, yusuke iida yusk.i...@gmail.com wrote:
 
 Execution of the graph was also checked.
 Since the number of pending(s) is restricted to 16 from the middle, it
 is judged that batch-limit is effective.
 Observing here, even if a job is restricted by batch-limit, two or
 more jobs are always fired(ed) in 1 second.
 These performed jobs return a result and the synchronous message of
 CIB generates them.
 The node which continued receiving a synchronous message processes
 there preferentially, and postpones an internal IPC message.
 I think that it caused timeout.
 
 What load-threshold were you running this with?
 
 I see this in the logs:
 Host vm10 supports a maximum of 4 jobs and throttle mode 0100.  New job 
 limit is 1
 
 Have you set LRMD_MAX_CHILDREN=4 on these nodes?
 I wouldn't recommend that for a single core VM.  I'd let the default of 
 2*cores be used.
 
 
 Also, I'm not seeing Extreme CIB load detected.  Are these still single 
 core machines?
 If so it would suggest that something about:
 
if(cores == 1) {
cib_max_cpu = 0.4;
}
if(throttle_load_target  0.0  throttle_load_target  cib_max_cpu) {
cib_max_cpu = throttle_load_target;
}
 
if(load  1.5 * cib_max_cpu) {
/* Can only happen on machines with a low number of cores */
crm_notice(Extreme %s detected: %f, desc, load);
mode |= throttle_extreme;
 
 is wrong.
 
 What was load-threshold configured as?
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 
 
 
 -- 
 
 METRO SYSTEMS CO., LTD
 
 Yusuke Iida
 Mail: yusk.i...@gmail.com
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


<    2   3   4   5   6   7   8   9   10   11   >