Re: [Pacemaker] Question about crm_mon -n option

2013-04-01 Thread Kazunori INOUE
(13.03.27 18:01), Andrew Beekhof wrote:
 On Wed, Mar 27, 2013 at 7:44 PM, Kazunori INOUE
 inouek...@intellilink.co.jp wrote:
 Hi,

 I'm using pacemaker-1.1 (c7910371a5. the latest devel).

 In the case of globally-unique=false, instance numbers are appended
 to the result of crm_mon -n, as with in the case of
 globally-unique=true.
 Is this specifications?

 $ crm configure show
   :
 primitive prmDummy ocf:pacemaker:Dummy
 clone clnDummy prmDummy \
  meta clone-max=2 clone-node-max=1 globally-unique=false

 $ crm_mon -n
   :
 Node dev1 (3232261525): online
  prmDummy:1  (ocf::pacemaker:Dummy): Started
 Node dev2 (3232261523): online
  prmDummy:0  (ocf::pacemaker:Dummy): Started


 Case without -n, instance numbers are not appended.

 Yeah, instance numbers shouldn't show up here

I wrote the patch which does not display instance numbers, when
globally-unique is false.
https://github.com/inouekazu/pacemaker/commit/c9b0ef4e4b3be336a31d83a9297ef23f1adf7c8b

The following files are results of crm_mon before and after applying
this patch.
- before_applying.log
- after_applying.log

The cluster configuration is as follows.

$ crm configure show
node $id=3232261523 dev2
node $id=3232261525 dev1
primitive prmDummy ocf:pacemaker:Dummy \
op monitor on-fail=restart interval=10s
primitive prmDummy2 ocf:pacemaker:Dummy \
op monitor on-fail=restart interval=10s
primitive prmStateful ocf:pacemaker:Stateful \
op monitor interval=11s role=Master on-fail=restart \
op monitor interval=12s role=Slave on-fail=restart
ms msStateful prmStateful \
meta master-max=1 master-node-max=1 clone-max=2 
clone-node-max=1 notify=true globally-unique=false
clone clnDummy prmDummy \
meta clone-max=2 clone-node-max=1 globally-unique=false
property $id=cib-bootstrap-options \
dc-version=1.1.10-1.el6-e8caee8 \
cluster-infrastructure=corosync \
no-quorum-policy=ignore \
stonith-enabled=false \
startup-fencing=false
rsc_defaults $id=rsc-options \
resource-stickiness=INFINITY \
migration-threshold=1


 $ crm_mon -r
   :
 Full list of resources:

   Clone Set: clnDummy [prmDummy]
   Started: [ dev1 dev2 ]

 
 Best Regards,
 Kazunori INOUE


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
$ crm_mon -1
Last updated: Mon Apr  1 16:09:19 2013
Last change: Mon Apr  1 15:27:44 2013 via cibadmin on dev1
Stack: corosync
Current DC: dev2 (3232261523) - partition with quorum
Version: 1.1.10-1.el6-e8caee8
2 Nodes configured, unknown expected votes
5 Resources configured.


Online: [ dev1 dev2 ]

 prmDummy2  (ocf::pacemaker:Dummy): Started dev1
 Master/Slave Set: msStateful [prmStateful]
 Masters: [ dev1 ]
 Stopped: [ prmStateful:1 ]
 Clone Set: clnDummy [prmDummy]
 Started: [ dev1 ]
 Stopped: [ prmDummy:1 ]

Failed actions:
prmStateful_monitor_12000 (node=dev2, call=38, rc=7, status=complete): not 
running
prmDummy_monitor_1 (node=dev2, call=25, rc=7, status=complete): not 
running
$
$ crm_mon -n1
Last updated: Mon Apr  1 16:09:26 2013
Last change: Mon Apr  1 15:27:44 2013 via cibadmin on dev1
Stack: corosync
Current DC: dev2 (3232261523) - partition with quorum
Version: 1.1.10-1.el6-e8caee8
2 Nodes configured, unknown expected votes
5 Resources configured.


Node dev1 (3232261525): online
prmDummy2   (ocf::pacemaker:Dummy): Started
prmStateful:0   (ocf::pacemaker:Stateful):  Master
prmDummy:0  (ocf::pacemaker:Dummy): Started
Node dev2 (3232261523): online

Failed actions:
prmStateful_monitor_12000 (node=dev2, call=38, rc=7, status=complete): not 
running
prmDummy_monitor_1 (node=dev2, call=25, rc=7, status=complete): not 
running
$
$ crm_mon -r1
Last updated: Mon Apr  1 16:09:30 2013
Last change: Mon Apr  1 15:27:44 2013 via cibadmin on dev1
Stack: corosync
Current DC: dev2 (3232261523) - partition with quorum
Version: 1.1.10-1.el6-e8caee8
2 Nodes configured, unknown expected votes
5 Resources configured.


Online: [ dev1 dev2 ]

Full list of resources:

 prmDummy2  (ocf::pacemaker:Dummy): Started dev1
 Master/Slave Set: msStateful [prmStateful]
 Masters: [ dev1 ]
 Stopped: [ prmStateful:1 ]
 Clone Set: clnDummy [prmDummy]
 Started: [ dev1 ]
 Stopped: [ prmDummy:1 ]

Failed actions:
prmStateful_monitor_12000 (node=dev2, call=38, rc=7, 

Re: [Pacemaker] Speeding up startup after migration

2013-04-01 Thread David Vossel




- Original Message -
 From: Vladislav Bogdanov bub...@hoster-ok.com
 To: pacemaker@oss.clusterlabs.org
 Sent: Friday, March 29, 2013 2:03:27 AM
 Subject: Re: [Pacemaker] Speeding up startup after migration
 
 29.03.2013 03:31, Andrew Beekhof wrote:
  On Fri, Mar 29, 2013 at 4:12 AM, Benjamin Kiessling
  mittages...@l.unchti.me wrote:
  Hi,
 
  we've got a small pacemaker cluster running which controls an
  active/passive router. On this cluster we've got a semi-large (~30)
  number of primitives which are grouped together. On migration it takes
  quite a long time until each resource is brought up again because they
  are started sequentially. Is there a way to speed up the process,
  ideally to execute these resource agents in parallel? They are fully
  independent so the order in which they finish is of no concern.
  
  I'm guessing you have them in a group?  Don't do that and they will
  fail over in parallel.
 
 Does current lrmd implementation have batch-limit like cluster-glue's
 one had? Can't find where is it.

The batch-limit option is still around, but has nothing to do with the lrmd.  
It does limit how many resources can execute in parallel, but at the transition 
engine level rather than the lrmd.

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#_available_cluster_options

-- Vossel

 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Same host displayed twice in crm status

2013-04-01 Thread David Vossel
- Original Message -
 From: Nicolas J. nikkro70+pacema...@gmail.com
 To: pacemaker@oss.clusterlabs.org
 Sent: Friday, March 29, 2013 8:55:30 AM
 Subject: [Pacemaker] Same host displayed twice in crm status
 
 Hi,
 
 I have a problem with a Corosync/Pacemaker configuration.
 One host of the cluster has been renamed and now the host is displayed twice
 in the configuration.
 
 When I try to remove the host from the configuration it works but if corosync
 is restarted on one node, the old host appears again.
 I tried several ways to delete the host with no effect.
 
 How can I delete the wrong host?

For the pacemaker version you are using, try deleting the node from the 
configuration in both the node and status sections, then use crm_node -R 
option to remove the node from the cluster's internal cache.  In pacemaker 
versions = 1.1.8 only the crm_node -R option is required to remove a node.

-- Vossel
 
 I checked the Linux configuration and there is no place where the old name is
 referenced.
 It's an OEL/Red Hat linux.
 
 Output
 -
 [root@vmtestoradg2 ~]# crm status
 
 Last updated: Fri Mar 29 14:51:56 2013
 Stack: openais
 Current DC: vmtestoradg1 - partition with quorum
 Version: 1.1.5-1.1.el5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
 4 Nodes configured, 3 expected votes
 1 Resources configured.
 
 
 Online: [ vmtestoradg1 vmtestora10g01 vmtestoradg2 ]
 OFFLINE: [ VMTESTORADG2.it.dbi-services.com ]
 
 DG_IP (ocf::heartbeat:IPaddr2): Started vmtestoradg1
 
 [root@vmtestoradg2 ~]# crm node clearstate VMTESTORADG2.it.dbi-services.com
 Do you really want to drop state for node VMTESTORADG2.it.dbi-services.com ?
 y
 [root@vmtestoradg2 ~]# crm node delete VMTESTORADG2.it.dbi-services.com
 INFO: node VMTESTORADG2.it.dbi-services.com not found by crm_node
 INFO: node VMTESTORADG2.it.dbi-services.com deleted
 
 Thanks in advance
 
 Best Regards,
 
 Nicolas J.
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Speeding up startup after migration

2013-04-01 Thread Vladislav Bogdanov
01.04.2013 17:28, David Vossel пишет:
 
 
 
 
 - Original Message -
 From: Vladislav Bogdanov bub...@hoster-ok.com
 To: pacemaker@oss.clusterlabs.org
 Sent: Friday, March 29, 2013 2:03:27 AM
 Subject: Re: [Pacemaker] Speeding up startup after migration

 29.03.2013 03:31, Andrew Beekhof wrote:
 On Fri, Mar 29, 2013 at 4:12 AM, Benjamin Kiessling
 mittages...@l.unchti.me wrote:
 Hi,

 we've got a small pacemaker cluster running which controls an
 active/passive router. On this cluster we've got a semi-large (~30)
 number of primitives which are grouped together. On migration it takes
 quite a long time until each resource is brought up again because they
 are started sequentially. Is there a way to speed up the process,
 ideally to execute these resource agents in parallel? They are fully
 independent so the order in which they finish is of no concern.

 I'm guessing you have them in a group?  Don't do that and they will
 fail over in parallel.

 Does current lrmd implementation have batch-limit like cluster-glue's
 one had? Can't find where is it.
 
 The batch-limit option is still around, but has nothing to do with
 the lrmd. It does limit how many resources can execute in parallel, but at
 the transition engine level rather than the lrmd.

Yep, I know that option, it was there for a very long time.

So, if I understand correctly, new lrmd runs as many simultaneous jobs
as possible. Unfortunately, in some circumstances this would result in
the high node load and timeouts. Is there a way to some-how limit that load?

 
 
 http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#_available_cluster_options
 
 -- Vossel
 

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org

 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Speeding up startup after migration

2013-04-01 Thread David Vossel




- Original Message -
 From: Vladislav Bogdanov bub...@hoster-ok.com
 To: pacemaker@oss.clusterlabs.org
 Sent: Monday, April 1, 2013 10:35:39 AM
 Subject: Re: [Pacemaker] Speeding up startup after migration
 
 01.04.2013 17:28, David Vossel пишет:
  
  
  
  
  - Original Message -
  From: Vladislav Bogdanov bub...@hoster-ok.com
  To: pacemaker@oss.clusterlabs.org
  Sent: Friday, March 29, 2013 2:03:27 AM
  Subject: Re: [Pacemaker] Speeding up startup after migration
 
  29.03.2013 03:31, Andrew Beekhof wrote:
  On Fri, Mar 29, 2013 at 4:12 AM, Benjamin Kiessling
  mittages...@l.unchti.me wrote:
  Hi,
 
  we've got a small pacemaker cluster running which controls an
  active/passive router. On this cluster we've got a semi-large (~30)
  number of primitives which are grouped together. On migration it takes
  quite a long time until each resource is brought up again because they
  are started sequentially. Is there a way to speed up the process,
  ideally to execute these resource agents in parallel? They are fully
  independent so the order in which they finish is of no concern.
 
  I'm guessing you have them in a group?  Don't do that and they will
  fail over in parallel.
 
  Does current lrmd implementation have batch-limit like cluster-glue's
  one had? Can't find where is it.
  
  The batch-limit option is still around, but has nothing to do with
  the lrmd. It does limit how many resources can execute in parallel, but at
  the transition engine level rather than the lrmd.
 
 Yep, I know that option, it was there for a very long time.
 
 So, if I understand correctly, new lrmd runs as many simultaneous jobs
 as possible. Unfortunately, in some circumstances this would result in
 the high node load and timeouts. Is there a way to some-how limit that load?

Isn't that what the batch-limit option does?  or are you saying you want a 
batch limit type option that is node specific? Why are you concerned about this 
behavior living in the LRMD instead of at the transition processing level?

I believe if we do any batch limiting type behavior at the LRMD level we're 
going to run into problems with the transition timers in the crmd.  The LRMD 
needs to always perform the actions it is given as soon as possible.

-- Vossel

  
  
  http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#_available_cluster_options
  
  -- Vossel
  
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 
  
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
  
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
  
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Speeding up startup after migration

2013-04-01 Thread Vladislav Bogdanov
01.04.2013 20:09, David Vossel wrote:
 - Original Message -
 From: Vladislav Bogdanov bub...@hoster-ok.com
 To: pacemaker@oss.clusterlabs.org
 Sent: Monday, April 1, 2013 10:35:39 AM
 Subject: Re: [Pacemaker] Speeding up startup after migration

 01.04.2013 17:28, David Vossel пишет:




 - Original Message -
 From: Vladislav Bogdanov bub...@hoster-ok.com
 To: pacemaker@oss.clusterlabs.org
 Sent: Friday, March 29, 2013 2:03:27 AM
 Subject: Re: [Pacemaker] Speeding up startup after migration

 29.03.2013 03:31, Andrew Beekhof wrote:
 On Fri, Mar 29, 2013 at 4:12 AM, Benjamin Kiessling
 mittages...@l.unchti.me wrote:
 Hi,

 we've got a small pacemaker cluster running which controls an
 active/passive router. On this cluster we've got a semi-large (~30)
 number of primitives which are grouped together. On migration it takes
 quite a long time until each resource is brought up again because they
 are started sequentially. Is there a way to speed up the process,
 ideally to execute these resource agents in parallel? They are fully
 independent so the order in which they finish is of no concern.

 I'm guessing you have them in a group?  Don't do that and they will
 fail over in parallel.

 Does current lrmd implementation have batch-limit like cluster-glue's
 one had? Can't find where is it.

 The batch-limit option is still around, but has nothing to do with
 the lrmd. It does limit how many resources can execute in parallel, but at
 the transition engine level rather than the lrmd.

 Yep, I know that option, it was there for a very long time.

 So, if I understand correctly, new lrmd runs as many simultaneous jobs
 as possible. Unfortunately, in some circumstances this would result in
 the high node load and timeouts. Is there a way to some-how limit that load?
 
 Isn't that what the batch-limit option does? or are you saying you
 want a batch limit type option that is node specific? Why are you
 concerned about this behavior living in the LRMD instead of at the
 transition processing level?

There was a limit in a glue's lrmd, and I think it was there for reason.
I do not know which behavior is better, they are just different.

 
 I believe if we do any batch limiting type behavior at the LRMD
 level we're going to run into problems with the transition timers in the crmd.

Did that change in crmd after lrmd replacement?

 The LRMD needs to always perform the actions it is given as soon as possible.

Yes, but... heavy load on a host (because of f.e. 150 CPU-intensive
operations run in parallel) may cause f.e. monitor timeouts and then
resource restarts and then stop timeouts and fencing.



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [Problem][crmsh]The designation of the 'ordered' attribute becomes the error.

2013-04-01 Thread Andreas Kurz
Hi Dejan,

On 2013-03-06 11:59, Dejan Muhamedagic wrote:
 Hi Hideo-san,
 
 On Wed, Mar 06, 2013 at 10:37:44AM +0900, renayama19661...@ybb.ne.jp wrote:
 Hi Dejan,
 Hi Andrew,

 As for the crm shell, the check of the meta attribute was revised with the 
 next patch.

  * http://hg.savannah.gnu.org/hgweb/crmsh/rev/d1174f42f4b3

 This patch was backported in Pacemaker1.0.13.

  * 
 https://github.com/ClusterLabs/pacemaker-1.0/commit/fa1a99ab36e0ed015f1bcbbb28f7db962a9d1abc#shell/modules/cibconfig.py

 However, the ordered,colocated attribute of the group resource is treated as 
 an error when I use crm Shell which adopted this patch.

 --
 (snip)
 ### Group Configuration ###
 group master-group \
 vip-master \
 vip-rep \
 meta \
 ordered=false
 (snip)

 [root@rh63-heartbeat1 ~]# crm configure load update test2339.crm 
 INFO: building help index
 crm_verify[20028]: 2013/03/06_17:57:18 WARN: unpack_nodes: Blind faith: not 
 fencing unseen nodes
 WARNING: vip-master: specified timeout 60s for start is smaller than the 
 advised 90
 WARNING: vip-master: specified timeout 60s for stop is smaller than the 
 advised 100
 WARNING: vip-rep: specified timeout 60s for start is smaller than the 
 advised 90
 WARNING: vip-rep: specified timeout 60s for stop is smaller than the advised 
 100
 ERROR: master-group: attribute ordered does not exist  - WHY?
 Do you still want to commit? y
 --

 If it chooses `yes` by a confirmation message, it is reflected, but it is a 
 problem that error message is displayed.
  * The error occurs in the same way when I appoint colocated attribute.
 AndI noticed that there was not explanation of ordered,colocated of the 
 group resource in online help of Pacemaker.

 I think that the designation of the ordered,colocated attribute should not 
 become the error in group resource.
 In addition, I think that ordered,colocated should be added to online help.
 
 These attributes are not listed in crmsh. Does the attached patch
 help?

Dejan, will this patch for the missing ordered and collocated group
meta-attribute be included in the next crmsh release? ... can't see the
patch in the current tip.

Thanks  Regards,
Andreas

 
 Thanks,
 
 Dejan

 Best Regards,
 Hideo Yamauchi.


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org





signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Speeding up startup after migration

2013-04-01 Thread Lars Marowsky-Bree
On 2013-04-01T13:09:14, David Vossel dvos...@redhat.com wrote:

  So, if I understand correctly, new lrmd runs as many simultaneous jobs
  as possible. Unfortunately, in some circumstances this would result in
  the high node load and timeouts. Is there a way to some-how limit that load?
 Isn't that what the batch-limit option does?  or are you saying you want a 
 batch limit type option that is node specific? Why are you concerned about 
 this behavior living in the LRMD instead of at the transition processing 
 level?
 
 I believe if we do any batch limiting type behavior at the LRMD level we're 
 going to run into problems with the transition timers in the crmd.  The LRMD 
 needs to always perform the actions it is given as soon as possible.

Seriously, folks, the LRM rewrite may turn out not to be the best
example of pacemaker's attention to detail ;-)

Yes, the previous LRM had a per-node concurrency limit. This avoided
overloading the nodes via IO, which is why it was added. (And also
smoothed out spikes in the monitoring calls should they happen to
coincide.) Default limit of parallel executions was 4 or half the number
of CPU cores, if memory serves.

This turned out to actually improve performance (since it avoided said
spikes), and avoid timeouts. (While it is true that, given a perfect
scheduler, the total runtime of N_1..100 being kicked off all at once
should be equal to N_1..100 being kicked off serially, it's quite
likely that doing the former will mean at least a few of those 100
operations hitting its *individual* timeout at the LRM level.)

The TE doesn't have enough knowledge to enforce this, since it doesn't
know if monitors get scheduled. The transition timers weren't really a
problem, since they had some lee-way accounted for.

If we don't have this functionality right now anymore, I do believe we
need it back.

I do seem to recall that at the time, Andrew preferred it to be
implemented at the LRM level, because it avoided a more complex
transition graph logic (e.g., the batch-limit functionality on a
per-node level, and doing something smart about monitors); but my memory
is hazy on this detail.

Nowadays, since we have the migration-threshold anyway, it may be
possible to do something about it cleanly in the TE, but that still
would leave the monitors unsolved ...


Regards,
Lars

(PS: 1.1.8 really isn't turning out to be my favorite release. If I
wasn't afraid it'd received as a rant, I'd try to write up a post-mortem
from my/our perspective to see what might be avoidable in the future.)

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org