[Pacemaker] increase the debug level

2011-03-21 Thread jiaju liu
Hi I use corosync+pacemaker. I set debug:on in corosync.conf.However there is 
nothing more output in log. If I want log to output more debug info what I 
should do?Thanks a lot


  ___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Rsource failover error

2011-03-21 Thread cfk
Hello,

Many thanks.

BR,

CFK

-Original Message-
From: Andreas Kurz [mailto:andreas.k...@linbit.com] 
Sent: Friday, March 18, 2011 4:22 PM
To: pacemaker@oss.clusterlabs.org
Subject: Re: [Pacemaker] Rsource failover error

hello,

On 2011-03-18 08:37, c...@itri.org.tw wrote:
 Dear all,
 
  
 
 I am a new member to this mailing list. Please let me know if the
 explanation is not clear enough.
 
  
 
 I setup a Centos 5.4 cluster environment (2 nodes, alpha1 and alpha2)
 with the following software:
 
 Corosync 1.3.0
 
 Pacemaker 1.0.10.
 
 Drbd 8.3.9
 
  
 
 The environment is constructed as Active/Passive cluster mode based on
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf.
 
  
 
 I setup four resources ( IP, DRBD, FileSystem, Apache) and want to test
 different failover situations.
 
  
 
 When I kill the corosync process at Active host, the Pacemaker seems
 fail to move DRBD:Master to the original Passive host, said Alpha2.

is there a log entry like 'Multiple primaries not allowed by config' ?
... if you only kill corosync and DRBD is still connected and running
fine DRBD will refuse to be promoted on both sides if not configured.

and yes ... stonith would solve this problem.

Regards,
Andreas

 
  
 
 Corosync and DRBD configuration files are attached in this mail, and the
 crm configuration is listed below
 
 =
 
 node alpha1
 
 node alpha2
 
 primitive ClusterIP ocf:heartbeat:IPaddr2 \
 
 params ip=192.168.75.10 cidr_netmask=32 \
 
 op monitor interval=10s
 
 primitive Disk ocf:linbit:drbd \
 
 params drbd_resource=ccmadata \
 
 op monitor interval=60s
 
 primitive FS ocf:heartbeat:Filesystem \
 
 params device=/dev/drbd0 directory=/var/www/html fstype=
 
 ext3
 
 primitive WebSite ocf:heartbeat:apache \
 
 params configfile=/etc/httpd/conf/httpd.conf \
 
 op monitor interval=1min
 
 ms DiskClone Disk \
 
 meta master-max=1 master-node-max=1 clone-max=2
 
 clone-node-max=1 notify=true
 
 colocation drbd-with-ip inf: ClusterIP DiskClone:Master
 
 colocation fs-on-drbd inf: FS DiskClone:Master
 
 colocation website-with-fs inf: WebSite FS
 
 order DiskClone-after-IP inf: DiskClone:promote ClusterIP:start
 
 order FS-after-DiskClone inf: DiskClone:promote FS:start
 
 order WebSite-after-FS inf: FS:start WebSite:start
 
 property $id=cib-bootstrap-options \
 
 dc-version=1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3 \
 
 cluster-infrastructure=openais \
 
 expected-quorum-votes=2 \
 
 stonith-enabled=false \
 
 no-quorum-policy=ignore
 
 =
 
  
 
 The first abnormal monitoring message by crm_mon command is
 
 =
 
 Last updated: Thu Mar 17 18:19:04 2011
 
 Stack: openais
 
 Current DC: alpha2 - partition WITHOUT quorum
 
 Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3
 
 2 Nodes configured, 2 expected votes
 
 4 Resources configured.
 
 
 
  
 
 Online: [ alpha2 ]
 
 OFFLINE: [ alpha1 ]
 
  
 
  Master/Slave Set: DiskClone
 
  Slaves: [ alpha2 ]
 
  Stopped: [ Disk:0 ]
 
 =
 
  
 
 The last abnormal monitoring message is
 
 =
 
 
 
 Last updated: Thu Mar 17 18:20:01 2011
 
 Stack: openais
 
 Current DC: alpha2 - partition WITHOUT quorum
 
 Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3
 
 2 Nodes configured, 2 expected votes
 
 4 Resources configured.
 
 
 
  
 
 Online: [ alpha2 ]
 
 OFFLINE: [ alpha1 ]
 
  
 
  Master/Slave Set: DiskClone
 
  Slaves: [ alpha2 ]
 
  Stopped: [ Disk:1 ]
 
  
 
 Failed actions:
 
 Disk:1_promote_0 (node=alpha2, call=12, rc=-2, status=Timed Out):
 
 unknown ex
 
 ec error
 
 Disk:0_promote_0 (node=alpha2, call=22, rc=-2, status=Timed Out):
 
 unknown ex
 
 ec error
 
 =
 
  
 
 Corosync log on host Alpha1 is drbd_test_alpha1.log, and that on hoat
 Alpha2 is drbd_test_alpha2.log
 
  
 
 My questions are:
 
 1) How to solve this issue? Do I miss some crm configuration for
 this situation?
 
 2) According to corosync log on host Alpha2, Pacemaker wants to
 prompt 2 DRBD masters (Please correct me if I am wrong). The action is
 failed because the operation mode is set as Active/Passive mode and only
 1 DRBD master is allowed to exist. Should I add additional crm or
 drbd.conf configurations?
 
 3) I am still study STONITH. Is my question a split-brain issue?
 
  
 
 Thanks for your help.
 
  
 
 BR,
 
 Chia-Feng Kang
 
  
 
  
 
  
 
  
 
 

Re: [Pacemaker] [pacemaker][patch 1/4] Simple changes for Pacemaker Explained, chapter 4 Ch_Nodes.xml

2011-03-21 Thread Andrew Beekhof
ack, some pretty bad english there :-)

On Mon, Mar 21, 2011 at 3:27 AM, Marcus Barrow mbar...@redhat.com wrote:

 Some simple changes for the Pacemaker Explained document. These are for 
 CH_Nodes.xml and consist of some typos, missing words etc.

 Regards,
 Marcus Barrow


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] [pacemaker][patch 3/4] Simple changes for Pacemaker Explained, Chapter 6 CH_Constraints.xml

2011-03-21 Thread Andrew Beekhof
Needs some updates.


+   Scores of all kinds are integral to how a cluster works.

well, not all clusters, just pacemaker ones.
How about:

+   Scores of all kinds are integral to how Pacemaker clusters work.


Assume the intent was to avoid confusion with actual sets here?

- titleExample set of opt-in location constraints/title
+ titleExample of opt-in location constraints/title

Prefer something like:
+ titleExample usage of opt-in location constraints/title

or similar to indicate that they only make sense together.


I usually try to avoid questions as titles:
-   titleWhat if Two Nodes Have the Same Score/title
+   titleWhat if Two Nodes Have the Same Score?/title

How about:
+   titleWhen Two Nodes Have the Same Score/title


I like the existing text in this case
-  titleSpecifying the Order Resources Should Start/Stop In/title
-  paraThe way to specify the order in which resources should
start is by creating literalrsc_order/literal constraints./para
+  titleSpecifying Resource Start/Stop Order/title
+  paraUse a literalrsc_order/literal constraint to specify
resource ordering./para

Also here:
- entryThe name of a resource that must be started before the
then resource is allowed to. /entry
+ entryThe name of a resource that must be started before the
then resource. /entry
Although changing to be a literal would be an improvement.

Also think colocation makes more sense than resource here:
-   entryThe colocation target. The cluster will decide where to 
put
this resource first and then decide where to put the resource in the
rsc field/entry
+   entryThe resource target. The cluster will decide where to put
this resource first and then decide where to put the colocation
resource specified in the rsc field/entry

+  paraResource sets were introduced for ordering and dependency
contraints to simplify this situation./para
Prefer instead:
+  paraTo simplify the construction of ordering chains, the
resource set syntax may be used instead./para

+Using resource sets for complex colocation contraints makes
things easier.
Prefer:
+  paraTo simplify the construction of colocation chains, the
resource set syntax may be used instead./para


nack, the word equivalent is important here
-   titleThe equivalent colocation chain expressed using 
resource_sets/title
+   titleA resource set for the same colocation dependency chain/title

and here:
-   titleA group resource with the equivalent colocation rules/title
+   titleA group resource for the same colocation dependency chain/title


Small improvement to:
+   The only thing that matters is that in order for any member of a set
to be active, all the members of the previous set must also be active
(and naturally on the same node). When a set has
literalsequential=true/literal, then in order for any member to
be active, the previous members must also be active.

+   The only thing that matters is that in order for any member of a set
to be active, all the members of the previous setfootnoteparaas
determined by the display order in the configuration/para/footnote
must also be active (and naturally on the same node).
+ When a set has literalsequential=true/literal, then in order
for any member to be active, the previous members must also be active.


Strictly speaking, they do have ordering dependancies, just not within the set.
+   captionVisual representation of a colocation chain where the
members of the middle set have no order dependencies/caption

Suggest:
+   captionVisual representation of a colocation chain where the
members of the middle set have no ordering dependencies with the other
sets/caption






On Mon, Mar 21, 2011 at 3:51 AM, Marcus Barrow mbar...@redhat.com wrote:

 More simple changes for the Pacemaker Explained document. These are for 
 CH_Constraints.xml and consist of typos and small changes. It also includes a 
 change to Section 6.6 where dependency on preceding sets and preceding 
 members of sets are described as M=1 and N+1. These were just changed to use 
 the word preceding, which might be more clear.

 Regards,
 Marcus Barrow


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Filesystem resource agent patch

2011-03-21 Thread Lars Ellenberg
On Mon, Mar 21, 2011 at 07:17:52AM +0100, Marko Potocnik wrote:
 Actually the symbolic link is the beautifier. We use different versions of
 database server and using the symbolic link mount point is always the same.
 
 Do I need to do anything else for the patch to make it into the main branch?

I'm not sure about the availability of readlink,
and it's actual behaviour (exit codes), if it exists.

But this patch should still behave anyways, so that's OK.


I personally feel that using symlinks as mount points
should not even work, and will confuse more than beautify.

But maybe that's just me.


-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Fencing order

2011-03-21 Thread Pavel Levshin

Hi.

Today, we had a network outage. Quite a few problems suddenly arised in 
out setup, including crashed corosync, known notify bug in DRBD RA and 
some problem with VirtualDomain RA timeout on stop.


But particularly strange was fencing behaviour.

Initially, one node (wapgw1-1) has parted from the cluster. When 
connection was restored, corosync has died on that node. It was 
considered offline unclean and was scheduled to be fenced. Fencing by 
HP iLO did not work (currently, I do not know why). Second priority 
fencing method is meatware, and it did take time.


Second node, wapgw1-2, hit DRBD notify bug and failed to stop some 
resources. It was online unclean. It also was scheduled to be fenced. 
HP iLO was available for this node, but it had not been STONITHed until 
I manually confirmed STONITH for wapgw1-1.


When I confirmed first node restart, second node was fenced automatically.

Is this ordering intended behaviour or a bug?

It's pacemaker 1.0.10, corosync 1.2.7. Three-node cluster.


--
Pavel Levshin



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Very strange behavior on asymmetric cluster

2011-03-21 Thread Serge Dubrouski
On Sat, Mar 19, 2011 at 4:14 PM, Pavel Levshin pa...@levshin.spb.ru wrote:
 19.03.2011 19:10, Dan Frincu:

 Even if that is set, we need to verify that the resources are, indeed,
 NOT running where they shouldn't be; remember, it is our job to ensure
 that the configured policy is enforced. So, we probe them everywhere to
 ensure they are indeed not around, and stop them if we find them.

 Again, WHY do you need to verify things which cannot happen by setup? If
 some resource cannot, REALLY CANNOT exist on a node, and administrator can
 confirm this, why rely on network, cluster stack, resource agents,
 electricity in power outlet, etc. to verify that 2+2 is still 4?

 Don't want to step on any toes or anything, mainly because me stepping on
 somebody's toes without the person wearing a pair of steel-toe cap boots
 would leave them toeless, but I've been hearing the ranting go on and on and
 just felt like maybe something's missing from the picture, specifically, an
 example for why checking for resources on passive nodes is a good thing,
 which I haven't seen thus far.

 ...

 Ok, so far it sounds perfect, but what happens if on the secondary/passive
 node, someone starts the service, by user error, by upgrading the software
 and thus activating its automatic startup at the given runlevel and
 restarting the secondary node (common practice when performing upgrades in a
 cluster environment), etc. If Pacemaker were not to check all the nodes for
 the service being active or not = epic fail. Its state-based model, where
 it maintains a state of the resources and performs the necessary actions to
 bring the cluster to that state is what saves us from the epic fail
 moment.

 Surely you are right. Resources must be monitored on standby nodes to
 prevent such a scenario. You can screw your setup by many other ways,
 howewer. And pacemaker (1.0.10, at least) does not execute recurring monitor
 on passive node, so you may start your service by hands, and it will be
 unnoticed for quite some time.

 What I am talking about is monitoring (probing) of a resource on a node
 where this resource cannot be exist. For example, if you have five nodes in
 your cluster and a DRBD resource, which can, by it's nature, work on no more
 than two nodes. Then, other three of your nodes will be occasionally probed
 for that resource. If that action fails, the resource will be restarted
 everywhere. If that node cannot be fenced, the resource will be dead.

As far as I understand that would require a definition of a quorum
node or another special kind of node where resource cannot exist.
Figuring out a a such role from location/collocation rules seems to
complex to me. The idea of quorum node was abandoned by long ago in
favor for some other features/project that Lars mentioned earlier.


 There is still at least one case when such a failure may happen even if RA
 is perfect: misbehaving or highly overloaded node may cause RA timeout. And
 bugs or configuration errors may, of course.

 A resource should not depend on unrelated things, such as nodes which have
 no connections to the resource. Then the resource will be more stable.

 I'm trying to be impartial here, although I may be biased by my experience
 to rule in favor of Pacemaker, but here's a thought, it's a free world, we
 all have the freedom of speech, which I'm also exercising at the moment,
 want something done, do it yourself, patches are being accepted, don't have
 the time, ask people for their help, in a polite manner, wait for them to
 reply, kindly ask them again (and prayers are heard, Steven Dake released
 http://www.mail-archive.com/openais@lists.linux-foundation.org/msg06072.html  a
 patch for automatic redundant ring recovery, thank you Steven), want
 something done fast, pay some developers to do it for you, say the folks
 over at www.linbit.com wouldn't mind some sponsorship (and I'm not
 affiliated with them in any way, believe it or not, I'm actually doing this
 without external incentives, from the kindness of my heart so to speak).

 My goal for now is to make the problem clear to the team. It is doubtful
 that such a patch will be accepted without that, given current reaction.
 Moreover, it is not clear how to fix the problem to the best advantage.

 This cluster stack is brilliant. It's a pity to see how it fails to keep a
 resource running while it is relatively simple to avoid unneeded downtime.

 Thank you for participating.


 P.S. There is a crude workaround: op monitor interval=0 timeout=10
 on_fail=nothing. Obvoiusly, it has own deficiencies.


 --
 Pavel Levshin


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs:
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker





-- 
Serge Dubrouski.


Re: [Pacemaker] Very strange behavior on asymmetric cluster

2011-03-21 Thread Serge Dubrouski
On Mon, Mar 21, 2011 at 10:43 AM, Carlos G Mendioroz t...@huapi.ba.ar wrote:

 Serge Dubrouski @ 21/03/2011 13:10 -0300 dixit:

 What I am talking about is monitoring (probing) of a resource on a node
 where this resource cannot be exist.

 As far as I understand that would require a definition of a quorum
 node or another special kind of node where resource cannot exist.
 Figuring out a a such role from location/collocation rules seems to
 complex to me. The idea of quorum node was abandoned by long ago in
 favor for some other features/project that Lars mentioned earlier.

 There is already a location rule, and a minus infinite value.

 Is that value being used dynamically ? If not, that could be used
 as a marker for this (resource) can not possibly run in this node
 so monitoring is not necesary ?

It is used dynamically quite often. For example moving resource out of
one node creates a such location rule. Does it mean that along with
moving resource Pacemaker has to stop monitoring it on the left node?
I don't think so.


 --
 Carlos G Mendioroz  t...@huapi.ba.ar  LW7 EQI  Argentina

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs:
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker




-- 
Serge Dubrouski.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Very strange behavior on asymmetric cluster

2011-03-21 Thread Carlos G Mendioroz

Serge Dubrouski @ 21/03/2011 13:49 -0300 dixit:

On Mon, Mar 21, 2011 at 10:43 AM, Carlos G Mendioroz t...@huapi.ba.ar wrote:

Serge Dubrouski @ 21/03/2011 13:10 -0300 dixit:

What I am talking about is monitoring (probing) of a resource on a node
where this resource cannot be exist.

As far as I understand that would require a definition of a quorum
node or another special kind of node where resource cannot exist.
Figuring out a a such role from location/collocation rules seems to
complex to me. The idea of quorum node was abandoned by long ago in
favor for some other features/project that Lars mentioned earlier.

There is already a location rule, and a minus infinite value.

Is that value being used dynamically ? If not, that could be used
as a marker for this (resource) can not possibly run in this node
so monitoring is not necesary ?


It is used dynamically quite often. For example moving resource out of
one node creates a such location rule. Does it mean that along with
moving resource Pacemaker has to stop monitoring it on the left node?
I don't think so.


Neither do I. That was exactly my precondition :)
Being that the RA absence is dealt with ok (i.e. no need to
install the RA to enable pacemaker to do what it needs) then I feel
it's ok anyway.

I've seen many times arguments of the kind if the admin does this,
then it breaks. I buy no such argument. I'm against systems playing
smarter than admins.

--
Carlos G Mendioroz  t...@huapi.ba.ar  LW7 EQI  Argentina

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Very strange behavior on asymmetric cluster

2011-03-21 Thread Pavel Levshin


21.03.2011 20:14, Carlos G Mendioroz:



It is used dynamically quite often. For example moving resource out of
one node creates a such location rule. Does it mean that along with
moving resource Pacemaker has to stop monitoring it on the left node?
I don't think so.




You are right, location rules is not suitable for this case. I'd prefer 
an additional meta parameter (or two) for the resource, listing 
included or excluded nodes.



Neither do I. That was exactly my precondition :)
Being that the RA absence is dealt with ok (i.e. no need to
install the RA to enable pacemaker to do what it needs) then I feel
it's ok anyway.


It's not completely OK.

First, personally I have been in sutuation when rc=5 not installed had 
been lost due to (still existing) bug 
(http://developerbugs.linux-foundation.org/show_bug.cgi?id=2568). It is 
a particular case which will eventually be fixed. But there are other 
possibilities to get into similar situation. Why wait for disaster?


Second, RA is not resource. You may have two independent resources with 
one RA, suitable for different nodes. You can overcome this by copying 
the RA and accessing it by different names for each resource. It would 
lead to the case #1.


Third, deleted RA may resurrect after software upgrade. You can defend 
yourself against this by using nonstandard location for your RAs. It may 
be considered good practice anyway, but IMHO this 'best practice' is not 
described in documentation.


All of this makes building highly available cluster more difficult.


I've seen many times arguments of the kind if the admin does this,
then it breaks. I buy no such argument. I'm against systems playing
smarter than admins.



So am I. Currently, the system tries to auto-detect resource existence 
by probing it, when admin knows that resource cannot exist there.



--
Pavel Levshin


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Stonith

2011-03-21 Thread tariq fillah
Hello,

I need to make a cluster with for a database with its Filesystems, and I
understood that I need to use STONITH, which one you think is the best one,
and what parameters I need. Also how doens STONITH exactly work.

Thanks in advance.
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Fwd: All resources bounce on failback

2011-03-21 Thread Pavel Levshin


21.03.2011 1:39, David Morton:


order DB_SHARE_FIRST_DEPOT inf: CL_OCFS2_SHARED DEPOT
order DB_SHARE_FIRST_ESP_AUDIT inf: CL_OCFS2_SHARED ESP_AUDIT



Hmm, does not this cause the observed behaviour? Infinite score makes 
order mandatory. It is not simple ordering. It requires to do both 
actions together always. Order is also symmetric by default. Your rules 
could be written in common language as follows:


1. Always start CL_OCFS2_SHARED then start DEPOT;
1a. Always stop DEPOT then stop CL_OCFS2_SHARED;
2. Always start CL_OCFS2_SHARED then start ESP_AUDIT;
2a. Always stop ESP_AUDIT then stop CL_OCFS2_SHARED;

In your described case, cluster wants to execute 2a. It causes 1a to be 
executed, because CL_OCFS2_SHARED stops. Then the cluster starts DEPOT 
again.


Where this behaviour is useful is not clear to me. Could anyone explain?

I should suggest relaxing your ordering rules to 0: score.


--
Pavel Levshin


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] [PATCH]Bug 2567 - crm resource migrate should support an optional role parameter

2011-03-21 Thread Holger Teutsch
Hi Dejan,

On Mon, 2011-03-21 at 16:11 +0100, Dejan Muhamedagic wrote:
 Hi Holger,
 
 On Sat, Mar 19, 2011 at 11:55:57AM +0100, Holger Teutsch wrote:
  Hi Dejan,
  
  On Fri, 2011-03-18 at 14:24 +0100, Dejan Muhamedagic wrote:
   Hi,
   
   On Fri, Mar 18, 2011 at 12:21:40PM +0100, Holger Teutsch wrote:
Hi,
I would like to submit 2 patches of an initial implementation for
discussion.
  ..
To recall:

crm_resource --move resource
creates a standby rule that moves the resource off the currently
active node

while

crm_resource --move resource --node newnode
creates a prefer rule that moves the resource to the new node.

When dealing with clones and masters the behavior was random as the code
only considers the node where the first instance of the clone was
started.

The new code behaves consistently for the master role of an m/s
resource. The options --master and rsc:master are somewhat redundant
as a slave move is not supported. Currently it's more an
acknowledgement of the user.

On the other hand it is desirable (and was requested several times on
the ML) to stop a single resource instance of a clone or master on a
specific node.

Should that be implemented by something like
 
crm_resource --move-off --resource myresource --node devel2 ?

or should

crm_resource refuse to work on clones

and/or should moving the master role be the default for m/s resources
and the --master option discarded ?
   
   I think that we also need to consider the case when clone-max is
   less than the number of nodes. If I understood correctly what you
   were saying. So, all of move slave and move master and move clone
   should be possible.
   
  
  I think the following use cases cover what can be done with such kind of
  interface:
  
  crm_resource --moveoff --resource myresource --node mynode
 - all resource variants: check whether active on mynode, then create 
  standby constraint
  
  crm_resource --move --resource myresource
 - primitive/group: convert to --moveoff --node `current_node`
 - clone/master: refused
  
  crm_resource --move --resource myresource --node mynode
- primitive/group: create prefer constraint
- clone/master: refused
  
  crm_resource --move --resource myresource --master --node mynode
- master: create prefer constraint for master role
- others: refused
  
  They should work (witch foreseeable outcome!) regardless of the setting of 
  clone-max.
 
 This seems quite complicated to me. Took me a while to figure
 out what's what and where :) Why bother doing the thinking for

I'm afraid the matter *is* complicated. The current implementation of 

crm_resource --move --resource myResource

(without node name) is moving off the resource from the node it is
currently active on by creating a standby constraint. For clones and
masters there is no such *single* active node the constraint can be
constructed for.

Consider this use case:
I have 2 nodes and a clone or master and would like to safely get rid of
one instance on a particular node (e.g. with agents 1.0.5 the slave of a
DB2 HADR pair 8-) ). No idea how that should be done without a move-off
functionality. 

 users? The only case which seems to me worth considering is
 refusing setting role for non-ms resources. Otherwise, let's let
 the user move things around and enjoy the consequences.

Definitely not true for production clusters. The tools should produce
least surprise consequences.
  
 
 Cheers,
 

Over the weekend I implemented the above mentioned functionality. Drop
me note if you want to play with an early snapshot 8-)

Regards
Holger 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Fwd: All resources bounce on failback

2011-03-21 Thread Pavel Levshin

22.03.2011 0:06, David Morton:
Many thanks Pavel !! Using a value of 0 changes the behavior to the 
desired, makes perfect sense when explained in plain terms also !!


I will experiment with some non-0 values, what situations could cause 
the order directive not being honored with a 0 value ?




Advisory-only ordering is applied when both action need to be executed. 
You have colocation constraint which takes care of starting OCFS2_SHARED 
with DEPOT, and then order constraint determines what starts first. The 
same logic applies to stop. So this setup should be safe.


Note that score in ordering constraint is somewhat misleading. The 
actual value does not matter; basically, there is only two possible 
values: above zero for mandatory constraing and zero or less for 
advisory one.



--
Pavel Levshin


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker