Re: [Pacemaker] How to prevent locked I/O using Pacemaker with Primary/Primary DRBD/OCFS2 (Ubuntu 10.10)
On Wed, Apr 06, 2011 at 10:26:24AM -0600, Reid, Mike wrote: > Lars, > > Thank you for your comments. I did confirm I was running 8.3.8.1, and I have > even upgraded to 8.3.10 but am still experiencing the same I/O lock issue. I > definitely agree with you, DRBD is behaving exactly as instructed, being > properly fenced, etc. > > I am quite new to DRBD (and OCFS2), learning a lot as I go. To your > question regarding copy/paste, yes, the configuration used was > culminated from a series of different tutorials, plus personal trial > and error related to this project. I have tried many variations of the > DRBD config (including resource-and-stonith) > but have not actually set up a functioning STONITH yet, And that's why your ocfs2 does not unblock. It waits for confirmation of a STONITH operation. > hence the > "resource-only". The Linbit > docs have been an amazing resource. > > Yes, I realize that a Secondary-node is not indicative of it's > data/synch state. The options I am testing here were referenced from > this page: > > > > http://www.drbd.org/users-guide/s-ocfs2-create-resource.html > > http://www.drbd.org/users-guide/s-configure-split-brain-behavior.html#s-automatic-split-brain-recovery-configuration > > > > > When you say "You do configure automatic data loss here", are you > suggesting that I am instructing DRBD survivor to perform a full > re-synch to it's peer? Nothing to do with full sync. Should usually be a bitmap based resync. But it may be a sync in an "unexpected" direction. > If so, that would make sense since I believe > this behavior was something I experienced prior to getting fencing > fully established. In my hard-boot testing, I did once notice the > "victim" was completely resynching, which sounds related to > "after-sb-1pri discard-secondary". > > DRBD aside, have you used OCFS2? I'm failing to realize why if DRBD is > fencing it's peer that OCFS2 remains in a locked-state, unable to run > standalone? To me, this issue does not seem related to DRBD or Pacemaker, but > rather a lower-level requirement of OCFS2 (DLM?), etc. > > To date, the ONLY way I can restore I/O to the remaining node is to bring the > other node back online, which unfortunately won't work in our Production > environment. On a separate ML, someone made a suggestion that "qdisk" might > be required to make this work, and while I have tried "qdisk", my high-level > research leads me to believe that is a legacy approach, not an option with > Pacemaker. Is that correct? > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] How to prevent locked I/O using Pacemaker with Primary/Primary DRBD/OCFS2 (Ubuntu 10.10)
On Tue, Apr 05, 2011 at 01:59:00PM -0400, Jean-Francois Malouin wrote: > Hi, > > I don't want to hijack this thread so feel free to change the Subject > line if you feel like it. > > * Lars Ellenberg [20110404 16:56]: > > On Mon, Apr 04, 2011 at 01:34:48PM -0600, Mike Reid wrote: > > > All, > > > > > > I am running a two-node web cluster on OCFS2 (v1.5.0) via DRBD > > > Primary/Primary (v8.3.8) and Pacemaker. Everything seems to be working > > > > If you want to stay with 8.3.8, make sure you are using 8.3.8.1 (note > > the trailing .1), or you can run into stalled resyncs. > > Or upgrade to "most recent". > > > Just curious, I'm running 8.3.8 but not sure about the trailing '.1'. > Am I safe with: > > ~# cat /proc/drbd > version: 8.3.8 (api:88/proto:86-94) No. > GIT-hash: d78846e52224fd00562f7c225bcc25b2d422321d build by root@puck, > 2010-11-29 18:13:54 > > cheers, > jf > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] How to prevent locked I/O using Pacemaker with Primary/Primary DRBD/OCFS2 (Ubuntu 10.10)
Lars, Thank you for your comments. I did confirm I was running 8.3.8.1, and I have even upgraded to 8.3.10 but am still experiencing the same I/O lock issue. I definitely agree with you, DRBD is behaving exactly as instructed, being properly fenced, etc. I am quite new to DRBD (and OCFS2), learning a lot as I go. To your question regarding copy/paste, yes, the configuration used was culminated from a series of different tutorials, plus personal trial and error related to this project. I have tried many variations of the DRBD config (including resource-and-stonith) but have not actually set up a functioning STONITH yet, hence the "resource-only". The Linbit docs have been an amazing resource. Yes, I realize that a Secondary-node is not indicative of it's data/synch state. The options I am testing here were referenced from this page: http://www.drbd.org/users-guide/s-ocfs2-create-resource.html http://www.drbd.org/users-guide/s-configure-split-brain-behavior.html#s-automatic-split-brain-recovery-configuration When you say "You do configure automatic data loss here", are you suggesting that I am instructing DRBD survivor to perform a full re-synch to it's peer? If so, that would make sense since I believe this behavior was something I experienced prior to getting fencing fully established. In my hard-boot testing, I did once notice the "victim" was completely resynching, which sounds related to "after-sb-1pri discard-secondary". DRBD aside, have you used OCFS2? I'm failing to realize why if DRBD is fencing it's peer that OCFS2 remains in a locked-state, unable to run standalone? To me, this issue does not seem related to DRBD or Pacemaker, but rather a lower-level requirement of OCFS2 (DLM?), etc. To date, the ONLY way I can restore I/O to the remaining node is to bring the other node back online, which unfortunately won't work in our Production environment. On a separate ML, someone made a suggestion that "qdisk" might be required to make this work, and while I have tried "qdisk", my high-level research leads me to believe that is a legacy approach, not an option with Pacemaker. Is that correct? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] [PATCH]Bug 2567 - crm resource migrate should support an optional "role" parameter
On Wed, 2011-04-06 at 15:38 +0200, Dejan Muhamedagic wrote: > On Wed, Apr 06, 2011 at 01:00:36PM +0200, Andrew Beekhof wrote: > > On Tue, Apr 5, 2011 at 12:27 PM, Dejan Muhamedagic > > wrote: > > > Ah, right, sorry, wanted to ask about the difference between > > > move-off and move. The description looks the same as for move. Is > > > it that in this case it is for clones so crm_resource needs an > > > extra node parameter? You wrote in the doc: > > > > > >+Migrate a resource (-instance for clones/masters) off the > > > specified node. > > > > > > The '-instance' looks somewhat funny. Why not say "Move/migrate a > > > clone or master/slave instance away from the specified node"? > > > > > > I must say that I still find all this quite confusing, i.e. now > > > we have "move", "unmove", and "move-off", but it's probably just me :) > > > > Not just you. The problem is that we didn't fully understand all the > > use case permutations at the time. > > > > I think, not withstanding legacy computability, "move" should probably > > be renamed to "move-to" and this new option be called "move-from". > > That seems more obvious and syntactically consistent with the rest of > > the system. > > Yes, move-to and move-from seem more consistent than other > options. The problem is that the old "move" is at times one and > then at times another. > > > In the absence of a host name, each uses the current location for the > > named group/primitive resource and complains for clones. > > > > The biggest question in my mind is what to call "unmove"... > > "move-cleanup" perhaps? > > move-remove? :D > Actually, though the word is a bit awkward, unmove sounds fine > to me. I would vote for "move-cleanup". It's consistent to move-XXX and to my (german) ears "unmove" seems to stand for the previous "move" being undone and the stuff comes back. BTW: Has someone already tried out the code or do you trust me 8-D ? Stay tuned for updated patches... - holger > > Thanks, > > Dejan > > > ___ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: > > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] [PATCH]Bug 2567 - crm resource migrate should support an optional "role" parameter
On Wed, Apr 06, 2011 at 01:00:36PM +0200, Andrew Beekhof wrote: > On Tue, Apr 5, 2011 at 12:27 PM, Dejan Muhamedagic > wrote: > > Ah, right, sorry, wanted to ask about the difference between > > move-off and move. The description looks the same as for move. Is > > it that in this case it is for clones so crm_resource needs an > > extra node parameter? You wrote in the doc: > > > > +Migrate a resource (-instance for clones/masters) off the specified > > node. > > > > The '-instance' looks somewhat funny. Why not say "Move/migrate a > > clone or master/slave instance away from the specified node"? > > > > I must say that I still find all this quite confusing, i.e. now > > we have "move", "unmove", and "move-off", but it's probably just me :) > > Not just you. The problem is that we didn't fully understand all the > use case permutations at the time. > > I think, not withstanding legacy computability, "move" should probably > be renamed to "move-to" and this new option be called "move-from". > That seems more obvious and syntactically consistent with the rest of > the system. Yes, move-to and move-from seem more consistent than other options. The problem is that the old "move" is at times one and then at times another. > In the absence of a host name, each uses the current location for the > named group/primitive resource and complains for clones. > > The biggest question in my mind is what to call "unmove"... > "move-cleanup" perhaps? move-remove? :D Actually, though the word is a bit awkward, unmove sounds fine to me. Thanks, Dejan > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] [PATCH]Bug 2567 - crm resource migrate should support an optional "role" parameter
On Tue, Apr 5, 2011 at 12:27 PM, Dejan Muhamedagic wrote: > Ah, right, sorry, wanted to ask about the difference between > move-off and move. The description looks the same as for move. Is > it that in this case it is for clones so crm_resource needs an > extra node parameter? You wrote in the doc: > > +Migrate a resource (-instance for clones/masters) off the specified > node. > > The '-instance' looks somewhat funny. Why not say "Move/migrate a > clone or master/slave instance away from the specified node"? > > I must say that I still find all this quite confusing, i.e. now > we have "move", "unmove", and "move-off", but it's probably just me :) Not just you. The problem is that we didn't fully understand all the use case permutations at the time. I think, not withstanding legacy computability, "move" should probably be renamed to "move-to" and this new option be called "move-from". That seems more obvious and syntactically consistent with the rest of the system. In the absence of a host name, each uses the current location for the named group/primitive resource and complains for clones. The biggest question in my mind is what to call "unmove"... "move-cleanup" perhaps? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] proper "dampen" value for ping resource
On Fri, Mar 11, 2011 at 1:59 PM, Klaus Darilion wrote: > Hi! > > I wonder what a proper value for "dampen" would be. Dampen is documented as: > > # attrd_updater --help|grep dampen > -d, --delay=value The time to wait (dampening) in seconds further > changes occur > > > So, I would read this as the delay to forward changes, e.g. to not > trigger fail-over on the first failed ping, but only after multiple > failures. > > Now, the default values (and examples have): > > primitive pingtest ocf:pacemaker:ping \ > params host_list="..." dampen="5s" \ > op monitor interval="10s" > > This means, that the host is pinged 5 attempts (default), then 10 > seconds pause, then another 5 pings , then again 10 seconds pause. > > Thus, I would think that if pinging fails (5 of 5 fails), then this is > an "error", but failvoer happens 5 seconds later (dampen). Is this > correct? If yes, then it would make more sense to adapt the examples and > increase the dampen value to cover at least 2 failed ping attempts (5*2s > + 10s + 5*2s = 30s) or shorten the attempts and interval, e.g.: > attempts=1, interval=2s, dampen=5. Both are sensible ideas, _however_ there is a bug that makes this inadvisable in pacemaker <= 1.0.10. For those versions interval must be greater than dampen. This might seem to make dampen useless, but its primary purpose was to allow other nodes a chance to also detect connectivity changes. This is unaffected by the above bug. Bug fix is: http://hg.clusterlabs.org/pacemaker/devel/rev/f7b4c065a75c > > If I am completely wrong, what behavior is really caused by "dampen"? > > Thanks > Klaus > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Remote monitor ?
I was thinking more like the ping resource, where you get "collective" knowledge and use it. I may think the service is ok, but it may be easier for other node to check. Thinking of an active-standby arrangement, the standby could verify with an ad-hoc client that the service (resource) is indeed working fine. The idea works well in a number of network protocols. -Carlos Michael Schwartzkopff @ 06/04/2011 03:57 -0300 dixit: Not really, unless you have the monitor op ssh to the other machine to run the command On Tue, Apr 5, 2011 at 4:01 PM, Carlos G Mendioroz wrote: Is there a way to let pacemaker get info on the performance of a resource from another node point of view ? Well, you could use SNMP. Works nice here. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker -- Carlos G MendiorozLW7 EQI Argentina ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Pacemaker 1.0 Configuration Explained -- More Issues, Please Help -- 2
Added "be stopped" On Wed, Apr 6, 2011 at 11:05 AM, Igor Kondrashkin wrote: > Hi Andrew, > > Now let us look at Table 5.2, at the last column of second row. > > We see: > > "Stopped - Force the resource to" > > and that is all. No end of sentence. I suppose the sentence ends by > "...stop.", but has there been anything more after full stop in this line? > > The error persists in "Pacemaker 1.1 Configuration Explained". > > Please fix it. > > Sincerely Yours, > > Igor > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Pacemaker 1.0 Configuration Explained -- More Issues, Please Help
On Tue, Apr 5, 2011 at 1:37 PM, Igor Kondrashkin wrote: > Hi Andrew, > > I've just found one more issue: > > The sentence (it is section 5.2.1) > > "The OCF Spec (as it relates to resource agents can be found at: > http://www.opencf.org/cgi-bin/viewcvs.cgi/specs/ra/resource-agent-api.txt?rev=HEAD) > [6] and is basically an extension of the Linux Standard Base conventions for > init scripts to ..." > > > is corrupted: either some words are missing between "[6]" and "and is > basically..." or "and" is extraneous and should be deleted. The closing bracket is in the wrong place. Instead of being after "HEAD" it should be after "resource agents". > The issue > persists in "Pacemaker 1.1 Config Explained" as well. > > Please take a look at this sentence and tell me what's the correct variant. > > Sincerely Yours, > > Igor > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Remote monitor ?
> Not really, unless you have the monitor op ssh to the other machine to > run the command > > On Tue, Apr 5, 2011 at 4:01 PM, Carlos G Mendioroz wrote: > > Is there a way to let pacemaker get info on the performance of > > a resource from another node point of view ? Well, you could use SNMP. Works nice here. -- Dr. Michael Schwartzkopff Guardinistr. 63 81375 München Tel: (0163) 172 50 98 signature.asc Description: This is a digitally signed message part. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker