Re: [Pacemaker] How to prevent locked I/O using Pacemaker with Primary/Primary DRBD/OCFS2 (Ubuntu 10.10)

2011-04-06 Thread Lars Ellenberg
On Wed, Apr 06, 2011 at 10:26:24AM -0600, Reid, Mike wrote:
> Lars,
> 
> Thank you for your comments. I did confirm I was running 8.3.8.1, and I have 
> even upgraded to 8.3.10 but am still experiencing the same I/O lock issue. I 
> definitely agree with you, DRBD is behaving exactly as instructed, being 
> properly fenced, etc.
> 
> I am quite new to DRBD (and OCFS2), learning a lot as I go. To your
> question regarding copy/paste, yes, the configuration used was
> culminated from a series of different tutorials, plus personal trial
> and error related to this project. I have tried many variations of the
> DRBD config (including resource-and-stonith)

> but have not actually set up a functioning STONITH yet,

And that's why your ocfs2 does not unblock.
It waits for confirmation of a STONITH operation.

> hence the
> "resource-only". The  Linbit
> docs have been an amazing resource.
> 
> Yes, I realize that a Secondary-node is not indicative of it's
> data/synch state. The options I am testing here were referenced from
> this page:
> 
> 
> 
>   http://www.drbd.org/users-guide/s-ocfs2-create-resource.html
>   
> http://www.drbd.org/users-guide/s-configure-split-brain-behavior.html#s-automatic-split-brain-recovery-configuration
>  
>   
>   
> 
> When you say "You do configure automatic data loss here", are you
> suggesting that I am instructing DRBD survivor to perform a full
> re-synch to it's peer?

Nothing to do with full sync. Should usually be a bitmap based resync.

But it may be a sync in an "unexpected" direction.

> If so, that would make sense since I believe
> this behavior was something I experienced prior to getting fencing
> fully established. In my hard-boot testing, I did once notice the
> "victim" was completely resynching, which sounds related to
> "after-sb-1pri discard-secondary". 
> 
> DRBD aside, have you used OCFS2? I'm failing to realize why if DRBD is 
> fencing it's peer that OCFS2 remains in a locked-state, unable to run 
> standalone? To me, this issue does not seem related to DRBD or Pacemaker, but 
> rather a lower-level requirement of OCFS2 (DLM?), etc.
> 
> To date, the ONLY way I can restore I/O to the remaining node is to bring the 
> other node back online, which unfortunately won't work in our Production 
> environment. On a separate ML, someone made a suggestion that "qdisk" might 
> be required to make this work, and while I have tried "qdisk", my high-level 
> research leads me to believe that is a legacy approach, not an option with 
> Pacemaker.  Is that correct? 

> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] How to prevent locked I/O using Pacemaker with Primary/Primary DRBD/OCFS2 (Ubuntu 10.10)

2011-04-06 Thread Lars Ellenberg
On Tue, Apr 05, 2011 at 01:59:00PM -0400, Jean-Francois Malouin wrote:
> Hi,
> 
> I don't want to hijack this thread so feel free to change the Subject
> line if you feel like it.
> 
> * Lars Ellenberg  [20110404 16:56]:
> > On Mon, Apr 04, 2011 at 01:34:48PM -0600, Mike Reid wrote:
> > > All,
> > > 
> > > I am running a two-node web cluster on OCFS2 (v1.5.0) via DRBD
> > > Primary/Primary (v8.3.8) and Pacemaker. Everything  seems to be working
> > 
> > If you want to stay with 8.3.8, make sure you are using 8.3.8.1 (note
> > the trailing .1), or you can run into stalled resyncs.
> > Or upgrade to "most recent".
> 
> 
> Just curious, I'm running 8.3.8 but not sure about the trailing '.1'. 
> Am I safe with:
> 
> ~# cat /proc/drbd
> version: 8.3.8 (api:88/proto:86-94)

No.

> GIT-hash: d78846e52224fd00562f7c225bcc25b2d422321d build by root@puck,
> 2010-11-29 18:13:54
> 
> cheers,
> jf
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] How to prevent locked I/O using Pacemaker with Primary/Primary DRBD/OCFS2 (Ubuntu 10.10)

2011-04-06 Thread Reid, Mike
Lars,

Thank you for your comments. I did confirm I was running 8.3.8.1, and I have 
even upgraded to 8.3.10 but am still experiencing the same I/O lock issue. I 
definitely agree with you, DRBD is behaving exactly as instructed, being 
properly fenced, etc.

I am quite new to DRBD (and OCFS2), learning a lot as I go. To your question 
regarding copy/paste, yes, the configuration used was culminated from a series 
of different tutorials, plus personal trial and error related to this project. 
I have tried many variations of the DRBD config (including 
resource-and-stonith) but have not actually set up a functioning STONITH yet, 
hence the "resource-only". The  Linbit docs have been an amazing resource.

Yes, I realize that a Secondary-node is not indicative of it's data/synch 
state. The options I am testing here were referenced from this page:



http://www.drbd.org/users-guide/s-ocfs2-create-resource.html

http://www.drbd.org/users-guide/s-configure-split-brain-behavior.html#s-automatic-split-brain-recovery-configuration
 



When you say "You do configure automatic data loss here", are you suggesting 
that I am instructing DRBD survivor to perform a full re-synch to it's peer? If 
so, that would make sense since I believe this behavior was something I 
experienced prior to getting fencing fully established. In my hard-boot 
testing, I did once notice the "victim" was completely resynching, which sounds 
related to "after-sb-1pri discard-secondary". 

DRBD aside, have you used OCFS2? I'm failing to realize why if DRBD is fencing 
it's peer that OCFS2 remains in a locked-state, unable to run standalone? To 
me, this issue does not seem related to DRBD or Pacemaker, but rather a 
lower-level requirement of OCFS2 (DLM?), etc.

To date, the ONLY way I can restore I/O to the remaining node is to bring the 
other node back online, which unfortunately won't work in our Production 
environment. On a separate ML, someone made a suggestion that "qdisk" might be 
required to make this work, and while I have tried "qdisk", my high-level 
research leads me to believe that is a legacy approach, not an option with 
Pacemaker.  Is that correct? 
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] [PATCH]Bug 2567 - crm resource migrate should support an optional "role" parameter

2011-04-06 Thread Holger Teutsch
On Wed, 2011-04-06 at 15:38 +0200, Dejan Muhamedagic wrote:
> On Wed, Apr 06, 2011 at 01:00:36PM +0200, Andrew Beekhof wrote:
> > On Tue, Apr 5, 2011 at 12:27 PM, Dejan Muhamedagic  
> > wrote:
> > > Ah, right, sorry, wanted to ask about the difference between
> > > move-off and move. The description looks the same as for move. Is
> > > it that in this case it is for clones so crm_resource needs an
> > > extra node parameter? You wrote in the doc:
> > >
> > >+Migrate a resource (-instance for clones/masters) off the 
> > > specified node.
> > >
> > > The '-instance' looks somewhat funny. Why not say "Move/migrate a
> > > clone or master/slave instance away from the specified node"?
> > >
> > > I must say that I still find all this quite confusing, i.e. now
> > > we have "move", "unmove", and "move-off", but it's probably just me :)
> > 
> > Not just you.  The problem is that we didn't fully understand all the
> > use case permutations at the time.
> > 
> > I think, not withstanding legacy computability, "move" should probably
> > be renamed to "move-to" and this new option be called "move-from".
> > That seems more obvious and syntactically consistent with the rest of
> > the system.
> 
> Yes, move-to and move-from seem more consistent than other
> options. The problem is that the old "move" is at times one and
> then at times another.
> 
> > In the absence of a host name, each uses the current location for the
> > named group/primitive resource and complains for clones.
> > 
> > The biggest question in my mind is what to call "unmove"...
> > "move-cleanup" perhaps?
> 
> move-remove? :D
> Actually, though the word is a bit awkward, unmove sounds fine
> to me.

I would vote for "move-cleanup". It's consistent to move-XXX and to my
(german) ears "unmove" seems to stand for the previous "move" being
undone and the stuff comes back.

BTW: Has someone already tried out the code or do you trust me 8-D ?

Stay tuned for updated patches...

- holger
> 
> Thanks,
> 
> Dejan
> 
> > ___
> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: 
> > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] [PATCH]Bug 2567 - crm resource migrate should support an optional "role" parameter

2011-04-06 Thread Dejan Muhamedagic
On Wed, Apr 06, 2011 at 01:00:36PM +0200, Andrew Beekhof wrote:
> On Tue, Apr 5, 2011 at 12:27 PM, Dejan Muhamedagic  
> wrote:
> > Ah, right, sorry, wanted to ask about the difference between
> > move-off and move. The description looks the same as for move. Is
> > it that in this case it is for clones so crm_resource needs an
> > extra node parameter? You wrote in the doc:
> >
> >        +Migrate a resource (-instance for clones/masters) off the specified 
> > node.
> >
> > The '-instance' looks somewhat funny. Why not say "Move/migrate a
> > clone or master/slave instance away from the specified node"?
> >
> > I must say that I still find all this quite confusing, i.e. now
> > we have "move", "unmove", and "move-off", but it's probably just me :)
> 
> Not just you.  The problem is that we didn't fully understand all the
> use case permutations at the time.
> 
> I think, not withstanding legacy computability, "move" should probably
> be renamed to "move-to" and this new option be called "move-from".
> That seems more obvious and syntactically consistent with the rest of
> the system.

Yes, move-to and move-from seem more consistent than other
options. The problem is that the old "move" is at times one and
then at times another.

> In the absence of a host name, each uses the current location for the
> named group/primitive resource and complains for clones.
> 
> The biggest question in my mind is what to call "unmove"...
> "move-cleanup" perhaps?

move-remove? :D

Actually, though the word is a bit awkward, unmove sounds fine
to me.

Thanks,

Dejan

> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] [PATCH]Bug 2567 - crm resource migrate should support an optional "role" parameter

2011-04-06 Thread Andrew Beekhof
On Tue, Apr 5, 2011 at 12:27 PM, Dejan Muhamedagic  wrote:
> Ah, right, sorry, wanted to ask about the difference between
> move-off and move. The description looks the same as for move. Is
> it that in this case it is for clones so crm_resource needs an
> extra node parameter? You wrote in the doc:
>
>        +Migrate a resource (-instance for clones/masters) off the specified 
> node.
>
> The '-instance' looks somewhat funny. Why not say "Move/migrate a
> clone or master/slave instance away from the specified node"?
>
> I must say that I still find all this quite confusing, i.e. now
> we have "move", "unmove", and "move-off", but it's probably just me :)

Not just you.  The problem is that we didn't fully understand all the
use case permutations at the time.

I think, not withstanding legacy computability, "move" should probably
be renamed to "move-to" and this new option be called "move-from".
That seems more obvious and syntactically consistent with the rest of
the system.

In the absence of a host name, each uses the current location for the
named group/primitive resource and complains for clones.

The biggest question in my mind is what to call "unmove"...
"move-cleanup" perhaps?

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] proper "dampen" value for ping resource

2011-04-06 Thread Andrew Beekhof
On Fri, Mar 11, 2011 at 1:59 PM, Klaus Darilion
 wrote:
> Hi!
>
> I wonder what a proper value for "dampen" would be. Dampen is documented as:
>
> # attrd_updater --help|grep dampen
>  -d, --delay=value      The time to wait (dampening) in seconds further
> changes occur
>
>
> So, I would read this as the delay to forward changes, e.g. to not
> trigger fail-over on the first failed ping, but only after multiple
> failures.
>
> Now, the default values (and examples have):
>
> primitive pingtest ocf:pacemaker:ping \
>        params host_list="..." dampen="5s" \
>        op monitor interval="10s"
>
> This means, that the host is pinged 5 attempts (default), then 10
> seconds pause, then another 5 pings , then again 10 seconds pause.
>
> Thus, I would think that if pinging fails (5 of 5 fails), then this is
> an "error", but failvoer happens 5 seconds later (dampen). Is this
> correct? If yes, then it would make more sense to adapt the examples and
> increase the dampen value to cover at least 2 failed ping attempts (5*2s
> + 10s + 5*2s = 30s) or shorten the attempts and interval, e.g.:
> attempts=1, interval=2s, dampen=5.

Both are sensible ideas, _however_ there is a bug that makes this
inadvisable in pacemaker <= 1.0.10.
For those versions interval must be greater than dampen.

This might seem to make dampen useless, but its primary purpose was to
allow other nodes a chance to also detect connectivity changes.
This is unaffected by the above bug.

Bug fix is:
   http://hg.clusterlabs.org/pacemaker/devel/rev/f7b4c065a75c

>
> If I am completely wrong, what behavior is really caused by "dampen"?
>
> Thanks
> Klaus
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Remote monitor ?

2011-04-06 Thread Carlos G Mendioroz

I was thinking more like the ping resource, where you get "collective"
knowledge and use it. I may think the service is ok, but it may be 
easier for other node to check.

Thinking of an active-standby arrangement, the standby could verify
with an ad-hoc client that the service (resource) is indeed working 
fine. The idea works well in a number of network protocols.

-Carlos

Michael Schwartzkopff @ 06/04/2011 03:57 -0300 dixit:

Not really, unless you have the monitor op ssh to the other machine to
run the command

On Tue, Apr 5, 2011 at 4:01 PM, Carlos G Mendioroz  wrote:

Is there a way to let pacemaker get info on the performance of
a resource from another node point of view ?


Well, you could use SNMP. Works nice here.





___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


--
Carlos G MendiorozLW7 EQI  Argentina

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Pacemaker 1.0 Configuration Explained -- More Issues, Please Help -- 2

2011-04-06 Thread Andrew Beekhof
Added "be stopped"

On Wed, Apr 6, 2011 at 11:05 AM, Igor Kondrashkin  wrote:
> Hi Andrew,
>
> Now let us look at Table 5.2, at the last column of second row.
>
> We see:
>
> "Stopped - Force the resource to"
>
> and that is all. No end of sentence. I suppose the sentence ends by
> "...stop.", but has there been anything more after full stop in this line?
>
> The error persists in "Pacemaker 1.1 Configuration Explained".
>
> Please fix it.
>
> Sincerely Yours,
>
> Igor
>

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Pacemaker 1.0 Configuration Explained -- More Issues, Please Help

2011-04-06 Thread Andrew Beekhof
On Tue, Apr 5, 2011 at 1:37 PM, Igor Kondrashkin  wrote:
> Hi Andrew,
>
> I've just found one more issue:
>
> The sentence (it is section 5.2.1)
>
> "The OCF Spec (as it relates to resource agents can be found at:
> http://www.opencf.org/cgi-bin/viewcvs.cgi/specs/ra/resource-agent-api.txt?rev=HEAD)
> [6] and is basically an extension of the Linux Standard Base conventions for
> init scripts to ..."
>
>
> is corrupted: either some words are missing between "[6]" and "and is
> basically..." or "and" is extraneous and should be deleted.

The closing bracket is in the wrong place.
Instead of being after "HEAD" it should be after "resource agents".

> The issue
> persists in "Pacemaker 1.1 Config Explained" as well.
>
> Please take a look at this sentence and tell me what's the correct variant.
>
> Sincerely Yours,
>
> Igor
>

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Remote monitor ?

2011-04-06 Thread Michael Schwartzkopff
> Not really, unless you have the monitor op ssh to the other machine to
> run the command
> 
> On Tue, Apr 5, 2011 at 4:01 PM, Carlos G Mendioroz  wrote:
> > Is there a way to let pacemaker get info on the performance of
> > a resource from another node point of view ?

Well, you could use SNMP. Works nice here.

-- 
Dr. Michael Schwartzkopff
Guardinistr. 63
81375 München

Tel: (0163) 172 50 98


signature.asc
Description: This is a digitally signed message part.
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker