Re: [Pacemaker] System Health backend part

2009-06-04 Thread Andrew Beekhof
On Tue, Jun 2, 2009 at 11:35 PM, Mark Hamzy  wrote:

> I believe that general purpose solutions that follow standards should
> live in pacemaker.

Just returning to this for a moment, if it is a truly general purpose
solution, then it could be useful for those not running Pacemaker.

So if we're talking about a daemon, then unless its pacemaker-specific
or needs pacemaker to compile, then it really should go in with the
other resources (we'll have some announcements regarding where
resources are kept "soon").

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] pingd comments and metadata

2009-06-04 Thread Andrew Beekhof
On Thu, Jun 4, 2009 at 8:47 AM, Florian Haas  wrote:
> Andrew, Dejan, Dominik,
>
> I am by no means a pingd expert, but the current incarnation in
> stable-1.0 seems to have some outdated and misleading comments and meta
> data. Examples:
>
> 
> 
> The list of ping nodes to count.  Defaults to all configured ping nodes.
>  Rarely needs to be specified.
> 
> Host list
> 
> 
>
> Do we even still have "configured ping nodes" in the original, ha.cf sense?

For openais based clusters, no.
For heartbeat based ones, yes.

>
> 
> 
> The name of the attributes to set.  This is the name to be used in the
> constraints.
> 
> Attribute name
> 
> 
>
> I may be mistaken, but I've never used integers as resource names.  And
> since they're XML IDs, I believe the can't start with a numeric
> character anyway. :)

Correct.  But where are integers mentioned?

> Also, checkbashisms complains about this:
> possible bashism in pingd line 215 (kill -[0-9] or -[A-Z]):
>        kill -TERM $pid
> possible bashism in pingd line 234 (kill -[0-9] or -[A-Z]):
>        kill -0 $pid

thats a bashism?

> Maybe just want to change #!/bin/sh to #!/bin/bash and be done with it.
>
> And, you probably want to update the copyright years so people don't
> believe the RA has been left untouched for three years. :)

One day :-)

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] pingd comments and metadata

2009-06-04 Thread Florian Haas
On 06/04/2009 09:33 AM, Andrew Beekhof wrote:
>> Do we even still have "configured ping nodes" in the original, ha.cf sense?
> 
> For openais based clusters, no.
> For heartbeat based ones, yes.

I see.

>> 
>> 
>> The name of the attributes to set.  This is the name to be used in the
>> constraints.
>> 
>> Attribute name
>> 
>> 
>>
>> I may be mistaken, but I've never used integers as resource names.  And
>> since they're XML IDs, I believe the can't start with a numeric
>> character anyway. :)
> 
> Correct.  But where are integers mentioned?

Zoom in on this line:



Cheers,
Florian

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] reliable way to cib SEGFAULT -- how is cibadmin -Q --xpath supposed to work?

2009-06-04 Thread Lars Ellenberg
On Wed, Jun 03, 2009 at 10:45:41PM +0200, Andrew Beekhof wrote:
> fixed:
>http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/cf478ed1269f

thanks.
though that simply returns the parent XML_ELEMENT_NODE,
which makes both
'//primiti...@type="IPaddr2" and instance_attributes/nvpa...@name = "ip" and 
@value="10.0.0.1"]]' and
'//primiti...@type="IPaddr2" and instance_attributes/nvpa...@name = "ip" and 
@value="10.0.0.1"]]/@id'
return the same, namely the full primitive xml:
      
        
          
        
      

where I would have liked the /@id to only spit out
ip_try3

current workaround is obviously
 | sed -ne '1 { s/^<.* id="\([^"]*\)".*>$/\1/p; };q'

but "it would be nice..."
hm. maybe I can hack something there myself.

> > cat > tmp.xml <<___
> > 
> >  
> >    
> >       > type="IPaddr2">
> >        
> >           > value="10.0.0.1"/>
> >        
> >      
> >    
> >  
> > 
> > ___
> >
> >
> > xmllint --shell tmp.xml <<<'ls //@id'
> > / > ls //@id
> > tan        7 ip_try3
> > t--       27 ip_try3-instance_attributes
> > t--       30 ip_try3-instance_attributes-ip
> > / >
> >
> > xmllint --shell tmp.xml <<<'ls //primiti...@type="IPaddr2" and 
> > instance_attributes/nvpa...@name = "ip" and @value="10.0.0.1"]]/@id'
> > / > ls //primiti...@type="IPaddr2" and instance_attributes/nvpa...@name = 
> > "ip" and @value="10.0.0.1"]]/@id
> > tan        7 ip_try3
> > / >

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] kernel.core_uses_pid and ulimit -c

2009-06-04 Thread Lars Ellenberg
On Thu, Jun 04, 2009 at 08:54:26AM +0200, Florian Haas wrote:
> On 06/04/2009 08:42 AM, Andrew Beekhof wrote:
> > On Thu, Jun 4, 2009 at 8:40 AM, Florian Haas  wrote:
> >> Andrew, Dejan et al.,
> >>
> >> The TODO page at http://clusterlabs.org/wiki/TODO states that Pacemaker
> >> now automagically sets the kernel.core_uses_pid sysctl to ease
> >> debugging. Wouldn't it make sense to check the currently set "ulimit -c"
> >> too, and at least issue a warning message on startup if that ulimit is
> >> set to zero?
> > 
> > Not a bad idea
> 
> Enhancement request filed:
> 
> http://developerbugs.linux-foundation.org/show_bug.cgi?id=2129

not necessary, and I'd even recommend against it.
it does already use setrlimit for that.
at least it does that for me, using cl_plumbing, cl_enable_coredumps(1).

so I'd not want to enable coredumps "globally".  I don't want some
random core files from novell-zislnxd lying around somewhere ;-)

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] reliable way to cib SEGFAULT -- how is cibadmin -Q --xpath supposed to work?

2009-06-04 Thread Andrew Beekhof
On Thu, Jun 4, 2009 at 1:33 PM, Lars Ellenberg
 wrote:
> On Wed, Jun 03, 2009 at 10:45:41PM +0200, Andrew Beekhof wrote:
>> fixed:
>>    http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/cf478ed1269f
>
> thanks.
> though that simply returns the parent XML_ELEMENT_NODE,
> which makes both
> '//primiti...@type="IPaddr2" and instance_attributes/nvpa...@name = "ip" and 
> @value="10.0.0.1"]]' and
> '//primiti...@type="IPaddr2" and instance_attributes/nvpa...@name = "ip" and 
> @value="10.0.0.1"]]/@id'
> return the same, namely the full primitive xml:
>       
>         
>            value="10.0.0.1"/>
>         
>       
>
> where I would have liked the /@id to only spit out
>        ip_try3
>
> current workaround is obviously
>  | sed -ne '1 { s/^<.* id="\([^"]*\)".*>$/\1/p; };q'
>
> but "it would be nice..."

yeah, alas everything between the xpath result and the screen was
built for element nodes.
i guess cibadmin could reapply the xpath expression to the result it
got from the cluster.
that could work.

> hm. maybe I can hack something there myself.

if you happened to write a patch for the above i'd be sure to include it :-)

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Pacemaker on OpenAIS, RRP, and link failure

2009-06-04 Thread Lars Marowsky-Bree
On 2009-05-26T12:50:34, Andrew Beekhof  wrote:

> >> try all the time also after failure like was done before failure.
> >
> > Complete Totem amateur behind the keyboard, but I'd second that. Since
> > you're constantly checking the link status while it's up, why not keep
> > doing so after it's gone down, to see if it has recovered?
> 
> Perhaps even at a decreased (user configurable) interval/rate.

I think that was actually discussed on the openais list and on IRC in
the past and never completely explained why it wouldn't work ;-)



Regards,
Lars

-- 
SuSE Labs, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Pacemaker on OpenAIS, RRP, and link failure

2009-06-04 Thread Lars Marowsky-Bree
On 2009-05-25T18:10:32, Florian Haas  wrote:

> I've repeatedly told customers that NIC bonding is not a valid
> substitute for redundant Heartbeat links, I will stubbornly insist it
> isn't one for OpenAIS RRP links either.

I think your stubborness is misguided, actually. I've had a similar
initial reaction when I looked at this - before ending up to recommend
bonding - but it turns out that bonding seemed actually preferable.

The downside with RRP, as mentioned on IRC, is that it is "only"
available to OpenAIS clients. The DLM and drbd and other software
however opens independent TCP connections, not to mention the
server-client connectivity, which only benefits if bonding is used.

> Some reasons:

These reasons are all technically valid, but I don't think they outweigh
the benefit from getting redundancy for all cluster communications.

> - You're not protected against bugs, currently known or unknown, in the
> bonding driver. If bonding itself breaks, you're screwed.

The same is true for bugs in the network stack in general.

> - Most people actually run bonding over interfaces over the same make,
> model, and chipset. That's not necessarily optimal, but it's a reality.
> Thus, if your driver breaks, you're screwed again. Granted, this is
> probably to if you ran two RRP links in that same configuration too.

Exactly.

Some of this can be balanced by running at least different NICs in
different nodes, which mitigates the problem at the cluster level, even
if a single node goes down.

> - Finally, you can't bond between a switched and a direct back-to-back
> connection, which makes bonding entirely unsuitable for the redundant
> links use case I described earlier.

Yes, bonding has a different deployment mode than the scenario you
described. On the other hand, modifying the deployment scenario would
give you more redundancy even for the replication, which has benefits
too.

> That I fully agree with. The question is what "working properly" means
> in this case -- should it be capable of auto-recovery, or should it not?

Despite the above arguments that nowadays I'd design my clusters with
bonding in mind, I of course agree that RRP _should_ work. 

Just like drbd/DLM/etc should work with SCTP to make use of the
redundant, un-bonded links.

But for the time being, I think bonded NICs is overall the best
solution.


Regards,
Lars

-- 
SuSE Labs, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Pacemaker on OpenAIS, RRP, and link failure

2009-06-04 Thread Juha Heinanen
Lars Marowsky-Bree writes:

 > > >> try all the time also after failure like was done before failure.
 > > >
 > > > Complete Totem amateur behind the keyboard, but I'd second that. Since
 > > > you're constantly checking the link status while it's up, why not keep
 > > > doing so after it's gone down, to see if it has recovered?
 > > 
 > > Perhaps even at a decreased (user configurable) interval/rate.
 > 
 > I think that was actually discussed on the openais list and on IRC in
 > the past and never completely explained why it wouldn't work ;-)

without that kind self healing there is no way that openais could
ever replace current heartbeat2+pacemaker setup.  most users simply
expect the system to self heal from this kind of failure.  they don't
have resources to manually babysit the cluster 24x7.

-- juha

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] [Openais] Pacemaker on OpenAIS, RRP, and link failure

2009-06-04 Thread Steven Dake
On Thu, 2009-06-04 at 17:54 +0200, Lars Marowsky-Bree wrote:
> On 2009-05-26T12:50:34, Andrew Beekhof  wrote:
> 
> > >> try all the time also after failure like was done before failure.
> > >
> > > Complete Totem amateur behind the keyboard, but I'd second that. Since
> > > you're constantly checking the link status while it's up, why not keep
> > > doing so after it's gone down, to see if it has recovered?
> > 
> > Perhaps even at a decreased (user configurable) interval/rate.
> 
> I think that was actually discussed on the openais list and on IRC in
> the past and never completely explained why it wouldn't work ;-)
> 
> 
> 

The problem with checking the link status with the current code is that
the protocol blocks I/O waiting for a response from the failed ring.
This could of course be modified to behave differently.  So the act of
failing a link is expensive and we dont want to retest that it is valid
very often.  The obvious solution to this is to redesign the protocol to
not have this constraint.  No patch has been written and I don't have
time to do such work at the present time.

Regards
-steve

> Regards,
> Lars
> 


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Do we have a repository for config files?

2009-06-04 Thread Shaffin Bhanji
Yes, that is correct.

On Thu, Jun 4, 2009 at 2:33 AM, Andrew Beekhof wrote:
> On Wed, Jun 3, 2009 at 9:40 PM, Shaffin Bhanji  
> wrote:
>> Hello,
>>
>> I am new to this list but do we have a repository of config files
>> (resources) that enable various HA capabilities yet?
>
> Do you mean the scripts or xml fragments that go in the cib?
>
> -- Andrew
>
> ___
> Pacemaker mailing list
> Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] [Openais] Pacemaker on OpenAIS, RRP, and link failure

2009-06-04 Thread Lars Marowsky-Bree
On 2009-06-04T09:23:04, Steven Dake  wrote:

> The problem with checking the link status with the current code is that
> the protocol blocks I/O waiting for a response from the failed ring.
> This could of course be modified to behave differently.

Right, so the rechecking could possibly be a separate thread, sending an
occasional liveness packet on the failed ring and trigger the RRP
recovery after it has heard from other nodes on it?

Some smarts would be needed of course to not constantly retrigger
partially active rings (which would fail again immediately).

> So the act of failing a link is expensive and we dont want to retest
> that it is valid very often.

Does "expensive" mean that it'll actually slow down the healthy
ring(s)?


Regards,
Lars

-- 
SuSE Labs, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] [Openais] Pacemaker on OpenAIS, RRP, and link failure

2009-06-04 Thread Steven Dake
On Thu, 2009-06-04 at 18:30 +0200, Lars Marowsky-Bree wrote:
> On 2009-06-04T09:23:04, Steven Dake  wrote:
> 
> > The problem with checking the link status with the current code is that
> > the protocol blocks I/O waiting for a response from the failed ring.
> > This could of course be modified to behave differently.
> 
> Right, so the rechecking could possibly be a separate thread, sending an
> occasional liveness packet on the failed ring and trigger the RRP
> recovery after it has heard from other nodes on it?

Well I prefer totem to remain nonthreaded except for encrypted xmit
operations, but in general, that is the basic idea.  

> Some smarts would be needed of course to not constantly retrigger
> partially active rings (which would fail again immediately).
> 
> > So the act of failing a link is expensive and we dont want to retest
> > that it is valid very often.
> 
> Does "expensive" mean that it'll actually slow down the healthy
> ring(s)?
> 
At the moment it blocks until the problem counter reaches the threshold
at which point the ring is declared failed and normal communication
continues.
> 
> Regards,
> Lars
> 


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Pacemaker on OpenAIS, RRP, and link failure

2009-06-04 Thread Lars Marowsky-Bree
On 2009-06-04T19:07:41, Juha Heinanen  wrote:

> without that kind self healing there is no way that openais could
> ever replace current heartbeat2+pacemaker setup. 

This "self-healing" works just fine with bonded NICs.


Regards,
Lars

-- 
SuSE Labs, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] dopd on openais

2009-06-04 Thread Lars Marowsky-Bree
On 2009-05-28T22:34:06, Florian Haas  wrote:

> Raoul,
> 
> No such thing currently exists. We're currently figuring out how to best
> do this with OpenAIS. Stay tuned.

BTW, while figuring that out, you could also consider to go a step
beyond what dopd does right now - OpenAIS gives you access to the node
IP addresses, and dlm_controld for example uses this to figure out
between which nodes to establish a connection.

That'd be awesome for my floating peers pet project. ;-)

On join of a drbd_ group, you could see who else is there and
connect to them, and also figure out if people try to start on more than
2 nodes etc.

Is there a design plan/discussion one can participate in?


Regards,
Lars

-- 
SuSE Labs, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Do we have a repository for config files?

2009-06-04 Thread Lars Marowsky-Bree
On 2009-06-04T12:27:50, Shaffin Bhanji  wrote:

> >> I am new to this list but do we have a repository of config files
> >> (resources) that enable various HA capabilities yet?
> >
> > Do you mean the scripts or xml fragments that go in the cib?
> Yes, that is correct.

You answered an "or" question with "yes". So which one it is? ;-)

Besides the XML, I'd suggest to provide "crm" style examples where
possible, as they tend to be more readable.


Regards,
Lars

-- 
SuSE Labs, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Do we have a repository for config files?

2009-06-04 Thread Shaffin Bhanji
Great catch Lars, I meant both, scripts as well as cib XML files.

Shaffin.


On Thu, Jun 4, 2009 at 1:01 PM, Lars Marowsky-Bree wrote:
> On 2009-06-04T12:27:50, Shaffin Bhanji  wrote:
>
>> >> I am new to this list but do we have a repository of config files
>> >> (resources) that enable various HA capabilities yet?
>> >
>> > Do you mean the scripts or xml fragments that go in the cib?
>> Yes, that is correct.
>
> You answered an "or" question with "yes". So which one it is? ;-)
>
> Besides the XML, I'd suggest to provide "crm" style examples where
> possible, as they tend to be more readable.
>
>
> Regards,
>    Lars
>
> --
> SuSE Labs, OPS Engineering, Novell, Inc.
> SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
>
>
> ___
> Pacemaker mailing list
> Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] [Openais] Pacemaker on OpenAIS, RRP, and link failure

2009-06-04 Thread Lars Marowsky-Bree
On 2009-06-04T11:38:08, Robert Wipfel  wrote:

> > I think that was actually discussed on the openais list and on IRC in
> > the past and never completely explained why it wouldn't work ;-)
> Link status can also be written to the other communication 
> medium: the shared disk (assuming different links for that I/O)

Right, and the corosync quorum framework can use such information to
determine quorum going forward.

It's an interesting idea to use this to implement a meta-channel for
recovery of network channels as well.


Regards,
Lars

-- 
SuSE Labs, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


[Pacemaker] cibadmin update

2009-06-04 Thread Infos E-Blokos

Hi,

when I try to update cib.xml, evreytime it fails

Call cib_replace failed (-45): Update was older than existing configuration


I use 


# cibadmin -R -o cib -x cib.xml

Thanks


--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


[Pacemaker] ESX guest having SLE HA

2009-06-04 Thread Priyanka Ranjan
Hi All,
Does SLE HA is supported as an O.S on VMware ESX guest 3.5 .

Thanks a lot for your help,
Priyanka.
___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker