Re: [Pacemaker] crm command line tool problem

2009-05-25 Thread Andrew Beekhof
Looks like a bug, can you post a hb_report archive of the scenario please?

On Fri, May 22, 2009 at 5:51 PM, Joe Armstrong jarmstr...@postpath.com wrote:
 Hi All,

 I am playing around with the crm command line tool to create an HA config for 
 pacemaker and am bumping into a problem.

 If I have a configuration running already, 3-node with ip  httpd (pretty 
 simple) and I want to create a new configuration according to the CRM CLI 
 document I should:

        crm configure erase

 and then create my new configuration.  But the erase directive also blows 
 away my cluster node definitions.  If I manually put them back again:

        crm configure node vm1
        crm configure node vm2
        crm configure node vm4

 I get this:

        
        Last updated: Fri May 22 08:32:16 2009
        Current DC: NONE
        3 Nodes configured, unknown expected votes
        0 Resources configured.
        

        Node vm1: UNCLEAN (offline)
        Node vm2: UNCLEAN (offline)
        Node vm4: UNCLEAN (offline)

 Now, now matter what I seem to do I can't get the nodes back to the online 
 state.  And worse, if I try to shutdown heartbeat all nodes try to kill all 
 other nodes and crmd won't shutdown (if I kill crmd Bad Things(TM) happen).

 Is there a way I can either:
        - remove all resources/constraints but leave the node definitions (I 
 did this by manually removing each resource and it works fine, but this seems 
 like a pain in a large config)
        - get the cluster nodes to see each other again
        - do something before the erase so the nodes don't go to the UNCLEAN 
 state

 Thanks.

 Joe

 ___
 Pacemaker mailing list
 Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] [RfC] Redesigned Debian HA packages, try 2 (was: try 1)

2009-05-25 Thread Martin Gerhard Loschwitz

Simon Horman schrieb:

Hi,

Has there been any progress with getting these packages into experimental?



Hi folks,

it's me again after some kind of longer outage; I overworked the packages
and created new ones of the latest upstream versions.

openais-legacy has lately been ACCEPTed into Experimental; heartbeat and
pacemaker are in the NEW queue since today and wait for their inclusion.

I also created Lenny packages. These packages are proven to work on Lenny
and allow you to run an openais-pacemaker or heartbeat-pacemaker-based
cluster on Debian GNU/Linux 5.0 alias Lenny. They are available only for
the amd64-architecture at the moment, but I hope to add x86-versions to
the archive tomorrow.

Anybody who wants to test them finds the versions for Unstable alias Sid
here:

deb http://people.debian.org/~madkiss/ha/sid/ ./
deb-src http://people.debian.org/~madkiss/ha/sid/ ./

Stable aka Debian GNU/Linux 5.0 (Lenny) users go here:

deb http://people.debian.org/~madkiss/ha/lenny/ ./
deb-src http://people.debian.org/~madkiss/ha/lenny/ ./

These packages are mostly lintian clean by now.

Please report back any glitches or packaging mistakes you spot.

Best Regards
Martin

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


[Pacemaker] Pacemaker on OpenAIS, RRP, and link failure

2009-05-25 Thread Juha Heinanen
Florian Haas writes:

  1. Set rrp_problem_count_timeout and/or rrp_problem_count_threshold
  ridiculously high so the ring status never goes to faulty. (It seems
  that RRP problem counting can't be disabled altogether).
  
  2. Have package maintainers include some magic that does
  openais-cfgtool -r every time a network link changes its status to UP
  (where the network management subsystem permits this).
  
  3. Instruct users to install cron jobs that do openais-cfgtool -r in
  specified intervals, causing OpenAIS to re-check the link status
  periodically.

all the above sound like hacks to me.  a better solution is to have a
crm variable that tells if automatic recovery is desired.

-- juha

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Pacemaker on OpenAIS, RRP, and link failure

2009-05-25 Thread Juha Heinanen
Florian Haas writes:

  Agree that they're hacks, but disagree with your alternative. Why should
  Pacemaker be concerned with low-level OpenAIS recovery procedures?

then have the variable in OpenAIS configuration.

-- juha

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] PingD Failure-Timeout

2009-05-25 Thread Andrew Beekhof
On Thu, May 21, 2009 at 10:20 PM, Eliot Gable ega...@broadvox.net wrote:
 Is there a way to time-out the failure of PingD?

Yes, but you need version = 1.0.0
I assume you're not running it as a clone right?




 In my configuration, I cannot run PingD all the time on every node. Only one
 node (the master) has public Internet access. I use PingD to cause the
 master to fail-over to one of the slaves. When a slave becomes master, it
 then gains public Internet connectivity. When it is a slave, the entire
 interface is down, so not even the gateway is reachable. So, I set up a
 PingD resource that is co-located with the master resource in the Master
 state. I also set up constraints that assign a -1000 score to a node for
 each resource if that node loses connectivity to the gateway. The result is
 that if I firewall off ICMP on the master, it correctly fails over to a
 slave. Then, it runs a stop on the master, as expected since it has a -1000
 score. The result is that my master resource runs as Master on the node that
 was the slave, and is Stopped on the node that was the master. However, it
 is still stuck with a -1000 score, and will never restart on the node that
 was the master until PingD thinks it has connectivity back. But that won’t
 happen because PingD no longer runs on that node since the interface is down
 on it and it won’t see anything if it did.



 I set a failure-timeout on the PingD resource, but it does not seem to do
 anything. Running ‘crm_verify –V –L 21 | less’ shows that the -1000
 score stays there, even well past the failure-timeout.



 Anybody have any suggestions how I can automatically clear that -1000 score
 after a certain (small) interval of time?





 Eliot Gable
 Senior Engineer
 1228 Euclid Ave, Suite 390
 Cleveland, OH 44115

 Direct: 216-373-4808
 Fax: 216-373-4657
 ega...@broadvox.net

 CONFIDENTIAL COMMUNICATION.  This e-mail and any files transmitted with it
 are confidential and are intended solely for the use of the individual or
 entity to whom it is addressed. If you are not the intended recipient,
 please call me immediately.  BROADVOX is a registered trademark of Broadvox,
 LLC.



 
 CONFIDENTIAL. This e-mail and any attached files are confidential and should
 be destroyed and/or returned if you are not the intended and proper
 recipient.

 ___
 Pacemaker mailing list
 Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker



___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] crm command line tool problem

2009-05-25 Thread Andrew Beekhof
On Mon, May 25, 2009 at 5:51 PM, Dejan Muhamedagic deja...@fastmail.fm wrote:
 Hi,

 On Fri, May 22, 2009 at 08:51:45AM -0700, Joe Armstrong wrote:
 Hi All,

 I am playing around with the crm command line tool to create an
 HA config for pacemaker and am bumping into a problem.

 If I have a configuration running already, 3-node with ip 
 httpd (pretty simple) and I want to create a new configuration
 according to the CRM CLI document I should:

       crm configure erase

 and then create my new configuration.  But the erase
 directive also blows away my cluster node definitions.  If I

 This has been fixed in the beginning of April. The crm now
 doesn't remove nodes on erase. Please update if possible.

1.0.4 should be out later this week (if I can get my primary machine
running again!)

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


[Pacemaker] Redesigned Debian HA packages

2009-05-25 Thread Juha Heinanen
i replaced my older packages with the new debian packages (heartbeat and
pacemaker-heartbeat) and my cluster came up automatically without a need
to change anything.

regarding crm_mon, i would like to start it automatically to monitor the
cluster and send alert emails if something happens, but no /etc/init.d
script seems to exist for that purpose.

-- juha

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Pacemaker on OpenAIS, RRP, and link failure

2009-05-25 Thread Florian Haas
On 2009-05-25 17:45, Andrew Beekhof wrote:
 SUSE is currently recommending NIC bonding.
 We've not been able to get satisfactory behavior from clusters using RRP.

I've repeatedly told customers that NIC bonding is not a valid
substitute for redundant Heartbeat links, I will stubbornly insist it
isn't one for OpenAIS RRP links either.

Some reasons:
- You're not protected against bugs, currently known or unknown, in the
bonding driver. If bonding itself breaks, you're screwed.
- Most people actually run bonding over interfaces over the same make,
model, and chipset. That's not necessarily optimal, but it's a reality.
Thus, if your driver breaks, you're screwed again. Granted, this is
probably to if you ran two RRP links in that same configuration too.
- Finally, you can't bond between a switched and a direct back-to-back
connection, which makes bonding entirely unsuitable for the redundant
links use case I described earlier.

 1. Set rrp_problem_count_timeout and/or rrp_problem_count_threshold
 ridiculously high so the ring status never goes to faulty. (It seems
 that RRP problem counting can't be disabled altogether).

 2. Have package maintainers include some magic that does
 openais-cfgtool -r every time a network link changes its status to UP
 (where the network management subsystem permits this).

 3. Instruct users to install cron jobs that do openais-cfgtool -r in
 specified intervals, causing OpenAIS to re-check the link status
 periodically.
 
 You could add it to the drbd monitor action I guess.
 But it does seem sub-optimal.

I already made my point with regard to Juha's suggestion that it seems
odd for Pacemaker to fiddle with its own communication infrastructure.
To instead defer that task to a Pacemaker resource agent seems
positively disturbing.

 I think the best solution is to work with upstream to get the feature
 working properly.

That I fully agree with. The question is what working properly means
in this case -- should it be capable of auto-recovery, or should it not?

Cheers,
Florian



___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Redesigned Debian HA packages

2009-05-25 Thread Andrew Beekhof
On Mon, May 25, 2009 at 6:05 PM, Juha Heinanen j...@tutpro.com wrote:
 i replaced my older packages with the new debian packages (heartbeat and
 pacemaker-heartbeat) and my cluster came up automatically without a need
 to change anything.

 regarding crm_mon, i would like to start it automatically to monitor the
 cluster and send alert emails if something happens, but no /etc/init.d
 script seems to exist for that purpose.

you probably want it running as a cluster resource
there may even be an RA for it (if not we need to write one)

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] crm command line tool problem

2009-05-25 Thread Dejan Muhamedagic
On Mon, May 25, 2009 at 06:03:20PM +0200, Andrew Beekhof wrote:
 On Mon, May 25, 2009 at 5:51 PM, Dejan Muhamedagic deja...@fastmail.fm 
 wrote:
  Hi,
 
  On Fri, May 22, 2009 at 08:51:45AM -0700, Joe Armstrong wrote:
  Hi All,
 
  I am playing around with the crm command line tool to create an
  HA config for pacemaker and am bumping into a problem.
 
  If I have a configuration running already, 3-node with ip 
  httpd (pretty simple) and I want to create a new configuration
  according to the CRM CLI document I should:
 
  ? ? ? crm configure erase
 
  and then create my new configuration. ?But the erase
  directive also blows away my cluster node definitions. ?If I
 
  This has been fixed in the beginning of April. The crm now
  doesn't remove nodes on erase. Please update if possible.
 
 1.0.4 should be out later this week (if I can get my primary machine
 running again!)

Must be epidemic: mine's broken too (mb went south, out of
warranty -- just for about a month)

 ___
 Pacemaker mailing list
 Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Pacemaker on OpenAIS, RRP, and link failure

2009-05-25 Thread Steven Dake
On Mon, 2009-05-25 at 18:32 +0300, Juha Heinanen wrote:
 Florian Haas writes:
 
   Agree that they're hacks, but disagree with your alternative. Why should
   Pacemaker be concerned with low-level OpenAIS recovery procedures?
 
 then have the variable in OpenAIS configuration.
 

Self-healing is not as obvious or easy as it sounds.  Totem (the
protocol) has no way to determine when the admin has replaced the faulty
switch in the network.

The only options I see is to periodically try the failed ring for
liveness.  The problem with this approach is it is hard to implement.
Another option is to reenable the ring after some period of time
internally and hope for the best.  The problem is with this approach
that is causes performance degredation every time the failed ring is
reenabled and restarted.

I think the first option is the best, but atm there isn't anyone that
has written patches and most people are focused on the 1.0 release...

Regards
-steve

 -- juha
 
 ___
 Pacemaker mailing list
 Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Pacemaker on OpenAIS, RRP, and link failure

2009-05-25 Thread Juha Heinanen
Steven Dake writes:

  Self-healing is not as obvious or easy as it sounds.  Totem (the
  protocol) has no way to determine when the admin has replaced the faulty
  switch in the network.

why can't it keep on pinging the interface/ip address even if there is
no response?

how is it with pingd, does pinging stop if there is no response and does
the node remain dead forever after ping failure until someone manually
does something?  i don't see any difference here regarding heartbeat/openais
level pinging. 

  The only options I see is to periodically try the failed ring for
  liveness.  The problem with this approach is it is hard to implement.

try all the time also after failure like was done before failure.

  I think the first option is the best, but atm there isn't anyone that
  has written patches and most people are focused on the 1.0 release...

1.0 release that people cannot migrate from current heartbeat/pacemaker
setup without loosing self healing capability makes little sense.

-- juha

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker