Re: [Lustre-discuss] failover software - heartbeat

2009-07-14 Thread Cliff White
Lundgren, Andrew wrote:
 It is very difficult to find relevant documentation for heartbeat 1/2. I just 
 finished configuring a heartbeat system and would not recommend it because of 
 the documentation.  (They seem to have removed portions the heartbeat 
 documentation from the site.)  
 
 Pacemaker is not a simple solution to configure either. I played briefly with 
 the RH clustering software.  It does not directly support any FS type other 
 than the basic ext2/ext3, and wasn't happy with a lustre type.  
 

That might be simple to fix, if it is script-based. We submitted a patch 
aeons ago to the heartbeat guys to add 'ldiskfs' as a supported FS. As I 
recall, it was a one-line change.
cliffw

 --
 Andrew
 
 -Original Message-
 From: lustre-discuss-boun...@lists.lustre.org [mailto:lustre-discuss-
 boun...@lists.lustre.org] On Behalf Of Carlos Santana
 Sent: Monday, July 13, 2009 11:42 AM
 To: lustre-discuss@lists.lustre.org
 Subject: [Lustre-discuss] failover software - heartbeat

 Howdy,

 The lustre manual recommends heartbeat for handling failover. The
 pacemaker is successor of hearbeat version 2. So whats recommended -
 should we be using pacemaker or stick to hearbeat?

 -
 CS.
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] failover software - heartbeat

2009-07-14 Thread Jim Garlick
Hi,

OK I have posted it to https://bugzilla.lustre.org/show_bug.cgi?id=20165

  20165: scripts for heartbeat v1 integration

I added example config files from our test cluster.  Probably best to
redirect questions/comments/criticisms to the bug and I'll respond there.

Jim


On Tue, Jul 14, 2009 at 12:26:24PM +1000, Atul Vidwansa wrote:
 Hi Jim,
 
 It would be great if you can attach the scripts to a Lustre bugzilla bug.
 
 Cheers,
 _Atul
 
 Jim Garlick wrote:
 We recently put heartbeat v1 in production and along the way
 developed some admin scripts including heartbeat resource agent compliant
 lustre init scripts, a script to initiate failover/failback and get 
 detailed
 status, a powerman stonith interface, and various safeguards to ensure MMP
 is on, devices are present and usable, etc. before starting lustre.
 
 If this is of general interest I could post it to a bug for review.
 
 Jim
 
 On Mon, Jul 13, 2009 at 01:45:02PM -0600, Lundgren, Andrew wrote:
   
 It is very difficult to find relevant documentation for heartbeat 1/2. I 
 just finished configuring a heartbeat system and would not recommend it 
 because of the documentation.  (They seem to have removed portions the 
 heartbeat documentation from the site.)  
 Pacemaker is not a simple solution to configure either. I played briefly 
 with the RH clustering software.  It does not directly support any FS 
 type other than the basic ext2/ext3, and wasn't happy with a lustre type. 
 
 --
 Andrew
 
 
 -Original Message-
 From: lustre-discuss-boun...@lists.lustre.org [mailto:lustre-discuss-
 boun...@lists.lustre.org] On Behalf Of Carlos Santana
 Sent: Monday, July 13, 2009 11:42 AM
 To: lustre-discuss@lists.lustre.org
 Subject: [Lustre-discuss] failover software - heartbeat
 
 Howdy,
 
 The lustre manual recommends heartbeat for handling failover. The
 pacemaker is successor of hearbeat version 2. So whats recommended -
 should we be using pacemaker or stick to hearbeat?
 
 -
 CS.
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://**lists.lustre.org/mailman/listinfo/lustre-discuss
   
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://**lists.lustre.org/mailman/listinfo/lustre-discuss
 
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://*lists.lustre.org/mailman/listinfo/lustre-discuss
   
 
 
 -- 
 ==
 Atul Vidwansa
 Sun Microsystems Australia Pty Ltd
 Web: http://*blogs.sun.com/atulvid
 Email: atul.vidwa...@sun.com
 
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] failover software - heartbeat

2009-07-14 Thread Cliff White
Jim Garlick wrote:
 Hi,
 
 OK I have posted it to https://bugzilla.lustre.org/show_bug.cgi?id=20165
 
   20165: scripts for heartbeat v1 integration
 
 I added example config files from our test cluster.  Probably best to
 redirect questions/comments/criticisms to the bug and I'll respond there.

Looks very good, thanks bunches. I've added a few extras from the 
discussion. Did you guy try ipfail, or only pingd?
cliffw

 
 Jim
 
 
 On Tue, Jul 14, 2009 at 12:26:24PM +1000, Atul Vidwansa wrote:
 Hi Jim,

 It would be great if you can attach the scripts to a Lustre bugzilla bug.

 Cheers,
 _Atul

 Jim Garlick wrote:
 We recently put heartbeat v1 in production and along the way
 developed some admin scripts including heartbeat resource agent compliant
 lustre init scripts, a script to initiate failover/failback and get 
 detailed
 status, a powerman stonith interface, and various safeguards to ensure MMP
 is on, devices are present and usable, etc. before starting lustre.

 If this is of general interest I could post it to a bug for review.

 Jim

 On Mon, Jul 13, 2009 at 01:45:02PM -0600, Lundgren, Andrew wrote:
  
 It is very difficult to find relevant documentation for heartbeat 1/2. I 
 just finished configuring a heartbeat system and would not recommend it 
 because of the documentation.  (They seem to have removed portions the 
 heartbeat documentation from the site.)  
 Pacemaker is not a simple solution to configure either. I played briefly 
 with the RH clustering software.  It does not directly support any FS 
 type other than the basic ext2/ext3, and wasn't happy with a lustre type. 

 --
 Andrew


 -Original Message-
 From: lustre-discuss-boun...@lists.lustre.org [mailto:lustre-discuss-
 boun...@lists.lustre.org] On Behalf Of Carlos Santana
 Sent: Monday, July 13, 2009 11:42 AM
 To: lustre-discuss@lists.lustre.org
 Subject: [Lustre-discuss] failover software - heartbeat

 Howdy,

 The lustre manual recommends heartbeat for handling failover. The
 pacemaker is successor of hearbeat version 2. So whats recommended -
 should we be using pacemaker or stick to hearbeat?

 -
 CS.
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://**lists.lustre.org/mailman/listinfo/lustre-discuss
  
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://**lists.lustre.org/mailman/listinfo/lustre-discuss

 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://*lists.lustre.org/mailman/listinfo/lustre-discuss
  

 -- 
 ==
 Atul Vidwansa
 Sun Microsystems Australia Pty Ltd
 Web: http://*blogs.sun.com/atulvid
 Email: atul.vidwa...@sun.com

 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] failover software - heartbeat

2009-07-14 Thread Jim Garlick
On Tue, Jul 14, 2009 at 09:37:54AM -0700, Cliff White wrote:
 Jim Garlick wrote:
 Hi,
 
 OK I have posted it to https://*bugzilla.lustre.org/show_bug.cgi?id=20165
 
   20165: scripts for heartbeat v1 integration
 
 I added example config files from our test cluster.  Probably best to
 redirect questions/comments/criticisms to the bug and I'll respond there.
 
 Looks very good, thanks bunches. I've added a few extras from the 
 discussion. Did you guy try ipfail, or only pingd?
 cliffw

We tried ipfail (unsuccessfully), not pingd.
I think pingd is a v2 only feature?  Our work is entirely with v1,
which seemed adeqate and also much simpler to understand and get right.

 Jim
 
 
 On Tue, Jul 14, 2009 at 12:26:24PM +1000, Atul Vidwansa wrote:
 Hi Jim,
 
 It would be great if you can attach the scripts to a Lustre bugzilla bug.
 
 Cheers,
 _Atul
 
 Jim Garlick wrote:
 We recently put heartbeat v1 in production and along the way
 developed some admin scripts including heartbeat resource agent compliant
 lustre init scripts, a script to initiate failover/failback and get 
 detailed
 status, a powerman stonith interface, and various safeguards to ensure 
 MMP
 is on, devices are present and usable, etc. before starting lustre.
 
 If this is of general interest I could post it to a bug for review.
 
 Jim
 
 On Mon, Jul 13, 2009 at 01:45:02PM -0600, Lundgren, Andrew wrote:
  
 It is very difficult to find relevant documentation for heartbeat 1/2. 
 I just finished configuring a heartbeat system and would not recommend 
 it because of the documentation.  (They seem to have removed portions 
 the heartbeat documentation from the site.)  
 Pacemaker is not a simple solution to configure either. I played 
 briefly with the RH clustering software.  It does not directly support 
 any FS type other than the basic ext2/ext3, and wasn't happy with a 
 lustre type. 
 --
 Andrew
 

 -Original Message-
 From: lustre-discuss-boun...@lists.lustre.org [mailto:lustre-discuss-
 boun...@lists.lustre.org] On Behalf Of Carlos Santana
 Sent: Monday, July 13, 2009 11:42 AM
 To: lustre-discuss@lists.lustre.org
 Subject: [Lustre-discuss] failover software - heartbeat
 
 Howdy,
 
 The lustre manual recommends heartbeat for handling failover. The
 pacemaker is successor of hearbeat version 2. So whats recommended -
 should we be using pacemaker or stick to hearbeat?
 
 -
 CS.
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://***lists.lustre.org/mailman/listinfo/lustre-discuss
  
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://***lists.lustre.org/mailman/listinfo/lustre-discuss

 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://**lists.lustre.org/mailman/listinfo/lustre-discuss
  
 
 -- 
 ==
 Atul Vidwansa
 Sun Microsystems Australia Pty Ltd
 Web: http://**blogs.sun.com/atulvid
 Email: atul.vidwa...@sun.com
 
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://*lists.lustre.org/mailman/listinfo/lustre-discuss
 
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] failover software - heartbeat

2009-07-13 Thread Lundgren, Andrew
It is very difficult to find relevant documentation for heartbeat 1/2. I just 
finished configuring a heartbeat system and would not recommend it because of 
the documentation.  (They seem to have removed portions the heartbeat 
documentation from the site.)  

Pacemaker is not a simple solution to configure either. I played briefly with 
the RH clustering software.  It does not directly support any FS type other 
than the basic ext2/ext3, and wasn't happy with a lustre type.  

--
Andrew

 -Original Message-
 From: lustre-discuss-boun...@lists.lustre.org [mailto:lustre-discuss-
 boun...@lists.lustre.org] On Behalf Of Carlos Santana
 Sent: Monday, July 13, 2009 11:42 AM
 To: lustre-discuss@lists.lustre.org
 Subject: [Lustre-discuss] failover software - heartbeat
 
 Howdy,
 
 The lustre manual recommends heartbeat for handling failover. The
 pacemaker is successor of hearbeat version 2. So whats recommended -
 should we be using pacemaker or stick to hearbeat?
 
 -
 CS.
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] failover software - heartbeat

2009-07-13 Thread Lundgren, Andrew
Were you able to get monitoring working to detect network failures?  (pingd?)

I have it configured, but haven't been able to get it to trigger a failover 
when an MDS cannot ping the network.  (I tried with 1.0 and 2.0 conf files,  I 
am currently using 2.0)  I have a ticket open with the pacemaker project (no 
ticket system for the HA stuff...)
but not resolution.  I am considering writing a script to down the node when 
the ping fails, but don't like the idea.  

I would also like to get the hpingd functioning to detect a fiber failure, but 
there was less available on that solution.

--
Andrew

 -Original Message-
 From: Jim Garlick [mailto:garl...@llnl.gov]
 Sent: Monday, July 13, 2009 2:21 PM
 To: Lundgren, Andrew
 Cc: Carlos Santana; lustre-discuss@lists.lustre.org
 Subject: Re: [Lustre-discuss] failover software - heartbeat
 
 We recently put heartbeat v1 in production and along the way
 developed some admin scripts including heartbeat resource agent
 compliant
 lustre init scripts, a script to initiate failover/failback and get
 detailed
 status, a powerman stonith interface, and various safeguards to ensure
 MMP
 is on, devices are present and usable, etc. before starting lustre.
 
 If this is of general interest I could post it to a bug for review.
 
 Jim
 
 On Mon, Jul 13, 2009 at 01:45:02PM -0600, Lundgren, Andrew wrote:
  It is very difficult to find relevant documentation for heartbeat
 1/2. I just finished configuring a heartbeat system and would not
 recommend it because of the documentation.  (They seem to have removed
 portions the heartbeat documentation from the site.)
 
  Pacemaker is not a simple solution to configure either. I played
 briefly with the RH clustering software.  It does not directly support
 any FS type other than the basic ext2/ext3, and wasn't happy with a
 lustre type.
 
  --
  Andrew
 
   -Original Message-
   From: lustre-discuss-boun...@lists.lustre.org [mailto:lustre-
 discuss-
   boun...@lists.lustre.org] On Behalf Of Carlos Santana
   Sent: Monday, July 13, 2009 11:42 AM
   To: lustre-discuss@lists.lustre.org
   Subject: [Lustre-discuss] failover software - heartbeat
  
   Howdy,
  
   The lustre manual recommends heartbeat for handling failover. The
   pacemaker is successor of hearbeat version 2. So whats recommended
 -
   should we be using pacemaker or stick to hearbeat?
  
   -
   CS.
   ___
   Lustre-discuss mailing list
   Lustre-discuss@lists.lustre.org
   http://*lists.lustre.org/mailman/listinfo/lustre-discuss
  ___
  Lustre-discuss mailing list
  Lustre-discuss@lists.lustre.org
  http://*lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] failover software - heartbeat

2009-07-13 Thread Jim Garlick
No.  I originally did have it set up like this (a v1 ha.cf snippet):

# One partner losing contact with both lnet routers or MDS triggers failover.
#ping_group lnet-router 172.16.10.254 172.16.2.254
#ping_group tycho-mds1 172.16.10.200 172.16.2.200
#respawn hacluster /usr/lib64/heartbeat/ipfail

However, I ran into a problem when rebooting the MDS.  Apparently if one
partner re-establishes contact with the MDS before the other one, it
immediately triggers failover.  This is with heartbeat-2.1.4.

Jim

On Mon, Jul 13, 2009 at 02:25:17PM -0600, Lundgren, Andrew wrote:
 Were you able to get monitoring working to detect network failures?  (pingd?)
 
 I have it configured, but haven't been able to get it to trigger a failover 
 when an MDS cannot ping the network.  (I tried with 1.0 and 2.0 conf files,  
 I am currently using 2.0)  I have a ticket open with the pacemaker project 
 (no ticket system for the HA stuff...)
 but not resolution.  I am considering writing a script to down the node when 
 the ping fails, but don't like the idea.  
 
 I would also like to get the hpingd functioning to detect a fiber failure, 
 but there was less available on that solution.
 
 --
 Andrew
 
  -Original Message-
  From: Jim Garlick [mailto:garl...@llnl.gov]
  Sent: Monday, July 13, 2009 2:21 PM
  To: Lundgren, Andrew
  Cc: Carlos Santana; lustre-discuss@lists.lustre.org
  Subject: Re: [Lustre-discuss] failover software - heartbeat
  
  We recently put heartbeat v1 in production and along the way
  developed some admin scripts including heartbeat resource agent
  compliant
  lustre init scripts, a script to initiate failover/failback and get
  detailed
  status, a powerman stonith interface, and various safeguards to ensure
  MMP
  is on, devices are present and usable, etc. before starting lustre.
  
  If this is of general interest I could post it to a bug for review.
  
  Jim
  
  On Mon, Jul 13, 2009 at 01:45:02PM -0600, Lundgren, Andrew wrote:
   It is very difficult to find relevant documentation for heartbeat
  1/2. I just finished configuring a heartbeat system and would not
  recommend it because of the documentation.  (They seem to have removed
  portions the heartbeat documentation from the site.)
  
   Pacemaker is not a simple solution to configure either. I played
  briefly with the RH clustering software.  It does not directly support
  any FS type other than the basic ext2/ext3, and wasn't happy with a
  lustre type.
  
   --
   Andrew
  
-Original Message-
From: lustre-discuss-boun...@lists.lustre.org [mailto:lustre-
  discuss-
boun...@lists.lustre.org] On Behalf Of Carlos Santana
Sent: Monday, July 13, 2009 11:42 AM
To: lustre-discuss@lists.lustre.org
Subject: [Lustre-discuss] failover software - heartbeat
   
Howdy,
   
The lustre manual recommends heartbeat for handling failover. The
pacemaker is successor of hearbeat version 2. So whats recommended
  -
should we be using pacemaker or stick to hearbeat?
   
-
CS.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://**lists.lustre.org/mailman/listinfo/lustre-discuss
   ___
   Lustre-discuss mailing list
   Lustre-discuss@lists.lustre.org
   http://**lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] failover software - heartbeat

2009-07-13 Thread Lundgren, Andrew
Are you doing anything if the network fails to one mds?

How about if your fiber path fails?

 -Original Message-
 From: Jim Garlick [mailto:garl...@llnl.gov]
 Sent: Monday, July 13, 2009 2:39 PM
 To: Lundgren, Andrew
 Cc: Carlos Santana; lustre-discuss@lists.lustre.org
 Subject: Re: [Lustre-discuss] failover software - heartbeat
 
 No.  I originally did have it set up like this (a v1 ha.cf snippet):
 
 # One partner losing contact with both lnet routers or MDS triggers
 failover.
 #ping_group lnet-router 172.16.10.254 172.16.2.254
 #ping_group tycho-mds1 172.16.10.200 172.16.2.200
 #respawn hacluster /usr/lib64/heartbeat/ipfail
 
 However, I ran into a problem when rebooting the MDS.  Apparently if
 one
 partner re-establishes contact with the MDS before the other one, it
 immediately triggers failover.  This is with heartbeat-2.1.4.
 
 Jim
 
 On Mon, Jul 13, 2009 at 02:25:17PM -0600, Lundgren, Andrew wrote:
  Were you able to get monitoring working to detect network failures?
 (pingd?)
 
  I have it configured, but haven't been able to get it to trigger a
 failover when an MDS cannot ping the network.  (I tried with 1.0 and
 2.0 conf files,  I am currently using 2.0)  I have a ticket open with
 the pacemaker project (no ticket system for the HA stuff...)
  but not resolution.  I am considering writing a script to down the
 node when the ping fails, but don't like the idea.
 
  I would also like to get the hpingd functioning to detect a fiber
 failure, but there was less available on that solution.
 
  --
  Andrew
 
   -Original Message-
   From: Jim Garlick [mailto:garl...@llnl.gov]
   Sent: Monday, July 13, 2009 2:21 PM
   To: Lundgren, Andrew
   Cc: Carlos Santana; lustre-discuss@lists.lustre.org
   Subject: Re: [Lustre-discuss] failover software - heartbeat
  
   We recently put heartbeat v1 in production and along the way
   developed some admin scripts including heartbeat resource agent
   compliant
   lustre init scripts, a script to initiate failover/failback and get
   detailed
   status, a powerman stonith interface, and various safeguards to
 ensure
   MMP
   is on, devices are present and usable, etc. before starting lustre.
  
   If this is of general interest I could post it to a bug for review.
  
   Jim
  
   On Mon, Jul 13, 2009 at 01:45:02PM -0600, Lundgren, Andrew wrote:
It is very difficult to find relevant documentation for heartbeat
   1/2. I just finished configuring a heartbeat system and would not
   recommend it because of the documentation.  (They seem to have
 removed
   portions the heartbeat documentation from the site.)
   
Pacemaker is not a simple solution to configure either. I played
   briefly with the RH clustering software.  It does not directly
 support
   any FS type other than the basic ext2/ext3, and wasn't happy with a
   lustre type.
   
--
Andrew
   
 -Original Message-
 From: lustre-discuss-boun...@lists.lustre.org [mailto:lustre-
   discuss-
 boun...@lists.lustre.org] On Behalf Of Carlos Santana
 Sent: Monday, July 13, 2009 11:42 AM
 To: lustre-discuss@lists.lustre.org
 Subject: [Lustre-discuss] failover software - heartbeat

 Howdy,

 The lustre manual recommends heartbeat for handling failover.
 The
 pacemaker is successor of hearbeat version 2. So whats
 recommended
   -
 should we be using pacemaker or stick to hearbeat?

 -
 CS.
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://**lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://**lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] failover software - heartbeat (Lundgren, Andrew)

2009-07-13 Thread Daniel Kulinski
Andrew,

I was able to get the ipfail to work on my heartbeat 2.1.3 installation.  

Make sure the following line is uncommented in /etc/ha.d/ha.cf:
respawn hacluster /usr/lib64/heartbeat/ipfail

And corresponding with that you must have a ping line with each host
separated by a space.

We have tested this and it works perfectly.  We have 3 ethernet networks to
each OSS and MDS pair.

I have no idea on what pingd is or how it relates to heartbeat.

Dan Kulinski


Were you able to get monitoring working to detect network failures?
(pingd?)

I have it configured, but haven't been able to get it to trigger a failover
when an MDS cannot ping the network.  (I tried with 1.0 and 2.0 conf files,
I am currently using 2.0)  I have a ticket open with the pacemaker project
(no ticket system for the HA stuff...)
but not resolution.  I am considering writing a script to down the node
when the ping fails, but don't like the idea.  

I would also like to get the hpingd functioning to detect a fiber failure,
but there was less available on that solution.

--
Andrew


___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] failover software - heartbeat

2009-07-13 Thread Jim Garlick
On network failures: no.

On fibre path failures: we configure ldiskfs with errors=panic so fibre
issues or other issues in the storage path will likely cause a panic and
trigger failover.

We're just getting started with failover so we elected to keep it simple
for now.

Jim

On Mon, Jul 13, 2009 at 02:41:09PM -0600, Lundgren, Andrew wrote:
 Are you doing anything if the network fails to one mds?
 
 How about if your fiber path fails?
 
  -Original Message-
  From: Jim Garlick [mailto:garl...@llnl.gov]
  Sent: Monday, July 13, 2009 2:39 PM
  To: Lundgren, Andrew
  Cc: Carlos Santana; lustre-discuss@lists.lustre.org
  Subject: Re: [Lustre-discuss] failover software - heartbeat
  
  No.  I originally did have it set up like this (a v1 ha.cf snippet):
  
  # One partner losing contact with both lnet routers or MDS triggers
  failover.
  #ping_group lnet-router 172.16.10.254 172.16.2.254
  #ping_group tycho-mds1 172.16.10.200 172.16.2.200
  #respawn hacluster /usr/lib64/heartbeat/ipfail
  
  However, I ran into a problem when rebooting the MDS.  Apparently if
  one
  partner re-establishes contact with the MDS before the other one, it
  immediately triggers failover.  This is with heartbeat-2.1.4.
  
  Jim
  
  On Mon, Jul 13, 2009 at 02:25:17PM -0600, Lundgren, Andrew wrote:
   Were you able to get monitoring working to detect network failures?
  (pingd?)
  
   I have it configured, but haven't been able to get it to trigger a
  failover when an MDS cannot ping the network.  (I tried with 1.0 and
  2.0 conf files,  I am currently using 2.0)  I have a ticket open with
  the pacemaker project (no ticket system for the HA stuff...)
   but not resolution.  I am considering writing a script to down the
  node when the ping fails, but don't like the idea.
  
   I would also like to get the hpingd functioning to detect a fiber
  failure, but there was less available on that solution.
  
   --
   Andrew
  
-Original Message-
From: Jim Garlick [mailto:garl...@llnl.gov]
Sent: Monday, July 13, 2009 2:21 PM
To: Lundgren, Andrew
Cc: Carlos Santana; lustre-discuss@lists.lustre.org
Subject: Re: [Lustre-discuss] failover software - heartbeat
   
We recently put heartbeat v1 in production and along the way
developed some admin scripts including heartbeat resource agent
compliant
lustre init scripts, a script to initiate failover/failback and get
detailed
status, a powerman stonith interface, and various safeguards to
  ensure
MMP
is on, devices are present and usable, etc. before starting lustre.
   
If this is of general interest I could post it to a bug for review.
   
Jim
   
On Mon, Jul 13, 2009 at 01:45:02PM -0600, Lundgren, Andrew wrote:
 It is very difficult to find relevant documentation for heartbeat
1/2. I just finished configuring a heartbeat system and would not
recommend it because of the documentation.  (They seem to have
  removed
portions the heartbeat documentation from the site.)

 Pacemaker is not a simple solution to configure either. I played
briefly with the RH clustering software.  It does not directly
  support
any FS type other than the basic ext2/ext3, and wasn't happy with a
lustre type.

 --
 Andrew

  -Original Message-
  From: lustre-discuss-boun...@lists.lustre.org [mailto:lustre-
discuss-
  boun...@lists.lustre.org] On Behalf Of Carlos Santana
  Sent: Monday, July 13, 2009 11:42 AM
  To: lustre-discuss@lists.lustre.org
  Subject: [Lustre-discuss] failover software - heartbeat
 
  Howdy,
 
  The lustre manual recommends heartbeat for handling failover.
  The
  pacemaker is successor of hearbeat version 2. So whats
  recommended
-
  should we be using pacemaker or stick to hearbeat?
 
  -
  CS.
  ___
  Lustre-discuss mailing list
  Lustre-discuss@lists.lustre.org
  http://***lists.lustre.org/mailman/listinfo/lustre-discuss
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://***lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss