No. I originally did have it set up like this (a v1 ha.cf snippet): # One partner losing contact with both lnet routers or MDS triggers failover. #ping_group lnet-router 172.16.10.254 172.16.2.254 #ping_group tycho-mds1 172.16.10.200 172.16.2.200 #respawn hacluster /usr/lib64/heartbeat/ipfail
However, I ran into a problem when rebooting the MDS. Apparently if one partner re-establishes contact with the MDS before the other one, it immediately triggers failover. This is with heartbeat-2.1.4. Jim On Mon, Jul 13, 2009 at 02:25:17PM -0600, Lundgren, Andrew wrote: > Were you able to get monitoring working to detect network failures? (pingd?) > > I have it configured, but haven't been able to get it to trigger a failover > when an MDS cannot ping the network. (I tried with 1.0 and 2.0 conf files, > I am currently using 2.0) I have a ticket open with the pacemaker project > (no ticket system for the HA stuff...) > but not resolution. I am considering writing a script to down the node when > the ping fails, but don't like the idea. > > I would also like to get the hpingd functioning to detect a fiber failure, > but there was less available on that solution. > > -- > Andrew > > > -----Original Message----- > > From: Jim Garlick [mailto:garl...@llnl.gov] > > Sent: Monday, July 13, 2009 2:21 PM > > To: Lundgren, Andrew > > Cc: Carlos Santana; lustre-discuss@lists.lustre.org > > Subject: Re: [Lustre-discuss] failover software - heartbeat > > > > We recently put heartbeat v1 in production and along the way > > developed some admin scripts including heartbeat resource agent > > compliant > > lustre init scripts, a script to initiate failover/failback and get > > detailed > > status, a powerman stonith interface, and various safeguards to ensure > > MMP > > is on, devices are present and usable, etc. before starting lustre. > > > > If this is of general interest I could post it to a bug for review. > > > > Jim > > > > On Mon, Jul 13, 2009 at 01:45:02PM -0600, Lundgren, Andrew wrote: > > > It is very difficult to find relevant documentation for heartbeat > > 1/2. I just finished configuring a heartbeat system and would not > > recommend it because of the documentation. (They seem to have removed > > portions the heartbeat documentation from the site.) > > > > > > Pacemaker is not a simple solution to configure either. I played > > briefly with the RH clustering software. It does not directly support > > any FS type other than the basic ext2/ext3, and wasn't happy with a > > lustre type. > > > > > > -- > > > Andrew > > > > > > > -----Original Message----- > > > > From: lustre-discuss-boun...@lists.lustre.org [mailto:lustre- > > discuss- > > > > boun...@lists.lustre.org] On Behalf Of Carlos Santana > > > > Sent: Monday, July 13, 2009 11:42 AM > > > > To: lustre-discuss@lists.lustre.org > > > > Subject: [Lustre-discuss] failover software - heartbeat > > > > > > > > Howdy, > > > > > > > > The lustre manual recommends heartbeat for handling failover. The > > > > pacemaker is successor of hearbeat version 2. So whats recommended > > - > > > > should we be using pacemaker or stick to hearbeat? > > > > > > > > - > > > > CS. > > > > _______________________________________________ > > > > Lustre-discuss mailing list > > > > Lustre-discuss@lists.lustre.org > > > > http://**lists.lustre.org/mailman/listinfo/lustre-discuss > > > _______________________________________________ > > > Lustre-discuss mailing list > > > Lustre-discuss@lists.lustre.org > > > http://**lists.lustre.org/mailman/listinfo/lustre-discuss _______________________________________________ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss