Hi, I have a heartbeat problem while trying automatic failover. Manual failover works great, unmounting a partitition from an OSS and remounting it on another one makes the clients recover. It all starts with this error:
Filesystem[7650]: 2011/12/22_14:36:05 ERROR: Couldn't mount filesystem /dev/mpath/colosse4-lun60-sata on /mnt/data/clun60 Filesystem[7639]: 2011/12/22_14:36:05 ERROR: Generic error As a result, the failover OSS is the wrong one and the clients stays in this state forever: sata-OST0000_UUID : Resource temporarily unavailable Here is my heartbeat config: [root@ib3-st02 ~]# cat /etc/ha.d/ha.cf # log file settings # write debug output to /var/log/ha-debug debugfile /var/log/ha-debug # write log messages to /var/log/ha-log logfile /var/log/ha-log # use syslog to write to logfiles logfacility local0 # set some time-outs. these values are only recommendations, which # depend e.g. on the OSS load # send keep-alive packages every 2 seconds keepalive 2 # wait 90 seconds before declaring a node dead deadtime 90 # write a warning to the logfile after 30 seconds without an answer # from the failover node warntime 30 # wait for 120 seconds before declaring a node dead after heartbeat # is brought up initdead 120 # define communication channels # use port 12345 to communicate with fail-over node udpport 12345 # use network interfaces eth0 and ib0 to detect a failed node bcast eth0 bond0 # Use manual failback auto_failback off # node names in this failover-pair. These names must match the # output of `hostname` node ib3-st01 node ib3-st02 node ib3-st03 node ib3-st04 [root@ib3-st02 ~]# cat /etc/ha.d/haresources ib3-st01 Filesystem::/dev/emcssd-1/mdt-sata::/mnt/mdt-colosse::lustre ib3-st01 Filesystem::/dev/mpath/colosse4-lun53-sata::/mnt/data/clun53::lustre ib3-st02 Filesystem::/dev/mpath/colosse4-lun54-sata::/mnt/data/clun54::lustre ib3-st03 Filesystem::/dev/mpath/colosse4-lun55-sata::/mnt/data/clun55::lustre ib3-st04 Filesystem::/dev/mpath/colosse4-lun56-sata::/mnt/data/clun56::lustre ib3-st01 Filesystem::/dev/mpath/colosse4-lun57-sata::/mnt/data/clun57::lustre ib3-st02 Filesystem::/dev/mpath/colosse4-lun58-sata::/mnt/data/clun58::lustre ib3-st03 Filesystem::/dev/mpath/colosse4-lun59-sata::/mnt/data/clun59::lustre ib3-st04 Filesystem::/dev/mpath/colosse4-lun60-sata::/mnt/data/clun60::lustre It is all the same on all OSS's. Does anybody ever encounter that problem? Thanks for help. -- Patrice Hamelin Specialiste sénior en systèmes d'exploitation | Senior OS specialist Environnement Canada | Environment Canada 2121, route Transcanadienne | 2121 Transcanada Highway Dorval, QC H9P 1J3 Téléphone | Telephone 514-421-5303 Télécopieur | Facsimile 514-421-7231 Gouvernement du Canada | Government of Canada _______________________________________________ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss