Just an FYI - I asked a similar question recently and got the following answer from Rolf:
> In my case, it was specific to openib only and it required you to be running > with two or more IB rails. > Then, if one of them failed, we just shut it down, and continued with the > working ones. > You could only get use of the failing rail if it was fixed and a new job was > started. > > To get this to work, I created a new PML called bfo. I also had to make some > changes in the openib BTL. > By default, none of the code is configured in. There is a README in the PML > bfo directory that > actually does quite a good job explaining what I did. The bfo module is included in the 1.6 series, and in the upcoming 1.7 series. Can't say anything as to its state of repair. On Oct 25, 2012, at 10:41 AM, George Bosilca <bosi...@icl.utk.edu> wrote: > > On Oct 25, 2012, at 17:54 , Lirong Jian <lirong.m...@gmail.com> wrote: > >> Hi foks, >> >> Sorry to bother you guys, but I have some questions about Open MPI and >> really want your help. >> >> There are some papers (e.g., [1, 2, 3], although they are sort of old-aged) >> mentioning that Open MPI is supporting NIC failover and message stripping >> over multiple NICs. However, when I read the source code of openmpi-1.6.2, I >> couldn't find any component named DR or TEG (which are mentioned in those >> papers and are supposed to support NIC failover and message stripping). So >> my question is: >> >> Does the 1.6.2 release of Open MPI support such two kinds of >> functionalities? If positive, which part of code is corresponding to these >> functionalities? > > Lirong, > > As you noticed the papers are quite old and dusty. > > Due to a lack of interest from the community the DR PML has been retired from > out stable releases. In other terms no stable Open MPI version supports > network failover. However, the code is still available in the trunk, but > there is no guarantee it still does what it was designed for. > > TEG has been replaced with OB1, which is our current network management > layer. It does stripping over multiple NICs (identical or not) by default. > > george. > >> >> Many thanks in advance. >> >> P.S., I am a newbie of this domain. Maybe my questions are simple even >> naive, but your help would be highly appreciated. >> >> Best, >> Lirong >> >> >> [1] Network Fault Tolerance in Open MPI. >> [2] Open MPI: A High Performance, Flexible Implementation of MPI >> Point-to-Point Communications. >> [3] TEG: A High-Performance, Scalable, Multi-network, Point-to-Point, >> Communications Methodology. >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel