Re: [OMPI devel] Modex and others

Ralph Castain Thu, 13 Nov 2008 08:37:05 -0500

If you look at the Dec meeting wiki, you will see that we are movingquickly to a modex-less launch anyway. It won't be the default becauseit requires pre-discovery of the cluster's network resources (forwhich we will provide a tool or method), but it will help resolve someof these problems.

Outside of that, I will have to leave it to the FT folks to figure outhow to resolve modex situations. We have the ability to supportmultiple modex models (and already do), but I don't know if you can dowhat you describe or not - I'm not sure how the MPI layer will handlethat situation.


Ralph

On Nov 13, 2008, at 6:22 AM, Leonardo Fialho wrote:

Jeff,
I agree with your viewpoint, principally about the "reachability".But...
Looking from the FT viewpoint, sometimes (or some FT architectures),wants to recover an application process on other node different fromthe first. In this case a new modex should be called. It's fine forcoordinated C/R, on the other hand, for uncoordinated C/R its not agood choice, I think. One more time the tradeoffs...
A possible solution is to perform n-1 modex involving the recoveredprocess and each one of the other processes... It's better than anallgather modex? I don't now. I think not. And what is the impact ofa allgather modex while MPI thread is delivering messages? Theseanswers about these questions could suggest that a uncoordinated C/Ris not possible on Open MPI.
Leonardo Fialho


Jeff Squyres escribió:
On Nov 7, 2008, at 10:18 AM, Leonardo Fialho wrote:
I understand that a process need to have the contact informationto send MPI messages to other processes, and modex permits it. Myquestion is, why do not perform the contact exchange when it isnecessary?
For example: in a M/W application, the workers does not need moreinformation than the masters contact info.
I think that it reduces the startup time, but increases the*first* communication between two peers.
FWIW, this is actually a pretty complex topic. There are many,many tradeoffs in terms of what performance do you want vs. whatfunctionality do you want. This subject has been discussed formany, many hours by the OMPI developers. :-)
The modex is performed during MPI_INIT; the v1.3 series' modex isquite a bit more efficient than the v1.2 series' modex. The modexinformation comprises of several things, some of which are eitherthe contact info or "reachability" info of BTL modules. For theopenib BTL, for example, port subnet ID's and MTU's are passed inthe modex, but LIDs don't need to be passed (in some cases) untiltwo processes actually try to reach each other. We use thereachability information to determine whether a given BTL module*could* be used to connect to a remote peer. For example, if weget to the end of MPI_INIT and find a peer that cannot be reached,we abort (after hours of debate, we decided it was better to abortright away when there was a peer that could not be reached ratherthan abort only on attempted first contact because it could be asimple network/configuration error that should be detectedimmediately, rather than erroring out [potentially] long into amulti-hour run).
We have been discussing a "modex-less" startup for quite a while;this is actually one of the topics on the agenda for an engineeringmeeting that we're having December. modex-less is quite importantfor scalability to many thousands of processes, but other tradeoffsmay be necessary to make this work (read: we've talked about modex-less for forever; we're finally likely to do it in the near futurebecause of some upcoming very very large scale machines at US DOElabs).
Does that make sense?
--
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478

_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Modex and others

Reply via email to