Yes, I think the mmap code in the sm btl actually has a sync point inside add_procs that when the root allocs and sets up the area, it'll locally broadcast a "yes, we're good -- mmap attach and let's continue" or "bad things happened; sm btl is broke" message.
But I am not confident about the other BTLs. On Jun 2, 2010, at 12:51 PM, Eugene Loh wrote: > George Bosilca wrote: > > > We did assume that at least the errors are symmetric, i.e. if A fails > > to connect to B then B will fail when trying to connect to A. > > I've not been following this thread closely, but thought I'd add a comment. > > It used to be that the sm BTL could fail asymmetrically. A shared > memory could be allocated and processes start to allocate resources > within shared memory. At some point, the shared area would be > exhausted. So, some processes were set up to communicate to others, but > the others would not be able to communicate back via the same BTL. I > think this led to much brokenness. (E.g., how would a process return a > sm fragment to a sender?) > > At this point, my recollection of those issues is very fuzzy. > > In any case, I think those issues went away with the shared-memory work > I did a while back. The size of the area is now computed to be large > enough that each process's initial allocation would succeed. > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/