Yes, I think the mmap code in the sm btl actually has a sync point inside 
add_procs that when the root allocs and sets up the area, it'll locally 
broadcast a "yes, we're good -- mmap attach and let's continue" or "bad things 
happened; sm btl is broke" message.

But I am not confident about the other BTLs.


On Jun 2, 2010, at 12:51 PM, Eugene Loh wrote:

> George Bosilca wrote:
> 
> > We did assume that at least the errors are symmetric, i.e. if A fails
> > to connect to B then B will fail when trying to connect to A.
> 
> I've not been following this thread closely, but thought I'd add a comment.
> 
> It used to be that the sm BTL could fail asymmetrically.  A shared
> memory could be allocated and processes start to allocate resources
> within shared memory.  At some point, the shared area would be
> exhausted.  So, some processes were set up to communicate to others, but
> the others would not be able to communicate back via the same BTL.  I
> think this led to much brokenness.  (E.g., how would a process return a
> sm fragment to a sender?)
> 
> At this point, my recollection of those issues is very fuzzy.
> 
> In any case, I think those issues went away with the shared-memory work
> I did a while back.  The size of the area is now computed to be large
> enough that each process's initial allocation would succeed.
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to