On Jun 2, 2010, at 12:42 PM, George Bosilca wrote:

> > 1. In this case, the openib BTL was not finalized, so there was a stub 
> > still there listening on the RDMACM CPC.  When another process tried to 
> > connect to X's RDMACM CPC port, Bad Things happened (because it was only 
> > half setup) and we segv'ed.
> >
> > Obviously, this should be fixed.  "Fixed" in this case probably means 
> > closing down the RDMACM CPC listening port.  But then that leads to another 
> > form of Badness.
> 
> I wonder how this is possible. If a process X fails to connect to Y, how can 
> Y succeed to connect to X ? Please enlighten me ...

It doesn't.  Process X segvs after it goes into the RDMACM CPC accept code 
(because the openib BTL was only half setup).

> > 2. If the openib BTL cleanly shuts down and is *not* still listening on its 
> > modex-advertised RDMACM CPC contact port, then if some other process tries 
> > to contact process X via the modex info, it'll fail.  This will then be 
> > judged to be a fatal error.  Failover in the BML will simply have delayed 
> > the job abort until someone tries to contact X via the openib BTL.
> 
> Isn't there any kind of timeout mechanism in the RDMACM CPC? If there is one 
> and the connection fails, then the PML will automatically try to use the next 
> available BTL, so it will eventually fail over TCP (if available).

Yes, there is a timeout.  I forget offhand what we do if the timeout occurs.  
We probably report the connect failure in the "normal" way, but I don't know 
that for sure.

> > I think that the majority of this discussion about the BML failure (or not) 
> > behavior assumed that *all* processes had the same failure (at least: *I* 
> > assumed this).  But if only *some* of the processes fail a given BTL 
> > add_procs, we have a problem because we're beyond the point of deciding who 
> > can connect to whom.  Shutting down a single BTL module at that point will 
> > create an inconsistency of the distributed data.
> 
> We did assume that at least the errors are symmetric, i.e. if A fails to 
> connect to B then B will fail when trying to connect to A. However, if there 
> are other BTL the connection is supposed to smoothly move over some other 
> BTL. As an example in the MX BTL, if two nodes have MX support, but they do 
> not share the same mapper the add_procs will silently fails.

This sounds dodgy and icky.  We have to wait for a connect timeout to fail over 
to the next BTL?

How long is the typical/default TCP timeout?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to