On Jun 2, 2010, at 12:42 PM, George Bosilca wrote: > > 1. In this case, the openib BTL was not finalized, so there was a stub > > still there listening on the RDMACM CPC. When another process tried to > > connect to X's RDMACM CPC port, Bad Things happened (because it was only > > half setup) and we segv'ed. > > > > Obviously, this should be fixed. "Fixed" in this case probably means > > closing down the RDMACM CPC listening port. But then that leads to another > > form of Badness. > > I wonder how this is possible. If a process X fails to connect to Y, how can > Y succeed to connect to X ? Please enlighten me ...
It doesn't. Process X segvs after it goes into the RDMACM CPC accept code (because the openib BTL was only half setup). > > 2. If the openib BTL cleanly shuts down and is *not* still listening on its > > modex-advertised RDMACM CPC contact port, then if some other process tries > > to contact process X via the modex info, it'll fail. This will then be > > judged to be a fatal error. Failover in the BML will simply have delayed > > the job abort until someone tries to contact X via the openib BTL. > > Isn't there any kind of timeout mechanism in the RDMACM CPC? If there is one > and the connection fails, then the PML will automatically try to use the next > available BTL, so it will eventually fail over TCP (if available). Yes, there is a timeout. I forget offhand what we do if the timeout occurs. We probably report the connect failure in the "normal" way, but I don't know that for sure. > > I think that the majority of this discussion about the BML failure (or not) > > behavior assumed that *all* processes had the same failure (at least: *I* > > assumed this). But if only *some* of the processes fail a given BTL > > add_procs, we have a problem because we're beyond the point of deciding who > > can connect to whom. Shutting down a single BTL module at that point will > > create an inconsistency of the distributed data. > > We did assume that at least the errors are symmetric, i.e. if A fails to > connect to B then B will fail when trying to connect to A. However, if there > are other BTL the connection is supposed to smoothly move over some other > BTL. As an example in the MX BTL, if two nodes have MX support, but they do > not share the same mapper the add_procs will silently fails. This sounds dodgy and icky. We have to wait for a connect timeout to fail over to the next BTL? How long is the typical/default TCP timeout? -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/