On Wed, Jun 13, 2007 at 10:52:53AM -0600, Galen Shipman wrote: > > On Jun 13, 2007, at 10:48 AM, Jeff Squyres wrote: > > > I wonder if this is bringing up the point that there are several of > > us working in the openib code base -- I wonder if it would be > > worthwhile to have a [short] teleconference to discuss what we're all > > doing in openib, where we're doing it (trunk, branch, whatever), when > > we expect to have it done, what version we need it in, etc. Just a > > coordination kind of teleconference. If people think this is a good > > idea, I can setup the call. > > sounds good to me. Sounds good to me to. Pasha also works on async event thread. This patch is not something I planned to work on. This problem prevented me from testing my changes to OB1 an is serious enough to be fixed on v1.2.
> > - Galen > > > > > For example, don't forget that Nysal and I have the openib btl port- > > selection stuff off in /tmp/jnysal-openib-wireup (the btl_openib_if_ > > [in|ex]clude MCA params). Per my prior e-mail, if no one objects, I > > will be bringing that stuff in to the trunk tomorrow evening (I'm > > pretty sure it won't conflict with what Galen is doing; Galen and I > > discussed on the phone this morning). > > > > > > > > > > On Jun 13, 2007, at 11:38 AM, Galen Shipman wrote: > > > >> Hi Gleb, > >> > >> As we have discussed before I am working on adding support for > >> multiple QPs with either per peer resources or shared resources. > >> As a result of this I am trying to clean up a lot of the OpenIB code. > >> It has grown up organically over the years and needs some attention. > >> Perhaps we can coordinate on commits or even work from the same temp > >> branch to do an overall cleanup as well as addressing the issue you > >> describe in this email. > >> > >> I bring this up because this commit will conflict quite a bit with > >> what I am working on, I can always merge it by hand but it may make > >> sense for us to get this all done in one area and then bring it all > >> over? > >> > >> Thanks, > >> > >> Galen > >> > >> > >> On Jun 13, 2007, at 7:27 AM, Gleb Natapov wrote: > >> > >>> Hello everyone, > >>> > >>> I encountered a problem with openib on depend connection code. > >>> Basically > >>> it works only by pure luck if you have more then one endpoint for > >>> the same > >>> proc and sometimes breaks in mysterious ways. > >>> > >>> The algo works like this: A wants to connect to B so it creates QP > >>> and sends it > >>> to B. B receives the QP from A and looks for endpoint that is not > >>> yet associated > >>> with remote endpoint, creates QP for it and sends info back. Now A > >>> receives > >>> the QP and goes through the same logic as B i.e looks for endpoint > >>> that is not > >>> yet connected, BUT there is no guaranty that it will find the > >>> endpoint that > >>> initiated the connection in the first place! And if it finds > >>> another one it will > >>> create QP for it and will send it back to B and so on and so forth. > >>> In the end > >>> I sometimes receive a peculiar mesh of connection where no QP has a > >>> connection > >>> back to it from the peer process. > >>> > >>> To overcome this problem B needs to send back some info that will > >>> allow A to > >>> determine the endpoint that initiated a connection request. The > >>> lid:qp pair > >>> will allow for this. But even then the problem will remain if two > >>> procs initiate > >>> connection at the same time. To dial with simultaneous connection > >>> asymmetry > >>> protocol have to be used one peer became master another slave. > >>> Slave alway > >>> initiate a connection to master. Master choose local endpoint to > >>> satisfy > >>> incoming request and sends info back to a slave. If master wants to > >>> initiate a > >>> connection it send message to a slave and slave initiate connection > >>> back to > >>> master. > >>> > >>> Included patch implements an algorithm described above and work for > >>> all > >>> scenarios for which current code fails to create a connection. > >>> > >>> -- > >>> Gleb. > >>> <fix_openib_wireup.diff> > >>> _______________________________________________ > >>> devel mailing list > >>> de...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> > >> _______________________________________________ > >> devel mailing list > >> de...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > > > -- > > Jeff Squyres > > Cisco Systems > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Gleb.