[OMPI devel] Open IB BTL and iWARP
Last I looked the OpenIB BTL relied on the short eager rdma buffers being written in order? Is this still the case? If so, how is this handled when iWARP is underneath the User Verb API and not Mellonox IB HCAs?
Re: [OMPI devel] Open IB BTL and iWARP
On Jul 9, 2008, at 4:08 PM, Don Kerr wrote: Last I looked the OpenIB BTL relied on the short eager rdma buffers being written in order? Is this still the case? The eager rdma optimization relies on the last byte of the short message being written last. I.e., when we see the last byte in the target buffer, we assume the rest of the message is there. If so, how is this handled when iWARP is underneath the User Verb API and not Mellonox IB HCAs? There's an MCA parameter that disables this optimization if the underlying hardware can't provide that guarantee. We also have this field in the INI file so that specific adapters can disable it automatically if they want/need to. Chelsio T3, NetEffect NE020, and NetXen adapters can all provide that guarantee (I asked those vendors). You can see this in: https://svn.open-mpi.org/trac/ompi/browser/trunk/ompi/mca/btl/openib/mca-btl-openib-hca-params.ini -- Jeff Squyres Cisco Systems
[OMPI devel] IOF repair
I have been investigating Ticket #1135 - stdin is read twice if rank=0 shares the node with mpirun. Repairing this problem is going to be quite difficult due to the rather terrible spaghetti code in the IOF, and the fact that the IOF in the HNP actually rml.sends the IO to itself multiple times as it cycles through the spaghetti. Unfortunately, this problem -is- a regression from 1.2. Rather than spending weeks trying to fix it, I see two approaches we could pursue. First, I could repair the problem by essentially returning the IOF to its 1.2 state. This will have to be done by hand as most of the differences are in function calls to utilities that have changed due to the removal of the old NS framework. However, there are a few places where the logic itself has been modified - and the problem must stem from somewhere in there. If I make this change, then we will be no better, and no worse, than 1.2. Note that we currently advise people to read from a file instead of from stdin to avoid other issues that were present in 1.2. Alternatively, we could ship 1.3 as-is, and warn users (similar to 1.2) that they should avoiding reading from stdin if there is any chance that rank=0 could be co-located with mpirun. Note that most of our clusters do not allow such co-location - but it is permitted by default by OMPI. We already plan to revisit the IOF at next week's technical meeting, with a goal of redefining the IOF's API to a more reduced set that reflects a less ambitious requirement. I expect to implement those changes fairly soon thereafter, but that would be targeted to 1.4 - not 1.3. Any thoughts on which way we should go? Ralph
Re: [OMPI devel] IOF repair
I'd like to have a look at the diff between the two, but I can't do so until tomorrow at the earliest. On Jul 9, 2008, at 7:26 PM, Ralph Castain wrote: I have been investigating Ticket #1135 - stdin is read twice if rank=0 shares the node with mpirun. Repairing this problem is going to be quite difficult due to the rather terrible spaghetti code in the IOF, and the fact that the IOF in the HNP actually rml.sends the IO to itself multiple times as it cycles through the spaghetti. Unfortunately, this problem -is- a regression from 1.2. Rather than spending weeks trying to fix it, I see two approaches we could pursue. First, I could repair the problem by essentially returning the IOF to its 1.2 state. This will have to be done by hand as most of the differences are in function calls to utilities that have changed due to the removal of the old NS framework. However, there are a few places where the logic itself has been modified - and the problem must stem from somewhere in there. If I make this change, then we will be no better, and no worse, than 1.2. Note that we currently advise people to read from a file instead of from stdin to avoid other issues that were present in 1.2. Alternatively, we could ship 1.3 as-is, and warn users (similar to 1.2) that they should avoiding reading from stdin if there is any chance that rank=0 could be co-located with mpirun. Note that most of our clusters do not allow such co-location - but it is permitted by default by OMPI. We already plan to revisit the IOF at next week's technical meeting, with a goal of redefining the IOF's API to a more reduced set that reflects a less ambitious requirement. I expect to implement those changes fairly soon thereafter, but that would be targeted to 1.4 - not 1.3. Any thoughts on which way we should go? Ralph ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
[OMPI devel] v1.3 RM: need a ruling
v1.3 RMs: Due to some recent work, the MCA parameter mpi_paffinity_alone disappeared -- it was moved and renamed to be opal_paffinity_alone. This is Bad because we have a lot of historical precent based on the MCA param name "mpi_paffinity_alone" (FAQ, PPT presentations, e-mails on public lists, etc.). So it needed to be restored for v1.3. I just noticed that I hadn't opened a ticket on this -- sorry -- I opened #1383 tonight. For a variety of reasons described in the commit message r1383, Lenny and I first decided that it would be best to fix this problem by the functionality committed in r18770 (have the ability to find out where an MCA parameter was set). This would allow us to register two MCA params: mpi_paffinity_alone and opal_paffinity_alone, and generally do the Right Thing (because we could then tell if a user had set a value or whether it was a default MCA param value). This functionality will also be useful in the openib BTL, where there is a blend of MCA parameters and INI file parameters. However, after doing that, it seemed like only a few more steps to implement an overall better solution: implement "synonyms" for MCA parameters. I.e., register the name "mpi_paffinity_alone" as a synonym for opal_paffinity_alone. Along the way, it was trivial to add a "deprecated" flag for MCA parameters that we no longer want to use anymore (this deprecated flag is also useful in the OB1 PML and openib BTL). So to fix a problem that needed to be fixed for v1.3 (restore the MCA parameter "mpi_paffinity_alone"), I ended up implementing new functionality. Can this go into v1.3, or do we need to implement some kind of alternate fix? (I admit to not having thought through what it would take to fix without the new MCA parameter functionality -- it might be kinda wonky) -- Jeff Squyres Cisco Systems