Short version:
--------------

I propose that we should disallow multiple different mca_btl_openib_receive_queues values (or receive_queues values from the INI file) to be used in a single MPI job for the v1.3 series.

More details:
-------------

The reason I'm looking into this heterogeneity stuff is to help Chelsio support their iWARP NIC in OMPI. Their NIC needs a specific value for mca_btl_openib_receive_queues (specifically: it does not support SRQ and it has the wireup race condition that we discussed before).

The major problem is that all the BSRQ information is currently stored in on the openib component -- it is *not* maintained on a per-HCA (or per port) basis. We *could* move all the BSRQ info to live on the hca_t struct (or even the openib module struct), but it has at least 3 big consequences:

1. It would touch a lot of code. But touching all this code is relatively low risk; it will be easy to check for correctness because the changes will either compile or not.

2. There are functions (some of which are static inline) that read the BSRQ data. These functions would have to take an additional (hca_t*) (or (btl_openib_module_t*)) parameter.

3. Getting to the BSRQ info will take at least 1 or 2 more dereferences (e.g., module->hca->bsrq_info or module->bsrq_info...).

I'm not too concerned about #1 (it's grunt work), but I am a bit concerned about #2 and #3 since at least some of these places are in the critical performance path.

Given these concerns, I propose the following v1.3:

- Add a "receive_queues" field to the INI file so that the Chelsio adapter can run out of the box (i.e., "mpirun -np 4 a.out" with hosts containing Chelsio NICs will get a value for btl_openib_receive_queues that will work).

- NetEffect NICs will also require overriding btl_openib_receive_queues, but will likely have a different value than Chelsio NICs (they don't have the wireup race condition that Chelsio does).

- Because the BSRQ info is on the component (i.e., global), we should detect when multiple different receive_queues values are specified and gracefully abort.

I think it'll be quite uncommon to have a need for two different receive_queues values, and that this proposal will be fine for v1.3

Comments?



On May 12, 2008, at 6:44 PM, Jeff Squyres wrote:

After looking at the code a bit, I realized that I completely forgot
that the INI file was invented to solve at least the heterogeneous-
adapters-in-a-host problem.

So I amended the ticket to reflect that that problem is already
solved.  The other part is not, though -- consider two MPI procs on
different hosts, each with an iWARP NIC, but one NIC supports SRQs and
one does not.


On May 12, 2008, at 5:36 PM, Jeff Squyres wrote:

I think that this issue has come up before, but I filed a ticket
about it because at least one developer (Jon) has a system with both
IB and iWARP adapters:

  https://svn.open-mpi.org/trac/ompi/ticket/1282

My question: do we care about the heterogeneous adapter scenarios?
For v1.3?  For v1.4?  For ...some version in the future?

I think the first issue I identified in the ticket is grunt work to
fix (annoying and tedious, but not difficult), but the second one
will be a little dicey -- it has scalability issues (e.g., sending
around all info in the modex, etc.).

--
Jeff Squyres
Cisco Systems



--
Jeff Squyres
Cisco Systems

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems

Reply via email to