There was discussion about this issue on the call today. Several
vendors expressed willingness to "fix" the problem and make OMPI
interop with different HCAs and RNICs in a single job run: Sun, IBM,
Mellanox.
So the question is -- what exactly do you want to do to fix this?
Assumedly this kind of auto-detection will only occur when we have a
modex -- further assuming that each port (module) can pass around its
vendor and part ID. Then upon start_connect(), you can look up your
peer's vendor and part ID and look up its values in the INI file and
see if there's a clash. That's a very general idea; I haven't re-
looked at the code recently to see if that would actually work.
I posted one idea here:
http://www.open-mpi.org/community/lists/users/2009/01/7861.php
But I'm not entirely convinced that's the Right way to go.
On Jan 27, 2009, at 8:15 AM, Jeff Squyres wrote:
On Jan 26, 2009, at 4:46 PM, Jeff Squyres wrote:
Note that I did not say that. I specifically stated that OMPI
failed and it is due to the fact that we are customizing for the
individual hardware devices. To be clear: this is an OMPI issue.
I'm asking (at the request of the IWG) if anyone cares about fixing
it.
I should clarify something in this discussion: Open MPI is *capable*
of running in heterogeneous OpenFabrics hardware (assuming IB <-->
IB and iWARP <--> iWARP, of course -- not IB <--> iWARP) as long as
it is configured to use the same verbs/hardware configuration on all
the hardware. Depending on the hardware, Open MPI may not be
configured to run this way by default because it may choose to
customize differently for different HCAs/RNICs.
However, if one manually configures Open MPI to use the same verbs/
hardware configuration values across all the HCAs/RNICs in your
cluster, Open MPI will probably work fine. If Open MPI doesn't work
in this kind of configuration, it may indicate some kind of vendor
HCA/RNIC incompatibility.
Case in point: I regression test "limited heterogeneous" scenarios
on my MPI testing cluster at Cisco every night. Specifically, I
have a variety of different models of Mellanox HCAs and they all
interoperate just fine across 2 air-gapped IB subnets. I don't know
if anyone has tested with wildly different HCAs/RNICs using some
lowest-common denominator verbs/hardware configuration values (i.e.,
some set of values that is supported by all HCAs/RNICs) to see if
OMPI works. I don't immediately see why that wouldn't work, but I
haven't tested it myself.
Out of the box, however, Open MPI is not necessarily configured to
have the same verbs/hardware configuration for each device. That is
what may fail by default.
--
Jeff Squyres
Cisco Systems
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Jeff Squyres
Cisco Systems