Jeff Squyres wrote:

On Nov 9, 2007, at 1:24 PM, Don Kerr wrote:

both, I was thinking of listing what I think are multi-rail requirements
but wanted to understand what the current state of things are

I believe the OF portion of the FAQ describes what we do in the v1.2 series (right Gleb?); I honestly don't remember what we do today on the trunk (I'm pretty sure that Gleb has tweaked it recently).
Gleb's response answered this.

As for what we *should* do, it's a very complicated question.  :-\
OK. I knew the "close to NIC" was a concern but was not aware an attempt to tackle this began. I will look at the "carto" framework.

Thanks
-DON

This is where all these discussions regarding affinity, NUMA, and NUNA (non uniform network architecture) come into play. A "very simple" scenario may be something like this:

- host A is UMA (perhaps even a uniprocessor) with 2 ports that are equidistant from the 1 MPI process on that host - host B is the same, except it only has 1 active port on the same IB subnet as host A's 2 ports
- the ports on both hosts are all the same speed (e.g., DDR)
- the ports all share a single, common, non-blocking switch

But even with this "simple" case, the answer as to what you should do is still unclear. If host A is able to drive both of its DDR links at full speed, you're could cause congestion at the link to host B if the MPI process on host A opens two connections. But if host A is only able to drive the same effective bandwidth out of its two ports as it is through a single port, then the end effect is probably fairly negligible -- it might not make much of a difference at all as to whether the MPI process A opens 1 or 2 connections to host B.

But then throw in other effects that I mentioned above (NUMA, NUNA, etc.), and the equation becomes much more complex. In some cases, it may be good to open 1 connection (e.g., bandwidth load balancing); in other cases it may be good to open 2 (e.g., congestion avoidance / spreading traffic around the network, particularly in the presence of other MPI jobs on the network). :-\

Such NUNA architectures may sound unusual to some, but both IBM and HP sell [many] blade-based HPC solutions with NUNA internal IB networks. Specifically: this is a fairly common scenario.

So this is a difficult question without a great answer. The hope is that the new carto framework that Sharon sent requirements around for will be able to at least make topology information available from both the host and the network so that BTLs can possibly make some intelligent decisions about what to do in these kinds of scenarios.

Reply via email to