On Mon, 15 Nov 2010, Terry Dontje wrote:

A few comments:

1.  Have you guys considered using hwloc for level 4-7 detection?
Yes, and I agree there may be something to improve on level 4-7 detection. But note that hitopo differs from hwloc because it is not discovering the whole machine, only where MPI processes have been spawned. More on this after.

2. Is L2 related to L2 cache? If no then is there some other term you could use?
It is not L2 cache. However, claiming that L2 is always related to L2 cache is a bit exagerated in my opinion. The term in hitopo is "L2NUMA" which seems clear to me. And there are L2 Infiniband switches, L2 support, ... :-)

3.  What do you see if the process is bound to multiple cores/hyperthreads?
4.  What do you see if the process is not bound to any level 4-7 items?
Currently (and this is not optimal), as soon as the process is not bound to 1 core, the cpuid component returns nothing (no socket, no core). We could improving this by returning only the socket when we are bound to a socket.

When placement is not per-core, socket number will therefore be 0 and core number will be renumbered by the "renumber" phase from 0 to N (N being the number of MPI processes on the node).

Hyperthread are only used if two processes are bound on the same core (the renumber phase will mark them as 0, 1, ...).

5. What about L1 and L2 cache locality as some levels? (hwloc exposes these but these are also at different depths depending on the platform).
This is something hitopo doesn't [want to] show. But we could imagine calling hwloc to know what are the properties of MPI process on the same core/socket/...

Note I am working with Jeff Squyres and Josh Hursey on some new paffinity code that uses hwloc. Though the paffinity code may not have direct relationship to hitopo the use of hwloc and standardization of what you call level 4-7 might help avoid some user confusions.
I agree there is a big potential for confusion between hwloc, carto, hitopo, ... One could think we should mutualise code, which is often not possible or not what we want.

My (maybe incorrect) vision is that hwloc and carto discover the hardware topology, i.e. what exists on the node (not what will be used). This is used by placement modules or btls to know what resources to use when launching processes.

HiTopo provides where (inside this discovery) MPI process end up being spawned [btw, not only intra-node but also inter-node]. We could get this information from Open MPI components that do the spawning, but since it is not enough (resource manager may do part of the binding), we re-do the discovery in the end.

To sum up, here is the complete picture as I see it :

[ 0. Resource manager restricts node/cpu/io/mem sets ]
  1. Hwloc discovers what's available for intra-node
  2. Spawning/placement is done by a combination of RMs, paffinity, ...
  3. HiTopo discovers what is used from intra- to inter- node.

Sylvain

On 11/15/2010 06:56 AM, Sylvain Jeaugey wrote:
As a followup of Stuttgart's developper's meeting, here is an RFC for our topology detection framework.

WHAT: Add a framework for hardware topology detection to be used by any other part of Open MPI to help optimization.

WHY: Collective operations or shared memory algorithms among others may have optimizations depending on the hardware relationship between two MPI processes. HiTopo is an attempt to provide it in a unified manner.

WHERE: ompi/mca/hitopo/

WHEN: When wanted.

========================================================================== We developped the HiTopo framework for our collective operation component, but it may be useful for other parts of Open MPI, so we'd like to contribute it.

A wiki page has been setup :
https://svn.open-mpi.org/trac/ompi/wiki/HiTopo

and a bitbucket repository :
http://bitbucket.org/jeaugeys/hitopo/

In a few words, we have 3 steps in HiTopo :

 - Detection : each MPI process detects its topology at various levels :
    - core/socket : through the cpuid component
    - node : through gethostname
    - switch/island : through openib (mad) or slurm
      [ Other topology detection components may be added for other
        resource managers, specific hardware or whatever we want ...]

- Collection : an allgather is performed to have all other processes' addresses

- Renumbering : "string" addresses are converted to numbers starting at 0 (Example : nodenames "foo" and "bar" are renamed 0 and 1).

Any comment welcome,
Sylvain
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com <mailto:terry.don...@oracle.com>




Reply via email to