One of the OMPI devs found a problem when I upgraded the OMPI SVN trunk to the 
hwloc 1.2.2ompi version last week that I think I am just now beginning to 
understand.

Brief reminder of our strategy:

- on each compute node, OMPI launches a local "orted" helper daemon
- this orted fork/exec's the local MPI processes

To avoid the penalty of each MPI process invoking hwloc discovery more-or-less 
simultaneously upon startup (which, as we've see on this list before, can be 
painful when core counts are large), we have the orted do the hwloc discovery, 
serialize this into XML, and send it to each of its local processes.  The local 
processes receive this XML and then load it into hwloc and run from there.

However, it looks like the resulting loaded-from-XML topology->is_thissystem is 
set to 0, and therefore functions like hwloc_get_cpubind() actually get wired 
up to dontget_thisproc_cpubind() (instead of the proper Linux backend, for 
example).

How do we avoid this?  We need working hwloc functions after loading up an XML 
topology string.

-- 
Jeff Squyres
[email protected]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to