We're integrating hwloc deeper into Open MPI, and are also using it in some 
other projects.  As a result, I have some practical feedback.

(SIDENOTE: reading this letter after I wrote it, I see that it sounds 
overwhelmingly negative.  Please do not take it that way!  We love the use of 
hwloc in our projects -- we just have some very specific feedback on things 
that we would like changed.  We're just not noting all the GOOD things about 
hwloc in this email.  I'll send pink hwloc ponies and rainbows in a different 
email)

-----

1. The depth-specific accessors are Bad.  Given the warning language in the 
docs paired with the practical realities that some people actually do mix and 
match CPUs in a single server (especially when testing new chips), the 
depth-based accessors *can/will* fail.  Meaning: you have to write application 
code that can handle the non-uniform depth cases, making the depth-based 
accessors essentially useless.

Note that this also affects the non-depth-based accessors, because many (all?) 
of them just turn around and call the depth-based accessors.  :-\

In every project where we've used hwloc, we've ended up writing tree traversal 
based accessors (e.g., DFS kinds of search algorithms) for things like finding 
the Nth object of a given type for the reasons listed above.

We suggest adding DFS-based accessors and possibly deprecating the depth-based 
accessors.  True, the depth-based accessors can be faster to find a given 
object, but even with a really large hwloc tree (e.g., 1,000 PUs), does the 
speed difference really matter?  I wonder if people put hwloc searches in 
critical performance code paths where such a cost difference would matter -- we 
certainly don't.  FWIW: we've generally used hwloc during setup phases.

2. All caches are listed as HWLOC_OBJ_CACHE, regardless of their level.  We 
would like to request changing these to having specific enums for each level of 
cache -- perhaps adding HWLOC_OBJ_CACHE_L1 through L10 to cover future possible 
platforms.

The reasons we are asking for this are as follows:

(2a) the depth-based accessors are automatically broken for any machine with 
more than one level of cache (i.e., they return -1 because caches exist at 
multiple levels).  Yes, #1 expounded on how the depth-based accessors are bad, 
but I mention this point anyway.  :-)

(2b) by the same logic, calling get_nbobjs() on HWLOC_OBJ_CACHE fails.

(2c) to find the set of any given Lx caches, you basically have to traverse the 
tree looking for HWLOC_OBJ_CACHE *and* attr->cache.depth==x.  It would be 
cleaner if we could just look for HWLOC_OBJ_CACHE_L<whatever>.

(2d) more specifically: since all caches are of type HWLOC_OBJ_CACHE, we find 
ourselves putting in special case logic for caches all over our code.  Ick.

Note: I'm not sure how to add new HWLOC_OBJ_CACHE_Lx types and preserve 
backwards compatibility.  :-\

3. It would be really great to have some kind of flag in each object that says 
whether all of its children are homogeneous or not.  

Specifically: if the flag is true, it means that the trees rooted by 
obj->children[i] are "the same", meaning that each contain the same number of 
same-typed objects in the same topology layout, and have the same attributes 
(e.g., their memory sizes are the same, etc.).

Of course, the OS indexes and cpusets will be different between the objects in 
the different trees.  The homogeneous flag does not apply to those kinds of 
things.

But having this flag means that you might be able to traverse just the 
obj->children[0] tree and then be able to prune all other DFS searches and 
extrapolate the discovered results.

We ended up implementing this kind of feature in a struct hanging off 
obj->userdata; it saved extra compute cycles and some extra logic in some cases.

4. src/topology-synthetic.c emits error messages on stderr when you try to 
import invalid XML.  I am guessing that this was put there because it's a much 
more specific error message than simply returning, for example, EINVAL -- the 
stderr message tells you the XML file line number of the problem, for example.  

Could this be done in a different way?  I ask because test suites that import 
synthetic XML to hwloc do not want their stdout/stderr interrupted.

Here's two suggestions from Andreas Kupries (tcl-hwloc guy) for this issue:

(4a) One possibility would be a general error callback the user can configure 
and which is invoked with such messages. An application can then set up a 
callback doing whatever processing is proper for it.  Or NULL to disable.

(4b) A second possibility would be to save the message internally somewhere 
(can't use the topology, unfortunately) and provide an API to read it from 
there. Then, when an application gets an error it can use that API to get a 
more detailed message, if any, and process it as it sees fit.  This has thread 
safety implications, however.

5. The XML dump of the topology doesn't include all the support information, 
such as whether you can bind to threads/cores/etc.  I'm guessing this was done 
because the emphasis on importing XML was for drawing pretty lstopo pictures.

But we're using the XML export in OMPI to send the topology of compute nodes up 
to the scheduler, where decisions are made about how to lay out processes on 
the back-end compute nodes, what the binding width will be, etc.  This 
front-end scheduler needs to know whether the back-end node is capable of 
supporting binding, for example.

We manually added this information into the message that we send up to the 
scheduler, but it would be much nicer if the XML export/import just handled 
that automatically.

-- 
Jeff Squyres
[email protected]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to