Hello,
Happy new year btw :D
Considering future network topology support, I believe we probably need
to fix a couple of things before releasing 1.0. Just to sum up the a
bunch of points that have been raised in the past months:
- there should be a way to have the complete toplogy in just one tree,
so you can browse in it and assign tasks/process/whatever in it,
according to architectural details provided by hwloc, but also network
details like bandwidth etc.
- the core of hwloc mustn't force any kind of tools, it must be easy
to either build something around hwloc detection and binding
functions, or load detection & binding plugins.
The way I see it is to provide a hwloc_topology_combine() function that
takes a series of several hwloc_topology_t trees and an object type,
and builds a tree that contains a new object of that type, under which
the trees appear. That combination can actually already be done by
hand by catenating xml files. For instance, on a simple cluster you'd
run lstopo on each machine and save xml files, load them together,
combine them under a "network" object (being able to register dynamic
object types should be easy), and save the result as an xml file, which
thus contains the complete topology of the cluster. A task dispatcher
can thus browse it at will etc. When it comes about binding, it'd be
the task dispatcher's role to first run the application to the target
machine, and there run hwloc to perform the actual binding, according to
the cpuset in the tree.
Now, coming to semantic changes:
- The top node of the tree wouldn't necessarily be a system object.
Actually, having always the top object having the system type is not
providing any useful information :), and it makes us duplicate fields
between system and machine. On usual (non-Kerrighed) machines, the top
node would just be machine. On Kerrighed systems, the top node would
be system. On networked systems, the top node would be a switch or the
Internet :)
As a consequence, hwloc_get_system_obj would have to be renamed to
hwloc_get_root_obj.
- Objects of network trees may not have cpusets defined (Trees obtained
directly from hwloc with defaults parameter would still have cpusets
on every node however). It does not make sense to merge cpusets of
different machines (they will conflict), and things like shifting
cpusets to be able to merge them would probably only bring issues.
That being said, that does not prevent from writing a transparency
plugin that automatically discovers the network topology, shifts
cpusets, and when requested for binding, automatically migrates to
the machine according to the shift, and uses the underlying OS hooks
to perform the binding. My point is that the hwloc combining operation
wouldn't fix cpusets itself and leave them NULL. The caller of the
combining operation will be responsible for that.
- This also means there can't be "global" cpusets like the recently
added hwloc_topology_get_{topology,complete,online,allowed}_cpuset
functions (not released yet). These can just be moved to the hwloc_obj
structure, thus being available for each object, which could actually be
helpful btw.
- Helpers that take cpuset parameters of course don't make sense any more
when applied to networked topologies. But it probably doesn't make
sense for the caller to call them in the first place, and the caller
knows it since it's the caller that has first called the combining
operation or loaded an XML file resulting from it.
If, however, at some point (after having distributed tasks between
machines for instance), operations with cpusets are desired, we could
provide a duplication function that takes a topology object parameter
A and builds a new topology tree containing all the objects under
A, A thus being its root, and then (if A indeed has a cpuset, but
the caller should know that) heleprs taking cpuset parameters can be
called.
So, to sum it up:
- hwloc_get_obj_by_depth(topo, 0, 0) may not be a system object any
more (actually it'd only be one on kerrighed systems).
- no global cpuset field, only in objects.
The second point shouldn't harm, it's just a matter of fixing the (not
yet released) API. The first point clearly contradicts the current
documentation (“HWLOC_OBJ_SYSTEM will always be the highest”),
but I believe not breaking it as soon as now will tie us from further
extensions anyway, and I don't think much code relies on it anyway.
The plan I see is that for 1.0 we only check that catenating .XML files
by hand to build misc levels representing network layers does indeed
work, which should mean that actual combining functions etc. should be
possible to implement later.
Please comment/disagree/agree :)
Samuel