On 02/21/12 19:29, Jeffrey Squyres wrote:
What's the output of running lstopo from hwloc 1.3.2? (this is the version
that's in the OMPI trunk and v1.5 branches)
http://www.open-mpi.org/software/hwloc/v1.3/
Is there any difference from v1.4 hwloc?
http://www.open-mpi.org/software/hwloc/v1.4/
Machine (8192MB)
NUMANode L#0 (P#0 4096MB) + PU L#0 (P#0)
NUMANode L#1 (P#1 4096MB) + PU L#1 (P#1)
No difference between 1.3 and 1.4. No information about sockets.
As Paul says, doesn't look like a compiler thing. (I get the same with
Intel and gcc.)
The hwloc README has a sample program that has ("third example")
depth = hwloc_get_type_depth(topology, HWLOC_OBJ_SOCKET);
if (depth == HWLOC_TYPE_DEPTH_UNKNOWN) {
printf("*** The number of sockets is unknown\n");
} else {
...
}
that reports that the number of sockets is unknown. So, "sockets" is
unknown and hwloc returns 0 for num_sockets and OMPI pukes on divide by
zero. OS info was listed in the original message (below). Might we
want to do something else? E.g., assume num_sockets==1 when
num_sockets==0 (if you know what I mean)? So, which one (or more) of
the following should be fixed?
*) on this platform, hwloc finds no socket level
*) therefore hwloc returns num_sockets==0 to OMPI
*) OMPI divides by 0 and barfs on basically everything
On Feb 21, 2012, at 7:20 PM, Eugene Loh wrote:
We have some amount of MTT testing going on every night and on ONE of our
systems v1.5 has been dead since r25914. The system is
Linux burl-ct-v20z-10 2.6.9-67.ELsmp #1 SMP Wed Nov 7 13:56:44 EST 2007 x86_64
x86_64 x86_64 GNU/Linux
and I'm encountering the problem with Intel (composer_xe_2011_sp1.7.256)
compilers. I haven't poked around enough yet to figure out what the
problematic characteristic of this configuration is.
In r25914, orte/mca/odls/base/odls_base_open.c, we get
222 /* get the number of local sockets unless we were given a number */
223 if (0 == orte_default_num_sockets_per_board) {
224
opal_paffinity_base_get_socket_info(&orte_odls_globals.num_sockets);
225 }
226 /* get the number of local processors */
227
opal_paffinity_base_get_processor_info(&orte_odls_globals.num_processors);
228 /* compute the base number of cores/socket, if not given */
229 if (0 == orte_default_num_cores_per_socket) {
230 orte_odls_globals.num_cores_per_socket =
orte_odls_globals.num_processors / orte_odls_globals.num_sockets;
231 }
Well, we execute the branch at line 224, but num_sockets remains 0. This leads
to the divide-by-0 at line 230. Digging deeper, the call at line 224 led us to
opal/mca/paffinity/hwloc/paffinity_hwloc_module.c (lots of stuff left out):
static int module_get_socket_info(int *num_sockets) {
hwloc_topology_t *t =&opal_hwloc_topology;
*num_sockets = (int) hwloc_get_nbobjs_by_type(*t, HWLOC_OBJ_SOCKET);
return OPAL_SUCCESS;
}
Anyhow, SOCKET is somehow an unknown layer, so num_sockets is returning 0.
I can poke around more, but does someone want to advise?
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel