On 02/21/12 19:29, Jeffrey Squyres wrote:
What's the output of running lstopo from hwloc 1.3.2?  (this is the version 
that's in the OMPI trunk and v1.5 branches)

     http://www.open-mpi.org/software/hwloc/v1.3/

Is there any difference from v1.4 hwloc?

     http://www.open-mpi.org/software/hwloc/v1.4/
Machine (8192MB)
  NUMANode L#0 (P#0 4096MB) + PU L#0 (P#0)
  NUMANode L#1 (P#1 4096MB) + PU L#1 (P#1)

No difference between 1.3 and 1.4.  No information about sockets.

As Paul says, doesn't look like a compiler thing. (I get the same with Intel and gcc.)

The hwloc README has a sample program that has ("third example")

 depth = hwloc_get_type_depth(topology, HWLOC_OBJ_SOCKET);
 if (depth == HWLOC_TYPE_DEPTH_UNKNOWN) {
     printf("*** The number of sockets is unknown\n");
 } else {
    ...
 }

that reports that the number of sockets is unknown. So, "sockets" is unknown and hwloc returns 0 for num_sockets and OMPI pukes on divide by zero. OS info was listed in the original message (below). Might we want to do something else? E.g., assume num_sockets==1 when num_sockets==0 (if you know what I mean)? So, which one (or more) of the following should be fixed?

*) on this platform, hwloc finds no socket level
*) therefore hwloc returns num_sockets==0 to OMPI
*) OMPI divides by 0 and barfs on basically everything
On Feb 21, 2012, at 7:20 PM, Eugene Loh wrote:
We have some amount of MTT testing going on every night and on ONE of our 
systems v1.5 has been dead since r25914.  The system is

Linux burl-ct-v20z-10 2.6.9-67.ELsmp #1 SMP Wed Nov 7 13:56:44 EST 2007 x86_64 
x86_64 x86_64 GNU/Linux

and I'm encountering the problem with Intel (composer_xe_2011_sp1.7.256) 
compilers.  I haven't poked around enough yet to figure out what the 
problematic characteristic of this configuration is.

In r25914, orte/mca/odls/base/odls_base_open.c, we get

    222     /* get the number of local sockets unless we were given a number */
    223     if (0 == orte_default_num_sockets_per_board) {
    224         
opal_paffinity_base_get_socket_info(&orte_odls_globals.num_sockets);
    225     }
    226     /* get the number of local processors */
    227     
opal_paffinity_base_get_processor_info(&orte_odls_globals.num_processors);
    228     /* compute the base number of cores/socket, if not given */
    229     if (0 == orte_default_num_cores_per_socket) {
    230         orte_odls_globals.num_cores_per_socket = 
orte_odls_globals.num_processors / orte_odls_globals.num_sockets;
    231     }

Well, we execute the branch at line 224, but num_sockets remains 0.  This leads 
to the divide-by-0 at line 230.  Digging deeper, the call at line 224 led us to 
opal/mca/paffinity/hwloc/paffinity_hwloc_module.c (lots of stuff left out):

static int module_get_socket_info(int *num_sockets) {
    hwloc_topology_t *t =&opal_hwloc_topology;
    *num_sockets = (int) hwloc_get_nbobjs_by_type(*t, HWLOC_OBJ_SOCKET);
    return OPAL_SUCCESS;
}

Anyhow, SOCKET is somehow an unknown layer, so num_sockets is returning 0.

I can poke around more, but does someone want to advise?
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to