I have been testing v1.5 with slightly older Intel "composerxe-2011.5.220" compilers. I see a "make check" failure in opal_datatype_test which is not present with any other compiler (such as gcc on the same node). This has been seen most recently on the 1.5.5rc2r25990 tarball generated earlier today. With "make check -k" I can confirm that opal_datatype_test is the ONLY failure I see with this compiler. So, I have just assumed this was a buggy compiler and thought nothing more of it.

I have not yet tested them, but also have the same "composer_xe_2011_sp1.7.256" compiler and a more recent "composer_xe_2011_sp1.8.273". I will test both ASAP and report back with my findings.

-Paul


On 2/21/2012 4:20 PM, Eugene Loh wrote:
We have some amount of MTT testing going on every night and on ONE of our systems v1.5 has been dead since r25914. The system is

Linux burl-ct-v20z-10 2.6.9-67.ELsmp #1 SMP Wed Nov 7 13:56:44 EST 2007 x86_64 x86_64 x86_64 GNU/Linux

and I'm encountering the problem with Intel (composer_xe_2011_sp1.7.256) compilers. I haven't poked around enough yet to figure out what the problematic characteristic of this configuration is.

In r25914, orte/mca/odls/base/odls_base_open.c, we get

222 /* get the number of local sockets unless we were given a number */
    223     if (0 == orte_default_num_sockets_per_board) {
224 opal_paffinity_base_get_socket_info(&orte_odls_globals.num_sockets);
    225     }
    226     /* get the number of local processors */
227 opal_paffinity_base_get_processor_info(&orte_odls_globals.num_processors);
    228     /* compute the base number of cores/socket, if not given */
    229     if (0 == orte_default_num_cores_per_socket) {
230 orte_odls_globals.num_cores_per_socket = orte_odls_globals.num_processors / orte_odls_globals.num_sockets;
    231     }

Well, we execute the branch at line 224, but num_sockets remains 0. This leads to the divide-by-0 at line 230. Digging deeper, the call at line 224 led us to opal/mca/paffinity/hwloc/paffinity_hwloc_module.c (lots of stuff left out):

static int module_get_socket_info(int *num_sockets) {
    hwloc_topology_t *t = &opal_hwloc_topology;
    *num_sockets = (int) hwloc_get_nbobjs_by_type(*t, HWLOC_OBJ_SOCKET);
    return OPAL_SUCCESS;
}

Anyhow, SOCKET is somehow an unknown layer, so num_sockets is returning 0.

I can poke around more, but does someone want to advise?
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Paul H. Hargrove                          phhargr...@lbl.gov
Future Technologies Group
HPC Research Department                   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900

Reply via email to