I have been testing v1.5 with slightly older Intel
"composerxe-2011.5.220" compilers.
I see a "make check" failure in opal_datatype_test which is not present
with any other compiler (such as gcc on the same node).
This has been seen most recently on the 1.5.5rc2r25990 tarball generated
earlier today.
With "make check -k" I can confirm that opal_datatype_test is the ONLY
failure I see with this compiler.
So, I have just assumed this was a buggy compiler and thought nothing
more of it.
I have not yet tested them, but also have the same
"composer_xe_2011_sp1.7.256" compiler and a more recent
"composer_xe_2011_sp1.8.273". I will test both ASAP and report back
with my findings.
-Paul
On 2/21/2012 4:20 PM, Eugene Loh wrote:
We have some amount of MTT testing going on every night and on ONE of
our systems v1.5 has been dead since r25914. The system is
Linux burl-ct-v20z-10 2.6.9-67.ELsmp #1 SMP Wed Nov 7 13:56:44 EST
2007 x86_64 x86_64 x86_64 GNU/Linux
and I'm encountering the problem with Intel
(composer_xe_2011_sp1.7.256) compilers. I haven't poked around enough
yet to figure out what the problematic characteristic of this
configuration is.
In r25914, orte/mca/odls/base/odls_base_open.c, we get
222 /* get the number of local sockets unless we were given a
number */
223 if (0 == orte_default_num_sockets_per_board) {
224
opal_paffinity_base_get_socket_info(&orte_odls_globals.num_sockets);
225 }
226 /* get the number of local processors */
227
opal_paffinity_base_get_processor_info(&orte_odls_globals.num_processors);
228 /* compute the base number of cores/socket, if not given */
229 if (0 == orte_default_num_cores_per_socket) {
230 orte_odls_globals.num_cores_per_socket =
orte_odls_globals.num_processors / orte_odls_globals.num_sockets;
231 }
Well, we execute the branch at line 224, but num_sockets remains 0.
This leads to the divide-by-0 at line 230. Digging deeper, the call
at line 224 led us to
opal/mca/paffinity/hwloc/paffinity_hwloc_module.c (lots of stuff left
out):
static int module_get_socket_info(int *num_sockets) {
hwloc_topology_t *t = &opal_hwloc_topology;
*num_sockets = (int) hwloc_get_nbobjs_by_type(*t, HWLOC_OBJ_SOCKET);
return OPAL_SUCCESS;
}
Anyhow, SOCKET is somehow an unknown layer, so num_sockets is
returning 0.
I can poke around more, but does someone want to advise?
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Paul H. Hargrove [email protected]
Future Technologies Group
HPC Research Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900