My build with the "2011_sp1.8.273" Intel compilers passes the same tests
as I detailed below for "2011_sp1.7.256".
I don't suspect any longer that the compiler is at fault, but am willing
to try additional/alternate tests to help confirm.
-Paul
On 2/21/2012 5:40 PM, Paul H. Hargrove wrote:
Here are the first of the results of the testing I promised.
I am not 100% sure how to reach the code that Eugene reported as
problematic, so I tried just running the ring test with various
-bind-to-* options. I am quite willing to run additional test
cases. All runs are w/ OMPI_MCA_btl=sm,self.
+ 2011.5.220
FAIL: "make check" fails opal_datatype_test
OK: mpirun -np 2 ./ring_c
OK: mpirun -np 2 -bind-to-none ./ring_c
OK: mpirun -np 2 -bind-to-core ./ring_c
OK: mpirun -np 2 -bind-to-socket ./ring_c
+ 2011_sp1.7.256
OK: "make check"
OK: mpirun -np 2 -bind-to-none ./ring_c
OK: mpirun -np 2 -bind-to-core ./ring_c
OK: mpirun -np 2 -bind-to-socket ./ring_c
So, I don't think the "2011_sp1.7.256" compilers are broken (and are
"better" than the ones I've been using).
I have a build with "2011_sp1.8.273" churning away right now (est.
45minutes to complete - should have disabled the Fortan bindings)
If there is something other than the -bind-to-* flags I should be
using to reach the problematic code, let me know.
But based on what I've seen so far, I think we can probably rule out
the compiler as the problem.
-Paul
On 2/21/2012 4:37 PM, Paul H. Hargrove wrote:
I have been testing v1.5 with slightly older Intel
"composerxe-2011.5.220" compilers.
I see a "make check" failure in opal_datatype_test which is not
present with any other compiler (such as gcc on the same node).
This has been seen most recently on the 1.5.5rc2r25990 tarball
generated earlier today.
With "make check -k" I can confirm that opal_datatype_test is the
ONLY failure I see with this compiler.
So, I have just assumed this was a buggy compiler and thought nothing
more of it.
I have not yet tested them, but also have the same
"composer_xe_2011_sp1.7.256" compiler and a more recent
"composer_xe_2011_sp1.8.273". I will test both ASAP and report back
with my findings.
-Paul
On 2/21/2012 4:20 PM, Eugene Loh wrote:
We have some amount of MTT testing going on every night and on ONE
of our systems v1.5 has been dead since r25914. The system is
Linux burl-ct-v20z-10 2.6.9-67.ELsmp #1 SMP Wed Nov 7 13:56:44 EST
2007 x86_64 x86_64 x86_64 GNU/Linux
and I'm encountering the problem with Intel
(composer_xe_2011_sp1.7.256) compilers. I haven't poked around
enough yet to figure out what the problematic characteristic of this
configuration is.
In r25914, orte/mca/odls/base/odls_base_open.c, we get
222 /* get the number of local sockets unless we were given
a number */
223 if (0 == orte_default_num_sockets_per_board) {
224
opal_paffinity_base_get_socket_info(&orte_odls_globals.num_sockets);
225 }
226 /* get the number of local processors */
227
opal_paffinity_base_get_processor_info(&orte_odls_globals.num_processors);
228 /* compute the base number of cores/socket, if not given */
229 if (0 == orte_default_num_cores_per_socket) {
230 orte_odls_globals.num_cores_per_socket =
orte_odls_globals.num_processors / orte_odls_globals.num_sockets;
231 }
Well, we execute the branch at line 224, but num_sockets remains 0.
This leads to the divide-by-0 at line 230. Digging deeper, the call
at line 224 led us to
opal/mca/paffinity/hwloc/paffinity_hwloc_module.c (lots of stuff
left out):
static int module_get_socket_info(int *num_sockets) {
hwloc_topology_t *t = &opal_hwloc_topology;
*num_sockets = (int) hwloc_get_nbobjs_by_type(*t,
HWLOC_OBJ_SOCKET);
return OPAL_SUCCESS;
}
Anyhow, SOCKET is somehow an unknown layer, so num_sockets is
returning 0.
I can poke around more, but does someone want to advise?
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Paul H. Hargrove phhargr...@lbl.gov
Future Technologies Group
HPC Research Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900