My build with the "2011_sp1.8.273" Intel compilers passes the same tests as I detailed below for "2011_sp1.7.256". I don't suspect any longer that the compiler is at fault, but am willing to try additional/alternate tests to help confirm.

-Paul

On 2/21/2012 5:40 PM, Paul H. Hargrove wrote:
Here are the first of the results of the testing I promised.
I am not 100% sure how to reach the code that Eugene reported as problematic, so I tried just running the ring test with various -bind-to-* options. I am quite willing to run additional test cases. All runs are w/ OMPI_MCA_btl=sm,self.

+ 2011.5.220
  FAIL: "make check" fails opal_datatype_test
  OK: mpirun -np 2 ./ring_c
  OK: mpirun -np 2 -bind-to-none ./ring_c
  OK: mpirun -np 2 -bind-to-core ./ring_c
  OK: mpirun -np 2 -bind-to-socket ./ring_c

+ 2011_sp1.7.256
  OK: "make check"
  OK: mpirun -np 2 -bind-to-none ./ring_c
  OK: mpirun -np 2 -bind-to-core ./ring_c
  OK: mpirun -np 2 -bind-to-socket ./ring_c

So, I don't think the "2011_sp1.7.256" compilers are broken (and are "better" than the ones I've been using). I have a build with "2011_sp1.8.273" churning away right now (est. 45minutes to complete - should have disabled the Fortan bindings)

If there is something other than the -bind-to-* flags I should be using to reach the problematic code, let me know. But based on what I've seen so far, I think we can probably rule out the compiler as the problem.

-Paul


On 2/21/2012 4:37 PM, Paul H. Hargrove wrote:
I have been testing v1.5 with slightly older Intel "composerxe-2011.5.220" compilers. I see a "make check" failure in opal_datatype_test which is not present with any other compiler (such as gcc on the same node). This has been seen most recently on the 1.5.5rc2r25990 tarball generated earlier today. With "make check -k" I can confirm that opal_datatype_test is the ONLY failure I see with this compiler. So, I have just assumed this was a buggy compiler and thought nothing more of it.

I have not yet tested them, but also have the same "composer_xe_2011_sp1.7.256" compiler and a more recent "composer_xe_2011_sp1.8.273". I will test both ASAP and report back with my findings.

-Paul


On 2/21/2012 4:20 PM, Eugene Loh wrote:
We have some amount of MTT testing going on every night and on ONE of our systems v1.5 has been dead since r25914. The system is

Linux burl-ct-v20z-10 2.6.9-67.ELsmp #1 SMP Wed Nov 7 13:56:44 EST 2007 x86_64 x86_64 x86_64 GNU/Linux

and I'm encountering the problem with Intel (composer_xe_2011_sp1.7.256) compilers. I haven't poked around enough yet to figure out what the problematic characteristic of this configuration is.

In r25914, orte/mca/odls/base/odls_base_open.c, we get

222 /* get the number of local sockets unless we were given a number */
    223     if (0 == orte_default_num_sockets_per_board) {
224 opal_paffinity_base_get_socket_info(&orte_odls_globals.num_sockets);
    225     }
    226     /* get the number of local processors */
227 opal_paffinity_base_get_processor_info(&orte_odls_globals.num_processors);
    228     /* compute the base number of cores/socket, if not given */
    229     if (0 == orte_default_num_cores_per_socket) {
230 orte_odls_globals.num_cores_per_socket = orte_odls_globals.num_processors / orte_odls_globals.num_sockets;
    231     }

Well, we execute the branch at line 224, but num_sockets remains 0. This leads to the divide-by-0 at line 230. Digging deeper, the call at line 224 led us to opal/mca/paffinity/hwloc/paffinity_hwloc_module.c (lots of stuff left out):

static int module_get_socket_info(int *num_sockets) {
    hwloc_topology_t *t = &opal_hwloc_topology;
*num_sockets = (int) hwloc_get_nbobjs_by_type(*t, HWLOC_OBJ_SOCKET);
    return OPAL_SUCCESS;
}

Anyhow, SOCKET is somehow an unknown layer, so num_sockets is returning 0.

I can poke around more, but does someone want to advise?
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Paul H. Hargrove                          phhargr...@lbl.gov
Future Technologies Group
HPC Research Department                   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900

Reply via email to