Nathan,

The onesided/test_acc2 test is failing in our Cisco MTT runs on the trunk and 
v1.7.5 branches:

----8<----
================ test_acc2 ========== Mon Mar 10 15:31:47 2014

Time per int accumulate 0.769040 microsecs
P0, Test No. 0, PASSED: accumulate performance Mon Mar 10 15:31:47 2014

================ test_acc2 ========== Mon Mar 10 15:31:47 2014

P7, Test No. 0, PASSED: multi-offset accumulate Mon Mar 10 15:31:47 2014

P0, Test No. 1 CHECK: accumulate self without permission, nfail=1, Mon Mar 10 
15:31:47 2014

P0, Test No. 2 CHECK: accumulate self, nfail=10000, Mon Mar 10 15:31:49 2014

P1, Test No. 0 CHECK: accumulate non-self, nfail=10000, Mon Mar 10 15:31:51 2014

-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus 
causing
the job to be terminated. The first process to do so was:

  Process name: [[30384,1],1]
  Exit code:    16
--------------------------------------------------------------------------
----8<----

I've bisected from the v1.7.4 to r30977 and the test begins failing around the 
time of r30894 (the big one-sided CMR from trunk): 
https://svn.open-mpi.org/trac/ompi/changeset/30894

The build on the v1.7 branch was in the (inclusive) range r30893:r30896, so 
there's a slim chance that this is one of r30893, r30895, or r30896 instead, 
but given the test failure that seems unlikely.

Valgrind runs are not clean on this test, but they were insufficient to point 
me to a suspicious part of the code (there are some finalization bugs that 
should probably be fixed, but I don't think they are causing this issue).  
Sample VG messages and my (possibly flawed) interpretations are at the end of 
this mail.

Can you take a look?  This is a regression from v1.7.4 as we're headed into 
v1.7.5.

Thanks,
-Dave


No idea here, but probably unrelated to the one-sided test failures:
==22608== Conditional jump or move depends on uninitialised value(s)
==22608==    at 0xA72E6B9: ml_init_k_nomial_trees (coll_ml_module.c:651)
==22608==    by 0xA7336C8: mca_coll_ml_tree_hierarchy_discovery 
(coll_ml_module.c:2162)
==22608==    by 0xA733D46: mca_coll_ml_fulltree_ptp_only_hierarchy_discovery 
(coll_ml_module.c:2333)
==22608==    by 0xA7314C9: ml_discover_hierarchy (coll_ml_module.c:1565)
==22608==    by 0xA735DCD: mca_coll_ml_comm_query (coll_ml_module.c:2992)
==22608==    by 0x4CD23AE: query_2_0_0 (coll_base_comm_select.c:395)
==22608==    by 0x4CD2372: query (coll_base_comm_select.c:378)
==22608==    by 0x4CD2285: check_one_component (coll_base_comm_select.c:340)
==22608==    by 0x4CD20D7: check_components (coll_base_comm_select.c:304)
==22608==    by 0x4CCAE11: mca_coll_base_comm_select 
(coll_base_comm_select.c:131)
==22608==    by 0x4C60B49: ompi_mpi_init (ompi_mpi_init.c:888)
==22608==    by 0x4C93AE2: PMPI_Init (pinit.c:84)
==22608==  Uninitialised value was created by a heap allocation
==22608==    at 0x4A07844: malloc (vg_replace_malloc.c:291)
==22608==    by 0x4A079B8: realloc (vg_replace_malloc.c:687)
==22608==    by 0xA72F91B: get_new_subgroup_data (coll_ml_module.c:1044)
==22608==    by 0xA73292E: mca_coll_ml_tree_hierarchy_discovery 
(coll_ml_module.c:1939)
==22608==    by 0xA733D46: mca_coll_ml_fulltree_ptp_only_hierarchy_discovery 
(coll_ml_module.c:2333)
==22608==    by 0xA7314C9: ml_discover_hierarchy (coll_ml_module.c:1565)
==22608==    by 0xA735DCD: mca_coll_ml_comm_query (coll_ml_module.c:2992)
==22608==    by 0x4CD23AE: query_2_0_0 (coll_base_comm_select.c:395)
==22608==    by 0x4CD2372: query (coll_base_comm_select.c:378)
==22608==    by 0x4CD2285: check_one_component (coll_base_comm_select.c:340)
==22608==    by 0x4CD20D7: check_components (coll_base_comm_select.c:304)
==22608==    by 0x4CCAE11: mca_coll_base_comm_select 
(coll_base_comm_select.c:131)

Possibly related:
==22608== Syscall param writev(vector[...]) points to uninitialised byte(s)
==22608==    at 0x354BAE0C57: writev (in /lib64/libc-2.12.so)
==22608==    by 0x95F7CD2: mca_btl_tcp_frag_send (btl_tcp_frag.c:107)
==22608==    by 0x95F61CE: mca_btl_tcp_endpoint_send (btl_tcp_endpoint.c:261)
==22608==    by 0x95F1C08: mca_btl_tcp_send (btl_tcp.c:387)
==22608==    by 0x9CD4098: mca_bml_base_send (bml.h:276)
==22608==    by 0x9CD636F: mca_pml_ob1_send_request_start_prepare 
(pml_ob1_sendreq.c:650)
==22608==    by 0x9CC9B39: mca_pml_ob1_send_request_start_btl 
(pml_ob1_sendreq.h:388)
==22608==    by 0x9CC9DF1: mca_pml_ob1_send_request_start 
(pml_ob1_sendreq.h:461)
==22608==    by 0x9CCA4FA: mca_pml_ob1_isend (pml_ob1_isend.c:85)
==22608==    by 0x4C699BD: comm_allreduce_pml (allreduce.c:178)
==22608==    by 0xA73172C: ml_discover_hierarchy (coll_ml_module.c:1616)
==22608==    by 0xA735DCD: mca_coll_ml_comm_query (coll_ml_module.c:2992)
==22608==  Address 0xffeffac10 is on thread 1's stack
==22608==  Uninitialised value was created by a stack allocation
==22608==    at 0xA73134D: ml_discover_hierarchy (coll_ml_module.c:1529)

** Very likely related (10 bytes is the exact size of the 
ompi_osc_rdma_frag_header_t, so the uninitialized bytes are payload):
==22608== Syscall param writev(vector[...]) points to uninitialised byte(s)
==22608==    at 0x354BAE0C57: writev (in /lib64/libc-2.12.so)
==22608==    by 0x95F7CD2: mca_btl_tcp_frag_send (btl_tcp_frag.c:107)
==22608==    by 0x95F798C: mca_btl_tcp_endpoint_send_handler 
(btl_tcp_endpoint.c:792)
==22608==    by 0x52CB5FB: event_persist_closure (event.c:1318)
==22608==    by 0x52CB70A: event_process_active_single_queue (event.c:1362)
==22608==    by 0x52CB9D9: event_process_active (event.c:1437)
==22608==    by 0x52CC054: opal_libevent2021_event_base_loop (event.c:1645)
==22608==    by 0x526FF4B: opal_progress (opal_progress.c:168)
==22608==    by 0x9CC93AE: opal_condition_wait (condition.h:78)
==22608==    by 0x9CC9618: ompi_request_wait_completion (request.h:381)
==22608==    by 0x9CCABB5: mca_pml_ob1_send (pml_ob1_isend.c:123)
==22608==    by 0xADB8E6D: ompi_coll_tuned_scatter_intra_basic_linear 
(coll_tuned_scatter.c:256)
==22608==  Address 0x82e6fca is 10 bytes inside a block of size 8,208 alloc'd
==22608==    at 0x4A07844: malloc (vg_replace_malloc.c:291)
==22608==    by 0xBA1E40C: ompi_osc_rdma_frag_constructor (osc_rdma_frag.c:24)
==22608==    by 0x5265FC9: opal_obj_run_constructors (opal_object.h:424)
==22608==    by 0x5266988: opal_free_list_grow (opal_free_list.c:113)
==22608==    by 0x526679B: opal_free_list_init (opal_free_list.c:78)
==22608==    by 0xBA183DE: component_init (osc_rdma_component.c:255)
==22608==    by 0x4CEC336: ompi_osc_base_find_available (osc_base_frame.c:48)
==22608==    by 0x4C604B1: ompi_mpi_init (ompi_mpi_init.c:659)
==22608==    by 0x4C93AE2: PMPI_Init (pinit.c:84)
==22608==    by 0x401A7A: main (test_acc2.c:226)
==22608==  Uninitialised value was created by a heap allocation
==22608==    at 0x4A07844: malloc (vg_replace_malloc.c:291)
==22608==    by 0xBA1E40C: ompi_osc_rdma_frag_constructor (osc_rdma_frag.c:24)
==22608==    by 0x5265FC9: opal_obj_run_constructors (opal_object.h:424)
==22608==    by 0x5266988: opal_free_list_grow (opal_free_list.c:113)
==22608==    by 0x526679B: opal_free_list_init (opal_free_list.c:78)
==22608==    by 0xBA183DE: component_init (osc_rdma_component.c:255)
==22608==    by 0x4CEC336: ompi_osc_base_find_available (osc_base_frame.c:48)
==22608==    by 0x4C604B1: ompi_mpi_init (ompi_mpi_init.c:659)
==22608==    by 0x4C93AE2: PMPI_Init (pinit.c:84)
==22608==    by 0x401A7A: main (test_acc2.c:226)


These both appear to be unrelated finalization bugs (the tcp BTL is accessing 
some OMPI procs after they've been released by ompi_proc_finalize()):
==22608== Invalid read of size 4
==22608==    at 0x4FA845F: orte_util_hash_name (name_fns.c:558)
==22608==    by 0x95F8A02: mca_btl_tcp_proc_destruct (btl_tcp_proc.c:79)
==22608==    by 0x95F885B: opal_obj_run_destructors (opal_object.h:446)
==22608==    by 0x95FA8B7: mca_btl_tcp_proc_remove (btl_tcp_proc.c:695)
==22608==    by 0x95F5EA1: mca_btl_tcp_endpoint_destruct (btl_tcp_endpoint.c:99)
==22608==    by 0x95F0818: opal_obj_run_destructors (opal_object.h:446)
==22608==    by 0x95F20C2: mca_btl_tcp_finalize (btl_tcp.c:481)
==22608==    by 0x4CC99D6: mca_btl_base_close (btl_base_frame.c:155)
==22608==    by 0x52B0C34: mca_base_framework_close (mca_base_framework.c:187)
==22608==    by 0x4CC93BF: mca_bml_base_close (bml_base_frame.c:117)
==22608==    by 0x52B0C34: mca_base_framework_close (mca_base_framework.c:187)
==22608==    by 0x4C61D8D: ompi_mpi_finalize (ompi_mpi_finalize.c:405)
==22608==  Address 0x7ff3cf8 is 72 bytes inside a block of size 112 free'd
==22608==    at 0x4A0674A: free (vg_replace_malloc.c:468)
==22608==    by 0x4C5B1D8: ompi_proc_finalize (proc.c:310)
==22608==    by 0x4C61C23: ompi_mpi_finalize (ompi_mpi_finalize.c:340)
==22608==    by 0x4C89244: PMPI_Finalize (pfinalize.c:46)
==22608==    by 0x401AB0: main (test_acc2.c:232)
==22608==
==22608== Invalid read of size 4
==22608==    at 0x4FA8470: orte_util_hash_name (name_fns.c:560)
==22608==    by 0x95F8A02: mca_btl_tcp_proc_destruct (btl_tcp_proc.c:79)
==22608==    by 0x95F885B: opal_obj_run_destructors (opal_object.h:446)
==22608==    by 0x95FA8B7: mca_btl_tcp_proc_remove (btl_tcp_proc.c:695)
==22608==    by 0x95F5EA1: mca_btl_tcp_endpoint_destruct (btl_tcp_endpoint.c:99)
==22608==    by 0x95F0818: opal_obj_run_destructors (opal_object.h:446)
==22608==    by 0x95F20C2: mca_btl_tcp_finalize (btl_tcp.c:481)
==22608==    by 0x4CC99D6: mca_btl_base_close (btl_base_frame.c:155)
==22608==    by 0x52B0C34: mca_base_framework_close (mca_base_framework.c:187)
==22608==    by 0x4CC93BF: mca_bml_base_close (bml_base_frame.c:117)
==22608==    by 0x52B0C34: mca_base_framework_close (mca_base_framework.c:187)
==22608==    by 0x4C61D8D: ompi_mpi_finalize (ompi_mpi_finalize.c:405)
==22608==  Address 0x7ff3cfc is 76 bytes inside a block of size 112 free'd
==22608==    at 0x4A0674A: free (vg_replace_malloc.c:468)
==22608==    by 0x4C5B1D8: ompi_proc_finalize (proc.c:310)
==22608==    by 0x4C61C23: ompi_mpi_finalize (ompi_mpi_finalize.c:340)
==22608==    by 0x4C89244: PMPI_Finalize (pfinalize.c:46)
==22608==    by 0x401AB0: main (test_acc2.c:232)

Reply via email to