Re: [OMPI devel] OMPI devel] race condition in abort can cause mpirun v1.8 hang

Gilles Gouaillardet Wed, 26 Nov 2014 19:26:37 -0500 (EST)

I will have a look

Btw, i was running the fortran version, not the c one.
Did you confgure with --enable--debug ?
The program sends to a rank *not* in the communicator, so this behavior could 
make some sense on an optimized build.


Cheers,

Gilles

Ralph Castain <r...@open-mpi.org>さんのメール:
>Ick - I’m getting a segfault when trying to run that test:
>
>
>MPITEST info  (0): Starting MPI_Errhandler_fatal test
>
>MPITEST info  (0): This test will abort after printing the results message
>
>MPITEST info  (0): If it does not, then a f.a.i.l.u.r.e will be noted
>
>[bend001:07714] *** Process received signal ***
>
>[bend001:07714] Signal: Segmentation fault (11)
>
>[bend001:07714] Signal code: Address not mapped (1)
>
>[bend001:07714] Failing at address: 0x50
>
>[bend001:07715] *** Process received signal ***
>
>[bend001:07715] Signal: Segmentation fault (11)
>
>[bend001:07715] Signal code: Address not mapped (1)
>
>[bend001:07715] Failing at address: 0x50
>
>[bend001:07714] ompi_comm_peer_lookup: invalid peer index (3)
>
>[bend001:07713] ompi_comm_peer_lookup: invalid peer index (3)
>
>[bend001:07715] ompi_comm_peer_lookup: invalid peer index (3)
>
>[bend001:07713] *** Process received signal ***
>
>[bend001:07713] Signal: Segmentation fault (11)
>
>[bend001:07713] Signal code: Address not mapped (1)
>
>[bend001:07713] Failing at address: 0x50
>
>[bend001:07713] [ 0] /usr/lib64/libpthread.so.0(+0xf130)[0x7f4485ecb130]
>
>[bend001:07713] [ 1] 
>/home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7f4480f74ca6]
>
>[bend001:07713] [ 2] [bend001:07714] [ 0] 
>/usr/lib64/libpthread.so.0(+0xf130)[0x7ff457885130]
>
>[bend001:07714] [ 1] 
>/home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7ff44e8dbca6]
>
>[bend001:07714] [ 2] [bend001:07715] [ 0] 
>/usr/lib64/libpthread.so.0(+0xf130)[0x7ffa97ff6130]
>
>[bend001:07715] [ 1] 
>/home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7ffa8eeeeca6]
>
>[bend001:07715] [ 2] MPITEST_results: MPI_Errhandler_fatal all tests PASSED (3)
>
>
>
>This is with the head of the 1.8 branch. Any suggestions?
>
>Ralph
>
>
>
>On Nov 26, 2014, at 8:46 AM, Ralph Castain <r...@open-mpi.org> wrote:
>
>
>Hmmm….yeah, I know we saw this and resolved it in the trunk, but it looks like 
>the fix indeed failed to come over to 1.8. I’ll take a gander (pretty sure I 
>remember how I fixed it) - thanks!
>
>On Nov 26, 2014, at 12:03 AM, Gilles Gouaillardet 
><gilles.gouaillar...@iferc.org> wrote:
>
>Ralph,
>
>i noted several hangs in mtt with the v1.8 branch.
>
>a simple way to reproduce it is to use the MPI_Errhandler_fatal_f test
>from the intel_tests suite,
>invoke mpirun on one node and run the taks on an other node :
>
>node0$ mpirun -np 3 -host node1 --mca btl tcp,self ./MPI_Errhandler_fatal_f
>
>/* since this is a race condition, you might need to run this in a loop
>in order to hit the bug */
>
>the attached tarball contains a patch (add debug + temporary hack) and
>some log files obtained with
>--mca errmgr_base_verbose 100 --mca odls_base_verbose 100
>
>without the hack, i can reproduce the bug with -np 3 (log.ko.txt) , with
>the hack, i can still reproduce the hang (though it might
>be a different one) with -np 16 (log.ko.2.txt)
>
>i remember some similar hangs were fixed on the trunk/master a few
>monthes ago.
>i tried to backport some commits but it did not help :-(
>
>could you please have a look at this ?
>
>Cheers,
>
>Gilles
><abort_hang.tar.gz>_______________________________________________
>devel mailing list
>de...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2014/11/16357.php
>
>

Re: [OMPI devel] OMPI devel] race condition in abort can cause mpirun v1.8 hang

Reply via email to