Looks like this should be fixed in my PR #101 - could you please review it?
Thanks Ralph > On Nov 26, 2014, at 8:14 PM, Ralph Castain <r...@open-mpi.org> wrote: > > Aha - I see what happened. I have that param set to false in my default mca > param file. If I set it to true on the cmd line, then I run without > segfaulting. > > Thanks! > Ralph > > >> On Nov 26, 2014, at 5:55 PM, Gilles Gouaillardet >> <gilles.gouaillar...@iferc.org <mailto:gilles.gouaillar...@iferc.org>> wrote: >> >> Ralph, >> >> let me correct and enhance my previous statement : >> >> - i cannot reproduce your crash in my environment (RHEL6 like vs your RHEL7 >> like) >> (i configured with --enable-debug --enable-picky) >> >> - i can reproduce the crash with >> mpirun --mca mpi_param_check false >> >> - if you configured with --without-mpi-param-check, i assume you would get >> the same crash >> (and if i understand correctly, there would be no way to --mca >> mpi_param_check true) >> >> here is the relevant part of my config.status : >> $ grep MPI_PARAM_CHECK config.status >> D["MPI_PARAM_CHECK"]=" ompi_mpi_param_check" >> D["OMPI_PARAM_CHECK"]=" 1" >> >> i will try on a centos7 box from now. >> in the mean time, can you check you config.status and try again with >> mpirun --mca mpi_param_check true >> >> Cheers, >> >> Gilles >> >> On 2014/11/27 10:06, Gilles Gouaillardet wrote: >>> I will double check this(afk right now) >>> Are you running on a rhel6 like distro with gcc ? >>> >>> Iirc, crash vs mpi error is ruled by --with-param-check or something like >>> this... >>> >>> Cheers, >>> >>> Gilles >>> >>> Ralph Castain <r...@open-mpi.org> <mailto:r...@open-mpi.org>さんのメール: >>>> I tried it with both the fortran and c versions - got the same result. >>>> >>>> >>>> This was indeed with a debug build. I wouldn’t expect a segfault even with >>>> an optimized build, though - I would expect an MPI error, yes? >>>> >>>> >>>> >>>> >>>> On Nov 26, 2014, at 4:26 PM, Gilles Gouaillardet >>>> <gilles.gouaillar...@gmail.com> <mailto:gilles.gouaillar...@gmail.com> >>>> wrote: >>>> >>>> >>>> I will have a look >>>> >>>> Btw, i was running the fortran version, not the c one. >>>> Did you confgure with --enable--debug ? >>>> The program sends to a rank *not* in the communicator, so this behavior >>>> could make some sense on an optimized build. >>>> >>>> Cheers, >>>> >>>> Gilles >>>> >>>> Ralph Castain <r...@open-mpi.org> <mailto:r...@open-mpi.org>さんのメール: >>>> Ick - I’m getting a segfault when trying to run that test: >>>> >>>> >>>> MPITEST info (0): Starting MPI_Errhandler_fatal test >>>> >>>> MPITEST info (0): This test will abort after printing the results message >>>> >>>> MPITEST info (0): If it does not, then a f.a.i.l.u.r.e will be noted >>>> >>>> [bend001:07714] *** Process received signal *** >>>> >>>> [bend001:07714] Signal: Segmentation fault (11) >>>> >>>> [bend001:07714] Signal code: Address not mapped (1) >>>> >>>> [bend001:07714] Failing at address: 0x50 >>>> >>>> [bend001:07715] *** Process received signal *** >>>> >>>> [bend001:07715] Signal: Segmentation fault (11) >>>> >>>> [bend001:07715] Signal code: Address not mapped (1) >>>> >>>> [bend001:07715] Failing at address: 0x50 >>>> >>>> [bend001:07714] ompi_comm_peer_lookup: invalid peer index (3) >>>> >>>> [bend001:07713] ompi_comm_peer_lookup: invalid peer index (3) >>>> >>>> [bend001:07715] ompi_comm_peer_lookup: invalid peer index (3) >>>> >>>> [bend001:07713] *** Process received signal *** >>>> >>>> [bend001:07713] Signal: Segmentation fault (11) >>>> >>>> [bend001:07713] Signal code: Address not mapped (1) >>>> >>>> [bend001:07713] Failing at address: 0x50 >>>> >>>> [bend001:07713] [ 0] /usr/lib64/libpthread.so.0(+0xf130)[0x7f4485ecb130] >>>> >>>> [bend001:07713] [ 1] >>>> /home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7f4480f74ca6] >>>> >>>> [bend001:07713] [ 2] [bend001:07714] [ 0] >>>> /usr/lib64/libpthread.so.0(+0xf130)[0x7ff457885130] >>>> >>>> [bend001:07714] [ 1] >>>> /home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7ff44e8dbca6] >>>> >>>> [bend001:07714] [ 2] [bend001:07715] [ 0] >>>> /usr/lib64/libpthread.so.0(+0xf130)[0x7ffa97ff6130] >>>> >>>> [bend001:07715] [ 1] >>>> /home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7ffa8eeeeca6] >>>> >>>> [bend001:07715] [ 2] MPITEST_results: MPI_Errhandler_fatal all tests >>>> PASSED (3) >>>> >>>> >>>> >>>> This is with the head of the 1.8 branch. Any suggestions? >>>> >>>> Ralph >>>> >>>> >>>> >>>> On Nov 26, 2014, at 8:46 AM, Ralph Castain <r...@open-mpi.org> >>>> <mailto:r...@open-mpi.org> wrote: >>>> >>>> >>>> Hmmm….yeah, I know we saw this and resolved it in the trunk, but it looks >>>> like the fix indeed failed to come over to 1.8. I’ll take a gander (pretty >>>> sure I remember how I fixed it) - thanks! >>>> >>>> On Nov 26, 2014, at 12:03 AM, Gilles Gouaillardet >>>> <gilles.gouaillar...@iferc.org> <mailto:gilles.gouaillar...@iferc.org> >>>> wrote: >>>> >>>> Ralph, >>>> >>>> i noted several hangs in mtt with the v1.8 branch. >>>> >>>> a simple way to reproduce it is to use the MPI_Errhandler_fatal_f test >>> >from the intel_tests suite, >>>> invoke mpirun on one node and run the taks on an other node : >>>> >>>> node0$ mpirun -np 3 -host node1 --mca btl tcp,self ./MPI_Errhandler_fatal_f >>>> >>>> /* since this is a race condition, you might need to run this in a loop >>>> in order to hit the bug */ >>>> >>>> the attached tarball contains a patch (add debug + temporary hack) and >>>> some log files obtained with >>>> --mca errmgr_base_verbose 100 --mca odls_base_verbose 100 >>>> >>>> without the hack, i can reproduce the bug with -np 3 (log.ko.txt) , with >>>> the hack, i can still reproduce the hang (though it might >>>> be a different one) with -np 16 (log.ko.2.txt) >>>> >>>> i remember some similar hangs were fixed on the trunk/master a few >>>> monthes ago. >>>> i tried to backport some commits but it did not help :-( >>>> >>>> could you please have a look at this ? >>>> >>>> Cheers, >>>> >>>> Gilles >>>> <abort_hang.tar.gz>_______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2014/11/16357.php >>>> <http://www.open-mpi.org/community/lists/devel/2014/11/16357.php> >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2014/11/16364.php >>>> <http://www.open-mpi.org/community/lists/devel/2014/11/16364.php> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2014/11/16366.php >>>> <http://www.open-mpi.org/community/lists/devel/2014/11/16366.php> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org <mailto:de...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/11/16368.php >> <http://www.open-mpi.org/community/lists/devel/2014/11/16368.php>