Looks like this should be fixed in my PR #101 - could you please review it?

Thanks
Ralph


> On Nov 26, 2014, at 8:14 PM, Ralph Castain <r...@open-mpi.org> wrote:
> 
> Aha - I see what happened. I have that param set to false in my default mca 
> param file. If I set it to true on the cmd line, then I run without 
> segfaulting.
> 
> Thanks!
> Ralph
> 
> 
>> On Nov 26, 2014, at 5:55 PM, Gilles Gouaillardet 
>> <gilles.gouaillar...@iferc.org <mailto:gilles.gouaillar...@iferc.org>> wrote:
>> 
>> Ralph,
>> 
>> let me correct and enhance my previous statement :
>> 
>> - i cannot reproduce your crash in my environment (RHEL6 like vs your RHEL7 
>> like)
>> (i configured with --enable-debug --enable-picky)
>> 
>> - i can reproduce the crash with
>> mpirun --mca mpi_param_check false
>> 
>> - if you configured with --without-mpi-param-check, i assume you would get 
>> the same crash
>> (and if i understand correctly, there would be no way to --mca 
>> mpi_param_check true)
>> 
>> here is the relevant part of my config.status :
>> $ grep MPI_PARAM_CHECK config.status 
>> D["MPI_PARAM_CHECK"]=" ompi_mpi_param_check"
>> D["OMPI_PARAM_CHECK"]=" 1"
>> 
>> i will try on a centos7 box from now.
>> in the mean time, can you check you config.status and try again with 
>> mpirun --mca mpi_param_check true
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> On 2014/11/27 10:06, Gilles Gouaillardet wrote:
>>> I will double check this(afk right now)
>>> Are you running on a rhel6 like distro with gcc ?
>>> 
>>> Iirc, crash vs mpi error is ruled by --with-param-check or something like 
>>> this...
>>> 
>>> Cheers,
>>> 
>>> Gilles 
>>> 
>>> Ralph Castain <r...@open-mpi.org> <mailto:r...@open-mpi.org>さんのメール:
>>>> I tried it with both the fortran and c versions - got the same result.
>>>> 
>>>> 
>>>> This was indeed with a debug build. I wouldn’t expect a segfault even with 
>>>> an optimized build, though - I would expect an MPI error, yes?
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Nov 26, 2014, at 4:26 PM, Gilles Gouaillardet 
>>>> <gilles.gouaillar...@gmail.com> <mailto:gilles.gouaillar...@gmail.com> 
>>>> wrote:
>>>> 
>>>> 
>>>> I will have a look
>>>> 
>>>> Btw, i was running the fortran version, not the c one.
>>>> Did you confgure with --enable--debug ?
>>>> The program sends to a rank *not* in the communicator, so this behavior 
>>>> could make some sense on an optimized build.
>>>> 
>>>> Cheers,
>>>> 
>>>> Gilles
>>>> 
>>>> Ralph Castain <r...@open-mpi.org> <mailto:r...@open-mpi.org>さんのメール:
>>>> Ick - I’m getting a segfault when trying to run that test:
>>>> 
>>>> 
>>>> MPITEST info  (0): Starting MPI_Errhandler_fatal test
>>>> 
>>>> MPITEST info  (0): This test will abort after printing the results message
>>>> 
>>>> MPITEST info  (0): If it does not, then a f.a.i.l.u.r.e will be noted
>>>> 
>>>> [bend001:07714] *** Process received signal ***
>>>> 
>>>> [bend001:07714] Signal: Segmentation fault (11)
>>>> 
>>>> [bend001:07714] Signal code: Address not mapped (1)
>>>> 
>>>> [bend001:07714] Failing at address: 0x50
>>>> 
>>>> [bend001:07715] *** Process received signal ***
>>>> 
>>>> [bend001:07715] Signal: Segmentation fault (11)
>>>> 
>>>> [bend001:07715] Signal code: Address not mapped (1)
>>>> 
>>>> [bend001:07715] Failing at address: 0x50
>>>> 
>>>> [bend001:07714] ompi_comm_peer_lookup: invalid peer index (3)
>>>> 
>>>> [bend001:07713] ompi_comm_peer_lookup: invalid peer index (3)
>>>> 
>>>> [bend001:07715] ompi_comm_peer_lookup: invalid peer index (3)
>>>> 
>>>> [bend001:07713] *** Process received signal ***
>>>> 
>>>> [bend001:07713] Signal: Segmentation fault (11)
>>>> 
>>>> [bend001:07713] Signal code: Address not mapped (1)
>>>> 
>>>> [bend001:07713] Failing at address: 0x50
>>>> 
>>>> [bend001:07713] [ 0] /usr/lib64/libpthread.so.0(+0xf130)[0x7f4485ecb130]
>>>> 
>>>> [bend001:07713] [ 1] 
>>>> /home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7f4480f74ca6]
>>>> 
>>>> [bend001:07713] [ 2] [bend001:07714] [ 0] 
>>>> /usr/lib64/libpthread.so.0(+0xf130)[0x7ff457885130]
>>>> 
>>>> [bend001:07714] [ 1] 
>>>> /home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7ff44e8dbca6]
>>>> 
>>>> [bend001:07714] [ 2] [bend001:07715] [ 0] 
>>>> /usr/lib64/libpthread.so.0(+0xf130)[0x7ffa97ff6130]
>>>> 
>>>> [bend001:07715] [ 1] 
>>>> /home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7ffa8eeeeca6]
>>>> 
>>>> [bend001:07715] [ 2] MPITEST_results: MPI_Errhandler_fatal all tests 
>>>> PASSED (3)
>>>> 
>>>> 
>>>> 
>>>> This is with the head of the 1.8 branch. Any suggestions?
>>>> 
>>>> Ralph
>>>> 
>>>> 
>>>> 
>>>> On Nov 26, 2014, at 8:46 AM, Ralph Castain <r...@open-mpi.org> 
>>>> <mailto:r...@open-mpi.org> wrote:
>>>> 
>>>> 
>>>> Hmmm….yeah, I know we saw this and resolved it in the trunk, but it looks 
>>>> like the fix indeed failed to come over to 1.8. I’ll take a gander (pretty 
>>>> sure I remember how I fixed it) - thanks!
>>>> 
>>>> On Nov 26, 2014, at 12:03 AM, Gilles Gouaillardet 
>>>> <gilles.gouaillar...@iferc.org> <mailto:gilles.gouaillar...@iferc.org> 
>>>> wrote:
>>>> 
>>>> Ralph,
>>>> 
>>>> i noted several hangs in mtt with the v1.8 branch.
>>>> 
>>>> a simple way to reproduce it is to use the MPI_Errhandler_fatal_f test
>>> >from the intel_tests suite,
>>>> invoke mpirun on one node and run the taks on an other node :
>>>> 
>>>> node0$ mpirun -np 3 -host node1 --mca btl tcp,self ./MPI_Errhandler_fatal_f
>>>> 
>>>> /* since this is a race condition, you might need to run this in a loop
>>>> in order to hit the bug */
>>>> 
>>>> the attached tarball contains a patch (add debug + temporary hack) and
>>>> some log files obtained with
>>>> --mca errmgr_base_verbose 100 --mca odls_base_verbose 100
>>>> 
>>>> without the hack, i can reproduce the bug with -np 3 (log.ko.txt) , with
>>>> the hack, i can still reproduce the hang (though it might
>>>> be a different one) with -np 16 (log.ko.2.txt)
>>>> 
>>>> i remember some similar hangs were fixed on the trunk/master a few
>>>> monthes ago.
>>>> i tried to backport some commits but it did not help :-(
>>>> 
>>>> could you please have a look at this ?
>>>> 
>>>> Cheers,
>>>> 
>>>> Gilles
>>>> <abort_hang.tar.gz>_______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16357.php 
>>>> <http://www.open-mpi.org/community/lists/devel/2014/11/16357.php>
>>>> 
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16364.php 
>>>> <http://www.open-mpi.org/community/lists/devel/2014/11/16364.php>
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16366.php 
>>>> <http://www.open-mpi.org/community/lists/devel/2014/11/16366.php>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/11/16368.php 
>> <http://www.open-mpi.org/community/lists/devel/2014/11/16368.php>

Reply via email to