And here is the backtrace I probably should have provided in the previous
email.
-Paul

#0  0x00002b4107ce9265 in raise () from /lib64/libc.so.6
#1  0x00002b4107ceaeb8 in abort () from /lib64/libc.so.6
#2  0x00002b4107ce26e6 in __assert_fail () from /lib64/libc.so.6
#3  0x000000000044e8b3 in udcm_module_finalize (btl=0x1cf2ae0,
cpc=0x1ce6f80)
    at
/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:743
#4  0x000000000044d7e0 in udcm_component_query (btl=0x1cf2ae0,
cpc=0x1cec948)
    at
/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:485
#5  0x00000000004464c5 in
ompi_btl_openib_connect_base_select_for_local_port (btl=0x1cf2ae0)
    at
/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/btl/openib/connect/btl_openib_connect_base.c:273
#6  0x0000000000442c5c in btl_openib_component_init
(num_btl_modules=0x7fff6e9b5a10,
    enable_progress_threads=false, enable_mpi_threads=false)
    at
/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/btl/openib/btl_openib_component.c:2837
#7  0x0000000000433328 in mca_btl_base_select
(enable_progress_threads=false, enable_mpi_threads=false)
    at
/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/btl/base/btl_base_select.c:111
#8  0x000000000043266c in mca_bml_r2_component_init
(priority=0x7fff6e9b5ac4, enable_progress_threads=false,
    enable_mpi_threads=false)
    at
/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/bml/r2/bml_r2_component.c:88
#9  0x000000000050e460 in mca_bml_base_init (enable_progress_threads=false,
enable_mpi_threads=false)
    at
/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/bml/base/bml_base_init.c:69
#10 0x00000000004d9e09 in mca_pml_ob1_component_init
(priority=0x7fff6e9b5bcc,
    enable_progress_threads=false, enable_mpi_threads=false)
    at
/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/pml/ob1/pml_ob1_component.c:271
#11 0x00000000004d90b0 in mca_pml_base_select
(enable_progress_threads=false, enable_mpi_threads=false)
    at
/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/pml/base/pml_base_select.c:128
#12 0x0000000000423e2d in ompi_mpi_init (argc=1, argv=0x7fff6e9b5ea8,
requested=0, provided=0x7fff6e9b5d5c)
    at
/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/runtime/ompi_mpi_init.c:614
#13 0x000000000042dceb in PMPI_Init (argc=0x7fff6e9b5d9c,
argv=0x7fff6e9b5d90) at pinit.c:84
#14 0x0000000000407285 in main (argc=1, argv=0x7fff6e9b5ea8) at ring_c.c:19


On Wed, Apr 22, 2015 at 9:41 AM, Paul Hargrove <phhargr...@lbl.gov> wrote:

> Howard,
>
> Unless there is some reason the settings must be global, you should be
> able to set the limits w/o root privs:
>
> Bourne shells:
>     $ ulimit -l 64
> C shells:
>     % limit -h memorylocked 64
>
> I would have thought these lines might need to go in a .profile or .cshrc
> to affect the application processes, but perhaps mpirun propogates the
> rlimits.
> So, on NERSC's Carver I can reproduce the problem (in my build of
> 1.8.5rc2) quite easily (below).
> I have configured with --enable-debug, which probably explains why I see
> an assertion failure rather than the reported SEGV.
>
> -Paul
>
> {hargrove@c1436 BLD}$ ulimit -l 64
> {hargrove@c1436 BLD}$ mpirun -np 2 examples/ring_c
> ring_c:
> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:743:
> udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) ==
> ((opal_object_t *) (&m->cm_recv_msg_queue))->obj_magic_id' failed.
> [c1436:05774] *** Process received signal ***
> [c1436:05774] Signal: Aborted (6)
> [c1436:05774] Signal code:  (-6)
> [c1436:05774] [ 0] /lib64/libpthread.so.0[0x2b4107aabb10]
> [c1436:05774] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x2b4107ce9265]
> [c1436:05774] [ 2] /lib64/libc.so.6(abort+0x110)[0x2b4107cead10]
> [c1436:05774] [ 3] /lib64/libc.so.6(__assert_fail+0xf6)[0x2b4107ce26e6]
> [c1436:05774] [ 4] examples/ring_c[0x44e8b3]
> [c1436:05774] [ 5] examples/ring_c[0x44d7e0]
> [c1436:05774] [ 6] examples/ring_c[0x4464c5]
> [c1436:05774] [ 7] examples/ring_c[0x442c5c]
> [c1436:05774] [ 8] examples/ring_c[0x433328]
> [c1436:05774] [ 9] examples/ring_c[0x43266c]
> [c1436:05774] [10] examples/ring_c[0x50e460]
> [c1436:05774] [11] examples/ring_c[0x4d9e09]
> [c1436:05774] [12] examples/ring_c[0x4d90b0]
> [c1436:05774] [13] examples/ring_c[0x423e2d]
> [c1436:05774] [14] examples/ring_c[0x42dceb]
> [c1436:05774] [15] examples/ring_c[0x407285]
> [c1436:05774] [16] /lib64/libc.so.6(__libc_start_main+0xf4)[0x2b4107cd6994]
> [c1436:05774] [17] examples/ring_c[0x407129]
> [c1436:05774] *** End of error message ***
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 5774 on node c1436 exited on
> signal 6 (Aborted).
> --------------------------------------------------------------------------
>
>
>
> On Wed, Apr 22, 2015 at 9:16 AM, Howard Pritchard <hpprit...@gmail.com>
> wrote:
>
>> Hi Raphael,
>>
>> Thanks very much for the patches.
>>
>> Would one of the developers on the list have a system where they
>> can make these kernel limit changes and which have HCAs installed?
>>
>> I don't have access to any system where I have such permissions.
>>
>> Howard
>>
>>
>> 2015-04-22 8:55 GMT-06:00 Raphaël Fouassier <raphael.fouass...@atos.net>:
>>
>>> We are experiencing a bug in OpenMPI in 1.8.4 which happens also on
>>> master: if locked memory limits are too low, a segfault happens
>>> in openib/udcm because some memory is not correctly deallocated.
>>>
>>> To reproduce it, modify /etc/security/limits.conf with:
>>> * soft memlock 64
>>> * hard memlock 64
>>> and launch with mpirun (not in a slurm allocation).
>>>
>>>
>>> I propose 2 patches for 1.8.4 and master (because of the btl move to
>>> opal) which:
>>> - free all allocated ressources
>>> - print the limits error
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/devel/2015/04/17305.php
>>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2015/04/17306.php
>>
>
>
>
> --
> Paul H. Hargrove                          phhargr...@lbl.gov
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department               Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>



-- 
Paul H. Hargrove                          phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department               Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900

Reply via email to