Hi Paul,

silly me.  forgot this was a ulimit thing.  I'll test on carver.

Howard


2015-04-22 10:45 GMT-06:00 Paul Hargrove <phhargr...@lbl.gov>:

> And here is the backtrace I probably should have provided in the previous
> email.
> -Paul
>
> #0  0x00002b4107ce9265 in raise () from /lib64/libc.so.6
> #1  0x00002b4107ceaeb8 in abort () from /lib64/libc.so.6
> #2  0x00002b4107ce26e6 in __assert_fail () from /lib64/libc.so.6
> #3  0x000000000044e8b3 in udcm_module_finalize (btl=0x1cf2ae0,
> cpc=0x1ce6f80)
>     at
> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:743
> #4  0x000000000044d7e0 in udcm_component_query (btl=0x1cf2ae0,
> cpc=0x1cec948)
>     at
> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:485
> #5  0x00000000004464c5 in
> ompi_btl_openib_connect_base_select_for_local_port (btl=0x1cf2ae0)
>     at
> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/btl/openib/connect/btl_openib_connect_base.c:273
> #6  0x0000000000442c5c in btl_openib_component_init
> (num_btl_modules=0x7fff6e9b5a10,
>     enable_progress_threads=false, enable_mpi_threads=false)
>     at
> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/btl/openib/btl_openib_component.c:2837
> #7  0x0000000000433328 in mca_btl_base_select
> (enable_progress_threads=false, enable_mpi_threads=false)
>     at
> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/btl/base/btl_base_select.c:111
> #8  0x000000000043266c in mca_bml_r2_component_init
> (priority=0x7fff6e9b5ac4, enable_progress_threads=false,
>     enable_mpi_threads=false)
>     at
> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/bml/r2/bml_r2_component.c:88
> #9  0x000000000050e460 in mca_bml_base_init
> (enable_progress_threads=false, enable_mpi_threads=false)
>     at
> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/bml/base/bml_base_init.c:69
> #10 0x00000000004d9e09 in mca_pml_ob1_component_init
> (priority=0x7fff6e9b5bcc,
>     enable_progress_threads=false, enable_mpi_threads=false)
>     at
> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/pml/ob1/pml_ob1_component.c:271
> #11 0x00000000004d90b0 in mca_pml_base_select
> (enable_progress_threads=false, enable_mpi_threads=false)
>     at
> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/pml/base/pml_base_select.c:128
> #12 0x0000000000423e2d in ompi_mpi_init (argc=1, argv=0x7fff6e9b5ea8,
> requested=0, provided=0x7fff6e9b5d5c)
>     at
> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/runtime/ompi_mpi_init.c:614
> #13 0x000000000042dceb in PMPI_Init (argc=0x7fff6e9b5d9c,
> argv=0x7fff6e9b5d90) at pinit.c:84
> #14 0x0000000000407285 in main (argc=1, argv=0x7fff6e9b5ea8) at ring_c.c:19
>
>
> On Wed, Apr 22, 2015 at 9:41 AM, Paul Hargrove <phhargr...@lbl.gov> wrote:
>
>> Howard,
>>
>> Unless there is some reason the settings must be global, you should be
>> able to set the limits w/o root privs:
>>
>> Bourne shells:
>>     $ ulimit -l 64
>> C shells:
>>     % limit -h memorylocked 64
>>
>> I would have thought these lines might need to go in a .profile or .cshrc
>> to affect the application processes, but perhaps mpirun propogates the
>> rlimits.
>> So, on NERSC's Carver I can reproduce the problem (in my build of
>> 1.8.5rc2) quite easily (below).
>> I have configured with --enable-debug, which probably explains why I see
>> an assertion failure rather than the reported SEGV.
>>
>> -Paul
>>
>> {hargrove@c1436 BLD}$ ulimit -l 64
>> {hargrove@c1436 BLD}$ mpirun -np 2 examples/ring_c
>> ring_c:
>> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:743:
>> udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) ==
>> ((opal_object_t *) (&m->cm_recv_msg_queue))->obj_magic_id' failed.
>> [c1436:05774] *** Process received signal ***
>> [c1436:05774] Signal: Aborted (6)
>> [c1436:05774] Signal code:  (-6)
>> [c1436:05774] [ 0] /lib64/libpthread.so.0[0x2b4107aabb10]
>> [c1436:05774] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x2b4107ce9265]
>> [c1436:05774] [ 2] /lib64/libc.so.6(abort+0x110)[0x2b4107cead10]
>> [c1436:05774] [ 3] /lib64/libc.so.6(__assert_fail+0xf6)[0x2b4107ce26e6]
>> [c1436:05774] [ 4] examples/ring_c[0x44e8b3]
>> [c1436:05774] [ 5] examples/ring_c[0x44d7e0]
>> [c1436:05774] [ 6] examples/ring_c[0x4464c5]
>> [c1436:05774] [ 7] examples/ring_c[0x442c5c]
>> [c1436:05774] [ 8] examples/ring_c[0x433328]
>> [c1436:05774] [ 9] examples/ring_c[0x43266c]
>> [c1436:05774] [10] examples/ring_c[0x50e460]
>> [c1436:05774] [11] examples/ring_c[0x4d9e09]
>> [c1436:05774] [12] examples/ring_c[0x4d90b0]
>> [c1436:05774] [13] examples/ring_c[0x423e2d]
>> [c1436:05774] [14] examples/ring_c[0x42dceb]
>> [c1436:05774] [15] examples/ring_c[0x407285]
>> [c1436:05774] [16]
>> /lib64/libc.so.6(__libc_start_main+0xf4)[0x2b4107cd6994]
>> [c1436:05774] [17] examples/ring_c[0x407129]
>> [c1436:05774] *** End of error message ***
>> --------------------------------------------------------------------------
>> mpirun noticed that process rank 0 with PID 5774 on node c1436 exited on
>> signal 6 (Aborted).
>> --------------------------------------------------------------------------
>>
>>
>>
>> On Wed, Apr 22, 2015 at 9:16 AM, Howard Pritchard <hpprit...@gmail.com>
>> wrote:
>>
>>> Hi Raphael,
>>>
>>> Thanks very much for the patches.
>>>
>>> Would one of the developers on the list have a system where they
>>> can make these kernel limit changes and which have HCAs installed?
>>>
>>> I don't have access to any system where I have such permissions.
>>>
>>> Howard
>>>
>>>
>>> 2015-04-22 8:55 GMT-06:00 Raphaël Fouassier <raphael.fouass...@atos.net>
>>> :
>>>
>>>> We are experiencing a bug in OpenMPI in 1.8.4 which happens also on
>>>> master: if locked memory limits are too low, a segfault happens
>>>> in openib/udcm because some memory is not correctly deallocated.
>>>>
>>>> To reproduce it, modify /etc/security/limits.conf with:
>>>> * soft memlock 64
>>>> * hard memlock 64
>>>> and launch with mpirun (not in a slurm allocation).
>>>>
>>>>
>>>> I propose 2 patches for 1.8.4 and master (because of the btl move to
>>>> opal) which:
>>>> - free all allocated ressources
>>>> - print the limits error
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/devel/2015/04/17305.php
>>>>
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/devel/2015/04/17306.php
>>>
>>
>>
>>
>> --
>> Paul H. Hargrove                          phhargr...@lbl.gov
>> Computer Languages & Systems Software (CLaSS) Group
>> Computer Science Department               Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>>
>
>
>
> --
> Paul H. Hargrove                          phhargr...@lbl.gov
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department               Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/04/17308.php
>

Reply via email to