Howard,

Unless there is some reason the settings must be global, you should be able
to set the limits w/o root privs:

Bourne shells:
    $ ulimit -l 64
C shells:
    % limit -h memorylocked 64

I would have thought these lines might need to go in a .profile or .cshrc
to affect the application processes, but perhaps mpirun propogates the
rlimits.
So, on NERSC's Carver I can reproduce the problem (in my build of 1.8.5rc2)
quite easily (below).
I have configured with --enable-debug, which probably explains why I see an
assertion failure rather than the reported SEGV.

-Paul

{hargrove@c1436 BLD}$ ulimit -l 64
{hargrove@c1436 BLD}$ mpirun -np 2 examples/ring_c
ring_c:
/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:743:
udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) ==
((opal_object_t *) (&m->cm_recv_msg_queue))->obj_magic_id' failed.
[c1436:05774] *** Process received signal ***
[c1436:05774] Signal: Aborted (6)
[c1436:05774] Signal code:  (-6)
[c1436:05774] [ 0] /lib64/libpthread.so.0[0x2b4107aabb10]
[c1436:05774] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x2b4107ce9265]
[c1436:05774] [ 2] /lib64/libc.so.6(abort+0x110)[0x2b4107cead10]
[c1436:05774] [ 3] /lib64/libc.so.6(__assert_fail+0xf6)[0x2b4107ce26e6]
[c1436:05774] [ 4] examples/ring_c[0x44e8b3]
[c1436:05774] [ 5] examples/ring_c[0x44d7e0]
[c1436:05774] [ 6] examples/ring_c[0x4464c5]
[c1436:05774] [ 7] examples/ring_c[0x442c5c]
[c1436:05774] [ 8] examples/ring_c[0x433328]
[c1436:05774] [ 9] examples/ring_c[0x43266c]
[c1436:05774] [10] examples/ring_c[0x50e460]
[c1436:05774] [11] examples/ring_c[0x4d9e09]
[c1436:05774] [12] examples/ring_c[0x4d90b0]
[c1436:05774] [13] examples/ring_c[0x423e2d]
[c1436:05774] [14] examples/ring_c[0x42dceb]
[c1436:05774] [15] examples/ring_c[0x407285]
[c1436:05774] [16] /lib64/libc.so.6(__libc_start_main+0xf4)[0x2b4107cd6994]
[c1436:05774] [17] examples/ring_c[0x407129]
[c1436:05774] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 5774 on node c1436 exited on
signal 6 (Aborted).
--------------------------------------------------------------------------



On Wed, Apr 22, 2015 at 9:16 AM, Howard Pritchard <hpprit...@gmail.com>
wrote:

> Hi Raphael,
>
> Thanks very much for the patches.
>
> Would one of the developers on the list have a system where they
> can make these kernel limit changes and which have HCAs installed?
>
> I don't have access to any system where I have such permissions.
>
> Howard
>
>
> 2015-04-22 8:55 GMT-06:00 Raphaël Fouassier <raphael.fouass...@atos.net>:
>
>> We are experiencing a bug in OpenMPI in 1.8.4 which happens also on
>> master: if locked memory limits are too low, a segfault happens
>> in openib/udcm because some memory is not correctly deallocated.
>>
>> To reproduce it, modify /etc/security/limits.conf with:
>> * soft memlock 64
>> * hard memlock 64
>> and launch with mpirun (not in a slurm allocation).
>>
>>
>> I propose 2 patches for 1.8.4 and master (because of the btl move to
>> opal) which:
>> - free all allocated ressources
>> - print the limits error
>>
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2015/04/17305.php
>>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/04/17306.php
>



-- 
Paul H. Hargrove                          phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department               Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900

Reply via email to