Howard, Unless there is some reason the settings must be global, you should be able to set the limits w/o root privs:
Bourne shells: $ ulimit -l 64 C shells: % limit -h memorylocked 64 I would have thought these lines might need to go in a .profile or .cshrc to affect the application processes, but perhaps mpirun propogates the rlimits. So, on NERSC's Carver I can reproduce the problem (in my build of 1.8.5rc2) quite easily (below). I have configured with --enable-debug, which probably explains why I see an assertion failure rather than the reported SEGV. -Paul {hargrove@c1436 BLD}$ ulimit -l 64 {hargrove@c1436 BLD}$ mpirun -np 2 examples/ring_c ring_c: /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:743: udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) (&m->cm_recv_msg_queue))->obj_magic_id' failed. [c1436:05774] *** Process received signal *** [c1436:05774] Signal: Aborted (6) [c1436:05774] Signal code: (-6) [c1436:05774] [ 0] /lib64/libpthread.so.0[0x2b4107aabb10] [c1436:05774] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x2b4107ce9265] [c1436:05774] [ 2] /lib64/libc.so.6(abort+0x110)[0x2b4107cead10] [c1436:05774] [ 3] /lib64/libc.so.6(__assert_fail+0xf6)[0x2b4107ce26e6] [c1436:05774] [ 4] examples/ring_c[0x44e8b3] [c1436:05774] [ 5] examples/ring_c[0x44d7e0] [c1436:05774] [ 6] examples/ring_c[0x4464c5] [c1436:05774] [ 7] examples/ring_c[0x442c5c] [c1436:05774] [ 8] examples/ring_c[0x433328] [c1436:05774] [ 9] examples/ring_c[0x43266c] [c1436:05774] [10] examples/ring_c[0x50e460] [c1436:05774] [11] examples/ring_c[0x4d9e09] [c1436:05774] [12] examples/ring_c[0x4d90b0] [c1436:05774] [13] examples/ring_c[0x423e2d] [c1436:05774] [14] examples/ring_c[0x42dceb] [c1436:05774] [15] examples/ring_c[0x407285] [c1436:05774] [16] /lib64/libc.so.6(__libc_start_main+0xf4)[0x2b4107cd6994] [c1436:05774] [17] examples/ring_c[0x407129] [c1436:05774] *** End of error message *** -------------------------------------------------------------------------- mpirun noticed that process rank 0 with PID 5774 on node c1436 exited on signal 6 (Aborted). -------------------------------------------------------------------------- On Wed, Apr 22, 2015 at 9:16 AM, Howard Pritchard <hpprit...@gmail.com> wrote: > Hi Raphael, > > Thanks very much for the patches. > > Would one of the developers on the list have a system where they > can make these kernel limit changes and which have HCAs installed? > > I don't have access to any system where I have such permissions. > > Howard > > > 2015-04-22 8:55 GMT-06:00 Raphaël Fouassier <raphael.fouass...@atos.net>: > >> We are experiencing a bug in OpenMPI in 1.8.4 which happens also on >> master: if locked memory limits are too low, a segfault happens >> in openib/udcm because some memory is not correctly deallocated. >> >> To reproduce it, modify /etc/security/limits.conf with: >> * soft memlock 64 >> * hard memlock 64 >> and launch with mpirun (not in a slurm allocation). >> >> >> I propose 2 patches for 1.8.4 and master (because of the btl move to >> opal) which: >> - free all allocated ressources >> - print the limits error >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/04/17305.php >> > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/04/17306.php > -- Paul H. Hargrove phhargr...@lbl.gov Computer Languages & Systems Software (CLaSS) Group Computer Science Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900