And here is the backtrace I probably should have provided in the previous email. -Paul
#0 0x00002b4107ce9265 in raise () from /lib64/libc.so.6 #1 0x00002b4107ceaeb8 in abort () from /lib64/libc.so.6 #2 0x00002b4107ce26e6 in __assert_fail () from /lib64/libc.so.6 #3 0x000000000044e8b3 in udcm_module_finalize (btl=0x1cf2ae0, cpc=0x1ce6f80) at /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:743 #4 0x000000000044d7e0 in udcm_component_query (btl=0x1cf2ae0, cpc=0x1cec948) at /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:485 #5 0x00000000004464c5 in ompi_btl_openib_connect_base_select_for_local_port (btl=0x1cf2ae0) at /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/btl/openib/connect/btl_openib_connect_base.c:273 #6 0x0000000000442c5c in btl_openib_component_init (num_btl_modules=0x7fff6e9b5a10, enable_progress_threads=false, enable_mpi_threads=false) at /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/btl/openib/btl_openib_component.c:2837 #7 0x0000000000433328 in mca_btl_base_select (enable_progress_threads=false, enable_mpi_threads=false) at /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/btl/base/btl_base_select.c:111 #8 0x000000000043266c in mca_bml_r2_component_init (priority=0x7fff6e9b5ac4, enable_progress_threads=false, enable_mpi_threads=false) at /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/bml/r2/bml_r2_component.c:88 #9 0x000000000050e460 in mca_bml_base_init (enable_progress_threads=false, enable_mpi_threads=false) at /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/bml/base/bml_base_init.c:69 #10 0x00000000004d9e09 in mca_pml_ob1_component_init (priority=0x7fff6e9b5bcc, enable_progress_threads=false, enable_mpi_threads=false) at /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/pml/ob1/pml_ob1_component.c:271 #11 0x00000000004d90b0 in mca_pml_base_select (enable_progress_threads=false, enable_mpi_threads=false) at /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/pml/base/pml_base_select.c:128 #12 0x0000000000423e2d in ompi_mpi_init (argc=1, argv=0x7fff6e9b5ea8, requested=0, provided=0x7fff6e9b5d5c) at /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/runtime/ompi_mpi_init.c:614 #13 0x000000000042dceb in PMPI_Init (argc=0x7fff6e9b5d9c, argv=0x7fff6e9b5d90) at pinit.c:84 #14 0x0000000000407285 in main (argc=1, argv=0x7fff6e9b5ea8) at ring_c.c:19 On Wed, Apr 22, 2015 at 9:41 AM, Paul Hargrove <phhargr...@lbl.gov> wrote: > Howard, > > Unless there is some reason the settings must be global, you should be > able to set the limits w/o root privs: > > Bourne shells: > $ ulimit -l 64 > C shells: > % limit -h memorylocked 64 > > I would have thought these lines might need to go in a .profile or .cshrc > to affect the application processes, but perhaps mpirun propogates the > rlimits. > So, on NERSC's Carver I can reproduce the problem (in my build of > 1.8.5rc2) quite easily (below). > I have configured with --enable-debug, which probably explains why I see > an assertion failure rather than the reported SEGV. > > -Paul > > {hargrove@c1436 BLD}$ ulimit -l 64 > {hargrove@c1436 BLD}$ mpirun -np 2 examples/ring_c > ring_c: > /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:743: > udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == > ((opal_object_t *) (&m->cm_recv_msg_queue))->obj_magic_id' failed. > [c1436:05774] *** Process received signal *** > [c1436:05774] Signal: Aborted (6) > [c1436:05774] Signal code: (-6) > [c1436:05774] [ 0] /lib64/libpthread.so.0[0x2b4107aabb10] > [c1436:05774] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x2b4107ce9265] > [c1436:05774] [ 2] /lib64/libc.so.6(abort+0x110)[0x2b4107cead10] > [c1436:05774] [ 3] /lib64/libc.so.6(__assert_fail+0xf6)[0x2b4107ce26e6] > [c1436:05774] [ 4] examples/ring_c[0x44e8b3] > [c1436:05774] [ 5] examples/ring_c[0x44d7e0] > [c1436:05774] [ 6] examples/ring_c[0x4464c5] > [c1436:05774] [ 7] examples/ring_c[0x442c5c] > [c1436:05774] [ 8] examples/ring_c[0x433328] > [c1436:05774] [ 9] examples/ring_c[0x43266c] > [c1436:05774] [10] examples/ring_c[0x50e460] > [c1436:05774] [11] examples/ring_c[0x4d9e09] > [c1436:05774] [12] examples/ring_c[0x4d90b0] > [c1436:05774] [13] examples/ring_c[0x423e2d] > [c1436:05774] [14] examples/ring_c[0x42dceb] > [c1436:05774] [15] examples/ring_c[0x407285] > [c1436:05774] [16] /lib64/libc.so.6(__libc_start_main+0xf4)[0x2b4107cd6994] > [c1436:05774] [17] examples/ring_c[0x407129] > [c1436:05774] *** End of error message *** > -------------------------------------------------------------------------- > mpirun noticed that process rank 0 with PID 5774 on node c1436 exited on > signal 6 (Aborted). > -------------------------------------------------------------------------- > > > > On Wed, Apr 22, 2015 at 9:16 AM, Howard Pritchard <hpprit...@gmail.com> > wrote: > >> Hi Raphael, >> >> Thanks very much for the patches. >> >> Would one of the developers on the list have a system where they >> can make these kernel limit changes and which have HCAs installed? >> >> I don't have access to any system where I have such permissions. >> >> Howard >> >> >> 2015-04-22 8:55 GMT-06:00 Raphaël Fouassier <raphael.fouass...@atos.net>: >> >>> We are experiencing a bug in OpenMPI in 1.8.4 which happens also on >>> master: if locked memory limits are too low, a segfault happens >>> in openib/udcm because some memory is not correctly deallocated. >>> >>> To reproduce it, modify /etc/security/limits.conf with: >>> * soft memlock 64 >>> * hard memlock 64 >>> and launch with mpirun (not in a slurm allocation). >>> >>> >>> I propose 2 patches for 1.8.4 and master (because of the btl move to >>> opal) which: >>> - free all allocated ressources >>> - print the limits error >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2015/04/17305.php >>> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/04/17306.php >> > > > > -- > Paul H. Hargrove phhargr...@lbl.gov > Computer Languages & Systems Software (CLaSS) Group > Computer Science Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > -- Paul H. Hargrove phhargr...@lbl.gov Computer Languages & Systems Software (CLaSS) Group Computer Science Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900