Hi Paul, silly me. forgot this was a ulimit thing. I'll test on carver.
Howard 2015-04-22 10:45 GMT-06:00 Paul Hargrove <phhargr...@lbl.gov>: > And here is the backtrace I probably should have provided in the previous > email. > -Paul > > #0 0x00002b4107ce9265 in raise () from /lib64/libc.so.6 > #1 0x00002b4107ceaeb8 in abort () from /lib64/libc.so.6 > #2 0x00002b4107ce26e6 in __assert_fail () from /lib64/libc.so.6 > #3 0x000000000044e8b3 in udcm_module_finalize (btl=0x1cf2ae0, > cpc=0x1ce6f80) > at > /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:743 > #4 0x000000000044d7e0 in udcm_component_query (btl=0x1cf2ae0, > cpc=0x1cec948) > at > /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:485 > #5 0x00000000004464c5 in > ompi_btl_openib_connect_base_select_for_local_port (btl=0x1cf2ae0) > at > /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/btl/openib/connect/btl_openib_connect_base.c:273 > #6 0x0000000000442c5c in btl_openib_component_init > (num_btl_modules=0x7fff6e9b5a10, > enable_progress_threads=false, enable_mpi_threads=false) > at > /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/btl/openib/btl_openib_component.c:2837 > #7 0x0000000000433328 in mca_btl_base_select > (enable_progress_threads=false, enable_mpi_threads=false) > at > /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/btl/base/btl_base_select.c:111 > #8 0x000000000043266c in mca_bml_r2_component_init > (priority=0x7fff6e9b5ac4, enable_progress_threads=false, > enable_mpi_threads=false) > at > /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/bml/r2/bml_r2_component.c:88 > #9 0x000000000050e460 in mca_bml_base_init > (enable_progress_threads=false, enable_mpi_threads=false) > at > /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/bml/base/bml_base_init.c:69 > #10 0x00000000004d9e09 in mca_pml_ob1_component_init > (priority=0x7fff6e9b5bcc, > enable_progress_threads=false, enable_mpi_threads=false) > at > /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/pml/ob1/pml_ob1_component.c:271 > #11 0x00000000004d90b0 in mca_pml_base_select > (enable_progress_threads=false, enable_mpi_threads=false) > at > /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/pml/base/pml_base_select.c:128 > #12 0x0000000000423e2d in ompi_mpi_init (argc=1, argv=0x7fff6e9b5ea8, > requested=0, provided=0x7fff6e9b5d5c) > at > /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/runtime/ompi_mpi_init.c:614 > #13 0x000000000042dceb in PMPI_Init (argc=0x7fff6e9b5d9c, > argv=0x7fff6e9b5d90) at pinit.c:84 > #14 0x0000000000407285 in main (argc=1, argv=0x7fff6e9b5ea8) at ring_c.c:19 > > > On Wed, Apr 22, 2015 at 9:41 AM, Paul Hargrove <phhargr...@lbl.gov> wrote: > >> Howard, >> >> Unless there is some reason the settings must be global, you should be >> able to set the limits w/o root privs: >> >> Bourne shells: >> $ ulimit -l 64 >> C shells: >> % limit -h memorylocked 64 >> >> I would have thought these lines might need to go in a .profile or .cshrc >> to affect the application processes, but perhaps mpirun propogates the >> rlimits. >> So, on NERSC's Carver I can reproduce the problem (in my build of >> 1.8.5rc2) quite easily (below). >> I have configured with --enable-debug, which probably explains why I see >> an assertion failure rather than the reported SEGV. >> >> -Paul >> >> {hargrove@c1436 BLD}$ ulimit -l 64 >> {hargrove@c1436 BLD}$ mpirun -np 2 examples/ring_c >> ring_c: >> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:743: >> udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == >> ((opal_object_t *) (&m->cm_recv_msg_queue))->obj_magic_id' failed. >> [c1436:05774] *** Process received signal *** >> [c1436:05774] Signal: Aborted (6) >> [c1436:05774] Signal code: (-6) >> [c1436:05774] [ 0] /lib64/libpthread.so.0[0x2b4107aabb10] >> [c1436:05774] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x2b4107ce9265] >> [c1436:05774] [ 2] /lib64/libc.so.6(abort+0x110)[0x2b4107cead10] >> [c1436:05774] [ 3] /lib64/libc.so.6(__assert_fail+0xf6)[0x2b4107ce26e6] >> [c1436:05774] [ 4] examples/ring_c[0x44e8b3] >> [c1436:05774] [ 5] examples/ring_c[0x44d7e0] >> [c1436:05774] [ 6] examples/ring_c[0x4464c5] >> [c1436:05774] [ 7] examples/ring_c[0x442c5c] >> [c1436:05774] [ 8] examples/ring_c[0x433328] >> [c1436:05774] [ 9] examples/ring_c[0x43266c] >> [c1436:05774] [10] examples/ring_c[0x50e460] >> [c1436:05774] [11] examples/ring_c[0x4d9e09] >> [c1436:05774] [12] examples/ring_c[0x4d90b0] >> [c1436:05774] [13] examples/ring_c[0x423e2d] >> [c1436:05774] [14] examples/ring_c[0x42dceb] >> [c1436:05774] [15] examples/ring_c[0x407285] >> [c1436:05774] [16] >> /lib64/libc.so.6(__libc_start_main+0xf4)[0x2b4107cd6994] >> [c1436:05774] [17] examples/ring_c[0x407129] >> [c1436:05774] *** End of error message *** >> -------------------------------------------------------------------------- >> mpirun noticed that process rank 0 with PID 5774 on node c1436 exited on >> signal 6 (Aborted). >> -------------------------------------------------------------------------- >> >> >> >> On Wed, Apr 22, 2015 at 9:16 AM, Howard Pritchard <hpprit...@gmail.com> >> wrote: >> >>> Hi Raphael, >>> >>> Thanks very much for the patches. >>> >>> Would one of the developers on the list have a system where they >>> can make these kernel limit changes and which have HCAs installed? >>> >>> I don't have access to any system where I have such permissions. >>> >>> Howard >>> >>> >>> 2015-04-22 8:55 GMT-06:00 Raphaël Fouassier <raphael.fouass...@atos.net> >>> : >>> >>>> We are experiencing a bug in OpenMPI in 1.8.4 which happens also on >>>> master: if locked memory limits are too low, a segfault happens >>>> in openib/udcm because some memory is not correctly deallocated. >>>> >>>> To reproduce it, modify /etc/security/limits.conf with: >>>> * soft memlock 64 >>>> * hard memlock 64 >>>> and launch with mpirun (not in a slurm allocation). >>>> >>>> >>>> I propose 2 patches for 1.8.4 and master (because of the btl move to >>>> opal) which: >>>> - free all allocated ressources >>>> - print the limits error >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2015/04/17305.php >>>> >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2015/04/17306.php >>> >> >> >> >> -- >> Paul H. Hargrove phhargr...@lbl.gov >> Computer Languages & Systems Software (CLaSS) Group >> Computer Science Department Tel: +1-510-495-2352 >> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >> > > > > -- > Paul H. Hargrove phhargr...@lbl.gov > Computer Languages & Systems Software (CLaSS) Group > Computer Science Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/04/17308.php >