No, that shouldn’t be the issue any more - and that isn’t what the backtrace indicates. It looks instead like there was a problem with the shared memory backing file on a remote node, and that caused the vader shared memory BTL to segfault.
Try turning vader off and see if that helps - I’m not sure what you are using, but maybe “-mca btl ^vader” will suffice Nathan - any other suggestions? > On Mar 17, 2016, at 4:40 PM, Lane, William <william.l...@cshs.org> wrote: > > I remember years ago, OpenMPI (version 1.3.3) required the hard/soft open > files limits be >= 4096 in order to function when large numbers of slots > were requested (with 1.3.3 this was at roughly 85 slots). Is this requirement > still present for OpenMPI versions 1.10.1 and greater? > > I'm having some issues now with OpenMPI version 1.10.1 that remind me > of the issues I had w/1.3.3 where OpenMPI worked fine as long as I don't > request too many slots. > > When I look at the ulimits -a (soft limit) I see: > open files (-n) 1024 > > Ulimits -Ha (hard limit) gives: > open files (-n) 4096 > > I'm getting errors of the form: > [csclprd3-5-5:15248] [[40732,0],0] plm:base:receive got update_proc_state for > job [40732,1] > [csclprd3-6-12:30567] *** Process received signal *** > [csclprd3-6-12:30567] Signal: Bus error (7) > [csclprd3-6-12:30567] Signal code: Non-existant physical address (2) > [csclprd3-6-12:30567] Failing at address: 0x2b3d19f72000 > [csclprd3-6-12:30568] *** Process received signal *** > [csclprd3-6-12:30567] [ 0] /lib64/libpthread.so.0(+0xf500)[0x2b3d0f71f500] > [csclprd3-6-12:30567] [ 1] > /hpc/apps/mpi/openmpi/1.10.1/lib/openmpi/mca_shmem_mmap.so(+0x1524)[0x2b3d10cb0524] > [csclprd3-6-12:30567] [ 2] > /hpc/apps/mpi/openmpi/1.10.1/lib/openmpi/mca_btl_vader.so(+0x3674)[0x2b3d18494674] > [csclprd3-6-12:30567] [ 3] > /hpc/apps/mpi/openmpi/1.10.1/lib/libmpi.so.12(mca_btl_base_select+0x117)[0x2b3d0f4b0b07] > [csclprd3-6-12:30567] [ 4] > /hpc/apps/mpi/openmpi/1.10.1/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x12)[0x2b3d13d917b2] > [csclprd3-6-12:30567] [ 5] > /hpc/apps/mpi/openmpi/1.10.1/lib/libmpi.so.12(mca_bml_base_init+0x99)[0x2b3d0f4b0309] > [csclprd3-6-12:30567] [ 6] > /hpc/apps/mpi/openmpi/1.10.1/lib/openmpi/mca_pml_ob1.so(+0x538c)[0x2b3d18ac238c] > [csclprd3-6-12:30567] [ 7] > /hpc/apps/mpi/openmpi/1.10.1/lib/libmpi.so.12(mca_pml_base_select+0x1e0)[0x2b3d0f4c1780] > [csclprd3-6-12:30567] [ 8] > /hpc/apps/mpi/openmpi/1.10.1/lib/libmpi.so.12(ompi_mpi_init+0x51d)[0x2b3d0f47317d] > [csclprd3-6-12:30567] [ 9] > /hpc/apps/mpi/openmpi/1.10.1/lib/libmpi.so.12(MPI_Init+0x170)[0x2b3d0f492820] > [csclprd3-6-12:30567] [10] /hpc/home/lanew/mpi/openmpi/a_1_10_1.out[0x400ad0] > [csclprd3-6-12:30567] [11] > /lib64/libc.so.6(__libc_start_main+0xfd)[0x2b3d0f94bcdd] > [csclprd3-6-12:30567] [12] /hpc/home/lanew/mpi/openmpi/a_1_10_1.out[0x400999] > [csclprd3-6-12:30567] *** End of error message *** > > Ugh. > > Bill L. > IMPORTANT WARNING: This message is intended for the use of the person or > entity to which it is addressed and may contain information that is > privileged and confidential, the disclosure of which is governed by > applicable law. If the reader of this message is not the intended recipient, > or the employee or agent responsible for delivering it to the intended > recipient, you are hereby notified that any dissemination, distribution or > copying of this information is strictly prohibited. Thank you for your > cooperation. _______________________________________________ > users mailing list > us...@open-mpi.org <mailto:us...@open-mpi.org> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > <http://www.open-mpi.org/mailman/listinfo.cgi/users> > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/03/28746.php > <http://www.open-mpi.org/community/lists/users/2016/03/28746.php>