No, that shouldn’t be the issue any more - and that isn’t what the backtrace 
indicates. It looks instead like there was a problem with the shared memory 
backing file on a remote node, and that caused the vader shared memory BTL to 
segfault.

Try turning vader off and see if that helps - I’m not sure what you are using, 
but maybe “-mca btl ^vader” will suffice

Nathan - any other suggestions?


> On Mar 17, 2016, at 4:40 PM, Lane, William <william.l...@cshs.org> wrote:
> 
> I remember years ago, OpenMPI (version 1.3.3) required the hard/soft open
> files limits be >= 4096 in order to function when large numbers of slots
> were requested (with 1.3.3 this was at roughly 85 slots). Is this requirement
> still present for OpenMPI versions 1.10.1 and greater?
> 
> I'm having some issues now with OpenMPI version 1.10.1 that remind me
> of the issues I had w/1.3.3 where OpenMPI worked fine as long as I don't
> request too many slots.
> 
> When I look at the ulimits -a (soft limit) I see:
> open files                      (-n) 1024
> 
> Ulimits -Ha (hard limit) gives:
> open files                      (-n) 4096
> 
> I'm getting errors of the form:
> [csclprd3-5-5:15248] [[40732,0],0] plm:base:receive got update_proc_state for 
> job [40732,1]
> [csclprd3-6-12:30567] *** Process received signal ***
> [csclprd3-6-12:30567] Signal: Bus error (7)
> [csclprd3-6-12:30567] Signal code: Non-existant physical address (2)
> [csclprd3-6-12:30567] Failing at address: 0x2b3d19f72000
> [csclprd3-6-12:30568] *** Process received signal ***
> [csclprd3-6-12:30567] [ 0] /lib64/libpthread.so.0(+0xf500)[0x2b3d0f71f500]
> [csclprd3-6-12:30567] [ 1] 
> /hpc/apps/mpi/openmpi/1.10.1/lib/openmpi/mca_shmem_mmap.so(+0x1524)[0x2b3d10cb0524]
> [csclprd3-6-12:30567] [ 2] 
> /hpc/apps/mpi/openmpi/1.10.1/lib/openmpi/mca_btl_vader.so(+0x3674)[0x2b3d18494674]
> [csclprd3-6-12:30567] [ 3] 
> /hpc/apps/mpi/openmpi/1.10.1/lib/libmpi.so.12(mca_btl_base_select+0x117)[0x2b3d0f4b0b07]
> [csclprd3-6-12:30567] [ 4] 
> /hpc/apps/mpi/openmpi/1.10.1/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x12)[0x2b3d13d917b2]
> [csclprd3-6-12:30567] [ 5] 
> /hpc/apps/mpi/openmpi/1.10.1/lib/libmpi.so.12(mca_bml_base_init+0x99)[0x2b3d0f4b0309]
> [csclprd3-6-12:30567] [ 6] 
> /hpc/apps/mpi/openmpi/1.10.1/lib/openmpi/mca_pml_ob1.so(+0x538c)[0x2b3d18ac238c]
> [csclprd3-6-12:30567] [ 7] 
> /hpc/apps/mpi/openmpi/1.10.1/lib/libmpi.so.12(mca_pml_base_select+0x1e0)[0x2b3d0f4c1780]
> [csclprd3-6-12:30567] [ 8] 
> /hpc/apps/mpi/openmpi/1.10.1/lib/libmpi.so.12(ompi_mpi_init+0x51d)[0x2b3d0f47317d]
> [csclprd3-6-12:30567] [ 9] 
> /hpc/apps/mpi/openmpi/1.10.1/lib/libmpi.so.12(MPI_Init+0x170)[0x2b3d0f492820]
> [csclprd3-6-12:30567] [10] /hpc/home/lanew/mpi/openmpi/a_1_10_1.out[0x400ad0]
> [csclprd3-6-12:30567] [11] 
> /lib64/libc.so.6(__libc_start_main+0xfd)[0x2b3d0f94bcdd]
> [csclprd3-6-12:30567] [12] /hpc/home/lanew/mpi/openmpi/a_1_10_1.out[0x400999]
> [csclprd3-6-12:30567] *** End of error message ***
> 
> Ugh.
> 
> Bill L.
> IMPORTANT WARNING: This message is intended for the use of the person or 
> entity to which it is addressed and may contain information that is 
> privileged and confidential, the disclosure of which is governed by 
> applicable law. If the reader of this message is not the intended recipient, 
> or the employee or agent responsible for delivering it to the intended 
> recipient, you are hereby notified that any dissemination, distribution or 
> copying of this information is strictly prohibited. Thank you for your 
> cooperation. _______________________________________________
> users mailing list
> us...@open-mpi.org <mailto:us...@open-mpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/03/28746.php 
> <http://www.open-mpi.org/community/lists/users/2016/03/28746.php>

Reply via email to