Hi folks, One of our users (oh, OK, our director, one of the Dalton developers) has found an odd behaviour of OMPI 1.6.5 on our x86 clusters and has managed to get a small reproducer - a modified version of the ubiquitous F90 "hello world" MPI program.
We find that if we run this program (compiled with either Intel or GCC) after doing "ulimit -v $((1*1024*1024))" to simulate the default 1GB memory limit for jobs under Slurm we get odd, but different behaviour. With the Intel compilers it appears to just hang, but if I run it under strace I see it looping constantly SEGV'ing. With RHEL 6.4 gfortran it instead SEGV's straight away and gives a stack trace: Hello, world, I am 0 of 1 [barcoo:27489] *** Process received signal *** [barcoo:27489] Signal: Segmentation fault (11) [barcoo:27489] Signal code: Address not mapped (1) [barcoo:27489] Failing at address: 0x2008e5708 [barcoo:27489] [ 0] /lib64/libpthread.so.0() [0x3f7b60f500] [barcoo:27489] [ 1] /usr/local/openmpi/1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_int_malloc+0x982) [0x7f83caff6dd2] [barcoo:27489] [ 2] /usr/local/openmpi/1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_malloc+0x52) [0x7f83caff7f42] [barcoo:27489] [ 3] ./gnumyhello_f90(MAIN__+0x146) [0x400f6a] [barcoo:27489] [ 4] ./gnumyhello_f90(main+0x2a) [0x4011ea] [barcoo:27489] [ 5] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3f7b21ecdd] [barcoo:27489] [ 6] ./gnumyhello_f90() [0x400d69] [barcoo:27489] *** End of error message *** If I let it generate a core file "bt" tells me: (gdb) bt #0 sYSMALLOc (av=0xffffffffffffefd0, bytes=<value optimized out>) at malloc.c:3240 #1 opal_memory_ptmalloc2_int_malloc (av=0xffffffffffffefd0, bytes=<value optimized out>) at malloc.c:4328 #2 0x00007f83caff7f42 in opal_memory_ptmalloc2_malloc (bytes=8560000000) at malloc.c:3433 #3 0x0000000000400f6a in main () at gnumyhello_f90.f90:26 #4 0x00000000004011ea in main () I've attached his reproducer program, I've just compiled it with: mpif90 -g -o ./gnumyhello_f90 gnumyhello_f90.f90 We've reproduced it on two different Intel clusters (both RHEL 6.4, one Nehalem and one SandyBridge) so I'd be really interested to know if this is a bug? Thanks! Chris -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci
! ! Copyright (c) 2004-2006 The Trustees of Indiana University and Indiana ! University Research and Technology ! Corporation. All rights reserved. ! Copyright (c) 2004-2005 The Regents of the University of California. ! All rights reserved. ! Copyright (c) 2006 Cisco Systems, Inc. All rights reserved. ! ! Sample MPI "hello world" application in Fortran 90 ! program main use mpi implicit none integer :: ierr, rank, size !integer, parameter :: WRKMEM=1050*10**6 integer, parameter :: WRKMEM=1070*10**6 real (kind(0.d0)), allocatable, dimension(:) :: work call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierr) print *, "Hello, world, I am ", rank, " of ", size allocate(work(WRKMEM),stat=ierr) if (ierr .eq. 0) then print *, "Task ", rank, " successfully allocated ", & (8.d0*WRKMEM/(1024**3)), "GB" deallocate(work) else print *, "Task ", rank, " failed to allocate ", & (8.d0*WRKMEM/(1024**3)), "GB" end if call MPI_FINALIZE(ierr) end