Hi folks,
One of our users (oh, OK, our director, one of the Dalton developers)
has found an odd behaviour of OMPI 1.6.5 on our x86 clusters and has
managed to get a small reproducer - a modified version of the
ubiquitous F90 "hello world" MPI program.
We find that if we run this program (compiled with either Intel or GCC)
after doing "ulimit -v $((1*1024*1024))" to simulate the default 1GB
memory limit for jobs under Slurm we get odd, but different behaviour.
With the Intel compilers it appears to just hang, but if I run it under
strace I see it looping constantly SEGV'ing.
With RHEL 6.4 gfortran it instead SEGV's straight away and gives a
stack trace:
Hello, world, I am 0 of 1
[barcoo:27489] *** Process received signal ***
[barcoo:27489] Signal: Segmentation fault (11)
[barcoo:27489] Signal code: Address not mapped (1)
[barcoo:27489] Failing at address: 0x2008e5708
[barcoo:27489] [ 0] /lib64/libpthread.so.0() [0x3f7b60f500]
[barcoo:27489] [ 1]
/usr/local/openmpi/1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_int_malloc+0x982)
[0x7f83caff6dd2]
[barcoo:27489] [ 2]
/usr/local/openmpi/1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_malloc+0x52)
[0x7f83caff7f42]
[barcoo:27489] [ 3] ./gnumyhello_f90(MAIN__+0x146) [0x400f6a]
[barcoo:27489] [ 4] ./gnumyhello_f90(main+0x2a) [0x4011ea]
[barcoo:27489] [ 5] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3f7b21ecdd]
[barcoo:27489] [ 6] ./gnumyhello_f90() [0x400d69]
[barcoo:27489] *** End of error message ***
If I let it generate a core file "bt" tells me:
(gdb) bt
#0 sYSMALLOc (av=0xffffffffffffefd0, bytes=<value optimized out>) at
malloc.c:3240
#1 opal_memory_ptmalloc2_int_malloc (av=0xffffffffffffefd0, bytes=<value
optimized out>) at malloc.c:4328
#2 0x00007f83caff7f42 in opal_memory_ptmalloc2_malloc (bytes=8560000000) at
malloc.c:3433
#3 0x0000000000400f6a in main () at gnumyhello_f90.f90:26
#4 0x00000000004011ea in main ()
I've attached his reproducer program, I've just compiled it with:
mpif90 -g -o ./gnumyhello_f90 gnumyhello_f90.f90
We've reproduced it on two different Intel clusters (both RHEL 6.4,
one Nehalem and one SandyBridge) so I'd be really interested to
know if this is a bug?
Thanks!
Chris
--
Christopher Samuel Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: [email protected] Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
!
! Copyright (c) 2004-2006 The Trustees of Indiana University and Indiana
! University Research and Technology
! Corporation. All rights reserved.
! Copyright (c) 2004-2005 The Regents of the University of California.
! All rights reserved.
! Copyright (c) 2006 Cisco Systems, Inc. All rights reserved.
!
! Sample MPI "hello world" application in Fortran 90
!
program main
use mpi
implicit none
integer :: ierr, rank, size
!integer, parameter :: WRKMEM=1050*10**6
integer, parameter :: WRKMEM=1070*10**6
real (kind(0.d0)), allocatable, dimension(:) :: work
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierr)
print *, "Hello, world, I am ", rank, " of ", size
allocate(work(WRKMEM),stat=ierr)
if (ierr .eq. 0) then
print *, "Task ", rank, " successfully allocated ", &
(8.d0*WRKMEM/(1024**3)), "GB"
deallocate(work)
else
print *, "Task ", rank, " failed to allocate ", &
(8.d0*WRKMEM/(1024**3)), "GB"
end if
call MPI_FINALIZE(ierr)
end