Hi folks,

One of our users (oh, OK, our director, one of the Dalton developers)
has found an odd behaviour of OMPI 1.6.5 on our x86 clusters and has
managed to get a small reproducer - a modified version of the
ubiquitous F90 "hello world" MPI program.

We find that if we run this program (compiled with either Intel or GCC)
after doing "ulimit -v $((1*1024*1024))" to simulate the default 1GB
memory limit for jobs under Slurm we get odd, but different behaviour.

With the Intel compilers it appears to just hang, but if I run it under
strace I see it looping constantly SEGV'ing.

With RHEL 6.4 gfortran it instead SEGV's straight away and gives a
stack trace:

 Hello, world, I am            0  of            1
[barcoo:27489] *** Process received signal ***
[barcoo:27489] Signal: Segmentation fault (11)
[barcoo:27489] Signal code: Address not mapped (1)
[barcoo:27489] Failing at address: 0x2008e5708
[barcoo:27489] [ 0] /lib64/libpthread.so.0() [0x3f7b60f500]
[barcoo:27489] [ 1] 
/usr/local/openmpi/1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_int_malloc+0x982)
 [0x7f83caff6dd2]
[barcoo:27489] [ 2] 
/usr/local/openmpi/1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_malloc+0x52) 
[0x7f83caff7f42]
[barcoo:27489] [ 3] ./gnumyhello_f90(MAIN__+0x146) [0x400f6a]
[barcoo:27489] [ 4] ./gnumyhello_f90(main+0x2a) [0x4011ea]
[barcoo:27489] [ 5] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3f7b21ecdd]
[barcoo:27489] [ 6] ./gnumyhello_f90() [0x400d69]
[barcoo:27489] *** End of error message ***


If I let it generate a core file "bt" tells me:

(gdb) bt
#0  sYSMALLOc (av=0xffffffffffffefd0, bytes=<value optimized out>) at 
malloc.c:3240
#1  opal_memory_ptmalloc2_int_malloc (av=0xffffffffffffefd0, bytes=<value 
optimized out>) at malloc.c:4328
#2  0x00007f83caff7f42 in opal_memory_ptmalloc2_malloc (bytes=8560000000) at 
malloc.c:3433
#3  0x0000000000400f6a in main () at gnumyhello_f90.f90:26
#4  0x00000000004011ea in main ()


I've attached his reproducer program, I've just compiled it with:

mpif90 -g -o ./gnumyhello_f90 gnumyhello_f90.f90

We've reproduced it on two different Intel clusters (both RHEL 6.4,
one Nehalem and one SandyBridge) so I'd be really interested to
know if this is a bug?

Thanks!
Chris
-- 
 Christopher Samuel        Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/      http://twitter.com/vlsci
!
! Copyright (c) 2004-2006 The Trustees of Indiana University and Indiana
!                         University Research and Technology
!                         Corporation.  All rights reserved.
! Copyright (c) 2004-2005 The Regents of the University of California.
!                         All rights reserved.
! Copyright (c) 2006      Cisco Systems, Inc.  All rights reserved.
!
! Sample MPI "hello world" application in Fortran 90
!
program main
    use mpi
    implicit none
    integer :: ierr, rank, size

    !integer, parameter :: WRKMEM=1050*10**6
    integer, parameter :: WRKMEM=1070*10**6

    real (kind(0.d0)), allocatable, dimension(:) :: work

    call MPI_INIT(ierr)
    call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
    call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierr)
    print *, "Hello, world, I am ", rank, " of ", size

    allocate(work(WRKMEM),stat=ierr)
    if (ierr .eq. 0) then
       print *, "Task ", rank, " successfully allocated ", &
                    (8.d0*WRKMEM/(1024**3)), "GB"
       deallocate(work)
    else
       print *, "Task ", rank, " failed to allocate ", &
                    (8.d0*WRKMEM/(1024**3)), "GB"
    end if

    call MPI_FINALIZE(ierr)
end

Reply via email to