[OMPI devel] Possible OMPI 1.6.5 bug? SEGV in malloc.c

Chris Samuel Wed, 28 Aug 2013 05:36:24 -0400 (EDT)

Hi folks,

One of our users (oh, OK, our director, one of the Dalton developers)
has found an odd behaviour of OMPI 1.6.5 on our x86 clusters and has
managed to get a small reproducer - a modified version of the
ubiquitous F90 "hello world" MPI program.


We find that if we run this program (compiled with either Intel or GCC)
after doing "ulimit -v $((1*1024*1024))" to simulate the default 1GB
memory limit for jobs under Slurm we get odd, but different behaviour.

With the Intel compilers it appears to just hang, but if I run it under
strace I see it looping constantly SEGV'ing.

With RHEL 6.4 gfortran it instead SEGV's straight away and gives a
stack trace:

 Hello, world, I am            0  of            1
[barcoo:27489] *** Process received signal ***
[barcoo:27489] Signal: Segmentation fault (11)
[barcoo:27489] Signal code: Address not mapped (1)
[barcoo:27489] Failing at address: 0x2008e5708
[barcoo:27489] [ 0] /lib64/libpthread.so.0() [0x3f7b60f500]
[barcoo:27489] [ 1] 
/usr/local/openmpi/1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_int_malloc+0x982)
 [0x7f83caff6dd2]
[barcoo:27489] [ 2] 
/usr/local/openmpi/1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_malloc+0x52) 
[0x7f83caff7f42]
[barcoo:27489] [ 3] ./gnumyhello_f90(MAIN__+0x146) [0x400f6a]
[barcoo:27489] [ 4] ./gnumyhello_f90(main+0x2a) [0x4011ea]
[barcoo:27489] [ 5] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3f7b21ecdd]
[barcoo:27489] [ 6] ./gnumyhello_f90() [0x400d69]
[barcoo:27489] *** End of error message ***


If I let it generate a core file "bt" tells me:

(gdb) bt
#0  sYSMALLOc (av=0xffffffffffffefd0, bytes=<value optimized out>) at 
malloc.c:3240
#1  opal_memory_ptmalloc2_int_malloc (av=0xffffffffffffefd0, bytes=<value 
optimized out>) at malloc.c:4328
#2  0x00007f83caff7f42 in opal_memory_ptmalloc2_malloc (bytes=8560000000) at 
malloc.c:3433
#3  0x0000000000400f6a in main () at gnumyhello_f90.f90:26
#4  0x00000000004011ea in main ()


I've attached his reproducer program, I've just compiled it with:

mpif90 -g -o ./gnumyhello_f90 gnumyhello_f90.f90

We've reproduced it on two different Intel clusters (both RHEL 6.4,
one Nehalem and one SandyBridge) so I'd be really interested to
know if this is a bug?

Thanks!
Chris
-- 
 Christopher Samuel        Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: [email protected] Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/      http://twitter.com/vlsci

!
! Copyright (c) 2004-2006 The Trustees of Indiana University and Indiana
!                         University Research and Technology
!                         Corporation.  All rights reserved.
! Copyright (c) 2004-2005 The Regents of the University of California.
!                         All rights reserved.
! Copyright (c) 2006      Cisco Systems, Inc.  All rights reserved.
!
! Sample MPI "hello world" application in Fortran 90
!
program main
    use mpi
    implicit none
    integer :: ierr, rank, size

    !integer, parameter :: WRKMEM=1050*10**6
    integer, parameter :: WRKMEM=1070*10**6

    real (kind(0.d0)), allocatable, dimension(:) :: work

    call MPI_INIT(ierr)
    call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
    call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierr)
    print *, "Hello, world, I am ", rank, " of ", size

    allocate(work(WRKMEM),stat=ierr)
    if (ierr .eq. 0) then
       print *, "Task ", rank, " successfully allocated ", &
                    (8.d0*WRKMEM/(1024**3)), "GB"
       deallocate(work)
    else
       print *, "Task ", rank, " failed to allocate ", &
                    (8.d0*WRKMEM/(1024**3)), "GB"
    end if

    call MPI_FINALIZE(ierr)
end

[OMPI devel] Possible OMPI 1.6.5 bug? SEGV in malloc.c

Reply via email to