Daniel,

Thanks for the very precise and clear report. I checked the code and indeed we miss the OMPI_ADDR macro call in a lot of places in the Fortran interface (not for point-to-point communications) . I will correct the issue shortly and hopefully the the fix will make its way into the next release (1.2.3).

 Thanks,
   george.

Daniel Spångberg wrote:
Dear openmpi developers,

I have very recently joined this list to be able to submit this report, as recommended in the FAQ, why I have not had time to familiarize myself with the etiquette of this list, for which I apologize in advance. If you require any additional information I am more than happy to provide you with such information.

One of the users of our computer cluster have reported a problem with his simulation code. We have been able to trace this down to a problem with the MPI struct when absolute addressing is used for the members in the struct, necessating the use of MPI_BOTTOM. The problem occurs with MPI_BCAST, but works fine with MPI_SEND and MPI_RECV. I digged down into the openmpi code and found out that the problem occurs only in fortran, which seems to be because when MPI_BCAST is called, the OMPI_ADDR macro in ompi/mpi/f77/constants.h is never evaluated which thus never turns the fortran MPI_BOTTOM into a C MPI_BOTTOM. When MPI_SEND and MPI_RECV are used the OMPI_ADDR macro is evaluated and no problem occurs.

An example of a problematic code (tested on two processes):

       program testme
       implicit none
       include 'mpif.h'
       integer ierr,rank,size
       integer btype(1),blen(1),param_batch
       real param
       integer(kind=mpi_address_kind) :: disp(1)
       integer status(MPI_STATUS_SIZE)

       param=0.

       call MPI_INIT(ierr)
       call MPI_GET_ADDRESS(param,disp(1),ierr)
       if (ierr.ne.MPI_SUCCESS) write(*,*) 'MPI_GET_ADDRESS FAILED'
       btype(1)=MPI_REAL
       blen(1)=1
       call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierr)
       call MPI_COMM_SIZE(MPI_COMM_WORLD,size,ierr)
       call MPI_TYPE_CREATE_STRUCT(1,blen,disp,btype,param_batch,ierr)
       if (ierr.ne.MPI_SUCCESS) then
         write(*,*) 'MPI_TYPE_CREATE_STRUCT FAILED'
       endif
       call MPI_TYPE_COMMIT(param_batch,ierr)
       if (ierr.ne.MPI_SUCCESS) write(*,*) 'MPI_TYPE_COMMIT FAILED'
       if (rank.eq.0) param=100.
c      if (rank.eq.0) then
c        call MPI_SEND(MPI_BOTTOM,1,
c     x       param_batch,1,0,MPI_COMM_WORLD,ierr)
c      else
c        call MPI_RECV(MPI_BOTTOM,1,
c     x       param_batch,0,0,MPI_COMM_WORLD,status,ierr)
c      endif
       call MPI_BCAST(MPI_BOTTOM,1,param_batch,0,MPI_COMM_WORLD,ierr)
       if (ierr.ne.MPI_SUCCESS) write(*,*) 'MPI_BCAST FAILED'
       write(*,*) 'Rank:',rank,'Size:',size,'Param=',param
       call MPI_Finalize(ierr)

       end program testme

mpirun -np 2 testme_submitted_to_devel
[auchentoshan:21021] *** Process received signal ***
[auchentoshan:21021] Signal: Segmentation fault (11)
[auchentoshan:21021] Signal code: Address not mapped (1)
[auchentoshan:21021] Failing at address: 0x7fc04ffd7c
[auchentoshan:21021] [ 0] /lib64/tls/libpthread.so.0 [0x3d9660c430]
[auchentoshan:21021] [ 1] /lib64/tls/libc.so.6(memcpy+0x60) [0x3d95571ec0]
[auchentoshan:21021] [ 2] /opt/openmpi-1.2.1-gcc4/lib/libmpi.so.0(ompi_convertor_pack+0x164) [0x2a957cf5b4] [auchentoshan:21021] [ 3] /opt/openmpi-1.2.1-gcc4/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_copy+0x24c) [0x2a982851dc] [auchentoshan:21021] [ 4] /opt/openmpi-1.2.1-gcc4/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_isend+0x217) [0x2a9827fe77] [auchentoshan:21021] [ 5] /opt/openmpi-1.2.1-gcc4/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_generic+0x354) [0x2a98ac08b4] [auchentoshan:21021] [ 6] /opt/openmpi-1.2.1-gcc4/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_binomial+0xc8) [0x2a98ac0bd8] [auchentoshan:21021] [ 7] /opt/openmpi-1.2.1-gcc4/lib/libmpi.so.0(PMPI_Bcast+0x15c) [0x2a957d62ac] [auchentoshan:21021] [ 8] /opt/openmpi-1.2.1-gcc4/lib/libmpi_f77.so.0(pmpi_bcast_+0x5a) [0x2a9567e99a] [auchentoshan:21021] [ 9] testme_submitted_to_devel(MAIN__+0x1f8) [0x401080]
[auchentoshan:21021] [10] testme_submitted_to_devel(main+0xe) [0x4011be]
[auchentoshan:21021] [11] /lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x3d9551c3fb]
[auchentoshan:21021] [12] testme_submitted_to_devel [0x400dfa]
[auchentoshan:21021] *** End of error message ***
mpirun noticed that job rank 0 with PID 21021 on node auchentoshan exited on signal 11 (Segmentation fault).
1 additional process aborted (not shown)

The openmpi version we have tested this on includes the latest, 1.2.2 version as well although the log message above is for version 1.2.1. Distribution: Scientific Linux 4.4 (RHEL4 clone). The problem occurs on both AMD64 and i386. The problem occurs both when using gcc/gfortran and the portland group compilers version 7.0-4. Interconnect GBE.

daniels@auchentoshan:~ > gcc4 --version
gcc4 (GCC) 4.1.0 20060515 (Red Hat 4.1.0-18)
Copyright (C) 2006 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

daniels@auchentoshan:~ > gfortran --version
GNU Fortran 95 (GCC) 4.1.0 20060515 (Red Hat 4.1.0-18)
Copyright (C) 2006 Free Software Foundation, Inc.

GNU Fortran comes with NO WARRANTY, to the extent permitted by law.
You may redistribute copies of GNU Fortran
under the terms of the GNU General Public License.
For more information about these matters, see the file named COPYING

Working C code which does the same thing:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

int main(int argc, char **argv)
{
   int rank,size;
   float param=0.;
   MPI_Datatype btype[1],param_batch;
   MPI_Aint disp[1];
   int blen[1];

   MPI_Init(&argc,&argv);
   if (MPI_Get_address(&param,&disp[0])!=MPI_SUCCESS)
     fprintf(stderr,"MPI_Get_address failed.\n");
   btype[0]=MPI_FLOAT;
   blen[0]=1;
   MPI_Comm_rank(MPI_COMM_WORLD,&rank);
   MPI_Comm_size(MPI_COMM_WORLD,&size);
   if (MPI_Type_create_struct(1,blen,disp,btype,&param_batch)!=MPI_SUCCESS)
     fprintf(stderr,"MPI_Type_Create_Struct failed.\n");
   if (MPI_Type_commit(&param_batch)!=MPI_SUCCESS)
     fprintf(stderr,"MPI_Type_Commit failed.\n");
   if (rank==0)
     param=100.;
   if (MPI_Bcast(MPI_BOTTOM,1,param_batch,0,MPI_COMM_WORLD)!=MPI_SUCCESS)
     fprintf(stderr,"MPI_Bcast failed.\n");
   printf("Rank:%d, Size:%d, Param=%f\n",rank,size,param);
   MPI_Finalize();
   return 0;
}


Best regards
Daniel Spångberg

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to