General informations:
------------------------------------
3 node Opteron cluster, 24CPUs, Melanox Infiniband 10Gb interconnect
Debian Lenny 5.0
self build kernel from kernel.org: 2.6.32.12, all NFS functions
available from kernel side
self build NFS-utils 1.2.2 from debian source of sid: nfs-kernel-server,
nfs-common
nfs-server with working lockd
fnctl() and locking is available on all nfs-clients, tested with
perl-script (attached)
openMPI 1.4.2 (build with GNU 4.3.2)
configure options:
./configure --prefix=/opt/openMPI_gnu_4.3.2 --sysconfdir=/etc
--localstatedir=/var --with-libnuma=/usr --with-libnuma-libdir=/usr/lib
--enable-mpirun-prefix-by-default --enable-sparse-groups --enable-static
--enable-cxx-exceptions --with-wrapper-cflags='-O3 -march=opteron'
--with-wrapper-cxxflags='-O3 -march=opteron' --with-wrapper-fflags='-O3
-march=opteron' --with-wrapper-fcflags='-O3 -march=opteron'
--with-openib --with-gnu-ld CFLAGS='-O3 -march=opteron' CXXFLAGS='-O3
-march=opteron' FFLAGS='-O3 -march=opteron' FCFLAGS='-O3 -march=opteron'
=======================================================================================
Dear openMPI developers,
I've found a bug in the current stable release of openMPI 1.4.2 which is
related to the MPI_WRITE function in combination with the execution on a
NFS-v3-crossmount. I've attached a small Fortran code-snip (testmpi.f),
which uses mpi_write to create a file "test.dat" which contains
{1,2,3,4,5,6} in binary, MPI_REALS written from every mpi-node executed
on, in the right displacement to every node.
When I execute this code on a glusterFS share, everthing works like a
charme....no problems at all....
The Problem is, when I try to compile and execute this program for two
nodes on an NFS-crossmount with openMPI, I get the following MPI error:
[ppclus02:23440] *** An error occurred in MPI_Bcast
[ppclus02:23440] *** on communicator MPI COMMUNICATOR 3 DUP FROM 0
[ppclus02:23440] *** MPI_ERR_TRUNCATE: message truncated
[ppclus02:23440] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
--------------------------------------------------------------------------
mpiexec has exited due to process rank 1 with PID 23440 on
node 192.168.11.2 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpiexec (as reported here).
--------------------------------------------------------------------------
My first educated guess was, that my NFS-crossmounts aren't capable to
make use of fnct() to lock the file needed by MPI_WRITE. So, i gave a
try on the following perl script (lock.pl). The result was: fnctl() and
NFS-file-locking works...
In comparison, I also tried the recent unstable version of MPICH2 v1.3a2
on the same NFS-crossmount. With MPICH2 it works also without any
problems on NFS-v3.
Thanks for your help, I remain in
best regards,
Oliver Deppert
lock.pl (to test NFS fnctl()-file locking)
-----------------------------------------------------------------------------------------------------------------------------------------------------
#!/usr/bin/perl
use Fcntl;
open FH, ">locktest.lock" or die "Cannot open $fn: $!";
print "Testing fcntl...\n";
@list = (F_WRLCK,0,0,0,0); # exclusive write lock, entire file
$struct = pack("SSLLL",@list);
fcntl(FH,&F_SETLKW,$struct) or die("cannot lock because: $!\n");
------------------------------------------------------------------------------------------------------------------------------------------------------
testmpi.f (fortran 90 code-snip to test mpi_write on NFS-v3)
-----------------------------------------------------------------------------------------------------------------------------------------------------
program WRITE_FILE
implicit none
include 'mpif.h'
integer info,pec
integer npe,mpe,mtag
integer :: realsize,file,displace,displaceloc
integer(kind=MPI_OFFSET_KIND) :: disp
integer :: status(MPI_STATUS_SIZE)
real(kind=4) :: locidx(6)
c INITIALIZATION
call MPI_INIT(info)
call MPI_COMM_SIZE(MPI_COMM_WORLD,npe,info) call
MPI_COMM_RANK(MPI_COMM_WORLD,mpe,info)
c routine
mtag=123
displace=6
!send data offset
do pec=0,mpe-1
CALL MPI_SEND(displace,1,MPI_INTEGER,
& pec,mtag,MPI_COMM_WORLD,info)
enddo
do pec=mpe+1,npe-1
CALL MPI_SEND(displace,1,MPI_INTEGER,
& pec,mtag,MPI_COMM_WORLD,info)
enddo
displaceloc=0
!get data offset
do pec=0,mpe-1
CALL MPI_RECV(displace,1,MPI_INTEGER,pec,mtag,
& MPI_COMM_WORLD,status,info)
displaceloc=displaceloc+displace
enddo
CALL MPI_TYPE_EXTENT(MPI_REAL,realsize,info)
disp=displaceloc*realsize
!open file
CALL MPI_FILE_OPEN(MPI_COMM_WORLD,'test.dat',
& MPI_MODE_WRONLY+MPI_MODE_CREATE,MPI_INFO_NULL,file,info)
!set file view (displacement in bytes)
CALL MPI_FILE_SET_VIEW(file,disp,MPI_REAL,
& MPI_REAL,'native',MPI_INFO_NULL,info)
!write out data
locidx(1)=1
locidx(2)=2
locidx(3)=3
locidx(4)=4
locidx(5)=5
locidx(6)=6
CALL MPI_FILE_WRITE(file,locidx,6,MPI_REAL,
& status,info)
!wait until all processes are done
!sync-barrier-sync recommended by mpi-consortium to guarantee
!file consistency
!http://www.mpi-forum.org/docs/mpi-20-html/node215.htm (2010)
call MPI_FILE_SYNC(file,info)
call MPI_BARRIER(MPI_COMM_WORLD,info)
CALL MPI_FILE_SYNC(file,info)
!close file
call MPI_FILE_CLOSE(file,info)
call MPI_FINALIZE(info)
stop
end
------------------------------------------------------------------------------------------------------------------------------------------------------