Hi,I was trying to use non-blocking MPI I/O and to my surprise, MPI_File_iwrite() is *blocking*. Please see the attached iwrite-test.c and run it (mpiexec -n 1 or standalone). This is what I get:
MPI_File_iwrite: 10.706 s MPI_Wait: 0.000 sI take that to mean that MPI_File_iwrite() blocks until the write is complete, and MPI_Wait() has nothing to do and returns right away. I _was_ expecting the iwrite to return immediately, so I can crunch numbers in the meantime. It doesn't, so this non-blocking API gains me nothing.
We tried to see what the standard calls for, but came up with different ways of understanding the semantics of MPI's asynchronous file I/O APIs; a blocking iwrite might or might not be compliant with the specs.
We can reproduce this behavior on OpenMPI 1.3.3 (as well as SunMPI 8.2) and IntelMPI 3.2, on a NetApp file server and on a Lustre setup. Each instance shows the expected throughput, but blocking File_iwrite() and trivial Wait().
Please see the attached ompi_info_dump.txt for our environment. I couldn't scare up a config.log just now; it probably never survived beyond the deployment.
What can we do to get non-blocking MPI I/O to work as expected? Thanks, Christoph Rackwitz Student Assistant High Performance Computing Group Center for Computing and Communication RWTH Aachen University
ompi_info_dump.txt.bz2
Description: Binary data
#include <stdio.h> #include <assert.h> #include <stdlib.h> #include <string.h> #include <mpi.h> char *outfile = "./foo.dat"; // anywhere is fine int main(int argc, char **argv) { MPI_Init(&argc, &argv); int myrank, nranks; char *buf; double t0, t1, t2, dt; MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Comm_size(MPI_COMM_WORLD, &nranks); MPI_File fh; MPI_Request request; puts("MPI_File_open() ..."); MPI_File_open( MPI_COMM_WORLD, //MPI_COMM_SELF, // makes no difference outfile, MPI_MODE_RDWR | MPI_MODE_CREATE, MPI_INFO_NULL, &fh); #define BUFSIZE (((long long)1)<<30) buf = malloc(BUFSIZE); t0 = MPI_Wtime(); puts("MPI_File_iwrite() ..."); assert(MPI_SUCCESS == MPI_File_iwrite( fh, buf, BUFSIZE, MPI_BYTE, &request )); t1 = MPI_Wtime(); puts("MPI_Wait() ..."); MPI_Wait(&request, MPI_STATUS_IGNORE); t2 = MPI_Wtime(); puts("MPI_File_close() ..."); MPI_File_close(&fh); puts("MPI_File_delete() ..."); MPI_File_delete(outfile, MPI_INFO_NULL); free(buf); puts(""); dt = t1-t0; printf("MPI_File_iwrite: %.3f s\n", dt); dt = t2-t1; printf("MPI_Wait: %.3f s\n", dt); dt = t2-t0; printf("total time: %.3f s\n", dt); printf("throughput: %.3f MB/s\n", BUFSIZE / dt / 1e6); MPI_Finalize(); return 0; }