Hi,

I was trying to use non-blocking MPI I/O and to my surprise, MPI_File_iwrite() is *blocking*. Please see the attached iwrite-test.c and run it (mpiexec -n 1 or standalone). This is what I get:

MPI_File_iwrite:  10.706 s
MPI_Wait:         0.000 s

I take that to mean that MPI_File_iwrite() blocks until the write is complete, and MPI_Wait() has nothing to do and returns right away. I _was_ expecting the iwrite to return immediately, so I can crunch numbers in the meantime. It doesn't, so this non-blocking API gains me nothing.

We tried to see what the standard calls for, but came up with different ways of understanding the semantics of MPI's asynchronous file I/O APIs; a blocking iwrite might or might not be compliant with the specs.

We can reproduce this behavior on OpenMPI 1.3.3 (as well as SunMPI 8.2) and IntelMPI 3.2, on a NetApp file server and on a Lustre setup. Each instance shows the expected throughput, but blocking File_iwrite() and trivial Wait().

Please see the attached ompi_info_dump.txt for our environment. I couldn't scare up a config.log just now; it probably never survived beyond the deployment.

What can we do to get non-blocking MPI I/O to work as expected?


Thanks,


Christoph Rackwitz
Student Assistant
High Performance Computing Group
Center for Computing and Communication
RWTH Aachen University

Attachment: ompi_info_dump.txt.bz2
Description: Binary data

#include <stdio.h>
#include <assert.h>
#include <stdlib.h>
#include <string.h>
#include <mpi.h>

char *outfile = "./foo.dat"; // anywhere is fine

int main(int argc, char **argv)
{
        MPI_Init(&argc, &argv);
        int myrank, nranks;
        char *buf;
        double t0, t1, t2, dt;

        MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
        MPI_Comm_size(MPI_COMM_WORLD, &nranks);

        MPI_File fh;
        MPI_Request request;

        puts("MPI_File_open() ...");
        MPI_File_open(
                MPI_COMM_WORLD,
                //MPI_COMM_SELF, // makes no difference
                outfile,
                MPI_MODE_RDWR | MPI_MODE_CREATE,
                MPI_INFO_NULL,
                &fh);

#define BUFSIZE (((long long)1)<<30)

        buf = malloc(BUFSIZE);

        t0 = MPI_Wtime();

        puts("MPI_File_iwrite() ...");
        assert(MPI_SUCCESS == MPI_File_iwrite(
                fh,
                buf,
                BUFSIZE,
                MPI_BYTE,
                &request
        ));

        t1 = MPI_Wtime();

        puts("MPI_Wait() ...");
        MPI_Wait(&request, MPI_STATUS_IGNORE);

        t2 = MPI_Wtime();

        puts("MPI_File_close() ...");
        MPI_File_close(&fh);

        puts("MPI_File_delete() ...");
        MPI_File_delete(outfile, MPI_INFO_NULL);

        free(buf);

        puts("");

        dt = t1-t0;
        printf("MPI_File_iwrite:  %.3f s\n", dt);
        dt = t2-t1;
        printf("MPI_Wait:         %.3f s\n", dt);
        dt = t2-t0;
        printf("total time:       %.3f s\n", dt);
        printf("throughput:       %.3f MB/s\n", BUFSIZE / dt / 1e6);

        MPI_Finalize();

        return 0;
}

Reply via email to