Greetings,

Attached is a small test fortran program that triggers a failure in the mpi_waitall. The problem is that the after a couple of calls to mpi_startall and mpi_waitall some of the mpi_requests become corrupted. This causes the next call to mpi_startall to fail. Here is output from a 2 cpu run.

[44]% mpif90 -g test_ompi.f
[45]% mpirun -np 2 a.out
TEST(A):   0  1 |        2       3       4       5
TEST(B):   0  1 |        2       3       4       5
OUTPUT:   0  1 |      100     100     101     101
TEST(A):   0  2 |        2       3       4       5
TEST(B):   0  2 |   -32766  -32766       4       5
OUTPUT:   0  2 |      200     200     201     201
TEST(A):   1  1 |        2       3       4       5
TEST(B):   1  1 |        2       3       4       5
OUTPUT:   1  1 |      101     101     100     100
TEST(A):   1  2 |        2       3       4       5
TEST(B):   1  2 |   -32766  -32766       4       5
OUTPUT:   1  2 |      201     201     200     200
^Cmpirun: killing job...

The "-32766" values show up in the mpi_request array after the second call to mpi_waitall. Using prints in the OpenMPI code I have tracked the problem to

ompi/request/req_wait.c:ompi_request_wait_all().

I find upon entry to ompi_request_wait_all() that the values of request[:]->req_f_to_c_index are valid. However, upon exit of ompi_request_wait_all() the first two entries of request[:]- >req_f_to_c_index have the value of -32766.

I am testing with OpenMPI version 1.2b2. This problem occurs on both x86_64 and Intel i386 and it occurs for both Portland Group compilers and for GCC/G95.

Cheers,
Tim Campbell
Naval Research Laboratory

Attachment: test_ompi.f.gz
Description: GNU Zip compressed data


Reply via email to