Greetings,Attached is a small test fortran program that triggers a failure in the mpi_waitall. The problem is that the after a couple of calls to mpi_startall and mpi_waitall some of the mpi_requests become corrupted. This causes the next call to mpi_startall to fail. Here is output from a 2 cpu run.
[44]% mpif90 -g test_ompi.f [45]% mpirun -np 2 a.out TEST(A): 0 1 | 2 3 4 5 TEST(B): 0 1 | 2 3 4 5 OUTPUT: 0 1 | 100 100 101 101 TEST(A): 0 2 | 2 3 4 5 TEST(B): 0 2 | -32766 -32766 4 5 OUTPUT: 0 2 | 200 200 201 201 TEST(A): 1 1 | 2 3 4 5 TEST(B): 1 1 | 2 3 4 5 OUTPUT: 1 1 | 101 101 100 100 TEST(A): 1 2 | 2 3 4 5 TEST(B): 1 2 | -32766 -32766 4 5 OUTPUT: 1 2 | 201 201 200 200 ^Cmpirun: killing job...The "-32766" values show up in the mpi_request array after the second call to mpi_waitall. Using prints in the OpenMPI code I have tracked the problem to
ompi/request/req_wait.c:ompi_request_wait_all().I find upon entry to ompi_request_wait_all() that the values of request[:]->req_f_to_c_index are valid. However, upon exit of ompi_request_wait_all() the first two entries of request[:]- >req_f_to_c_index have the value of -32766.
I am testing with OpenMPI version 1.2b2. This problem occurs on both x86_64 and Intel i386 and it occurs for both Portland Group compilers and for GCC/G95.
Cheers, Tim Campbell Naval Research Laboratory
test_ompi.f.gz
Description: GNU Zip compressed data