> Thanks, George.
> Are persistent send/receives matched from the start of the calculation? If
> so, then I guess MPI_CANCEL won’t work.

A persistent request is only matched when it is started. The MPI_Cancel on
a persistent receive doesn't affect the persistent request itself, but
instead only cancel the started instance of the request.

>  I don’t think Open MPI is the problem. I think there is something wrong
> with our cluster in that it just seems to hang up on these big packages.
> The calculation successfully exchanges hundreds or thousands before just
> hanging.

While possible, it is highly unlikely that a message gets dropped by the
network without some kind of warning (system log at least). You might want
to take  a look in the dmesg to see if there is nothing unexpected there.

>  I’m not sure I understand completely your recommendation for dumping
> diagnostics. Is this documented somewhere?

Unfortunately not, this is basically a developer trick to dump the state of
the MPI library. This goes a little like this. Once you have attached a
debugger to your process (let's assume gdb), you need to find the
communicator where you have posted your requests (I can't help here this is
not part of the code you sent). With <communicator_index> set to this value:

gdb$ p ompi_comm_f_to_c_table.addr[<communicator_index>]

will give you the C pointer of the communicator.

gdb$ call mca_pml.pml_dump(
ompi_comm_f_to_c_table.addr[<communicator_index>], 1)

should print all the local known messages by the MPI library, including
pending sends and receives. This will also print additional information
(the status of the requests, the tag, the size, and so on) that can be
understood by the developers. If you post the info here, we might be able
to provide additional information on the issue.


> Kevin,
> In Open MPI we only support cancelling non-yet matched receives. So, you
> cannot cancel sends nor receive requests that have already been matched.
> While the latter are supposed to complete (otherwise they would not have
> been matched), the former are trickier to complete if the corresponding
> receive is never posted.
> To sum this up, the bad news is that there is no way to correctly cancel
> MPI requests without hitting deadlock.
> That being said, I can hardly understand how Open MPI can drop a message.
> There might be something else in here, that is more difficult to spot. We
> do have an internal way to dump all pending (or known) communication.
> Assuming you are using the OB1 PML here is how you dump all known
> communications. Attach to a process and find the communicator pointer (you
> will need to convert between the F90 communicator and the C pointer) and
> then call mca_pml.pml_dump( commptr, 1).
> Also, it is possible to check how one of the more recent versions of Open
> MPI (> 2.1) behave with your code ?
> I am running a large computational fluid dynamics code on a linux cluster
> (Centos 6.8, Open MPI 1.8.4). The code is written in Fortran and compiled
> with Intel Fortran 16.0.3. The cluster has 36 nodes, each node has two
> sockets, each socket has six cores. I have noticed that the code hangs when
> the size of the packages exchanged using a persistent send and receive call
> become large. I cannot say exactly how large, but generally on the order of
> 10 MB. Rather than let the code just hang, I implemented a timing loop
> using MPI_TESTALL. If MPI_TESTALL fails to return successfully after, say,
> 10 minutes, I attempt to MPI_CANCEL the unsuccessful request(s) and
> continue on with the calculation, even if the communication(s) did not
> succeed. It would not necessarily cripple the calculation if a few MPI
> communications were unsuccessful. This is a snippet of code that tests if
> the communications are successful and attempts to cancel if not:
>    FLAG = .FALSE.
>          WRITE(LU_ERR,'(A,A,I6,A,A)') ‘Request timed out for MPI process
> ',MYID,' running on ',PNAME(1:PNAMELEN)
>          DO NNN=1,NREQ
>             write(LU_ERR,*) ‘Request ',NNN,’ returns from MPI_CANCEL'
>             write(LU_ERR,*) ‘Request ',NNN,’ returns from MPI_WAIT'
>             write(LU_ERR,*) ‘Request ',NNN,’ returns from
>          ENDDO
>      ENDIF
>    ENDDO
> The job still hangs, and when I look at the error file, I see that on MPI
> process A, one of the sends has not completed, and on process B, one of the
> receives has not completed. The failed send and failed receive are
> consistent – that is they are matching. What I do not understand is that
> for both the uncompleted send and receive, the code hangs in MPI_WAIT. That
> is, I do not get the printout that says that the process has returned from
> MPI_WAIT. I interpret this to mean that either some of the large message
> has been sent or received, but not all. The MPI standard seems a bit vague
> on what is supposed to happen if part of the message simply disappears due
> to some network glitch. These errors occur after hundreds or thousands of
> successful exchanges. They never happen at the same point in the
> calculation. They are random, but they occur only when the messages are
> large (like MBs). When the messages are not large, the code can run for
> days or weeks without errors.
> So why does MPI_WAIT hang? The MPI standard says
> “If a communication is marked for cancellation, then an MPI_Wait
> <https://www.open-mpi.org/doc/v2.0/man3/MPI_Wait.3.php> call for that
> communication is guaranteed to return, irrespective of the activities of
> other processes (i.e., MPI_Wait
> <https://www.open-mpi.org/doc/v2.0/man3/MPI_Wait.3.php> behaves as a
> local function)” (https://www.open-mpi.org/doc/v2.0/man3/MPI_Cancel.3.php).
> Could the problem be with my cluster – in that the large message is broken
> up into smaller packets, and one of these packets disappears and there is
> no way to cancel it? That’s really what I am looking for – a way to cancel
> the failed communication but still continue the calculation.
