I am running a large computational fluid dynamics code on a linux cluster (Centos 6.8, Open MPI 1.8.4). The code is written in Fortran and compiled with Intel Fortran 16.0.3. The cluster has 36 nodes, each node has two sockets, each socket has six cores. I have noticed that the code hangs when the size of the packages exchanged using a persistent send and receive call become large. I cannot say exactly how large, but generally on the order of 10 MB. Rather than let the code just hang, I implemented a timing loop using MPI_TESTALL. If MPI_TESTALL fails to return successfully after, say, 10 minutes, I attempt to MPI_CANCEL the unsuccessful request(s) and continue on with the calculation, even if the communication(s) did not succeed. It would not necessarily cripple the calculation if a few MPI communications were unsuccessful. This is a snippet of code that tests if the communications are successful and attempts to cancel if not:
START_TIME = MPI_WTIME() FLAG = .FALSE. DO WHILE(.NOT.FLAG) CALL MPI_TESTALL(NREQ,REQ(1:NREQ),FLAG,ARRAY_OF_STATUSES,IERR) WAIT_TIME = MPI_WTIME() - START_TIME IF (WAIT_TIME>TIMEOUT) THEN WRITE(LU_ERR,'(A,A,I6,A,A)') 'Request timed out for MPI process ',MYID,' running on ',PNAME(1:PNAMELEN) DO NNN=1,NREQ IF (ARRAY_OF_STATUSES(1,NNN)==MPI_SUCCESS) CYCLE CALL MPI_CANCEL(REQ(NNN),IERR) write(LU_ERR,*) 'Request ',NNN,' returns from MPI_CANCEL' CALL MPI_WAIT(REQ(NNN),STATUS,IERR) write(LU_ERR,*) 'Request ',NNN,' returns from MPI_WAIT' CALL MPI_TEST_CANCELLED(STATUS,FLAG2,IERR) write(LU_ERR,*) 'Request ',NNN,' returns from MPI_TEST_CANCELLED' ENDDO ENDIF ENDDO The job still hangs, and when I look at the error file, I see that on MPI process A, one of the sends has not completed, and on process B, one of the receives has not completed. The failed send and failed receive are consistent - that is they are matching. What I do not understand is that for both the uncompleted send and receive, the code hangs in MPI_WAIT. That is, I do not get the printout that says that the process has returned from MPI_WAIT. I interpret this to mean that either some of the large message has been sent or received, but not all. The MPI standard seems a bit vague on what is supposed to happen if part of the message simply disappears due to some network glitch. These errors occur after hundreds or thousands of successful exchanges. They never happen at the same point in the calculation. They are random, but they occur only when the messages are large (like MBs). When the messages are not large, the code can run for days or weeks without errors. So why does MPI_WAIT hang? The MPI standard says "If a communication is marked for cancellation, then an MPI_Wait<https://www.open-mpi.org/doc/v2.0/man3/MPI_Wait.3.php> call for that communication is guaranteed to return, irrespective of the activities of other processes (i.e., MPI_Wait<https://www.open-mpi.org/doc/v2.0/man3/MPI_Wait.3.php> behaves as a local function)" (https://www.open-mpi.org/doc/v2.0/man3/MPI_Cancel.3.php). Could the problem be with my cluster - in that the large message is broken up into smaller packets, and one of these packets disappears and there is no way to cancel it? That's really what I am looking for - a way to cancel the failed communication but still continue the calculation.
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users