On Sep 26, 2008, at 1:45 PM, Robert Kubrick wrote:
I'm not sure how should I interpret this message:
[local:17344] *** An error occurred in MPI_Testsome
[local:17344] *** on communicator MPI COMMUNICATOR 5 CREATE FROM 0
[local:17344] *** MPI_ERR_TRUNCATE: message truncated
[local:17344] *** MPI_ERRORS_ARE_FATAL (goodbye)
mpiexec noticed that job rank 0 with PID 17338 on node local exited
on signal 15 (Terminated).
3 additional processes aborted (not shown)
I am assuming that the error was triggered because one of the
buffers I set in the MPI_Recv_init() calls can not contain the
incoming message.
Sorry for the delay in replying.
This is likely the cause -- MPI defines this as a run-time error.
However, I don't understand why job rank 0 terminates first. The
only process that contains a call to MPI_Testsome has actually rank
3, and it's receiving messages from rank 0.
The aborting process sends a message to kill all the other processes
in the job before it dies itself (i.e., to obey the semantics of an
MPI abort). Hence, it's likely that there's a race going on here and
process 0 dies before 3, so mpirun reports that first.
Also I think it would be a good idea to print the message tag in the
error log.
Mm. Good point. I'll file this as a feature request -- we have
centralized error reporting for the abort sequence, so it'll take a
little noodling to get that in there. Probably won't happen for v1.3[.
0], but that's good real-world feedback to have. Thanks!
--
Jeff Squyres
Cisco Systems