I am seeing the intel test suite tests MPI_Errhandler_fatal_c and MPI_Errhandler_fatal_f fail with an oob failure quite a bit I have not seen this test failing under MTT until the epoch code was added. So I have a suspicion the epoch code might be at fault. Could someone familiar with the epoch changes (Wesley) take a look at this failure.

Note this intermittently fails but fails for me more times than not. Attached is a log file of a run that succeeds followed by the failing run. The piece of concern are the messages involving mca_oob_tcp_msg_recv and below.

thanks,

--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com <mailto:terry.don...@oracle.com>



Script started on Thu Aug 18 09:15:10 2011
 burl-ct-x4150-1 101 =>mpirun -np 4 --mca btl tcp,self --mca coll_sm_priority 
100 -- `pwd`/src/MPI_Errhandler_f atal_c

MPITEST info  (0): Starting MPI_Errhandler_fatal test
MPITEST info  (0): This test will abort after printing the results message
MPITEST info  (0): If it does not, then a f.a.i.l.u.r.e will be noted
MPITEST_results: MPI_Errhandler_fatal all tests PASSED (4)
[burl-ct-x4150-1:26951] *** An error occurred in MPI_Send
[burl-ct-x4150-1:26951] *** reported by process [2470772737,3]
[burl-ct-x4150-1:26951] *** on communicator MPI COMMUNICATOR 3 DUP FROM 0
[burl-ct-x4150-1:26951] *** MPI_ERR_RANK: invalid rank
[burl-ct-x4150-1:26951] *** MPI_ERRORS_ARE_FATAL (processes in this 
communicator will now abort,
[burl-ct-x4150-1:26951] ***    and potentially your MPI job)
[burl-ct-x4150-1:26945] 3 more processes have sent help message 
help-mpi-errors.txt / mpi_errors_are_fatal
[burl-ct-x4150-1:26945] Set MCA parameter "orte_base_help_aggregate" to 0 to 
see all help / error messages
 burl-ct-x4150-1 102 =>mpirun -np 4 --mca btl tcp,self --mca coll_sm_priority 
100 -- `pwd`/src/MPI_Errhandler_f atal_c

MPITEST info  (0): Starting MPI_Errhandler_fatal test
MPITEST info  (0): This test will abort after printing the results message
MPITEST info  (0): If it does not, then a f.a.i.l.u.r.e will be noted
MPITEST_results: MPI_Errhandler_fatal all tests PASSED (4)
[burl-ct-x4150-1:26955] *** An error occurred in MPI_Send
[burl-ct-x4150-1:26955] *** reported by process [2471231489,0]
[burl-ct-x4150-1:26955] *** on communicator MPI COMMUNICATOR 3 DUP FROM 0
[burl-ct-x4150-1:26955] *** MPI_ERR_RANK: invalid rank
[burl-ct-x4150-1:26955] *** MPI_ERRORS_ARE_FATAL (processes in this 
communicator will now abort,
[burl-ct-x4150-1:26955] ***    and potentially your MPI job)
[burl-ct-x4150-1:26952] [[37708,0],0,0]-[[37708,1],3,0] mca_oob_tcp_msg_recv: 
readv failed: Connection reset by peer (131)
[burl-ct-x4150-1:26952] [[37708,0],0,0] ORTE_ERROR_LOG: A message is attempting 
to be sent to a process whose contact information is unknown in file 
../../../../../orte/mca/rml/oob/rml_oob_send.c at line 149
[burl-ct-x4150-1:26952] [[37708,0],0,0] attempted to send to [[37708,1],1,0]: 
tag 38
[burl-ct-x4150-1:26952] [[37708,0],0,0] ORTE_ERROR_LOG: A message is attempting 
to be sent to a process whose contact information is unknown in file 
../../../../orte/mca/odls/base/odls_base_default_fns.c at line 2345
[burl-ct-x4150-1:26952] [[37708,0],0,0]-[[37708,1],0,0] mca_oob_tcp_msg_recv: 
readv failed: Connection reset by peer (131)
[burl-ct-x4150-1:26952] 3 more processes have sent help message 
help-mpi-errors.txt / mpi_errors_are_fatal
[burl-ct-x4150-1:26952] Set MCA parameter "orte_base_help_aggregate" to 0 to 
see all help / error messages
 burl-ct-x4150-1 103 =>^Dexit

script done on Thu Aug 18 09:15:57 2011

Reply via email to