Re: [OMPI devel] MPI_Errhandler_fatal_c failure

2011-08-18 Thread TERRY DONTJE
Thought I'd throw this out there, I retraced my MTT steps and did find 
that there were failures of this test back until r24774.  r24775 has a 
comment that looks very relevant.  I am talking to the committer of that 
change now.


Sorry for the false accusation.

--td

On 8/18/2011 2:32 PM, George Bosilca wrote:

Terry,

The test succeeded in both of your runs.

However, I rolled back before the epoch change (24814) and the output is the 
following:

MPITEST info  (0): Starting MPI_Errhandler_fatal test
MPITEST info  (0): This test will abort after printing the results message
MPITEST info  (0): If it does not, then a f.a.i.l.u.r.e will be noted
[dancer.eecs.utk.edu:16098] *** An error occurred in MPI_Send
[dancer.eecs.utk.edu:16098] *** reported by process 
[766095392769,139869904961537]
[dancer.eecs.utk.edu:16098] *** on communicator MPI COMMUNICATOR 3 DUP FROM 0
[dancer.eecs.utk.edu:16098] *** MPI_ERR_RANK: invalid rank
[dancer.eecs.utk.edu:16098] *** MPI_ERRORS_ARE_FATAL (processes in this 
communicator will now abort,
[dancer.eecs.utk.edu:16098] ***and potentially your MPI job)
MPITEST_results: MPI_Errhandler_fatal all tests PASSED (4)
[dancer.eecs.utk.edu:16096] [[24280,0],0]-[[24280,1],3] mca_oob_tcp_msg_recv: 
readv failed: Connection reset by peer (104)
[dancer.eecs.utk.edu:16096] 3 more processes have sent help message 
help-mpi-errors.txt / mpi_errors_are_fatal
[dancer.eecs.utk.edu:16096] Set MCA parameter "orte_base_help_aggregate" to 0 
to see all help / error messages

As you can see it is identical to the output in your test.

   george.


On Aug 18, 2011, at 12:29 , TERRY DONTJE wrote:


Just ran MPI_Errhandler_fatal_c with r25063 and it still fails.  Everything is the same 
except I don't see the "readv failed.." message.

Have your tried to run this code yourself?  It is pretty simple and fails with 
one node using np=4.

--td

On 8/18/2011 10:57 AM, Wesley Bland wrote:

I just checked in a fix (I hope). I think the problem was that the errmgr
was removing children from the list of odls children without using the
mutex to prevent race conditions. Let me know if the MTT is still having
problems tomorrow.

Wes



I am seeing the intel test suite tests MPI_Errhandler_fatal_c and
MPI_Errhandler_fatal_f fail with an oob failure quite a bit  I have not
seen this test failing under MTT until the epoch code was added.  So I
have a suspicion the epoch code might be at fault.  Could someone
familiar with the epoch changes (Wesley) take a look at this failure.

Note this intermittently fails but fails for me more times than not.
Attached is a log file of a run that succeeds followed by the failing
run.  The piece of concern are the messages involving
mca_oob_tcp_msg_recv and below.

thanks,

--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email
terry.don...@oracle.com







--

Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle - Performance Technologies
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com 





Re: [OMPI devel] MPI_Errhandler_fatal_c failure

2011-08-18 Thread TERRY DONTJE



On 8/18/2011 2:32 PM, George Bosilca wrote:

Terry,

The test succeeded in both of your runs.
Not really.  Granted the test aborted  in both cases however the case 
you show below has further issues while the orte is trying to clean 
things up.  It certainly is not what I would call friendly.  But that is 
besides the point, the issue is orte is having  issues with 
MPI_Errhandler_fatal_c test IMO and it looks like you have seen the same 
failure prior to the epoch changes.  Fair enough, I'll go back to the 
drawing board and see if I can narrow this down.


--td

However, I rolled back before the epoch change (24814) and the output is the 
following:

MPITEST info  (0): Starting MPI_Errhandler_fatal test
MPITEST info  (0): This test will abort after printing the results message
MPITEST info  (0): If it does not, then a f.a.i.l.u.r.e will be noted
[dancer.eecs.utk.edu:16098] *** An error occurred in MPI_Send
[dancer.eecs.utk.edu:16098] *** reported by process 
[766095392769,139869904961537]
[dancer.eecs.utk.edu:16098] *** on communicator MPI COMMUNICATOR 3 DUP FROM 0
[dancer.eecs.utk.edu:16098] *** MPI_ERR_RANK: invalid rank
[dancer.eecs.utk.edu:16098] *** MPI_ERRORS_ARE_FATAL (processes in this 
communicator will now abort,
[dancer.eecs.utk.edu:16098] ***and potentially your MPI job)
MPITEST_results: MPI_Errhandler_fatal all tests PASSED (4)
[dancer.eecs.utk.edu:16096] [[24280,0],0]-[[24280,1],3] mca_oob_tcp_msg_recv: 
readv failed: Connection reset by peer (104)
[dancer.eecs.utk.edu:16096] 3 more processes have sent help message 
help-mpi-errors.txt / mpi_errors_are_fatal
[dancer.eecs.utk.edu:16096] Set MCA parameter "orte_base_help_aggregate" to 0 
to see all help / error messages

As you can see it is identical to the output in your test.

   george.


On Aug 18, 2011, at 12:29 , TERRY DONTJE wrote:


Just ran MPI_Errhandler_fatal_c with r25063 and it still fails.  Everything is the same 
except I don't see the "readv failed.." message.

Have your tried to run this code yourself?  It is pretty simple and fails with 
one node using np=4.

--td

On 8/18/2011 10:57 AM, Wesley Bland wrote:

I just checked in a fix (I hope). I think the problem was that the errmgr
was removing children from the list of odls children without using the
mutex to prevent race conditions. Let me know if the MTT is still having
problems tomorrow.

Wes



I am seeing the intel test suite tests MPI_Errhandler_fatal_c and
MPI_Errhandler_fatal_f fail with an oob failure quite a bit  I have not
seen this test failing under MTT until the epoch code was added.  So I
have a suspicion the epoch code might be at fault.  Could someone
familiar with the epoch changes (Wesley) take a look at this failure.

Note this intermittently fails but fails for me more times than not.
Attached is a log file of a run that succeeds followed by the failing
run.  The piece of concern are the messages involving
mca_oob_tcp_msg_recv and below.

thanks,

--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email
terry.don...@oracle.com







--

Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle - Performance Technologies
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com 





Re: [OMPI devel] MPI_Errhandler_fatal_c failure

2011-08-18 Thread George Bosilca
Terry,

The test succeeded in both of your runs.

However, I rolled back before the epoch change (24814) and the output is the 
following:

MPITEST info  (0): Starting MPI_Errhandler_fatal test
MPITEST info  (0): This test will abort after printing the results message
MPITEST info  (0): If it does not, then a f.a.i.l.u.r.e will be noted
[dancer.eecs.utk.edu:16098] *** An error occurred in MPI_Send
[dancer.eecs.utk.edu:16098] *** reported by process 
[766095392769,139869904961537]
[dancer.eecs.utk.edu:16098] *** on communicator MPI COMMUNICATOR 3 DUP FROM 0
[dancer.eecs.utk.edu:16098] *** MPI_ERR_RANK: invalid rank
[dancer.eecs.utk.edu:16098] *** MPI_ERRORS_ARE_FATAL (processes in this 
communicator will now abort,
[dancer.eecs.utk.edu:16098] ***and potentially your MPI job)
MPITEST_results: MPI_Errhandler_fatal all tests PASSED (4)
[dancer.eecs.utk.edu:16096] [[24280,0],0]-[[24280,1],3] mca_oob_tcp_msg_recv: 
readv failed: Connection reset by peer (104)
[dancer.eecs.utk.edu:16096] 3 more processes have sent help message 
help-mpi-errors.txt / mpi_errors_are_fatal
[dancer.eecs.utk.edu:16096] Set MCA parameter "orte_base_help_aggregate" to 0 
to see all help / error messages

As you can see it is identical to the output in your test.

  george.


On Aug 18, 2011, at 12:29 , TERRY DONTJE wrote:

> Just ran MPI_Errhandler_fatal_c with r25063 and it still fails.  Everything 
> is the same except I don't see the "readv failed.." message.
> 
> Have your tried to run this code yourself?  It is pretty simple and fails 
> with one node using np=4.
> 
> --td
> 
> On 8/18/2011 10:57 AM, Wesley Bland wrote:
>> I just checked in a fix (I hope). I think the problem was that the errmgr
>> was removing children from the list of odls children without using the
>> mutex to prevent race conditions. Let me know if the MTT is still having
>> problems tomorrow.
>> 
>> Wes
>> 
>> 
>>> I am seeing the intel test suite tests MPI_Errhandler_fatal_c and
>>> MPI_Errhandler_fatal_f fail with an oob failure quite a bit  I have not
>>> seen this test failing under MTT until the epoch code was added.  So I
>>> have a suspicion the epoch code might be at fault.  Could someone
>>> familiar with the epoch changes (Wesley) take a look at this failure.
>>> 
>>> Note this intermittently fails but fails for me more times than not.
>>> Attached is a log file of a run that succeeds followed by the failing
>>> run.  The piece of concern are the messages involving
>>> mca_oob_tcp_msg_recv and below.
>>> 
>>> thanks,
>>> 
>>> --
>>> Oracle
>>> Terry D. Dontje | Principal Software Engineer
>>> Developer Tools Engineering | +1.781.442.2631
>>> Oracle *- Performance Technologies*
>>> 95 Network Drive, Burlington, MA 01803
>>> Email 
>>> terry.don...@oracle.com 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
> 
> -- 
> 
> Terry D. Dontje | Principal Software Engineer
> Developer Tools Engineering | +1.781.442.2631
> Oracle - Performance Technologies
> 95 Network Drive, Burlington, MA 01803
> Email terry.don...@oracle.com
> 
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] MPI_Errhandler_fatal_c failure

2011-08-18 Thread TERRY DONTJE
Just ran MPI_Errhandler_fatal_c with r25063 and it still fails.  
Everything is the same except I don't see the "readv failed.." message.


Have your tried to run this code yourself?  It is pretty simple and 
fails with one node using np=4.


--td

On 8/18/2011 10:57 AM, Wesley Bland wrote:

I just checked in a fix (I hope). I think the problem was that the errmgr
was removing children from the list of odls children without using the
mutex to prevent race conditions. Let me know if the MTT is still having
problems tomorrow.

Wes


I am seeing the intel test suite tests MPI_Errhandler_fatal_c and
MPI_Errhandler_fatal_f fail with an oob failure quite a bit  I have not
seen this test failing under MTT until the epoch code was added.  So I
have a suspicion the epoch code might be at fault.  Could someone
familiar with the epoch changes (Wesley) take a look at this failure.

Note this intermittently fails but fails for me more times than not.
Attached is a log file of a run that succeeds followed by the failing
run.  The piece of concern are the messages involving
mca_oob_tcp_msg_recv and below.

thanks,

--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com






--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com 





Re: [OMPI devel] MPI_Errhandler_fatal_c failure

2011-08-18 Thread Wesley Bland
I just checked in a fix (I hope). I think the problem was that the errmgr
was removing children from the list of odls children without using the
mutex to prevent race conditions. Let me know if the MTT is still having
problems tomorrow.

Wes

> I am seeing the intel test suite tests MPI_Errhandler_fatal_c and
> MPI_Errhandler_fatal_f fail with an oob failure quite a bit  I have not
> seen this test failing under MTT until the epoch code was added.  So I
> have a suspicion the epoch code might be at fault.  Could someone
> familiar with the epoch changes (Wesley) take a look at this failure.
>
> Note this intermittently fails but fails for me more times than not.
> Attached is a log file of a run that succeeds followed by the failing
> run.  The piece of concern are the messages involving
> mca_oob_tcp_msg_recv and below.
>
> thanks,
>
> --
> Oracle
> Terry D. Dontje | Principal Software Engineer
> Developer Tools Engineering | +1.781.442.2631
> Oracle *- Performance Technologies*
> 95 Network Drive, Burlington, MA 01803
> Email terry.don...@oracle.com 
>
>
>
>



[OMPI devel] MPI_Errhandler_fatal_c failure

2011-08-18 Thread TERRY DONTJE
I am seeing the intel test suite tests MPI_Errhandler_fatal_c and 
MPI_Errhandler_fatal_f fail with an oob failure quite a bit  I have not 
seen this test failing under MTT until the epoch code was added.  So I 
have a suspicion the epoch code might be at fault.  Could someone 
familiar with the epoch changes (Wesley) take a look at this failure.


Note this intermittently fails but fails for me more times than not.  
Attached is a log file of a run that succeeds followed by the failing 
run.  The piece of concern are the messages involving 
mca_oob_tcp_msg_recv and below.


thanks,

--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com 



Script started on Thu Aug 18 09:15:10 2011
 burl-ct-x4150-1 101 =>mpirun -np 4 --mca btl tcp,self --mca coll_sm_priority 
100 -- `pwd`/src/MPI_Errhandler_f atal_c

MPITEST info  (0): Starting MPI_Errhandler_fatal test
MPITEST info  (0): This test will abort after printing the results message
MPITEST info  (0): If it does not, then a f.a.i.l.u.r.e will be noted
MPITEST_results: MPI_Errhandler_fatal all tests PASSED (4)
[burl-ct-x4150-1:26951] *** An error occurred in MPI_Send
[burl-ct-x4150-1:26951] *** reported by process [2470772737,3]
[burl-ct-x4150-1:26951] *** on communicator MPI COMMUNICATOR 3 DUP FROM 0
[burl-ct-x4150-1:26951] *** MPI_ERR_RANK: invalid rank
[burl-ct-x4150-1:26951] *** MPI_ERRORS_ARE_FATAL (processes in this 
communicator will now abort,
[burl-ct-x4150-1:26951] ***and potentially your MPI job)
[burl-ct-x4150-1:26945] 3 more processes have sent help message 
help-mpi-errors.txt / mpi_errors_are_fatal
[burl-ct-x4150-1:26945] Set MCA parameter "orte_base_help_aggregate" to 0 to 
see all help / error messages
 burl-ct-x4150-1 102 =>mpirun -np 4 --mca btl tcp,self --mca coll_sm_priority 
100 -- `pwd`/src/MPI_Errhandler_f atal_c

MPITEST info  (0): Starting MPI_Errhandler_fatal test
MPITEST info  (0): This test will abort after printing the results message
MPITEST info  (0): If it does not, then a f.a.i.l.u.r.e will be noted
MPITEST_results: MPI_Errhandler_fatal all tests PASSED (4)
[burl-ct-x4150-1:26955] *** An error occurred in MPI_Send
[burl-ct-x4150-1:26955] *** reported by process [2471231489,0]
[burl-ct-x4150-1:26955] *** on communicator MPI COMMUNICATOR 3 DUP FROM 0
[burl-ct-x4150-1:26955] *** MPI_ERR_RANK: invalid rank
[burl-ct-x4150-1:26955] *** MPI_ERRORS_ARE_FATAL (processes in this 
communicator will now abort,
[burl-ct-x4150-1:26955] ***and potentially your MPI job)
[burl-ct-x4150-1:26952] [[37708,0],0,0]-[[37708,1],3,0] mca_oob_tcp_msg_recv: 
readv failed: Connection reset by peer (131)
[burl-ct-x4150-1:26952] [[37708,0],0,0] ORTE_ERROR_LOG: A message is attempting 
to be sent to a process whose contact information is unknown in file 
../../../../../orte/mca/rml/oob/rml_oob_send.c at line 149
[burl-ct-x4150-1:26952] [[37708,0],0,0] attempted to send to [[37708,1],1,0]: 
tag 38
[burl-ct-x4150-1:26952] [[37708,0],0,0] ORTE_ERROR_LOG: A message is attempting 
to be sent to a process whose contact information is unknown in file 
../../../../orte/mca/odls/base/odls_base_default_fns.c at line 2345
[burl-ct-x4150-1:26952] [[37708,0],0,0]-[[37708,1],0,0] mca_oob_tcp_msg_recv: 
readv failed: Connection reset by peer (131)
[burl-ct-x4150-1:26952] 3 more processes have sent help message 
help-mpi-errors.txt / mpi_errors_are_fatal
[burl-ct-x4150-1:26952] Set MCA parameter "orte_base_help_aggregate" to 0 to 
see all help / error messages
 burl-ct-x4150-1 103 =>^Dexit

script done on Thu Aug 18 09:15:57 2011