On Mon, Apr 21, 2008 at 09:04:28AM -0400, Josh Hursey wrote:
> Adrian,
Hi!
> Has there been any progress on this bug? If you still cannot reproduce
> it, if you send either Tim Prins or I a debugging patch we can run
> with it. Or we can try to arrange access to one of our machines for you.
Adrian,
Has there been any progress on this bug? If you still cannot reproduce
it, if you send either Tim Prins or I a debugging patch we can run
with it. Or we can try to arrange access to one of our machines for you.
This bug is making it difficult for us to continue working off of the
To echo what Josh said, there are no special compile flags being used.
If you send me a patch with debug output, I'd be happy to run it for you.
Both odin and sif are fairly normal linux based clusters, with ethernet
and openib IP networks. The ethernet network has both ipv4 & ipv6, and
the op
On Fri, Apr 18, 2008 at 01:00:40PM -0400, Josh Hursey wrote:
> The trick is to force Open MPI to use only tcp,self and nothing else.
> Did you try adding this (-mca btl tcp,self) to the runtime parameter
> set?
Sure. Even with 64 processes, I cannot trigger this behaviour. Neither
on Linux no
I'm seeing this problem as well even running just 4 processes on a
single node (though not as frequently as with higher process counts).
The trick is to force Open MPI to use only tcp,self and nothing else.
Did you try adding this (-mca btl tcp,self) to the runtime parameter
set?
-- Josh
On Fri, Apr 18, 2008 at 08:04:17AM -0400, Tim Prins wrote:
> Hi Adrian,
Hi!
> After this change, I am getting a lot of errors of the form:
> [sif2][[12854,1],9][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by
> peer (104)
>
> See for instanc
Hi Adrian,
After this change, I am getting a lot of errors of the form:
[sif2][[12854,1],9][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by
peer (104)
See for instance: http://www.open-mpi.org/mtt/index.php?do_redir=615
I have found this espe