Re: [OMPI users] Application using OpenMPI 1.2.3 hangs, error messages in mca_btl_tcp_frag_recv

Jeff Squyres Tue, 18 Sep 2007 20:19:15 -0400

On Sep 17, 2007, at 11:26 AM, Daniel Rozenbaum wrote:

What seems to be happening is this: the code of the server iswritten in

such a manner that the server knows how many "responses" it's supposed

to receive from all the clients, so when all the calculation taskshave

been distributed, the server enters a loop inside which it calls

MPI_Waitany on an array of handles until it receives all theresults it

expects. However, from my debug prints it looks like all the clients
think they've sent all the results they could, and they're now all
sitting in MPI_Probe, waiting for the server to send out the next
instruction (which is supposed to contain a message indicating the end
of the run). So, the server is stuck in MPI_Waitany() while all the
clients are stuck in MPI_Probe().

On the server side, try putting in a debug loop and see if any of therequests that your app is waiting for are not MPI_REQUEST_NULL (it'snot a value of 0 -- you'll need to compare againstMPI_REQUEST_NULL). If there are any, see if you can trace backwardsto see what request it is.

I was wondering if you could comment on the "readv failed" messagesI'm
seeing in the server's stderr:

[host1][0,1,0][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed with errno=110

I'm seeing a few of these along the server's run, with errno=110
("Connection timed out" according to the "perl -e 'die$!=errno'"method
I found in OpenMPI FAQs), and I've also seen errno=113 ("No route to
host"). Could this mean there's an occasional infrastructureproblem? Itwould be strange, as it would then seem that this particular runsomehow
triggers it?.. Could these messages also mean that some messages got
lost due to these errors, and that's why the server thinks it stillhassome results to receive while the clients think they've senteverything out?

That is all possible. Sorry I missed that message in your originalmessage -- it's basically a message saying that MPI_COMM_WORLD rank 0got a timeout from one of the peers that it shouldn't have.

You're sure that none of your processes are exiting early, right?You said they were all waiting in MPI_Probe, but I just wanted todouble check that they're all still running.

Unfortunately, our error message is not very clear about which hostit lost the connection with -- after you see that message, do you seeincoming communications from all the slaves, or only some of them?


--
Jeff Squyres
Cisco Systems

Re: [OMPI users] Application using OpenMPI 1.2.3 hangs, error messages in mca_btl_tcp_frag_recv

Reply via email to