Re: [OMPI users] Large TCP cluster timeout issue

Ralph Castain Tue, 20 Sep 2011 19:55:06 -0400

Truly am sorry about that - we were just talking today about the need to update 
and improve our FAQ on running on large clusters. Did you by any chance look at 
it? Would appreciate any thoughts on how it should be improved from a user's 
perspective.




On Sep 20, 2011, at 3:28 PM, Henderson, Brent wrote:

> Nope, but if I didn’t that would have saved me about an hour of coding time! 
>  
> I’m still curious if it would be beneficial to inject some barriers at 
> certain locations so that if you had a slow node, not everyone would end up 
> connecting to it all at once.  Anyway, if I get access to another large TCP 
> cluster, I’ll give it a try.
>  
> Thanks,
>  
> brent
>  
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
> Behalf Of Ralph Castain
> Sent: Tuesday, September 20, 2011 4:15 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] Large TCP cluster timeout issue
>  
> Hmmm....perhaps you didn't notice the mpi_preconnect_all option? It does 
> precisely what you described - it pushes zero-byte messages around a ring to 
> force all the connections open at MPI_Init.
>  
>  
> On Sep 20, 2011, at 3:06 PM, Henderson, Brent wrote:
> 
> 
> I recently had access to a 200+ node Magny Cours (24 ranks/host) 10G Linux 
> cluster.  I was able to use OpenMPI v1.5.4 with hello world, IMB and HPCC, 
> but there were a couple of issues along the way.  After setting some system 
> tunables up a little bit on all of the nodes a hello_world program worked 
> just fine – it appears that the TCP connections between most or all of the 
> ranks are deferred until they are actually used so the easy test ran 
> reasonably quickly.  I then moved to IMB. 
>  
> I typically don’t care about the small rank counts, so I add the –npmin 99999 
> option to just run the ‘big’ number of ranks.  This ended with an abort after 
> MPI_Init(), but before running any tests.  Lots (possibly all) of ranks 
> emitted messages that looked like:
>  
>     
> ‘[n112][[13200,1],1858][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>  connect() to 172.23.4.1 failed: Connection timed out (110)’
>  
> Where n112 is one of the nodes in the job, and 172.23.4.1 is the first node 
> in the job.  One of the first things that IMB does before running a test is 
> create a communicator for each specific rank count it is testing.  Apparently 
> this collective operation causes a large number of connections to be made.  
> The abort messages (one example shown above) all show the connect failure to 
> a single node, so it would appear that a very large number of nodes attempt 
> to connect to that one at the same time and overwhelmed it.  (Or it was slow 
> and everyone ganged up on it as they worked their way around the ring.  J  Is 
> there a supported/suggested way to work around this?  It was very repeatable.
>  
> I was able to work around this by using the primary definitions for 
> MPI_Init() and MPI_Init_thread() by calling the ‘P’ version of the routine, 
> and then having each rank send its rank number to the rank one to the right, 
> then two to the right, and so-on around the ring.  I added a MPI_Barrier( 
> MPI_COMM_WORLD ), call every N messages to keep things at a controlled pace.  
> N was 64 by default, but settable via environment variable in case that 
> number didn’t work well for some reason.  This fully connected the mesh (110k 
> socket connections per host!) and allowed the tests to run.  Not a great 
> solution, I know, but I’ll throw it out there until I know the right way.
>  
> Once I had this in place, I used the workaround with HPCC as well.  Without 
> it, it would not get very far at all.  With it, I was able to make it through 
> the entire test.
>  
> Looking forward to getting the experts thoughts about the best way to handle 
> big TCP clusters – thanks!
>  
> Brent
>  
> P.S.  v1.5.4 worked *much* better that v1.4.3 on this cluster – not sure why, 
> but kudos to those working on changes since then!
>  
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>  
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Large TCP cluster timeout issue

Reply via email to