Re: [OMPI devel] NP64 _gather_ problem

Steve Wise Fri, 17 Sep 2010 16:46:31 -0400

I'll look into Solaris Studio. I think somehow the connections aregetting single threaded or somehow funneled due to the gatheralgorithm. And since they are taking ~160ms to setup each one, andthere are ~3600 connections getting setup, we end up with a 7 minute runtime. Now, 160ms seems way too high for setting up even an iWARPconnection which has some streaming mode TCP exchanges as part ofconnection setup. I would think it should be around a few hundred_usecs_. So I'm pursuing the connect latency too.


Thanks,


Steve.

On 9/17/2010 12:13 PM, Terry Dontje wrote:

Right, by default all connections will be handled on the fly. So asan MPI_Send is executed to a process that there is not a connection tothen a dance happens between the sender and the receiver. So why thishappens with np > 60 may have to do with how many connections arehappening at the same time or if the destination of one connectionrequest is not in the MPI library.
It would be interesting to figure out when in the timeline of the jobthat such requests are are being delayed. You can get such a timelineby using a tool like Solaris Studio collector/analyzer (which actuallyhas a Linux version).
--td

Steve Wise wrote:
Yes it does. With mpi_preconnect_mpi to 1, NP64 doesn't stall. Soits not the algorithm in and of itself, but rather some interplaybetween the algorithm and connection setup I guess.
On 9/17/2010 5:24 AM, Terry Dontje wrote:
Does setting mca parameter mpi_preconnect_mpi to 1 help at all.This might be able to help determine if it is the actuallyconnection set up between processes that are out of sync as opposeto something in the actual gather algorithm.
--td

Steve Wise wrote:
Here's a clue: ompi_coll_tuned_gather_intra_dec_fixed() changesits algorithm for job sizes > 60 to some binomial method. Ichanged the threshold to 100 and my NP64 jobs run fine. Now to tryand understand what about ompi_coll_tuned_gather_intra_binomial()is causing these connect delays...
On 9/16/2010 1:01 PM, Steve Wise wrote:
Oops. One key typo here: This is the IMB-MPI1 gather test, notbarrier. :(
On 9/16/2010 12:05 PM, Steve Wise wrote:
 Hi,
I'm debugging a performance problem with running IMB-MP1/barrierin an NP64 cluster (8 nodes, 8 cores each). I'm usingopenmpi-1.4.1 from the OFED-1.5.1 distribution. The BTL isopenib/iWARP via Chelsio's T3 RNIC. In short, a NP60 and smallerrun completes in a timely manner as expected, but NP61 andlarger runs come to a crawl at the 8KB IO size and take ~5-10minto complete. It does complete though. It behaves this way evenif I run on > 8 nodes so there are available cores. IE a NP64 ona 16 node cluster still behaves the same way even though thereare only 4 ranks on each node. So its apparently not a threadstarvation issue due to lack of cores. When in the stalledstate, I see on the order of 100 or so established iwarpconnections on each node. And the connection count increasesVERY slowly and sporadically (at its peak there are around 800connections for a NP64 gather operation). In comparison, when Irun the <= NP60 runs, the connections quickly ramp up to theexpected amount. I added hooks in the openib BTL to track thetime it takes to setup each connection. In all runs, both <=NP60 and > NP60, the average connection setup time is around200ms. And the max setup time seen is never much above thisvalue. That tells me that its not individual connection setupthat is the issue. I then added printfs/fflushes in librdmacmto visually see when a connection is attempted and when it isaccepted. When I run with these printfs, I see the connectionsget setup quickly and evently in the <= NP60 case. Initiallywhen the job is started, I see a small flurry of connectionsgetting setup, then the run begins and at around 1KB IO size Isee a 2nd large flurry of connection setups. Then the testcontinues and completes. With the >NP60 case, this second roundof connection setups is very sporadic and slow. Very slow! I'llsee little bursts of ~10-20 connections setup, then long randompauses. The net is that full connection setup for the job takes5-10min. During this time the ranks are basically spinning idleawaiting the connections to get setup. So I'm concluding thatsomething above the BTL layer isn't issuing the endpoint connectrequests in a timely manner.
Attached are 3 padb dumps during the stall. Anybody see anythinginteresting in these?
Any ideas how I can further debug this? Once I get above theopenib BTL layer my eyes glaze over and I get lost quickly. :)I would greatly appreciate any ideas from the OpenMPI experts!
Thanks in advance,

Steve.


_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
------------------------------------------------------------------------

_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle * - Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email [email protected] <mailto:[email protected]>


_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
------------------------------------------------------------------------

_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle * - Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email [email protected] <mailto:[email protected]>


_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] NP64 _gather_ problem

Reply via email to