Re: [OMPI devel] MPI_Barrier performance

George Bosilca Thu, 17 Apr 2008 16:23:35 -0400

This sounds like the fuel problem we're facing right now. Potentially, there are enough resources (for now). Simultaneously, there is enough demand (for ever). But they are connected by this artificially maintained tiny pipe ...

The tuned collective are not supposed to adapt to all cases. They are supposed to deliver the best performance available when each process have its own dedicated network resources. In other words, when there is one process per node. Why CT6 deliver better performances ? Process placement and the communication pattern are just few factors that affect these performances. Change one of them and for a specific configuration will get a [possibly] large improvement in terms of performance. However, it's a temporary benefit, because it doesn't solve the real issue, it just hide it away.

Until we have the hierarch collective working, there is no miracle solution to this problem. Well ... except not starting 16 processes per node :)


  george.

On Apr 15, 2008, at 1:45 PM, Rolf Vandevaart wrote:

I have been running the IMB performance tests and noticed some strange behavior. This is running on a CentOS cluster with 16 processes per node and using the openib btl. Currently, I am looking at the MPI_Barrier performance. Since we make use of a recursive double algorithm (in the tuned collective) I would have expected to see a log2(np) type performance. However, the data is much worse than log2(np) with the trunk being worse than v1.2.4. One interesting piece of data is that I replaced the tuned algorithm with one that is very similar (copied from Sun Clustertools 6) , but instead of each process doing send/recv, we have each one do a send to their lower partners followed by a receive with their upper partners. Then, everything is reversed which finished the barrier. For reasons unknown, this appears to perform better even thought both algorithms should be log2(np).
Another interesting fact is that when run on my really slow cluster over TCP (latency of 150 usec) the tuned barrier algorithm very closely follows the expected log2(np).
I have mentioned this issue to a few people, but thought I would share with a wider audience to see if anyone else has observed MPI_Barrier that is not log2(np). I have attached two pdfs. The first one shows my results and the second one is a picture of the two different barrier algorithms.
Rolf

--

=========================
rolf.vandeva...@sun.com
781-442-3043
=========================
<imb-barrier-ompi.pdf><barrier- tree.pdf>_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

smime.p7s
Description: S/MIME cryptographic signature

Re: [OMPI devel] MPI_Barrier performance

Reply via email to