Rolf,
Whoowh! That's actually good news, since in our own tests hierarch is
always slower. But this might be due to various reasons, including the
fact, that we only have two cores per node. BTW: I actually would expect
IMB test to have worse performance for hierarch compared to many other
benchmarks, since the rotating root causes some additional work/overhead.
Rolf Vandevaart wrote:
I am curious if anyone is doing any work currently on the hierarchical
collectives. I ask this because I just did some runs on a cluster made
up of 4 servers with 4 processors per server. I used TCP over IB. I
was running with np=16 and using the IMB benchmark to test MPI_Bcast.
What I am seeing is that the hierarchical collectives appear to boost
performance. The IMB test rotates the root so one could imagine that
since the hierarchical minimizes internode communication, performance
increases. See the table at the end of this post with the comparison
for MPI_Bcast between tuned and hierarchical. This leads me to a few
other questions.
1. From what I can tell from the debug messages, we still cannot stack
the hierarchical on top of the tuned. I know that Brian Barrett did
some work after the collectives meeting to allow for this, but I could
not figure out how to get it to work.
actually, this should be possible. We do however experience with the
current trunk some problems, so I can not verify right now. So, just for
the sake of clarity, did you run hierarch on top of tuned or on top of
basic and/or sm?
2. Enabling the hierarchical collectives causes a massive slowdown
during MPI_Init. I know it was discussed a little at the collectives
meeting and it appears that this is still something we need to solve.
For a simple hello_world, np=4, 2 node cluster, I see around 5 seconds
to run for tuned collectives, but I see around 18 seconds for
hierarchical.
yes. A faster, however simpler hierarchy detection is implemented, but
not yet committed.
3. Apart from the MPI_Init issue, is hierarchical ready to go?
Clearly, the algorithms are very simple in hierarch, but they are still
lacking large-scale testing, so this is something which would have to be
incorporated.
We have also experimented with various other hierarchical algorithms for
bcast, over the last few months. Our overall progress has however been
significantly slower than I hoped. I know however, that various other
groups also have interest in the hierarch component, and might also be
ready to invest some time to bring it up to speed.
Thanks
Edgar
4. As the nodes get fatter, I assume the need for hierarchical
will increase, so this may become a larger issue for all of us?
RESULTS FROM TWO RUNS OF IMB-MPI1
#----------------------------------------------------------------
# Benchmarking Bcast
# #processes = 16 TUNED HIERARCH
#----------------------------------------------------------------
#bytes #repetitions t_avg[usec] t_avg[usec]
0 1000 0.11 0.22
1 1000 205.97 319.86
2 1000 159.23 180.80
4 1000 175.32 189.16
8 1000 153.10 184.26
16 1000 170.98 192.33
32 1000 160.69 187.17
64 1000 159.75 182.62
128 1000 175.47 185.19
256 1000 160.77 194.68
512 1000 265.45 313.89
1024 1000 185.66 215.43
2048 1000 815.97 257.37
4096 1000 1208.48 442.93
8192 1000 1521.23 530.54
16384 1000 2357.45 813.44
32768 1000 3341.29 1455.78
65536 640 6485.70 3387.02
131072 320 13488.35 5261.65
262144 160 24783.09 10747.28
524288 80 50906.06 21817.64
1048576 40 95466.82 41397.49
2097152 20 180759.72 81319.54
4194304 10 322327.71 163274.55
=========================
rolf.vandeva...@sun.com
781-442-3043
=========================
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Edgar Gabriel
Assistant Professor
Parallel Software Technologies Lab http://pstl.cs.uh.edu
Department of Computer Science University of Houston
Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA
Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335