Greg,

Greg Lindahl wrote:
Compare the latency numbers in HPC Challenge to the 2-node ping-pong
latency reported by vendors. For some vendors, it's the same number.
For others, the latency from using all the nodes is much, much higher.

The ring test in HPC is rather poorly implemented: 3 iterations only to measure something in the same order of magnitude than the precision of MPI_Wtime(). Someone just failed Benchmarking 101. If you replace a gettimeofday() implementation of MPI_Wtime() by a cycle counter one, then the numbers change quite a bit.

However, I agree with you, this is the right way to measure the sensitivity to concurrent traffic.

Note that the new MVAPICH has message coalescing, which causes its

It is unbelievable that so few people denounce it. It is clearly implemented only to cheat on a micro-benchmark. What's next ? Checking that the buffer to send is identical to the previous one to avoid sending "redundant" messages in ping-pong ?!?

message each to lots of other nodes before synchronizing. Message rate
benchmarks like "base" HPCC Gups get no benefit from message
coalescing.

HPCC Gups already does some sort of coalescing. If updates are going to the same process, then they are put in the same bucket. The size of messages depend on the number of updates in the buckets, so smaller number of nodes means bigger messages. I don't understand why they would do that, it defeats the goal of scalability testing.

HPC Challenge is much better than what has come before, but it too can

I think HPCC is somewhat a regression compared to the NAS for example. The communication benchmarks are too analytic, not functional enough.

intra-node. And guess what? HPCC results are hard to come by, even though
it's pretty easy to run.

And HPCC is a pain in the bottom to compile and run. HPL is not really a shinning example of straightforward build process, and configless operations, so why build HPCC on top of it ? Is autoconf still too bleeding edge these days ? Argh ! And What about the three dozens parameters in the config file ?!? It's just insane.

I like the NAS benchmarks. You can run each of them independently, only choose the problem size and the number of processes. Easy to run, easy to compare. Pallas is nice too, anybody can run it.

Trust me, I'd love to see microbenchmarks which attack the real issues
that speed up applications. But usually they miss the mark, and my
attempt to create a new one (message rate) is now destroyed by message
coalescing. I should have used an N-node benchmark instead.

If you want to show the impact of concurrent communications, something latency-based like the HPCC ring test is the best way (eventually with more nodes). The millions of packet per second of a stream-based benchmark are lovely for the marketing folks, but has little meaning for real codes that computes a minimum. However, an alltoall on many cores/nodes would exercise the same metric (many sends/recvs on the same NIC at the same time), but would be harder to cheat and be much more meaningful IMHO.

Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com
_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to