A colleague of mine ran some microkernels on an 8-way Barcelona box (Sun
x2200M2 at 2.3 GHz). Here are some performance comparisons with Scali.
The performance tests are modified versions of the HPCC pingpong
tests. The OMPI version is the trunk with my "single-queue" fixes...
otherwise, OMPI latency at higher np would be noticeably worse.
latency(ns) bandwidth(MB/s)
(8-byte msgs) (2M-byte msgs)
============= =============
np Scali OMPI Scali OMPI
2 327 661 1458 1295
4 369 670 1517 1287
8 414 758 1535 1294
OMPI latency is nearly 2x slower than Scali's. Presumably, "fastpath"
PML latency optimizations would help us a lot here. Thankfully, our
latency is flat with np with the recent "single-queue" fixes...
otherwise our high-np latency story would be so much worse. We're
behind on bandwidth as well, though not as pitifully so.