Re: [Beowulf] Performance characterising a HPC application

stephen mulcahy Fri, 23 Mar 2007 11:11:16 -0800

Mark Hahn wrote:

1. processor bound.
2. memory bound.


oprofile is the only thing I know of that will give you this distinction.

In practice, I don't think it is given the usage characteristics Imentioned in my previous mail.

3. interconnect bound.


with ethernet, this is obvious, since you can just look at user/system/idle
times.


You mean the system time will be high if nodes are busy sending/receiving?

4. headnode bound.


do you mean for NFS traffic?


More in terms of managing the responses from the compute nodes.

it's not _that_ hard to hit full wire speed with gigabit...
however, saturating the wire means it's entirely possible that nodes
are being bottlenecked by this.

It seems to be the case - but the peak usage of the network isrelatively infrequent (every 30 minutes or so) - average usage is muchlower.

* Network traffic in averages at about 50 Mbit/sec but peaks to about200 Mbit/sec. Network traffic out averages about 50 Mbit/sec but peaksto about 200Mbit/sec. The peaks are very short (maybe a few seconds induration, presumably at the end of an MPI "run" if that is the correctterm).
you don't think the peaks correspond to inter-node communication (_during_
the MPI job)?

Sorry, thats what I meant - this particular model outputs to a historyfile every 30 minutes or so - and seems to do a lot of inter-node commsaround the same time .. so yes, the traffic seems to be generated by theMPI job.

ouch. the cluster is doing very badly, and clearly bottlenecked oneither inter-node or headnode IO. I guess I'd be tempted to capture somerepresentative trace data with tcpdump (but I'm pretty oldfashioned andfundamentalist about these things.)


Phew, that sounds hardcore :)

I tried wireshark on the headnode for a few minutes but ended up withgigs of data and wasn't sure what I was looking for so I'm currentlytrying to see what my model isn't generating mpe log data which might bemore manageable. What would you see in a tcpdump of the network was thebottleneck, lots of resends?

quantify that. Do others here running MPI jobs see big improvements inusing Infiniband over Gigabit for MPI jobs or does it really depend onthe
jeez: compare a 50 us interconnect to a 4 us one (or 80 MB/s vs >800).
anything which doesn't speed up going from gigabit to IB/10G/quadrics iswhat I would call embarassingly parallel...

True - I guess I'm trying to do some cost/benefit analysis so themagnitude of the improvement is important to me .. but maybe measuringit on a test cluster is the only way to be sure of this one.

characteristics of the MPI job? What characteristics should I belooking for?
well, have you run a simple MPI benchmark, to make sure you're seeing
reasonable performance? single-pair latency, bandwidth and some form ofgroup communication are always good to know.

Not in a while - I did some testing early on when I was testingdifferent compilers but I don't think I did any specific MPI testing.What would you recommend - pallas or hpl? Or something else? Whats agood one that has other good publicly available reference data?

a) to identify what parts of the system any tuning exercises shouldfocus on.- some possible low hanging fruit includes enabling jumbo frames [somerough
jumbo frames are mainly a way to recover some CPU overhead - most systems,
especially those which are only 20%, can handle back-to-back 1500B frames.
it's easy enough to measure (with ttcp, netperf, etc).

Interestingly enough - I enabled this on Friday and the first model wetested with showed a 2-3% performance improvement in some quick testing.We tested it with another model which is uses a larger test set overthe weekend and it showed a 30% improvement. So that's good news, butit's still not entirely obvious why we're seeing such a huge improvementwhen the network utilisation doesn't indicate that the switch issaturated - but I guess latency could be a big factor here.

I don't think you mentioned what your network looks like - all into oneswitch? what kind is it? have you verified that all the links are at1000/fullduplex?

All the nodes are Tyan s2891 boards with onboard Broadcom bcm5704integrated nics. They are all connected to a single hp procurve 3400cl24-port switch. And I've verified that all ports are running at1000/full (the switch is reporting some ports as using MDI and someusing MDIX but I'm not sure thats a cause for concern, although it ismildly surprising since they all use a standard cable and mobo).

I notice that AMD (and Mellanox and Pathscale/Qlogic) have clustersavailable through their developer program for testing. Has anyoneactually used these?
I haven't.  but if you'd like to try on our systems, we have quite a range.
(no IB, but our quadrics systems are roughly comparable.)

Thanks for the offer (if it was :) - I need to have a think about theeffort required to set this up and see how much assistance theAMD/Mellanox/Pathscale cluster folks give for this kind of testing - ifthey don't make it too difficult I'm inclined to avail of their kitrather than hassle you.


Thanks again,

-stephen

--
Stephen Mulcahy, Applepie Solutions Ltd, Innovation in Business Center,
   GMIT, Dublin Rd, Galway, Ireland.      http://www.aplpi.com
_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Performance characterising a HPC application

Reply via email to