Mark Hahn wrote:
1. processor bound.
2. memory bound.

oprofile is the only thing I know of that will give you this distinction.

In practice, I don't think it is given the usage characteristics I mentioned in my previous mail.

3. interconnect bound.

with ethernet, this is obvious, since you can just look at user/system/idle
times.

You mean the system time will be high if nodes are busy sending/receiving?


4. headnode bound.

do you mean for NFS traffic?

More in terms of managing the responses from the compute nodes.

it's not _that_ hard to hit full wire speed with gigabit...
however, saturating the wire means it's entirely possible that nodes
are being bottlenecked by this.

It seems to be the case - but the peak usage of the network is relatively infrequent (every 30 minutes or so) - average usage is much lower.

* Network traffic in averages at about 50 Mbit/sec but peaks to about 200 Mbit/sec. Network traffic out averages about 50 Mbit/sec but peaks to about 200Mbit/sec. The peaks are very short (maybe a few seconds in duration, presumably at the end of an MPI "run" if that is the correct term).

you don't think the peaks correspond to inter-node communication (_during_
the MPI job)?

Sorry, thats what I meant - this particular model outputs to a history file every 30 minutes or so - and seems to do a lot of inter-node comms around the same time .. so yes, the traffic seems to be generated by the MPI job.

ouch. the cluster is doing very badly, and clearly bottlenecked on either inter-node or headnode IO. I guess I'd be tempted to capture some representative trace data with tcpdump (but I'm pretty oldfashioned and fundamentalist about these things.)

Phew, that sounds hardcore :)

I tried wireshark on the headnode for a few minutes but ended up with gigs of data and wasn't sure what I was looking for so I'm currently trying to see what my model isn't generating mpe log data which might be more manageable. What would you see in a tcpdump of the network was the bottleneck, lots of resends?

quantify that. Do others here running MPI jobs see big improvements in using Infiniband over Gigabit for MPI jobs or does it really depend on the

jeez: compare a 50 us interconnect to a 4 us one (or 80 MB/s vs >800).

anything which doesn't speed up going from gigabit to IB/10G/quadrics is what I would call embarassingly parallel...

True - I guess I'm trying to do some cost/benefit analysis so the magnitude of the improvement is important to me .. but maybe measuring it on a test cluster is the only way to be sure of this one.

characteristics of the MPI job? What characteristics should I be looking for?

well, have you run a simple MPI benchmark, to make sure you're seeing
reasonable performance? single-pair latency, bandwidth and some form of group communication are always good to know.

Not in a while - I did some testing early on when I was testing different compilers but I don't think I did any specific MPI testing. What would you recommend - pallas or hpl? Or something else? Whats a good one that has other good publicly available reference data?

a) to identify what parts of the system any tuning exercises should focus on. - some possible low hanging fruit includes enabling jumbo frames [some rough

jumbo frames are mainly a way to recover some CPU overhead - most systems,
especially those which are only 20%, can handle back-to-back 1500B frames.
it's easy enough to measure (with ttcp, netperf, etc).

Interestingly enough - I enabled this on Friday and the first model we tested with showed a 2-3% performance improvement in some quick testing. We tested it with another model which is uses a larger test set over the weekend and it showed a 30% improvement. So that's good news, but it's still not entirely obvious why we're seeing such a huge improvement when the network utilisation doesn't indicate that the switch is saturated - but I guess latency could be a big factor here.

I don't think you mentioned what your network looks like - all into one switch? what kind is it? have you verified that all the links are at 1000/fullduplex?

All the nodes are Tyan s2891 boards with onboard Broadcom bcm5704 integrated nics. They are all connected to a single hp procurve 3400cl 24-port switch. And I've verified that all ports are running at 1000/full (the switch is reporting some ports as using MDI and some using MDIX but I'm not sure thats a cause for concern, although it is mildly surprising since they all use a standard cable and mobo).

I notice that AMD (and Mellanox and Pathscale/Qlogic) have clusters available through their developer program for testing. Has anyone actually used these?

I haven't.  but if you'd like to try on our systems, we have quite a range.
(no IB, but our quadrics systems are roughly comparable.)

Thanks for the offer (if it was :) - I need to have a think about the effort required to set this up and see how much assistance the AMD/Mellanox/Pathscale cluster folks give for this kind of testing - if they don't make it too difficult I'm inclined to avail of their kit rather than hassle you.

Thanks again,

-stephen

--
Stephen Mulcahy, Applepie Solutions Ltd, Innovation in Business Center,
   GMIT, Dublin Rd, Galway, Ireland.      http://www.aplpi.com
_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to