>> I'm not getting stellar performances with the petsc linear solver on
>> a 64 bit Xeon (8 CPUs with 64 Gb RAM). The machine processors are
>> clocked at 3 GHz, but -log_summary tells me I'm running at 1e8 flops/s
>> (on a single processor; I don't see a big speedup with more processors,
>> but that's probably due to memory bandwidth).

I recently evaluated several options for a new cluster, and as a general
trend found that the Xeon memory bandwidth is a severe limitation to speedup
for several highly implicit codes, not just libMesh/PETSc.  The Opterons,
however, seem to do much better with their hypertransport bus.  We settled
on quad-socket/dual core compute nodes, and will be evaluating a quad
socket/quad core potential upgrade in the next few months.  I'll let you
know how far this trend goes...

But you can likely tweak more performance out of what you've got...

>> That's 30x slower than the
>> clock speed. Is that normal? Have other users seen this order of
>> magnitude difference between clock speed and flop/s on other systems?
>> I'm testing a system with ~100,000 DOFs, using the cg solver.
>> 
>> I'd like to know if I should invest time in tuning my libraries/system,
>> or just give up and buy a better computer.

I'm not sure what BLAS/LAPACK you are using with PETSc, but that has a
first-order impact on performance.  I would suggest *not* letting PETSc
download and compile BLAS/LAPACK, since there are several assembly-level
BLAS implementations which tend to smoke any compiled version.  In no
particular order, they are:

libGoto     - BLAS only, http://www.tacc.utexas.edu/resources/software
Intel's MKL - BLAS & LAPACK ($$?)
AMD's ACML  - BLAS & LAPACK, free once you register

Also, I don't know what MPI you are using, but at 8-cores SMP it is
definitively worth using an MPI that implements on-processor communications
in shared memory.  It is just silly to use sockets to communicate local
information through a heavy network protocol.  There is a non-default
configuration option to MPICH than turns this on, and I think OpenMPI picks
it up out of the box.

Finally, PETSc is written in C with F77 kernels, both of which can be highly
optimized by almost any compiler these days.  It may be worth building PETSc
with some more aggressive optimization flags.

> If you do put effort into tuning and see a big difference in results,
> would you post the options that worked best for you to the list?

I echo this -- if you find substantial gains it might be worth starting a
wiki page to document collective experiences.

-Ben


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
Libmesh-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/libmesh-users

Reply via email to