>> I'm not getting stellar performances with the petsc linear solver on >> a 64 bit Xeon (8 CPUs with 64 Gb RAM). The machine processors are >> clocked at 3 GHz, but -log_summary tells me I'm running at 1e8 flops/s >> (on a single processor; I don't see a big speedup with more processors, >> but that's probably due to memory bandwidth).
I recently evaluated several options for a new cluster, and as a general trend found that the Xeon memory bandwidth is a severe limitation to speedup for several highly implicit codes, not just libMesh/PETSc. The Opterons, however, seem to do much better with their hypertransport bus. We settled on quad-socket/dual core compute nodes, and will be evaluating a quad socket/quad core potential upgrade in the next few months. I'll let you know how far this trend goes... But you can likely tweak more performance out of what you've got... >> That's 30x slower than the >> clock speed. Is that normal? Have other users seen this order of >> magnitude difference between clock speed and flop/s on other systems? >> I'm testing a system with ~100,000 DOFs, using the cg solver. >> >> I'd like to know if I should invest time in tuning my libraries/system, >> or just give up and buy a better computer. I'm not sure what BLAS/LAPACK you are using with PETSc, but that has a first-order impact on performance. I would suggest *not* letting PETSc download and compile BLAS/LAPACK, since there are several assembly-level BLAS implementations which tend to smoke any compiled version. In no particular order, they are: libGoto - BLAS only, http://www.tacc.utexas.edu/resources/software Intel's MKL - BLAS & LAPACK ($$?) AMD's ACML - BLAS & LAPACK, free once you register Also, I don't know what MPI you are using, but at 8-cores SMP it is definitively worth using an MPI that implements on-processor communications in shared memory. It is just silly to use sockets to communicate local information through a heavy network protocol. There is a non-default configuration option to MPICH than turns this on, and I think OpenMPI picks it up out of the box. Finally, PETSc is written in C with F77 kernels, both of which can be highly optimized by almost any compiler these days. It may be worth building PETSc with some more aggressive optimization flags. > If you do put effort into tuning and see a big difference in results, > would you post the options that worked best for you to the list? I echo this -- if you find substantial gains it might be worth starting a wiki page to document collective experiences. -Ben ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Libmesh-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/libmesh-users
