Hi I have been running benchmarks on the OpenMP branch of petsc-dev on an Intel Westmere (Intel(R) Xeon(R) CPU X5670 @ 2.93GHz).
You can see all graphs + test code + code to generate results in the tar ball linked below and I am just going to give a quick summary here. http://amcg.ese.ic.ac.uk/~ggorman/omp_vec_benchmarks.tar.gz There are 3 sets of results: gcc/ : GCC 4.6 intel/ : Intel 12.0 with MKL intel-pinning/ : as above put applying hard affinity. Files matching mpi_*.pdf show the MPI speedup and parallel efficiency for a range of vector sizes. Similarly for omp_*.pdf with respect to OpenMP. The remaining files directly compare scaling of MPI Vs OpenMP for the various tests for the largest vector size. I think the results are very encouraging and there are many interesting little details in there. I am just going to summarise a few here that I think are particularly important. 1. In most cases the threaded code performs as well as, and in many cases better then the mpi code. 2. For GCC I did not use a threaded blas. For Intel I used -lmkl_intel_thread. However, it appears dnrm2 is not threaded. It seems to be a common feature among other threaded blas libraries that Level 1 is not completely threaded (e.g. cray). Unfortunately most of this is experience/anecdotal information. I do not know of any proper survey. We have the option here of either rolling our own or ignoring the issue until profiling shows it is a problem...and eventually someone else will release a fully threaded blas. 3. Comparing intel/ and intel-pinning/ is particularly interesting. "First touch" has been applied to all memory in VecCreate so that memory should be paged correctly for NUMA. But first touch does not gain you much if threads migrate, so for the intel-pinning/ results I set the env KMP_AFFINITY=scatter to get hard affinity. You can clearly from the results that this improves parallel efficiency by a few percentage points in many cases. It also really smooths out efficiency dips as you run on different number of threads. Full blown benchmarks would not make a lot of sense until we get the Mat classes threaded in a similar fashion. However, at this point I would like feedback on the direction this is taking and if we can start getting code committed. Cheers Gerard