[petsc-dev] OpenMP/Vec

Gerard Gorman Thu, 16 Feb 2012 16:33:16 +0000

Hi

I have been running benchmarks on the OpenMP branch of petsc-dev on an
Intel Westmere (Intel(R) Xeon(R) CPU X5670 @ 2.93GHz).


You can see all graphs + test code + code to generate results in the tar
ball linked below and I am just going to give a quick summary here.
http://amcg.ese.ic.ac.uk/~ggorman/omp_vec_benchmarks.tar.gz

There are 3 sets of results:

gcc/ : GCC 4.6
intel/ : Intel 12.0 with MKL
intel-pinning/ : as above put applying hard affinity.

Files matching  mpi_*.pdf show the MPI speedup and parallel efficiency
for a range of vector sizes. Similarly for omp_*.pdf with respect to
OpenMP. The remaining files directly compare scaling of MPI Vs OpenMP
for the various tests for the largest vector size.

I think the results are very encouraging and there are many interesting
little details in there. I am just going to summarise a few here that I
think are particularly important.

1. In most cases the threaded code performs as well as, and in many
cases better then the mpi code.

2. For GCC I did not use a threaded blas. For Intel I used
-lmkl_intel_thread. However, it appears dnrm2 is not threaded. It seems
to be a common feature among other threaded blas libraries that Level 1
is not completely threaded (e.g. cray). Unfortunately most of this is
experience/anecdotal information. I do not know of any proper survey. We
have the option here of either rolling our own or ignoring the issue
until profiling shows it is a problem...and eventually someone else will
release a fully threaded blas.

3. Comparing intel/ and intel-pinning/ is particularly interesting.
"First touch" has been applied to all memory in VecCreate so that memory
should be paged correctly for NUMA. But first touch does not gain you
much if threads migrate, so for the intel-pinning/ results I set the env
KMP_AFFINITY=scatter to get hard affinity. You can clearly from the
results that this improves parallel efficiency by a few percentage
points in many cases. It also really smooths out efficiency dips as you
run on different number of threads.

Full blown benchmarks would not make a lot of sense until we get the Mat
classes threaded in a similar fashion. However, at this point I would
like feedback on the direction this is taking and if we can start
getting code committed.

Cheers
Gerard

[petsc-dev] OpenMP/Vec

Reply via email to