> I have been running benchmarks on the OpenMP branch of petsc-dev on an
> Intel Westmere (Intel(R) Xeon(R) CPU X5670 @ 2.93GHz).
> You can see all graphs + test code + code to generate results in the tar
> ball linked below and I am just going to give a quick summary here.
> There are 3 sets of results:
> gcc/ : GCC 4.6
> intel/ : Intel 12.0 with MKL
> intel-pinning/ : as above put applying hard affinity.
> Files matching  mpi_*.pdf show the MPI speedup and parallel efficiency
> for a range of vector sizes. Similarly for omp_*.pdf with respect to
> OpenMP. The remaining files directly compare scaling of MPI Vs OpenMP
> for the various tests for the largest vector size.
> I think the results are very encouraging and there are many interesting
> little details in there. I am just going to summarise a few here that I
> think are particularly important.
> 1. In most cases the threaded code performs as well as, and in many
> cases better then the mpi code.


> 2. For GCC I did not use a threaded blas. For Intel I used
> -lmkl_intel_thread. However, it appears dnrm2 is not threaded. It seems
> to be a common feature among other threaded blas libraries that Level 1
> is not completely threaded (e.g. cray). Unfortunately most of this is
> experience/anecdotal information. I do not know of any proper survey. We
> have the option here of either rolling our own or ignoring the issue
> until profiling shows it is a problem...and eventually someone else will
> release a fully threaded blas.

Even much of BLAS2 is not threaded in MKL. Note that opening a parallel
region is actually quite expensive, so (perhaps ironically), MPI is
expected to perform better than threading when parallel regions involve
relatively little work. In the case of BLAS-1, only quite large sizes could
possibly pay for spawning a parallel region.

This is why I think the long-term solutions for threading involve
long-lived threads with mostly-private memory that prefer redundant
computation so that you only have to pay for synchronization instead of
also having to pay dearly to use interfaces. Unfortunately, I think the
current threaded programming models are challenging to use in this way and
it imposes some extra complexity on users.

> 3. Comparing intel/ and intel-pinning/ is particularly interesting.
> "First touch" has been applied to all memory in VecCreate so that memory
> should be paged correctly for NUMA. But first touch does not gain you
> much if threads migrate, so for the intel-pinning/ results I set the env
> KMP_AFFINITY=scatter to get hard affinity. You can clearly from the
> results that this improves parallel efficiency by a few percentage
> points in many cases. It also really smooths out efficiency dips as you
> run on different number of threads.

Are you choosing sizes so that thread partitions always fall on a page
boundary, or are some pages cut irregularly?

> Full blown benchmarks would not make a lot of sense until we get the Mat
> classes threaded in a similar fashion. However, at this point I would
> like feedback on the direction this is taking and if we can start
> getting code committed.

Did you post a repository yet? I'd like to have a look at the code.
