Sorry about the slow response. On Thu, Feb 16, 2012 at 10:33, Gerard Gorman <g.gorman at imperial.ac.uk>wrote:
> Hi > > I have been running benchmarks on the OpenMP branch of petsc-dev on an > Intel Westmere (Intel(R) Xeon(R) CPU X5670 @ 2.93GHz). > > You can see all graphs + test code + code to generate results in the tar > ball linked below and I am just going to give a quick summary here. > http://amcg.ese.ic.ac.uk/~ggorman/omp_vec_benchmarks.tar.gz > > There are 3 sets of results: > > gcc/ : GCC 4.6 > intel/ : Intel 12.0 with MKL > intel-pinning/ : as above put applying hard affinity. > > Files matching mpi_*.pdf show the MPI speedup and parallel efficiency > for a range of vector sizes. Similarly for omp_*.pdf with respect to > OpenMP. The remaining files directly compare scaling of MPI Vs OpenMP > for the various tests for the largest vector size. > > I think the results are very encouraging and there are many interesting > little details in there. I am just going to summarise a few here that I > think are particularly important. > > 1. In most cases the threaded code performs as well as, and in many > cases better then the mpi code. > Cool. > > 2. For GCC I did not use a threaded blas. For Intel I used > -lmkl_intel_thread. However, it appears dnrm2 is not threaded. It seems > to be a common feature among other threaded blas libraries that Level 1 > is not completely threaded (e.g. cray). Unfortunately most of this is > experience/anecdotal information. I do not know of any proper survey. We > have the option here of either rolling our own or ignoring the issue > until profiling shows it is a problem...and eventually someone else will > release a fully threaded blas. > Even much of BLAS2 is not threaded in MKL. Note that opening a parallel region is actually quite expensive, so (perhaps ironically), MPI is expected to perform better than threading when parallel regions involve relatively little work. In the case of BLAS-1, only quite large sizes could possibly pay for spawning a parallel region. This is why I think the long-term solutions for threading involve long-lived threads with mostly-private memory that prefer redundant computation so that you only have to pay for synchronization instead of also having to pay dearly to use interfaces. Unfortunately, I think the current threaded programming models are challenging to use in this way and it imposes some extra complexity on users. > > 3. Comparing intel/ and intel-pinning/ is particularly interesting. > "First touch" has been applied to all memory in VecCreate so that memory > should be paged correctly for NUMA. But first touch does not gain you > much if threads migrate, so for the intel-pinning/ results I set the env > KMP_AFFINITY=scatter to get hard affinity. You can clearly from the > results that this improves parallel efficiency by a few percentage > points in many cases. It also really smooths out efficiency dips as you > run on different number of threads. > Are you choosing sizes so that thread partitions always fall on a page boundary, or are some pages cut irregularly? > > Full blown benchmarks would not make a lot of sense until we get the Mat > classes threaded in a similar fashion. However, at this point I would > like feedback on the direction this is taking and if we can start > getting code committed. > Did you post a repository yet? I'd like to have a look at the code. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20120223/7cbb4e62/attachment.html>