Hi Jed Many thanks for your feedback.
Jed Brown emailed the following on 24/02/12 04:06: > On Thu, Feb 16, 2012 at 10:33, Gerard Gorman <g.gorman at imperial.ac.uk > <mailto:g.gorman at imperial.ac.uk>> wrote: > > 2. For GCC I did not use a threaded blas. For Intel I used > -lmkl_intel_thread. However, it appears dnrm2 is not threaded. It > seems > to be a common feature among other threaded blas libraries that > Level 1 > is not completely threaded (e.g. cray). Unfortunately most of this is > experience/anecdotal information. I do not know of any proper > survey. We > have the option here of either rolling our own or ignoring the issue > until profiling shows it is a problem...and eventually someone > else will > release a fully threaded blas. > > > Even much of BLAS2 is not threaded in MKL. Note that opening a > parallel region is actually quite expensive, so (perhaps ironically), > MPI is expected to perform better than threading when parallel regions > involve relatively little work. In the case of BLAS-1, only quite > large sizes could possibly pay for spawning a parallel region. I agree that different data sizes might require different approaches. One might consider this as part of an autotuning framework for PETSc. The cost of spawning threads is generally minimised through the use of thread pools typically used by OpenMP - i.e. you only have a one time cost associated with forking and joining threads. However, even with a pool there are still some overheads (e.g. scheduling chunks) which will effect you for small data sizes. I have not measured this myself (appending to the todo list) but it is frequently discussed, e.g. http://software.intel.com/en-us/articles/performance-obstacles-for-threading-how-do-they-affect-openmp-code/ http://www2.fz-juelich.de/jsc/datapool/scalasca/scalasca_patterns-1.3.html > > This is why I think the long-term solutions for threading involve > long-lived threads with mostly-private memory that prefer redundant > computation so that you only have to pay for synchronization instead > of also having to pay dearly to use interfaces. Unfortunately, I think > the current threaded programming models are challenging to use in this > way and it imposes some extra complexity on users. I think you mean thread-pools, as are used for OpenMP. The same thing is done for pthreads (e.g. http://www.hlnum.org/english/projects/tools/threadpool/doc.html) and others. > > > > 3. Comparing intel/ and intel-pinning/ is particularly interesting. > "First touch" has been applied to all memory in VecCreate so that > memory > should be paged correctly for NUMA. But first touch does not gain you > much if threads migrate, so for the intel-pinning/ results I set > the env > KMP_AFFINITY=scatter to get hard affinity. You can clearly from the > results that this improves parallel efficiency by a few percentage > points in many cases. It also really smooths out efficiency dips > as you > run on different number of threads. > > > Are you choosing sizes so that thread partitions always fall on a page > boundary, or are some pages cut irregularly? We are using static schedules. This means that the chunk size = array_length/nthreads. Therefore, we can have bad page/thread locality at the start (ie malloc may have returned a pointer to the middle of a page already faulted and this is not necessary on the same memory node that thread 0 is located), and where chunks boundaries don't align with page boundaries, where the successive threads id's are on different memory nodes. I've attached a figure to fill in deficiencies in my explanation - it is based on an Intel Westmere with two sockets (and two memory nodes), 6 cores per socket, an array of 10000 doubles, and page sizes of 4096 bytes. You can control the page fault at the start of the array by replacing malloc with posix_memalign, where the alignment is the page size. For the pages that stride chunks that have been allocated to threads on different sockets...you'd have to use gaps in your arrays or something similar to resolve this. I would do the first of these because it's easy. I don't know an easy way to implement the second so I'd be inclined that inefficiency unless profiling indicates it cannot be ignored. > > > > Full blown benchmarks would not make a lot of sense until we get > the Mat > classes threaded in a similar fashion. However, at this point I would > like feedback on the direction this is taking and if we can start > getting code committed. > > > Did you post a repository yet? I'd like to have a look at the code. It's on Barry's favourite collaborative software development site of course ;-) https://bitbucket.org/wence/petsc-dev-omp/overview Cheers Gerard -------------- next part -------------- A non-text attachment was scrubbed... Name: array_page_threads.pdf Type: application/pdf Size: 19485 bytes Desc: not available URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20120226/8ebad503/attachment.pdf>