On Mon, Feb 27, 2012 at 16:31, Gerard Gorman <g.gorman at imperial.ac.uk>wrote:
> I had a quick go at trying to get some sensible benchmarks for this but > I was getting too much system noise. I am particularly interested in > seeing if the overhead goes to zero if num_threads(1) is used. > What timing method did you use? I did not see overhead going to zero when num_threads goes to 1 when using GCC compilers, but Intel seems to do fairly well. > > I'm surprised by this. I not aware of any compiler that doesn't have > OpenMP support - and then you do not actually enable OpenMP compilers > generally just ignore the pragma. Do you know of any compiler that does > not have OpenMP support which will complain? > Sean points out that omp.h might not be available, but that misses the point. As far as I know, recent mainstream compilers have enough sense to at least ignore these directives, but I'm sure there are still cases where it would be an issue. More importantly, #pragma was a misfeature that should never be used now that _Pragma() exists. The latter is better not just because it can be turned off, but because it can be manipulated using macros and can be explicitly compiled out. > This may not be flexible enough. You frequently want to have a parallel > region, and then have multiple omp for's within that one region. > PetscPragmaOMPObject(obj, parallel) { PetscPragmaOMP(whetever you normally write for this loop) for (....) { } ... and so on } > I think what you describe is close to Fig 3 of this paper written by > your neighbours: > http://greg.bronevetsky.com/papers/2008IWOMP.pdf > However, before making the implementation more complex, it would be good > to benchmark the current approach and use a tool like likwid to measure > the NUMA traffic so we can get a good handle on the costs. > Sure. > Well this is where the implementation details get richer and there are > many options - they also become less portable. For example, what does > all this mean for the sparc64 processors which are UMA. > Delay to runtime, use an ignorant partition for UMA. (Blue Gene/Q is also essentially uniform.) But note that even with uniform memory, cache still makes it somewhat hierarchical. > Not to mention > Intel MIC which also supports OpenMP. I guess I am cautious about > getting too bogged down with very invasive optimisations until we have > benchmarked the basic approach which in a wide range of use cases will > achieve good thread/page locality as illustrated previously. > I guess I'm just interested in exposing enough semantic information to be able to schedule a few different ways using run-time (or, if absolutely necessary, configure-time) options. I don't want to have to revisit individual loops. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20120227/5163a159/attachment.html>