On Jun 19, 2011, at 8:13 AM, Jed Brown wrote: > On Sun, Jun 19, 2011 at 03:33, Barry Smith <bsmith at mcs.anl.gov> wrote: > > No, Cray does not provide any threaded BLAS 1. Generally speaking it is > > not worth threading a single nested loop unless the trip count is very high > > and generally that does not happen often enough to warrant the special > > BLAS. In fact, I am not even sure we omp BLAS 2, I don't think so. > > This shouldn't be surprising. It's worth noting that inside of multigrid (or > a surface-volume split problem) the same vector operation will be called with > quite different sizes. Managing the granularity so that threads are used when > they will actually be faster, but not when the size is too small to offset > the startup cost is not trivial.
Huhh? VecDot() {if n is >> big use 2 threads else use 1} I don't see why that is hard? > > A related matter that I keep harping on is that the memory hierarchy is very > non-uniform. In the old days, it was reasonably uniform within a socket, but > some of the latest hardware has multiple dies within a socket, each with > more-or-less independent memory buses. So what is the numa.h you've been using. If we allocate vector arrays and matrix arrays then does that give you the locality? BTW: If it doesn't do it yet, ./configure needs to check for numa.h and do PETSC_HAVE_NUMA_H Barry > > Of course you can always move MPI down to finer granularity (e.g. one MPI > process per die instead of one per socket). I think this is a good solution > for many applications, and may perform better than threads for reasonably > large subdomains (mostly because memory affinity can be reliably managed), > but it is less appealing in the strong scaling limit. > > I have yet to see a threading library/language that offers a good platform > for bandwidth-constrained strong scaling. The existing solutions tend to base > everything on the absurd assumption that parallel computation is about > parallelizing the computation. In reality, certainly for the sorts of > problems that most PETSc users care about, everything hard about parallelism > is how to communicate that which needs to be communicated while retaining > good data locality on that which doesn't need to be communicated. > > It's an easy local optimization to get high performance out of a local > kernel. In contrast, optimizing for data locality often involves major data > structure and algorithm changes. > > The current systems all seem to be really bad at this, with data locality > being something that is sometimes provided implicitly, but is ultimately very > fragile. The number of recent threading papers that report less than 30% of > hardware bandwidth peak for STREAM is not inspiring. (Rather few authors > actually state the hardware peak, but if you look up the specs, it's really > rare to see something respectable (e.g. 80%).)