On Jun 19, 2011, at 8:13 AM, Jed Brown wrote:

> On Sun, Jun 19, 2011 at 03:33, Barry Smith <bsmith at mcs.anl.gov> wrote:
> > No, Cray does not provide any threaded BLAS 1.  Generally speaking it is 
> > not worth threading a single nested loop unless the trip count is very high 
> > and generally that does not happen often enough to warrant the special 
> > BLAS.  In fact, I am not even sure we omp BLAS 2, I don't think so.
> 
> This shouldn't be surprising. It's worth noting that inside of multigrid (or 
> a surface-volume split problem) the same vector operation will be called with 
> quite different sizes. Managing the granularity so that threads are used when 
> they will actually be faster, but not when the size is too small to offset 
> the startup cost is not trivial.

   Huhh? VecDot() {if n is >> big use 2 threads else use 1} I don't see why 
that is hard?

> 
> A related matter that I keep harping on is that the memory hierarchy is very 
> non-uniform. In the old days, it was reasonably uniform within a socket, but 
> some of the latest hardware has multiple dies within a socket, each with 
> more-or-less independent memory buses.

  So what is the numa.h you've been using. If we allocate vector arrays and 
matrix arrays then does that give you the locality?

   BTW: If it doesn't do it yet, ./configure needs to check for numa.h and do 
PETSC_HAVE_NUMA_H


   Barry

> 
> Of course you can always move MPI down to finer granularity (e.g. one MPI 
> process per die instead of one per socket). I think this is a good solution 
> for many applications, and may perform better than threads for reasonably 
> large subdomains (mostly because memory affinity can be reliably managed), 
> but it is less appealing in the strong scaling limit.
> 
> I have yet to see a threading library/language that offers a good platform 
> for bandwidth-constrained strong scaling. The existing solutions tend to base 
> everything on the absurd assumption that parallel computation is about 
> parallelizing the computation. In reality, certainly for the sorts of 
> problems that most PETSc users care about, everything hard about parallelism 
> is how to communicate that which needs to be communicated while retaining 
> good data locality on that which doesn't need to be communicated.
> 
> It's an easy local optimization to get high performance out of a local 
> kernel. In contrast, optimizing for data locality often involves major data 
> structure and algorithm changes.
> 
> The current systems all seem to be really bad at this, with data locality 
> being something that is sometimes provided implicitly, but is ultimately very 
> fragile. The number of recent threading papers that report less than 30% of 
> hardware bandwidth peak for STREAM is not inspiring. (Rather few authors 
> actually state the hardware peak, but if you look up the specs, it's really 
> rare to see something respectable (e.g. 80%).)


Reply via email to