Begin forwarded message:
> From: Nathan Wichmann <wichmann at cray.com> > Date: June 19, 2011 4:15:48 PM CDT > To: Barry Smith <bsmith at mcs.anl.gov>, John Shalf <JShalf at lbl.gov> > Cc: Lois Curfman McInnes <curfman at mcs.anl.gov>, Satish Balay <balay at > mcs.anl.gov>, Alice Koniges <aekoniges at lbl.gov>, Robert Preissl <rpreissl > at lbl.gov>, Erich Strohmaier <EStrohmaier at lbl.gov>, Stephane Ethier > <ethier at pppl.gov> > Subject: RE: Poisson step in GTS > > Q: How does one "correctly map data using first-touch"? (Reference ok). > > A: The default policy on the XE6 is that whichever process/thread is the > first to access a page associated with the memory location, then that page > will be physically allocated on the memory closest to the core which is > running that process/thread if possible. This basically means that if you > are running with 4 or more mpi ranks and 6 or fewer omp threads per node, > then you don't have to do anything. If you want to run with more than 6 omp > threads then you have to worry about this a lot more. > > Q: How does one "configure as 4 NUMA nodes with 6 threads each"? Do you mean > 4 MPI processes (each with 6 threads or something different?) > > A: If/when you run on a Cray XE system you will find out that you have to > use something called aprun to lauch jobs. Various options to aprun will > inform it how to launch your job. Options include, but are not limited to > "-n" which specifies the number of mpi ranks and "-d" which specifies the > number of core allocated to that rank which are available to run OMP threads. > The actual number of OMP threads is set via the env var OMP_NUM_THREADS. > There are many more options that can affect placement, but these the the > easiest to understand and the most important for this discussion, IMHO. > > Q: What is an "OpenMP thread" mapped to on the Cray systems? A pthread? > Some other kind of thread? > > A: I think that it is mapped to pthreads, but I prefer to keep it abstract > as it is implementation dependent and one cannot mix user pthreads with omp > in the same app and hope to get decent results. :-) > > Nathan > > > > > Nathan Wichmann Cray Inc. > wichmann at cray.com 380 Jackson St > Applications Engineer St Paul, MN 55101 > office: 1-800-284-2729 x605-9079 > cell: 651-428-1131 > > -----Original Message----- > From: Barry Smith [mailto:bsmith at mcs.anl.gov] > Sent: Sunday, June 19, 2011 2:44 PM > To: John Shalf > Cc: Nathan Wichmann; Lois Curfman McInnes; Satish Balay; Alice Koniges; > Robert Preissl; Erich Strohmaier; Stephane Ethier > Subject: Re: Poisson step in GTS > > > On Jun 19, 2011, at 5:34 AM, John Shalf wrote: > >> Hi Barry, >> here is the stream benchmark results that Hongzhang Shan collected on Hopper >> for Nick's COE studies. The red curve shows performance when you run >> stream when all of the data ends up mapped to a single memory controller. >> The blue curve shows the case when you correctly map data using first-touch >> so that the stream benchmark accesses data on its local memory controller >> (the correct NUMA mapping). > > How does one "correctly map data using first-touch"? (Reference ok). >> >> >> The bottom line is that it is essential that data is touched first on the >> memory controller that is nearest the OpenMP processes that will be >> accessing it (otherwise memory bandwidth will tank). This should occur >> naturally if you configure as 4 NUMA nodes with 6 threads each, as per >> Nathan's suggestion. > > How does one "configure as 4 NUMA nodes with 6 threads each"? Do you mean 4 > MPI processes (each with 6 threads or something different?) > >> If we want to be more aggressive and use 24-way threaded parallelism per >> node, then extra care must be taken to ensure the memory affinity is not >> screwed up. > > BTW: What is an "OpenMP thread" mapped to on the Cray systems? A pthread? > Some other kind of thread? > > Barry > >> >> -john >> >> On Jun 18, 2011, at 10:13 AM, Barry Smith wrote: >>> On Jun 18, 2011, at 9:35 AM, Nathan Wichmann wrote: >>>> Hi Robert, Barry and all, >>>> >>>> Is it our assumption that the Poisson version of GTS will normally be run >>>> with 1 mpi rank per die and 6 (on AMD Magny cours) omp threads? >>> >>> Our new vector and matrix classes will allow the flexibility of any number >>> of MPI processes and any number of threads under that. So 1 MPI rank and 6 >>> threads is supportable. >>> >>>> In that case there should be sufficient bandwidth for decent scaling; I >>>> would say something Barry's Intel experience. Barry is certainly correct >>>> that as one uses more cores one will be more bandwidth limited. >>> >>> I would be interested in seeing the OpenMP streams for this system. >>>> >>>> I also like John's comment: "we have little faith that the compiler will >>>> do anything intelligent." Which compiler are you using? If you are using >>>> CCE then you should get a lst file to see what it is doing. Probably the >>>> only thing that can and should be done is unroll the inner loop. >>> >>> Do you folks a provide a thread based BLAS 1 operations? For example ddot, >>> dscale, daxpy? If so, we can piggy-back on those to get the best possible >>> performance on the vector operations., >>>> >>>> Another consideration is the typical size of "n". Normally the dense the >>>> matrix the large n is, no? But still, it would be interesting to know. >>> >>> In this application the matrix is extremely sparse, likely between 7 and 27 >>> nonzeros per row. Matrices, of course, can get as big as you like. >>> >>> Barry >> >> <PastedGraphic-1.pdf> >