Begin forwarded message:

> From: Nathan Wichmann <wichmann at cray.com>
> Date: June 19, 2011 4:15:48 PM CDT
> To: Barry Smith <bsmith at mcs.anl.gov>, John Shalf <JShalf at lbl.gov>
> Cc: Lois Curfman McInnes <curfman at mcs.anl.gov>, Satish Balay <balay at 
> mcs.anl.gov>, Alice Koniges <aekoniges at lbl.gov>, Robert Preissl <rpreissl 
> at lbl.gov>, Erich Strohmaier <EStrohmaier at lbl.gov>, Stephane Ethier 
> <ethier at pppl.gov>
> Subject: RE: Poisson step in GTS
> 
> Q:  How does one "correctly map data using first-touch"? (Reference ok).
> 
> A:  The default policy on the XE6 is that whichever process/thread is the 
> first to access a page associated with the memory location, then that page 
> will be physically allocated on the memory closest to the core which is 
> running that process/thread if possible.  This basically means that if you 
> are running with 4 or more mpi ranks and 6 or fewer omp threads per node, 
> then you don't have to do anything.  If you want to run with more than 6 omp 
> threads then you have to worry about this a lot more.  
> 
> Q:  How does one "configure as 4 NUMA nodes with 6 threads each"? Do you mean 
> 4 MPI processes (each with 6 threads or something different?)
> 
> A:  If/when you run on a Cray XE system you will find out that you have to 
> use something called aprun to lauch jobs.  Various options to aprun will 
> inform it how to launch your job.  Options include, but are not limited to 
> "-n" which specifies the number of mpi ranks and "-d" which specifies the 
> number of core allocated to that rank which are available to run OMP threads. 
>  The actual number of OMP threads is set via the env var OMP_NUM_THREADS.  
> There are many more options that can affect placement, but these the the 
> easiest to understand and the most important for this discussion, IMHO.
> 
> Q:  What is an "OpenMP thread"  mapped to on the Cray systems? A pthread? 
> Some other kind of thread?
> 
> A:  I think that it is mapped to pthreads, but I prefer to keep it abstract 
> as it is implementation dependent and one cannot mix user pthreads with omp 
> in the same app and hope to get decent results.  :-)
> 
> Nathan
> 
> 
> 
> 
> Nathan Wichmann                Cray Inc.
> wichmann at cray.com              380 Jackson St
> Applications Engineer          St Paul, MN 55101
> office:  1-800-284-2729  x605-9079          
>  cell:  651-428-1131
> 
> -----Original Message-----
> From: Barry Smith [mailto:bsmith at mcs.anl.gov] 
> Sent: Sunday, June 19, 2011 2:44 PM
> To: John Shalf
> Cc: Nathan Wichmann; Lois Curfman McInnes; Satish Balay; Alice Koniges; 
> Robert Preissl; Erich Strohmaier; Stephane Ethier
> Subject: Re: Poisson step in GTS
> 
> 
> On Jun 19, 2011, at 5:34 AM, John Shalf wrote:
> 
>> Hi Barry,
>> here is the stream benchmark results that Hongzhang Shan collected on Hopper 
>> for Nick's COE studies.   The red curve shows performance when you run 
>> stream when all of the data ends up mapped to a single memory controller.  
>> The blue curve shows the case when you correctly map data using first-touch 
>> so that the stream benchmark accesses data on its local memory controller 
>> (the correct NUMA mapping). 
> 
>   How does one "correctly map data using first-touch"? (Reference ok).
>> 
>> 
>> The bottom line is that it is essential that data is touched first on the 
>> memory controller that is nearest the OpenMP processes that will be 
>> accessing it (otherwise memory bandwidth will tank).  This should occur 
>> naturally if you configure as 4 NUMA nodes with 6 threads each, as per 
>> Nathan's suggestion.
> 
>   How does one "configure as 4 NUMA nodes with 6 threads each"? Do you mean 4 
> MPI processes (each with 6 threads or something different?)
> 
>> If we want to be more aggressive and use 24-way threaded parallelism per 
>> node, then extra care must be taken to ensure the memory affinity is not 
>> screwed up.
> 
>   BTW: What is an "OpenMP thread"  mapped to on the Cray systems? A pthread? 
> Some other kind of thread?
> 
>   Barry
> 
>> 
>> -john
>> 
>> On Jun 18, 2011, at 10:13 AM, Barry Smith wrote:
>>> On Jun 18, 2011, at 9:35 AM, Nathan Wichmann wrote:
>>>> Hi Robert, Barry and all,
>>>> 
>>>> Is it our assumption that the Poisson version of GTS will normally be run 
>>>> with 1 mpi rank per die and 6 (on AMD Magny cours) omp threads?
>>> 
>>> Our new vector and matrix classes will allow the flexibility of any number 
>>> of MPI processes and any number of threads under that. So 1 MPI rank and 6 
>>> threads is supportable.
>>> 
>>>> In that case there should be sufficient bandwidth for decent scaling; I 
>>>> would say something Barry's Intel experience.  Barry is certainly correct 
>>>> that as one uses more cores one will be more bandwidth limited.
>>> 
>>> I would be interested in seeing the OpenMP streams for this system.
>>>> 
>>>> I also like John's comment: "we have little faith that the compiler will 
>>>> do anything intelligent."  Which compiler are you using?  If you are using 
>>>> CCE then you should get a lst file to see what it is doing.  Probably the 
>>>> only thing that can and should be done is unroll the inner loop.
>>> 
>>> Do you folks a provide a thread based BLAS 1 operations? For example ddot, 
>>> dscale, daxpy? If so, we can piggy-back on those to get the best possible 
>>> performance on the vector operations.,
>>>> 
>>>> Another consideration is the typical size of "n".  Normally the dense the 
>>>> matrix the large n is, no?  But still, it would be interesting to know.
>>> 
>>> In this application the matrix is extremely sparse, likely between 7 and 27 
>>> nonzeros per row. Matrices, of course, can get as big as you like.
>>> 
>>> Barry
>> 
>> <PastedGraphic-1.pdf>
> 


Reply via email to