Begin forwarded message:

> From: John Shalf <jshalf at lbl.gov>
> Date: June 19, 2011 5:15:49 PM CDT
> To: Barry Smith <bsmith at mcs.anl.gov>
> Cc: Nathan Wichmann <wichmann at cray.com>, Lois Curfman McInnes <curfman at 
> mcs.anl.gov>, Satish Balay <balay at mcs.anl.gov>, Alice Koniges <aekoniges 
> at lbl.gov>, Robert Preissl <rpreissl at lbl.gov>, Erich Strohmaier 
> <EStrohmaier at lbl.gov>, Stephane Ethier <ethier at pppl.gov>
> Subject: Re: Poisson step in GTS
> 
> 
> On Jun 19, 2011, at 9:44 PM, Barry Smith wrote:
>> On Jun 19, 2011, at 5:34 AM, John Shalf wrote:
>>> Hi Barry,
>>> here is the stream benchmark results that Hongzhang Shan collected on 
>>> Hopper for Nick's COE studies.   The red curve shows performance when you 
>>> run stream when all of the data ends up mapped to a single memory 
>>> controller.  The blue curve shows the case when you correctly map data 
>>> using first-touch so that the stream benchmark accesses data on its local 
>>> memory controller (the correct NUMA mapping). 
>> 
>>  How does one "correctly map data using first-touch"? (Reference ok).
> 
> The AMD nodes (and even the Intel Nehalem nodes) have memory controllers 
> integrated onto the processor chips.  The processor chips are integrated 
> together into a node using HyperTransport for the AMD chips or QPI, which 
> happens to be slower per link than the memory bandwidth of the local memory 
> controllers on each these chips.  Consequently, the bandwidth of accessing 
> memory using the memory controller that is co-located on the die with the 
> CPUs is much lower latency and higher bandwidth than accessing the memory 
> controller on one of the other dies.  
> 
> So you need to have some way of identifying which memory controller should 
> "own" a piece of memory so that you can keep it closer to the processors that 
> will primarily be using it.  The "first touch" memory affinity policy says 
> that a memory page gets mapped to the memory controller that is *closest* to 
> the first processor core to write a value to that page.  So you can malloc() 
> data at any time, but the pages get assigned to memory controllers base on 
> the first processor to "touch" that memory page.  If you touch the data (and 
> thereby assign it to memory controllers) in a different layout than you 
> actually use it, then most accesses will be non-local, and therefore will be 
> very slow.  If you touch data using the processor that will primarily be 
> accessing the data later on, then it will get mapped to a local memory 
> controller and will therefore be much faster.  So you have to be very 
> carefully about how you first touch data to ensure good memory/stream 
> performance.
> 
> You can ref. the NERSC FAQ on the "first touch" principle.
> http://www.nersc.gov/users/computational-systems/hopper/getting-started/multi-core-faq/
> 
>>> The bottom line is that it is essential that data is touched first on the 
>>> memory controller that is nearest the OpenMP processes that will be 
>>> accessing it (otherwise memory bandwidth will tank).  This should occur 
>>> naturally if you configure as 4 NUMA nodes with 6 threads each, as per 
>>> Nathan's suggestion.
>> 
>>  How does one "configure as 4 NUMA nodes with 6 threads each"? Do you mean 4 
>> MPI processes (each with 6 threads or something different?)
> 
> That is correct.  The Cray XE6 node (the dual-socket AMD magnycours) has a 
> total of 4 dies (and hence 4 NUMA domains).  Cray refers to these dies (each 
> of which has its own local memory controller) as a NUMA-node to distinguish 
> it from the full node that contains for of these dies.  Within a NUMA node, 
> there a no NUMA issues. 
> 
> So Cray refers to these dies (these sub-sections of a node where there are no 
> NUMA issues) as numa_nodes.  You can use 'aprun' to launch tasks so that you 
> get one task per numa_node and the threads within that numa_node will not 
> have to worry about the first touch stuff we talked about above.  For Hopper, 
> that is 4 numa_nodes per node, and 6 OpenMP threads per numa_node.
> 
> e.g.
>       aprun -S threads_per_numa_node=6 -sn numa_nodes_per_node=4
> 
>>> If we want to be more aggressive and use 24-way threaded parallelism per 
>>> node, then extra care must be taken to ensure the memory affinity is not 
>>> screwed up.
>> 
>>  BTW: What is an "OpenMP thread"  mapped to on the Cray systems? A pthread? 
>> Some other kind of thread?
> 
> I'm not sure what you mean here.  OpenMP is directives for threading.  So an 
> OpenMP "thread" is just how many threads you assign to each MPI task (with 
> OpenMP operating within each MPI task).
> 
> -john
> 


Reply via email to