Begin forwarded message:
> From: John Shalf <jshalf at lbl.gov> > Date: June 19, 2011 5:15:49 PM CDT > To: Barry Smith <bsmith at mcs.anl.gov> > Cc: Nathan Wichmann <wichmann at cray.com>, Lois Curfman McInnes <curfman at > mcs.anl.gov>, Satish Balay <balay at mcs.anl.gov>, Alice Koniges <aekoniges > at lbl.gov>, Robert Preissl <rpreissl at lbl.gov>, Erich Strohmaier > <EStrohmaier at lbl.gov>, Stephane Ethier <ethier at pppl.gov> > Subject: Re: Poisson step in GTS > > > On Jun 19, 2011, at 9:44 PM, Barry Smith wrote: >> On Jun 19, 2011, at 5:34 AM, John Shalf wrote: >>> Hi Barry, >>> here is the stream benchmark results that Hongzhang Shan collected on >>> Hopper for Nick's COE studies. The red curve shows performance when you >>> run stream when all of the data ends up mapped to a single memory >>> controller. The blue curve shows the case when you correctly map data >>> using first-touch so that the stream benchmark accesses data on its local >>> memory controller (the correct NUMA mapping). >> >> How does one "correctly map data using first-touch"? (Reference ok). > > The AMD nodes (and even the Intel Nehalem nodes) have memory controllers > integrated onto the processor chips. The processor chips are integrated > together into a node using HyperTransport for the AMD chips or QPI, which > happens to be slower per link than the memory bandwidth of the local memory > controllers on each these chips. Consequently, the bandwidth of accessing > memory using the memory controller that is co-located on the die with the > CPUs is much lower latency and higher bandwidth than accessing the memory > controller on one of the other dies. > > So you need to have some way of identifying which memory controller should > "own" a piece of memory so that you can keep it closer to the processors that > will primarily be using it. The "first touch" memory affinity policy says > that a memory page gets mapped to the memory controller that is *closest* to > the first processor core to write a value to that page. So you can malloc() > data at any time, but the pages get assigned to memory controllers base on > the first processor to "touch" that memory page. If you touch the data (and > thereby assign it to memory controllers) in a different layout than you > actually use it, then most accesses will be non-local, and therefore will be > very slow. If you touch data using the processor that will primarily be > accessing the data later on, then it will get mapped to a local memory > controller and will therefore be much faster. So you have to be very > carefully about how you first touch data to ensure good memory/stream > performance. > > You can ref. the NERSC FAQ on the "first touch" principle. > http://www.nersc.gov/users/computational-systems/hopper/getting-started/multi-core-faq/ > >>> The bottom line is that it is essential that data is touched first on the >>> memory controller that is nearest the OpenMP processes that will be >>> accessing it (otherwise memory bandwidth will tank). This should occur >>> naturally if you configure as 4 NUMA nodes with 6 threads each, as per >>> Nathan's suggestion. >> >> How does one "configure as 4 NUMA nodes with 6 threads each"? Do you mean 4 >> MPI processes (each with 6 threads or something different?) > > That is correct. The Cray XE6 node (the dual-socket AMD magnycours) has a > total of 4 dies (and hence 4 NUMA domains). Cray refers to these dies (each > of which has its own local memory controller) as a NUMA-node to distinguish > it from the full node that contains for of these dies. Within a NUMA node, > there a no NUMA issues. > > So Cray refers to these dies (these sub-sections of a node where there are no > NUMA issues) as numa_nodes. You can use 'aprun' to launch tasks so that you > get one task per numa_node and the threads within that numa_node will not > have to worry about the first touch stuff we talked about above. For Hopper, > that is 4 numa_nodes per node, and 6 OpenMP threads per numa_node. > > e.g. > aprun -S threads_per_numa_node=6 -sn numa_nodes_per_node=4 > >>> If we want to be more aggressive and use 24-way threaded parallelism per >>> node, then extra care must be taken to ensure the memory affinity is not >>> screwed up. >> >> BTW: What is an "OpenMP thread" mapped to on the Cray systems? A pthread? >> Some other kind of thread? > > I'm not sure what you mean here. OpenMP is directives for threading. So an > OpenMP "thread" is just how many threads you assign to each MPI task (with > OpenMP operating within each MPI task). > > -john >