Justin Chang <[email protected]> writes: > On Thu, Aug 13, 2015 at 1:04 PM, Jed Brown <[email protected]> wrote: >> It looks like with one core/socket, all your memory sits over one >> channel. You can play tricks to avoid that or use 4 cores/socket in >> order to use all memory channels. > > How do I play these tricks?
They generally aren't practical outside of simple benchmarks. Read through this blog series if you want to dive into memory performance. http://sites.utexas.edu/jdm4372/2010/11/11/optimizing-amd-opteron-memory-bandwidth-part-5-single-thread-read-only/ > I have no root access. Is there another way to confirm the clock speed? I don't recall a way to access that information without root. You can benchmark, obviously, but you're looking for an independent information source. You can ask a sysadmin to run this on a compute node. > > --- > > So if I have two sockets per node, then the theoretical peak bandwidth > is actually double than what I thought (whether it be 119.4 GB/s or > 102.4 GB/s). And if 8 cores really is the optimal number to use for a > single compute node, why are there 20 totals to begin with? Or would > this depend on the particular application? "20 totals"? Note that you might have hyperthreading, in which case there are twice as many logical cores as physical cores. > Also, can someone elaborate on the difference between the words > "core", "processor", and "thread"? Processor - typically a unit of manufacturing and sale that goes into a socket. Sometimes it shares a last-level cache and other times it is independent parts stuck together. Sometimes different parts of the processor are connected to different memory channels (implying multiple "NUMA nodes" on a single socket) and sometimes they are multiplexed (so all cores see the same speed to any memory channel on that socket). Core - the physical unit that processes ("integer") instructions. There can be multiple floating point units per core (e.g., anything with dual-issue FMA) or multiple cores per floating point unit (e.g., the AMD processors on Titan). Logical core/hardware thread - the logical unit exposed to the operating system. Often there are 2, 4, or more hardware threads per core. These have their own registers (as far as you can tell; it can be complicated by "register renaming") and are used to cover high-latency operations including waiting on memory and some arithmetic. Usually only one hardware thread issues instructions in any given cycle, so if a single thread has sufficient ILP (instruction-level parallelism) to keep issuing every cycle, there can be no benefit to using multiple hardware threads. This is impossible with some architectures, thus necessitating use of multiple hardware threads per core to reach peak flops, integer instructions, and/or bandwidth.
signature.asc
Description: PGP signature
