Karel Gardas wrote:
Erik,
to be honest I quite don't get what you write here. The reason is engineering.
4 socket server means that 1 cpu is connected to another 2 and forms kind of
ring. Nice simple picture is here:
http://ixbtlabs.com/articles2/cpu/rmma-numa2.html
Now, from this picture I would imagine the memory hiararchy should be divided into 3
parts (not two as you write). The first is local memory connected to the CPU memory
controller. The second is memory connected directly to CPU's neighbours CPU (2) and the
third is memory connected directly to the CPU in the "opposite corner". ie. to
the CPU where you need to go thorough 2 HT links to reach it. That's exactly the reason
I'm not able to understand why you are talking just about 2 CPU groups and you seems to
assume that local memory access is of the same latency like access over HT link? Sorry I
don't understand. If you could be so kind and provide me with some link to AMD
documentation. I've really tried hard googling for it, but to no avail. I've just found
various articles like above but none which would explain what you have described here.
I went back and looked at my AMD system documentation, and I think I led
your astray. All 800-series AMD chips have a total of 3 HyperTransport
links, and in a 4-socket system, they are indeed layed out in a ring
formation. So, any single socket has direct connection to it's "own"
memory, plus connection to two neighbor CPUs.
http://support.amd.com/us/Processor_TechDocs/40555.pdf
The link above actually is really good for exactly your problem, as to
how to optimize workloads for a 4-socket system.
What this boils down to is that it requires 50 ns to get to local RAM
(the DIMMs wired directly to that socket), 50ns + 1 "hop" time to get to
DRAM associated with any of the neighbor's CPUs (which looks like 50ns +
55 ns = 105ns), and 50ns + 2 "hop" time to get to the CPU cross-wise (
50ns + 55ns +55ns = 160ns). So, you are correct - there are 3 levels to
the memory access hierarchy. You still want to pin processes to
individual CPUs if you can, to limit this NUMA penalty. If you need more
than one CPU's power, I would recommend grouping them in 2s like I had
original mentioned, as it avoids the 2-hop penalty. That is, CPU0 & 1,
and 2 & 3. Avoid grouping 2 & 3 and 1 & 4, as they will definitely
suffer the 2-hop problem.
What I had originally described is the layout of the 8-socket systems,
which, because they likewise have only 3 HT links per socket, are
grouped in a ring also, but each "corner" is a 2-CPU group.
Sorry about that - I should think more at 3am before I post. <yawn>
If your boss will let you, get a whole bunch of 512MB
or 1GB HP DIMMs
from eBay, as they're dirt cheap (search for "HP
(512*,1gb) PC-3200 ECC"
and see what it shows, or just search for the
relevant HP part numbers:
376638-B21 and 376639-B21) - I'm seeing prices that
are under $100 for
4GB of additional RAM. Otherwise go to someone like
www.memoryx.net and
get certified memory - it's still going to be
super-cheap for these
machines, so you /should/ be able to get enough to
balance out the banks
of memory for better performance.
In fact dimms in USA looks like on the 1/3 of price here in EU. That's the
reason why I wrote about expensive memory and I've been looking on our EU
prices. That's for the note I'm probably going to purchase from USA eschop then.
Thanks,
Karel
MemoryX is here in Silicon Valley, and I highly recommend them. They're
very professional, very thorough, and they are /very/ experienced with
dealing with non-USA customers. They have a massive stock of stuff for
virtually everything, and they have a lifetime guaranty (including a
money-back, compatibility guaranty). If you're not doing eBay, that's
who I'd deal with.
--
Erik Trimble
Java System Support
Mailstop: usca22-123
Phone: x17195
Santa Clara, CA
_______________________________________________
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org