Karel Gardas wrote:
Erik,

to be honest I quite don't get what you write here. The reason is engineering. 
4 socket server means that 1 cpu is connected to another 2 and forms kind of 
ring. Nice simple picture is here: 
http://ixbtlabs.com/articles2/cpu/rmma-numa2.html

Now, from this picture I would imagine the memory hiararchy should be divided into 3 
parts (not two as you write). The first is local memory connected to the CPU memory 
controller. The second is memory connected directly to CPU's neighbours CPU (2) and the 
third is memory connected directly to the CPU in the "opposite corner". ie. to 
the CPU where you need to go thorough 2 HT links to reach it. That's exactly the reason 
I'm not able to understand why you are talking just about 2 CPU groups and you seems to 
assume that local memory access is of the same latency like access over HT link? Sorry I 
don't understand. If you could be so kind and provide me with some link to AMD 
documentation. I've really tried hard googling for it, but to no avail. I've just found 
various articles like above but none which would explain what you have described here.

I went back and looked at my AMD system documentation, and I think I led your astray. All 800-series AMD chips have a total of 3 HyperTransport links, and in a 4-socket system, they are indeed layed out in a ring formation. So, any single socket has direct connection to it's "own" memory, plus connection to two neighbor CPUs.

http://support.amd.com/us/Processor_TechDocs/40555.pdf

The link above actually is really good for exactly your problem, as to how to optimize workloads for a 4-socket system.

What this boils down to is that it requires 50 ns to get to local RAM (the DIMMs wired directly to that socket), 50ns + 1 "hop" time to get to DRAM associated with any of the neighbor's CPUs (which looks like 50ns + 55 ns = 105ns), and 50ns + 2 "hop" time to get to the CPU cross-wise ( 50ns + 55ns +55ns = 160ns). So, you are correct - there are 3 levels to the memory access hierarchy. You still want to pin processes to individual CPUs if you can, to limit this NUMA penalty. If you need more than one CPU's power, I would recommend grouping them in 2s like I had original mentioned, as it avoids the 2-hop penalty. That is, CPU0 & 1, and 2 & 3. Avoid grouping 2 & 3 and 1 & 4, as they will definitely suffer the 2-hop problem.


What I had originally described is the layout of the 8-socket systems, which, because they likewise have only 3 HT links per socket, are grouped in a ring also, but each "corner" is a 2-CPU group.


Sorry about that - I should think more at 3am before I post.  <yawn>

If your boss will let you, get a whole bunch of 512MB
or 1GB HP DIMMs from eBay, as they're dirt cheap (search for "HP (512*,1gb) PC-3200 ECC" and see what it shows, or just search for the relevant HP part numbers: 376638-B21 and 376639-B21) - I'm seeing prices that are under $100 for 4GB of additional RAM. Otherwise go to someone like www.memoryx.net and get certified memory - it's still going to be super-cheap for these machines, so you /should/ be able to get enough to balance out the banks of memory for better performance.

In fact dimms in USA looks like on the 1/3 of price here in EU. That's the 
reason why I wrote about expensive memory and I've been looking on our EU 
prices. That's for the note I'm probably going to purchase from USA eschop then.

Thanks,
Karel

MemoryX is here in Silicon Valley, and I highly recommend them. They're very professional, very thorough, and they are /very/ experienced with dealing with non-USA customers. They have a massive stock of stuff for virtually everything, and they have a lifetime guaranty (including a money-back, compatibility guaranty). If you're not doing eBay, that's who I'd deal with.

--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

_______________________________________________
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org

Reply via email to