On Wed, 22 Dec 2010, Yongjun Chen wrote: > On Wed, Dec 22, 2010 at 5:54 PM, Satish Balay <balay at mcs.anl.gov> wrote: > > > On Wed, 22 Dec 2010, Yongjun Chen wrote: > > > > > On Wed, Dec 22, 2010 at 5:40 PM, Barry Smith <bsmith at mcs.anl.gov> > > > wrote: > > > > > > > Processors: 4 CPUS * 4Cores/CPU, with each core 2500MHz > > > > > > > > > > Memories: 16 *2 GB DDR2 333 MHz, dual channel, data width 64 bit, so > > the > > > > memory Bandwidth for 2 memories is 64/8*166*2*2=5.4GB/s. > > > > > > > > Wait a minute. You have 16 cores that share 5.4 GB/s???? This is not > > > > enough for iterative solvers, in fact this is absolutely terrible for > > > > iterative solvers. You really want 5.4 GB/s PER core! This machine is > > > > absolutely inappropriate for iterative solvers. No package can give you > > good > > > > speedups on this machine. > > > > > > Barry, there are 16 memories, every 2 memories make up one dual channel, > > > thus in this machine there are 8 dual channel, each dual channel has the > > > memory bandwidth 5.4GB/s. > > > > What hardware is this? [processor/chipset?] > > > > By dmidecode, it shows the processor is > > Handle 0x0010, DMI type 4, 40 bytes > Processor Information > Socket Designation: CPU 4 > Type: Central Processor > Family: Quad-Core Opteron > Manufacturer: AMD > ID: 06 05 F6 40 74 03 E8 3D > Signature: Family 5, Model 0, Stepping 6 > Flags: > DE (Debugging extension) > TSC (Time stamp counter) > MSR (Model specific registers) > PAE (Physical address extension) > CX8 (CMPXCHG8 instruction supported) > APIC (On-chip APIC hardware supported) > CLFSH (CLFLUSH instruction supported) > DS (Debug store) > ACPI (ACPI supported) > MMX (MMX technology supported) > FXSR (Fast floating-point save and restore) > SSE2 (Streaming SIMD extensions 2) > SS (Self-snoop) > HTT (Hyper-threading technology) > TM (Thermal monitor supported) > Version: Quad-Core AMD Opteron(tm) Processor 8360 SE > Voltage: 1.5 V > External Clock: 200 MHz > Max Speed: 4600 MHz > Current Speed: 2500 MHz > Status: Populated, Enabled > Upgrade: Other > L1 Cache Handle: 0x0011 > L2 Cache Handle: 0x0012 > L3 Cache Handle: 0x0013 > Serial Number: N/A > Asset Tag: N/A > Part Number: N/A > Core Count: 4 > Core Enabled: 4 > Characteristics: > 64-bit capable
ok - your machine has the following schematic.. [from google] http://www.qdpma.com/SystemArchitecture_files/013_Opteron.png > > >From what you say - it looks like each chip has 4cores, and 2 > > dual-channel memory controllers for each of them. > > > > The question is - does the hardware provide scalable memory-bandwidth > > per core? Most machines don't. > > > > This point is not clear for me right now. Hm.. the point is: the hardware designer had 2 choices: - provide a single memory controller per core [so each core gets only 2.7gb/s - i.e 4 memory controllers per CPU, and common L2 cache across all cores not possible] - provide a single memory controller with 2-dual memory channels [i.e 10.8GB/s] thats shared by 1-4 cores. With this - there can be a single L2 cache for all 4 cores. Which of the above 2 is a good design? The first one provides scalable performance - but the second one doesn't. Also the first one limits the performance of sequential [np=1 applications]. The second one provides all bandwidth to even np=1 codes - so they might have better sequential performane. And then performance differences due to different cache synchronization issues.. Satish > > > > > I.e the same 5.4*2GB/s is avilable for 1 core run as well as the 4 core > > run. > > > > So if the algorithm is able to use 5.4GB/s [or more] for 1 threads, > > 10.8 [or more] for 2 threads - you would just see scalable performance > > from 1 to 2, and 3, 4 would perhaps be slightly incremental to the > > 2-core performance. > > > > Satish > > >
