On Wed, Dec 22, 2010 at 6:32 PM, Satish Balay <balay at mcs.anl.gov> wrote:
> On Wed, 22 Dec 2010, Yongjun Chen wrote: > > > On Wed, Dec 22, 2010 at 5:54 PM, Satish Balay <balay at mcs.anl.gov> wrote: > > > > > On Wed, 22 Dec 2010, Yongjun Chen wrote: > > > > > > > On Wed, Dec 22, 2010 at 5:40 PM, Barry Smith <bsmith at mcs.anl.gov> > wrote: > > > > > > > > > Processors: 4 CPUS * 4Cores/CPU, with each core 2500MHz > > > > > > > > > > > > Memories: 16 *2 GB DDR2 333 MHz, dual channel, data width 64 bit, > so > > > the > > > > > memory Bandwidth for 2 memories is 64/8*166*2*2=5.4GB/s. > > > > > > > > > > Wait a minute. You have 16 cores that share 5.4 GB/s???? This is > not > > > > > enough for iterative solvers, in fact this is absolutely terrible > for > > > > > iterative solvers. You really want 5.4 GB/s PER core! This machine > is > > > > > absolutely inappropriate for iterative solvers. No package can give > you > > > good > > > > > speedups on this machine. > > > > > > > > Barry, there are 16 memories, every 2 memories make up one dual > channel, > > > > thus in this machine there are 8 dual channel, each dual channel has > the > > > > memory bandwidth 5.4GB/s. > > > > > > What hardware is this? [processor/chipset?] > > > > > > > By dmidecode, it shows the processor is > > > > Handle 0x0010, DMI type 4, 40 bytes > > Processor Information > > Socket Designation: CPU 4 > > Type: Central Processor > > Family: Quad-Core Opteron > > Manufacturer: AMD > > ID: 06 05 F6 40 74 03 E8 3D > > Signature: Family 5, Model 0, Stepping 6 > > Flags: > > DE (Debugging extension) > > TSC (Time stamp counter) > > MSR (Model specific registers) > > PAE (Physical address extension) > > CX8 (CMPXCHG8 instruction supported) > > APIC (On-chip APIC hardware supported) > > CLFSH (CLFLUSH instruction supported) > > DS (Debug store) > > ACPI (ACPI supported) > > MMX (MMX technology supported) > > FXSR (Fast floating-point save and restore) > > SSE2 (Streaming SIMD extensions 2) > > SS (Self-snoop) > > HTT (Hyper-threading technology) > > TM (Thermal monitor supported) > > Version: Quad-Core AMD Opteron(tm) Processor 8360 SE > > Voltage: 1.5 V > > External Clock: 200 MHz > > Max Speed: 4600 MHz > > Current Speed: 2500 MHz > > Status: Populated, Enabled > > Upgrade: Other > > L1 Cache Handle: 0x0011 > > L2 Cache Handle: 0x0012 > > L3 Cache Handle: 0x0013 > > Serial Number: N/A > > Asset Tag: N/A > > Part Number: N/A > > Core Count: 4 > > Core Enabled: 4 > > Characteristics: > > 64-bit capable > > ok - your machine has the following schematic.. [from google] > > http://www.qdpma.com/SystemArchitecture_files/013_Opteron.png > > > > >From what you say - it looks like each chip has 4cores, and 2 > > > dual-channel memory controllers for each of them. > > > > > > The question is - does the hardware provide scalable memory-bandwidth > > > per core? Most machines don't. > > > > > > > This point is not clear for me right now. > > Hm.. the point is: the hardware designer had 2 choices: > > - provide a single memory controller per core [so each core gets only > 2.7gb/s - i.e 4 memory controllers per CPU, and common L2 cache > across all cores not possible] > > - provide a single memory controller with 2-dual memory channels [i.e > 10.8GB/s] thats shared by 1-4 cores. With this - there can be a > single L2 cache for all 4 cores. > > Which of the above 2 is a good design? The first one provides scalable > performance - but the second one doesn't. Also the first one limits > the performance of sequential [np=1 applications]. The second one > provides all bandwidth to even np=1 codes - so they might have better > sequential performane. And then performance differences due to different > cache synchronization issues.. > > Satish > > Thanks a lot, Satish. It is much clear now. But for the choice of the two, the program dmidecode does not show this information. Do you know any way to get it? > > > > > > > > > > > I.e the same 5.4*2GB/s is avilable for 1 core run as well as the 4 core > > > run. > > > > > > So if the algorithm is able to use 5.4GB/s [or more] for 1 threads, > > > 10.8 [or more] for 2 threads - you would just see scalable performance > > > from 1 to 2, and 3, 4 would perhaps be slightly incremental to the > > > 2-core performance. > > > > > > Satish > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20101222/0eb5771b/attachment.htm>
