Can you run the app with the following options [use one per run] - and see if it makes any difference in performance [in VecScatters]
-vecscatter_rr -vecscatter_ssend -vecscatter_sendfirst Also - you might want to try using the latest mpich to see if there are any improvements. Regarding the hardware issues - yeah - AMD has a NUMA architecture [i.e access from memory from a different cpu is slower than the memory on the local CPU]. There could also be some OS issues wrt memory layout for MPI messages - or some other contention [perhaps IO interrupts from the OS?] that could be causing the slowdown. All of this is just a guess.. Satish On Wed, 7 Feb 2007, Shi Jin wrote: > Thank you very much, Satish. > You are right. From the log_summary, the communication > takes slightly more time on the shared memory than the > cluster even after using the taskset. > This is still hard to understand since I think > in-memory operations have to been orders of magnitude > faster than network opertations(gigabit ethernet). > > By the way, I took a look my the specs of my > shared-memory machine( Sun Fire Server 4600). > It seems that each CPU socket has its own DIMMS of > RAM. > I wonder if there is a speed issue if one has to copy > data from the RAM of one CPU to another. > > Thanks. > > Shi > --- Satish Balay <balay at mcs.anl.gov> wrote: > > > A couple of comments: > > > > - with the dual core opteron - the memorybandwith > > per core is now > > reduced by half - so the performance suffers. > > However memory > > bandwidth across CPUs is scalable. [6.4 Gb/s per > > each node or 3.2Gb/s > > per core] > > > > - Current generation Intel Core 2 duo appears to > > claim having > > sufficient bandwidth [15.3Gb/s per node = 7.6Gb/s > > per core?] so from > > this bandwidth number - this chip might do better > > than the AMD > > chip. However I'm not sure if there is a SMP with > > this chip - which > > has scalable memory system [across say 8 nodes - as > > you currently > > have..] > > > > - Older intel SMP boxes has a single memory bank > > shared across all the > > CPUs [so effective bandwidth per CPU was pretty > > small. Optrons' > > scalable architecture looked much better than the > > older intel SMPs] > > > > - From previous log_summary - part of the > > inefficiency of the SMP box > > [when compared to the cluster] was in the MPI > > performance. Do you > > still see this effect in the '-np 8' runs? If so > > this could be the > > [part of the] reason for this 30% reduction in > > performance. > > > > Satish > > > > On Mon, 5 Feb 2007, Shi Jin wrote: > > > > > Hi there, > > > > > > I have made some new progress on the issue of SMP > > > performance. Since my shared memory machine is a 8 > > > dual-core Opteron machine. I think the two cores > > on a > > > single CPU chip shares the memory bandwidth. > > > Therefore, if I can avoid using the same core on > > the > > > chip, I can get some performance improvement. > > Indeed, > > > I am able to do this by the linux command taskset. > > > > > Here is what I did: > > > petscmpirun -n 8 taskset -c 0,2,4,6,8,10,12,14 > > ../spAF > > > This way, I specifically ask the processes to be > > run > > > on the first core on the CPUs. > > > By doing this, my performance is doubled compared > > with > > > the simple petscmpirun -n 8 ../spAF > > > > > > So this test shows that we do suffer from the > > > competition of resources of multiple processes, > > > especially when we use 16 processes. > > > > > > However, I should point out that even with the > > help > > > taskset, the shared-memory performance is still > > 30% > > > less than that on the cluster. > > > > > > I am not sure whether this problem exists > > specifically > > > for the AMD machines or it applys to any > > shared-memory > > > architecture. > > > > > > Thanks. > > > Shi > > > > > > --- Shi Jin <jinzishuai at yahoo.com> wrote: > > > > > > > Hi there, > > > > > > > > I am fairly new to PETSc but have 5 years of MPI > > > > programming already. I recently took on a > > project of > > > > analyzing a finite element code written in C > > with > > > > PETSc. > > > > I found out that on a shared-memory machine > > (60GB > > > > RAM, > > > > 16 CPUS), the code runs around 4 times slower > > > > than > > > > on a distributed memory cluster (4GB Ram, > > > > 4CPU/node), > > > > although they yield identical results. > > > > There are 1.6Million finite elements in the > > problem > > > > so > > > > it is a fairly large calculation. The total > > memory > > > > used is 3GBx16=48GB. > > > > > > > > Both the two systems run Linux as OS and the > > same > > > > code > > > > is compiled against the same version of MPICH-2 > > and > > > > PETSc. > > > > > > > > The shared-memory machine is actually a little > > > > faster > > > > than the cluster machines in terms of single > > process > > > > runs. > > > > > > > > I am surprised at this result since we usually > > tend > > > > to > > > > think that shared-memory would be much faster > > since > > > > the in-memory operation is much faster that the > > > > network communication. > > > > > > > > However, I read the PETSc FAQ and found that > > "the > > > > speed of sparse matrix computations is almost > > > > totally > > > > determined by the speed of the memory, not the > > speed > > > > of the CPU". > > > > This makes me wonder whether the poor > > performance of > > > > my code on a shared-memory machine is due to the > > > > competition of different process on the same > > memory > > > > bus. Since the code is still MPI based, a lot of > > > > data > > > > are moving around inside the memory. Is this a > > > > reasonable explanation of what I observed? > > > > > > > > Thank you very much. > > > > > > > > Shi > > > > > > > > > > > > > > > > > > > > > > ____________________________________________________________________________________ > > > > Do you Yahoo!? > > > > Everyone is raving about the all-new Yahoo! Mail > > > > beta. > > > > http://new.mail.yahoo.com > > > > > > > > > > > > > > > > > > > > > > > > > > ____________________________________________________________________________________ > > > Expecting? Get great news right away with email > > Auto-Check. > > > Try the Yahoo! Mail Beta. > > > > > > http://advision.webevents.yahoo.com/mailbeta/newmail_tools.html > > > > > > > > > > > > > > > > > ____________________________________________________________________________________ > Don't get soaked. Take a quick peak at the forecast > with the Yahoo! Search weather shortcut. > http://tools.search.yahoo.com/shortcuts/#loc_weather > >
