Hi there, I have made some new progress on the issue of SMP performance. Since my shared memory machine is a 8 dual-core Opteron machine. I think the two cores on a single CPU chip shares the memory bandwidth. Therefore, if I can avoid using the same core on the chip, I can get some performance improvement. Indeed, I am able to do this by the linux command taskset. Here is what I did: petscmpirun -n 8 taskset -c 0,2,4,6,8,10,12,14 ../spAF This way, I specifically ask the processes to be run on the first core on the CPUs. By doing this, my performance is doubled compared with the simple petscmpirun -n 8 ../spAF
So this test shows that we do suffer from the competition of resources of multiple processes, especially when we use 16 processes. However, I should point out that even with the help taskset, the shared-memory performance is still 30% less than that on the cluster. I am not sure whether this problem exists specifically for the AMD machines or it applys to any shared-memory architecture. Thanks. Shi --- Shi Jin <jinzishuai at yahoo.com> wrote: > Hi there, > > I am fairly new to PETSc but have 5 years of MPI > programming already. I recently took on a project of > analyzing a finite element code written in C with > PETSc. > I found out that on a shared-memory machine (60GB > RAM, > 16 CPUS), the code runs around 4 times slower > than > on a distributed memory cluster (4GB Ram, > 4CPU/node), > although they yield identical results. > There are 1.6Million finite elements in the problem > so > it is a fairly large calculation. The total memory > used is 3GBx16=48GB. > > Both the two systems run Linux as OS and the same > code > is compiled against the same version of MPICH-2 and > PETSc. > > The shared-memory machine is actually a little > faster > than the cluster machines in terms of single process > runs. > > I am surprised at this result since we usually tend > to > think that shared-memory would be much faster since > the in-memory operation is much faster that the > network communication. > > However, I read the PETSc FAQ and found that "the > speed of sparse matrix computations is almost > totally > determined by the speed of the memory, not the speed > of the CPU". > This makes me wonder whether the poor performance of > my code on a shared-memory machine is due to the > competition of different process on the same memory > bus. Since the code is still MPI based, a lot of > data > are moving around inside the memory. Is this a > reasonable explanation of what I observed? > > Thank you very much. > > Shi > > > > ____________________________________________________________________________________ > Do you Yahoo!? > Everyone is raving about the all-new Yahoo! Mail > beta. > http://new.mail.yahoo.com > > ____________________________________________________________________________________ Expecting? Get great news right away with email Auto-Check. Try the Yahoo! Mail Beta. http://advision.webevents.yahoo.com/mailbeta/newmail_tools.html
