PETSc runs slower on a shared memory machine than on a cluster

Satish Balay Wed, 7 Feb 2007 10:58:06 -0600 (CST)

Can you run the app with the following options [use one per run] - and
see if it makes any difference in performance [in VecScatters]


-vecscatter_rr
-vecscatter_ssend
-vecscatter_sendfirst

Also - you might want to try using the latest mpich to see if there
are any improvements.

Regarding the hardware issues - yeah - AMD has a NUMA architecture
[i.e access from memory from a different cpu is slower than the memory
on the local CPU]. There could also be some OS issues wrt memory
layout for MPI messages - or some other contention [perhaps IO
interrupts from the OS?] that could be causing the slowdown. All of
this is just a guess..

Satish


On Wed, 7 Feb 2007, Shi Jin wrote:

> Thank you very much, Satish.
> You are right. From the log_summary, the communication
> takes slightly more time on the shared memory than the
> cluster even after using the taskset.
> This is still hard to understand since I think
> in-memory operations have to been orders of magnitude
> faster than network opertations(gigabit ethernet).
> 
> By the way, I took a look my the specs of my
> shared-memory machine( Sun Fire Server 4600).
> It seems that each CPU socket has its own DIMMS of
> RAM.
> I wonder if there is a speed issue if one has to copy
> data from the RAM of one CPU to another.
> 
> Thanks.
> 
> Shi
> --- Satish Balay <balay at mcs.anl.gov> wrote:
> 
> > A couple of comments:
> > 
> > - with the dual core opteron - the memorybandwith
> > per core is now
> > reduced by half - so the performance suffers. 
> > However memory
> > bandwidth across CPUs is scalable. [6.4 Gb/s per
> > each node or 3.2Gb/s
> > per core]
> > 
> > - Current generation Intel Core 2 duo appears to
> > claim having
> > sufficient bandwidth [15.3Gb/s per node = 7.6Gb/s
> > per core?] so from
> > this bandwidth number - this chip might do better
> > than the AMD
> > chip. However I'm not sure if there is a SMP with
> > this chip - which
> > has scalable memory system [across say 8 nodes - as
> > you currently
> > have..]
> > 
> > - Older intel SMP boxes has a single memory bank
> > shared across all the
> > CPUs [so effective bandwidth per CPU was pretty
> > small. Optrons'
> > scalable architecture looked much better than the
> > older intel SMPs]
> > 
> > - From previous log_summary - part of the
> > inefficiency of the SMP box
> > [when compared to the cluster] was in the MPI
> > performance. Do you
> > still see this effect in the '-np 8' runs? If so
> > this could be the
> > [part of the] reason for this 30% reduction in
> > performance.
> > 
> > Satish
> > 
> > On Mon, 5 Feb 2007, Shi Jin wrote:
> > 
> > > Hi there,
> > > 
> > > I have made some new progress on the issue of SMP
> > > performance. Since my shared memory machine is a 8
> > > dual-core Opteron machine. I think the two cores
> > on a
> > > single CPU chip shares the memory bandwidth.
> > > Therefore, if I can avoid using the same core on
> > the
> > > chip, I can get some performance improvement.
> > Indeed,
> > > I am able to do this by the linux command taskset.
> > 
> > > Here is what I did:
> > > petscmpirun -n 8 taskset -c 0,2,4,6,8,10,12,14
> > ../spAF
> > > This way, I specifically ask the processes to be
> > run
> > > on the first core on the CPUs. 
> > > By doing this, my performance is doubled compared
> > with
> > > the simple petscmpirun -n 8 ../spAF
> > > 
> > > So this test shows that we do suffer from the
> > > competition of resources of multiple processes,
> > > especially when we use 16 processes.
> > > 
> > > However, I should point out that even with the
> > help
> > > taskset, the shared-memory performance is still
> > 30%
> > > less than  that on the cluster.
> > > 
> > > I am not sure whether this problem exists
> > specifically
> > > for the AMD machines or it applys to any
> > shared-memory
> > > architecture.
> > > 
> > > Thanks.
> > > Shi
> > > 
> > > --- Shi Jin <jinzishuai at yahoo.com> wrote:
> > > 
> > > > Hi there,
> > > > 
> > > > I am fairly new to PETSc but have 5 years of MPI
> > > > programming already. I recently took on a
> > project of
> > > > analyzing a finite element code written in C
> > with
> > > > PETSc.
> > > > I found out that on a shared-memory machine
> > (60GB
> > > > RAM,
> > > > 16    CPUS), the code runs around 4 times slower
> > > > than
> > > > on a distributed memory cluster (4GB Ram,
> > > > 4CPU/node),
> > > > although they yield identical results.
> > > > There are 1.6Million finite elements in the
> > problem
> > > > so
> > > > it is a fairly large calculation. The total
> > memory
> > > > used is 3GBx16=48GB.
> > > > 
> > > > Both the two systems run Linux as OS and the
> > same
> > > > code
> > > > is compiled against the same version of MPICH-2
> > and
> > > > PETSc.
> > > >  
> > > > The shared-memory machine is actually a little
> > > > faster
> > > > than the cluster machines in terms of single
> > process
> > > > runs.
> > > > 
> > > > I am surprised at this result since we usually
> > tend
> > > > to
> > > > think that shared-memory would be much faster
> > since
> > > > the in-memory operation is much faster that the
> > > > network communication.
> > > > 
> > > > However, I read the PETSc FAQ and found that
> > "the
> > > > speed of sparse matrix computations is almost
> > > > totally
> > > > determined by the speed of the memory, not the
> > speed
> > > > of the CPU". 
> > > > This makes me wonder whether the poor
> > performance of
> > > > my code on a shared-memory machine is due to the
> > > > competition of different process on the same
> > memory
> > > > bus. Since the code is still MPI based, a lot of
> > > > data
> > > > are moving around inside the memory. Is this a
> > > > reasonable explanation of what I observed?
> > > > 
> > > > Thank you very much.
> > > > 
> > > > Shi
> > > > 
> > > > 
> > > >  
> > > >
> > >
> >
> ____________________________________________________________________________________
> > > > Do you Yahoo!?
> > > > Everyone is raving about the all-new Yahoo! Mail
> > > > beta.
> > > > http://new.mail.yahoo.com
> > > > 
> > > > 
> > > 
> > > 
> > > 
> > >  
> > >
> >
> ____________________________________________________________________________________
> > > Expecting? Get great news right away with email
> > Auto-Check. 
> > > Try the Yahoo! Mail Beta.
> > >
> >
> http://advision.webevents.yahoo.com/mailbeta/newmail_tools.html
> > 
> > > 
> > > 
> > 
> > 
> 
> 
> 
>  
> ____________________________________________________________________________________
> Don't get soaked.  Take a quick peak at the forecast
> with the Yahoo! Search weather shortcut.
> http://tools.search.yahoo.com/shortcuts/#loc_weather
> 
>

PETSc runs slower on a shared memory machine than on a cluster

Reply via email to