I am very suspicious of any results where the fortran blas is out-performing the MKL.
A On Thu, Mar 17, 2011 at 3:24 AM, Natarajan CS <csnataraj at gmail.com> wrote: > Hello Rob, > ???? Thanks for the update, this might be very valuable for other developers > out there! > I am still a little surprised by the performance of MKL, I use it for a > variety of problems and haven't come across a situation where performance > has been penalized! Maybe there are more experienced developers out here who > have seen otherwise. Have you tried ATLAS by any chance? > > When the MPI performance is concerned, I wouldn't be very surprised if > intelMPI turns out to be faster than MPICH2. All this is of course fabric > dependent but has been as I said in my experience for a variety of problems. > Also, I think intelMPI is a customized version of MPICH2, so may not be a > big surprise there > > ?There are very experienced people in this forum who might be able to say > otherwise and give more accurate answers > > Cheers, > > C.S.N > > > On Wed, Mar 16, 2011 at 4:22 PM, Robert Ellis <Robert.Ellis at geosoft.com> > wrote: >> >> Hi All, >> >> >> >> For those still interested in this thread, timing tests with MKL indicate >> that sequential MKL performs approximately the same as parallel MKL with >> NUM_THREADS=1, which isn't too surprising. What is a bit surprising is that >> MKL always, at least for this application, gives significantly slower >> performance than direct compilation of the code from >> ?--download-f-blas-lapack=1. My conclusion is that if your code is written >> with explicit parallelization, in this case using PETSc, and fully utilizes >> your hardware, ?using sophisticated libraries may actually harm performance. >> Keep it simple! >> >> >> >> Now a question: all my tests used MPICH2. Does anyone think using Intel >> MPI would significantly improve the performance of MKL with PETSc? >> >> >> >> Cheers, >> >> Rob >> >> >> >> From: petsc-users-bounces at mcs.anl.gov >> [mailto:petsc-users-bounces at mcs.anl.gov] On Behalf Of Rob Ellis >> Sent: Tuesday, March 15, 2011 3:33 PM >> To: 'PETSc users list' >> >> Subject: Re: [petsc-users] Building with MKL 10.3 >> >> >> >> Yes, MKL_DYNAMIC was set to true. No, I haven't tested on Nehalem. I'm >> currently comparing sequential MKL with --download-f-blas-lapack=1. >> >> Rob >> >> >> >> From: petsc-users-bounces at mcs.anl.gov >> [mailto:petsc-users-bounces at mcs.anl.gov] On Behalf Of Natarajan CS >> >> Sent: Tuesday, March 15, 2011 3:20 PM >> To: PETSc users list >> Cc: Robert Ellis >> Subject: Re: [petsc-users] Building with MKL 10.3 >> >> >> >> Thanks Eric and Rob. >> >> Indeed! Was MKL_DYNAMIC set to default (true)? It looks like using 1 >> thread per core (sequential MKL) is the right thing to do as baseline. >> ?I would think that the performance of #cores =? num_mpi_processes * >> num_mkl_threads might be <= #cores = num_mpi_processes case (# cores const) >> unless some cache effects come into play (Not sure what, I would think the >> mkl installation should weed these issues out). >> >> P.S : >> Out of curiosity have you also tested your app on Nehalem? Any difference >> between Nehalem vs Westmere for similar bandwidth? >> >> On Tue, Mar 15, 2011 at 4:35 PM, Jed Brown <jed at 59a2.org> wrote: >> >> On Tue, Mar 15, 2011 at 22:30, Robert Ellis <Robert.Ellis at geosoft.com> >> wrote: >> >> Regardless of setting the number of threads for MKL or OMP, the MKL >> performance was worse than simply using --download-f-blas-lapack=1. >> >> >> >> Interesting. Does this statement include using just one thread, perhaps >> with a non-threaded MKL? Also, when you used threading, were you putting an >> MPI process on every core or were you making sure that you had enough cores >> for num_mpi_processes * num_mkl_threads? >> >> >
