[Pw_forum] openmp vs mpich performance with MKL 10.x
Thank you Axel, for all the time and the explanations of the deep numerical details. I will give up at his point. I am satisfied to know that I am not wasting CPU resources due to bad configuration, and I will keep both MPI and OpenMP compilations.. I will try to add to the benchmark of Nicola with CP. Best regards -- Eduardo Menendez -- next part -- An HTML attachment was scrubbed... URL: http://www.democritos.it/pipermail/pw_forum/attachments/20080509/e749495a/attachment-0001.htm
[Pw_forum] openmp vs mpich performance with MKL 10.x
On Wed, 7 May 2008, Eduardo Ariel Menendez Proupin wrote: EAM> Hi, EAM> Plese, find attached my best make.sys, to be run serially. Try this in your EAM> system. My timings are close to yours. Below are the details. However, it ok, i tried running on my machine with the intel wrapper i get a wall time of 13m43s and using a multi-threaded fftw3 i need a wall time of 16m13s to complete the job, but i have not yet added the additional tunings that i added to CPMD that finally made fftw3 faster. in summary it looks as if on my hardware MPI is the winner. i would be interested to see if you get different timings with OpenMPI instead of MPICH. EAM> runs faster serially than using mpiexec -n 1. [...] EAM> > EAM> > obviously, switching to the intel fft didn't help. EAM> EAM> FOR ME, IT HELPS ONLY WHEN RUNNING SERIAL. on CPMD i found that actually, using the multi-threaded fftw3 is _even_ faster. you will need to add one function call to tell the fftw3-planner that all future plans should be generated for $OMP_NUM_THREADS threads. the fact that it helps in the serial code only, is easily understandable if you look what QEs FFT modules do differently when running in serial or in parallel. if you run in serial, QE calls a 3d-FFT directly instead of a sequence of 1d/2d-FFTs. with the 3d-fft you have the chance to parallelize in the same was as with MPI by using threads. if you run in parallel, you already call many small 1d-ffts and those don't parallelize well. instead it would be required to distribute those calls across threads to have a similar gain. EAM> > your system with many states and only gamma point EAM> > is definitely a case that benefits the most from EAM> > multi-threaded BLAS/LAPACK. EAM> EAM> TYPICAL FOR BO MOLECULAR DYNAMICS. EAM> I WOULD SAY, AVOID MIXING MPI AND OPENMP. ALSO AVOID INTEL FFTW WRAPPERS EAM> WITH MPI, EVEN IF OMP_NUM_THREADS=1. EAM> USE THREADED BLAS/LAPACK/FFTW2(3) FOR SERIAL RUNS. i don't think that this can be said in general, because your system is a best case scenario. in my experience a serial executable is about 10% faster than a parallel one for one task with plane-wave pseudopotential calculations. the fact that you have a large system with only gamma point gives you the maximum benefit from parallel LAPACK/BLAS and the multi-threaded FFT. however, if you want to do BO-dynamics i suspect that you may lose the performance advantage, since the wavefunction extrapolation will cut down the number of SCF cycles needed and at the same time the force calculation is not multi-threaded at all. to get a real benefit from a multi-core machine, additional OpenMP directives need to be added to the QE code. the fact that OpenMP libraries and MPI parallelization are somewhat comparable, could indicate that there is some more room to improve the MPI parallelization. luckily for most QE-users the first, simple level of parallelization across k-points will apply and give them a lot of speedup without much and only _then_ the parallelization across the G-space, task groups and finally threads/libraries/OpenMP directives should apply. cheers, axel. EAM> EAM> ANYWAY, THE DIFFERENCE BETWEEN THE BEST MPI AND THE BEST OPENMP IS LESS THAN EAM> 10% (11m30s vs 12m43s) EAM> EAM> > EAM> > EAM> > i'm curious to learn how these number match up EAM> > with your performance measurements. EAM> > EAM> > cheers, EAM> > axel. EAM> > EAM> > EAM> > EAM> EAM> EAM> -- === Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu http://www.cmm.upenn.edu Center for Molecular Modeling -- University of Pennsylvania Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323 tel: 1-215-898-1582, fax: 1-215-573-6233, office-tel: 1-215-898-5425 === If you make something idiot-proof, the universe creates a better idiot.
[Pw_forum] openmp vs mpich performance with MKL 10.x
On Tue, 6 May 2008, Eduardo Ariel Menendez Proupin wrote: EAMP> Dear Axel, EAMP> EAMP> > you should also compare against the parallel executable EAMP> > run with -np 1 against the serial executable. EAMP> EAMP> It is a bit slower. eduardo, ok. thanks, here are some timing results on my desktop. the machine was not entirely idle and freshly booted, so take the numbers with a bit of caution. i have a one year old two-socket intel dual core 2.66GHz machine (i.e. more or less equivalent to a single socket intel quad-core, with two dual-core dies in one case). this is using the latest cvs code: with serial MKL, serial FFTW-2.1.5 and OpenMPI with 4 mpi tasks. i get a wall time of 12m12s and cpu time of 10m40s. changing MKL to threaded MKL using 4 threads and 1 mpi task i get a wall time of 18m8s and cpu time of 28m30s (which means that roughly 40% of the time the code was running multi-threaded BLAS/LAPACK). with serial FFT, threaded MKL using 2 threads and 2 mpi tasks i get a wall time of 12m45s and cpu time of 14.42s now when i swap the serial FFTW2 against the intel MKL FFTW2 wrapper i get with 2 threads and 2 MPI tasks a wall time of 15m2s and a cpu time of 24m11s. and with 4 threads and 1 MPI task i get a wall time of 0h19m and a cpu time of 1h 2m and finally when disabling threading and with 4 MPI tasks i get 12m38 wall time and 11m14s cpu time. obviously, switching to the intel fft didn't help. your system with many states and only gamma point is definitely a case that benefits the most from multi-threaded BLAS/LAPACK. i'm curious to learn how these number match up with your performance measurements. cheers, axel. EAMP> EAMP> Attached is my input. EAMP> EAMP> EAMP> EAMP> -- === Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu http://www.cmm.upenn.edu Center for Molecular Modeling -- University of Pennsylvania Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323 tel: 1-215-898-1582, fax: 1-215-573-6233, office-tel: 1-215-898-5425 === If you make something idiot-proof, the universe creates a better idiot.
[Pw_forum] openmp vs mpich performance with MKL 10.x
> there are two issue that need to be considered. > > 1) how large are your test jobs? if they are not large enough, timings are > pointless. about 15 minutes in Intel Quadcore. 66 atoms: Cd_30Te_30O_6. 576 electrons in total. My test may be very particular. If you a have a balanced benchmark, I would like to run it. > 2) it is most likely, that you are still tricked by the > auto-parallelization of intel MKL. the export OMP_NUM_THREADS > will usually only work for the _local_ copy, for some > MPI startup mechanisms not at all. thus your MPI jobs will > be slowed down. I am using only SMP. Sorry, I still haven't a cluster of Quadcores. > > > to make certain that you only like the serial version of > MKL with your MPI executable, please replace -lmkl_em64t > in your make.sys file with > -lmkl_intel_lp64 -lmkl_sequential -lmkl_core Yes, I also tried that. The test runs in 14m2s. Using only -lmkl_em64t it runs in 14m31s. Using serial compilations it ran in 12m20s. Thanks, Eduardo -- next part -- An HTML attachment was scrubbed... URL: http://www.democritos.it/pipermail/pw_forum/attachments/20080506/ef1744f2/attachment.htm
[Pw_forum] openmp vs mpich performance with MKL 10.x
On Tue, 6 May 2008, Eduardo Ariel Menendez Proupin wrote: EAMP> > there are two issue that need to be considered. EAMP> > EAMP> > 1) how large are your test jobs? if they are not large enough, timings are EAMP> > pointless. EAMP> about 15 minutes in Intel Quadcore. 66 atoms: Cd_30Te_30O_6. 576 EAMP> electrons in total. My test may be very particular. If you a have hmmm... that is pretty large. would you mind sending me the input. i'd like to make some verifications on my machine (two-socket dual core). EAMP> a balanced benchmark, I would like to run it. i've only done these kind of benchmarks systematically with CPMD, and only a few confirmation tests with QE. in general the G-space parallelization is comparable. while the individual performance for a specific problem can be quite different (QE is far superior with ultra-soft and k-points, CPMD outruns cp.x with norm-conserving pseudos), the scaling behavior was always quite similar for small to medium numbers of nodes. EAMP> > 2) it is most likely, that you are still tricked by the EAMP> > auto-parallelization of intel MKL. the export OMP_NUM_THREADS EAMP> > will usually only work for the _local_ copy, for some EAMP> > MPI startup mechanisms not at all. thus your MPI jobs will EAMP> > be slowed down. EAMP> EAMP> I am using only SMP. Sorry, I still haven't a cluster of Quadcores. that still does not mean that the environment is exported. some MPICH versions have pretty awkward ways of starting MPI environments that do not always forward the environment at all. EAMP> > to make certain that you only like the serial version of EAMP> > MKL with your MPI executable, please replace -lmkl_em64t EAMP> > in your make.sys file with EAMP> > -lmkl_intel_lp64 -lmkl_sequential -lmkl_core EAMP> EAMP> EAMP> Yes, I also tried that. The test runs in 14m2s. Using only -lmkl_em64t it EAMP> runs in 14m31s. Using serial compilations it ran in 12m20s. you should also compare against the parallel executable run with -np 1 against the serial executable. depending on your hardware (memory speed) and the fact that the 10.0 MKL has about 20% speed improvement on recent cpus, it is quite possible. since your problem is quite large, i guess that a lot of time is spent in the libraries. with a single quad-core cpu you also have the maximum amount of memory contention when running 4 individual mpi threads, whereas using multi-threading may take better advantage of data locality and reduce the load on the memory bus. cheers, axel. EAMP> EAMP> EAMP> EAMP> Thanks, EAMP> Eduardo EAMP> -- === Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu http://www.cmm.upenn.edu Center for Molecular Modeling -- University of Pennsylvania Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323 tel: 1-215-898-1582, fax: 1-215-573-6233, office-tel: 1-215-898-5425 === If you make something idiot-proof, the universe creates a better idiot.
[Pw_forum] openmp vs mpich performance with MKL 10.x
On Tue, 6 May 2008, Nicola Marzari wrote: NM> NM> Dear Eduardo, hi nicola, NM> 1) no improvements with the Intel fftw2 wrapper, as opposed to fftw2 NM> Q-E sources, when using mpi. I also never managed to successfully run NM> with the Intel fftw3 wrapper (or with fftw3 - that probably says NM> something about me). no it doesn't. i NM> 2) great improvements of a serial code (different from Q-E) when using NM> the automatic parallelism of MKL in quad-cores. nod. i just yesterday made some tests with different BLAS/LAPACK implementations, and it turns out that the 10.0 MKL is pretty effecient in parallelizing tasks like DGEMM. through use of SSE2/3/4 and multi threading you can get easily a factor of 6 improvement on a 4-core node. NM> NM> 3) btw, MPICH has always been for us the slower protocol, compared with NM> LAMMPI or OpenMPI NM> NM> I actually wonder if the best solution on a quad-core would be, say, NM> to use two cores for MPI, and the other two for the openmp threads. this is a _very_ tricky issue. usually for a plane wave pseudopotential codes, that distributed data parallelization is pretty efficient, except for the 3d-fourier transforms across the whole data set, which are very sensitive to network latencies. for jobs using k-points, you also have the option to parallelize over k-points with is very efficient, even on not so fast networks. with the CVS versions, you have another level of parallelism added (parallelisation over function instead of data = task groups). thus given an ideal network, you first want to exploit MPI parallelism maximally and then what is left is rather small, and - sadly - OpenMP doesn't work very efficiently on that. the overhead of spawning, synchronizing and joining threads is too high compared to the gain through parallelism. but we live in a real world and there are unexpected side effects and non-ideal machines and networks. e.g. when using nodes with many cores, e.g. two-socket quad-core, you have to "squeeze" a lot of communication through just one network card (be it infiniband, myrinet or ethernet) that will serialize communication and add unwanted conflicts and latencies. i've seen this happen particularly when using a very large number of nodes where you can run out of (physical) memory simply because of the way how the lowlevel communication was programmed. in that case you may indeed be better off using only half or a quarter of the cores with MPI and then set OMP_NUM_THREADS to 2 or even keep it at 1 (because that will, provided you have an MPI with processor affinity and optimal job placement, double the cpu cache). it is particularly interesting to discuss from this perspective having multi-core nodes connected by a high-latency TCP/IP network (e.g. gigabit ethernet). here with one MPI task per node you reach the limit of scaling pretty fast, and also using multiple MPI tasks per node is mostly multiplying the latencies, which is not helping. under those circumstance the data set is still rather large and then OpenMP parallelism can help to get the most out of a given machine. as noted before, it would be _even_ better if OpenMP directives were added to time critical and multi-threadable parts of QE. i have experienced this in CPMD where i managed to get about 80% of the MPI performance with the latest (extensively threaded) development sources and a fully-multi-threaded toolchain on a single node. however, running across multiple nodes quickly reduces the effectivity of the OpenMP support. just with two nodes you are at 60% only. now, deciding on what is the best combination of options is a very tricky multi-dimensional optimization problem you have to consider the following: - the size of the typical problem and job type - whether you can benefit from k-point parallelism - whether you prefer faster execution over cost efficiency and throughput. - the total amount of money you want to spend - the skillset of people that have to run the machine - how many people have to share the machine. - how I/O bound the jobs are. - how much memory you need and how much money you are willing to invest in faster memory. - failure rates and the level of service (gigabit equipment is easily available). also some of those parameters are (non-linearly) coupled which makes the decision making process even nastier. cheers, axel. NM> NM> I eagerly await Axel's opinion. NM> NM> nicola NM> NM> Eduardo Ariel Menendez Proupin wrote: NM> > Hi, NM> > I have noted recently that I am able to obtain faster binaries of pw.x NM> > using the the OpenMP paralellism implemented in the Intel MKL libraries NM> > of version 10.xxx, than using MPICH, in the Intel cpus. Previously I had NM> > always gotten better performance using MPI. I would like to know of NM> > other experience on how to make the machines faster. Let me explain in NM> > more details. NM> > NM> > Compiling using MPI means using mpif90 as link
[Pw_forum] openmp vs mpich performance with MKL 10.x
Dear Eduardo, our own experiences are summarized here: http://quasiamore.mit.edu/pmwiki/index.php?n=Main.CP90Timings It would be great if you could contribute your own data, either for pw.x or cp.x under the conditions you describe. I noticed indeed, informally, a few of the things you mention: 1) no improvements with the Intel fftw2 wrapper, as opposed to fftw2 Q-E sources, when using mpi. I also never managed to successfully run with the Intel fftw3 wrapper (or with fftw3 - that probably says something about me). 2) great improvements of a serial code (different from Q-E) when using the automatic parallelism of MKL in quad-cores. 3) btw, MPICH has always been for us the slower protocol, compared with LAMMPI or OpenMPI I actually wonder if the best solution on a quad-core would be, say, to use two cores for MPI, and the other two for the openmp threads. I eagerly await Axel's opinion. nicola Eduardo Ariel Menendez Proupin wrote: > Hi, > I have noted recently that I am able to obtain faster binaries of pw.x > using the the OpenMP paralellism implemented in the Intel MKL libraries > of version 10.xxx, than using MPICH, in the Intel cpus. Previously I had > always gotten better performance using MPI. I would like to know of > other experience on how to make the machines faster. Let me explain in > more details. > > Compiling using MPI means using mpif90 as linker and compiler, linking > against mkl_ia32 or mkl_em64t, and using link flags -i-static -openmp. > This is just the what appears in the make.sys after running configure > in version 4cvs, > > At runtime, I set > export OMP_NUM_THREADS=1 > export MKL_NUM_THREADS=1 > and run using > mpiexec -n $NCPUs pw.x output > where NCPUs is the number of cores available in the system. > > The second choice is > ./configure --disable-parallel > > and at runtime > export OMP_NUM_THREADS=$NCPU > export MKL_NUM_THREADS=$NCPU > and run using > pw.x output > > I have tested it in Quadcores (NCPU=4) and with an old Dual Xeon B.C. > (before cores) (NCPU=2). > > Before April 2007, the first choice had always workes faster. After > that, when I came to use the MKL 10.xxx, the second choice is working > faster. I have found no significant difference between version 3.2.3 and > 4cvs. > > A special comment is for the FFT library. The MKL has a wrapper to the > FFTW, that must be compiled after instalation (it is very easy). This > creates additional libraries named like libfftw3xf_intel.a and > libfftw2xf_intel.a > This allows improves the performance in the second choice, specially > with libfftw3xf_intel.a. > > Using MPI, libfftw2xf_intel.a is as fast as using the FFTW source > distributed with espresso, i.e., there is no gain in using > libfftw2xf_intel.a. With libfftw3xf_intel.a and MPI, I have never been > able to run pw.x succesfully, it just aborts. > > I would like to hear of your experiences. > > Best regards > Eduardo Menendez > University of Chile > > > > > ___ > Pw_forum mailing list > Pw_forum at pwscf.org > http://www.democritos.it/mailman/listinfo/pw_forum -- - Prof Nicola Marzari Department of Materials Science and Engineering 13-5066 MIT 77 Massachusetts Avenue Cambridge MA 02139-4307 USA tel 617.4522758 fax 2586534 marzari at mit.edu http://quasiamore.mit.edu
[Pw_forum] openmp vs mpich performance with MKL 10.x
Hi, I have noted recently that I am able to obtain faster binaries of pw.x using the the OpenMP paralellism implemented in the Intel MKL libraries of version 10.xxx, than using MPICH, in the Intel cpus. Previously I had always gotten better performance using MPI. I would like to know of other experience on how to make the machines faster. Let me explain in more details. Compiling using MPI means using mpif90 as linker and compiler, linking against mkl_ia32 or mkl_em64t, and using link flags -i-static -openmp. This is just the what appears in the make.sys after running configure in version 4cvs, At runtime, I set export OMP_NUM_THREADS=1 export MKL_NUM_THREADS=1 and run using mpiexec -n $NCPUs pw.x output where NCPUs is the number of cores available in the system. The second choice is ./configure --disable-parallel and at runtime export OMP_NUM_THREADS=$NCPU export MKL_NUM_THREADS=$NCPU and run using pw.x output I have tested it in Quadcores (NCPU=4) and with an old Dual Xeon B.C. (before cores) (NCPU=2). Before April 2007, the first choice had always workes faster. After that, when I came to use the MKL 10.xxx, the second choice is working faster. I have found no significant difference between version 3.2.3 and 4cvs. A special comment is for the FFT library. The MKL has a wrapper to the FFTW, that must be compiled after instalation (it is very easy). This creates additional libraries named like libfftw3xf_intel.a and libfftw2xf_intel.a This allows improves the performance in the second choice, specially with libfftw3xf_intel.a. Using MPI, libfftw2xf_intel.a is as fast as using the FFTW source distributed with espresso, i.e., there is no gain in using libfftw2xf_intel.a. With libfftw3xf_intel.a and MPI, I have never been able to run pw.x succesfully, it just aborts. I would like to hear of your experiences. Best regards Eduardo Menendez University of Chile -- next part -- An HTML attachment was scrubbed... URL: http://www.democritos.it/pipermail/pw_forum/attachments/20080506/ca00a740/attachment.htm
[Pw_forum] openmp vs mpich performance with MKL 10.x
On Tue, 6 May 2008, Eduardo Ariel Menendez Proupin wrote: EAMP> Hi, EAMP> I have noted recently that I am able to obtain faster binaries of pw.x using EAMP> the the OpenMP paralellism implemented in the Intel MKL libraries of version EAMP> 10.xxx, than using MPICH, in the Intel cpus. Previously I had always gotten EAMP> better performance using MPI. I would like to know of other experience on EAMP> how to make the machines faster. Let me explain in more details. EAMP> EAMP> Compiling using MPI means using mpif90 as linker and compiler, linking EAMP> against mkl_ia32 or mkl_em64t, and using link flags -i-static -openmp. This EAMP> is just the what appears in the make.sys after running configure in version EAMP> 4cvs, EAMP> EAMP> At runtime, I set EAMP> export OMP_NUM_THREADS=1 EAMP> export MKL_NUM_THREADS=1 EAMP> and run using EAMP> mpiexec -n $NCPUs pw.x output EAMP> where NCPUs is the number of cores available in the system. EAMP> EAMP> The second choice is EAMP> ./configure --disable-parallel EAMP> EAMP> and at runtime EAMP> export OMP_NUM_THREADS=$NCPU EAMP> export MKL_NUM_THREADS=$NCPU EAMP> and run using EAMP> pw.x output EAMP> EAMP> I have tested it in Quadcores (NCPU=4) and with an old Dual Xeon B.C. EAMP> (before cores) (NCPU=2). EAMP> EAMP> Before April 2007, the first choice had always workes faster. After that, EAMP> when I came to use the MKL 10.xxx, the second choice is working faster. I EAMP> have found no significant difference between version 3.2.3 and 4cvs. EAMP> EAMP> A special comment is for the FFT library. The MKL has a wrapper to the FFTW, EAMP> that must be compiled after instalation (it is very easy). This creates EAMP> additional libraries named like libfftw3xf_intel.a and libfftw2xf_intel.a EAMP> This allows improves the performance in the second choice, specially with EAMP> libfftw3xf_intel.a. EAMP> EAMP> Using MPI, libfftw2xf_intel.a is as fast as using the FFTW source EAMP> distributed with espresso, i.e., there is no gain in using EAMP> libfftw2xf_intel.a. With libfftw3xf_intel.a and MPI, I have never been able EAMP> to run pw.x succesfully, it just aborts. EAMP> EAMP> I would like to hear of your experiences. eduardo, there are two issue that need to be considered. 1) how large are your test jobs? if they are not large enough, timings are pointless. 2) it is most likely, that you are still tricked by the auto-parallelization of intel MKL. the export OMP_NUM_THREADS will usually only work for the _local_ copy, for some MPI startup mechanisms not at all. thus your MPI jobs will be slowed down. to make certain that you only like the serial version of MKL with your MPI executable, please replace -lmkl_em64t in your make.sys file with -lmkl_intel_lp64 -lmkl_sequential -lmkl_core you may have to add: -Wl,-rpath,/opt/intel/path/to/your/mkl to make your executable find the libraries at runtime. with those executable you can try again, and i would be _very_ surprised if using MPI is slower than serial and multi-threading. i made tests with intel FFT vs. FFTW in a number of plane wave codes and the intel FFT was always slower. cheers, axel. EAMP> EAMP> Best regards EAMP> Eduardo Menendez EAMP> University of Chile EAMP> -- === Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu http://www.cmm.upenn.edu Center for Molecular Modeling -- University of Pennsylvania Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323 tel: 1-215-898-1582, fax: 1-215-573-6233, office-tel: 1-215-898-5425 === If you make something idiot-proof, the universe creates a better idiot.