Peter, thanks for the response. I'm getting small speedup from multithreading in libgoto. Here are timings from the wien2k serial benchmark:
OMP_NUM_THREADS=1: 195.463u 0.307s 3:15.80 99.9% 0+0k 0+33264io 0pf+0w OMP_NUM_THREADS=2: 199.565u 0.569s 2:57.40 112.8% 0+0k 0+33264io 0pf+0w OMP_NUM_THREADS=3: 204.145u 0.635s 2:51.02 119.7% 0+0k 0+33264io 0pf+0w OMP_NUM_THREADS=4: 211.666u 0.736s 2:49.02 125.6% 0+0k 0+33264io 0pf+0w OMP_NUM_THREADS=5: 222.604u 1.032s 2:48.41 132.7% 0+0k 0+33264io 0pf+0w OMP_NUM_THREADS=6: 231.258u 0.927s 2:47.54 138.5% 0+0k 0+33264io 0pf+0w OMP_NUM_THREADS=7: 243.170u 0.996s 2:46.55 146.5% 0+0k 0+33264io 0pf+0w OMP_NUM_THREADS=8: 252.584u 0.916s 2:46.57 152.1% 0+0k 0+33264io 0pf+0w I would like explore the k-point parallelization. But when I run 'x lapw1 -p' it aborts with an error message about being unable to run lapw1c_mpi. This appears to me like it's trying to run the fine grained MPI parallel version. I'm not building wien2k with mpi so I don't have a lapw1c_mpi. I must be misunderstanding something. What am I doing wrong that's causing it to try to run this lapw1c_mpi executable? Which of these are appropriate .machines files to do k-point parallelization across N cpu cores on a single machine? This? 1:localhost:N Or this? N:localhost And do I need any of these lines? extrafine granularity:1 residue:localhost Or do I need something else either in .machines or in some other file or on the command line? -- Todd Pfaff <pfaff at mcmaster.ca> Research & High-Performance Computing Support McMaster University, Hamilton, Ontario, Canada http://www.rhpcs.mcmaster.ca/~pfaff On Mon, 11 Aug 2008, Peter Blaha wrote: > The program lapw1 spends a large fraction in BLAS-routines, thus it can > benefit from multithreading of GOTOLIBS (or MKL). > Setting the variables you mentioned to 2 (or 4) you should see a > speedup. The improvement may depend on many factors but it will be at > most about 50%. > > Another possibility to utilize the multiple cores is to do k-point > parallelism. > Generate a .machines file with 2,4 or 8 times your machine name > and test the performance with x lapw1 -p. > On some architectures (with slow memory bus) it can be that only 4 > parallel jobs give best performance (because the slow memory bus cannot > feed all 8 cpus properly), on others you can use 8 parallel jobs. > Sometimes a mixture (4 k-point parallel + OMP_NUM_THREADS=2) is best. > > Todd Pfaff schrieb: >> We're using: >> >> wien2k-08.2-20080407 >> >> built with: >> >> GNU Fortran (GCC) 4.2.3 (4.2.3-6mnb1) >> GotoBLAS-1.26 >> >> and running on an 8 core (2 x quad core) Xeon machine. >> >> Can wien2k take advantage of multithreading inherent to GotoBLAS >> when either GOTO_NUM_THREADS or OMP_NUM_THREADS is set? >> >> If so, can someone provide, or direct me to a document about details of >> the best way to build and run wien2k for such an environment? >> >> Thank you, >> -- >> Todd Pfaff <pfaff at mcmaster.ca> >> Research & High-Performance Computing Support >> McMaster University, Hamilton, Ontario, Canada >> http://www.rhpcs.mcmaster.ca/~pfaff >> _______________________________________________ >> Wien mailing list >> Wien at zeus.theochem.tuwien.ac.at >> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien > >