[Pw_forum] QE on Xeon Phi : Execution Issue
Dear Fabio Affinito Thank you so much for information. "Apologizing does not mean that you are wrong and the other one is right... It simply means that you value the relationship much more than your ego.." On Wed, Jul 23, 2014 at 5:32 PM, Nisha Agrawal wrote: > Hi, > > I setup the quantum espresso Intel Xeon Phi version using the instruction > provided in the following link > > > https://software.intel.com/en-us/articles/quantum-espresso-for-intel-xeon-phi-coprocessor > > > However when I was running, its not getting offloaded to Intel Xeon PHI , > following is the script I am using > to run QE MIC version. Please let me know If I missed something which is > required to set or doing somthing > wrong. > > > --- > source /home/opt/ICS-2013.1.039-intel64/bin/compilervars.sh intel64 > source /home/opt/ICS-2013.1.039-intel64/mkl/bin/mklvars.sh intel64 > source /home/opt/ICS-2013.1.039-intel64/impi/4.1.2.040/bin64/mpivars.sh > > export MKL_MIC_ENABLE=1 > export MKL_DYNAMIC=false > export MKL_MIC_DISABLE_HOST_FALLBACK=1 > export MIC_LD_LIBRARY_PATH=$MKLROOT/lib/mic:$MIC_LD_LIBRARY_PATH > > export OFFLOAD_DEVICES=0 > > export I_MPI_FALLBACK_DEVICE=disable > export I_MPI_PIN=disable > export I_MPI_DEBUG=5 > > > export MKL_MIC_ZGEMM_AA_M_MIN=500 > export MKL_MIC_ZGEMM_AA_N_MIN=500 > export MKL_MIC_ZGEMM_AA_K_MIN=500 > export MKL_MIC_THRESHOLDS_ZGEMM=500,500,500 > > > export OFFLOAD_REPORT=2 > mpirun -np 8 -perhost 4 ./espresso-5.0.2/bin/pw.x -in ./BN.in 2>&1 | > tee test.log > > - > > --- > > > "Apologizing does not mean that you are wrong and the other one is right... > It simply means that you value the relationship much more than your ego.." > > > On Mon, Jul 14, 2014 at 8:16 PM, Axel Kohlmeyer > wrote: > >> On Mon, Jul 14, 2014 at 9:34 AM, Eduardo Menendez >> wrote: >> > Thank you Axel. Your advise rises another doubt. Can we get the maximum >> > performance from a highly clocked CPU? >> > I used to consider that the the fastest CPUs were too fast for the >> memory >> > access, resulting in bottlenecks. Of couse it depends on cache size. >> >> your concern is justified, but the situation is more complex these >> days. highly clocked CPUs have less cores and thus receive a larger >> share of the available memory bandwidth and the highest clocked >> inter-CPU and memory bus is only available for a subset of the CPUs. >> now you have an optimization problem that has to consider the strong >> scaling (or lack thereof) of the code in question as an additional >> input parameter. >> >> to give an example: we purchased at the same time dual socket nodes >> that had the same mainboard, but either 2x 3.5GHz quad-core or 2x >> 2.8GHz hex-core. the 3.5GHz was the fastest clock available at the >> time. for classical MD, i get better performance out of the 12-core >> nodes, for plane-wave DFT i get about the same performance out of >> both, for CP2k i get better performance with the 8-core (in fact, CP2k >> runs fastest on the 12-core with using only 8 cores). now, the cost of >> the 2.8GHz CPUs is significantly lower, so that is why we procured the >> majority of the cluster with those. but we do have applications that >> scale less than CP2k or are serial, but require high per-core memory >> bandwidth, so we got a few of the 3.5GHz ones, too (and since they are >> already expensive we filled them with RAM as much as it doesn't result >> in underclocking of the memory bus; and in turn we put "only" 1GB/core >> into the 12-core nodes). >> >> so it all boils down to finding the right balance and adjusting it to >> the application mix that you are running. last time i checked the >> intel spec sheets, it looked as if the best deal was to be had for >> CPUs with the second largest number of CPU cores and as high a clock >> as required to have the full memory bus speed. that will also keep the >> heat in check, as the highest clocked CPUs usually have a much higher >> TDP (>50% more) and that is just a much larger demand on cooling and >> power and will incur additional indirect costs as well. >> >> HTH, >> axel. >> >> >> > >> >>Stick with the cpu. For QE you should be best off with intel. Also you >> are >> >> likely to >get the best price/performance ratio with CPUs that have >> less >> >> than the maximum >number of cpu cores and a higher clock instead. >> > >> > >> > Eduardo Menendez Proupin >> > Departamento de Fisica, Facultad de Ciencias, Universidad de Chile >> > URL: http://www.gnm.cl/emenendez >> > >> > "Science may be described as the art of systematic oversimplification." >> Karl >> > Popper >> > >> > >> > ___ >> > Pw_forum mailing list >> > Pw_forum at pwscf.org >> > h
[Pw_forum] QE on Xeon Phi : Execution Issue
Hi, I setup the quantum espresso Intel Xeon Phi version using the instruction provided in the following link https://software.intel.com/en-us/articles/quantum-espresso-for-intel-xeon-phi-coprocessor However when I was running, its not getting offloaded to Intel Xeon PHI , following is the script I am using to run QE MIC version. Please let me know If I missed something which is required to set or doing somthing wrong. --- source /home/opt/ICS-2013.1.039-intel64/bin/compilervars.sh intel64 source /home/opt/ICS-2013.1.039-intel64/mkl/bin/mklvars.sh intel64 source /home/opt/ICS-2013.1.039-intel64/impi/4.1.2.040/bin64/mpivars.sh export MKL_MIC_ENABLE=1 export MKL_DYNAMIC=false export MKL_MIC_DISABLE_HOST_FALLBACK=1 export MIC_LD_LIBRARY_PATH=$MKLROOT/lib/mic:$MIC_LD_LIBRARY_PATH export OFFLOAD_DEVICES=0 export I_MPI_FALLBACK_DEVICE=disable export I_MPI_PIN=disable export I_MPI_DEBUG=5 export MKL_MIC_ZGEMM_AA_M_MIN=500 export MKL_MIC_ZGEMM_AA_N_MIN=500 export MKL_MIC_ZGEMM_AA_K_MIN=500 export MKL_MIC_THRESHOLDS_ZGEMM=500,500,500 export OFFLOAD_REPORT=2 mpirun -np 8 -perhost 4 ./espresso-5.0.2/bin/pw.x -in ./BN.in 2>&1 | tee test.log - --- "Apologizing does not mean that you are wrong and the other one is right... It simply means that you value the relationship much more than your ego.." On Mon, Jul 14, 2014 at 8:16 PM, Axel Kohlmeyer wrote: > On Mon, Jul 14, 2014 at 9:34 AM, Eduardo Menendez > wrote: > > Thank you Axel. Your advise rises another doubt. Can we get the maximum > > performance from a highly clocked CPU? > > I used to consider that the the fastest CPUs were too fast for the memory > > access, resulting in bottlenecks. Of couse it depends on cache size. > > your concern is justified, but the situation is more complex these > days. highly clocked CPUs have less cores and thus receive a larger > share of the available memory bandwidth and the highest clocked > inter-CPU and memory bus is only available for a subset of the CPUs. > now you have an optimization problem that has to consider the strong > scaling (or lack thereof) of the code in question as an additional > input parameter. > > to give an example: we purchased at the same time dual socket nodes > that had the same mainboard, but either 2x 3.5GHz quad-core or 2x > 2.8GHz hex-core. the 3.5GHz was the fastest clock available at the > time. for classical MD, i get better performance out of the 12-core > nodes, for plane-wave DFT i get about the same performance out of > both, for CP2k i get better performance with the 8-core (in fact, CP2k > runs fastest on the 12-core with using only 8 cores). now, the cost of > the 2.8GHz CPUs is significantly lower, so that is why we procured the > majority of the cluster with those. but we do have applications that > scale less than CP2k or are serial, but require high per-core memory > bandwidth, so we got a few of the 3.5GHz ones, too (and since they are > already expensive we filled them with RAM as much as it doesn't result > in underclocking of the memory bus; and in turn we put "only" 1GB/core > into the 12-core nodes). > > so it all boils down to finding the right balance and adjusting it to > the application mix that you are running. last time i checked the > intel spec sheets, it looked as if the best deal was to be had for > CPUs with the second largest number of CPU cores and as high a clock > as required to have the full memory bus speed. that will also keep the > heat in check, as the highest clocked CPUs usually have a much higher > TDP (>50% more) and that is just a much larger demand on cooling and > power and will incur additional indirect costs as well. > > HTH, > axel. > > > > > >>Stick with the cpu. For QE you should be best off with intel. Also you > are > >> likely to >get the best price/performance ratio with CPUs that have less > >> than the maximum >number of cpu cores and a higher clock instead. > > > > > > Eduardo Menendez Proupin > > Departamento de Fisica, Facultad de Ciencias, Universidad de Chile > > URL: http://www.gnm.cl/emenendez > > > > "Science may be described as the art of systematic oversimplification." > Karl > > Popper > > > > > > ___ > > Pw_forum mailing list > > Pw_forum at pwscf.org > > http://pwscf.org/mailman/listinfo/pw_forum > > > > -- > Dr. Axel Kohlmeyer akohlmey at gmail.com http://goo.gl/1wk0 > College of Science & Technology, Temple University, Philadelphia PA, USA > International Centre for Theoretical Physics, Trieste. Italy. > > ___ > Pw_forum mailing list > Pw_forum at pwscf.org > http://pwscf.org/mailman/listinfo/pw_forum > -- next part
[Pw_forum] QE on Xeon Phi : Execution Issue
Dear Nisha, The page that you mentioned refers to the use of QE using the Automatic Offload mode of Intel MKL. On my experience, this mode is not efficient, with the exception of only a few cases where the size of matrices involved in QE is huge. At present, there is an effort aimed to release a version of QE that can take more advantage from the offload on the MIC cards. I hope that this version will be released soon after the summer. An alternative to the offload is the native mode. You can compile QE natively on the MIC architecture simply adding the -mmic flag to the intel compiler (and linking properly the MKL). Best, Fabio - Messaggio originale - > Da: "Nisha Agrawal" > A: "PWSCF Forum" > Inviato: Mercoled?, 23 luglio 2014 14:02:28 > Oggetto: [Pw_forum] QE on Xeon Phi : Execution Issue > > > > Hi, > > > I setup the quantum espresso Intel Xeon Phi version using the > instruction provided in the following link > > https://software.intel.com/en-us/articles/quantum-espresso-for-intel-xeon-phi-coprocessor > > > However when I was running, its not getting offloaded to Intel Xeon > PHI , following is the script I am using > to run QE MIC version. Please let me know If I missed something which > is required to set or doing somthing > wrong. > > > --- > > source /home/opt/ICS-2013.1.039-intel64/bin/compilervars.sh intel64 > source /home/opt/ICS-2013.1.039-intel64/mkl/bin/mklvars.sh intel64 > source /home/opt/ICS-2013.1.039-intel64/impi/ > 4.1.2.040/bin64/mpivars.sh > > > > > export MKL_MIC_ENABLE=1 > export MKL_DYNAMIC=false > export MKL_MIC_DISABLE_HOST_FALLBACK=1 > export MIC_LD_LIBRARY_PATH=$MKLROOT/lib/mic:$MIC_LD_LIBRARY_PATH > > > export OFFLOAD_DEVICES=0 > > > > > export I_MPI_FALLBACK_DEVICE=disable > export I_MPI_PIN=disable > export I_MPI_DEBUG=5 > > > > > export MKL_MIC_ZGEMM_AA_M_MIN=500 > export MKL_MIC_ZGEMM_AA_N_MIN=500 > export MKL_MIC_ZGEMM_AA_K_MIN=500 > export MKL_MIC_THRESHOLDS_ZGEMM=500,500,500 > > > > > export OFFLOAD_REPORT=2 > mpirun -np 8 -perhost 4 ./espresso-5.0.2/bin/pw.x -in ./BN.in 2>&1 | > tee test.log > > > > - > --- > > > > > "Apologizing does not mean that you are wrong and the other one is > right... > It simply means that you value the relationship much more than your > ego.." > > > On Mon, Jul 14, 2014 at 8:16 PM, Axel Kohlmeyer < akohlmey at gmail.com > > wrote: > > > > On Mon, Jul 14, 2014 at 9:34 AM, Eduardo Menendez < > eariel99 at gmail.com > wrote: > > Thank you Axel. Your advise rises another doubt. Can we get the > > maximum > > performance from a highly clocked CPU? > > I used to consider that the the fastest CPUs were too fast for the > > memory > > access, resulting in bottlenecks. Of couse it depends on cache > > size. > > your concern is justified, but the situation is more complex these > days. highly clocked CPUs have less cores and thus receive a larger > share of the available memory bandwidth and the highest clocked > inter-CPU and memory bus is only available for a subset of the CPUs. > now you have an optimization problem that has to consider the strong > scaling (or lack thereof) of the code in question as an additional > input parameter. > > to give an example: we purchased at the same time dual socket nodes > that had the same mainboard, but either 2x 3.5GHz quad-core or 2x > 2.8GHz hex-core. the 3.5GHz was the fastest clock available at the > time. for classical MD, i get better performance out of the 12-core > nodes, for plane-wave DFT i get about the same performance out of > both, for CP2k i get better performance with the 8-core (in fact, > CP2k > runs fastest on the 12-core with using only 8 cores). now, the cost > of > the 2.8GHz CPUs is significantly lower, so that is why we procured > the > majority of the cluster with those. but we do have applications that > scale less than CP2k or are serial, but require high per-core memory > bandwidth, so we got a few of the 3.5GHz ones, too (and since they > are > already expensive we filled them with RAM as much as it doesn't > result > in underclocking of the memory bus; and in turn we put "only" > 1GB/core > into the 12-core nodes). > > so it all boils down to finding the right balance and adjusting it
[Pw_forum] QE on Xeon Phi
On Mon, Jul 14, 2014 at 9:34 AM, Eduardo Menendez wrote: > Thank you Axel. Your advise rises another doubt. Can we get the maximum > performance from a highly clocked CPU? > I used to consider that the the fastest CPUs were too fast for the memory > access, resulting in bottlenecks. Of couse it depends on cache size. your concern is justified, but the situation is more complex these days. highly clocked CPUs have less cores and thus receive a larger share of the available memory bandwidth and the highest clocked inter-CPU and memory bus is only available for a subset of the CPUs. now you have an optimization problem that has to consider the strong scaling (or lack thereof) of the code in question as an additional input parameter. to give an example: we purchased at the same time dual socket nodes that had the same mainboard, but either 2x 3.5GHz quad-core or 2x 2.8GHz hex-core. the 3.5GHz was the fastest clock available at the time. for classical MD, i get better performance out of the 12-core nodes, for plane-wave DFT i get about the same performance out of both, for CP2k i get better performance with the 8-core (in fact, CP2k runs fastest on the 12-core with using only 8 cores). now, the cost of the 2.8GHz CPUs is significantly lower, so that is why we procured the majority of the cluster with those. but we do have applications that scale less than CP2k or are serial, but require high per-core memory bandwidth, so we got a few of the 3.5GHz ones, too (and since they are already expensive we filled them with RAM as much as it doesn't result in underclocking of the memory bus; and in turn we put "only" 1GB/core into the 12-core nodes). so it all boils down to finding the right balance and adjusting it to the application mix that you are running. last time i checked the intel spec sheets, it looked as if the best deal was to be had for CPUs with the second largest number of CPU cores and as high a clock as required to have the full memory bus speed. that will also keep the heat in check, as the highest clocked CPUs usually have a much higher TDP (>50% more) and that is just a much larger demand on cooling and power and will incur additional indirect costs as well. HTH, axel. > >>Stick with the cpu. For QE you should be best off with intel. Also you are >> likely to >get the best price/performance ratio with CPUs that have less >> than the maximum >number of cpu cores and a higher clock instead. > > > Eduardo Menendez Proupin > Departamento de Fisica, Facultad de Ciencias, Universidad de Chile > URL: http://www.gnm.cl/emenendez > > ?Science may be described as the art of systematic oversimplification.? Karl > Popper > > > ___ > Pw_forum mailing list > Pw_forum at pwscf.org > http://pwscf.org/mailman/listinfo/pw_forum -- Dr. Axel Kohlmeyer akohlmey at gmail.com http://goo.gl/1wk0 College of Science & Technology, Temple University, Philadelphia PA, USA International Centre for Theoretical Physics, Trieste. Italy.
[Pw_forum] QE on Xeon Phi
Thank you Axel. Your advise rises another doubt. Can we get the maximum performance from a highly clocked CPU? I used to consider that the the fastest CPUs were too fast for the memory access, resulting in bottlenecks. Of couse it depends on cache size. >Stick with the cpu. For QE you should be best off with intel. Also you are likely to >get the best price/performance ratio with CPUs that have less than the maximum >number of cpu cores and a higher clock instead. Eduardo Menendez Proupin Departamento de Fisica, Facultad de Ciencias, Universidad de Chile URL: http://www.gnm.cl/emenendez ?Science may be described as the art of systematic oversimplification.? Karl Popper -- next part -- An HTML attachment was scrubbed... URL: http://pwscf.org/pipermail/pw_forum/attachments/20140714/a7e06ac8/attachment.html
[Pw_forum] QE on Xeon Phi
Stick with the cpu. For QE you should be best off with intel. Also you are likely to get the best price/performance ratio with CPUs that have less than the maximum number of cpu cores and a higher clock instead. Axel. On Jul 11, 2014 7:45 PM, "Eduardo Menendez" wrote: > > Dear fellows, > > I need to know if it is worth to buy Xeon coprocessors for use with Quantum ESPRESSO. I need make a choice between more CPU or less CPU+coprocessor, and even to choose between few Intel cores vs not-so-few AMD Opterons. Searching in the Web I have found these implemntatios of QE on Xeon Phi: > (1) a Quantum ESPRESSO modified for Xeon Phi, by Fabio Affinito, and > (2) an instructions for instalation of standard Quantum ESPRESSO with some automagic use of Xeon Phi by the MKL. Here is the site > https://software.intel.com/en-us/articles/quantum-espresso-for-intel-xeon-phi-coprocessor > > Is the choice (1) available and mature enough for a general use, or at least for > using PWscf for calculations of defects in supercells containg a multiple of 64 atoms? > Is choice (2) effcient? I see a benchmark in that site that, if I interpret correctly, it indicates only a 12% improvement. Hence I think this automagic choice is not worth enough. Am I wrong? > > Choice 2 needs installing MPSS (Manycore Platform Software Stacks) and Intel MPI. Does choice (1) also require these components? > > MPSS is supported for Red Hat and SUSE. Is there any good experience with Debian or Ubuntu? > > At this point I feel rather conservative :-( . If not warmed by enthusiastic praise of coprocessors, or GPU, I will keep looking for as many CPU cores as possible. > > > Cheers, > > > Eduardo Menendez Proupin > Departamento de Fisica, Facultad de Ciencias, Universidad de Chile > URL: http://www.gnm.cl/emenendez > > ?Science may be described as the art of systematic oversimplification.? Karl Popper > > > ___ > Pw_forum mailing list > Pw_forum at pwscf.org > http://pwscf.org/mailman/listinfo/pw_forum -- next part -- An HTML attachment was scrubbed... URL: http://pwscf.org/pipermail/pw_forum/attachments/20140712/149f0368/attachment.html
[Pw_forum] QE on Xeon Phi
Dear fellows, I need to know if it is worth to buy Xeon coprocessors for use with Quantum ESPRESSO. I need make a choice between more CPU or less CPU+coprocessor, and even to choose between few Intel cores vs not-so-few AMD Opterons. Searching in the Web I have found these implemntatios of QE on Xeon Phi: (1) a Quantum ESPRESSO modified for Xeon Phi, by Fabio Affinito, and (2) an instructions for instalation of standard Quantum ESPRESSO with some automagic use of Xeon Phi by the MKL. Here is the site https://software.intel.com/en-us/articles/quantum-espresso-for-intel-xeon-phi-coprocessor Is the choice (1) available and mature enough for a general use, or at least for using PWscf for calculations of defects in supercells containg a multiple of 64 atoms? Is choice (2) effcient? I see a benchmark in that site that, if I interpret correctly, it indicates only a 12% improvement. Hence I think this automagic choice is not worth enough. Am I wrong? Choice 2 needs installing MPSS (Manycore Platform Software Stacks) and Intel MPI. Does choice (1) also require these components? MPSS is supported for Red Hat and SUSE. Is there any good experience with Debian or Ubuntu? At this point I feel rather conservative :-( . If not warmed by enthusiastic praise of coprocessors, or GPU, I will keep looking for as many CPU cores as possible. Cheers, Eduardo Menendez Proupin Departamento de Fisica, Facultad de Ciencias, Universidad de Chile URL: http://www.gnm.cl/emenendez ?Science may be described as the art of systematic oversimplification.? Karl Popper -- next part -- An HTML attachment was scrubbed... URL: http://pwscf.org/pipermail/pw_forum/attachments/20140711/2e8bb69d/attachment.html