[Pw_forum] QE on Xeon Phi : Execution Issue

2014-07-24 Thread Nisha Agrawal
Dear Fabio Affinito

Thank you so much for information.

"Apologizing does not mean that you are wrong and the other one is right...
It simply means that you value the relationship much more than your ego.."


On Wed, Jul 23, 2014 at 5:32 PM, Nisha Agrawal 
wrote:

> Hi,
>
> I setup the quantum espresso Intel Xeon Phi version using the instruction
> provided in the following link
>
>
> https://software.intel.com/en-us/articles/quantum-espresso-for-intel-xeon-phi-coprocessor
>
>
> However when I was running, its not getting offloaded to Intel Xeon PHI ,
> following is the script I am using
> to run QE MIC version. Please let me know If I missed something which is
> required to set or doing somthing
> wrong.
>
>
> ---
> source /home/opt/ICS-2013.1.039-intel64/bin/compilervars.sh intel64
> source /home/opt/ICS-2013.1.039-intel64/mkl/bin/mklvars.sh intel64
> source /home/opt/ICS-2013.1.039-intel64/impi/4.1.2.040/bin64/mpivars.sh
>
> export MKL_MIC_ENABLE=1
> export MKL_DYNAMIC=false
> export MKL_MIC_DISABLE_HOST_FALLBACK=1
> export MIC_LD_LIBRARY_PATH=$MKLROOT/lib/mic:$MIC_LD_LIBRARY_PATH
>
> export OFFLOAD_DEVICES=0
>
> export I_MPI_FALLBACK_DEVICE=disable
> export I_MPI_PIN=disable
> export I_MPI_DEBUG=5
>
>
> export MKL_MIC_ZGEMM_AA_M_MIN=500
> export MKL_MIC_ZGEMM_AA_N_MIN=500
> export MKL_MIC_ZGEMM_AA_K_MIN=500
> export MKL_MIC_THRESHOLDS_ZGEMM=500,500,500
>
>
> export OFFLOAD_REPORT=2
> mpirun  -np 8 -perhost 4  ./espresso-5.0.2/bin/pw.x   -in  ./BN.in 2>&1 |
> tee test.log
>
> -
>
> ---
>
>
> "Apologizing does not mean that you are wrong and the other one is right...
> It simply means that you value the relationship much more than your ego.."
>
>
> On Mon, Jul 14, 2014 at 8:16 PM, Axel Kohlmeyer 
> wrote:
>
>> On Mon, Jul 14, 2014 at 9:34 AM, Eduardo Menendez 
>> wrote:
>> > Thank you Axel. Your advise rises another doubt. Can we get the maximum
>> > performance from a highly clocked CPU?
>> > I used to consider that the the fastest CPUs were too fast for the
>> memory
>> > access, resulting in bottlenecks. Of couse it depends on cache size.
>>
>> your concern is justified, but the situation is more complex these
>> days. highly clocked CPUs have less cores and thus receive a larger
>> share of the available memory bandwidth and the highest clocked
>> inter-CPU and memory bus is only available for a subset of the CPUs.
>> now you have an optimization problem that has to consider the strong
>> scaling (or lack thereof) of the code in question as an additional
>> input parameter.
>>
>> to give an example: we purchased at the same time dual socket nodes
>> that had the same mainboard, but either 2x 3.5GHz quad-core or 2x
>> 2.8GHz hex-core. the 3.5GHz was the fastest clock available at the
>> time. for classical MD, i get better performance out of the 12-core
>> nodes, for plane-wave DFT i get about the same performance out of
>> both, for CP2k i get better performance with the 8-core (in fact, CP2k
>> runs fastest on the 12-core with using only 8 cores). now, the cost of
>> the 2.8GHz CPUs is significantly lower, so that is why we procured the
>> majority of the cluster with those. but we do have applications that
>> scale less than CP2k or are serial, but require high per-core memory
>> bandwidth, so we got a few of the 3.5GHz ones, too (and since they are
>> already expensive we filled them with RAM as much as it doesn't result
>> in underclocking of the memory bus; and in turn we put "only" 1GB/core
>> into the 12-core nodes).
>>
>> so it all boils down to finding the right balance and adjusting it to
>> the application mix that you are running. last time i checked the
>> intel spec sheets, it looked as if the best deal was to be had for
>> CPUs with the second largest number of CPU cores and as high a clock
>> as required to have the full memory bus speed. that will also keep the
>> heat in check, as the highest clocked CPUs usually have a much higher
>> TDP (>50% more) and that is just a much larger demand on cooling and
>> power and will incur additional indirect costs as well.
>>
>> HTH,
>> axel.
>>
>>
>> >
>> >>Stick with the cpu. For QE you should be best off with intel. Also you
>> are
>> >> likely to >get the best price/performance ratio with CPUs that have
>> less
>> >> than the maximum >number of cpu cores and a higher clock instead.
>> >
>> >
>> > Eduardo Menendez Proupin
>> > Departamento de Fisica, Facultad de Ciencias, Universidad de Chile
>> > URL: http://www.gnm.cl/emenendez
>> >
>> > "Science may be described as the art of systematic oversimplification."
>> Karl
>> > Popper
>> >
>> >
>> > ___
>> > Pw_forum mailing list
>> > Pw_forum at pwscf.org
>> > h

[Pw_forum] QE on Xeon Phi : Execution Issue

2014-07-23 Thread Nisha Agrawal
Hi,

I setup the quantum espresso Intel Xeon Phi version using the instruction
provided in the following link

https://software.intel.com/en-us/articles/quantum-espresso-for-intel-xeon-phi-coprocessor


However when I was running, its not getting offloaded to Intel Xeon PHI ,
following is the script I am using
to run QE MIC version. Please let me know If I missed something which is
required to set or doing somthing
wrong.

---
source /home/opt/ICS-2013.1.039-intel64/bin/compilervars.sh intel64
source /home/opt/ICS-2013.1.039-intel64/mkl/bin/mklvars.sh intel64
source /home/opt/ICS-2013.1.039-intel64/impi/4.1.2.040/bin64/mpivars.sh

export MKL_MIC_ENABLE=1
export MKL_DYNAMIC=false
export MKL_MIC_DISABLE_HOST_FALLBACK=1
export MIC_LD_LIBRARY_PATH=$MKLROOT/lib/mic:$MIC_LD_LIBRARY_PATH

export OFFLOAD_DEVICES=0

export I_MPI_FALLBACK_DEVICE=disable
export I_MPI_PIN=disable
export I_MPI_DEBUG=5


export MKL_MIC_ZGEMM_AA_M_MIN=500
export MKL_MIC_ZGEMM_AA_N_MIN=500
export MKL_MIC_ZGEMM_AA_K_MIN=500
export MKL_MIC_THRESHOLDS_ZGEMM=500,500,500


export OFFLOAD_REPORT=2
mpirun  -np 8 -perhost 4  ./espresso-5.0.2/bin/pw.x   -in  ./BN.in 2>&1 |
tee test.log

-
---


"Apologizing does not mean that you are wrong and the other one is right...
It simply means that you value the relationship much more than your ego.."


On Mon, Jul 14, 2014 at 8:16 PM, Axel Kohlmeyer  wrote:

> On Mon, Jul 14, 2014 at 9:34 AM, Eduardo Menendez 
> wrote:
> > Thank you Axel. Your advise rises another doubt. Can we get the maximum
> > performance from a highly clocked CPU?
> > I used to consider that the the fastest CPUs were too fast for the memory
> > access, resulting in bottlenecks. Of couse it depends on cache size.
>
> your concern is justified, but the situation is more complex these
> days. highly clocked CPUs have less cores and thus receive a larger
> share of the available memory bandwidth and the highest clocked
> inter-CPU and memory bus is only available for a subset of the CPUs.
> now you have an optimization problem that has to consider the strong
> scaling (or lack thereof) of the code in question as an additional
> input parameter.
>
> to give an example: we purchased at the same time dual socket nodes
> that had the same mainboard, but either 2x 3.5GHz quad-core or 2x
> 2.8GHz hex-core. the 3.5GHz was the fastest clock available at the
> time. for classical MD, i get better performance out of the 12-core
> nodes, for plane-wave DFT i get about the same performance out of
> both, for CP2k i get better performance with the 8-core (in fact, CP2k
> runs fastest on the 12-core with using only 8 cores). now, the cost of
> the 2.8GHz CPUs is significantly lower, so that is why we procured the
> majority of the cluster with those. but we do have applications that
> scale less than CP2k or are serial, but require high per-core memory
> bandwidth, so we got a few of the 3.5GHz ones, too (and since they are
> already expensive we filled them with RAM as much as it doesn't result
> in underclocking of the memory bus; and in turn we put "only" 1GB/core
> into the 12-core nodes).
>
> so it all boils down to finding the right balance and adjusting it to
> the application mix that you are running. last time i checked the
> intel spec sheets, it looked as if the best deal was to be had for
> CPUs with the second largest number of CPU cores and as high a clock
> as required to have the full memory bus speed. that will also keep the
> heat in check, as the highest clocked CPUs usually have a much higher
> TDP (>50% more) and that is just a much larger demand on cooling and
> power and will incur additional indirect costs as well.
>
> HTH,
> axel.
>
>
> >
> >>Stick with the cpu. For QE you should be best off with intel. Also you
> are
> >> likely to >get the best price/performance ratio with CPUs that have less
> >> than the maximum >number of cpu cores and a higher clock instead.
> >
> >
> > Eduardo Menendez Proupin
> > Departamento de Fisica, Facultad de Ciencias, Universidad de Chile
> > URL: http://www.gnm.cl/emenendez
> >
> > "Science may be described as the art of systematic oversimplification."
> Karl
> > Popper
> >
> >
> > ___
> > Pw_forum mailing list
> > Pw_forum at pwscf.org
> > http://pwscf.org/mailman/listinfo/pw_forum
>
>
>
> --
> Dr. Axel Kohlmeyer  akohlmey at gmail.com  http://goo.gl/1wk0
> College of Science & Technology, Temple University, Philadelphia PA, USA
> International Centre for Theoretical Physics, Trieste. Italy.
>
> ___
> Pw_forum mailing list
> Pw_forum at pwscf.org
> http://pwscf.org/mailman/listinfo/pw_forum
>
-- next part

[Pw_forum] QE on Xeon Phi : Execution Issue

2014-07-23 Thread Fabio Affinito
Dear Nisha,

The page that you mentioned refers to the use of QE using the Automatic Offload 
mode of Intel MKL. On my experience, this mode is not efficient, with the 
exception of only a few cases where the size of matrices involved in QE is huge.

At present, there is an effort aimed to release a version of QE that can take 
more advantage from the offload on the MIC cards. I hope that this version will 
be released soon after the summer.

An alternative to the offload is the native mode. You can compile QE natively 
on the MIC architecture simply adding the -mmic flag to the intel compiler (and 
linking properly the MKL). 

Best,

Fabio


- Messaggio originale -
> Da: "Nisha Agrawal" 
> A: "PWSCF Forum" 
> Inviato: Mercoled?, 23 luglio 2014 14:02:28
> Oggetto: [Pw_forum] QE on Xeon Phi : Execution Issue
> 
> 
> 
> Hi,
> 
> 
> I setup the quantum espresso Intel Xeon Phi version using the
> instruction provided in the following link
> 
> https://software.intel.com/en-us/articles/quantum-espresso-for-intel-xeon-phi-coprocessor
> 
> 
> However when I was running, its not getting offloaded to Intel Xeon
> PHI , following is the script I am using
> to run QE MIC version. Please let me know If I missed something which
> is required to set or doing somthing
> wrong.
> 
> 
> ---
> 
> source /home/opt/ICS-2013.1.039-intel64/bin/compilervars.sh intel64
> source /home/opt/ICS-2013.1.039-intel64/mkl/bin/mklvars.sh intel64
> source /home/opt/ICS-2013.1.039-intel64/impi/
> 4.1.2.040/bin64/mpivars.sh
> 
> 
> 
> 
> export MKL_MIC_ENABLE=1
> export MKL_DYNAMIC=false
> export MKL_MIC_DISABLE_HOST_FALLBACK=1
> export MIC_LD_LIBRARY_PATH=$MKLROOT/lib/mic:$MIC_LD_LIBRARY_PATH
> 
> 
> export OFFLOAD_DEVICES=0
> 
> 
> 
> 
> export I_MPI_FALLBACK_DEVICE=disable
> export I_MPI_PIN=disable
> export I_MPI_DEBUG=5
> 
> 
> 
> 
> export MKL_MIC_ZGEMM_AA_M_MIN=500
> export MKL_MIC_ZGEMM_AA_N_MIN=500
> export MKL_MIC_ZGEMM_AA_K_MIN=500
> export MKL_MIC_THRESHOLDS_ZGEMM=500,500,500
> 
> 
> 
> 
> export OFFLOAD_REPORT=2
> mpirun -np 8 -perhost 4 ./espresso-5.0.2/bin/pw.x -in ./BN.in 2>&1 |
> tee test.log
> 
> 
> 
> -
> ---
> 
> 
> 
> 
> "Apologizing does not mean that you are wrong and the other one is
> right...
> It simply means that you value the relationship much more than your
> ego.."
> 
> 
> On Mon, Jul 14, 2014 at 8:16 PM, Axel Kohlmeyer < akohlmey at gmail.com
> > wrote:
> 
> 
> 
> On Mon, Jul 14, 2014 at 9:34 AM, Eduardo Menendez <
> eariel99 at gmail.com > wrote:
> > Thank you Axel. Your advise rises another doubt. Can we get the
> > maximum
> > performance from a highly clocked CPU?
> > I used to consider that the the fastest CPUs were too fast for the
> > memory
> > access, resulting in bottlenecks. Of couse it depends on cache
> > size.
> 
> your concern is justified, but the situation is more complex these
> days. highly clocked CPUs have less cores and thus receive a larger
> share of the available memory bandwidth and the highest clocked
> inter-CPU and memory bus is only available for a subset of the CPUs.
> now you have an optimization problem that has to consider the strong
> scaling (or lack thereof) of the code in question as an additional
> input parameter.
> 
> to give an example: we purchased at the same time dual socket nodes
> that had the same mainboard, but either 2x 3.5GHz quad-core or 2x
> 2.8GHz hex-core. the 3.5GHz was the fastest clock available at the
> time. for classical MD, i get better performance out of the 12-core
> nodes, for plane-wave DFT i get about the same performance out of
> both, for CP2k i get better performance with the 8-core (in fact,
> CP2k
> runs fastest on the 12-core with using only 8 cores). now, the cost
> of
> the 2.8GHz CPUs is significantly lower, so that is why we procured
> the
> majority of the cluster with those. but we do have applications that
> scale less than CP2k or are serial, but require high per-core memory
> bandwidth, so we got a few of the 3.5GHz ones, too (and since they
> are
> already expensive we filled them with RAM as much as it doesn't
> result
> in underclocking of the memory bus; and in turn we put "only"
> 1GB/core
> into the 12-core nodes).
> 
> so it all boils down to finding the right balance and adjusting it 

[Pw_forum] QE on Xeon Phi

2014-07-14 Thread Axel Kohlmeyer
On Mon, Jul 14, 2014 at 9:34 AM, Eduardo Menendez  wrote:
> Thank you Axel. Your advise rises another doubt. Can we get the maximum
> performance from a highly clocked CPU?
> I used to consider that the the fastest CPUs were too fast for the memory
> access, resulting in bottlenecks. Of couse it depends on cache size.

your concern is justified, but the situation is more complex these
days. highly clocked CPUs have less cores and thus receive a larger
share of the available memory bandwidth and the highest clocked
inter-CPU and memory bus is only available for a subset of the CPUs.
now you have an optimization problem that has to consider the strong
scaling (or lack thereof) of the code in question as an additional
input parameter.

to give an example: we purchased at the same time dual socket nodes
that had the same mainboard, but either 2x 3.5GHz quad-core or 2x
2.8GHz hex-core. the 3.5GHz was the fastest clock available at the
time. for classical MD, i get better performance out of the 12-core
nodes, for plane-wave DFT i get about the same performance out of
both, for CP2k i get better performance with the 8-core (in fact, CP2k
runs fastest on the 12-core with using only 8 cores). now, the cost of
the 2.8GHz CPUs is significantly lower, so that is why we procured the
majority of the cluster with those. but we do have applications that
scale less than CP2k or are serial, but require high per-core memory
bandwidth, so we got a few of the 3.5GHz ones, too (and since they are
already expensive we filled them with RAM as much as it doesn't result
in underclocking of the memory bus; and in turn we put "only" 1GB/core
into the 12-core nodes).

so it all boils down to finding the right balance and adjusting it to
the application mix that you are running. last time i checked the
intel spec sheets, it looked as if the best deal was to be had for
CPUs with the second largest number of CPU cores and as high a clock
as required to have the full memory bus speed. that will also keep the
heat in check, as the highest clocked CPUs usually have a much higher
TDP (>50% more) and that is just a much larger demand on cooling and
power and will incur additional indirect costs as well.

HTH,
axel.


>
>>Stick with the cpu. For QE you should be best off with intel. Also you are
>> likely to >get the best price/performance ratio with CPUs that have less
>> than the maximum >number of cpu cores and a higher clock instead.
>
>
> Eduardo Menendez Proupin
> Departamento de Fisica, Facultad de Ciencias, Universidad de Chile
> URL: http://www.gnm.cl/emenendez
>
> ?Science may be described as the art of systematic oversimplification.? Karl
> Popper
>
>
> ___
> Pw_forum mailing list
> Pw_forum at pwscf.org
> http://pwscf.org/mailman/listinfo/pw_forum



-- 
Dr. Axel Kohlmeyer  akohlmey at gmail.com  http://goo.gl/1wk0
College of Science & Technology, Temple University, Philadelphia PA, USA
International Centre for Theoretical Physics, Trieste. Italy.



[Pw_forum] QE on Xeon Phi

2014-07-14 Thread Eduardo Menendez
Thank you Axel. Your advise rises another doubt. Can we get the maximum
performance from a highly clocked CPU?
I used to consider that the the fastest CPUs were too fast for the memory
access, resulting in bottlenecks. Of couse it depends on cache size.

>Stick with the cpu. For QE you should be best off with intel. Also you are
likely to >get the best price/performance ratio with CPUs that have less
than the maximum >number of cpu cores and a higher clock instead.

Eduardo Menendez Proupin
Departamento de Fisica, Facultad de Ciencias, Universidad de Chile
URL: http://www.gnm.cl/emenendez

  ?Science may be described as the art of systematic oversimplification.?
Karl Popper
-- next part --
An HTML attachment was scrubbed...
URL: 
http://pwscf.org/pipermail/pw_forum/attachments/20140714/a7e06ac8/attachment.html
 


[Pw_forum] QE on Xeon Phi

2014-07-12 Thread Axel Kohlmeyer
Stick with the cpu. For QE you should be best off with intel. Also you are
likely to get the best price/performance ratio with CPUs that have less
than the maximum number of cpu cores and a higher clock instead.

Axel.
On Jul 11, 2014 7:45 PM, "Eduardo Menendez"  wrote:
>
> Dear fellows,
>
> I need to know if it is worth to buy Xeon coprocessors for use with
Quantum ESPRESSO. I need make a choice between more CPU or less
CPU+coprocessor, and even to choose between few Intel cores vs not-so-few
AMD Opterons. Searching in the Web I have found these implemntatios of QE
on Xeon Phi:
> (1) a Quantum ESPRESSO modified for Xeon Phi, by Fabio Affinito,  and
> (2) an instructions for instalation of standard Quantum ESPRESSO with
some automagic use of Xeon Phi by the MKL. Here is the site
>
https://software.intel.com/en-us/articles/quantum-espresso-for-intel-xeon-phi-coprocessor
>
> Is the choice (1) available and  mature enough for a general use, or at
least for
> using PWscf for calculations of defects in supercells containg a multiple
of 64 atoms?
> Is choice (2) effcient? I see a benchmark in that site that, if I
interpret correctly, it indicates only a 12% improvement. Hence I think
this automagic choice is not worth enough. Am I wrong?
>
> Choice 2 needs installing MPSS (Manycore Platform Software Stacks) and
Intel MPI. Does choice (1) also require these components?
>
> MPSS is supported for Red Hat and SUSE. Is there any good experience with
Debian or Ubuntu?
>
> At this point I feel rather conservative :-( . If not warmed by
enthusiastic praise of coprocessors, or GPU, I will keep looking for as
many CPU cores as possible.
>
>
> Cheers,
>
>
> Eduardo Menendez Proupin
> Departamento de Fisica, Facultad de Ciencias, Universidad de Chile
> URL: http://www.gnm.cl/emenendez
>
> ?Science may be described as the art of systematic oversimplification.?
Karl Popper
>
>
> ___
> Pw_forum mailing list
> Pw_forum at pwscf.org
> http://pwscf.org/mailman/listinfo/pw_forum
-- next part --
An HTML attachment was scrubbed...
URL: 
http://pwscf.org/pipermail/pw_forum/attachments/20140712/149f0368/attachment.html
 


[Pw_forum] QE on Xeon Phi

2014-07-11 Thread Eduardo Menendez
Dear fellows,

I need to know if it is worth to buy Xeon coprocessors for use with Quantum
ESPRESSO. I need make a choice between more CPU or less CPU+coprocessor,
and even to choose between few Intel cores vs not-so-few AMD Opterons.
Searching in the Web I have found these implemntatios of QE on Xeon Phi:
(1) a Quantum ESPRESSO modified for Xeon Phi, by Fabio Affinito,  and
(2) an instructions for instalation of standard Quantum ESPRESSO with some
automagic use of Xeon Phi by the MKL. Here is the site
https://software.intel.com/en-us/articles/quantum-espresso-for-intel-xeon-phi-coprocessor

Is the choice (1) available and  mature enough for a general use, or at
least for
using PWscf for calculations of defects in supercells containg a multiple
of 64 atoms?
Is choice (2) effcient? I see a benchmark in that site that, if I interpret
correctly, it indicates only a 12% improvement. Hence I think this
automagic choice is not worth enough. Am I wrong?

Choice 2 needs installing MPSS (Manycore Platform Software Stacks) and
Intel MPI. Does choice (1) also require these components?

MPSS is supported for Red Hat and SUSE. Is there any good experience with
Debian or Ubuntu?

At this point I feel rather conservative :-( . If not warmed by
enthusiastic praise of coprocessors, or GPU, I will keep looking for as
many CPU cores as possible.


Cheers,


Eduardo Menendez Proupin
Departamento de Fisica, Facultad de Ciencias, Universidad de Chile
URL: http://www.gnm.cl/emenendez

  ?Science may be described as the art of systematic oversimplification.?
Karl Popper
-- next part --
An HTML attachment was scrubbed...
URL: 
http://pwscf.org/pipermail/pw_forum/attachments/20140711/2e8bb69d/attachment.html