Re: [Wien] Intel(R) Xeon(R) CPU X5550 @ 2.67GHz vs Intel(R) Xeon(R) CPU E5620 @ 2.40GHz

2013-10-18 Thread Yundi Quan
First, thank Peter. I should have described my problem thoroughly.

:RKM  : MATRIX SIZE 9190LOs:1944  RKM= 4.88  WEIGHT= 2.00  PGR

The reduced RKM is 4.88. The reduced matrix size is 9190 which is about 2/5 of 
the full matrix. So that explains a lot. I'm using P1 symmetry. Therefore, the 
complex veion of lapw1, lapw2 are used. Compared with LDA calculations, LSDA 
almost doubles lapw1 and lapw2. 

I'm using P1 symmetry. Therefore, symmetry cannot reduce the number of stars 
(i.e. planes waves) in the interstitial region or the number of spherical 
harmonics inside the muffin-tin sphere. I guess that's why my job takes so 
long. And moreover, I'm only using k-point parallel without mpi.

Oxygen is the smallest atom in the unit cell. Reducing RKMAX to 6.5 is what I'm 
going to do first.

One of the clusters to which I have access has 8 cores per node and 8GB memory 
per node. Given the constraint of memory, I wonder how to improve the core 
usage when calculating compounds with large unit cells. For the compound I'm 
currently working on, I request one core per node and 8 nodes(=nkp) per job. So 
7*8=56 cores are wasted while running my job. I'm in dire need of help.


Yundi


On Oct 17, 2013, at 10:58 PM, Peter Blaha pbl...@theochem.tuwien.ac.at wrote:

 You still did not tell us the matrix size for the truncated RKmax, but yes,
 the scaling is probably ok.   (scaling goes with n^3; i.e. in case of of
 matrix size 12000 and 24000 we expect almost a factor of 8 !!! in cpu time.
 It also explaines the memory 
 
 You also did not tell us if you have inversion or not.
 
 One of my real cases with  NMAT= 21500   takes 400 sec on 64 cores (mpi), so 
 one
 could estimate something like 2 sec on a single core, which comes to the 
 right order
 of magnitude compared to your case.
 
 And: you may have 72 inequivalent atoms, but you did not tell us how many 
 atoms in total you have.
 The total number of atoms is the important info !!
 
 Probably you can reduce RKMAX (you did not tell us which atom has RMT=1.65 
 (probably O ??)
 and most likely you should use mpi AND iterative diagonalization.
 
 As I said, a case with 72 atoms (or whatever you have) can run in minutes on 
 a reasonable cluster
 and with a proper optimized setup (not just the defaults).
 
 
 Am 17.10.2013 18:05, schrieb Yundi Quan:
 Thanks a lot.
 On cluster A, RKM was automatically reduced to 4.88 while on cluster B RKM 
 was kept at 7. I didn't expect this, though I was aware that WIEN2k would 
 automatically reduce
 RKM in some cases. But is it reasonable for an iteration to run for eight 
 hours with the following parameters?
 Minimum sphere size: 1.65000 Bohr.
 Total k-mesh : 8
 Gmax : 12
 
 :RKM  : MATRIX SIZE23486LOs:1944  RKM= 7.00  WEIGHT= 2.00  PGR:
 :RKM  : MATRIX SIZE23486LOs:1944  RKM= 7.00  WEIGHT= 2.00  PGR:
 
 
 On Thu, Oct 17, 2013 at 8:54 AM, Peter Blaha pbl...@theochem.tuwien.ac.at 
 mailto:pbl...@theochem.tuwien.ac.at wrote:
 
The Xeon X5550 processor is a 4 core processor and your cluster may have 
 combined a few of them on one node (2-4 ?) Anyway, 14 cores are not really 
 possible ??
 
Have you done more than just looking on the total time ?
 
Is the machines file the same on both clusters ? Such a machines file 
 does NOT use  mpi.
 
One guess in case you really use mpi on cluster B (with a different 
 .machines file): In the sequential run (A) the basis set is limited by 
 NMATMAX, in the mpi-parallel
run it is not (or it is scaled up by sqrt(N-core)).
So it could be that run A has a MUCH smaller RKMAX than run (B).
Check grep :RKM case.scf   of the two runs.
What are the real matrix sizes 
 
Alternatively, NMATMAX could be chosen differently on the two machines 
 since somebody else installed WIEN2k.
 
Please compare carefully the resulting case.output1_1 files and 
 eventually send the RELEVANT PARTS OF THEM.
 
 
In any case, a 72 atom cell should NOT take 2 h / iteration (or even 8 
 ??).
 
What are your sphere sizes ???, what gives :RKM in case.scf ???
 
At least one can set   OMP_NUM_THREAD=2 or 4   and speed up the code by a 
 factor of almost 2. (You should see in the dayfile something close to 200 % 
 instead of ~100%
 
c1208-ib(1) 26558.265u 17.956s 7:34:14.39 97.5%0+0k 0+0io 0pf+0w
 
In essence:  A matrix size of 1 (real, with inversion) lapw1 should 
 take in the order of 10 min  (no mpi, maybe with OMP_NUM_THREAD=2)
 
 
 
On 10/17/2013 04:33 PM, Yundi Quan wrote:
 
Thanks for your reply.
a). both machines are set up in a way that once a node is assigned to 
 a
job, it cannot be assigned to another.
b). The .machines file looks like this
1:node1
1:node2
1:node3
1:node4
1:node5
1:node6
1:node7
1:node8
granularity:1
extrafine:1
lapw2_vector_split:1
 
I've been trying to avoid using mpi 

Re: [Wien] Intel(R) Xeon(R) CPU X5550 @ 2.67GHz vs Intel(R) Xeon(R) CPU E5620 @ 2.40GHz

2013-10-18 Thread Peter Blaha
As was mentioned before, such a big case needs   mpi  in order to run 
efficiently.


As a quick small improvement  set the OMP_NUM_THREAD variable to 2 or 
4.   This should give a speedup of about 2 and in the dayfile you should 
see that not 905% of the cpu was used, but 180% or so.



On 10/18/2013 10:51 AM, Yundi Quan wrote:

First, thank Peter. I should have described my problem thoroughly.

:RKM  : MATRIX SIZE 9190LOs:1944  RKM= 4.88  WEIGHT= 2.00  PGR

The reduced RKM is 4.88. The reduced matrix size is 9190 which is about 2/5 of 
the full matrix. So that explains a lot. I'm using P1 symmetry. Therefore, the 
complex veion of lapw1, lapw2 are used. Compared with LDA calculations, LSDA 
almost doubles lapw1 and lapw2.

I'm using P1 symmetry. Therefore, symmetry cannot reduce the number of stars 
(i.e. planes waves) in the interstitial region or the number of spherical 
harmonics inside the muffin-tin sphere. I guess that's why my job takes so 
long. And moreover, I'm only using k-point parallel without mpi.

Oxygen is the smallest atom in the unit cell. Reducing RKMAX to 6.5 is what I'm 
going to do first.

One of the clusters to which I have access has 8 cores per node and 8GB memory 
per node. Given the constraint of memory, I wonder how to improve the core 
usage when calculating compounds with large unit cells. For the compound I'm 
currently working on, I request one core per node and 8 nodes(=nkp) per job. So 
7*8=56 cores are wasted while running my job. I'm in dire need of help.


Yundi


On Oct 17, 2013, at 10:58 PM, Peter Blaha pbl...@theochem.tuwien.ac.at wrote:


You still did not tell us the matrix size for the truncated RKmax, but yes,
the scaling is probably ok.   (scaling goes with n^3; i.e. in case of of
matrix size 12000 and 24000 we expect almost a factor of 8 !!! in cpu time.
It also explaines the memory 

You also did not tell us if you have inversion or not.

One of my real cases with  NMAT= 21500   takes 400 sec on 64 cores (mpi), so one
could estimate something like 2 sec on a single core, which comes to the 
right order
of magnitude compared to your case.

And: you may have 72 inequivalent atoms, but you did not tell us how many atoms 
in total you have.
The total number of atoms is the important info !!

Probably you can reduce RKMAX (you did not tell us which atom has RMT=1.65 
(probably O ??)
and most likely you should use mpi AND iterative diagonalization.

As I said, a case with 72 atoms (or whatever you have) can run in minutes on a 
reasonable cluster
and with a proper optimized setup (not just the defaults).


Am 17.10.2013 18:05, schrieb Yundi Quan:

Thanks a lot.
On cluster A, RKM was automatically reduced to 4.88 while on cluster B RKM was 
kept at 7. I didn't expect this, though I was aware that WIEN2k would 
automatically reduce
RKM in some cases. But is it reasonable for an iteration to run for eight hours 
with the following parameters?
Minimum sphere size: 1.65000 Bohr.
Total k-mesh : 8
Gmax : 12

:RKM  : MATRIX SIZE23486LOs:1944  RKM= 7.00  WEIGHT= 2.00  PGR:
:RKM  : MATRIX SIZE23486LOs:1944  RKM= 7.00  WEIGHT= 2.00  PGR:


On Thu, Oct 17, 2013 at 8:54 AM, Peter Blaha pbl...@theochem.tuwien.ac.at 
mailto:pbl...@theochem.tuwien.ac.at wrote:

The Xeon X5550 processor is a 4 core processor and your cluster may have 
combined a few of them on one node (2-4 ?) Anyway, 14 cores are not really 
possible ??

Have you done more than just looking on the total time ?

Is the machines file the same on both clusters ? Such a machines file does 
NOT use  mpi.

One guess in case you really use mpi on cluster B (with a different 
.machines file): In the sequential run (A) the basis set is limited by NMATMAX, 
in the mpi-parallel
run it is not (or it is scaled up by sqrt(N-core)).
So it could be that run A has a MUCH smaller RKMAX than run (B).
Check grep :RKM case.scf   of the two runs.
What are the real matrix sizes 

Alternatively, NMATMAX could be chosen differently on the two machines 
since somebody else installed WIEN2k.

Please compare carefully the resulting case.output1_1 files and eventually 
send the RELEVANT PARTS OF THEM.


In any case, a 72 atom cell should NOT take 2 h / iteration (or even 8 ??).

What are your sphere sizes ???, what gives :RKM in case.scf ???

At least one can set   OMP_NUM_THREAD=2 or 4   and speed up the code by a 
factor of almost 2. (You should see in the dayfile something close to 200 % 
instead of ~100%

c1208-ib(1) 26558.265u 17.956s 7:34:14.39 97.5%0+0k 0+0io 0pf+0w

In essence:  A matrix size of 1 (real, with inversion) lapw1 should 
take in the order of 10 min  (no mpi, maybe with OMP_NUM_THREAD=2)



On 10/17/2013 04:33 PM, Yundi Quan wrote:

Thanks for your reply.
a). both machines are set up in a way that once a node is assigned to a
job, it cannot be assigned to another.
b). The .machines file 

[Wien] Intel(R) Xeon(R) CPU X5550 @ 2.67GHz vs Intel(R) Xeon(R) CPU E5620 @ 2.40GHz

2013-10-17 Thread Yundi Quan
Hi,
I have access to two clusters as a low-level user. One cluster (cluster A)
consists of nodes with 8 core and 8 G mem per node. The other cluster
(cluster B) has 24G mem per node and each node has 14 cores or more. The
cores on cluster A are Xeon CPU E5620@2.40GHz, while the cores on cluster B
are Xeon CPU X5550@2.67GH. From the specifications (2.40GHz+12288 KB cache
vs 2.67GHz+8192 KB cache), two machines should be very close in
performance. But it does not seem to be so.

I have job with 72 atoms per unit cell. I initialized the job on cluster A
and ran it for a few iterations. Each iteration took 2 hours. Then, I moved
the job to cluster B (14 cores per node with @2.67GHz). Now it takes more
than 8 hours to finish one iteration. On both clusters, I request one core
per node and 8 nodes per job ( 8 is the number of k points). I compiled
WIEN2k_13 on cluster A without mpi. On cluster B, WIEN2k_12 was compiled by
the administrator with mpi.

What could have caused poor performance of cluster B? Is it because of MPI?

On an unrelated question. Sometimes memory would run out on cluster B which
has 24Gmem per node. Nevertheless the same job could run smoothly on
cluster A which only has 8 G per node.

Thanks.
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] Intel(R) Xeon(R) CPU X5550 @ 2.67GHz vs Intel(R) Xeon(R) CPU E5620 @ 2.40GHz

2013-10-17 Thread Laurence Marks
There are so many possibilities, a few:

a) If you only request 1 core/node most queuing systems (qsub/msub
etc) will allocate the other cores to other jobs. You are then going
to be very dependent upon what those other jobs are doing. Normal is
to use all the cores on a given node.

b) When you run on cluster B, in addition to a) it is going to be
inefficient to run with mpi communications across nodes and it is much
better to run on a given node across cores. Are you using a machines
file with eight 1: nodeA lines (for instance) or one with a single 1:
nodeA nodeB? The first does not use mpi, the second does. To use
mpi within a node you would use lines such as 1:node:8. Knowledge of
your .machines file will help people assist you.

c) The memory on those clusters is very small, whoever bought them was
not thinking about large scale jobs. I look for at least 4G/core, and
2G/core is barely acceptable. You are going to have to use mpi.

d) All mpi is equal, but some mpi is more equal than others. Depending
upon whether you have infiniband, ethernet, openmpi, impi and how
everything was compiled you can see enormous differences. One thing to
look at is the difference between the cpu time and wall time (both in
case.dayfile and at the bottom of case.output1_*). With a good mpi
setup the wall time should be 5-10% more than the cpu time; with a bad
setup it can be several times it.

On Thu, Oct 17, 2013 at 8:44 AM, Yundi Quan quanyu...@gmail.com wrote:
 Hi,
 I have access to two clusters as a low-level user. One cluster (cluster A)
 consists of nodes with 8 core and 8 G mem per node. The other cluster
 (cluster B) has 24G mem per node and each node has 14 cores or more. The
 cores on cluster A are Xeon CPU E5620@2.40GHz, while the cores on cluster B
 are Xeon CPU X5550@2.67GH. From the specifications (2.40GHz+12288 KB cache
 vs 2.67GHz+8192 KB cache), two machines should be very close in performance.
 But it does not seem to be so.

 I have job with 72 atoms per unit cell. I initialized the job on cluster A
 and ran it for a few iterations. Each iteration took 2 hours. Then, I moved
 the job to cluster B (14 cores per node with @2.67GHz). Now it takes more
 than 8 hours to finish one iteration. On both clusters, I request one core
 per node and 8 nodes per job ( 8 is the number of k points). I compiled
 WIEN2k_13 on cluster A without mpi. On cluster B, WIEN2k_12 was compiled by
 the administrator with mpi.

 What could have caused poor performance of cluster B? Is it because of MPI?

 On an unrelated question. Sometimes memory would run out on cluster B which
 has 24Gmem per node. Nevertheless the same job could run smoothly on cluster
 A which only has 8 G per node.

 Thanks.



-- 
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu 1-847-491-3996
Research is to see what everybody else has seen, and to think what
nobody else has thought
Albert Szent-Gyorgi
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] Intel(R) Xeon(R) CPU X5550 @ 2.67GHz vs Intel(R) Xeon(R) CPU E5620 @ 2.40GHz

2013-10-17 Thread Yundi Quan
Thanks for your reply.
a). both machines are set up in a way that once a node is assigned to a
job, it cannot be assigned to another.
b). The .machines file looks like this
1:node1
1:node2
1:node3
1:node4
1:node5
1:node6
1:node7
1:node8
granularity:1
extrafine:1
lapw2_vector_split:1

I've been trying to avoid using mpi because sometime mpi can slow down my
calculations because of poor communications between nodes.

c). the amount of memory available to a core does not seem to be the
problem in my case because my job could run smoothly on cluster A where
each node has 8G memory and 8 core). But my job runs into memory problems
on cluster B where each core has much more memory available. I wonder
whether there are parameters which I should change in WIEN2k to reduce the
memory usage.

d). My dayfile for a single iteration looks like this. The wallclocks are
around 500.


cycle 1 (Fri Oct 11 02:14:05 PDT 2013) (40/99 to go)

   lapw0 -p (02:14:05) starting parallel lapw0 at Fri Oct 11 02:14:06 PDT
2013
 .machine0 : processors
running lapw0 in single mode
1431.414u 22.267s 24:14.84 99.9% 0+0k 0+0io 0pf+0w
   lapw1  -up -p-c (02:38:20) starting parallel lapw1 at Fri Oct 11
02:38:20 PDT 2013
-  starting parallel LAPW1 jobs at Fri Oct 11 02:38:21 PDT 2013
running LAPW1 in parallel mode (using .machines)
8 number_of_parallel_jobs
 c1208-ib(1) 26558.265u 17.956s 7:34:14.39 97.5% 0+0k 0+0io 0pf+0w
 c1201-ib(1) 26845.212u 15.496s 7:39:59.37 97.3% 0+0k 0+0io 0pf+0w
 c1180-ib(1) 25872.609u 18.143s 7:23:53.43 97.2% 0+0k 0+0io 0pf+0w
 c1179-ib(1) 26040.482u 17.868s 7:26:38.66 97.2% 0+0k 0+0io 0pf+0w
 c1178-ib(1) 26571.271u 17.946s 7:34:16.23 97.5% 0+0k 0+0io 0pf+0w
 c1177-ib(1) 27108.070u 34.294s 8:32:55.53 88.1% 0+0k 0+0io 0pf+0w
 c1171-ib(1) 26729.399u 14.175s 7:36:22.67 97.6% 0+0k 0+0io 0pf+0w
 c0844-ib(1) 25883.863u 47.148s 8:12:35.54 87.7% 0+0k 0+0io 0pf+0w
   Summary of lapw1para:
   c1208-ib k=1 user=26558.3 wallclock=454
   c1201-ib k=1 user=26845.2 wallclock=459
   c1180-ib k=1 user=25872.6 wallclock=443
   c1179-ib k=1 user=26040.5 wallclock=446
   c1178-ib k=1 user=26571.3 wallclock=454
   c1177-ib k=1 user=27108.1 wallclock=512
   c1171-ib k=1 user=26729.4 wallclock=456
   c0844-ib k=1 user=25883.9 wallclock=492
97.935u 34.265s 8:32:58.38 0.4% 0+0k 0+0io 0pf+0w
   lapw1  -dn -p-c (11:11:19) starting parallel lapw1 at Fri Oct 11
11:11:19 PDT 2013
-  starting parallel LAPW1 jobs at Fri Oct 11 11:11:19 PDT 2013
running LAPW1 in parallel mode (using .machines.help)
8 number_of_parallel_jobs
 c1208-ib(1) 26474.686u 16.142s 7:33:36.01 97.3% 0+0k 0+0io 0pf+0w
 c1201-ib(1) 26099.149u 40.330s 8:04:42.58 89.8% 0+0k 0+0io 0pf+0w
 c1180-ib(1) 26809.287u 14.724s 7:38:56.52 97.4% 0+0k 0+0io 0pf+0w
 c1179-ib(1) 26007.527u 17.959s 7:26:10.62 97.2% 0+0k 0+0io 0pf+0w
 c1178-ib(1) 26565.723u 17.576s 7:35:20.11 97.3% 0+0k 0+0io 0pf+0w
 c1177-ib(1) 27114.619u 31.180s 8:21:28.34 90.2% 0+0k 0+0io 0pf+0w
 c1171-ib(1) 26474.665u 15.309s 7:33:38.15 97.3% 0+0k 0+0io 0pf+0w
 c0844-ib(1) 26586.569u 15.010s 7:35:22.88 97.3% 0+0k 0+0io 0pf+0w
   Summary of lapw1para:
   c1208-ib k=1 user=26474.7 wallclock=453
   c1201-ib k=1 user=26099.1 wallclock=484
   c1180-ib k=1 user=26809.3 wallclock=458
   c1179-ib k=1 user=26007.5 wallclock=446
   c1178-ib k=1 user=26565.7 wallclock=455
   c1177-ib k=1 user=27114.6 wallclock=501
   c1171-ib k=1 user=26474.7 wallclock=453
   c0844-ib k=1 user=26586.6 wallclock=455
104.607u 18.798s 8:21:30.92 0.4% 0+0k 0+0io 0pf+0w
   lapw2 -up -p   -c (19:32:50) running LAPW2 in parallel mode
  c1208-ib 1016.517u 13.674s 17:11.10 99.9% 0+0k 0+0io 0pf+0w
  c1201-ib 1017.359u 13.669s 17:11.82 99.9% 0+0k 0+0io 0pf+0w
  c1180-ib 1033.056u 13.283s 17:27.07 99.9% 0+0k 0+0io 0pf+0w
  c1179-ib 1037.551u 13.447s 17:31.50 99.9% 0+0k 0+0io 0pf+0w
  c1178-ib 1019.156u 13.729s 17:13.49 99.9% 0+0k 0+0io 0pf+0w
  c1177-ib 1021.878u 13.731s 17:16.07 99.9% 0+0k 0+0io 0pf+0w
  c1171-ib 1032.417u 13.681s 17:26.70 99.9% 0+0k 0+0io 0pf+0w
  c0844-ib 1022.315u 13.870s 17:16.81 99.9% 0+0k 0+0io 0pf+0w
   Summary of lapw2para:
   c1208-ib user=1016.52 wallclock=1031.1
   c1201-ib user=1017.36 wallclock=1031.82
   c1180-ib user=1033.06 wallclock=1047.07
   c1179-ib user=1037.55 wallclock=1051.5
   c1178-ib user=1019.16 wallclock=1033.49
   c1177-ib user=1021.88 wallclock=1036.07
   c1171-ib user=1032.42 wallclock=1046.7
   c0844-ib user=1022.32 wallclock=1036.81
31.923u 13.526s 18:20.12 4.1% 0+0k 0+0io 0pf+0w
   lapw2 -dn -p   -c (19:51:10) running LAPW2 in parallel mode
  c1208-ib 947.942u 13.364s 16:01.75 99.9% 0+0k 0+0io 0pf+0w
  c1201-ib 932.766u 13.640s 15:49.22 99.7% 0+0k 0+0io 0pf+0w
  c1180-ib 932.474u 13.609s 15:47.76 99.8% 0+0k 0+0io 0pf+0w
  c1179-ib 936.171u 13.691s 15:50.33 99.9% 0+0k 0+0io 0pf+0w
  c1178-ib 947.798u 13.493s 16:04.99 99.6% 0+0k 0+0io 0pf+0w
  c1177-ib 947.786u 13.350s 

Re: [Wien] Intel(R) Xeon(R) CPU X5550 @ 2.67GHz vs Intel(R) Xeon(R) CPU E5620 @ 2.40GHz

2013-10-17 Thread Laurence Marks
I assume the dayfile was for cluster A, as wall is about 8x cpu which
is about right for mkl multithreading which you are presumably using.
You are not using mpi. You may want to compare the wall time to using
on cluster A

1:node1:8

depending upon many factors it may be faster, or slower. This is only
doing mpi using the bus not between nodes.

Is it 72 unique atoms, or 72 total?

My guess is that cluster A is about right. You can make it faster by
using iterative diagonalization (-it or -it -noHinv) and perhaps
reducing RKMAX -- you don't say what your RMTs are.

For cluster B what blas/lapack are you using? Does it really have that
many cores/node or is it using hyperthreading (which does not really
give you much)? How is your NFS structured -- good communications or
just slow ethernet?


On Thu, Oct 17, 2013 at 9:33 AM, Yundi Quan q...@ms.physics.ucdavis.edu wrote:
 Thanks for your reply.
 a). both machines are set up in a way that once a node is assigned to a job,
 it cannot be assigned to another.
 b). The .machines file looks like this
 1:node1
 1:node2
 1:node3
 1:node4
 1:node5
 1:node6
 1:node7
 1:node8
 granularity:1
 extrafine:1
 lapw2_vector_split:1

 I've been trying to avoid using mpi because sometime mpi can slow down my
 calculations because of poor communications between nodes.

 c). the amount of memory available to a core does not seem to be the problem
 in my case because my job could run smoothly on cluster A where each node
 has 8G memory and 8 core). But my job runs into memory problems on cluster B
 where each core has much more memory available. I wonder whether there are
 parameters which I should change in WIEN2k to reduce the memory usage.

 d). My dayfile for a single iteration looks like this. The wallclocks are
 around 500.


 cycle 1 (Fri Oct 11 02:14:05 PDT 2013) (40/99 to go)

   lapw0 -p (02:14:05) starting parallel lapw0 at Fri Oct 11 02:14:06 PDT
 2013
  .machine0 : processors
 running lapw0 in single mode
 1431.414u 22.267s 24:14.84 99.9% 0+0k 0+0io 0pf+0w
   lapw1  -up -p-c (02:38:20) starting parallel lapw1 at Fri Oct 11
 02:38:20 PDT 2013
 -  starting parallel LAPW1 jobs at Fri Oct 11 02:38:21 PDT 2013
 running LAPW1 in parallel mode (using .machines)
 8 number_of_parallel_jobs
  c1208-ib(1) 26558.265u 17.956s 7:34:14.39 97.5% 0+0k 0+0io 0pf+0w
  c1201-ib(1) 26845.212u 15.496s 7:39:59.37 97.3% 0+0k 0+0io 0pf+0w
  c1180-ib(1) 25872.609u 18.143s 7:23:53.43 97.2% 0+0k 0+0io 0pf+0w
  c1179-ib(1) 26040.482u 17.868s 7:26:38.66 97.2% 0+0k 0+0io 0pf+0w
  c1178-ib(1) 26571.271u 17.946s 7:34:16.23 97.5% 0+0k 0+0io 0pf+0w
  c1177-ib(1) 27108.070u 34.294s 8:32:55.53 88.1% 0+0k 0+0io 0pf+0w
  c1171-ib(1) 26729.399u 14.175s 7:36:22.67 97.6% 0+0k 0+0io 0pf+0w
  c0844-ib(1) 25883.863u 47.148s 8:12:35.54 87.7% 0+0k 0+0io 0pf+0w
Summary of lapw1para:
c1208-ib k=1 user=26558.3 wallclock=454
c1201-ib k=1 user=26845.2 wallclock=459
c1180-ib k=1 user=25872.6 wallclock=443
c1179-ib k=1 user=26040.5 wallclock=446
c1178-ib k=1 user=26571.3 wallclock=454
c1177-ib k=1 user=27108.1 wallclock=512
c1171-ib k=1 user=26729.4 wallclock=456
c0844-ib k=1 user=25883.9 wallclock=492
 97.935u 34.265s 8:32:58.38 0.4% 0+0k 0+0io 0pf+0w
   lapw1  -dn -p-c (11:11:19) starting parallel lapw1 at Fri Oct 11
 11:11:19 PDT 2013
 -  starting parallel LAPW1 jobs at Fri Oct 11 11:11:19 PDT 2013
 running LAPW1 in parallel mode (using .machines.help)
 8 number_of_parallel_jobs
  c1208-ib(1) 26474.686u 16.142s 7:33:36.01 97.3% 0+0k 0+0io 0pf+0w
  c1201-ib(1) 26099.149u 40.330s 8:04:42.58 89.8% 0+0k 0+0io 0pf+0w
  c1180-ib(1) 26809.287u 14.724s 7:38:56.52 97.4% 0+0k 0+0io 0pf+0w
  c1179-ib(1) 26007.527u 17.959s 7:26:10.62 97.2% 0+0k 0+0io 0pf+0w
  c1178-ib(1) 26565.723u 17.576s 7:35:20.11 97.3% 0+0k 0+0io 0pf+0w
  c1177-ib(1) 27114.619u 31.180s 8:21:28.34 90.2% 0+0k 0+0io 0pf+0w
  c1171-ib(1) 26474.665u 15.309s 7:33:38.15 97.3% 0+0k 0+0io 0pf+0w
  c0844-ib(1) 26586.569u 15.010s 7:35:22.88 97.3% 0+0k 0+0io 0pf+0w
Summary of lapw1para:
c1208-ib k=1 user=26474.7 wallclock=453
c1201-ib k=1 user=26099.1 wallclock=484
c1180-ib k=1 user=26809.3 wallclock=458
c1179-ib k=1 user=26007.5 wallclock=446
c1178-ib k=1 user=26565.7 wallclock=455
c1177-ib k=1 user=27114.6 wallclock=501
c1171-ib k=1 user=26474.7 wallclock=453
c0844-ib k=1 user=26586.6 wallclock=455
 104.607u 18.798s 8:21:30.92 0.4% 0+0k 0+0io 0pf+0w
   lapw2 -up -p   -c (19:32:50) running LAPW2 in parallel mode
   c1208-ib 1016.517u 13.674s 17:11.10 99.9% 0+0k 0+0io 0pf+0w
   c1201-ib 1017.359u 13.669s 17:11.82 99.9% 0+0k 0+0io 0pf+0w
   c1180-ib 1033.056u 13.283s 17:27.07 99.9% 0+0k 0+0io 0pf+0w
   c1179-ib 1037.551u 13.447s 17:31.50 99.9% 0+0k 0+0io 0pf+0w
   c1178-ib 1019.156u 13.729s 17:13.49 99.9% 0+0k 0+0io 0pf+0w
   c1177-ib 1021.878u 13.731s 17:16.07 99.9% 0+0k 0+0io 0pf+0w
   

Re: [Wien] Intel(R) Xeon(R) CPU X5550 @ 2.67GHz vs Intel(R) Xeon(R) CPU E5620 @ 2.40GHz

2013-10-17 Thread Yundi Quan
Sorry that I didn't make it clear. The dayfile was for cluster B. As I said
before, I always request one core per node and 8 nodes per job (number of k
points).  I have 72 crystallographically non-equivalent atoms.

On cluster B, I used the following R_LIB (LAPACK+BLAS) option to compile
WIEN2k. -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -openmp -iomp5


Yundi


On Thu, Oct 17, 2013 at 7:50 AM, Laurence Marks l-ma...@northwestern.eduwrote:

 I assume the dayfile was for cluster A, as wall is about 8x cpu which
 is about right for mkl multithreading which you are presumably using.
 You are not using mpi. You may want to compare the wall time to using
 on cluster A

 1:node1:8

 depending upon many factors it may be faster, or slower. This is only
 doing mpi using the bus not between nodes.

 Is it 72 unique atoms, or 72 total?

 My guess is that cluster A is about right. You can make it faster by
 using iterative diagonalization (-it or -it -noHinv) and perhaps
 reducing RKMAX -- you don't say what your RMTs are.

 For cluster B what blas/lapack are you using? Does it really have that
 many cores/node or is it using hyperthreading (which does not really
 give you much)? How is your NFS structured -- good communications or
 just slow ethernet?


 On Thu, Oct 17, 2013 at 9:33 AM, Yundi Quan q...@ms.physics.ucdavis.edu
 wrote:
  Thanks for your reply.
  a). both machines are set up in a way that once a node is assigned to a
 job,
  it cannot be assigned to another.
  b). The .machines file looks like this
  1:node1
  1:node2
  1:node3
  1:node4
  1:node5
  1:node6
  1:node7
  1:node8
  granularity:1
  extrafine:1
  lapw2_vector_split:1
 
  I've been trying to avoid using mpi because sometime mpi can slow down my
  calculations because of poor communications between nodes.
 
  c). the amount of memory available to a core does not seem to be the
 problem
  in my case because my job could run smoothly on cluster A where each node
  has 8G memory and 8 core). But my job runs into memory problems on
 cluster B
  where each core has much more memory available. I wonder whether there
 are
  parameters which I should change in WIEN2k to reduce the memory usage.
 
  d). My dayfile for a single iteration looks like this. The wallclocks are
  around 500.
 
 
  cycle 1 (Fri Oct 11 02:14:05 PDT 2013) (40/99 to go)
 
lapw0 -p (02:14:05) starting parallel lapw0 at Fri Oct 11 02:14:06 PDT
  2013
   .machine0 : processors
  running lapw0 in single mode
  1431.414u 22.267s 24:14.84 99.9% 0+0k 0+0io 0pf+0w
lapw1  -up -p-c (02:38:20) starting parallel lapw1 at Fri Oct 11
  02:38:20 PDT 2013
  -  starting parallel LAPW1 jobs at Fri Oct 11 02:38:21 PDT 2013
  running LAPW1 in parallel mode (using .machines)
  8 number_of_parallel_jobs
   c1208-ib(1) 26558.265u 17.956s 7:34:14.39 97.5% 0+0k 0+0io 0pf+0w
   c1201-ib(1) 26845.212u 15.496s 7:39:59.37 97.3% 0+0k 0+0io 0pf+0w
   c1180-ib(1) 25872.609u 18.143s 7:23:53.43 97.2% 0+0k 0+0io 0pf+0w
   c1179-ib(1) 26040.482u 17.868s 7:26:38.66 97.2% 0+0k 0+0io 0pf+0w
   c1178-ib(1) 26571.271u 17.946s 7:34:16.23 97.5% 0+0k 0+0io 0pf+0w
   c1177-ib(1) 27108.070u 34.294s 8:32:55.53 88.1% 0+0k 0+0io 0pf+0w
   c1171-ib(1) 26729.399u 14.175s 7:36:22.67 97.6% 0+0k 0+0io 0pf+0w
   c0844-ib(1) 25883.863u 47.148s 8:12:35.54 87.7% 0+0k 0+0io 0pf+0w
 Summary of lapw1para:
 c1208-ib k=1 user=26558.3 wallclock=454
 c1201-ib k=1 user=26845.2 wallclock=459
 c1180-ib k=1 user=25872.6 wallclock=443
 c1179-ib k=1 user=26040.5 wallclock=446
 c1178-ib k=1 user=26571.3 wallclock=454
 c1177-ib k=1 user=27108.1 wallclock=512
 c1171-ib k=1 user=26729.4 wallclock=456
 c0844-ib k=1 user=25883.9 wallclock=492
  97.935u 34.265s 8:32:58.38 0.4% 0+0k 0+0io 0pf+0w
lapw1  -dn -p-c (11:11:19) starting parallel lapw1 at Fri Oct 11
  11:11:19 PDT 2013
  -  starting parallel LAPW1 jobs at Fri Oct 11 11:11:19 PDT 2013
  running LAPW1 in parallel mode (using .machines.help)
  8 number_of_parallel_jobs
   c1208-ib(1) 26474.686u 16.142s 7:33:36.01 97.3% 0+0k 0+0io 0pf+0w
   c1201-ib(1) 26099.149u 40.330s 8:04:42.58 89.8% 0+0k 0+0io 0pf+0w
   c1180-ib(1) 26809.287u 14.724s 7:38:56.52 97.4% 0+0k 0+0io 0pf+0w
   c1179-ib(1) 26007.527u 17.959s 7:26:10.62 97.2% 0+0k 0+0io 0pf+0w
   c1178-ib(1) 26565.723u 17.576s 7:35:20.11 97.3% 0+0k 0+0io 0pf+0w
   c1177-ib(1) 27114.619u 31.180s 8:21:28.34 90.2% 0+0k 0+0io 0pf+0w
   c1171-ib(1) 26474.665u 15.309s 7:33:38.15 97.3% 0+0k 0+0io 0pf+0w
   c0844-ib(1) 26586.569u 15.010s 7:35:22.88 97.3% 0+0k 0+0io 0pf+0w
 Summary of lapw1para:
 c1208-ib k=1 user=26474.7 wallclock=453
 c1201-ib k=1 user=26099.1 wallclock=484
 c1180-ib k=1 user=26809.3 wallclock=458
 c1179-ib k=1 user=26007.5 wallclock=446
 c1178-ib k=1 user=26565.7 wallclock=455
 c1177-ib k=1 user=27114.6 wallclock=501
 c1171-ib k=1 user=26474.7 wallclock=453
 c0844-ib k=1 

Re: [Wien] Intel(R) Xeon(R) CPU X5550 @ 2.67GHz vs Intel(R) Xeon(R) CPU E5620 @ 2.40GHz

2013-10-17 Thread Laurence Marks
Something is not right. I think I misread your dayfile and in fast mkl
threading is not active. Try something like  env | grep -e MKL . I
suspect that your job is just running on a single core.

On Thu, Oct 17, 2013 at 10:13 AM, Yundi Quan quanyu...@gmail.com wrote:
 Sorry that I didn't make it clear. The dayfile was for cluster B. As I said
 before, I always request one core per node and 8 nodes per job (number of k
 points).  I have 72 crystallographically non-equivalent atoms.

 On cluster B, I used the following R_LIB (LAPACK+BLAS) option to compile
 WIEN2k. -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -openmp -iomp5


 Yundi


 On Thu, Oct 17, 2013 at 7:50 AM, Laurence Marks l-ma...@northwestern.edu
 wrote:

 I assume the dayfile was for cluster A, as wall is about 8x cpu which
 is about right for mkl multithreading which you are presumably using.
 You are not using mpi. You may want to compare the wall time to using
 on cluster A

 1:node1:8

 depending upon many factors it may be faster, or slower. This is only
 doing mpi using the bus not between nodes.

 Is it 72 unique atoms, or 72 total?

 My guess is that cluster A is about right. You can make it faster by
 using iterative diagonalization (-it or -it -noHinv) and perhaps
 reducing RKMAX -- you don't say what your RMTs are.

 For cluster B what blas/lapack are you using? Does it really have that
 many cores/node or is it using hyperthreading (which does not really
 give you much)? How is your NFS structured -- good communications or
 just slow ethernet?


 On Thu, Oct 17, 2013 at 9:33 AM, Yundi Quan q...@ms.physics.ucdavis.edu
 wrote:
  Thanks for your reply.
  a). both machines are set up in a way that once a node is assigned to a
  job,
  it cannot be assigned to another.
  b). The .machines file looks like this
  1:node1
  1:node2
  1:node3
  1:node4
  1:node5
  1:node6
  1:node7
  1:node8
  granularity:1
  extrafine:1
  lapw2_vector_split:1
 
  I've been trying to avoid using mpi because sometime mpi can slow down
  my
  calculations because of poor communications between nodes.
 
  c). the amount of memory available to a core does not seem to be the
  problem
  in my case because my job could run smoothly on cluster A where each
  node
  has 8G memory and 8 core). But my job runs into memory problems on
  cluster B
  where each core has much more memory available. I wonder whether there
  are
  parameters which I should change in WIEN2k to reduce the memory usage.
 
  d). My dayfile for a single iteration looks like this. The wallclocks
  are
  around 500.
 
 
  cycle 1 (Fri Oct 11 02:14:05 PDT 2013) (40/99 to go)
 
lapw0 -p (02:14:05) starting parallel lapw0 at Fri Oct 11 02:14:06
  PDT
  2013
   .machine0 : processors
  running lapw0 in single mode
  1431.414u 22.267s 24:14.84 99.9% 0+0k 0+0io 0pf+0w
lapw1  -up -p-c (02:38:20) starting parallel lapw1 at Fri Oct 11
  02:38:20 PDT 2013
  -  starting parallel LAPW1 jobs at Fri Oct 11 02:38:21 PDT 2013
  running LAPW1 in parallel mode (using .machines)
  8 number_of_parallel_jobs
   c1208-ib(1) 26558.265u 17.956s 7:34:14.39 97.5% 0+0k 0+0io 0pf+0w
   c1201-ib(1) 26845.212u 15.496s 7:39:59.37 97.3% 0+0k 0+0io 0pf+0w
   c1180-ib(1) 25872.609u 18.143s 7:23:53.43 97.2% 0+0k 0+0io 0pf+0w
   c1179-ib(1) 26040.482u 17.868s 7:26:38.66 97.2% 0+0k 0+0io 0pf+0w
   c1178-ib(1) 26571.271u 17.946s 7:34:16.23 97.5% 0+0k 0+0io 0pf+0w
   c1177-ib(1) 27108.070u 34.294s 8:32:55.53 88.1% 0+0k 0+0io 0pf+0w
   c1171-ib(1) 26729.399u 14.175s 7:36:22.67 97.6% 0+0k 0+0io 0pf+0w
   c0844-ib(1) 25883.863u 47.148s 8:12:35.54 87.7% 0+0k 0+0io 0pf+0w
 Summary of lapw1para:
 c1208-ib k=1 user=26558.3 wallclock=454
 c1201-ib k=1 user=26845.2 wallclock=459
 c1180-ib k=1 user=25872.6 wallclock=443
 c1179-ib k=1 user=26040.5 wallclock=446
 c1178-ib k=1 user=26571.3 wallclock=454
 c1177-ib k=1 user=27108.1 wallclock=512
 c1171-ib k=1 user=26729.4 wallclock=456
 c0844-ib k=1 user=25883.9 wallclock=492
  97.935u 34.265s 8:32:58.38 0.4% 0+0k 0+0io 0pf+0w
lapw1  -dn -p-c (11:11:19) starting parallel lapw1 at Fri Oct 11
  11:11:19 PDT 2013
  -  starting parallel LAPW1 jobs at Fri Oct 11 11:11:19 PDT 2013
  running LAPW1 in parallel mode (using .machines.help)
  8 number_of_parallel_jobs
   c1208-ib(1) 26474.686u 16.142s 7:33:36.01 97.3% 0+0k 0+0io 0pf+0w
   c1201-ib(1) 26099.149u 40.330s 8:04:42.58 89.8% 0+0k 0+0io 0pf+0w
   c1180-ib(1) 26809.287u 14.724s 7:38:56.52 97.4% 0+0k 0+0io 0pf+0w
   c1179-ib(1) 26007.527u 17.959s 7:26:10.62 97.2% 0+0k 0+0io 0pf+0w
   c1178-ib(1) 26565.723u 17.576s 7:35:20.11 97.3% 0+0k 0+0io 0pf+0w
   c1177-ib(1) 27114.619u 31.180s 8:21:28.34 90.2% 0+0k 0+0io 0pf+0w
   c1171-ib(1) 26474.665u 15.309s 7:33:38.15 97.3% 0+0k 0+0io 0pf+0w
   c0844-ib(1) 26586.569u 15.010s 7:35:22.88 97.3% 0+0k 0+0io 0pf+0w
 Summary of lapw1para:
 c1208-ib k=1 user=26474.7 wallclock=453
  

Re: [Wien] Intel(R) Xeon(R) CPU X5550 @ 2.67GHz vs Intel(R) Xeon(R) CPU E5620 @ 2.40GHz

2013-10-17 Thread Peter Blaha

You still did not tell us the matrix size for the truncated RKmax, but yes,
the scaling is probably ok.   (scaling goes with n^3; i.e. in case of of
matrix size 12000 and 24000 we expect almost a factor of 8 !!! in cpu time.
It also explaines the memory 

You also did not tell us if you have inversion or not.

One of my real cases with  NMAT= 21500   takes 400 sec on 64 cores (mpi), so one
could estimate something like 2 sec on a single core, which comes to the 
right order
of magnitude compared to your case.

And: you may have 72 inequivalent atoms, but you did not tell us how many atoms 
in total you have.
The total number of atoms is the important info !!

Probably you can reduce RKMAX (you did not tell us which atom has RMT=1.65 
(probably O ??)
and most likely you should use mpi AND iterative diagonalization.

As I said, a case with 72 atoms (or whatever you have) can run in minutes on a 
reasonable cluster
and with a proper optimized setup (not just the defaults).


Am 17.10.2013 18:05, schrieb Yundi Quan:

Thanks a lot.
On cluster A, RKM was automatically reduced to 4.88 while on cluster B RKM was 
kept at 7. I didn't expect this, though I was aware that WIEN2k would 
automatically reduce
RKM in some cases. But is it reasonable for an iteration to run for eight hours 
with the following parameters?
Minimum sphere size: 1.65000 Bohr.
Total k-mesh : 8
Gmax : 12

:RKM  : MATRIX SIZE23486LOs:1944  RKM= 7.00  WEIGHT= 2.00  PGR:
:RKM  : MATRIX SIZE23486LOs:1944  RKM= 7.00  WEIGHT= 2.00  PGR:


On Thu, Oct 17, 2013 at 8:54 AM, Peter Blaha pbl...@theochem.tuwien.ac.at 
mailto:pbl...@theochem.tuwien.ac.at wrote:

The Xeon X5550 processor is a 4 core processor and your cluster may have 
combined a few of them on one node (2-4 ?) Anyway, 14 cores are not really 
possible ??

Have you done more than just looking on the total time ?

Is the machines file the same on both clusters ? Such a machines file does 
NOT use  mpi.

One guess in case you really use mpi on cluster B (with a different 
.machines file): In the sequential run (A) the basis set is limited by NMATMAX, 
in the mpi-parallel
run it is not (or it is scaled up by sqrt(N-core)).
So it could be that run A has a MUCH smaller RKMAX than run (B).
Check grep :RKM case.scf   of the two runs.
What are the real matrix sizes 

Alternatively, NMATMAX could be chosen differently on the two machines 
since somebody else installed WIEN2k.

Please compare carefully the resulting case.output1_1 files and eventually 
send the RELEVANT PARTS OF THEM.


In any case, a 72 atom cell should NOT take 2 h / iteration (or even 8 ??).

What are your sphere sizes ???, what gives :RKM in case.scf ???

At least one can set   OMP_NUM_THREAD=2 or 4   and speed up the code by a 
factor of almost 2. (You should see in the dayfile something close to 200 % 
instead of ~100%

c1208-ib(1) 26558.265u 17.956s 7:34:14.39 97.5%0+0k 0+0io 0pf+0w

In essence:  A matrix size of 1 (real, with inversion) lapw1 should 
take in the order of 10 min  (no mpi, maybe with OMP_NUM_THREAD=2)



On 10/17/2013 04:33 PM, Yundi Quan wrote:

Thanks for your reply.
a). both machines are set up in a way that once a node is assigned to a
job, it cannot be assigned to another.
b). The .machines file looks like this
1:node1
1:node2
1:node3
1:node4
1:node5
1:node6
1:node7
1:node8
granularity:1
extrafine:1
lapw2_vector_split:1

I've been trying to avoid using mpi because sometime mpi can slow down
my calculations because of poor communications between nodes.

c). the amount of memory available to a core does not seem to be the
problem in my case because my job could run smoothly on cluster A where
each node has 8G memory and 8 core). But my job runs into memory
problems on cluster B where each core has much more memory available. I
wonder whether there are parameters which I should change in WIEN2k to
reduce the memory usage.

d). My dayfile for a single iteration looks like this. The wallclocks
are around 500.


  cycle 1 (Fri Oct 11 02:14:05 PDT 2013) (40/99 to go)

 lapw0 -p(02:14:05) starting parallel lapw0 at Fri Oct 11 02:14:06

PDT 2013
 .machine0 : processors
running lapw0 in single mode
1431.414u 22.267s 24:14.84 99.9%0+0k 0+0io 0pf+0w
 lapw1  -up -p-c(02:38:20) starting parallel lapw1 at Fri Oct 
11

02:38:20 PDT 2013
-  starting parallel LAPW1 jobs at Fri Oct 11 02:38:21 PDT 2013
running LAPW1 in parallel mode (using .machines)
8 number_of_parallel_jobs
   c1208-ib(1) 26558.265u 17.956s 7:34:14.39 97.5%0+0k 0+0io 0pf+0w
   c1201-ib(1) 26845.212u 15.496s 7:39:59.37 97.3%0+0k