Re: [Wien] Intel(R) Xeon(R) CPU X5550 @ 2.67GHz vs Intel(R) Xeon(R) CPU E5620 @ 2.40GHz

Yundi Quan Fri, 18 Oct 2013 01:52:47 -0700

First, thank Peter. I should have described my problem thoroughly.

:RKM  : MATRIX SIZE 9190LOs:1944  RKM= 4.88  WEIGHT= 2.00  PGR


The reduced RKM is 4.88. The reduced matrix size is 9190 which is about 2/5 of 
the full matrix. So that explains a lot. I'm using P1 symmetry. Therefore, the 
complex veion of lapw1, lapw2 are used. Compared with LDA calculations, LSDA 
almost doubles lapw1 and lapw2. 

I'm using P1 symmetry. Therefore, symmetry cannot reduce the number of stars 
(i.e. planes waves) in the interstitial region or the number of spherical 
harmonics inside the muffin-tin sphere. I guess that's why my job takes so 
long. And moreover, I'm only using k-point parallel without mpi.

Oxygen is the smallest atom in the unit cell. Reducing RKMAX to 6.5 is what I'm 
going to do first.

One of the clusters to which I have access has 8 cores per node and 8GB memory 
per node. Given the constraint of memory, I wonder how to improve the core 
usage when calculating compounds with large unit cells. For the compound I'm 
currently working on, I request one core per node and 8 nodes(=nkp) per job. So 
7*8=56 cores are wasted while running my job. I'm in dire need of help.


Yundi


On Oct 17, 2013, at 10:58 PM, Peter Blaha <pbl...@theochem.tuwien.ac.at> wrote:

> You still did not tell us the matrix size for the truncated RKmax, but yes,
> the scaling is probably ok.   (scaling goes with n^3; i.e. in case of of
> matrix size 12000 and 24000 we expect almost a factor of 8 !!! in cpu time.
> It also explaines the memory ....
> 
> You also did not tell us if you have inversion or not.
> 
> One of my real cases with  NMAT= 21500   takes 400 sec on 64 cores (mpi), so 
> one
> could estimate something like 20000 sec on a single core, which comes to the 
> right order
> of magnitude compared to your case.
> 
> And: you may have 72 inequivalent atoms, but you did not tell us how many 
> atoms in total you have.
> The total number of atoms is the important info !!
> 
> Probably you can reduce RKMAX (you did not tell us which atom has RMT=1.65 
> (probably O ??)
> and most likely you should use mpi AND iterative diagonalization.
> 
> As I said, a case with 72 atoms (or whatever you have) can run in minutes on 
> a reasonable cluster
> and with a proper optimized setup (not just the defaults).
> 
> 
> Am 17.10.2013 18:05, schrieb Yundi Quan:
>> Thanks a lot.
>> On cluster A, RKM was automatically reduced to 4.88 while on cluster B RKM 
>> was kept at 7. I didn't expect this, though I was aware that WIEN2k would 
>> automatically reduce
>> RKM in some cases. But is it reasonable for an iteration to run for eight 
>> hours with the following parameters?
>> Minimum sphere size: 1.65000 Bohr.
>> Total k-mesh : 8
>> Gmax             : 12
>> 
>> :RKM  : MATRIX SIZE23486LOs:1944  RKM= 7.00  WEIGHT= 2.00  PGR:
>> :RKM  : MATRIX SIZE23486LOs:1944  RKM= 7.00  WEIGHT= 2.00  PGR:
>> 
>> 
>> On Thu, Oct 17, 2013 at 8:54 AM, Peter Blaha <pbl...@theochem.tuwien.ac.at 
>> <mailto:pbl...@theochem.tuwien.ac.at>> wrote:
>> 
>>    The Xeon X5550 processor is a 4 core processor and your cluster may have 
>> combined a few of them on one node (2-4 ?) Anyway, 14 cores are not really 
>> possible ??
>> 
>>    Have you done more than just looking on the total time ?
>> 
>>    Is the machines file the same on both clusters ? Such a machines file 
>> does NOT use  mpi.
>> 
>>    One guess in case you really use mpi on cluster B (with a different 
>> .machines file): In the sequential run (A) the basis set is limited by 
>> NMATMAX, in the mpi-parallel
>>    run it is not (or it is scaled up by sqrt(N-core)).
>>    So it could be that run A has a MUCH smaller RKMAX than run (B).
>>    Check grep :RKM case.scf   of the two runs.
>>    What are the real matrix sizes ????
>> 
>>    Alternatively, NMATMAX could be chosen differently on the two machines 
>> since somebody else installed WIEN2k.
>> 
>>    Please compare carefully the resulting case.output1_1 files and 
>> eventually send the RELEVANT PARTS OF THEM.
>> 
>> 
>>    In any case, a 72 atom cell should NOT take 2 h / iteration (or even 8 
>> ??).
>> 
>>    What are your sphere sizes ???, what gives :RKM in case.scf ???
>> 
>>    At least one can set   OMP_NUM_THREAD=2 or 4   and speed up the code by a 
>> factor of almost 2. (You should see in the dayfile something close to 200 % 
>> instead of ~100%
>> 
>>     >       c1208-ib(1) 26558.265u 17.956s 7:34:14.39 97.5%0+0k 0+0io 0pf+0w
>> 
>>    In essence:  A matrix size of 10000 (real, with inversion) lapw1 should 
>> take in the order of 10 min  (no mpi, maybe with OMP_NUM_THREAD=2)
>> 
>> 
>> 
>>    On 10/17/2013 04:33 PM, Yundi Quan wrote:
>> 
>>        Thanks for your reply.
>>        a). both machines are set up in a way that once a node is assigned to 
>> a
>>        job, it cannot be assigned to another.
>>        b). The .machines file looks like this
>>        1:node1
>>        1:node2
>>        1:node3
>>        1:node4
>>        1:node5
>>        1:node6
>>        1:node7
>>        1:node8
>>        granularity:1
>>        extrafine:1
>>        lapw2_vector_split:1
>> 
>>        I've been trying to avoid using mpi because sometime mpi can slow down
>>        my calculations because of poor communications between nodes.
>> 
>>        c). the amount of memory available to a core does not seem to be the
>>        problem in my case because my job could run smoothly on cluster A 
>> where
>>        each node has 8G memory and 8 core). But my job runs into memory
>>        problems on cluster B where each core has much more memory available. 
>> I
>>        wonder whether there are parameters which I should change in WIEN2k to
>>        reduce the memory usage.
>> 
>>        d). My dayfile for a single iteration looks like this. The wallclocks
>>        are around 500.
>> 
>> 
>>              cycle 1 (Fri Oct 11 02:14:05 PDT 2013) (40/99 to go)
>> 
>>          >   lapw0 -p(02:14:05) starting parallel lapw0 at Fri Oct 11 
>> 02:14:06
>> 
>>        PDT 2013
>>        -------- .machine0 : processors
>>        running lapw0 in single mode
>>        1431.414u 22.267s 24:14.84 99.9%0+0k 0+0io 0pf+0w
>>          >   lapw1  -up -p    -c(02:38:20) starting parallel lapw1 at Fri 
>> Oct 11
>> 
>>        02:38:20 PDT 2013
>>        ->  starting parallel LAPW1 jobs at Fri Oct 11 02:38:21 PDT 2013
>>        running LAPW1 in parallel mode (using .machines)
>>        8 number_of_parallel_jobs
>>               c1208-ib(1) 26558.265u 17.956s 7:34:14.39 97.5%0+0k 0+0io 
>> 0pf+0w
>>               c1201-ib(1) 26845.212u 15.496s 7:39:59.37 97.3%0+0k 0+0io 
>> 0pf+0w
>>               c1180-ib(1) 25872.609u 18.143s 7:23:53.43 97.2%0+0k 0+0io 
>> 0pf+0w
>>               c1179-ib(1) 26040.482u 17.868s 7:26:38.66 97.2%0+0k 0+0io 
>> 0pf+0w
>>               c1178-ib(1) 26571.271u 17.946s 7:34:16.23 97.5%0+0k 0+0io 
>> 0pf+0w
>>               c1177-ib(1) 27108.070u 34.294s 8:32:55.53 88.1%0+0k 0+0io 
>> 0pf+0w
>>               c1171-ib(1) 26729.399u 14.175s 7:36:22.67 97.6%0+0k 0+0io 
>> 0pf+0w
>>               c0844-ib(1) 25883.863u 47.148s 8:12:35.54 87.7%0+0k 0+0io 
>> 0pf+0w
>> 
>>             Summary of lapw1para:
>>             c1208-ibk=1user=26558.__3wallclock=454
>>             c1201-ibk=1user=26845.__2wallclock=459
>>             c1180-ibk=1user=25872.__6wallclock=443
>>             c1179-ibk=1user=26040.__5wallclock=446
>>             c1178-ibk=1user=26571.__3wallclock=454
>>             c1177-ibk=1user=27108.__1wallclock=512
>>             c1171-ibk=1user=26729.__4wallclock=456
>>             c0844-ibk=1user=25883.__9wallclock=492
>>        97.935u 34.265s 8:32:58.38 0.4%0+0k 0+0io 0pf+0w
>>          >   lapw1  -dn -p    -c(11:11:19) starting parallel lapw1 at Fri 
>> Oct 11
>> 
>>        11:11:19 PDT 2013
>>        ->  starting parallel LAPW1 jobs at Fri Oct 11 11:11:19 PDT 2013
>>        running LAPW1 in parallel mode (using .machines.help)
>>        8 number_of_parallel_jobs
>>               c1208-ib(1) 26474.686u 16.142s 7:33:36.01 97.3%0+0k 0+0io 
>> 0pf+0w
>>               c1201-ib(1) 26099.149u 40.330s 8:04:42.58 89.8%0+0k 0+0io 
>> 0pf+0w
>>               c1180-ib(1) 26809.287u 14.724s 7:38:56.52 97.4%0+0k 0+0io 
>> 0pf+0w
>>               c1179-ib(1) 26007.527u 17.959s 7:26:10.62 97.2%0+0k 0+0io 
>> 0pf+0w
>>               c1178-ib(1) 26565.723u 17.576s 7:35:20.11 97.3%0+0k 0+0io 
>> 0pf+0w
>>               c1177-ib(1) 27114.619u 31.180s 8:21:28.34 90.2%0+0k 0+0io 
>> 0pf+0w
>>               c1171-ib(1) 26474.665u 15.309s 7:33:38.15 97.3%0+0k 0+0io 
>> 0pf+0w
>>               c0844-ib(1) 26586.569u 15.010s 7:35:22.88 97.3%0+0k 0+0io 
>> 0pf+0w
>>             Summary of lapw1para:
>>             c1208-ibk=1user=26474.__7wallclock=453
>>             c1201-ibk=1user=26099.__1wallclock=484
>>             c1180-ibk=1user=26809.__3wallclock=458
>>             c1179-ibk=1user=26007.__5wallclock=446
>>             c1178-ibk=1user=26565.__7wallclock=455
>>             c1177-ibk=1user=27114.__6wallclock=501
>>             c1171-ibk=1user=26474.__7wallclock=453
>>             c0844-ibk=1user=26586.__6wallclock=455
>>        104.607u 18.798s 8:21:30.92 0.4%0+0k 0+0io 0pf+0w
>> 
>>          >   lapw2 -up -p   -c (19:32:50) running LAPW2 in parallel mode
>>                c1208-ib 1016.517u 13.674s 17:11.10 99.9% 0+0k 0+0io 0pf+0w
>>                c1201-ib 1017.359u 13.669s 17:11.82 99.9% 0+0k 0+0io 0pf+0w
>>                c1180-ib 1033.056u 13.283s 17:27.07 99.9% 0+0k 0+0io 0pf+0w
>>                c1179-ib 1037.551u 13.447s 17:31.50 99.9% 0+0k 0+0io 0pf+0w
>>                c1178-ib 1019.156u 13.729s 17:13.49 99.9% 0+0k 0+0io 0pf+0w
>>                c1177-ib 1021.878u 13.731s 17:16.07 99.9% 0+0k 0+0io 0pf+0w
>>                c1171-ib 1032.417u 13.681s 17:26.70 99.9% 0+0k 0+0io 0pf+0w
>>                c0844-ib 1022.315u 13.870s 17:16.81 99.9% 0+0k 0+0io 0pf+0w
>>             Summary of lapw2para:
>>             c1208-ibuser=1016.52wallclock=__1031.1
>>             c1201-ibuser=1017.36wallclock=__1031.82
>>             c1180-ibuser=1033.06wallclock=__1047.07
>>             c1179-ibuser=1037.55wallclock=__1051.5
>>             c1178-ibuser=1019.16wallclock=__1033.49
>>             c1177-ibuser=1021.88wallclock=__1036.07
>>             c1171-ibuser=1032.42wallclock=__1046.7
>>             c0844-ibuser=1022.32wallclock=__1036.81
>>        31.923u 13.526s 18:20.12 4.1%0+0k 0+0io 0pf+0w
>> 
>>          >   lapw2 -dn -p   -c (19:51:10) running LAPW2 in parallel mode
>>                c1208-ib 947.942u 13.364s 16:01.75 99.9% 0+0k 0+0io 0pf+0w
>>                c1201-ib 932.766u 13.640s 15:49.22 99.7% 0+0k 0+0io 0pf+0w
>>                c1180-ib 932.474u 13.609s 15:47.76 99.8% 0+0k 0+0io 0pf+0w
>>                c1179-ib 936.171u 13.691s 15:50.33 99.9% 0+0k 0+0io 0pf+0w
>>                c1178-ib 947.798u 13.493s 16:04.99 99.6% 0+0k 0+0io 0pf+0w
>>                c1177-ib 947.786u 13.350s 16:04.89 99.6% 0+0k 0+0io 0pf+0w
>>                c1171-ib 930.971u 13.874s 15:45.22 99.9% 0+0k 0+0io 0pf+0w
>>                c0844-ib 950.723u 13.426s 16:04.69 99.9% 0+0k 0+0io 0pf+0w
>>             Summary of lapw2para:
>>             c1208-ibuser=947.942wallclock=__961.75
>>             c1201-ibuser=932.766wallclock=__949.22
>>             c1180-ibuser=932.474wallclock=__947.76
>>             c1179-ibuser=936.171wallclock=__950.33
>>             c1178-ibuser=947.798wallclock=__964.99
>>             c1177-ibuser=947.786wallclock=__964.89
>>             c1171-ibuser=930.971wallclock=__945.22
>>             c0844-ibuser=950.723wallclock=__964.69
>>        31.522u 13.879s 16:53.13 4.4%0+0k 0+0io 0pf+0w
>>          >   lcore -up(20:08:03) 2.993u 0.587s 0:03.75 95.2%0+0k 0+0io 0pf+0w
>>          >   lcore -dn(20:08:07) 2.843u 0.687s 0:03.66 96.1%0+0k 0+0io 0pf+0w
>>          >   mixer (20:08:21) 23.206u 32.513s 0:56.63 98.3%0+0k 0+0io 0pf+0w
>> 
>>        :ENERGY convergence:  0 0.00001 416.9302585700000000
>>        :CHARGE convergence:  0 0.0000 3.6278086
>> 
>> 
>>        On Thu, Oct 17, 2013 at 7:11 AM, Laurence Marks
>>        <l-ma...@northwestern.edu <mailto:l-ma...@northwestern.edu> 
>> <mailto:L-marks@northwestern.__edu <mailto:l-ma...@northwestern.edu>>> wrote:
>> 
>>             There are so many possibilities, a few:
>> 
>>             a) If you only request 1 core/node most queuing systems 
>> (qsub/msub
>>             etc) will allocate the other cores to other jobs. You are then 
>> going
>>             to be very dependent upon what those other jobs are doing. 
>> Normal is
>>             to use all the cores on a given node.
>> 
>>             b) When you run on cluster B, in addition to a) it is going to be
>>             inefficient to run with mpi communications across nodes and it 
>> is much
>>             better to run on a given node across cores. Are you using a 
>> machines
>>             file with eight 1: nodeA lines (for instance) or one with a 
>> single 1:
>>             nodeA nodeB....? The first does not use mpi, the second does. To 
>> use
>>             mpi within a node you would use lines such as 1:node:8. 
>> Knowledge of
>>             your .machines file will help people assist you.
>> 
>>             c) The memory on those clusters is very small, whoever bought 
>> them was
>>             not thinking about large scale jobs. I look for at least 
>> 4G/core, and
>>             2G/core is barely acceptable. You are going to have to use mpi.
>> 
>>             d) All mpi is equal, but some mpi is more equal than others. 
>> Depending
>>             upon whether you have infiniband, ethernet, openmpi, impi and how
>>             everything was compiled you can see enormous differences. One 
>> thing to
>>             look at is the difference between the cpu time and wall time 
>> (both in
>>             case.dayfile and at the bottom of case.output1_*). With a good 
>> mpi
>>             setup the wall time should be 5-10% more than the cpu time; with 
>> a bad
>>             setup it can be several times it.
>> 
>>             On Thu, Oct 17, 2013 at 8:44 AM, Yundi Quan <quanyu...@gmail.com 
>> <mailto:quanyu...@gmail.com>
>>             <mailto:quanyu...@gmail.com <mailto:quanyu...@gmail.com>>> wrote:
>>              > Hi,
>>              > I have access to two clusters as a low-level user. One cluster
>>             (cluster A)
>>              > consists of nodes with 8 core and 8 G mem per node. The other 
>> cluster
>>              > (cluster B) has 24G mem per node and each node has 14 cores or
>>             more. The
>>              > cores on cluster A are Xeon CPU E5620@2.40GHz, while the 
>> cores on
>>             cluster B
>>              > are Xeon CPU X5550@2.67GH. From the specifications 
>> (2.40GHz+12288
>>             KB cache
>>              > vs 2.67GHz+8192 KB cache), two machines should be very close 
>> in
>>             performance.
>>              > But it does not seem to be so.
>>              >
>>              > I have job with 72 atoms per unit cell. I initialized the job 
>> on
>>             cluster A
>>              > and ran it for a few iterations. Each iteration took 2 hours.
>>             Then, I moved
>>              > the job to cluster B (14 cores per node with @2.67GHz). Now it
>>             takes more
>>              > than 8 hours to finish one iteration. On both clusters, I 
>> request
>>             one core
>>              > per node and 8 nodes per job ( 8 is the number of k points). I
>>             compiled
>>              > WIEN2k_13 on cluster A without mpi. On cluster B, WIEN2k_12 
>> was
>>             compiled by
>>              > the administrator with mpi.
>>              >
>>              > What could have caused poor performance of cluster B? Is it
>>             because of MPI?
>>              >
>>              > On an unrelated question. Sometimes memory would run out on
>>             cluster B which
>>              > has 24Gmem per node. Nevertheless the same job could run 
>> smoothly
>>             on cluster
>>              > A which only has 8 G per node.
>>              >
>>              > Thanks.
>> 
>> 
>> 
>>             --
>>             Professor Laurence Marks
>>             Department of Materials Science and Engineering
>>             Northwestern University
>>        www.numis.northwestern.edu <http://www.numis.northwestern.edu> 
>> <http://www.numis.__northwestern.edu <http://www.numis.northwestern.edu>>
>>        1-847-491-3996 <tel:1-847-491-3996> <tel:1-847-491-3996 
>> <tel:1-847-491-3996>>
>> 
>>             "Research is to see what everybody else has seen, and to think 
>> what
>>             nobody else has thought"
>>             Albert Szent-Gyorgi
>>             _________________________________________________
>>             Wien mailing list
>>        w...@zeus.theochem.tuwien.ac.__at 
>> <mailto:Wien@zeus.theochem.tuwien.ac.at> 
>> <mailto:Wien@zeus.theochem.__tuwien.ac.at 
>> <mailto:Wien@zeus.theochem.tuwien.ac.at>>
>> 
>>        http://zeus.theochem.tuwien.__ac.at/mailman/listinfo/wien 
>> <http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien>
>>             SEARCH the MAILING-LIST at:
>>        
>> http://www.mail-archive.com/__wien@zeus.theochem.tuwien.ac.__at/index.html 
>> <http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html>
>> 
>> 
>> 
>> 
>>        _________________________________________________
>>        Wien mailing list
>>        w...@zeus.theochem.tuwien.ac.__at 
>> <mailto:Wien@zeus.theochem.tuwien.ac.at>
>>        http://zeus.theochem.tuwien.__ac.at/mailman/listinfo/wien 
>> <http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien>
>>        SEARCH the MAILING-LIST at: 
>> http://www.mail-archive.com/__wien@zeus.theochem.tuwien.ac.__at/index.html
>>        
>> <http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html>
>> 
>> 
>>    --
>> 
>>                                           P.Blaha
>>    
>> ------------------------------__------------------------------__--------------
>>    Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
>>    Phone: +43-1-58801-165300 <tel:%2B43-1-58801-165300>             FAX: 
>> +43-1-58801-165982 <tel:%2B43-1-58801-165982>
>>    Email: bl...@theochem.tuwien.ac.at <mailto:bl...@theochem.tuwien.ac.at>   
>>  WWW: http://info.tuwien.ac.at/__theochem/ 
>> <http://info.tuwien.ac.at/theochem/>
>>    
>> ------------------------------__------------------------------__--------------
>> 
>>    _________________________________________________
>>    Wien mailing list
>>    w...@zeus.theochem.tuwien.ac.__at <mailto:Wien@zeus.theochem.tuwien.ac.at>
>>    http://zeus.theochem.tuwien.__ac.at/mailman/listinfo/wien 
>> <http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien>
>>    SEARCH the MAILING-LIST at: 
>> http://www.mail-archive.com/__wien@zeus.theochem.tuwien.ac.__at/index.html
>>    <http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html>
>> 
>> 
>> 
>> 
>> _______________________________________________
>> Wien mailing list
>> Wien@zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>> SEARCH the MAILING-LIST at:  
>> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>> 
> 
> -- 
> -----------------------------------------
> Peter Blaha
> Inst. Materials Chemistry, TU Vienna
> Getreidemarkt 9, A-1060 Vienna, Austria
> Tel: +43-1-5880115671
> Fax: +43-1-5880115698
> email: pbl...@theochem.tuwien.ac.at
> -----------------------------------------
> _______________________________________________
> Wien mailing list
> Wien@zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:  
> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html

_______________________________________________
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html

Re: [Wien] Intel(R) Xeon(R) CPU X5550 @ 2.67GHz vs Intel(R) Xeon(R) CPU E5620 @ 2.40GHz

Reply via email to