Re: [Wien] Fwd: Re: MPI stuck at lapw0

Peter Blaha Tue, 07 Nov 2017 07:59:13 -0800

The different runtime fractions for small and large systems is due tothe scaling of the time.lapw0 scales basically linear with the number of atoms, but lapw1 scalescubically with the basisset.

And here is the second problem: for your nanowire you get a matix sizeof about 130000x130000, and this for just 97 atoms.It is not the number of atoms which is determining the memory, but theplane wave basis set. This info is printed in the :RKM line of the scffile and you can even get it using


x lapw1 -nmat_only

So your cell dimensions / RMT settings must be very bad. Remember: Also"vacuum" costs a lot in plane wave methods. You have to optimize yourRMT and reduce cell parameters (vacuum).



lapw2: you can set a line in .machines:

lapw2_vector_split:4  (or 8 or 16)

which will reduce the memory consumption of lapw2.




On 11/07/2017 04:09 PM, Luigi Maduro - TNW wrote:

There are 2 different things:

lapw0para  executes:

   $remote  $machine "cd $PWD;$t $exe $def.def"

where $remote is either ssh or rsh (depending on your configuration setup)

once this is defined, it goes to the remote node and executes

$exe, which usually refers to   mpirun

mpirun is a script on your system, and it may acknowledge this

I_MPI_HYDRA_BOOTSTRAP=rsh variable, while by default it seems to do ssh (even if your 
system does not support this). WIEN2k does not know about such variable and assumes 
that a plain mpirun will do the correct thing. The sysadmin should >>setup the

system such that rsh is used by default with mpirun, or should tell
people, which mpi-commands/variables they should set.

PS: I do not quite understand how it can happen that you get rsh in lapw1para, 
but ssh in lapw0para ??


I do not understand either, because when I check the lapw2para script I
see that “set remote = rsh”





I have a couple of questions concerning the parallel version of WIEN2k,
one concerning insufficient virtual memory and the other concerning lapw1.

I’ve been trying to do simulations of MoS2 in two types of
configurations. One is a monolayer calculation (4x4x1 unit cells) with
48 atoms,

and another calculation deals with a “nanowire” (13x2x1 unit cells) with
97 atoms.



For the 4x4x1 unit cell  I have an rkmax of 6.0 and a 10 k-point mesh.
For the calculation I used 2 nodes and 20 processors per node (so 40 in
total).
The command run is: run_lapw –p –nlvdw –ec 0.0001.

What I noticed is that both lapw1 and nlvdw take a long time to  run.
Lapw0 takes about a minute, as does lapw2. Lapw1 and nlvdw take about
16-19 minutes to run.
When I log into the nodes and use the ‘top’ command to check the CPU% I
see that all processors are at 100%, however I’ve been notified that
only 2% of the requested CPU time is actually used.

I don’t really understand why there is such a big discrepancy of the
computation time between lapw1 and lapw2. In smaller calculations lapw1
and lapw2 are in the same order of magnitude in computation time.







For the nanowire calculation I chose an rkmax of 6.0 and a single
k-point and only used LDA because I want to compare LDA with NLVDW later
on. I always get an “forrtl: severe (41): insufficient virtual memory”
error at lapw1 or lapw2 at the first SCF cycle no matter the amount of
nodes I request, from 1 node to 20 nodes.

Each time I requested 20 processors per node. Only with the 20 nodes and
20 processors did the SCF cycle make it to lapw2, but it crashed not
long after reaching lapw2. Each node is equipped with 128 Gb of memory,
and the end of output1_1 looks like this:



MPI-parallel calculation using   400 processors

Scalapack processors array (row,col):  20  20

Matrix size       136632

Nice Optimum Blocksize  112 Excess %  0.000D+00

          allocate H       712.2 MB          dimensions  6832  6832

          allocate S       712.2 MB          dimensions  6832  6832

     allocate spanel        11.7 MB          dimensions  6832   112

     allocate hpanel        11.7 MB          dimensions  6832   112

   allocate spanelus        11.7 MB          dimensions  6832   112

       allocate slen         5.8 MB          dimensions  6832   112

         allocate x2         5.8 MB          dimensions  6832   112

   allocate legendre        75.9 MB          dimensions  6832    13   112

allocate al,bl (row)         2.3 MB          dimensions  6832    11

allocate al,bl (col)         0.0 MB          dimensions   112    11

         allocate YL         1.7 MB          dimensions    15  6832     1

Time for al,bl    (hamilt, cpu/wall) :         14.7        14.7

Time for legendre (hamilt, cpu/wall) :          4.1         4.1

Time for phase    (hamilt, cpu/wall) :         29.7        30.2

Time for us       (hamilt, cpu/wall) :         38.8        39.2

Time for overlaps (hamilt, cpu/wall) :        115.6       116.3

Time for distrib  (hamilt, cpu/wall) :          0.3         0.3

Time sum iouter   (hamilt, cpu/wall) :        203.5       205.7

number of local orbitals, nlo (hamilt)      749

       allocate YL          33.4 MB          dimensions    15136632     1

       allocate phsc         2.1 MB          dimensions136632

Time for los      (hamilt, cpu/wall) :          0.4         0.4

Time for alm         (hns) :          1.0

Time for vector      (hns) :          7.2

Time for vector2     (hns) :          6.8

Time for VxV         (hns) :        114.8

Wall Time for VxV    (hns) :          1.2

Scalapack Workspace size   100.38 and   804.35 Mb



Any help is appreciated.
Kind regards,
Luigi


--

                                      P.Blaha
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300             FAX: +43-1-58801-165982
Email: bl...@theochem.tuwien.ac.at    WIEN2k: http://www.wien2k.at
WWW:   http://www.imc.tuwien.ac.at/TC_Blaha
--------------------------------------------------------------------------
_______________________________________________
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html

Re: [Wien] Fwd: Re: MPI stuck at lapw0

Reply via email to