Re: [Wien] Update on hybrid troubles

tran Wed, 25 Mar 2015 01:06:07 -0700

Very very approximately: one month per SCF iteration. Maybe much much
more. My advise is to give up and use either PBE or another code with
a much faster implementation of hybrid functionals.



On Wed, 25 Mar 2015, FonsPaul wrote:

I would like to ask for advice on scaling for the 96 atom amorphous system I am 
attempting to use hybrid calculations on.  As it states below, my attempts with 
a (very) small system showed that the hybrid setup
is functional and that there was a slowdown by a factor of about six between 
PBE and PBE0.  In light of the comment I received that there there may be 
issues with size allocations for working with the 2x2x2 MP grid I was initially 
using, I attempted again to
run a hybrid calculation with just one k-point at Gamma.  In this case my 
machines file looked like:

lapw0:localhost:12
1:localhost:24 sagittarius-ib:24 draco-ib:24 libra-ib:24
granularity:1
extrafine:1


This calculation quickly finished the lapw0 calculation and has been running 
for several days for lapw1c.  Each nodes has two 12 core Ivy Bridge CPUs on it 
and the nodes are interconnected via Infiband.  I know that a 2x2x2 MP grid 
Vasp hybrid calculation
took about 24 hours on the Archer supercomputer using 8 nodes which is 
approximately twice the number of CPUs that I have.  The number of valence 
electrons in the pseudopotentials was significantly  less though and did not 
include the d-states which I have
included in my calculations.  Would it be possible to hear a (wild is OK) guess 
of the amount of CPU hours required to complete a calculation on the 96 atom 
system?


The MgO results are repeated below:





As suggested I have calculated a hybrid functional for the MgO primitive cell.

My .machines file was

# 
lapw0:localhost:2
8:localhost:2
8:draco-ib:2
granularity
extrafine:1

and my MgO.inhf file looks like

0.25         alpha
T            screened (T) or unscreened (F)
0.165        lambda
9            nband
6            gmax
3            lmaxe
3            lmaxv
1d-3         tolu



For the regular PBE calculation (just run_lapw -p -in1new 2), a SCF loop took 
about 15 seconds.  For the hybrid calculation with options “run_lapw -hf -p 
-in1new 2”, with the above case.inhf file (screening off), the calculation on 
the same structure took
about 90 seconds/SCF loop, making the hybrid calculations about a factor of six 
slower.

I encountered no errors with this run unlike my attempts with the 96 atom 
system.  In response to the question asked about the size of aCGT.weighf, the 
size of the file was 117,020 bytes.


Best wishes,
Paul Fons




      On Mar 13, 2015, at 11:37 PM, t...@theochem.tuwien.ac.at wrote:

What is the size of aCGT.weighhf? Is it empty?

Also, before continuing further with your big system, it would be
interesting to know if the same problem occurs with a very small system
like MgO (on same machine and in MPI mode).

Anyway, I still think that it is hopeless to apply hybrid functionals on
such a big system.

F. Tran


On Fri, 13 Mar 2015, Paul Fons wrote:

      I attempted to run a SCF loop using a hybrid functional and have run into 
some problems.  In my earlier try I had a incorrectly specified .machines file 
now I have addressed this problem. I also changed the SCRATCH environment 
variable to
      “./“ so that it points to the main directory for the calculation.  I have 
run a PBE SCF loop to normal terminal for an amorphous cluster of 96 atoms of 
Cu-Ge-Te.  I then ran the init_hf script and after setting the number of bands 
to 770 for
      my 1526 electron system, I set the MP grid to 2x2x2 for a total of four 
k-points.  I then ran the command "run_lapw -hf -p -in1new 2”.  The SCF loop 
ran through lapw0, lapw1, lapw2, core, and then crashed at the program hf.  The MPI
      processes upon crashing reported the following error:

      forrtl: severe (67): input statement requires too much data, unit 26, 
file /usr/local/share/wien2k/Fons/aCGT/aCGT.weighhf

      I have no idea why the program failed this time.  I am using the Intel 
compiler (15) and the Intel MPI environment (e.g. mpiexec.hydra) to launch 
parallel programs as can be seen in the “parallel_options” file.  The only 
*.error files are
      those like “hf_1.error” to “hf_4.error”  which contain the not 
particularly useful information “error in hf”. So the error occurred in the 
routine “hf”.   I would be most grateful for any advice as to what to try next. 
 I have included what
      I hope is relevant debugging information below.


      My parallel_options files (in all nodes) are


      mats...@gemini.a04.aist.go.jp:~/Wien2K>cat parallel_options setenv TASKSET 
"no"
      setenv USE_REMOTE 0
      setenv MPI_REMOTE 0
      setenv WIEN_GRANULARITY 1
      setenv WIEN_MPIRUN "mpiexec.hydra -n _NP_ -machinefile _HOSTS_ _EXEC_"


      My .machines files is as follows:


      lapw0:localhost:12
      1:localhost:12
      1:localhost:12
      1:draco-ib:12
      1:draco-ib:12
      granularity:1
      extrafine:1



      CONTENTS of :parallel

      -----------------------------------------------------------------
      starting parallel lapw1 at Thu Mar 12 14:25:13 JST 2015
         localhost localhost localhost localhost localhost localhost localhost 
localhost localhost localhost localhost localhost(1) 25254.539u 519.601s 
35:44.50 1201.8% 0+0k 8+882304io 0pf+0w
         localhost localhost localhost localhost localhost localhost localhost 
localhost localhost localhost localhost localhost(1) 24889.112u 585.238s 
35:41.95 1189.3% 0+0k 0+719488io 0pf+0w
         draco-ib draco-ib draco-ib draco-ib draco-ib draco-ib draco-ib 
draco-ib draco-ib draco-ib draco-ib draco-ib(1) 0.034u 0.021s 32:40.68 0.0% 
0+0k 0+0io 0pf+0w
         draco-ib draco-ib draco-ib draco-ib draco-ib draco-ib draco-ib 
draco-ib draco-ib draco-ib draco-ib draco-ib(1) 0.035u 0.017s 32:39.14 0.0% 
0+0k 0+0io 0pf+0w
       Summary of lapw1para:
       localhost k=0 user=0 wallclock=0
       draco-ib k=0 user=0 wallclock=0
      <-  done at Thu Mar 12 15:01:00 JST 2015
      -----------------------------------------------------------------
      ->  starting Fermi on gemini.a04.aist.go.jp at Thu Mar 12 15:28:19 JST 
2015
      ->  starting parallel lapw2c at Thu Mar 12 15:28:20 JST 2015
          localhost 389.940u 7.565s 0:36.04 1102.9% 0+0k 718416+253704io 0pf+0w
          localhost 347.944u 5.749s 0:31.67 1116.7% 0+0k 718528+199776io 0pf+0w
          draco-ib 0.029u 0.026s 0:33.86 0.1% 0+0k 8+0io 0pf+0w
          draco-ib 0.032u 0.020s 0:33.80 0.1% 0+0k 8+0io 0pf+0w
       Summary of lapw2para:
       localhost user=737.884 wallclock=67.71
       draco-ib user=0.061 wallclock=67.66
      <-  done at Thu Mar 12 15:28:57 JST 2015
      ->  starting sumpara 4 on gemini.a04.aist.go.jp at Thu Mar 12 15:28:58 
JST 2015
      <-  done at Thu Mar 12 15:29:27 JST 2015
      -----------------------------------------------------------------
      ->  starting parallel hfc at Thu Mar 12 15:29:35 JST 2015
      **  HF crashed at Thu Mar 12 15:29:40 JST 2015
      **  check ERROR FILES!
      ————————————————————————————————



      DAYFILE

      mats...@libra.a04.aist.go.jp:/usr/local/share/wien2k/Fons/aCGT>cat 
aCGT.dayfile

      Calculating aCGT in /usr/local/share/wien2k/Fons/aCGT
      on gemini.a04.aist.go.jp with PID 46746
      using WIEN2k_14.2 (Release 15/10/2014) in /home/matstud/Wien2K


        start (Thu Mar 12 11:22:09 JST 2015) with lapw0 (40/99 to go)

        cycle 1 (Thu Mar 12 11:22:09 JST 2015) (40/99 to go)

             lapw0 -grr -p (11:22:09) starting parallel lapw0 at Thu Mar 12 
11:22:09 JST 2015

      -------- .machine0 : 12 processors
      745.365u 3.350s 1:05.09 1150.2% 0+0k 144+796936io 0pf+0w
             lapw0 -p (11:23:14) starting parallel lapw0 at Thu Mar 12 11:23:15 
JST 2015

      -------- .machine0 : 12 processors
      620.682u 2.669s 0:54.15 1151.1% 0+0k 40+203264io 0pf+0w
             lapw1    -c (11:24:09) 20736.270u 146.444s 3:01:03.72 192.2% 0+0k 
11992+5913840io 0pf+0w
             lapw1  -p   -c (14:25:13) starting parallel lapw1 at Thu Mar 12 
14:25:13 JST 2015

      ->  starting parallel LAPW1 jobs at Thu Mar 12 14:25:13 JST 2015
      running LAPW1 in parallel mode (using .machines)
      4 number_of_parallel_jobs
         localhost localhost localhost localhost localhost localhost localhost 
localhost localhost localhost localhost localhost(1) 25254.539u 519.601s 
35:44.50 1201.8% 0+0k 8+882304io 0pf+0w
         localhost localhost localhost localhost localhost localhost localhost 
localhost localhost localhost localhost localhost(1) 24889.112u 585.238s 
35:41.95 1189.3% 0+0k 0+719488io 0pf+0w
         draco-ib draco-ib draco-ib draco-ib draco-ib draco-ib draco-ib 
draco-ib draco-ib draco-ib draco-ib draco-ib(1) 0.034u 0.021s 32:40.68 0.0% 
0+0k 0+0io 0pf+0w
         draco-ib draco-ib draco-ib draco-ib draco-ib draco-ib draco-ib 
draco-ib draco-ib draco-ib draco-ib draco-ib(1) 0.035u 0.017s 32:39.14 0.0% 
0+0k 0+0io 0pf+0w
       Summary of lapw1para:
       localhost k=0 user=0 wallclock=0
       draco-ib k=0 user=0 wallclock=0
      50150.035u 1107.587s 35:46.80 2387.6% 0+0k 72+1603320io 0pf+0w
             lapw2   -c (15:01:00) 1728.441u 186.944s 27:18.21 116.9% 0+0k 
5749640+254352io 0pf+0w
             lapw2 -p   -c   (15:28:18) running LAPW2 in parallel mode

          localhost 389.940u 7.565s 0:36.04 1102.9% 0+0k 718416+253704io 0pf+0w
          localhost 347.944u 5.749s 0:31.67 1116.7% 0+0k 718528+199776io 0pf+0w
          draco-ib 0.029u 0.026s 0:33.86 0.1% 0+0k 8+0io 0pf+0w
          draco-ib 0.032u 0.020s 0:33.80 0.1% 0+0k 8+0io 0pf+0w
       Summary of lapw2para:
       localhost user=737.884 wallclock=67.71
       draco-ib user=0.061 wallclock=67.66
      753.258u 15.562s 1:08.22 1126.9% 0+0k 2229568+654112io 0pf+0w
             lcore (15:29:27) 4.166u 0.370s 0:06.35 71.3% 0+0k 8+69416io 0pf+0w
             hf       -p -c (15:29:34) running HF in parallel mode

      **  HF crashed!
      0.987u 2.750s 0:06.26 59.5% 0+0k 1208+1344io 19pf+0w
      error: command   /home/matstud/Wien2K/hfcpara -c hf.def   failed

             stop error

      _______________________________________________
      Wien mailing list
      Wien@zeus.theochem.tuwien.ac.at
      http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
      SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html

_______________________________________________
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html

_______________________________________________
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html

Re: [Wien] Update on hybrid troubles

Reply via email to