Do not call mpirun yourself -- it is called by run_lapw.

What is your $WIENROOT/parallel_options file? It was setup during
installation, and needs to be correct for your srun environment.

What is in case.scf0, case.output000* and lapw0.error? These may indicate
what you did wrong.

Professor Laurence Marks
"Research is to see what everybody else has seen, and to think what nobody
else has thought", Albert Szent-Gyorgi

On Mon, Oct 12, 2020, 15:17 Christian Søndergaard Pedersen <>

> Dear everybody
> I am following up on this thread to report on two separate errors in my
> attempts to properly parallellize a calculation. For the first, a
> calculation utilized 0.00% of available CPU resources. My .machines file
> looks like this:
> #
> dstart:g004:8 g010:8 g011:8 g040:8
> lapw0:g004:8 g010:8 g011:8 g040:8
> 1:g004:16
> 1:g010:16
> 1:g011:16
> 1:g040:16
> With my submit script calling the following commands:
> srun hostname -s > slurm.hosts
> run_lapw -p
> x qtl -p -telnes
> Of course, the job didn't reach x qtl. The resultant case.dayfile is
> short, so I am dumping all of it here:
> Calculating test-machines in /path/to/directory
> on with PID XXXXX
> using WIEN2k_19.1 (Release 25/6/2019) in
> /path/to/installation/directory/WIEN2k/19.1-intel-2019a
>     start       (Mon Oct 12 19:04:06 CEST 2020) with lapw0 (40/99 to go)
>     cycle 1     (Mon Oct 12 19:04:06 CEST 2020)         (40/99 to go)
> >   lapw0   -p  (19:04:06) starting parallel lapw0 at Mon Oct 12 19:04:06
> CEST 2020
> -------- .machine0 : 32 processors
> [1] 16095
> The .machine0 file displays the lines
> g004 [repeated for 8 lines]
> g010 [repeated for 8 lines]
> g011 [repeated for 8 lines]
> g040 [repeated for 8 lines]
> which tells me that the .machines file works as intended, and that the
> cause of the problem is located somewhere else. Which brings me to the
> second error, which occured when I tried calling mpirun explicitly like so:
> srun hostname -s > slurm.hosts
> mpirun run_lapw -p
> mpirun qtl -p -telnes
> from within the job script. This crashed the job right away. The
> lapw0.error file prints out "Error in Parallel lapw0" and "check ERROR
> FILES!" a number of times. The case.clmsum file is present and looks
> correct, and the .machines file looks like the one from before (with
> different node numbers). However, the .machine0 file now looks like:
> g094
> g094
> g094
> g081
> g081
> g08g094
> g094
> g094
> g094
> g094
> [...]
> I.e. there's an error on line 6, where a node is not properly named and a
> line break is missing. The dayfile repeatedly prints out "> stop error" a
> total of sixteen times. I don't know if the above .machine0 file is the
> culprit, but it seems the obvious conclusion. Any help in this matter will
> be much appreciated.
> Best regards
> Christian
> _______________________________________________
> Wien mailing list
Wien mailing list

Reply via email to