Clearly you should write your job script such that it divides the 36 k-points
in a
meaningful way.
In principle you can use 36,18,9,6,4,or 3 parallel jobs, but 16 us not
meaningful.
Furthermore, it seems that your cluster has problems with heavy I/O (NFS) and
this is
most likely the reason for the observed high load and the crash. Thus I would
i) not use too many cores. Has one node of your cluster really 16 cores, or is
this just due
to multithreading and in fact it has only 8 ? Do you have enough memory per
node ?
ii) try to use a (local) $SCRATCH directory, which reduces the NFS load. But
this works only
if your k-list and .machines file is compatible as mentioned above.
It also seems a bit of a bigger calculations (lapw1 took nearly 2h), thus you
may either need MPI
or you should not use all cores on one node at your cluster because of memory
restrictions.
Am 03.02.2012 13:56, schrieb Bin Shao:
Dear all,
I am running wien2k 11.1 on a cluster with Centos 6 under a pbs queuing
system. The job is submitted in a k-point parallel mode and the total 36
kpoints are divided by 16 cups.
But there comes some errors in lapw2 and the dnlapw2_18/19/20.error files are
not empty. At the same time, the job in pbs system seems dead and can not be
killed by the pbs
command. The administrator check the computing node and command top shows
that the node is experiencing very heavy load above 40. Further, ps aux shows
that there are 16 lapw2
processes but not running or say suspended. The jobs caused a heavy load and
triggered the self-protection mechanism of the OS, which automatically
suspends any running process
including ssh login except root account.
Any comments will be appreciated and thanks in advanced.
The followings are the error files and case.dayfile.
dnlapw2_18/19/20.error--
Error in LAPW2
-case.output2dn_19
...
KVEC( 73563) = -19 -599.10461
KVEC( 73564) = -19 24 -99.10461
KVEC( 73565) = -19 2499.10461
KVEC( 73566) =19 -24 -99.10461
KVEC( 73567) =19 -2499.10461
KVEC( 73568) =195 -99.10461
KVEC( 73569) =19599.10461
KVE
case.dayfile---
...
[14] Done ( ( $remote $machine[$p] cd $PWD;$t
$exe ${def}_${loop}.def $loop;fixerror_lapw ${def}_$loop; rm -f
.lock_$lockfile[$p] ) .stdout2_$loop;
if ( -f .stdout2_$loop ) bashtime2csh.pl_lapw .stdout2_$loop .temp2_$loop;
grep \% .temp2_$loop .time2_$loop; grep -v \% .temp2_$loop | perl -e
print stderr STDIN )
[9]Done ( ( $remote $machine[$p] cd $PWD;$t
$exe ${def}_${loop}.def $loop;fixerror_lapw ${def}_$loop; rm -f
.lock_$lockfile[$p] ) .stdout2_$loop;
if ( -f .stdout2_$loop ) bashtime2csh.pl_lapw .stdout2_$loop .temp2_$loop;
grep \% .temp2_$loop .time2_$loop; grep -v \% .temp2_$loop | perl -e
print stderr STDIN )
[4]Done ( ( $remote $machine[$p] cd $PWD;$t
$exe ${def}_${loop}.def $loop;fixerror_lapw ${def}_$loop; rm -f
.lock_$lockfile[$p] ) .stdout2_$loop;
if ( -f .stdout2_$loop ) bashtime2csh.pl_lapw .stdout2_$loop .temp2_$loop;
grep \% .temp2_$loop .time2_$loop; grep -v \% .temp2_$loop | perl -e
print stderr STDIN )
[4] 18809
-
-:log
...
Thu Feb 2 17:58:03 CST 2012 (x) lapw1 -c -dn -p -orb
Thu Feb 2 19:46:53 CST 2012 (x) lapw2 -c -up -p
Thu Feb 2 19:51:36 CST 2012 (x) sumpara -up -d
Thu Feb 2 19:52:07 CST 2012 (x) lapw2 -c -dn -p
(If more information is needed, I will provide.)
Best,
--
Bin Shao, Ph.D. Candidate
College of Information Technical Science, Nankai University
94 Weijin Rd. Nankai Dist. Tianjin 300071, China
Email: bshao at mail.nankai.edu.cn mailto:bshao at mail.nankai.edu.cn
___
Wien mailing list
Wien at zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
--
P.Blaha
--
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300 FAX: +43-1-58801-165982
Email: blaha at theochem.tuwien.ac.atWWW: http://info.tuwien.ac.at/theochem/
--