[Wien] errors in lapw

2012-02-03 Thread Bin Shao
Dear all,

I am running wien2k 11.1 on a cluster with Centos 6 under a pbs queuing
system. The job is submitted in a k-point parallel mode and the total 36
kpoints are divided by 16 cups. But there comes some errors in lapw2 and
the dnlapw2_18/19/20.error files are not empty. At the same time, the job
in pbs system seems dead and can not be killed by the pbs command. The
administrator check the computing node and command top shows that the node
is experiencing very heavy load above 40. Further, ps aux shows that there
are 16 lapw2 processes but not running or say suspended. The jobs caused a
heavy load and triggered the self-protection mechanism of the OS, which
automatically suspends any running process including ssh login except root
account.

Any comments will be appreciated and thanks in advanced.

The followings are the error files and case.dayfile.
dnlapw2_18/19/20.error--
Error in LAPW2


-case.output2dn_19
...
   KVEC( 73563) =   -19   -599.10461
   KVEC( 73564) =   -19   24   -99.10461
   KVEC( 73565) =   -19   2499.10461
   KVEC( 73566) =19  -24   -99.10461
   KVEC( 73567) =19  -2499.10461
   KVEC( 73568) =195   -99.10461
   KVEC( 73569) =19599.10461
   KVE


case.dayfile---
...
[14]   Done  ( ( $remote $machine[$p] cd $PWD;$t
$exe ${def}_${loop}.def $loop;fixerror_lapw ${def}_$loop; rm -f
.lock_$lockfile[$p] )  .stdout2_$loop; if ( -f .stdout2_$loop )
bashtime2csh.pl_lapw .stdout2_$loop  .temp2_$loop; grep \% .temp2_$loop 
.time2_$loop; grep -v \% .temp2_$loop | perl -e print stderr STDIN )
[9]Done  ( ( $remote $machine[$p] cd $PWD;$t
$exe ${def}_${loop}.def $loop;fixerror_lapw ${def}_$loop; rm -f
.lock_$lockfile[$p] )  .stdout2_$loop; if ( -f .stdout2_$loop )
bashtime2csh.pl_lapw .stdout2_$loop  .temp2_$loop; grep \% .temp2_$loop 
.time2_$loop; grep -v \% .temp2_$loop | perl -e print stderr STDIN )
[4]Done  ( ( $remote $machine[$p] cd $PWD;$t
$exe ${def}_${loop}.def $loop;fixerror_lapw ${def}_$loop; rm -f
.lock_$lockfile[$p] )  .stdout2_$loop; if ( -f .stdout2_$loop )
bashtime2csh.pl_lapw .stdout2_$loop  .temp2_$loop; grep \% .temp2_$loop 
.time2_$loop; grep -v \% .temp2_$loop | perl -e print stderr STDIN )
[4] 18809
-

-:log
...
Thu Feb  2 17:58:03 CST 2012 (x) lapw1 -c -dn -p -orb
Thu Feb  2 19:46:53 CST 2012 (x) lapw2 -c -up -p
Thu Feb  2 19:51:36 CST 2012 (x) sumpara -up -d
Thu Feb  2 19:52:07 CST 2012 (x) lapw2 -c -dn -p


(If more information is needed, I will provide.)

Best,

-- 
Bin Shao, Ph.D. Candidate
College of Information Technical Science, Nankai University
94 Weijin Rd. Nankai Dist. Tianjin 300071, China
Email: bshao at mail.nankai.edu.cn
-- next part --
An HTML attachment was scrubbed...
URL: 
http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20120203/e1af9482/attachment-0001.htm


[Wien] errors in lapw

2012-02-03 Thread Peter Blaha


Clearly you should write your job script such that it divides the 36 k-points 
in a
meaningful way.
In principle you can use 36,18,9,6,4,or 3 parallel jobs, but 16 us not 
meaningful.

Furthermore, it seems that your cluster has problems with heavy I/O (NFS) and 
this is
most likely the reason for the observed high load and the crash. Thus I would
i) not use too many cores. Has one node of your cluster really 16 cores, or is 
this just due
to multithreading and in fact it has only 8 ? Do you have enough memory per 
node ?
ii) try to use a (local) $SCRATCH directory, which reduces the NFS load. But 
this works only
 if your k-list and .machines file is compatible as mentioned above.

It also seems a bit of a bigger calculations (lapw1 took nearly 2h), thus you 
may either need MPI
or you should not use all cores on one node at your cluster because of memory 
restrictions.


Am 03.02.2012 13:56, schrieb Bin Shao:
 Dear all,

 I am running wien2k 11.1 on a cluster with Centos 6 under a pbs queuing 
 system. The job is submitted in a k-point parallel mode and the total 36 
 kpoints are divided by 16 cups.
 But there comes some errors in lapw2 and the dnlapw2_18/19/20.error files are 
 not empty. At the same time, the job in pbs system seems dead and can not be 
 killed by the pbs
 command. The administrator check the computing node and command top shows 
 that the node is experiencing very heavy load above 40. Further, ps aux shows 
 that there are 16 lapw2
 processes but not running or say suspended. The jobs caused a heavy load and 
 triggered the self-protection mechanism of the OS, which automatically 
 suspends any running process
 including ssh login except root account.

 Any comments will be appreciated and thanks in advanced.

 The followings are the error files and case.dayfile.
 dnlapw2_18/19/20.error--
 Error in LAPW2
 

 -case.output2dn_19
 ...
 KVEC( 73563) =   -19   -599.10461
 KVEC( 73564) =   -19   24   -99.10461
 KVEC( 73565) =   -19   2499.10461
 KVEC( 73566) =19  -24   -99.10461
 KVEC( 73567) =19  -2499.10461
 KVEC( 73568) =195   -99.10461
 KVEC( 73569) =19599.10461
 KVE
 

 case.dayfile---
 ...
 [14]   Done  ( ( $remote $machine[$p] cd $PWD;$t 
 $exe ${def}_${loop}.def $loop;fixerror_lapw ${def}_$loop; rm -f 
 .lock_$lockfile[$p] )  .stdout2_$loop;
 if ( -f .stdout2_$loop ) bashtime2csh.pl_lapw .stdout2_$loop  .temp2_$loop; 
 grep \% .temp2_$loop  .time2_$loop; grep -v \% .temp2_$loop | perl -e 
 print stderr STDIN )
 [9]Done  ( ( $remote $machine[$p] cd $PWD;$t 
 $exe ${def}_${loop}.def $loop;fixerror_lapw ${def}_$loop; rm -f 
 .lock_$lockfile[$p] )  .stdout2_$loop;
 if ( -f .stdout2_$loop ) bashtime2csh.pl_lapw .stdout2_$loop  .temp2_$loop; 
 grep \% .temp2_$loop  .time2_$loop; grep -v \% .temp2_$loop | perl -e 
 print stderr STDIN )
 [4]Done  ( ( $remote $machine[$p] cd $PWD;$t 
 $exe ${def}_${loop}.def $loop;fixerror_lapw ${def}_$loop; rm -f 
 .lock_$lockfile[$p] )  .stdout2_$loop;
 if ( -f .stdout2_$loop ) bashtime2csh.pl_lapw .stdout2_$loop  .temp2_$loop; 
 grep \% .temp2_$loop  .time2_$loop; grep -v \% .temp2_$loop | perl -e 
 print stderr STDIN )
 [4] 18809
 -

 -:log
 ...
 Thu Feb  2 17:58:03 CST 2012 (x) lapw1 -c -dn -p -orb
 Thu Feb  2 19:46:53 CST 2012 (x) lapw2 -c -up -p
 Thu Feb  2 19:51:36 CST 2012 (x) sumpara -up -d
 Thu Feb  2 19:52:07 CST 2012 (x) lapw2 -c -dn -p
 

 (If more information is needed, I will provide.)

 Best,

 --
 Bin Shao, Ph.D. Candidate
 College of Information Technical Science, Nankai University
 94 Weijin Rd. Nankai Dist. Tianjin 300071, China
 Email: bshao at mail.nankai.edu.cn mailto:bshao at mail.nankai.edu.cn



 ___
 Wien mailing list
 Wien at zeus.theochem.tuwien.ac.at
 http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien

-- 

   P.Blaha
--
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300 FAX: +43-1-58801-165982
Email: blaha at theochem.tuwien.ac.atWWW: http://info.tuwien.ac.at/theochem/
--