Dear all, I am running wien2k 11.1 on a cluster with Centos 6 under a pbs queuing system. The job is submitted in a k-point parallel mode and the total 36 kpoints are divided by 16 cups. But there comes some errors in lapw2 and the dnlapw2_18/19/20.error files are not empty. At the same time, the job in pbs system seems dead and can not be killed by the pbs command. The administrator check the computing node and command top shows that the node is experiencing very heavy load above 40. Further, ps aux shows that there are 16 lapw2 processes but not running or say suspended. The jobs caused a heavy load and triggered the self-protection mechanism of the OS, which automatically suspends any running process including ssh login except root account.
Any comments will be appreciated and thanks in advanced. The followings are the error files and case.dayfile. --------------------dnlapw2_18/19/20.error------------------ Error in LAPW2 ------------------------------------------------------------------------ ---------------------case.output2dn_19------------------------ ... KVEC( 73563) = -19 -5 9 9.1046 1 KVEC( 73564) = -19 24 -9 9.1046 1 KVEC( 73565) = -19 24 9 9.1046 1 KVEC( 73566) = 19 -24 -9 9.1046 1 KVEC( 73567) = 19 -24 9 9.1046 1 KVEC( 73568) = 19 5 -9 9.1046 1 KVEC( 73569) = 19 5 9 9.1046 1 KVE ------------------------------------------------------------------------ --------------------case.dayfile----------------------------------- ... [14] Done ( ( $remote $machine[$p] "cd $PWD;$t $exe ${def}_${loop}.def $loop;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdout2_$loop; if ( -f .stdout2_$loop ) bashtime2csh.pl_lapw .stdout2_$loop > .temp2_$loop; grep \% .temp2_$loop >> .time2_$loop; grep -v \% .temp2_$loop | perl -e "print stderr <STDIN>" ) [9] Done ( ( $remote $machine[$p] "cd $PWD;$t $exe ${def}_${loop}.def $loop;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdout2_$loop; if ( -f .stdout2_$loop ) bashtime2csh.pl_lapw .stdout2_$loop > .temp2_$loop; grep \% .temp2_$loop >> .time2_$loop; grep -v \% .temp2_$loop | perl -e "print stderr <STDIN>" ) [4] Done ( ( $remote $machine[$p] "cd $PWD;$t $exe ${def}_${loop}.def $loop;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdout2_$loop; if ( -f .stdout2_$loop ) bashtime2csh.pl_lapw .stdout2_$loop > .temp2_$loop; grep \% .temp2_$loop >> .time2_$loop; grep -v \% .temp2_$loop | perl -e "print stderr <STDIN>" ) [4] 18809 ----------------------------------------------------------------------------- -----------------------------:log-------------------------------------------- ... Thu Feb 2 17:58:03 CST 2012> (x) lapw1 -c -dn -p -orb Thu Feb 2 19:46:53 CST 2012> (x) lapw2 -c -up -p Thu Feb 2 19:51:36 CST 2012> (x) sumpara -up -d Thu Feb 2 19:52:07 CST 2012> (x) lapw2 -c -dn -p -------------------------------------------------------------------------------- (If more information is needed, I will provide.) Best, -- Bin Shao, Ph.D. Candidate College of Information Technical Science, Nankai University 94 Weijin Rd. Nankai Dist. Tianjin 300071, China Email: bshao at mail.nankai.edu.cn -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20120203/e1af9482/attachment-0001.htm>