[Wien] errors in lapw

2012-02-03 Thread Bin Shao
Dear all,

I am running wien2k 11.1 on a cluster with Centos 6 under a pbs queuing
system. The job is submitted in a k-point parallel mode and the total 36
kpoints are divided by 16 cups. But there comes some errors in lapw2 and
the dnlapw2_18/19/20.error files are not empty. At the same time, the job
in pbs system seems dead and can not be killed by the pbs command. The
administrator check the computing node and command top shows that the node
is experiencing very heavy load above 40. Further, ps aux shows that there
are 16 lapw2 processes but not running or say suspended. The jobs caused a
heavy load and triggered the self-protection mechanism of the OS, which
automatically suspends any running process including ssh login except root
account.

Any comments will be appreciated and thanks in advanced.

The followings are the error files and case.dayfile.
dnlapw2_18/19/20.error--
Error in LAPW2


-case.output2dn_19
...
   KVEC( 73563) =   -19   -599.10461
   KVEC( 73564) =   -19   24   -99.10461
   KVEC( 73565) =   -19   2499.10461
   KVEC( 73566) =19  -24   -99.10461
   KVEC( 73567) =19  -2499.10461
   KVEC( 73568) =195   -99.10461
   KVEC( 73569) =19599.10461
   KVE


case.dayfile---
...
[14]   Done  ( ( $remote $machine[$p] "cd $PWD;$t
$exe ${def}_${loop}.def $loop;fixerror_lapw ${def}_$loop"; rm -f
.lock_$lockfile[$p] ) >& .stdout2_$loop; if ( -f .stdout2_$loop )
bashtime2csh.pl_lapw .stdout2_$loop > .temp2_$loop; grep \% .temp2_$loop >>
.time2_$loop; grep -v \% .temp2_$loop | perl -e "print stderr " )
[9]Done  ( ( $remote $machine[$p] "cd $PWD;$t
$exe ${def}_${loop}.def $loop;fixerror_lapw ${def}_$loop"; rm -f
.lock_$lockfile[$p] ) >& .stdout2_$loop; if ( -f .stdout2_$loop )
bashtime2csh.pl_lapw .stdout2_$loop > .temp2_$loop; grep \% .temp2_$loop >>
.time2_$loop; grep -v \% .temp2_$loop | perl -e "print stderr " )
[4]Done  ( ( $remote $machine[$p] "cd $PWD;$t
$exe ${def}_${loop}.def $loop;fixerror_lapw ${def}_$loop"; rm -f
.lock_$lockfile[$p] ) >& .stdout2_$loop; if ( -f .stdout2_$loop )
bashtime2csh.pl_lapw .stdout2_$loop > .temp2_$loop; grep \% .temp2_$loop >>
.time2_$loop; grep -v \% .temp2_$loop | perl -e "print stderr " )
[4] 18809
-

-:log
...
Thu Feb  2 17:58:03 CST 2012> (x) lapw1 -c -dn -p -orb
Thu Feb  2 19:46:53 CST 2012> (x) lapw2 -c -up -p
Thu Feb  2 19:51:36 CST 2012> (x) sumpara -up -d
Thu Feb  2 19:52:07 CST 2012> (x) lapw2 -c -dn -p


(If more information is needed, I will provide.)

Best,

-- 
Bin Shao, Ph.D. Candidate
College of Information Technical Science, Nankai University
94 Weijin Rd. Nankai Dist. Tianjin 300071, China
Email: bshao at mail.nankai.edu.cn
-- next part --
An HTML attachment was scrubbed...
URL: 



[Wien] errors in lapw

2012-02-03 Thread Peter Blaha


Clearly you should write your job script such that it divides the 36 k-points 
in a
"meaningful" way.
In principle you can use 36,18,9,6,4,or 3 parallel jobs, but 16 us not 
meaningful.

Furthermore, it seems that your cluster has problems with heavy I/O (NFS) and 
this is
most likely the reason for the observed high load and the crash. Thus I would
i) not use too many cores. Has one node of your cluster really 16 cores, or is 
this just due
to "multithreading" and in fact it has only 8 ? Do you have enough memory per 
node ?
ii) try to use a (local) $SCRATCH directory, which reduces the NFS load. But 
this works only
 if your k-list and .machines file is "compatible" as mentioned above.

It also seems a bit of a bigger calculations (lapw1 took nearly 2h), thus you 
may either need MPI
or you should not use all cores on one node at your cluster because of memory 
restrictions.


Am 03.02.2012 13:56, schrieb Bin Shao:
> Dear all,
>
> I am running wien2k 11.1 on a cluster with Centos 6 under a pbs queuing 
> system. The job is submitted in a k-point parallel mode and the total 36 
> kpoints are divided by 16 cups.
> But there comes some errors in lapw2 and the dnlapw2_18/19/20.error files are 
> not empty. At the same time, the job in pbs system seems dead and can not be 
> killed by the pbs
> command. The administrator check the computing node and command top shows 
> that the node is experiencing very heavy load above 40. Further, ps aux shows 
> that there are 16 lapw2
> processes but not running or say suspended. The jobs caused a heavy load and 
> triggered the self-protection mechanism of the OS, which automatically 
> suspends any running process
> including ssh login except root account.
>
> Any comments will be appreciated and thanks in advanced.
>
> The followings are the error files and case.dayfile.
> dnlapw2_18/19/20.error--
> Error in LAPW2
> 
>
> -case.output2dn_19
> ...
> KVEC( 73563) =   -19   -599.10461
> KVEC( 73564) =   -19   24   -99.10461
> KVEC( 73565) =   -19   2499.10461
> KVEC( 73566) =19  -24   -99.10461
> KVEC( 73567) =19  -2499.10461
> KVEC( 73568) =195   -99.10461
> KVEC( 73569) =19599.10461
> KVE
> 
>
> case.dayfile---
> ...
> [14]   Done  ( ( $remote $machine[$p] "cd $PWD;$t 
> $exe ${def}_${loop}.def $loop;fixerror_lapw ${def}_$loop"; rm -f 
> .lock_$lockfile[$p] ) >& .stdout2_$loop;
> if ( -f .stdout2_$loop ) bashtime2csh.pl_lapw .stdout2_$loop > .temp2_$loop; 
> grep \% .temp2_$loop >> .time2_$loop; grep -v \% .temp2_$loop | perl -e 
> "print stderr " )
> [9]Done  ( ( $remote $machine[$p] "cd $PWD;$t 
> $exe ${def}_${loop}.def $loop;fixerror_lapw ${def}_$loop"; rm -f 
> .lock_$lockfile[$p] ) >& .stdout2_$loop;
> if ( -f .stdout2_$loop ) bashtime2csh.pl_lapw .stdout2_$loop > .temp2_$loop; 
> grep \% .temp2_$loop >> .time2_$loop; grep -v \% .temp2_$loop | perl -e 
> "print stderr " )
> [4]Done  ( ( $remote $machine[$p] "cd $PWD;$t 
> $exe ${def}_${loop}.def $loop;fixerror_lapw ${def}_$loop"; rm -f 
> .lock_$lockfile[$p] ) >& .stdout2_$loop;
> if ( -f .stdout2_$loop ) bashtime2csh.pl_lapw .stdout2_$loop > .temp2_$loop; 
> grep \% .temp2_$loop >> .time2_$loop; grep -v \% .temp2_$loop | perl -e 
> "print stderr " )
> [4] 18809
> -
>
> -:log
> ...
> Thu Feb  2 17:58:03 CST 2012> (x) lapw1 -c -dn -p -orb
> Thu Feb  2 19:46:53 CST 2012> (x) lapw2 -c -up -p
> Thu Feb  2 19:51:36 CST 2012> (x) sumpara -up -d
> Thu Feb  2 19:52:07 CST 2012> (x) lapw2 -c -dn -p
> 
>
> (If more information is needed, I will provide.)
>
> Best,
>
> --
> Bin Shao, Ph.D. Candidate
> College of Information Technical Science, Nankai University
> 94 Weijin Rd. Nankai Dist. Tianjin 300071, China
> Email: bshao at mail.nankai.edu.cn 
>
>
>
> ___
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien

-- 

   P.Blaha
--
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300 FAX: +43-1-58801-165982
Email: blaha at theochem.tuwien.ac.atWWW: http://info.tu

[Wien] errors in lapw

2012-02-04 Thread Bin Shao
Dear Prof. Peter Blaha,

Thank you for your reply.

In principle you can use 36,18,9,6,4,or 3 parallel jobs, but 16 us not
> meaningful.


The computing node has really 16 cores (two AMD Opteron(tm) Processor 6136
cpus) and 32 Gb momery. So the 36 k-points are divided by 16 cores, 3
k-points for 4 cores and 2 k-points for the other 12 cores. As you
suggestion, if I only use 12 cores, it might be take less time in lapw1.

ii) try to use a (local) $SCRATCH directory, which reduces the NFS load.
> But this works only
>if your k-list and .machines file is "compatible" as mentioned above.


Actually, the administrator just changed my /home directory to a local disk
in the login node. Before this, the heavy I/O has never happened through a
network disk array. I guess this may be the reason for the crash.

Any comments will be appreciated.

Best,


On Fri, Feb 3, 2012 at 9:53 PM, Peter Blaha wrote:

>
>
> Clearly you should write your job script such that it divides the 36
> k-points in a
> "meaningful" way.
> In principle you can use 36,18,9,6,4,or 3 parallel jobs, but 16 us not
> meaningful.
>
> Furthermore, it seems that your cluster has problems with heavy I/O (NFS)
> and this is
> most likely the reason for the observed high load and the crash. Thus I
> would
> i) not use too many cores. Has one node of your cluster really 16 cores,
> or is this just due
> to "multithreading" and in fact it has only 8 ? Do you have enough memory
> per node ?
> ii) try to use a (local) $SCRATCH directory, which reduces the NFS load.
> But this works only
>if your k-list and .machines file is "compatible" as mentioned above.
>
> It also seems a bit of a bigger calculations (lapw1 took nearly 2h), thus
> you may either need MPI
> or you should not use all cores on one node at your cluster because of
> memory restrictions.
>
>
> Am 03.02.2012 13:56, schrieb Bin Shao:
>
>> Dear all,
>>
>> I am running wien2k 11.1 on a cluster with Centos 6 under a pbs queuing
>> system. The job is submitted in a k-point parallel mode and the total 36
>> kpoints are divided by 16 cups.
>> But there comes some errors in lapw2 and the dnlapw2_18/19/20.error files
>> are not empty. At the same time, the job in pbs system seems dead and can
>> not be killed by the pbs
>> command. The administrator check the computing node and command top shows
>> that the node is experiencing very heavy load above 40. Further, ps aux
>> shows that there are 16 lapw2
>> processes but not running or say suspended. The jobs caused a heavy load
>> and triggered the self-protection mechanism of the OS, which automatically
>> suspends any running process
>> including ssh login except root account.
>>
>> Any comments will be appreciated and thanks in advanced.
>>
>> The followings are the error files and case.dayfile.
>> dnlapw2_**18/19/20.error**--
>> Error in LAPW2
>> --**--**
>> 
>>
>> -case.**output2dn_19--**--
>> ...
>>KVEC( 73563) =   -19   -599.10461
>>KVEC( 73564) =   -19   24   -99.10461
>>KVEC( 73565) =   -19   2499.10461
>>KVEC( 73566) =19  -24   -99.10461
>>KVEC( 73567) =19  -2499.10461
>>KVEC( 73568) =195   -99.10461
>>KVEC( 73569) =19599.10461
>>KVE
>> --**--**
>> 
>>
>> case.**dayfile---**
>> ...
>> [14]   Done  ( ( $remote $machine[$p] "cd $PWD;$t
>> $exe ${def}_${loop}.def $loop;fixerror_lapw ${def}_$loop"; rm -f
>> .lock_$lockfile[$p] ) >& .stdout2_$loop;
>> if ( -f .stdout2_$loop ) bashtime2csh.pl_lapw .stdout2_$loop >
>> .temp2_$loop; grep \% .temp2_$loop >> .time2_$loop; grep -v \% .temp2_$loop
>> | perl -e "print stderr " )
>> [9]Done  ( ( $remote $machine[$p] "cd $PWD;$t
>> $exe ${def}_${loop}.def $loop;fixerror_lapw ${def}_$loop"; rm -f
>> .lock_$lockfile[$p] ) >& .stdout2_$loop;
>> if ( -f .stdout2_$loop ) bashtime2csh.pl_lapw .stdout2_$loop >
>> .temp2_$loop; grep \% .temp2_$loop >> .time2_$loop; grep -v \% .temp2_$loop
>> | perl -e "print stderr " )
>> [4]Done  ( ( $remote $machine[$p] "cd $PWD;$t
>> $exe ${def}_${loop}.def $loop;fixerror_lapw ${def}_$loop"; rm -f
>> .lock_$lockfile[$p] ) >& .stdout2_$loop;
>> if ( -f .stdout2_$loop ) bashtime2csh.pl_lapw .stdout2_$loop >
>> .temp2_$loop; grep \% .temp2_$loop >> .time2_$loop; grep -v \% .temp2_$loop
>> | perl -e "print stderr " )
>> [4] 18809
>> --**--**
>> -
>>
>> -:**log---**
>> -
>> ...
>> Thu Feb  2 17:58:03 CST 2012> (x) lapw1