[Wien] LAPW2 crashed when running in parallel

2010-11-01 Thread Maxim Rakitin
Hi,

It looks like Intel's mpirun doesn't have '-machinefile' option. Instead 
of this it has '-hostfile' option (form here: 
http://downloadmirror.intel.com/18462/eng/nes_release_notes.txt).

Try 'mpirun -h' for information about options and apply appropriate.

Best regards,
Maxim Rakitin
email: rms85 at physics.susu.ac.ru
web: http://www.susu.ac.ru


01.11.2010 4:56, Wei Xie ?:
 Dear all WIEN2k community members:

 We encountered some problem when running in parallel (K-point, MPI or 
 both)--the calculations crashed at LAPW2. Note we had no problem 
 running it in serial. We have tried to diagnose the problem, recompile 
 the code with difference options and test with difference cases and 
 parameters based on similar problems reported on the mail list, but 
 the problem persists. So we write here hoping someone can offer us 
 some suggestion. We have attached related files below for your 
 reference. Your replies are appreciated in advance!

 This is a TiC example running in both Kpoint and MPI parallel on two 
 nodes /r1i0n0/ and /r1i0n1/ (8cores/node):

 *1. **stdout **(abridged) *
 MPI: invalid option -machinefile
 real0m0.004s
 user0m0.000s
 sys0m0.000s
 ...
 MPI: invalid option -machinefile
 real0m0.003s
 user0m0.000s
 sys0m0.004s
 TiC.scf1up_1: No such file or directory.

 LAPW2 - Error. Check file lapw2.error
 cp: cannot stat `.in.tmp': No such file or directory
 rm: cannot remove `.in.tmp': No such file or directory
 *rm: cannot remove `.in.tmp1': No such file or directory*
 *
 *
 *2. TiC.dayfile (abridged) *
 ...
 start (Sun Oct 31 16:25:06 MDT 2010) with lapw0 (40/99 to go)
 cycle 1 (Sun Oct 31 16:25:06 MDT 2010) (40/99 to go)

lapw0 -p(16:25:06) starting parallel lapw0 at Sun Oct 31 16:25:07 
 MDT 2010
  .machine0 : 16 processors
 invalid local arg: -machinefile

 0.436u 0.412s 0:04.63 18.1%0+0k 2600+0io 1pf+0w
lapw1  -up -p (16:25:12) starting parallel lapw1 at Sun Oct 31 
 16:25:12 MDT 2010
 -  starting parallel LAPW1 jobs at Sun Oct 31 16:25:12 MDT 2010
 running LAPW1 in parallel mode (using .machines)
 2 number_of_parallel_jobs
  r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0(1) 
  r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1(1) 
  r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0(1)Summary 
 of lapw1para:
r1i0n0 k=0 user=0 wallclock=0
r1i0n1 k=0 user=0 wallclock=0
 ...
 0.116u 0.316s 0:10.48 4.0%0+0k 0+0io 0pf+0w
lapw2 -up -p (16:25:34) running LAPW2 in parallel mode
 **  LAPW2 crashed!
 0.032u 0.104s 0:01.13 11.5%0+0k 82304+0io 8pf+0w
 error: command   /home/xiew/WIEN2k_10/lapw2para -up uplapw2.def   failed

 *3. uplapw2.error *
 Error in LAPW2
  'LAPW2' - can't open unit: 18
  'LAPW2' -filename: TiC.vspup
  'LAPW2' -  status: old  form: formatted
 **  testerror: Error in Parallel LAPW2

 *4. .machines*
 #
 1:r1i0n0:8
 1:r1i0n1:8
 lapw0:r1i0n0:8 r1i0n1:8
 granularity:1
 extrafine:1

 *5. compilers, MPI and options*
 Intel Compilers  and MKL 11.1.046
 Intel MPI 3.2.0.011

 current:FOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback
 current:FPOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback
 current:LDFLAGS:$(FOPT) 
 -L/usr/local/intel/Compiler/11.1/046/mkl/lib/em64t -pthread
 current:DPARALLEL:'-DParallel'
 current:R_LIBS:-lmkl_lapack -lmkl_intel_lp64 -lmkl_intel_thread 
 -lmkl_core -openmp -lpthread -lguide
 current:RP_LIBS:-L/usr/local/intel/Compiler/11.1/046/mkl/lib/em64t 
 -lmkl_scalapack_lp64 
 /usr/local/intel/Compiler/11.1/046/mkl/lib/em64t/libmkl_solver_lp64.a 
 -Wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core 
 -lmkl_blacs_intelmpi_lp64 -Wl,--end-group -openmp -lpthread 
 -L/home/xiew/fftw-2.1.5/lib -lfftw_mpi -lfftw $(R_LIBS)
 current:MPIRUN:mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_

 Best regards,
 Wei Xie
 Computational Materials Group
 University of Wisconsin-Madison


 ___
 Wien mailing list
 Wien at zeus.theochem.tuwien.ac.at
 http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
-- next part --
An HTML attachment was scrubbed...
URL: 
http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20101101/e1463e23/attachment.htm


[Wien] LAPW2 crashed when running in parallel

2010-11-01 Thread Wei Xie
Hi Maxim,

Thanks for your reply! 
We tried MPIRUN=mpirun -np _NP_ -hostfile _HOSTS_ _EXEC_, but the problem 
persists. The only difference is that stdout changes to ''? MPI: invalid option 
-hostfile ?''.

Thanks,
Wei


On Oct 31, 2010, at 10:40 PM, Maxim Rakitin wrote:

 Hi,
 
 It looks like Intel's mpirun doesn't have '-machinefile' option. Instead of 
 this it has '-hostfile' option (form here: 
 http://downloadmirror.intel.com/18462/eng/nes_release_notes.txt).
 
 Try 'mpirun -h' for information about options and apply appropriate.
 Best regards,
Maxim Rakitin
email: rms85 at physics.susu.ac.ru
web: http://www.susu.ac.ru
 
 01.11.2010 4:56, Wei Xie ?:
 
 Dear all WIEN2k community members:
 
 We encountered some problem when running in parallel (K-point, MPI or 
 both)--the calculations crashed at LAPW2. Note we had no problem running it 
 in serial. We have tried to diagnose the problem, recompile the code with 
 difference options and test with difference cases and parameters based on 
 similar problems reported on the mail list, but the problem persists. So we 
 write here hoping someone can offer us some suggestion. We have attached 
 related files below for your reference. Your replies are appreciated in 
 advance! 
 
 This is a TiC example running in both Kpoint and MPI parallel on two nodes 
 r1i0n0 and r1i0n1 (8cores/node):
 
 1. stdout (abridged) 
 MPI: invalid option -machinefile
 real 0m0.004s
 user 0m0.000s
 sys 0m0.000s
 ...
 MPI: invalid option -machinefile
 real 0m0.003s
 user 0m0.000s
 sys 0m0.004s
 TiC.scf1up_1: No such file or directory.
 
 LAPW2 - Error. Check file lapw2.error
 cp: cannot stat `.in.tmp': No such file or directory
 rm: cannot remove `.in.tmp': No such file or directory
 rm: cannot remove `.in.tmp1': No such file or directory
 
 2. TiC.dayfile (abridged) 
 ...
 start  (Sun Oct 31 16:25:06 MDT 2010) with lapw0 (40/99 to go)
 cycle 1  (Sun Oct 31 16:25:06 MDT 2010)  (40/99 to go)
 
lapw0 -p (16:25:06) starting parallel lapw0 at Sun Oct 31 16:25:07 MDT 
  2010
  .machine0 : 16 processors
 invalid local arg: -machinefile
 
 0.436u 0.412s 0:04.63 18.1% 0+0k 2600+0io 1pf+0w
lapw1  -up -p(16:25:12) starting parallel lapw1 at Sun Oct 31 
  16:25:12 MDT 2010
 -  starting parallel LAPW1 jobs at Sun Oct 31 16:25:12 MDT 2010
 running LAPW1 in parallel mode (using .machines)
 2 number_of_parallel_jobs
  r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0(1)  r1i0n1 
 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1(1)  r1i0n0 r1i0n0 
 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0(1)Summary of lapw1para:
r1i0n0  k=0  user=0  wallclock=0
r1i0n1  k=0  user=0  wallclock=0
 ...
 0.116u 0.316s 0:10.48 4.0% 0+0k 0+0io 0pf+0w
lapw2 -up -p   (16:25:34) running LAPW2 in parallel mode
 **  LAPW2 crashed!
 0.032u 0.104s 0:01.13 11.5% 0+0k 82304+0io 8pf+0w
 error: command   /home/xiew/WIEN2k_10/lapw2para -up uplapw2.def   failed
 
 3. uplapw2.error 
 Error in LAPW2
  'LAPW2' - can't open unit: 18   
  
  'LAPW2' -filename: TiC.vspup
  
  'LAPW2' -  status: old  form: formatted 
  
 **  testerror: Error in Parallel LAPW2
 
 4. .machines
 #
 1:r1i0n0:8
 1:r1i0n1:8
 lapw0:r1i0n0:8 r1i0n1:8 
 granularity:1
 extrafine:1
 
 5. compilers, MPI and options
 Intel Compilers  and MKL 11.1.046
 Intel MPI 3.2.0.011
 
 current:FOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback
 current:FPOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback
 current:LDFLAGS:$(FOPT) -L/usr/local/intel/Compiler/11.1/046/mkl/lib/em64t 
 -pthread
 current:DPARALLEL:'-DParallel'
 current:R_LIBS:-lmkl_lapack -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core 
 -openmp -lpthread -lguide
 current:RP_LIBS:-L/usr/local/intel/Compiler/11.1/046/mkl/lib/em64t 
 -lmkl_scalapack_lp64 
 /usr/local/intel/Compiler/11.1/046/mkl/lib/em64t/libmkl_solver_lp64.a 
 -Wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core 
 -lmkl_blacs_intelmpi_lp64 -Wl,--end-group -openmp -lpthread 
 -L/home/xiew/fftw-2.1.5/lib -lfftw_mpi -lfftw $(R_LIBS)
 current:MPIRUN:mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_
 
 Best regards,
 Wei Xie
 Computational Materials Group
 University of Wisconsin-Madison
 
 
 ___
 Wien mailing list
 Wien at zeus.theochem.tuwien.ac.at
 http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
 ___
 Wien mailing list
 Wien at zeus.theochem.tuwien.ac.at
 http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien

-- next part --
An HTML attachment was scrubbed...
URL: 
http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20101031/2ce15505/attachment.htm


[Wien] LAPW2 crashed when running in parallel

2010-11-01 Thread Maxim Rakitin
Dear Wei,

Maybe -machinefile is ok for your mpirun. Which options are appropriate 
for it? What does help say?

Try to restore your MPIRUN variable with -machinefile and rerun the 
calculation. Then see what is in .machine0/1/2 files and let us know. It 
should contain 8 lines of r1i0n0 node and 8 lines of r1i0n1 node.

One more thing you should check is $WIENROOT/parallel_options file. What 
is its content?

Best regards,
Maxim Rakitin
email: rms85 at physics.susu.ac.ru
web: http://www.susu.ac.ru


01.11.2010 9:06, Wei Xie ?:
 Hi Maxim,

 Thanks for your reply!
 We tried MPIRUN=mpirun -np _NP_ -hostfile _HOSTS_ _EXEC_, but the 
 problem persists. The only difference is that stdout changes to 
 ''? MPI: invalid option -hostfile ?''.

 Thanks,
 Wei


 On Oct 31, 2010, at 10:40 PM, Maxim Rakitin wrote:

 Hi,

 It looks like Intel's mpirun doesn't have '-machinefile' option. 
 Instead of this it has '-hostfile' option (form here: 
 http://downloadmirror.intel.com/18462/eng/nes_release_notes.txt).

 Try 'mpirun -h' for information about options and apply appropriate.
 Best regards,
 Maxim Rakitin
 email:rms85 at physics.susu.ac.ru
 web:http://www.susu.ac.ru

 01.11.2010 4:56, Wei Xie ?:
 Dear all WIEN2k community members:

 We encountered some problem when running in parallel 
 (K-point, MPI or both)--the calculations crashed at LAPW2. Note we 
 had no problem running it in serial. We have tried to diagnose the 
 problem, recompile the code with difference options and test with 
 difference cases and parameters based on similar problems reported 
 on the mail list, but the problem persists. So we write here hoping 
 someone can offer us some suggestion. We have attached related files 
 below for your reference. Your replies are appreciated in advance!

 This is a TiC example running in both Kpoint and MPI parallel on two 
 nodes /r1i0n0/ and /r1i0n1/ (8cores/node):

 *1. **stdout **(abridged) *
 MPI: invalid option -machinefile
 real0m0.004s
 user0m0.000s
 sys0m0.000s
 ...
 MPI: invalid option -machinefile
 real0m0.003s
 user0m0.000s
 sys0m0.004s
 TiC.scf1up_1: No such file or directory.

 LAPW2 - Error. Check file lapw2.error
 cp: cannot stat `.in.tmp': No such file or directory
 rm: cannot remove `.in.tmp': No such file or directory
 *rm: cannot remove `.in.tmp1': No such file or directory*
 *
 *
 *2. TiC.dayfile (abridged) *
 ...
 start (Sun Oct 31 16:25:06 MDT 2010) with lapw0 (40/99 to go)
 cycle 1 (Sun Oct 31 16:25:06 MDT 2010) (40/99 to go)

lapw0 -p(16:25:06) starting parallel lapw0 at Sun Oct 31 
 16:25:07 MDT 2010
  .machine0 : 16 processors
 invalid local arg: -machinefile

 0.436u 0.412s 0:04.63 18.1%0+0k 2600+0io 1pf+0w
lapw1  -up -p (16:25:12) starting parallel lapw1 at Sun Oct 31 
 16:25:12 MDT 2010
 -  starting parallel LAPW1 jobs at Sun Oct 31 16:25:12 MDT 2010
 running LAPW1 in parallel mode (using .machines)
 2 number_of_parallel_jobs
  r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0(1) 
  r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1(1) 
  r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0(1)   
  Summary of lapw1para:
r1i0n0 k=0 user=0 wallclock=0
r1i0n1 k=0 user=0 wallclock=0
 ...
 0.116u 0.316s 0:10.48 4.0%0+0k 0+0io 0pf+0w
lapw2 -up -p (16:25:34) running LAPW2 in parallel mode
 **  LAPW2 crashed!
 0.032u 0.104s 0:01.13 11.5%0+0k 82304+0io 8pf+0w
 error: command   /home/xiew/WIEN2k_10/lapw2para -up uplapw2.def   failed

 *3. uplapw2.error *
 Error in LAPW2
  'LAPW2' - can't open unit: 18
  'LAPW2' -filename: TiC.vspup
  'LAPW2' -  status: old  form: formatted
 **  testerror: Error in Parallel LAPW2

 *4. .machines*
 #
 1:r1i0n0:8
 1:r1i0n1:8
 lapw0:r1i0n0:8 r1i0n1:8
 granularity:1
 extrafine:1

 *5. compilers, MPI and options*
 Intel Compilers  and MKL 11.1.046
 Intel MPI 3.2.0.011

 current:FOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback
 current:FPOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML 
 -traceback
 current:LDFLAGS:$(FOPT) 
 -L/usr/local/intel/Compiler/11.1/046/mkl/lib/em64t -pthread
 current:DPARALLEL:'-DParallel'
 current:R_LIBS:-lmkl_lapack -lmkl_intel_lp64 -lmkl_intel_thread 
 -lmkl_core -openmp -lpthread -lguide
 current:RP_LIBS:-L/usr/local/intel/Compiler/11.1/046/mkl/lib/em64t 
 -lmkl_scalapack_lp64 
 /usr/local/intel/Compiler/11.1/046/mkl/lib/em64t/libmkl_solver_lp64.a 
 -Wl,--start-group 
 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core 
 -lmkl_blacs_intelmpi_lp64 -Wl,--end-group -openmp -lpthread 
 -L/home/xiew/fftw-2.1.5/lib -lfftw_mpi -lfftw $(R_LIBS)
 current:MPIRUN:mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_

 Best regards,
 Wei Xie
 Computational Materials Group
 University of Wisconsin-Madison


 ___
 Wien mailing list
 Wien at zeus.theochem.tuwien.ac.at
 http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
 ___
 Wien 

[Wien] LAPW2 crashed when running in parallel

2010-11-01 Thread Lyudmila V. Dobysheva
01 Nov 2010 02:56:47 Wei Xie wrote:
 We encountered some problem when running in parallel (K-point, MPI or
  both)--the calculations crashed at LAPW2. Note we had no problem running
  it in serial.
 This is a TiC example running

Dear Wei,

Isn't the error connected with spin-polarised - spin-UNpolarised cases? TiC is 
to be calculated unpolarised, as far as I know.
 1. stdout (abridged)
...
 TiC.scf1up_1: No such file or directory.

Was lapw1 really successfull?

 3. uplapw2.error
 Error in LAPW2
  'LAPW2' - can't open unit: 18
  'LAPW2' -filename: TiC.vspup
  'LAPW2' -  status: old  form: formatted

It looks like your initialization was done without spin-polarization, but 
runsp_lapw was run. But in this case lapw1 must also be unsuccessful (?).

Best regards,
Lyudmila Dobysheva
--
Phys.-Techn. Institute of Ural Br. of Russian Ac. of Sci.
426001 Izhevsk, ul.Kirova 132
RUSSIA
--
Tel.:7(3412) 442118 (home), 218988(office), 250614(Fax)
E-mail: lyu at otf.fti.udmurtia.su
lyuka17 at mail.ru
lyu at otf.pti.udm.ru
http://fti.udm.ru/content/view/25/103/lang,english/
--


[Wien] LAPW2 crashed when running in parallel

2010-11-01 Thread Wei Xie
Hi Maxim,

Thanks for the follow-up!

I think it should be -machinefile  that's appropriate. Here's the help:
-machinefile # file mapping procs to machine

No -hostfile option mentioned for my current version of MPI in the help.

Yes, the machine0/1/2 files are exactly like what you described.

The parallel_options is: 
setenv USE_REMOTE 1
setenv MPI_REMOTE 1
setenv WIEN_GRANULARITY 1
setenv WIEN_MPIRUN mpirun -np _NP_ -hostfile _HOSTS_ _EXEC_

I think the problem should be due to my MPI. However, even if disable MPI 
parallelization, the problem still persists (no evident difference in the 
output files, including case.dayfile, stdout and :log). Note we can run with 
exactly the same set of input files in serial mode with no problem. 

Again, thanks for your help!

Cheers,
Wei


On Oct 31, 2010, at 11:27 PM, Maxim Rakitin wrote:

 Dear Wei,
 
 Maybe -machinefile is ok for your mpirun. Which options are appropriate for 
 it? What does help say?
 
 Try to restore your MPIRUN variable with -machinefile and rerun the 
 calculation. Then see what is in .machine0/1/2 files and let us know. It 
 should contain 8 lines of r1i0n0 node and 8 lines of r1i0n1 node.
 
 One more thing you should check is $WIENROOT/parallel_options file. What is 
 its content?
 Best regards,
Maxim Rakitin
email: rms85 at physics.susu.ac.ru
web: http://www.susu.ac.ru
 
 01.11.2010 9:06, Wei Xie ?:
 
 Hi Maxim,
 
 Thanks for your reply! 
 We tried MPIRUN=mpirun -np _NP_ -hostfile _HOSTS_ _EXEC_, but the problem 
 persists. The only difference is that stdout changes to ''? MPI: invalid 
 option -hostfile ?''.
 
 Thanks,
 Wei
 
 
 On Oct 31, 2010, at 10:40 PM, Maxim Rakitin wrote:
 
 Hi,
 
 It looks like Intel's mpirun doesn't have '-machinefile' option. Instead of 
 this it has '-hostfile' option (form here: 
 http://downloadmirror.intel.com/18462/eng/nes_release_notes.txt).
 
 Try 'mpirun -h' for information about options and apply appropriate.
 Best regards,
Maxim Rakitin
email: rms85 at physics.susu.ac.ru
web: http://www.susu.ac.ru
 
 01.11.2010 4:56, Wei Xie ?:
 
 Dear all WIEN2k community members:
 
 We encountered some problem when running in parallel (K-point, MPI or 
 both)--the calculations crashed at LAPW2. Note we had no problem running 
 it in serial. We have tried to diagnose the problem, recompile the code 
 with difference options and test with difference cases and parameters 
 based on similar problems reported on the mail list, but the problem 
 persists. So we write here hoping someone can offer us some suggestion. We 
 have attached related files below for your reference. Your replies are 
 appreciated in advance! 
 
 This is a TiC example running in both Kpoint and MPI parallel on two nodes 
 r1i0n0 and r1i0n1 (8cores/node):
 
 1. stdout (abridged) 
 MPI: invalid option -machinefile
 real 0m0.004s
 user 0m0.000s
 sys 0m0.000s
 ...
 MPI: invalid option -machinefile
 real 0m0.003s
 user 0m0.000s
 sys 0m0.004s
 TiC.scf1up_1: No such file or directory.
 
 LAPW2 - Error. Check file lapw2.error
 cp: cannot stat `.in.tmp': No such file or directory
 rm: cannot remove `.in.tmp': No such file or directory
 rm: cannot remove `.in.tmp1': No such file or directory
 
 2. TiC.dayfile (abridged) 
 ...
 start  (Sun Oct 31 16:25:06 MDT 2010) with lapw0 (40/99 to go)
 cycle 1  (Sun Oct 31 16:25:06 MDT 2010)  (40/99 to go)
 
lapw0 -p (16:25:06) starting parallel lapw0 at Sun Oct 31 16:25:07 MDT 
  2010
  .machine0 : 16 processors
 invalid local arg: -machinefile
 
 0.436u 0.412s 0:04.63 18.1% 0+0k 2600+0io 1pf+0w
lapw1  -up -p(16:25:12) starting parallel lapw1 at Sun Oct 31 
  16:25:12 MDT 2010
 -  starting parallel LAPW1 jobs at Sun Oct 31 16:25:12 MDT 2010
 running LAPW1 in parallel mode (using .machines)
 2 number_of_parallel_jobs
  r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0(1)  
 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1(1)  r1i0n0 
 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0(1)Summary of 
 lapw1para:
r1i0n0  k=0  user=0  wallclock=0
r1i0n1  k=0  user=0  wallclock=0
 ...
 0.116u 0.316s 0:10.48 4.0% 0+0k 0+0io 0pf+0w
lapw2 -up -p   (16:25:34) running LAPW2 in parallel mode
 **  LAPW2 crashed!
 0.032u 0.104s 0:01.13 11.5% 0+0k 82304+0io 8pf+0w
 error: command   /home/xiew/WIEN2k_10/lapw2para -up uplapw2.def   failed
 
 3. uplapw2.error 
 Error in LAPW2
  'LAPW2' - can't open unit: 18 

  'LAPW2' -filename: TiC.vspup  

  'LAPW2' -  status: old  form: formatted   

 **  testerror: Error in Parallel LAPW2
 
 4. .machines
 #
 1:r1i0n0:8
 1:r1i0n1:8
 lapw0:r1i0n0:8 r1i0n1:8 
 granularity:1
 extrafine:1
 
 5. compilers, MPI and options
 Intel Compilers  and MKL 11.1.046
 Intel MPI 3.2.0.011
 
 current:FOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback
 

[Wien] LAPW2 crashed when running in parallel

2010-11-01 Thread Maxim Rakitin
Hi Wei,

The parallel_options file manages how parallel programs run, so change 
the following line in it:

setenv WIEN_MPIRUN mpirun -np _NP_ -hostfile _HOSTS_ _EXEC_

to

setenv WIEN_MPIRUN mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_

Your .machine0/1/2 files are correct,

Also I believe that 'USE_REMOTE' variable which is set to 1 makes 
parallel scripts (I mean lapw[012]para_lapw) to be launched using 
ssh/rsh. So switch it to '0'. I'm not sure about 'MPI_REMOTE' option, 
it's a new one. Try to set different values (0 or 1) for it.

Hope this will help.

Best regards,
Maxim Rakitin
email: rms85 at physics.susu.ac.ru
web: http://www.susu.ac.ru


01.11.2010 21:35, Wei Xie ?:
 Hi Maxim,

 Thanks for the follow-up!

 I think it should be -machinefile  that's appropriate. Here's the help:
 -machinefile # file mapping procs to machine

 No -hostfile option mentioned for my current version of MPI in the help.

 Yes, the machine0/1/2 files are exactly like what you described.

 The parallel_options is:
 setenv USE_REMOTE 1
 setenv MPI_REMOTE 1
 setenv WIEN_GRANULARITY 1
 setenv WIEN_MPIRUN mpirun -np _NP_ -hostfile _HOSTS_ _EXEC_

 I think the problem should be due to my MPI. However, even if disable 
 MPI parallelization, the problem still persists (no evident difference 
 in the output files, including case.dayfile, stdout and :log). Note we 
 can run with exactly the same set of input files in serial mode with 
 no problem.

 Again, thanks for your help!

 Cheers,
 Wei


 On Oct 31, 2010, at 11:27 PM, Maxim Rakitin wrote:

 Dear Wei,

 Maybe -machinefile is ok for your mpirun. Which options are 
 appropriate for it? What does help say?

 Try to restore your MPIRUN variable with -machinefile and rerun the 
 calculation. Then see what is in .machine0/1/2 files and let us know. 
 It should contain 8 lines of r1i0n0 node and 8 lines of r1i0n1 node.

 One more thing you should check is $WIENROOT/parallel_options file. 
 What is its content?
 Best regards,
 Maxim Rakitin
 email:rms85 at physics.susu.ac.ru
 web:http://www.susu.ac.ru

 01.11.2010 9:06, Wei Xie ?:
 Hi Maxim,

 Thanks for your reply!
 We tried MPIRUN=mpirun -np _NP_ -hostfile _HOSTS_ _EXEC_, but the 
 problem persists. The only difference is that stdout changes to 
 ''? MPI: invalid option -hostfile ?''.

 Thanks,
 Wei


 On Oct 31, 2010, at 10:40 PM, Maxim Rakitin wrote:

 Hi,

 It looks like Intel's mpirun doesn't have '-machinefile' option. 
 Instead of this it has '-hostfile' option (form here: 
 http://downloadmirror.intel.com/18462/eng/nes_release_notes.txt).

 Try 'mpirun -h' for information about options and apply appropriate.
 Best regards,
 Maxim Rakitin
 email:rms85 at physics.susu.ac.ru
 web:http://www.susu.ac.ru

 01.11.2010 4:56, Wei Xie ?:
 Dear all WIEN2k community members:

 We encountered some problem when running in parallel 
 (K-point, MPI or both)--the calculations crashed at LAPW2. Note we 
 had no problem running it in serial. We have tried to diagnose the 
 problem, recompile the code with difference options and test with 
 difference cases and parameters based on similar problems reported 
 on the mail list, but the problem persists. So we write here 
 hoping someone can offer us some suggestion. We have attached 
 related files below for your reference. Your replies are 
 appreciated in advance!

 This is a TiC example running in both Kpoint and MPI parallel on 
 two nodes /r1i0n0/ and /r1i0n1/ (8cores/node):

 *1. **stdout **(abridged) *
 MPI: invalid option -machinefile
 real0m0.004s
 user0m0.000s
 sys0m0.000s
 ...
 MPI: invalid option -machinefile
 real0m0.003s
 user0m0.000s
 sys0m0.004s
 TiC.scf1up_1: No such file or directory.

 LAPW2 - Error. Check file lapw2.error
 cp: cannot stat `.in.tmp': No such file or directory
 rm: cannot remove `.in.tmp': No such file or directory
 *rm: cannot remove `.in.tmp1': No such file or directory*
 *
 *
 *2. TiC.dayfile (abridged) *
 ...
 start (Sun Oct 31 16:25:06 MDT 2010) with lapw0 (40/99 to go)
 cycle 1 (Sun Oct 31 16:25:06 MDT 2010) (40/99 to go)

lapw0 -p(16:25:06) starting parallel lapw0 at Sun Oct 31 
 16:25:07 MDT 2010
  .machine0 : 16 processors
 invalid local arg: -machinefile

 0.436u 0.412s 0:04.63 18.1%0+0k 2600+0io 1pf+0w
lapw1  -up -p (16:25:12) starting parallel lapw1 at Sun Oct 31 
 16:25:12 MDT 2010
 -  starting parallel LAPW1 jobs at Sun Oct 31 16:25:12 MDT 2010
 running LAPW1 in parallel mode (using .machines)
 2 number_of_parallel_jobs
  r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0(1)   
r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1(1) 
  r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0(1)   
  Summary of lapw1para:
r1i0n0 k=0 user=0 wallclock=0
r1i0n1 k=0 user=0 wallclock=0
 ...
 0.116u 0.316s 0:10.48 4.0%0+0k 0+0io 0pf+0w
lapw2 -up -p (16:25:34) running LAPW2 in parallel mode
 **  LAPW2 crashed!
 0.032u 0.104s 

[Wien] LAPW2 crashed when running in parallel

2010-11-01 Thread Wei Xie
Dear Lyudmila,

On Nov 1, 2010, at 8:36 AM, Lyudmila V. Dobysheva wrote:

 01 Nov 2010 02:56:47 Wei Xie wrote:
 We encountered some problem when running in parallel (K-point, MPI or
 both)--the calculations crashed at LAPW2. Note we had no problem running
 it in serial.
 This is a TiC example running
 
 Dear Wei,
 
 Isn't the error connected with spin-polarised - spin-UNpolarised cases? TiC 
 is 
 to be calculated unpolarised, as far as I know.
TiC is non SP in the first example of UG, but we did it with SP here just for 
testing. We can run SP calculations in serial for TiC.
 1. stdout (abridged)
 ...
 TiC.scf1up_1: No such file or directory.
 
 Was lapw1 really successfull?
Thanks for your reminder. We checked and found that lapw1 was actually not 
successful either--there's no case.output1, case.output2, case.output?file in 
the case directory. My guess is that the computing nodes are not communicating 
well with the headnode, so that even the lapw1 finished ok, the output files 
are not written from computing nodes to the headnode. We are testing the 
communication now. 
 
 3. uplapw2.error
 Error in LAPW2
 'LAPW2' - can't open unit: 18
 'LAPW2' -filename: TiC.vspup
 'LAPW2' -  status: old  form: formatted
 
 It looks like your initialization was done without spin-polarization, but 
 runsp_lapw was run. But in this case lapw1 must also be unsuccessful (?).
see my answers above.

Your possible follow-up is appreciated in advance.

Thanks,
Wei
 
 Best regards,
 Lyudmila Dobysheva
 --
 Phys.-Techn. Institute of Ural Br. of Russian Ac. of Sci.
 426001 Izhevsk, ul.Kirova 132
 RUSSIA
 --
 Tel.:7(3412) 442118 (home), 218988(office), 250614(Fax)
 E-mail: lyu at otf.fti.udmurtia.su
lyuka17 at mail.ru
lyu at otf.pti.udm.ru
 http://fti.udm.ru/content/view/25/103/lang,english/
 --
 ___
 Wien mailing list
 Wien at zeus.theochem.tuwien.ac.at
 http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien



[Wien] LAPW2 crashed when running in parallel

2010-10-31 Thread Wei Xie
Dear all WIEN2k community members:

We encountered some problem when running in parallel (K-point, MPI or 
both)--the calculations crashed at LAPW2. Note we had no problem running it in 
serial. We have tried to diagnose the problem, recompile the code with 
difference options and test with difference cases and parameters based on 
similar problems reported on the mail list, but the problem persists. So we 
write here hoping someone can offer us some suggestion. We have attached 
related files below for your reference. Your replies are appreciated in 
advance! 

This is a TiC example running in both Kpoint and MPI parallel on two nodes 
r1i0n0 and r1i0n1 (8cores/node):

1. stdout (abridged) 
MPI: invalid option -machinefile
real0m0.004s
user0m0.000s
sys 0m0.000s
...
MPI: invalid option -machinefile
real0m0.003s
user0m0.000s
sys 0m0.004s
TiC.scf1up_1: No such file or directory.

LAPW2 - Error. Check file lapw2.error
cp: cannot stat `.in.tmp': No such file or directory
rm: cannot remove `.in.tmp': No such file or directory
rm: cannot remove `.in.tmp1': No such file or directory

2. TiC.dayfile (abridged) 
...
start   (Sun Oct 31 16:25:06 MDT 2010) with lapw0 (40/99 to go)
cycle 1 (Sun Oct 31 16:25:06 MDT 2010)  (40/99 to go)

   lapw0 -p(16:25:06) starting parallel lapw0 at Sun Oct 31 16:25:07 MDT 
 2010
 .machine0 : 16 processors
invalid local arg: -machinefile

0.436u 0.412s 0:04.63 18.1% 0+0k 2600+0io 1pf+0w
   lapw1  -up -p   (16:25:12) starting parallel lapw1 at Sun Oct 31 
 16:25:12 MDT 2010
-  starting parallel LAPW1 jobs at Sun Oct 31 16:25:12 MDT 2010
running LAPW1 in parallel mode (using .machines)
2 number_of_parallel_jobs
 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0(1)  r1i0n1 
r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1(1)  r1i0n0 r1i0n0 r1i0n0 
r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0(1)Summary of lapw1para:
   r1i0n0k=0 user=0  wallclock=0
   r1i0n1k=0 user=0  wallclock=0
...
0.116u 0.316s 0:10.48 4.0%  0+0k 0+0io 0pf+0w
   lapw2 -up -p(16:25:34) running LAPW2 in parallel mode
**  LAPW2 crashed!
0.032u 0.104s 0:01.13 11.5% 0+0k 82304+0io 8pf+0w
error: command   /home/xiew/WIEN2k_10/lapw2para -up uplapw2.def   failed

3. uplapw2.error 
Error in LAPW2
 'LAPW2' - can't open unit: 18
 'LAPW2' -filename: TiC.vspup 
 'LAPW2' -  status: old  form: formatted  
**  testerror: Error in Parallel LAPW2

4. .machines
#
1:r1i0n0:8
1:r1i0n1:8
lapw0:r1i0n0:8 r1i0n1:8 
granularity:1
extrafine:1

5. compilers, MPI and options
Intel Compilers  and MKL 11.1.046
Intel MPI 3.2.0.011

current:FOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback
current:FPOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback
current:LDFLAGS:$(FOPT) -L/usr/local/intel/Compiler/11.1/046/mkl/lib/em64t 
-pthread
current:DPARALLEL:'-DParallel'
current:R_LIBS:-lmkl_lapack -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core 
-openmp -lpthread -lguide
current:RP_LIBS:-L/usr/local/intel/Compiler/11.1/046/mkl/lib/em64t 
-lmkl_scalapack_lp64 
/usr/local/intel/Compiler/11.1/046/mkl/lib/em64t/libmkl_solver_lp64.a 
-Wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core 
-lmkl_blacs_intelmpi_lp64 -Wl,--end-group -openmp -lpthread 
-L/home/xiew/fftw-2.1.5/lib -lfftw_mpi -lfftw $(R_LIBS)
current:MPIRUN:mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_

Best regards,
Wei Xie
Computational Materials Group
University of Wisconsin-Madison

-- next part --
An HTML attachment was scrubbed...
URL: 
http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20101031/5eec4c81/attachment.htm