[Wien] LAPW2 crashed when running in parallel
Hi, It looks like Intel's mpirun doesn't have '-machinefile' option. Instead of this it has '-hostfile' option (form here: http://downloadmirror.intel.com/18462/eng/nes_release_notes.txt). Try 'mpirun -h' for information about options and apply appropriate. Best regards, Maxim Rakitin email: rms85 at physics.susu.ac.ru web: http://www.susu.ac.ru 01.11.2010 4:56, Wei Xie ?: Dear all WIEN2k community members: We encountered some problem when running in parallel (K-point, MPI or both)--the calculations crashed at LAPW2. Note we had no problem running it in serial. We have tried to diagnose the problem, recompile the code with difference options and test with difference cases and parameters based on similar problems reported on the mail list, but the problem persists. So we write here hoping someone can offer us some suggestion. We have attached related files below for your reference. Your replies are appreciated in advance! This is a TiC example running in both Kpoint and MPI parallel on two nodes /r1i0n0/ and /r1i0n1/ (8cores/node): *1. **stdout **(abridged) * MPI: invalid option -machinefile real0m0.004s user0m0.000s sys0m0.000s ... MPI: invalid option -machinefile real0m0.003s user0m0.000s sys0m0.004s TiC.scf1up_1: No such file or directory. LAPW2 - Error. Check file lapw2.error cp: cannot stat `.in.tmp': No such file or directory rm: cannot remove `.in.tmp': No such file or directory *rm: cannot remove `.in.tmp1': No such file or directory* * * *2. TiC.dayfile (abridged) * ... start (Sun Oct 31 16:25:06 MDT 2010) with lapw0 (40/99 to go) cycle 1 (Sun Oct 31 16:25:06 MDT 2010) (40/99 to go) lapw0 -p(16:25:06) starting parallel lapw0 at Sun Oct 31 16:25:07 MDT 2010 .machine0 : 16 processors invalid local arg: -machinefile 0.436u 0.412s 0:04.63 18.1%0+0k 2600+0io 1pf+0w lapw1 -up -p (16:25:12) starting parallel lapw1 at Sun Oct 31 16:25:12 MDT 2010 - starting parallel LAPW1 jobs at Sun Oct 31 16:25:12 MDT 2010 running LAPW1 in parallel mode (using .machines) 2 number_of_parallel_jobs r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0(1) r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1(1) r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0(1)Summary of lapw1para: r1i0n0 k=0 user=0 wallclock=0 r1i0n1 k=0 user=0 wallclock=0 ... 0.116u 0.316s 0:10.48 4.0%0+0k 0+0io 0pf+0w lapw2 -up -p (16:25:34) running LAPW2 in parallel mode ** LAPW2 crashed! 0.032u 0.104s 0:01.13 11.5%0+0k 82304+0io 8pf+0w error: command /home/xiew/WIEN2k_10/lapw2para -up uplapw2.def failed *3. uplapw2.error * Error in LAPW2 'LAPW2' - can't open unit: 18 'LAPW2' -filename: TiC.vspup 'LAPW2' - status: old form: formatted ** testerror: Error in Parallel LAPW2 *4. .machines* # 1:r1i0n0:8 1:r1i0n1:8 lapw0:r1i0n0:8 r1i0n1:8 granularity:1 extrafine:1 *5. compilers, MPI and options* Intel Compilers and MKL 11.1.046 Intel MPI 3.2.0.011 current:FOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback current:FPOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback current:LDFLAGS:$(FOPT) -L/usr/local/intel/Compiler/11.1/046/mkl/lib/em64t -pthread current:DPARALLEL:'-DParallel' current:R_LIBS:-lmkl_lapack -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -openmp -lpthread -lguide current:RP_LIBS:-L/usr/local/intel/Compiler/11.1/046/mkl/lib/em64t -lmkl_scalapack_lp64 /usr/local/intel/Compiler/11.1/046/mkl/lib/em64t/libmkl_solver_lp64.a -Wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_lp64 -Wl,--end-group -openmp -lpthread -L/home/xiew/fftw-2.1.5/lib -lfftw_mpi -lfftw $(R_LIBS) current:MPIRUN:mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_ Best regards, Wei Xie Computational Materials Group University of Wisconsin-Madison ___ Wien mailing list Wien at zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien -- next part -- An HTML attachment was scrubbed... URL: http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20101101/e1463e23/attachment.htm
[Wien] LAPW2 crashed when running in parallel
Hi Maxim, Thanks for your reply! We tried MPIRUN=mpirun -np _NP_ -hostfile _HOSTS_ _EXEC_, but the problem persists. The only difference is that stdout changes to ''? MPI: invalid option -hostfile ?''. Thanks, Wei On Oct 31, 2010, at 10:40 PM, Maxim Rakitin wrote: Hi, It looks like Intel's mpirun doesn't have '-machinefile' option. Instead of this it has '-hostfile' option (form here: http://downloadmirror.intel.com/18462/eng/nes_release_notes.txt). Try 'mpirun -h' for information about options and apply appropriate. Best regards, Maxim Rakitin email: rms85 at physics.susu.ac.ru web: http://www.susu.ac.ru 01.11.2010 4:56, Wei Xie ?: Dear all WIEN2k community members: We encountered some problem when running in parallel (K-point, MPI or both)--the calculations crashed at LAPW2. Note we had no problem running it in serial. We have tried to diagnose the problem, recompile the code with difference options and test with difference cases and parameters based on similar problems reported on the mail list, but the problem persists. So we write here hoping someone can offer us some suggestion. We have attached related files below for your reference. Your replies are appreciated in advance! This is a TiC example running in both Kpoint and MPI parallel on two nodes r1i0n0 and r1i0n1 (8cores/node): 1. stdout (abridged) MPI: invalid option -machinefile real 0m0.004s user 0m0.000s sys 0m0.000s ... MPI: invalid option -machinefile real 0m0.003s user 0m0.000s sys 0m0.004s TiC.scf1up_1: No such file or directory. LAPW2 - Error. Check file lapw2.error cp: cannot stat `.in.tmp': No such file or directory rm: cannot remove `.in.tmp': No such file or directory rm: cannot remove `.in.tmp1': No such file or directory 2. TiC.dayfile (abridged) ... start (Sun Oct 31 16:25:06 MDT 2010) with lapw0 (40/99 to go) cycle 1 (Sun Oct 31 16:25:06 MDT 2010) (40/99 to go) lapw0 -p (16:25:06) starting parallel lapw0 at Sun Oct 31 16:25:07 MDT 2010 .machine0 : 16 processors invalid local arg: -machinefile 0.436u 0.412s 0:04.63 18.1% 0+0k 2600+0io 1pf+0w lapw1 -up -p(16:25:12) starting parallel lapw1 at Sun Oct 31 16:25:12 MDT 2010 - starting parallel LAPW1 jobs at Sun Oct 31 16:25:12 MDT 2010 running LAPW1 in parallel mode (using .machines) 2 number_of_parallel_jobs r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0(1) r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1(1) r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0(1)Summary of lapw1para: r1i0n0 k=0 user=0 wallclock=0 r1i0n1 k=0 user=0 wallclock=0 ... 0.116u 0.316s 0:10.48 4.0% 0+0k 0+0io 0pf+0w lapw2 -up -p (16:25:34) running LAPW2 in parallel mode ** LAPW2 crashed! 0.032u 0.104s 0:01.13 11.5% 0+0k 82304+0io 8pf+0w error: command /home/xiew/WIEN2k_10/lapw2para -up uplapw2.def failed 3. uplapw2.error Error in LAPW2 'LAPW2' - can't open unit: 18 'LAPW2' -filename: TiC.vspup 'LAPW2' - status: old form: formatted ** testerror: Error in Parallel LAPW2 4. .machines # 1:r1i0n0:8 1:r1i0n1:8 lapw0:r1i0n0:8 r1i0n1:8 granularity:1 extrafine:1 5. compilers, MPI and options Intel Compilers and MKL 11.1.046 Intel MPI 3.2.0.011 current:FOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback current:FPOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback current:LDFLAGS:$(FOPT) -L/usr/local/intel/Compiler/11.1/046/mkl/lib/em64t -pthread current:DPARALLEL:'-DParallel' current:R_LIBS:-lmkl_lapack -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -openmp -lpthread -lguide current:RP_LIBS:-L/usr/local/intel/Compiler/11.1/046/mkl/lib/em64t -lmkl_scalapack_lp64 /usr/local/intel/Compiler/11.1/046/mkl/lib/em64t/libmkl_solver_lp64.a -Wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_lp64 -Wl,--end-group -openmp -lpthread -L/home/xiew/fftw-2.1.5/lib -lfftw_mpi -lfftw $(R_LIBS) current:MPIRUN:mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_ Best regards, Wei Xie Computational Materials Group University of Wisconsin-Madison ___ Wien mailing list Wien at zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien ___ Wien mailing list Wien at zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien -- next part -- An HTML attachment was scrubbed... URL: http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20101031/2ce15505/attachment.htm
[Wien] LAPW2 crashed when running in parallel
Dear Wei, Maybe -machinefile is ok for your mpirun. Which options are appropriate for it? What does help say? Try to restore your MPIRUN variable with -machinefile and rerun the calculation. Then see what is in .machine0/1/2 files and let us know. It should contain 8 lines of r1i0n0 node and 8 lines of r1i0n1 node. One more thing you should check is $WIENROOT/parallel_options file. What is its content? Best regards, Maxim Rakitin email: rms85 at physics.susu.ac.ru web: http://www.susu.ac.ru 01.11.2010 9:06, Wei Xie ?: Hi Maxim, Thanks for your reply! We tried MPIRUN=mpirun -np _NP_ -hostfile _HOSTS_ _EXEC_, but the problem persists. The only difference is that stdout changes to ''? MPI: invalid option -hostfile ?''. Thanks, Wei On Oct 31, 2010, at 10:40 PM, Maxim Rakitin wrote: Hi, It looks like Intel's mpirun doesn't have '-machinefile' option. Instead of this it has '-hostfile' option (form here: http://downloadmirror.intel.com/18462/eng/nes_release_notes.txt). Try 'mpirun -h' for information about options and apply appropriate. Best regards, Maxim Rakitin email:rms85 at physics.susu.ac.ru web:http://www.susu.ac.ru 01.11.2010 4:56, Wei Xie ?: Dear all WIEN2k community members: We encountered some problem when running in parallel (K-point, MPI or both)--the calculations crashed at LAPW2. Note we had no problem running it in serial. We have tried to diagnose the problem, recompile the code with difference options and test with difference cases and parameters based on similar problems reported on the mail list, but the problem persists. So we write here hoping someone can offer us some suggestion. We have attached related files below for your reference. Your replies are appreciated in advance! This is a TiC example running in both Kpoint and MPI parallel on two nodes /r1i0n0/ and /r1i0n1/ (8cores/node): *1. **stdout **(abridged) * MPI: invalid option -machinefile real0m0.004s user0m0.000s sys0m0.000s ... MPI: invalid option -machinefile real0m0.003s user0m0.000s sys0m0.004s TiC.scf1up_1: No such file or directory. LAPW2 - Error. Check file lapw2.error cp: cannot stat `.in.tmp': No such file or directory rm: cannot remove `.in.tmp': No such file or directory *rm: cannot remove `.in.tmp1': No such file or directory* * * *2. TiC.dayfile (abridged) * ... start (Sun Oct 31 16:25:06 MDT 2010) with lapw0 (40/99 to go) cycle 1 (Sun Oct 31 16:25:06 MDT 2010) (40/99 to go) lapw0 -p(16:25:06) starting parallel lapw0 at Sun Oct 31 16:25:07 MDT 2010 .machine0 : 16 processors invalid local arg: -machinefile 0.436u 0.412s 0:04.63 18.1%0+0k 2600+0io 1pf+0w lapw1 -up -p (16:25:12) starting parallel lapw1 at Sun Oct 31 16:25:12 MDT 2010 - starting parallel LAPW1 jobs at Sun Oct 31 16:25:12 MDT 2010 running LAPW1 in parallel mode (using .machines) 2 number_of_parallel_jobs r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0(1) r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1(1) r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0(1) Summary of lapw1para: r1i0n0 k=0 user=0 wallclock=0 r1i0n1 k=0 user=0 wallclock=0 ... 0.116u 0.316s 0:10.48 4.0%0+0k 0+0io 0pf+0w lapw2 -up -p (16:25:34) running LAPW2 in parallel mode ** LAPW2 crashed! 0.032u 0.104s 0:01.13 11.5%0+0k 82304+0io 8pf+0w error: command /home/xiew/WIEN2k_10/lapw2para -up uplapw2.def failed *3. uplapw2.error * Error in LAPW2 'LAPW2' - can't open unit: 18 'LAPW2' -filename: TiC.vspup 'LAPW2' - status: old form: formatted ** testerror: Error in Parallel LAPW2 *4. .machines* # 1:r1i0n0:8 1:r1i0n1:8 lapw0:r1i0n0:8 r1i0n1:8 granularity:1 extrafine:1 *5. compilers, MPI and options* Intel Compilers and MKL 11.1.046 Intel MPI 3.2.0.011 current:FOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback current:FPOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback current:LDFLAGS:$(FOPT) -L/usr/local/intel/Compiler/11.1/046/mkl/lib/em64t -pthread current:DPARALLEL:'-DParallel' current:R_LIBS:-lmkl_lapack -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -openmp -lpthread -lguide current:RP_LIBS:-L/usr/local/intel/Compiler/11.1/046/mkl/lib/em64t -lmkl_scalapack_lp64 /usr/local/intel/Compiler/11.1/046/mkl/lib/em64t/libmkl_solver_lp64.a -Wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_lp64 -Wl,--end-group -openmp -lpthread -L/home/xiew/fftw-2.1.5/lib -lfftw_mpi -lfftw $(R_LIBS) current:MPIRUN:mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_ Best regards, Wei Xie Computational Materials Group University of Wisconsin-Madison ___ Wien mailing list Wien at zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien ___ Wien
[Wien] LAPW2 crashed when running in parallel
01 Nov 2010 02:56:47 Wei Xie wrote: We encountered some problem when running in parallel (K-point, MPI or both)--the calculations crashed at LAPW2. Note we had no problem running it in serial. This is a TiC example running Dear Wei, Isn't the error connected with spin-polarised - spin-UNpolarised cases? TiC is to be calculated unpolarised, as far as I know. 1. stdout (abridged) ... TiC.scf1up_1: No such file or directory. Was lapw1 really successfull? 3. uplapw2.error Error in LAPW2 'LAPW2' - can't open unit: 18 'LAPW2' -filename: TiC.vspup 'LAPW2' - status: old form: formatted It looks like your initialization was done without spin-polarization, but runsp_lapw was run. But in this case lapw1 must also be unsuccessful (?). Best regards, Lyudmila Dobysheva -- Phys.-Techn. Institute of Ural Br. of Russian Ac. of Sci. 426001 Izhevsk, ul.Kirova 132 RUSSIA -- Tel.:7(3412) 442118 (home), 218988(office), 250614(Fax) E-mail: lyu at otf.fti.udmurtia.su lyuka17 at mail.ru lyu at otf.pti.udm.ru http://fti.udm.ru/content/view/25/103/lang,english/ --
[Wien] LAPW2 crashed when running in parallel
Hi Maxim, Thanks for the follow-up! I think it should be -machinefile that's appropriate. Here's the help: -machinefile # file mapping procs to machine No -hostfile option mentioned for my current version of MPI in the help. Yes, the machine0/1/2 files are exactly like what you described. The parallel_options is: setenv USE_REMOTE 1 setenv MPI_REMOTE 1 setenv WIEN_GRANULARITY 1 setenv WIEN_MPIRUN mpirun -np _NP_ -hostfile _HOSTS_ _EXEC_ I think the problem should be due to my MPI. However, even if disable MPI parallelization, the problem still persists (no evident difference in the output files, including case.dayfile, stdout and :log). Note we can run with exactly the same set of input files in serial mode with no problem. Again, thanks for your help! Cheers, Wei On Oct 31, 2010, at 11:27 PM, Maxim Rakitin wrote: Dear Wei, Maybe -machinefile is ok for your mpirun. Which options are appropriate for it? What does help say? Try to restore your MPIRUN variable with -machinefile and rerun the calculation. Then see what is in .machine0/1/2 files and let us know. It should contain 8 lines of r1i0n0 node and 8 lines of r1i0n1 node. One more thing you should check is $WIENROOT/parallel_options file. What is its content? Best regards, Maxim Rakitin email: rms85 at physics.susu.ac.ru web: http://www.susu.ac.ru 01.11.2010 9:06, Wei Xie ?: Hi Maxim, Thanks for your reply! We tried MPIRUN=mpirun -np _NP_ -hostfile _HOSTS_ _EXEC_, but the problem persists. The only difference is that stdout changes to ''? MPI: invalid option -hostfile ?''. Thanks, Wei On Oct 31, 2010, at 10:40 PM, Maxim Rakitin wrote: Hi, It looks like Intel's mpirun doesn't have '-machinefile' option. Instead of this it has '-hostfile' option (form here: http://downloadmirror.intel.com/18462/eng/nes_release_notes.txt). Try 'mpirun -h' for information about options and apply appropriate. Best regards, Maxim Rakitin email: rms85 at physics.susu.ac.ru web: http://www.susu.ac.ru 01.11.2010 4:56, Wei Xie ?: Dear all WIEN2k community members: We encountered some problem when running in parallel (K-point, MPI or both)--the calculations crashed at LAPW2. Note we had no problem running it in serial. We have tried to diagnose the problem, recompile the code with difference options and test with difference cases and parameters based on similar problems reported on the mail list, but the problem persists. So we write here hoping someone can offer us some suggestion. We have attached related files below for your reference. Your replies are appreciated in advance! This is a TiC example running in both Kpoint and MPI parallel on two nodes r1i0n0 and r1i0n1 (8cores/node): 1. stdout (abridged) MPI: invalid option -machinefile real 0m0.004s user 0m0.000s sys 0m0.000s ... MPI: invalid option -machinefile real 0m0.003s user 0m0.000s sys 0m0.004s TiC.scf1up_1: No such file or directory. LAPW2 - Error. Check file lapw2.error cp: cannot stat `.in.tmp': No such file or directory rm: cannot remove `.in.tmp': No such file or directory rm: cannot remove `.in.tmp1': No such file or directory 2. TiC.dayfile (abridged) ... start (Sun Oct 31 16:25:06 MDT 2010) with lapw0 (40/99 to go) cycle 1 (Sun Oct 31 16:25:06 MDT 2010) (40/99 to go) lapw0 -p (16:25:06) starting parallel lapw0 at Sun Oct 31 16:25:07 MDT 2010 .machine0 : 16 processors invalid local arg: -machinefile 0.436u 0.412s 0:04.63 18.1% 0+0k 2600+0io 1pf+0w lapw1 -up -p(16:25:12) starting parallel lapw1 at Sun Oct 31 16:25:12 MDT 2010 - starting parallel LAPW1 jobs at Sun Oct 31 16:25:12 MDT 2010 running LAPW1 in parallel mode (using .machines) 2 number_of_parallel_jobs r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0(1) r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1(1) r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0(1)Summary of lapw1para: r1i0n0 k=0 user=0 wallclock=0 r1i0n1 k=0 user=0 wallclock=0 ... 0.116u 0.316s 0:10.48 4.0% 0+0k 0+0io 0pf+0w lapw2 -up -p (16:25:34) running LAPW2 in parallel mode ** LAPW2 crashed! 0.032u 0.104s 0:01.13 11.5% 0+0k 82304+0io 8pf+0w error: command /home/xiew/WIEN2k_10/lapw2para -up uplapw2.def failed 3. uplapw2.error Error in LAPW2 'LAPW2' - can't open unit: 18 'LAPW2' -filename: TiC.vspup 'LAPW2' - status: old form: formatted ** testerror: Error in Parallel LAPW2 4. .machines # 1:r1i0n0:8 1:r1i0n1:8 lapw0:r1i0n0:8 r1i0n1:8 granularity:1 extrafine:1 5. compilers, MPI and options Intel Compilers and MKL 11.1.046 Intel MPI 3.2.0.011 current:FOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback
[Wien] LAPW2 crashed when running in parallel
Hi Wei, The parallel_options file manages how parallel programs run, so change the following line in it: setenv WIEN_MPIRUN mpirun -np _NP_ -hostfile _HOSTS_ _EXEC_ to setenv WIEN_MPIRUN mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_ Your .machine0/1/2 files are correct, Also I believe that 'USE_REMOTE' variable which is set to 1 makes parallel scripts (I mean lapw[012]para_lapw) to be launched using ssh/rsh. So switch it to '0'. I'm not sure about 'MPI_REMOTE' option, it's a new one. Try to set different values (0 or 1) for it. Hope this will help. Best regards, Maxim Rakitin email: rms85 at physics.susu.ac.ru web: http://www.susu.ac.ru 01.11.2010 21:35, Wei Xie ?: Hi Maxim, Thanks for the follow-up! I think it should be -machinefile that's appropriate. Here's the help: -machinefile # file mapping procs to machine No -hostfile option mentioned for my current version of MPI in the help. Yes, the machine0/1/2 files are exactly like what you described. The parallel_options is: setenv USE_REMOTE 1 setenv MPI_REMOTE 1 setenv WIEN_GRANULARITY 1 setenv WIEN_MPIRUN mpirun -np _NP_ -hostfile _HOSTS_ _EXEC_ I think the problem should be due to my MPI. However, even if disable MPI parallelization, the problem still persists (no evident difference in the output files, including case.dayfile, stdout and :log). Note we can run with exactly the same set of input files in serial mode with no problem. Again, thanks for your help! Cheers, Wei On Oct 31, 2010, at 11:27 PM, Maxim Rakitin wrote: Dear Wei, Maybe -machinefile is ok for your mpirun. Which options are appropriate for it? What does help say? Try to restore your MPIRUN variable with -machinefile and rerun the calculation. Then see what is in .machine0/1/2 files and let us know. It should contain 8 lines of r1i0n0 node and 8 lines of r1i0n1 node. One more thing you should check is $WIENROOT/parallel_options file. What is its content? Best regards, Maxim Rakitin email:rms85 at physics.susu.ac.ru web:http://www.susu.ac.ru 01.11.2010 9:06, Wei Xie ?: Hi Maxim, Thanks for your reply! We tried MPIRUN=mpirun -np _NP_ -hostfile _HOSTS_ _EXEC_, but the problem persists. The only difference is that stdout changes to ''? MPI: invalid option -hostfile ?''. Thanks, Wei On Oct 31, 2010, at 10:40 PM, Maxim Rakitin wrote: Hi, It looks like Intel's mpirun doesn't have '-machinefile' option. Instead of this it has '-hostfile' option (form here: http://downloadmirror.intel.com/18462/eng/nes_release_notes.txt). Try 'mpirun -h' for information about options and apply appropriate. Best regards, Maxim Rakitin email:rms85 at physics.susu.ac.ru web:http://www.susu.ac.ru 01.11.2010 4:56, Wei Xie ?: Dear all WIEN2k community members: We encountered some problem when running in parallel (K-point, MPI or both)--the calculations crashed at LAPW2. Note we had no problem running it in serial. We have tried to diagnose the problem, recompile the code with difference options and test with difference cases and parameters based on similar problems reported on the mail list, but the problem persists. So we write here hoping someone can offer us some suggestion. We have attached related files below for your reference. Your replies are appreciated in advance! This is a TiC example running in both Kpoint and MPI parallel on two nodes /r1i0n0/ and /r1i0n1/ (8cores/node): *1. **stdout **(abridged) * MPI: invalid option -machinefile real0m0.004s user0m0.000s sys0m0.000s ... MPI: invalid option -machinefile real0m0.003s user0m0.000s sys0m0.004s TiC.scf1up_1: No such file or directory. LAPW2 - Error. Check file lapw2.error cp: cannot stat `.in.tmp': No such file or directory rm: cannot remove `.in.tmp': No such file or directory *rm: cannot remove `.in.tmp1': No such file or directory* * * *2. TiC.dayfile (abridged) * ... start (Sun Oct 31 16:25:06 MDT 2010) with lapw0 (40/99 to go) cycle 1 (Sun Oct 31 16:25:06 MDT 2010) (40/99 to go) lapw0 -p(16:25:06) starting parallel lapw0 at Sun Oct 31 16:25:07 MDT 2010 .machine0 : 16 processors invalid local arg: -machinefile 0.436u 0.412s 0:04.63 18.1%0+0k 2600+0io 1pf+0w lapw1 -up -p (16:25:12) starting parallel lapw1 at Sun Oct 31 16:25:12 MDT 2010 - starting parallel LAPW1 jobs at Sun Oct 31 16:25:12 MDT 2010 running LAPW1 in parallel mode (using .machines) 2 number_of_parallel_jobs r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0(1) r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1(1) r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0(1) Summary of lapw1para: r1i0n0 k=0 user=0 wallclock=0 r1i0n1 k=0 user=0 wallclock=0 ... 0.116u 0.316s 0:10.48 4.0%0+0k 0+0io 0pf+0w lapw2 -up -p (16:25:34) running LAPW2 in parallel mode ** LAPW2 crashed! 0.032u 0.104s
[Wien] LAPW2 crashed when running in parallel
Dear Lyudmila, On Nov 1, 2010, at 8:36 AM, Lyudmila V. Dobysheva wrote: 01 Nov 2010 02:56:47 Wei Xie wrote: We encountered some problem when running in parallel (K-point, MPI or both)--the calculations crashed at LAPW2. Note we had no problem running it in serial. This is a TiC example running Dear Wei, Isn't the error connected with spin-polarised - spin-UNpolarised cases? TiC is to be calculated unpolarised, as far as I know. TiC is non SP in the first example of UG, but we did it with SP here just for testing. We can run SP calculations in serial for TiC. 1. stdout (abridged) ... TiC.scf1up_1: No such file or directory. Was lapw1 really successfull? Thanks for your reminder. We checked and found that lapw1 was actually not successful either--there's no case.output1, case.output2, case.output?file in the case directory. My guess is that the computing nodes are not communicating well with the headnode, so that even the lapw1 finished ok, the output files are not written from computing nodes to the headnode. We are testing the communication now. 3. uplapw2.error Error in LAPW2 'LAPW2' - can't open unit: 18 'LAPW2' -filename: TiC.vspup 'LAPW2' - status: old form: formatted It looks like your initialization was done without spin-polarization, but runsp_lapw was run. But in this case lapw1 must also be unsuccessful (?). see my answers above. Your possible follow-up is appreciated in advance. Thanks, Wei Best regards, Lyudmila Dobysheva -- Phys.-Techn. Institute of Ural Br. of Russian Ac. of Sci. 426001 Izhevsk, ul.Kirova 132 RUSSIA -- Tel.:7(3412) 442118 (home), 218988(office), 250614(Fax) E-mail: lyu at otf.fti.udmurtia.su lyuka17 at mail.ru lyu at otf.pti.udm.ru http://fti.udm.ru/content/view/25/103/lang,english/ -- ___ Wien mailing list Wien at zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
[Wien] LAPW2 crashed when running in parallel
Dear all WIEN2k community members: We encountered some problem when running in parallel (K-point, MPI or both)--the calculations crashed at LAPW2. Note we had no problem running it in serial. We have tried to diagnose the problem, recompile the code with difference options and test with difference cases and parameters based on similar problems reported on the mail list, but the problem persists. So we write here hoping someone can offer us some suggestion. We have attached related files below for your reference. Your replies are appreciated in advance! This is a TiC example running in both Kpoint and MPI parallel on two nodes r1i0n0 and r1i0n1 (8cores/node): 1. stdout (abridged) MPI: invalid option -machinefile real0m0.004s user0m0.000s sys 0m0.000s ... MPI: invalid option -machinefile real0m0.003s user0m0.000s sys 0m0.004s TiC.scf1up_1: No such file or directory. LAPW2 - Error. Check file lapw2.error cp: cannot stat `.in.tmp': No such file or directory rm: cannot remove `.in.tmp': No such file or directory rm: cannot remove `.in.tmp1': No such file or directory 2. TiC.dayfile (abridged) ... start (Sun Oct 31 16:25:06 MDT 2010) with lapw0 (40/99 to go) cycle 1 (Sun Oct 31 16:25:06 MDT 2010) (40/99 to go) lapw0 -p(16:25:06) starting parallel lapw0 at Sun Oct 31 16:25:07 MDT 2010 .machine0 : 16 processors invalid local arg: -machinefile 0.436u 0.412s 0:04.63 18.1% 0+0k 2600+0io 1pf+0w lapw1 -up -p (16:25:12) starting parallel lapw1 at Sun Oct 31 16:25:12 MDT 2010 - starting parallel LAPW1 jobs at Sun Oct 31 16:25:12 MDT 2010 running LAPW1 in parallel mode (using .machines) 2 number_of_parallel_jobs r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0(1) r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1(1) r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0(1)Summary of lapw1para: r1i0n0k=0 user=0 wallclock=0 r1i0n1k=0 user=0 wallclock=0 ... 0.116u 0.316s 0:10.48 4.0% 0+0k 0+0io 0pf+0w lapw2 -up -p(16:25:34) running LAPW2 in parallel mode ** LAPW2 crashed! 0.032u 0.104s 0:01.13 11.5% 0+0k 82304+0io 8pf+0w error: command /home/xiew/WIEN2k_10/lapw2para -up uplapw2.def failed 3. uplapw2.error Error in LAPW2 'LAPW2' - can't open unit: 18 'LAPW2' -filename: TiC.vspup 'LAPW2' - status: old form: formatted ** testerror: Error in Parallel LAPW2 4. .machines # 1:r1i0n0:8 1:r1i0n1:8 lapw0:r1i0n0:8 r1i0n1:8 granularity:1 extrafine:1 5. compilers, MPI and options Intel Compilers and MKL 11.1.046 Intel MPI 3.2.0.011 current:FOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback current:FPOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback current:LDFLAGS:$(FOPT) -L/usr/local/intel/Compiler/11.1/046/mkl/lib/em64t -pthread current:DPARALLEL:'-DParallel' current:R_LIBS:-lmkl_lapack -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -openmp -lpthread -lguide current:RP_LIBS:-L/usr/local/intel/Compiler/11.1/046/mkl/lib/em64t -lmkl_scalapack_lp64 /usr/local/intel/Compiler/11.1/046/mkl/lib/em64t/libmkl_solver_lp64.a -Wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_lp64 -Wl,--end-group -openmp -lpthread -L/home/xiew/fftw-2.1.5/lib -lfftw_mpi -lfftw $(R_LIBS) current:MPIRUN:mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_ Best regards, Wei Xie Computational Materials Group University of Wisconsin-Madison -- next part -- An HTML attachment was scrubbed... URL: http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20101031/5eec4c81/attachment.htm