Dear Wei, Maybe -machinefile is ok for your mpirun. Which options are appropriate for it? What does help say?
Try to restore your MPIRUN variable with -machinefile and rerun the calculation. Then see what is in .machine0/1/2 files and let us know. It should contain 8 lines of r1i0n0 node and 8 lines of r1i0n1 node. One more thing you should check is $WIENROOT/parallel_options file. What is its content? Best regards, Maxim Rakitin email: rms85 at physics.susu.ac.ru web: http://www.susu.ac.ru 01.11.2010 9:06, Wei Xie ?????: > Hi Maxim, > > Thanks for your reply! > We tried MPIRUN=mpirun -np _NP_ -hostfile _HOSTS_ _EXEC_, but the > problem persists. The only difference is that stdout changes to > ''? MPI: invalid option -hostfile ?''. > > Thanks, > Wei > > > On Oct 31, 2010, at 10:40 PM, Maxim Rakitin wrote: > >> Hi, >> >> It looks like Intel's mpirun doesn't have '-machinefile' option. >> Instead of this it has '-hostfile' option (form here: >> http://downloadmirror.intel.com/18462/eng/nes_release_notes.txt). >> >> Try 'mpirun -h' for information about options and apply appropriate. >> Best regards, >> Maxim Rakitin >> email:rms85 at physics.susu.ac.ru >> web:http://www.susu.ac.ru >> >> 01.11.2010 4:56, Wei Xie ?????: >>> Dear all WIEN2k community members: >>> >>> We encountered some problem when running in parallel >>> (K-point, MPI or both)--the calculations crashed at LAPW2. Note we >>> had no problem running it in serial. We have tried to diagnose the >>> problem, recompile the code with difference options and test with >>> difference cases and parameters based on similar problems reported >>> on the mail list, but the problem persists. So we write here hoping >>> someone can offer us some suggestion. We have attached related files >>> below for your reference. Your replies are appreciated in advance! >>> >>> This is a TiC example running in both Kpoint and MPI parallel on two >>> nodes /r1i0n0/ and /r1i0n1/ (8cores/node): >>> >>> *1. **stdout **(abridged) * >>> MPI: invalid option -machinefile >>> real0m0.004s >>> user0m0.000s >>> sys0m0.000s >>> ... >>> MPI: invalid option -machinefile >>> real0m0.003s >>> user0m0.000s >>> sys0m0.004s >>> TiC.scf1up_1: No such file or directory. >>> >>> LAPW2 - Error. Check file lapw2.error >>> cp: cannot stat `.in.tmp': No such file or directory >>> rm: cannot remove `.in.tmp': No such file or directory >>> *rm: cannot remove `.in.tmp1': No such file or directory* >>> * >>> * >>> *2. TiC.dayfile (abridged) * >>> ... >>> start (Sun Oct 31 16:25:06 MDT 2010) with lapw0 (40/99 to go) >>> cycle 1 (Sun Oct 31 16:25:06 MDT 2010) (40/99 to go) >>> >>> > lapw0 -p(16:25:06) starting parallel lapw0 at Sun Oct 31 >>> 16:25:07 MDT 2010 >>> -------- .machine0 : 16 processors >>> invalid "local" arg: -machinefile >>> >>> 0.436u 0.412s 0:04.63 18.1%0+0k 2600+0io 1pf+0w >>> > lapw1 -up -p (16:25:12) starting parallel lapw1 at Sun Oct 31 >>> 16:25:12 MDT 2010 >>> -> starting parallel LAPW1 jobs at Sun Oct 31 16:25:12 MDT 2010 >>> running LAPW1 in parallel mode (using .machines) >>> 2 number_of_parallel_jobs >>> r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0(1) >>> r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1(1) >>> r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0(1) >>> Summary of lapw1para: >>> r1i0n0 k=0 user=0 wallclock=0 >>> r1i0n1 k=0 user=0 wallclock=0 >>> ... >>> 0.116u 0.316s 0:10.48 4.0%0+0k 0+0io 0pf+0w >>> > lapw2 -up -p (16:25:34) running LAPW2 in parallel mode >>> ** LAPW2 crashed! >>> 0.032u 0.104s 0:01.13 11.5%0+0k 82304+0io 8pf+0w >>> error: command /home/xiew/WIEN2k_10/lapw2para -up uplapw2.def failed >>> >>> *3. uplapw2.error * >>> Error in LAPW2 >>> 'LAPW2' - can't open unit: 18 >>> 'LAPW2' - filename: TiC.vspup >>> 'LAPW2' - status: old form: formatted >>> ** testerror: Error in Parallel LAPW2 >>> >>> *4. .machines* >>> # >>> 1:r1i0n0:8 >>> 1:r1i0n1:8 >>> lapw0:r1i0n0:8 r1i0n1:8 >>> granularity:1 >>> extrafine:1 >>> >>> *5. compilers, MPI and options* >>> Intel Compilers and MKL 11.1.046 >>> Intel MPI 3.2.0.011 >>> >>> current:FOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback >>> current:FPOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML >>> -traceback >>> current:LDFLAGS:$(FOPT) >>> -L/usr/local/intel/Compiler/11.1/046/mkl/lib/em64t -pthread >>> current:DPARALLEL:'-DParallel' >>> current:R_LIBS:-lmkl_lapack -lmkl_intel_lp64 -lmkl_intel_thread >>> -lmkl_core -openmp -lpthread -lguide >>> current:RP_LIBS:-L/usr/local/intel/Compiler/11.1/046/mkl/lib/em64t >>> -lmkl_scalapack_lp64 >>> /usr/local/intel/Compiler/11.1/046/mkl/lib/em64t/libmkl_solver_lp64.a >>> -Wl,--start-group >>> -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core >>> -lmkl_blacs_intelmpi_lp64 -Wl,--end-group -openmp -lpthread >>> -L/home/xiew/fftw-2.1.5/lib -lfftw_mpi -lfftw $(R_LIBS) >>> current:MPIRUN:mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_ >>> >>> Best regards, >>> Wei Xie >>> Computational Materials Group >>> University of Wisconsin-Madison >>> >>> >>> _______________________________________________ >>> Wien mailing list >>> Wien at zeus.theochem.tuwien.ac.at >>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien >> _______________________________________________ >> Wien mailing list >> Wien at zeus.theochem.tuwien.ac.at <mailto:Wien at >> zeus.theochem.tuwien.ac.at> >> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien > > > _______________________________________________ > Wien mailing list > Wien at zeus.theochem.tuwien.ac.at > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20101101/4a99021e/attachment.htm>