Hi Wei, The parallel_options file manages how parallel programs run, so change the following line in it:
setenv WIEN_MPIRUN "mpirun -np _NP_ -hostfile _HOSTS_ _EXEC_" to setenv WIEN_MPIRUN "mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_" Your .machine0/1/2 files are correct, Also I believe that 'USE_REMOTE' variable which is set to 1 makes parallel scripts (I mean lapw[012]para_lapw) to be launched using ssh/rsh. So switch it to '0'. I'm not sure about 'MPI_REMOTE' option, it's a new one. Try to set different values (0 or 1) for it. Hope this will help. Best regards, Maxim Rakitin email: rms85 at physics.susu.ac.ru web: http://www.susu.ac.ru 01.11.2010 21:35, Wei Xie ?????: > Hi Maxim, > > Thanks for the follow-up! > > I think it should be -machinefile that's appropriate. Here's the help: > -machinefile # file mapping procs to machine > > No -hostfile option mentioned for my current version of MPI in the help. > > Yes, the machine0/1/2 files are exactly like what you described. > > The parallel_options is: > setenv USE_REMOTE 1 > setenv MPI_REMOTE 1 > setenv WIEN_GRANULARITY 1 > setenv WIEN_MPIRUN "mpirun -np _NP_ -hostfile _HOSTS_ _EXEC_" > > I think the problem should be due to my MPI. However, even if disable > MPI parallelization, the problem still persists (no evident difference > in the output files, including case.dayfile, stdout and :log). Note we > can run with exactly the same set of input files in serial mode with > no problem. > > Again, thanks for your help! > > Cheers, > Wei > > > On Oct 31, 2010, at 11:27 PM, Maxim Rakitin wrote: > >> Dear Wei, >> >> Maybe -machinefile is ok for your mpirun. Which options are >> appropriate for it? What does help say? >> >> Try to restore your MPIRUN variable with -machinefile and rerun the >> calculation. Then see what is in .machine0/1/2 files and let us know. >> It should contain 8 lines of r1i0n0 node and 8 lines of r1i0n1 node. >> >> One more thing you should check is $WIENROOT/parallel_options file. >> What is its content? >> Best regards, >> Maxim Rakitin >> email:rms85 at physics.susu.ac.ru >> web:http://www.susu.ac.ru >> >> 01.11.2010 9:06, Wei Xie ?????: >>> Hi Maxim, >>> >>> Thanks for your reply! >>> We tried MPIRUN=mpirun -np _NP_ -hostfile _HOSTS_ _EXEC_, but the >>> problem persists. The only difference is that stdout changes to >>> ''? MPI: invalid option -hostfile ?''. >>> >>> Thanks, >>> Wei >>> >>> >>> On Oct 31, 2010, at 10:40 PM, Maxim Rakitin wrote: >>> >>>> Hi, >>>> >>>> It looks like Intel's mpirun doesn't have '-machinefile' option. >>>> Instead of this it has '-hostfile' option (form here: >>>> http://downloadmirror.intel.com/18462/eng/nes_release_notes.txt). >>>> >>>> Try 'mpirun -h' for information about options and apply appropriate. >>>> Best regards, >>>> Maxim Rakitin >>>> email:rms85 at physics.susu.ac.ru >>>> web:http://www.susu.ac.ru >>>> >>>> 01.11.2010 4:56, Wei Xie ?????: >>>>> Dear all WIEN2k community members: >>>>> >>>>> We encountered some problem when running in parallel >>>>> (K-point, MPI or both)--the calculations crashed at LAPW2. Note we >>>>> had no problem running it in serial. We have tried to diagnose the >>>>> problem, recompile the code with difference options and test with >>>>> difference cases and parameters based on similar problems reported >>>>> on the mail list, but the problem persists. So we write here >>>>> hoping someone can offer us some suggestion. We have attached >>>>> related files below for your reference. Your replies are >>>>> appreciated in advance! >>>>> >>>>> This is a TiC example running in both Kpoint and MPI parallel on >>>>> two nodes /r1i0n0/ and /r1i0n1/ (8cores/node): >>>>> >>>>> *1. **stdout **(abridged) * >>>>> MPI: invalid option -machinefile >>>>> real0m0.004s >>>>> user0m0.000s >>>>> sys0m0.000s >>>>> ... >>>>> MPI: invalid option -machinefile >>>>> real0m0.003s >>>>> user0m0.000s >>>>> sys0m0.004s >>>>> TiC.scf1up_1: No such file or directory. >>>>> >>>>> LAPW2 - Error. Check file lapw2.error >>>>> cp: cannot stat `.in.tmp': No such file or directory >>>>> rm: cannot remove `.in.tmp': No such file or directory >>>>> *rm: cannot remove `.in.tmp1': No such file or directory* >>>>> * >>>>> * >>>>> *2. TiC.dayfile (abridged) * >>>>> ... >>>>> start (Sun Oct 31 16:25:06 MDT 2010) with lapw0 (40/99 to go) >>>>> cycle 1 (Sun Oct 31 16:25:06 MDT 2010) (40/99 to go) >>>>> >>>>> > lapw0 -p(16:25:06) starting parallel lapw0 at Sun Oct 31 >>>>> 16:25:07 MDT 2010 >>>>> -------- .machine0 : 16 processors >>>>> invalid "local" arg: -machinefile >>>>> >>>>> 0.436u 0.412s 0:04.63 18.1%0+0k 2600+0io 1pf+0w >>>>> > lapw1 -up -p (16:25:12) starting parallel lapw1 at Sun Oct 31 >>>>> 16:25:12 MDT 2010 >>>>> -> starting parallel LAPW1 jobs at Sun Oct 31 16:25:12 MDT 2010 >>>>> running LAPW1 in parallel mode (using .machines) >>>>> 2 number_of_parallel_jobs >>>>> r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0(1) >>>>> r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1(1) >>>>> r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0(1) >>>>> Summary of lapw1para: >>>>> r1i0n0 k=0 user=0 wallclock=0 >>>>> r1i0n1 k=0 user=0 wallclock=0 >>>>> ... >>>>> 0.116u 0.316s 0:10.48 4.0%0+0k 0+0io 0pf+0w >>>>> > lapw2 -up -p (16:25:34) running LAPW2 in parallel mode >>>>> ** LAPW2 crashed! >>>>> 0.032u 0.104s 0:01.13 11.5%0+0k 82304+0io 8pf+0w >>>>> error: command /home/xiew/WIEN2k_10/lapw2para -up uplapw2.def >>>>> failed >>>>> >>>>> *3. uplapw2.error * >>>>> Error in LAPW2 >>>>> 'LAPW2' - can't open unit: 18 >>>>> 'LAPW2' - filename: TiC.vspup >>>>> 'LAPW2' - status: old form: formatted >>>>> ** testerror: Error in Parallel LAPW2 >>>>> >>>>> *4. .machines* >>>>> # >>>>> 1:r1i0n0:8 >>>>> 1:r1i0n1:8 >>>>> lapw0:r1i0n0:8 r1i0n1:8 >>>>> granularity:1 >>>>> extrafine:1 >>>>> >>>>> *5. compilers, MPI and options* >>>>> Intel Compilers and MKL 11.1.046 >>>>> Intel MPI 3.2.0.011 >>>>> >>>>> current:FOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML >>>>> -traceback >>>>> current:FPOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML >>>>> -traceback >>>>> current:LDFLAGS:$(FOPT) >>>>> -L/usr/local/intel/Compiler/11.1/046/mkl/lib/em64t -pthread >>>>> current:DPARALLEL:'-DParallel' >>>>> current:R_LIBS:-lmkl_lapack -lmkl_intel_lp64 -lmkl_intel_thread >>>>> -lmkl_core -openmp -lpthread -lguide >>>>> current:RP_LIBS:-L/usr/local/intel/Compiler/11.1/046/mkl/lib/em64t >>>>> -lmkl_scalapack_lp64 >>>>> /usr/local/intel/Compiler/11.1/046/mkl/lib/em64t/libmkl_solver_lp64.a >>>>> -Wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core >>>>> -lmkl_blacs_intelmpi_lp64 -Wl,--end-group -openmp -lpthread >>>>> -L/home/xiew/fftw-2.1.5/lib -lfftw_mpi -lfftw $(R_LIBS) >>>>> current:MPIRUN:mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_ >>>>> >>>>> Best regards, >>>>> Wei Xie >>>>> Computational Materials Group >>>>> University of Wisconsin-Madison >>>>> >>>>> >>>>> _______________________________________________ >>>>> Wien mailing list >>>>> Wien at zeus.theochem.tuwien.ac.at >>>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien >>>> _______________________________________________ >>>> Wien mailing list >>>> Wien at zeus.theochem.tuwien.ac.at >>>> <mailto:Wien at zeus.theochem.tuwien.ac.at> >>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien >>> >>> >>> _______________________________________________ >>> Wien mailing list >>> Wien at zeus.theochem.tuwien.ac.at >>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien >> _______________________________________________ >> Wien mailing list >> Wien at zeus.theochem.tuwien.ac.at <mailto:Wien at >> zeus.theochem.tuwien.ac.at> >> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien > > > _______________________________________________ > Wien mailing list > Wien at zeus.theochem.tuwien.ac.at > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20101101/5d6b9b02/attachment.htm>