[Wien] Re[wien] commlib error
Dear Prof. Blaha, *Lau*rence Marks, and Gavin Abo, Thanks for your valuable suggestions, Currently I am working with your suggestions and I will let you inform if the problem is solved. For Prof. Laurence Marks: Sir these were my options during installation (k-point parallelization) *System: linuxif111* *Wien Version: WIEN2k_14.2* f90 compiler: ifort and C compiler icc *Current settings: O Compiler options: -FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback F FFTW options: -DFFTW3 -/applic/compilers/intel/11.1/mpi/openmpi/1.6.3/applib2/FFTW3/3.3.4/double/include L Linker Flags:$(FOPT) -L/applic/compilers/intel/11.1/mkl/lib/em64t -pthread P Preprocessor flags '-DParallel' R R_LIB (LAPACK+BLAS): -lmkl_lapack -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -openmp -lpthread -lguide FL FFTW_LIBS: -lfftw3_mpi -lfftw3 -L/applic/compilers/intel/11.1/mpi/openmpi/1.6.3/applib2/FFTW3/3.3.4/double/libparallel f90 compiler mpif90 FFTW3 FFTW_LIB + FFTW_OPT: -lfftw3_mpi -lfftw3 -L/applic/compilers/intel/11.1/mpi/openmpi/1.6.3/applib2/FFTW3/3.3.4/double/lib + -DFFTW3 -I/applic/compilers/intel/11.1/mpi/openmpi/1.6.3/applib2/FFTW3/3.3.4/double/include (already set) RP RP_LIB(SCALAPACK+PBLAS): -lmkl_scalapack_lp64 -lmkl_solver_lp64 -lmkl_blacs_lp64 $(R_LIBS) FP FPOPT(par.comp.options): -FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -Dmkl_scalapack -traceback MP MPIRUN commando: mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_and this is my job script#!/bin/bash#$ -V#$ -cwd#$ -N FM-Pr#$ -pe mpi_fu 47#$ -q normal#$ -R yes#$ -l h_rt=48:00:00echo Got $NSLOTS slots.cat $TMPDIR/machines# enables $TMPDIR/rsh to catch rsh calls if availablecd $SGE_O_WORKDIRrm -f .machinesecho 'granularity:1' .machinesecho 'extrafine:1' .machinesi=1while ((i = NSLOTS))doecho -n '1:' .machineshead -n $i $TMPDIR/machines |tail -n 1 .machines((i=i+1))donerunsp_lapw -p -orb -i 1000 -ec 0.0001 -cc 0.001and sir I did some calculations for Monolayer phosphorene previously, but face no problem like this during that calculation.Best regardsImran khan* ___ Wien mailing list Wien@zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
[Wien] commlib error
Dear wien2k experts and users, I am using wien2k version 14.2 on a queuing system (SGE), with intel compiler 11.1, MPI libraries mpi/openmpi-1.6.3 and math libraries fftw-3.3.4. With these options I install Wien2K without any compile time error. The purpose of my calculation is to find the stable site for different substituents in NdFeB intermetallics. I am running the case.struct given in the attachment, using 200 (6 6 4) k-points. My RKmax value is 7 and Gmax is 12, and I am using LDA+U method. I am using the following command runsp_lapw -p -orb -i 80 -ec 0.0001 -cc 0.001 Every time I submit my job after few scf cycles the job is terminated with the following error in the error tag file. error: commlib error: got select error (Connection reset by peer) error: executing task of job 2424636 failed: failed sending task to execd@tachyon1478: can't find connection . . . LAPW2 END LAPW2 END LAPW2 END LAPW2 END real0m53.638s forrtl: No such file or directory forrtl: severe (29): file not found, unit 21, file /home01/x1030imr/khan/Wien2K/Neomagnet/Pr-doped/f-site/AFM/Pr-Af/Pr-Af.scf2up_31 Image PCRoutineLineSource sumpara004A671D Unknown Unknown Unknown sumpara004A5225 Unknown Unknown Unknown sumpara00456259 Unknown Unknown Unknown sumpara00416A5A Unknown Unknown Unknown sumpara00416250 Unknown Unknown Unknown sumpara00421E3D Unknown Unknown Unknown sumpara00410771 scfsum_ 126 scfsum.f sumpara0040EE82 MAIN__219 sumpara.f sumpara004033DC Unknown Unknown Unknown libc.so.6 0035AA81D974 Unknown Unknown Unknown sumpara004032E9 Unknown Unknown Unknown cp: cannot stat `.in.tmp': No such file or directory I have discussed this error with the engineers of that queuing system (tachyon), and I have searched the mailing list as well but could not find any solutions. your guidance to solve this issue will be greatly appreciated. Best regards Imran. Pr-Af.struct Description: Binary data ___ Wien mailing list Wien@zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
Re: [Wien] commlib error
From a brief Google search this is an mpi error. How did you compile, it is easy to use wrong blacs combinations. Have you run simpler cases such as TiC first? --- Professor Laurence Marks Department of Materials Science and Engineering Northwestern University http://www.numis.northwestern.edu Corrosion in 4D http://MURI4D.numis.northwestern.edu Co-Editor, Acta Cryst A Research is to see what everybody else has seen, and to think what nobody else has thought Albert Szent-Gyorgi On Jul 10, 2015 03:05, Imran Khan imrankhanswat...@gmail.com wrote: Dear wien2k experts and users, I am using wien2k version 14.2 on a queuing system (SGE), with intel compiler 11.1, MPI libraries mpi/openmpi-1.6.3 and math libraries fftw-3.3.4. With these options I install Wien2K without any compile time error. The purpose of my calculation is to find the stable site for different substituents in NdFeB intermetallics. I am running the case.struct given in the attachment, using 200 (6 6 4) k-points. My RKmax value is 7 and Gmax is 12, and I am using LDA+U method. I am using the following command runsp_lapw -p -orb -i 80 -ec 0.0001 -cc 0.001 Every time I submit my job after few scf cycles the job is terminated with the following error in the error tag file. error: commlib error: got select error (Connection reset by peer) error: executing task of job 2424636 failed: failed sending task to execd@tachyon1478: can't find connection . . . LAPW2 END LAPW2 END LAPW2 END LAPW2 END real0m53.638s forrtl: No such file or directory forrtl: severe (29): file not found, unit 21, file /home01/x1030imr/khan/Wien2K/Neomagnet/Pr-doped/f-site/AFM/Pr-Af/Pr-Af.scf2up_31 Image PCRoutineLineSource sumpara004A671D Unknown Unknown Unknown sumpara004A5225 Unknown Unknown Unknown sumpara00456259 Unknown Unknown Unknown sumpara00416A5A Unknown Unknown Unknown sumpara00416250 Unknown Unknown Unknown sumpara00421E3D Unknown Unknown Unknown sumpara00410771 scfsum_ 126 scfsum.f sumpara0040EE82 MAIN__219 sumpara.f sumpara004033DC Unknown Unknown Unknown libc.so.6 0035AA81D974 Unknown Unknown Unknown sumpara004032E9 Unknown Unknown Unknown cp: cannot stat `.in.tmp': No such file or directory I have discussed this error with the engineers of that queuing system (tachyon), and I have searched the mailing list as well but could not find any solutions. your guidance to solve this issue will be greatly appreciated. Best regards Imran. ___ Wien mailing list Wien@zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
Re: [Wien] commlib error
An additional comment, in the post at: https://arc.liv.ac.uk/pipermail/gridengine-users/2010-October/032729.html You can see that they have the error of the form: error: commlib error: got select error (Connection reset by peer) error: executing task of job x failed: failed sending task to execd@hostname: can't find connection It looks like they might have tracked down the problem to the master daemon (qmaster), as seen in the post at: https://arc.liv.ac.uk/pipermail/gridengine-users/2010-October/032758.html So, maybe, the error could be caused by a daemon problem (with the tachyon1478 node). On 7/10/2015 5:01 AM, Laurence Marks wrote: From a brief Google search this is an mpi error. How did you compile, it is easy to use wrong blacs combinations. Have you run simpler cases such as TiC first? --- Professor Laurence Marks Department of Materials Science and Engineering Northwestern University http://www.numis.northwestern.edu Corrosion in 4D http://MURI4D.numis.northwestern.edu Co-Editor, Acta Cryst A Research is to see what everybody else has seen, and to think what nobody else has thought Albert Szent-Gyorgi On Jul 10, 2015 03:05, Imran Khan imrankhanswat...@gmail.com mailto:imrankhanswat...@gmail.com wrote: Dear wien2k experts and users, I am using wien2k version 14.2 on a queuing system (SGE), with intel compiler 11.1, MPI libraries mpi/openmpi-1.6.3 and math libraries fftw-3.3.4. With these options I install Wien2K without any compile time error. The purpose of my calculation is to find the stable site for different substituents in NdFeB intermetallics. I am running the case.struct given in the attachment, using 200 (6 6 4) k-points. My RKmax value is 7 and Gmax is 12, and I am using LDA+U method. I am using the following command runsp_lapw -p -orb -i 80 -ec 0.0001 -cc 0.001 Every time I submit my job after few scf cycles the job is terminated with the following error in the error tag file. error: commlib error: got select error (Connection reset by peer) error: executing task of job 2424636 failed: failed sending task to execd@tachyon1478: can't find connection . . . LAPW2 END LAPW2 END LAPW2 END LAPW2 END real0m53.638s forrtl: No such file or directory forrtl: severe (29): file not found, unit 21, file /home01/x1030imr/khan/Wien2K/Neomagnet/Pr-doped/f-site/AFM/Pr-Af/Pr-Af.scf2up_31 Image PCRoutine LineSource sumpara004A671D Unknown Unknown Unknown sumpara004A5225 Unknown Unknown Unknown sumpara00456259 Unknown Unknown Unknown sumpara00416A5A Unknown Unknown Unknown sumpara00416250 Unknown Unknown Unknown sumpara00421E3D Unknown Unknown Unknown sumpara00410771 scfsum_ 126 scfsum.f sumpara0040EE82 MAIN__219 sumpara.f sumpara004033DC Unknown Unknown Unknown libc.so.6 0035AA81D974 Unknown Unknown Unknown sumpara004032E9 Unknown Unknown Unknown cp: cannot stat `.in.tmp': No such file or directory I have discussed this error with the engineers of that queuing system (tachyon), and I have searched the mailing list as well but could not find any solutions. your guidance to solve this issue will be greatly appreciated. Best regards Imran. ___ Wien mailing list Wien@zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
Re: [Wien] commlib error
The comlib error is certainly a system error, where the communication between the nodes is broken somehow. From wien2k you got the error that in the sumpara step (after lapw2) it could not find the filePr-Af.scf2up_31 So the first question you have to pose yourself is: do I have this file and is it ok ? ls -alsrp *scf2up_* You should find many of these files (as many as k-parallel jobs are submitted) and ALL of them should have a reasonable length (at least non-zero). My suspicion is, that the network filesystem on your system is a bit slow in updating the files on different nodes and therefore the errors occur randomly after a few iterations. You did not say how you parallelize nor what the cputime is, but a few tips: - reduce the number of k-point parallel jobs (I hope you did NOT distribute the 200 k-points onto 200 cores !). Depending on the matrix size, you may try some (higher) mpi-parallelism. - make sure you are using a local SCRATCH directory to reduce network load (AND a compatible k-parallelism, i.e. (num-kpt / n-core) must be an integer) - increase the sleep times in $WIENROOT/lapw2para (and maybe lapw1para) from the defaults to larger values like setenv DELAY 0.5 # delay launching of processes by n seconds setenv SLEEPY 4# additional sleep before checking On 07/09/2015 07:51 AM, Imran Khan wrote: Dear wien2k experts and users, I am using wien2k version 14.2 on a queuing system (SGE), with intel compiler 11.1, MPI libraries mpi/openmpi-1.6.3 and math libraries fftw-3.3.4. With these options I install Wien2K without any compile time error. The purpose of my calculation is to find the stable site for different substituents in NdFeB intermetallics. I am running the case.struct given in the attachment, using 200 (6 6 4) k-points. My RKmax value is 7 and Gmax is 12, and I am using LDA+U method. I am using the following command runsp_lapw -p -orb -i 80 -ec 0.0001 -cc 0.001 Every time I submit my job after few scf cycles the job is terminated with the following error in the error tag file. error: commlib error: got select error (Connection reset by peer) error: executing task of job 2424636 failed: failed sending task to execd@tachyon1478: can't find connection . . . LAPW2 END LAPW2 END LAPW2 END LAPW2 END real0m53.638s forrtl: No such file or directory forrtl: severe (29): file not found, unit 21, file /home01/x1030imr/khan/Wien2K/Neomagnet/Pr-doped/f-site/AFM/Pr-Af/Pr-Af.scf2up_31 Image PCRoutineLineSource sumpara004A671D Unknown Unknown Unknown sumpara004A5225 Unknown Unknown Unknown sumpara00456259 Unknown Unknown Unknown sumpara00416A5A Unknown Unknown Unknown sumpara00416250 Unknown Unknown Unknown sumpara00421E3D Unknown Unknown Unknown sumpara00410771 scfsum_ 126 scfsum.f sumpara0040EE82 MAIN__219 sumpara.f sumpara004033DC Unknown Unknown Unknown libc.so.6 0035AA81D974 Unknown Unknown Unknown sumpara004032E9 Unknown Unknown Unknown cp: cannot stat `.in.tmp': No such file or directory I have discussed this error with the engineers of that queuing system (tachyon), and I have searched the mailing list as well but could not find any solutions. your guidance to solve this issue will be greatly appreciated. Imran ___ Wien mailing list Wien@zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html -- P.Blaha -- Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna Phone: +43-1-58801-165300 FAX: +43-1-58801-165982 Email: bl...@theochem.tuwien.ac.atWIEN2k: http://www.wien2k.at WWW: http://www.imc.tuwien.ac.at/staff/tc_group_e.php -- ___ Wien mailing list Wien@zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
[Wien] commlib error
Dear wien2k experts and users, I am using wien2k version 14.2 on a queuing system (SGE), with intel compiler 11.1, MPI libraries mpi/openmpi-1.6.3 and math libraries fftw-3.3.4. With these options I install Wien2K without any compile time error. The purpose of my calculation is to find the stable site for different substituents in NdFeB intermetallics. I am running the case.struct given in the attachment, using 200 (6 6 4) k-points. My RKmax value is 7 and Gmax is 12, and I am using LDA+U method. I am using the following command runsp_lapw -p -orb -i 80 -ec 0.0001 -cc 0.001 Every time I submit my job after few scf cycles the job is terminated with the following error in the error tag file. error: commlib error: got select error (Connection reset by peer) error: executing task of job 2424636 failed: failed sending task to execd@tachyon1478: can't find connection . . . LAPW2 END LAPW2 END LAPW2 END LAPW2 END real0m53.638s forrtl: No such file or directory forrtl: severe (29): file not found, unit 21, file /home01/x1030imr/khan/Wien2K/Neomagnet/Pr-doped/f-site/AFM/Pr-Af/Pr-Af.scf2up_31 Image PCRoutineLineSource sumpara004A671D Unknown Unknown Unknown sumpara004A5225 Unknown Unknown Unknown sumpara00456259 Unknown Unknown Unknown sumpara00416A5A Unknown Unknown Unknown sumpara00416250 Unknown Unknown Unknown sumpara00421E3D Unknown Unknown Unknown sumpara00410771 scfsum_ 126 scfsum.f sumpara0040EE82 MAIN__219 sumpara.f sumpara004033DC Unknown Unknown Unknown libc.so.6 0035AA81D974 Unknown Unknown Unknown sumpara004032E9 Unknown Unknown Unknown cp: cannot stat `.in.tmp': No such file or directory I have discussed this error with the engineers of that queuing system (tachyon), and I have searched the mailing list as well but could not find any solutions. your guidance to solve this issue will be greatly appreciated. Imran Pr-Af.struct Description: Binary data ___ Wien mailing list Wien@zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html