[Wien] Re[wien] commlib error

2015-07-11 Thread Imran Khan
Dear Prof. Blaha, *Lau*rence Marks,  and Gavin Abo,
Thanks for your valuable suggestions, Currently I am working with your
suggestions and I will let you inform if the problem is solved.
For Prof. Laurence Marks:
Sir these were my options during installation (k-point parallelization)
*System: linuxif111*
*Wien Version: WIEN2k_14.2*
f90 compiler: ifort and C compiler icc










*Current settings: O Compiler options: -FR -mp1 -w -prec_div -pc80 -pad -ip
-DINTEL_VML -traceback F FFTW options:  -DFFTW3
-/applic/compilers/intel/11.1/mpi/openmpi/1.6.3/applib2/FFTW3/3.3.4/double/include
L
  Linker Flags:$(FOPT)
-L/applic/compilers/intel/11.1/mkl/lib/em64t -pthread P   Preprocessor
flags   '-DParallel' R   R_LIB (LAPACK+BLAS): -lmkl_lapack
-lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -openmp -lpthread -lguide FL
 FFTW_LIBS:   -lfftw3_mpi -lfftw3
-L/applic/compilers/intel/11.1/mpi/openmpi/1.6.3/applib2/FFTW3/3.3.4/double/libparallel
f90 compiler mpif90   FFTW3 FFTW_LIB + FFTW_OPT: -lfftw3_mpi -lfftw3
-L/applic/compilers/intel/11.1/mpi/openmpi/1.6.3/applib2/FFTW3/3.3.4/double/lib
 +   -DFFTW3
-I/applic/compilers/intel/11.1/mpi/openmpi/1.6.3/applib2/FFTW3/3.3.4/double/include
(already set) RP  RP_LIB(SCALAPACK+PBLAS): -lmkl_scalapack_lp64
-lmkl_solver_lp64 -lmkl_blacs_lp64 $(R_LIBS) FP
 FPOPT(par.comp.options): -FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML
-Dmkl_scalapack -traceback MP  MPIRUN commando: mpirun -np _NP_
-machinefile _HOSTS_ _EXEC_and this is my job script#!/bin/bash#$ -V#$
-cwd#$ -N FM-Pr#$ -pe mpi_fu 47#$ -q normal#$ -R yes#$ -l h_rt=48:00:00echo
Got $NSLOTS slots.cat $TMPDIR/machines# enables $TMPDIR/rsh to catch rsh
calls if availablecd $SGE_O_WORKDIRrm -f .machinesecho 'granularity:1'
.machinesecho 'extrafine:1' .machinesi=1while ((i = NSLOTS))doecho -n
'1:' .machineshead -n $i $TMPDIR/machines |tail -n 1 
.machines((i=i+1))donerunsp_lapw -p -orb -i 1000 -ec 0.0001 -cc 0.001and
sir I did some calculations for Monolayer phosphorene previously, but face
no problem like this during that calculation.Best regardsImran khan*
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


[Wien] commlib error

2015-07-10 Thread Imran Khan
Dear wien2k experts and users,
I am using wien2k version 14.2 on a queuing system (SGE), with intel
compiler 11.1, MPI libraries mpi/openmpi-1.6.3 and math libraries
fftw-3.3.4. With these options I install Wien2K without any compile time
error.
The purpose of my calculation is to find the stable site for different
substituents in NdFeB intermetallics.
I am running the case.struct given in the attachment, using 200 (6 6 4)
k-points. My RKmax value is 7 and Gmax is 12, and I am using LDA+U method.
I am using the following command  runsp_lapw -p -orb -i 80 -ec 0.0001 -cc
0.001
Every time I submit my job after few scf cycles the job is terminated with
the following error in the error tag file.

error: commlib error: got select error (Connection reset by peer)
error: executing task of job 2424636 failed: failed sending task to
execd@tachyon1478: can't find connection
.
.
.
 LAPW2 END
 LAPW2 END
 LAPW2 END
 LAPW2 END
real0m53.638s
forrtl: No such file or directory
forrtl: severe (29): file not found, unit 21, file
/home01/x1030imr/khan/Wien2K/Neomagnet/Pr-doped/f-site/AFM/Pr-Af/Pr-Af.scf2up_31
Image  PCRoutineLineSource
sumpara004A671D  Unknown   Unknown  Unknown
sumpara004A5225  Unknown   Unknown  Unknown
sumpara00456259  Unknown   Unknown  Unknown
sumpara00416A5A  Unknown   Unknown  Unknown
sumpara00416250  Unknown   Unknown  Unknown
sumpara00421E3D  Unknown   Unknown  Unknown
sumpara00410771  scfsum_   126  scfsum.f
sumpara0040EE82  MAIN__219
 sumpara.f
sumpara004033DC  Unknown   Unknown  Unknown
libc.so.6  0035AA81D974  Unknown   Unknown  Unknown
sumpara004032E9  Unknown   Unknown  Unknown
cp: cannot stat `.in.tmp': No such file or directory

I have discussed this error with the engineers of that queuing system
(tachyon), and I have searched the mailing list as well but could not find
any solutions.
your guidance to solve this issue will be greatly appreciated.
Best regards
Imran.


Pr-Af.struct
Description: Binary data
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] commlib error

2015-07-10 Thread Laurence Marks
From a brief Google search this is an mpi error.

How did you compile, it is easy to use wrong blacs combinations.

Have you run simpler cases such as TiC first?

---
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
http://www.numis.northwestern.edu
Corrosion in 4D http://MURI4D.numis.northwestern.edu
Co-Editor, Acta Cryst A
Research is to see what everybody else has seen, and to think what nobody
else has thought
Albert Szent-Gyorgi
On Jul 10, 2015 03:05, Imran Khan imrankhanswat...@gmail.com wrote:

  Dear wien2k experts and users,
 I am using wien2k version 14.2 on a queuing system (SGE), with intel
 compiler 11.1, MPI libraries mpi/openmpi-1.6.3 and math libraries
 fftw-3.3.4. With these options I install Wien2K without any compile time
 error.
 The purpose of my calculation is to find the stable site for different
 substituents in NdFeB intermetallics.
 I am running the case.struct given in the attachment, using 200 (6 6 4)
 k-points. My RKmax value is 7 and Gmax is 12, and I am using LDA+U method.
 I am using the following command  runsp_lapw -p -orb -i 80 -ec 0.0001 -cc
 0.001
 Every time I submit my job after few scf cycles the job is terminated with
 the following error in the error tag file.

  error: commlib error: got select error (Connection reset by peer)
 error: executing task of job 2424636 failed: failed sending task to
 execd@tachyon1478: can't find connection
  .
 .
 .
   LAPW2 END
  LAPW2 END
  LAPW2 END
  LAPW2 END
 real0m53.638s
 forrtl: No such file or directory
 forrtl: severe (29): file not found, unit 21, file
 /home01/x1030imr/khan/Wien2K/Neomagnet/Pr-doped/f-site/AFM/Pr-Af/Pr-Af.scf2up_31
 Image  PCRoutineLineSource
 sumpara004A671D  Unknown   Unknown  Unknown
 sumpara004A5225  Unknown   Unknown  Unknown
 sumpara00456259  Unknown   Unknown  Unknown
 sumpara00416A5A  Unknown   Unknown  Unknown
 sumpara00416250  Unknown   Unknown  Unknown
 sumpara00421E3D  Unknown   Unknown  Unknown
 sumpara00410771  scfsum_   126
  scfsum.f
 sumpara0040EE82  MAIN__219
  sumpara.f
 sumpara004033DC  Unknown   Unknown  Unknown
 libc.so.6  0035AA81D974  Unknown   Unknown  Unknown
 sumpara004032E9  Unknown   Unknown  Unknown
 cp: cannot stat `.in.tmp': No such file or directory

  I have discussed this error with the engineers of that queuing system
 (tachyon), and I have searched the mailing list as well but could not find
 any solutions.
 your guidance to solve this issue will be greatly appreciated.
 Best regards
 Imran.

___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] commlib error

2015-07-10 Thread Gavin Abo

An additional comment, in the post at:

https://arc.liv.ac.uk/pipermail/gridengine-users/2010-October/032729.html

You can see that they have the error of the form:

error: commlib error: got select error (Connection reset by peer)
error: executing task of job x failed: failed sending task to 
execd@hostname: can't find connection


It looks like they might have tracked down the problem to the master 
daemon (qmaster), as seen in the post at:


https://arc.liv.ac.uk/pipermail/gridengine-users/2010-October/032758.html

So, maybe, the error could be caused by a daemon problem (with the 
tachyon1478 node).


On 7/10/2015 5:01 AM, Laurence Marks wrote:


From a brief Google search this is an mpi error.

How did you compile, it is easy to use wrong blacs combinations.

Have you run simpler cases such as TiC first?

---
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
http://www.numis.northwestern.edu
Corrosion in 4D http://MURI4D.numis.northwestern.edu
Co-Editor, Acta Cryst A
Research is to see what everybody else has seen, and to think what 
nobody else has thought

Albert Szent-Gyorgi

On Jul 10, 2015 03:05, Imran Khan imrankhanswat...@gmail.com 
mailto:imrankhanswat...@gmail.com wrote:


Dear wien2k experts and users,
I am using wien2k version 14.2 on a queuing system (SGE), with
intel compiler 11.1, MPI libraries mpi/openmpi-1.6.3 and math
libraries fftw-3.3.4. With these options I install Wien2K without
any compile time error.
The purpose of my calculation is to find the stable site for
different substituents in NdFeB intermetallics.
I am running the case.struct given in the attachment, using 200 (6
6 4) k-points. My RKmax value is 7 and Gmax is 12, and I am using
LDA+U method.
I am using the following command  runsp_lapw -p -orb -i 80 -ec
0.0001 -cc 0.001
Every time I submit my job after few scf cycles the job is
terminated with the following error in the error tag file.

error: commlib error: got select error (Connection reset by peer)
error: executing task of job 2424636 failed: failed sending task
to execd@tachyon1478: can't find connection
.
.
.
 LAPW2 END
 LAPW2 END
 LAPW2 END
 LAPW2 END
real0m53.638s
forrtl: No such file or directory
forrtl: severe (29): file not found, unit 21, file

/home01/x1030imr/khan/Wien2K/Neomagnet/Pr-doped/f-site/AFM/Pr-Af/Pr-Af.scf2up_31
Image  PCRoutine  LineSource
sumpara004A671D  Unknown Unknown  Unknown
sumpara004A5225  Unknown Unknown  Unknown
sumpara00456259  Unknown Unknown  Unknown
sumpara00416A5A  Unknown Unknown  Unknown
sumpara00416250  Unknown Unknown  Unknown
sumpara00421E3D  Unknown Unknown  Unknown
sumpara00410771  scfsum_ 126  scfsum.f
sumpara0040EE82  MAIN__219  sumpara.f
sumpara004033DC  Unknown Unknown  Unknown
libc.so.6  0035AA81D974  Unknown Unknown  Unknown
sumpara004032E9  Unknown Unknown  Unknown
cp: cannot stat `.in.tmp': No such file or directory

I have discussed this error with the engineers of that queuing
system (tachyon), and I have searched the mailing list as well but
could not find any solutions.
your guidance to solve this issue will be greatly appreciated.
Best regards
Imran.

___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] commlib error

2015-07-09 Thread Peter Blaha
The comlib error is certainly a system error, where the communication 
between the nodes is broken somehow.


From wien2k you got the error that in the sumpara step (after lapw2) it 
could not find the filePr-Af.scf2up_31


So the first question you have to pose yourself is: do I have this file 
and is it ok ?


ls -alsrp *scf2up_*

You should find many of these files (as many as k-parallel jobs are 
submitted) and ALL of them should have a reasonable length (at least 
non-zero).


My suspicion is, that the network filesystem on your system is a bit 
slow in updating the files on different nodes and therefore the errors 
occur randomly after a few iterations.


You did not say how you parallelize nor what the cputime is, but a few tips:

- reduce the number of k-point parallel jobs (I hope you did NOT 
distribute the 200 k-points onto 200 cores !). Depending on the matrix 
size, you may try some (higher) mpi-parallelism.


- make sure you are using a local SCRATCH directory to reduce network 
load (AND a compatible k-parallelism, i.e. (num-kpt / n-core) must be an 
integer)


- increase the sleep times in $WIENROOT/lapw2para (and maybe 
lapw1para) from the defaults to larger values like

setenv DELAY   0.5  # delay launching of processes by n seconds
setenv SLEEPY  4# additional sleep before checking



On 07/09/2015 07:51 AM, Imran Khan wrote:

Dear wien2k experts and users,
I am using wien2k version 14.2 on a queuing system (SGE), with intel
compiler 11.1, MPI libraries mpi/openmpi-1.6.3 and math libraries
fftw-3.3.4. With these options I install Wien2K without any compile time
error.
The purpose of my calculation is to find the stable site for different
substituents in NdFeB intermetallics.
I am running the case.struct given in the attachment, using 200 (6 6 4)
k-points. My RKmax value is 7 and Gmax is 12, and I am using LDA+U method.
I am using the following command  runsp_lapw -p -orb -i 80 -ec 0.0001
-cc 0.001
Every time I submit my job after few scf cycles the job is terminated
with the following error in the error tag file.

error: commlib error: got select error (Connection reset by peer)
error: executing task of job 2424636 failed: failed sending task to
execd@tachyon1478: can't find connection
 .
 .
 .
  LAPW2 END
  LAPW2 END
  LAPW2 END
  LAPW2 END
real0m53.638s
forrtl: No such file or directory
forrtl: severe (29): file not found, unit 21, file
/home01/x1030imr/khan/Wien2K/Neomagnet/Pr-doped/f-site/AFM/Pr-Af/Pr-Af.scf2up_31
Image  PCRoutineLineSource
sumpara004A671D  Unknown   Unknown  Unknown
sumpara004A5225  Unknown   Unknown  Unknown
sumpara00456259  Unknown   Unknown  Unknown
sumpara00416A5A  Unknown   Unknown  Unknown
sumpara00416250  Unknown   Unknown  Unknown
sumpara00421E3D  Unknown   Unknown  Unknown
sumpara00410771  scfsum_   126  scfsum.f
sumpara0040EE82  MAIN__219
  sumpara.f
sumpara004033DC  Unknown   Unknown  Unknown
libc.so.6  0035AA81D974  Unknown   Unknown  Unknown
sumpara004032E9  Unknown   Unknown  Unknown
cp: cannot stat `.in.tmp': No such file or directory

I have discussed this error with the engineers of that queuing system
(tachyon), and I have searched the mailing list as well but could not
find any solutions.
your guidance to solve this issue will be greatly appreciated.
Imran


___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html



--

  P.Blaha
--
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300 FAX: +43-1-58801-165982
Email: bl...@theochem.tuwien.ac.atWIEN2k: http://www.wien2k.at
WWW:   http://www.imc.tuwien.ac.at/staff/tc_group_e.php
--
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


[Wien] commlib error

2015-07-08 Thread Imran Khan
Dear wien2k experts and users,
I am using wien2k version 14.2 on a queuing system (SGE), with intel
compiler 11.1, MPI libraries mpi/openmpi-1.6.3 and math libraries
fftw-3.3.4. With these options I install Wien2K without any compile time
error.
The purpose of my calculation is to find the stable site for different
substituents in NdFeB intermetallics.
I am running the case.struct given in the attachment, using 200 (6 6 4)
k-points. My RKmax value is 7 and Gmax is 12, and I am using LDA+U method.
I am using the following command  runsp_lapw -p -orb -i 80 -ec 0.0001 -cc
0.001
Every time I submit my job after few scf cycles the job is terminated with
the following error in the error tag file.

error: commlib error: got select error (Connection reset by peer)
error: executing task of job 2424636 failed: failed sending task to
execd@tachyon1478: can't find connection
.
.
.
 LAPW2 END
 LAPW2 END
 LAPW2 END
 LAPW2 END
real0m53.638s
forrtl: No such file or directory
forrtl: severe (29): file not found, unit 21, file
/home01/x1030imr/khan/Wien2K/Neomagnet/Pr-doped/f-site/AFM/Pr-Af/Pr-Af.scf2up_31
Image  PCRoutineLineSource
sumpara004A671D  Unknown   Unknown  Unknown
sumpara004A5225  Unknown   Unknown  Unknown
sumpara00456259  Unknown   Unknown  Unknown
sumpara00416A5A  Unknown   Unknown  Unknown
sumpara00416250  Unknown   Unknown  Unknown
sumpara00421E3D  Unknown   Unknown  Unknown
sumpara00410771  scfsum_   126  scfsum.f
sumpara0040EE82  MAIN__219
 sumpara.f
sumpara004033DC  Unknown   Unknown  Unknown
libc.so.6  0035AA81D974  Unknown   Unknown  Unknown
sumpara004032E9  Unknown   Unknown  Unknown
cp: cannot stat `.in.tmp': No such file or directory

I have discussed this error with the engineers of that queuing system
(tachyon), and I have searched the mailing list as well but could not find
any solutions.
your guidance to solve this issue will be greatly appreciated.
Imran


Pr-Af.struct
Description: Binary data
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html