subject:"Re\: \[Wien\] Problem when running MPI\-parallel version of LAPW0"

Re: [Wien] Problem when running MPI-parallel version of LAPW0

2014-10-23 Thread Rémi Arras


Thank you everybody for your answers.
For the .machines file, we already have a script and it is well generated.
We will try to verify again the links and test another version of the 
fftw3-library.  I will keep you informed if the problem is solved.


Best regards,
Rémi Arras

Le 22/10/2014 14:22, Peter Blaha a écrit :

Usually the crucial point for lapw0  is the fftw3-library.

I noticed you have fftw-3.3.4, which I never tested. Since fftw is 
incompatible between fftw2 and 3, maybe they have done something again 
...


Besides that, I assume you have installed fftw using the same ifor and 
mpi versions ...




On 10/22/2014 01:29 PM, Rémi Arras wrote:

Dear Pr. Blaha, Dear Wien2k users,

We tried to install the last version of Wien2k (14.1) on a supercomputer
and we are facing some troubles with the MPI parallel version.

1)lapw0 is running correctly in sequential, but crashes systematically
when the parallel option is activated (independently of the number of
cores we use):


lapw0 -p(16:08:13) starting parallel lapw0 at lun. sept. 29 16:08:13

CEST 2014
 .machine0 : 4 processors
Child id1 SIGSEGV
Child id2 SIGSEGV
Child id3 SIGSEGV
Child id0 SIGSEGV
**lapw0 crashed!
0.029u 0.036s 0:50.91 0.0%0+0k 5248+104io 17pf+0w
error: command/eos3/p1229/remir/INSTALLATION_WIEN/14.1/lapw0para -up -c
lapw0.deffailed

stop error


w2k_dispatch_signal(): received: Segmentation fault
w2k_dispatch_signal(): received: Segmentation fault
Child with myid of1has an error
'Unknown' - SIGSEGV
Child id1 SIGSEGV
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 1
**lapw0 crashed!
cat: No match.0.027u 0.034s 1:33.13 0.0%0+0k 5200+96io 16pf+0w
error: command/eos3/p1229/remir/INSTALLATION_WIEN/14.1/lapw0para -up -c
lapw0.deffailed


2) lapw2 also crashes sometimes when MPI parallelization is used.
Sequential or k-parallel runs are ok, and contrary to lapw0, the error
does not occur for all cases (we did not notice any problem when testing
the mpi benchmark with lapw1):

w2k_dispatch_signal(): received: Segmentation fault application called
MPI_Abort(MPI_COMM_WORLD, 768) - process 0

Our system is a Bullx DLC Cluster (LInux Red Hat+ Intel Ivybridge) and
we use the compiler(+mkl) intel/14.0.2.144 and intelmpi/4.1.3.049.
The batch Scheduler is SLURM.

Here are the settings and the options we used for the installation :

OPTIONS:
current:FOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback
current:FPOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML
-Dmkl_scalapack -traceback -xAVX
current:FFTW_OPT:-DFFTW3
-I/users/p1229/remir/INSTALLATION_WIEN/fftw-3.3.4-Intel_MPI/include
current:FFTW_LIBS:-lfftw3_mpi -lfftw3
-L/users/p1229/remir/INSTALLATION_WIEN/fftw-3.3.4-Intel_MPI/lib
current:LDFLAGS:$(FOPT) -L$(MKLROOT)/lib/$(MKL_TARGET_ARCH) -pthread
current:DPARALLEL:'-DParallel'
current:R_LIBS:-lmkl_lapack95_lp64 -lmkl_intel_lp64 -lmkl_intel_thread
-lmkl_core -openmp -lpthread
current:RP_LIBS:-mkl=cluster -lfftw3_mpi -lfftw3
-L/users/p1229/remir/INSTALLATION_WIEN/fftw-3.3.4-Intel_MPI/lib
current:MPIRUN:mpirun -np _NP_ _EXEC_
current:MKL_TARGET_ARCH:intel64

PARALLEL_OPTIONS:
setenv TASKSET no
setenv USE_REMOTE 1
setenv MPI_REMOTE 1
setenv WIEN_GRANULARITY 1
setenv WIEN_MPIRUN mpirun -np _NP_ _EXEC_

Any suggestions which could help us to solve this problem would be
greatly appreciated.

Best regards,
Rémi Arras


___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at: 
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html







___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html

Re: [Wien] Problem when running MPI-parallel version of LAPW0

2014-10-22 Thread Michael Sluydts


Hello Rémi,

While I'm not sure this is the (only) problem, in our setup we also give 
mpirun the machines file:


setenv WIEN_MPIRUN mpirun  -np _NP_ -machinefile _HOSTS_ _EXEC_

which I generate based on a 1 k-point per node setup with the following 
python script:


/wienhybrid
#!/usr/bin/env python
#Machines file generator for WIEN2k
#May 13th 2013
#
#Michael Sluydts
#Center for Molecular Modeling
#Ghent University
from collections import Counter
import subprocess, os
nodefile = subprocess.Popen('echo 
$PBS_NODEFILE',stdout=subprocess.PIPE,shell=True)

nodefile = nodefile.communicate()[0].strip()
nodefile = open(nodefile,'r')

machines = nodefile.readlines()
nodefile.close()

node = ''
corecount=Counter()


#gather cores per nodes
for core in machines:
node = core.split('.')[0]
corecount[node] += 1



#if there are more nodes than k-points we must redistribute the 
remaining cores


#count the irreducible kpoints
IBZ = int(subprocess.Popen('wc -l  ' + os.getcwd().split('/')[-1] + 
'.klist',stdout=subprocess.PIPE,shell=True).communicate()[0])-2


corerank = corecount.most_common()

alloc = Counter()
total = Counter()
nodemap = []
#pick out the largest nodes and redivide the remaining ones by adding 
the largest leftover node to the k-point with least allocated cores


for node,cores in corerank:
if len(alloc)  IBZ:
alloc[node] += cores
total[node] += cores
else:
lowcore = total.most_common()[-1][0]
total[lowcore] += cores
nodemap.append((node,lowcore))

#give lapw0 all cores
machinesfile = 'lapw0: ' + corecount.keys()[0] + ':' + 
str(corecount[corecount.keys()[0]]) + '\n'

#for node in corecount.keys():
#machinesfile += node + ':' + str(corecount[node]) + ' '
#machinesfile += '\n'

#machinesfile = ''
for node in alloc.keys():
#allocate main node
machinesfile += '1:' + node + ':' + str(alloc[node])
#machinesfile += '1:' + node
#for i in range(1,alloc[node]):
#machinesfile += ' ' + node
#distribute leftover nodes
extra = [x for x,y in nodemap if y == node]
for ext in extra:
#machinesfile += ' ' + ext + ':' + str(corecount[ext])
for i in range(1,corecount[ext]):
machinesfile+=' ' + ext
machinesfile += '\n'


#If your nodes do not all have the same specifications you may have to 
change the weights above 1: and the granularity below, if you use a 
residue machine you should remove extrafine and add the residue 
configuration

machinesfile += 'granularity:1\nextrafine:1\n'

#if you have memory issues or a limited bandwidth between nodes try 
uncommenting the following line (can always try it and see if it speeds 
things up)

#machinesfile += 'lapw2 vector split:2\n'

machines = file('.machines','w')
machines.write(machinesfile)
machines.close()



___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html

Re: [Wien] Problem when running MPI-parallel version of LAPW0

2014-10-22 Thread Michael Sluydts

Perhaps an important note: the python script is for a Torque PBS queuing 
system (based on $PBS_NODEFILE)


Rémi Arras schreef op 22/10/2014 13:29:

Dear Pr. Blaha, Dear Wien2k users,

We tried to install the last version of Wien2k (14.1) on a 
supercomputer and we are facing some troubles with the MPI parallel 
version.


1)lapw0 is running correctly in sequential, but crashes systematically 
when the parallel option is activated (independently of the number of 
cores we use):


lapw0 -p(16:08:13) starting parallel lapw0 at lun. sept. 29 16:08:13 
CEST 2014

 .machine0 : 4 processors
Child id1 SIGSEGV
Child id2 SIGSEGV
Child id3 SIGSEGV
Child id0 SIGSEGV
**lapw0 crashed!
0.029u 0.036s 0:50.91 0.0%0+0k 5248+104io 17pf+0w
error: command/eos3/p1229/remir/INSTALLATION_WIEN/14.1/lapw0para -up 
-c lapw0.deffailed

stop error

w2k_dispatch_signal(): received: Segmentation fault
w2k_dispatch_signal(): received: Segmentation fault
Child with myid of1has an error
'Unknown' - SIGSEGV
Child id1 SIGSEGV
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 1
**lapw0 crashed!
cat: No match.0.027u 0.034s 1:33.13 0.0%0+0k 5200+96io 16pf+0w
error: command/eos3/p1229/remir/INSTALLATION_WIEN/14.1/lapw0para -up 
-c lapw0.deffailed



2) lapw2 also crashes sometimes when MPI parallelization is used. 
Sequential or k-parallel runs are ok, and contrary to lapw0, the error 
does not occur for all cases (we did not notice any problem when 
testing the mpi benchmark with lapw1):


w2k_dispatch_signal(): received: Segmentation fault application called 
MPI_Abort(MPI_COMM_WORLD, 768) - process 0


Our system is a Bullx DLC Cluster (LInux Red Hat+ Intel Ivybridge) and 
we use the compiler(+mkl) intel/14.0.2.144 and intelmpi/4.1.3.049.

The batch Scheduler is SLURM.

Here are the settings and the options we used for the installation :

OPTIONS:
current:FOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback
current:FPOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML 
-Dmkl_scalapack -traceback -xAVX
current:FFTW_OPT:-DFFTW3 
-I/users/p1229/remir/INSTALLATION_WIEN/fftw-3.3.4-Intel_MPI/include
current:FFTW_LIBS:-lfftw3_mpi -lfftw3 
-L/users/p1229/remir/INSTALLATION_WIEN/fftw-3.3.4-Intel_MPI/lib

current:LDFLAGS:$(FOPT) -L$(MKLROOT)/lib/$(MKL_TARGET_ARCH) -pthread
current:DPARALLEL:'-DParallel'
current:R_LIBS:-lmkl_lapack95_lp64 -lmkl_intel_lp64 -lmkl_intel_thread 
-lmkl_core -openmp -lpthread
current:RP_LIBS:-mkl=cluster -lfftw3_mpi -lfftw3 
-L/users/p1229/remir/INSTALLATION_WIEN/fftw-3.3.4-Intel_MPI/lib

current:MPIRUN:mpirun -np _NP_ _EXEC_
current:MKL_TARGET_ARCH:intel64

PARALLEL_OPTIONS:
setenv TASKSET no
setenv USE_REMOTE 1
setenv MPI_REMOTE 1
setenv WIEN_GRANULARITY 1
setenv WIEN_MPIRUN mpirun -np _NP_ _EXEC_

Any suggestions which could help us to solve this problem would be 
greatly appreciated.


Best regards,
Rémi Arras


___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html

Re: [Wien] Problem when running MPI-parallel version of LAPW0

2014-10-22 Thread Laurence Marks

It is often hard to know exactly what issues are with mpi. Most often it is
due to incorrect combinations of scalapack/blacs in the linking options.

The first think to check is your linking options with
https://software.intel.com/en-us/articles/intel-mkl-link-line-advisor/.
What you have does not look exactly right to me, but I have not used your
release.

If that does not work, look in case.dayfile, the log file.

If there is still nothing it is sometimes useful to comment out the line

  CALL W2kinit

in lapw0.F, recompile then just do x lapw0 -p. You sometimes will get
more information although it is not as safe as mpi tasks can hang forever
without it in some cases.

On Wed, Oct 22, 2014 at 6:29 AM, Rémi Arras remi.ar...@cemes.fr wrote:

  Dear Pr. Blaha, Dear Wien2k users,

 We tried to install the last version of Wien2k (14.1) on a supercomputer
 and we are facing some troubles with the MPI parallel version.

 1)  lapw0 is running correctly in sequential, but crashes systematically
 when the parallel option is activated (independently of the number of cores
 we use):

lapw0 -p(16:08:13) starting parallel lapw0 at lun. sept. 29 16:08:13
 CEST 2014
  .machine0 : 4 processors
  Child id   1 SIGSEGV
  Child id   2 SIGSEGV
  Child id   3 SIGSEGV
  Child id   0 SIGSEGV
 **  lapw0 crashed!
 0.029u 0.036s 0:50.91 0.0%  0+0k 5248+104io 17pf+0w
 error: command   /eos3/p1229/remir/INSTALLATION_WIEN/14.1/lapw0para -up -c
 lapw0.def   failed
stop error

 w2k_dispatch_signal(): received: Segmentation fault
 w2k_dispatch_signal(): received: Segmentation fault
  Child with myid of1  has an error
 'Unknown' - SIGSEGV
  Child id   1 SIGSEGV
 application called MPI_Abort(MPI_COMM_WORLD, 0) - process 1
 **  lapw0 crashed!
 cat: No match.0.027u 0.034s 1:33.13 0.0%  0+0k 5200+96io 16pf+0w
 error: command   /eos3/p1229/remir/INSTALLATION_WIEN/14.1/lapw0para -up
 -c lapw0.def   failed


 2) lapw2 also crashes sometimes when MPI parallelization is used.
 Sequential or k-parallel runs are ok, and contrary to lapw0, the error does
 not occur for all cases (we did not notice any problem when testing the
 mpi benchmark with lapw1):

 w2k_dispatch_signal(): received: Segmentation fault application called
 MPI_Abort(MPI_COMM_WORLD, 768) - process 0

 Our system is a Bullx DLC Cluster (LInux Red Hat+ Intel Ivybridge) and we
 use the compiler(+mkl) intel/14.0.2.144 and intelmpi/4.1.3.049.
 The batch Scheduler is SLURM.

 Here are the settings and the options we used for the installation :

 OPTIONS:
 current:FOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback
 current:FPOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML
 -Dmkl_scalapack -traceback -xAVX
 current:FFTW_OPT:-DFFTW3
 -I/users/p1229/remir/INSTALLATION_WIEN/fftw-3.3.4-Intel_MPI/include
 current:FFTW_LIBS:-lfftw3_mpi -lfftw3
 -L/users/p1229/remir/INSTALLATION_WIEN/fftw-3.3.4-Intel_MPI/lib
 current:LDFLAGS:$(FOPT) -L$(MKLROOT)/lib/$(MKL_TARGET_ARCH) -pthread
 current:DPARALLEL:'-DParallel'
 current:R_LIBS:-lmkl_lapack95_lp64 -lmkl_intel_lp64 -lmkl_intel_thread
 -lmkl_core -openmp -lpthread
 current:RP_LIBS:-mkl=cluster -lfftw3_mpi -lfftw3
 -L/users/p1229/remir/INSTALLATION_WIEN/fftw-3.3.4-Intel_MPI/lib
 current:MPIRUN:mpirun -np _NP_ _EXEC_
 current:MKL_TARGET_ARCH:intel64

 PARALLEL_OPTIONS:
 setenv TASKSET no
 setenv USE_REMOTE 1
 setenv MPI_REMOTE 1
 setenv WIEN_GRANULARITY 1
 setenv WIEN_MPIRUN mpirun -np _NP_ _EXEC_

 Any suggestions which could help us to solve this problem would be greatly
 appreciated.

 Best regards,
 Rémi Arras




-- 
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu
Corrosion in 4D: MURI4D.numis.northwestern.edu
Co-Editor, Acta Cryst A
Research is to see what everybody else has seen, and to think what nobody
else has thought
Albert Szent-Gyorgi
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html

Re: [Wien] Problem when running MPI-parallel version of LAPW0

2014-10-22 Thread Peter Blaha


Usually the crucial point for lapw0  is the fftw3-library.

I noticed you have fftw-3.3.4, which I never tested. Since fftw is 
incompatible between fftw2 and 3, maybe they have done something again ...


Besides that, I assume you have installed fftw using the same ifor and 
mpi versions ...




On 10/22/2014 01:29 PM, Rémi Arras wrote:

Dear Pr. Blaha, Dear Wien2k users,

We tried to install the last version of Wien2k (14.1) on a supercomputer
and we are facing some troubles with the MPI parallel version.

1)lapw0 is running correctly in sequential, but crashes systematically
when the parallel option is activated (independently of the number of
cores we use):


lapw0 -p(16:08:13) starting parallel lapw0 at lun. sept. 29 16:08:13

CEST 2014
 .machine0 : 4 processors
Child id1 SIGSEGV
Child id2 SIGSEGV
Child id3 SIGSEGV
Child id0 SIGSEGV
**lapw0 crashed!
0.029u 0.036s 0:50.91 0.0%0+0k 5248+104io 17pf+0w
error: command/eos3/p1229/remir/INSTALLATION_WIEN/14.1/lapw0para -up -c
lapw0.deffailed

stop error


w2k_dispatch_signal(): received: Segmentation fault
w2k_dispatch_signal(): received: Segmentation fault
Child with myid of1has an error
'Unknown' - SIGSEGV
Child id1 SIGSEGV
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 1
**lapw0 crashed!
cat: No match.0.027u 0.034s 1:33.13 0.0%0+0k 5200+96io 16pf+0w
error: command/eos3/p1229/remir/INSTALLATION_WIEN/14.1/lapw0para -up -c
lapw0.deffailed


2) lapw2 also crashes sometimes when MPI parallelization is used.
Sequential or k-parallel runs are ok, and contrary to lapw0, the error
does not occur for all cases (we did not notice any problem when testing
the mpi benchmark with lapw1):

w2k_dispatch_signal(): received: Segmentation fault application called
MPI_Abort(MPI_COMM_WORLD, 768) - process 0

Our system is a Bullx DLC Cluster (LInux Red Hat+ Intel Ivybridge) and
we use the compiler(+mkl) intel/14.0.2.144 and intelmpi/4.1.3.049.
The batch Scheduler is SLURM.

Here are the settings and the options we used for the installation :

OPTIONS:
current:FOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback
current:FPOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML
-Dmkl_scalapack -traceback -xAVX
current:FFTW_OPT:-DFFTW3
-I/users/p1229/remir/INSTALLATION_WIEN/fftw-3.3.4-Intel_MPI/include
current:FFTW_LIBS:-lfftw3_mpi -lfftw3
-L/users/p1229/remir/INSTALLATION_WIEN/fftw-3.3.4-Intel_MPI/lib
current:LDFLAGS:$(FOPT) -L$(MKLROOT)/lib/$(MKL_TARGET_ARCH) -pthread
current:DPARALLEL:'-DParallel'
current:R_LIBS:-lmkl_lapack95_lp64 -lmkl_intel_lp64 -lmkl_intel_thread
-lmkl_core -openmp -lpthread
current:RP_LIBS:-mkl=cluster -lfftw3_mpi -lfftw3
-L/users/p1229/remir/INSTALLATION_WIEN/fftw-3.3.4-Intel_MPI/lib
current:MPIRUN:mpirun -np _NP_ _EXEC_
current:MKL_TARGET_ARCH:intel64

PARALLEL_OPTIONS:
setenv TASKSET no
setenv USE_REMOTE 1
setenv MPI_REMOTE 1
setenv WIEN_GRANULARITY 1
setenv WIEN_MPIRUN mpirun -np _NP_ _EXEC_

Any suggestions which could help us to solve this problem would be
greatly appreciated.

Best regards,
Rémi Arras


___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html



--

  P.Blaha
--
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300 FAX: +43-1-58801-165982
Email: bl...@theochem.tuwien.ac.atWIEN2k: http://www.wien2k.at
WWW:   http://www.imc.tuwien.ac.at/staff/tc_group_e.php
--
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html

Re: [Wien] Problem when running MPI-parallel version of LAPW0

Re: [Wien] Problem when running MPI-parallel version of LAPW0

Re: [Wien] Problem when running MPI-parallel version of LAPW0

Re: [Wien] Problem when running MPI-parallel version of LAPW0

Re: [Wien] Problem when running MPI-parallel version of LAPW0

5 matches

Site Navigation

Mail list logo

Footer information