Perhaps an important note: the python script is for a Torque PBS queuing system (based on $PBS_NODEFILE)

Rémi Arras schreef op 22/10/2014 13:29:
Dear Pr. Blaha, Dear Wien2k users,

We tried to install the last version of Wien2k (14.1) on a supercomputer and we are facing some troubles with the MPI parallel version.

1)lapw0 is running correctly in sequential, but crashes systematically when the parallel option is activated (independently of the number of cores we use):

>lapw0 -p(16:08:13) starting parallel lapw0 at lun. sept. 29 16:08:13 CEST 2014
-------- .machine0 : 4 processors
Child id1 SIGSEGV
Child id2 SIGSEGV
Child id3 SIGSEGV
Child id0 SIGSEGV
**lapw0 crashed!
0.029u 0.036s 0:50.91 0.0%0+0k 5248+104io 17pf+0w
error: command/eos3/p1229/remir/INSTALLATION_WIEN/14.1/lapw0para -up -c lapw0.deffailed
>stop error

w2k_dispatch_signal(): received: Segmentation fault
w2k_dispatch_signal(): received: Segmentation fault
Child with myid of1has an error
'Unknown' - SIGSEGV
Child id1 SIGSEGV
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 1
**lapw0 crashed!
cat: No match.0.027u 0.034s 1:33.13 0.0%0+0k 5200+96io 16pf+0w
error: command/eos3/p1229/remir/INSTALLATION_WIEN/14.1/lapw0para -up -c lapw0.deffailed


2) lapw2 also crashes sometimes when MPI parallelization is used. Sequential or k-parallel runs are ok, and contrary to lapw0, the error does not occur for all cases (we did not notice any problem when testing the mpi benchmark with lapw1):

w2k_dispatch_signal(): received: Segmentation fault application called MPI_Abort(MPI_COMM_WORLD, 768) - process 0

Our system is a Bullx DLC Cluster (LInux Red Hat+ Intel Ivybridge) and we use the compiler(+mkl) intel/14.0.2.144 and intelmpi/4.1.3.049.
The batch Scheduler is SLURM.

Here are the settings and the options we used for the installation :

OPTIONS:
current:FOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback
current:FPOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -Dmkl_scalapack -traceback -xAVX current:FFTW_OPT:-DFFTW3 -I/users/p1229/remir/INSTALLATION_WIEN/fftw-3.3.4-Intel_MPI/include current:FFTW_LIBS:-lfftw3_mpi -lfftw3 -L/users/p1229/remir/INSTALLATION_WIEN/fftw-3.3.4-Intel_MPI/lib
current:LDFLAGS:$(FOPT) -L$(MKLROOT)/lib/$(MKL_TARGET_ARCH) -pthread
current:DPARALLEL:'-DParallel'
current:R_LIBS:-lmkl_lapack95_lp64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -openmp -lpthread current:RP_LIBS:-mkl=cluster -lfftw3_mpi -lfftw3 -L/users/p1229/remir/INSTALLATION_WIEN/fftw-3.3.4-Intel_MPI/lib
current:MPIRUN:mpirun -np _NP_ _EXEC_
current:MKL_TARGET_ARCH:intel64

PARALLEL_OPTIONS:
setenv TASKSET "no"
setenv USE_REMOTE 1
setenv MPI_REMOTE 1
setenv WIEN_GRANULARITY 1
setenv WIEN_MPIRUN "mpirun -np _NP_ _EXEC_"

Any suggestions which could help us to solve this problem would be greatly appreciated.

Best regards,
Rémi Arras


_______________________________________________
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html

_______________________________________________
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html

Reply via email to