Hello,
I am getting random crashes (segmentation faults) on a super computer
(guillimin)
using 3 nodes with 12 cores per node. The same program (Ray) runs
without any
problem on the other super computers I use.
The interconnect is "InfiniBand: QLogic Corp. InfiniPath QME7342 QDR
HCA" and
the messages transit using "performance scaled messaging" (PSM) which I
think is some
sort of replacement to Infiniband verbs although I am not sure.
Adding '--mca mtl ^psm' to the Open-MPI mpiexec program options solves
the problem, but increases the latency from 20 microseconds to 55
microseconds.
There seems to be some sort of message corruption during the transit,
but I can not rule out
other explanations.
I have no idea what is going on and why disabling PSM solves the problem.
Versions
module load gcc/4.5.3
module load openmpi/1.4.3-gcc
Command that randomly crashes
mpiexec -n 36 -output-filename MiSeq-bug-2012-06-28.1 \
Ray -k 31 \
-o MiSeq-bug-2012-06-28.1 \
-p \
data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R1.fastq \
data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R2.fastq
Command that completes successfully
mpiexec -n 36 -output-filename psm-bug-2012-06-26-hotfix.1 \
--mca mtl ^psm \
Ray -k 31 \
-o psm-bug-2012-06-26-hotfix.1 \
-p \
data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R1.fastq \
data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R2.fastq
Sébastien Boisvert