Hello,

I am getting random crashes (segmentation faults) on a super computer (guillimin) using 3 nodes with 12 cores per node. The same program (Ray) runs without any
problem on the other super computers I use.

The interconnect is "InfiniBand: QLogic Corp. InfiniPath QME7342 QDR HCA" and the messages transit using "performance scaled messaging" (PSM) which I think is some
sort of replacement to Infiniband verbs although I am not sure.

Adding '--mca mtl ^psm' to the Open-MPI mpiexec program options solves
the problem, but increases the latency from 20 microseconds to 55 microseconds.

There seems to be some sort of message corruption during the transit, but I can not rule out
other explanations.


I have no idea what is going on and why disabling PSM solves the problem.


Versions

module load gcc/4.5.3
module load openmpi/1.4.3-gcc


Command that randomly crashes

mpiexec -n 36 -output-filename MiSeq-bug-2012-06-28.1 \
Ray -k 31 \
-o MiSeq-bug-2012-06-28.1 \
-p \
data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R1.fastq \
data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R2.fastq


Command that completes successfully

mpiexec -n 36 -output-filename  psm-bug-2012-06-26-hotfix.1 \
--mca mtl ^psm \
Ray -k 31 \
-o psm-bug-2012-06-26-hotfix.1 \
-p \
data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R1.fastq \
data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R2.fastq



Sébastien Boisvert

Reply via email to