Hi
Biagio Lucini wrote:
Hello,
I am new to this list, where I hope to find a solution for a problem
that I have been having for quite a longtime.
I run various versions of openmpi (from 1.1.2 to 1.2.8) on a cluster
with Infiniband interconnects that I use and administer at the same
time. The openfabric stac is OFED-1.2.5, the compilers gcc 4.2 and
Intel. The queue manager is SGE 6.0u8.
The trouble is with an MPI code that runs fine with an openmpi 1.1.2
library compiled without infiniband support (I have tested the
scalability of the code up to 64 cores, the nodes are 4 or 8 cores,
the results are exactly what I expect), but if I try to use a version
compiled for infiniband, then only a subset of comunications (the ones
connecting cores in the same node) are enabled, and because of this
the program fails (gets stuck in a perennial waiting phase, in
particular). This happens with any combination of compilers/library
releases (1.1.2, 1.2.7, 1.2.8) I have tried. On other codes, and in
particular on benchmarks downloaded from the net, openmpi over
infiniband seems to work (I compared the latency with the tcp btl, so
I am pretty sure that infiniband works). The two variables I kept
fixed are SGE and the OFED module stack. I would like not to touch
them, if possible, because the cluster seems to run fine for other
purposes.
My question is: does anyone has a suggestion on what I could try next?
I'm pretty sure that to get an answer I need to provide more details,
which I am willing to do, but in more than two months of
testing/trying/hoping/praying I have accumulated so much material and
information that if I post everything in this e-mail I am likely to
confuse a potential helper, more than helping him to understand the
problem.
Does the problem only show up with openmpi? Did you tried to use mvapich
(http://mvapich.cse.ohio-state.edu/) to test whether it is a hardware or
software problem? (I don't know any other open-source MPI implementation
which supports infiniband)
Dorian
Thank you in advance,
Biagio Lucini