Hi

Biagio Lucini wrote:
Hello,

I am new to this list, where I hope to find a solution for a problem that I have been having for quite a longtime.

I run various versions of openmpi (from 1.1.2 to 1.2.8) on a cluster with Infiniband interconnects that I use and administer at the same time. The openfabric stac is OFED-1.2.5, the compilers gcc 4.2 and Intel. The queue manager is SGE 6.0u8.

The trouble is with an MPI code that runs fine with an openmpi 1.1.2 library compiled without infiniband support (I have tested the scalability of the code up to 64 cores, the nodes are 4 or 8 cores, the results are exactly what I expect), but if I try to use a version compiled for infiniband, then only a subset of comunications (the ones connecting cores in the same node) are enabled, and because of this the program fails (gets stuck in a perennial waiting phase, in particular). This happens with any combination of compilers/library releases (1.1.2, 1.2.7, 1.2.8) I have tried. On other codes, and in particular on benchmarks downloaded from the net, openmpi over infiniband seems to work (I compared the latency with the tcp btl, so I am pretty sure that infiniband works). The two variables I kept fixed are SGE and the OFED module stack. I would like not to touch them, if possible, because the cluster seems to run fine for other purposes.

My question is: does anyone has a suggestion on what I could try next?
I'm pretty sure that to get an answer I need to provide more details, which I am willing to do, but in more than two months of testing/trying/hoping/praying I have accumulated so much material and information that if I post everything in this e-mail I am likely to confuse a potential helper, more than helping him to understand the problem.

Does the problem only show up with openmpi? Did you tried to use mvapich (http://mvapich.cse.ohio-state.edu/) to test whether it is a hardware or software problem? (I don't know any other open-source MPI implementation which supports infiniband)

Dorian


Thank you in advance,
Biagio Lucini


Reply via email to