Re: [OMPI users] Problem with openmpi and infiniband

Biagio Lucini Tue, 23 Dec 2008 18:33:23 -0500

Hi Dorian,

thank you for your message.


doriankrause wrote:

The trouble is with an MPI code that runs fine with an openmpi 1.1.2library compiled without infiniband support (I have tested thescalability of the code up to 64 cores, the nodes are 4 or 8 cores,the results are exactly what I expect), but if I try to use a versioncompiled for infiniband, then only a subset of comunications (the onesconnecting cores in the same node) are enabled, and because of thisthe program fails (gets stuck in a perennial waiting phase, inparticular). This happens with any combination of compilers/libraryreleases (1.1.2, 1.2.7, 1.2.8) I have tried. On other codes, and inparticular on benchmarks downloaded from the net, openmpi overinfiniband seems to work (I compared the latency with the tcp btl, soI am pretty sure that infiniband works). The two variables I keptfixed are SGE and the OFED module stack. I would like not to touchthem, if possible, because the cluster seems to run fine for otherpurposes.

Does the problem only show up with openmpi? Did you tried to use mvapich(http://mvapich.cse.ohio-state.edu/) to test whether it is a hardware orsoftware problem? (I don't know any other open-source MPI implementationwhich supports infiniband)

I have had bad experiences with mpich, on which mvapich is based. Theshort answer to your question is yes, and it did not work for otherreasons (not even over ethernet). The interesting development today isthat Intel MPI (which should be more or less mvapich2 if I am not wrong)seems to work (I will verify this also with mvapich2). This seems topoint towards a problem with the OpenMPI libraries, but I havereservations: they seem to work for even complicated benchmarking tests(like the Intel Benchmark) AND I have troubles also with mpich, which Idid not sort out. A possibility is that the problem is generated by theinteraction MPI-SGE-my code. I would love if someone more experiencedthan me would give a look at the code (which unfortunately is fortran).I will try to trim down the over 4000 lines to a manageable proof ofconcept, if anyone is interested in following this up, but it isunlikely to happen before new year :-)


Thanks again,
Biagio

Re: [OMPI users] Problem with openmpi and infiniband

Reply via email to