[OMPI users] gadget-3 locks up using openmpi and infiniband (or myrinet)

Lydia Heck Sun, 16 May 2010 13:32:37 -0400



One of the big cosmology codes is Gadget-3 (Springel et al).

The code uses MPI for interprocess communications. At the ICC in Durham we useOpenMPI and have been using it for ~3 years.At the ICC Gadget-3 is one of the major research codes and we have been runningit since it was written and we have observed something which is very worrying:

When running over gigabit using -mca btl tcp,self,sm the code runs alright,which is good as the largest part of our cluster is over gigabit, and asGadget-3 scales rather well, the penalty for running over gigabit is notprohibitive.We also have a myrinet cluster and on there larger runs freeze. However asthe gigabit cluster was available we have not really investigated this untiljust now.

We currently have access to an infiniband cluster and we found the following:in a specfic set of blocked sendrecv section it seems to communicate in pairsuntil in the end there is only one pair left processes where it deadlocks.For that pair the processes have setupcommunications, they know about each other's IDs, they know what datatype tocommunicate but never communicate that data. The precise timing in the runningis not pinable, i.e. in consecutive runs it does not freeze at the same pointin the run. This is using openmpi and it propagated over different versions ofopenmpi (judging from our myrinet experience).

I should mention that the communication on either the myrinet cluster or theinfiniband cluster do work properly as runs of other codes (castep, b_eff)show.

So my question(s) is (are): has anybody had similar experiences and/or wouldanybody have an idea why this could happen and/or what we could do about it?


Lydia


------------------------------------------
Dr E L  Heck

University of Durham
Institute for Computational Cosmology
Ogden Centre
Department of Physics
South Road

DURHAM, DH1 3LE
United Kingdom

e-mail: lydia.h...@durham.ac.uk

Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___________________________________________

[OMPI users] gadget-3 locks up using openmpi and infiniband (or myrinet)

Reply via email to