One of the big cosmology codes is Gadget-3 (Springel et al).
The code uses MPI for interprocess communications. At the ICC in Durham we use OpenMPI and have been using it for ~3 years. At the ICC Gadget-3 is one of the major research codes and we have been running it since it was written and we have observed something which is very worrying:

When running over gigabit using -mca btl tcp,self,sm the code runs alright, which is good as the largest part of our cluster is over gigabit, and as Gadget-3 scales rather well, the penalty for running over gigabit is not prohibitive. We also have a myrinet cluster and on there larger runs freeze. However as the gigabit cluster was available we have not really investigated this until just now.

We currently have access to an infiniband cluster and we found the following: in a specfic set of blocked sendrecv section it seems to communicate in pairs until in the end there is only one pair left processes where it deadlocks. For that pair the processes have setup communications, they know about each other's IDs, they know what datatype to communicate but never communicate that data. The precise timing in the running is not pinable, i.e. in consecutive runs it does not freeze at the same point in the run. This is using openmpi and it propagated over different versions of openmpi (judging from our myrinet experience).

I should mention that the communication on either the myrinet cluster or the infiniband cluster do work properly as runs of other codes (castep, b_eff) show.

So my question(s) is (are): has anybody had similar experiences and/or would anybody have an idea why this could happen and/or what we could do about it?

Lydia


------------------------------------------
Dr E L  Heck

University of Durham
Institute for Computational Cosmology
Ogden Centre
Department of Physics
South Road

DURHAM, DH1 3LE
United Kingdom

e-mail: lydia.h...@durham.ac.uk

Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___________________________________________

Reply via email to