Recently the organization I work for bought a modest sized Linux cluster for running large atmospheric data assimilation systems. In my experience a glaring problem with systems of this kind is poor IO performance. Typically they have 2 types of network: 1) A high speed, low latency, e.g. Infiniband, network dedicated to MPI communications, and, 2) A lower speed network, e.g 1Gb or 10Gb ethernet, for IO. On clusters this second network is usually the basis for a global parallel file system (GPFS), through which nearly all IO traffic must pass. So the IO performance of applications such as ours is completely dependent on the speed of the GPFS, and therefore on the network hardware it uses.
We have seen that a cluster with a GPFS based on a 1Gb network is painfully slow for our applications, and of course with a 10Gb network is much better. Therefore we are making the case to the IT staff that all our systems should have GPFS running on 10Gb networks. Some of them have a hard time accepting this, since they don't really understand the requirements of our applications. With all of this, here is my MPI related question. I recently added an option to use MPI-IO to do the heavy IO lifting in our applications. I would like to know what the relative importance of the dedicated MPI network vis-a-vis the GPFS network for typical MPIIO collective reads and writes. I assume there must be some hand-off of data between the networks during the process, but how is it done, and are there any rules to help understand it. Any insights would be welcome. T. Rosmond P.S. I am running with Open-mpi 1.4.2.