On 09/01/2015 02:01 PM, Wolf Dapp wrote:
Am 01.09.2015 um 17:43 schrieb Scot Breitenfeld:
I was also getting the same error with MOAB from ANL when we were
benchmarking small mesh reads with large number of processors. When I
ran on 16384 processes the job would terminate with:

Out of memory in file
/bgsys/source/srcV1R2M1.17463/comm/lib/dev/mpich2/src/mpi/romio/adio/ad_bg/ad_bg_rdcoll.c,
line 1073

A semi-discussion about the problem can be found here:

http://lists.mpich.org/pipermail/devel/2013-May/000154.html

We did not have time in the project to look into the problem any further.

Scot

Thanks for pointing out this discussion, Scot. It seems that not only
you did not have time to investigate the problem further, but neither
IBM nor MPICH did :)

I guess this indicates that it's not an HDF5 problem but an MPICH
problem, at heart, and that there's some memory allocations that scale
with the number of ranks.

Though it seems your team hit the "invisible barrier" much later than we
did.

Hello!  I'm pleased to see another Blue Gene user.

MPI collective I/O works at Blue Gene scale -- most of the time. The exception appears to be when the distribution of data among processes is lumpy; e.g. everyone reads the exact same data, or some processes have more to write than others. In those cases, some internal memory allocations end up exhausting Blue Gene's memory.

You can limit the size of the intermediate buffer by setting the "cb_buffer_size" hint. Doing this splits up the read or write into more rounds and so indirectly limits the total memory used. It's only a band-aide, though.

The read-and-broadcast approach is the best for your workload, and the one I end up suggesting any time this comes up.

Why don't we do this inside the MPI-IO library? Glad you asked! it turns out for a lot of reasons (file views, etypes, ftypes, and the fact that different datatypes may have identical type maps and yet there's no good way to compare types in that way) answering "did you all want to read the same data" is actually kind of challenging in the MPI-IO library.

It's easier to detect identical reads in HDF5, because one need only look at the (hyperslab) selection: To determine "you are all asking for the entire dataset" or "you are all asking for one row of this 3d variable" requires only comparing two N-d arrays. This comparison is likely expensive at scale, though, so "easier" does not necessarily mean "good idea" -- I don't think we'd want this turned on for every access.

so that leaves the application, which indeed knows everyone is reading the same data. It sort of sounds like passing the buck, and perhaps it is, but not for lack of effort from the other layers of the software stack.

==rob

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Reply via email to