On 09/01/2015 02:01 PM, Wolf Dapp wrote:
Am 01.09.2015 um 17:43 schrieb Scot Breitenfeld:
I was also getting the same error with MOAB from ANL when we were
benchmarking small mesh reads with large number of processors. When I
ran on 16384 processes the job would terminate with:
Out of memory in file
/bgsys/source/srcV1R2M1.17463/comm/lib/dev/mpich2/src/mpi/romio/adio/ad_bg/ad_bg_rdcoll.c,
line 1073
A semi-discussion about the problem can be found here:
http://lists.mpich.org/pipermail/devel/2013-May/000154.html
We did not have time in the project to look into the problem any further.
Scot
Thanks for pointing out this discussion, Scot. It seems that not only
you did not have time to investigate the problem further, but neither
IBM nor MPICH did :)
I guess this indicates that it's not an HDF5 problem but an MPICH
problem, at heart, and that there's some memory allocations that scale
with the number of ranks.
Though it seems your team hit the "invisible barrier" much later than we
did.
Hello! I'm pleased to see another Blue Gene user.
MPI collective I/O works at Blue Gene scale -- most of the time. The
exception appears to be when the distribution of data among processes
is lumpy; e.g. everyone reads the exact same data, or some processes
have more to write than others. In those cases, some internal memory
allocations end up exhausting Blue Gene's memory.
You can limit the size of the intermediate buffer by setting the
"cb_buffer_size" hint. Doing this splits up the read or write into more
rounds and so indirectly limits the total memory used. It's only a
band-aide, though.
The read-and-broadcast approach is the best for your workload, and the
one I end up suggesting any time this comes up.
Why don't we do this inside the MPI-IO library? Glad you asked! it
turns out for a lot of reasons (file views, etypes, ftypes, and the fact
that different datatypes may have identical type maps and yet there's no
good way to compare types in that way) answering "did you all want to
read the same data" is actually kind of challenging in the MPI-IO library.
It's easier to detect identical reads in HDF5, because one need only
look at the (hyperslab) selection: To determine "you are all asking for
the entire dataset" or "you are all asking for one row of this 3d
variable" requires only comparing two N-d arrays. This comparison is
likely expensive at scale, though, so "easier" does not necessarily mean
"good idea" -- I don't think we'd want this turned on for every access.
so that leaves the application, which indeed knows everyone is reading
the same data. It sort of sounds like passing the buck, and perhaps it
is, but not for lack of effort from the other layers of the software stack.
==rob
--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5