I probably have a number of silly/dumb questions but someone is bound to ask. . 
.

From: Hdf-forum 
<[email protected]<mailto:[email protected]>>
 on behalf of Wolf Dapp <[email protected]<mailto:[email protected]>>
Reply-To: HDF Users Discussion List 
<[email protected]<mailto:[email protected]>>
Date: Tuesday, September 1, 2015 7:34 AM
To: HDF Users Discussion List 
<[email protected]<mailto:[email protected]>>
Cc: Wolf Dapp <[email protected]<mailto:[email protected]>>
Subject: [Hdf-forum] H5DRead crashes on Bluegene/Q with "out of memory"

Dear forum members,

this may be too specialized a problem, but maybe somebody still has some
insights.

Our code (running on an IBM BlueGene/Q machine) reads in some data,
using HDF5. This is done collectively, on each core (everyone reads in
the same data, at the same time). It is not known a priori which
processor owns which part of the data, they have to compute this
themselves and discard the data they don't own.

Hmm. For this use case, I assume the data to be read is *always* small enough 
for a single core. Why not read it indepndently to one core and broadcast 
and/or MPI_Send yourself? I understand that is not your use case but what I 
suggest will very likely be much more scalable for large core counts vs. all 
cores attacking the filesystem for the same bunch of bytes.

The data file is ~9.4MB
in a simple test case.

That certainly sounds small enough that what you describe should work at any 
core count.

The data is a custom data type of a nested struct
with two 32-bit integers and two 64-bit doubles that form a complex
number, with a total of 192 bits.

Do you happen to know if any HDF5 'filters' are involved in reading this data 
(compression, custom conversion, etc.)?


If I use less than 1024 cores, there is no problem. However, for >=1024
cores, I get a crash with the error

Is there really 'no problem' or is the problem really happening but its just 
not bad enough to cause OOM? I mean maybe 512 cores fails on files of 18.8MB 
size?


"Out of memory in file
/bgsys/source/srcV1R2M3.12428/comm/lib/dev/mpich2/src/mpi/romio/adio/ad_bg/ad_bg_rdcoll.c,
line 1073"

Thats only where the last allocation failed causing OOM, right? Can you run a 
small problem with valgrind (maybe with massif heap sizing tool) to see whats 
happening as far as mallocs and frees?


We use parallel HDF5 1.8.15; I've also tried 1.8.14. Another library
dependence is FFTW 3.3.3, but that should not really matter.

I traced the crash with Totalview to the call of H5Dread(). The
second-to-last call in the crash trace is MPIDO_Alltoallv, the last one
is PAMI_Context_trylock_advancev. I don't have exact calls nor line
numbers since the HDF5 library was not compiled with debug symbols. [the
file mentioned in the error message is not accessible]

Again, thats only getting you to the last malloc that failed. You need to use 
some kind of tool to find out where all the memory is getting allocated like 
valgrind or memtrace or something.


Is this an HDF5 problem, or a problem with IBM's MPI implementation?
Might it be an MPI buffer overflow?!? Or is there maybe a problem with
data contiguity in the struct?

Its not possible to say at this point.


The problem disappears if I read in the file in chunks of less than
192kiB at a time. A more workable workaround is to replace collective
communication by independent communication, in which case, the problem
disappears.
H5Pset_dxpl_mpio(plist_id, H5FD_MPIO_COLLECTIVE); -->
H5Pset_dxpl_mpio(plist_id, H5FD_MPIO_INDEPENDENT);

Since this data file is quite small (usually not larger than a few
hundred megabytes at most), reading in the file independently is not a
huge performance problem at this stage, but for very large simulations
it might be.

In other, older parts of the code, we're (successfully!) reading in (up
to) 256 GiB of data in predefined data types (double, float) using
H5FD_MPIO_COLLECTIVE without any problem, so I'm thinking this problem
is connected with the user-defined data type in some way.

Are these collective calls all reading the same part of a dataset or all 
reading different parts? The use-case you described above sounded like all 
cores were reading the same part (whole) dataset. And, 256GiB is large enough 
that no one core could hold all of that so each core here *must* be reading a 
different part of a dataset. perhaps that is relevant?


I attach some condensed code with all calls to the HDF5 library; I'm not
sure anyone is in the position to actually reproduce this problem, so
the main() routine and the data file are probably unnecessary. However,
I'd be happy to also send those if need be.

If you cannot run valgrind/massif on BG/Q, try running the same code on another 
machine where you *can* run valgrind/massif. If non-system code is the culprit, 
you will be able to reproduce memory growth elsewhere. OTOH, if there is a 
problem down in BG/Q system code, moving to another machine would hide the 
problem.

Sorry I have only questions but its worth asking.


Thanks in advance for any hints.

Best regards,
Wolf

--



_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Reply via email to