Am 01.09.2015 um 17:14 schrieb Miller, Mark C.: > Our code (running on an IBM BlueGene/Q machine) reads in some data, > using HDF5. This is done collectively, on each core (everyone reads in > the same data, at the same time). It is not known a priori which > processor owns which part of the data, they have to compute this > themselves and discard the data they don't own. > > Hmm. For this use case, I assume the data to be read is *always* small > enough for a single core. Why not read it indepndently to one core and > broadcast and/or MPI_Send yourself? I understand that is not your use > case but what I suggest will very likely be much more scalable for large > core counts vs. all cores attacking the filesystem for the same bunch of > bytes.
Thanks for your reply, Mark. In the code snippet I sent, we read in the data in chunks if it is not sufficiently small -- we limit each chunk to ~48MiB that should always fit. We indeed considered reading with one process and then broadcasting (or some mildly parallel version of the same procedure), but decided against it. However, we might reconsider if the current way proves not feasible or a performance issue. > The data file is ~9.4MB > in a simple test case. > > That certainly sounds small enough that what you describe should work at > any core count. precisely. There's nothing else in memory in the test case, either, and it crashes even if each core has 2GiB memory. > The data is a custom data type of a nested struct > with two 32-bit integers and two 64-bit doubles that form a complex > number, with a total of 192 bits. > > Do you happen to know if any HDF5 'filters' are involved in reading this > data (compression, custom conversion, etc.)? No compression or further conversions, just nested structs of native datatypes, as indicated in the sample code. > If I use less than 1024 cores, there is no problem. However, for >=1024 > cores, I get a crash with the error > > Is there really 'no problem' or is the problem really happening but its > just not bad enough to cause OOM? I mean maybe 512 cores fails on files > of 18.8MB size? There really seems to be 'no problem' in that case. As I mentioned, for fewer cores the problem goes away if the chunk size is 192kiB. In a different test, 512 ranks worked fine with much larger data sets, while 1024 failed in every case. > "Out of memory in file > > /bgsys/source/srcV1R2M3.12428/comm/lib/dev/mpich2/src/mpi/romio/adio/ad_bg/ad_bg_rdcoll.c, > line 1073" > > Thats only where the last allocation failed causing OOM, right? Can you > run a small problem with valgrind (maybe with massif heap sizing tool) > to see whats happening as far as mallocs and frees? > > [...] > > Again, thats only getting you to the last malloc that failed. You need > to use some kind of tool to find out where all the memory is getting > allocated like valgrind or memtrace or something. Haven't done that, but will try (if possible). On a standard cluster, it worked fine for tests up to 128 ranks (couldn't try with 1024 ranks). > The problem disappears if I read in the file in chunks of less than > 192kiB at a time. A more workable workaround is to replace collective > communication by independent communication, in which case, the problem > disappears. > H5Pset_dxpl_mpio(plist_id, H5FD_MPIO_COLLECTIVE); --> > H5Pset_dxpl_mpio(plist_id, H5FD_MPIO_INDEPENDENT); > > Since this data file is quite small (usually not larger than a few > hundred megabytes at most), reading in the file independently is not a > huge performance problem at this stage, but for very large simulations > it might be. > > In other, older parts of the code, we're (successfully!) reading in (up > to) 256 GiB of data in predefined data types (double, float) using > H5FD_MPIO_COLLECTIVE without any problem, so I'm thinking this problem > is connected with the user-defined data type in some way. > > Are these collective calls all reading the same part of a dataset or all > reading different parts? The use-case you described above sounded like > all cores were reading the same part (whole) dataset. And, 256GiB is > large enough that no one core could hold all of that so each core here > *must* be reading a different part of a dataset. perhaps that is relevant? Yes, indeed, it's the part that every rank reads /everything/ that causes the problem. If everyone reads part of the file, there's no problem, but that's not our use case here. I'm asking the hdf-forum because the problem occured using H5DRead, and it seems innocent and simple (and small) enough that this shouldn't be the case. But (see Scot's email in this thread), it seems that MPICH is the real culprit, and no solution has been put forth in the last 2 years, only workarounds. Thanks again for your questions and suggestions. Wolf -- _______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org Twitter: https://twitter.com/hdf5
