Re: [Hdf-forum] Collective IO and filters

Michael K. Edwards Wed, 08 Nov 2017 13:28:20 -0800

I'm reasonably confident now that this hang is unrelated to the
"writers contributing zero data" workaround.  The three ranks that
have made it to H5Dmpio.c:1479 all have nonzero nelmts in the call to
H5D__chunk_collective_write() up the stack.  (And I did check that
they're all still trying to write to the same dataset.)

Here's what I see in rank 0:

(gdb) p *chunk_entry
$5 = {index = 0, scaled = {0, 0, 0, 18446744073709551615 <repeats 30
times>}, full_overwrite = false, num_writers = 4, io_size = 832, buf =
0x0, chunk_states = {chunk_current = {offset = 4720,
      length = 6}, new_chunk = {offset = 4720, length = 6}}, owners =
{original_owner = 0, new_owner = 0}, async_info =
{receive_requests_array = 0x30c2870, receive_buffer_array = 0x30c2f20,
    num_receive_requests = 3}}

And here's what I see in rank 3:

(gdb) p *chunk_list
$3 = {index = 0, scaled = {0 <repeats 33 times>}, full_overwrite =
false, num_writers = 4, io_size = 592, buf = 0x0, chunk_states =
{chunk_current = {offset = 4720, length = 6}, new_chunk = {
      offset = 4720, length = 6}}, owners = {original_owner = 3,
new_owner = 0}, async_info = {receive_requests_array = 0x0,
receive_buffer_array = 0x0, num_receive_requests = 0}}

The loop index "j" in the receive loop in rank 0 is still 0, which
suggests that it has not received any messages from the other ranks.
The breakage could certainly be down in the MPI implementation.  I am
running Intel's build of MVAPICH2 2.2 (as bundled with their current
Omni-Path release blob), and it visibly has performance "issues" in my
dev environment.  It's not out of the realm of the plausible that it's
not delivering these messages.  It's just odd that it manages to slog
through in the unfiltered case and not in this filtered case.

On Wed, Nov 8, 2017 at 12:57 PM, Jordan Henderson
<[email protected]> wrote:
> Ah yes, I can see what you mean by the difference between the use of these
> causing issues between in-tree and out-of-tree plugins. This is particularly
> interesting in that it makes sense to allocate the chunk data buffers using
> the H5MM_ routines to be compliant with the standards of HDF5 library
> development, but causes issues with those plugins which use the raw memory
> routines. Conversely, if the chunk buffers were to be allocated using the
> raw routines, it would break compatibility with the in-tree filters. Thank
> you for bringing this to my attention; I believe I will need to think on
> this one, as there are a few different ways of approaching the problem, with
> some being more "correct" than others.

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Re: [Hdf-forum] Collective IO and filters

Reply via email to