Re: [Hdf-forum] Collective IO and filters

Michael K. Edwards Wed, 08 Nov 2017 12:24:17 -0800

It's not even clear to me yet whether this is the same dataset that
triggered the assert.  Working on getting complete details.  But FWIW
the PETSc code does not call H5Sselect_none().  It calls
H5Sselect_hyperslab() in all ranks, and that's why the ranks in which
the slice is zero columns wide hit the "empty sel_chunks" pathway I
added to H5D__create_chunk_mem_map_hyper().



On Wed, Nov 8, 2017 at 12:02 PM, Michael K. Edwards
<[email protected]> wrote:
> Thanks, Jordan.  I recognize that this is very recent feature work and
> my goal is to help push it forward.
>
> My current use case is relatively straightforward, though there are a
> couple of layers on top of HDF5 itself.  The problem can be reproduced
> by building PETSc 3.8.1 against libraries built from the develop
> branch of HDF5, adding in the H5Dset_filter() calls, and running an
> example that exercises them.  (I'm using
> src/snes/examples/tutorials/ex12.c with the -dm_view_hierarchy flag to
> induce HDF5 writes.)  If you want, I can supply full details for you
> to reproduce it locally, or I can do any experiments you'd like me to
> within this setup.  (It also involves patches to the out-of-tree H5Z
> plugins to make them use H5MM_malloc/H5MM_xfree rather than raw
> malloc/free, which in turn involves exposing H5MMprivate.h to the
> plugins.  Is this something you've solved in a different way?)
>
>
> On Wed, Nov 8, 2017 at 11:44 AM, Jordan Henderson
> <[email protected]> wrote:
>> Hi Michael,
>>
>>
>> during the design phase of this feature I tried to both account for and test
>> the case where some of the writers do not have any data to contribute.
>> However, it seems like your use case falls outside of what I have tested
>> (perhaps I have not used enough ranks?). In particular my test cases were
>> small and simply had some of the ranks call H5Sselect_none(), which doesn't
>> seem to trigger this particular assertion failure. Is this how you're
>> approaching these particular ranks in your code or is there a different way
>> you are having them participate in the write operation?
>>
>>
>> As for the hanging issue, it looks as though rank 0 is waiting to receive
>> some modification data from another rank for a particular chunk. Whether or
>> not there is actually valid data that rank 0 should be waiting for, I cannot
>> easily tell without being able to trace it through. As the other ranks have
>> finished modifying their particular sets of chunks, they have moved on and
>> are waiting for everyone to get together and broadcast their new chunk sizes
>> so that free space in the file can be collectively re-allocated, but of
>> course rank 0 is not proceeding forward. My best guess is that either:
>>
>>
>> The "num_writers" field for the chunk struct corresponding to the particular
>> chunk that rank 0 is working on has been incorrectly set, causing rank 0 to
>> think that there are more ranks writing to the chunk than the actual amount
>> and consequently causing rank 0 to wait forever for a non-existent MPI
>> message
>>
>>
>> or
>>
>>
>> The "new_owner" field of the chunk struct for this chunk was incorrectly set
>> on the other ranks, causing them to never issue an MPI_Isend to rank 0, also
>> causing rank 0 to wait for a non-existent MPI message
>>
>>
>> This feature should still be regarded as being in beta and its complexity
>> can lead to difficult to track down bugs such as the ones you are currently
>> encountering. That being said, your feedback is very useful and will help to
>> push this feature towards a production-ready level of quality. Also, if it
>> is feasible to come up with a minimal example that reproduces this issue, it
>> would be very helpful and would make it much easier to diagnose why exactly
>> these failures are occurring.
>>
>> Thanks,
>> Jordan
>>
>> ________________________________
>> From: Hdf-forum <[email protected]> on behalf of Michael
>> K. Edwards <[email protected]>
>> Sent: Wednesday, November 8, 2017 11:23 AM
>> To: Miller, Mark C.
>> Cc: HDF Users Discussion List
>> Subject: Re: [Hdf-forum] Collective IO and filters
>>
>> Closer to 1000 ranks initially.  There's a bug in handling the case
>> where some of the writers don't have any data to contribute (because
>> there's a dimension smaller than the number of ranks), which I have
>> worked around like this:
>>
>> diff --git a/src/H5Dchunk.c b/src/H5Dchunk.c
>> index af6599a..9522478 100644
>> --- a/src/H5Dchunk.c
>> +++ b/src/H5Dchunk.c
>> @@ -1836,6 +1836,9 @@ H5D__create_chunk_mem_map_hyper(const H5D_chunk_map_t
>> *fm)
>>          /* Indicate that the chunk's memory space is shared */
>>          chunk_info->mspace_shared = TRUE;
>>      } /* end if */
>> +    else if(H5SL_count(fm->sel_chunks)==0) {
>> +        /* No chunks, because no local data; avoid
>> HDassert(fm->m_ndims==fm->f_ndims) on null mem_space */
>> +    } /* end else if */
>>      else {
>>          /* Get bounding box for file selection */
>>          if(H5S_SELECT_BOUNDS(fm->file_space, file_sel_start, file_sel_end)
>> < 0)
>>
>> That makes the assert go away.  Now I'm investigating a hang in the
>> chunk redistribution logic in rank 0, with a backtrace that looks like
>> this:
>>
>> #0  0x00007f4bd456a6c6 in psm2_mq_ipeek2 () from /lib64/libpsm2.so.2
>> #1  0x00007f4bd5d3b341 in psm_progress_wait () from
>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
>> #2  0x00007f4bd5d3012d in MPID_Mprobe () from
>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
>> #3  0x00007f4bd5cbeeb4 in PMPI_Mprobe () from
>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
>> #4  0x00007f4bd81aadf6 in H5D__chunk_redistribute_shared_chunks
>> (io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0,
>> local_chunk_array=0x17f0f80,
>>     local_chunk_array_num_entries=0x7ffdfb83d9f8) at H5Dmpio.c:3041
>> #5  0x00007f4bd81a9696 in H5D__construct_filtered_io_info_list
>> (io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0,
>> chunk_list=0x7ffdfb83daf0, num_entries=0x7ffdfb83db00)
>>     at H5Dmpio.c:2794
>> #6  0x00007f4bd81a2d58 in H5D__link_chunk_filtered_collective_io
>> (io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0,
>> dx_plist=0x16f7230) at H5Dmpio.c:1447
>> #7  0x00007f4bd81a027d in H5D__chunk_collective_io
>> (io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0) at
>> H5Dmpio.c:933
>> #8  0x00007f4bd81a0968 in H5D__chunk_collective_write
>> (io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, nelmts=104,
>> file_space=0x17e2dc0, mem_space=0x17dc770, fm=0x17eeec0) at
>> H5Dmpio.c:1018
>> #9  0x00007f4bd7ce3d63 in H5D__write (dataset=0x17e0010,
>> mem_type_id=216172782113783851, mem_space=0x17dc770,
>> file_space=0x17e2dc0, dxpl_id=720575940379279384, buf=0x17d6240) at
>> H5Dio.c:835
>> #10 0x00007f4bd7ce181c in H5D__pre_write (dset=0x17e0010,
>> direct_write=false, mem_type_id=216172782113783851,
>> mem_space=0x17dc770, file_space=0x17e2dc0, dxpl_id=720575940379279384,
>> buf=0x17d6240)
>>     at H5Dio.c:394
>> #11 0x00007f4bd7ce0fd1 in H5Dwrite (dset_id=360287970189639680,
>> mem_type_id=216172782113783851, mem_space_id=288230376151711749,
>> file_space_id=288230376151711750, dxpl_id=720575940379279384,
>>     buf=0x17d6240) at H5Dio.c:318
>>
>> The other ranks have moved past this and are hanging here:
>>
>> #0  0x00007feb6e6546c6 in psm2_mq_ipeek2 () from /lib64/libpsm2.so.2
>> #1  0x00007feb6fe25341 in psm_progress_wait () from
>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
>> #2  0x00007feb6fdd8975 in MPIC_Wait () from
>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
>> #3  0x00007feb6fdd918b in MPIC_Sendrecv () from
>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
>> #4  0x00007feb6fcf0fda in MPIR_Allreduce_pt2pt_rd_MV2 () from
>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
>> #5  0x00007feb6fcf48ef in MPIR_Allreduce_index_tuned_intra_MV2 () from
>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
>> #6  0x00007feb6fca1534 in MPIR_Allreduce_impl () from
>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
>> #7  0x00007feb6fca1b93 in PMPI_Allreduce () from
>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
>> #8  0x00007feb72287c2a in H5D__mpio_array_gatherv
>> (local_array=0x125f2d0, local_array_num_entries=0,
>> array_entry_size=368, _gathered_array=0x7ffff083f1d8,
>>     _gathered_array_num_entries=0x7ffff083f1e8, nprocs=4,
>> allgather=true, root=0, comm=-1006632952, sort_func=0x0) at
>> H5Dmpio.c:479
>> #9  0x00007feb7228cfb8 in H5D__link_chunk_filtered_collective_io
>> (io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, fm=0x125d280,
>> dx_plist=0x11cf240) at H5Dmpio.c:1479
>> #10 0x00007feb7228a27d in H5D__chunk_collective_io
>> (io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, fm=0x125d280) at
>> H5Dmpio.c:933
>> #11 0x00007feb7228a968 in H5D__chunk_collective_write
>> (io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, nelmts=74,
>> file_space=0x12514e0, mem_space=0x124b450, fm=0x125d280) at
>> H5Dmpio.c:1018
>> #12 0x00007feb71dcdd63 in H5D__write (dataset=0x124e7d0,
>> mem_type_id=216172782113783851, mem_space=0x124b450,
>> file_space=0x12514e0, dxpl_id=720575940379279384, buf=0x1244e80) at
>> H5Dio.c:835
>> #13 0x00007feb71dcb81c in H5D__pre_write (dset=0x124e7d0,
>> direct_write=false, mem_type_id=216172782113783851,
>> mem_space=0x124b450, file_space=0x12514e0, dxpl_id=720575940379279384,
>> buf=0x1244e80)
>>     at H5Dio.c:394
>> #14 0x00007feb71dcafd1 in H5Dwrite (dset_id=360287970189639680,
>> mem_type_id=216172782113783851, mem_space_id=288230376151711749,
>> file_space_id=288230376151711750, dxpl_id=720575940379279384,
>>     buf=0x1244e80) at H5Dio.c:318
>>
>> (I'm currently running with this patch atop commit bf570b1, on an
>> earlier theory that the crashing bug may have crept in after Jordan's
>> big merge.  I'll rebase on current develop but I doubt that'll change
>> much.)
>>
>> The hang may or may not be directly related to the workaround being a
>> bit of a hack.  I can set you up with full reproduction details if you
>> like; I seem to be getting some traction on it, but more eyeballs are
>> always good, especially if they're better set up for MPI tracing than
>> I am right now.
>>
>>
>> On Wed, Nov 8, 2017 at 8:48 AM, Miller, Mark C. <[email protected]> wrote:
>>> Hi Michael,
>>>
>>>
>>>
>>> I have not tried this in parallel yet. That said, what scale are you
>>> trying
>>> to do this at? 1000 ranks or 1,000,000 ranks? Something in between?
>>>
>>>
>>>
>>> My understanding is that there are some known scaling issues out past
>>> maybe
>>> 10,000 ranks. Not heard of outright assertion failures there though.
>>>
>>>
>>>
>>> Mark
>>>
>>>
>>>
>>>
>>>
>>> "Hdf-forum on behalf of Michael K. Edwards" wrote:
>>>
>>>
>>>
>>> I'm trying to write an HDF5 file with dataset compression from an MPI
>>>
>>> job.  (Using PETSc 3.8 compiled against MVAPICH2, if that matters.)
>>>
>>> After running into the "Parallel I/O does not support filters yet"
>>>
>>> error message in release versions of HDF5, I have turned to the
>>>
>>> develop branch.  Clearly there has been much work towards collective
>>>
>>> filtered IO in the run-up to a 1.11 (1.12?) release; equally clearly
>>>
>>> it is not quite ready for prime time yet.  So far I've encountered a
>>>
>>> livelock scenario with ZFP, reproduced it with SZIP, and, with no
>>>
>>> filters at all, obtained this nifty error message:
>>>
>>>
>>>
>>> ex12: H5Dchunk.c:1849: H5D__create_chunk_mem_map_hyper: Assertion
>>>
>>> `fm->m_ndims==fm->f_ndims' failed.
>>>
>>>
>>>
>>> Has anyone on this list been able to write parallel HDF5 using a
>>>
>>> recent state of the develop branch, with or without filters
>>>
>>> configured?
>>>
>>>
>>>
>>> Thanks,
>>>
>>> - Michael
>>>
>>>
>>>
>>> _______________________________________________
>>>
>>> Hdf-forum is for HDF software users discussion.
>>>
>>> [email protected]
>>>
>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>
>>> Twitter: https://twitter.com/hdf5
>>>
>>>
>>
>> _______________________________________________
>> Hdf-forum is for HDF software users discussion.
>> [email protected]
>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>> Twitter: https://twitter.com/hdf5
>> The HDF Group (@hdf5) | Twitter
>> twitter.com
>> The latest Tweets from The HDF Group (@hdf5). Technologies and supporting
>> services that make possible the management of large, complex data
>> collections. Support ...
>>

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Re: [Hdf-forum] Collective IO and filters

Reply via email to