It's not even clear to me yet whether this is the same dataset that triggered the assert. Working on getting complete details. But FWIW the PETSc code does not call H5Sselect_none(). It calls H5Sselect_hyperslab() in all ranks, and that's why the ranks in which the slice is zero columns wide hit the "empty sel_chunks" pathway I added to H5D__create_chunk_mem_map_hyper().
On Wed, Nov 8, 2017 at 12:02 PM, Michael K. Edwards <[email protected]> wrote: > Thanks, Jordan. I recognize that this is very recent feature work and > my goal is to help push it forward. > > My current use case is relatively straightforward, though there are a > couple of layers on top of HDF5 itself. The problem can be reproduced > by building PETSc 3.8.1 against libraries built from the develop > branch of HDF5, adding in the H5Dset_filter() calls, and running an > example that exercises them. (I'm using > src/snes/examples/tutorials/ex12.c with the -dm_view_hierarchy flag to > induce HDF5 writes.) If you want, I can supply full details for you > to reproduce it locally, or I can do any experiments you'd like me to > within this setup. (It also involves patches to the out-of-tree H5Z > plugins to make them use H5MM_malloc/H5MM_xfree rather than raw > malloc/free, which in turn involves exposing H5MMprivate.h to the > plugins. Is this something you've solved in a different way?) > > > On Wed, Nov 8, 2017 at 11:44 AM, Jordan Henderson > <[email protected]> wrote: >> Hi Michael, >> >> >> during the design phase of this feature I tried to both account for and test >> the case where some of the writers do not have any data to contribute. >> However, it seems like your use case falls outside of what I have tested >> (perhaps I have not used enough ranks?). In particular my test cases were >> small and simply had some of the ranks call H5Sselect_none(), which doesn't >> seem to trigger this particular assertion failure. Is this how you're >> approaching these particular ranks in your code or is there a different way >> you are having them participate in the write operation? >> >> >> As for the hanging issue, it looks as though rank 0 is waiting to receive >> some modification data from another rank for a particular chunk. Whether or >> not there is actually valid data that rank 0 should be waiting for, I cannot >> easily tell without being able to trace it through. As the other ranks have >> finished modifying their particular sets of chunks, they have moved on and >> are waiting for everyone to get together and broadcast their new chunk sizes >> so that free space in the file can be collectively re-allocated, but of >> course rank 0 is not proceeding forward. My best guess is that either: >> >> >> The "num_writers" field for the chunk struct corresponding to the particular >> chunk that rank 0 is working on has been incorrectly set, causing rank 0 to >> think that there are more ranks writing to the chunk than the actual amount >> and consequently causing rank 0 to wait forever for a non-existent MPI >> message >> >> >> or >> >> >> The "new_owner" field of the chunk struct for this chunk was incorrectly set >> on the other ranks, causing them to never issue an MPI_Isend to rank 0, also >> causing rank 0 to wait for a non-existent MPI message >> >> >> This feature should still be regarded as being in beta and its complexity >> can lead to difficult to track down bugs such as the ones you are currently >> encountering. That being said, your feedback is very useful and will help to >> push this feature towards a production-ready level of quality. Also, if it >> is feasible to come up with a minimal example that reproduces this issue, it >> would be very helpful and would make it much easier to diagnose why exactly >> these failures are occurring. >> >> Thanks, >> Jordan >> >> ________________________________ >> From: Hdf-forum <[email protected]> on behalf of Michael >> K. Edwards <[email protected]> >> Sent: Wednesday, November 8, 2017 11:23 AM >> To: Miller, Mark C. >> Cc: HDF Users Discussion List >> Subject: Re: [Hdf-forum] Collective IO and filters >> >> Closer to 1000 ranks initially. There's a bug in handling the case >> where some of the writers don't have any data to contribute (because >> there's a dimension smaller than the number of ranks), which I have >> worked around like this: >> >> diff --git a/src/H5Dchunk.c b/src/H5Dchunk.c >> index af6599a..9522478 100644 >> --- a/src/H5Dchunk.c >> +++ b/src/H5Dchunk.c >> @@ -1836,6 +1836,9 @@ H5D__create_chunk_mem_map_hyper(const H5D_chunk_map_t >> *fm) >> /* Indicate that the chunk's memory space is shared */ >> chunk_info->mspace_shared = TRUE; >> } /* end if */ >> + else if(H5SL_count(fm->sel_chunks)==0) { >> + /* No chunks, because no local data; avoid >> HDassert(fm->m_ndims==fm->f_ndims) on null mem_space */ >> + } /* end else if */ >> else { >> /* Get bounding box for file selection */ >> if(H5S_SELECT_BOUNDS(fm->file_space, file_sel_start, file_sel_end) >> < 0) >> >> That makes the assert go away. Now I'm investigating a hang in the >> chunk redistribution logic in rank 0, with a backtrace that looks like >> this: >> >> #0 0x00007f4bd456a6c6 in psm2_mq_ipeek2 () from /lib64/libpsm2.so.2 >> #1 0x00007f4bd5d3b341 in psm_progress_wait () from >> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12 >> #2 0x00007f4bd5d3012d in MPID_Mprobe () from >> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12 >> #3 0x00007f4bd5cbeeb4 in PMPI_Mprobe () from >> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12 >> #4 0x00007f4bd81aadf6 in H5D__chunk_redistribute_shared_chunks >> (io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0, >> local_chunk_array=0x17f0f80, >> local_chunk_array_num_entries=0x7ffdfb83d9f8) at H5Dmpio.c:3041 >> #5 0x00007f4bd81a9696 in H5D__construct_filtered_io_info_list >> (io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0, >> chunk_list=0x7ffdfb83daf0, num_entries=0x7ffdfb83db00) >> at H5Dmpio.c:2794 >> #6 0x00007f4bd81a2d58 in H5D__link_chunk_filtered_collective_io >> (io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0, >> dx_plist=0x16f7230) at H5Dmpio.c:1447 >> #7 0x00007f4bd81a027d in H5D__chunk_collective_io >> (io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0) at >> H5Dmpio.c:933 >> #8 0x00007f4bd81a0968 in H5D__chunk_collective_write >> (io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, nelmts=104, >> file_space=0x17e2dc0, mem_space=0x17dc770, fm=0x17eeec0) at >> H5Dmpio.c:1018 >> #9 0x00007f4bd7ce3d63 in H5D__write (dataset=0x17e0010, >> mem_type_id=216172782113783851, mem_space=0x17dc770, >> file_space=0x17e2dc0, dxpl_id=720575940379279384, buf=0x17d6240) at >> H5Dio.c:835 >> #10 0x00007f4bd7ce181c in H5D__pre_write (dset=0x17e0010, >> direct_write=false, mem_type_id=216172782113783851, >> mem_space=0x17dc770, file_space=0x17e2dc0, dxpl_id=720575940379279384, >> buf=0x17d6240) >> at H5Dio.c:394 >> #11 0x00007f4bd7ce0fd1 in H5Dwrite (dset_id=360287970189639680, >> mem_type_id=216172782113783851, mem_space_id=288230376151711749, >> file_space_id=288230376151711750, dxpl_id=720575940379279384, >> buf=0x17d6240) at H5Dio.c:318 >> >> The other ranks have moved past this and are hanging here: >> >> #0 0x00007feb6e6546c6 in psm2_mq_ipeek2 () from /lib64/libpsm2.so.2 >> #1 0x00007feb6fe25341 in psm_progress_wait () from >> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12 >> #2 0x00007feb6fdd8975 in MPIC_Wait () from >> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12 >> #3 0x00007feb6fdd918b in MPIC_Sendrecv () from >> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12 >> #4 0x00007feb6fcf0fda in MPIR_Allreduce_pt2pt_rd_MV2 () from >> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12 >> #5 0x00007feb6fcf48ef in MPIR_Allreduce_index_tuned_intra_MV2 () from >> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12 >> #6 0x00007feb6fca1534 in MPIR_Allreduce_impl () from >> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12 >> #7 0x00007feb6fca1b93 in PMPI_Allreduce () from >> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12 >> #8 0x00007feb72287c2a in H5D__mpio_array_gatherv >> (local_array=0x125f2d0, local_array_num_entries=0, >> array_entry_size=368, _gathered_array=0x7ffff083f1d8, >> _gathered_array_num_entries=0x7ffff083f1e8, nprocs=4, >> allgather=true, root=0, comm=-1006632952, sort_func=0x0) at >> H5Dmpio.c:479 >> #9 0x00007feb7228cfb8 in H5D__link_chunk_filtered_collective_io >> (io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, fm=0x125d280, >> dx_plist=0x11cf240) at H5Dmpio.c:1479 >> #10 0x00007feb7228a27d in H5D__chunk_collective_io >> (io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, fm=0x125d280) at >> H5Dmpio.c:933 >> #11 0x00007feb7228a968 in H5D__chunk_collective_write >> (io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, nelmts=74, >> file_space=0x12514e0, mem_space=0x124b450, fm=0x125d280) at >> H5Dmpio.c:1018 >> #12 0x00007feb71dcdd63 in H5D__write (dataset=0x124e7d0, >> mem_type_id=216172782113783851, mem_space=0x124b450, >> file_space=0x12514e0, dxpl_id=720575940379279384, buf=0x1244e80) at >> H5Dio.c:835 >> #13 0x00007feb71dcb81c in H5D__pre_write (dset=0x124e7d0, >> direct_write=false, mem_type_id=216172782113783851, >> mem_space=0x124b450, file_space=0x12514e0, dxpl_id=720575940379279384, >> buf=0x1244e80) >> at H5Dio.c:394 >> #14 0x00007feb71dcafd1 in H5Dwrite (dset_id=360287970189639680, >> mem_type_id=216172782113783851, mem_space_id=288230376151711749, >> file_space_id=288230376151711750, dxpl_id=720575940379279384, >> buf=0x1244e80) at H5Dio.c:318 >> >> (I'm currently running with this patch atop commit bf570b1, on an >> earlier theory that the crashing bug may have crept in after Jordan's >> big merge. I'll rebase on current develop but I doubt that'll change >> much.) >> >> The hang may or may not be directly related to the workaround being a >> bit of a hack. I can set you up with full reproduction details if you >> like; I seem to be getting some traction on it, but more eyeballs are >> always good, especially if they're better set up for MPI tracing than >> I am right now. >> >> >> On Wed, Nov 8, 2017 at 8:48 AM, Miller, Mark C. <[email protected]> wrote: >>> Hi Michael, >>> >>> >>> >>> I have not tried this in parallel yet. That said, what scale are you >>> trying >>> to do this at? 1000 ranks or 1,000,000 ranks? Something in between? >>> >>> >>> >>> My understanding is that there are some known scaling issues out past >>> maybe >>> 10,000 ranks. Not heard of outright assertion failures there though. >>> >>> >>> >>> Mark >>> >>> >>> >>> >>> >>> "Hdf-forum on behalf of Michael K. Edwards" wrote: >>> >>> >>> >>> I'm trying to write an HDF5 file with dataset compression from an MPI >>> >>> job. (Using PETSc 3.8 compiled against MVAPICH2, if that matters.) >>> >>> After running into the "Parallel I/O does not support filters yet" >>> >>> error message in release versions of HDF5, I have turned to the >>> >>> develop branch. Clearly there has been much work towards collective >>> >>> filtered IO in the run-up to a 1.11 (1.12?) release; equally clearly >>> >>> it is not quite ready for prime time yet. So far I've encountered a >>> >>> livelock scenario with ZFP, reproduced it with SZIP, and, with no >>> >>> filters at all, obtained this nifty error message: >>> >>> >>> >>> ex12: H5Dchunk.c:1849: H5D__create_chunk_mem_map_hyper: Assertion >>> >>> `fm->m_ndims==fm->f_ndims' failed. >>> >>> >>> >>> Has anyone on this list been able to write parallel HDF5 using a >>> >>> recent state of the develop branch, with or without filters >>> >>> configured? >>> >>> >>> >>> Thanks, >>> >>> - Michael >>> >>> >>> >>> _______________________________________________ >>> >>> Hdf-forum is for HDF software users discussion. >>> >>> [email protected] >>> >>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org >>> >>> Twitter: https://twitter.com/hdf5 >>> >>> >> >> _______________________________________________ >> Hdf-forum is for HDF software users discussion. >> [email protected] >> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org >> Twitter: https://twitter.com/hdf5 >> The HDF Group (@hdf5) | Twitter >> twitter.com >> The latest Tweets from The HDF Group (@hdf5). Technologies and supporting >> services that make possible the management of large, complex data >> collections. Support ... >> _______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org Twitter: https://twitter.com/hdf5
