Thank you. That got me farther along. The crash is now in the
H5Z-blosc filter glue, and should be easy to fix. It's interesting
that the filter is applied on a per-chunk basis, including on
zero-sized chunks; it's possible that something is wrong higher up the
stack. I haven't really thought about collective read with filters
yet. Jordan, can you fill me in on how that's supposed to work,
especially if the reader has a different number of MPI ranks than the
writer had?
HDF5-DIAG: Error detected in HDF5 (1.11.0) MPI-process 0:
#000: H5Dio.c line 319 in H5Dwrite(): can't prepare for writing data
major: Dataset
minor: Write failed
#001: H5Dio.c line 395 in H5D__pre_write(): can't write data
major: Dataset
minor: Write failed
#002: H5Dio.c line 836 in H5D__write(): can't write data
major: Dataset
minor: Write failed
#003: H5Dmpio.c line 1019 in H5D__chunk_collective_write(): write error
major: Dataspace
minor: Write failed
#004: H5Dmpio.c line 934 in H5D__chunk_collective_io(): couldn't
finish filtered linked chunk MPI-IO
major: Low-level I/O
minor: Can't get value
#005: H5Dmpio.c line 1474 in
H5D__link_chunk_filtered_collective_io(): couldn't process chunk entry
major: Dataset
minor: Write failed
#006: H5Dmpio.c line 3277 in
H5D__filtered_collective_chunk_entry_io(): couldn't unfilter chunk for
modifying
major: Data filters
minor: Filter operation failed
#007: H5Z.c line 1256 in H5Z_pipeline(): filter returned failure during read
major: Data filters
minor: Read failed
#008: /home/centos/blosc/hdf5-blosc/src/blosc_filter.c line 250 in
blosc_filter(): Can't allocate decompression buffer
major: Data filters
minor: Callback failed
On Thu, Nov 9, 2017 at 9:22 AM, Dana Robinson <[email protected]> wrote:
> In develop, H5MM_malloc() and H5MM_calloc() will throw an assert if size is
> zero. That should not be there and the function docs even say that we return
> NULL on size zero.
>
> The bad line is at lines 271 and 360 in H5MM.c if you want to try yanking
> that out and rebuilding.
>
> Dana
>
> On 11/9/17, 09:06, "Hdf-forum on behalf of Michael K. Edwards"
> <[email protected] on behalf of [email protected]>
> wrote:
>
> Actually, it's not the H5Screate() that crashes; that works fine since
> HDF5 1.8.7. It's a zero-sized malloc somewhere inside the call to
> H5Dwrite(), possibly in the filter. I think this is close to
> resolution; just have to get tools on it.
>
> On Thu, Nov 9, 2017 at 8:47 AM, Michael K. Edwards
> <[email protected]> wrote:
> > Apparently this has been reported before as a problem with PETSc/HDF5
> > integration:
> https://lists.mcs.anl.gov/pipermail/petsc-users/2012-January/011980.html
> >
> > On Thu, Nov 9, 2017 at 8:37 AM, Michael K. Edwards
> > <[email protected]> wrote:
> >> Thank you for the validation, and for the suggestion to use
> >> H5Sselect_none(). That is probably the right thing for the dataspace.
> >> Not quite sure what to do about the memspace, though; the comment is
> >> correct that we crash if any of the dimensions is zero.
> >>
> >> On Thu, Nov 9, 2017 at 8:34 AM, Jordan Henderson
> >> <[email protected]> wrote:
> >>> It seems you're discovering the issues right as I'm typing this!
> >>>
> >>>
> >>> I'm glad you were able to solve the issue with the hanging. I was
> starting
> >>> to suspect an issue with the MPI implementation but it's usually the
> last
> >>> thing on the list after inspecting the code itself.
> >>>
> >>>
> >>> As you've seen, it seems that PETSc is creating a NULL dataspace for
> the
> >>> ranks which are not contributing, instead of creating a Scalar/Simple
> >>> dataspace on all ranks and calling H5Sselect_none() for those that
> don't
> >>> participate. This would most likely explain the reason you saw the
> assertion
> >>> failure in the non-filtered case, as the legacy code probably was not
> >>> expecting to receive a NULL dataspace. On top of that, the NULL
> dataspace
> >>> seems like it is causing the parallel operation to break collective
> mode,
> >>> which is not allowed when filters are involved. I would need to do
> some
> >>> research as to why this happens before deciding whether it's more
> >>> appropriate to modify this in HDF5 or to have PETSc not use NULL
> dataspaces.
> >>>
> >>>
> >>> Avoiding deadlock from the final sort has been an issue I had to
> re-tackle a
> >>> few different times due to the nature of the code's complexity, but I
> will
> >>> investigate using the chunk offset as a secondary sort key and see if
> it
> >>> will run into problems in any other cases. Ideally, the chunk
> redistribution
> >>> might be updated in the future to involve all ranks in the operation
> instead
> >>> of just rank 0, also allowing for improvements to the redistribution
> >>> algorithm that may solve these problems, but for the time being this
> may be
> >>> sufficient.
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5
>
>
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5