Re: [Hdf-forum] Collective IO and filters

Michael K. Edwards Thu, 09 Nov 2017 14:59:58 -0800

Just so it's clear:  the fixes are mostly in the plugins and in how
PETSc calls into the HDF5 code.  (It should probably never have mixed
simple and null dataspaces in one collective write.)  The fixes to
HDF5 itself are:


* Dana's observations with regard to the H5MM APIs:
  * the inappropriate assert(size > 0) in H5MM_[mc]alloc in the
develop branch; and
  * the recommendation to use H5allocate/resize/free_memory() rather
than the private APIs.
* The recommendation to sort by chunk address within each owner's
range of chunk entries, to avoid risk of deadlock in the
H5D__chunk_redistribute_shared_chunks() code.

I haven't switched to H5allocate/resize/free_memory() yet, but here's
(minimally tested) code to handle the other two issues.

Cheers,
- Michael

diff --git a/src/H5Dmpio.c b/src/H5Dmpio.c
index 79572c0..60e9f03 100644
--- a/src/H5Dmpio.c
+++ b/src/H5Dmpio.c
@@ -2328,14 +2328,22 @@
H5D__cmp_filtered_collective_io_info_entry(const void
*filtered_collective_io_in
 static int
 H5D__cmp_filtered_collective_io_info_entry_owner(const void
*filtered_collective_io_info_entry1, const void
*filtered_collective_io_info_entry2)
 {
-    int owner1 = -1, owner2 = -1;
+    int owner1 = -1, owner2 = -1, delta = 0;
+    haddr_t addr1 = HADDR_UNDEF, addr2 = HADDR_UNDEF;

     FUNC_ENTER_STATIC_NOERR

     owner1 = ((const H5D_filtered_collective_io_info_t *)
filtered_collective_io_info_entry1)->owners.original_owner;
     owner2 = ((const H5D_filtered_collective_io_info_t *)
filtered_collective_io_info_entry2)->owners.original_owner;
-
-    FUNC_LEAVE_NOAPI(owner1 - owner2)
+    if (owner1 != owner2) {
+        delta = owner1 - owner2;
+    } else {
+        addr1 = ((const H5D_filtered_collective_io_info_t *)
filtered_collective_io_info_entry1)->chunk_states.new_chunk.offset;
+        addr2 = ((const H5D_filtered_collective_io_info_t *)
filtered_collective_io_info_entry2)->chunk_states.new_chunk.offset;
+        delta = H5F_addr_cmp(addr1, addr2);
+    }
+
+    FUNC_LEAVE_NOAPI(delta)
 } /* end H5D__cmp_filtered_collective_io_info_entry_owner() */

 ^L
diff --git a/src/H5MM.c b/src/H5MM.c
index ee3b28f..3f06850 100644
--- a/src/H5MM.c
+++ b/src/H5MM.c
@@ -268,8 +268,6 @@ H5MM_malloc(size_t size)
 {
     void *ret_value = NULL;

-    HDassert(size);
-
     /* Use FUNC_ENTER_NOAPI_NOINIT_NOERR here to avoid performance issues */
     FUNC_ENTER_NOAPI_NOINIT_NOERR

@@ -357,8 +355,6 @@ H5MM_calloc(size_t size)
 {
     void *ret_value = NULL;

-    HDassert(size);
-
     /* Use FUNC_ENTER_NOAPI_NOINIT_NOERR here to avoid performance issues */
     FUNC_ENTER_NOAPI_NOINIT_NOERR


On Thu, Nov 9, 2017 at 2:27 PM, Michael K. Edwards
<[email protected]> wrote:
> That does appear to have been the problem.  I modified H5Z-blosc to
> allocate enough room for the BLOSC header, and to fall back to memcpy
> mode (clevel=0) if the data expands during "compressed" encoding.
> This unblocks me, though I think it might be a good idea for the
> collective filtered I/O path to handle H5Z_FLAG_OPTIONAL properly.
>
> Would it be helpful for me to send a patch once I've cleaned up my
> debugging goop?  What's a good way to do that -- github pull request?
> Do you need a contributor agreement / copyright assignment / some such
> thing?
>
>
> On Thu, Nov 9, 2017 at 1:44 PM, Michael K. Edwards
> <[email protected]> wrote:
>> I observe this comment in the H5Z-blosc code:
>>
>>     /* Allocate an output buffer exactly as long as the input data; if
>>        the result is larger, we simply return 0.  The filter is flagged
>>        as optional, so HDF5 marks the chunk as uncompressed and
>>        proceeds.
>>     */
>>
>> In my current setup, I have not marked the filter with
>> H5Z_FLAG_MANDATORY, for this reason.  Is this comment accurate for the
>> collective filtered path, or is it possible that the zero return code
>> is being treated as "compressed data is zero bytes long"?
>>
>>
>>
>> On Thu, Nov 9, 2017 at 1:37 PM, Michael K. Edwards
>> <[email protected]> wrote:
>>> Thank you for the explanation.  That's consistent with what I see when
>>> I add a debug printf into H5D__construct_filtered_io_info_list().  So
>>> I'm now looking into the filter situation.  It's possible that the
>>> H5Z-blosc glue is mishandling the case where the compressed data is
>>> larger than the uncompressed data.
>>>
>>> About to write 12 of 20
>>> About to write 0 of 20
>>> About to write 0 of 20
>>> About to write 8 of 20
>>> Rank 0 selected 12 of 20
>>> Rank 1 selected 8 of 20
>>> HDF5-DIAG: Error detected in HDF5 (1.11.0) MPI-process 0:
>>>   #000: H5Dio.c line 319 in H5Dwrite(): can't prepare for writing data
>>>     major: Dataset
>>>     minor: Write failed
>>>   #001: H5Dio.c line 395 in H5D__pre_write(): can't write data
>>>     major: Dataset
>>>     minor: Write failed
>>>   #002: H5Dio.c line 836 in H5D__write(): can't write data
>>>     major: Dataset
>>>     minor: Write failed
>>>   #003: H5Dmpio.c line 1019 in H5D__chunk_collective_write(): write error
>>>     major: Dataspace
>>>     minor: Write failed
>>>   #004: H5Dmpio.c line 934 in H5D__chunk_collective_io(): couldn't
>>> finish filtered linked chunk MPI-IO
>>>     major: Low-level I/O
>>>     minor: Can't get value
>>>   #005: H5Dmpio.c line 1474 in
>>> H5D__link_chunk_filtered_collective_io(): couldn't process chunk entry
>>>     major: Dataset
>>>     minor: Write failed
>>>   #006: H5Dmpio.c line 3278 in
>>> H5D__filtered_collective_chunk_entry_io(): couldn't unfilter chunk for
>>> modifying
>>>     major: Data filters
>>>     minor: Filter operation failed
>>>   #007: H5Z.c line 1256 in H5Z_pipeline(): filter returned failure during 
>>> read
>>>     major: Data filters
>>>     minor: Read failed
>>>
>>>
>>>
>>> On Thu, Nov 9, 2017 at 1:02 PM, Jordan Henderson
>>> <[email protected]> wrote:
>>>> For the purpose of collective I/O it is true that all ranks must call
>>>> H5Dwrite() so that they can participate in those collective operations that
>>>> are necessary (the file space re-allocation and so on). However, even 
>>>> though
>>>> they called H5Dwrite() with a valid memspace, the fact that they have a 
>>>> NONE
>>>> selection in the given file space should cause their chunk-file mapping
>>>> struct (see lines 357-385 of H5Dpkg.h for the struct's definition and the
>>>> code for H5D__link_chunk_filtered_collective_io() to see how it uses this
>>>> built up list of chunks selected in the file) to contain no entries in the
>>>> "fm->sel_chunks" field. That alone should mean that during the chunk
>>>> redistribution, they will not actually send anything at all to any of the
>>>> ranks. They only participate there for the sake that, were the method of
>>>> redistribution modified, ranks which previously had no chunks selected 
>>>> could
>>>> potentially be given some chunks to work on.
>>>>
>>>>
>>>> For all practical purposes, every single chunk_entry seen in the list from
>>>> rank 0's perspective should be a valid I/O caused by some rank writing some
>>>> positive amount of bytes to the chunk. On rank 0's side, you should be able
>>>> to check the io_size field of each of the chunk_entry entries and see how
>>>> big the I/O is from the "original_owner" to that chunk. If any of these are
>>>> 0, something is likely very wrong. If that is indeed the case, you could
>>>> likely pull a hacky workaround by manually removing them from the list, but
>>>> I'd be more concerned about the root of the problem if there are zero-size
>>>> I/O chunk_entry entries being added to the list.

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Re: [Hdf-forum] Collective IO and filters

Reply via email to