Mark,

Thank you for the excellent report and patches. I added a JIRA issue HDFFV-9208.

Elena
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal  The HDF Group  http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~




On Apr 1, 2015, at 4:46 PM, Hodson, Mark (Contractor) 
<[email protected]<mailto:[email protected]>> wrote:


UNCLASSIFIED

Elena et al,

Further to previous emails regarding problems using nested named data types 
(HDFFV-9174, HDFFV-9176), I can now report my colleague and I have tracked down 
the crash that is occurring in our application and have been able to patch the 
problem locally. Can I request that a new ticket be raised in JIRA, and the 
following information attached? Thanks.

--8<--

THE PROBLEM:

The problem requires use of (a) named data types (not necessarily nested); (b) 
soft links; and (c) data type conversions (which may be implicitly generated by 
dataset read or write functions). All references in the following explanation 
refer to source files and lines as per the 1.8.14 release.

When reading or writing dataset values, data type conversions are often 
required. When a data type conversion structure is initialised, it takes a copy 
of the source and destination data types [H5Tconv.c:1963 and others]. If either 
data type is a named data type, the H5T_t oloc.file record is non-NULL, but (in 
the cases we tested) its oloc.holding_file record is FALSE. These fields are 
copied as part of the data type copy, and so the data types held by the 
converter has this feature. Furthermore, data type conversions are registered 
in a global conversion path cache that out-lives any H5F_t file handle.

Because oloc.holding_file is FALSE, it is possible to close the file that holds 
the named data types being referred to. THIS IS THE BUG; that there are now 
registered global data type identifiers held by global data type conversion 
structures that have a dangling reference (the oloc.file record) to a file that 
is no longer open- the record has been deallocated. However, until some other 
operation attempts to dereference this dangling pointer, there is no failure in 
the HDF library.

We found, however, that deleting a soft link will trigger such a failure. 
Having opened a new file (a different one to the one containing the original 
named data types) that contains a soft link, calling H5Ldelete() on the soft 
link includes code that searches all currently held identifiers of all types 
(datasets, groups and data types) and attempts to rename them to a NULL or 
empty string [H5Gname.c:815 callback switch statement]. Because the data types 
with dangling file references report as named data types [at H5Gname.c:828], 
the oloc is retrieved and the file pointer is dereferenced when it tries to 
locate the parent of the file record [H5Gname.c:859]. This causes the crash.

REPRODUCTION:

Reproducing this, especially in the HDF test harness and without any source 
code changes, is surprisingly difficult. The reason is that when the file is 
closed, creating the dangling reference, the memory occupied by the H5F_t 
structure is freed onto the internal HDF heap. This structure still looks valid 
(enough) especially if the “parent” field is NULL. If you re-open the same 
file, it winds up picking back up the same block from the free list and 
effectively re-hooking the file reference, and everything continues to work. If 
you open a different file, you often still can’t induce a crash. In our 
application, we would sometimes have to open and close a cooperating pair of 
files 6 times before we would get the crash. Our new test placed into the HDF 
test framework uses two files to help ensure that the dangling file reference 
refers to a file that is closed.

Because of this, a (temporary) source code modification is required to reliably 
trip the test that we have submitted. This introduces debug-heap type behaviour 
by writing a 0xCDCDCDCD pattern over the top of memory when it is returned to 
the free list. See below for more information on the patches involved.

OUR FIX:

Our fix works, in that the HDF regression tests all seem to pass (on Windows, 
Visual Studio 2008), our system tests pass and the application no longer 
crashes. However it is probably not the most elegant fix. What I have done is, 
where the data type objects are copied into the conversion structure, I reach 
in and free the entire oloc field using internal API calls. As the 
oloc.holding_file field is FALSE, this has no real reference counting 
side-effects, and even if this field were TRUE, this is a copy of a data type 
so the original would still keep the file alive. Even though this sets 
oloc.file to NULL, the copy still reports as a named data type when the soft 
link renaming callback occurs. Therefore I also had to patch this callback to 
perform a secondary check for a NULL file pointer before proceeding.

We tried COPY_TRANSIENT instead of COPY_ALL, but this still copies the oloc 
structure and file reference and so was not a solution.

We believe that holding_file must be FALSE because the data type conversion 
structures are in the global conversion path cache and reference counting 
should not be performed in this context.

I wonder if a more elegant solution may be possible as part of fixing 
HDFFV-9174. If when you create a dataset with a named data type you take a copy 
of the data type to associate with the dataset that correctly reports as *not* 
named, then you could use this same technique when taking copies of data types 
for the conversion structures. Then the existing check in H5Gname.c should 
prevent dereferencing a dangling file reference if one exists. However, if you 
consider the existence of the dangling reference bad practice, consider our 
method of freeing the oloc structure as part of the final solution.

Furthermore, we are not sure whether the “shared” field in the H5T_t structure 
should also be freed; it may also contain a dangling reference.

PATCHES: (attached)

[*] = We applied this patch to the standard build of 1.8.14 for the purpose of 
our production application build.

dtypes.c.patch = A patch for test/dtypes.c that adds two tests: one that shows 
how H5Tcommitted() gives inconsistent results for committed data types 
depending on whether you have reopened the file (HDFFV-9174), and another that 
shows how dangling file references from committed data types can be 
dereferenced when deleting a soft link (note: this requires 
H5Fint.c.memory.patch to be applied as well).

H5Fint.c.memory.patch = A patch for src/H5Fint.c that adds one line of code to 
set the content of the H5F_t memory freed on the HDF internal heap (and 
returned to the free list) to 0xCDCDCDCD so that we reliably trip the dangling 
file reference problem. Without this, the memory is untouched and with a 
non-null parent pointer the H5Ldelete() operation typically succeeds, even 
though it is dereferencing "deallocated" memory.

[*] H5Gname.c.patch = A patch that detects explicitly set NULL file pointers 
(as set in H5Tconv.c patch, next) in copies of named data types generated for 
the internal data conversion paths. Detecting these allows us to avoid 
dereferencing the dangling file reference that would otherwise be there. 
Possibly if the HDF group fix their H5Tcommitted() behaviour then the check for 
not-a-named-datatype just above where this patch is applied would prevent the 
problem in a more robust fashion.

[*] H5Tconv.c.patch = A patch that explicitly frees the "oloc" reference 
(including its internal file pointer, which is the pointer that becomes a 
dangling reference) when data types are copied for the purposes of placing them 
in the global cache of converters (for various conversion paths). These 
out-live the files that would otherwise be referred to. Attempt to do a 
transient copy of these data types was attempted but this has no effect on 
"oloc". We're uncertain about the shared location information but testing seems 
to suggest it is not being dereferenced even if it might also hold dangling 
file reference.

--8<--

All the best!
Mark



IMPORTANT: This email remains the property of the Department of Defence and is 
subject to the jurisdiction of section 70 of the Crimes Act 1914. If you have 
received this email in error, you are requested to contact the sender and 
delete the email.

<dtypes.c.patch><H5Fint.c.memory.patch><H5Gname.c.patch><H5Tconv.patch>_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]<mailto:[email protected]>
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Reply via email to