Dear Nathaniel!
h5read is in fact designed to be able to call a user defined function
h5read.<myclass>, but it is not yet fully implemented, respectively tested.
I stalled this because of the complexity of this task. But maybe you and the
Bioc-devel list can help.
I can imaging the following scenario:
- h5write can write the attr(foo, "class") <- "myclass" attribute to the HDF5
object
This is already set up, one can invoke this by using write.attributes=TRUE as
you mentioned
h5write is a generic function and one can write its own h5write.<myclass>
function.
- Before h5read reads the object, it tries to read the class-attribute and
invokes h5read.<myclass>
which is defined somewhere outside rhdf5.
The problems I came across are:
1.) Usually, the h5read.<mycall> is implemented in some package "mypackage".
How do I know, which package it is, if the package is not yet loaded?
Do we have to store an additional "BioCpackage" attribute in the HDF5
object?
2.) What happens, if the package provider changes the class definition in the
next BioC-release?
Do we have to store a package version number as well?
3.) How shall we deal with R-attributes?
HDF5 attributes are not able to store all R-attibutes, because
HDF5-attributes are restricted to
a maximum size, R-attibutes can be almost as large as you like. One way
would be
to store attributes in a group called /obj.ATTRIBUTES.
E.g. assume you have an R-object foo with attribute names = c("A","B",
)
of length 2^30
and geneNames = c("ENSGA","ENSGB",
)
Should h5write write the following:
/foo : an HDF5 object, e.g. an integer array
/foo.ATTRIBUTES : a group
/foo.ATTRIBUTES/names : a string vector
/foo.ATTRIBUTES/geneNames : a string vector
This definitely breaks, if someone wants to write a list that contains
both
elements "foo" and "foo.ATTRIBUTES". Is this acceptable?
4.) What is the best standard for storing S3/S4-objects in HDF5?
Assume there is an object foo class baa with slots a = "integer", b =
"double" and c = "mysecondclass"
Should h5write write the following:
/foo : a group with attributes class="baa", BioCpackage="baapackage"
/foo/slots : a group
/foo/slots/a : integer
/foo/slots/b : double
/foo/slots/c : a group with attributes class="mysecondclass",
BioCpackage="mysecondpackage"
/foo/slots/c/slots
and assume foo has additional attributes as above h5write would write in
addition:
/foo.ATTRIBUTES : a group
/foo.ATTRIBUTES/names : a string vector
/foo.ATTRIBUTES/geneNames : a string vector
This standard would allow the definition of a function that reads
S3/S4-objects of any kind
and still allow the user to define its own function h5read.<myclass>.
What do you think about this? I guess that is the direction that you have in
mind. Any other
suggestions and comments are welcome.
Bernd
On 07.08.2014, at 02:49, Nathaniel Hayden <[email protected]> wrote:
> When reading from an hdf5 file I would like to automatically call a function
> I define when datasets of an arbitrary type (see: 'class') are read from an
> hdf5 file. Since it looks like the existing infrastructure (courtesy of the
> 'callGeneric' parameter in h5read) in rhdf5 was made for this, I would like
> to avoid duplicating work. But I can't find an example of the
> h5read.<classname> functionality indicated in the callGeneric description in
> the h5read man page.
>
> A simple example is if the type is integer, I want as.integer to be
> automatically called on the read-in object before it gets passed back. But I
> intend to extend this to other Bioconductor classes of arbitrary complexity.
>
> Based on the documentation, it seems like either using attr(foo "class") <-
> "integer" (in conjunction with h5write(<...>, write.attributes=TRUE) or
> adding a 'class' attribute through the h5writeAttribute interface should be
> enough to trigger the h5read.integer function upon calling h5read. Neither
> seems to work. Note that I can pass read.attributes=TRUE and the attributes
> get assigned the object (for example, the object comes back with a "class"
> attribute), but that's not exactly what I'm after.
>
> In looking at the R/h5read.R source code, it looks like the block where the
> h5read.<classname> call gets set up (around line 59) queries the "class"
> attribute of the read-in obj before the h5 object's attributes are actually
> read, so the 'cl' variable never seems to get set.
>
> Here's an example where I would expect h5read.<classname> to be invoked, but
> it doesn't:
>
> library(rhdf5)
> h5read.integer <- function(obj) { as.integer(obj) } ## h5read.<classname>
> debug(h5read.integer)
> exists(paste("h5read","integer",sep="."),mode="function")
>
> h5fl <- tempfile(fileext=".h5")
> h5createFile(h5fl)
> ints <- 42L:33L
> attr(ints, "class") <- "integer"
> h5write(ints, h5fl, "foo", write.attributes=TRUE)
> H5close()
>
> ## h5writeAttribute route
> ##fid <- H5Fopen(h5fl)
> ##did <- H5Dopen(fid, "foo")
> ##h5writeAttribute("integer", did, name="class")
> ##H5close()
>
> ##res <- h5read(h5fl, "foo", read.attributes=FALSE)
> res <- h5read(h5fl, "foo", read.attributes=TRUE)
>
> Running the external h5dump utility confirms that a "class" attribute is
> attached to the foo DATASET, which seems to match what the h5read man page
> prescribes. If I edit the source code to set the 'cl' variable to "integer"
> my h5read.integer function gets invoked, as expected.
>
> Any help would be much appreciated. Thank you.
[[alternative HTML version deleted]]
_______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel