A Thursday 31 January 2008, escriguéreu:
> It looks like pytables already works well with existing hard links. I
> wanted to ask though if there might be any caching issues lurking
> beneath the surface that I should investigate. An example data set is
> attached (created with the nexus library using Paul Kienzle's python
> wrapper) where t.root.entry.r8_data is a 5x4 64bit float array and
> t.root.link.renLinkData is a hard link to the same. I can modify the
> array following either path, and support looks completely
> transparent, but I just wanted to be sure.
I've been talking with Ivan in that respect, and we have come to the
conclusion that implementing links in PyTables is much more hairy than
I anticipated. The problem is mainly with metadata coherency (there
should not be problems with data itself, as you have checked it; maybe
tables, that have I/O buffers, could have some but in rather
exceptional cases).
As you probably know, PyTables does a lot of effort to caching metadata
in order to accelerate the access to metadata (this is why it is so
efficient when handling potentially large hierarchies, see [1]_). The
metadata that is cached is basically found at three points:
* Node objects
* AttributeSet objects
* Indexes objects (this is a special case for the Pro version, but
important to us).
Here it is a couple of examples of the kind of problems that can be
seen. Firstly, problems with the node cache:
In [72]: f.root.entry.sample
Out[72]:
/entry/sample (Group) ''
children := ['ch_data' (Array)]
In [73]: f.root.link.renLinkGroup
Out[73]:
/link/renLinkGroup (Group) ''
children := ['ch_data' (Array)] # it's a link to '/entry/sample'
In [74]: new_node=f.createArray(f.root.link.renLinkGroup, "new_array",
[1,2])
In [75]: f.root.link.renLinkGroup
Out[75]:
/link/renLinkGroup (Group) ''
children := ['new_array' (Array), 'ch_data' (Array)]
In [76]: f.root.entry.sample
Out[76]:
/entry/sample (Group) ''
children := ['ch_data' (Array)]
where you can see that the 'new_array' node is missing in 'sample' (!).
Secondly, problems with the attribute metadata cache:
In [51]: f.root.entry.r8_data.attrs
Out[51]:
/entry/r8_data._v_attrs (AttributeSet), 4 attributes:
[ch_attribute := 'NeXus',
i4_attribute := 42,
r4_attribute := 3.14159274101,
target := '/entry/r8_data']
In [52]: f.root.link.renLinkData.attrs
Out[52]:
/link/renLinkData._v_attrs (AttributeSet), 4 attributes:
[ch_attribute := 'NeXus',
i4_attribute := 42,
r4_attribute := 3.14159274101,
target := '/entry/r8_data']
In [53]: f.root.link.renLinkData.attrs.userattr = "a test"
In [54]: f.root.link.renLinkData.attrs
Out[54]:
/link/renLinkData._v_attrs (AttributeSet), 5 attributes:
[ch_attribute := 'NeXus',
i4_attribute := 42,
r4_attribute := 3.14159274101,
target := '/entry/r8_data',
userattr := 'a test']
In [55]: f.root.entry.r8_data.attrs
Out[55]:
/entry/r8_data._v_attrs (AttributeSet), 4 attributes:
[ch_attribute := 'NeXus',
i4_attribute := 42,
r4_attribute := 3.14159274101,
target := '/entry/r8_data']
Note that 'userattr' attribute is missing in 'r8_data' node.
I'll skip the discussion of the problems with indexes, as the already
mentioned are more than enough to show the point.
These metadata cache coherency issues is pretty difficult to solve, as
we need to rethink all the structure of PyTables to include the new
issues in the schema, because the current one is just not thought to
deal with this.
Another additional problem is that it seems that HDF5 does allow hard
links to Groups, so introducing the possibility to create 'loops' in
the hierarchy. Of course, suporting this in PyTables introduces more
complexity, but, in a first approximation, we could 'disable' this
feature, so I'll skip the discussion of the issues in that regard.
Going back to the cache coherency problem, a key aspect for solving it
would be how to uniquely determine the data area of a node on disk
(i.e. the equivalent of a 'inode' in a filesystem), and take this
identifier as the new 'primary key' for the node cache (right now, this
role is played by the node path, but this is precisely what introduces
the cache coherency problem). A possible candidate for playing
this 'primary key' role would be the '_v_objectID' node attribute, but
unfortunately HDF5 returns different IDs for links pointing to the
same 'inode':
In [77]: f.root.entry.sample._v_objectID
Out[77]: 134217739
In [78]: f.root.link.renLinkGroup._v_objectID
Out[78]: 134217748
Ummm, I will ask to the HDF5 mailing list if it would be possible to get
a unique identifier for all the links pointing to same data area. If
HDF5 can provide such a identifier, the next step should be to rethink
the structure of the metadata cache in PyTables and implement a new one
based on the 'inode' concept, instead of the 'node path' one, which
certainly is not a trivial task (to say the least).
> It looks like rounding out support for hard links would simply
> require adding a new method to File to create the link. I propose
> something like
>
> File.linkNode(self, where, name, curObject)
> or
> File.createLink(self, where, name, curObject)
>
> The argument list here follows the pattern in createTable.
Yeah, I like both. Perhaps the 'createLink' flavor is more consistent
with the 'actionNode' pattern that is used in other constructors.
> Soft links would take more work. I don't think I would use them
> myself, so I probably am the wrong person to suggest their
> implementation. Maybe they would require a new pytables object
> deriving from Leaf, I don't really know how such a thing should
> behave. They could be added later, and be created with the same file
> method through the addition of a linktype kwarg that defaults to hard
> links.
Curiously enough, Ivan and me think that 'soft' links would be far more
cheaper to implement in current PyTables than their 'hard'
counterparts. This is due to the fact that metadata 'primary keys' in
our object tree cache could continue to be based on 'node paths'
instead of 'inodes'. Still, there is the problem of metadata cache
coherency, but we could think about maintaining lists of 'soft links'
pointing to each 'real' node, so that we can update them on each
modification of the real node. Or perhaps, implementing the 'soft
links' using a proxy pattern would be more than enough. These are
ideas from the top of my head, but we should think more about that
anyway.
Well, sorry for not being able of anticipating so much difficulties
before.
..[1] http://www.carabos.com/downloads/resources/NewObjectTreeCache.pdf
Cheers,
--
>0,0< Francesc Altet http://www.carabos.com/
V V Cárabos Coop. V. Enjoy Data
"-"
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users