Node futures

Andy Seaborne Fri, 20 Jul 2012 08:04:28 -0700

Here are some thoughts on Jena3-timeframe changes - we have theopportunity to make changes and while we can't change too much withoutgoing into a black hole, a few key changes can be done.


== Node Hierarchy

At the moment, every node has an Object "label" in Node itself and itkeys the cache, which is where LiteralLabels and AnonId come in. Nocache means no need to need to use a general Label.

The class hierarchy is then overriding the Node operations specific toit's type. There is a Node_Concrete between URIs/Literals/blank nodeand Node.

With no node cache there is no need for general Object label and eachsubclass can have it's own slots.


(costs here are any virtual method dispatch)

= URI

A java object is 2slots+data so a label adds 2 slots just by being. Nottoo bad for URI - the label is the string - so the key to space savingis in parsers reusing (but not complete interning) strings.


= Literal

The label is a LiteralLabel which is a bit more heavy weight.

It has lexical form, datatype, lang, hash, value, a wellformed flag, anda exceptionMsg cache. 5 object slots, a boolean and the exceptioncache slot.

The value is calculated at construction time - this is unhelpful for TDBand ARQ does not use the value anyway (it has it's own NodeValue systemwhich does use the datatype code extensively but not Node values).


= Blank Nodes

Blank nodes (which matter less) have an AnonId as an extra object. I'dlike to change to using UUIDs for bNodes, recorded as 2 longs (+ anoption to use a string for backwards compatibility and because it isuseful to sometimes debug with a preallocated label). We can removeAnonId except for compatibility (it becomes a transient object not storein Node/Node_Blank


= Variable

Minor - but the label is a VariableName object which holds a string. Idon't think this is necessary if there is no cache (it is sort-of neededby the cache to split from other thins with the same string, so samehashCode).


= And some new node types:

1/ Node Graph - for graphs-within-graphs. This is to future proof thehierarchy.

2/ Node Ext - this is a carrier - it means you can build RDF-likestructures using triples and (in-memory) graphs. For example, whenordering a BGP, ARQ has a "known to be defined" marker. I'd like to putthat in a Triple ... but can't as it's not a Node. Node Ext would havea contract that is "don't let it leak out of your usage scope" - ie noguarantees.


== Interfaces vs Concrete Nodes

Having a NodeFactory, not Node being it's own factory, seems like asensible step to do anyway. A bit more disruptive but, on balance, Ithink, worth it.


What about the age-old Node-as-interface Node-as-fixed-class
and Node-as-concrete decision?

On the surface, interfaces look like the right thing. If the Node labelslot is gone, and there is a NodeFactory, Node itself can be an interface.

But what does it costthese days? It isn't much if there is, byanalysis, only one implementation (I think the JIT removes the visturl amethod call - anyone know?) but then what's the point of an interface?


1/ Interface to associate local data e.g. per store

This assumes nodes going in and out of a storage layer - so not forSPARQL. It can be provided within a store by keeping a map/cache of nodeto local info.


Where might it be used?

Not of help to TDB or SDB. The heavy operations are SPARQL, notprogrammatic nodes.

TDB looks up a Node in the node table cache - and may move to using a(long) hash as the key anyway.


2/ Interface for alternative implementations

Can't think of a use case aside from per-subsystem info. Even if wewanted that, a slot in a generic Node object is 8 bytes and a cast.


3/ Parser-independence

A parser can be passed a factory that creates Nodes of a specificimplementation. This avoids Node->local node translation.


So when it comes to real usage cases, I don't see many!

One downside - no static methods so the transition to a NodeFactory isnot smoothable.

There leave as concrete for now, migrate, then maybe make an interfacewhen the Node.create static go.


== Parser encapsulation

In a chat with @quoll about reusing parser code without too much of therest of Jena, we agreed that if the parser emitted some simple containerobject for IRIs/literals/bnodes. There is some testing based on typequite late in the pipeline (e.g. no literals as predicates) as Triplesget created.

RIOT creates Jena Nodes and Triples (and Quads). The point where more ofthe rest of Jena gets dragged is datatypes because literals have adatatype and in Jena Node datatype is an RDFDatatype

ARP has separate AResource objects that wrap the parser output - thereis a second Node creation (it's only a shallow copy as the strings etcare not copied) to convert ARP-only objects to Jena Nodes but, given XMLparsing is going on, that is not noticeable.

So Node is not suitable currently as the encapsulation directly becauseof datatype. May be the datatype argument ought to be a Node(URI) anddatatypes are an internal aspect.


This then fits with Node-as-fixed-class.

== Summary

1/ Remove Node cache
2/ Remove the Node label, put the per-class data in the subclasses.
3/ Delay Node literal "value" determination until first call.
4/ BNodes as UUIDs.

Maybe change the typed literals construction to take a Node, not thedatatype itself.

Node futures

Reply via email to