Here are some thoughts on Jena3-timeframe changes - we have the
opportunity to make changes and while we can't change too much without
going into a black hole, a few key changes can be done.
== Node Hierarchy
At the moment, every node has an Object "label" in Node itself and it
keys the cache, which is where LiteralLabels and AnonId come in. No
cache means no need to need to use a general Label.
The class hierarchy is then overriding the Node operations specific to
it's type. There is a Node_Concrete between URIs/Literals/blank node
and Node.
With no node cache there is no need for general Object label and each
subclass can have it's own slots.
(costs here are any virtual method dispatch)
= URI
A java object is 2slots+data so a label adds 2 slots just by being. Not
too bad for URI - the label is the string - so the key to space saving
is in parsers reusing (but not complete interning) strings.
= Literal
The label is a LiteralLabel which is a bit more heavy weight.
It has lexical form, datatype, lang, hash, value, a wellformed flag, and
a exceptionMsg cache. 5 object slots, a boolean and the exception
cache slot.
The value is calculated at construction time - this is unhelpful for TDB
and ARQ does not use the value anyway (it has it's own NodeValue system
which does use the datatype code extensively but not Node values).
= Blank Nodes
Blank nodes (which matter less) have an AnonId as an extra object. I'd
like to change to using UUIDs for bNodes, recorded as 2 longs (+ an
option to use a string for backwards compatibility and because it is
useful to sometimes debug with a preallocated label). We can remove
AnonId except for compatibility (it becomes a transient object not store
in Node/Node_Blank
= Variable
Minor - but the label is a VariableName object which holds a string. I
don't think this is necessary if there is no cache (it is sort-of needed
by the cache to split from other thins with the same string, so same
hashCode).
= And some new node types:
1/ Node Graph - for graphs-within-graphs. This is to future proof the
hierarchy.
2/ Node Ext - this is a carrier - it means you can build RDF-like
structures using triples and (in-memory) graphs. For example, when
ordering a BGP, ARQ has a "known to be defined" marker. I'd like to put
that in a Triple ... but can't as it's not a Node. Node Ext would have
a contract that is "don't let it leak out of your usage scope" - ie no
guarantees.
== Interfaces vs Concrete Nodes
Having a NodeFactory, not Node being it's own factory, seems like a
sensible step to do anyway. A bit more disruptive but, on balance, I
think, worth it.
What about the age-old Node-as-interface Node-as-fixed-class
and Node-as-concrete decision?
On the surface, interfaces look like the right thing. If the Node label
slot is gone, and there is a NodeFactory, Node itself can be an interface.
But what does it costthese days? It isn't much if there is, by
analysis, only one implementation (I think the JIT removes the visturl a
method call - anyone know?) but then what's the point of an interface?
1/ Interface to associate local data e.g. per store
This assumes nodes going in and out of a storage layer - so not for
SPARQL. It can be provided within a store by keeping a map/cache of node
to local info.
Where might it be used?
Not of help to TDB or SDB. The heavy operations are SPARQL, not
programmatic nodes.
TDB looks up a Node in the node table cache - and may move to using a
(long) hash as the key anyway.
2/ Interface for alternative implementations
Can't think of a use case aside from per-subsystem info. Even if we
wanted that, a slot in a generic Node object is 8 bytes and a cast.
3/ Parser-independence
A parser can be passed a factory that creates Nodes of a specific
implementation. This avoids Node->local node translation.
So when it comes to real usage cases, I don't see many!
One downside - no static methods so the transition to a NodeFactory is
not smoothable.
There leave as concrete for now, migrate, then maybe make an interface
when the Node.create static go.
== Parser encapsulation
In a chat with @quoll about reusing parser code without too much of the
rest of Jena, we agreed that if the parser emitted some simple container
object for IRIs/literals/bnodes. There is some testing based on type
quite late in the pipeline (e.g. no literals as predicates) as Triples
get created.
RIOT creates Jena Nodes and Triples (and Quads). The point where more of
the rest of Jena gets dragged is datatypes because literals have a
datatype and in Jena Node datatype is an RDFDatatype
ARP has separate AResource objects that wrap the parser output - there
is a second Node creation (it's only a shallow copy as the strings etc
are not copied) to convert ARP-only objects to Jena Nodes but, given XML
parsing is going on, that is not noticeable.
So Node is not suitable currently as the encapsulation directly because
of datatype. May be the datatype argument ought to be a Node(URI) and
datatypes are an internal aspect.
This then fits with Node-as-fixed-class.
== Summary
1/ Remove Node cache
2/ Remove the Node label, put the per-class data in the subclasses.
3/ Delay Node literal "value" determination until first call.
4/ BNodes as UUIDs.
Maybe change the typed literals construction to take a Node, not the
datatype itself.