Here are some thoughts on Jena3-timeframe changes - we have the opportunity to make changes and while we can't change too much without going into a black hole, a few key changes can be done.

== Node Hierarchy

At the moment, every node has an Object "label" in Node itself and it keys the cache, which is where LiteralLabels and AnonId come in. No cache means no need to need to use a general Label.

The class hierarchy is then overriding the Node operations specific to it's type. There is a Node_Concrete between URIs/Literals/blank node and Node.

With no node cache there is no need for general Object label and each subclass can have it's own slots.

(costs here are any virtual method dispatch)

= URI

A java object is 2slots+data so a label adds 2 slots just by being. Not too bad for URI - the label is the string - so the key to space saving is in parsers reusing (but not complete interning) strings.

= Literal

The label is a LiteralLabel which is a bit more heavy weight.
It has lexical form, datatype, lang, hash, value, a wellformed flag, and a exceptionMsg cache. 5 object slots, a boolean and the exception cache slot.

The value is calculated at construction time - this is unhelpful for TDB and ARQ does not use the value anyway (it has it's own NodeValue system which does use the datatype code extensively but not Node values).

= Blank Nodes

Blank nodes (which matter less) have an AnonId as an extra object. I'd like to change to using UUIDs for bNodes, recorded as 2 longs (+ an option to use a string for backwards compatibility and because it is useful to sometimes debug with a preallocated label). We can remove AnonId except for compatibility (it becomes a transient object not store in Node/Node_Blank

= Variable

Minor - but the label is a VariableName object which holds a string. I don't think this is necessary if there is no cache (it is sort-of needed by the cache to split from other thins with the same string, so same hashCode).

= And some new node types:

1/ Node Graph - for graphs-within-graphs. This is to future proof the hierarchy.

2/ Node Ext - this is a carrier - it means you can build RDF-like structures using triples and (in-memory) graphs. For example, when ordering a BGP, ARQ has a "known to be defined" marker. I'd like to put that in a Triple ... but can't as it's not a Node. Node Ext would have a contract that is "don't let it leak out of your usage scope" - ie no guarantees.

== Interfaces vs Concrete Nodes

Having a NodeFactory, not Node being it's own factory, seems like a sensible step to do anyway. A bit more disruptive but, on balance, I think, worth it.

What about the age-old Node-as-interface Node-as-fixed-class
and Node-as-concrete decision?

On the surface, interfaces look like the right thing. If the Node label slot is gone, and there is a NodeFactory, Node itself can be an interface.

But what does it costthese days? It isn't much if there is, by analysis, only one implementation (I think the JIT removes the visturl a method call - anyone know?) but then what's the point of an interface?

1/ Interface to associate local data e.g. per store

This assumes nodes going in and out of a storage layer - so not for SPARQL. It can be provided within a store by keeping a map/cache of node to local info.

Where might it be used?

Not of help to TDB or SDB. The heavy operations are SPARQL, not programmatic nodes.

TDB looks up a Node in the node table cache - and may move to using a (long) hash as the key anyway.

2/ Interface for alternative implementations

Can't think of a use case aside from per-subsystem info. Even if we wanted that, a slot in a generic Node object is 8 bytes and a cast.

3/ Parser-independence

A parser can be passed a factory that creates Nodes of a specific implementation. This avoids Node->local node translation.

So when it comes to real usage cases, I don't see many!

One downside - no static methods so the transition to a NodeFactory is not smoothable.

There leave as concrete for now, migrate, then maybe make an interface when the Node.create static go.

== Parser encapsulation

In a chat with @quoll about reusing parser code without too much of the rest of Jena, we agreed that if the parser emitted some simple container object for IRIs/literals/bnodes. There is some testing based on type quite late in the pipeline (e.g. no literals as predicates) as Triples get created.

RIOT creates Jena Nodes and Triples (and Quads). The point where more of the rest of Jena gets dragged is datatypes because literals have a datatype and in Jena Node datatype is an RDFDatatype

ARP has separate AResource objects that wrap the parser output - there is a second Node creation (it's only a shallow copy as the strings etc are not copied) to convert ARP-only objects to Jena Nodes but, given XML parsing is going on, that is not noticeable.

So Node is not suitable currently as the encapsulation directly because of datatype. May be the datatype argument ought to be a Node(URI) and datatypes are an internal aspect.

This then fits with Node-as-fixed-class.

== Summary

1/ Remove Node cache
2/ Remove the Node label, put the per-class data in the subclasses.
3/ Delay Node literal "value" determination until first call.
4/ BNodes as UUIDs.

Maybe change the typed literals construction to take a Node, not the datatype itself.

Reply via email to