Hi Andy,
first of all, thanks for this.
Re: JENA-44... what is blocking JENA-44 going into trunk is just the lack of a
common way to serialize binding. By the way, we are using a patched version of
ARQ on some of our servers (with no problem and improvements in terms of
stability, RAM consumption in particular when users submit queries which need
to sort large resultsets and they timeout).
So, all this is more than welcome from my point of view (i.e. one patch less
to manage).
Andy Seaborne wrote:
== A Design for a Persistent Bindings Mini-language
There are a number of activities that require being about to serialize,
and read back, bindings. They use different serializations. A shared
"bindings I/O" would mean all activities could use one, tuned, set of
serialization and I/O classes.
JENA-44 (External sort) encodes a binding as a length-denoted byte
array. The byte arry uses lengh-denoted byte arrays within the
bindings. I/O is done using Data(In|Out)putStream, specifically.
putInt/getInt() and put/get(byte[]) and ByteBuffer putInt/getInt() and
put/get(byte[]) for the per-row serialization as (var,Turtle string
form) pairs. It uses a null for no such value.
JENA-45 (Spill to disk SPARQL Update) uses a more textual representation
based on a binding endcoded as (var, Turtle term). End of row is denoted
by a DOT. It uses modified RIOT for input reading.
There is also use of TSV I/O for writing and reading result sets. In
this form, the variables are written once at the start, and not in each
line.
== Proposed mini-language
This proposal takes those separate designs, and adds high-level
compression.
A sequence of bindings is written assuming there is a list of variables
in force. Position in the row determines which variable is bound to
which variable (=> compression of variable names).
What do we write when a variable is not bound?
We need to be able to write a symbol/token for that, right?
(ok... I saw DASH below!)
> Turtle-style
prefixes can be used (=> compression for IRIs) and the value of a slot
in a row can "same as the row before" (=> compression for repeated
terms) or undefined.
Ack.
Rows end in a DOT - this is not stricly necessary but adds a robustness
against truncated data and bugs.
Every row is the length, in number of terms, as the list variables in
force.
Directives are lines starting with a keyword. End on DOT.
The directives are:
PREFIX : <http://example> .
Like Turtles, except keyword based to fit with being a keyword-driven
mini-language.
VARS ?x ?y .
Set the variables in force for subsequent rows,
until the next VARS directive.
We need VARS because it's not always possible to determine all
the possible variables before starting to write out bindings.
This is not completely clear to me. An example of when it's not possible
to determine all the possible variables before starting to write out binding
will probably convince me and help me to clarify.
A binding row is a sequence of terms, encoded like Turtle, including
prefixed names and short forms for numbers (more compression). In
addition STAR ("*") means "same term as the row before" and DASH ("-")
means undef. Don't use * for - from previous row.
Rows end in DOT. Preferred style is one space after each term. This
makes writing safe.
Terms can be written without intermediate copies (except local name
processing) or buffers. The OutputLangUtils does not do this currently
but it should.
For presentation reasons only, blank lines are allowed (this would all
get lost in the lexing/tokenization anyway).
Example:
-------------
VARS ?x ?y .
PREFIX : <http://example/> .
:local1 <http://example.other/text> .
* - .
* 123 .
-------------
== Discussion
The format is text - but we're writing strings anyway so a binary form,
rather than a delimited text form, is unlikely to give much advantage
but can't reuse the standard bytes<->chars stuff without intermediate
copies
This would all be hidden behind interface anyway. A binary tokenizer
and binary OutputLangUtils would enable binary output.
Dynamic choosing of prefixes can be done.
As Stephen, I am not 100% sure what "dynamic choosing of prefixes" means.
I suppose the next step is to create a JIRA issue for it.
Jotting down interface names and signature of methods would be extremely helpful
for me and tomorrow I have time to spend to work on this (since it is directly
helping JENA-44 going down into trunk).
Thanks again,
Paolo