[
https://issues.apache.org/jira/browse/JENA-85?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paolo Castagna updated JENA-85:
-------------------------------
Description:
( from: http://markmail.org/thread/ljjrsiun3oxtrchw )
There are a number of activities that require being about to serialize, and
read back, bindings. They use different serializations. A shared "bindings
I/O" would mean all activities could use one, tuned, set of serialization and
I/O classes.
JENA-44 (External sort) encodes a binding as a length-denoted byte array. The
byte arry uses lengh-denoted byte arrays within the bindings. I/O is done
using Data(In|Out)putStream, specifically. putInt/getInt() and put/get(byte[])
and ByteBuffer putInt/getInt() and put/get(byte[]) for the per-row
serialization as (var,Turtle string form) pairs. It uses a null for no such
value.
JENA-45 (Spill to disk SPARQL Update) uses a more textual representation based
on a binding endcoded as (var, Turtle term). End of row is denoted by a DOT.
It uses modified RIOT for input reading.
There is also use of TSV I/O for writing and reading result sets. In this
form, the variables are written once at the start, and not in each line.
== Proposed mini-language
This proposal takes those separate designs, and adds high-level compression.
A sequence of bindings is written assuming there is a list of variables in
force. Position in the row determines which variable is bound to which
variable (=> compression of variable names). Turtle-style prefixes can be used
(=> compression for IRIs) and the value of a slot in a row can "same as the row
before" (=> compression for repeated terms) or undefined.
Rows end in a DOT - this is not stricly necessary but adds a robustness against
truncated data and bugs.
Every row is the length, in number of terms, as the list variables in force.
Directives are lines starting with a keyword. End on DOT.
The directives are:
PREFIX : <http://example> .
Like Turtles, except keyword based to fit with being a keyword-driven
mini-language.
VARS ?x ?y .
Set the variables in force for subsequent rows,
until the next VARS directive.
We need VARS because it's not always possible to determine all
the possible variables before starting to write out bindings.
A binding row is a sequence of terms, encoded like Turtle, including prefixed
names and short forms for numbers (more compression). In addition STAR ("*")
means "same term as the row before" and DASH ("-") means undef. Don't use *
for - from previous row.
Rows end in DOT. Preferred style is one space after each term. This makes
writing safe.
Terms can be written without intermediate copies (except local name processing)
or buffers. The OutputLangUtils does not do this currently but it should.
For presentation reasons only, blank lines are allowed (this would all get lost
in the lexing/tokenization anyway).
Example:
-------------
VARS ?x ?y .
PREFIX : <http://example/> .
:local1 <http://example.other/text> .
* - .
* 123 .
-------------
== Discussion
The format is text - but we're writing strings anyway so a binary form, rather
than a delimited text form, is unlikely to give much advantage but can't reuse
the standard bytes<->chars stuff without intermediate copies
This would all be hidden behind interface anyway. A binary tokenizer and
binary OutputLangUtils would enable binary output.
Dynamic choosing of prefixes can be done.
was:
(Text taken from: http://markmail.org/thread/ljjrsiun3oxtrchw)
There are a number of activities that require being about to serialize, and
read back, bindings. They use different serializations. A shared "bindings
I/O" would mean all activities could use one, tuned, set of serialization and
I/O classes.
JENA-44 (External sort) encodes a binding as a length-denoted byte array. The
byte arry uses lengh-denoted byte arrays within the bindings. I/O is done
using Data(In|Out)putStream, specifically. putInt/getInt() and put/get(byte[])
and ByteBuffer putInt/getInt() and put/get(byte[]) for the per-row
serialization as (var,Turtle string form) pairs. It uses a null for no such
value.
JENA-45 (Spill to disk SPARQL Update) uses a more textual representation based
on a binding endcoded as (var, Turtle term). End of row is denoted by a DOT.
It uses modified RIOT for input reading.
There is also use of TSV I/O for writing and reading result sets. In this
form, the variables are written once at the start, and not in each line.
== Proposed mini-language
This proposal takes those separate designs, and adds high-level compression.
A sequence of bindings is written assuming there is a list of variables in
force. Position in the row determines which variable is bound to which
variable (=> compression of variable names). Turtle-style prefixes can be used
(=> compression for IRIs) and the value of a slot in a row can "same as the row
before" (=> compression for repeated terms) or undefined.
Rows end in a DOT - this is not stricly necessary but adds a robustness against
truncated data and bugs.
Every row is the length, in number of terms, as the list variables in force.
Directives are lines starting with a keyword. End on DOT.
The directives are:
PREFIX : <http://example> .
Like Turtles, except keyword based to fit with being a keyword-driven
mini-language.
VARS ?x ?y .
Set the variables in force for subsequent rows,
until the next VARS directive.
We need VARS because it's not always possible to determine all
the possible variables before starting to write out bindings.
A binding row is a sequence of terms, encoded like Turtle, including prefixed
names and short forms for numbers (more compression). In addition STAR ("*")
means "same term as the row before" and DASH ("-") means undef. Don't use *
for - from previous row.
Rows end in DOT. Preferred style is one space after each term. This makes
writing safe.
Terms can be written without intermediate copies (except local name processing)
or buffers. The OutputLangUtils does not do this currently but it should.
For presentation reasons only, blank lines are allowed (this would all get lost
in the lexing/tokenization anyway).
Example:
-------------
VARS ?x ?y .
PREFIX : <http://example/> .
:local1 <http://example.other/text> .
* - .
* 123 .
-------------
== Discussion
The format is text - but we're writing strings anyway so a binary form, rather
than a delimited text form, is unlikely to give much advantage but can't reuse
the standard bytes<->chars stuff without intermediate copies
This would all be hidden behind interface anyway. A binary tokenizer and
binary OutputLangUtils would enable binary output.
Dynamic choosing of prefixes can be done.
> Common bindings I/O
> -------------------
>
> Key: JENA-85
> URL: https://issues.apache.org/jira/browse/JENA-85
> Project: Jena
> Issue Type: New Feature
> Components: ARQ
> Reporter: Paolo Castagna
>
> ( from: http://markmail.org/thread/ljjrsiun3oxtrchw )
> There are a number of activities that require being about to serialize, and
> read back, bindings. They use different serializations. A shared "bindings
> I/O" would mean all activities could use one, tuned, set of serialization and
> I/O classes.
> JENA-44 (External sort) encodes a binding as a length-denoted byte array.
> The byte arry uses lengh-denoted byte arrays within the bindings. I/O is
> done using Data(In|Out)putStream, specifically. putInt/getInt() and
> put/get(byte[]) and ByteBuffer putInt/getInt() and put/get(byte[]) for the
> per-row serialization as (var,Turtle string form) pairs. It uses a null for
> no such value.
> JENA-45 (Spill to disk SPARQL Update) uses a more textual representation
> based on a binding endcoded as (var, Turtle term). End of row is denoted by a
> DOT. It uses modified RIOT for input reading.
> There is also use of TSV I/O for writing and reading result sets. In this
> form, the variables are written once at the start, and not in each line.
> == Proposed mini-language
> This proposal takes those separate designs, and adds high-level compression.
> A sequence of bindings is written assuming there is a list of variables in
> force. Position in the row determines which variable is bound to which
> variable (=> compression of variable names). Turtle-style prefixes can be
> used (=> compression for IRIs) and the value of a slot in a row can "same as
> the row before" (=> compression for repeated terms) or undefined.
> Rows end in a DOT - this is not stricly necessary but adds a robustness
> against truncated data and bugs.
> Every row is the length, in number of terms, as the list variables in force.
> Directives are lines starting with a keyword. End on DOT.
> The directives are:
> PREFIX : <http://example> .
> Like Turtles, except keyword based to fit with being a keyword-driven
> mini-language.
> VARS ?x ?y .
> Set the variables in force for subsequent rows,
> until the next VARS directive.
> We need VARS because it's not always possible to determine all
> the possible variables before starting to write out bindings.
> A binding row is a sequence of terms, encoded like Turtle, including prefixed
> names and short forms for numbers (more compression). In addition STAR ("*")
> means "same term as the row before" and DASH ("-") means undef. Don't use *
> for - from previous row.
> Rows end in DOT. Preferred style is one space after each term. This makes
> writing safe.
> Terms can be written without intermediate copies (except local name
> processing) or buffers. The OutputLangUtils does not do this currently but
> it should.
> For presentation reasons only, blank lines are allowed (this would all get
> lost in the lexing/tokenization anyway).
> Example:
> -------------
> VARS ?x ?y .
> PREFIX : <http://example/> .
> :local1 <http://example.other/text> .
> * - .
> * 123 .
> -------------
> == Discussion
> The format is text - but we're writing strings anyway so a binary form,
> rather than a delimited text form, is unlikely to give much advantage but
> can't reuse the standard bytes<->chars stuff without intermediate copies
> This would all be hidden behind interface anyway. A binary tokenizer and
> binary OutputLangUtils would enable binary output.
> Dynamic choosing of prefixes can be done.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira