Re: RDF Writer JSON LD FRAME performance

2021-08-20 Thread Andy Seaborne

Hi Ronald,

Jena used github/jsonld-java for its JSON-LD handling.  Providing the 
costs aren't going in the translation across the Jena/jsonld-java 
boundary, then it is the cost in jsonld-java.


I haven't had the opportunity to check but it has been said that the 
JSON-LD algorithms are not cheap.


Are you able to attach visialvm (or similar) and see where the time is 
going?


BTW In the next release, Jena will have support for reading JSON-LD 1.1 
using the Titanium project.


Andy

On 20/08/2021 08:36, Roland Bailly wrote:

Hello !

I have a question related to the performance of the RDFWriter to JSON LD FRAME 
format.
Currently I have to process 40k objects inside in a RDF file. When I process 
them into JSON following the code:


JsonLDWriteContext ctx = new JsonLDWriteContext();
JsonLdOptions opts = new JsonLdOptions();
opts.setOmitGraph(true);
opts.setEmbed(Embed.ALWAYS);
opts.setProcessingMode(JSON_LD_1_1);
ctx.setOptions(opts);

RDFWriter.create().format(RDFFormat.JSONLD_FRAME_FLAT).source(graph).context(ctx).build().asString();

I have in average 1 to 10 seconds to process it. It is a bit too slow.
Does someone know how to increase the speed of the process?

Yours faithfully,

Roland Bailly




Re: RDF / Linked data track @ ApacheCon

2021-04-07 Thread Claude Warren
The call for presentations for ApacheCod closes  *Monday, May 3rd, 2021
8:00 AM* (UTC). *Please do not wait until the last minute*.

If we want to have an RDF / Linked data track we need talks.

Please submit your proposal at https://acah2021.jamhosted.net/

Thank you,
Claude


On Sat, Mar 6, 2021 at 9:42 PM Claude Warren  wrote:

> Sorry for the cross posting.
>
> I am the chair for the RDF and Linked data track at ApacheCon this year.
> I am looking for a few speakers to talk about RDF and or Linked data in
> projects, particularly with respect to the use of Apache libraries, tool
> kits, etc.  If you are interested please drop me a line and I will make
> sure you are notified directly when the call for papers opens.
>
> Thank you,
> Claude
>


Re: RDF-* and permissions

2021-03-10 Thread Andy Seaborne




On 09/03/2021 23:12, Marco Neumann wrote:

can RDF-star further be nested?

as in:

<< << << :s :p :o >> :p1 :o1 >> :q :y >> :r :z .


Yes. Arbitrary depth.

Quoting already quoted triples.

Andy




On Tue, Mar 9, 2021 at 10:08 PM Andy Seaborne  wrote:


There are two variants around: RDF* and RDF-star.

RDF* is the original work by Olaf and <<:s :p :o>> means
the triple :s :p :o is asserted as well.
That has permissions implications.

RDF-star is the name the active community (including Olaf) is giving to
its work (1) to make it search able and (2) it's different.
<<:s :p :o>> does not imply anything about the existence of :s :p :o .

https://github.com/apache/jena/pull/951
is RDF-star (only).

There are various reasons why RDF-star is the way it is, including being
able to separate the target data from the triples about the data (e.g.
wikidata).

In RDF-star, you can "say anything about anything" - including triples
that are not asserted in the graph or otherwise don't exist.

For permissions, one approach is that only asserted triples matter. No
need to filter inside a <<>>. So you can make assertions about things
you don't have access to : they don't exist as asserted from your POV.
No difference.

But if it is what you want, it is a recursive analysis of the statement.
   The matcher in PR#951 is recursive.

Don't forget:

<< << :s :p :o >> :q :y >> :r :z .

  > Without a way to distinguish the
  > RDF-*Node from a regular resource

Model:
RDFNode.isStmtResource()

Graph:
Node.isTripleTerm()

  Andy

On 09/03/2021 16:24, Claude Warren wrote:

Greetings,

RDF-* seems like it may cause problems for permissions.

  From what I have seen we take a statement and convert that to a node

where

the label is the statement (or the triple?).  But there does not seem to

be

a way to differentiate the RDF-* edge nodes from other Nodes.

My question arises around the following:

Let's say there are 2 statements in the model
(s,p,o) and (x,y, (s,p,o)) where the (x,y,(s,p,o)) is an RDF-* statement
about (s,p,o).

If a user does not have access to see (s,p,o) they probably should not be
able to see (x,y,(s,p,o)) either.  Without a way to distinguish the
RDF-*Node from a regular resource I can't do the filtering.

  >

The best I can
hope for is that the SecuritEngine implementation can, but I expect that
will have problems too.

Does anyone with RDF-* background see a way around this?

Claude



  Andy






Re: RDF-* and permissions

2021-03-09 Thread Marco Neumann
can RDF-star further be nested?

as in:

<< << << :s :p :o >> :p1 :o1 >> :q :y >> :r :z .


On Tue, Mar 9, 2021 at 10:08 PM Andy Seaborne  wrote:

> There are two variants around: RDF* and RDF-star.
>
> RDF* is the original work by Olaf and <<:s :p :o>> means
> the triple :s :p :o is asserted as well.
> That has permissions implications.
>
> RDF-star is the name the active community (including Olaf) is giving to
> its work (1) to make it search able and (2) it's different.
> <<:s :p :o>> does not imply anything about the existence of :s :p :o .
>
> https://github.com/apache/jena/pull/951
> is RDF-star (only).
>
> There are various reasons why RDF-star is the way it is, including being
> able to separate the target data from the triples about the data (e.g.
> wikidata).
>
> In RDF-star, you can "say anything about anything" - including triples
> that are not asserted in the graph or otherwise don't exist.
>
> For permissions, one approach is that only asserted triples matter. No
> need to filter inside a <<>>. So you can make assertions about things
> you don't have access to : they don't exist as asserted from your POV.
> No difference.
>
> But if it is what you want, it is a recursive analysis of the statement.
>   The matcher in PR#951 is recursive.
>
> Don't forget:
>
> << << :s :p :o >> :q :y >> :r :z .
>
>  > Without a way to distinguish the
>  > RDF-*Node from a regular resource
>
> Model:
> RDFNode.isStmtResource()
>
> Graph:
> Node.isTripleTerm()
>
>  Andy
>
> On 09/03/2021 16:24, Claude Warren wrote:
> > Greetings,
> >
> > RDF-* seems like it may cause problems for permissions.
> >
> >  From what I have seen we take a statement and convert that to a node
> where
> > the label is the statement (or the triple?).  But there does not seem to
> be
> > a way to differentiate the RDF-* edge nodes from other Nodes.
> >
> > My question arises around the following:
> >
> > Let's say there are 2 statements in the model
> > (s,p,o) and (x,y, (s,p,o)) where the (x,y,(s,p,o)) is an RDF-* statement
> > about (s,p,o).
> >
> > If a user does not have access to see (s,p,o) they probably should not be
> > able to see (x,y,(s,p,o)) either.  Without a way to distinguish the
> > RDF-*Node from a regular resource I can't do the filtering.
>  >
> > The best I can
> > hope for is that the SecuritEngine implementation can, but I expect that
> > will have problems too.
> >
> > Does anyone with RDF-* background see a way around this?
> >
> > Claude
> >
>
>  Andy
>


-- 


---
Marco Neumann
KONA


Re: RDF-* and permissions

2021-03-09 Thread Andy Seaborne

There are two variants around: RDF* and RDF-star.

RDF* is the original work by Olaf and <<:s :p :o>> means
the triple :s :p :o is asserted as well.
That has permissions implications.

RDF-star is the name the active community (including Olaf) is giving to 
its work (1) to make it search able and (2) it's different.

<<:s :p :o>> does not imply anything about the existence of :s :p :o .

https://github.com/apache/jena/pull/951
is RDF-star (only).

There are various reasons why RDF-star is the way it is, including being 
able to separate the target data from the triples about the data (e.g. 
wikidata).


In RDF-star, you can "say anything about anything" - including triples 
that are not asserted in the graph or otherwise don't exist.


For permissions, one approach is that only asserted triples matter. No 
need to filter inside a <<>>. So you can make assertions about things 
you don't have access to : they don't exist as asserted from your POV. 
No difference.


But if it is what you want, it is a recursive analysis of the statement. 
 The matcher in PR#951 is recursive.


Don't forget:

<< << :s :p :o >> :q :y >> :r :z .

> Without a way to distinguish the
> RDF-*Node from a regular resource

Model:
RDFNode.isStmtResource()

Graph:
Node.isTripleTerm()

Andy

On 09/03/2021 16:24, Claude Warren wrote:

Greetings,

RDF-* seems like it may cause problems for permissions.

 From what I have seen we take a statement and convert that to a node where
the label is the statement (or the triple?).  But there does not seem to be
a way to differentiate the RDF-* edge nodes from other Nodes.

My question arises around the following:

Let's say there are 2 statements in the model
(s,p,o) and (x,y, (s,p,o)) where the (x,y,(s,p,o)) is an RDF-* statement
about (s,p,o).

If a user does not have access to see (s,p,o) they probably should not be
able to see (x,y,(s,p,o)) either.  Without a way to distinguish the
RDF-*Node from a regular resource I can't do the filtering.

>

The best I can
hope for is that the SecuritEngine implementation can, but I expect that
will have problems too.

Does anyone with RDF-* background see a way around this?

Claude



Andy


Re: RDF*

2020-04-13 Thread Andy Seaborne




On 06/04/2020 21:09, Andy Seaborne wrote:

WIP

RDF* is adding triples-as-term to RDF so that you can annotate triples.
This is not reification.

Summary:

* Aim to get it working for Turtle*, in-memory, SPARQL* and JSON results 
(Fuseki in other words).


Done.
This is a SPARQL extension (Syntax.syntaxARQ).
Fuseki uses that mode anyway.

It is hardwired into Turtle and it is not new Lang.

"Things go wrong" if used with XML result set format; text format works. 
 There is a extension to the JSON results for term type "triple":


 "bindings": [
   {
 "s": {
   "type": "triple" ,
   "value": {
 "subject":   { "type": "bnode" , "value": "b0" } ,
 "predicate": { "type": "uri" , "value": "http://example/p; } ,
 "object":
 { "type": "literal" ,
   "datatype": "http://www.w3.org/2001/XMLSchema#integer; ,
   "value": "123" }
  }
}

SPARQL Update templates and data done.

N-Triples output works, input does not yet.
This is the blocker for a PR.

Pretty turtle does not cope with all cases; TURTLE_BLOCKS does.

RDF/Thrift not done.

SPARQL BIND not done: but see below for new functions.

No Model API support.


* "sometime"
* This is not by converting to reification and back again!
* Experimental - may change without warning.


Not yet in TDB1,TDB2.



-


There are some accessor/constructor functions:

afn:subject(t), afn:predicate(t), afn:object(t)
afn:triple(s,p,o)
afn:isTriple(t)

Details:

This is an "SA" mode engine.

RDF* has "PG mode" and "SA mode". PG is what is spec'ed and it requires 
the triple named to be in the graph. That makes things a bit weird when 
you get into the details:


e.g. BIND(<<:s :p :o>> AS ?t) and related assignment cases.

and specially N-triples would no longer be a simple "one triple, one 
line" style. in "PG mode", deleting a triple should also cause all 
triple with a matching triple term to be deleted.


My plan is to overlay "PG mode" functionality on this "SA mode" engine" 
if the user demand is there - I'm not convinced it will be.


For now, put the triple in the data if you want the triple term to refer 
to only triples in the data.  If used this way, future changes will be 
transparent.


Andy


Re: RDF*

2020-04-07 Thread Andy Seaborne




On 06/04/2020 22:28, aj...@apache.org wrote:

More later, just a quick note re:


(ajs6f: Does JSON-LD 1.1 say anything about this?)


Nope, not a thing. If anything, the direction has been in the other
direction. E.g. 1.0 allows bnodes-as-predicates and we strongly discourage
that in 1.1-- pushing folks towards "vanilla" RDF.

Our next WG meeting is this Friday, and I am happy to raise any questions
we want raised, then or going forward.


No need from my POV.  RDF* isn't a standard and details of it aren't 
clear, to the point where a real WG would probably have opinions.

There is no reason JSON-LD needs to pay attention to it while in CR.

Andy



Adam

On Mon, Apr 6, 2020, 4:10 PM Andy Seaborne  wrote:


WIP

RDF* is adding triples-as-term to RDF so that you can annotate triples.
This is not reification.

Summary:

* Aim to get it working for Turtle*, in-memory, SPARQL* and JSON results
(Fuseki in other words).
* "sometime"
* This is not by converting to reification and back again!
* Experimental - may change without warning.

-

Details of RDF*


https://blog.liu.se/olafhartig/2019/01/10/position-statement-rdf-star-and-sparql-star/
https://arxiv.org/pdf/1406.3399.pdf

 Example 
PREFIX : 

:s :p 123 .
<<:s :p 123>> :q 678 .


The RDF Triple term is <<:s :p 123>>.

(it is lucky that early versions of SPARQL, pre 1.0, had <<>> for
reification ... so the grammar works!)

Currently working:
Node_Triple
Turtle* parsing
Turtle* writing
In-memory storage
SPARQL syntax
SPARQL execution with <> pattern matching.
SPARQL text format output of result sets.
Conversion to/from reification

(it is lucky that early ideas for SPARQL, pre 1.0, had <<>> for
reification ... so the grammar works!)

including nested cases (yup: triple terms inside triple terms is legal:
<<:s :p  <<:s :p 123>> >> is legal).

There are other areas impacted:
  Model API
  Result sets (JSON, XML)
  JSON-LD
  RDF/XML syntax

and there is less degrees of freedom in design for compatibility reasons.

Ideas:

1. API:

RDF Triple terms can be in the subject position, so it is a Resource,
even though they are really abstractly literals.

Going through this list, 1.3 is my first choice at the moment.

1.1. Encoding

Have special blank nodes or URI that carry the triple encoding information.

Having some encoding as a fallback mechanism is probably wise. It is
needed for RDF/XML, and may be N-triples.

1.2. Resource-Statement -- built-in

To avoid a signature change, I think we can put it in Resource with
"Resource::isTriple()" and "getTriple() -> Statement" with the goal of
"no RDF* -> no impact".

1.3. Resource-Statement -- as(Statement.class)

A case of Resource that is not a blank node nor a URI but can be used
with RDFNode.as(Statement.class) to get the Statement.  or even a blank
node, that RDFNode.canAs(Statement.class) - sort of on the fly reification.

1.4. New RDFNode subclass.

The ultimate "do it properly" - add a new kind of RDFNode - is an API
change as the subject position changes. I don't see any compatibility
path. Only consider when it is clear what a stable answer is.

2. Result sets

I favour 2.1, together with switchable to "encoding", so that foreign,
plain-RDF, code does work on SPARQL* results to some degree. That
includes YASGUI.

2.1. "Do it properly"
New term type:
   { "type": "triple" , "value":  }
   and "value" is a JSON object
 {"subject": ... , "predicate":  "object":  }

 XML results is similar .. which is TriX format 
  


http://www.w3.org/2001/XMLSchema#integer;>123
  

2.2. An "Encoding" approach:

 { "type": "triple" , "value": "_encoding_" }
and/or
 { "type": "uri" , "value": "encoding" }


3. RDF/XML

Unlike Turtle* changing the syntax is impractical.
Doing nothing is quite reasonable and wait to see what common practice
emerges. Encoding works.

4. JSON-LD

Encoding is probably the way to go.
(ajs6f: Does JSON-LD 1.1 say anything about this?)


Term encoding:

It's going to be long! Prefixes can't be assumed to be present. It needs
3 components, and maybe a datatype or language. And so we need a
separator/encoding of use of the separator.

  Andy





Re: RDF*

2020-04-06 Thread ajs6f
More later, just a quick note re:

> (ajs6f: Does JSON-LD 1.1 say anything about this?)

Nope, not a thing. If anything, the direction has been in the other
direction. E.g. 1.0 allows bnodes-as-predicates and we strongly discourage
that in 1.1-- pushing folks towards "vanilla" RDF.

Our next WG meeting is this Friday, and I am happy to raise any questions
we want raised, then or going forward.

Adam

On Mon, Apr 6, 2020, 4:10 PM Andy Seaborne  wrote:

> WIP
>
> RDF* is adding triples-as-term to RDF so that you can annotate triples.
> This is not reification.
>
> Summary:
>
> * Aim to get it working for Turtle*, in-memory, SPARQL* and JSON results
> (Fuseki in other words).
> * "sometime"
> * This is not by converting to reification and back again!
> * Experimental - may change without warning.
>
> -
>
> Details of RDF*
>
>
> https://blog.liu.se/olafhartig/2019/01/10/position-statement-rdf-star-and-sparql-star/
> https://arxiv.org/pdf/1406.3399.pdf
>
>  Example 
> PREFIX : 
>
> :s :p 123 .
> <<:s :p 123>> :q 678 .
> 
>
> The RDF Triple term is <<:s :p 123>>.
>
> (it is lucky that early versions of SPARQL, pre 1.0, had <<>> for
> reification ... so the grammar works!)
>
> Currently working:
>Node_Triple
>Turtle* parsing
>Turtle* writing
>In-memory storage
>SPARQL syntax
>SPARQL execution with <> pattern matching.
>SPARQL text format output of result sets.
>Conversion to/from reification
>
> (it is lucky that early ideas for SPARQL, pre 1.0, had <<>> for
> reification ... so the grammar works!)
>
> including nested cases (yup: triple terms inside triple terms is legal:
> <<:s :p  <<:s :p 123>> >> is legal).
>
> There are other areas impacted:
>  Model API
>  Result sets (JSON, XML)
>  JSON-LD
>  RDF/XML syntax
>
> and there is less degrees of freedom in design for compatibility reasons.
>
> Ideas:
>
> 1. API:
>
> RDF Triple terms can be in the subject position, so it is a Resource,
> even though they are really abstractly literals.
>
> Going through this list, 1.3 is my first choice at the moment.
>
> 1.1. Encoding
>
> Have special blank nodes or URI that carry the triple encoding information.
>
> Having some encoding as a fallback mechanism is probably wise. It is
> needed for RDF/XML, and may be N-triples.
>
> 1.2. Resource-Statement -- built-in
>
> To avoid a signature change, I think we can put it in Resource with
> "Resource::isTriple()" and "getTriple() -> Statement" with the goal of
> "no RDF* -> no impact".
>
> 1.3. Resource-Statement -- as(Statement.class)
>
> A case of Resource that is not a blank node nor a URI but can be used
> with RDFNode.as(Statement.class) to get the Statement.  or even a blank
> node, that RDFNode.canAs(Statement.class) - sort of on the fly reification.
>
> 1.4. New RDFNode subclass.
>
> The ultimate "do it properly" - add a new kind of RDFNode - is an API
> change as the subject position changes. I don't see any compatibility
> path. Only consider when it is clear what a stable answer is.
>
> 2. Result sets
>
> I favour 2.1, together with switchable to "encoding", so that foreign,
> plain-RDF, code does work on SPARQL* results to some degree. That
> includes YASGUI.
>
> 2.1. "Do it properly"
>New term type:
>   { "type": "triple" , "value":  }
>   and "value" is a JSON object
> {"subject": ... , "predicate":  "object":  }
>
> XML results is similar .. which is TriX format 
>  
>
>
> datatype="http://www.w3.org/2001/XMLSchema#integer;>123
>  
>
> 2.2. An "Encoding" approach:
>
> { "type": "triple" , "value": "_encoding_" }
> and/or
> { "type": "uri" , "value": "encoding" }
>
>
> 3. RDF/XML
>
> Unlike Turtle* changing the syntax is impractical.
> Doing nothing is quite reasonable and wait to see what common practice
> emerges. Encoding works.
>
> 4. JSON-LD
>
> Encoding is probably the way to go.
> (ajs6f: Does JSON-LD 1.1 say anything about this?)
>
>
> Term encoding:
>
> It's going to be long! Prefixes can't be assumed to be present. It needs
> 3 components, and maybe a datatype or language. And so we need a
> separator/encoding of use of the separator.
>
>  Andy
>


Re: RDF Diff/patch

2017-12-27 Thread Claude Warren
Currently I am using https://github.com/Claudenw/java-diff-utils (forked
from https://github.com/dnaumenko/java-diff-utils -- no changes yet).

I start with the assumption that the datastore will always produce the same
ID for the blank node across queries.  I assume they will change if deleted
and reinserted but as long as there is no change I assume they are the same
id.  If that assumption does not hold the diff probably won't work
correctly.

I basically perform a query against the 2 datasets to producer ordered
g,s,p,o quads.

I feed the results into diff/patch routine.

Currently if the blank nodes have different ids they would be deleted and
reinserted in the first case and just one deleted in the second case.

The code is at https://github.com/Claudenw/rdf-diff-patch (sorry Andy got
"rdf" and "patch" in the name -- I'll change it if I can find another good
descriptor -- alternatively, we might be able to generate RDF-patch format
output).

Use PatchFactory to create the patch object and UpdateFactory to create the
UpdateRequest.

This code does need the recent fixes for jena-querybuilder 3.7.0-SNAPSHOT.

I have only been working on this for a couple of days and there are several
places to improve it.


   1. I think the diff/patch routine has some equality plugin points that
   might make matching different blank node ids within a graph possible in the
   diff processing.
   2. Since the patch generated by java-diff-utils would have both the
   delete and the insert quads it should be possible to create models for each
   named graph in the quad list, perform some queries against them to remove
   any blank nodes that are the "same" (your choice of definition for "same")
   and perform mapping between old and new node ids.

There are lots of edge cases to explore here.

Claude


On Wed, Dec 27, 2017 at 4:26 PM, ajs6f  wrote:

> I'm curious too, Claude. Is the idea that one assumes that bnodes are
> already using the same pool of labels, or something like that? IOW, if I
> have dataset1:
>
> _:a a my:type .
> _:b a my:type .
>
> and dataset2:
>
> _:c a my:type .
>
> and I want to convert dataset1 into dataset2, will your algorithm delete
> both triples and add a new one, or just remove a triple, and if so, is that
> deterministic? If dataset2 is instead:
>
> _:a a my:type .
>
> will the algorithm only remove one triple and be done, or remove both and
> add a new one?
>
> ajs6f
>
> > On Dec 27, 2017, at 11:00 AM, Andy Seaborne  wrote:
> >
> > It would be interesting to see especially the handling of blank nodes
> cycles and other structures.
> >
> > Please don't call it "RDF Patch" or a names similar to that - that term
> is already used.
> >
> >Andy
> >
> > On 26/12/17 18:17, Claude Warren wrote:
> >> Howdy,
> >> I am working on a tool that can create UpdateRequests that will convert
> one
> >> Dataset into another.
> >> The basic idea is to extract the quads sorted by (g,s,p,o) and then
> perform
> >> a diff on the lists (like a text diff but each quad is a "line").
> >> The result is that I can create statements to delete insert and delete
> one
> >> dataset to make it "identical" to the other.  Identical in this case
> means
> >> that each model in the two datasets are isomorphic.
> >> Is anyone else interested in this?
> >> Claude
>
>


-- 
I like: Like Like - The likeliest place on the web

LinkedIn: http://www.linkedin.com/in/claudewarren


Re: RDF Diff/patch

2017-12-27 Thread ajs6f
I'm curious too, Claude. Is the idea that one assumes that bnodes are already 
using the same pool of labels, or something like that? IOW, if I have dataset1:

_:a a my:type .
_:b a my:type .

and dataset2:

_:c a my:type .

and I want to convert dataset1 into dataset2, will your algorithm delete both 
triples and add a new one, or just remove a triple, and if so, is that 
deterministic? If dataset2 is instead:

_:a a my:type .

will the algorithm only remove one triple and be done, or remove both and add a 
new one?

ajs6f

> On Dec 27, 2017, at 11:00 AM, Andy Seaborne  wrote:
> 
> It would be interesting to see especially the handling of blank nodes cycles 
> and other structures.
> 
> Please don't call it "RDF Patch" or a names similar to that - that term is 
> already used.
> 
>Andy
> 
> On 26/12/17 18:17, Claude Warren wrote:
>> Howdy,
>> I am working on a tool that can create UpdateRequests that will convert one
>> Dataset into another.
>> The basic idea is to extract the quads sorted by (g,s,p,o) and then perform
>> a diff on the lists (like a text diff but each quad is a "line").
>> The result is that I can create statements to delete insert and delete one
>> dataset to make it "identical" to the other.  Identical in this case means
>> that each model in the two datasets are isomorphic.
>> Is anyone else interested in this?
>> Claude



Re: RDF Diff/patch

2017-12-27 Thread Andy Seaborne
It would be interesting to see especially the handling of blank nodes 
cycles and other structures.


Please don't call it "RDF Patch" or a names similar to that - that term 
is already used.


Andy

On 26/12/17 18:17, Claude Warren wrote:

Howdy,

I am working on a tool that can create UpdateRequests that will convert one
Dataset into another.

The basic idea is to extract the quads sorted by (g,s,p,o) and then perform
a diff on the lists (like a text diff but each quad is a "line").

The result is that I can create statements to delete insert and delete one
dataset to make it "identical" to the other.  Identical in this case means
that each model in the two datasets are isomorphic.

Is anyone else interested in this?

Claude



Re: RDF Patch - experiences suggesting changes

2016-10-20 Thread Rob Vesse
I think we’re both coming at this from different angles hence the disagreement. 
Your assumption seems to be that patches are packets, or perhaps messages, 
exchanged between parts of the system. Whereas my assumption is that the patch 
is similar to a journal and represents a continuous stream of changes with 
potentially many packets in your parlance concatenated together. In your 
scenario I agree that there is probably no need for reversible at the 
transaction level, however in my scenario this could be extremely useful.

 Ultimately you probably have to make a call one way or another. You have 
actual practical implementation experience with this format so you are probably 
better placed to decide what works in the real world. As with the notion of 
repeat it is not like a decision made now is a permanent one, you can always 
revise the standard in the future provided you include a way to define what 
version of the standard a patch conforms to.

Rob

On 20/10/2016 13:50, "Andy Seaborne"  wrote:



On 19/10/16 10:51, Rob Vesse wrote:
> On 14/10/2016 17:09, "Andy Seaborne"  wrote:
>
> I don't understand what capabilities are enabled by transaction
> granularity if there are multiple transactions in a single patch.
> Concrete examples of where it helps?
>
> However, I've normally been working with one transaction per patch 
anyway.
>
> Allowing multiple transaction per patch is for making a collect of
> (semantically) related changes into a unit, by consolidating small
> patches "today's changes " (c.f. git squash).
>
> Leaving the transaction boundaries in gives internal checkpoints, not
> just one big transaction. It also makes the consolidate patch
> decomposable (unlike squash).
>
> Internal checkpoints are useful not just for keeping the transaction
> manageable but also to be able to restart a very large update in case 
it
> failed part way through for system reasons (server power cut, user
> reboots laptop by accident, ...)  Imagine keeping a DBpedia copy up 
to date.
>
> I think the thought is that a producer of A patch can decide whether
> each transaction being recorded should be reversible or not. For
> example if you are a very large dataset to an already large database
> you probably don’t want to slow down the import process by having to
> check whether every triple/quad is already in the database as you
> import it. Therefore you might choose to output a non-reversible
> transaction for performance reasons.
>
> On the other hand if you’re accepting a small change to the data then
> that cost is probably acceptable and you would output a reversible
> transaction.
>
> I am not arguing that you shouldn’t have transaction boundaries, in
> fact I think they are essential, but simply that you may want to be
> to annotate the properties of a transaction Beyond just stating the
> boundaries.

Rob,

I agree the producer needs to have control.  What I am asking is why one 
patch unit (packet) would have multiple transactions with different 
characteristics in it.  The properties of patch packet include 
reversibility of contents. A patch overall isn't reversible unless each 
transaction within it is so there is now an opportunity for errors.

I think unit of patch packet is enough - it is supposed to be a sensible 
set of changes to move the dataset from one consistent state to another. 
  In developing that set of changes, there may have been several 
transactions (c.f. git squash).  It happens to give a checkpoint effect 
on large patches as well.

Analogy that may not help : a "TB/TC" is a database-transaction and a 
"patch" is more like a "business transaction".


(The use of "transaction" may not be the best - "action"? but with a 
need for "abort" as well as "commit", "transaction"

Andy







Re: RDF Patch - experiences suggesting changes

2016-10-20 Thread A. Soroka
For my cases, I would like intra-patch transactions because I have several 
different possible implementations of "patch"-- in other words, a patch might 
be an HTTP request, a section of a journal on a filesystem, a feed from a queue 
between time x and time y, an isolated file, etc. Having an independent notion 
of transaction would let me easily keep a common entity (transaction) in my 
systems even though the concrete manifestation of "patch" is varying.

---
A. Soroka
The University of Virginia Library

> On Oct 20, 2016, at 8:50 AM, Andy Seaborne  wrote:
> 
> 
> 
> On 19/10/16 10:51, Rob Vesse wrote:
>> On 14/10/2016 17:09, "Andy Seaborne"  wrote:
>> 
>>I don't understand what capabilities are enabled by transaction
>>granularity if there are multiple transactions in a single patch.
>>Concrete examples of where it helps?
>> 
>>However, I've normally been working with one transaction per patch anyway.
>> 
>>Allowing multiple transaction per patch is for making a collect of
>>(semantically) related changes into a unit, by consolidating small
>>patches "today's changes " (c.f. git squash).
>> 
>>Leaving the transaction boundaries in gives internal checkpoints, not
>>just one big transaction. It also makes the consolidate patch
>>decomposable (unlike squash).
>> 
>>Internal checkpoints are useful not just for keeping the transaction
>>manageable but also to be able to restart a very large update in case it
>>failed part way through for system reasons (server power cut, user
>>reboots laptop by accident, ...)  Imagine keeping a DBpedia copy up to 
>> date.
>> 
>> I think the thought is that a producer of A patch can decide whether
>> each transaction being recorded should be reversible or not. For
>> example if you are a very large dataset to an already large database
>> you probably don’t want to slow down the import process by having to
>> check whether every triple/quad is already in the database as you
>> import it. Therefore you might choose to output a non-reversible
>> transaction for performance reasons.
>> 
>> On the other hand if you’re accepting a small change to the data then
>> that cost is probably acceptable and you would output a reversible
>> transaction.
>> 
>> I am not arguing that you shouldn’t have transaction boundaries, in
>> fact I think they are essential, but simply that you may want to be
>> to annotate the properties of a transaction Beyond just stating the
>> boundaries.
> 
> Rob,
> 
> I agree the producer needs to have control.  What I am asking is why one 
> patch unit (packet) would have multiple transactions with different 
> characteristics in it.  The properties of patch packet include reversibility 
> of contents. A patch overall isn't reversible unless each transaction within 
> it is so there is now an opportunity for errors.
> 
> I think unit of patch packet is enough - it is supposed to be a sensible set 
> of changes to move the dataset from one consistent state to another.  In 
> developing that set of changes, there may have been several transactions 
> (c.f. git squash).  It happens to give a checkpoint effect on large patches 
> as well.
> 
> Analogy that may not help : a "TB/TC" is a database-transaction and a "patch" 
> is more like a "business transaction".
> 
> 
> (The use of "transaction" may not be the best - "action"? but with a need for 
> "abort" as well as "commit", "transaction"
> 
>   Andy



Re: RDF Patch - experiences suggesting changes

2016-10-20 Thread Andy Seaborne



On 19/10/16 10:51, Rob Vesse wrote:

On 14/10/2016 17:09, "Andy Seaborne"  wrote:

I don't understand what capabilities are enabled by transaction
granularity if there are multiple transactions in a single patch.
Concrete examples of where it helps?

However, I've normally been working with one transaction per patch anyway.

Allowing multiple transaction per patch is for making a collect of
(semantically) related changes into a unit, by consolidating small
patches "today's changes " (c.f. git squash).

Leaving the transaction boundaries in gives internal checkpoints, not
just one big transaction. It also makes the consolidate patch
decomposable (unlike squash).

Internal checkpoints are useful not just for keeping the transaction
manageable but also to be able to restart a very large update in case it
failed part way through for system reasons (server power cut, user
reboots laptop by accident, ...)  Imagine keeping a DBpedia copy up to date.

I think the thought is that a producer of A patch can decide whether
each transaction being recorded should be reversible or not. For
example if you are a very large dataset to an already large database
you probably don’t want to slow down the import process by having to
check whether every triple/quad is already in the database as you
import it. Therefore you might choose to output a non-reversible
transaction for performance reasons.

On the other hand if you’re accepting a small change to the data then
that cost is probably acceptable and you would output a reversible
transaction.

I am not arguing that you shouldn’t have transaction boundaries, in
fact I think they are essential, but simply that you may want to be
to annotate the properties of a transaction Beyond just stating the
boundaries.


Rob,

I agree the producer needs to have control.  What I am asking is why one 
patch unit (packet) would have multiple transactions with different 
characteristics in it.  The properties of patch packet include 
reversibility of contents. A patch overall isn't reversible unless each 
transaction within it is so there is now an opportunity for errors.


I think unit of patch packet is enough - it is supposed to be a sensible 
set of changes to move the dataset from one consistent state to another. 
 In developing that set of changes, there may have been several 
transactions (c.f. git squash).  It happens to give a checkpoint effect 
on large patches as well.


Analogy that may not help : a "TB/TC" is a database-transaction and a 
"patch" is more like a "business transaction".



(The use of "transaction" may not be the best - "action"? but with a 
need for "abort" as well as "commit", "transaction"


Andy


Re: RDF Patch - experiences suggesting changes

2016-10-19 Thread Andy Seaborne



On 19/10/16 11:34, Stian Soiland-Reyes wrote:

I had a quick go, and the penalty from gzip with using expanded forms
without "R" was negligible (~ 0.1%, a bit higher with no prefixes). It
also means you can't process the RDF Patch in a parallel way without
preprocessing.  (Same for prefixes).


Good point ... for certain restricted patches like all QA or all QD 
where reordering (necessary for parallel processing) is possible.


At this point, specifying RDF Patch v2 without R until the interactions 
with gzip etc compressing is better understood seems to me to be the way 
forward.


It's easier to add later than add now and remove.

FYI:

The RIOT parers do interning of Nodes using a 1000 slot LRU cache (so 
not large) - this leads to 30%, sometimes 50%, less memory being used 
due to shared terms.  In practice, it results interning all properties 
in a vocabulary (a 1000 well used properties being quite unusual) which 
R does not do.


Andy





Using "R" could also restrict possible compression pattern, for instance in :

A 

 .
A 

 .

a good compression algorithm might recognize patterns in here like:

 .\nA   .\nA R R" (which sometimes would work well).



Can RDF Patch items within a transaction be considered in any order
(first all the DELETEs, then all the ADDs), or do they have to be
played back linearly?


On 19 October 2016 at 10:57, Rob Vesse  wrote:

Yes but ANY is a form of lossy compression. You lost the actual details of what 
was removed. Also it can only be used for removals and yields no benefit for 
additions.

 On the other hand REPEAT is lossless compression.

 However if you apply a general-purpose compression like gzip on top of the 
patch you probably get just as good compression without needing any special 
tokens. In my experience repeat is more useful in compact binary formats where 
you can use fewer bytes to encode it then either the term itself or a reference 
to the term in some lookup table.

On 14/10/2016 17:09, "Andy Seaborne"  wrote:

These two together seem a bit contradictory.  The advantage of ANY, with
versions, is that it is form of compression.










Re: RDF Patch - experiences suggesting changes

2016-10-19 Thread Andy Seaborne

They must be in teh order presented.

QD then QA is a different outcome to QA then QD.

On 19/10/16 15:00, Stian Soiland-Reyes wrote:

Obviously that would be the case for the flat file (no transactions) and
order of any transactions.

So if that is the case also *inside* a transaction, then you are
effectively doing suboperations with a new transactional state per line in
the transaction.

How about restricting transactions to always have the order DDD ..?


Because it is a stream


That would help on reversibility as well as you can't then remove triples
added in the same transaction. (Reversibility is just to swap A/D blocks).


The reverse of QD depends on the data.

If the quad existed at the start, the reverse is a no-op, else its QA.


Perhaps DDD ordering could be a restriction only for Reversible
transactions as it could prevent a more "naive" log approach to be used
with transactions..?


The important point is higher-level semantics (that word!) can be 
imposed by systems on top of a basic patch format.


The headers indicate the additional properties of the patch.

Andy



On 19 Oct 2016 1:40 pm, "Rob Vesse"  wrote:


I am pretty sure that the intent is that a patch must be read in linear
order i.e. It is not designed for parallel processing

On 19/10/2016 11:34, "Stian Soiland-Reyes"  wrote:

I had a quick go, and the penalty from gzip with using expanded forms
without "R" was negligible (~ 0.1%, a bit higher with no prefixes). It
also means you can't process the RDF Patch in a parallel way without
preprocessing.  (Same for prefixes).

Using "R" could also restrict possible compression pattern, for
instance in :

A 

 .
A 

 .

a good compression algorithm might recognize patterns in here like:

 .\nA    .\nA R R" (which sometimes would work well).



Can RDF Patch items within a transaction be considered in any order
(first all the DELETEs, then all the ADDs), or do they have to be
played back linearly?


On 19 October 2016 at 10:57, Rob Vesse  wrote:
> Yes but ANY is a form of lossy compression. You lost the actual
details of what was removed. Also it can only be used for removals and
yields no benefit for additions.
>
>  On the other hand REPEAT is lossless compression.
>
>  However if you apply a general-purpose compression like gzip on top
of the patch you probably get just as good compression without needing any
special tokens. In my experience repeat is more useful in compact binary
formats where you can use fewer bytes to encode it then either the term
itself or a reference to the term in some lookup table.
>
> On 14/10/2016 17:09, "Andy Seaborne"  wrote:
>
> These two together seem a bit contradictory.  The advantage of
ANY, with
> versions, is that it is form of compression.
>
>
>
>



--
Stian Soiland-Reyes
http://orcid.org/-0001-9842-9718










Re: RDF Patch - experiences suggesting changes

2016-10-19 Thread Rob Vesse
I am pretty sure that the intent is that a patch must be read in linear order 
i.e. It is not designed for parallel processing

On 19/10/2016 11:34, "Stian Soiland-Reyes"  wrote:

I had a quick go, and the penalty from gzip with using expanded forms
without "R" was negligible (~ 0.1%, a bit higher with no prefixes). It
also means you can't process the RDF Patch in a parallel way without
preprocessing.  (Same for prefixes).

Using "R" could also restrict possible compression pattern, for instance in 
:

A 

 .
A 

 .

a good compression algorithm might recognize patterns in here like:

 .\nA    .\nA R R" (which sometimes would work well).



Can RDF Patch items within a transaction be considered in any order
(first all the DELETEs, then all the ADDs), or do they have to be
played back linearly?


On 19 October 2016 at 10:57, Rob Vesse  wrote:
> Yes but ANY is a form of lossy compression. You lost the actual details 
of what was removed. Also it can only be used for removals and yields no 
benefit for additions.
>
>  On the other hand REPEAT is lossless compression.
>
>  However if you apply a general-purpose compression like gzip on top of 
the patch you probably get just as good compression without needing any special 
tokens. In my experience repeat is more useful in compact binary formats where 
you can use fewer bytes to encode it then either the term itself or a reference 
to the term in some lookup table.
>
> On 14/10/2016 17:09, "Andy Seaborne"  wrote:
>
> These two together seem a bit contradictory.  The advantage of ANY, 
with
> versions, is that it is form of compression.
>
>
>
>



-- 
Stian Soiland-Reyes
http://orcid.org/-0001-9842-9718







Re: RDF Patch - experiences suggesting changes

2016-10-19 Thread Stian Soiland-Reyes
I had a quick go, and the penalty from gzip with using expanded forms
without "R" was negligible (~ 0.1%, a bit higher with no prefixes). It
also means you can't process the RDF Patch in a parallel way without
preprocessing.  (Same for prefixes).

Using "R" could also restrict possible compression pattern, for instance in :

A 

 .
A 

 .

a good compression algorithm might recognize patterns in here like:

 .\nA    .\nA R R" (which sometimes would work well).



Can RDF Patch items within a transaction be considered in any order
(first all the DELETEs, then all the ADDs), or do they have to be
played back linearly?


On 19 October 2016 at 10:57, Rob Vesse  wrote:
> Yes but ANY is a form of lossy compression. You lost the actual details of 
> what was removed. Also it can only be used for removals and yields no benefit 
> for additions.
>
>  On the other hand REPEAT is lossless compression.
>
>  However if you apply a general-purpose compression like gzip on top of the 
> patch you probably get just as good compression without needing any special 
> tokens. In my experience repeat is more useful in compact binary formats 
> where you can use fewer bytes to encode it then either the term itself or a 
> reference to the term in some lookup table.
>
> On 14/10/2016 17:09, "Andy Seaborne"  wrote:
>
> These two together seem a bit contradictory.  The advantage of ANY, with
> versions, is that it is form of compression.
>
>
>
>



-- 
Stian Soiland-Reyes
http://orcid.org/-0001-9842-9718


Re: RDF Patch - experiences suggesting changes

2016-10-19 Thread Rob Vesse
Yes but ANY is a form of lossy compression. You lost the actual details of what 
was removed. Also it can only be used for removals and yields no benefit for 
additions.

 On the other hand REPEAT is lossless compression.

 However if you apply a general-purpose compression like gzip on top of the 
patch you probably get just as good compression without needing any special 
tokens. In my experience repeat is more useful in compact binary formats where 
you can use fewer bytes to encode it then either the term itself or a reference 
to the term in some lookup table.

On 14/10/2016 17:09, "Andy Seaborne"  wrote:

These two together seem a bit contradictory.  The advantage of ANY, with 
versions, is that it is form of compression.






Re: RDF Patch - experiences suggesting changes

2016-10-19 Thread Rob Vesse
On 14/10/2016 17:09, "Andy Seaborne"  wrote:

I don't understand what capabilities are enabled by transaction 
granularity if there are multiple transactions in a single patch. 
Concrete examples of where it helps?

However, I've normally been working with one transaction per patch anyway.

Allowing multiple transaction per patch is for making a collect of 
(semantically) related changes into a unit, by consolidating small 
patches "today's changes " (c.f. git squash).

Leaving the transaction boundaries in gives internal checkpoints, not 
just one big transaction. It also makes the consolidate patch 
decomposable (unlike squash).

Internal checkpoints are useful not just for keeping the transaction 
manageable but also to be able to restart a very large update in case it 
failed part way through for system reasons (server power cut, user 
reboots laptop by accident, ...)  Imagine keeping a DBpedia copy up to date.

 I think the thought is that a producer of A patch can decide whether each 
transaction being recorded should be reversible or not. For example if you are 
a very large dataset to an already large database you probably don’t want to 
slow down the import process by having to check whether every triple/quad is 
already in the database as you import it. Therefore you might choose to output 
a non-reversible transaction for performance reasons.

On the other hand if you’re accepting a small change to the data then that cost 
is probably acceptable and you would output a reversible transaction.

 I am not arguing that you shouldn’t have transaction boundaries, in fact I 
think they are essential, but simply that you may want to be to annotate the 
properties of a transaction Beyond just stating the boundaries.






Re: RDF Patch - experiences suggesting changes

2016-10-17 Thread Andy Seaborne



On 14/10/16 11:59, A. Soroka wrote:
...


6/ Packets of change.

To have 4 (label a patch with reversible) and 5 (the version details), there 
needs to be somewhere to put the information. Having it in the patch itself 
means that the whole unit can be stored in a file.  If it is in the protocol, 
like HTTP for E-tags then the information becomes separated.  That is not to 
say that it can't also be in the protocol but it needs support in the data 
format.


As long as the sort of information about which we are thinking makes sense on a 
per-transaction basis, that could be as I suggest above, as "metadata" on BEGIN.


So a patch packet for a single transaction:

PARENT 
VERSION 
REVERSIBLE   optional
TB
QA ...
QD ...
PA ...
PD ...
TC
H 

where QA and QD are "quad add" "quad delete", and "PA" "PD" are "add prefix" and 
"delete prefix"


I'm suggesting something more like:

TB PARENT  VERSION  REVERSIBLE
QA ...
QD ...
PA ...
PD ...
TC H 

Or even just positionally

TB   REVERSIBLE
QA ...
QD ...
PA ...
PD ...
TC 


An "HTTP header"-like form (key-value lines) would be better because it 
is an open design where new header fields can be put in as needed. 
(e.g. "Author", "Date", "Signed-off-by").


It could be done as a different style, maybe identical to HTTP, so the 
patch specific parsing begins at the first TB, or include a blank line 
or marker like "". Or the header could reuse the same parsing 
tokens, though values may end up in strings for grouping more often.


Andy





Re: RDF Patch - experiences suggesting changes

2016-10-14 Thread Andy Seaborne



On 14/10/16 15:22, Rob Vesse wrote:

Thanks for sending this out

Another use case that springs to mind is for write ahead logging
particularly for reversible patches.


Yes. Whether it has to be reversible depends on how it related to the
journal.  If it is the journal, it only has to be a replayable log. If 
the commit journal is separate, it may need to be reversible.



On the subject prefixes I agree that being able to record prefix
definitions it Is useful and I am strongly in favour of not using
them to compact the data. As you say it actually makes reading and
writing the Data slower as well as requiring additional state to be
recorded during processing.

I like the use of transaction boundaries, I also like A.Soroka’s
suggestion on making the reversible flag be Applied to transaction
begin rather than to the patch as a whole though I don’t see any
problem with supporting both forms. I think reversible patches are an
essential feature.


I don't understand what capabilities are enabled by transaction 
granularity if there are multiple transactions in a single patch. 
Concrete examples of where it helps?


However, I've normally been working with one transaction per patch anyway.

Allowing multiple transaction per patch is for making a collect of 
(semantically) related changes into a unit, by consolidating small 
patches "today's changes " (c.f. git squash).


Leaving the transaction boundaries in gives internal checkpoints, not 
just one big transaction. It also makes the consolidate patch 
decomposable (unlike squash).


Internal checkpoints are useful not just for keeping the transaction 
manageable but also to be able to restart a very large update in case it 
failed part way through for system reasons (server power cut, user 
reboots laptop by accident, ...)  Imagine keeping a DBpedia copy up to date.



For the version control aspect I would be tempted to not constrain it
to UUID and simply say that it is an identifier for the parent state
to which the patch is applied. This will then allow people the
freedom to use hash algorithms, simple counters etc or any other
Version identification scheme they desired. I might even be tempted
to suggest that it should be a URI so that people can use identifiers
in their own name spaces to reduce the chance collisions.


As long as the ref is globally unique (so not counters without uniquifier).

I mentioned UUIDs really to turn up the contrast. It is not naming a web 
resource if it is a version.  The web resource is mutable - it's the 
dataset.  If someone wants to use http: versions for a 
way-back-database, that's cool, but making that the way for systems that 
don't have temporal capabilities (the majority) gets into philosophical 
debates.


And to keep patches protocol independent.

I have separate work on a protocol for keeping two datasets synced (soft 
consistency).



I can see the value of supporting meta data about the patch both
within it and in any protocol used to communicate it. Checksums are
fine although if you include this then you probably need to define
exactly how each checksum should be calculated.


Yes.



As for some of the other suggestions you have received:

- I would be strongly against including an ANY term. As soon as you
get into wild cards you may as well just use SPARQL Update. Plus the
meaning of the wild card is dependent on the dataset to which it is
applied which completely defeats the purpose of being a canonical
description of changes

> - I am strongly for including the REPEAT term.

This has the potential to offer significant compression particularly
if the system producing the patch chooses to group changes by subject
and predicate À la turtle and most other syntaxes.


These two together seem a bit contradictory.  The advantage of ANY, with 
versions, is that it is form of compression.


With out a version, I agree that it is stepping towards a higher level 
language for changes.



The compression by subject/predicate leads me mixed - compression after 
hashing would treat them as more orthogonal.  compressing even with R


My rule of thumb is x8 to x10 compression of N-triple/N-quads.  That's 
not all coming from same-subject etc.  I assume it comes from 
effectively spotting the namespaces and making them compression tokens.


> - Having a term for the default graph could prove useful

Andy



Rob


On 13/10/2016 16:32, "Andy Seaborne"  wrote:

I've been using modified RDF Patch for the data exchanged to keep
multiple datasets synchronized.

My primary use case is having multiple copies of the datasets for a
high availability solution.  It has to be a general solution for any
data.

There are some changes to the format that this work has highlighted.

[RDF Patch - v1] https://afs.github.io/rdf-patch/


1/ Record changes to prefixes

Just handling quads/triples isn't enough - to keep two datasets
in-step, we also need to record changes to 

Re: RDF Patch - experiences suggesting changes

2016-10-14 Thread Rob Vesse
Thanks for sending this out

Another use case that springs to mind is for write ahead logging particularly 
for reversible patches.

On the subject prefixes I agree that being able to record prefix definitions it 
Is useful and I am strongly in favour of not using them to compact the data. As 
you say it actually makes reading and writing the Data slower as well as 
requiring additional state to be recorded during processing.

 I like the use of transaction boundaries, I also like A.Soroka’s suggestion on 
making the reversible flag be Applied to transaction begin rather than to the 
patch as a whole though I don’t see any problem with supporting both forms. I 
think reversible patches are an essential feature.

 For the version control aspect I would be tempted to not constrain it to UUID 
and simply say that it is an identifier for the parent state to which the patch 
is applied. This will then allow people the freedom to use hash algorithms, 
simple counters etc or any other Version identification scheme they desired. I 
might even be tempted to suggest that it should be a URI so that people can use 
identifiers in their own name spaces to reduce the chance collisions.

 I can see the value of supporting meta data about the patch both within it and 
in any protocol used to communicate it. Checksums are fine although if you 
include this then you probably need to define exactly how each checksum should 
be calculated.

 As for some of the other suggestions you have received:

- I would be strongly against including an ANY term. As soon as you get into 
wild cards you may as well just use SPARQL Update. Plus the meaning of the wild 
card is dependent on the dataset to which it is applied which completely 
defeats the purpose of being a canonical description of changes
- I am strongly for including the REPEAT term. This has the potential to offer 
significant compression particularly if the system producing the patch chooses 
to group changes by subject and predicate À la turtle and most other syntaxes.
- Having a term for the default graph could prove useful

Rob


On 13/10/2016 16:32, "Andy Seaborne"  wrote:

I've been using modified RDF Patch for the data exchanged to keep 
multiple datasets synchronized.

My primary use case is having multiple copies of the datasets for a high 
availability solution.  It has to be a general solution for any data.

There are some changes to the format that this work has highlighted.

[RDF Patch - v1]
https://afs.github.io/rdf-patch/


1/ Record changes to prefixes

Just handling quads/triples isn't enough - to keep two datasets in-step, 
we also need to record changes to prefixes.  While they don't change the 
meaning of the data, application developers and users like prefixes.

2/ Remove the in-data prefixes feature.

RDF Patch has the feature to define prefixes in the data and use them 
for prefix names later in the data using @prefix.

This seems to have no real advantage, it can slow things down (c.f. 
N-Triples parsing is faster than Turtle parsing - prefixes is part of 
that), and it generally complicates the data form.

When including "add"/"delete" prefixes on the dataset (1) it also makes 
it quite confusing.

Whether the "R" for "repeat" entry from previous row should also be 
removed is an open question.

3/ Record transaction boundaries.

(A.3 in RDF Patch v1)
http://afs.github.io/rdf-patch/#transaction-boundaries

Having the transaction boundaries recorded means that they can be 
replayed when applying the patch.  While often a patch will be one 
transaction, patches can be consolidated by concatenation.

There 3 operations:

TB, TC, TA - Transaction Begin, Commit, Abort.

Abort is useful to include because to know whether a transaction in a 
patch is going to commit or abort means waiting until the end.  That 
could be buffering client-side, or buffering server-side (or not writing 
the patch to a file) and having a means to discard a patch stream.

Instead, allow a transaction to record an abort, and say that aborted 
transactions in patches can be discarded downstream.

4/ Reversibility is a patch feature.

The RDF Patch v1 document includes "canonical patch" (section 9)
http://afs.github.io/rdf-patch/#canonical-patches

Such a patch is reversible (it can undo changes) if the adds and deletes 
are recorded only if they lead to a real change.  "Add quad" must mean 
"there was no quad in the set before".  But this only makes sense if the 
whole patch has this property.

RDF Patches are in general entries in a "redo log" - you can apply the 
patch over and over again and it will end up in the same state (they are 
idempotent).

A 

Re: RDF Patch - experiences suggesting changes

2016-10-14 Thread Andy Seaborne



On 14/10/16 12:09, A. Soroka wrote:

+1 to ANY, because it offers the potential to actually remove lines
from a patch. For example, removing a whole graph could shrink pretty
rapidly. But just to be clear, ANY would seem to be illegal inside a
reversible patch, right?


Yes.



--- A. Soroka The University of Virginia Library


Re: RDF Patch - experiences suggesting changes

2016-10-14 Thread A. Soroka
+1 to ANY, because it offers the potential to actually remove lines from a 
patch. For example, removing a whole graph could shrink pretty rapidly. But 
just to be clear, ANY would seem to be illegal inside a reversible patch, right?

---
A. Soroka
The University of Virginia Library

> On Oct 14, 2016, at 5:34 AM, Andy Seaborne  wrote:
> 
> Hi Paul,
> 
> The general goal of RDF Patch is to be "assembler" for changes, or "N-Triples 
> for changes" - and there is no pattern matching capability.
> 
> In your example you'd have to know the old value:
> 
> TB
> QD 
>    "r3.xlarge" .
> QA 
>    "r3.2xlarge" .
> TC
> 
> 
> (aside: I wonder if instead of the 3/4 rule for triples/quads, a marker for 
> the default graph is better so the tuple is always QD and 4 terms.
> 
> QD _ 
>    "r3.xlarge" .
> 
> or have TD, TA
> )
> 
> On 13/10/16 17:02, Paul Houle wrote:
>> There is another use case for an "RDF Patch" which applies to
>> hand-written models.  For instance I have a model which describes a job
>> that is run in AWS that looks like
>> 
>> @prefix : 
>> @prefix parameter: 
>> 
>> :Server
>>   :subnetId "subnet-e0ab0197";
>>   :baseImage "ami-ea602afd";
>>   :instanceType "r3.xlarge";
>>   :keyName "o2key";
>>   :keyFile "~/AMZN Keys/o2key.ppk" ;
>>   :securityGroupIds "sg-bca0b2d9" ;
>>   :todo "dbpedia-load" ;
>>   parameter:RDF_SOURCE "s3://abnes/dbpedia/2015-10-gz/" ;
>>   parameter:GRAPH_NAME "http://dbpedia.org/; ;
>>   :imageCommand "/home/ubuntu/RDFeasy/bin/shred_evidence_and_halt";
>>   :iamProfile  ;
>>   :instanceName "Image Build Server";
>>   :qBase  .
>> 
>> one thing you might want to do is modify it so it uses a different
>> :baseImage or a different :instanceType and a natural way to do that is
>> to say
>> 
>> 'remove :Server :instanceType ?x and insert :Server :instanceType
>> "r3.2xlarge"'
> 
> SPARQL Update can provide the "pattern matching" (or some subset like 
> SparqlPatch [https://www.w3.org/2001/sw/wiki/SparqlPatch]):
> 
> 
> DELETE { :Server :instanceType ?x }
> INSERT { :Server :instanceType "r3.2xlarge" }
> WHERE  { :Server :instanceType ?x }
> 
> or
> 
> DELETE WHERE { :Server :instanceType ?x }
> ;
> INSERT DATA { :Server :instanceType "r3.2xlarge" }
> 
> 
> That said, the one useful additional to RDF Patch which is "pattern matching" 
> might be limited bulk delete.
> 
> QD   ANY ANY .
> 
> because listing all the triples to delete when they can be found from the 
> data anyway is a big space saving.
> 
>   Andy
> 
>> but better than that if you have a schema that says ":instanceType is a
>> single valued property" you can write another graph like
>> 
>> :Server
>>   :instanceType "r3.2xlarge" .
>> 
>> and merge it with the first graph to get the desired effect.
>> 
>> More generally this fits into the theme that "the structure of
>> commonsense knowledge is that there are rules,  then exceptions to the
>> rules,  then exceptions to the exceptions of the rules,  etc."
>> For instance I extracted a geospatial database out of Freebase that was
>> about 10 million facts and I found I had to add and remove about 10
>> facts on the route to a 99% success rate at a geospatial recognition
>> task.  A disciplined approach to "agreeing to disagree" goes a long way
>> to solve the problem that specific applications require us to split
>> hairs in different ways.
>> 
>> 
>> 



Re: RDF Patch - experiences suggesting changes

2016-10-14 Thread A. Soroka
Thoughts in-line. (Incidentally, my immediate interest in RDF Patch is pretty 
similar; robustness via distribution, but there's also a smaller, more 
theoretical interest for me in automatically "shredding" or "sharding" datasets 
across networks for higher persistence and query throughput.)

---
A. Soroka
The University of Virginia Library

> On Oct 13, 2016, at 11:32 AM, Andy Seaborne  wrote:
> 
> ...
> 1/ Record changes to prefixes
> 
> Just handling quads/triples isn't enough - to keep two datasets in-step, we 
> also need to record changes to prefixes.  While they don't change the meaning 
> of the data, application developers and users like prefixes.

Boo, hiss, but I can see your point. The worry to me would be the inevitable 
semantic overloading that will come with. But I guess that cake has already 
been baked by all the other RDF formats except NTriples.

> 2/ Remove the in-data prefixes feature.
> 
> RDF Patch has the feature to define prefixes in the data and use them for 
> prefix names later in the data using @prefix.
> 
> This seems to have no real advantage, it can slow things down (c.f. N-Triples 
> parsing is faster than Turtle parsing - prefixes is part of that), and it 
> generally complicates the data form.
> 
> When including "add"/"delete" prefixes on the dataset (1) it also makes it 
> quite confusing.
> 
> Whether the "R" for "repeat" entry from previous row should also be removed 
> is an open question.

I would agree with removing R, and the reason is that it doesn't remove lines. 
In other words, the abbreviation it offers is pretty minimal. On the other 
hand, it is relatively cheap to implement (4 slots of state) so I wouldn't 
argue very much to remove it.

> 3/ Record transaction boundaries.
> 
> (A.3 in RDF Patch v1)
> http://afs.github.io/rdf-patch/#transaction-boundaries
> 
> Having the transaction boundaries recorded means that they can be replayed 
> when applying the patch.  While often a patch will be one transaction, 
> patches can be consolidated by concatenation.
> 
> There 3 operations:
> 
> TB, TC, TA - Transaction Begin, Commit, Abort.
> 
> Abort is useful to include because to know whether a transaction in a patch 
> is going to commit or abort means waiting until the end.  That could be 
> buffering client-side, or buffering server-side (or not writing the patch to 
> a file) and having a means to discard a patch stream.
> 
> Instead, allow a transaction to record an abort, and say that aborted 
> transactions in patches can be discarded downstream.

This is very good stuff. It would be nice to include a definition of 
"transaction-compact" in which no TA may appear. It would enable RDF Patch 
readers to make a very convenient assumption. 

> 4/ Reversibility is a patch feature.
> 
> The RDF Patch v1 document includes "canonical patch" (section 9)
> http://afs.github.io/rdf-patch/#canonical-patches
> 
> Such a patch is reversible (it can undo changes) if the adds and deletes are 
> recorded only if they lead to a real change.  "Add quad" must mean "there was 
> no quad in the set before".  But this only makes sense if the whole patch has 
> this property.
> ...
> What would be useful is to label the patch itself to say whether it is 
> reversible.

Just a thought-- you could change BEGIN to permit "flags". So you could have:

BEGIN REVERSIBLE
patch
patch
patch
END

and you get "canonicity" on a per-transaction level. A patch could optionally 
make explicit its wrapping BEGIN and END for this kind of use.

> 5/ "RDF Git"
> 
> A patch should be able to record where it can be applied.  If RDF Patch is 
> being used to keep two datasets in-step, then some checking to know that the 
> patch can be applied to a copy because it is a patch created from the 
> previous version
> 
> So give each version of the dataset a UUID for a version then record the old 
> ("parent") UUID and the new UUID in the patch.
> ...
> Or some system may want to apply any patch and so create a tree of changes.  
> For the use case of keeping two datasets in-step, that's not what is wanted 
> but other use cases may be better served by having the primary version chain 
> sorted out by higher level software; a patch may be a "proposed change".

Yes, the roaring success of Git (and other DVCS) may imply that letting patches 
be pure changes (not connected to particular versions of the dataset, just 
"isolated" deltas) is the right way to think about them. The word "patch", 
itself, is usefully suggestive. That doesn't mean avoiding any versioning info, 
just making clear that datasets have versions, and the UUIDs associated with a 
given patch refer to where it _came from_, but you can still apply it to 
whatever you want (like cherry-picking Git commits).

Or another way to think about it: any dataset is just the sum of a series of 
patches (a random dataset with no history has an implicit history of one 
"virtual" patch with nothing but adds). So those UUIDs are roughly 

Re: RDF Patch - experiences suggesting changes

2016-10-14 Thread Andy Seaborne

Hi Paul,

The general goal of RDF Patch is to be "assembler" for changes, or 
"N-Triples for changes" - and there is no pattern matching capability.


In your example you'd have to know the old value:

TB
QD 
    "r3.xlarge" .
QA 
    "r3.2xlarge" .
TC


(aside: I wonder if instead of the 3/4 rule for triples/quads, a marker 
for the default graph is better so the tuple is always QD and 4 terms.


QD _ 
    "r3.xlarge" .

or have TD, TA
)

On 13/10/16 17:02, Paul Houle wrote:

There is another use case for an "RDF Patch" which applies to
hand-written models.  For instance I have a model which describes a job
that is run in AWS that looks like

@prefix : 
@prefix parameter: 

:Server
   :subnetId "subnet-e0ab0197";
   :baseImage "ami-ea602afd";
   :instanceType "r3.xlarge";
   :keyName "o2key";
   :keyFile "~/AMZN Keys/o2key.ppk" ;
   :securityGroupIds "sg-bca0b2d9" ;
   :todo "dbpedia-load" ;
   parameter:RDF_SOURCE "s3://abnes/dbpedia/2015-10-gz/" ;
   parameter:GRAPH_NAME "http://dbpedia.org/; ;
   :imageCommand "/home/ubuntu/RDFeasy/bin/shred_evidence_and_halt";
   :iamProfile  ;
   :instanceName "Image Build Server";
   :qBase  .

one thing you might want to do is modify it so it uses a different
:baseImage or a different :instanceType and a natural way to do that is
to say

'remove :Server :instanceType ?x and insert :Server :instanceType
"r3.2xlarge"'


SPARQL Update can provide the "pattern matching" (or some subset like 
SparqlPatch [https://www.w3.org/2001/sw/wiki/SparqlPatch]):



DELETE { :Server :instanceType ?x }
INSERT { :Server :instanceType "r3.2xlarge" }
WHERE  { :Server :instanceType ?x }

or

DELETE WHERE { :Server :instanceType ?x }
;
INSERT DATA { :Server :instanceType "r3.2xlarge" }


That said, the one useful additional to RDF Patch which is "pattern 
matching" might be limited bulk delete.


QD   ANY ANY .

because listing all the triples to delete when they can be found from 
the data anyway is a big space saving.


Andy


but better than that if you have a schema that says ":instanceType is a
single valued property" you can write another graph like

:Server
   :instanceType "r3.2xlarge" .

and merge it with the first graph to get the desired effect.

More generally this fits into the theme that "the structure of
commonsense knowledge is that there are rules,  then exceptions to the
rules,  then exceptions to the exceptions of the rules,  etc."
For instance I extracted a geospatial database out of Freebase that was
about 10 million facts and I found I had to add and remove about 10
facts on the route to a 99% success rate at a geospatial recognition
task.  A disciplined approach to "agreeing to disagree" goes a long way
to solve the problem that specific applications require us to split
hairs in different ways.





Re: RDF Patch - experiences suggesting changes

2016-10-13 Thread Paul Houle
There is another use case for an "RDF Patch" which applies to
hand-written models.  For instance I have a model which describes a job
that is run in AWS that looks like

@prefix : 
@prefix parameter: 

:Server
   :subnetId "subnet-e0ab0197";
   :baseImage "ami-ea602afd";
   :instanceType "r3.xlarge";
   :keyName "o2key";
   :keyFile "~/AMZN Keys/o2key.ppk" ;
   :securityGroupIds "sg-bca0b2d9" ;
   :todo "dbpedia-load" ;
   parameter:RDF_SOURCE "s3://abnes/dbpedia/2015-10-gz/" ;
   parameter:GRAPH_NAME "http://dbpedia.org/; ;
   :imageCommand "/home/ubuntu/RDFeasy/bin/shred_evidence_and_halt";
   :iamProfile  ;
   :instanceName "Image Build Server";
   :qBase  .

one thing you might want to do is modify it so it uses a different
:baseImage or a different :instanceType and a natural way to do that is
to say

'remove :Server :instanceType ?x and insert :Server :instanceType
"r3.2xlarge"'

but better than that if you have a schema that says ":instanceType is a
single valued property" you can write another graph like

:Server
   :instanceType "r3.2xlarge" .

and merge it with the first graph to get the desired effect.

More generally this fits into the theme that "the structure of
commonsense knowledge is that there are rules,  then exceptions to the
rules,  then exceptions to the exceptions of the rules,  etc."

For instance I extracted a geospatial database out of Freebase that was
about 10 million facts and I found I had to add and remove about 10
facts on the route to a 99% success rate at a geospatial recognition
task.  A disciplined approach to "agreeing to disagree" goes a long way
to solve the problem that specific applications require us to split
hairs in different ways.



-- 
  Paul Houle
  paul.ho...@ontology2.com

On Thu, Oct 13, 2016, at 11:32 AM, Andy Seaborne wrote:
> I've been using modified RDF Patch for the data exchanged to keep 
> multiple datasets synchronized.
> 
> My primary use case is having multiple copies of the datasets for a high 
> availability solution.  It has to be a general solution for any data.
> 
> There are some changes to the format that this work has highlighted.
> 
> [RDF Patch - v1]
> https://afs.github.io/rdf-patch/
> 
> 
> 1/ Record changes to prefixes
> 
> Just handling quads/triples isn't enough - to keep two datasets in-step, 
> we also need to record changes to prefixes.  While they don't change the 
> meaning of the data, application developers and users like prefixes.
> 
> 2/ Remove the in-data prefixes feature.
> 
> RDF Patch has the feature to define prefixes in the data and use them 
> for prefix names later in the data using @prefix.
> 
> This seems to have no real advantage, it can slow things down (c.f. 
> N-Triples parsing is faster than Turtle parsing - prefixes is part of 
> that), and it generally complicates the data form.
> 
> When including "add"/"delete" prefixes on the dataset (1) it also makes 
> it quite confusing.
> 
> Whether the "R" for "repeat" entry from previous row should also be 
> removed is an open question.
> 
> 3/ Record transaction boundaries.
> 
> (A.3 in RDF Patch v1)
> http://afs.github.io/rdf-patch/#transaction-boundaries
> 
> Having the transaction boundaries recorded means that they can be 
> replayed when applying the patch.  While often a patch will be one 
> transaction, patches can be consolidated by concatenation.
> 
> There 3 operations:
> 
> TB, TC, TA - Transaction Begin, Commit, Abort.
> 
> Abort is useful to include because to know whether a transaction in a 
> patch is going to commit or abort means waiting until the end.  That 
> could be buffering client-side, or buffering server-side (or not writing 
> the patch to a file) and having a means to discard a patch stream.
> 
> Instead, allow a transaction to record an abort, and say that aborted 
> transactions in patches can be discarded downstream.
> 
> 4/ Reversibility is a patch feature.
> 
> The RDF Patch v1 document includes "canonical patch" (section 9)
> http://afs.github.io/rdf-patch/#canonical-patches
> 
> Such a patch is reversible (it can undo changes) if the adds and deletes 
> are recorded only if they lead to a real change.  "Add quad" must mean 
> "there was no quad in the set before".  But this only makes sense if the 
> whole patch has this property.
> 
> RDF Patches are in general entries in a "redo log" - you can apply the 
> patch over and over again and it will end up in the same state (they are 
> idempotent).
> 
> A reversible patch is also an "undo log" entry and if you apply it in 
> reverse order, it acts to undo the patch played forwards.
> 
> Testing whether a triple or quad is already present while performing 
> updates is not cheap - and in some cases where the patch is being 
> computed without reference to an existing dataset may not be possible.
> 
> What would be useful is to label the patch 

Re: RDF 1.1 handling of xsd:sring datatype

2015-09-04 Thread Rob Vesse
RDF 1.1 support is enabled by default in Jena 3.0.0 onwards

Some older versions of Jena 2.x and Fuseki can have this turned on but I
don't remember off hand where the switch is

Also even if you turn the switch on it doesn't affect existing data in TDB
databases because data loaded with RDF 1.1 support enabled is encoded
slightly differently to that loaded without

So basically use Jena 3.x (Fuseki 2.3.x) and reload the data if using a
pre-existing TDB database

Rob

On 04/09/2015 17:13, "Reto Gmür"  wrote:

>Hi all,
>
>I've noticed that with the graph
>
>
>
>"A test
>resource"^^
>.
>
>running the query
>
>ASK {
>
>"A test resource".}
>
>with TDB 1.1.2 or with the Stian's fuseki 2.0.1-SNAPSHOT docker image
>returns false.
>
>Is there are setting to enable RDF 1.1 Literals, so that the above query
>would return true?
>
>Cheers,
>Reto






Re: RDF 1.1 : Work-in-progress : xsd:string

2014-12-01 Thread Andy Seaborne

Update: jena-core  jena-arq status

If you want to play, set:

JenaRuntime.isRDF11 = true

Do not run on persistent storage yet.


** jena-core:

Notes:
The Literal.getInt etc operations now cast xsd;strings, not just plain 
literals.  Its the most compatible way forward.  Personally, I think the 
whole casting thing in this area id now wrong (it's an XML-centric 
feature because of prevalence of property123/property).


Todo-1/
There is one RDF/XML-ABBREV writer thing to sort out - it is not using 
attributes for simple strings because they have type xsd:string now.


** jena-arq:

Todo-2/
ARQ sorts deterministically, even where the SPARQL spec does not require 
it.  The spec requires simple literals before datatype literals.  ARQ 
then sorts by lexical form then by datatype URI.


So in SPARQL/RDF-1.0 :

5  5^^xsd:integer  5^^xsd:string

but RDF-1.1 makes abc the same term as abc^^xsd:String.

Generally, 1.0-1.1 is about treating xsd:string like simple literals 
(no datatpe, no language) so that becomes:


5  5^^xsd:integer

There are 3 tests that catch this.

Todo-3/
Similarly, it's sorting rdf:langString as a datatype, not as a language 
literal.  2 tests.


Todo-4/
Some of the scripted tests need fixing, including some WG ones which use 
DISTINCT and now a and a^^xsd:string are the same when they use dot 
be different.


Note: Bonus: TransformFilterEquality can now process the string case as 
an optimization.


Note: Discovery:

SimpleSelector(null, null, 5) does not find 5^^xsd:integer, only 5 
which I find bizarre in the extreme.  It always been this way - it's not 
an RDF-1.1-ism.


Andy


Re: RDF 1.1 : Work-in-progress

2014-11-23 Thread Andy Seaborne

On 22/11/14 17:28, Andy Seaborne wrote:

 From the commits list, and JENA-816, you may have noticed I've started
working through the RDF 1.1 isms.

There is a flag in JenaRuntime to control the system mode and a test in
the jena-core to stop that building with the RDF 1.1 mode set.

The test suites don't yet pass in RDF 1.1 mode - there is code that
makes reasonably but RDF 1.0 assumptions about xsd:string not being the
same as simple literals, and datatype code not being aware that language
tags imply a datatype of rdf:langString.

No work on the persistent storage implications yet.

 Andy



The set of commits I've just pushed is part of cleaning up 
Literal(Label) creation.  It removes the use of defaults (where lang is 
 or null, or the datatype is null) and makes a call to a factory 
creation create(String, String), create(String, Datatype) or 
create(String).  This reduces the use of consta


In RDF 1.1, there is always a datatype and while the code all works, it 
is less clear when the caller says datatype = null and the created 
Node has datatype not null.


The original (String, String, Datatype) operations are still there.  Any 
existing code is unaffected in RDF 1.0.


Andy



Re: RDF Thrift for Jena

2014-09-04 Thread Rob Vesse
Thanks Andy,

I have started experimenting, more on that to follow

Rob

On 31/08/2014 15:36, Andy Seaborne a...@apache.org wrote:

On 26/08/14 21:20, Andy Seaborne wrote:
 I've been working on a binary format for RDF and SPARQL result sets:

 http://afs.github.io/rdf-thrift/

 This is now ready to go if everyone is OK with that.

 I'm flagging this up for passive consensus because it adds a new
 dependency (for Apache Thrift).

 And of course any questions or comments.

 Summary, as an RDF syntax:

 + x3 faster to parse than N-triples
 + same size as N-triples, and same compression effects with gzip (8-10
 compression).
 + Not much additional work to add because Thrift does most of the work.

  Andy

Migration done (JENA-774).  Some cleaning up to do (putting classes in
more logical places mostly) but tests in and passing.

   Andy







Re: RDF Thrift for Jena

2014-09-04 Thread Andy Seaborne
Cool - my first attempt at write speed testing suggested it was about 
the same as N-triples.


Write performance testing is harder (!!!) because you need a big enough 
source of data to run without the source itself affecting the numbers.


N-Triples writing has always been faster than reading - it's much closer 
to push strings straight into the output with no single character 
mangling most of the time.


From looking at the thrift implementation, it has to do small 
char-byte conversions.


It maybe faster to not use Java's native converter (which involves a 
copy) but to do direct chars - output stream using BlockUTF8.


When I last tested, BlockUTF8 was faster for strings ~100 characters 
but after that Java JDK was faster for larger.


Andy

On 04/09/14 10:05, Rob Vesse wrote:

Thanks Andy,

I have started experimenting, more on that to follow

Rob

On 31/08/2014 15:36, Andy Seaborne a...@apache.org wrote:


On 26/08/14 21:20, Andy Seaborne wrote:

I've been working on a binary format for RDF and SPARQL result sets:

http://afs.github.io/rdf-thrift/

This is now ready to go if everyone is OK with that.

I'm flagging this up for passive consensus because it adds a new
dependency (for Apache Thrift).

And of course any questions or comments.

Summary, as an RDF syntax:

+ x3 faster to parse than N-triples
+ same size as N-triples, and same compression effects with gzip (8-10
compression).
+ Not much additional work to add because Thrift does most of the work.

  Andy


Migration done (JENA-774).  Some cleaning up to do (putting classes in
more logical places mostly) but tests in and passing.

Andy










Re: RDF Thrift for Jena

2014-09-04 Thread Rob Vesse
Comments inline:

On 04/09/2014 10:34, Andy Seaborne a...@apache.org wrote:

Cool - my first attempt at write speed testing suggested it was about
the same as N-triples.

Write performance testing is harder (!!!) because you need a big enough
source of data to run without the source itself affecting the numbers.

Personally my approach is to use a machine with enough RAM to hold the
source data entirely in memory (first parsed into relevant data structures
e.g. Dataset/Model) and then just write to disk from memory


N-Triples writing has always been faster than reading - it's much closer
to push strings straight into the output with no single character
mangling most of the time.

Yes I have NTriples as one of the fastest writer in the tests I've run so
far.  It is data dependent though as for some source data it is equivalent
to RDF Thrift in performance

Rob


 From looking at the thrift implementation, it has to do small
char-byte conversions.

It maybe faster to not use Java's native converter (which involves a
copy) but to do direct chars - output stream using BlockUTF8.

When I last tested, BlockUTF8 was faster for strings ~100 characters
but after that Java JDK was faster for larger.

   Andy

On 04/09/14 10:05, Rob Vesse wrote:
 Thanks Andy,

 I have started experimenting, more on that to follow

 Rob

 On 31/08/2014 15:36, Andy Seaborne a...@apache.org wrote:

 On 26/08/14 21:20, Andy Seaborne wrote:
 I've been working on a binary format for RDF and SPARQL result sets:

 http://afs.github.io/rdf-thrift/

 This is now ready to go if everyone is OK with that.

 I'm flagging this up for passive consensus because it adds a new
 dependency (for Apache Thrift).

 And of course any questions or comments.

 Summary, as an RDF syntax:

 + x3 faster to parse than N-triples
 + same size as N-triples, and same compression effects with gzip (8-10
 compression).
 + Not much additional work to add because Thrift does most of the
work.

   Andy

 Migration done (JENA-774).  Some cleaning up to do (putting classes in
 more logical places mostly) but tests in and passing.

 Andy












Re: RDF Thrift for Jena

2014-09-02 Thread Andy Seaborne
Thrift is 3 layers: a service model, an encoding layer and a handling of 
the bytes in/out.  RDF Thrift is not using the service layer; I'm using 
it elsewhere (Lizard) and it is just fine - it's simpler than netty for 
a tightly bound system.


Java Thrift only has dependencies
  org.apache.httpcomponents:httpcore
  org.apache.commons:commons-lang3

and the httpcore part is for Thrift over HTTP (TServlet - thrift-encoded 
RPC over HTTP).


Avro is the system to look at if you want encoding schema evolution.

Andy

On 01/09/14 23:25, Stian Soiland-Reyes wrote:

Thanks for your clarifications, don't worry I am now officially relieved! :)

I am sorry for being that versioning guy - I guess I've had too many
bad experiences trying to manage dependencies of dependencies of
dependencies over the years.. (even down to having our own class
loader mechanism...!)

I check now and see that Apache Thrift is in fact a long-running
project that, and although still evolving, seems to do things the
right way.

If I understand it right (just clicking through the Thrift
documentation) it seems it would mainly be the code-generation step
from the Thrift IDLs that would be suspectible to change - which is
not very different from the situation with XSDs and JAXB-API,
and thus less of a concern for users of Jena which might themselves
also (indirectly) use a newer Thrift version.



On 1 September 2014 22:40, Andy Seaborne a...@apache.org wrote:

On 01/09/14 19:57, Stian Soiland-Reyes wrote:


Sounds proper enough :) with a binary format obviously one has to be very
careful about any changes, but I was more thinking of versioning of the
API
of Apache Thrift that your module would use through dependencies.



Same applies to text forms. Their strength is that they are W3C standards.
If that is of paramount important, then possibly RDF 1.1 N-Quads is the
best choice because it is fixed.



If I was to use Jena 1.14.0 depending on Apache Thrift say 0.6.0, but
instead also depended on  (something that depends on) a newer Apache
Thrift
0.9.0, have that project committed themselves to semantic versioning so
that this would still in theory work? E.g. not deleting or breaking
existing API signatures (adding is ok)



I'm not seeing that RDF Thrift is different from anything else in terms of
versioning.  You are always have the issue that dependency versions might
conflict.  At some level, you have to judge the community - at least with
open source you have that option, as well as taking the source and keeping
what you need.  Archive a record of the dependencies you use!  (if you trust
maven central - Apache projects do not make releases that depend on anything
not maven central - no dependencies on obscure transient jar repos!)

Incremental versioning had better work as Apache Thrift depends on an
earlier version of org.apache.httpcomponents:httpcore (v 4.2.4) and Jena
currently uses 4.2.6.

We have:

dependency
   groupIdorg.apache.thrift/groupId
   artifactIdlibthrift/artifactId
   version0.9.1/version
   exclusions
 !-- Use whatever version Jena is using --
 exclusion
groupIdorg.apache.httpcomponents/groupId
artifactIdhttpcore/artifactId
 /exclusion
   /exclusions
/dependency



In theory it should not make anything fall over unless you tried to use
the
Jena Thrift serialization.. but that depends on how it is wired in. In
RIOT
the standard language serializers are hardcoded somewhere, right?



Wired in but not hard coded.  They have never been hardcoded but it was
quite hard to rewire.  Think of them as a standard library of things to use
if you want to.

Now RIOT has registries (for parsers, for writers, for stream writers, then
registries for SPARQL Result Sets readers and writers) which have a set of
languages included and set up but you can remove one, replace one or add one
(RDF Thrift and JSON-LD were developed outside Jena and wired in at run time
until they moved into RIOT when stable).  Or call any code you like and put
the outcome into a graph/dataset.

 Andy




On 1 Sep 2014 09:35, Andy Seaborne a...@apache.org wrote:


On 31/08/14 19:03, Stian Soiland-Reyes wrote:


How have you tested this for IRIs and international characters in
literals?
sorry, I am out travelling and have not checked the code yet.. :)



Yes.

Thrift encodes strings as UTF-8.

The wire form of an IRI is a tagged string:
http://afs.github.io/rdf-thrift/rdf-binary-thrift.html

struct RDF_IRI {
1: required string iri
}

   The new dependency on Apache Thrift would be my main concern if this is


not
in a separate module. How stable are Thrift APIs?E.g. do they follow
semantic versioning so that a Jena build will work with a newer Thrift
version (with same major)?



Stronger than that - Thrift cares a lot about wire/storage format
compatibility because of the large scale of deployments in which it's
used.

A system wide, cross-language change of format simply isn't practical. It
would have to be a 

Re: RDF Thrift for Jena

2014-09-01 Thread Andy Seaborne

To easily use the new format, use riot with the --out argument

e.g. convert n-triples to RDF Thrift:

 riot --out=TRDF data.nt  data.trdf

Andy



Re: RDF Thrift for Jena

2014-09-01 Thread Stian Soiland-Reyes
Sounds proper enough :) with a binary format obviously one has to be very
careful about any changes, but I was more thinking of versioning of the API
of Apache Thrift that your module would use through dependencies.

If I was to use Jena 1.14.0 depending on Apache Thrift say 0.6.0, but
instead also depended on  (something that depends on) a newer Apache Thrift
0.9.0, have that project committed themselves to semantic versioning so
that this would still in theory work? E.g. not deleting or breaking
existing API signatures (adding is ok)

In theory it should not make anything fall over unless you tried to use the
Jena Thrift serialization.. but that depends on how it is wired in. In RIOT
the standard language serializers are hardcoded somewhere, right?
On 1 Sep 2014 09:35, Andy Seaborne a...@apache.org wrote:

 On 31/08/14 19:03, Stian Soiland-Reyes wrote:

 How have you tested this for IRIs and international characters in
 literals?
 sorry, I am out travelling and have not checked the code yet.. :)


 Yes.

 Thrift encodes strings as UTF-8.

 The wire form of an IRI is a tagged string:
 http://afs.github.io/rdf-thrift/rdf-binary-thrift.html

 struct RDF_IRI {
 1: required string iri
 }

  The new dependency on Apache Thrift would be my main concern if this is
 not
 in a separate module. How stable are Thrift APIs?E.g. do they follow
 semantic versioning so that a Jena build will work with a newer Thrift
 version (with same major)?


 Stronger than that - Thrift cares a lot about wire/storage format
 compatibility because of the large scale of deployments in which it's used.

 A system wide, cross-language change of format simply isn't practical. It
 would have to be a parallel evolution.

 See their discussion of adding the union type - on the wire its a struct
 of one element (i.e. each element is 'optional') and union-ness is provided
 by the encode/decode.  Old implementations that are not aware of union
 still work.

 What is open (but closing) is whether the RDF encoding is the right one.
 Evidence from real use is always going to be valuable.

 Andy

  On 31 Aug 2014 15:37, Andy Seaborne a...@apache.org wrote:

  On 26/08/14 21:20, Andy Seaborne wrote:

  I've been working on a binary format for RDF and SPARQL result sets:

 http://afs.github.io/rdf-thrift/

 This is now ready to go if everyone is OK with that.

 I'm flagging this up for passive consensus because it adds a new
 dependency (for Apache Thrift).

 And of course any questions or comments.

 Summary, as an RDF syntax:

 + x3 faster to parse than N-triples
 + same size as N-triples, and same compression effects with gzip (8-10
 compression).
 + Not much additional work to add because Thrift does most of the work.

   Andy


 Migration done (JENA-774).  Some cleaning up to do (putting classes in
 more logical places mostly) but tests in and passing.

  Andy







Re: RDF Thrift for Jena

2014-08-27 Thread Rob Vesse
Andy

I assume the intent it to add this into Apache Jena as a new module?

Rob

On 26/08/2014 21:20, Andy Seaborne a...@apache.org wrote:

I've been working on a binary format for RDF and SPARQL result sets:

http://afs.github.io/rdf-thrift/

This is now ready to go if everyone is OK with that.

I'm flagging this up for passive consensus because it adds a new
dependency (for Apache Thrift).

And of course any questions or comments.

Summary, as an RDF syntax:

+ x3 faster to parse than N-triples
+ same size as N-triples, and same compression effects with gzip (8-10
compression).
+ Not much additional work to add because Thrift does most of the work.

   Andy






Re: RDF Thrift for Jena

2014-08-27 Thread Benson Margulies
I'm curious if you compared this to SMILE.

On Wed, Aug 27, 2014 at 4:26 AM, Rob Vesse rve...@dotnetrdf.org wrote:
 Andy

 I assume the intent it to add this into Apache Jena as a new module?

 Rob

 On 26/08/2014 21:20, Andy Seaborne a...@apache.org wrote:

I've been working on a binary format for RDF and SPARQL result sets:

http://afs.github.io/rdf-thrift/

This is now ready to go if everyone is OK with that.

I'm flagging this up for passive consensus because it adds a new
dependency (for Apache Thrift).

And of course any questions or comments.

Summary, as an RDF syntax:

+ x3 faster to parse than N-triples
+ same size as N-triples, and same compression effects with gzip (8-10
compression).
+ Not much additional work to add because Thrift does most of the work.

   Andy






Re: RDF Thrift for Jena

2014-08-27 Thread Andy Seaborne

On 27/08/14 11:16, Benson Margulies wrote:

I'm curious if you compared this to SMILE.


Reference?

I wrote a comparison of RDF HDT, Sesame's binary format and RDF Thrift:

http://lists.w3.org/Archives/Public/semantic-web/2014Aug/0049.html

Summary:
1/ Dictionaries blow up for scale
2/ Using Thrift means it's much less work to implement.

Andy



Re: RDF Thrift for Jena

2014-08-27 Thread Andy Seaborne

On 27/08/14 09:26, Rob Vesse wrote:

Andy

I assume the intent it to add this into Apache Jena as a new module?


It wasn't my plan because it's quite small but I'm open to suggestions. 
 Pros and cons?


https://github.com/afs/rdf-thrift/tree/master/src/main/java/org/apache/jena/riot/binary

5 files and thrift/ which is the Thrift-generated part.

Given the general tendency to not get the right jars in the classpath, 
any sort of ServiceLoader games, or directly using reflection, don't help.


Andy



Rob

On 26/08/2014 21:20, Andy Seaborne a...@apache.org wrote:


I've been working on a binary format for RDF and SPARQL result sets:

http://afs.github.io/rdf-thrift/

This is now ready to go if everyone is OK with that.

I'm flagging this up for passive consensus because it adds a new
dependency (for Apache Thrift).

And of course any questions or comments.

Summary, as an RDF syntax:

+ x3 faster to parse than N-triples
+ same size as N-triples, and same compression effects with gzip (8-10
compression).
+ Not much additional work to add because Thrift does most of the work.

Andy









Re: RDF Thrift for Jena

2014-08-27 Thread Andy Seaborne

On 26/08/14 23:13, Stephen Owens wrote:

Very cool Andy. How's the write performance?



Good question - not completely sure yet - I'm not expecting a huge gain.

It's easier to test read performance and read performance was what I was 
more interested in initially.


For read, you can parse and send the result to the moral equivalent of 
/dev/null so producing a big file and timing gets the numbers.  riot 
does this already.


For writing, you need a generator capable of going faster than the 
output writer and that needs to be such that it itself isn't slowing the 
computer down.


Writing N-triples is currently faster than parsing - parsing is fiddling 
out with character-by-character checking for the markers like '' and 
''.   Length encoded structures and someone else's tuned code beats 
that easily.  Just need to get the bytes-java characters going 
efficiently, which is also the trick for parsing N-Triples.  Writing 
does not need such a copy-heavy/single-character manipulation code path 
and you can output the strings more directly to a write buffer directly, 
still checking for escape sequences on literals (singe character 
operations - yuk).


Writing Thrift is build datastructures and output, but no escape 
sequence checking is needed.


Andy



Re: RDF Thrift for Jena

2014-08-27 Thread Benson Margulies
Smile is the binary format for Json that comes in Jackson. Since the
transformation from RDF to Json is pretty simple, it occurred to me to
wonder how well that stack would, well, stack up.


On Wed, Aug 27, 2014 at 6:23 AM, Andy Seaborne a...@apache.org wrote:
 On 27/08/14 11:16, Benson Margulies wrote:

 I'm curious if you compared this to SMILE.


 Reference?

 I wrote a comparison of RDF HDT, Sesame's binary format and RDF Thrift:

 http://lists.w3.org/Archives/Public/semantic-web/2014Aug/0049.html

 Summary:
 1/ Dictionaries blow up for scale
 2/ Using Thrift means it's much less work to implement.

 Andy



Re: RDF Thrift for Jena

2014-08-27 Thread Andy Seaborne



On 27/08/14 11:58, Benson Margulies wrote:

Smile is the binary format for Json that comes in Jackson. Since the
transformation from RDF to Json is pretty simple, it occurred to me to
wonder how well that stack would, well, stack up.


Oh, that SMILE :-)

I haven't tried for this but its going to need encoding RDF terms into 
binary JSON (c.f. RDF/JSON or JSON-LD or SPARQL Result in JSON) which 
adds a layer of complexity to the process.   I'd guess it's going to be 
a measurable time cost as more java objects being churned to do the 
parsing, bytes-JOSN java object, JSON java object to Jena Nodes/Triples.


RDF Thrift takes the string directly off the wire and builds Jena RDF 
objects. The encoding puts the RDF terms directly onto the wire with 
little overhead.


A nice feature of RDF Thrift files for graphs and datasets is that they 
can be concatented like N-triples files.  JSON structures can't.


The converse is possible - there is a JSON version of Apache Thrift so 
that could be SMILE'ed.  Seems like a lot of layers though.


Andy




On Wed, Aug 27, 2014 at 6:23 AM, Andy Seaborne a...@apache.org wrote:

On 27/08/14 11:16, Benson Margulies wrote:


I'm curious if you compared this to SMILE.



Reference?

I wrote a comparison of RDF HDT, Sesame's binary format and RDF Thrift:

http://lists.w3.org/Archives/Public/semantic-web/2014Aug/0049.html

Summary:
1/ Dictionaries blow up for scale
2/ Using Thrift means it's much less work to implement.

 Andy





Re: RDF Thrift for Jena

2014-08-26 Thread Stephen Owens
Very cool Andy. How's the write performance?


On Tue, Aug 26, 2014 at 4:20 PM, Andy Seaborne a...@apache.org wrote:

 I've been working on a binary format for RDF and SPARQL result sets:

 http://afs.github.io/rdf-thrift/

 This is now ready to go if everyone is OK with that.

 I'm flagging this up for passive consensus because it adds a new
 dependency (for Apache Thrift).

 And of course any questions or comments.

 Summary, as an RDF syntax:

 + x3 faster to parse than N-triples
 + same size as N-triples, and same compression effects with gzip (8-10
 compression).
 + Not much additional work to add because Thrift does most of the work.

 Andy




-- 
Regards,

*Stephen Owens*
CTO, ThoughtWire
t 647.351.9473 ext.104  I  m 416.697.3466


Re: RDF 1.1 -- changes to plain literals

2013-10-15 Thread Simon Helsen
Hi all,

regarding some sort of migration utility, Wouldn't that be a must? Or is 
the expectation that all previously built databases are thrown out and 
recreated?

From our point of view, we've usually had ways to recreate TDB databases, 
but the cost can be enormous (depending on the size of the DB). I would 
think a migration utility would be able to convert a database much faster

Simon





From:
Andy Seaborne a...@apache.org
To:
dev@jena.apache.org, 
Date:
10/14/2013 10:41 AM
Subject:
Re: RDF 1.1 -- changes to plain literals





On 14/10/13 09:11, Rob Vesse wrote:
 Andy

 Thanks for the great overview, I've been looking at supporting this on
 dotNetRDF as well lately so have been thinking much along the same 
lines.

 I think the check language first needs to be emphasized in messaging to
 users about this change, dotNetRDF has the same issue and I've seen
 recently that Sesame was also affected by this.  Therefore I think we 
need
 to be clear about the need for this change in usage.

 My feeling is we should make this a configurable behavior, the default
 going forward should be RDF 1.1 but it would be nice if users could 
toggle
 that back to RDF-2004 behaviors if they need to produce data for older
 systems.

Some way of reverting to old behaviour would be good.  As long as it's 
system-wide I don't foresee any problems.  On a per graph basis would be 
very hard; on a per parser run is possible but does not catch API 
created data.

Once data has passed through in RDF 1.1 mode and written to file, 
whether database or syntax written to disk, it gets confusing to 
mix-and-match and go back to RDF-2004 style.

There is reasonable need for some compatibility style, then, yes, let's 
put it in.

One thing I think is worth avoiding is too much speculatively 
compatibility (i.e. guessing!), like putting in all variations of Node 
creation into NodeFactory as different factory methods.  These tend to 
end up with a life beyond the transition.

 On the database side particularly for TDB would it be feasible to 
produce
 a migration utility which would check a database to see if it is 
affected
 and if so produce a migrated version of the database?

Backup to N-Quads in RDF-2004 style, update software and restore in RDF 
1.1 style will work and it will leave a backup should the deployment 
wish to reverse the process.

A special utility to convert TDB databases would be possible by looking 
in the node table for explicit xsd:strings, then looking in the indexes 
for the internal value of term and changing it (delete-add).

Doing a backup first is a good thing (tm) at that point anyway.

It would be an offline process as it is munging the internal tables 
directly.  A transactional version is also doable but each layer of 
complexity increases the risk of getting it wrong in some corner case. 
A special utility has the disadvantage of not being well-used so at risk 
of bugs.

So, currently, I would want to see a significant need for this before 
embarking on something other than backup-upgrade-restore.

 Andy


 Rob





Re: RDF 1.1 -- changes to plain literals

2013-10-14 Thread Andy Seaborne



On 14/10/13 09:11, Rob Vesse wrote:

Andy

Thanks for the great overview, I've been looking at supporting this on
dotNetRDF as well lately so have been thinking much along the same lines.

I think the check language first needs to be emphasized in messaging to
users about this change, dotNetRDF has the same issue and I've seen
recently that Sesame was also affected by this.  Therefore I think we need
to be clear about the need for this change in usage.

My feeling is we should make this a configurable behavior, the default
going forward should be RDF 1.1 but it would be nice if users could toggle
that back to RDF-2004 behaviors if they need to produce data for older
systems.


Some way of reverting to old behaviour would be good.  As long as it's 
system-wide I don't foresee any problems.  On a per graph basis would be 
very hard; on a per parser run is possible but does not catch API 
created data.


Once data has passed through in RDF 1.1 mode and written to file, 
whether database or syntax written to disk, it gets confusing to 
mix-and-match and go back to RDF-2004 style.


There is reasonable need for some compatibility style, then, yes, let's 
put it in.


One thing I think is worth avoiding is too much speculatively 
compatibility (i.e. guessing!), like putting in all variations of Node 
creation into NodeFactory as different factory methods.  These tend to 
end up with a life beyond the transition.



On the database side particularly for TDB would it be feasible to produce
a migration utility which would check a database to see if it is affected
and if so produce a migrated version of the database?


Backup to N-Quads in RDF-2004 style, update software and restore in RDF 
1.1 style will work and it will leave a backup should the deployment 
wish to reverse the process.


A special utility to convert TDB databases would be possible by looking 
in the node table for explicit xsd:strings, then looking in the indexes 
for the internal value of term and changing it (delete-add).


Doing a backup first is a good thing (tm) at that point anyway.

It would be an offline process as it is munging the internal tables 
directly.  A transactional version is also doable but each layer of 
complexity increases the risk of getting it wrong in some corner case. 
A special utility has the disadvantage of not being well-used so at risk 
of bugs.


So, currently, I would want to see a significant need for this before 
embarking on something other than backup-upgrade-restore.


Andy



Rob




Re: RDF-patch : doc updated

2013-08-20 Thread Rob Vesse
Hey Jakob

Thanks for the feedback, some comments inline:


On 8/20/13 7:29 AM, Jakob Frank ja...@apache.org wrote:

Hi Andy, all

I had a quick look through the document - and overall I like the clean
and simple approach of the proposed format.

Here are my first comments:
- HTTP PATCH targets a resource - IMHO it should be allowed that the
server limits changes to this addressed resource. (the illustrative
example in the doc modifies two resources)

It depends what the resource identifies, the HTTP resource could easily be
a SPARQL Graph store in which case modifying multiple different RDF
resources is perfectly fine IMO.

Possibly making this a MAY constraint in the specification would make
sense.


- if you continue the resource-centered approach, you could allow to
skip the subject in the patch file. (but then: how to distinguish
between POC and SPO statements?)

Interesting thought, though I think we likely want to allow modifying many
resources per my above comment so having this mechanism doesn't
necessarily make sense.  If we did this to avoid the triple/quad
distinction we would need to properly put some up front definition


- what about allowing wildcards for deletion? e.g.
D http://example/bob foaf:name ?x .
to delete all foaf:names for ex:bob

This makes the language more query like, the intention was very much to
just stream a set of specific changes not to be able to express more
update language style constructs.


- is the ordering of the statements significant? i.e. what is the
result of the following patch:
D http://example/s http://example/p http://example/o .
A http://example/s http://example/p http://example/o .
is it different to the result of
A http://example/s http://example/p http://example/o .
D http://example/s http://example/p http://example/o .

Yes the ordering is significant, it is a streaming format by design so
changes MUST be processed in order.

Thanks,

Rob

ps. cc'd dev@jena.apache.org so people interested there can also see your
feedback



Best,
Jakob

On 12 August 2013 11:23, Andy Seaborne a...@apache.org wrote:
 On 11/08/13 18:07, Andy Seaborne wrote:

 Rob, all,


 Wrong dev@ :-(

 But your welcome to comment and make suggestions :-)

 The doc is:

 http://afs.github.io/rdf-patch/

 Andy



 I've made some changes : I've moved the discussion of features to an
 appendix and added some possibilities for some of these items.

 http://afs.github.io/rdf-patch/#notes

  A.1 Line Mode
  A.2 Metadata
  A.2.1 Linking
  A.2.2 Inline
  A.3 Transaction Boundaries
  A.4 Alternative Blank Node Syntax
  A.5 Alignment Errors
  A.6 Binary Format

 - - - - -

 Prompted by
 https://twitter.com/pdxleif/status/366267325818736640

 Where should we encourage discussion in going to a wider audience?

 One possibility is github, using the issues area of the git repo .  But
 you have to know where to look and the semweb world is still quite
 mailing-list driven.

 public-sparql-...@w3.org makes some sense (it's not a high volume
list).
   Other more general lists like semantic-...@w3.org have their pros and
 cons.

  Andy





Re: RDF-patch : doc updated

2013-08-20 Thread Andy Seaborne

On 20/08/13 18:35, Rob Vesse wrote:

Hey Jakob

Thanks for the feedback, some comments inline:


On 8/20/13 7:29 AM, Jakob Frank ja...@apache.org wrote:


Hi Andy, all


Hi Jakob,



I had a quick look through the document - and overall I like the clean
and simple approach of the proposed format.

Here are my first comments:
- HTTP PATCH targets a resource - IMHO it should be allowed that the
server limits changes to this addressed resource. (the illustrative
example in the doc modifies two resources)


It depends what the resource identifies, the HTTP resource could easily be
a SPARQL Graph store in which case modifying multiple different RDF
resources is perfectly fine IMO.

Possibly making this a MAY constraint in the specification would make
sense.


It's only one possible use case of how RDF patch might be used.  In 
section 8.2, the example is triples - a graph.  There is no reason why 
further restrictions of a general format can't be introduced for 
specific usages.  For example, only allowing A records for an 
append-only target.


Any server can always add constraints and/or simply decide not to 
execute any request (although checking request wide constraints can 
conflict with streaming).


Unlike Talis changesets there is no assumption that web-resource, the 
target of the action is the subject of all triples at that location.


The idea that the web resource naming hierarchy aligns to graph primary 
subject is quite weak.  When LDP gets to containers with inline members, 
multiple subjects in one web resource occur.


Try adding a new member for the LDP-C in example 8 of
http://www.w3.org/TR/ldp/#informative-1 !


- if you continue the resource-centered approach, you could allow to
skip the subject in the patch file. (but then: how to distinguish
between POC and SPO statements?)



Interesting thought, though I think we likely want to allow modifying many
resources per my above comment so having this mechanism doesn't
necessarily make sense.  If we did this to avoid the triple/quad
distinction we would need to properly put some up front definition



- what about allowing wildcards for deletion? e.g.
D http://example/bob foaf:name ?x .
to delete all foaf:names for ex:bob


This makes the language more query like, the intention was very much to
just stream a set of specific changes not to be able to express more
update language style constructs.


The restricted DELETE WHERE functionality can be useful.  I'd use * 
(or the word ANY) to avoid variable naming


D http://example/bob foaf:name ANY .

Variables get interesting:

D ?x foaf:knows ?x .

:-)

Andy





- is the ordering of the statements significant? i.e. what is the
result of the following patch:
D http://example/s http://example/p http://example/o .
A http://example/s http://example/p http://example/o .
is it different to the result of
A http://example/s http://example/p http://example/o .
D http://example/s http://example/p http://example/o .


Yes the ordering is significant, it is a streaming format by design so
changes MUST be processed in order.

Thanks,

Rob

ps. cc'd dev@jena.apache.org so people interested there can also see your
feedback




Best,
Jakob

On 12 August 2013 11:23, Andy Seaborne a...@apache.org wrote:

On 11/08/13 18:07, Andy Seaborne wrote:


Rob, all,



Wrong dev@ :-(

But your welcome to comment and make suggestions :-)

The doc is:

http://afs.github.io/rdf-patch/

 Andy




I've made some changes : I've moved the discussion of features to an
appendix and added some possibilities for some of these items.

http://afs.github.io/rdf-patch/#notes

  A.1 Line Mode
  A.2 Metadata
  A.2.1 Linking
  A.2.2 Inline
  A.3 Transaction Boundaries
  A.4 Alternative Blank Node Syntax
  A.5 Alignment Errors
  A.6 Binary Format

- - - - -

Prompted by
https://twitter.com/pdxleif/status/366267325818736640

Where should we encourage discussion in going to a wider audience?

One possibility is github, using the issues area of the git repo .  But
you have to know where to look and the semweb world is still quite
mailing-list driven.

public-sparql-...@w3.org makes some sense (it's not a high volume
list).
   Other more general lists like semantic-...@w3.org have their pros and
cons.

  Andy









Re: [RDF Patch] Looking at Talis Changesets and other proposals.

2013-07-31 Thread Andy Seaborne

It'll be worth while expanding on the streaming and scalability points.

This metadata is a bit complicated: I've falled into some traps here.

The RDF patch format is not an RDF serialization.  Blank node labels 
work differently so embedding N-Triples or Turtle isn't so automatic. 
An RDF patch parsers needing a Turtle parser seems a bit heavy.


It could be consider as string (block escaping? c.f CDATA), to be sent 
off to a real RDF language parser.  While a combined RDF-patch+Turtle is 
easy for RIOT (same tokenizer so no issues of read ahead grabbing tokens 
from the other language) but it's not normal for it to be easy to have 
mixed languages when using parser generators.


And  isn't a sensible way to refer to this change because (common SW 
issue) it really means where the copy of this document came from.


The change itself needs a unique name like a UUID so it's the same 
wherever the copy is obtained from.


We could have link headers, rather than inline metadata, except they can 
get broken and not accessinle at the tim eof access.


If this is an area where there is doubt (thinking to do, choices to be 
made), then I think putting that more speculative stuff in a separate 
section and keeping the core document simple and stable.  But worthwhile 
putting in the doc as something is needed.


Andy


On 30/07/13 16:56, Rob Vesse wrote:

Andy

I am familiar with Talis Changesets having used them heavily in my PhD
research.

My concerns are much the same as yours in that Changesets really don't
scale well.  The other big problem is that since they are RDF graphs they
are unordered since once cannot rely on a serializer/parser producing the
data in the same order as was originally intended especially if you start
crossing boundaries between different toolkits/APIs.  This makes them
effectively useless as a streaming patch format unless you send a stream
of small changests, this however adds copious overhead to a format
intended for speed and simplicity.

Perhaps more simply you can do the following

#METADATA
 rp:create [ foaf:name Andy ; foaf:orgURL http://jena.apache.org ] ;
rp:createdDate 2013-07-30^^xsd:date
rdfs:comment A valid Turtle graph .
#METADATA

The #METADATA is used to denote the start/end of a metadata block (which
ideally we permit only at the start of the patch).  This can then be
easily discarded by line oriented processors since if you see #METADATA
you just throw away all subsequent lines until you see #METADATA again.
Within the metadata block you could allow full blown Turtle or restrict to
a simpler tuple format if preferable?

Is it worth adding a comparison to alternative approaches as an Appendix
to the RDF patch proposal?

Rob


On 7/30/13 7:49 AM, Andy Seaborne a...@apache.org wrote:


Rob, all,

Leigh Dodds expressed a preference for Talis Changesets for patches.  I
have tries to analysis their pros and cons.

For me, the scale issue alone makes changesets the wrong starting point.
  They really solve a different problem of managing some remote data
with small, incremental changes.

It would be useful to add to RDF patch the ability to have metadata
about the change itself.

One way is to introduce a new marker M, which permits effectively,
N-Triples.  (Maybe required to be at the front.)

Not Turtle but I see RDF patch as machine oriented, not human readable.

M  rp:create _:a .
M _:a foaf:name Andy .
M _:a foaf:orgURL http://jena.apache.org/ .
M  rp:createdDate 2013-07-30^^xsd:date .
M  rdfs:comment Seems like a good idea .

Andy





Re: [RDF Patch] Looking at Talis Changesets and other proposals.

2013-07-30 Thread Stephen Allen
Andy,

I am agreed about scalability and stream processing being an important
goal, which as you note the Changesets do not address.  To me, the cleanest
way to implement that would be to go beyond an RDF representation and
create a new language as you've done.

The metadata block you mention is basically what I suggested as a header
block.  I do think it will be important to talk about the patch itself.
 I'm sort of leaning towards requiring it to appear before any data (but
after prefixes).  I'm trying to think when it might be necessary to allow
it to appear at any arbitrary point in the document (change some kind of
state after its processed say 10 million triples?), and I'm not coming up
with a lot of examples.  If it were required (lets say some statistics at
the end of the file about the preceding data), then maybe we have a header
(which has to appear up front) and metadata (which can appear anywhere)?

-Stephen


On Tue, Jul 30, 2013 at 10:49 AM, Andy Seaborne a...@apache.org wrote:

 Rob, all,

 Leigh Dodds expressed a preference for Talis Changesets for patches.  I
 have tries to analysis their pros and cons.

 For me, the scale issue alone makes changesets the wrong starting point.
  They really solve a different problem of managing some remote data with
 small, incremental changes.

 It would be useful to add to RDF patch the ability to have metadata about
 the change itself.

 One way is to introduce a new marker M, which permits effectively,
 N-Triples.  (Maybe required to be at the front.)

 Not Turtle but I see RDF patch as machine oriented, not human readable.

 M  rp:create _:a .
 M _:a foaf:name Andy .
 M _:a foaf:orgURL http://jena.apache.org/ .
 M  rp:createdDate 2013-07-30^^xsd:date .
 M  rdfs:comment Seems like a good idea .

 Andy

 --**--**--

 Talis Changesets (TCS)

 http://docs.api.talis.com/**getting-started/changesetshttp://docs.api.talis.com/getting-started/changesets
 http://docs.api.talis.com/**getting-started/changeset-**protocolhttp://docs.api.talis.com/getting-started/changeset-protocol
 http://vocab.org/changeset/**schema.htmlhttp://vocab.org/changeset/schema.html

 == Brief Description

 A Changeset is a set of triples to remove and a set of triples to add,
 recorded as a single RDF graph.  There is a fixed subject of change - a
 changeset is a change to a single resource.  The triples of the change must
 all have the same subject and this must be the subject of change.

 The triples of the change are recorded as reified statements.  This is
 necessary so that triples can be grouped into removal and addition sets.
 The change set can have descriptive information about the change. Because
 the changset is an RDF graph, the graph can say who was the creator, record
 the reason for the change, and the date the modification was created (not
 executed).  This also requires that the change triples are reified.

 ChangeSet can be linked together to produce a sequence of changes.  This
 is how to get changes to several resources - a list of changesets.

 == Pros and Cons

 This approach a some advantages and some disadvantages:
 (some of these can be overcome by fairly obvious changes to the definition

 1/ Changes relate only to one resource.  You can't make a coordinated set
 of changes, such as adding a batch of several new resources in a single
 HTTP request.

 2/ Blank nodes can't be handled.  There is no way give the subject of
 change if it is a blank node. (The Talis platform didn't support blank
 nodes.)

 3/ A changeset is an RDF graph.

 It needs the whole changeset graph to be available before any changes are
 made.  The whole graph is needed to validate the changeset (e.g. all
 reified triples have the same subject), and order of triples in a
 serialization of a graph is arbitrary (esp. if produced  by a generic RDF
 serializer) so, for example, the subject of change triple could be last,
 of the additions and removals can be mixed in any order.  To get stable
 changes, it is necessary to have a rule like all removals done before the
 additions are done.

 This is a limitation at scale. In practice, a changeset must be parsed
 into memory (standard parser), validated (changeset specific code) and
 applied (changeset specific code).  The design can't support streaming nor
 changes which may be larger than available RAM (e.g. millions of triples).

 It does mean that a standard RDF tool kit can be used to produce the
 change set (with suitable application code to build the graph structure)
 and to parse it at the receiver, toegther with some application code for
 producing, validaing and executing a changeset.

 4/ The feature of metadata per change is a useful feature.

 5/ Change sets only work for a change to a resource in a single graph.

 == Other

 Graph literals:

 Some other proposals have been made (like Delta, or varients based on
 TriG) where named graphs are used instead of reified 

Re: RDF Patch

2013-06-21 Thread Rob Vesse
I went ahead and submitted a pull request for various
typographical/editorial tweaks

I also went ahead and renamed Minimise Actions to Canonical Patches as
that makes a much clearer name for it, not sure this is quite the correct
terminology though

Rob



On 6/20/13 3:38 PM, Rob Vesse rve...@cray.com wrote:

I did read some of the working group discussions around the patch format
and some of the stuff they were discussing made me want to cry at the
horrific syntax abuses some people were proposing to make

Steering them towards something that is simpler like RDF patch would seem
a good idea

Rob



On 6/20/13 3:03 PM, Andy Seaborne a...@apache.org wrote:


BTW, I got a ping from LDP-WG about a patch format.  That WG want
something sub-SPARQL, this maybe a useful input.


I've looked before at RDF-encoded versions (Talis ChangeSets, using
TriG) but without further syntax or processing rules, they don't stream
and it needs a whole request read in before processing.  That a severe
limitation.

Example:

@prefix diff: http://example/diff# .
@prefix : http://example/data# .

#g2 { :s :p 456  }
#g1 { :s :p 123  }

#g1 { :x :q foo }

{  diff:delete #g1 ;
  diff:insert #g2 .
}

with the manifest default graph last, you can't tell anything about
#g1 or #g2 so the best I can imagine is to stash them away somewhere.

And does not cope with datasets (a graph-grouped complex manifest would
work but then any simplicity is lost and production of such patches is
looking a bit troublesome)

And then there's blank nodes.

Restricted SPARQL Update(INSERT DATA, DELETE DATA) sort of works ...
except bNodes.  An advantage is adding naturally DROP GRAPH and
DELETE WHERE.

  Andy





Re: RDF Patch

2013-06-21 Thread Andy Seaborne



On 21/06/13 18:31, Rob Vesse wrote:

I went ahead and submitted a pull request for various
typographical/editorial tweaks


I've added a use case section that talks about HTTP PATCH.  Not really a 
use (why did the app call PATCH?) but going back, it's good to point 
out the role it could play in PATCH.



I also went ahead and renamed Minimise Actions to Canonical Patches as
that makes a much clearer name for it, not sure this is quite the correct
terminology though


That's a better name ...

Maybe just call it reversible, which is the main point.

Or a Strong RDF Patch

Andy



Rob



On 6/20/13 3:38 PM, Rob Vesse rve...@cray.com wrote:


I did read some of the working group discussions around the patch format
and some of the stuff they were discussing made me want to cry at the
horrific syntax abuses some people were proposing to make

Steering them towards something that is simpler like RDF patch would seem
a good idea

Rob



On 6/20/13 3:03 PM, Andy Seaborne a...@apache.org wrote:



BTW, I got a ping from LDP-WG about a patch format.  That WG want
something sub-SPARQL, this maybe a useful input.


I've looked before at RDF-encoded versions (Talis ChangeSets, using
TriG) but without further syntax or processing rules, they don't stream
and it needs a whole request read in before processing.  That a severe
limitation.

Example:

@prefix diff: http://example/diff# .
@prefix : http://example/data# .

#g2 { :s :p 456  }
#g1 { :s :p 123  }

#g1 { :x :q foo }

{  diff:delete #g1 ;
  diff:insert #g2 .
}

with the manifest default graph last, you can't tell anything about
#g1 or #g2 so the best I can imagine is to stash them away somewhere.

And does not cope with datasets (a graph-grouped complex manifest would
work but then any simplicity is lost and production of such patches is
looking a bit troublesome)

And then there's blank nodes.

Restricted SPARQL Update(INSERT DATA, DELETE DATA) sort of works ...
except bNodes.  An advantage is adding naturally DROP GRAPH and
DELETE WHERE.

Andy









Re: RDF Delta - recording changes to RDF Datasets

2013-06-20 Thread Simon Helsen
as an aside, I always think of G in the first spot as well and I have 
always found N-Quad format very counter intuitive. Andy, I suspect that is 
because in our head, we group triples per graph, i.e. we lexicographically 
sort per graph, then the rest (very much like a dictionary). In N-Quad, if 
you sort by graph, everything is reversed. 

On the issue, I also favor using the standard, i.e. N-Quad

Simon





From:
Andy Seaborne a...@apache.org
To:
dev@jena.apache.org, 
Date:
06/20/2013 09:17 AM
Subject:
Re: RDF Delta - recording changes to RDF Datasets



I think the use of N-Quads order is better, given N-Quads exists.  I 
always think of quads a G-S-P-O (but I have no idea why!) and it just 
got written that way because.

The format does really need to be parsed in complete rows before 
deciding what to do with a row so, caveat very large literals (VLL), 
batching by graph isn't greatly affected.

VLL (Very Long Literals) of themselves could do with special handling.
But at the same time, I'd like to assume subjects-as-literals which 
means they are not necessarily in the final object slot in GSPO order 
when you could imagine special handling enabled by G-first.

Added a comments/todo section to not loose any of these points.

 Andy


On 19/06/13 00:56, Rob Vesse wrote:
 The format already allows arbitrarily sized tuples (well in the current
 form it is capped at 255 columns per tuple) though it assumes that this
 will be used to convey SPARQL results and thus currently requires that
 column headers be provided.  Both those restrictions would be fairly 
easy
 to remove.

 I will raise the issue of open sourcing with management again and see if 
I
 get any traction.

 On the subject of column ordering I can see benefits of putting the g
 field first in that it may make it easier to batch operations on a 
single
 graph though I don't think putting it at the end to align with NQuads
 precludes this you just require slightly more lookahead to determine
 whether to continue adding statements to your batch.

 Rob



 On 6/18/13 4:41 PM, Stephen Allen sal...@apache.org wrote:

 On Tue, Jun 18, 2013 at 6:05 PM, Andy Seaborne a...@apache.org wrote:

 On 18/06/13 22:13, Rob Vesse wrote:

 Hey Andy


 Hi Rob - thanks for the comments - really appreciate feedback -



 The basic approach looks sound and I like the simple text based 
format,
 see my notes later on about maybe having a binary serialization as
 well.


 A binary forms would excellent for this and for NT and NQ.  One of the
 speed limitations is parsing and Turtle is slower than NT (this isn't
 just
 a Jena effect).  gzip is neutral for reading but slows down writing.
 So a
 fast file format would be quite useful to add to the tool box.


   How do you envisage incremental backups being implemented in 
practice,
 you
 suggest in the document that you would take a full RDF dump and then
 compute the RDF delta from a previous backup.  Talking from the
 experience
 of having done this as part of one of my experiments in my PhD this
 can be
 very complex and time consuming to do especially if you need to take
 care
 of BNode isomorphism.  I assume from some of the other discussion on
 BNodes that you assume that IDs will remain stable across dumps, thus
 there is an implicit requirement here that the database be able to 
dump
 RDF using consistent BNode IDs (either internal IDs or some stable
 round
 trippable IDs).  Taking ARQ as an example the existing NQuads/TriG
 writers
 do not do this so there would need to be an option for those writers
 to be
 able to support this.


 Shh, don't tell anyone but n-quads and n-triples outputs do dump
 recoverable bNode labels :-)  TriG and Turtle do not - they try to be
 pretty.  The readers need a  tweak to recover them but the label-Node
 code
 has an option for various label policies and recover id from label is
 one
 of them.  This is not exposed formally - it's strictly illegal for RDF
 syntaxes.  Or use _:label URIs.

 I have prototyped a wrapper dataset that records changes as they 
happen
 driven off add(quad) and delete(quad).  This produces the RDF Delta
 (sp!)
 form so couple to xtn and you can have a live incremental backup.

 A strict after-the-event delta would be prohibitively expensive.


   Even without any concerns of BNode isomorphism comparing two RDF 
dumps
 to
 create a delta could be a potentially very time consuming operation 
and
 recording the deltas as changes happen may be far more efficient.  Of
 course depending on the exact use case the RDF dump and compute delta
 approach may be acceptable.


 It isn't a delta in the set theory A\B sense - nor is it a diff (it's
 not
 reversible without the additional condition).  delta and diff are
 both
 names I've toyed with - RDF changes might better capture the idea. 
Or
 RDF Changes Log.


   My main criticism is on the Minimise actions section, there needs 
to
 be
 a more solid clarification of definitions and when minimization can

Re: RDF Patch

2013-06-20 Thread Andy Seaborne

On 20/06/13 20:39, Stephen Allen wrote:

On Thu, Jun 20, 2013 at 3:15 PM, Andy Seaborne a...@apache.org wrote:


Moved:

http://afs.github.io/rdf-**patch/ http://afs.github.io/rdf-patch/

and ReSpec'ed.



Another idea.  Maybe a header of some type to record various bits of
metadata.  One important one might be whether or not the file was in
minimized form.  Presumably, you'd want this to be RDF as well.  Example
(H stands for header):

H _:b a http://jena.apache.org/2013/06/rdf-patch#Patch .
H _:b rdfs:comment Generated by Jena Fuseki .
H _:b dc:date 2013-06-20^^xsd:date .
H _:b http://jena.apache.org/2013/06/rdf-patch#minimizedForm true .

etc.

I think you'd only allow H rows to appear before any A or D rows appear
(but allow @prefix statements before it).

I don't know exactly what you'd want to put in an ontology like this, but
it may be useful.  Also I used a blank node as the subject in my example,
but perhaps a fixed resource would be better.

-Stephen



Good points - need to mark whether it's reversible or not (minimal isn't 
quite the right word for the required characteristic).


We could break the no relative URI rule and use  for this document - 
the bNodes are trick because the label is interpreted not as file 
scoped, but something to name real store bnodes.  Flipping label scopes 
might get confusing!


Maybe a format specific syntax (RDF Patch isn't RDF) and the parser 
generates RDF from it.  A link to general file is always possible.


H  rdf:type http://jena.apache.org/2013/06/rdf-patch#Patch .
H  rdfs:comment Generated by Jena Fuseki .
H  dc:date 2013-06-20^^xsd:date .
H  http://jena.apache.org/2013/06/rdf-patch#minimizedForm true .
H  link http://example/more/information.ttl .

Andy




Re: RDF Patch

2013-06-20 Thread Rob Vesse
You could add a Binary Serialization to the TODO list

Rob



On 6/20/13 1:46 PM, Andy Seaborne a...@apache.org wrote:

On 20/06/13 20:39, Stephen Allen wrote:
 On Thu, Jun 20, 2013 at 3:15 PM, Andy Seaborne a...@apache.org wrote:

 Moved:

 http://afs.github.io/rdf-**patch/ http://afs.github.io/rdf-patch/

 and ReSpec'ed.


 Another idea.  Maybe a header of some type to record various bits of
 metadata.  One important one might be whether or not the file was in
 minimized form.  Presumably, you'd want this to be RDF as well.
Example
 (H stands for header):

 H _:b a http://jena.apache.org/2013/06/rdf-patch#Patch .
 H _:b rdfs:comment Generated by Jena Fuseki .
 H _:b dc:date 2013-06-20^^xsd:date .
 H _:b http://jena.apache.org/2013/06/rdf-patch#minimizedForm true .

 etc.

 I think you'd only allow H rows to appear before any A or D rows appear
 (but allow @prefix statements before it).

 I don't know exactly what you'd want to put in an ontology like this,
but
 it may be useful.  Also I used a blank node as the subject in my
example,
 but perhaps a fixed resource would be better.

 -Stephen


Good points - need to mark whether it's reversible or not (minimal isn't
quite the right word for the required characteristic).

We could break the no relative URI rule and use  for this document -
the bNodes are trick because the label is interpreted not as file
scoped, but something to name real store bnodes.  Flipping label scopes
might get confusing!

Maybe a format specific syntax (RDF Patch isn't RDF) and the parser
generates RDF from it.  A link to general file is always possible.

H  rdf:type http://jena.apache.org/2013/06/rdf-patch#Patch .
H  rdfs:comment Generated by Jena Fuseki .
H  dc:date 2013-06-20^^xsd:date .
H  http://jena.apache.org/2013/06/rdf-patch#minimizedForm true .
H  link http://example/more/information.ttl .

   Andy





Re: RDF Patch

2013-06-20 Thread Andy Seaborne

On 20/06/13 22:33, Rob Vesse wrote:

You could add a Binary Serialization to the TODO list


OK - I didn't want to imply anything at the moment but since you mention 
it ... done!


Andy



Rob



On 6/20/13 1:46 PM, Andy Seaborne a...@apache.org wrote:


On 20/06/13 20:39, Stephen Allen wrote:

On Thu, Jun 20, 2013 at 3:15 PM, Andy Seaborne a...@apache.org wrote:


Moved:

http://afs.github.io/rdf-**patch/ http://afs.github.io/rdf-patch/

and ReSpec'ed.



Another idea.  Maybe a header of some type to record various bits of
metadata.  One important one might be whether or not the file was in
minimized form.  Presumably, you'd want this to be RDF as well.
Example
(H stands for header):

H _:b a http://jena.apache.org/2013/06/rdf-patch#Patch .
H _:b rdfs:comment Generated by Jena Fuseki .
H _:b dc:date 2013-06-20^^xsd:date .
H _:b http://jena.apache.org/2013/06/rdf-patch#minimizedForm true .

etc.

I think you'd only allow H rows to appear before any A or D rows appear
(but allow @prefix statements before it).

I don't know exactly what you'd want to put in an ontology like this,
but
it may be useful.  Also I used a blank node as the subject in my
example,
but perhaps a fixed resource would be better.

-Stephen



Good points - need to mark whether it's reversible or not (minimal isn't
quite the right word for the required characteristic).

We could break the no relative URI rule and use  for this document -
the bNodes are trick because the label is interpreted not as file
scoped, but something to name real store bnodes.  Flipping label scopes
might get confusing!

Maybe a format specific syntax (RDF Patch isn't RDF) and the parser
generates RDF from it.  A link to general file is always possible.

H  rdf:type http://jena.apache.org/2013/06/rdf-patch#Patch .
H  rdfs:comment Generated by Jena Fuseki .
H  dc:date 2013-06-20^^xsd:date .
H  http://jena.apache.org/2013/06/rdf-patch#minimizedForm true .
H  link http://example/more/information.ttl .

Andy








Re: RDF Patch

2013-06-20 Thread Andy Seaborne


BTW, I got a ping from LDP-WG about a patch format.  That WG want 
something sub-SPARQL, this maybe a useful input.



I've looked before at RDF-encoded versions (Talis ChangeSets, using 
TriG) but without further syntax or processing rules, they don't stream 
and it needs a whole request read in before processing.  That a severe 
limitation.


Example:

@prefix diff: http://example/diff# .
@prefix : http://example/data# .

#g2 { :s :p 456  }
#g1 { :s :p 123  }

#g1 { :x :q foo }

{  diff:delete #g1 ;
 diff:insert #g2 .
}

with the manifest default graph last, you can't tell anything about 
#g1 or #g2 so the best I can imagine is to stash them away somewhere.


And does not cope with datasets (a graph-grouped complex manifest would 
work but then any simplicity is lost and production of such patches is 
looking a bit troublesome)


And then there's blank nodes.

Restricted SPARQL Update(INSERT DATA, DELETE DATA) sort of works ... 
except bNodes.  An advantage is adding naturally DROP GRAPH and 
DELETE WHERE.


Andy



Re: RDF Patch

2013-06-20 Thread Rob Vesse
I did read some of the working group discussions around the patch format
and some of the stuff they were discussing made me want to cry at the
horrific syntax abuses some people were proposing to make

Steering them towards something that is simpler like RDF patch would seem
a good idea

Rob



On 6/20/13 3:03 PM, Andy Seaborne a...@apache.org wrote:


BTW, I got a ping from LDP-WG about a patch format.  That WG want
something sub-SPARQL, this maybe a useful input.


I've looked before at RDF-encoded versions (Talis ChangeSets, using
TriG) but without further syntax or processing rules, they don't stream
and it needs a whole request read in before processing.  That a severe
limitation.

Example:

@prefix diff: http://example/diff# .
@prefix : http://example/data# .

#g2 { :s :p 456  }
#g1 { :s :p 123  }

#g1 { :x :q foo }

{  diff:delete #g1 ;
  diff:insert #g2 .
}

with the manifest default graph last, you can't tell anything about
#g1 or #g2 so the best I can imagine is to stash them away somewhere.

And does not cope with datasets (a graph-grouped complex manifest would
work but then any simplicity is lost and production of such patches is
looking a bit troublesome)

And then there's blank nodes.

Restricted SPARQL Update(INSERT DATA, DELETE DATA) sort of works ...
except bNodes.  An advantage is adding naturally DROP GRAPH and
DELETE WHERE.

   Andy




Re: RDF Delta - recording changes to RDF Datasets

2013-06-18 Thread Rob Vesse
Hey Andy

The basic approach looks sound and I like the simple text based format,
see my notes later on about maybe having a binary serialization as well.

How do you envisage incremental backups being implemented in practice, you
suggest in the document that you would take a full RDF dump and then
compute the RDF delta from a previous backup.  Talking from the experience
of having done this as part of one of my experiments in my PhD this can be
very complex and time consuming to do especially if you need to take care
of BNode isomorphism.  I assume from some of the other discussion on
BNodes that you assume that IDs will remain stable across dumps, thus
there is an implicit requirement here that the database be able to dump
RDF using consistent BNode IDs (either internal IDs or some stable round
trippable IDs).  Taking ARQ as an example the existing NQuads/TriG writers
do not do this so there would need to be an option for those writers to be
able to support this.

Even without any concerns of BNode isomorphism comparing two RDF dumps to
create a delta could be a potentially very time consuming operation and
recording the deltas as changes happen may be far more efficient.  Of
course depending on the exact use case the RDF dump and compute delta
approach may be acceptable.

My main criticism is on the Minimise actions section, there needs to be
a more solid clarification of definitions and when minimization can and
should happen.

For example:

When written in minimise form the RDF Delta can be run backwards, to undo
a change. This only works when real changes are recorded because otherwise
knowing a triple is added does not mean it was not there before.

While I agree it is necessary to record real changes for deltas to be
reverse applied I'm not convinced they have to be in minimized form (at
least based on how the definition of minimized form reads right now), if
only real changes are recorded then deltas will be in a minimal form.

Yet it is not entirely clear by your definition the following delta would
be considered minimal:

A http://s http://p http://o
R http://s http://p http://o
A http://s http://p http://o

I'm assuming that your intention was that such deltas should not be
minimized but perhaps this needs to be more clear in the document.

On the topic of related work:

I think I may have mentioned previously that I've done some research work
internally here at YarcData on a general purpose binary serialization for
Triples, Quads and Tuples which likely could be fairly trivially extended
to carry a binary encoding of the deltas as well which may save space.
For ball park comparison purposes compression is roughly equivalent to
GZipping raw NTriples with the key advantage being that the format is
significantly faster to process even in its current prototype single
threaded implementation (the design was written to take advantage of
parallelism).  There are a bunch of further optimizations that I had ideas
for that I never got as far as implementing because of lack of management
support for the concept.

There has been some discussion of open sourcing this work (likely as a
contributed Experimental module to Jena) so that it could be developed
outside of the company, if this sounds like it may be of interest I will
broach the subject with relevant management again and see whether this can
happen in the near future.

Rob


On 6/18/13 7:26 AM, Andy Seaborne a...@apache.org wrote:

I started writing up a format for transferring changes between dataset
copies (copies in time and in location).

https://cwiki.apache.org/confluence/display/JENA/RDF+Delta

Still rough and ready but I hope it gives a general impression of the
format and usage.

Comments, thoughts, discussion here on dev@ please.

   Andy



Re: RDF Delta - recording changes to RDF Datasets

2013-06-18 Thread Andy Seaborne

On 18/06/13 22:13, Rob Vesse wrote:

Hey Andy


Hi Rob - thanks for the comments - really appreciate feedback -



The basic approach looks sound and I like the simple text based format,
see my notes later on about maybe having a binary serialization as well.


A binary forms would excellent for this and for NT and NQ.  One of the 
speed limitations is parsing and Turtle is slower than NT (this isn't 
just a Jena effect).  gzip is neutral for reading but slows down 
writing.  So a fast file format would be quite useful to add to the tool 
box.



How do you envisage incremental backups being implemented in practice, you
suggest in the document that you would take a full RDF dump and then
compute the RDF delta from a previous backup.  Talking from the experience
of having done this as part of one of my experiments in my PhD this can be
very complex and time consuming to do especially if you need to take care
of BNode isomorphism.  I assume from some of the other discussion on
BNodes that you assume that IDs will remain stable across dumps, thus
there is an implicit requirement here that the database be able to dump
RDF using consistent BNode IDs (either internal IDs or some stable round
trippable IDs).  Taking ARQ as an example the existing NQuads/TriG writers
do not do this so there would need to be an option for those writers to be
able to support this.


Shh, don't tell anyone but n-quads and n-triples outputs do dump 
recoverable bNode labels :-)  TriG and Turtle do not - they try to be 
pretty.  The readers need a  tweak to recover them but the label-Node 
code has an option for various label policies and recover id from label 
is one of them.  This is not exposed formally - it's strictly illegal 
for RDF syntaxes.  Or use _:label URIs.


I have prototyped a wrapper dataset that records changes as they happen 
driven off add(quad) and delete(quad).  This produces the RDF Delta 
(sp!) form so couple to xtn and you can have a live incremental backup.


A strict after-the-event delta would be prohibitively expensive.


Even without any concerns of BNode isomorphism comparing two RDF dumps to
create a delta could be a potentially very time consuming operation and
recording the deltas as changes happen may be far more efficient.  Of
course depending on the exact use case the RDF dump and compute delta
approach may be acceptable.


It isn't a delta in the set theory A\B sense - nor is it a diff (it's 
not reversible without the additional condition).  delta and diff 
are both names I've toyed with - RDF changes might better capture the 
idea.  Or RDF Changes Log.



My main criticism is on the Minimise actions section, there needs to be
a more solid clarification of definitions and when minimization can and
should happen.


Yes - it isn't as well covered in the doc.

Logically - or generally - in teh event generating dataset wrpapper:

if ( contains(g,s,p,o) ) {
record(QuadAction.NO_ADD,g,s,p,o) ; // No action.
return ;
}

add(g,s,p,o) ;
record(QuadAction.ADD,g,s,p,o) ;// Action.

https://github.com/afs/AFS-Dev/tree/master/src/main/java/projects/recorder

but implementations like TDB can do it without the contains() as the 
indexes already return true/false for whether a change occurred or not.




For example:

When written in minimise form the RDF Delta can be run backwards, to undo
a change. This only works when real changes are recorded because otherwise
knowing a triple is added does not mean it was not there before.

While I agree it is necessary to record real changes for deltas to be
reverse applied I'm not convinced they have to be in minimized form (at
least based on how the definition of minimized form reads right now), if
only real changes are recorded then deltas will be in a minimal form.

Yet it is not entirely clear by your definition the following delta would
be considered minimal:

A http://s http://p http://o
R http://s http://p http://o
A http://s http://p http://o


If the dataset did not originally contain http://s http://p 
http://o then that is minimal.  Each row makes a real change ; it's 
the fast that graphs/datasets are set of triples/quads that the real 
change is needed.



I'm assuming that your intention was that such deltas should not be
minimized but perhaps this needs to be more clear in the document.


There is no reason not to allow the redundant first two A-D to be 
removed but it's not required.



On the topic of related work:

I think I may have mentioned previously that I've done some research work
internally here at YarcData on a general purpose binary serialization for
Triples, Quads and Tuples which likely could be fairly trivially extended
to carry a binary encoding of the deltas as well which may save space.
For ball park comparison purposes compression is roughly equivalent to
GZipping raw NTriples with the key advantage being that the format is
significantly faster to process even in its current 

Re: RDF Delta - recording changes to RDF Datasets

2013-06-18 Thread Stephen Allen
On Tue, Jun 18, 2013 at 6:05 PM, Andy Seaborne a...@apache.org wrote:

 On 18/06/13 22:13, Rob Vesse wrote:

 Hey Andy


 Hi Rob - thanks for the comments - really appreciate feedback -



 The basic approach looks sound and I like the simple text based format,
 see my notes later on about maybe having a binary serialization as well.


 A binary forms would excellent for this and for NT and NQ.  One of the
 speed limitations is parsing and Turtle is slower than NT (this isn't just
 a Jena effect).  gzip is neutral for reading but slows down writing.  So a
 fast file format would be quite useful to add to the tool box.


  How do you envisage incremental backups being implemented in practice, you
 suggest in the document that you would take a full RDF dump and then
 compute the RDF delta from a previous backup.  Talking from the experience
 of having done this as part of one of my experiments in my PhD this can be
 very complex and time consuming to do especially if you need to take care
 of BNode isomorphism.  I assume from some of the other discussion on
 BNodes that you assume that IDs will remain stable across dumps, thus
 there is an implicit requirement here that the database be able to dump
 RDF using consistent BNode IDs (either internal IDs or some stable round
 trippable IDs).  Taking ARQ as an example the existing NQuads/TriG writers
 do not do this so there would need to be an option for those writers to be
 able to support this.


 Shh, don't tell anyone but n-quads and n-triples outputs do dump
 recoverable bNode labels :-)  TriG and Turtle do not - they try to be
 pretty.  The readers need a  tweak to recover them but the label-Node code
 has an option for various label policies and recover id from label is one
 of them.  This is not exposed formally - it's strictly illegal for RDF
 syntaxes.  Or use _:label URIs.

 I have prototyped a wrapper dataset that records changes as they happen
 driven off add(quad) and delete(quad).  This produces the RDF Delta (sp!)
 form so couple to xtn and you can have a live incremental backup.

 A strict after-the-event delta would be prohibitively expensive.


  Even without any concerns of BNode isomorphism comparing two RDF dumps to
 create a delta could be a potentially very time consuming operation and
 recording the deltas as changes happen may be far more efficient.  Of
 course depending on the exact use case the RDF dump and compute delta
 approach may be acceptable.


 It isn't a delta in the set theory A\B sense - nor is it a diff (it's not
 reversible without the additional condition).  delta and diff are both
 names I've toyed with - RDF changes might better capture the idea.  Or
 RDF Changes Log.


  My main criticism is on the Minimise actions section, there needs to be
 a more solid clarification of definitions and when minimization can and
 should happen.


 Yes - it isn't as well covered in the doc.

 Logically - or generally - in teh event generating dataset wrpapper:

 if ( contains(g,s,p,o) ) {
 record(QuadAction.NO_ADD,g,s,**p,o) ; // No action.
 return ;
 }

 add(g,s,p,o) ;
 record(QuadAction.ADD,g,s,p,o) ;// Action.

 https://github.com/afs/AFS-**Dev/tree/master/src/main/java/**
 projects/recorderhttps://github.com/afs/AFS-Dev/tree/master/src/main/java/projects/recorder

 but implementations like TDB can do it without the contains() as the
 indexes already return true/false for whether a change occurred or not.



 For example:

 When written in minimise form the RDF Delta can be run backwards, to undo
 a change. This only works when real changes are recorded because otherwise
 knowing a triple is added does not mean it was not there before.

 While I agree it is necessary to record real changes for deltas to be
 reverse applied I'm not convinced they have to be in minimized form (at
 least based on how the definition of minimized form reads right now), if
 only real changes are recorded then deltas will be in a minimal form.

 Yet it is not entirely clear by your definition the following delta would
 be considered minimal:

 A http://s http://p http://o
 R http://s http://p http://o
 A http://s http://p http://o


 If the dataset did not originally contain http://s http://p http://o
 then that is minimal.  Each row makes a real change ; it's the fast that
 graphs/datasets are set of triples/quads that the real change is needed.


  I'm assuming that your intention was that such deltas should not be
 minimized but perhaps this needs to be more clear in the document.


 There is no reason not to allow the redundant first two A-D to be removed
 but it's not required.


  On the topic of related work:

 I think I may have mentioned previously that I've done some research work
 internally here at YarcData on a general purpose binary serialization for
 Triples, Quads and Tuples which likely could be fairly trivially extended
 to carry a binary encoding of the deltas as well 

Re: RDF Delta - recording changes to RDF Datasets

2013-06-18 Thread Rob Vesse
The format already allows arbitrarily sized tuples (well in the current
form it is capped at 255 columns per tuple) though it assumes that this
will be used to convey SPARQL results and thus currently requires that
column headers be provided.  Both those restrictions would be fairly easy
to remove.

I will raise the issue of open sourcing with management again and see if I
get any traction.

On the subject of column ordering I can see benefits of putting the g
field first in that it may make it easier to batch operations on a single
graph though I don't think putting it at the end to align with NQuads
precludes this you just require slightly more lookahead to determine
whether to continue adding statements to your batch.

Rob



On 6/18/13 4:41 PM, Stephen Allen sal...@apache.org wrote:

On Tue, Jun 18, 2013 at 6:05 PM, Andy Seaborne a...@apache.org wrote:

 On 18/06/13 22:13, Rob Vesse wrote:

 Hey Andy


 Hi Rob - thanks for the comments - really appreciate feedback -



 The basic approach looks sound and I like the simple text based format,
 see my notes later on about maybe having a binary serialization as
well.


 A binary forms would excellent for this and for NT and NQ.  One of the
 speed limitations is parsing and Turtle is slower than NT (this isn't
just
 a Jena effect).  gzip is neutral for reading but slows down writing.
So a
 fast file format would be quite useful to add to the tool box.


  How do you envisage incremental backups being implemented in practice,
you
 suggest in the document that you would take a full RDF dump and then
 compute the RDF delta from a previous backup.  Talking from the
experience
 of having done this as part of one of my experiments in my PhD this
can be
 very complex and time consuming to do especially if you need to take
care
 of BNode isomorphism.  I assume from some of the other discussion on
 BNodes that you assume that IDs will remain stable across dumps, thus
 there is an implicit requirement here that the database be able to dump
 RDF using consistent BNode IDs (either internal IDs or some stable
round
 trippable IDs).  Taking ARQ as an example the existing NQuads/TriG
writers
 do not do this so there would need to be an option for those writers
to be
 able to support this.


 Shh, don't tell anyone but n-quads and n-triples outputs do dump
 recoverable bNode labels :-)  TriG and Turtle do not - they try to be
 pretty.  The readers need a  tweak to recover them but the label-Node
code
 has an option for various label policies and recover id from label is
one
 of them.  This is not exposed formally - it's strictly illegal for RDF
 syntaxes.  Or use _:label URIs.

 I have prototyped a wrapper dataset that records changes as they happen
 driven off add(quad) and delete(quad).  This produces the RDF Delta
(sp!)
 form so couple to xtn and you can have a live incremental backup.

 A strict after-the-event delta would be prohibitively expensive.


  Even without any concerns of BNode isomorphism comparing two RDF dumps
to
 create a delta could be a potentially very time consuming operation and
 recording the deltas as changes happen may be far more efficient.  Of
 course depending on the exact use case the RDF dump and compute delta
 approach may be acceptable.


 It isn't a delta in the set theory A\B sense - nor is it a diff (it's
not
 reversible without the additional condition).  delta and diff are
both
 names I've toyed with - RDF changes might better capture the idea.  Or
 RDF Changes Log.


  My main criticism is on the Minimise actions section, there needs to
be
 a more solid clarification of definitions and when minimization can and
 should happen.


 Yes - it isn't as well covered in the doc.

 Logically - or generally - in teh event generating dataset wrpapper:

 if ( contains(g,s,p,o) ) {
 record(QuadAction.NO_ADD,g,s,**p,o) ; // No action.
 return ;
 }

 add(g,s,p,o) ;
 record(QuadAction.ADD,g,s,p,o) ;// Action.

 https://github.com/afs/AFS-**Dev/tree/master/src/main/java/**
 
projects/recorderhttps://github.com/afs/AFS-Dev/tree/master/src/main/jav
a/projects/recorder

 but implementations like TDB can do it without the contains() as the
 indexes already return true/false for whether a change occurred or not.



 For example:

 When written in minimise form the RDF Delta can be run backwards, to
undo
 a change. This only works when real changes are recorded because
otherwise
 knowing a triple is added does not mean it was not there before.

 While I agree it is necessary to record real changes for deltas to be
 reverse applied I'm not convinced they have to be in minimized form (at
 least based on how the definition of minimized form reads right now),
if
 only real changes are recorded then deltas will be in a minimal form.

 Yet it is not entirely clear by your definition the following delta
would
 be considered minimal:

 A http://s http://p http://o
 R http://s http://p http://o
 A