Claus Stadler created JENA-1894:
-----------------------------------

             Summary: Insert-order preserving dataset
                 Key: JENA-1894
                 URL: https://issues.apache.org/jira/browse/JENA-1894
             Project: Apache Jena
          Issue Type: Improvement
          Components: ARQ
    Affects Versions: Jena 3.14.0
            Reporter: Claus Stadler


To the best of my knowledge, there is no backend for datasets that retains 
insert order.
This feature is particularly useful when changing RDF files in a git 
repository, as it makes for nice commits. An insert-order preserving 
Triple/QuadTable implementation enables:
* Writing (subject-grouped) RDF files or events from an RDF stream out in 
nearly the same way they were read in - this makes it easier to compare outputs 
of data transformations
* Combining ORDER BY with CONSTRUCT queries:

{code:java}
Dataset ds = DatasetFactory.createOrderPreservingDataset();
QueryExecutionFactory.create("CONSTRUCT WHERE { ?s ?p ?o } ORDER BY ?s ?p ?o", 
ds);
RDFDataMgr.write(System.out, ds, RDFFormat.TURTLE_BLOCKS);
{code}

I have created an implementation for this some time ago with the main classes 
of the machinery being:

* 
[QuadTableFromNestedMaps.java|https://github.com/SmartDataAnalytics/jena-sparql-api/blob/a18b069e963bdef6cc9e8915f3e8f766893bab15/jena-sparql-api-rx/src/main/java/org/aksw/jena_sparql_api/rx/QuadTableFromNestedMaps.java#L26]
* In addition, is created a lazy (but adequate?) wrapper for re-using a quad 
table as a triple table:
[TripleTableFromQuadTable.java|https://github.com/SmartDataAnalytics/jena-sparql-api/blob/a18b069e963bdef6cc9e8915f3e8f766893bab15/jena-sparql-api-rx/src/main/java/org/aksw/jena_sparql_api/rx/TripleTableFromQuadTable.java#L30]
* The DatasetGraph wapper:
[DatasetGraphQuadsImpl.java|https://github.com/SmartDataAnalytics/jena-sparql-api/blob/a18b069e963bdef6cc9e8915f3e8f766893bab15/jena-sparql-api-rx/src/main/java/org/aksw/jena_sparql_api/rx/DatasetGraphQuadsImpl.java#L32]

Note, that DatasetGraphQuadsImpl at present falsly claims that it is 
transaction aware - because otherwise any SPARQL insert caused an exception (I 
have not tried with the latest fixes for 3.15.0-SNAPSHOT yet). In any case, for 
the use cases of writing out RDF transactions may not even be necessary, but if 
there is an easy way to add them, then it should be done.

An example of the above code in action is here: [Git Diff based on ordered 
turtle-blocks output 
|https://github.com/SmartDataAnalytics/lodservatory/commit/ec50cd33230a771c557c1ed2751799401ea3fd89]

The downside of using this kind of order preserving dataset is, that 
essentially it only features an gspo index. Hence, the performance 
characteristics of this kind of order preserving dataset - which is intended 
mostly for serialization or presentation - varies greatly form the 
query-optimized implementations.

In any case, order preserving datasets are a highly useful feature for Jena and 
I'd gladly contribute a PR for that. My main questions are:
* How to call the factory methods in DatasetFactory, DatasetGraphFactory etc - 
createOrderPreservingDataset?
* In the approach using QuadTableFromNestedMaps needed - or can a different 
implementation of QuadTable be repurposed?
* It seems that the abstract class DatasetGraphQuads does not have any 
implementation at least in ARQ and the jena modules I use (according to 
eclipse) - so my custom implementation of DatasetGraphQuadsImpl seems to be 
needed, or is there a similar class lying around in another jena package?




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to