[
https://issues.apache.org/jira/browse/JENA-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17205538#comment-17205538
]
Adam Soroka commented on JENA-1894:
-----------------------------------
[~Aklakan] and [~andy], I apologize for having been nowhere to be found for
this discussion. Long story involving ${work} and ${family}, but I should be
able to participate more usefully going forward, at least for a few weeks.
Luckily, it turns out that I would have had little to offer! What Claus has
here is already better and more general than TIM's undergirding (which was not
well-designed and was really a "get 'er done" approach.)
To (finally) answer a particular question Andy asked, yes, TIM does use a
selected index to quickly answer graph names, but it is a hacky design that
relies on special knowledge, not the nicely-planned version that Claus shows
here.
I need to read the paper that Claus linked (sounds fascinating, I have been
intrigued by tensor-based approaches for years but I don't really know the math
well enough to do anything novel and I don't have time to learn right now-- I'd
also be really curious if [~rvesse] had any thoughts about it). I will try to
look carefully at Claus' PR soon.
One last thing: I think that if we take up the general framework as Claus
offers, we should start with the desideratum that it be impled in DBOE. I think
we should be moving as hard as we can to unify storage abstractions (modulo
technical reality that impl details matter) because one of the difficult things
about learning Jena right now seems to me to be that we have so many very
independent impls of {{DatasetGraph}}.
> Insert-order preserving dataset
> -------------------------------
>
> Key: JENA-1894
> URL: https://issues.apache.org/jira/browse/JENA-1894
> Project: Apache Jena
> Issue Type: Improvement
> Components: ARQ
> Affects Versions: Jena 3.14.0
> Reporter: Claus Stadler
> Priority: Major
>
> To the best of my knowledge, there is no backend for datasets that retains
> insert order.
> This feature is particularly useful when changing RDF files in a git
> repository, as it makes for nice commits. An insert-order preserving
> Triple/QuadTable implementation enables:
> * Writing (subject-grouped) RDF files or events from an RDF stream out in
> nearly the same way they were read in - this makes it easier to compare
> outputs of data transformations
> * Combining ORDER BY with CONSTRUCT queries:
> {code:java}
> Dataset ds = DatasetFactory.createOrderPreservingDataset();
> QueryExecutionFactory.create("CONSTRUCT WHERE { ?s ?p ?o } ORDER BY ?s ?p
> ?o", ds);
> RDFDataMgr.write(System.out, ds, RDFFormat.TURTLE_BLOCKS);
> {code}
> I have created an implementation for this some time ago with the main classes
> of the machinery being:
> *
> [QuadTableFromNestedMaps.java|https://github.com/SmartDataAnalytics/jena-sparql-api/blob/a18b069e963bdef6cc9e8915f3e8f766893bab15/jena-sparql-api-rx/src/main/java/org/aksw/jena_sparql_api/rx/QuadTableFromNestedMaps.java#L26]
> * In addition, I created a lazy (but adequate?) wrapper for re-using a quad
> table as a triple table:
>
> [TripleTableFromQuadTable.java|https://github.com/SmartDataAnalytics/jena-sparql-api/blob/a18b069e963bdef6cc9e8915f3e8f766893bab15/jena-sparql-api-rx/src/main/java/org/aksw/jena_sparql_api/rx/TripleTableFromQuadTable.java#L30]
> * The DatasetGraph wapper:
>
> [DatasetGraphQuadsImpl.java|https://github.com/SmartDataAnalytics/jena-sparql-api/blob/a18b069e963bdef6cc9e8915f3e8f766893bab15/jena-sparql-api-rx/src/main/java/org/aksw/jena_sparql_api/rx/DatasetGraphQuadsImpl.java#L32]
> The actual factory code then uses:
> {code:java}
> public static DatasetGraph createOrderPreservingDatasetGraph() {
> QuadTable quadTable = new QuadTableFromNestedMaps();
> TripleTable tripleTable = new TripleTableFromQuadTable(quadTable);
> DatasetGraph result = new DatasetGraphInMemory(quadTable,
> tripleTable);
> return result;
> }
> {code}
> Note, that DatasetGraphQuadsImpl at present falsly claims that it is
> transaction aware - because otherwise any SPARQL insert caused an exception
> (I have not tried with the latest fixes for 3.15.0-SNAPSHOT yet). In any
> case, for the use cases of writing out RDF transactions may not even be
> necessary, but if there is an easy way to add them, then it should be done.
> An example of the above code in action is here: [Git Diff based on ordered
> turtle-blocks output
> |https://github.com/SmartDataAnalytics/lodservatory/commit/ec50cd33230a771c557c1ed2751799401ea3fd89]
> The downside of using this kind of order preserving dataset is, that
> essentially it only features an gspo index. Hence, the performance
> characteristics of this kind of order preserving dataset - which is intended
> mostly for serialization or presentation - varies greatly form the
> query-optimized implementations.
> In any case, order preserving datasets are a highly useful feature for Jena
> and I'd gladly contribute a PR for that. My main questions are:
> * How to call the factory methods in DatasetFactory, DatasetGraphFactory etc
> - createOrderPreservingDataset?
> * In the approach using QuadTableFromNestedMaps needed - or can a different
> implementation of QuadTable be repurposed?
> * It seems that the abstract class DatasetGraphQuads does not have any
> implementation at least in ARQ and the jena modules I use (according to
> eclipse) - so my custom implementation of DatasetGraphQuadsImpl seems to be
> needed, or is there a similar class lying around in another jena package?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)