[ 
https://issues.apache.org/jira/browse/JENA-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17505461#comment-17505461
 ] 

Andy Seaborne edited comment on JENA-2309 at 3/13/22, 1:55 PM:
---------------------------------------------------------------

Reformat my first comment because JIRA \{quote} has changed to something rather 
unhelpful.


was (Author: andy.seaborne):
Reformat my first comment because JITA \{quote} has changed to something rather 
unhelpful.

> Enhancing Riot for Big Data
> ---------------------------
>
>                 Key: JENA-2309
>                 URL: https://issues.apache.org/jira/browse/JENA-2309
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: RIOT
>    Affects Versions: Jena 4.5.0
>            Reporter: Claus Stadler
>            Priority: Major
>
> We have successfully managed to adapt Jena Riot to quite efficiently work 
> within Apache Spark, however we needed to make certain adaption that rely on 
> brittle reflection hacks and APIs that are marked for removal (namely 
> PipedRDFIterator):
> In principle, for writing RDF data out, we implemented a mapPartition 
> operation that maps the input RDF to lines of text via StreamRDF which is 
> understood by apache spark's RDD.saveAsText();
> However, for use with Big Data we need to
>  * disable blank node relabeling
>  * preconfigure the StreamRDF with a given set of prefixes (that is 
> broadcasted to each node)
> Furthermore
>  * The default PrefixMapping implementation is very inefficient when it comes 
> to handling a dump of prefix.cc. I am using 2500 prefixes. Each RDF term in 
> the output results in a scan of the full prefix map
>  * Even if the PrefixMapping is optimized, the recently added PrefixMap 
> adapter again does scanning - and its a final class so no easy override.
> And finally, we have a use case to allow for relative IRIs in the RDF: We are 
> creating DCAT catalogs from directory content as in this file:
> DCAT catalog with relative IRIs over directory content: [work-in-progress 
> example|https://hobbitdata.informatik.uni-leipzig.de/lsqv2/dumps/dcat.trig]
> If you retrieve the file with a semantic web client (riot, rapper, etc) it 
> will automatically use the download location as the base url and thus giving 
> absolute URLs to the published artifacts - regardless under which URL that 
> directory is hosted.
> *IRIxResolver: We rely on IRIProviderJDK which states "do not use in 
> production" however it is the only one the let us achieve the goal. [our 
> code|https://github.com/Scaseco/jenax/blob/dd51ef9a39013d4ddbb4806fcad36b03a4dbaa7c/jenax-arq-parent/jenax-arq-utils/src/main/java/org/aksw/jenax/arq/util/irixresolver/IRIxResolverUtils.java#L30]
>  * Prologue: We use reflection to set the resolver and would like the 
> setResolver method [our 
> code|https://github.com/Scaseco/jenax/blob/dd51ef9a39013d4ddbb4806fcad36b03a4dbaa7c/jenax-arq-parent/jenax-arq-utils/src/main/java/org/aksw/jenax/arq/util/prologue/PrologueUtils.java#L65]
>  * WriterStreamRDFBase: We need to be able to create instances of 
> WriterStreamRDF classes which we can configure with our own PrefixMap 
> instance (e.g. trie-backed), and our own LabelToNode stragegy ("asGiven") - 
> [our 
> code|https://github.com/SANSA-Stack/SANSA-Stack/blob/40fa6f89f421eee22c9789973ec828ec3f970c33/sansa-spark-jena-java/src/main/java/net/sansa_stack/spark/io/rdf/output/RddRdfWriter.java#L387]
>  * PrefixMapAdapter: We need an adapter that inherits the performance 
> characteristics of the backing PrefixMapping [our 
> code|https://github.com/Scaseco/jenax/blob/dd51ef9a39013d4ddbb4806fcad36b03a4dbaa7c/jenax-arq-parent/jenax-arq-utils/src/main/java/org/aksw/jenax/arq/util/prefix/PrefixMapAdapter.java#L57]
>  * PrefixMapping: We need a trie-based implementation for efficiency. We 
> created one based on the trie class in jena which on initial experiments was 
> sufficiently fast. Though we did not benchmark whether e.g. PatriciaTrie from 
> commons collection would be faster. [our 
> code|https://github.com/Scaseco/jenax/blob/dd51ef9a39013d4ddbb4806fcad36b03a4dbaa7c/jenax-arq-parent/jenax-arq-utils/src/main/java/org/aksw/jenax/arq/util/prefix/PrefixMappingTrie.java#L27]
> With PrefixMapTrie the profiler showed that the amout of time spent on 
> abbreviate went from ~100% to 1% - though not totally sure about standard 
> conformance here.
>  * PipedRDFIterator / AsyncParser: We can read trig as a Splittable format 
> (which is pretty cool) - however this requires being able to start and stop 
> the RDF parser at will for probing. In other words, AsyncParser needs to 
> return ClosableIterators whose close method actually stops the parsing 
> thread. Also when scanning for prefixes we want to be able to create rules 
> such as "as long as the parser emits a prefix with less than e.g. 100 
> non-prefix events in between keep looking for prefixes" - AsyncParser has the 
> API for it with EltStreamRDF but it is private.
> For future-proofness we'd have these use cases to be reflected in jena.
> Because we have sorted all the above issues mostly out I'd prefer to address 
> these things with only one or a few PRs (maybe the ClosableIterators on 
> AsyncParsers would be more work because our code only did that for 
> PipedRDFIterator and I haven't looked in detail into the new architecture).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to