[
https://issues.apache.org/jira/browse/JENA-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17505461#comment-17505461
]
Andy Seaborne edited comment on JENA-2309 at 3/13/22, 1:55 PM:
---------------------------------------------------------------
Reformat my first comment because JIRA \{quote} has changed to something rather
unhelpful.
was (Author: andy.seaborne):
Reformat my first comment because JITA \{quote} has changed to something rather
unhelpful.
> Enhancing Riot for Big Data
> ---------------------------
>
> Key: JENA-2309
> URL: https://issues.apache.org/jira/browse/JENA-2309
> Project: Apache Jena
> Issue Type: Improvement
> Components: RIOT
> Affects Versions: Jena 4.5.0
> Reporter: Claus Stadler
> Priority: Major
>
> We have successfully managed to adapt Jena Riot to quite efficiently work
> within Apache Spark, however we needed to make certain adaption that rely on
> brittle reflection hacks and APIs that are marked for removal (namely
> PipedRDFIterator):
> In principle, for writing RDF data out, we implemented a mapPartition
> operation that maps the input RDF to lines of text via StreamRDF which is
> understood by apache spark's RDD.saveAsText();
> However, for use with Big Data we need to
> * disable blank node relabeling
> * preconfigure the StreamRDF with a given set of prefixes (that is
> broadcasted to each node)
> Furthermore
> * The default PrefixMapping implementation is very inefficient when it comes
> to handling a dump of prefix.cc. I am using 2500 prefixes. Each RDF term in
> the output results in a scan of the full prefix map
> * Even if the PrefixMapping is optimized, the recently added PrefixMap
> adapter again does scanning - and its a final class so no easy override.
> And finally, we have a use case to allow for relative IRIs in the RDF: We are
> creating DCAT catalogs from directory content as in this file:
> DCAT catalog with relative IRIs over directory content: [work-in-progress
> example|https://hobbitdata.informatik.uni-leipzig.de/lsqv2/dumps/dcat.trig]
> If you retrieve the file with a semantic web client (riot, rapper, etc) it
> will automatically use the download location as the base url and thus giving
> absolute URLs to the published artifacts - regardless under which URL that
> directory is hosted.
> *IRIxResolver: We rely on IRIProviderJDK which states "do not use in
> production" however it is the only one the let us achieve the goal. [our
> code|https://github.com/Scaseco/jenax/blob/dd51ef9a39013d4ddbb4806fcad36b03a4dbaa7c/jenax-arq-parent/jenax-arq-utils/src/main/java/org/aksw/jenax/arq/util/irixresolver/IRIxResolverUtils.java#L30]
> * Prologue: We use reflection to set the resolver and would like the
> setResolver method [our
> code|https://github.com/Scaseco/jenax/blob/dd51ef9a39013d4ddbb4806fcad36b03a4dbaa7c/jenax-arq-parent/jenax-arq-utils/src/main/java/org/aksw/jenax/arq/util/prologue/PrologueUtils.java#L65]
> * WriterStreamRDFBase: We need to be able to create instances of
> WriterStreamRDF classes which we can configure with our own PrefixMap
> instance (e.g. trie-backed), and our own LabelToNode stragegy ("asGiven") -
> [our
> code|https://github.com/SANSA-Stack/SANSA-Stack/blob/40fa6f89f421eee22c9789973ec828ec3f970c33/sansa-spark-jena-java/src/main/java/net/sansa_stack/spark/io/rdf/output/RddRdfWriter.java#L387]
> * PrefixMapAdapter: We need an adapter that inherits the performance
> characteristics of the backing PrefixMapping [our
> code|https://github.com/Scaseco/jenax/blob/dd51ef9a39013d4ddbb4806fcad36b03a4dbaa7c/jenax-arq-parent/jenax-arq-utils/src/main/java/org/aksw/jenax/arq/util/prefix/PrefixMapAdapter.java#L57]
> * PrefixMapping: We need a trie-based implementation for efficiency. We
> created one based on the trie class in jena which on initial experiments was
> sufficiently fast. Though we did not benchmark whether e.g. PatriciaTrie from
> commons collection would be faster. [our
> code|https://github.com/Scaseco/jenax/blob/dd51ef9a39013d4ddbb4806fcad36b03a4dbaa7c/jenax-arq-parent/jenax-arq-utils/src/main/java/org/aksw/jenax/arq/util/prefix/PrefixMappingTrie.java#L27]
> With PrefixMapTrie the profiler showed that the amout of time spent on
> abbreviate went from ~100% to 1% - though not totally sure about standard
> conformance here.
> * PipedRDFIterator / AsyncParser: We can read trig as a Splittable format
> (which is pretty cool) - however this requires being able to start and stop
> the RDF parser at will for probing. In other words, AsyncParser needs to
> return ClosableIterators whose close method actually stops the parsing
> thread. Also when scanning for prefixes we want to be able to create rules
> such as "as long as the parser emits a prefix with less than e.g. 100
> non-prefix events in between keep looking for prefixes" - AsyncParser has the
> API for it with EltStreamRDF but it is private.
> For future-proofness we'd have these use cases to be reflected in jena.
> Because we have sorted all the above issues mostly out I'd prefer to address
> these things with only one or a few PRs (maybe the ClosableIterators on
> AsyncParsers would be more work because our code only did that for
> PipedRDFIterator and I haven't looked in detail into the new architecture).
--
This message was sent by Atlassian Jira
(v8.20.1#820001)