Claus Stadler created JENA-2309:
-----------------------------------
Summary: Enhancing Riot for Big Data
Key: JENA-2309
URL: https://issues.apache.org/jira/browse/JENA-2309
Project: Apache Jena
Issue Type: Improvement
Components: RIOT
Affects Versions: Jena 4.5.0
Reporter: Claus Stadler
We have successfully managed to adapt Jena Riot to quite efficiently work
within Apache Spark, however we needed to make certain adaption that rely on
brittle reflection hacks and APIs that are marked for removal (namely
PipedRDFIterator):
In principle, for writing RDF data out, we implemented a mapPartition operation
that maps the input RDF to lines of text via StreamRDF which is understood by
apache spark's RDD.saveAsText();
However, for use with Big Data we need to
* disable blank node relabeling
* preconfigure the StreamRDF with a given set of prefixes (that is broadcasted
to each node)
Furthermore
* The default PrefixMapping implementation is very inefficient when it comes to
handling a dump of prefix.cc. I am using 2500 prefixes. Each RDF term in the
output results in a scan of the full prefix map
* Even if the PrefixMapping is optimized, the recently added PrefixMap adapter
again does scanning - and its a final class so no easy override.
And finally, we have a use case to allow for relative IRIs in the RDF: We are
creating DCAT catalogs from directory content as in this file:
DCAT catalog with relative IRIs over directory content: [work-in-progress
example|https://hobbitdata.informatik.uni-leipzig.de/lsqv2/dumps/dcat.trig]
If you retrieve the file with a semantic web client (riot, rapper, etc) it will
automatically use the download location as the base url and thus giving
absolute URLs to the published artifacts - regardless under which URL that
directory is hosted.
*IRIxResolver: We rely on IRIProviderJDK which states "do not use in
production" however it is the only one the let us achieve the goal. [our
code|https://github.com/Scaseco/jenax/blob/dd51ef9a39013d4ddbb4806fcad36b03a4dbaa7c/jenax-arq-parent/jenax-arq-utils/src/main/java/org/aksw/jenax/arq/util/irixresolver/IRIxResolverUtils.java#L30]
* Prologue: We use reflection to set the resolver and would like the
setResolver method [our
code|https://github.com/Scaseco/jenax/blob/dd51ef9a39013d4ddbb4806fcad36b03a4dbaa7c/jenax-arq-parent/jenax-arq-utils/src/main/java/org/aksw/jenax/arq/util/prologue/PrologueUtils.java#L65]
* WriterStreamRDFBase: We need to be able to create instances of
WriterStreamRDF classes which we can configure with our own PrefixMap instance
(e.g. trie-backed), and our own LabelToNode stragegy ("asGiven") - [our
code|https://github.com/SANSA-Stack/SANSA-Stack/blob/40fa6f89f421eee22c9789973ec828ec3f970c33/sansa-spark-jena-java/src/main/java/net/sansa_stack/spark/io/rdf/output/RddRdfWriter.java#L387]
* PrefixMapAdapter: We need an adapter that inherits the performance
characteristics of the backing PrefixMapping [our
code|https://github.com/Scaseco/jenax/blob/dd51ef9a39013d4ddbb4806fcad36b03a4dbaa7c/jenax-arq-parent/jenax-arq-utils/src/main/java/org/aksw/jenax/arq/util/prefix/PrefixMapAdapter.java#L57]
* PrefixMapping: We need a trie-based implementation for efficiency. We created
one based on the trie class in jena which on initial experiments was
sufficiently fast. Though we did not benchmark whether e.g. PatriciaTrie from
commons collection would be faster. [our
code|https://github.com/Scaseco/jenax/blob/dd51ef9a39013d4ddbb4806fcad36b03a4dbaa7c/jenax-arq-parent/jenax-arq-utils/src/main/java/org/aksw/jenax/arq/util/prefix/PrefixMappingTrie.java#L27]
With PrefixMapTrie the profiler showed that the amout of time spent on
abbreviate went from ~100% to 1% - though not totally sure about standard
conformance here.
* PipedRDFIterator / AsyncParser: We can read trig as a Splittable format
(which is pretty cool) - however this requires being able to start and stop the
RDF parser at will for probing. In other words, AsyncParser needs to return
ClosableIterators whose close method actually stops the parsing thread. Also
when scanning for prefixes we want to be able to create rules such as "as long
as the parser emits a prefix with less than e.g. 100 non-prefix events in
between keep looking for prefixes" - AsyncParser has the API for it with
EltStreamRDF but it is private.
For future-proofness we'd have these use cases to be reflected in jena.
Because we have sorted all the above issues mostly out and I'd prefer to
address these things with only a one or a few PRs (well maybe the
ClosableIterators on AsyncParsers would be more work - because our code only
did that for PipedRDFIterator).
--
This message was sent by Atlassian Jira
(v8.20.1#820001)