[ https://issues.apache.org/jira/browse/JENA-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17505930#comment-17505930 ]
Andy Seaborne commented on JENA-2309: ------------------------------------- {{RDFParserBuilder.resolver(IRIxResolver resolver)}} Parsing with a predefined set of prefixes is a matter sending them to the destination first. (passing them any other way will not work. {code:java} StreamRDF dest = ... prefixes.forEach(dest.prefix(...)) RDFParser.create().resolver(myResolver)..... parse(dest); {code} There is also {{RDFFactory}}. > Enhancing Riot for Big Data > --------------------------- > > Key: JENA-2309 > URL: https://issues.apache.org/jira/browse/JENA-2309 > Project: Apache Jena > Issue Type: Improvement > Components: RIOT > Affects Versions: Jena 4.5.0 > Reporter: Claus Stadler > Priority: Major > > We have successfully managed to adapt Jena Riot to quite efficiently work > within Apache Spark, however we needed to make certain adaption that rely on > brittle reflection hacks and APIs that are marked for removal (namely > PipedRDFIterator): > In principle, for writing RDF data out, we implemented a mapPartition > operation that maps the input RDF to lines of text via StreamRDF which is > understood by apache spark's RDD.saveAsText(); > However, for use with Big Data we need to > * disable blank node relabeling > * preconfigure the StreamRDF with a given set of prefixes (that is > broadcasted to each node) > Furthermore > * The default PrefixMapping implementation is very inefficient when it comes > to handling a dump of prefix.cc. I am using 2500 prefixes. Each RDF term in > the output results in a scan of the full prefix map > * Even if the PrefixMapping is optimized, the recently added PrefixMap > adapter again does scanning - and its a final class so no easy override. > And finally, we have a use case to allow for relative IRIs in the RDF: We are > creating DCAT catalogs from directory content as in this file: > DCAT catalog with relative IRIs over directory content: [work-in-progress > example|https://hobbitdata.informatik.uni-leipzig.de/lsqv2/dumps/dcat.trig] > If you retrieve the file with a semantic web client (riot, rapper, etc) it > will automatically use the download location as the base url and thus giving > absolute URLs to the published artifacts - regardless under which URL that > directory is hosted. > *IRIxResolver: We rely on IRIProviderJDK which states "do not use in > production" however it is the only one the let us achieve the goal. [our > code|https://github.com/Scaseco/jenax/blob/dd51ef9a39013d4ddbb4806fcad36b03a4dbaa7c/jenax-arq-parent/jenax-arq-utils/src/main/java/org/aksw/jenax/arq/util/irixresolver/IRIxResolverUtils.java#L30] > * Prologue: We use reflection to set the resolver and would like the > setResolver method [our > code|https://github.com/Scaseco/jenax/blob/dd51ef9a39013d4ddbb4806fcad36b03a4dbaa7c/jenax-arq-parent/jenax-arq-utils/src/main/java/org/aksw/jenax/arq/util/prologue/PrologueUtils.java#L65] > * WriterStreamRDFBase: We need to be able to create instances of > WriterStreamRDF classes which we can configure with our own PrefixMap > instance (e.g. trie-backed), and our own LabelToNode stragegy ("asGiven") - > [our > code|https://github.com/SANSA-Stack/SANSA-Stack/blob/40fa6f89f421eee22c9789973ec828ec3f970c33/sansa-spark-jena-java/src/main/java/net/sansa_stack/spark/io/rdf/output/RddRdfWriter.java#L387] > * PrefixMapAdapter: We need an adapter that inherits the performance > characteristics of the backing PrefixMapping [our > code|https://github.com/Scaseco/jenax/blob/dd51ef9a39013d4ddbb4806fcad36b03a4dbaa7c/jenax-arq-parent/jenax-arq-utils/src/main/java/org/aksw/jenax/arq/util/prefix/PrefixMapAdapter.java#L57] > * PrefixMapping: We need a trie-based implementation for efficiency. We > created one based on the trie class in jena which on initial experiments was > sufficiently fast. Though we did not benchmark whether e.g. PatriciaTrie from > commons collection would be faster. [our > code|https://github.com/Scaseco/jenax/blob/dd51ef9a39013d4ddbb4806fcad36b03a4dbaa7c/jenax-arq-parent/jenax-arq-utils/src/main/java/org/aksw/jenax/arq/util/prefix/PrefixMappingTrie.java#L27] > With PrefixMapTrie the profiler showed that the amout of time spent on > abbreviate went from ~100% to 1% - though not totally sure about standard > conformance here. > * PipedRDFIterator / AsyncParser: We can read trig as a Splittable format > (which is pretty cool) - however this requires being able to start and stop > the RDF parser at will for probing. In other words, AsyncParser needs to > return ClosableIterators whose close method actually stops the parsing > thread. Also when scanning for prefixes we want to be able to create rules > such as "as long as the parser emits a prefix with less than e.g. 100 > non-prefix events in between keep looking for prefixes" - AsyncParser has the > API for it with EltStreamRDF but it is private. > For future-proofness we'd have these use cases to be reflected in jena. > Because we have sorted all the above issues mostly out I'd prefer to address > these things with only one or a few PRs (maybe the ClosableIterators on > AsyncParsers would be more work because our code only did that for > PipedRDFIterator and I haven't looked in detail into the new architecture). -- This message was sent by Atlassian Jira (v8.20.1#820001)