[jira] [Updated] (JENA-2309) Enhancing Riot for Big Data

Claus Stadler (Jira) Sat, 12 Mar 2022 13:06:05 -0800


     [ 
https://issues.apache.org/jira/browse/JENA-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Claus Stadler updated JENA-2309:
--------------------------------
    Description: 
We have successfully managed to adapt Jena Riot to quite efficiently work 
within Apache Spark, however we needed to make certain adaption that rely on 
brittle reflection hacks and APIs that are marked for removal (namely 
PipedRDFIterator):

In principle, for writing RDF data out, we implemented a mapPartition operation 
that maps the input RDF to lines of text via StreamRDF which is understood by 
apache spark's RDD.saveAsText();

However, for use with Big Data we need to
 * disable blank node relabeling
 * preconfigure the StreamRDF with a given set of prefixes (that is broadcasted 
to each node)

Furthermore
 * The default PrefixMapping implementation is very inefficient when it comes 
to handling a dump of prefix.cc. I am using 2500 prefixes. Each RDF term in the 
output results in a scan of the full prefix map
 * Even if the PrefixMapping is optimized, the recently added PrefixMap adapter 
again does scanning - and its a final class so no easy override.

And finally, we have a use case to allow for relative IRIs in the RDF: We are 
creating DCAT catalogs from directory content as in this file:

DCAT catalog with relative IRIs over directory content: [work-in-progress 
example|https://hobbitdata.informatik.uni-leipzig.de/lsqv2/dumps/dcat.trig]

If you retrieve the file with a semantic web client (riot, rapper, etc) it will 
automatically use the download location as the base url and thus giving 
absolute URLs to the published artifacts - regardless under which URL that 
directory is hosted.

*IRIxResolver: We rely on IRIProviderJDK which states "do not use in 
production" however it is the only one the let us achieve the goal. [our 
code|https://github.com/Scaseco/jenax/blob/dd51ef9a39013d4ddbb4806fcad36b03a4dbaa7c/jenax-arq-parent/jenax-arq-utils/src/main/java/org/aksw/jenax/arq/util/irixresolver/IRIxResolverUtils.java#L30]
 * Prologue: We use reflection to set the resolver and would like the 
setResolver method [our 
code|https://github.com/Scaseco/jenax/blob/dd51ef9a39013d4ddbb4806fcad36b03a4dbaa7c/jenax-arq-parent/jenax-arq-utils/src/main/java/org/aksw/jenax/arq/util/prologue/PrologueUtils.java#L65]
 * WriterStreamRDFBase: We need to be able to create instances of 
WriterStreamRDF classes which we can configure with our own PrefixMap instance 
(e.g. trie-backed), and our own LabelToNode stragegy ("asGiven") - [our 
code|https://github.com/SANSA-Stack/SANSA-Stack/blob/40fa6f89f421eee22c9789973ec828ec3f970c33/sansa-spark-jena-java/src/main/java/net/sansa_stack/spark/io/rdf/output/RddRdfWriter.java#L387]
 * PrefixMapAdapter: We need an adapter that inherits the performance 
characteristics of the backing PrefixMapping [our 
code|https://github.com/Scaseco/jenax/blob/dd51ef9a39013d4ddbb4806fcad36b03a4dbaa7c/jenax-arq-parent/jenax-arq-utils/src/main/java/org/aksw/jenax/arq/util/prefix/PrefixMapAdapter.java#L57]
 * PrefixMapping: We need a trie-based implementation for efficiency. We 
created one based on the trie class in jena which on initial experiments was 
sufficiently fast. Though we did not benchmark whether e.g. PatriciaTrie from 
commons collection would be faster. [our 
code|https://github.com/Scaseco/jenax/blob/dd51ef9a39013d4ddbb4806fcad36b03a4dbaa7c/jenax-arq-parent/jenax-arq-utils/src/main/java/org/aksw/jenax/arq/util/prefix/PrefixMappingTrie.java#L27]
With PrefixMapTrie the profiler showed that the amout of time spent on 
abbreviate went from ~100% to 1% - though not totally sure about standard 
conformance here.
 * PipedRDFIterator / AsyncParser: We can read trig as a Splittable format 
(which is pretty cool) - however this requires being able to start and stop the 
RDF parser at will for probing. In other words, AsyncParser needs to return 
ClosableIterators whose close method actually stops the parsing thread. Also 
when scanning for prefixes we want to be able to create rules such as "as long 
as the parser emits a prefix with less than e.g. 100 non-prefix events in 
between keep looking for prefixes" - AsyncParser has the API for it with 
EltStreamRDF but it is private.

For future-proofness we'd have these use cases to be reflected in jena.
Because we have sorted all the above issues mostly out I'd prefer to address 
these things with only one or a few PRs (maybe the ClosableIterators on 
AsyncParsers would be more work because our code only did that for 
PipedRDFIterator and I haven't looked in detail into the new architecture).

  was:
We have successfully managed to adapt Jena Riot to quite efficiently work 
within Apache Spark, however we needed to make certain adaption that rely on 
brittle reflection hacks and APIs that are marked for removal (namely 
PipedRDFIterator):

In principle, for writing RDF data out, we implemented a mapPartition operation 
that maps the input RDF to lines of text via StreamRDF which is understood by 
apache spark's RDD.saveAsText();
 
However, for use with Big Data we need to
* disable blank node relabeling
* preconfigure the StreamRDF with a given set of prefixes (that is broadcasted 
to each node)

Furthermore
* The default PrefixMapping implementation is very inefficient when it comes to 
handling a dump of prefix.cc. I am using 2500 prefixes. Each RDF term in the 
output results in a scan of the full prefix map
* Even if the PrefixMapping is optimized, the recently added PrefixMap adapter 
again does scanning - and its a final class so no easy override.

And finally, we have a use case to allow for relative IRIs in the RDF: We are 
creating DCAT catalogs from directory content as in this file:

DCAT catalog with relative IRIs over directory content: [work-in-progress 
example|https://hobbitdata.informatik.uni-leipzig.de/lsqv2/dumps/dcat.trig]

If you retrieve the file with a semantic web client (riot, rapper, etc) it will 
automatically use the download location as the base url and thus giving 
absolute URLs to the published artifacts - regardless under which URL that 
directory is hosted.

*IRIxResolver: We rely on IRIProviderJDK which states "do not use in 
production" however it is the only one the let us achieve the goal. [our 
code|https://github.com/Scaseco/jenax/blob/dd51ef9a39013d4ddbb4806fcad36b03a4dbaa7c/jenax-arq-parent/jenax-arq-utils/src/main/java/org/aksw/jenax/arq/util/irixresolver/IRIxResolverUtils.java#L30]
* Prologue: We use reflection to set the resolver and would like the 
setResolver method [our 
code|https://github.com/Scaseco/jenax/blob/dd51ef9a39013d4ddbb4806fcad36b03a4dbaa7c/jenax-arq-parent/jenax-arq-utils/src/main/java/org/aksw/jenax/arq/util/prologue/PrologueUtils.java#L65]
* WriterStreamRDFBase: We need to be able to create instances of 
WriterStreamRDF classes which we can configure with our own PrefixMap instance 
(e.g. trie-backed), and our own LabelToNode stragegy ("asGiven") - [our 
code|https://github.com/SANSA-Stack/SANSA-Stack/blob/40fa6f89f421eee22c9789973ec828ec3f970c33/sansa-spark-jena-java/src/main/java/net/sansa_stack/spark/io/rdf/output/RddRdfWriter.java#L387]
* PrefixMapAdapter: We need an adapter that inherits the performance 
characteristics of the backing PrefixMapping [our 
code|https://github.com/Scaseco/jenax/blob/dd51ef9a39013d4ddbb4806fcad36b03a4dbaa7c/jenax-arq-parent/jenax-arq-utils/src/main/java/org/aksw/jenax/arq/util/prefix/PrefixMapAdapter.java#L57]
* PrefixMapping: We need a trie-based implementation for efficiency. We created 
one based on the trie class in jena which on initial experiments was 
sufficiently fast. Though we did not benchmark whether e.g. PatriciaTrie from 
commons collection would be faster. [our 
code|https://github.com/Scaseco/jenax/blob/dd51ef9a39013d4ddbb4806fcad36b03a4dbaa7c/jenax-arq-parent/jenax-arq-utils/src/main/java/org/aksw/jenax/arq/util/prefix/PrefixMappingTrie.java#L27]
With PrefixMapTrie the profiler showed that the amout of time spent on 
abbreviate went from ~100% to 1% - though not totally sure about standard 
conformance here.
* PipedRDFIterator / AsyncParser: We can read trig as a Splittable format 
(which is pretty cool) - however this requires being able to start and stop the 
RDF parser at will for probing. In other words, AsyncParser needs to return 
ClosableIterators whose close method actually stops the parsing thread. Also 
when scanning for prefixes we want to be able to create rules such as "as long 
as the parser emits a prefix with less than e.g. 100 non-prefix events in 
between keep looking for prefixes" - AsyncParser has the API for it with 
EltStreamRDF but it is private.

For future-proofness  we'd have these use cases to be reflected in jena.
Because we have sorted all the above issues mostly out and I'd  prefer to 
address these things with only a one or a few PRs (well maybe the 
ClosableIterators on AsyncParsers would be more work - because our code only 
did that for PipedRDFIterator).



> Enhancing Riot for Big Data
> ---------------------------
>
>                 Key: JENA-2309
>                 URL: https://issues.apache.org/jira/browse/JENA-2309
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: RIOT
>    Affects Versions: Jena 4.5.0
>            Reporter: Claus Stadler
>            Priority: Major
>
> We have successfully managed to adapt Jena Riot to quite efficiently work 
> within Apache Spark, however we needed to make certain adaption that rely on 
> brittle reflection hacks and APIs that are marked for removal (namely 
> PipedRDFIterator):
> In principle, for writing RDF data out, we implemented a mapPartition 
> operation that maps the input RDF to lines of text via StreamRDF which is 
> understood by apache spark's RDD.saveAsText();
> However, for use with Big Data we need to
>  * disable blank node relabeling
>  * preconfigure the StreamRDF with a given set of prefixes (that is 
> broadcasted to each node)
> Furthermore
>  * The default PrefixMapping implementation is very inefficient when it comes 
> to handling a dump of prefix.cc. I am using 2500 prefixes. Each RDF term in 
> the output results in a scan of the full prefix map
>  * Even if the PrefixMapping is optimized, the recently added PrefixMap 
> adapter again does scanning - and its a final class so no easy override.
> And finally, we have a use case to allow for relative IRIs in the RDF: We are 
> creating DCAT catalogs from directory content as in this file:
> DCAT catalog with relative IRIs over directory content: [work-in-progress 
> example|https://hobbitdata.informatik.uni-leipzig.de/lsqv2/dumps/dcat.trig]
> If you retrieve the file with a semantic web client (riot, rapper, etc) it 
> will automatically use the download location as the base url and thus giving 
> absolute URLs to the published artifacts - regardless under which URL that 
> directory is hosted.
> *IRIxResolver: We rely on IRIProviderJDK which states "do not use in 
> production" however it is the only one the let us achieve the goal. [our 
> code|https://github.com/Scaseco/jenax/blob/dd51ef9a39013d4ddbb4806fcad36b03a4dbaa7c/jenax-arq-parent/jenax-arq-utils/src/main/java/org/aksw/jenax/arq/util/irixresolver/IRIxResolverUtils.java#L30]
>  * Prologue: We use reflection to set the resolver and would like the 
> setResolver method [our 
> code|https://github.com/Scaseco/jenax/blob/dd51ef9a39013d4ddbb4806fcad36b03a4dbaa7c/jenax-arq-parent/jenax-arq-utils/src/main/java/org/aksw/jenax/arq/util/prologue/PrologueUtils.java#L65]
>  * WriterStreamRDFBase: We need to be able to create instances of 
> WriterStreamRDF classes which we can configure with our own PrefixMap 
> instance (e.g. trie-backed), and our own LabelToNode stragegy ("asGiven") - 
> [our 
> code|https://github.com/SANSA-Stack/SANSA-Stack/blob/40fa6f89f421eee22c9789973ec828ec3f970c33/sansa-spark-jena-java/src/main/java/net/sansa_stack/spark/io/rdf/output/RddRdfWriter.java#L387]
>  * PrefixMapAdapter: We need an adapter that inherits the performance 
> characteristics of the backing PrefixMapping [our 
> code|https://github.com/Scaseco/jenax/blob/dd51ef9a39013d4ddbb4806fcad36b03a4dbaa7c/jenax-arq-parent/jenax-arq-utils/src/main/java/org/aksw/jenax/arq/util/prefix/PrefixMapAdapter.java#L57]
>  * PrefixMapping: We need a trie-based implementation for efficiency. We 
> created one based on the trie class in jena which on initial experiments was 
> sufficiently fast. Though we did not benchmark whether e.g. PatriciaTrie from 
> commons collection would be faster. [our 
> code|https://github.com/Scaseco/jenax/blob/dd51ef9a39013d4ddbb4806fcad36b03a4dbaa7c/jenax-arq-parent/jenax-arq-utils/src/main/java/org/aksw/jenax/arq/util/prefix/PrefixMappingTrie.java#L27]
> With PrefixMapTrie the profiler showed that the amout of time spent on 
> abbreviate went from ~100% to 1% - though not totally sure about standard 
> conformance here.
>  * PipedRDFIterator / AsyncParser: We can read trig as a Splittable format 
> (which is pretty cool) - however this requires being able to start and stop 
> the RDF parser at will for probing. In other words, AsyncParser needs to 
> return ClosableIterators whose close method actually stops the parsing 
> thread. Also when scanning for prefixes we want to be able to create rules 
> such as "as long as the parser emits a prefix with less than e.g. 100 
> non-prefix events in between keep looking for prefixes" - AsyncParser has the 
> API for it with EltStreamRDF but it is private.
> For future-proofness we'd have these use cases to be reflected in jena.
> Because we have sorted all the above issues mostly out I'd prefer to address 
> these things with only one or a few PRs (maybe the ClosableIterators on 
> AsyncParsers would be more work because our code only did that for 
> PipedRDFIterator and I haven't looked in detail into the new architecture).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (JENA-2309) Enhancing Riot for Big Data

Reply via email to