On 01/11/13 09:15, Coughlan, Barry wrote:
Hi all,
According to the RIOT docs, iterating over triples/quads with piped
streams requires separate threads for producer/consumer.
For some applications this isn't practical. In my case I am running an
Hadoop job on NTriple datasets, so I am parsing one triple at a time.
The overhead and extra code complexity of kicking off a thread to parse
each triple is too high, and this may be true for other use cases
involving small datasets.
I wrote some StreamRDF implementations which store the results in Java
Collections, so that parsing can be run on a single thread. Attached is
a patch with the implementations, tests and an example (I borrowed the
term 'Collector' from Apache Lucene). But I now suspect that I've
overlooked some simple existing API call to do this.
Any feedback appreciated.
Regards,
Barry
Barry,
Thanks for the contribution - I've create JENA-581 [1] and attached your
patch. Looks like a useful thing to add to StreamRDFLib.
You could use a graph or model to collect your triples but at the
granularity of one-by-one, even that may incur some overhead.
(You can pass the same StreamRDF to multiple calls of the parser
machinery to aggregate triples. e.g. RDFDataMgr.parse)
The documentation needs to spell out the implicit assumption that it's
about parallel processing of data (typically a very large file); your
use case isn't that.
thanks
Andy
[1] https://issues.apache.org/jira/browse/JENA-581