On 01/11/13 09:15, Coughlan, Barry wrote:
Hi all,

According to the RIOT docs, iterating over triples/quads with piped
streams requires separate threads for producer/consumer.

For some applications this isn't practical. In my case I am running an
Hadoop job on NTriple datasets, so I am parsing one triple at a time.
The overhead and extra code complexity of kicking off a thread to parse
each triple is too high, and this may be true for other use cases
involving small datasets.

I wrote some StreamRDF implementations which store the results in Java
Collections, so that parsing can be run on a single thread. Attached is
a patch with the implementations, tests and an example (I borrowed the
term 'Collector' from Apache Lucene). But I now suspect that I've
overlooked some simple existing API call to do this.

Any feedback appreciated.

Regards,
Barry

Barry,

Thanks for the contribution - I've create JENA-581 [1] and attached your patch. Looks like a useful thing to add to StreamRDFLib.

You could use a graph or model to collect your triples but at the granularity of one-by-one, even that may incur some overhead.

(You can pass the same StreamRDF to multiple calls of the parser machinery to aggregate triples. e.g. RDFDataMgr.parse)

The documentation needs to spell out the implicit assumption that it's about parallel processing of data (typically a very large file); your use case isn't that.

        thanks
        Andy

[1] https://issues.apache.org/jira/browse/JENA-581

Reply via email to