Re: Single-threaded RIOT parsingof InputStream

Andy Seaborne Fri, 01 Nov 2013 16:08:58 -0700

On 01/11/13 09:15, Coughlan, Barry wrote:

Hi all,


According to the RIOT docs, iterating over triples/quads with piped
streams requires separate threads for producer/consumer.

For some applications this isn't practical. In my case I am running an
Hadoop job on NTriple datasets, so I am parsing one triple at a time.
The overhead and extra code complexity of kicking off a thread to parse
each triple is too high, and this may be true for other use cases
involving small datasets.

I wrote some StreamRDF implementations which store the results in Java
Collections, so that parsing can be run on a single thread. Attached is
a patch with the implementations, tests and an example (I borrowed the
term 'Collector' from Apache Lucene). But I now suspect that I've
overlooked some simple existing API call to do this.

Any feedback appreciated.

Regards,
Barry


Barry,

Thanks for the contribution - I've create JENA-581 [1] and attached yourpatch. Looks like a useful thing to add to StreamRDFLib.

You could use a graph or model to collect your triples but at thegranularity of one-by-one, even that may incur some overhead.

(You can pass the same StreamRDF to multiple calls of the parsermachinery to aggregate triples. e.g. RDFDataMgr.parse)

The documentation needs to spell out the implicit assumption that it'sabout parallel processing of data (typically a very large file); youruse case isn't that.


        thanks
        Andy

[1] https://issues.apache.org/jira/browse/JENA-581

Re: Single-threaded RIOT parsingof InputStream

Reply via email to