Re: Single-threaded RIOT parsingof InputStream

Rob Vesse Mon, 04 Nov 2013 01:56:29 -0800

Barry

As Andy has stated in his replies no we didn't have this functionality
already and he has now added it to trunk.


As far as your described use case goes I would point out that this mode of
operation will not be scalable unless you have appropriately partitioned the
data.  Parsing is inherently a blocking process hence why the iterator model
provided by RIOT already relies on having a producer and a consumer thread
with a bounded thread safe queue between them to stop the producer filling
the memory with as much data as it can read before the consumer ever gets to
start processing the data.

In your described model you will need to parse the entirety of the data into
memory before you can start consuming it which risks OOM errors with larger
datasets.  If your real target is Hadoop input formats then you may want to
instead take a look at Paolo Castagna's jena-grande repository on GitHub -
https://github.com/castagna/jena-grande  which is a little out of date with
respect to latest Hadoop versions but demonstrates how to create input
formats for RDF - 
https://github.com/castagna/jena-grande/tree/master/src/main/java/org/apache
/jena/grande/mapreduce/io

Hope this helps,

Rob

From:  "Coughlan, Barry" <[email protected]>
Reply-To:  <[email protected]>
Date:  Friday, 1 November 2013 09:15
To:  "[email protected]" <[email protected]>
Subject:  Single-threaded RIOT parsingof InputStream

> Hi all,
> 
> According to the RIOT docs, iterating over triples/quads with piped streams
> requires separate threads for producer/consumer.
> 
> For some applications this isn't practical. In my case I am running an Hadoop
> job on NTriple datasets, so I am parsing one triple at a time. The overhead
> and extra code complexity of kicking off a thread to parse each triple is too
> high, and this may be true for other use cases involving small datasets.
> 
> I wrote some StreamRDF implementations which store the results in Java
> Collections, so that parsing can be run on a single thread. Attached is a
> patch with the implementations, tests and an example (I borrowed the term
> 'Collector' from Apache Lucene). But I now suspect that I've overlooked some
> simple existing API call to do this.
> 
> Any feedback appreciated.
> 
> Regards,
> Barry

Re: Single-threaded RIOT parsingof InputStream

Reply via email to