On 16/09/11 21:22, Glenn Ammons wrote:
I have a number of CSV files over which I would like to do SPARQL
queries, without converting them to RDF first.  I'm trying to figure
out how to extend Jena so that each flat file would appear to ARQ
queries as a new named graph.  This page:

http://jena.sourceforge.net/ARQ/arq-query-eval.html

suggests extending GraphBase.java, which is straightforward enough,
but it doesn't explain how to register the new Graph implementation
with the system.  It seems to me that, at a minimum, I would need some
way to inform query execution of the named graphs that my extension
supplies.

Are there any examples of such an extension?  I know about
ARQ-2.8.8/src-examples/arq/examples/engine/MyQueryEngine.java, but I'm
not sure that I need to write my own query engine.  I've also been
looking at the TDB initialization code.

Thanks.
--glenn

Glenn,

You don't need to introduce the graph implementation to the system - TDB does some initialization for other reasons.

If you extend GraphBase (one method needed - find(s,p,o)) to deal with CSV mapping then it's done.

Then  (1)

Model model = ModelFactory.modelForGraph(graph) ;

and you can put one model per file in a DataSource.

or (2) you can also skip the Model stuff, build a DatasetGraph (DatasetGraphMap), and wrap it up as a Dataset with DatasetFactory.creat(datasetGraph).

These two get you to the same place from the point-of-view of the SPARQL system. They each use a general purpose Dataset(Graph) implementation that maps down to specific graphs.

This general purpose Dataset implementation is already in ARQ - no need to register specific query engines for this case.

These execute SPARQl Query over the Graph.find operation. Everything will work.

Only if you want to have a specialised engine that does something else (specialised indexing maybe, or a *lot* of CSV files that it is worth implementing a specialised storage system for them) do you need to go further.

Even if that's needed, I suggest implementing the GraphBase route first because it is quickest to get something working. The thing you wil have to add is the translation between RDF and the CSV data model used in the columns.

ARQ has the CSV and TSV output that is now a W3C draft, in case that helps for getting information back out of the RDF view.

http://www.w3.org/TR/sparql11-results-csv-tsv/

TDB implements a specialized DatasetGraph and executes SPARQL queries (the BGP parts) directly over the idnexes, not going back and forth through to Node objects - that's why it registers itself. If you put TDB backed graphs in a general purpose dataset, then ARQ is going to access TDB via the Graph.find route.

        Andy

Reply via email to