On 16/09/11 21:22, Glenn Ammons wrote:
I have a number of CSV files over which I would like to do SPARQL
queries, without converting them to RDF first. I'm trying to figure
out how to extend Jena so that each flat file would appear to ARQ
queries as a new named graph. This page:
http://jena.sourceforge.net/ARQ/arq-query-eval.html
suggests extending GraphBase.java, which is straightforward enough,
but it doesn't explain how to register the new Graph implementation
with the system. It seems to me that, at a minimum, I would need some
way to inform query execution of the named graphs that my extension
supplies.
Are there any examples of such an extension? I know about
ARQ-2.8.8/src-examples/arq/examples/engine/MyQueryEngine.java, but I'm
not sure that I need to write my own query engine. I've also been
looking at the TDB initialization code.
Thanks.
--glenn
Glenn,
You don't need to introduce the graph implementation to the system - TDB
does some initialization for other reasons.
If you extend GraphBase (one method needed - find(s,p,o)) to deal with
CSV mapping then it's done.
Then (1)
Model model = ModelFactory.modelForGraph(graph) ;
and you can put one model per file in a DataSource.
or (2) you can also skip the Model stuff, build a DatasetGraph
(DatasetGraphMap), and wrap it up as a Dataset with
DatasetFactory.creat(datasetGraph).
These two get you to the same place from the point-of-view of the SPARQL
system. They each use a general purpose Dataset(Graph) implementation
that maps down to specific graphs.
This general purpose Dataset implementation is already in ARQ - no need
to register specific query engines for this case.
These execute SPARQl Query over the Graph.find operation. Everything
will work.
Only if you want to have a specialised engine that does something else
(specialised indexing maybe, or a *lot* of CSV files that it is worth
implementing a specialised storage system for them) do you need to go
further.
Even if that's needed, I suggest implementing the GraphBase route first
because it is quickest to get something working. The thing you wil have
to add is the translation between RDF and the CSV data model used in the
columns.
ARQ has the CSV and TSV output that is now a W3C draft, in case that
helps for getting information back out of the RDF view.
http://www.w3.org/TR/sparql11-results-csv-tsv/
TDB implements a specialized DatasetGraph and executes SPARQL queries
(the BGP parts) directly over the idnexes, not going back and forth
through to Node objects - that's why it registers itself. If you put
TDB backed graphs in a general purpose dataset, then ARQ is going to
access TDB via the Graph.find route.
Andy