Re: [GSoC 2014] Data Tables for SPARQL

Andy Seaborne Mon, 10 Mar 2014 07:51:43 -0700

Hi Ying,

Good questions. I'll try to give a response to the specific pointsyou've brought up but also there is a different I want to put forwardfor discussion.

I'll write up a first draft of a project plan then we can see if thesize and scope is realistic.

You asked about whether variables are column names. That is how TARQLand SPARQL VALUES works but I've realised there is a different approachand it's one that will give a better system. It is to translate the CSVto RDF, and this may be materialized or dynamically mapped. If"materialized" it's likely to be a lot bigger; as "property tables" orsomethign inspired by that idea, it'll be more compact.


There are some issues with variables-as-columns include:

1/ Fixed variable names don't combine with other part of a query patternvery well.

If there is common use of the same name it a join - that's what anatural join in SQL is. If there are two tables, then ?a is overloaded.If column names are used to derive a variable name, we may not want toequate them in the query because column names in different CSV filesweren't designed with that in mind.

2/ You can't describe (in RDF) the data very easily - e.g. annotate thata column is of years.


3/  It needs the language to change (i.e. TABLE to access it)

In TARQL, which is focusing on a controlled transform from CSV to RDF,it works out quite nicely - variables go into the CONSTRUCT template.It produces RDF.

Property tables are a style of approach where the CSV data is accessedas RDF.

The data table columns be predicate URIs. The data table itself is anRDF graph of regular structure. It can be accessed with normal(unmodified) SPARQL syntax. It would be better if the storage andexecution of that part of the SPARQL query were adapted to such regulardata. Something for after getting an initial cut down.


Suppose we have a CSV file:
-------------------
Town,Population
Southton,123000
Northville,654000
-------------------

One header row, two data rows.

Aside: this is regular-shaped CSV (and some CSV files are definitely notregular at all!). There is the current editors working draft from theCSV on the Web Working Group (not yet published, likely to change, onlypart of the picture, etc etc)


http://w3c.github.io/csvw/syntax/

which is defining a more regular data out of CSV. This is the targetfor the CSV work: table shaped CSV; not arbitrary, irregularly shaped CSV.

There is no way the working group will have standardised any CSV to RDFmapping in the lifetime of the GSoC project but the WG charter says itmust be covered. So the mapping below is made up and ahead of where theworking group is currently but a standardized, "direct mapping" (nometadata, no templates) style is going to happen. The mapping detailsmay change but the general approach is clear.


As RDF this might be

-------------
@prefix : <http://example/table> .
@prefix csv: <http://w3c/future-csv-vocab/> .

[ csv:row 1 ; :Town "Southton" ; :Population 123000 ] .
[ csv:row 2 ; :Town "Northville" ; :Population 654000 ] .
-------------

or without the bnode abbreviation:

-------------
@prefix : <http://example/table> .
@prefix csv: <http://w3c/future-csv-vocab/> .

_:b0  csv:row 1 ;
      :Town "Southton" ;
      :Population 123000 .

_:b1  csv:row 2 ;
      :Town "Northville" ;
      :Population 654000 .
-------------

Each row is modelling one "entity" (here, a population observation).There is a subject (a blank node) and one predicate-value for each cellof the row. Row numbers are added because it can be important.


Background:

A related idea for property has come up before

  http://www.hpl.hp.com/techreports/2006/HPL-2006-140.html

That paper should only be taken as giving a flavour. The motivation wasdifferent, more about making RDF look like regular database especiallywhen the data is regular. At the workshop last week, I talk to OrriErling (OpenLink/Virtuoso) and apparently, maybe by parallel evolution,Virtuoso does something similar.



Aside:

There is a whole design space (outside this project) for translating CSVto RDF.


Just if anyone is interested: see the related SQL-to-RDF work:

http://www.w3.org/TR/r2rml/
http://www.w3.org/TR/rdb-direct-mapping/

If the metadata said that one of the columns was uniquely defining (aprimary key in SQL terms, or inverse functional property in OWL-terms),we wouldn't need blank nodes at all - we could use a URI template, forif town names were unique (they are not!) a IRI template ofhttp://data/town/{Town} would give:


-------------
@prefix : <http://example/table> .
@prefix csv: <http://w3c/future-csv-vocab/> .

<http://data/town/Southton>
      csv:row 1 ;
      rdfs:label "Southton" ;
      :Population 123000 .

<http://data/town/Northville>
      csv:row 2 ;
      rdfs:label "Northville" ;
      :Population 654000 .
-------------

Doing this transformation in rules is one route.  JENA-650 connection?
</aside>

In SPARQL:

Now the CSV file is viewed as an graph - normal, unmodified SPARQL canbe used. Multiple CSVs files can be multiple graphs in one dataset togive query across different data sources.


# Towns over 500,000 people.
SELECT ?townName ?pop {
{ GRAPH <http://example/population> {
    ?x :Town ?townName ;
       :Popuation ?pop .
    FILTER(?pop > 500000)
  }
}


A few comments inline - the bulk of this message is above.

I hope this makes some sense. Having spent time with people who reallydo work with CSVs files last week around the linked geospatial workshop, the user needs and requirements are much clearer.


        Andy

PS I was on a panel that included mentioning the work you did last year.It went well.


On 07/03/14 12:10, Ying Jiang wrote:
...

2. Storage of the table (in-memory is enough, with reading from a file).
   - Questions:
2.1 What's the life cycle of the in-memory table? Should we discard
the table after the query execution, or keep it in-memory for later
reuse with the same query or update, or use by a subsequent query?
When will the table be discarded?



That'll need refining but a way to read and reuse.  There needs to be away
for the app to pass in tables (a Map<Sting, ???> and a tool forerading CSVs
to get the ???) because ...


When will the tables be passed in? TARQL loads the CSVs when parsing
the SPARQL query string. Shall we load the tables and create the Map
before querying and cache them for resue? This could be similar to
querying a Dataset, and the simplest way goes something like:

DataTableMap<String, DataTable> dtm =
DataTableSetFactory.createDataTableMap(); // The keys of dts are the
URI of the DataTables loaded.
dtm.addDataTable( "<ex:table_1>", "file:table_1.csv", true); // The
table data are loaded when added into the map.
dtm.addDataTable( "<ex:table_2>", "file:table_2.csv", false); // Or
the table data are *lazy* loaded during querying later on, i.e. not
loaded now.
Query query = QueryFactory.create(queryString) ; // New .jj will be
created for parsing TABLE and FROM TABLE clauses. However the
QueryFactory interface remains the same as before.
QueryExecution qExec = QueryExecutionFactory.create(query, model,
dtm) ; // New create method for QueryExecutionFactory to accomendate
dtm
... //dtm can be reused later on for other QueryExecutions, or be
discarded when the app ends.

Is the above what you mean? Any comments?


Yes, using TABLE.

With property tables it can be done as

// Default graph of the dataset

Model csv1 =
  ModelFactory.createModelForGraph(new GraphCSV("data1.csv")) ;
QueryExecution qExec = QueryExecutionFactory.create(query, csv1) ;

or for multiple CSV files and/or other RDF data:

Model csv1 =
  ModelFactory.createModelForGraph(new GraphCSV("data1.csv")) ;
Model csv2 =
  ModelFactory.createModelForGraph(new GraphCSV("data1.csv")) ;

Dataset dataset = ... ;
dataset.addNamedModel("http://example/population";, csv1) ;
dataset.addNamedModel("http://example/table2";, csv2) ;

... normal SPARQL execution ...

3. Modify the SPARQL grammar to support FROM TABLE and TABLE (for
inclusion inside a larger query, c.f. SPARQL VALUES clause).
   - Questions:
3.1 What're the differences between FROM TABLE and TABLE?



FROM TABLE would be one way to get tables into the query as would passing it
in in the query context.

Queries can't be assumed to

TABLE in a query is accessing the table, using it to get the

TARQL, and I've only read the documentation, is a query over a single CSV
file.  This project should be about multiple CSVs and combining with other
RDF data.

A quick sketch and the syntax is not checked as sensible:

SELECT ... {
   # Fixed column names
   TABLE <uri> {
      BIND (URI(CONCAT('http://example.com/ns#', ?b)) AS ?uri)
      BIND (STRLANG(?a, 'en') AS ?with_language_tag)
      FILTER (?v > 57)
   }
}

More ambitious to have column naming and FILTERs:

SELECT ...
WHERE {

    TABLE <uri> { "col1" AS ?myVar1 ,
                  "col10" AS ?V ,
                  "col5" AS ?appName
                  FILTER(?V > 57) }
}

creates a set of bindings based on access description.


Are the <uri> after TABLE the key of the Map<Sting, ???>? If so, I now
understand the TABLE clauses from the examples. However, still not
sure about FROM TABLE. Could you please show me some query string
examples containing the FROM TABLE clauses?


FROM TABLE would set the map entry.  c.f. FROM NAMED

In this case the name of the table (graph) is the location it comes from- it's not a general choice of name. A common issue for FROM NAMED, notspecific to CSV processing.

Re: [GSoC 2014] Data Tables for SPARQL

Reply via email to