Hi Ying,

Good questions. I'll try to give a response to the specific points you've brought up but also there is a different I want to put forward for discussion.

I'll write up a first draft of a project plan then we can see if the size and scope is realistic.

You asked about whether variables are column names. That is how TARQL and SPARQL VALUES works but I've realised there is a different approach and it's one that will give a better system. It is to translate the CSV to RDF, and this may be materialized or dynamically mapped. If "materialized" it's likely to be a lot bigger; as "property tables" or somethign inspired by that idea, it'll be more compact.

There are some issues with variables-as-columns include:

1/ Fixed variable names don't combine with other part of a query pattern very well.

If there is common use of the same name it a join - that's what a natural join in SQL is. If there are two tables, then ?a is overloaded. If column names are used to derive a variable name, we may not want to equate them in the query because column names in different CSV files weren't designed with that in mind.

2/ You can't describe (in RDF) the data very easily - e.g. annotate that a column is of years.

3/  It needs the language to change (i.e. TABLE to access it)

In TARQL, which is focusing on a controlled transform from CSV to RDF, it works out quite nicely - variables go into the CONSTRUCT template. It produces RDF.

Property tables are a style of approach where the CSV data is accessed as RDF.

The data table columns be predicate URIs. The data table itself is an RDF graph of regular structure. It can be accessed with normal (unmodified) SPARQL syntax. It would be better if the storage and execution of that part of the SPARQL query were adapted to such regular data. Something for after getting an initial cut down.

Suppose we have a CSV file:
-------------------
Town,Population
Southton,123000
Northville,654000
-------------------

One header row, two data rows.

Aside: this is regular-shaped CSV (and some CSV files are definitely not regular at all!). There is the current editors working draft from the CSV on the Web Working Group (not yet published, likely to change, only part of the picture, etc etc)

http://w3c.github.io/csvw/syntax/

which is defining a more regular data out of CSV. This is the target for the CSV work: table shaped CSV; not arbitrary, irregularly shaped CSV.

There is no way the working group will have standardised any CSV to RDF mapping in the lifetime of the GSoC project but the WG charter says it must be covered. So the mapping below is made up and ahead of where the working group is currently but a standardized, "direct mapping" (no metadata, no templates) style is going to happen. The mapping details may change but the general approach is clear.

As RDF this might be

-------------
@prefix : <http://example/table> .
@prefix csv: <http://w3c/future-csv-vocab/> .

[ csv:row 1 ; :Town "Southton" ; :Population 123000 ] .
[ csv:row 2 ; :Town "Northville" ; :Population 654000 ] .
-------------

or without the bnode abbreviation:

-------------
@prefix : <http://example/table> .
@prefix csv: <http://w3c/future-csv-vocab/> .

_:b0  csv:row 1 ;
      :Town "Southton" ;
      :Population 123000 .

_:b1  csv:row 2 ;
      :Town "Northville" ;
      :Population 654000 .
-------------


Each row is modelling one "entity" (here, a population observation). There is a subject (a blank node) and one predicate-value for each cell of the row. Row numbers are added because it can be important.

Background:

A related idea for property has come up before

  http://www.hpl.hp.com/techreports/2006/HPL-2006-140.html

That paper should only be taken as giving a flavour. The motivation was different, more about making RDF look like regular database especially when the data is regular. At the workshop last week, I talk to Orri Erling (OpenLink/Virtuoso) and apparently, maybe by parallel evolution, Virtuoso does something similar.


Aside:
There is a whole design space (outside this project) for translating CSV to RDF.

Just if anyone is interested: see the related SQL-to-RDF work:

http://www.w3.org/TR/r2rml/
http://www.w3.org/TR/rdb-direct-mapping/

If the metadata said that one of the columns was uniquely defining (a primary key in SQL terms, or inverse functional property in OWL-terms), we wouldn't need blank nodes at all - we could use a URI template, for if town names were unique (they are not!) a IRI template of http://data/town/{Town} would give:

-------------
@prefix : <http://example/table> .
@prefix csv: <http://w3c/future-csv-vocab/> .

<http://data/town/Southton>
      csv:row 1 ;
      rdfs:label "Southton" ;
      :Population 123000 .

<http://data/town/Northville>
      csv:row 2 ;
      rdfs:label "Northville" ;
      :Population 654000 .
-------------

Doing this transformation in rules is one route.  JENA-650 connection?
</aside>

In SPARQL:

Now the CSV file is viewed as an graph - normal, unmodified SPARQL can be used. Multiple CSVs files can be multiple graphs in one dataset to give query across different data sources.

# Towns over 500,000 people.
SELECT ?townName ?pop {
{ GRAPH <http://example/population> {
    ?x :Town ?townName ;
       :Popuation ?pop .
    FILTER(?pop > 500000)
  }
}


A few comments inline - the bulk of this message is above.

I hope this makes some sense. Having spent time with people who really do work with CSVs files last week around the linked geospatial workshop , the user needs and requirements are much clearer.

        Andy

PS I was on a panel that included mentioning the work you did last year. It went well.

On 07/03/14 12:10, Ying Jiang wrote:
...
2. Storage of the table (in-memory is enough, with reading from a file).
   - Questions:
2.1 What's the life cycle of the in-memory table? Should we discard
the table after the query execution, or keep it in-memory for later
reuse with the same query or update, or use by a subsequent query?
When will the table be discarded?


That'll need refining but a way to read and reuse.  There needs to be away
for the app to pass in tables (a Map<Sting, ???> and a tool forerading CSVs
to get the ???) because ...

When will the tables be passed in? TARQL loads the CSVs when parsing
the SPARQL query string. Shall we load the tables and create the Map
before querying and cache them for resue? This could be similar to
querying a Dataset, and the simplest way goes something like:

DataTableMap<String, DataTable> dtm =
DataTableSetFactory.createDataTableMap(); // The keys of dts are the
URI of the DataTables loaded.
dtm.addDataTable( "<ex:table_1>", "file:table_1.csv", true); // The
table data are loaded when added into the map.
dtm.addDataTable( "<ex:table_2>", "file:table_2.csv", false); // Or
the table data are *lazy* loaded during querying later on, i.e. not
loaded now.
Query query = QueryFactory.create(queryString) ; // New .jj will be
created for parsing TABLE and FROM TABLE clauses. However the
QueryFactory interface remains the same as before.
QueryExecution qExec = QueryExecutionFactory.create(query, model,
dtm) ; // New create method for QueryExecutionFactory to accomendate
dtm
... //dtm can be reused later on for other QueryExecutions, or be
discarded when the app ends.

Is the above what you mean? Any comments?

Yes, using TABLE.

With property tables it can be done as

// Default graph of the dataset

Model csv1 =
  ModelFactory.createModelForGraph(new GraphCSV("data1.csv")) ;
QueryExecution qExec = QueryExecutionFactory.create(query, csv1) ;

or for multiple CSV files and/or other RDF data:

Model csv1 =
  ModelFactory.createModelForGraph(new GraphCSV("data1.csv")) ;
Model csv2 =
  ModelFactory.createModelForGraph(new GraphCSV("data1.csv")) ;

Dataset dataset = ... ;
dataset.addNamedModel("http://example/population";, csv1) ;
dataset.addNamedModel("http://example/table2";, csv2) ;

... normal SPARQL execution ...



3. Modify the SPARQL grammar to support FROM TABLE and TABLE (for
inclusion inside a larger query, c.f. SPARQL VALUES clause).
   - Questions:
3.1 What're the differences between FROM TABLE and TABLE?


FROM TABLE would be one way to get tables into the query as would passing it
in in the query context.

Queries can't be assumed to

TABLE in a query is accessing the table, using it to get the

TARQL, and I've only read the documentation, is a query over a single CSV
file.  This project should be about multiple CSVs and combining with other
RDF data.

A quick sketch and the syntax is not checked as sensible:

SELECT ... {
   # Fixed column names
   TABLE <uri> {
      BIND (URI(CONCAT('http://example.com/ns#', ?b)) AS ?uri)
      BIND (STRLANG(?a, 'en') AS ?with_language_tag)
      FILTER (?v > 57)
   }
}

More ambitious to have column naming and FILTERs:

SELECT ...
WHERE {

    TABLE <uri> { "col1" AS ?myVar1 ,
                  "col10" AS ?V ,
                  "col5" AS ?appName
                  FILTER(?V > 57) }
}

creates a set of bindings based on access description.


Are the <uri> after TABLE the key of the Map<Sting, ???>? If so, I now
understand the TABLE clauses from the examples. However, still not
sure about FROM TABLE. Could you please show me some query string
examples containing the FROM TABLE clauses?

FROM TABLE would set the map entry.  c.f. FROM NAMED

In this case the name of the table (graph) is the location it comes from - it's not a general choice of name. A common issue for FROM NAMED, not specific to CSV processing.

Reply via email to