On 16/03/14 04:31, Ying Jiang wrote:
Dear Andy,
I greatly appreciate your detailed explanations. I've studied all the
examples and the links you mentioned. I'll try to summarise here with
further questions below:
1. We have 2 possible ways for the project: "variables-as-columns" and
"property tables". I can understand both the ideas, thanks to your
instructions. The former one has its issues you pointed out, and the
latter one seems to make more sense for the users. Do you mean we
should discard the former one and focus on the latter in this project?
Yes - "predicates-for-columns" = "property tables"
From that, you can recover "variables-as-columns" by query pattern.
The reverse is messy at best. Either very unnatural variable names to
stop clashes or beign careful about scoping (and that will confuse people).
2. We can have some lessons learned from SQL-to-RDF work. But CSV
(even regular-shaped CSV) is different from database in some ways,
which requires us to dig in deeper on the details. Some questions
like:
The W3C "CSV on the Web Working Group" [1] is working on a standard
mechanism for converting CSV to other forms, RDF included. The details
of that mechanism aren't clear yet and won't be in time for the project
- it's an area that (my current belief) will chop and change a fair bit
in getting to a final specification.
The area of CSV-RDF is bigger than a GSoC project anyway and fairly open
ended given all the sorts of the things people do with CSV files (e.g.
encoding author lists in fields).
But there is a simpler case - one need is a "direct mapping" whereby a
CSV file with no additional metadata is mapped to RDF. I think we can
focus on a design for this in the project.
The translation is fixed : blank node for each row (addresses the
primary key issue - and alternative below), the base URL of the CSV file
is used to generate the predicate names.
Then, the project gets all the machinery working - otherwise the output
will CSV to RDF without the Jena architectural chnages to support it in
the long term.
[1] https://www.w3.org/2013/csvw/wiki/Main_Page
2.1 How to determine the data type of the column? All the values in
CSV are firstly parsed as Strings line by line. Suppose the parser
found a number string of "123000.0", how can we know whether it's an
integer, a float/double or even just a string in RDF?
Initially, they can be strings.
Later, and maybe an option the user can turn on, then a dynamic choice
which is a posh way of saying attempt to parse it as an integer and if
it passes, it's an integer. Spreadsheets do this guessing.
"Duck datatyping" - if it looks like an integer (decimal, double, date)
it is an integer (decimal, double, date).
Actually, this is then the same as tokenizing and there is code to reuse
to do that.
2.2 How to deal with the namespaces? RDF requires that the subjects
and the predicates are URIs. We need to pass in the namespaces (or
just the default namespaces) to make URIs by combining the namespaces
with the values in CSV. Things may get more complicated if different
columns are to be bound with different namespaces.
Subject a can be blank nodes which is useful because each row is then a
new blank node.
One row written in RDF might be:
[ csv:row 1 ; :Town "Southton" ; :Population 123000 ] .
or
_:b0 csv:row 1 ;
:Town "Southton" ;
:Population 123000 .
It's the same RDF triples (3 of them).
For predicates, suppose the URL of the CSV file is <FILE> then the
columns can be <FILE#Town> and <FILE#Population>.
Rules or SPARQL Update can be used to turn that into a better data model
if the users wants to write that code.
2.3 The hp 2006 report [1] says "Jena supports three kinds of property
tables as well as a triple store". The "town" example you provided
conforms to the "single-valued" property table. Shall we consider the
others (e.g. the "multi-valued" one and the "triple store" one) in
this project? Does Jena in the latest release still support these
property tables? If so, where're the related source codes?
Single-valued.
In the CSV-WG it looks like duplicate column names are not going to be
supported (at best, the parser has to make then unique by adding "1",
"2" etc).
Despite what the report says, the code didn't make it into the public
Jena codebase. (And we have removed the old RDB subsystem it refers to.)
2.4 There's no "primary key" definition in CSV. All the RDF are not
OWL in fact. How do we know the column in CSV is uniquely defining? It
seems CSV lacks of some kind of "metadata" of the columns and the
values. If we have such metadata, how to pass in the namespace of the
IRI template of http://data/town/{Town} (something related to the
question 2.2)?
It's not necessary to have a defined primary row - that is generated
subject URI. It might be nice if available but that's metadata.
So one of:
1/ The triples for each row have a blank node for subject
2/ The triples for row N have a URI which is <FILE#_N>.
In both cases, the subject node is generated automatically.
3. For the "property tables" way, it seems that all we need to do is
to resolve the problems in 2., and to code "GraphCSV" accordingly. I
can make the GraphCSV class by implementing the Graph interface. In
this way, for Jena ARP, a CSV table is actually a Graph, without any
differences from other types of Graphs. It looks like that there's no
need to introduce TABLE and FROM TABLE clauses in the SPARQL language
grammar. We can just use the existing GRAPH, FROM and FROM NAMED
clauses for the CSV "property tables", can't we?
s/ARP/ARQ/ -- ARP is the RDF/XML parser; ARQ is the query engine :-)
Yes - correct.
In the later stages of the project, there is an item to make OpExecutor
(which is the class that actually drives the SPARQL execution) do better
for GraphCSV than just treating it as a Graph by accessing the
PropertyTable behind it.
The big gain for PropertyTables is the space saving they enable as well
as the possibility of making them persistent in a special storage system
(not in this project but the design should not make that too hard at
some later time).
Andy
Best regards,
Ying Jiang
[1] http://www.hpl.hp.com/techreports/2006/HPL-2006-140.pdf
On Mon, Mar 10, 2014 at 10:50 PM, Andy Seaborne <[email protected]> wrote:
Hi Ying,
Good questions. I'll try to give a response to the specific points you've
brought up but also there is a different I want to put forward for
discussion.
I'll write up a first draft of a project plan then we can see if the size
and scope is realistic.
You asked about whether variables are column names. That is how TARQL and
SPARQL VALUES works but I've realised there is a different approach and it's
one that will give a better system. It is to translate the CSV to RDF, and
this may be materialized or dynamically mapped. If "materialized" it's
likely to be a lot bigger; as "property tables" or somethign inspired by
that idea, it'll be more compact.
There are some issues with variables-as-columns include:
1/ Fixed variable names don't combine with other part of a query pattern
very well.
If there is common use of the same name it a join - that's what a natural
join in SQL is. If there are two tables, then ?a is overloaded. If column
names are used to derive a variable name, we may not want to equate them in
the query because column names in different CSV files weren't designed with
that in mind.
2/ You can't describe (in RDF) the data very easily - e.g. annotate that a
column is of years.
3/ It needs the language to change (i.e. TABLE to access it)
In TARQL, which is focusing on a controlled transform from CSV to RDF, it
works out quite nicely - variables go into the CONSTRUCT template. It
produces RDF.
Property tables are a style of approach where the CSV data is accessed as
RDF.
The data table columns be predicate URIs. The data table itself is an RDF
graph of regular structure. It can be accessed with normal (unmodified)
SPARQL syntax. It would be better if the storage and execution of that part
of the SPARQL query were adapted to such regular data. Something for after
getting an initial cut down.
Suppose we have a CSV file:
-------------------
Town,Population
Southton,123000
Northville,654000
-------------------
One header row, two data rows.
Aside: this is regular-shaped CSV (and some CSV files are definitely not
regular at all!). There is the current editors working draft from the CSV on
the Web Working Group (not yet published, likely to change, only part of the
picture, etc etc)
http://w3c.github.io/csvw/syntax/
which is defining a more regular data out of CSV. This is the target for
the CSV work: table shaped CSV; not arbitrary, irregularly shaped CSV.
There is no way the working group will have standardised any CSV to RDF
mapping in the lifetime of the GSoC project but the WG charter says it must
be covered. So the mapping below is made up and ahead of where the working
group is currently but a standardized, "direct mapping" (no metadata, no
templates) style is going to happen. The mapping details may change but the
general approach is clear.
As RDF this might be
-------------
@prefix : <http://example/table> .
@prefix csv: <http://w3c/future-csv-vocab/> .
[ csv:row 1 ; :Town "Southton" ; :Population 123000 ] .
[ csv:row 2 ; :Town "Northville" ; :Population 654000 ] .
-------------
or without the bnode abbreviation:
-------------
@prefix : <http://example/table> .
@prefix csv: <http://w3c/future-csv-vocab/> .
_:b0 csv:row 1 ;
:Town "Southton" ;
:Population 123000 .
_:b1 csv:row 2 ;
:Town "Northville" ;
:Population 654000 .
-------------
Each row is modelling one "entity" (here, a population observation). There
is a subject (a blank node) and one predicate-value for each cell of the
row. Row numbers are added because it can be important.
Background:
A related idea for property has come up before
http://www.hpl.hp.com/techreports/2006/HPL-2006-140.html
That paper should only be taken as giving a flavour. The motivation was
different, more about making RDF look like regular database especially when
the data is regular. At the workshop last week, I talk to Orri Erling
(OpenLink/Virtuoso) and apparently, maybe by parallel evolution, Virtuoso
does something similar.
Aside:
There is a whole design space (outside this project) for translating CSV to
RDF.
Just if anyone is interested: see the related SQL-to-RDF work:
http://www.w3.org/TR/r2rml/
http://www.w3.org/TR/rdb-direct-mapping/
If the metadata said that one of the columns was uniquely defining (a
primary key in SQL terms, or inverse functional property in OWL-terms), we
wouldn't need blank nodes at all - we could use a URI template, for if town
names were unique (they are not!) a IRI template of http://data/town/{Town}
would give:
-------------
@prefix : <http://example/table> .
@prefix csv: <http://w3c/future-csv-vocab/> .
<http://data/town/Southton>
csv:row 1 ;
rdfs:label "Southton" ;
:Population 123000 .
<http://data/town/Northville>
csv:row 2 ;
rdfs:label "Northville" ;
:Population 654000 .
-------------
Doing this transformation in rules is one route. JENA-650 connection?
</aside>
In SPARQL:
Now the CSV file is viewed as an graph - normal, unmodified SPARQL can be
used. Multiple CSVs files can be multiple graphs in one dataset to give
query across different data sources.
# Towns over 500,000 people.
SELECT ?townName ?pop {
{ GRAPH <http://example/population> {
?x :Town ?townName ;
:Popuation ?pop .
FILTER(?pop > 500000)
}
}
A few comments inline - the bulk of this message is above.
I hope this makes some sense. Having spent time with people who really do
work with CSVs files last week around the linked geospatial workshop , the
user needs and requirements are much clearer.
Andy
PS I was on a panel that included mentioning the work you did last year. It
went well.
On 07/03/14 12:10, Ying Jiang wrote:
...
2. Storage of the table (in-memory is enough, with reading from a file).
- Questions:
2.1 What's the life cycle of the in-memory table? Should we discard
the table after the query execution, or keep it in-memory for later
reuse with the same query or update, or use by a subsequent query?
When will the table be discarded?
That'll need refining but a way to read and reuse. There needs to be
away
for the app to pass in tables (a Map<Sting, ???> and a tool forerading
CSVs
to get the ???) because ...
When will the tables be passed in? TARQL loads the CSVs when parsing
the SPARQL query string. Shall we load the tables and create the Map
before querying and cache them for resue? This could be similar to
querying a Dataset, and the simplest way goes something like:
DataTableMap<String, DataTable> dtm =
DataTableSetFactory.createDataTableMap(); // The keys of dts are the
URI of the DataTables loaded.
dtm.addDataTable( "<ex:table_1>", "file:table_1.csv", true); // The
table data are loaded when added into the map.
dtm.addDataTable( "<ex:table_2>", "file:table_2.csv", false); // Or
the table data are *lazy* loaded during querying later on, i.e. not
loaded now.
Query query = QueryFactory.create(queryString) ; // New .jj will be
created for parsing TABLE and FROM TABLE clauses. However the
QueryFactory interface remains the same as before.
QueryExecution qExec = QueryExecutionFactory.create(query, model,
dtm) ; // New create method for QueryExecutionFactory to accomendate
dtm
... //dtm can be reused later on for other QueryExecutions, or be
discarded when the app ends.
Is the above what you mean? Any comments?
Yes, using TABLE.
With property tables it can be done as
// Default graph of the dataset
Model csv1 =
ModelFactory.createModelForGraph(new GraphCSV("data1.csv")) ;
QueryExecution qExec = QueryExecutionFactory.create(query, csv1) ;
or for multiple CSV files and/or other RDF data:
Model csv1 =
ModelFactory.createModelForGraph(new GraphCSV("data1.csv")) ;
Model csv2 =
ModelFactory.createModelForGraph(new GraphCSV("data1.csv")) ;
Dataset dataset = ... ;
dataset.addNamedModel("http://example/population", csv1) ;
dataset.addNamedModel("http://example/table2", csv2) ;
... normal SPARQL execution ...
3. Modify the SPARQL grammar to support FROM TABLE and TABLE (for
inclusion inside a larger query, c.f. SPARQL VALUES clause).
- Questions:
3.1 What're the differences between FROM TABLE and TABLE?
FROM TABLE would be one way to get tables into the query as would passing
it
in in the query context.
Queries can't be assumed to
TABLE in a query is accessing the table, using it to get the
TARQL, and I've only read the documentation, is a query over a single CSV
file. This project should be about multiple CSVs and combining with
other
RDF data.
A quick sketch and the syntax is not checked as sensible:
SELECT ... {
# Fixed column names
TABLE <uri> {
BIND (URI(CONCAT('http://example.com/ns#', ?b)) AS ?uri)
BIND (STRLANG(?a, 'en') AS ?with_language_tag)
FILTER (?v > 57)
}
}
More ambitious to have column naming and FILTERs:
SELECT ...
WHERE {
TABLE <uri> { "col1" AS ?myVar1 ,
"col10" AS ?V ,
"col5" AS ?appName
FILTER(?V > 57) }
}
creates a set of bindings based on access description.
Are the <uri> after TABLE the key of the Map<Sting, ???>? If so, I now
understand the TABLE clauses from the examples. However, still not
sure about FROM TABLE. Could you please show me some query string
examples containing the FROM TABLE clauses?
FROM TABLE would set the map entry. c.f. FROM NAMED
In this case the name of the table (graph) is the location it comes from -
it's not a general choice of name. A common issue for FROM NAMED, not
specific to CSV processing.