Re: [GSoC 2014] Data Tables for SPARQL

Ying Jiang Thu, 20 Mar 2014 21:08:57 -0700

Hi Andy,

It's OK. Here's the copy of my proposal in the attachment.


Cheers,
Ying Jiang

On Wed, Mar 19, 2014 at 10:13 PM, Andy Seaborne <[email protected]> wrote:
> On 19/03/14 04:22, Ying Jiang wrote:
>>
>> Dear Andy,
>>
>> I've submitted a proposal [1] to GSoC, according to our previous
>> discussions. Please let me know if anything can be improved.
>> Thanks a lot!
>
>
> Looks fine - and congratulations on the lecturer position.
>
>         Andy
>
>
>>
>> Cheers,
>> Ying Jiang
>>
>> [1]
>> http://www.google-melange.com/gsoc/proposal/review/student/google/gsoc2014/jpz6311whu/5632763709358080
>
>
> These URLs are restricted until projects are accepted.  I can read it as:
>
> http://www.google-melange.com/gsoc/proposal/review/org/google/gsoc2014/jpz6311whu/5632763709358080
>
> not sure if that's organisation specific though.
>
> When projects are accepted, the proposal becomes public.
>
> Ying - in Apache, we do everything in public where possible.  Would you mind
> emailing dev@ with a copy?  (Remove anything you don't want on an archived
> list)
>
>
>>
>> On Mon, Mar 17, 2014 at 10:17 PM, Andy Seaborne <[email protected]> wrote:
>>>
>>> On 16/03/14 04:31, Ying Jiang wrote:
>>>>
>>>>
>>>> Dear Andy,
>>>>
>>>> I greatly appreciate your detailed explanations. I've studied all the
>>>> examples and the links you mentioned. I'll try to summarise here with
>>>> further questions below:
>>>>
>>>> 1. We have 2 possible ways for the project: "variables-as-columns" and
>>>> "property tables". I can understand both the ideas, thanks to your
>>>> instructions. The former one has its issues you pointed out, and the
>>>> latter one seems to make more sense for the users. Do you mean we
>>>> should discard the former one and focus on the latter in this project?
>>>
>>>
>>>
>>> Yes - "predicates-for-columns" = "property tables"
>>>
>>>  From that, you can recover "variables-as-columns" by query pattern. The
>>> reverse is messy at best. Either very unnatural variable names to stop
>>> clashes or beign careful about scoping (and that will confuse people).
>>>
>>>
>>>> 2. We can have some lessons learned from SQL-to-RDF work. But CSV
>>>> (even regular-shaped CSV) is different from database in some ways,
>>>> which requires us to dig in deeper on the details. Some questions
>>>> like:
>>>
>>>
>>>
>>> The W3C "CSV on the Web Working Group" [1] is working on a standard
>>> mechanism for converting CSV to other forms, RDF included.  The details
>>> of
>>> that mechanism aren't clear yet and won't be in time for the project -
>>> it's
>>> an area that (my current belief) will chop and change a fair bit in
>>> getting
>>> to a final specification.
>>>
>>>
>>> The area of CSV-RDF is bigger than a GSoC project anyway and fairly open
>>> ended given all the sorts of the things people do with CSV files (e.g.
>>> encoding author lists in fields).
>>>
>>> But there is a simpler case - one need is a "direct mapping" whereby a
>>> CSV file with no additional metadata is mapped to RDF.  I think we can
>>> focus
>>> on a design for this in the project.
>>>
>>> The translation is fixed : blank node for each row (addresses the primary
>>> key issue - and alternative below), the base URL of the CSV file is used
>>> to
>>> generate the predicate names.
>>>
>>> Then, the project gets all the machinery working - otherwise the output
>>> will
>>> CSV to RDF without the Jena architectural chnages to support it in the
>>> long
>>> term.
>>>
>>> [1] https://www.w3.org/2013/csvw/wiki/Main_Page
>>>
>>>
>>>> 2.1 How to determine the data type of the column? All the values in
>>>> CSV are firstly parsed as Strings line by line. Suppose the parser
>>>> found a number string of "123000.0", how can we know whether it's an
>>>> integer, a float/double or even just a string in RDF?
>>>
>>>
>>>
>>> Initially, they can be strings.
>>>
>>> Later, and maybe an option the user can turn on, then a dynamic choice
>>> which
>>> is a posh way of saying attempt to parse it as an integer and if it
>>> passes,
>>> it's an integer.  Spreadsheets do this guessing.
>>>
>>> "Duck datatyping" - if it looks like an integer (decimal, double, date)
>>> it
>>> is an integer (decimal, double, date).
>>>
>>> Actually, this is then the same as tokenizing and there is code to reuse
>>> to
>>> do that.
>>>
>>>
>>>> 2.2 How to deal with the namespaces? RDF requires that the subjects
>>>> and the predicates are URIs. We need to pass in the namespaces (or
>>>> just the default namespaces) to make URIs by combining the namespaces
>>>> with the values in CSV. Things may get more complicated if different
>>>> columns are to be bound with different namespaces.
>>>
>>>
>>>
>>> Subject a can be blank nodes which is useful because each row is then a
>>> new
>>> blank node.
>>>
>>> One row written in RDF might be:
>>>
>>>
>>> [ csv:row 1 ; :Town "Southton" ; :Population 123000 ] .
>>>
>>> or
>>>
>>>
>>> _:b0  csv:row 1 ;
>>>        :Town "Southton" ;
>>>        :Population 123000 .
>>>
>>> It's the same RDF triples (3 of them).
>>>
>>> For predicates, suppose the URL of the CSV file is <FILE> then the
>>> columns
>>> can be  <FILE#Town> and <FILE#Population>.
>>>
>>> Rules or SPARQL Update can be used to turn that into a better data model
>>> if
>>> the users wants to write that code.
>>>
>>>
>>>> 2.3 The hp 2006 report [1] says "Jena supports three kinds of property
>>>> tables as well as a triple store". The "town" example you provided
>>>> conforms to the "single-valued" property table. Shall we consider the
>>>> others (e.g. the "multi-valued" one and the "triple store" one) in
>>>> this project? Does Jena in the latest release still support these
>>>> property tables? If so, where're the related source codes?
>>>
>>>
>>>
>>> Single-valued.
>>>
>>> In the CSV-WG it looks like duplicate column names are not going to be
>>> supported (at best, the parser has to make then unique by adding "1", "2"
>>> etc).
>>>
>>> Despite what the report says, the code didn't make it into the public
>>> Jena
>>> codebase.  (And we have removed the old RDB subsystem it refers to.)
>>>
>>>
>>>> 2.4 There's no "primary key" definition in CSV. All the RDF are not
>>>> OWL in fact. How do we know the column in CSV is uniquely defining? It
>>>> seems CSV lacks of some kind of "metadata" of the columns and the
>>>> values. If we have such metadata, how to pass in the namespace of  the
>>>> IRI template of http://data/town/{Town} (something related to the
>>>> question 2.2)?
>>>
>>>
>>>
>>> It's not necessary to have a defined primary row - that is generated
>>> subject
>>> URI.  It might be nice if available but that's metadata.
>>>
>>> So one of:
>>> 1/ The triples for each row have a blank node for subject
>>> 2/ The triples for row N have a URI which is <FILE#_N>.
>>>
>>> In both cases, the subject node is generated automatically.
>>>
>>>
>>>> 3. For the "property tables" way, it seems that all we need to do is
>>>> to resolve the problems in 2., and to code "GraphCSV" accordingly. I
>>>> can make the GraphCSV class by implementing the Graph interface. In
>>>> this way, for Jena ARP, a CSV table is actually a Graph, without any
>>>> differences from other types of Graphs. It looks like that there's no
>>>> need to introduce TABLE and FROM TABLE clauses in the SPARQL language
>>>> grammar. We can just use the existing GRAPH, FROM and FROM NAMED
>>>> clauses for the CSV "property tables", can't we?
>>>
>>>
>>>
>>> s/ARP/ARQ/ -- ARP is the RDF/XML parser; ARQ is the query engine :-)
>>>
>>> Yes - correct.
>>>
>>> In the later stages of the project, there is an item to make OpExecutor
>>> (which is the class that actually drives the SPARQL execution) do better
>>> for
>>> GraphCSV than just treating it as a Graph by accessing the PropertyTable
>>> behind it.
>>>
>>> The big gain for PropertyTables is the space saving they enable as well
>>> as
>>> the possibility of making them persistent in a special storage system
>>> (not
>>> in this project but the design should not make that too hard at some
>>> later
>>> time).
>>>
>>>          Andy
>>>
>>>
>>>>
>>>> Best regards,
>>>> Ying Jiang
>>>>
>>>> [1] http://www.hpl.hp.com/techreports/2006/HPL-2006-140.pdf
>>>>
>>>>
>>>>
>>>> On Mon, Mar 10, 2014 at 10:50 PM, Andy Seaborne <[email protected]> wrote:
>>>>>
>>>>>
>>>>> Hi Ying,
>>>>>
>>>>> Good questions.  I'll try to give a response to the specific points
>>>>> you've
>>>>> brought up but also there is a different I want to put forward for
>>>>> discussion.
>>>>>
>>>>> I'll write up a first draft of a project plan then we can see if the
>>>>> size
>>>>> and scope is realistic.
>>>>>
>>>>> You asked about whether variables are column names.  That is how TARQL
>>>>> and
>>>>> SPARQL VALUES works but I've realised there is a different approach and
>>>>> it's
>>>>> one that will give a better system.  It is to translate the CSV to RDF,
>>>>> and
>>>>> this may be materialized or dynamically mapped. If "materialized" it's
>>>>> likely to be a lot bigger; as "property tables" or somethign inspired
>>>>> by
>>>>> that idea, it'll be more compact.
>>>>>
>>>>> There are some issues with variables-as-columns include:
>>>>>
>>>>> 1/ Fixed variable names don't combine with other part of a query
>>>>> pattern
>>>>> very well.
>>>>>
>>>>> If there is common use of the same name it a join - that's what a
>>>>> natural
>>>>> join in SQL is.  If there are two tables, then ?a is overloaded.  If
>>>>> column
>>>>> names are used to derive a variable name, we may not want to equate
>>>>> them
>>>>> in
>>>>> the query because column names in different CSV files weren't designed
>>>>> with
>>>>> that in mind.
>>>>>
>>>>> 2/ You can't describe (in RDF) the data very easily - e.g. annotate
>>>>> that
>>>>> a
>>>>> column is of years.
>>>>>
>>>>> 3/  It needs the language to change (i.e. TABLE to access it)
>>>>>
>>>>> In TARQL, which is focusing on a controlled transform from CSV to RDF,
>>>>> it
>>>>> works out quite nicely - variables go into the CONSTRUCT template. It
>>>>> produces RDF.
>>>>>
>>>>> Property tables are a style of approach where the CSV data is accessed
>>>>> as
>>>>> RDF.
>>>>>
>>>>> The data table columns be predicate URIs.  The data table itself is an
>>>>> RDF
>>>>> graph of regular structure.  It can be accessed with normal
>>>>> (unmodified)
>>>>> SPARQL syntax. It would be better if the storage and execution of that
>>>>> part
>>>>> of the SPARQL query were adapted to such regular data.  Something for
>>>>> after
>>>>> getting an initial cut down.
>>>>>
>>>>> Suppose we have a CSV file:
>>>>> -------------------
>>>>> Town,Population
>>>>> Southton,123000
>>>>> Northville,654000
>>>>> -------------------
>>>>>
>>>>> One header row, two data rows.
>>>>>
>>>>> Aside: this is regular-shaped CSV (and some CSV files are definitely
>>>>> not
>>>>> regular at all!). There is the current editors working draft from the
>>>>> CSV
>>>>> on
>>>>> the Web Working Group (not yet published, likely to change, only part
>>>>> of
>>>>> the
>>>>> picture, etc etc)
>>>>>
>>>>> http://w3c.github.io/csvw/syntax/
>>>>>
>>>>> which is defining a more regular data out of CSV.  This is the target
>>>>> for
>>>>> the CSV work: table shaped CSV; not arbitrary, irregularly shaped CSV.
>>>>>
>>>>> There is no way the working group will have standardised any CSV to RDF
>>>>> mapping in the lifetime of the GSoC project but the WG charter says it
>>>>> must
>>>>> be covered.  So the mapping below is made up and ahead of where the
>>>>> working
>>>>> group is currently but a standardized, "direct mapping" (no metadata,
>>>>> no
>>>>> templates) style is going to happen.  The mapping details may change
>>>>> but
>>>>> the
>>>>> general approach is clear.
>>>>>
>>>>> As RDF this might be
>>>>>
>>>>> -------------
>>>>> @prefix : <http://example/table> .
>>>>> @prefix csv: <http://w3c/future-csv-vocab/> .
>>>>>
>>>>> [ csv:row 1 ; :Town "Southton" ; :Population 123000 ] .
>>>>> [ csv:row 2 ; :Town "Northville" ; :Population 654000 ] .
>>>>> -------------
>>>>>
>>>>> or without the bnode abbreviation:
>>>>>
>>>>> -------------
>>>>> @prefix : <http://example/table> .
>>>>> @prefix csv: <http://w3c/future-csv-vocab/> .
>>>>>
>>>>> _:b0  csv:row 1 ;
>>>>>         :Town "Southton" ;
>>>>>         :Population 123000 .
>>>>>
>>>>> _:b1  csv:row 2 ;
>>>>>         :Town "Northville" ;
>>>>>         :Population 654000 .
>>>>> -------------
>>>>>
>>>>>
>>>>> Each row is modelling one "entity" (here, a population observation).
>>>>> There
>>>>> is a subject (a blank node) and one predicate-value for each cell of
>>>>> the
>>>>> row.  Row numbers are added because it can be important.
>>>>>
>>>>> Background:
>>>>>
>>>>> A related idea for property has come up before
>>>>>
>>>>>     http://www.hpl.hp.com/techreports/2006/HPL-2006-140.html
>>>>>
>>>>> That paper should only be taken as giving a flavour. The motivation was
>>>>> different, more about making RDF look like regular database especially
>>>>> when
>>>>> the data is regular.  At the workshop last week, I talk to Orri Erling
>>>>> (OpenLink/Virtuoso) and apparently, maybe by parallel evolution,
>>>>> Virtuoso
>>>>> does something similar.
>>>>>
>>>>>
>>>>> Aside:
>>>>> There is a whole design space (outside this project) for translating
>>>>> CSV
>>>>> to
>>>>> RDF.
>>>>>
>>>>> Just if anyone is interested: see the related SQL-to-RDF work:
>>>>>
>>>>> http://www.w3.org/TR/r2rml/
>>>>> http://www.w3.org/TR/rdb-direct-mapping/
>>>>>
>>>>> If the metadata said that one of the columns was uniquely defining (a
>>>>> primary key in SQL terms, or inverse functional property in OWL-terms),
>>>>> we
>>>>> wouldn't need blank nodes at all - we could use a URI template, for if
>>>>> town
>>>>> names were unique (they are not!) a IRI template of
>>>>> http://data/town/{Town}
>>>>> would give:
>>>>>
>>>>> -------------
>>>>> @prefix : <http://example/table> .
>>>>> @prefix csv: <http://w3c/future-csv-vocab/> .
>>>>>
>>>>> <http://data/town/Southton>
>>>>>         csv:row 1 ;
>>>>>         rdfs:label "Southton" ;
>>>>>         :Population 123000 .
>>>>>
>>>>> <http://data/town/Northville>
>>>>>         csv:row 2 ;
>>>>>         rdfs:label "Northville" ;
>>>>>         :Population 654000 .
>>>>> -------------
>>>>>
>>>>> Doing this transformation in rules is one route.  JENA-650 connection?
>>>>> </aside>
>>>>>
>>>>> In SPARQL:
>>>>>
>>>>> Now the CSV file is viewed as an graph - normal, unmodified SPARQL can
>>>>> be
>>>>> used.  Multiple CSVs files can be multiple graphs in one dataset to
>>>>> give
>>>>> query across different data sources.
>>>>>
>>>>> # Towns over 500,000 people.
>>>>> SELECT ?townName ?pop {
>>>>> { GRAPH <http://example/population> {
>>>>>       ?x :Town ?townName ;
>>>>>          :Popuation ?pop .
>>>>>       FILTER(?pop > 500000)
>>>>>     }
>>>>> }
>>>>>
>>>>>
>>>>> A few comments inline - the bulk of this message is above.
>>>>>
>>>>> I hope this makes some sense.  Having spent time with people who really
>>>>> do
>>>>> work with CSVs files last week around the linked geospatial workshop ,
>>>>> the
>>>>> user needs and requirements are much clearer.
>>>>>
>>>>>           Andy
>>>>>
>>>>> PS I was on a panel that included mentioning the work you did last
>>>>> year.
>>>>> It
>>>>> went well.
>>>>>
>>>>> On 07/03/14 12:10, Ying Jiang wrote:
>>>>> ...
>>>>>
>>>>>>>> 2. Storage of the table (in-memory is enough, with reading from a
>>>>>>>> file).
>>>>>>>>      - Questions:
>>>>>>>> 2.1 What's the life cycle of the in-memory table? Should we discard
>>>>>>>> the table after the query execution, or keep it in-memory for later
>>>>>>>> reuse with the same query or update, or use by a subsequent query?
>>>>>>>> When will the table be discarded?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> That'll need refining but a way to read and reuse.  There needs to be
>>>>>>> away
>>>>>>> for the app to pass in tables (a Map<Sting, ???> and a tool
>>>>>>> forerading
>>>>>>> CSVs
>>>>>>> to get the ???) because ...
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> When will the tables be passed in? TARQL loads the CSVs when parsing
>>>>>> the SPARQL query string. Shall we load the tables and create the Map
>>>>>> before querying and cache them for resue? This could be similar to
>>>>>> querying a Dataset, and the simplest way goes something like:
>>>>>>
>>>>>> DataTableMap<String, DataTable> dtm =
>>>>>> DataTableSetFactory.createDataTableMap(); // The keys of dts are the
>>>>>> URI of the DataTables loaded.
>>>>>> dtm.addDataTable( "<ex:table_1>", "file:table_1.csv", true); // The
>>>>>> table data are loaded when added into the map.
>>>>>> dtm.addDataTable( "<ex:table_2>", "file:table_2.csv", false); // Or
>>>>>> the table data are *lazy* loaded during querying later on, i.e. not
>>>>>> loaded now.
>>>>>> Query query = QueryFactory.create(queryString) ; // New .jj will be
>>>>>> created for parsing TABLE and FROM TABLE clauses. However the
>>>>>> QueryFactory interface remains the same as before.
>>>>>> QueryExecution qExec = QueryExecutionFactory.create(query, model,
>>>>>> dtm) ; // New create method for QueryExecutionFactory to accomendate
>>>>>> dtm
>>>>>> ... //dtm can be reused later on for other QueryExecutions, or be
>>>>>> discarded when the app ends.
>>>>>>
>>>>>> Is the above what you mean? Any comments?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Yes, using TABLE.
>>>>>
>>>>> With property tables it can be done as
>>>>>
>>>>> // Default graph of the dataset
>>>>>
>>>>> Model csv1 =
>>>>>     ModelFactory.createModelForGraph(new GraphCSV("data1.csv")) ;
>>>>> QueryExecution qExec = QueryExecutionFactory.create(query, csv1) ;
>>>>>
>>>>> or for multiple CSV files and/or other RDF data:
>>>>>
>>>>> Model csv1 =
>>>>>     ModelFactory.createModelForGraph(new GraphCSV("data1.csv")) ;
>>>>> Model csv2 =
>>>>>     ModelFactory.createModelForGraph(new GraphCSV("data1.csv")) ;
>>>>>
>>>>> Dataset dataset = ... ;
>>>>> dataset.addNamedModel("http://example/population";, csv1) ;
>>>>> dataset.addNamedModel("http://example/table2";, csv2) ;
>>>>>
>>>>> ... normal SPARQL execution ...
>>>>>
>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> 3. Modify the SPARQL grammar to support FROM TABLE and TABLE (for
>>>>>>>> inclusion inside a larger query, c.f. SPARQL VALUES clause).
>>>>>>>>      - Questions:
>>>>>>>> 3.1 What're the differences between FROM TABLE and TABLE?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> FROM TABLE would be one way to get tables into the query as would
>>>>>>> passing
>>>>>>> it
>>>>>>> in in the query context.
>>>>>>>
>>>>>>> Queries can't be assumed to
>>>>>>>
>>>>>>> TABLE in a query is accessing the table, using it to get the
>>>>>>>
>>>>>>> TARQL, and I've only read the documentation, is a query over a single
>>>>>>> CSV
>>>>>>> file.  This project should be about multiple CSVs and combining with
>>>>>>> other
>>>>>>> RDF data.
>>>>>>>
>>>>>>> A quick sketch and the syntax is not checked as sensible:
>>>>>>>
>>>>>>> SELECT ... {
>>>>>>>      # Fixed column names
>>>>>>>      TABLE <uri> {
>>>>>>>         BIND (URI(CONCAT('http://example.com/ns#', ?b)) AS ?uri)
>>>>>>>         BIND (STRLANG(?a, 'en') AS ?with_language_tag)
>>>>>>>         FILTER (?v > 57)
>>>>>>>      }
>>>>>>> }
>>>>>>>
>>>>>>> More ambitious to have column naming and FILTERs:
>>>>>>>
>>>>>>> SELECT ...
>>>>>>> WHERE {
>>>>>>>
>>>>>>>       TABLE <uri> { "col1" AS ?myVar1 ,
>>>>>>>                     "col10" AS ?V ,
>>>>>>>                     "col5" AS ?appName
>>>>>>>                     FILTER(?V > 57) }
>>>>>>> }
>>>>>>>
>>>>>>> creates a set of bindings based on access description.
>>>>>>>
>>>>>>
>>>>>> Are the <uri> after TABLE the key of the Map<Sting, ???>? If so, I now
>>>>>> understand the TABLE clauses from the examples. However, still not
>>>>>> sure about FROM TABLE. Could you please show me some query string
>>>>>> examples containing the FROM TABLE clauses?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> FROM TABLE would set the map entry.  c.f. FROM NAMED
>>>>>
>>>>> In this case the name of the table (graph) is the location it comes
>>>>> from
>>>>> -
>>>>> it's not a general choice of name.  A common issue for FROM NAMED, not
>>>>> specific to CSV processing.
>>>>>
>>>
>

Re: [GSoC 2014] Data Tables for SPARQL

Reply via email to