Hi Andy, It's OK. Here's the copy of my proposal in the attachment.
Cheers, Ying Jiang On Wed, Mar 19, 2014 at 10:13 PM, Andy Seaborne <[email protected]> wrote: > On 19/03/14 04:22, Ying Jiang wrote: >> >> Dear Andy, >> >> I've submitted a proposal [1] to GSoC, according to our previous >> discussions. Please let me know if anything can be improved. >> Thanks a lot! > > > Looks fine - and congratulations on the lecturer position. > > Andy > > >> >> Cheers, >> Ying Jiang >> >> [1] >> http://www.google-melange.com/gsoc/proposal/review/student/google/gsoc2014/jpz6311whu/5632763709358080 > > > These URLs are restricted until projects are accepted. I can read it as: > > http://www.google-melange.com/gsoc/proposal/review/org/google/gsoc2014/jpz6311whu/5632763709358080 > > not sure if that's organisation specific though. > > When projects are accepted, the proposal becomes public. > > Ying - in Apache, we do everything in public where possible. Would you mind > emailing dev@ with a copy? (Remove anything you don't want on an archived > list) > > >> >> On Mon, Mar 17, 2014 at 10:17 PM, Andy Seaborne <[email protected]> wrote: >>> >>> On 16/03/14 04:31, Ying Jiang wrote: >>>> >>>> >>>> Dear Andy, >>>> >>>> I greatly appreciate your detailed explanations. I've studied all the >>>> examples and the links you mentioned. I'll try to summarise here with >>>> further questions below: >>>> >>>> 1. We have 2 possible ways for the project: "variables-as-columns" and >>>> "property tables". I can understand both the ideas, thanks to your >>>> instructions. The former one has its issues you pointed out, and the >>>> latter one seems to make more sense for the users. Do you mean we >>>> should discard the former one and focus on the latter in this project? >>> >>> >>> >>> Yes - "predicates-for-columns" = "property tables" >>> >>> From that, you can recover "variables-as-columns" by query pattern. The >>> reverse is messy at best. Either very unnatural variable names to stop >>> clashes or beign careful about scoping (and that will confuse people). >>> >>> >>>> 2. We can have some lessons learned from SQL-to-RDF work. But CSV >>>> (even regular-shaped CSV) is different from database in some ways, >>>> which requires us to dig in deeper on the details. Some questions >>>> like: >>> >>> >>> >>> The W3C "CSV on the Web Working Group" [1] is working on a standard >>> mechanism for converting CSV to other forms, RDF included. The details >>> of >>> that mechanism aren't clear yet and won't be in time for the project - >>> it's >>> an area that (my current belief) will chop and change a fair bit in >>> getting >>> to a final specification. >>> >>> >>> The area of CSV-RDF is bigger than a GSoC project anyway and fairly open >>> ended given all the sorts of the things people do with CSV files (e.g. >>> encoding author lists in fields). >>> >>> But there is a simpler case - one need is a "direct mapping" whereby a >>> CSV file with no additional metadata is mapped to RDF. I think we can >>> focus >>> on a design for this in the project. >>> >>> The translation is fixed : blank node for each row (addresses the primary >>> key issue - and alternative below), the base URL of the CSV file is used >>> to >>> generate the predicate names. >>> >>> Then, the project gets all the machinery working - otherwise the output >>> will >>> CSV to RDF without the Jena architectural chnages to support it in the >>> long >>> term. >>> >>> [1] https://www.w3.org/2013/csvw/wiki/Main_Page >>> >>> >>>> 2.1 How to determine the data type of the column? All the values in >>>> CSV are firstly parsed as Strings line by line. Suppose the parser >>>> found a number string of "123000.0", how can we know whether it's an >>>> integer, a float/double or even just a string in RDF? >>> >>> >>> >>> Initially, they can be strings. >>> >>> Later, and maybe an option the user can turn on, then a dynamic choice >>> which >>> is a posh way of saying attempt to parse it as an integer and if it >>> passes, >>> it's an integer. Spreadsheets do this guessing. >>> >>> "Duck datatyping" - if it looks like an integer (decimal, double, date) >>> it >>> is an integer (decimal, double, date). >>> >>> Actually, this is then the same as tokenizing and there is code to reuse >>> to >>> do that. >>> >>> >>>> 2.2 How to deal with the namespaces? RDF requires that the subjects >>>> and the predicates are URIs. We need to pass in the namespaces (or >>>> just the default namespaces) to make URIs by combining the namespaces >>>> with the values in CSV. Things may get more complicated if different >>>> columns are to be bound with different namespaces. >>> >>> >>> >>> Subject a can be blank nodes which is useful because each row is then a >>> new >>> blank node. >>> >>> One row written in RDF might be: >>> >>> >>> [ csv:row 1 ; :Town "Southton" ; :Population 123000 ] . >>> >>> or >>> >>> >>> _:b0 csv:row 1 ; >>> :Town "Southton" ; >>> :Population 123000 . >>> >>> It's the same RDF triples (3 of them). >>> >>> For predicates, suppose the URL of the CSV file is <FILE> then the >>> columns >>> can be <FILE#Town> and <FILE#Population>. >>> >>> Rules or SPARQL Update can be used to turn that into a better data model >>> if >>> the users wants to write that code. >>> >>> >>>> 2.3 The hp 2006 report [1] says "Jena supports three kinds of property >>>> tables as well as a triple store". The "town" example you provided >>>> conforms to the "single-valued" property table. Shall we consider the >>>> others (e.g. the "multi-valued" one and the "triple store" one) in >>>> this project? Does Jena in the latest release still support these >>>> property tables? If so, where're the related source codes? >>> >>> >>> >>> Single-valued. >>> >>> In the CSV-WG it looks like duplicate column names are not going to be >>> supported (at best, the parser has to make then unique by adding "1", "2" >>> etc). >>> >>> Despite what the report says, the code didn't make it into the public >>> Jena >>> codebase. (And we have removed the old RDB subsystem it refers to.) >>> >>> >>>> 2.4 There's no "primary key" definition in CSV. All the RDF are not >>>> OWL in fact. How do we know the column in CSV is uniquely defining? It >>>> seems CSV lacks of some kind of "metadata" of the columns and the >>>> values. If we have such metadata, how to pass in the namespace of the >>>> IRI template of http://data/town/{Town} (something related to the >>>> question 2.2)? >>> >>> >>> >>> It's not necessary to have a defined primary row - that is generated >>> subject >>> URI. It might be nice if available but that's metadata. >>> >>> So one of: >>> 1/ The triples for each row have a blank node for subject >>> 2/ The triples for row N have a URI which is <FILE#_N>. >>> >>> In both cases, the subject node is generated automatically. >>> >>> >>>> 3. For the "property tables" way, it seems that all we need to do is >>>> to resolve the problems in 2., and to code "GraphCSV" accordingly. I >>>> can make the GraphCSV class by implementing the Graph interface. In >>>> this way, for Jena ARP, a CSV table is actually a Graph, without any >>>> differences from other types of Graphs. It looks like that there's no >>>> need to introduce TABLE and FROM TABLE clauses in the SPARQL language >>>> grammar. We can just use the existing GRAPH, FROM and FROM NAMED >>>> clauses for the CSV "property tables", can't we? >>> >>> >>> >>> s/ARP/ARQ/ -- ARP is the RDF/XML parser; ARQ is the query engine :-) >>> >>> Yes - correct. >>> >>> In the later stages of the project, there is an item to make OpExecutor >>> (which is the class that actually drives the SPARQL execution) do better >>> for >>> GraphCSV than just treating it as a Graph by accessing the PropertyTable >>> behind it. >>> >>> The big gain for PropertyTables is the space saving they enable as well >>> as >>> the possibility of making them persistent in a special storage system >>> (not >>> in this project but the design should not make that too hard at some >>> later >>> time). >>> >>> Andy >>> >>> >>>> >>>> Best regards, >>>> Ying Jiang >>>> >>>> [1] http://www.hpl.hp.com/techreports/2006/HPL-2006-140.pdf >>>> >>>> >>>> >>>> On Mon, Mar 10, 2014 at 10:50 PM, Andy Seaborne <[email protected]> wrote: >>>>> >>>>> >>>>> Hi Ying, >>>>> >>>>> Good questions. I'll try to give a response to the specific points >>>>> you've >>>>> brought up but also there is a different I want to put forward for >>>>> discussion. >>>>> >>>>> I'll write up a first draft of a project plan then we can see if the >>>>> size >>>>> and scope is realistic. >>>>> >>>>> You asked about whether variables are column names. That is how TARQL >>>>> and >>>>> SPARQL VALUES works but I've realised there is a different approach and >>>>> it's >>>>> one that will give a better system. It is to translate the CSV to RDF, >>>>> and >>>>> this may be materialized or dynamically mapped. If "materialized" it's >>>>> likely to be a lot bigger; as "property tables" or somethign inspired >>>>> by >>>>> that idea, it'll be more compact. >>>>> >>>>> There are some issues with variables-as-columns include: >>>>> >>>>> 1/ Fixed variable names don't combine with other part of a query >>>>> pattern >>>>> very well. >>>>> >>>>> If there is common use of the same name it a join - that's what a >>>>> natural >>>>> join in SQL is. If there are two tables, then ?a is overloaded. If >>>>> column >>>>> names are used to derive a variable name, we may not want to equate >>>>> them >>>>> in >>>>> the query because column names in different CSV files weren't designed >>>>> with >>>>> that in mind. >>>>> >>>>> 2/ You can't describe (in RDF) the data very easily - e.g. annotate >>>>> that >>>>> a >>>>> column is of years. >>>>> >>>>> 3/ It needs the language to change (i.e. TABLE to access it) >>>>> >>>>> In TARQL, which is focusing on a controlled transform from CSV to RDF, >>>>> it >>>>> works out quite nicely - variables go into the CONSTRUCT template. It >>>>> produces RDF. >>>>> >>>>> Property tables are a style of approach where the CSV data is accessed >>>>> as >>>>> RDF. >>>>> >>>>> The data table columns be predicate URIs. The data table itself is an >>>>> RDF >>>>> graph of regular structure. It can be accessed with normal >>>>> (unmodified) >>>>> SPARQL syntax. It would be better if the storage and execution of that >>>>> part >>>>> of the SPARQL query were adapted to such regular data. Something for >>>>> after >>>>> getting an initial cut down. >>>>> >>>>> Suppose we have a CSV file: >>>>> ------------------- >>>>> Town,Population >>>>> Southton,123000 >>>>> Northville,654000 >>>>> ------------------- >>>>> >>>>> One header row, two data rows. >>>>> >>>>> Aside: this is regular-shaped CSV (and some CSV files are definitely >>>>> not >>>>> regular at all!). There is the current editors working draft from the >>>>> CSV >>>>> on >>>>> the Web Working Group (not yet published, likely to change, only part >>>>> of >>>>> the >>>>> picture, etc etc) >>>>> >>>>> http://w3c.github.io/csvw/syntax/ >>>>> >>>>> which is defining a more regular data out of CSV. This is the target >>>>> for >>>>> the CSV work: table shaped CSV; not arbitrary, irregularly shaped CSV. >>>>> >>>>> There is no way the working group will have standardised any CSV to RDF >>>>> mapping in the lifetime of the GSoC project but the WG charter says it >>>>> must >>>>> be covered. So the mapping below is made up and ahead of where the >>>>> working >>>>> group is currently but a standardized, "direct mapping" (no metadata, >>>>> no >>>>> templates) style is going to happen. The mapping details may change >>>>> but >>>>> the >>>>> general approach is clear. >>>>> >>>>> As RDF this might be >>>>> >>>>> ------------- >>>>> @prefix : <http://example/table> . >>>>> @prefix csv: <http://w3c/future-csv-vocab/> . >>>>> >>>>> [ csv:row 1 ; :Town "Southton" ; :Population 123000 ] . >>>>> [ csv:row 2 ; :Town "Northville" ; :Population 654000 ] . >>>>> ------------- >>>>> >>>>> or without the bnode abbreviation: >>>>> >>>>> ------------- >>>>> @prefix : <http://example/table> . >>>>> @prefix csv: <http://w3c/future-csv-vocab/> . >>>>> >>>>> _:b0 csv:row 1 ; >>>>> :Town "Southton" ; >>>>> :Population 123000 . >>>>> >>>>> _:b1 csv:row 2 ; >>>>> :Town "Northville" ; >>>>> :Population 654000 . >>>>> ------------- >>>>> >>>>> >>>>> Each row is modelling one "entity" (here, a population observation). >>>>> There >>>>> is a subject (a blank node) and one predicate-value for each cell of >>>>> the >>>>> row. Row numbers are added because it can be important. >>>>> >>>>> Background: >>>>> >>>>> A related idea for property has come up before >>>>> >>>>> http://www.hpl.hp.com/techreports/2006/HPL-2006-140.html >>>>> >>>>> That paper should only be taken as giving a flavour. The motivation was >>>>> different, more about making RDF look like regular database especially >>>>> when >>>>> the data is regular. At the workshop last week, I talk to Orri Erling >>>>> (OpenLink/Virtuoso) and apparently, maybe by parallel evolution, >>>>> Virtuoso >>>>> does something similar. >>>>> >>>>> >>>>> Aside: >>>>> There is a whole design space (outside this project) for translating >>>>> CSV >>>>> to >>>>> RDF. >>>>> >>>>> Just if anyone is interested: see the related SQL-to-RDF work: >>>>> >>>>> http://www.w3.org/TR/r2rml/ >>>>> http://www.w3.org/TR/rdb-direct-mapping/ >>>>> >>>>> If the metadata said that one of the columns was uniquely defining (a >>>>> primary key in SQL terms, or inverse functional property in OWL-terms), >>>>> we >>>>> wouldn't need blank nodes at all - we could use a URI template, for if >>>>> town >>>>> names were unique (they are not!) a IRI template of >>>>> http://data/town/{Town} >>>>> would give: >>>>> >>>>> ------------- >>>>> @prefix : <http://example/table> . >>>>> @prefix csv: <http://w3c/future-csv-vocab/> . >>>>> >>>>> <http://data/town/Southton> >>>>> csv:row 1 ; >>>>> rdfs:label "Southton" ; >>>>> :Population 123000 . >>>>> >>>>> <http://data/town/Northville> >>>>> csv:row 2 ; >>>>> rdfs:label "Northville" ; >>>>> :Population 654000 . >>>>> ------------- >>>>> >>>>> Doing this transformation in rules is one route. JENA-650 connection? >>>>> </aside> >>>>> >>>>> In SPARQL: >>>>> >>>>> Now the CSV file is viewed as an graph - normal, unmodified SPARQL can >>>>> be >>>>> used. Multiple CSVs files can be multiple graphs in one dataset to >>>>> give >>>>> query across different data sources. >>>>> >>>>> # Towns over 500,000 people. >>>>> SELECT ?townName ?pop { >>>>> { GRAPH <http://example/population> { >>>>> ?x :Town ?townName ; >>>>> :Popuation ?pop . >>>>> FILTER(?pop > 500000) >>>>> } >>>>> } >>>>> >>>>> >>>>> A few comments inline - the bulk of this message is above. >>>>> >>>>> I hope this makes some sense. Having spent time with people who really >>>>> do >>>>> work with CSVs files last week around the linked geospatial workshop , >>>>> the >>>>> user needs and requirements are much clearer. >>>>> >>>>> Andy >>>>> >>>>> PS I was on a panel that included mentioning the work you did last >>>>> year. >>>>> It >>>>> went well. >>>>> >>>>> On 07/03/14 12:10, Ying Jiang wrote: >>>>> ... >>>>> >>>>>>>> 2. Storage of the table (in-memory is enough, with reading from a >>>>>>>> file). >>>>>>>> - Questions: >>>>>>>> 2.1 What's the life cycle of the in-memory table? Should we discard >>>>>>>> the table after the query execution, or keep it in-memory for later >>>>>>>> reuse with the same query or update, or use by a subsequent query? >>>>>>>> When will the table be discarded? >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> That'll need refining but a way to read and reuse. There needs to be >>>>>>> away >>>>>>> for the app to pass in tables (a Map<Sting, ???> and a tool >>>>>>> forerading >>>>>>> CSVs >>>>>>> to get the ???) because ... >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> When will the tables be passed in? TARQL loads the CSVs when parsing >>>>>> the SPARQL query string. Shall we load the tables and create the Map >>>>>> before querying and cache them for resue? This could be similar to >>>>>> querying a Dataset, and the simplest way goes something like: >>>>>> >>>>>> DataTableMap<String, DataTable> dtm = >>>>>> DataTableSetFactory.createDataTableMap(); // The keys of dts are the >>>>>> URI of the DataTables loaded. >>>>>> dtm.addDataTable( "<ex:table_1>", "file:table_1.csv", true); // The >>>>>> table data are loaded when added into the map. >>>>>> dtm.addDataTable( "<ex:table_2>", "file:table_2.csv", false); // Or >>>>>> the table data are *lazy* loaded during querying later on, i.e. not >>>>>> loaded now. >>>>>> Query query = QueryFactory.create(queryString) ; // New .jj will be >>>>>> created for parsing TABLE and FROM TABLE clauses. However the >>>>>> QueryFactory interface remains the same as before. >>>>>> QueryExecution qExec = QueryExecutionFactory.create(query, model, >>>>>> dtm) ; // New create method for QueryExecutionFactory to accomendate >>>>>> dtm >>>>>> ... //dtm can be reused later on for other QueryExecutions, or be >>>>>> discarded when the app ends. >>>>>> >>>>>> Is the above what you mean? Any comments? >>>>> >>>>> >>>>> >>>>> >>>>> Yes, using TABLE. >>>>> >>>>> With property tables it can be done as >>>>> >>>>> // Default graph of the dataset >>>>> >>>>> Model csv1 = >>>>> ModelFactory.createModelForGraph(new GraphCSV("data1.csv")) ; >>>>> QueryExecution qExec = QueryExecutionFactory.create(query, csv1) ; >>>>> >>>>> or for multiple CSV files and/or other RDF data: >>>>> >>>>> Model csv1 = >>>>> ModelFactory.createModelForGraph(new GraphCSV("data1.csv")) ; >>>>> Model csv2 = >>>>> ModelFactory.createModelForGraph(new GraphCSV("data1.csv")) ; >>>>> >>>>> Dataset dataset = ... ; >>>>> dataset.addNamedModel("http://example/population", csv1) ; >>>>> dataset.addNamedModel("http://example/table2", csv2) ; >>>>> >>>>> ... normal SPARQL execution ... >>>>> >>>>> >>>>>>> >>>>>>> >>>>>>>> 3. Modify the SPARQL grammar to support FROM TABLE and TABLE (for >>>>>>>> inclusion inside a larger query, c.f. SPARQL VALUES clause). >>>>>>>> - Questions: >>>>>>>> 3.1 What're the differences between FROM TABLE and TABLE? >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> FROM TABLE would be one way to get tables into the query as would >>>>>>> passing >>>>>>> it >>>>>>> in in the query context. >>>>>>> >>>>>>> Queries can't be assumed to >>>>>>> >>>>>>> TABLE in a query is accessing the table, using it to get the >>>>>>> >>>>>>> TARQL, and I've only read the documentation, is a query over a single >>>>>>> CSV >>>>>>> file. This project should be about multiple CSVs and combining with >>>>>>> other >>>>>>> RDF data. >>>>>>> >>>>>>> A quick sketch and the syntax is not checked as sensible: >>>>>>> >>>>>>> SELECT ... { >>>>>>> # Fixed column names >>>>>>> TABLE <uri> { >>>>>>> BIND (URI(CONCAT('http://example.com/ns#', ?b)) AS ?uri) >>>>>>> BIND (STRLANG(?a, 'en') AS ?with_language_tag) >>>>>>> FILTER (?v > 57) >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> More ambitious to have column naming and FILTERs: >>>>>>> >>>>>>> SELECT ... >>>>>>> WHERE { >>>>>>> >>>>>>> TABLE <uri> { "col1" AS ?myVar1 , >>>>>>> "col10" AS ?V , >>>>>>> "col5" AS ?appName >>>>>>> FILTER(?V > 57) } >>>>>>> } >>>>>>> >>>>>>> creates a set of bindings based on access description. >>>>>>> >>>>>> >>>>>> Are the <uri> after TABLE the key of the Map<Sting, ???>? If so, I now >>>>>> understand the TABLE clauses from the examples. However, still not >>>>>> sure about FROM TABLE. Could you please show me some query string >>>>>> examples containing the FROM TABLE clauses? >>>>> >>>>> >>>>> >>>>> >>>>> FROM TABLE would set the map entry. c.f. FROM NAMED >>>>> >>>>> In this case the name of the table (graph) is the location it comes >>>>> from >>>>> - >>>>> it's not a general choice of name. A common issue for FROM NAMED, not >>>>> specific to CSV processing. >>>>> >>> >
