Hi! > There's a brief summary in the paper of the models used. In terms of all > the "gory" details of how everything was generated, (hopefully) all of > the relevant details supporting the paper should be available here: > > http://users.dcc.uchile.cl/~dhernand/wquery/
Yes, the gory part is what I'm after :) Thank you, I'll read through it in the next couple of days and come back with any questions/comments I might have. > We just generalised sitelinks and references as a special type of > qualifier (actually I don't think the paper mentions sitelinks but we > mention this in the context of references). Sitelinks can not be qualifiers, since they belong to the entity, not to the statement. They can, I imagine, be considered a special case of properties (we do not do it, but in theory it is not impossible to represent them this way if one wanted to). I am not sure how exactly one would make references special case of qualifier, as qualifier has one (maybe complex) value, while references each can have multiple properties and values, but I'll read through the details and the code before I talk more about it, it's possible that I find my answers there. > I guess that depends on what you mean by "automatic" or "manual". :) > > Automatic scripts were manually coded to convert from the JSON dump to > each representation. The code is linked above. Here I meant queries, not data. > I'm not sure I follow on this part, in particular on the part of > "semantic completeness" and why this is hard to achieve in the context > of relational databases. (I get the gist but don't understand enough to > respond directly ... but perhaps below I can answer indirectly?) Check out https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format This is the range of data we need to represent and allow people to query. We found it hard to do this using relational model. It's probably possible in theory, but producing efficient queries for it looked very challenging, unless we were essentially to duplicate the effort that is implemented in any graph database and only use the db itself for most basic storage needs. That's pretty much what Titan + Cassandra combo did, which we initially used until Titan's devs were acquired by DataStax and resulting uncertainty prompted us to look into different solutions. I imagine in theory it's also possible to create Something+PostgreSQL combo doing the same, but PostgreSQL looks not enough. In any case, dealing with things like property paths seem to be rather hard on SQL-based platform, and practically a must for Wikidata querying. > * It's not so good when there's a lot of "self-joins" in the query > (compared with Virtuoso), like for "bushy queries" (or what we call > "snowflake queries"), or when multiple values for a tuple are given > (i.e., a single pattern contains multiple constants) but neither on > their own are particularly selective. We figure that perhaps Virtuoso > has special optimisations for such self-joins since they would be much > more common in an RDF/SPARQL scenario than a relational/SQL scenario. That confirms my intuition about it, thanks for the details :) > * Encoding object values with different datatypes (booleans, dates, > etc.) was a pain. One option was to have separate tables/columns for > each datatype, which would complicate queries and also leave the > question of how to add calendars, precisions, etc. Another option was to > use JSON strings to encode the values (the version of Postgres we used > just considered these as strings, but I think the new version has some > JSONB(?) support that could help get around this). That is also an issue. We have a number of specialty data types (e.g. dates extending billion years into the future/past, coordinates including different globes, etc.) which may present a challenge unless the platform offers an easy way to encode custom types and deal with them. RDF has rather flexible model (basically string + type IRI) here, and Blazegraph too, not sure how accommodating the SQL databases would be. Also, relational DBs mostly prefer very predictable data type model - i.e. same column always contains the same type. This is obviously not true for any generic representation, and may be not true even in very restricted context - e.g. same property can have values of different types (rare but happens). Of course, one can wrap everything into JSON - but then how you index, i.e. sort or range by date if date is a JSON object having no natural way to compare? Given such issues, that was the reason we've stopped investigating SQL direction very early (we also didn't have a lot of manpower to investigate all ways completely, so we had to quickly evaluate a number of solutions, choose preferred one and concentrate there). -- Stas Malyshev smalys...@wikimedia.org _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata