Re: [Wikidata] Wikidata query performance paper

Stas Malyshev Sat, 06 Aug 2016 15:48:50 -0700

Hi!

> There's a brief summary in the paper of the models used. In terms of all
> the "gory" details of how everything was generated, (hopefully) all of
> the relevant details supporting the paper should be available here:
> 
> http://users.dcc.uchile.cl/~dhernand/wquery/


Yes, the gory part is what I'm after :) Thank you, I'll read through it
in the next couple of days and come back with any questions/comments I
might have.

> We just generalised sitelinks and references as a special type of
> qualifier (actually I don't think the paper mentions sitelinks but we
> mention this in the context of references).

Sitelinks can not be qualifiers, since they belong to the entity, not to
the statement. They can, I imagine, be considered a special case of
properties (we do not do it, but in theory it is not impossible to
represent them this way if one wanted to).

I am not sure how exactly one would make references special case of
qualifier, as qualifier has one (maybe complex) value, while references
each can have multiple properties and values, but I'll read through the
details and the code before I talk more about it, it's possible that I
find my answers there.

> I guess that depends on what you mean by "automatic" or "manual". :)
> 
> Automatic scripts were manually coded to convert from the JSON dump to
> each representation. The code is linked above.

Here I meant queries, not data.

> I'm not sure I follow on this part, in particular on the part of
> "semantic completeness" and why this is hard to achieve in the context
> of relational databases. (I get the gist but don't understand enough to
> respond directly ... but perhaps below I can answer indirectly?)

Check out https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format

This is the range of data we need to represent and allow people to
query. We found it hard to do this using relational model. It's probably
possible in theory, but producing efficient queries for it looked very
challenging, unless we were essentially to duplicate the effort that is
implemented in any graph database and only use the db itself for most
basic storage needs. That's pretty much what Titan + Cassandra combo
did, which we initially used until Titan's devs were acquired by
DataStax and resulting uncertainty prompted us to look into different
solutions. I imagine in theory it's also possible to create
Something+PostgreSQL combo doing the same, but PostgreSQL looks not enough.

In any case, dealing with things like property paths seem to be rather
hard on SQL-based platform, and practically a must for Wikidata querying.

> * It's not so good when there's a lot of "self-joins" in the query
> (compared with Virtuoso), like for "bushy queries" (or what we call
> "snowflake queries"), or when multiple values for a tuple are given
> (i.e., a single pattern contains multiple constants) but neither on
> their own are particularly selective. We figure that perhaps Virtuoso
> has special optimisations for such self-joins since they would be much
> more common in an RDF/SPARQL scenario than a relational/SQL scenario.

That confirms my intuition about it, thanks for the details :)

> * Encoding object values with different datatypes (booleans, dates,
> etc.) was a pain. One option was to have separate tables/columns for
> each datatype, which would complicate queries and also leave the
> question of how to add calendars, precisions, etc. Another option was to
> use JSON strings to encode the values (the version of Postgres we used
> just considered these as strings, but I think the new version has some
> JSONB(?) support that could help get around this).

That is also an issue. We have a number of specialty data types (e.g.
dates extending billion years into the future/past, coordinates
including different globes, etc.) which may present a challenge unless
the platform offers an easy way to encode custom types and deal with
them. RDF has rather flexible model (basically string + type IRI) here,
and Blazegraph too, not sure how accommodating the SQL databases would be.

Also, relational DBs mostly prefer very predictable data type model -
i.e. same column always contains the same type. This is obviously not
true for any generic representation, and may be not true even in very
restricted context - e.g. same property can have values of different
types (rare but happens).

Of course, one can wrap everything into JSON - but then how you index,
i.e. sort or range by date if date is a JSON object having no natural
way to compare?

Given such issues, that was the reason we've stopped investigating SQL
direction very early (we also didn't have a lot of manpower to
investigate all ways completely, so we had to quickly evaluate a number
of solutions, choose preferred one and concentrate there).

-- 
Stas Malyshev
smalys...@wikimedia.org

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata query performance paper

Reply via email to