On 07.02.20 14:32, Guillaume Lederrey wrote:
Keeping all of Wikidata in a single graph is most probably not going to work long term. We have not found examples of public SPARQL endpoints with > 10 B triples and there is probably a good reason for that. We will probably need to split the graphs at some point. We don't know how yet (that's why we loaded the dumps into Hadoop, that might give us some more insight). We might expose a subgraph with only truthy statements. Or have language specific graphs, with only language specific labels. Or something completely different.

I have not looked in detail at query runtimes nor how blazegraph indexing works internally, however I noticed that in many cases queries that involve SPARQL property paths (and especially joins of those) take a long time to run. At the same time, I recently discovered that if we only store which entity is connected to which other entity (without storing the actual statement details, like property, qualifiers or ranks), those only take up about 2GB compressed with Zstandard (I represented each connection as <32 bit int source entity> <32 bit int destination entity>). Of course that discards a lot of important information, but it made me wonder if there is perhaps something that could be done to more efficiently evaluate queries, given the relatively strict schema the RDF representation of Wikidata adheres to? (Since it is generated from a more structured form, Statements). As an example, blazegraph doesn't know the relationship between wdt:Pxxx and p:Pxxx, or even things like p:Pxxx/ps:Pxxx.

Another, somewhat related idea: perhaps it's possible to keep the SPARQL interface for the frontend, but use a more efficient, split representation of the graph in the backend? Not sure how different that would be from the indexing that blazegraph does already, though.

Regards,

Benno

PS: appologies to Guillaume if you receive this mail twice, i clicked the wrong button when replying




_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Reply via email to