Re: [Wikidata] Scaling Wikidata Query Service

Sebastian Hellmann Mon, 10 Jun 2019 12:04:28 -0700

Hi Guillaume,

On 10.06.19 16:54, Guillaume Lederrey wrote:

Hello!


On Mon, Jun 10, 2019 at 4:28 PM Sebastian Hellmann
<hellm...@informatik.uni-leipzig.de> wrote:

Hi Guillaume,

On 06.06.19 21:32, Guillaume Lederrey wrote:

Hello all!

There has been a number of concerns raised about the performance and
scaling of Wikdata Query Service. We share those concerns and we are
doing our best to address them. Here is some info about what is going
on:

In an ideal world, WDQS should:

* scale in terms of data size
* scale in terms of number of edits
* have low update latency
* expose a SPARQL endpoint for queries
* allow anyone to run any queries on the public WDQS endpoint
* provide great query performance
* provide a high level of availability

Scaling graph databases is a "known hard problem", and we are reaching
a scale where there are no obvious easy solutions to address all the
above constraints. At this point, just "throwing hardware at the
problem" is not an option anymore. We need to go deeper into the
details and potentially make major changes to the current architecture.
Some scaling considerations are discussed in [1]. This is going to take
time.

I am not sure how to evaluate this correctly. Scaling databases in general is a "known hard 
problem" and graph databases a sub-field of it, which are optimized for graph-like queries as 
opposed to column stores or relational databases. If you say that "throwing hardware at the 
problem" does not help, you are admitting that Blazegraph does not scale for what is needed by 
Wikidata.

Yes, I am admitting that Blazegraph (at least in the way we are using
it at the moment) does not scale to our future needs. Blazegraph does
have support for sharding (what they call "Scale Out"). And yes, we
need to have a closer look at how that works. I'm not the expert here,
so I won't even try to assert if that's a viable solution or not.

Yes, sharding is what you need, I think, instead of replication. This isthe technique where data is repartitioned into more manageable chunksacross servers.


Here is a good explanation of it:

http://vos.openlinksw.com/owiki/wiki/VOS/VOSArticleWebScaleRDF

http://docs.openlinksw.com/virtuoso/ch-clusterprogramming/

Sharding, scale-out or repartitioning is a classical enterprise featurefor Open-source databases. I am rather surprised that Blazegraph is fullGPL without an enterprise edition. But then they really sounded liketheir goal as a company was to be bought by a bigger fish, in this caseAmazon Web Services. What is their deal? They are offering support?

So if you go open-source, I think you will have a hard time finding goodfree databases sharding/repartition. FoundationDB as proposed in thegrant [1]is from Apple


[1] https://meta.wikimedia.org/wiki/Grants:Project/WDQS_On_FoundationDB

I mean try the sharding feature. At some point though it might be worthconsidering to go enterprise. Corporate Open Source often has a twist.

Just a note here: Virtuoso is also a full RDMS, so you could probablykeep wikibase db in the same cluster and fix the asynchronicity. That isalso true for any mappers like Sparqlify:http://aksw.org/Projects/Sparqlify.html However, these shift theproblem, then you need a sharded/repartitioned relational database....



All the best,

Sebastian

 From [1]:

At the moment, each WDQS cluster is a group of independent servers, sharing 
nothing, with each server independently updated and each server holding a full 
data set.

Then it is not a "cluster" in the sense of databases. It is more a redundancy 
architecture like RAID 1. Is this really how BlazeGraph does it? Don't they have a proper 
cluster solution, where they repartition data across servers? Or is this independent 
servers a wikimedia staff homebuild?

It all depends on your definition of a cluster. We have groups of
machine collectively serving some coherent traffic, but each machine
is completely independent from others. So yes, the comparison to RAID1
is adequate.

Some info here:

- We evaluated some stores according to their performance: 
http://www.semantic-web-journal.net/content/evaluation-metadata-representations-rdf-stores-0
  "Evaluation of Metadata Representations in RDF stores"

Thanks for the link! That looks quite interesting!

- Virtuoso has proven quite useful. I don't want to advertise here, but the 
thing they have going for DBpedia uses ridiculous hardware, i.e. 64GB RAM and 
it is also the OS version, not the professional with clustering and repartition 
capability. So we are playing the game since ten years now: Everybody tries 
other databases, but then most people come back to virtuoso. I have to admit 
that OpenLink is maintaining the hosting for DBpedia themselves, so they know 
how to optimise. They normally do large banks as customers with millions of 
write transactions per hour. In LOD2 they also implemented column store 
features with MonetDB and repartitioning in clusters.

I'm not entirely sure how to read the above (and a quick look at
virtuoso website does not give me the answer either), but it looks
like the sharding / partitioning options are only available in the
enterprise version. That probably makes it a non starter for us.

- I recently heard a presentation from Arango-DB and they had a good cluster 
concept as well, although I don't know anybody who tried it. The slides seemed 
to make sense.

Nice, another one to add to our list of options to test.

All the best,

Sebastian




Reasonably, addressing all of the above constraints is unlikely to
ever happen. Some of the constraints are non negotiable: if we can't
keep up with Wikidata in term of data size or number of edits, it does
not make sense to address query performance. On some constraints, we
will probably need to compromise.

For example, the update process is asynchronous. It is by nature
expected to lag. In the best case, this lag is measured in minutes,
but can climb to hours occasionally. This is a case of prioritizing
stability and correctness (ingesting all edits) over update latency.
And while we can work to reduce the maximum latency, this will still
be an asynchronous process and needs to be considered as such.

We currently have one Blazegraph expert working with us to address a
number of performance and stability issues. We
are planning to hire an additional engineer to help us support the
service in the long term. You can follow our current work in phabricator [2].

If anyone has experience with scaling large graph databases, please
reach out to us, we're always happy to share ideas!

Thanks all for your patience!

    Guillaume

[1] https://wikitech.wikimedia.org/wiki/Wikidata_query_service/ScalingStrategy
[2] https://phabricator.wikimedia.org/project/view/1239/

--
All the best,
Sebastian Hellmann

Director of Knowledge Integration and Linked Data Technologies (KILT) 
Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org, http://linguistics.okfn.org, 
https://www.w3.org/community/ld4lt
Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org

--
All the best,
Sebastian Hellmann

Director of Knowledge Integration and Linked Data Technologies (KILT)Competence Center

at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association

Projects: http://dbpedia.org, http://nlp2rdf.org,http://linguistics.okfn.org, https://www.w3.org/community/ld4lt<http://www.w3.org/community/ld4lt>

Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Scaling Wikidata Query Service

Reply via email to