Re: [Wikidata] Scaling Wikidata Query Service

Stas Malyshev Thu, 13 Jun 2019 16:53:30 -0700

Hi!

> It handles data locality across a shared nothing cluster just fine i.e.,
> you can interact with any node in a Virtuoso cluster and experience
> identical behavior (everyone node looks like single node in the eyes of
> the operator).


Does this mean no sharding, i.e. each server stores the full DB? This is
the model we're using currently, but given the growth of the data it may
be non sustainable on current hardware. I see in your tables that
Uniprot has about 30B triples, but I wonder how update loads there look
like. Our main issue is that the hardware we have now is showing its
limits when there's a lot of updates in parallel to significant query
load. So I wonder if the "single server holds everything" model is
sustainable in the long term.

> There are live instances of Virtuoso that demonstrate its capabilities.
> If you want to explore shared-nothing cluster capabilities then our live
> LOD Cloud cache is the place to start [1][2][3]. If you want to see the
> single-server open source edition that you have DBpedia, DBpedia-Live,
> Uniprot and many other nodes in the LOD Cloud to choose from. All of
> these instance are highly connected.

Again, here the question is not too much in "can you load 7bn triples
into Virtuoso" - we know we can. What we want to figure out whether
given specific query/update patterns we have now - it is going to give
us significantly better performance allowing to support our projected
growth.
And also possibly whether Virtuoso has ways to make our update workflow
be more optimal - e.g. right now if one triple changes in Wikidata item,
we're essentially downloading and updating the whole item (not exactly
since triples that stay the same are preserved but it requires a lot of
data transfer to express that in SPARQL). Would there be ways to update
the things more efficiently?

> Virtuoso handles both shared-nothing clusters and replication i.e., you
> can have a cluster configuration used in conjunction with a replication
> topology if your solution requires that.

Replication could certainly be useful I think it it's faster to update
single server and then replicate than simultaneously update all servers
(that's what is happening now).

-- 
Stas Malyshev
smalys...@wikimedia.org

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Scaling Wikidata Query Service

Reply via email to