Re: [Wikidata] Scaling Wikidata Query Service

Kingsley Idehen Tue, 11 Jun 2019 07:29:49 -0700

On 6/10/19 3:49 PM, Guillaume Lederrey wrote:
> On Mon, Jun 10, 2019 at 9:03 PM Sebastian Hellmann
> <hellm...@informatik.uni-leipzig.de> wrote:
>> Hi Guillaume,
>>
>> On 10.06.19 16:54, Guillaume Lederrey wrote:
>>
>> Hello!
>>
>> On Mon, Jun 10, 2019 at 4:28 PM Sebastian Hellmann
>> <hellm...@informatik.uni-leipzig.de> wrote:
>>
>> Hi Guillaume,
>>
>> On 06.06.19 21:32, Guillaume Lederrey wrote:
>>
>> Hello all!
>>
>> There has been a number of concerns raised about the performance and
>> scaling of Wikdata Query Service. We share those concerns and we are
>> doing our best to address them. Here is some info about what is going
>> on:
>>
>> In an ideal world, WDQS should:
>>
>> * scale in terms of data size
>> * scale in terms of number of edits
>> * have low update latency
>> * expose a SPARQL endpoint for queries
>> * allow anyone to run any queries on the public WDQS endpoint
>> * provide great query performance
>> * provide a high level of availability
>>
>> Scaling graph databases is a "known hard problem", and we are reaching
>> a scale where there are no obvious easy solutions to address all the
>> above constraints. At this point, just "throwing hardware at the
>> problem" is not an option anymore. We need to go deeper into the
>> details and potentially make major changes to the current architecture.
>> Some scaling considerations are discussed in [1]. This is going to take
>> time.
>>
>> I am not sure how to evaluate this correctly. Scaling databases in general 
>> is a "known hard problem" and graph databases a sub-field of it, which are 
>> optimized for graph-like queries as opposed to column stores or relational 
>> databases. If you say that "throwing hardware at the problem" does not help, 
>> you are admitting that Blazegraph does not scale for what is needed by 
>> Wikidata.
>>
>> Yes, I am admitting that Blazegraph (at least in the way we are using
>> it at the moment) does not scale to our future needs. Blazegraph does
>> have support for sharding (what they call "Scale Out"). And yes, we
>> need to have a closer look at how that works. I'm not the expert here,
>> so I won't even try to assert if that's a viable solution or not.
>>
>> Yes, sharding is what you need, I think, instead of replication. This is the 
>> technique where data is repartitioned into more manageable chunks across 
>> servers.
> Well, we need sharding for scalability and replication for
> availability, so we do need both. The hard problem is sharding.
>
>> Here is a good explanation of it:
>>
>> http://vos.openlinksw.com/owiki/wiki/VOS/VOSArticleWebScaleRDF
> Interesting read. I don't see how Virtuoso addresses data locality, it
> looks like sharding of their RDF store is just hash based (I'm
> assuming some kind of uniform hash).



It handles data locality across a shared nothing cluster just fine i.e.,
you can interact with any node in a Virtuoso cluster and experience
identical behavior (everyone node looks like single node in the eyes of
the operator).


>  I'm not enough of an expert on
> graph databases, but I doubt that a highly connected graph like
> Wikidata will be able to scale reads without some way to address data
> locality. Obviously, this needs testing.
>
>> http://docs.openlinksw.com/virtuoso/ch-clusterprogramming/


There are live instances of Virtuoso that demonstrate its capabilities.
If you want to explore shared-nothing cluster capabilities then our live
LOD Cloud cache is the place to start [1][2][3]. If you want to see the
single-server open source edition that you have DBpedia, DBpedia-Live,
Uniprot and many other nodes in the LOD Cloud to choose from. All of
these instance are highly connected.

If you want to get into the depths of Linked Data regarding query
processing pipelines that include URI (or Super Key) de-reference, you
can take a look at our URIBurner Service [4][5].

Virtuoso handles both shared-nothing clusters and replication i.e., you
can have a cluster configuration used in conjunction with a replication
topology if your solution requires that.

Virtuoso is a full-blown SQL RDBMS that leverages SPARQL and a SQL
extension for handling challenges associated with Entity Relationship
Graphs represented as RDF statement collections. You can even use SPARQL
inside SQL from any ODBC- or JDBC-compliant app or service etc..


Links:

[1] http://lod.openlinksw.com

[2]
https://twitter.com/search?f=tweets&vertical=default&q=%23PermID%20%40kidehen&src=typd
-- query samplings via links included in tweets

[3] https://tinyurl.com/y47prg9h -- SPARQL transitive option applied to
a skos taxonomy tree

[4] https://linkeddata.uriburner.com -- this service provides Linked
Data transformation combined with an ability to de-ref URI-variables and
URI-constants in the body of a query as part of the solution production
pipeline; it also includes a service that adds image processing to the
aforementioned pipeline via the PivotViewer module for data visualization

[5]
https://medium.com/virtuoso-blog/what-is-small-data-and-why-is-it-important-fbf5f267884
-- About Small Data (use of URI-dereference to tackle thorny data access
challenges by leveraging the power of HTTP URIs as Super Keys)


-- 
Regards,

Kingsley Idehen       
Founder & CEO 
OpenLink Software   
Home Page: http://www.openlinksw.com
Community Support: https://community.openlinksw.com
Weblogs (Blogs):
Company Blog: https://medium.com/openlink-software-blog
Virtuoso Blog: https://medium.com/virtuoso-blog
Data Access Drivers Blog: 
https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers

Personal Weblogs (Blogs):
Medium Blog: https://medium.com/@kidehen
Legacy Blogs: http://www.openlinksw.com/blog/~kidehen/
              http://kidehen.blogspot.com

Profile Pages:
Pinterest: https://www.pinterest.com/kidehen/
Quora: https://www.quora.com/profile/Kingsley-Uyi-Idehen
Twitter: https://twitter.com/kidehen
Google+: https://plus.google.com/+KingsleyIdehen/about
LinkedIn: http://www.linkedin.com/in/kidehen

Web Identities (WebID):
Personal: http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i
        : 
http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this

smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Scaling Wikidata Query Service

Reply via email to