[Wikidata] Re: Wikidata Query Service scaling update Aug 2021

Marco Fossati Thu, 19 Aug 2021 06:53:00 -0700

Dropping my two cents here: I'm wondering about the Wikidata Linked DataFragments (LDF) service [1] usage.

LDF [2] is nice because it shifts the computation burden to the client,at the cost of less expressive SPARQL queries, IIRC.I think it would be a good idea to forward simple queries to thatservice, instead of WDQS.


Cheers,

Marco

[1] https://query.wikidata.org/bigdata/ldf
[2] https://linkeddatafragments.org/

On 8/19/21 12:48 AM, Imre Samu wrote:

> (i) identify and delete lower priority data (e.g. labels,descriptions, aliases, non-normalized values, etc);


Ouch.
For me

- as a native Hungarian: the labels, descriptions, aliases - isextremely important- as a data user: I am using "labels","aliases" in my concordances tools( mapping wikidata-ids with external ids )


So  Please clarify the practical meaning of the *"delete"*
*
*Thanks in advance,
   Imre

Mike Pham <mp...@wikimedia.org <mailto:mp...@wikimedia.org>> ezt írta(időpont: 2021. aug. 18., Sze, 23:08):

Wikidata community members,

Thank you for all of your work helping Wikidata grow and improve
over the years. In the spirit of better communication, we would like
to take this opportunity to share some of the current challenges
Wikidata Query Service (WDQS) is facing, and some strategies we have
for dealing with them.

WDQS currently risks failing to provide acceptable service quality
due to the following reasons:

Blazegraph scaling

Graph size. WDQS uses Blazegraph as our graph backend. While
Blazegraph can theoretically support 50 billion edges
<https://blazegraph.com/>, in reality Wikidata is the
largest graph we know of running on Blazegraph (~13 billion
triples

<https://grafana.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=7&orgId=1&refresh=1m>),
and there is a risk that we will reach a size

<https://www.w3.org/wiki/LargeTripleStores#Bigdata.28R.29_.2812.7B.29>limit
of what it can realistically support
<https://phabricator.wikimedia.org/T213210>. Once Blazegraph
is maxed out, WDQS can no longer be updated. This will also
break Wikidata tools that rely on WDQS.

Software support. Blazegraph is end of life software, which
is no longer actively maintained, making it an unsustainable
backend to continue moving forward with long term.

Blazegraph maxing out in size poses the greatest risk for
catastrophic failure, as it would effectively prevent WDQS from
being updated further, and inevitably fall out of date. Our long
term strategy to address this is to move to a new graph backend that
best meets our WDQS needs and is actively maintained, and begin the
migration off of Blazegraph as soon as a viable alternative is
identified <https://phabricator.wikimedia.org/T206560>.

In the interim period, we are exploring disaster mitigation options
for reducing Wikidata’s graph size in the case that we hit this
upper graph size limit: (i) identify and delete lower priority data
(e.g. labels, descriptions, aliases, non-normalized values, etc);
(ii) separate out certain subgraphs (such as Lexemes and/or
scholarly articles). This would be a last resort scenario to keep
Wikidata and WDQS running with reduced functionality while we are
able to deploy a more long-term solution.

Update and access scaling

Throughput. WDQS is currently trying to provide fast
updates, and fast unlimited queries for all users. As the
number of SPARQL queries grows over time

<https://www.mediawiki.org/wiki/User:MPopov_(WMF)/Wikimania_2021_Hackathon>alongside
graph updates, WDQS is struggling to sufficiently keep up

<https://grafana.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=8&from=now-6M&to=now&refresh=1d>in
each dimension of service quality without compromising
anywhere. For users, this often leads to timed out queries.

Equitable service. We are currently unable to adjust system
behavior per user/agent. As such, it is not possible to
provide equitable service to users: for example, a heavy
user could swamp WDQS enough to hinder usability by
community users.

In addition to being a querying service for Wikidata, WDQS is also
part of the edit pipeline of Wikidata (every edit on Wikidata is
pushed to WDQS to update the data there). While deploying the new
Flink-based Streaming Updater
<https://phabricator.wikimedia.org/T244590>will help with increasing
throughput of Wikidata updates, there is a substantial risk that
WDQS will be unable to keep up with the combination of increased
querying and updating, resulting in more tradeoffs between update
lag and querying latency/timeouts.

In the near term, we would like to work more closely with you to
determine what acceptable trade-offs would be for preserving WDQS
functionality while we scale up Wikidata querying. In the long term,
we will be conducting more user research to better understand your
needs so we can (i) optimize querying via SPARQL and/or other
methods, (ii) explore better user management that will allow us to
prevent heavy use of WDQS that does not align with the goals of our
movement and projects, and (iii) make it easier for users to set up
and run their own query services.

Though this information about the current state of WDQS may not be a
total surprise to many of you, we want to be as transparent with you
as possible to ensure that there are as few surprises as possible in
the case of any potential service disruptions/catastrophic failures,
and that we can accommodate your work as best as we can in the
future evolution of WDQS. We plan on doing a session on WDQS scaling
challenges during WikidataCon this year at the end of October.

Thanks for your understanding with these scaling challenges, and for
any feedback you have already been providing. If you have new
concerns, comments and questions, you can best reach us at this talk
page

<https://www.wikidata.org/wiki/Wikidata_talk:Query_Service_scaling_update_Aug_2021>.
Additionally, if you have not had a chance to fill out our survey

<https://docs.google.com/forms/d/e/1FAIpQLSe1H_OXQFDCiGlp0QRwP6-Z2CGCgm96MWBBmiqsMLu0a6bhLg/viewform?usp=sf_link>yet,
please tell us how you use the Wikidata Query Service (see privacy
statement

<https://foundation.wikimedia.org/wiki/WDQS_User_Survey_2021_Privacy_Statement>)!
Whether you are an occasional user or create tools, your feedback is
needed to decide our future development.

Best,

WMF Search + WMDE

_______________________________________________
Wikidata mailing list -- wikidata@lists.wikimedia.org
<mailto:wikidata@lists.wikimedia.org>
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org
<mailto:wikidata-le...@lists.wikimedia.org>

_______________________________________________
Wikidata mailing list -- wikidata@lists.wikimedia.org
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org

_______________________________________________
Wikidata mailing list -- wikidata@lists.wikimedia.org
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org

[Wikidata] Re: Wikidata Query Service scaling update Aug 2021

Reply via email to