[Wikidata] Re: Inconsistencies on WDQS data - data reload on WDQS

Guillaume Lederrey Wed, 22 Feb 2023 00:38:53 -0800

On Wed, 22 Feb 2023 at 00:03, Kingsley Idehen via Wikidata <
wikidata@lists.wikimedia.org> wrote:


>
> On 2/21/23 4:05 PM, Guillaume Lederrey wrote:
> > Hello all!
> >
> > TL;DR: We expect to successfully complete the recent data reload on
> > Wikidata Query Service soon, but we've encountered multiple failures
> > related to the size of the graph, and anticipate that this issue may
> > worsen in the future. Although we succeeded this time, we cannot
> > guarantee that future reload attempts will be successful given the
> > current trend of the data reload process. Thank you for your
> > understanding and patience..
> >
> > Longer version:
> >
> > WDQS is updated from a stream of recent changes on Wikidata, with a
> > maximum delay of ~2 minutes. This process was improved as part of the
> > WDQS Streaming Updater project to ensure data coherence[1] . However,
> > the update process is still imperfect and can lead to data
> > inconsistencies in some cases[2][3]. To address this, we reload the
> > data from dumps a few times per year to reinitialize the system from a
> > known good state.
> >
> > The recent reload of data from dumps started in mid-December and was
> > initially met with some issues related to download and instabilities
> > in Blazegraph, the database used by WDQS[4]. Loading the data into
> > Blazegraph takes a couple of weeks due to the size of the graph, and
> > we had multiple attempts where the reload failed after >90% of the
> > data had been loaded. Our understanding of the issue is that a "race
> > condition" in Blazegraph[5], where subtle timing changes lead to
> > corruption of the journal in some rare cases, is to blame.[6]
> >
> > We want to reassure you that the last reload job was successful on one
> > of our servers. The data still needs to be copied over to all of the
> > WDQS servers, which will take a couple of weeks, but should not bring
> > any additional issues. However, reloading the full data from dumps is
> > becoming more complex as the data size grows, and we wanted to let you
> > know why the process took longer than expected. We understand that
> > data inconsistencies can be problematic, and we appreciate your
> > patience and understanding while we work to ensure the quality and
> > consistency of the data on WDQS.
> >
> > Thank you for your continued support and understanding!
> >
> >
> >     Guillaume
> >
> >
> > [1] https://phabricator.wikimedia.org/T244590
> > [2] https://phabricator.wikimedia.org/T323239
> > [3] https://phabricator.wikimedia.org/T322869
> > [4] https://phabricator.wikimedia.org/T323096
> > [5] https://en.wikipedia.org/wiki/Race_condition#In_software
> > [6] https://phabricator.wikimedia.org/T263110
> >
> Hi Guillaume,
>
> Are there plans to decouple WDQS from the back-end database? Doing that
> provides more resilient architecture for Wikidata as a whole since you
> will be able to swap and interchange SPARQL-compliant backends.
>

It depends what you mean by decoupling. The coupling points as I see them
are:

* update process
* UI
* exposed SPARQL endpoint

The update process is mostly decoupled from the backend. It is producing a
stream of RDF updates that is backend independent, with a very thin
Blazegraph specific adapted to load the data into Blazegraph.

The UI is mostly backend independant. It relies on Search for some
features. And of course, the queries themselves might depend on Blazegraph
specific features.

The exposed SPARQL endpoint is at the moment a direct exposition of the
Blazegraph endpoint, so it does expose all the Blazegraph specific features
and quirks.


What we would like to do at some point (this is not more than a rough idea
at this point) is to add a proxy in front of the SPARQL endpoint, that
would filter specific SPARQL features, so that we limit what is available
to a standard set of features available across most potential backends.
This would help reduce the coupling of queries with the backend. Of course,
this would have the drawback of limiting the feature set.

I'm not sure I entirely understood the question, please let me know if my
answer is missing the point.

  Have fun!

    Guillaume


> BTW -- we are going to make AWS and even Azure hosted instances (offered
> on a PAGO basis) of our Virtuoso-hosted edition of Wikidata (which we
> recently reloaded).
>
> --
> Regards,
>
> Kingsley Idehen
> Founder & CEO
> OpenLink Software
> Home Page: http://www.openlinksw.com
> Community Support: https://community.openlinksw.com
> Weblogs (Blogs):
> Company Blog: https://medium.com/openlink-software-blog
> Virtuoso Blog: https://medium.com/virtuoso-blog
> Data Access Drivers Blog:
> https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers
>
> Personal Weblogs (Blogs):
> Medium Blog: https://medium.com/@kidehen
> Legacy Blogs: http://www.openlinksw.com/blog/~kidehen/
>                http://kidehen.blogspot.com
>
> Profile Pages:
> Pinterest: https://www.pinterest.com/kidehen/
> Quora: https://www.quora.com/profile/Kingsley-Uyi-Idehen
> Twitter: https://twitter.com/kidehen
> Google+: https://plus.google.com/+KingsleyIdehen/about
> LinkedIn: http://www.linkedin.com/in/kidehen
>
> Web Identities (WebID):
> Personal: http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i
>          :
> http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this
>
> _______________________________________________
> Wikidata mailing list -- wikidata@lists.wikimedia.org
> Public archives at
> https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/6ND7MOVXL3F73SR37MBWEIT5CCOK2EES/
> To unsubscribe send an email to wikidata-le...@lists.wikimedia.org
>


-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>

_______________________________________________
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/BMBS6MNJ2BMMGZ5W2QZX2XJY6OOQ43IR/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org

[Wikidata] Re: Inconsistencies on WDQS data - data reload on WDQS

Reply via email to