[Wikidata] Re: Wikidata Query Service scaling update Aug 2021

Samuel Klein Wed, 01 Sep 2021 05:40:54 -0700

I like the idea of comparing live instances; could we pose a test-instance
challenge, with some benchmarks, and invite different communities to take
it up, hosting their own demos of what a well-tuned instance of WD could
look like?  (Could also be hosted by us / spun up by advocates for a tool
in our community; could also spur some kaggle interest)


The size of the community actively interested in the health of Wikidata
seems complementary information; alongside overall community size/health
(which appears on the existing metrics list).   //S


On Fri, Aug 27, 2021 at 10:19 AM Kingsley Idehen via Wikidata <
wikidata@lists.wikimedia.org> wrote:

> On 8/25/21 3:17 PM, Mike Pham wrote:
>
> Thanks for all suggestions, and general enthusiasm in helping scale WDQS!
> A number of you have suggested various graph backends to consider moving to
> from Blazegraph, and I wanted to take a minute to respond more generically.
>
> There are several criteria we need to consider for a Blazegraph
> alternative. Ideally we would have this list of criteria ready and
> available to share, so that the community can help vet alternatives with
> us. Unfortunately, we do not currently have a full list of these criteria.
> While the criteria we judged candidate graph backends on are available
> here
> <https://docs.google.com/spreadsheets/d/1MXikljoSUVP77w7JKf9EXN40OB-ZkMqT8Y5b2NYVKbU/edit?usp=sharing>,
> it is highly unlikely these will be the exact set we will use in this next
> stage of scaling, and should only be used as a historical reference.
>
> It is likely that there is no silver bullet solution that will satisfy
> every criteria. We will probably need to make compromises in some areas in
> order to optimize for others. This is a primary reason for conducting the WDQS
> user survey
> <https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2021/08#Wikidata_Query_Service_(WDQS)_User_Survey_2021>:
> we would like a better understanding of what the overall community
> priorities are, including from those who may be less vocal in existing
> discussions. These priorities will then be a major component in distilling
> the criteria (and weights) for a new graph backend.
>
> The current plan is to share the (most up to date as we can) survey
> results at WikidataCon
> <https://www.wikidata.org/wiki/Wikidata:WikidataCon_2021> this year. I
> appreciate the discussion around potential candidates so far, and welcome
> the continued insight/help, but wanted to also be clear that we will not be
> making any decisions about a new graph backend, or have a complete list of
> criteria or testing process, at the moment — WikidataCon will be the next
> strategic check-in point.
>
> As always, your patience is appreciated, and I’m looking forward to the
> continuing discussions and collaboration!
>
> Best,
> Mike
>
>
>
> —
>
> *Mike Pham* (he/him)
> Sr Product Manager, Search
> Wikimedia Foundation <https://wikimediafoundation.org/>
>
>
> Hi Mike,
>
> Here's a suggestion regarding this important matter, circa 2021:
>
> At the very least, a candidate platform should be able to deliver on a
> live instance of the Wikidata dataset accessible for interaction via SPARQL
> Query Services Endpoint.
>
> Based on the interesting list of suggestions presented in this mailing
> list (and in the Google Spreadsheet
> <https://docs.google.com/spreadsheets/d/1MXikljoSUVP77w7JKf9EXN40OB-ZkMqT8Y5b2NYVKbU/edit#gid=0&range=M1>
> it's spawned), the larger goal of a vibrant LOD Cloud Knowledge Graph would
> benefit exponentially if each platform actually offered a live instance.
>
> Irrespective of the final decision made, we are always going to offer a
> live Wikidata instance, just as we do a LOD Cloud Cache etc..
>
> Also note, the WDQS and SPARQL loose-coupling suggested by Jerven is
> ultra-important, making that cool Query Services App independent of SPARQL
> Query Service backend will improve utility and general resilience,
> immensely.
>
> *Links*
>
> [1] https://wikidata.demo.openlinksw.com/sparql -- Wikidata instance
> we've been hosting for quite some time
>
> [2] http://lod.openlinksw.com/sparql -- 40 Billion+ Triples instance
> (used to be the largest live SPARQL Query Service instance until Uniprot
> dethroned it!).
>
> [3]
> https://medium.com/virtuoso-blog/on-the-mutually-beneficial-nature-of-dbpedia-and-wikidata-5fb2b9f22ada
> -- On the Mutually Beneficial Nature of DBpedia and Wikidata
>
>
> Kingsley
>
>
>
> On 25August, 2021 at 09:41:28, Samuel Klein (meta...@gmail.com) wrote:
>
> Aha, hello jerven :)  I should have remembered your earlier comment,
> delighted you are here.
>
> Thank you again for sharing your promising experience + benchmarks +
> suggestions -- and for highlighting both similarities and differences.
>
> SJ
>
> On Tue, Aug 24, 2021 at 2:18 AM jerven Bolleman
> <jerven.bolleman@sib.swiss> <jerven.bolleman@sib.swiss> wrote:
>
>> Hi Samuel, All,
>>
>> I am the software engineer responsible for sparql.uniprot.org.
>> I already offered to help in https://phabricator.wikimedia.org/T206561.
>> So no need to ask Andra or Egon ;)
>>
>> While we are good users of virtuoso, and strongly suggest it is
>> evaluated. As it is in general a good product that does scale.[1]
>>
>> One of the things we did differently than WDQS is to introduce a
>> controlled layer between the "public" and the "database".
>> To allow things like query rewriting/redirection upon data model
>> changes, as well as rewriting some schema rediscovery queries to a known
>> faster query. We also parse the queries with RDF4J before handing them
>> to virtuoso. This makes sure that the queries that we accept are only
>> valid SPARQL 1.1. Avoiding users getting used to almost SPARQL dialects
>> (i.e. retain the flexiblity to move to a different endpoint). We are in
>> the process of updating this code and contributing it to RDF4J, with the
>> first contribution in the develop/4.0.0 branch
>>
>> I think a number of current customizations in WDQS can be moved to a
>> front RDF4J layer. Then the RDF4J sail/repository layer can be used to
>> preserve flexibility. So that WDQS can more easily switch between
>> backend databases in the future.
>>
>> One large difference between UniProt and WDQS is that WikiData is
>> continually updated while UniProt is batch released a few times a year.
>> WDQS is somewhat easier in some areas and more difficult in others
>> because of that.
>>
>> Regards,
>> Jerven
>>
>> [1] No Database is perfect, but it does scale a lot better than
>> Blazegraph did. Which we also evaluated in the past. There is still a
>> lot of potential in Virtuoso to scale even better in the future.
>>
>>
>>
>>
>>
>> On 23/08/2021 21:36, Samuel Klein wrote:
>> > Ah, that's lovely.  Thanks for the update, Kingsley!  Uniprot is a good
>> > parallel to keep in mind.
>> >
>> > For Egon, Andra, others who work with them: Is there someone you'd
>> > recommend chatting with at uniprot?
>> > "scaling alongside uniprot" or at least engaging them on how to solve
>> > shared + comparable issues (they also offer authentication-free SPARQL
>> > querying) sounds like a compelling option.
>> >
>> > S.
>> >
>> > On Thu, Aug 19, 2021 at 4:32 PM Kingsley Idehen via Wikidata
>> > <wikidata@lists.wikimedia.org <mailto:wikidata@lists.wikimedia.org>>
>> wrote:
>> >
>> >     On 8/18/21 5:07 PM, Mike Pham wrote:
>> >>
>> >>     Wikidata community members,
>> >>
>> >>
>> >>     Thank you for all of your work helping Wikidata grow and improve
>> >>     over the years. In the spirit of better communication, we would
>> >>     like to take this opportunity to share some of the current
>> >>     challenges Wikidata Query Service (WDQS) is facing, and some
>> >>     strategies we have for dealing with them.
>> >>
>> >>
>> >>     WDQS currently risks failing to provide acceptable service quality
>> >>     due to the following reasons:
>> >>
>> >>     1.
>> >>
>> >>         Blazegraph scaling
>> >>
>> >>         1.
>> >>
>> >>             Graph size. WDQS uses Blazegraph as our graph backend.
>> >>             While Blazegraph can theoretically support 50 billion
>> >>             edges <https://blazegraph.com/>, in reality Wikidata is
>> >>             the largest graph we know of running on Blazegraph (~13
>> >>             billion triples
>> >>             <
>> https://grafana.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=7&orgId=1&refresh=1m
>> >),
>> >>             and there is a risk that we will reach a size
>> >>             <
>> https://www.w3.org/wiki/LargeTripleStores#Bigdata.28R.29_.2812.7B.29
>> >limit
>> >>             of what it can realistically support
>> >>             <https://phabricator.wikimedia.org/T213210>. Once
>> >>             Blazegraph is maxed out, WDQS can no longer be updated.
>> >>             This will also break Wikidata tools that rely on WDQS.
>> >>
>> >>         2.
>> >>
>> >>             Software support. Blazegraph is end of life software,
>> >>             which is no longer actively maintained, making it an
>> >>             unsustainable backend to continue moving forward with long
>> >>             term.
>> >>
>> >>
>> >>     Blazegraph maxing out in size poses the greatest risk for
>> >>     catastrophic failure, as it would effectively prevent WDQS from
>> >>     being updated further, and inevitably fall out of date. Our long
>> >>     term strategy to address this is to move to a new graph backend
>> >>     that best meets our WDQS needs and is actively maintained, and
>> >>     begin the migration off of Blazegraph as soon as a viable
>> >>     alternative is identified
>> >>     <https://phabricator.wikimedia.org/T206560>.
>> >>
>> >
>> >     Hi Mike,
>> >
>> >     Do bear in mind that pre and post selection of Blazegraph for
>> >     Wikidata, we've always offered an RDF-based DBMS that can handle
>> >     current and future requirements for Wikidata, just as we do DBpedia.
>> >
>> >     At the time of our first rendezvous, handling 50 billion triples
>> >     would have typically required our Cluster Edition which is a
>> >     Commercial Only offering -- basically, that was the deal breaker
>> >     back then.
>> >
>> >     Anyway, in recent times, our Open Source Edition has evolved to
>> >     handle some 80 Billion+ triples (exemplified by the live Uniprot
>> >     instance) where performance and scale is primary a function of
>> >     available memory.
>> >
>> >     I hope this helps.
>> >
>> >     Related:
>> >
>> >     [1] https://wikidata.demo.openlinksw.com/sparql
>> >     <https://wikidata.demo.openlinksw.com/sparql>-- Our Live Wikidata
>> >     SPARQL Query Endpoint
>> >     [2]
>> >
>> https://docs.google.com/spreadsheets/d/15AXnxMgKyCvLPil_QeGC0DiXOP-Hu8Ln97fZ683ZQF0/edit#gid=0
>> >     <
>> https://docs.google.com/spreadsheets/d/15AXnxMgKyCvLPil_QeGC0DiXOP-Hu8Ln97fZ683ZQF0/edit#gid=0
>> >
>> >     -- Google Spreadsheet about various Virtuoso Configurations
>> >     associated with some well-known public endpoints
>> >     [3] https://t.co/EjAAO73wwE <https://t.co/EjAAO73wwE> -- this query
>> >     doesn't complete with the current Blazegraph-based Wikidata endpoint
>> >     [4] https://t.co/GTATPPJNBI <https://t.co/GTATPPJNBI> -- same query
>> >     completing when applied to the Virtuoso-based endpoint
>> >     [5] https://t.co/X7mLmcYC69 <https://t.co/X7mLmcYC69> -- about
>> >     loading Wikidata's datasets into a Virtuoso instance
>> >     [6]
>> >
>> https://twitter.com/search?q=%23Wikidata%20%23VirtuosoRDBMS%20%40kidehen&src=typed_query&f=live
>> >     <
>> https://twitter.com/search?q=%2523Wikidata%20%2523VirtuosoRDBMS%20%2540kidehen&src=typed_query&f=live
>> >
>> >     -- various demos shared via Twitter over the years regarding
>> Wikidata
>> >
>> >     --
>> >     Regards,
>> >
>> >     Kingsley Idehen
>> >     Founder & CEO
>> >     OpenLink Software
>> >     Home Page:http://www.openlinksw.com  <http://www.openlinksw.com>
>> >     Community Support:https://community.openlinksw.com  <
>> https://community.openlinksw.com>
>> >     Weblogs (Blogs):
>> >     Company Blog:https://medium.com/openlink-software-blog  <
>> https://medium.com/openlink-software-blog>
>> >     Virtuoso Blog:https://medium.com/virtuoso-blog  <
>> https://medium.com/virtuoso-blog>
>> >     Data Access Drivers Blog:
>> https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers  <
>> https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers>
>> >
>> >     Personal Weblogs (Blogs):
>> >     Medium Blog:https://medium.com/@kidehen  <
>> https://medium.com/@kidehen>
>> >     Legacy Blogs:http://www.openlinksw.com/blog/~kidehen/  <
>> http://www.openlinksw.com/blog/~kidehen/>
>> >                    http://kidehen.blogspot.com  <
>> http://kidehen.blogspot.com>
>> >
>> >     Profile Pages:
>> >     Pinterest:https://www.pinterest.com/kidehen/  <
>> https://www.pinterest.com/kidehen/>
>> >     Quora:https://www.quora.com/profile/Kingsley-Uyi-Idehen  <
>> https://www.quora.com/profile/Kingsley-Uyi-Idehen>
>> >     Twitter:https://twitter.com/kidehen  <https://twitter.com/kidehen>
>> >     Google+:https://plus.google.com/+KingsleyIdehen/about  <
>> https://plus.google.com/+KingsleyIdehen/about>
>> >     LinkedIn:http://www.linkedin.com/in/kidehen  <
>> http://www.linkedin.com/in/kidehen>
>> >
>> >     Web Identities (WebID):
>> >     Personal:
>> http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i  <
>> http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i>
>> >              :
>> http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this
>> <
>> http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this
>> >
>> >
>> >     _______________________________________________
>> >     Wikidata mailing list -- wikidata@lists.wikimedia.org
>> >     <mailto:wikidata@lists.wikimedia.org>
>> >     To unsubscribe send an email to wikidata-le...@lists.wikimedia.org
>> >     <mailto:wikidata-le...@lists.wikimedia.org>
>> >
>> >
>> >
>> > --
>> > Samuel Klein          @metasj           w:user:sj          +1 617 529
>> 4266
>> >
>> > _______________________________________________
>> > Wikidata mailing list -- wikidata@lists.wikimedia.org
>> > To unsubscribe send an email to wikidata-le...@lists.wikimedia.org
>> >
>>
>> --
>>
>>         *Jerven Tjalling Bolleman*
>> Principal Software Developer
>> *SIB | Swiss Institute of Bioinformatics*
>> 1, rue Michel Servet - CH 1211 Geneva 4 - Switzerland
>> t +41 22 379 58 85
>> Jerven.Bolleman@sib.swiss - www.sib.swiss
>> _______________________________________________
>> Wikidata mailing list -- wikidata@lists.wikimedia.org
>> To unsubscribe send an email to wikidata-le...@lists.wikimedia.org
>>
>
>
> --
> Samuel Klein          @metasj           w:user:sj          +1 617 529 4266
> _______________________________________________
> Wikidata mailing list -- wikidata@lists.wikimedia.org
> To unsubscribe send an email to wikidata-le...@lists.wikimedia.org
>
>
> _______________________________________________
> Wikidata mailing list -- wikidata@lists.wikimedia.org
> To unsubscribe send an email to wikidata-le...@lists.wikimedia.org
>
>
> --
> Regards,
>
> Kingsley Idehen       
> Founder & CEO
> OpenLink Software
> Home Page: http://www.openlinksw.com
> Community Support: https://community.openlinksw.com
> Weblogs (Blogs):
> Company Blog: https://medium.com/openlink-software-blog
> Virtuoso Blog: https://medium.com/virtuoso-blog
> Data Access Drivers Blog: 
> https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers
>
> Personal Weblogs (Blogs):
> Medium Blog: https://medium.com/@kidehen
> Legacy Blogs: http://www.openlinksw.com/blog/~kidehen/
>               http://kidehen.blogspot.com
>
> Profile Pages:
> Pinterest: https://www.pinterest.com/kidehen/
> Quora: https://www.quora.com/profile/Kingsley-Uyi-Idehen
> Twitter: https://twitter.com/kidehen
> Google+: https://plus.google.com/+KingsleyIdehen/about
> LinkedIn: http://www.linkedin.com/in/kidehen
>
> Web Identities (WebID):
> Personal: http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i
>         : 
> http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this
>
> _______________________________________________
> Wikidata mailing list -- wikidata@lists.wikimedia.org
> To unsubscribe send an email to wikidata-le...@lists.wikimedia.org
>


-- 
Samuel Klein          @metasj           w:user:sj          +1 617 529 4266

_______________________________________________
Wikidata mailing list -- wikidata@lists.wikimedia.org
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org

[Wikidata] Re: Wikidata Query Service scaling update Aug 2021

Reply via email to