I like the idea of comparing live instances; could we pose a test-instance challenge, with some benchmarks, and invite different communities to take it up, hosting their own demos of what a well-tuned instance of WD could look like? (Could also be hosted by us / spun up by advocates for a tool in our community; could also spur some kaggle interest)
The size of the community actively interested in the health of Wikidata seems complementary information; alongside overall community size/health (which appears on the existing metrics list). //S On Fri, Aug 27, 2021 at 10:19 AM Kingsley Idehen via Wikidata < wikidata@lists.wikimedia.org> wrote: > On 8/25/21 3:17 PM, Mike Pham wrote: > > Thanks for all suggestions, and general enthusiasm in helping scale WDQS! > A number of you have suggested various graph backends to consider moving to > from Blazegraph, and I wanted to take a minute to respond more generically. > > There are several criteria we need to consider for a Blazegraph > alternative. Ideally we would have this list of criteria ready and > available to share, so that the community can help vet alternatives with > us. Unfortunately, we do not currently have a full list of these criteria. > While the criteria we judged candidate graph backends on are available > here > <https://docs.google.com/spreadsheets/d/1MXikljoSUVP77w7JKf9EXN40OB-ZkMqT8Y5b2NYVKbU/edit?usp=sharing>, > it is highly unlikely these will be the exact set we will use in this next > stage of scaling, and should only be used as a historical reference. > > It is likely that there is no silver bullet solution that will satisfy > every criteria. We will probably need to make compromises in some areas in > order to optimize for others. This is a primary reason for conducting the WDQS > user survey > <https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2021/08#Wikidata_Query_Service_(WDQS)_User_Survey_2021>: > we would like a better understanding of what the overall community > priorities are, including from those who may be less vocal in existing > discussions. These priorities will then be a major component in distilling > the criteria (and weights) for a new graph backend. > > The current plan is to share the (most up to date as we can) survey > results at WikidataCon > <https://www.wikidata.org/wiki/Wikidata:WikidataCon_2021> this year. I > appreciate the discussion around potential candidates so far, and welcome > the continued insight/help, but wanted to also be clear that we will not be > making any decisions about a new graph backend, or have a complete list of > criteria or testing process, at the moment ā WikidataCon will be the next > strategic check-in point. > > As always, your patience is appreciated, and Iām looking forward to the > continuing discussions and collaboration! > > Best, > Mike > > > > ā > > *Mike Pham* (he/him) > Sr Product Manager, Search > Wikimedia Foundation <https://wikimediafoundation.org/> > > > Hi Mike, > > Here's a suggestion regarding this important matter, circa 2021: > > At the very least, a candidate platform should be able to deliver on a > live instance of the Wikidata dataset accessible for interaction via SPARQL > Query Services Endpoint. > > Based on the interesting list of suggestions presented in this mailing > list (and in the Google Spreadsheet > <https://docs.google.com/spreadsheets/d/1MXikljoSUVP77w7JKf9EXN40OB-ZkMqT8Y5b2NYVKbU/edit#gid=0&range=M1> > it's spawned), the larger goal of a vibrant LOD Cloud Knowledge Graph would > benefit exponentially if each platform actually offered a live instance. > > Irrespective of the final decision made, we are always going to offer a > live Wikidata instance, just as we do a LOD Cloud Cache etc.. > > Also note, the WDQS and SPARQL loose-coupling suggested by Jerven is > ultra-important, making that cool Query Services App independent of SPARQL > Query Service backend will improve utility and general resilience, > immensely. > > *Links* > > [1] https://wikidata.demo.openlinksw.com/sparql -- Wikidata instance > we've been hosting for quite some time > > [2] http://lod.openlinksw.com/sparql -- 40 Billion+ Triples instance > (used to be the largest live SPARQL Query Service instance until Uniprot > dethroned it!). > > [3] > https://medium.com/virtuoso-blog/on-the-mutually-beneficial-nature-of-dbpedia-and-wikidata-5fb2b9f22ada > -- On the Mutually Beneficial Nature of DBpedia and Wikidata > > > Kingsley > > > > On 25August, 2021 at 09:41:28, Samuel Klein (meta...@gmail.com) wrote: > > Aha, hello jerven :) I should have remembered your earlier comment, > delighted you are here. > > Thank you again for sharing your promising experience + benchmarks + > suggestions -- and for highlighting both similarities and differences. > > SJ > > On Tue, Aug 24, 2021 at 2:18 AM jerven Bolleman > <jerven.bolleman@sib.swiss> <jerven.bolleman@sib.swiss> wrote: > >> Hi Samuel, All, >> >> I am the software engineer responsible for sparql.uniprot.org. >> I already offered to help in https://phabricator.wikimedia.org/T206561. >> So no need to ask Andra or Egon ;) >> >> While we are good users of virtuoso, and strongly suggest it is >> evaluated. As it is in general a good product that does scale.[1] >> >> One of the things we did differently than WDQS is to introduce a >> controlled layer between the "public" and the "database". >> To allow things like query rewriting/redirection upon data model >> changes, as well as rewriting some schema rediscovery queries to a known >> faster query. We also parse the queries with RDF4J before handing them >> to virtuoso. This makes sure that the queries that we accept are only >> valid SPARQL 1.1. Avoiding users getting used to almost SPARQL dialects >> (i.e. retain the flexiblity to move to a different endpoint). We are in >> the process of updating this code and contributing it to RDF4J, with the >> first contribution in the develop/4.0.0 branch >> >> I think a number of current customizations in WDQS can be moved to a >> front RDF4J layer. Then the RDF4J sail/repository layer can be used to >> preserve flexibility. So that WDQS can more easily switch between >> backend databases in the future. >> >> One large difference between UniProt and WDQS is that WikiData is >> continually updated while UniProt is batch released a few times a year. >> WDQS is somewhat easier in some areas and more difficult in others >> because of that. >> >> Regards, >> Jerven >> >> [1] No Database is perfect, but it does scale a lot better than >> Blazegraph did. Which we also evaluated in the past. There is still a >> lot of potential in Virtuoso to scale even better in the future. >> >> >> >> >> >> On 23/08/2021 21:36, Samuel Klein wrote: >> > Ah, that's lovely. Thanks for the update, Kingsley! Uniprot is a good >> > parallel to keep in mind. >> > >> > For Egon, Andra, others who work with them: Is there someone you'd >> > recommend chatting with at uniprot? >> > "scaling alongside uniprot" or at least engaging them on how to solve >> > shared + comparable issues (they also offer authentication-free SPARQL >> > querying) sounds like a compelling option. >> > >> > S. >> > >> > On Thu, Aug 19, 2021 at 4:32 PM Kingsley Idehen via Wikidata >> > <wikidata@lists.wikimedia.org <mailto:wikidata@lists.wikimedia.org>> >> wrote: >> > >> > On 8/18/21 5:07 PM, Mike Pham wrote: >> >> >> >> Wikidata community members, >> >> >> >> >> >> Thank you for all of your work helping Wikidata grow and improve >> >> over the years. In the spirit of better communication, we would >> >> like to take this opportunity to share some of the current >> >> challenges Wikidata Query Service (WDQS) is facing, and some >> >> strategies we have for dealing with them. >> >> >> >> >> >> WDQS currently risks failing to provide acceptable service quality >> >> due to the following reasons: >> >> >> >> 1. >> >> >> >> Blazegraph scaling >> >> >> >> 1. >> >> >> >> Graph size. WDQS uses Blazegraph as our graph backend. >> >> While Blazegraph can theoretically support 50 billion >> >> edges <https://blazegraph.com/>, in reality Wikidata is >> >> the largest graph we know of running on Blazegraph (~13 >> >> billion triples >> >> < >> https://grafana.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=7&orgId=1&refresh=1m >> >), >> >> and there is a risk that we will reach a size >> >> < >> https://www.w3.org/wiki/LargeTripleStores#Bigdata.28R.29_.2812.7B.29 >> >limit >> >> of what it can realistically support >> >> <https://phabricator.wikimedia.org/T213210>. Once >> >> Blazegraph is maxed out, WDQS can no longer be updated. >> >> This will also break Wikidata tools that rely on WDQS. >> >> >> >> 2. >> >> >> >> Software support. Blazegraph is end of life software, >> >> which is no longer actively maintained, making it an >> >> unsustainable backend to continue moving forward with long >> >> term. >> >> >> >> >> >> Blazegraph maxing out in size poses the greatest risk for >> >> catastrophic failure, as it would effectively prevent WDQS from >> >> being updated further, and inevitably fall out of date. Our long >> >> term strategy to address this is to move to a new graph backend >> >> that best meets our WDQS needs and is actively maintained, and >> >> begin the migration off of Blazegraph as soon as a viable >> >> alternative is identified >> >> <https://phabricator.wikimedia.org/T206560>. >> >> >> > >> > Hi Mike, >> > >> > Do bear in mind that pre and post selection of Blazegraph for >> > Wikidata, we've always offered an RDF-based DBMS that can handle >> > current and future requirements for Wikidata, just as we do DBpedia. >> > >> > At the time of our first rendezvous, handling 50 billion triples >> > would have typically required our Cluster Edition which is a >> > Commercial Only offering -- basically, that was the deal breaker >> > back then. >> > >> > Anyway, in recent times, our Open Source Edition has evolved to >> > handle some 80 Billion+ triples (exemplified by the live Uniprot >> > instance) where performance and scale is primary a function of >> > available memory. >> > >> > I hope this helps. >> > >> > Related: >> > >> > [1] https://wikidata.demo.openlinksw.com/sparql >> > <https://wikidata.demo.openlinksw.com/sparql>-- Our Live Wikidata >> > SPARQL Query Endpoint >> > [2] >> > >> https://docs.google.com/spreadsheets/d/15AXnxMgKyCvLPil_QeGC0DiXOP-Hu8Ln97fZ683ZQF0/edit#gid=0 >> > < >> https://docs.google.com/spreadsheets/d/15AXnxMgKyCvLPil_QeGC0DiXOP-Hu8Ln97fZ683ZQF0/edit#gid=0 >> > >> > -- Google Spreadsheet about various Virtuoso Configurations >> > associated with some well-known public endpoints >> > [3] https://t.co/EjAAO73wwE <https://t.co/EjAAO73wwE> -- this query >> > doesn't complete with the current Blazegraph-based Wikidata endpoint >> > [4] https://t.co/GTATPPJNBI <https://t.co/GTATPPJNBI> -- same query >> > completing when applied to the Virtuoso-based endpoint >> > [5] https://t.co/X7mLmcYC69 <https://t.co/X7mLmcYC69> -- about >> > loading Wikidata's datasets into a Virtuoso instance >> > [6] >> > >> https://twitter.com/search?q=%23Wikidata%20%23VirtuosoRDBMS%20%40kidehen&src=typed_query&f=live >> > < >> https://twitter.com/search?q=%2523Wikidata%20%2523VirtuosoRDBMS%20%2540kidehen&src=typed_query&f=live >> > >> > -- various demos shared via Twitter over the years regarding >> Wikidata >> > >> > -- >> > Regards, >> > >> > Kingsley Idehen >> > Founder & CEO >> > OpenLink Software >> > Home Page:http://www.openlinksw.com <http://www.openlinksw.com> >> > Community Support:https://community.openlinksw.com < >> https://community.openlinksw.com> >> > Weblogs (Blogs): >> > Company Blog:https://medium.com/openlink-software-blog < >> https://medium.com/openlink-software-blog> >> > Virtuoso Blog:https://medium.com/virtuoso-blog < >> https://medium.com/virtuoso-blog> >> > Data Access Drivers Blog: >> https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers < >> https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers> >> > >> > Personal Weblogs (Blogs): >> > Medium Blog:https://medium.com/@kidehen < >> https://medium.com/@kidehen> >> > Legacy Blogs:http://www.openlinksw.com/blog/~kidehen/ < >> http://www.openlinksw.com/blog/~kidehen/> >> > http://kidehen.blogspot.com < >> http://kidehen.blogspot.com> >> > >> > Profile Pages: >> > Pinterest:https://www.pinterest.com/kidehen/ < >> https://www.pinterest.com/kidehen/> >> > Quora:https://www.quora.com/profile/Kingsley-Uyi-Idehen < >> https://www.quora.com/profile/Kingsley-Uyi-Idehen> >> > Twitter:https://twitter.com/kidehen <https://twitter.com/kidehen> >> > Google+:https://plus.google.com/+KingsleyIdehen/about < >> https://plus.google.com/+KingsleyIdehen/about> >> > LinkedIn:http://www.linkedin.com/in/kidehen < >> http://www.linkedin.com/in/kidehen> >> > >> > Web Identities (WebID): >> > Personal: >> http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i < >> http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i> >> > : >> http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this >> < >> http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this >> > >> > >> > _______________________________________________ >> > Wikidata mailing list -- wikidata@lists.wikimedia.org >> > <mailto:wikidata@lists.wikimedia.org> >> > To unsubscribe send an email to wikidata-le...@lists.wikimedia.org >> > <mailto:wikidata-le...@lists.wikimedia.org> >> > >> > >> > >> > -- >> > Samuel Klein @metasj w:user:sj +1 617 529 >> 4266 >> > >> > _______________________________________________ >> > Wikidata mailing list -- wikidata@lists.wikimedia.org >> > To unsubscribe send an email to wikidata-le...@lists.wikimedia.org >> > >> >> -- >> >> *Jerven Tjalling Bolleman* >> Principal Software Developer >> *SIB | Swiss Institute of Bioinformatics* >> 1, rue Michel Servet - CH 1211 Geneva 4 - Switzerland >> t +41 22 379 58 85 >> Jerven.Bolleman@sib.swiss - www.sib.swiss >> _______________________________________________ >> Wikidata mailing list -- wikidata@lists.wikimedia.org >> To unsubscribe send an email to wikidata-le...@lists.wikimedia.org >> > > > -- > Samuel Klein @metasj w:user:sj +1 617 529 4266 > _______________________________________________ > Wikidata mailing list -- wikidata@lists.wikimedia.org > To unsubscribe send an email to wikidata-le...@lists.wikimedia.org > > > _______________________________________________ > Wikidata mailing list -- wikidata@lists.wikimedia.org > To unsubscribe send an email to wikidata-le...@lists.wikimedia.org > > > -- > Regards, > > Kingsley Idehen > Founder & CEO > OpenLink Software > Home Page: http://www.openlinksw.com > Community Support: https://community.openlinksw.com > Weblogs (Blogs): > Company Blog: https://medium.com/openlink-software-blog > Virtuoso Blog: https://medium.com/virtuoso-blog > Data Access Drivers Blog: > https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers > > Personal Weblogs (Blogs): > Medium Blog: https://medium.com/@kidehen > Legacy Blogs: http://www.openlinksw.com/blog/~kidehen/ > http://kidehen.blogspot.com > > Profile Pages: > Pinterest: https://www.pinterest.com/kidehen/ > Quora: https://www.quora.com/profile/Kingsley-Uyi-Idehen > Twitter: https://twitter.com/kidehen > Google+: https://plus.google.com/+KingsleyIdehen/about > LinkedIn: http://www.linkedin.com/in/kidehen > > Web Identities (WebID): > Personal: http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i > : > http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this > > _______________________________________________ > Wikidata mailing list -- wikidata@lists.wikimedia.org > To unsubscribe send an email to wikidata-le...@lists.wikimedia.org > -- Samuel Klein @metasj w:user:sj +1 617 529 4266
_______________________________________________ Wikidata mailing list -- wikidata@lists.wikimedia.org To unsubscribe send an email to wikidata-le...@lists.wikimedia.org