I like the idea of dumps as an alternative too. But I think this should be a service that is offered via the WM Clouds. Some might remember me asking related questions on this very mailing list several months ago.
Having a DB called "latest_dump" which actually has the latest dump of all wikis would be tremendously helpful. Many cross-wiki queries can work off of several-days-old data. On Sat, Nov 14, 2020 at 2:52 PM Amir Sarabadani <ladsgr...@gmail.com> wrote: > Hello, > I actually welcome the change and am quite happy about it. It might break > several tools (including some of mine) but as a database nerd, I can see > the benefits outweighing the problems (and I wish benefits would have been > communicated in the announcement). > > The short version is that this change would make labs replicas blazing > fast. > > The long version: Database of all of wikis is currently being replicated > to a set of giant "cloud" or "labs" replicas. IIRC correctly, these dbs > have 512GB memory (while being massive is not big enough to hold > everything), the space left for InnoDB Buffer pool should be around 350GB > and storing everything in there is impossible (the rest would be for > temporary tables and other critical functions), so I assume when you query > quarry (sorry, I had to make the pun), most it is actually coming from > reading disk which is ten times slower. Looking at graphs > <https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=13&orgId=1&var-server=labsdb1011&var-port=9104&from=now-7d&to=now>, > The Innodb buffer pool efficiency for labs dbs is around 99% (two nines), > while production databases (similar hardware but split into eight different > sections) is 99.99% (four nines), these two orders of magnitude difference > is mostly because of cache locality which I hope we would achieve if these > changes get done (unless the new hardware will be commodity hardware > instead of beefy servers but I doubt that, correct me if I'm wrong). > Meaning less timeouts, less slow apps and tools, etc. It's not just speed > though, the updates coming in to replicas would be split too so it wouldn't > saturate the network and less heavy I/O in memory and disk meaning better > scalability (adding commons/wikidata on each section would be the exact > opposite of that and even if we do it now, we eventually have to pull the > plug as wikis are growing and we are not the same size or growth speed we > used to be years ago). > > I understand it would break tools and queries but I have a feeling that > lots of them should be already split into multiple queries, or should read > dumps instead or sometimes it's more of an x/y problem > <https://en.wikipedia.org/wiki/XY_problem> > > I think this is great and a big thank you for doing it. > > On Fri, Nov 13, 2020 at 11:39 AM Kimmo Virtanen <kimmo.virta...@gmail.com> > wrote: > >> As a follow up comment. >> >> If I understand correctly the main problems are a) databases are growing >> too big to be stored in single instances and b) query complexity is >> growing. >> >> a) the growth of the data is not going away as the major drivers for the >> growth are automated edits from Wikidata and Structured data on Commons. >> They are generating new data with increasing speed faster than humans ever >> could. So the longer term answer is to store the data to separate instances >> and use something like federated queries. This is how the access to the >> commonwiki replica was originally done when toolserver moved to toollabs in >> 2014.[1] Another long term solution to make databases smaller is to >> replicate only the current state of the wikidata/commonswiki and leave for >> example the revision history out. >> >> b) a major factor for query complexity which affects the query execution >> times is afaik the actor migration and the data sanitization which executes >> the queries through the multiple views.[2,3] I have no idea how bad the >> problem currently is, but one could think that replication could be >> implemented with lighter sanitation by leaving some of the problematic data >> out altogether from replication. >> >> Anyway, my question is, are there more detailed plans for the *Wiki >> Replicas 2020 Redesign *than what is on the wikipage[4] or tickets >> linked from it? I guess there is if the plan is to buy new hardware in >> October and now we are in the implementation phase? Also is there >> information on the actual bottlenecks at table level? I.e., which tables >> (in which databases) are the too big ones, hard to keep up in replication >> and slow in terms of query time? >> >> [1] >> https://www.mediawiki.org/wiki/Wikimedia_Labs/Tool_Labs/Migration_of_Toolserver_tools#Will_the_commons_database_be_replicated_to_all_clusters,_like_it_is_on_the_Toolserver >> ? >> [2] >> https://wikitech.wikimedia.org/wiki/News/Actor_storage_changes_on_the_Wiki_Replicas >> [3] https://phabricator.wikimedia.org/T215445 >> [4] https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign >> >> Br, >> -- Kimmo Virtanen, Zache >> >> On Fri, Nov 13, 2020 at 8:51 AM Kimmo Virtanen <kimmo.virta...@gmail.com> >> wrote: >> >>> > Maarten: Having 6 servers with each one having a slice + s4 (Commons) >>> + s8 (Wikidata) might be a good compromise. >>> > Martin: Another idea is to have the database structured as-planned, >>> but add a server with *all* databases that would be slower/less stable, >>> but will provide a solution for those who really need cross database joins >>> >>> From the point of view of a person who is using cross database joins on >>> both tools and analysis queries I would say that both ideas would be >>> suitable. I think that 90% of my crosswiki queries are written against >>> *wiki + wikidata/commons. However, I would not say that it is only for >>> those who really need it but more like that cross database joins are an >>> awesome feature for everybody and it is a loss if it will be gone. >>> >>> In older times we had also ability to do joins between user databases >>> and replica databases, which was removed in 2017 if I googled correctly.[1] >>> My guess is that one reason for the increasing query complexity is that >>> there is no possibility for creating tmp tables or joining to preselected >>> data so everything is done in single queries. In any case, if the solution >>> is what Martin suggests to move cross joinable databases to a single server >>> and the original problem was that it was hard to keep in sync multiple >>> servers then we could reintroduce the user database joins as well. >>> >>> [1] >>> https://phabricator.wikimedia.org/phame/post/view/70/new_wiki_replica_servers_ready_for_use/ >>> >>> Br, >>> -- Kimmo Virtanen, Zache >>> >>> On Fri, Nov 13, 2020 at 2:17 AM Martin Urbanec < >>> martin.urba...@wikimedia.cz> wrote: >>> >>>> +1 to Marteen >>>> >>>> Another idea is to have the database structured as-planned, but add a >>>> server with *all* databases that would be slower/less stable, but will >>>> provide a solution for those who really need cross database joins >>>> >>>> Martin >>>> >>>> pá 13. 11. 2020 v 0:31 odesílatel Maarten Dammers <maar...@mdammers.nl> >>>> napsal: >>>> >>>>> I recall some point in time (Toolserver maybe?) when all the slices >>>>> (overview at https://tools-info.toolforge.org/?listmetap ) were at >>>>> different servers, but the Commons slice (s4) was on every server. >>>>> At some point new fancy database servers were introduced with all the >>>>> slices on all servers. Having 6 servers with each one having a slice + s4 >>>>> (Commons) + s8 (Wikidata) might be a good compromise. >>>>> On 12-11-2020 00:58, John wrote: >>>>> >>>>> I’ll throw my hat in this too. Moving it to the application layer will >>>>> make a number of queries just not feasible any longer. It might make sense >>>>> from the administration side, but from the user perspective it beaks one >>>>> of >>>>> the biggest features that toolforge has. >>>>> >>>>> On Wed, Nov 11, 2020 at 6:40 PM Martin Urbanec < >>>>> martin.urba...@wikimedia.cz> wrote: >>>>> >>>>>> MusikAnimal is right, however, Wikidata and Commons either have a sui >>>>>> generis slice, or they share it with a few very large wikis. Tools that >>>>>> do >>>>>> any kind of crosswiki analysis would instantly break, as most of them >>>>>> utilise joining by Wikidata items at the very least. >>>>>> >>>>>> I second Maarten here. This would mean a lot of things that currently >>>>>> require a (relatively simple) SQL query would need a full script, which >>>>>> would do the join at the application level. >>>>>> >>>>>> I fully understand the reasoning, but there needs to be some >>>>>> replacement. Intentionally introduce breaking changes while providing no >>>>>> "new standard" is a bad pattern in a community environment. >>>>>> >>>>>> Martin >>>>>> >>>>>> On Wed, Nov 11, 2020, 10:31 PM MusikAnimal <musikani...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Technically, cross-wiki joins aren't completely disallowed, you just >>>>>>> have to make sure each of the db names are on the same slice/section, >>>>>>> right? >>>>>>> >>>>>>> ~ MA >>>>>>> >>>>>>> On Wed, Nov 11, 2020 at 4:11 PM Maarten Dammers <maar...@mdammers.nl> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Joaquin, >>>>>>>> On 10-11-2020 21:26, Joaquin Oltra Hernandez wrote: >>>>>>>> >>>>>>>> TLDR: Wiki Replicas' architecture is being redesigned for stability >>>>>>>> and performance. Cross database JOINs will not be available and a host >>>>>>>> connection will only allow querying its associated DB. See [1] >>>>>>>> <https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign> >>>>>>>> for more details. >>>>>>>> >>>>>>>> If you only think of Wikipedia, not a lot will break probably, but >>>>>>>> if you take into account Commons and Wikidata a lot will break. A quick >>>>>>>> grep in my folder with Commons queries returns 123 lines with cross >>>>>>>> database joins. So yes, stuff will break and tools will be abandoned. >>>>>>>> This >>>>>>>> follows the practice that seems to have become standard for the WMF >>>>>>>> these >>>>>>>> days: Decisions are made with a small group within the WMF without any >>>>>>>> community involved. Only after the decision has been made, it's >>>>>>>> announced. >>>>>>>> >>>>>>>> Unhappy and disappointed, >>>>>>>> >>>>>>>> Maarten >>>>>>>> _______________________________________________ >>>>>>>> Wikimedia Cloud Services mailing list >>>>>>>> Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org) >>>>>>>> https://lists.wikimedia.org/mailman/listinfo/cloud >>>>>>>> >>>>>>> _______________________________________________ >>>>>>> Wikimedia Cloud Services mailing list >>>>>>> Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org) >>>>>>> https://lists.wikimedia.org/mailman/listinfo/cloud >>>>>>> >>>>>> _______________________________________________ >>>>>> Wikimedia Cloud Services mailing list >>>>>> Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org) >>>>>> https://lists.wikimedia.org/mailman/listinfo/cloud >>>>>> >>>>> >>>>> _______________________________________________ >>>>> Wikimedia Cloud Services mailing listcl...@lists.wikimedia.org (formerly >>>>> lab...@lists.wikimedia.org)https://lists.wikimedia.org/mailman/listinfo/cloud >>>>> >>>>> _______________________________________________ >>>>> Wikimedia Cloud Services mailing list >>>>> Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org) >>>>> https://lists.wikimedia.org/mailman/listinfo/cloud >>>>> >>>> _______________________________________________ >>>> Wikimedia Cloud Services mailing list >>>> Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org) >>>> https://lists.wikimedia.org/mailman/listinfo/cloud >>>> >>> _______________________________________________ >> Wikimedia Cloud Services mailing list >> Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org) >> https://lists.wikimedia.org/mailman/listinfo/cloud >> > > > -- > Amir (he/him) > > _______________________________________________ > Wikimedia Cloud Services mailing list > Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org) > https://lists.wikimedia.org/mailman/listinfo/cloud >
_______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud