Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

Huji Lee Sat, 14 Nov 2020 12:25:05 -0800

I like the idea of dumps as an alternative too. But I think this should be
a service that is offered via the WM Clouds. Some might remember me asking
related questions on this very mailing list several months ago.


Having a DB called "latest_dump" which actually has the latest dump of all
wikis would be tremendously helpful. Many cross-wiki queries can work off
of several-days-old data.

On Sat, Nov 14, 2020 at 2:52 PM Amir Sarabadani <ladsgr...@gmail.com> wrote:

> Hello,
> I actually welcome the change and am quite happy about it. It might break
> several tools (including some of mine) but as a database nerd, I can see
> the benefits outweighing the problems (and I wish benefits would have been
> communicated in the announcement).
>
> The short version is that this change would make labs replicas blazing
> fast.
>
> The long version: Database of all of wikis is currently being replicated
> to a set of giant "cloud" or "labs" replicas. IIRC correctly, these dbs
> have 512GB memory (while being massive is not big enough to hold
> everything), the space left for InnoDB Buffer pool should be around 350GB
> and storing everything in there is impossible (the rest would be for
> temporary tables and other critical functions), so I assume when you query
> quarry (sorry, I had to make the pun), most it is actually coming from
> reading disk which is ten times slower. Looking at graphs
> <https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=13&orgId=1&var-server=labsdb1011&var-port=9104&from=now-7d&to=now>,
> The Innodb buffer pool efficiency for labs dbs is around 99% (two nines),
> while production databases (similar hardware but split into eight different
> sections) is 99.99% (four nines), these two orders of magnitude difference
> is mostly because of cache locality which I hope we would achieve if these
> changes get done (unless the new hardware will be commodity hardware
> instead of beefy servers but I doubt that, correct me if I'm wrong).
> Meaning less timeouts, less slow apps and tools, etc. It's not just speed
> though, the updates coming in to replicas would be split too so it wouldn't
> saturate the network and less heavy I/O in memory and disk meaning better
> scalability (adding commons/wikidata on each section would be the exact
> opposite of that and even if we do it now, we eventually have to pull the
> plug as wikis are growing and we are not the same size or growth speed we
> used to be years ago).
>
> I understand it would break tools and queries but I have a feeling that
> lots of them should be already split into multiple queries, or should read
> dumps instead or sometimes it's more of an x/y problem
> <https://en.wikipedia.org/wiki/XY_problem>
>
> I think this is great and a big thank you for doing it.
>
> On Fri, Nov 13, 2020 at 11:39 AM Kimmo Virtanen <kimmo.virta...@gmail.com>
> wrote:
>
>> As a follow up comment.
>>
>> If I understand correctly the main problems are a) databases are growing
>> too big to be stored in single instances and b) query complexity is
>> growing.
>>
>> a) the growth of the data is not going away as the major drivers for the
>> growth are automated edits from Wikidata and Structured data on Commons.
>> They are generating new data with increasing speed faster than humans ever
>> could. So the longer term answer is to store the data to separate instances
>> and use something like federated queries. This is how the access to the
>> commonwiki replica was originally done when toolserver moved to toollabs in
>> 2014.[1] Another long term solution to make databases smaller is to
>> replicate only the current state of the wikidata/commonswiki and leave for
>> example the revision history out.
>>
>> b) a major factor for query complexity which affects the query execution
>> times is afaik the actor migration and the data sanitization which executes
>> the queries through the multiple views.[2,3]  I have no idea how bad the
>> problem currently is, but one could think that replication could be
>> implemented with lighter sanitation by leaving some of the problematic data
>> out altogether from replication.
>>
>> Anyway, my question is, are there more detailed plans for the *Wiki
>> Replicas 2020 Redesign *than what is on the wikipage[4] or tickets
>> linked from it? I guess there is if the plan is to buy new hardware in
>> October and now we are in the implementation phase? Also is there
>> information on the actual bottlenecks at table level? I.e., which tables
>> (in which databases) are the too big ones, hard to keep up in replication
>> and slow in terms of query time?
>>
>> [1]
>> https://www.mediawiki.org/wiki/Wikimedia_Labs/Tool_Labs/Migration_of_Toolserver_tools#Will_the_commons_database_be_replicated_to_all_clusters,_like_it_is_on_the_Toolserver
>> ?
>> [2]
>> https://wikitech.wikimedia.org/wiki/News/Actor_storage_changes_on_the_Wiki_Replicas
>> [3] https://phabricator.wikimedia.org/T215445
>> [4] https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign
>>
>> Br,
>> -- Kimmo Virtanen, Zache
>>
>> On Fri, Nov 13, 2020 at 8:51 AM Kimmo Virtanen <kimmo.virta...@gmail.com>
>> wrote:
>>
>>> >  Maarten: Having 6 servers with each one having a slice + s4 (Commons)
>>> + s8 (Wikidata) might be a good compromise.
>>> > Martin: Another idea is to have the database structured as-planned,
>>> but add a server with *all* databases that would be slower/less stable,
>>> but will provide a solution for those who really need cross database joins
>>>
>>> From the point of view of a person who is using cross database joins on
>>> both tools and analysis queries I would say that both ideas would be
>>> suitable. I think that 90%  of my crosswiki queries are written against
>>> *wiki + wikidata/commons. However, I would not say that it is only for
>>> those who really need it but more like that cross database joins are an
>>> awesome feature for everybody and it is a loss if it will be gone.
>>>
>>> In older times we had also ability to do joins between user databases
>>> and replica databases, which was removed in 2017 if I googled correctly.[1]
>>> My guess is that one reason for the increasing query complexity is that
>>> there is no possibility for creating tmp tables or joining to preselected
>>> data so everything is done in single queries.  In any case, if the solution
>>> is what Martin suggests to move cross joinable databases to a single server
>>> and the original problem was that it was hard to keep in sync multiple
>>> servers then we could reintroduce the user database joins as well.
>>>
>>> [1]
>>> https://phabricator.wikimedia.org/phame/post/view/70/new_wiki_replica_servers_ready_for_use/
>>>
>>> Br,
>>> -- Kimmo Virtanen, Zache
>>>
>>> On Fri, Nov 13, 2020 at 2:17 AM Martin Urbanec <
>>> martin.urba...@wikimedia.cz> wrote:
>>>
>>>> +1 to Marteen
>>>>
>>>> Another idea is to have the database structured as-planned, but add a
>>>> server with *all* databases that would be slower/less stable, but will
>>>> provide a solution for those who really need cross database joins
>>>>
>>>> Martin
>>>>
>>>> pá 13. 11. 2020 v 0:31 odesílatel Maarten Dammers <maar...@mdammers.nl>
>>>> napsal:
>>>>
>>>>> I recall some point in time (Toolserver maybe?) when all the slices
>>>>> (overview at https://tools-info.toolforge.org/?listmetap ) were at
>>>>> different servers, but the Commons slice (s4) was on every server.
>>>>> At some point new fancy database servers were introduced with all the
>>>>> slices on all servers. Having 6 servers with each one having a slice + s4
>>>>> (Commons) + s8 (Wikidata) might be a good compromise.
>>>>> On 12-11-2020 00:58, John wrote:
>>>>>
>>>>> I’ll throw my hat in this too. Moving it to the application layer will
>>>>> make a number of queries just not feasible any longer. It might make sense
>>>>> from the administration side, but from the user perspective it beaks one 
>>>>> of
>>>>> the biggest features that toolforge has.
>>>>>
>>>>> On Wed, Nov 11, 2020 at 6:40 PM Martin Urbanec <
>>>>> martin.urba...@wikimedia.cz> wrote:
>>>>>
>>>>>> MusikAnimal is right, however, Wikidata and Commons either have a sui
>>>>>> generis slice, or they share it with a few very large wikis. Tools that 
>>>>>> do
>>>>>> any kind of crosswiki analysis would instantly break, as most of them
>>>>>> utilise joining by Wikidata items at the very least.
>>>>>>
>>>>>> I second Maarten here. This would mean a lot of things that currently
>>>>>> require a (relatively simple) SQL query would need a full script, which
>>>>>> would do the join at the application level.
>>>>>>
>>>>>> I fully understand the reasoning, but there needs to be some
>>>>>> replacement. Intentionally introduce breaking changes while providing no
>>>>>> "new standard" is a bad pattern in a community environment.
>>>>>>
>>>>>> Martin
>>>>>>
>>>>>> On Wed, Nov 11, 2020, 10:31 PM MusikAnimal <musikani...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Technically, cross-wiki joins aren't completely disallowed, you just
>>>>>>> have to make sure each of the db names are on the same slice/section,
>>>>>>> right?
>>>>>>>
>>>>>>> ~ MA
>>>>>>>
>>>>>>> On Wed, Nov 11, 2020 at 4:11 PM Maarten Dammers <maar...@mdammers.nl>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Joaquin,
>>>>>>>> On 10-11-2020 21:26, Joaquin Oltra Hernandez wrote:
>>>>>>>>
>>>>>>>> TLDR: Wiki Replicas' architecture is being redesigned for stability
>>>>>>>> and performance. Cross database JOINs will not be available and a host
>>>>>>>> connection will only allow querying its associated DB. See [1]
>>>>>>>> <https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign>
>>>>>>>> for more details.
>>>>>>>>
>>>>>>>> If you only think of Wikipedia, not a lot will break probably, but
>>>>>>>> if you take into account Commons and Wikidata a lot will break. A quick
>>>>>>>> grep in my folder with Commons queries returns 123 lines with cross
>>>>>>>> database joins. So yes, stuff will break and tools will be abandoned. 
>>>>>>>> This
>>>>>>>> follows the practice that seems to have become standard for the WMF 
>>>>>>>> these
>>>>>>>> days: Decisions are made with a small group within the WMF without any
>>>>>>>> community involved. Only after the decision has been made, it's 
>>>>>>>> announced.
>>>>>>>>
>>>>>>>> Unhappy and disappointed,
>>>>>>>>
>>>>>>>> Maarten
>>>>>>>> _______________________________________________
>>>>>>>> Wikimedia Cloud Services mailing list
>>>>>>>> Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
>>>>>>>> https://lists.wikimedia.org/mailman/listinfo/cloud
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Wikimedia Cloud Services mailing list
>>>>>>> Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
>>>>>>> https://lists.wikimedia.org/mailman/listinfo/cloud
>>>>>>>
>>>>>> _______________________________________________
>>>>>> Wikimedia Cloud Services mailing list
>>>>>> Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
>>>>>> https://lists.wikimedia.org/mailman/listinfo/cloud
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Wikimedia Cloud Services mailing listcl...@lists.wikimedia.org (formerly 
>>>>> lab...@lists.wikimedia.org)https://lists.wikimedia.org/mailman/listinfo/cloud
>>>>>
>>>>> _______________________________________________
>>>>> Wikimedia Cloud Services mailing list
>>>>> Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
>>>>> https://lists.wikimedia.org/mailman/listinfo/cloud
>>>>>
>>>> _______________________________________________
>>>> Wikimedia Cloud Services mailing list
>>>> Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
>>>> https://lists.wikimedia.org/mailman/listinfo/cloud
>>>>
>>> _______________________________________________
>> Wikimedia Cloud Services mailing list
>> Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
>> https://lists.wikimedia.org/mailman/listinfo/cloud
>>
>
>
> --
> Amir (he/him)
>
> _______________________________________________
> Wikimedia Cloud Services mailing list
> Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
> https://lists.wikimedia.org/mailman/listinfo/cloud
>

_______________________________________________
Wikimedia Cloud Services mailing list
Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud

Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

Reply via email to