This might be a trivial suggestion (for me it was a game changer):

Segment your large queries based on an indexed column. By that I mean, add
an additional WHERE clause to process only a small subset of the entire DB
(e.g. page_id >= 130000 AND page_id < 140000) and then loop in your
application over many of those small fast running queries moving the window
from MIN(page_id) to MAX(page_id). This made it possible to handle the
amounts of data returned in my application logic and got rid of query
timeouts. I'm sure this reduces pressure on the DB servers, too (as the
memory requirement for these smaller queries is a lot smaller).

Either way, I'd be interested in hearing other devs' experiences and tips
on how to cope with the new restrictions. It's the new reality and there is
little point in complaining about it IMO (not that I haven't done that
already :-D)

On Mon, Mar 15, 2021 at 11:58 AM Joaquin Oltra Hernandez <
jhernan...@wikimedia.org> wrote:

> Hi,
>
> These changes are not arbitrary, they are a necessity. They are happening
> because the clusters are out of capacity, they keep having problems with
> replication lagging and crashes (eg: [1]) and restoring servers takes days
> where the rest of the cluster remains at increased load. Additionally they
> can't be upgraded to new MariaDB versions which we need to be in sync with
> the production databases and ensure security in the cluster [2].
>
> There is a set of trade-offs, as outlined in [3], and the new architecture
> manages to maintain keep the replicas with almost all features: a complete
> data set updated in real time; with an SQL interface; in a stable and
> performant cluster that can be upgraded and maintained in sync with the
> production database clusters.
> With the growth experienced in the last years (commons, wikidata) the
> current technology used by Wikireplicas is not feasible.
>
> Not being able to do cross-wiki joins with SQL is a jump in complexity,
> sometimes so big that is not easily overcome.
> Most cross-join queries can be recreated using two separate queries and
> filtering the results with code, which can already be daunting for some but
> at least there is a solution.
> However, there are some queries that are impractical due to the large
> amount of data involved. For example, one we have identified from the
> feedback is querying for overlapping local and Commons images.
>
> We have been keeping track of all the email responses and phabricator
> comments with specific examples and code using cross-wiki joins, and we
> have been working on a system to gather data from Quarry to analyze
> cross-join queries to look into it in depth. [4] was made to figure out
> common use cases and possibilities to enable them again.
>
> There is no easy answer, there are some things that are becoming harder to
> do, and others are becoming much harder, and we need to figure out together
> what are the ones that become practically impossible because of the size of
> the datasets and find better solutions.
>
> [1] Latest crash: https://phabricator.wikimedia.org/T276980 - Replication
> lagging:
> https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=labsdb1009&var-port=9104&from=now-30d&to=now
> [2] https://lists.wikimedia.org/pipermail/cloud/2020-November/001322.html
> [3] https://lists.wikimedia.org/pipermail/cloud/2020-November/001326.html
> [4] https://phabricator.wikimedia.org/T215858
>
> On Sun, Mar 14, 2021 at 12:15 AM Daniel Schwen <li...@schwen.de> wrote:
>
>> This is the next step after disallowing user databases on replicas. It
>> broke some of my tools but I recently rewrote them to move joining logic
>> into my application. I also replicate small amounts of data (e.g. page
>> titles for a subset of pages) into my user db for joins.
>> I found it quite off-putting at first. Volunteers are effectively forced
>> to do substantial amounts of extra work after creating tools that "work
>> perfectly fine".
>> But I can understand the need to provide more scalable infrastructure
>> with the ever growing projects.
>>
>> On Sat, Mar 13, 2021, 3:46 PM Yetkin Sakal via Cloud <
>> cloud@lists.wikimedia.org> wrote:
>>
>>> I completely agree with Maarten. It would be a step backward to stop
>>> supporting cross-database joins on wiki replicas. This is a breaking change
>>> and should not be applied unless a feasible solution to the problem is
>>> found.
>>> On Saturday, March 13, 2021, 8:17:39 PM GMT+3, Maarten Dammers <
>>> maar...@mdammers.nl> wrote:
>>>
>>>
>>> Hi Joaquin,
>>>
>>> Completely nothing was done with the community input about not being
>>> able to do cross wiki joins anymore [1]. In the past the WMF would do
>>> something with community input. I guess the new strategy is to just give
>>> the appearance of community input to legitimize a decision. Nice way to
>>> alienate the volunteers.
>>>
>>> Maarten
>>>
>>> [1]
>>> https://lists.wikimedia.org/pipermail/cloud/2020-November/thread.html#1309
>>> On 12-03-2021 19:16, Joaquin Oltra Hernandez wrote:
>>>
>>> TLDR:
>>> - Instead of `*.db.svc.eqiad.wmflabs` use `*.db.svc.wikimedia.cloud` to
>>> use the new replicas
>>> - Quarry will migrate March 23 to use the new cluster
>>> - In a ~month (April 15) the old cluster will start retiring. See
>>> https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign#Timeline
>>> for more details
>>>
>>> Hi everyone,
>>>
>>> I'm happy to announce that the new replicas cluster is available for use
>>> after a few weeks open for testing.
>>>
>>> To use the new cluster, you will only need to change the hostname when
>>> connecting to the databases. Instead of `*.db.svc.eqiad.wmflabs` you can
>>> use `*.db.svc.wikimedia.cloud`. See:
>>>
>>> -
>>> https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign#New_host_names
>>> -
>>> https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign#How_can_I_test_the_new_replicas_before_the_switchover?
>>> - https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database
>>>
>>>
>>> This brings us to the timeline updates:
>>>
>>> https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign#Timeline
>>>
>>> - March 2021
>>>   - Announce new cluster domains for all users - March 12
>>>   - Migrate Quarry to use the new cluster - March 23
>>> - April 2021
>>>   - PAWS migration - Estimated first week of April
>>>   - Migrate the old cluster to utilize new replication hosts.
>>> Replication may stop. - April 15
>>>   - Redirect old hostnames to the new cluster - April 28
>>>
>>> Quarry is ready so it will transition first to use the new replicas.
>>> This will happen on March 23, at which point we will deploy the changes and
>>> update the docs. We are publishing a note on Tech news but if there are
>>> other venues where this information would be useful please help us spread
>>> the word.
>>>
>>> PAWS is in the process of being migrated. We estimate it could be ready
>>> to use the new cluster at the beginning of April, and will publish more
>>> details as soon as we know.
>>>
>>> On April 15 the old cluster will be moved under the new replication
>>> hosts, at which point there may be side effects and replication may stop.
>>> We recommend you update your code earlier to use the new hostnames to
>>> verify things are working normally.
>>>
>>> Not long after, the old hostnames (`*.db.svc.eqiad.wmflabs`) will be
>>> redirected to the new ones (`*.db.svc.wikimedia.cloud`), at which point the
>>> old cluster will effectively be inaccessible.
>>>
>>>
>>> If you have questions or need help please reply to
>>> cloud@lists.wikimedia.org, join the #wikimedia-cloud IRC channel, or
>>> open a ticket in Phabricator with the Data-Services tag.
>>>
>>> Thanks,
>>>
>>> _______________________________________________
>>> Wikimedia Cloud Services announce mailing 
>>> listcloud-annou...@lists.wikimedia.org (formerly 
>>> labs-annou...@lists.wikimedia.org)https://lists.wikimedia.org/mailman/listinfo/cloud-announce
>>>
>>>
>>> _______________________________________________
>>> Wikimedia Cloud Services mailing listcl...@lists.wikimedia.org (formerly 
>>> lab...@lists.wikimedia.org)https://lists.wikimedia.org/mailman/listinfo/cloud
>>>
>>> _______________________________________________
>>> Wikimedia Cloud Services mailing list
>>> Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
>>> https://lists.wikimedia.org/mailman/listinfo/cloud
>>> _______________________________________________
>>> Wikimedia Cloud Services mailing list
>>> Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
>>> https://lists.wikimedia.org/mailman/listinfo/cloud
>>>
>> _______________________________________________
>> Wikimedia Cloud Services mailing list
>> Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
>> https://lists.wikimedia.org/mailman/listinfo/cloud
>>
>
>
> --
> Joaquin Oltra Hernandez
> Developer Advocate - Wikimedia Foundation
> _______________________________________________
> Wikimedia Cloud Services mailing list
> Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
> https://lists.wikimedia.org/mailman/listinfo/cloud
>
_______________________________________________
Wikimedia Cloud Services mailing list
Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud

Reply via email to