Changing queries to support a new database format is one thing.
Writing migration code to deal with a situation that should not exist
(columns being dropped before the migration is completed) is another.
I suppose I am lucky in that the only tool I maintain that queries the
pagelinks table is single-wiki.

AntiCompositeNumber
(he/him)

On Wed, Jan 17, 2024 at 3:05 PM Amir Sarabadani
<asarabad...@wikimedia.org> wrote:
>
> Hi!
>
> Am Mi., 17. Jan. 2024 um 19:37 Uhr schrieb Ben Kurtovic 
> <wikipedia.ear...@gmail.com>:
>>
>> Hi Amir & others,
>>
>> I’m glad we are making changes to improve DB storage/query efficiency. I 
>> wanted to express my agreement with Tacsipacsi that dropping the data before 
>> the migration has completed is a really bad outcome. Now tool maintainers 
>> need to deal with multiple migrations depending on the wikis they query or 
>> add more code complexity. And there is little time to make the changes for 
>> those of us who had planned to wait until the new data was available.
>
>
> I totally understand the frustration. In my volunteer capacity, I also 
> maintain numerous tools and they break every now and then because of changes.
>>
>>
>> > Commons has grown to 1.8TB already
>>
>> That’s a big number yes, but it doesn’t really answer the question — is the 
>> database actually about to fill up?
>
>
> It's a bit more nuanced. We are not hitting limits on the storage. But the 
> memory for data cache on each replica is about 350GB and we need to serve 
> almost everything from memory since the disk is 1000 times slower than 
> memory. If we read too much from disk, reads start to pile up, leading the 
> appserver requests starting to pile up and general outage happening  (which 
> has happened before with wikidata's database). You can have a 3TB database 
> with only 100GB of "hot" data and you'd be fine but Commons is both big and 
> very heavily read and across its tables and rows. Ratio-wise, Commons 
> database is already reading twice as much as English Wikipedia from disk.
>
>> How much time do you have until that happens and how much time until s1/s8 
>> finish their migration?
>
> The database is already in the "fragile" and "high risk" state. I can't give 
> you an exact date when it'll go down but due to reasons mentioned above I can 
> tell you that even now with any noticeable increase in its traffic or sudden 
> shift in its read patterns it will go down and bring all wikis down with it.  
> There are already user-facing parts in commons that shouldn't be slow but 
> they are due to excessive read from disk.
>
> Also, for the case of Wikidata, it might take a long time, possibly three 
> more months, to finish due to its unique pagelinks usage pattern because of 
> scholarly articles.
>
>> Is there a reason you can share why this work wasn’t started earlier if the 
>> timing is so close?
>
> We have been constantly working to reduce its size in the past several years, 
> templatelinks migration, externallinks redesign, and so on has been done back 
> to back (started in 2021. We even bumped priority of externallinks migration 
> because of Commons only) but at the same time, the wiki has been growing way 
> too fast. (Emphasizing that the growth doesn't have much to do with images 
> being uploaded, the image table is only 100GB, the problem is the overly 
> large links tables, including templatelinks being 270GB, categorylinks being 
> 200GB, pagelinks being 190GB and so on.). This has put us into a red queen 
> situation with no easy way out.
>>
>>
>> > so you need to use the old way for the list of thirty-ish wikis (s1, s2, 
>> > s6, s7, s8) and for any wiki not a member of that, you can just switch to 
>> > the new system
>>
>> IMO, a likely outcome is some tools/bots will simply be broken on a subset 
>> of wikis until the migtation is completed across all DBs.
>
>
> The most urgent one is Commons. What about only dropping it from Commons to 
> reduce the risk of outage and leave the rest until the all are finished (or 
> all except Wikidata)? You'd have to write something for the new schema 
> regardless.
>>
>>
>> Thanks for all your work on this task so far.
>
>
> Thank you and sorry for the inconvenience.
>>
>>
>> Ben / Earwig
>
> _______________________________________________
> Cloud mailing list -- cloud@lists.wikimedia.org
> List information: 
> https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/
_______________________________________________
Cloud mailing list -- cloud@lists.wikimedia.org
List information: 
https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/

Reply via email to