Changing queries to support a new database format is one thing. Writing migration code to deal with a situation that should not exist (columns being dropped before the migration is completed) is another. I suppose I am lucky in that the only tool I maintain that queries the pagelinks table is single-wiki.
AntiCompositeNumber (he/him) On Wed, Jan 17, 2024 at 3:05 PM Amir Sarabadani <asarabad...@wikimedia.org> wrote: > > Hi! > > Am Mi., 17. Jan. 2024 um 19:37 Uhr schrieb Ben Kurtovic > <wikipedia.ear...@gmail.com>: >> >> Hi Amir & others, >> >> I’m glad we are making changes to improve DB storage/query efficiency. I >> wanted to express my agreement with Tacsipacsi that dropping the data before >> the migration has completed is a really bad outcome. Now tool maintainers >> need to deal with multiple migrations depending on the wikis they query or >> add more code complexity. And there is little time to make the changes for >> those of us who had planned to wait until the new data was available. > > > I totally understand the frustration. In my volunteer capacity, I also > maintain numerous tools and they break every now and then because of changes. >> >> >> > Commons has grown to 1.8TB already >> >> That’s a big number yes, but it doesn’t really answer the question — is the >> database actually about to fill up? > > > It's a bit more nuanced. We are not hitting limits on the storage. But the > memory for data cache on each replica is about 350GB and we need to serve > almost everything from memory since the disk is 1000 times slower than > memory. If we read too much from disk, reads start to pile up, leading the > appserver requests starting to pile up and general outage happening (which > has happened before with wikidata's database). You can have a 3TB database > with only 100GB of "hot" data and you'd be fine but Commons is both big and > very heavily read and across its tables and rows. Ratio-wise, Commons > database is already reading twice as much as English Wikipedia from disk. > >> How much time do you have until that happens and how much time until s1/s8 >> finish their migration? > > The database is already in the "fragile" and "high risk" state. I can't give > you an exact date when it'll go down but due to reasons mentioned above I can > tell you that even now with any noticeable increase in its traffic or sudden > shift in its read patterns it will go down and bring all wikis down with it. > There are already user-facing parts in commons that shouldn't be slow but > they are due to excessive read from disk. > > Also, for the case of Wikidata, it might take a long time, possibly three > more months, to finish due to its unique pagelinks usage pattern because of > scholarly articles. > >> Is there a reason you can share why this work wasn’t started earlier if the >> timing is so close? > > We have been constantly working to reduce its size in the past several years, > templatelinks migration, externallinks redesign, and so on has been done back > to back (started in 2021. We even bumped priority of externallinks migration > because of Commons only) but at the same time, the wiki has been growing way > too fast. (Emphasizing that the growth doesn't have much to do with images > being uploaded, the image table is only 100GB, the problem is the overly > large links tables, including templatelinks being 270GB, categorylinks being > 200GB, pagelinks being 190GB and so on.). This has put us into a red queen > situation with no easy way out. >> >> >> > so you need to use the old way for the list of thirty-ish wikis (s1, s2, >> > s6, s7, s8) and for any wiki not a member of that, you can just switch to >> > the new system >> >> IMO, a likely outcome is some tools/bots will simply be broken on a subset >> of wikis until the migtation is completed across all DBs. > > > The most urgent one is Commons. What about only dropping it from Commons to > reduce the risk of outage and leave the rest until the all are finished (or > all except Wikidata)? You'd have to write something for the new schema > regardless. >> >> >> Thanks for all your work on this task so far. > > > Thank you and sorry for the inconvenience. >> >> >> Ben / Earwig > > _______________________________________________ > Cloud mailing list -- cloud@lists.wikimedia.org > List information: > https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/ _______________________________________________ Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/