(If you don’t work with links tables such as templatelinks, pagelinks and
so on, feel free to ignore this message)

TLDR: The schema of links tables (starting with templatelinks) will change
to have numeric id pointing to linktarget table instead of repeating
namespace and title.

Hello,

The current schema and storage of most links tables are: page id (the
source), namespace id of the target link and title of the target. For
example, if a page with id of 1 uses Template:Foo, the row in the database
would be 1, 6, and Foo (Template namespace has id of 6)

Repeating the target’s title is not sustainable, for example more than half
of Wikimedia Commons database is just three links tables. The sheer size of
these tables makes a considerable portion of all queries slower, backups
and dumps taking longer and taking much more space than needed due to
unnecessary duplication. In Wikimedia Commons, on average a title is
duplicated around 100 times for templatelinks and around 20 times for
pagelinks. The numbers for other wikis depend on the usage patterns.

Moving forward, these tables will be normalized, meaning a typical row will
hold mapping of page id to linktarget id instead. Linktarget is a new table
deployed in production and contains immutable records of namespace id and
string. The major differences between page and linktarget tables are: 1-
linktarget values won’t change (unlike page records that change with page
move) 2- linktarget values can point to non-existent pages (=red links).

The first table being done is templatelinks, then pagelinks, imagelinks and
categorylinks will follow. During the migration phase both values will be
accessible but we will turn off writing to the old columns once the values
are backfilled and switched to be read from the new schema. We will
announce any major changes beforehand but this is to let you know these
changes are coming.

While the normalization of all links tables will take several years to
finish, templatelinks will finish in the next few months and is the most
pressing one.

So if you:

   -

   … rely on the schema of these tables in cloud replicas, you will need to
   change your tools.
   -

   … rely on dumps of these tables, you will need to change your scripts.


Currently, templatelinks writes to both data schemes for new rows in most
wikis. This week we will start backfilling the data with the new schema but
it will take months to finish in large wikis.

You can keep track of the general long-term work in
https://phabricator.wikimedia.org/T300222 and the specific work for
templatelinks in https://phabricator.wikimedia.org/T299417. You can also
read more on the reasoning in https://phabricator.wikimedia.org/T222224.

Thanks
-- 
*Amir Sarabadani (he/him)*
Staff Database Architect
Wikimedia Foundation <https://wikimediafoundation.org/>
_______________________________________________
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

Reply via email to