[Wikidata-bugs] [Maniphest] [Commented On] T174028: Finalize database schema for MCR content meta-data

Marostegui Fri, 06 Oct 2017 04:38:12 -0700

Marostegui added a comment.

In T174028#3625308, @daniel wrote:

@jcrespo After some discussion on https://www.mediawiki.org/wiki/Talk:Multi-Content_Revisions/Database_Schema, on https://gerrit.wikimedia.org/r/#/c/378724/4, and on the mailing list, there are a few questions that require a judgement call from a DBA. Ideally of course we would test and measure. But I'm afraid at least some of these decisions need to be made beforehand, so an assessment from experience would be very helpful.

So here are the most pressing questions:

Should we introduce content_sha1?

The content table is going to be tall (initially, as tall as revision, but two or three times taller later), so I want to keep it narrow.

If we have it, we probably want to have an index on it (this would clash with sharding, though, see below)

If we don't have it, we need to load the content of all slots to calculate rev_sha1 for each new revision.

...or we drop rev_sha1, and calculate the hash for use in API results and XML dumps on the fly.

If we introduce content_sha1 now, we can still write empty strings to it later to reduce size.

We could choose a more compact representation than base36 used for rev_sha1

I don't really have an opinion on that to be honest. What would you prefer?.
I do agree that the narrower the table is, the better.

Should we build in support for sharing (partitioning) on the page ID?

we would need slot_page and content_page

all (unique/primary) indexes would have to start with the page id

all selects and joins will have to use the page ID (this needs some changes in the code, but not much)

it's two integer (or bigint) columns in two very tall tables

queries across pages will be inefficient, or need a separate table (note that we don't need a contributions table unless we also partition the revision table)

in some (rare) cases (e.g. undelete of a very old revision, or a history split), page ids would have to be re-written or content rows would have to be duplicated with different page ids.

Regarding whether we should use partitioning or not - the only concern I'd have (by looking at the numbers listed on https://www.mediawiki.org/wiki/Multi-Content_Revisions/Content_Meta-Data#Database_Schema) is the disk size. The access pattern and performance...I am not super sure it would get an instant benefit from the sharding (look at the current revision or page table numbers).
It is hard to tell without having some scenarios where we can test.

By looking at the current size of revision and page table (for enwiki), and looking at the draft numbers posted on the above link, for the first years we should probably be fine.
Not sure if we can an optimal design now that would prevent us from dropping this in production and avoid touching it for 5 years :-)
I would prefer no partitioning for now.

Should slot_revision be declared BIGINT right away?

we will need to convert rev_id to BIGIN soonish

until then, is it bad to join an int to a bigint in queries?

if we declare it INT for now, we'll have to convert it when we convert rev_id.

It is true that in some cases joining different column types can give some performance drawbacks, we should do some testing to see how much it would degrade the query.
If at some point we have to convert rev_id to BIGINT, that will be a big maintenance that can potentially take weeks/months, so I guess we can include the conversion of slot_revision to bigint along with it.
If that maintenance is going to happen anyways, we could probably go for int with slot_revision and schedule both maintenances at the same time?

This is just my opinion, not an statement on what we should do.
It is hard to give more ideas without knowing mediawiki in depth. Actually you guys are probably more capable than me to do set a real statement here! :-)

TASK DETAIL

https://phabricator.wikimedia.org/T174028

EMAIL PREFERENCES

https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Marostegui
Cc: Marostegui, cicalese, Abit, brion, Anomie, jcrespo, gerritbot, Aklapper, daniel, E1presidente, Ramsey-WMF, Jmmuguerza, SandraF_WMF, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, Ramalepe, Liugev6, QZanden, Acer, Lewizho99, Maathavan, Susannaanas, Izno, Wikidata-bugs, PKM, Base, matthiasmullie, aude, Ricordisamoa, Fabrice_Florin, Raymond, Mbch331

_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Commented On] T174028: Finalize database schema for MCR content meta-data

Reply via email to