jida...@jidanni.org schrieb: > I'm curious what does > SELECT COUNT(DISTINCT old_text), COUNT(*) FROM text; > show on Wikipedia's database? On mine I get > COUNT(DISTINCT old_text): 2913 > COUNT(*): 3560 > I.e., 1/7 of the rows are redundant.
On Wikimedia wikis, text is stored in compressed blobs in extra database clusters. There is no way to get this information efficiently. If you want it, walk through a full history dump and store hashes for each revision text. > Currently undos, so frequent on wikis, just blindly create a duplicate row > instead of checking if the old one could be reused, > https://bugzilla.wikimedia.org/show_bug.cgi?id=18333 . Maybe some hardware > savings could even be achieved. For *reverts* this is already done, and that is indeed the only situation where it is reliable. "Undo" can also be applied to older changes, and it basically is a reverse patch. That is, the result of an undo may be a text that is different from all previous revisions. However, when undoing the *last* edit, it is indeed equivalent to a revert. Perhaps MediaWiki could make use of that. Because of the fact that multiple revisions are compressed together into one blob, redundant text is not so bad - but it's only "nice" of both copies end up in the same blob. This is increasingly the case since Tim Starling implemented the revision reordering thingy. -- daniel _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l