jida...@jidanni.org schrieb:
> I'm curious what does
>   SELECT COUNT(DISTINCT old_text), COUNT(*) FROM text;
> show on Wikipedia's database? On mine I get
>   COUNT(DISTINCT old_text): 2913
>                   COUNT(*): 3560
> I.e., 1/7 of the rows are redundant.

On Wikimedia wikis, text is stored in compressed blobs in extra database
clusters. There is no way to get this information efficiently. If you want it,
walk through a full history dump and store hashes for each revision text.

> Currently undos, so frequent on wikis, just blindly create a duplicate row
> instead of checking if the old one could be reused,
> https://bugzilla.wikimedia.org/show_bug.cgi?id=18333 . Maybe some hardware
> savings could even be achieved.

For *reverts* this is already done, and that is indeed the only situation where
it is reliable. "Undo" can also be applied to older changes, and it basically is
a reverse patch. That is, the result of an undo may be a text that is different
from all previous revisions. However, when undoing the *last* edit, it is indeed
equivalent to a revert. Perhaps MediaWiki could make use of that.

Because of the fact that multiple revisions are compressed together into one
blob, redundant text is not so bad - but it's only "nice" of both copies end up
in the same blob. This is increasingly the case since Tim Starling implemented
the revision reordering thingy.

-- daniel

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to