Sounds all very reasonable.
Some thoughts:
* Having revisions not wrapped into means that for
reconstructing the history of a page, the entire dump has to be
scanned, unless there is an index of all revisions
* Such an index should probably accompany the XML file, ideally if the
XML is in a seeka
Hi,
(I don’t post often here and I’m not a MW developer but I try to follow,
correct me if I’m wrong.)
I see a couple of things which must be done carefully and willingly about
page titles. Currently there is a difference between page_id and page
title, since the page_id is conserved when t
On Thu, Aug 18, 2011 at 10:30 AM, Diederik van Liere wrote:
> 1. Denormalization of the schema
> Instead of having a tag with multiple tags, I
> propose to just have tags. Each tag would
> include a , , and
> tag. This denormalization would make it much easier to
> build an incremental dump
On Tue, Aug 23, 2011 at 5:35 PM, Brion Vibber wrote:
> Broadly speaking some sort of diff storage makes a lot of sense; especially
> if it doesn't require reproducing those diffs all the time. :)
>
> But be warned that there are different needs and different ways of
> processing data; diffs again