Re: [Wikitech-l] Changing XML Wikipedia Schema to Enable Smaller Incremental Dumps that are Hadoop ready

Magnus Manske Thu, 18 Aug 2011 13:19:30 -0700

Sounds all very reasonable.

Some thoughts:
* Having revisions not wrapped into <page> means that for
reconstructing the history of a page, the entire dump has to be
scanned, unless there is an index of all revisions
* Such an index should probably accompany the XML file, ideally if the
XML is in a seekable zip container (bgzip etc.)
* I suggest that the current article version at the time of dump is
stored in full, and not as a diff; if you want to do history, you'll
probably calculate all diffs anyway, but the current version should be
accessible right away


Magnus


On Thu, Aug 18, 2011 at 6:30 PM, Diederik van Liere <dvanli...@gmail.com> wrote:
> Hi!
>
> Over the last year, I have been using the Wikipedia XML dumps
> extensively. I used it to conduct the Editor Trends Study [0] and me
> and the Summer Research Fellows [1] have used it in the last three
> months during the Summer of Research. I am proposing some changes to
> the current XML schema based on those experiences.
>
> The current XML schema presents a number of challenges for both the
> people who are creating dump files as the people who are consuming the
> dump files. Challenges include:
>
> 1) The embedded structure of the schema, a single <page> tag with
> multiple <revision> tags makes it very hard to develop an incremental
> dump utility
> 2) A lot of post processing is required.
> 3) By storing the entire text for each revision, the dump files are
> getting so large that they become unmanageable for most people.
>
>
> 1. Denormalization of the schema
> Instead of having a <page> tag with multiple <revision> tags, I
> propose to just have <revision> tags. Each <revision> tag would
> include a <page_id>, <page_title>, <page_namespace> and
> <page_redirect> tag. This denormalization would make it much easier to
> build an incremental dump utility. You only need to keep track of the
> final revision of each article at the moment of dump creation and then
> you can create a new incremental dump continueing from the last dump.
> It would also easier to restore a dump process that crashed.  Finally,
> tools like Hadoop would have a way easier time handling this XML
> schema than the current one.
>
>
> 2. Post-processing of data
> Currently, a significant amount of time is required for
> post-processing the data. Some examples include:
> * The title includes the namespace and so to exclude pages from a
> particular namespace requires generating a separate namespace
> variable. Particularly, focusing on the main namespace is tricky
> because that can only be done by checking whether a page does not
> belong to any other namespace (see bug
> https://bugzilla.wikimedia.org/show_bug.cgi?id=27775).
> * The <redirect> tag currently is either True or False, more useful
> would be the article_id of the page to which a page is redirected.
> * Revisions within a <page> are sorted by revision_id, but they should
> be sorted by timestamp. The current ordering makes it even harder to
> generate diffs between two revisions (see bug
> https://bugzilla.wikimedia.org/show_bug.cgi?id=27112)
> * Some useful variables in the MySQL database are not yet exposed in
> the XML files. Examples include:
>        - Length of revision (part of Mediawiki 1.17)
>        - Namespace of article
>
>
> 3. Smaller dump sizes
> The dump files continue to grow as the text of each revision is stored
> in the XML file. Currently, the uncompressed XML dump files of the
> English Wikipedia are about 5.5Tb in size and this will only continue
> to grow. An alternative would be to replace the <text> tag with a
> <text_added> and <text_removed> tags. A page can still be
> reconstructed by patching multiple <text_added> and <text_removed>
> tags. We can provide a simple script / tool that would reconstruct the
> full text of an article up to a particular date / revision id. This
> has two advantages:
> 1) The dump files will be significantly smaller
> 2) It will be easier and faster to analyze the types of edits. Who is
> adding a template, who is wikifying an edit, who is fixing spelling
> and grammar mistakes.
>
>
> 4. Downsides
> This suggestion is obviously not backwards compatible and it might
> break some tools out there. I think that the upsides (incremental
> backups, Hadoop-ready and smaller sizes) outweigh the downside of
> being backwards incompatible. The current way of dump generation
> cannot continue forever.
>
> [0] http://strategy.wikimedia.org/wiki/Editor_Trends_Study,
> http://strategy.wikimedia.org/wiki/March_2011_Update
> [1] http://blog.wikimedia.org/2011/06/01/summerofresearchannouncement/
>
> I would love to hear your thoughts and comments!
>
> Best,
> Diederik
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Changing XML Wikipedia Schema to Enable Smaller Incremental Dumps that are Hadoop ready

Reply via email to