Sounds all very reasonable. Some thoughts: * Having revisions not wrapped into <page> means that for reconstructing the history of a page, the entire dump has to be scanned, unless there is an index of all revisions * Such an index should probably accompany the XML file, ideally if the XML is in a seekable zip container (bgzip etc.) * I suggest that the current article version at the time of dump is stored in full, and not as a diff; if you want to do history, you'll probably calculate all diffs anyway, but the current version should be accessible right away
Magnus On Thu, Aug 18, 2011 at 6:30 PM, Diederik van Liere <dvanli...@gmail.com> wrote: > Hi! > > Over the last year, I have been using the Wikipedia XML dumps > extensively. I used it to conduct the Editor Trends Study [0] and me > and the Summer Research Fellows [1] have used it in the last three > months during the Summer of Research. I am proposing some changes to > the current XML schema based on those experiences. > > The current XML schema presents a number of challenges for both the > people who are creating dump files as the people who are consuming the > dump files. Challenges include: > > 1) The embedded structure of the schema, a single <page> tag with > multiple <revision> tags makes it very hard to develop an incremental > dump utility > 2) A lot of post processing is required. > 3) By storing the entire text for each revision, the dump files are > getting so large that they become unmanageable for most people. > > > 1. Denormalization of the schema > Instead of having a <page> tag with multiple <revision> tags, I > propose to just have <revision> tags. Each <revision> tag would > include a <page_id>, <page_title>, <page_namespace> and > <page_redirect> tag. This denormalization would make it much easier to > build an incremental dump utility. You only need to keep track of the > final revision of each article at the moment of dump creation and then > you can create a new incremental dump continueing from the last dump. > It would also easier to restore a dump process that crashed. Finally, > tools like Hadoop would have a way easier time handling this XML > schema than the current one. > > > 2. Post-processing of data > Currently, a significant amount of time is required for > post-processing the data. Some examples include: > * The title includes the namespace and so to exclude pages from a > particular namespace requires generating a separate namespace > variable. Particularly, focusing on the main namespace is tricky > because that can only be done by checking whether a page does not > belong to any other namespace (see bug > https://bugzilla.wikimedia.org/show_bug.cgi?id=27775). > * The <redirect> tag currently is either True or False, more useful > would be the article_id of the page to which a page is redirected. > * Revisions within a <page> are sorted by revision_id, but they should > be sorted by timestamp. The current ordering makes it even harder to > generate diffs between two revisions (see bug > https://bugzilla.wikimedia.org/show_bug.cgi?id=27112) > * Some useful variables in the MySQL database are not yet exposed in > the XML files. Examples include: > - Length of revision (part of Mediawiki 1.17) > - Namespace of article > > > 3. Smaller dump sizes > The dump files continue to grow as the text of each revision is stored > in the XML file. Currently, the uncompressed XML dump files of the > English Wikipedia are about 5.5Tb in size and this will only continue > to grow. An alternative would be to replace the <text> tag with a > <text_added> and <text_removed> tags. A page can still be > reconstructed by patching multiple <text_added> and <text_removed> > tags. We can provide a simple script / tool that would reconstruct the > full text of an article up to a particular date / revision id. This > has two advantages: > 1) The dump files will be significantly smaller > 2) It will be easier and faster to analyze the types of edits. Who is > adding a template, who is wikifying an edit, who is fixing spelling > and grammar mistakes. > > > 4. Downsides > This suggestion is obviously not backwards compatible and it might > break some tools out there. I think that the upsides (incremental > backups, Hadoop-ready and smaller sizes) outweigh the downside of > being backwards incompatible. The current way of dump generation > cannot continue forever. > > [0] http://strategy.wikimedia.org/wiki/Editor_Trends_Study, > http://strategy.wikimedia.org/wiki/March_2011_Update > [1] http://blog.wikimedia.org/2011/06/01/summerofresearchannouncement/ > > I would love to hear your thoughts and comments! > > Best, > Diederik > > _______________________________________________ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l