Re: [Wikitech-l] Changing XML Wikipedia Schema to Enable Smaller Incremental Dumps that are Hadoop ready

2011-08-23 Thread Brion Vibber
On Thu, Aug 18, 2011 at 10:30 AM, Diederik van Liere dvanli...@gmail.comwrote:

 1. Denormalization of the schema
 Instead of having a page tag with multiple revision tags, I
 propose to just have revision tags. Each revision tag would
 include a page_id, page_title, page_namespace and
 page_redirect tag. This denormalization would make it much easier to
 build an incremental dump utility. You only need to keep track of the
 final revision of each article at the moment of dump creation and then
 you can create a new incremental dump continueing from the last dump.
 It would also easier to restore a dump process that crashed.


page title/namespace and redirect-ness are not fixed to a revision, and may
change over time. This means that simply knowing the last revision you left
off at doesn't give you enough information for a continuation point; you'd
have to go back and see if any revisions have been deleted or had their
pages' title, redirectness, or other properties have changed.


I think it may be better to abandon the single XML stream data model and
allow for structure and random-access. A directory tree with separate files
for various pages/revisions may be a lot easier to produce  update
in-place, and could be downloaded  resynced with standard tools like
rsync or a custom tool that optimizes what files it looks for.

There's basically a couple different problems to solve:

1) Building a complete data set and getting that out to people

2) Updating an existing data set with new data

3) Processing a data set in some useful way

Generating the initial dump today is super expensive -- because it's a
single compressed XML stream, we have to copy and re-copy most of the same
data over, and over, and over.

And today there's no good way to just apply an incremental dump on top of
your existing download.


3. Smaller dump sizes
 The dump files continue to grow as the text of each revision is stored
 in the XML file. Currently, the uncompressed XML dump files of the
 English Wikipedia are about 5.5Tb in size and this will only continue
 to grow. An alternative would be to replace the text tag with a
 text_added and text_removed tags. A page can still be
 reconstructed by patching multiple text_added and text_removed
 tags. We can provide a simple script / tool that would reconstruct the
 full text of an article up to a particular date / revision id. This
 has two advantages:
 1) The dump files will be significantly smaller
 2) It will be easier and faster to analyze the types of edits. Who is
 adding a template, who is wikifying an edit, who is fixing spelling
 and grammar mistakes.


Broadly speaking some sort of diff storage makes a lot of sense; especially
if it doesn't require reproducing those diffs all the time. :)

But be warned that there are different needs and different ways of
processing data; diffs again interfere with random access, as you need to be
able to fetch adjacent items to reproduce the text. If you're just trundling
along through the entire dump and applying diffs as you go to reconstruct
the text, then you're basically doing what you already do when doing
on-the-fly decompression of the .xml.bz2 or .xml.7z -- it may, or may not,
actually save you anything for this case.

Of course if all you really wanted was the diff, then obviously that's going
to help you. :)

-- brion
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Changing XML Wikipedia Schema to Enable Smaller Incremental Dumps that are Hadoop ready

2011-08-23 Thread Robert Rohde
On Tue, Aug 23, 2011 at 5:35 PM, Brion Vibber br...@pobox.com wrote:
snip
 Broadly speaking some sort of diff storage makes a lot of sense; especially
 if it doesn't require reproducing those diffs all the time. :)

 But be warned that there are different needs and different ways of
 processing data; diffs again interfere with random access, as you need to be
 able to fetch adjacent items to reproduce the text. If you're just trundling
 along through the entire dump and applying diffs as you go to reconstruct
 the text, then you're basically doing what you already do when doing
 on-the-fly decompression of the .xml.bz2 or .xml.7z -- it may, or may not,
 actually save you anything for this case.

 Of course if all you really wanted was the diff, then obviously that's going
 to help you. :)

I've found that diff representations of the full history can knock off
about 95% of the uncompressed size.  When stacked with generic
compressors such as bz2 and 7z, an intelligent differencing scheme can
still see improvement such that .diff.7z is about 10-50% smaller than
.xml.7z while representing the same content.  As you note though, the
trade-off is that you have to look at many diffs to reconstruct the
page's content.  Given that hard disks are cheap, the biggest
advantage is probably really for people who want to study diffs as
their main object of study.

-Robert Rohde

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Changing XML Wikipedia Schema to Enable Smaller Incremental Dumps that are Hadoop ready

2011-08-19 Thread Seb35
Hi,

(I don’t post often here and I’m not a MW developer but I try to follow,  
correct me if I’m wrong.)

I see a couple of things which must be done carefully and willingly about  
page titlesref. Currently there is a difference between page_id and page  
title, since the page_id is conserved when the title of the page changes  
(during a move), so there is currently no canonical page title associated  
to a revision, only a page_id, or in other words I think it is  
theoretically non possible to retrieve the original page title of a given  
past revision (this could be discussed on another thread) and I have some  
doubts also about retrieving the original page_id of a revision in very  
rare cases (with a succession of deletion-undeletion of some  
revisions-moves) but I’m not sure of that.

So introduce a page_title in the revisions (your §1.) is a new interesting  
information if your consider this as the title as of date of saving of the  
revision, and then page_id-title and page_title can be different, the  
same for the namespace. But this information is not currently available in  
the database. This would pose the problem of definition of existing  
revisions in the dumps: use the current page title associated to the  
current page_id? If you put the current page_title associated to the  
current page_id of the revision this means the page_title will change  
accross dumps every time a move is done, I don’t find it is semantically  
correct, but at least it should be clearly explained. This is the current  
behaviour but since the page_title is outside of a revision you implicitly  
aggree this behaviour which is semantically correct.

In the §2. there is a similar thing for the redirect: currently the  
redirect points to a title, not a page_id (if you move the pointed page,  
the redirect will point to the new page).

ref: I tried to work two years ago about an extension to restore ideally  
pixel-per-pixel an old revision, but I think it’s not (currently) possible  
mainly because of this problem of page titles. There are other problems  
but this is the main problem. Others include retrieving of an old version  
of the templates (related to the problem on the title), color of links and  
categories, version of an image, external ressources like site CSS/JS,  
status about deleted revisions (display or not), and finer things like  
user preferences and rights, ultimately differences due to changes of MW  
configuration or MW version, etc. (I don’t consider a change of version of  
the user browser :) I didn’t publish it then (Sumana was not here to say  
me to publish it ;) but I retrieved it on my computer, I try to publish it  
and explain on mw.org.

Sébastien

Thu, 18 Aug 2011 13:30:18 -0400, Diederik van Liere dvanli...@gmail.com  
wrote:
 Hi!

 Over the last year, I have been using the Wikipedia XML dumps
 extensively. I used it to conduct the Editor Trends Study [0] and me
 and the Summer Research Fellows [1] have used it in the last three
 months during the Summer of Research. I am proposing some changes to
 the current XML schema based on those experiences.

 The current XML schema presents a number of challenges for both the
 people who are creating dump files as the people who are consuming the
 dump files. Challenges include:

 1) The embedded structure of the schema, a single page tag with
 multiple revision tags makes it very hard to develop an incremental
 dump utility
 2) A lot of post processing is required.
 3) By storing the entire text for each revision, the dump files are
 getting so large that they become unmanageable for most people.


 1. Denormalization of the schema
 Instead of having a page tag with multiple revision tags, I
 propose to just have revision tags. Each revision tag would
 include a page_id, page_title, page_namespace and
 page_redirect tag. This denormalization would make it much easier to
 build an incremental dump utility. You only need to keep track of the
 final revision of each article at the moment of dump creation and then
 you can create a new incremental dump continueing from the last dump.
 It would also easier to restore a dump process that crashed.  Finally,
 tools like Hadoop would have a way easier time handling this XML
 schema than the current one.


 2. Post-processing of data
 Currently, a significant amount of time is required for
 post-processing the data. Some examples include:
 * The title includes the namespace and so to exclude pages from a
 particular namespace requires generating a separate namespace
 variable. Particularly, focusing on the main namespace is tricky
 because that can only be done by checking whether a page does not
 belong to any other namespace (see bug
 https://bugzilla.wikimedia.org/show_bug.cgi?id=27775).
 * The redirect tag currently is either True or False, more useful
 would be the article_id of the page to which a page is redirected.
 * Revisions within a page are sorted by revision_id, but they 

Re: [Wikitech-l] Changing XML Wikipedia Schema to Enable Smaller Incremental Dumps that are Hadoop ready

2011-08-18 Thread Magnus Manske
Sounds all very reasonable.

Some thoughts:
* Having revisions not wrapped into page means that for
reconstructing the history of a page, the entire dump has to be
scanned, unless there is an index of all revisions
* Such an index should probably accompany the XML file, ideally if the
XML is in a seekable zip container (bgzip etc.)
* I suggest that the current article version at the time of dump is
stored in full, and not as a diff; if you want to do history, you'll
probably calculate all diffs anyway, but the current version should be
accessible right away

Magnus


On Thu, Aug 18, 2011 at 6:30 PM, Diederik van Liere dvanli...@gmail.com wrote:
 Hi!

 Over the last year, I have been using the Wikipedia XML dumps
 extensively. I used it to conduct the Editor Trends Study [0] and me
 and the Summer Research Fellows [1] have used it in the last three
 months during the Summer of Research. I am proposing some changes to
 the current XML schema based on those experiences.

 The current XML schema presents a number of challenges for both the
 people who are creating dump files as the people who are consuming the
 dump files. Challenges include:

 1) The embedded structure of the schema, a single page tag with
 multiple revision tags makes it very hard to develop an incremental
 dump utility
 2) A lot of post processing is required.
 3) By storing the entire text for each revision, the dump files are
 getting so large that they become unmanageable for most people.


 1. Denormalization of the schema
 Instead of having a page tag with multiple revision tags, I
 propose to just have revision tags. Each revision tag would
 include a page_id, page_title, page_namespace and
 page_redirect tag. This denormalization would make it much easier to
 build an incremental dump utility. You only need to keep track of the
 final revision of each article at the moment of dump creation and then
 you can create a new incremental dump continueing from the last dump.
 It would also easier to restore a dump process that crashed.  Finally,
 tools like Hadoop would have a way easier time handling this XML
 schema than the current one.


 2. Post-processing of data
 Currently, a significant amount of time is required for
 post-processing the data. Some examples include:
 * The title includes the namespace and so to exclude pages from a
 particular namespace requires generating a separate namespace
 variable. Particularly, focusing on the main namespace is tricky
 because that can only be done by checking whether a page does not
 belong to any other namespace (see bug
 https://bugzilla.wikimedia.org/show_bug.cgi?id=27775).
 * The redirect tag currently is either True or False, more useful
 would be the article_id of the page to which a page is redirected.
 * Revisions within a page are sorted by revision_id, but they should
 be sorted by timestamp. The current ordering makes it even harder to
 generate diffs between two revisions (see bug
 https://bugzilla.wikimedia.org/show_bug.cgi?id=27112)
 * Some useful variables in the MySQL database are not yet exposed in
 the XML files. Examples include:
        - Length of revision (part of Mediawiki 1.17)
        - Namespace of article


 3. Smaller dump sizes
 The dump files continue to grow as the text of each revision is stored
 in the XML file. Currently, the uncompressed XML dump files of the
 English Wikipedia are about 5.5Tb in size and this will only continue
 to grow. An alternative would be to replace the text tag with a
 text_added and text_removed tags. A page can still be
 reconstructed by patching multiple text_added and text_removed
 tags. We can provide a simple script / tool that would reconstruct the
 full text of an article up to a particular date / revision id. This
 has two advantages:
 1) The dump files will be significantly smaller
 2) It will be easier and faster to analyze the types of edits. Who is
 adding a template, who is wikifying an edit, who is fixing spelling
 and grammar mistakes.


 4. Downsides
 This suggestion is obviously not backwards compatible and it might
 break some tools out there. I think that the upsides (incremental
 backups, Hadoop-ready and smaller sizes) outweigh the downside of
 being backwards incompatible. The current way of dump generation
 cannot continue forever.

 [0] http://strategy.wikimedia.org/wiki/Editor_Trends_Study,
 http://strategy.wikimedia.org/wiki/March_2011_Update
 [1] http://blog.wikimedia.org/2011/06/01/summerofresearchannouncement/

 I would love to hear your thoughts and comments!

 Best,
 Diederik

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l