Re: [Wikitech-l] Changing XML Wikipedia Schema to Enable Smaller Incremental Dumps that are Hadoop ready

2011-08-18 Thread Magnus Manske
Sounds all very reasonable. Some thoughts: * Having revisions not wrapped into means that for reconstructing the history of a page, the entire dump has to be scanned, unless there is an index of all revisions * Such an index should probably accompany the XML file, ideally if the XML is in a seeka

Re: [Wikitech-l] Changing XML Wikipedia Schema to Enable Smaller Incremental Dumps that are Hadoop ready

2011-08-19 Thread Seb35
Hi, (I don’t post often here and I’m not a MW developer but I try to follow, correct me if I’m wrong.) I see a couple of things which must be done carefully and willingly about page titles. Currently there is a difference between page_id and page title, since the page_id is conserved when t

Re: [Wikitech-l] Changing XML Wikipedia Schema to Enable Smaller Incremental Dumps that are Hadoop ready

2011-08-23 Thread Brion Vibber
On Thu, Aug 18, 2011 at 10:30 AM, Diederik van Liere wrote: > 1. Denormalization of the schema > Instead of having a tag with multiple tags, I > propose to just have tags. Each tag would > include a , , and > tag. This denormalization would make it much easier to > build an incremental dump

Re: [Wikitech-l] Changing XML Wikipedia Schema to Enable Smaller Incremental Dumps that are Hadoop ready

2011-08-23 Thread Robert Rohde
On Tue, Aug 23, 2011 at 5:35 PM, Brion Vibber wrote: > Broadly speaking some sort of diff storage makes a lot of sense; especially > if it doesn't require reproducing those diffs all the time. :) > > But be warned that there are different needs and different ways of > processing data; diffs again