Re: [Wikitech-l] Moving the Dump Process to another language

Andrew Dunbar Fri, 25 Mar 2011 00:48:35 -0700

On 25 March 2011 18:21, Ariel T. Glenn <ar...@wikimedia.org> wrote:
> Στις 24-03-2011, ημέρα Πεμ, και ώρα 20:29 -0400, ο/η James Linden
> έγραψε:
>> >> So, thoughts on this? Is 'Move Dumping Process to another language' a
>> >> good idea at all?
>> >>
>> >
>> > I'd worry a lot less about what languages are used than whether the process
>> > itself is scalable.
>>
>> I'm not a mediawiki / wikipedia developer, but as a developer / sys
>> admin, I'd think that adding another environment stack requirement (in
>> the case of C# or Java) to the overall architecture would be a bad
>> idea in general.
>>
>> > The current dump process (which I created in 2004-2005 when we had a LOT
>> > less data, and a LOT fewer computers) is very linear, which makes it 
>> > awkward
>> > to scale up:
>> >
>> > * pull a list of all page revisions, in page/rev order
>> >  * as they go through, pump page/rev data to a linear XML stream
>> > * pull that linear XML stream back in again, as well as the last time's
>> > completed linear XML stream
>> >  * while going through those, combine the original page text from the last
>> > XML dump, or from the current database, and spit out a linear XML stream
>> > containing both page/rev data and rev text
>> >  * and also stick compression on the end
>> >
>> > About the only way we can scale it beyond a couple of CPUs
>> > (compression/decompression as separate processes from the main PHP stream
>> > handler) is to break it into smaller linear pieces and either reassemble
>> > them, or require users to reassemble the pieces for linear processing.
>
> TBH users wouldn't have to reassemble the pieces I don't think; they
> might be annoyed at having 400 little (or not so little) files lying
> around but any processing they meant to do could, I would think, easily
> be wrapped in a loop that tossed in each piece in order as input.
>
>> > Within each of those linear processes, any bottleneck will slow everything
>> > down whether that's bzip2 or 7zip compression/decompression, fetching
>> > revisions from the wiki's complex storage systems, the XML parsing, or
>> > something in the middle.
>> >
>> > What I'd recommend looking at is ways to actually rearrange the data so a)
>> > there's less work that needs to be done to create a new dump and b) most of
>> > that work can be done independently of other work that's going on, so it's
>> > highly scalable.
>> >
>> > Ideally, anything that hasn't changed since the last dump shouldn't need
>> > *any* new data processing (right now it'll go through several stages of
>> > slurping from a DB, decompression and recompression, XML parsing and
>> > re-structuring, etc). A new dump should consist basically of running 
>> > through
>> > appending new data and removing deleted data, without touching the things
>> > that haven't changed.
>
> One assumption here is that there is a previous dump to work from;
> that's not always true, and we should be able to run a dump "from
> scratch" without it needing to take 3 months for en wiki.
>
> A second assumption is that the previous dump data is sound; we've also
> seen that fail to be true.  This means that we need to be able to check
> the contents against the database contents in some fashion.  Currently
> we look at revision length for each revision, but that's not foolproof
> (and it's also still too slow).
>
> However if verification meant just that, verification instead of
> rerwiting a new file with the additional costs that compression imposes
> on us, we would see some gains immediately.
>
>> > This may actually need a fancier structured data file format, or perhaps a
>> > sensible directory structure and subfile structure -- ideally one that's
>> > friendly to beed updated via simple things like rsync.
>>
>> I'm probably stating the obvious here...
>>
>> Breaking the dump up by article namespace might be a starting point --
>> have 1 controller process for each namespace. That leaves 85% of the
>> work in the default namespace, which could them be segmented by any
>> combination of factors, maybe as simple as block batches of X number
>> of articles.
>
> We already have the mechanism for running batches of arbitrary numbers
> of articles.   That's what the en history dumps do now.
>
> What we don't have is:
>
> * a way to run easily over multiple hosts
> * a way to recombine small pieces into larger files for download that
> isn't serial, *or* alternatively a format that relies on multiple small
> pieces so we can skip recombining
> * a way to check previous content for integrity *quickly* before folding
> it into the current dumps (we check each revision separately, much too
> slow)
> * a way to "fold previous content into the current dumps" that consists
> of making a straight copy of what's on disk with no processing.  (What
> do we do if something has been deleted or moved, or is corrupt?  The
> existing format isn't friendly to those cases.)
>
>> When I'm importing the XML dump to MySQL, I have one process that
>> reads the XML file, and X processes (10 usually) working in parallel
>> to parse each article block on a first-available queue system. My
>> current implementation is a bit cumbersome, but maybe the idea could
>> be used for building the dump as well?
>>
>> In general, I'm interested in pitching in some effort on anything
>> related to the dump/import processes.
>
> Glad to hear it!  Drop by irc please, I'm in the usual channels. :-)


Just a thought, wouldn't it be easier to generate dumps in parallel if
we did away with the assumption that the dump would be in database
order. The metadata in the dump provides the ordering info for the
people that require it.

Andrew Dunbar (hippietrail)

> Ariel
>> --------------------------------------
>> James Linden
>> kodekr...@gmail.com
>> --------------------------------------
>>
>> _______________________________________________
>> Wikitech-l mailing list
>> Wikitech-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Moving the Dump Process to another language

Reply via email to