>> So, thoughts on this? Is 'Move Dumping Process to another language' a
>> good idea at all?
>>
>
> I'd worry a lot less about what languages are used than whether the process
> itself is scalable.

I'm not a mediawiki / wikipedia developer, but as a developer / sys
admin, I'd think that adding another environment stack requirement (in
the case of C# or Java) to the overall architecture would be a bad
idea in general.

> The current dump process (which I created in 2004-2005 when we had a LOT
> less data, and a LOT fewer computers) is very linear, which makes it awkward
> to scale up:
>
> * pull a list of all page revisions, in page/rev order
>  * as they go through, pump page/rev data to a linear XML stream
> * pull that linear XML stream back in again, as well as the last time's
> completed linear XML stream
>  * while going through those, combine the original page text from the last
> XML dump, or from the current database, and spit out a linear XML stream
> containing both page/rev data and rev text
>  * and also stick compression on the end
>
> About the only way we can scale it beyond a couple of CPUs
> (compression/decompression as separate processes from the main PHP stream
> handler) is to break it into smaller linear pieces and either reassemble
> them, or require users to reassemble the pieces for linear processing.
>
> Within each of those linear processes, any bottleneck will slow everything
> down whether that's bzip2 or 7zip compression/decompression, fetching
> revisions from the wiki's complex storage systems, the XML parsing, or
> something in the middle.
>
> What I'd recommend looking at is ways to actually rearrange the data so a)
> there's less work that needs to be done to create a new dump and b) most of
> that work can be done independently of other work that's going on, so it's
> highly scalable.
>
> Ideally, anything that hasn't changed since the last dump shouldn't need
> *any* new data processing (right now it'll go through several stages of
> slurping from a DB, decompression and recompression, XML parsing and
> re-structuring, etc). A new dump should consist basically of running through
> appending new data and removing deleted data, without touching the things
> that haven't changed.
>
> This may actually need a fancier structured data file format, or perhaps a
> sensible directory structure and subfile structure -- ideally one that's
> friendly to beed updated via simple things like rsync.

I'm probably stating the obvious here...

Breaking the dump up by article namespace might be a starting point --
have 1 controller process for each namespace. That leaves 85% of the
work in the default namespace, which could them be segmented by any
combination of factors, maybe as simple as block batches of X number
of articles.

When I'm importing the XML dump to MySQL, I have one process that
reads the XML file, and X processes (10 usually) working in parallel
to parse each article block on a first-available queue system. My
current implementation is a bit cumbersome, but maybe the idea could
be used for building the dump as well?

In general, I'm interested in pitching in some effort on anything
related to the dump/import processes.

--------------------------------------
James Linden
kodekr...@gmail.com
--------------------------------------

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to