>> So, thoughts on this? Is 'Move Dumping Process to another language' a >> good idea at all? >> > > I'd worry a lot less about what languages are used than whether the process > itself is scalable.
I'm not a mediawiki / wikipedia developer, but as a developer / sys admin, I'd think that adding another environment stack requirement (in the case of C# or Java) to the overall architecture would be a bad idea in general. > The current dump process (which I created in 2004-2005 when we had a LOT > less data, and a LOT fewer computers) is very linear, which makes it awkward > to scale up: > > * pull a list of all page revisions, in page/rev order > * as they go through, pump page/rev data to a linear XML stream > * pull that linear XML stream back in again, as well as the last time's > completed linear XML stream > * while going through those, combine the original page text from the last > XML dump, or from the current database, and spit out a linear XML stream > containing both page/rev data and rev text > * and also stick compression on the end > > About the only way we can scale it beyond a couple of CPUs > (compression/decompression as separate processes from the main PHP stream > handler) is to break it into smaller linear pieces and either reassemble > them, or require users to reassemble the pieces for linear processing. > > Within each of those linear processes, any bottleneck will slow everything > down whether that's bzip2 or 7zip compression/decompression, fetching > revisions from the wiki's complex storage systems, the XML parsing, or > something in the middle. > > What I'd recommend looking at is ways to actually rearrange the data so a) > there's less work that needs to be done to create a new dump and b) most of > that work can be done independently of other work that's going on, so it's > highly scalable. > > Ideally, anything that hasn't changed since the last dump shouldn't need > *any* new data processing (right now it'll go through several stages of > slurping from a DB, decompression and recompression, XML parsing and > re-structuring, etc). A new dump should consist basically of running through > appending new data and removing deleted data, without touching the things > that haven't changed. > > This may actually need a fancier structured data file format, or perhaps a > sensible directory structure and subfile structure -- ideally one that's > friendly to beed updated via simple things like rsync. I'm probably stating the obvious here... Breaking the dump up by article namespace might be a starting point -- have 1 controller process for each namespace. That leaves 85% of the work in the default namespace, which could them be segmented by any combination of factors, maybe as simple as block batches of X number of articles. When I'm importing the XML dump to MySQL, I have one process that reads the XML file, and X processes (10 usually) working in parallel to parse each article block on a first-available queue system. My current implementation is a bit cumbersome, but maybe the idea could be used for building the dump as well? In general, I'm interested in pitching in some effort on anything related to the dump/import processes. -------------------------------------- James Linden kodekr...@gmail.com -------------------------------------- _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l