On Mon, Feb 23, 2009 at 11:08 AM, Alex <mrzmanw...@gmail.com> wrote:
> Most of that hasn't been touched in years, and it seems to be mainly a
> Python wrapper around the dump scripts in /phase3/maintenance/ which
> also don't seem to have had significant changes recently. Has anything
> been done recently (in a very broad sense of the word)? Or at least, has
> anything been written down about what the plans are?

In a "very broad sense" (and not directly connected to main problems),
I wrote a compressor [1] that converts full-text history dumps into an
"edit syntax" that provides ~95% compression on the larger dumps while
keeping it in a plain text format that could still be searched and
processed without needing a full decompression.

That's one of several ways to modify the way dump process operates in
order to make the output easier to work with (if it takes ~2 TB to
expand enwiki's full history, then that is not practical for most
users even if we solve the problem of generating it).  It is not
necessarily true that my specific technology is the right answer, but
various changes in formatting to aid distribution, generation, and use
are one of the areas that ought to be considered when reimplementing
the dump process.

The largest gains are almost certainly going to be in parallelization
though.  A single monolithic dumper is impractical for enwiki.

-Robert Rohde

[1] http://svn.wikimedia.org/viewvc/mediawiki/trunk/tools/editsyntax/

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to