Re: [Wikitech-l] parallel bzip2 (de)compression of the dump

Anthony Thu, 26 Mar 2009 19:22:14 -0700

On Thu, Mar 26, 2009 at 3:09 PM, Ilmari Karonen <nos...@vyznev.net> wrote:

> ERSEK Laszlo wrote:
> > ** 4. Thanassis Tsiodras' offline reader, available under
> >
> > http://users.softlab.ece.ntua.gr/~ttsiod/buildWikipediaOffline.html<http://users.softlab.ece.ntua.gr/%7Ettsiod/buildWikipediaOffline.html>
> >
> > uses, according to section "Seeking in the dump file", bzip2recover to
> > split the bzip2 blocks out of the single bzip2 stream. The page states
> >
> >       This process is fast (since it involves almost no CPU calculations
> >
> > While this may be true relative to other dump-processing operations,
> > bzip2recover is, in fact, not much more than a huge single threaded
> > bit-shifter, which even makes two passes over the dump. (IIRC, the first
> > pass shifts over the whole dump to find bzip2 block delimiteres, then the
> > second pass shifts the blocks found previously into byte-aligned,
> separate
> > bzip2 streams.)
>
> Hmm?  Admittedly, I don't know the bzip2 format very well, but as far as
> I understand it, there should be no bit-shifting involved: each block in
> the stream is a completely independent, self-contained sequence of bytes.
>

I'm not sure about the bit-shifting, but the second pass of bzip2recover
adds in the file headers.

The thing is, the second pass of bzip2recover is unnecessary if all you want
to do is build a file index.  And the source code of the first pass of
bzip2recover is really simple.  I've managed to hack it to make a single
pass outputting the starting bytes of each block, and have been able to use
the index to dd out the block I want to access (plus a few bytes padding on
each end because I was too lazy to find out the exact location), run the
real bzip2recover on just that block, and then uncompress the recovered
block.  That could be made into a lot less of a hack if I wanted to take the
time to figure out how exactly to rebuild the header, but I haven't
bothered.

Anyway, the point being that it's not necessary to actually run bzip2recover
and deal with millions of little files, if you build an index.  That way you
don't need twice the space, and don't need to mess with reiserfs.
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] parallel bzip2 (de)compression of the dump

Reply via email to