Re: [Xmldatadumps-l] [Fwd: Re: possible gsoc idea, comments?]

2013-05-06 Thread Randall Farmer
To wrap up what I started earlier, here's a slightly tweaked copy of the
last script I sent around: basic changes to make it no longer completely
unusable (uses rzip only, container format always uses network byte order,
header indicating file type). It also has a (slow) Python decompressor for
the rzip format that it uses if the binary isn't installed, written while I
was trying to understand rzip a bit better.

Just compressing chunks of rev. history like this probably isn't a great
substitute for delta coding--even if more implementation quirks were ironed
out, there still isn't a way to add revs w/out recompressing unrelated
content; also, an efficient delta coder could definitely eat less memory
than this (it would only have to keep a couplefew revisions in RAM at a
time, not 10MB).

This does at least show that some relatively blunt efforts to efficiently
compress redundancy between revisions can save time and space (vs. bzip2),
or save time spent on compression at relatively little space cost (vs.
7zip). If anybody has questions or ideas, happy to talk or try to help.
But, all that said, declaring blks2.py a (kinda fun to work on!) dead end.
:)

Randall


blks2.py
Description: Binary data
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


Re: [Xmldatadumps-l] [Fwd: Re: possible gsoc idea, comments?]

2013-05-06 Thread Federico Leva (Nemo)

Randall Farmer, 06/05/2013 08:37:

To wrap up what I started earlier, here's a slightly tweaked copy of the
last script I sent around [...]  But, all that said, declaring blks2.py a (kinda
fun to work on!) dead end. :)


If you're done with it, you may want to drop it on a Wikimedia repo like 
https://gerrit.wikimedia.org/r/gitweb?p=operations/dumps.git;a=tree;f=toys;h=0974d59e573fd5bceb76ec93878471bc11f6430c;hb=119d99131f2cf692819422ad5e516c49d935a504 
or whatever, just so that it's not only a mail attachment.
I also copied some short info you sent earlier to 
https://www.mediawiki.org/wiki/Dbzip2#rzip_and_xdelta3 for lack of 
better existing pages (?).


Nemo

___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l