To wrap up what I started earlier, here's a slightly tweaked copy of the last script I sent around: basic changes to make it no longer completely unusable (uses rzip only, container format always uses network byte order, header indicating file type). It also has a (slow) Python decompressor for the rzip format that it uses if the binary isn't installed, written while I was trying to understand rzip a bit better.
Just compressing chunks of rev. history like this probably isn't a great substitute for delta coding--even if more implementation quirks were ironed out, there still isn't a way to add revs w/out recompressing unrelated content; also, an efficient delta coder could definitely eat less memory than this (it would only have to keep a couplefew revisions in RAM at a time, not 10MB). This does at least show that some relatively blunt efforts to efficiently compress redundancy between revisions can save time and space (vs. bzip2), or save time spent on compression at relatively little space cost (vs. 7zip). If anybody has questions or ideas, happy to talk or try to help. But, all that said, declaring blks2.py a (kinda fun to work on!) dead end. :) Randall
blks2.py
Description: Binary data
_______________________________________________ Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l