0) Context

In the XML dump files, I get <text ...>plaintext</text>.
When I build a mirror using XML dump files, I get:

old_text: plaintext
old_flags: utf-8

However, when I then create a new page on my mirror, I get:

old_text: ciphertext
old_flags: utf-8.gzip

1) Objective

When I build a mirror, I would like to compress the <text
...>plaintext</text> to get:

old_text: ciphertext
old_flags: utf-8,gzip

I would like this done for every text revision, so as to save both disk
space and communication bandwidth between web server and browser.

2) Problem

There is little relevant documentation on <https://www.mediawiki.org>. So I
have run a few experiments.

exp1) I pipe the plaintext through gzip, escape for MySQL, and build the
mirror.
However, when I browse, I get the message:

``The revision #165770 of the page named "Main Page" does not exist''

When I look in the database, some kind of ciphertext does indeed exist.

3) Variants

Many utilities compress plaintext using LZ77 and Huffman encoding, but each
differs as to the file header and tail. Some versions of deflate have no
header at all. So I try four more experiments:

exp2) gzip, but throw away the 10 byte header (to simulate deflate)

/bin/gzip | tail -c +11

exp3) perl compress

/usr/bin/perl -MCompress::Zlib -e 'undef $/; print compress(<>)'

exp4) python compress, then throw away the single-quotes

/usr/bin/python -c \"import zlib,sys;print
repr(zlib.compress(sys.stdin.read()))\" | /bin/sed 's/^.//; s/.$//'

exp5) zlib-flate from the qpdf DEB package

/usr/bin/zlib-flate -compress

For all experiments, the browser gives the same error message.

4) Reading compressed old_text

It should be possible to read the old_text ciphertext using command-line
tools.
I created a user page which mediawiki stored compressed. It is displayed
correctly by the browser. But when I tried to read it directly from the
database, there were problems.

(shell) mysql --host=localhost --user=root --password simplewiki
--skip-column-names --silent --execute 'select old_text from
simplewiki.text where old_id=5146705' | zlib-flate -uncompress
Enter password:
flate: inflate: data: incorrect data check

5) Request

Please provide documentation as to how mediawiki handles compressed
old_text.
a) How is plaintext compressed?
b) Is the ciphertext escaped for MySQL after compression?
c) How does mediawiki handle old_flags=utf-8,gzip?
d) How are the contents of old_text unescaped and decompressed for
rendering?
e) Where in the mediawiki code should I be looking to understand this
better?

SIncerely Yours,
Kent
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to