0) Context In the XML dump files, I get <text ...>plaintext</text>. When I build a mirror using XML dump files, I get:
old_text: plaintext old_flags: utf-8 However, when I then create a new page on my mirror, I get: old_text: ciphertext old_flags: utf-8.gzip 1) Objective When I build a mirror, I would like to compress the <text ...>plaintext</text> to get: old_text: ciphertext old_flags: utf-8,gzip I would like this done for every text revision, so as to save both disk space and communication bandwidth between web server and browser. 2) Problem There is little relevant documentation on <https://www.mediawiki.org>. So I have run a few experiments. exp1) I pipe the plaintext through gzip, escape for MySQL, and build the mirror. However, when I browse, I get the message: ``The revision #165770 of the page named "Main Page" does not exist'' When I look in the database, some kind of ciphertext does indeed exist. 3) Variants Many utilities compress plaintext using LZ77 and Huffman encoding, but each differs as to the file header and tail. Some versions of deflate have no header at all. So I try four more experiments: exp2) gzip, but throw away the 10 byte header (to simulate deflate) /bin/gzip | tail -c +11 exp3) perl compress /usr/bin/perl -MCompress::Zlib -e 'undef $/; print compress(<>)' exp4) python compress, then throw away the single-quotes /usr/bin/python -c \"import zlib,sys;print repr(zlib.compress(sys.stdin.read()))\" | /bin/sed 's/^.//; s/.$//' exp5) zlib-flate from the qpdf DEB package /usr/bin/zlib-flate -compress For all experiments, the browser gives the same error message. 4) Reading compressed old_text It should be possible to read the old_text ciphertext using command-line tools. I created a user page which mediawiki stored compressed. It is displayed correctly by the browser. But when I tried to read it directly from the database, there were problems. (shell) mysql --host=localhost --user=root --password simplewiki --skip-column-names --silent --execute 'select old_text from simplewiki.text where old_id=5146705' | zlib-flate -uncompress Enter password: flate: inflate: data: incorrect data check 5) Request Please provide documentation as to how mediawiki handles compressed old_text. a) How is plaintext compressed? b) Is the ciphertext escaped for MySQL after compression? c) How does mediawiki handle old_flags=utf-8,gzip? d) How are the contents of old_text unescaped and decompressed for rendering? e) Where in the mediawiki code should I be looking to understand this better? SIncerely Yours, Kent _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l