On May 20, 8:24 am, [EMAIL PROTECTED] wrote: > bearophile: > > > So you need to store only this 11 byte long string to be able to > > decompress it. > > Note that maybe there is a header, that may contain changing things, > like the length of the compressed text, etc. > > Bye, > bearophile
I've read that military texts contain different letter frequencies than standard English. If you use a non-QWERTY keyset, it may change your frequency distribution also. I worry that this is an impolite question; as such, I lean to peers for backup: Will you be additionally adding further entries to the zipped list? Will you be rewriting the entire file upon update, or just appending bytes? If your sequence is 'ab, ab, ab, cd, ab', you might be at: 00010. Add 'cd' again and you're at: 000101. You didn't have to re-output the contents. But, if you add 'bc', you have: 0001012, which isn't in binary. So you're left at: 000 000 000 001 000 001 010 But five more and the byte overflows. I'd say pickle the corpus, with new additions, and re-zip the entire contents each time. Would you like to break across (coughdisksectorscough) multiple files, say, a different corpus- compression file pair for every thousand entries? -- http://mail.python.org/mailman/listinfo/python-list