Re: compressing short strings?

castironpi Tue, 20 May 2008 07:00:47 -0700

On May 20, 8:24 am, [EMAIL PROTECTED] wrote:
> bearophile:
>
> > So you need to store only this 11 byte long string to be able to
> > decompress it.
>
> Note that maybe there is a header, that may contain changing things,
> like the length of the compressed text, etc.
>
> Bye,
> bearophile


I've read that military texts contain different letter frequencies
than standard English.  If you use a non-QWERTY keyset, it may change
your frequency distribution also.

I worry that this is an impolite question; as such, I lean to peers
for backup:

Will you be additionally adding further entries to the zipped list?

Will you be rewriting the entire file upon update, or just appending
bytes?

If your sequence is 'ab, ab, ab, cd, ab', you might be at:

00010.

Add 'cd' again and you're at:

000101.

You didn't have to re-output the contents.

But, if you add 'bc', you have:

0001012, which isn't in binary.  So you're left at:

000 000 000 001 000 001 010

But five more and the byte overflows.

I'd say pickle the corpus, with new additions, and re-zip the entire
contents each time.  Would you like to break across
(coughdisksectorscough) multiple files, say, a different corpus-
compression file pair for every thousand entries?
--
http://mail.python.org/mailman/listinfo/python-list

Re: compressing short strings?

Reply via email to