Paul Rubin <http://[EMAIL PROTECTED]> writes:
> I have a lot of short English strings I'd like to compress in order to > reduce the size of a database. That is, I'd like a compression > function that takes a string like (for example) "George Washington" > and returns a shorter string, with luck maybe 6 bytes or so. One > obvious idea is take the gzip function, compress some large text > corpus with it in streaming mode and throw away the output (but > setting up the internal state to model the statistics of English > text), then put in "George Washington" and treat the additional output > as the compressed string. Obviously to get reasonable speed there > would have to be a way to save the internal state after initializing > from the corpus. > > Anyone know if this has been done and if there's code around for it? > Maybe I'm better off with freezing a dynamic Markov model? I think > there's DMM code around but am not sure where to look. > > Thanks. Out of the blue idea which is probably rubbish: Create a tree of your words, look at it as a Deterministic Finite Automaton (DFA), minimise it, you get what some call a DAWG. That works very well with lots of small words. e.g. Words aabc, babc, bbbc. Minimal DFA: 1 -a-> 2 -a-> 4 -b-> 5 -c-> 6 \ ^ b-> 3 -a-----+ \ / b----+ To compress a word, simply find its position in alphabetical order, call that its 'path'. The 'decompression algorithm' is then (in pseudo-python) def decompress(DFA, path): letters = [] state = DFA.initial_state while not state.is_final(): transitions = state.transitions path, choice = divmod(path, len(transitions)) state = transitions[choice].new_state letters.append(transitions[choice].letter) I'm not too sure about this :) -- Arnaud -- http://mail.python.org/mailman/listinfo/python-list