Re: Language modeling (was Re: [agi] draft for comment)

Matt Mahoney Sat, 06 Sep 2008 16:35:23 -0700

--- On Sat, 9/6/08, Pei Wang <[EMAIL PROTECTED]> wrote:

> As for "compression", yes every intelligent
> system needs to 'compress'
> its experience in the sense of "keeping the essence
> but using less
> space". However, it is clearly not loseless. It is
> even not what we
> usually call "loosy compression", because what to
> keep and in what
> form is highly context-sensitive. Consequently, this
> process is not
> reversible --- no decompression, though the result can be
> applied in
> various ways. Therefore I prefer not to call it compression
> to avoid
> confusing this process with the technical sense of
> "compression",
> which is reversible, at least approximately.


I think you misunderstand my use of compression. The goal is modeling or 
prediction. Given a string, predict the next symbol. I use compression to 
estimate how accurate the model is. It is easy to show that if your model is 
accurate, then when you connect your model to an ideal coder (such as an 
arithmetic coder), then compression will be optimal. You could actually skip 
the coding step, but it is cheap, so I use it so that there is no question of 
making a mistake in the measurement. If a bug in the coder produces a too small 
output, then the decompression step won't reproduce the original file.

In fact, many speech recognition experiments do skip the coding step in their 
tests and merely calculate what the compressed size would be. (More precisely, 
they calculate word perplexity, which is equivalent). The goal of speech 
recognition is to find the text y that maximizes P(y|x) for utterance x. It is 
common to factor the model using Bayes law: P(y|x) = P(x|y)P(y)/P(x). We can 
drop P(x) since it is constant, leaving the acoustic model P(x|y) and language 
model P(y) to evaluate. We know from experiments that compression tests on P(y) 
correlate well with word error rates for the overall system.

Internally, all lossless compressors use lossy compression or data reduction to 
make predictions. Most commonly, a context is truncated and possibly hashed 
before looking up the statistics for the next symbol. The top lossless 
compressors in my benchmark use more sophisticated forms of data reduction, 
such as mapping upper and lower case letters together, or mapping groups of 
semantically or syntactically related words to the same context.

As a test, lossless compression is only appropriate for text. For other hard AI 
problems such as vision, art, and music, incompressible noise would overwhelm 
the human-perceptible signal. Theoretically you could compress video to 2 bits 
per second (the rate of human long term memory) by encoding it as a script. The 
decompressor would read the script and create a new movie. The proper test 
would be lossy compression, but this requires human judgment to evaluate how 
well the reconstructed data matches the original.


-- Matt Mahoney, [EMAIL PROTECTED]




-------------------------------------------
agi
Archives: https://www.listbox.com/member/archive/303/=now
RSS Feed: https://www.listbox.com/member/archive/rss/303/
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=8660244&id_secret=111637683-c8fa51
Powered by Listbox: http://www.listbox.com

Re: Language modeling (was Re: [agi] draft for comment)

Reply via email to