Re: Language modeling (was Re: [agi] draft for comment)

2008-09-05 Thread Pei Wang
On Fri, Sep 5, 2008 at 11:15 AM, Matt Mahoney <[EMAIL PROTECTED]> wrote:
> --- On Thu, 9/4/08, Pei Wang <[EMAIL PROTECTED]> wrote:
>
>> I guess you still see NARS as using model-theoretic
>> semantics, so you
>> call it "symbolic" and contrast it with system
>> with sensors. This is
>> not correct --- see
>> http://nars.wang.googlepages.com/wang.semantics.pdf and
>> http://nars.wang.googlepages.com/wang.AI_Misconceptions.pdf
>
> I mean NARS is symbolic in the sense that you write statements in Narsese 
> like "raven -> bird <0.97, 0.92>" (probability=0.97, confidence=0.92). I 
> realize that the meanings of "raven" and "bird" are determined by their 
> relations to other symbols in the knowledge base and that the probability and 
> confidence change with experience. But in practice you are still going to 
> write statements like this because it is the easiest way to build the 
> knowledge base.

Yes.

> You aren't going to specify the brightness of millions of pixels in a vision 
> system in Narsese, and there is no mechanism I am aware of to collect this 
> knowledge from a natural language text corpus.

Of course not. To have visual experience, there must be a devise to
convert visual signals into internal representation in Narsese. I
never suggested otherwise.

> There is no mechanism to add new symbols to the knowledge base through 
> experience. You have to explicitly add them.

"New symbols" either come from the outside in experience (experience
can be verbal), or composed by the concept-formation rules from
existing ones. The latter case is explained in my book.

> Natural language has evolved to be learnable on a massively parallel network 
> of slow computing elements. This should be apparent when we compare 
> successful language models with unsuccessful ones. Artificial language models 
> usually consist of tokenization, parsing, and semantic analysis phases. This 
> does not work on natural language because artificial languages have precise 
> specifications and natural languages do not.

It depends on which aspect of the language you talk about. Narsese has
"precise specifications" in syntax, but the meaning of the terms is a
function of experience, and change from time to time.

> No two humans use exactly the same language, nor does the same human at two 
> points in time. Rather, language is learnable by example, so that each 
> message causes the language of the receiver to be a little more like that of 
> the sender.

Same thing in NARS --- if two implementations of NARS have different
experience, they will disagree on what is the meaning of a term. When
they begin to learn natural language, it will also be true for
grammar. Since I haven't done any concrete NLP yet, I don't expect you
to believe me on the second point, but you cannot rule out that
possibility just because no traditional system can do that.

> Children learn semantics before syntax, which is the opposite order from 
> which you would write an artificial language interpreter.

NARS indeed can learn semantics before syntax --- see
http://nars.wang.googlepages.com/wang.roadmap.pdf

I won't comment on the following detailed statements, since I agree
with your criticism on the traditional processing of formal language,
but that is not how NARS handles languages. Don't think NARS as
another Cyc just because both use "formal language". The same "ravens
are birds" in these two systems are treated very differently in them.

Pei


> An example of a successful language model is a search engine. We know that 
> most of the meaning of a text document depends only on the words it contains, 
> ignoring word order. A search engine matches the semantics of the query with 
> the semantics of a document mostly by matching words, but also by matching 
> semantically related words like "water" to "wet".
>
> Here is an example of a computationally intensive but biologically plausible 
> language model. A semantic model is a word-word matrix A such that A_ij is 
> the degree to which words i and j are related, which you can think of as the 
> probability of finding i and j together in a sliding window over a huge text 
> corpus. However, semantic relatedness is a fuzzy identity relation, meaning 
> it is reflexive, commutative, and transitive. If i is related to j and j to 
> k, then i is related to k. Deriving transitive relations in A, also known as 
> latent semantic analysis, is performed by singular value decomposition, 
> factoring A = USV where S is diagonal, then discarding the small terms of S, 
> which has the effect of lossy compression. Typically, A has about 10^6 
> elements and we keep only a few hundred elements of S. Fortunately there is a 
> parallel algorithm that incrementally updates the matrices as the system 
> learns: a 3 layer neural network where S is the hidden layer
>  (which can grow) and U and V are weight matrices. [1].
>
> Traditional language processing has failed because the task of converting 
> natural languag

Re: Language modeling (was Re: [agi] draft for comment)

2008-09-05 Thread Matt Mahoney
--- On Fri, 9/5/08, Pei Wang <[EMAIL PROTECTED]> wrote:

> NARS indeed can learn semantics before syntax --- see
> http://nars.wang.googlepages.com/wang.roadmap.pdf

Yes, I see this corrects many of the problems with Cyc and with traditional 
language models. I didn't see a description of a mechanism for learning new 
terms in your other paper. Clearly this could be added, although I believe it 
should be a statistical process.

I am interested in determining the computational cost of language modeling. The 
evidence I have so far is that it is high. I believe the algorithmic complexity 
of a model is 10^9 bits. This is consistent with Turing's 1950 prediction that 
AI would require this much memory, with Landauer's estimate of human long term 
memory, and is about how much language a person processes by adulthood assuming 
an information content of 1 bit per character as Shannon estimated in 1950. 
This is why I use a 1 GB data set in my compression benchmark.

However there is a 3 way tradeoff between CPU speed, memory, and model accuracy 
(as measured by compression ratio). I added two graphs to my benchmark at 
http://cs.fit.edu/~mmahoney/compression/text.html (below the main table) which 
shows this clearly. In particular the size-memory tradeoff is an almost 
perfectly straight line (with memory on a log scale) over tests of 104 
compressors. These tests suggest to me that CPU and memory are indeed 
bottlenecks to language modeling. The best models in my tests use simple 
semantic and grammatical models, well below adult human level. The 3 top 
programs on the memory graph map words to tokens using dictionaries that group 
semantically and syntactically related words together, but only one 
(paq8hp12any) uses a semantic space of more than one dimension. All have large 
vocabularies, although not implausibly large for an educated person. Other top 
programs like nanozipltcb and WinRK use smaller dictionaries and
 strictly lexical models. Lesser programs model only at the n-gram level.

I don't yet have an answer to my question, but I believe efficient human-level 
NLP will require hundreds of GB or perhaps 1 TB of memory. The slowest programs 
are already faster than real time, given that equivalent learning in humans 
would take over a decade. I think you could use existing hardware in a 
speed-memory tradeoff to get real time NLP, but it would not be practical for 
doing experiments where each source code change requires training the model 
from scratch. Model development typically requires thousands of tests.


-- Matt Mahoney, [EMAIL PROTECTED]



---
agi
Archives: https://www.listbox.com/member/archive/303/=now
RSS Feed: https://www.listbox.com/member/archive/rss/303/
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=8660244&id_secret=111637683-c8fa51
Powered by Listbox: http://www.listbox.com


Re: Language modeling (was Re: [agi] draft for comment)

2008-09-05 Thread Pei Wang
On Fri, Sep 5, 2008 at 6:15 PM, Matt Mahoney <[EMAIL PROTECTED]> wrote:
> --- On Fri, 9/5/08, Pei Wang <[EMAIL PROTECTED]> wrote:
>
>> NARS indeed can learn semantics before syntax --- see
>> http://nars.wang.googlepages.com/wang.roadmap.pdf
>
> Yes, I see this corrects many of the problems with Cyc and with traditional 
> language models. I didn't see a description of a mechanism for learning new 
> terms in your other paper. Clearly this could be added, although I believe it 
> should be a statistical process.

I don't have a separate paper on term composition, so you'd have to
read my book. It is indeed a statistical process, in the sense that
most of the composed terms won't be useful, so will be forgot
gradually. Only the "useful patterns" will be kept for long time in
the form of compound terms.

> I am interested in determining the computational cost of language modeling. 
> The evidence I have so far is that it is high. I believe the algorithmic 
> complexity of a model is 10^9 bits. This is consistent with Turing's 1950 
> prediction that AI would require this much memory, with Landauer's estimate 
> of human long term memory, and is about how much language a person processes 
> by adulthood assuming an information content of 1 bit per character as 
> Shannon estimated in 1950. This is why I use a 1 GB data set in my 
> compression benchmark.

I see your point, though I think to analyze this problem in terms of
computational complexity is not the correct way to go, because this
process does not follow a predetermined algorithm. Instead, language
learning is an incremental process, without a well-defined beginning
and ending.

> However there is a 3 way tradeoff between CPU speed, memory, and model 
> accuracy (as measured by compression ratio). I added two graphs to my 
> benchmark at http://cs.fit.edu/~mmahoney/compression/text.html (below the 
> main table) which shows this clearly. In particular the size-memory tradeoff 
> is an almost perfectly straight line (with memory on a log scale) over tests 
> of 104 compressors. These tests suggest to me that CPU and memory are indeed 
> bottlenecks to language modeling. The best models in my tests use simple 
> semantic and grammatical models, well below adult human level. The 3 top 
> programs on the memory graph map words to tokens using dictionaries that 
> group semantically and syntactically related words together, but only one 
> (paq8hp12any) uses a semantic space of more than one dimension. All have 
> large vocabularies, although not implausibly large for an educated person. 
> Other top programs like nanozipltcb and WinRK use smaller dictionaries and
>  strictly lexical models. Lesser programs model only at the n-gram level.

Like to many existing AI works, my disagreement with you is not that
much on the solution you proposed (I can see the value), but on the
problem you specified as the goal of AI. For example, I have no doubt
about the theoretical and practical values of compression, but don't
think it has much to do with intelligence. I don't think this kind of
issue can be efficient handled by email discussion like this one. I've
been thinking about to write a paper to compare my ideas with the
ideas represented by AIXI, which is closely related to yours, though
this project hasn't got enough priority in my to-do list. Hopefully
I'll find the time to make myself clear on this topic.

> I don't yet have an answer to my question, but I believe efficient 
> human-level NLP will require hundreds of GB or perhaps 1 TB of memory. The 
> slowest programs are already faster than real time, given that equivalent 
> learning in humans would take over a decade. I think you could use existing 
> hardware in a speed-memory tradeoff to get real time NLP, but it would not be 
> practical for doing experiments where each source code change requires 
> training the model from scratch. Model development typically requires 
> thousands of tests.

I guess we are exploring very different paths in NLP, and now it is
too early to tell which one will do better.

Pei


---
agi
Archives: https://www.listbox.com/member/archive/303/=now
RSS Feed: https://www.listbox.com/member/archive/rss/303/
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=8660244&id_secret=111637683-c8fa51
Powered by Listbox: http://www.listbox.com


Re: Language modeling (was Re: [agi] draft for comment)

2008-09-05 Thread Matt Mahoney
--- On Fri, 9/5/08, Pei Wang <[EMAIL PROTECTED]> wrote:

> Like to many existing AI works, my disagreement with you is
> not that
> much on the solution you proposed (I can see the value),
> but on the
> problem you specified as the goal of AI. For example, I
> have no doubt
> about the theoretical and practical values of compression,
> but don't
> think it has much to do with intelligence.

In http://cs.fit.edu/~mmahoney/compression/rationale.html I explain why text 
compression is an AI problem. To summarize, if you know the probability 
distribution of text, then you can compute P(A|Q) for any question Q and answer 
A to pass the Turing test. Compression allows you to precisely measure the 
accuracy of your estimate of P. Compression (actually, word perplexity) has 
been used since the early 1990's to measure the quality of language models for 
speech recognition, since it correlates well with word error rate.

The purpose of this work is not to solve general intelligence, such as the 
universal intelligence proposed by Legg and Hutter [1]. That is not computable, 
so you have to make some arbitrary choice with regard to test environments 
about what problems you are going to solve. I believe the goal of AGI should be 
to do useful work for humans, so I am making a not so arbitrary choice to solve 
a problem that is central to what most people regard as useful intelligence.

I had hoped that my work would lead to an elegant theory of AI, but that hasn't 
been the case. Rather, the best compression programs were developed as a series 
of thousands of hacks and tweaks, e.g. change a 4 to a 5 because it gives 
0.002% better compression on the benchmark. The result is an opaque mess. I 
guess I should have seen it coming, since it is predicted by information theory 
(e.g. [2]).

Nevertheless the architectures of the best text compressors are consistent with 
cognitive development models, i.e. phoneme (or letter) sequences -> lexical -> 
semantics -> syntax, which are themselves consistent with layered neural 
architectures. I already described a neural semantic model in my last post. I 
also did work supporting Hutchens and Alder showing that lexical models can be 
learned from n-gram statistics, consistent with the observation that babies 
learn the rules for segmenting continuous speech before they learn any words 
[3].

I agree it should also be clear that semantics is learned before grammar, 
contrary to the way artificial languages are processed. Grammar requires 
semantics, but not the other way around. Search engines work using semantics 
only. Yet we cannot parse sentences like "I ate pizza with Bob", "I ate pizza 
with pepperoni", "I ate pizza with chopsticks", without semantics.

My benchmark does not prove that there aren't better language models, but it is 
strong evidence. It represents the work of about 100 researchers who have tried 
and failed to find more accurate, faster, or less memory intensive models. The 
resource requirements seem to increase as we go up the chain from n-grams to 
grammar, contrary to symbolic approaches. This is my argument why I think AI is 
bound by lack of hardware, not lack of theory.

1. Legg, Shane, and Marcus Hutter (2006), A Formal Measure of Machine 
Intelligence, Proc. Annual machine learning conference of Belgium and The 
Netherlands (Benelearn-2006). Ghent, 2006.  
http://www.vetta.org/documents/ui_benelearn.pdf

2. Legg, Shane, (2006), Is There an Elegant Universal Theory of Prediction?,  
Technical Report IDSIA-12-06, IDSIA / USI-SUPSI, Dalle Molle Institute for 
Artificial Intelligence, Galleria 2, 6928 Manno, Switzerland.
http://www.vetta.org/documents/IDSIA-12-06-1.pdf

3. M. Mahoney (2000), A Note on Lexical Acquisition in Text without Spaces, 
http://cs.fit.edu/~mmahoney/dissertation/lex1.html


-- Matt Mahoney, [EMAIL PROTECTED]



---
agi
Archives: https://www.listbox.com/member/archive/303/=now
RSS Feed: https://www.listbox.com/member/archive/rss/303/
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=8660244&id_secret=111637683-c8fa51
Powered by Listbox: http://www.listbox.com


Re: Language modeling (was Re: [agi] draft for comment)

2008-09-05 Thread Pei Wang
Matt,

Thanks for taking the time to explain your ideas in detail. As I said,
our different opinions on how to do AI come from our very different
understanding of "intelligence". I don't take "passing Turing Test" as
my research goal (as explained in
http://nars.wang.googlepages.com/wang.logic_intelligence.pdf and
http://nars.wang.googlepages.com/wang.AI_Definitions.pdf).  I disagree
with Hutter's approach, not because his SOLUTION is not computable,
but because his PROBLEM is too idealized and simplified to be relevant
to the actual problems of AI.

Even so, I'm glad that we can still agree on somethings, like
semantics comes before syntax. In my plan for NLP, there won't be
separate 'parsing' and 'semantic mapping' stages. I'll say more when I
have concrete results to share.

Pei

On Fri, Sep 5, 2008 at 8:39 PM, Matt Mahoney <[EMAIL PROTECTED]> wrote:
> --- On Fri, 9/5/08, Pei Wang <[EMAIL PROTECTED]> wrote:
>
>> Like to many existing AI works, my disagreement with you is
>> not that
>> much on the solution you proposed (I can see the value),
>> but on the
>> problem you specified as the goal of AI. For example, I
>> have no doubt
>> about the theoretical and practical values of compression,
>> but don't
>> think it has much to do with intelligence.
>
> In http://cs.fit.edu/~mmahoney/compression/rationale.html I explain why text 
> compression is an AI problem. To summarize, if you know the probability 
> distribution of text, then you can compute P(A|Q) for any question Q and 
> answer A to pass the Turing test. Compression allows you to precisely measure 
> the accuracy of your estimate of P. Compression (actually, word perplexity) 
> has been used since the early 1990's to measure the quality of language 
> models for speech recognition, since it correlates well with word error rate.
>
> The purpose of this work is not to solve general intelligence, such as the 
> universal intelligence proposed by Legg and Hutter [1]. That is not 
> computable, so you have to make some arbitrary choice with regard to test 
> environments about what problems you are going to solve. I believe the goal 
> of AGI should be to do useful work for humans, so I am making a not so 
> arbitrary choice to solve a problem that is central to what most people 
> regard as useful intelligence.
>
> I had hoped that my work would lead to an elegant theory of AI, but that 
> hasn't been the case. Rather, the best compression programs were developed as 
> a series of thousands of hacks and tweaks, e.g. change a 4 to a 5 because it 
> gives 0.002% better compression on the benchmark. The result is an opaque 
> mess. I guess I should have seen it coming, since it is predicted by 
> information theory (e.g. [2]).
>
> Nevertheless the architectures of the best text compressors are consistent 
> with cognitive development models, i.e. phoneme (or letter) sequences -> 
> lexical -> semantics -> syntax, which are themselves consistent with layered 
> neural architectures. I already described a neural semantic model in my last 
> post. I also did work supporting Hutchens and Alder showing that lexical 
> models can be learned from n-gram statistics, consistent with the observation 
> that babies learn the rules for segmenting continuous speech before they 
> learn any words [3].
>
> I agree it should also be clear that semantics is learned before grammar, 
> contrary to the way artificial languages are processed. Grammar requires 
> semantics, but not the other way around. Search engines work using semantics 
> only. Yet we cannot parse sentences like "I ate pizza with Bob", "I ate pizza 
> with pepperoni", "I ate pizza with chopsticks", without semantics.
>
> My benchmark does not prove that there aren't better language models, but it 
> is strong evidence. It represents the work of about 100 researchers who have 
> tried and failed to find more accurate, faster, or less memory intensive 
> models. The resource requirements seem to increase as we go up the chain from 
> n-grams to grammar, contrary to symbolic approaches. This is my argument why 
> I think AI is bound by lack of hardware, not lack of theory.
>
> 1. Legg, Shane, and Marcus Hutter (2006), A Formal Measure of Machine 
> Intelligence, Proc. Annual machine learning conference of Belgium and The 
> Netherlands (Benelearn-2006). Ghent, 2006.  
> http://www.vetta.org/documents/ui_benelearn.pdf
>
> 2. Legg, Shane, (2006), Is There an Elegant Universal Theory of Prediction?,  
> Technical Report IDSIA-12-06, IDSIA / USI-SUPSI, Dalle Molle Institute for 
> Artificial Intelligence, Galleria 2, 6928 Manno, Switzerland.
> http://www.vetta.org/documents/IDSIA-12-06-1.pdf
>
> 3. M. Mahoney (2000), A Note on Lexical Acquisition in Text without Spaces, 
> http://cs.fit.edu/~mmahoney/dissertation/lex1.html
>
>
> -- Matt Mahoney, [EMAIL PROTECTED]
>
>
>
> ---
> agi
> Archives: https://www.listbox.com/member/archive/303/=now
> RSS Feed: https://www.listbox.

RE: Language modeling (was Re: [agi] draft for comment)

2008-09-06 Thread John G. Rose
Thinking out loud here as I find the relationship between compression and
intelligence interesting:

Compression in itself has the overriding goal of reducing storage bits.
Intelligence has coincidental compression. There is resource management
there. But I do think that it is not ONLY coincidental. Knowledge has
structure which can be organized and naturally can collapse into a lower
complexity storage state. Things have order, based on physics and other
mathematical relationships. The relationship between compression and stored
knowledge and intelligence is intriguing. But knowledge can be compressed
inefficiently to where it inhibits extraction and other operations so there
are differences with compression and intelligence related to computational
expense. Optimal intelligence would have a variational compression structure
IOW some stuff needs fast access time with minimal decompression resource
expenditure and other stuff has high storage priority but computational
expense and access time are not a priority.

And then when you say the word compression there is a complicity of utility.
The result of a compressor that has general intelligence still has a goal of
reducing storage bits. I think that compression can be a byproduct of the
stored knowledge created by a general intelligence. But if you have a
compressor with general intelligence built in and you assign it a goal of
taking input data and reducing the storage space it still may result in a
series of hacks because that may be the best way of accomplishing that goal.


Sure there may be some new undiscovered hacks that require general
intelligence to uncover. And a compressor that is generally intelligent may
produce more rich lossily compressed data from varied sources. The best
lossy compressor is probably generally intelligent. They are very similar as
you indicate... but when you start getting real lossy, when you start asking
questions from your lossy compressed data that are not related to just the
uncompressed input there is a difference there. Compression itself is just
one dimensional. Intelligence is multi. 

John 



> -Original Message-
> From: Matt Mahoney [mailto:[EMAIL PROTECTED]
> Sent: Friday, September 05, 2008 6:39 PM
> To: agi@v2.listbox.com
> Subject: Re: Language modeling (was Re: [agi] draft for comment)
> 
> --- On Fri, 9/5/08, Pei Wang <[EMAIL PROTECTED]> wrote:
> 
> > Like to many existing AI works, my disagreement with you is
> > not that
> > much on the solution you proposed (I can see the value),
> > but on the
> > problem you specified as the goal of AI. For example, I
> > have no doubt
> > about the theoretical and practical values of compression,
> > but don't
> > think it has much to do with intelligence.
> 
> In http://cs.fit.edu/~mmahoney/compression/rationale.html I explain why
> text compression is an AI problem. To summarize, if you know the
> probability distribution of text, then you can compute P(A|Q) for any
> question Q and answer A to pass the Turing test. Compression allows you
> to precisely measure the accuracy of your estimate of P. Compression
> (actually, word perplexity) has been used since the early 1990's to
> measure the quality of language models for speech recognition, since it
> correlates well with word error rate.
> 
> The purpose of this work is not to solve general intelligence, such as
> the universal intelligence proposed by Legg and Hutter [1]. That is not
> computable, so you have to make some arbitrary choice with regard to
> test environments about what problems you are going to solve. I believe
> the goal of AGI should be to do useful work for humans, so I am making a
> not so arbitrary choice to solve a problem that is central to what most
> people regard as useful intelligence.
> 
> I had hoped that my work would lead to an elegant theory of AI, but that
> hasn't been the case. Rather, the best compression programs were
> developed as a series of thousands of hacks and tweaks, e.g. change a 4
> to a 5 because it gives 0.002% better compression on the benchmark. The
> result is an opaque mess. I guess I should have seen it coming, since it
> is predicted by information theory (e.g. [2]).
> 
> Nevertheless the architectures of the best text compressors are
> consistent with cognitive development models, i.e. phoneme (or letter)
> sequences -> lexical -> semantics -> syntax, which are themselves
> consistent with layered neural architectures. I already described a
> neural semantic model in my last post. I also did work supporting
> Hutchens and Alder showing that lexical models can be learned from n-
> gram statistics, consistent with the observation that babies learn the
> rules for segmenting continuous speech before they learn any words [3].
> 
> I agre

Re: Language modeling (was Re: [agi] draft for comment)

2008-09-06 Thread Matt Mahoney
--- On Fri, 9/5/08, Pei Wang <[EMAIL PROTECTED]> wrote:

> Thanks for taking the time to explain your ideas in detail.
> As I said,
> our different opinions on how to do AI come from our very
> different
> understanding of "intelligence". I don't take
> "passing Turing Test" as
> my research goal (as explained in
> http://nars.wang.googlepages.com/wang.logic_intelligence.pdf
> and
> http://nars.wang.googlepages.com/wang.AI_Definitions.pdf). 
> I disagree
> with Hutter's approach, not because his SOLUTION is not
> computable,
> but because his PROBLEM is too idealized and simplified to
> be relevant
> to the actual problems of AI.

I don't advocate the Turing test as the ideal test of intelligence. Turing 
himself was aware of the problem when he gave an example of a computer 
answering an arithmetic problem incorrectly in his famous 1950 paper:

Q: Please write me a sonnet on the subject of the Forth Bridge.
A: Count me out on this one. I never could write poetry.
Q: Add 34957 to 70764.
A: (Pause about 30 seconds and then give as answer) 105621.
Q: Do you play chess?
A: Yes.
Q: I have K at my K1, and no other pieces.  You have only K at K6 and R at R1.  
It is your move.  What do you play?
A: (After a pause of 15 seconds) R-R8 mate.

I prefer a "preference test", which a machine passes if you prefer to talk to 
it over a human. Such a machine would be too fast and make too few errors to 
pass a Turing test. For example, if you had to add two large numbers, I think 
you would prefer to use a calculator than ask someone. You could, I suppose, 
measure intelligence as the fraction of questions for which the machine gives 
the preferred answer, which would be 1/4 in Turing's example.

If you know the probability distribution P of text, and therefore know the 
distribution P(A|Q) for any question Q and answer A, then to pass the Turing 
test you would randomly choose answers from this distribution. But to pass the 
preference test for all Q, you would choose A that maximizes P(A|Q) because the 
most probable answer is usually the correct one. Text compression measures 
progress toward either test.

I believe that compression measures your definition of intelligence, i.e. 
adaptation given insufficient knowledge and resources. In my benchmark, there 
are two parts: the size of the decompression program, which measures the 
initial knowledge, and the compressed size, which measures prediction errors 
that occur as the system adapts. Programs must also meet practical time and 
memory constraints to be listed in most benchmarks.

Compression is also consistent with Legg and Hutter's universal intelligence, 
i.e. expected reward of an AIXI universal agent in an environment simulated by 
a random program. Suppose you have a compression oracle that inputs any string 
x and outputs the shortest program that outputs a string with prefix x. Then 
this reduces the (uncomputable) AIXI problem to using the oracle to guess which 
environment is consistent with the interaction so far, and figuring out which 
future outputs by the agent will maximize reward.

Of course universal intelligence is also not testable because it requires an 
infinite number of environments. Instead, we have to choose a practical data 
set. I use Wikipedia text, which has fewer errors than average text, but I 
believe that is consistent with my goal of passing the preference test.


-- Matt Mahoney, [EMAIL PROTECTED]



---
agi
Archives: https://www.listbox.com/member/archive/303/=now
RSS Feed: https://www.listbox.com/member/archive/rss/303/
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=8660244&id_secret=111637683-c8fa51
Powered by Listbox: http://www.listbox.com


RE: Language modeling (was Re: [agi] draft for comment)

2008-09-06 Thread Matt Mahoney
--- On Sat, 9/6/08, John G. Rose <[EMAIL PROTECTED]> wrote:

> Compression in itself has the overriding goal of reducing
> storage bits.

Not the way I use it. The goal is to predict what the environment will do next. 
Lossless compression is a way of measuring how well we are doing.

-- Matt Mahoney, [EMAIL PROTECTED]



---
agi
Archives: https://www.listbox.com/member/archive/303/=now
RSS Feed: https://www.listbox.com/member/archive/rss/303/
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=8660244&id_secret=111637683-c8fa51
Powered by Listbox: http://www.listbox.com


Re: Language modeling (was Re: [agi] draft for comment)

2008-09-06 Thread Pei Wang
I won't argue against your  "preference test" here, since this is a
big topic, and I've already made my position clear in the papers I
mentioned.

As for "compression", yes every intelligent system needs to 'compress'
its experience in the sense of "keeping the essence but using less
space". However, it is clearly not loseless. It is even not what we
usually call "loosy compression", because what to keep and in what
form is highly context-sensitive. Consequently, this process is not
reversible --- no decompression, though the result can be applied in
various ways. Therefore I prefer not to call it compression to avoid
confusing this process with the technical sense of "compression",
which is reversible, at least approximately.

Legg and Hutter's "universal intelligence" definition is way too
narrow to cover various attempts towards AI, even as an idealization.
Therefore, I don't take it as a goal to aim at and to approach to as
close as possible. However, as I said before, I'd rather leave this
topic for the future, when I have enough time to give it a fair
treatment.

Pei

On Sat, Sep 6, 2008 at 4:29 PM, Matt Mahoney <[EMAIL PROTECTED]> wrote:
> --- On Fri, 9/5/08, Pei Wang <[EMAIL PROTECTED]> wrote:
>
>> Thanks for taking the time to explain your ideas in detail.
>> As I said,
>> our different opinions on how to do AI come from our very
>> different
>> understanding of "intelligence". I don't take
>> "passing Turing Test" as
>> my research goal (as explained in
>> http://nars.wang.googlepages.com/wang.logic_intelligence.pdf
>> and
>> http://nars.wang.googlepages.com/wang.AI_Definitions.pdf).
>> I disagree
>> with Hutter's approach, not because his SOLUTION is not
>> computable,
>> but because his PROBLEM is too idealized and simplified to
>> be relevant
>> to the actual problems of AI.
>
> I don't advocate the Turing test as the ideal test of intelligence. Turing 
> himself was aware of the problem when he gave an example of a computer 
> answering an arithmetic problem incorrectly in his famous 1950 paper:
>
> Q: Please write me a sonnet on the subject of the Forth Bridge.
> A: Count me out on this one. I never could write poetry.
> Q: Add 34957 to 70764.
> A: (Pause about 30 seconds and then give as answer) 105621.
> Q: Do you play chess?
> A: Yes.
> Q: I have K at my K1, and no other pieces.  You have only K at K6 and R at 
> R1.  It is your move.  What do you play?
> A: (After a pause of 15 seconds) R-R8 mate.
>
> I prefer a "preference test", which a machine passes if you prefer to talk to 
> it over a human. Such a machine would be too fast and make too few errors to 
> pass a Turing test. For example, if you had to add two large numbers, I think 
> you would prefer to use a calculator than ask someone. You could, I suppose, 
> measure intelligence as the fraction of questions for which the machine gives 
> the preferred answer, which would be 1/4 in Turing's example.
>
> If you know the probability distribution P of text, and therefore know the 
> distribution P(A|Q) for any question Q and answer A, then to pass the Turing 
> test you would randomly choose answers from this distribution. But to pass 
> the preference test for all Q, you would choose A that maximizes P(A|Q) 
> because the most probable answer is usually the correct one. Text compression 
> measures progress toward either test.
>
> I believe that compression measures your definition of intelligence, i.e. 
> adaptation given insufficient knowledge and resources. In my benchmark, there 
> are two parts: the size of the decompression program, which measures the 
> initial knowledge, and the compressed size, which measures prediction errors 
> that occur as the system adapts. Programs must also meet practical time and 
> memory constraints to be listed in most benchmarks.
>
> Compression is also consistent with Legg and Hutter's universal intelligence, 
> i.e. expected reward of an AIXI universal agent in an environment simulated 
> by a random program. Suppose you have a compression oracle that inputs any 
> string x and outputs the shortest program that outputs a string with prefix 
> x. Then this reduces the (uncomputable) AIXI problem to using the oracle to 
> guess which environment is consistent with the interaction so far, and 
> figuring out which future outputs by the agent will maximize reward.
>
> Of course universal intelligence is also not testable because it requires an 
> infinite number of environments. Instead, we have to choose a practical data 
> set. I use Wikipedia text, which has fewer errors than average text, but I 
> believe that is consistent with my goal of passing the preference test.
>
>
> -- Matt Mahoney, [EMAIL PROTECTED]
>
>
>
> ---
> agi
> Archives: https://www.listbox.com/member/archive/303/=now
> RSS Feed: https://www.listbox.com/member/archive/rss/303/
> Modify Your Subscription: https://www.listbox.com/member/?&;
> Powered by Listbox: http://www.listbox.com
>


Re: Language modeling (was Re: [agi] draft for comment)

2008-09-06 Thread Matt Mahoney
--- On Sat, 9/6/08, Pei Wang <[EMAIL PROTECTED]> wrote:

> As for "compression", yes every intelligent
> system needs to 'compress'
> its experience in the sense of "keeping the essence
> but using less
> space". However, it is clearly not loseless. It is
> even not what we
> usually call "loosy compression", because what to
> keep and in what
> form is highly context-sensitive. Consequently, this
> process is not
> reversible --- no decompression, though the result can be
> applied in
> various ways. Therefore I prefer not to call it compression
> to avoid
> confusing this process with the technical sense of
> "compression",
> which is reversible, at least approximately.

I think you misunderstand my use of compression. The goal is modeling or 
prediction. Given a string, predict the next symbol. I use compression to 
estimate how accurate the model is. It is easy to show that if your model is 
accurate, then when you connect your model to an ideal coder (such as an 
arithmetic coder), then compression will be optimal. You could actually skip 
the coding step, but it is cheap, so I use it so that there is no question of 
making a mistake in the measurement. If a bug in the coder produces a too small 
output, then the decompression step won't reproduce the original file.

In fact, many speech recognition experiments do skip the coding step in their 
tests and merely calculate what the compressed size would be. (More precisely, 
they calculate word perplexity, which is equivalent). The goal of speech 
recognition is to find the text y that maximizes P(y|x) for utterance x. It is 
common to factor the model using Bayes law: P(y|x) = P(x|y)P(y)/P(x). We can 
drop P(x) since it is constant, leaving the acoustic model P(x|y) and language 
model P(y) to evaluate. We know from experiments that compression tests on P(y) 
correlate well with word error rates for the overall system.

Internally, all lossless compressors use lossy compression or data reduction to 
make predictions. Most commonly, a context is truncated and possibly hashed 
before looking up the statistics for the next symbol. The top lossless 
compressors in my benchmark use more sophisticated forms of data reduction, 
such as mapping upper and lower case letters together, or mapping groups of 
semantically or syntactically related words to the same context.

As a test, lossless compression is only appropriate for text. For other hard AI 
problems such as vision, art, and music, incompressible noise would overwhelm 
the human-perceptible signal. Theoretically you could compress video to 2 bits 
per second (the rate of human long term memory) by encoding it as a script. The 
decompressor would read the script and create a new movie. The proper test 
would be lossy compression, but this requires human judgment to evaluate how 
well the reconstructed data matches the original.


-- Matt Mahoney, [EMAIL PROTECTED]




---
agi
Archives: https://www.listbox.com/member/archive/303/=now
RSS Feed: https://www.listbox.com/member/archive/rss/303/
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=8660244&id_secret=111637683-c8fa51
Powered by Listbox: http://www.listbox.com


RE: Language modeling (was Re: [agi] draft for comment)

2008-09-07 Thread John G. Rose
> From: Matt Mahoney [mailto:[EMAIL PROTECTED]
> 
> --- On Sat, 9/6/08, John G. Rose <[EMAIL PROTECTED]> wrote:
> 
> > Compression in itself has the overriding goal of reducing
> > storage bits.
> 
> Not the way I use it. The goal is to predict what the environment will
> do next. Lossless compression is a way of measuring how well we are
> doing.
> 

Predicting the environment in order to determine which data to pack where,
thus achieving higher compression ratio. Or compression as an integral part
of prediction? Some types of prediction are inherently compressed I suppose.


John



---
agi
Archives: https://www.listbox.com/member/archive/303/=now
RSS Feed: https://www.listbox.com/member/archive/rss/303/
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=8660244&id_secret=111637683-c8fa51
Powered by Listbox: http://www.listbox.com


RE: Language modeling (was Re: [agi] draft for comment)

2008-09-07 Thread Matt Mahoney
--- On Sun, 9/7/08, John G. Rose <[EMAIL PROTECTED]> wrote:

> From: John G. Rose <[EMAIL PROTECTED]>
> Subject: RE: Language modeling (was Re: [agi] draft for comment)
> To: agi@v2.listbox.com
> Date: Sunday, September 7, 2008, 9:15 AM
> > From: Matt Mahoney [mailto:[EMAIL PROTECTED]
> > 
> > --- On Sat, 9/6/08, John G. Rose
> <[EMAIL PROTECTED]> wrote:
> > 
> > > Compression in itself has the overriding goal of
> reducing
> > > storage bits.
> > 
> > Not the way I use it. The goal is to predict what the
> environment will
> > do next. Lossless compression is a way of measuring
> how well we are
> > doing.
> > 
> 
> Predicting the environment in order to determine which data
> to pack where,
> thus achieving higher compression ratio. Or compression as
> an integral part
> of prediction? Some types of prediction are inherently
> compressed I suppose.

Predicting the environment to maximize reward. Hutter proved that universal 
intelligence is a compression problem. The optimal behavior of an AIXI agent is 
to guess the shortest program consistent with observation so far. That's 
algorithmic compression.

-- Matt Mahoney, [EMAIL PROTECTED]



---
agi
Archives: https://www.listbox.com/member/archive/303/=now
RSS Feed: https://www.listbox.com/member/archive/rss/303/
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=8660244&id_secret=111637683-c8fa51
Powered by Listbox: http://www.listbox.com


RE: Language modeling (was Re: [agi] draft for comment)

2008-09-08 Thread John G. Rose
> From: Matt Mahoney [mailto:[EMAIL PROTECTED]
> 
> --- On Sun, 9/7/08, John G. Rose <[EMAIL PROTECTED]> wrote:
> 
> > From: John G. Rose <[EMAIL PROTECTED]>
> > Subject: RE: Language modeling (was Re: [agi] draft for comment)
> > To: agi@v2.listbox.com
> > Date: Sunday, September 7, 2008, 9:15 AM
> > > From: Matt Mahoney [mailto:[EMAIL PROTECTED]
> > >
> > > --- On Sat, 9/6/08, John G. Rose
> > <[EMAIL PROTECTED]> wrote:
> > >
> > > > Compression in itself has the overriding goal of
> > reducing
> > > > storage bits.
> > >
> > > Not the way I use it. The goal is to predict what the
> > environment will
> > > do next. Lossless compression is a way of measuring
> > how well we are
> > > doing.
> > >
> >
> > Predicting the environment in order to determine which data
> > to pack where,
> > thus achieving higher compression ratio. Or compression as
> > an integral part
> > of prediction? Some types of prediction are inherently
> > compressed I suppose.
> 
> Predicting the environment to maximize reward. Hutter proved that
> universal intelligence is a compression problem. The optimal behavior of
> an AIXI agent is to guess the shortest program consistent with
> observation so far. That's algorithmic compression.
> 

Oh I see. Guessing shortest program = compression. OK right. But yeah like
Pei said the word "compression" is misleading. It implies a reduction where
you are actually increasing understanding :)

John




---
agi
Archives: https://www.listbox.com/member/archive/303/=now
RSS Feed: https://www.listbox.com/member/archive/rss/303/
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=8660244&id_secret=111637683-c8fa51
Powered by Listbox: http://www.listbox.com