Re: Language modeling (was Re: [agi] draft for comment)
On Fri, Sep 5, 2008 at 11:15 AM, Matt Mahoney <[EMAIL PROTECTED]> wrote: > --- On Thu, 9/4/08, Pei Wang <[EMAIL PROTECTED]> wrote: > >> I guess you still see NARS as using model-theoretic >> semantics, so you >> call it "symbolic" and contrast it with system >> with sensors. This is >> not correct --- see >> http://nars.wang.googlepages.com/wang.semantics.pdf and >> http://nars.wang.googlepages.com/wang.AI_Misconceptions.pdf > > I mean NARS is symbolic in the sense that you write statements in Narsese > like "raven -> bird <0.97, 0.92>" (probability=0.97, confidence=0.92). I > realize that the meanings of "raven" and "bird" are determined by their > relations to other symbols in the knowledge base and that the probability and > confidence change with experience. But in practice you are still going to > write statements like this because it is the easiest way to build the > knowledge base. Yes. > You aren't going to specify the brightness of millions of pixels in a vision > system in Narsese, and there is no mechanism I am aware of to collect this > knowledge from a natural language text corpus. Of course not. To have visual experience, there must be a devise to convert visual signals into internal representation in Narsese. I never suggested otherwise. > There is no mechanism to add new symbols to the knowledge base through > experience. You have to explicitly add them. "New symbols" either come from the outside in experience (experience can be verbal), or composed by the concept-formation rules from existing ones. The latter case is explained in my book. > Natural language has evolved to be learnable on a massively parallel network > of slow computing elements. This should be apparent when we compare > successful language models with unsuccessful ones. Artificial language models > usually consist of tokenization, parsing, and semantic analysis phases. This > does not work on natural language because artificial languages have precise > specifications and natural languages do not. It depends on which aspect of the language you talk about. Narsese has "precise specifications" in syntax, but the meaning of the terms is a function of experience, and change from time to time. > No two humans use exactly the same language, nor does the same human at two > points in time. Rather, language is learnable by example, so that each > message causes the language of the receiver to be a little more like that of > the sender. Same thing in NARS --- if two implementations of NARS have different experience, they will disagree on what is the meaning of a term. When they begin to learn natural language, it will also be true for grammar. Since I haven't done any concrete NLP yet, I don't expect you to believe me on the second point, but you cannot rule out that possibility just because no traditional system can do that. > Children learn semantics before syntax, which is the opposite order from > which you would write an artificial language interpreter. NARS indeed can learn semantics before syntax --- see http://nars.wang.googlepages.com/wang.roadmap.pdf I won't comment on the following detailed statements, since I agree with your criticism on the traditional processing of formal language, but that is not how NARS handles languages. Don't think NARS as another Cyc just because both use "formal language". The same "ravens are birds" in these two systems are treated very differently in them. Pei > An example of a successful language model is a search engine. We know that > most of the meaning of a text document depends only on the words it contains, > ignoring word order. A search engine matches the semantics of the query with > the semantics of a document mostly by matching words, but also by matching > semantically related words like "water" to "wet". > > Here is an example of a computationally intensive but biologically plausible > language model. A semantic model is a word-word matrix A such that A_ij is > the degree to which words i and j are related, which you can think of as the > probability of finding i and j together in a sliding window over a huge text > corpus. However, semantic relatedness is a fuzzy identity relation, meaning > it is reflexive, commutative, and transitive. If i is related to j and j to > k, then i is related to k. Deriving transitive relations in A, also known as > latent semantic analysis, is performed by singular value decomposition, > factoring A = USV where S is diagonal, then discarding the small terms of S, > which has the effect of lossy compression. Typically, A has about 10^6 > elements and we keep only a few hundred elements of S. Fortunately there is a > parallel algorithm that incrementally updates the matrices as the system > learns: a 3 layer neural network where S is the hidden layer > (which can grow) and U and V are weight matrices. [1]. > > Traditional language processing has failed because the task of converting > natural languag
Re: Language modeling (was Re: [agi] draft for comment)
--- On Fri, 9/5/08, Pei Wang <[EMAIL PROTECTED]> wrote: > NARS indeed can learn semantics before syntax --- see > http://nars.wang.googlepages.com/wang.roadmap.pdf Yes, I see this corrects many of the problems with Cyc and with traditional language models. I didn't see a description of a mechanism for learning new terms in your other paper. Clearly this could be added, although I believe it should be a statistical process. I am interested in determining the computational cost of language modeling. The evidence I have so far is that it is high. I believe the algorithmic complexity of a model is 10^9 bits. This is consistent with Turing's 1950 prediction that AI would require this much memory, with Landauer's estimate of human long term memory, and is about how much language a person processes by adulthood assuming an information content of 1 bit per character as Shannon estimated in 1950. This is why I use a 1 GB data set in my compression benchmark. However there is a 3 way tradeoff between CPU speed, memory, and model accuracy (as measured by compression ratio). I added two graphs to my benchmark at http://cs.fit.edu/~mmahoney/compression/text.html (below the main table) which shows this clearly. In particular the size-memory tradeoff is an almost perfectly straight line (with memory on a log scale) over tests of 104 compressors. These tests suggest to me that CPU and memory are indeed bottlenecks to language modeling. The best models in my tests use simple semantic and grammatical models, well below adult human level. The 3 top programs on the memory graph map words to tokens using dictionaries that group semantically and syntactically related words together, but only one (paq8hp12any) uses a semantic space of more than one dimension. All have large vocabularies, although not implausibly large for an educated person. Other top programs like nanozipltcb and WinRK use smaller dictionaries and strictly lexical models. Lesser programs model only at the n-gram level. I don't yet have an answer to my question, but I believe efficient human-level NLP will require hundreds of GB or perhaps 1 TB of memory. The slowest programs are already faster than real time, given that equivalent learning in humans would take over a decade. I think you could use existing hardware in a speed-memory tradeoff to get real time NLP, but it would not be practical for doing experiments where each source code change requires training the model from scratch. Model development typically requires thousands of tests. -- Matt Mahoney, [EMAIL PROTECTED] --- agi Archives: https://www.listbox.com/member/archive/303/=now RSS Feed: https://www.listbox.com/member/archive/rss/303/ Modify Your Subscription: https://www.listbox.com/member/?member_id=8660244&id_secret=111637683-c8fa51 Powered by Listbox: http://www.listbox.com
Re: Language modeling (was Re: [agi] draft for comment)
On Fri, Sep 5, 2008 at 6:15 PM, Matt Mahoney <[EMAIL PROTECTED]> wrote: > --- On Fri, 9/5/08, Pei Wang <[EMAIL PROTECTED]> wrote: > >> NARS indeed can learn semantics before syntax --- see >> http://nars.wang.googlepages.com/wang.roadmap.pdf > > Yes, I see this corrects many of the problems with Cyc and with traditional > language models. I didn't see a description of a mechanism for learning new > terms in your other paper. Clearly this could be added, although I believe it > should be a statistical process. I don't have a separate paper on term composition, so you'd have to read my book. It is indeed a statistical process, in the sense that most of the composed terms won't be useful, so will be forgot gradually. Only the "useful patterns" will be kept for long time in the form of compound terms. > I am interested in determining the computational cost of language modeling. > The evidence I have so far is that it is high. I believe the algorithmic > complexity of a model is 10^9 bits. This is consistent with Turing's 1950 > prediction that AI would require this much memory, with Landauer's estimate > of human long term memory, and is about how much language a person processes > by adulthood assuming an information content of 1 bit per character as > Shannon estimated in 1950. This is why I use a 1 GB data set in my > compression benchmark. I see your point, though I think to analyze this problem in terms of computational complexity is not the correct way to go, because this process does not follow a predetermined algorithm. Instead, language learning is an incremental process, without a well-defined beginning and ending. > However there is a 3 way tradeoff between CPU speed, memory, and model > accuracy (as measured by compression ratio). I added two graphs to my > benchmark at http://cs.fit.edu/~mmahoney/compression/text.html (below the > main table) which shows this clearly. In particular the size-memory tradeoff > is an almost perfectly straight line (with memory on a log scale) over tests > of 104 compressors. These tests suggest to me that CPU and memory are indeed > bottlenecks to language modeling. The best models in my tests use simple > semantic and grammatical models, well below adult human level. The 3 top > programs on the memory graph map words to tokens using dictionaries that > group semantically and syntactically related words together, but only one > (paq8hp12any) uses a semantic space of more than one dimension. All have > large vocabularies, although not implausibly large for an educated person. > Other top programs like nanozipltcb and WinRK use smaller dictionaries and > strictly lexical models. Lesser programs model only at the n-gram level. Like to many existing AI works, my disagreement with you is not that much on the solution you proposed (I can see the value), but on the problem you specified as the goal of AI. For example, I have no doubt about the theoretical and practical values of compression, but don't think it has much to do with intelligence. I don't think this kind of issue can be efficient handled by email discussion like this one. I've been thinking about to write a paper to compare my ideas with the ideas represented by AIXI, which is closely related to yours, though this project hasn't got enough priority in my to-do list. Hopefully I'll find the time to make myself clear on this topic. > I don't yet have an answer to my question, but I believe efficient > human-level NLP will require hundreds of GB or perhaps 1 TB of memory. The > slowest programs are already faster than real time, given that equivalent > learning in humans would take over a decade. I think you could use existing > hardware in a speed-memory tradeoff to get real time NLP, but it would not be > practical for doing experiments where each source code change requires > training the model from scratch. Model development typically requires > thousands of tests. I guess we are exploring very different paths in NLP, and now it is too early to tell which one will do better. Pei --- agi Archives: https://www.listbox.com/member/archive/303/=now RSS Feed: https://www.listbox.com/member/archive/rss/303/ Modify Your Subscription: https://www.listbox.com/member/?member_id=8660244&id_secret=111637683-c8fa51 Powered by Listbox: http://www.listbox.com
Re: Language modeling (was Re: [agi] draft for comment)
--- On Fri, 9/5/08, Pei Wang <[EMAIL PROTECTED]> wrote: > Like to many existing AI works, my disagreement with you is > not that > much on the solution you proposed (I can see the value), > but on the > problem you specified as the goal of AI. For example, I > have no doubt > about the theoretical and practical values of compression, > but don't > think it has much to do with intelligence. In http://cs.fit.edu/~mmahoney/compression/rationale.html I explain why text compression is an AI problem. To summarize, if you know the probability distribution of text, then you can compute P(A|Q) for any question Q and answer A to pass the Turing test. Compression allows you to precisely measure the accuracy of your estimate of P. Compression (actually, word perplexity) has been used since the early 1990's to measure the quality of language models for speech recognition, since it correlates well with word error rate. The purpose of this work is not to solve general intelligence, such as the universal intelligence proposed by Legg and Hutter [1]. That is not computable, so you have to make some arbitrary choice with regard to test environments about what problems you are going to solve. I believe the goal of AGI should be to do useful work for humans, so I am making a not so arbitrary choice to solve a problem that is central to what most people regard as useful intelligence. I had hoped that my work would lead to an elegant theory of AI, but that hasn't been the case. Rather, the best compression programs were developed as a series of thousands of hacks and tweaks, e.g. change a 4 to a 5 because it gives 0.002% better compression on the benchmark. The result is an opaque mess. I guess I should have seen it coming, since it is predicted by information theory (e.g. [2]). Nevertheless the architectures of the best text compressors are consistent with cognitive development models, i.e. phoneme (or letter) sequences -> lexical -> semantics -> syntax, which are themselves consistent with layered neural architectures. I already described a neural semantic model in my last post. I also did work supporting Hutchens and Alder showing that lexical models can be learned from n-gram statistics, consistent with the observation that babies learn the rules for segmenting continuous speech before they learn any words [3]. I agree it should also be clear that semantics is learned before grammar, contrary to the way artificial languages are processed. Grammar requires semantics, but not the other way around. Search engines work using semantics only. Yet we cannot parse sentences like "I ate pizza with Bob", "I ate pizza with pepperoni", "I ate pizza with chopsticks", without semantics. My benchmark does not prove that there aren't better language models, but it is strong evidence. It represents the work of about 100 researchers who have tried and failed to find more accurate, faster, or less memory intensive models. The resource requirements seem to increase as we go up the chain from n-grams to grammar, contrary to symbolic approaches. This is my argument why I think AI is bound by lack of hardware, not lack of theory. 1. Legg, Shane, and Marcus Hutter (2006), A Formal Measure of Machine Intelligence, Proc. Annual machine learning conference of Belgium and The Netherlands (Benelearn-2006). Ghent, 2006. http://www.vetta.org/documents/ui_benelearn.pdf 2. Legg, Shane, (2006), Is There an Elegant Universal Theory of Prediction?, Technical Report IDSIA-12-06, IDSIA / USI-SUPSI, Dalle Molle Institute for Artificial Intelligence, Galleria 2, 6928 Manno, Switzerland. http://www.vetta.org/documents/IDSIA-12-06-1.pdf 3. M. Mahoney (2000), A Note on Lexical Acquisition in Text without Spaces, http://cs.fit.edu/~mmahoney/dissertation/lex1.html -- Matt Mahoney, [EMAIL PROTECTED] --- agi Archives: https://www.listbox.com/member/archive/303/=now RSS Feed: https://www.listbox.com/member/archive/rss/303/ Modify Your Subscription: https://www.listbox.com/member/?member_id=8660244&id_secret=111637683-c8fa51 Powered by Listbox: http://www.listbox.com
Re: Language modeling (was Re: [agi] draft for comment)
Matt, Thanks for taking the time to explain your ideas in detail. As I said, our different opinions on how to do AI come from our very different understanding of "intelligence". I don't take "passing Turing Test" as my research goal (as explained in http://nars.wang.googlepages.com/wang.logic_intelligence.pdf and http://nars.wang.googlepages.com/wang.AI_Definitions.pdf). I disagree with Hutter's approach, not because his SOLUTION is not computable, but because his PROBLEM is too idealized and simplified to be relevant to the actual problems of AI. Even so, I'm glad that we can still agree on somethings, like semantics comes before syntax. In my plan for NLP, there won't be separate 'parsing' and 'semantic mapping' stages. I'll say more when I have concrete results to share. Pei On Fri, Sep 5, 2008 at 8:39 PM, Matt Mahoney <[EMAIL PROTECTED]> wrote: > --- On Fri, 9/5/08, Pei Wang <[EMAIL PROTECTED]> wrote: > >> Like to many existing AI works, my disagreement with you is >> not that >> much on the solution you proposed (I can see the value), >> but on the >> problem you specified as the goal of AI. For example, I >> have no doubt >> about the theoretical and practical values of compression, >> but don't >> think it has much to do with intelligence. > > In http://cs.fit.edu/~mmahoney/compression/rationale.html I explain why text > compression is an AI problem. To summarize, if you know the probability > distribution of text, then you can compute P(A|Q) for any question Q and > answer A to pass the Turing test. Compression allows you to precisely measure > the accuracy of your estimate of P. Compression (actually, word perplexity) > has been used since the early 1990's to measure the quality of language > models for speech recognition, since it correlates well with word error rate. > > The purpose of this work is not to solve general intelligence, such as the > universal intelligence proposed by Legg and Hutter [1]. That is not > computable, so you have to make some arbitrary choice with regard to test > environments about what problems you are going to solve. I believe the goal > of AGI should be to do useful work for humans, so I am making a not so > arbitrary choice to solve a problem that is central to what most people > regard as useful intelligence. > > I had hoped that my work would lead to an elegant theory of AI, but that > hasn't been the case. Rather, the best compression programs were developed as > a series of thousands of hacks and tweaks, e.g. change a 4 to a 5 because it > gives 0.002% better compression on the benchmark. The result is an opaque > mess. I guess I should have seen it coming, since it is predicted by > information theory (e.g. [2]). > > Nevertheless the architectures of the best text compressors are consistent > with cognitive development models, i.e. phoneme (or letter) sequences -> > lexical -> semantics -> syntax, which are themselves consistent with layered > neural architectures. I already described a neural semantic model in my last > post. I also did work supporting Hutchens and Alder showing that lexical > models can be learned from n-gram statistics, consistent with the observation > that babies learn the rules for segmenting continuous speech before they > learn any words [3]. > > I agree it should also be clear that semantics is learned before grammar, > contrary to the way artificial languages are processed. Grammar requires > semantics, but not the other way around. Search engines work using semantics > only. Yet we cannot parse sentences like "I ate pizza with Bob", "I ate pizza > with pepperoni", "I ate pizza with chopsticks", without semantics. > > My benchmark does not prove that there aren't better language models, but it > is strong evidence. It represents the work of about 100 researchers who have > tried and failed to find more accurate, faster, or less memory intensive > models. The resource requirements seem to increase as we go up the chain from > n-grams to grammar, contrary to symbolic approaches. This is my argument why > I think AI is bound by lack of hardware, not lack of theory. > > 1. Legg, Shane, and Marcus Hutter (2006), A Formal Measure of Machine > Intelligence, Proc. Annual machine learning conference of Belgium and The > Netherlands (Benelearn-2006). Ghent, 2006. > http://www.vetta.org/documents/ui_benelearn.pdf > > 2. Legg, Shane, (2006), Is There an Elegant Universal Theory of Prediction?, > Technical Report IDSIA-12-06, IDSIA / USI-SUPSI, Dalle Molle Institute for > Artificial Intelligence, Galleria 2, 6928 Manno, Switzerland. > http://www.vetta.org/documents/IDSIA-12-06-1.pdf > > 3. M. Mahoney (2000), A Note on Lexical Acquisition in Text without Spaces, > http://cs.fit.edu/~mmahoney/dissertation/lex1.html > > > -- Matt Mahoney, [EMAIL PROTECTED] > > > > --- > agi > Archives: https://www.listbox.com/member/archive/303/=now > RSS Feed: https://www.listbox.
RE: Language modeling (was Re: [agi] draft for comment)
Thinking out loud here as I find the relationship between compression and intelligence interesting: Compression in itself has the overriding goal of reducing storage bits. Intelligence has coincidental compression. There is resource management there. But I do think that it is not ONLY coincidental. Knowledge has structure which can be organized and naturally can collapse into a lower complexity storage state. Things have order, based on physics and other mathematical relationships. The relationship between compression and stored knowledge and intelligence is intriguing. But knowledge can be compressed inefficiently to where it inhibits extraction and other operations so there are differences with compression and intelligence related to computational expense. Optimal intelligence would have a variational compression structure IOW some stuff needs fast access time with minimal decompression resource expenditure and other stuff has high storage priority but computational expense and access time are not a priority. And then when you say the word compression there is a complicity of utility. The result of a compressor that has general intelligence still has a goal of reducing storage bits. I think that compression can be a byproduct of the stored knowledge created by a general intelligence. But if you have a compressor with general intelligence built in and you assign it a goal of taking input data and reducing the storage space it still may result in a series of hacks because that may be the best way of accomplishing that goal. Sure there may be some new undiscovered hacks that require general intelligence to uncover. And a compressor that is generally intelligent may produce more rich lossily compressed data from varied sources. The best lossy compressor is probably generally intelligent. They are very similar as you indicate... but when you start getting real lossy, when you start asking questions from your lossy compressed data that are not related to just the uncompressed input there is a difference there. Compression itself is just one dimensional. Intelligence is multi. John > -Original Message- > From: Matt Mahoney [mailto:[EMAIL PROTECTED] > Sent: Friday, September 05, 2008 6:39 PM > To: agi@v2.listbox.com > Subject: Re: Language modeling (was Re: [agi] draft for comment) > > --- On Fri, 9/5/08, Pei Wang <[EMAIL PROTECTED]> wrote: > > > Like to many existing AI works, my disagreement with you is > > not that > > much on the solution you proposed (I can see the value), > > but on the > > problem you specified as the goal of AI. For example, I > > have no doubt > > about the theoretical and practical values of compression, > > but don't > > think it has much to do with intelligence. > > In http://cs.fit.edu/~mmahoney/compression/rationale.html I explain why > text compression is an AI problem. To summarize, if you know the > probability distribution of text, then you can compute P(A|Q) for any > question Q and answer A to pass the Turing test. Compression allows you > to precisely measure the accuracy of your estimate of P. Compression > (actually, word perplexity) has been used since the early 1990's to > measure the quality of language models for speech recognition, since it > correlates well with word error rate. > > The purpose of this work is not to solve general intelligence, such as > the universal intelligence proposed by Legg and Hutter [1]. That is not > computable, so you have to make some arbitrary choice with regard to > test environments about what problems you are going to solve. I believe > the goal of AGI should be to do useful work for humans, so I am making a > not so arbitrary choice to solve a problem that is central to what most > people regard as useful intelligence. > > I had hoped that my work would lead to an elegant theory of AI, but that > hasn't been the case. Rather, the best compression programs were > developed as a series of thousands of hacks and tweaks, e.g. change a 4 > to a 5 because it gives 0.002% better compression on the benchmark. The > result is an opaque mess. I guess I should have seen it coming, since it > is predicted by information theory (e.g. [2]). > > Nevertheless the architectures of the best text compressors are > consistent with cognitive development models, i.e. phoneme (or letter) > sequences -> lexical -> semantics -> syntax, which are themselves > consistent with layered neural architectures. I already described a > neural semantic model in my last post. I also did work supporting > Hutchens and Alder showing that lexical models can be learned from n- > gram statistics, consistent with the observation that babies learn the > rules for segmenting continuous speech before they learn any words [3]. > > I agre
Re: Language modeling (was Re: [agi] draft for comment)
--- On Fri, 9/5/08, Pei Wang <[EMAIL PROTECTED]> wrote: > Thanks for taking the time to explain your ideas in detail. > As I said, > our different opinions on how to do AI come from our very > different > understanding of "intelligence". I don't take > "passing Turing Test" as > my research goal (as explained in > http://nars.wang.googlepages.com/wang.logic_intelligence.pdf > and > http://nars.wang.googlepages.com/wang.AI_Definitions.pdf). > I disagree > with Hutter's approach, not because his SOLUTION is not > computable, > but because his PROBLEM is too idealized and simplified to > be relevant > to the actual problems of AI. I don't advocate the Turing test as the ideal test of intelligence. Turing himself was aware of the problem when he gave an example of a computer answering an arithmetic problem incorrectly in his famous 1950 paper: Q: Please write me a sonnet on the subject of the Forth Bridge. A: Count me out on this one. I never could write poetry. Q: Add 34957 to 70764. A: (Pause about 30 seconds and then give as answer) 105621. Q: Do you play chess? A: Yes. Q: I have K at my K1, and no other pieces. You have only K at K6 and R at R1. It is your move. What do you play? A: (After a pause of 15 seconds) R-R8 mate. I prefer a "preference test", which a machine passes if you prefer to talk to it over a human. Such a machine would be too fast and make too few errors to pass a Turing test. For example, if you had to add two large numbers, I think you would prefer to use a calculator than ask someone. You could, I suppose, measure intelligence as the fraction of questions for which the machine gives the preferred answer, which would be 1/4 in Turing's example. If you know the probability distribution P of text, and therefore know the distribution P(A|Q) for any question Q and answer A, then to pass the Turing test you would randomly choose answers from this distribution. But to pass the preference test for all Q, you would choose A that maximizes P(A|Q) because the most probable answer is usually the correct one. Text compression measures progress toward either test. I believe that compression measures your definition of intelligence, i.e. adaptation given insufficient knowledge and resources. In my benchmark, there are two parts: the size of the decompression program, which measures the initial knowledge, and the compressed size, which measures prediction errors that occur as the system adapts. Programs must also meet practical time and memory constraints to be listed in most benchmarks. Compression is also consistent with Legg and Hutter's universal intelligence, i.e. expected reward of an AIXI universal agent in an environment simulated by a random program. Suppose you have a compression oracle that inputs any string x and outputs the shortest program that outputs a string with prefix x. Then this reduces the (uncomputable) AIXI problem to using the oracle to guess which environment is consistent with the interaction so far, and figuring out which future outputs by the agent will maximize reward. Of course universal intelligence is also not testable because it requires an infinite number of environments. Instead, we have to choose a practical data set. I use Wikipedia text, which has fewer errors than average text, but I believe that is consistent with my goal of passing the preference test. -- Matt Mahoney, [EMAIL PROTECTED] --- agi Archives: https://www.listbox.com/member/archive/303/=now RSS Feed: https://www.listbox.com/member/archive/rss/303/ Modify Your Subscription: https://www.listbox.com/member/?member_id=8660244&id_secret=111637683-c8fa51 Powered by Listbox: http://www.listbox.com
RE: Language modeling (was Re: [agi] draft for comment)
--- On Sat, 9/6/08, John G. Rose <[EMAIL PROTECTED]> wrote: > Compression in itself has the overriding goal of reducing > storage bits. Not the way I use it. The goal is to predict what the environment will do next. Lossless compression is a way of measuring how well we are doing. -- Matt Mahoney, [EMAIL PROTECTED] --- agi Archives: https://www.listbox.com/member/archive/303/=now RSS Feed: https://www.listbox.com/member/archive/rss/303/ Modify Your Subscription: https://www.listbox.com/member/?member_id=8660244&id_secret=111637683-c8fa51 Powered by Listbox: http://www.listbox.com
Re: Language modeling (was Re: [agi] draft for comment)
I won't argue against your "preference test" here, since this is a big topic, and I've already made my position clear in the papers I mentioned. As for "compression", yes every intelligent system needs to 'compress' its experience in the sense of "keeping the essence but using less space". However, it is clearly not loseless. It is even not what we usually call "loosy compression", because what to keep and in what form is highly context-sensitive. Consequently, this process is not reversible --- no decompression, though the result can be applied in various ways. Therefore I prefer not to call it compression to avoid confusing this process with the technical sense of "compression", which is reversible, at least approximately. Legg and Hutter's "universal intelligence" definition is way too narrow to cover various attempts towards AI, even as an idealization. Therefore, I don't take it as a goal to aim at and to approach to as close as possible. However, as I said before, I'd rather leave this topic for the future, when I have enough time to give it a fair treatment. Pei On Sat, Sep 6, 2008 at 4:29 PM, Matt Mahoney <[EMAIL PROTECTED]> wrote: > --- On Fri, 9/5/08, Pei Wang <[EMAIL PROTECTED]> wrote: > >> Thanks for taking the time to explain your ideas in detail. >> As I said, >> our different opinions on how to do AI come from our very >> different >> understanding of "intelligence". I don't take >> "passing Turing Test" as >> my research goal (as explained in >> http://nars.wang.googlepages.com/wang.logic_intelligence.pdf >> and >> http://nars.wang.googlepages.com/wang.AI_Definitions.pdf). >> I disagree >> with Hutter's approach, not because his SOLUTION is not >> computable, >> but because his PROBLEM is too idealized and simplified to >> be relevant >> to the actual problems of AI. > > I don't advocate the Turing test as the ideal test of intelligence. Turing > himself was aware of the problem when he gave an example of a computer > answering an arithmetic problem incorrectly in his famous 1950 paper: > > Q: Please write me a sonnet on the subject of the Forth Bridge. > A: Count me out on this one. I never could write poetry. > Q: Add 34957 to 70764. > A: (Pause about 30 seconds and then give as answer) 105621. > Q: Do you play chess? > A: Yes. > Q: I have K at my K1, and no other pieces. You have only K at K6 and R at > R1. It is your move. What do you play? > A: (After a pause of 15 seconds) R-R8 mate. > > I prefer a "preference test", which a machine passes if you prefer to talk to > it over a human. Such a machine would be too fast and make too few errors to > pass a Turing test. For example, if you had to add two large numbers, I think > you would prefer to use a calculator than ask someone. You could, I suppose, > measure intelligence as the fraction of questions for which the machine gives > the preferred answer, which would be 1/4 in Turing's example. > > If you know the probability distribution P of text, and therefore know the > distribution P(A|Q) for any question Q and answer A, then to pass the Turing > test you would randomly choose answers from this distribution. But to pass > the preference test for all Q, you would choose A that maximizes P(A|Q) > because the most probable answer is usually the correct one. Text compression > measures progress toward either test. > > I believe that compression measures your definition of intelligence, i.e. > adaptation given insufficient knowledge and resources. In my benchmark, there > are two parts: the size of the decompression program, which measures the > initial knowledge, and the compressed size, which measures prediction errors > that occur as the system adapts. Programs must also meet practical time and > memory constraints to be listed in most benchmarks. > > Compression is also consistent with Legg and Hutter's universal intelligence, > i.e. expected reward of an AIXI universal agent in an environment simulated > by a random program. Suppose you have a compression oracle that inputs any > string x and outputs the shortest program that outputs a string with prefix > x. Then this reduces the (uncomputable) AIXI problem to using the oracle to > guess which environment is consistent with the interaction so far, and > figuring out which future outputs by the agent will maximize reward. > > Of course universal intelligence is also not testable because it requires an > infinite number of environments. Instead, we have to choose a practical data > set. I use Wikipedia text, which has fewer errors than average text, but I > believe that is consistent with my goal of passing the preference test. > > > -- Matt Mahoney, [EMAIL PROTECTED] > > > > --- > agi > Archives: https://www.listbox.com/member/archive/303/=now > RSS Feed: https://www.listbox.com/member/archive/rss/303/ > Modify Your Subscription: https://www.listbox.com/member/?&; > Powered by Listbox: http://www.listbox.com >
Re: Language modeling (was Re: [agi] draft for comment)
--- On Sat, 9/6/08, Pei Wang <[EMAIL PROTECTED]> wrote: > As for "compression", yes every intelligent > system needs to 'compress' > its experience in the sense of "keeping the essence > but using less > space". However, it is clearly not loseless. It is > even not what we > usually call "loosy compression", because what to > keep and in what > form is highly context-sensitive. Consequently, this > process is not > reversible --- no decompression, though the result can be > applied in > various ways. Therefore I prefer not to call it compression > to avoid > confusing this process with the technical sense of > "compression", > which is reversible, at least approximately. I think you misunderstand my use of compression. The goal is modeling or prediction. Given a string, predict the next symbol. I use compression to estimate how accurate the model is. It is easy to show that if your model is accurate, then when you connect your model to an ideal coder (such as an arithmetic coder), then compression will be optimal. You could actually skip the coding step, but it is cheap, so I use it so that there is no question of making a mistake in the measurement. If a bug in the coder produces a too small output, then the decompression step won't reproduce the original file. In fact, many speech recognition experiments do skip the coding step in their tests and merely calculate what the compressed size would be. (More precisely, they calculate word perplexity, which is equivalent). The goal of speech recognition is to find the text y that maximizes P(y|x) for utterance x. It is common to factor the model using Bayes law: P(y|x) = P(x|y)P(y)/P(x). We can drop P(x) since it is constant, leaving the acoustic model P(x|y) and language model P(y) to evaluate. We know from experiments that compression tests on P(y) correlate well with word error rates for the overall system. Internally, all lossless compressors use lossy compression or data reduction to make predictions. Most commonly, a context is truncated and possibly hashed before looking up the statistics for the next symbol. The top lossless compressors in my benchmark use more sophisticated forms of data reduction, such as mapping upper and lower case letters together, or mapping groups of semantically or syntactically related words to the same context. As a test, lossless compression is only appropriate for text. For other hard AI problems such as vision, art, and music, incompressible noise would overwhelm the human-perceptible signal. Theoretically you could compress video to 2 bits per second (the rate of human long term memory) by encoding it as a script. The decompressor would read the script and create a new movie. The proper test would be lossy compression, but this requires human judgment to evaluate how well the reconstructed data matches the original. -- Matt Mahoney, [EMAIL PROTECTED] --- agi Archives: https://www.listbox.com/member/archive/303/=now RSS Feed: https://www.listbox.com/member/archive/rss/303/ Modify Your Subscription: https://www.listbox.com/member/?member_id=8660244&id_secret=111637683-c8fa51 Powered by Listbox: http://www.listbox.com
RE: Language modeling (was Re: [agi] draft for comment)
> From: Matt Mahoney [mailto:[EMAIL PROTECTED] > > --- On Sat, 9/6/08, John G. Rose <[EMAIL PROTECTED]> wrote: > > > Compression in itself has the overriding goal of reducing > > storage bits. > > Not the way I use it. The goal is to predict what the environment will > do next. Lossless compression is a way of measuring how well we are > doing. > Predicting the environment in order to determine which data to pack where, thus achieving higher compression ratio. Or compression as an integral part of prediction? Some types of prediction are inherently compressed I suppose. John --- agi Archives: https://www.listbox.com/member/archive/303/=now RSS Feed: https://www.listbox.com/member/archive/rss/303/ Modify Your Subscription: https://www.listbox.com/member/?member_id=8660244&id_secret=111637683-c8fa51 Powered by Listbox: http://www.listbox.com
RE: Language modeling (was Re: [agi] draft for comment)
--- On Sun, 9/7/08, John G. Rose <[EMAIL PROTECTED]> wrote: > From: John G. Rose <[EMAIL PROTECTED]> > Subject: RE: Language modeling (was Re: [agi] draft for comment) > To: agi@v2.listbox.com > Date: Sunday, September 7, 2008, 9:15 AM > > From: Matt Mahoney [mailto:[EMAIL PROTECTED] > > > > --- On Sat, 9/6/08, John G. Rose > <[EMAIL PROTECTED]> wrote: > > > > > Compression in itself has the overriding goal of > reducing > > > storage bits. > > > > Not the way I use it. The goal is to predict what the > environment will > > do next. Lossless compression is a way of measuring > how well we are > > doing. > > > > Predicting the environment in order to determine which data > to pack where, > thus achieving higher compression ratio. Or compression as > an integral part > of prediction? Some types of prediction are inherently > compressed I suppose. Predicting the environment to maximize reward. Hutter proved that universal intelligence is a compression problem. The optimal behavior of an AIXI agent is to guess the shortest program consistent with observation so far. That's algorithmic compression. -- Matt Mahoney, [EMAIL PROTECTED] --- agi Archives: https://www.listbox.com/member/archive/303/=now RSS Feed: https://www.listbox.com/member/archive/rss/303/ Modify Your Subscription: https://www.listbox.com/member/?member_id=8660244&id_secret=111637683-c8fa51 Powered by Listbox: http://www.listbox.com
RE: Language modeling (was Re: [agi] draft for comment)
> From: Matt Mahoney [mailto:[EMAIL PROTECTED] > > --- On Sun, 9/7/08, John G. Rose <[EMAIL PROTECTED]> wrote: > > > From: John G. Rose <[EMAIL PROTECTED]> > > Subject: RE: Language modeling (was Re: [agi] draft for comment) > > To: agi@v2.listbox.com > > Date: Sunday, September 7, 2008, 9:15 AM > > > From: Matt Mahoney [mailto:[EMAIL PROTECTED] > > > > > > --- On Sat, 9/6/08, John G. Rose > > <[EMAIL PROTECTED]> wrote: > > > > > > > Compression in itself has the overriding goal of > > reducing > > > > storage bits. > > > > > > Not the way I use it. The goal is to predict what the > > environment will > > > do next. Lossless compression is a way of measuring > > how well we are > > > doing. > > > > > > > Predicting the environment in order to determine which data > > to pack where, > > thus achieving higher compression ratio. Or compression as > > an integral part > > of prediction? Some types of prediction are inherently > > compressed I suppose. > > Predicting the environment to maximize reward. Hutter proved that > universal intelligence is a compression problem. The optimal behavior of > an AIXI agent is to guess the shortest program consistent with > observation so far. That's algorithmic compression. > Oh I see. Guessing shortest program = compression. OK right. But yeah like Pei said the word "compression" is misleading. It implies a reduction where you are actually increasing understanding :) John --- agi Archives: https://www.listbox.com/member/archive/303/=now RSS Feed: https://www.listbox.com/member/archive/rss/303/ Modify Your Subscription: https://www.listbox.com/member/?member_id=8660244&id_secret=111637683-c8fa51 Powered by Listbox: http://www.listbox.com