This seems like a quite interestin gapproach to qualitative comparison of AGI systems...
On Thu, Dec 27, 2012 at 8:34 PM, Abram Demski <abramdem...@gmail.com> wrote: > Since testing seems to be a popular topic at the moment... > > I've been thinking on and off that it would be interesting to have a > collection of AGI sub-problem tests. Individual tests would *not* be aimed > at the whole AGI problem... instead, an individual test would look at a > particular sub-problem. Different researchers would, of course, think > different sub-problems were more or less important; most researchers might > think most sub-problems have little or no importance. Still, the collection > might provide some community value. > > I imagine two different uses for this. First, for researchers who regard a > particular sub-problem as a *component* which can be directly integrated > into an architecture, they can look at what algorithms best solve that > problem. Second, for those who don't think of it as a component, but do > agree that a viable (proto-)AGI system should be competitive with narrow AI > on that task, the score on that task can be used as an indication of > progress for their (proto-)AGI system. > > I've also been speculating about ways to modify the test to reduce the > score of narrow AI approaches. For example, if one algorithm ends up being > used as a component of the solution to several different tests, this should > be significant! Unfortunately, considerations like this are not easily > given a numerical score... > > 1) Shorter programs should be favored over longer programs, to some > extent. We would tend to expect shorter solutions to generalize more. > However, while this observation is fairly straightforward in *predictive* > tasks, it is not nearly so good a principle for other tasks. Matt's > compression benchmark already takes this into account in a reasonable way. > However, a simple formula to balance program length and success in other > domains would, I take it, be wrong. > > 2) If one program can solve a wide variety of problems with relatively > little support code, that's good. However, it would seem silly to rank > programs directly in this way, because most programs will only solve a > subset of the problems, and it doesn't seem right to compare the scores of > programs which work on differing problem subsets. > > 3) .. any other ideas? > > Matt's collection of benchmarks, foremost his compression benchmark, is a > good start. A few words in defense of compression as a benchmark: > > As Matt has attempted to make clear, this is really just a prediction > benchmark. Any (probabilistic) predictor can be hooked up to an arithmetic > coder, and the length of the resulting compressed data (plus the program > size) gives a measure of predictive success which has strong Bayesian > foundations. > > There are certainly other measures of predictive accuracy, such as simply > taking the difference between prediction and observation. These can be > included if desired, but I don't think they are as principled. > > The ultimate test of a predictive system is how well it serves other > tasks. In other words, an AGI-level predictive system should try to predict > what is important for the task at hand, rather than predicting everything. > But that can be measured in other tests... > > Now, I AM disappointed with one thing about Matt's benchmark, when viewed > in this way: although a number of ideas in the compression world (and many > combinations/permutations) have been tried, the benchmark does not test > many ideas from the AI world. For example, I cannot say how well PAQ > performs against a standard HMM implementation or the many HMM variants. As > a result, there is no way to actually say anything about the impact of PAQ > on (narrow) AI sequence prediction. This, of course, could be remedied by > simply trying it out... however, performance would likely be poor *simply* > because PAQ has been tuned to this problem set over a number of years. > > Therefore, for a benchmark useful to the research community, it would be > good to test a number of simple algorithms against one another. Fine-tuning > of algorithms would be explicitly discouraged, since we are looking for > what general principles tend to work better, rather than looking for which > specific software works better! > > So, again, we see that the actual score (although important) is not > necessarily the most important thing about the benchmark... > > A second test area might be reinforcement learning. Systems could compete > against the recent AIXI approximation. Again, it would be good to focus on > what components are being used... for example, different prediction > algorithms and different planning algorithms could be mixed-and-matched to > some extent. (Ultimately, one might expect a coupled algorithm to do best, > but currently, the algorithms we know to try would be separate...). > > There are a number of RL test scenarios out there. Naturally it would be > good to design some specifically for AGI as well, ie, include more > realistic physical simulations with some degree of motor control and > vision. This sort of AGI testing environment has been discussed extensively > in the past, of course, and some are working towards making it happen. > > A third testing area might be automatic programming. There is a start on > this website: > > http://www.inductive-programming.org/ > > It would be possible, with some work, to try MOSES on those problems. > (Notice that not all systems could be tested on every problem, so again, we > can't really compute a global score.) > > A fourth testing area could be automated reasoning. The TPTP dataset could > be used. > > A fifth area could be GGP (general game playing). > > Any other suggestions? Are there other existing benchmarks which > could/should be included in such a list? Are there benchmarks which should > exist, but don't? > > This is all, of course, extremely speculative. Compiling a list of > existing benchmarks of potential relevance to AGI is somewhat easy. Making > it a really valuable resource is much more difficult. In particular, > translating problem specifications for use with existing systems is > difficult and annoying. Standards such as RL-glue could be very important > for reducing the work. > > Best, > > -- > Abram Demski > http://lo-tho.blogspot.com/ > *AGI* | Archives <https://www.listbox.com/member/archive/303/=now> > <https://www.listbox.com/member/archive/rss/303/212726-deec6279> | > Modify<https://www.listbox.com/member/?&>Your Subscription > <http://www.listbox.com> > -- Ben Goertzel, PhD http://goertzel.org "My humanity is a constant self-overcoming" -- Friedrich Nietzsche ------------------------------------------- AGI Archives: https://www.listbox.com/member/archive/303/=now RSS Feed: https://www.listbox.com/member/archive/rss/303/21088071-f452e424 Modify Your Subscription: https://www.listbox.com/member/?member_id=21088071&id_secret=21088071-58d57657 Powered by Listbox: http://www.listbox.com