This seems like a quite interestin gapproach to qualitative comparison of
AGI systems...

On Thu, Dec 27, 2012 at 8:34 PM, Abram Demski <abramdem...@gmail.com> wrote:

> Since testing seems to be a popular topic at the moment...
>
> I've been thinking on and off that it would be interesting to have a
> collection of AGI sub-problem tests. Individual tests would *not* be aimed
> at the whole AGI problem... instead, an individual test would look at a
> particular sub-problem. Different researchers would, of course, think
> different sub-problems were more or less important; most researchers might
> think most sub-problems have little or no importance. Still, the collection
> might provide some community value.
>
> I imagine two different uses for this. First, for researchers who regard a
> particular sub-problem as a *component* which can be directly integrated
> into an architecture, they can look at what algorithms best solve that
> problem. Second, for those who don't think of it as a component, but do
> agree that a viable (proto-)AGI system should be competitive with narrow AI
> on that task, the score on that task can be used as an indication of
> progress for their (proto-)AGI system.
>
> I've also been speculating about ways to modify the test to reduce the
> score of narrow AI approaches. For example, if one algorithm ends up being
> used as a component of the solution to several different tests, this should
> be significant! Unfortunately, considerations like this are not easily
> given a numerical score...
>
> 1) Shorter programs should be favored over longer programs, to some
> extent. We would tend to expect shorter solutions to generalize more.
> However, while this observation is fairly straightforward in *predictive*
> tasks, it is not nearly so good a principle for other tasks. Matt's
> compression benchmark already takes this into account in a reasonable way.
> However, a simple formula to balance program length and success in other
> domains would, I take it, be wrong.
>
> 2) If one program can solve a wide variety of problems with relatively
> little support code, that's good. However, it would seem silly to rank
> programs directly in this way, because most programs will only solve a
> subset of the problems, and it doesn't seem right to compare the scores of
> programs which work on differing problem subsets.
>
> 3) .. any other ideas?
>
> Matt's collection of benchmarks, foremost his compression benchmark, is a
> good start. A few words in defense of compression as a benchmark:
>
> As Matt has attempted to make clear, this is really just a prediction
> benchmark. Any (probabilistic) predictor can be hooked up to an arithmetic
> coder, and the length of the resulting compressed data (plus the program
> size) gives a measure of predictive success which has strong Bayesian
> foundations.
>
> There are certainly other measures of predictive accuracy, such as simply
> taking the difference between prediction and observation. These can be
> included if desired, but I don't think they are as principled.
>
> The ultimate test of a predictive system is how well it serves other
> tasks. In other words, an AGI-level predictive system should try to predict
> what is important for the task at hand, rather than predicting everything.
> But that can be measured in other tests...
>
> Now, I AM disappointed with one thing about Matt's benchmark, when viewed
> in this way: although a number of ideas in the compression world (and many
> combinations/permutations) have been tried, the benchmark does not test
> many ideas from the AI world. For example, I cannot say how well PAQ
> performs against a standard HMM implementation or the many HMM variants. As
> a result, there is no way to actually say anything about the impact of PAQ
> on (narrow) AI sequence prediction. This, of course, could be remedied by
> simply trying it out... however, performance would likely be poor *simply*
> because PAQ has been tuned to this problem set over a number of years.
>
> Therefore, for a benchmark useful to the research community, it would be
> good to test a number of simple algorithms against one another. Fine-tuning
> of algorithms would be explicitly discouraged, since we are looking for
> what general principles tend to work better, rather than looking for which
> specific software works better!
>
> So, again, we see that the actual score (although important) is not
> necessarily the most important thing about the benchmark...
>
> A second test area might be reinforcement learning. Systems could compete
> against the recent AIXI approximation. Again, it would be good to focus on
> what components are being used... for example, different prediction
> algorithms and different planning algorithms could be mixed-and-matched to
> some extent. (Ultimately, one might expect a coupled algorithm to do best,
> but currently, the algorithms we know to try would be separate...).
>
> There are a number of RL test scenarios out there. Naturally it would be
> good to design some specifically for AGI as well, ie, include more
> realistic physical simulations with some degree of motor control and
> vision. This sort of AGI testing environment has been discussed extensively
> in the past, of course, and some are working towards making it happen.
>
> A third testing area might be automatic programming. There is a start on
> this website:
>
> http://www.inductive-programming.org/
>
> It would be possible, with some work, to try MOSES on those problems.
> (Notice that not all systems could be tested on every problem, so again, we
> can't really compute a global score.)
>
> A fourth testing area could be automated reasoning. The TPTP dataset could
> be used.
>
> A fifth area could be GGP (general game playing).
>
> Any other suggestions? Are there other existing benchmarks which
> could/should be included in such a list? Are there benchmarks which should
> exist, but don't?
>
> This is all, of course, extremely speculative. Compiling a list of
> existing benchmarks of potential relevance to AGI is somewhat easy. Making
> it a really valuable resource is much more difficult. In particular,
> translating problem specifications for use with existing systems is
> difficult and annoying. Standards such as RL-glue could be very important
> for reducing the work.
>
> Best,
>
> --
> Abram Demski
> http://lo-tho.blogspot.com/
>    *AGI* | Archives <https://www.listbox.com/member/archive/303/=now>
> <https://www.listbox.com/member/archive/rss/303/212726-deec6279> | 
> Modify<https://www.listbox.com/member/?&;>Your Subscription
> <http://www.listbox.com>
>



-- 
Ben Goertzel, PhD
http://goertzel.org

"My humanity is a constant self-overcoming" -- Friedrich Nietzsche



-------------------------------------------
AGI
Archives: https://www.listbox.com/member/archive/303/=now
RSS Feed: https://www.listbox.com/member/archive/rss/303/21088071-f452e424
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=21088071&id_secret=21088071-58d57657
Powered by Listbox: http://www.listbox.com

Reply via email to