Re: [agi] Computing's coming Theory of Everything

Abram Demski Wed, 23 Jul 2008 10:20:45 -0700

This is getting long in embedded-reply format, but oh well....

On Wed, Jul 23, 2008 at 12:24 PM, Steve Richfield
<[EMAIL PROTECTED]> wrote:
> Abram,
>
> On 7/23/08, Abram Demski <[EMAIL PROTECTED]> wrote:
>>
>> Replying in reverse order....
>>
>> > Story: I once viewed being able to invert the Airy Disk transform (what
>> > makes a blur from a point of light in a microscope or telescope) as an
>> > EXTREMELY valuable thing to do to greatly increase their power, so I set
>> > about finding a transform function. Then, I wrote a program to test it,
>> > first making an Airy Disk blur and then transforming it back to the
>> > original
>> > point. It sorta worked, but there was lots of computational noise in the
>> > result, so I switched to double precision, whereupon it failed to work
>> > at
>> > all. After LOTS more work, I finally figured out that the Airy Disk
>> > function
>> > was a perfect spacial low pass filter, so that two points that were too
>> > close to be resolved as separate points made EXACTLY the same perfectly
>> > circular pattern as did a single point of the same total brightness. In
>> > single precision, I was inverting the computational noise, and doing a
>> > pretty good job of it. However, for about a month, I thought that I had
>> > changed the world.
>>
>> Neat. I have a professor who is doing some stuff with a similar
>> transform, but with a circle (/sphere) rather than a disc (/ball).
>
>
> The "Airy Disk" is the name of the transform. In fact, it is the central
> maxima surrounded by faint rings of rapidly diminishing brightness typical
> of what a star produces. Note that you can cut the radius of the first
> minima to ~2/3 by stopping out all but a peripheral ring on the lens, which
> significantly increases the resolution - a well known trick among
> experienced astronomers, but completely missed by the Hubble team! Just
> stopping out the middle of their mirror would make it equivalent to half
> again its present diameter, though its light-gathering ability would be
> greatly reduced. Of course, this could easily be switched in and out just as
> they are already switching other optical systems in and out.
>
> Can you tell me a little more about what your professor is doing?


He came up with a fast way of doing the transform, which allows him to
quickly identify points that have spherical shapes around them (of a
given radius). He does the transform for a few different
radius-values, so he detects spheres of different sizes, and then he
uses the resulting information to help classify points. An example
application would be picking out important structures in X-ray images
or CAT scans: train the system on points that doctors pick out, then
use it to pick out points in a new image. Spheres may not be the best
feature to use, but they work, and since his algorithm allows them to
be calculated extremely quickly, it becomes a good choice.

> Imagine a layer where the inputs represent probabilities of situations in
> the real world, and the present layer must recognize combinations that are
> important. This would seem to require ANDing (multiplication) rather than
> simple linear addition. However, if we first take the logarithms of the
> incoming probabilities, simple addition produces ANDed probabilities.
>
> OK, so lets make this a little more complicated by specifying that some of
> those inputs are correlated, and hence should receive reduced weighting. We
> can compute the weighted geometric mean of a group of inputs by simply
> multiplying each by its weight (synaptic efficacy), and adding the results
> together. Of course, the sum of these efficacies would be 1.0.

If I understand, what you are saying is that linear dependencies might
be squeezed out, but some nonlinear dependencies might become linear
for various reasons, including purposefully applying nonlinear
functions (log, sigmoid...) to the resulting variables.

It seems there are some standard ways of introducing nonlinearity:
http://en.wikipedia.org/wiki/Kernel_principal_component_analysis

On a related note, the standard classifier my professor applied to the
sphere-data worked by taking the data to a higher-dimensional space
that made nonlinear dependencies linear. It then found a plane that
cut between "yes" points and "no" points.



> Agreed. Nonlinearities, time information, scope, memory, etc. BTW, have you
> looked at asynchronous logic -  where they have MEMORY elements sprinkled in
> with the logic?! Why? Because they look for some indication of a subsequent
> event, e.g. inputs going to FALSE, before re-evaluating the inputs. This is
> akin to pipelining - which OF COURSE you would expect in highly parallel
> systems like us. Asynchronous logic has many of the same design issues as
> our own brains, and some REALLY counter-intuitive techniques have been
> developed, like 2-wire logic, where TRUE and FALSE are transmitted on two
> different wires to eliminate the need for synchronicity. There are several
> such eye-popping methods that could well be working within us.

This sounds exactly like the invocation model (of Karl M. Fant). Fun
stuff. I hope it catches on.

> (Taking a really long drag) That is what the computation of successive
> eigenvalues already accomplishes, as otherwise, all outputs would simply
> reflect the most significant parameter.

Ah! Cool.

>>
>> For example, the unaltered algorithm might not support
>> location-invariance in a picture, so people might add "eye-movements"
>> to the algorithm, making it slide around taking many sub-picture
>> samples. Next, people might want size-invariance, then
>> rotation-invariance. These three together might seem to cover
>> everything, but they do not. First, we've thrown out possibly useful
>> information along the way; people can ignore size sometimes, but it is
>> sometimes important,
>
>
> Size is SO related to distance that I doubt that it has much value.

Yet, we absolutely need size information to know how far we need to
reach or walk.

>>
>> and even more so for rotation and location.
>
>
> Which are SO related to head position that again, I doubt that it has much
> value.

But again, both are absolutely needed to know where to reach our hand
and how to hold it to grasp properly. We simply cannot throw this info
away.

>>
>> Second, more complicated types of invariance can be learned; there is
>> really an infinite variety. This is why relational methods are
>> necessary: they can see things from the beginning as both in a
>> particular location, and as being in a relationship to surroundings
>> that is location-independent. The same holds for size if we add the
>> proper formulas.  (Hmm... I admit that current relational methods
>> can't so easily account for rotation invariance... it would be
>> possible but very expensive...)
>
>
> However, successive PCA layers should be able to do this with ease, by
> recognizing combinations of picture elements like "right angles", "straight
> lines", no "curved lines" or "unbounded shading", and "bounded shading" to
> recognize darkened geometric shapes regardless of size, distortion, or
> orientation.

My opinion is that these systems would need to memorize the picture
that formed each of these shapes in each location at each size and
rotation, in order to develop the desired invariance-- unless it was
preprogrammed, in which case we need some trick to save the
information for later use while allowing the system to ignore it.

However, this is not a totally educated opinion, I admit. It comes
from the notion that the system simply treats each new pixel as a new
variable in an ever-larger space, but neglects the relationships
between these variables. Extracted patterns tend to look spatial
because those are the variables that turn out to be correlated, but
the system really has no ideas that two pixels are sitting next to
eachother, it just treats them as variables. Am I correct?

>>
>> >> Such systems might produce some good results, but the formalism cannot
>> >> represent complex relational ideas.
>
>
> Did I miss something in the example above?
>>
>> > All you need is a model, any model, capable of representing reality and
>> > its
>> > complex relationships. I would think that simple cause-and-effect might
>> > suffice, where events cause other events, that in turn cause still other
>> > events. With a neuron or PCA coordinate for each prospective event, I
>> > could
>> > see things coming together. The uninhibited neurons (or PAC coordinates)
>> > in
>> > the last layer would be the possible present courses of action.
>> > Stimulate
>> > them all and the best will inhibit the rest, and the best course of
>> > action
>> > will take place.
>> >>
>>
>> Hidden causes happen to be a turing-complete formalism, so sure. And
>> if we add the temporal dimension, this becomes really relational,
>> since the temporal cause-effect chain can stretch backward and forward
>> and become a nonbounded computation. But I am skeptical about teasing
>> this behavior from the current algorithms.
>
>
> Having wrestled with this a little in Dr. Eliza, I found that the easy and
> possibly the very best thing to do is to recognize the cause-and-effect
> chain links that you can, and simply ignore those that you can't recognize.
> Where you lack information, simply drop that term when computing
> probabilities and "fudge" the result slightly to lose when compared with
> other probabilities computed with better information. The only effect of
> this minor fudging was to sort the output with the major key being
> probability of condition, and the minor key being belief in correctness.
> This often resulted in producing very high or low probabilities of things
> based on really flimsy evidence, but so far in looking at specific cases,
> none of these were obviously wrong.
>
>
>
> Steve Richfield
> =============
>>
>> On Tue, Jul 22, 2008 at 7:04 PM, Steve Richfield
>> <[EMAIL PROTECTED]> wrote:
>> > Abram,
>> >
>> > All good points. Detailed comments follow. First I must take a LONG
>> > drag,
>> > because I must now blow a lot of smoke...
>> >
>> > On 7/22/08, Abram Demski <[EMAIL PROTECTED]> wrote:
>> >>
>> >> On Tue, Jul 22, 2008 at 4:29 PM, Steve Richfield
>> >> <[EMAIL PROTECTED]> wrote:
>> >> > Abram,
>> >> >
>> >> > On 7/22/08, Abram Demski <[EMAIL PROTECTED]> wrote:
>> >> >>
>> >> >> From the paper you posted, and from wikipedia articles, the current
>> >> >> meaning of PCA is very different from your generalized version. I
>> >> >> doubt the current algorithms would even metaphorically apply...
>> >> >
>> >> >
>> >> > Just more input points that are time-displaced from the present
>> >> > points,
>> >> > or
>> >> > alternatively in simple cases, compute with the derivative of the
>> >> > inputs
>> >> > rather than with their static value.
>> >>
>> >> Such systems might produce some good results, but the formalism cannot
>> >> represent complex relational ideas.
>> >
>> >
>> > All you need is a model, any model, capable of representing reality and
>> > its
>> > complex relationships. I would think that simple cause-and-effect might
>> > suffice, where events cause other events, that in turn cause still other
>> > events. With a neuron or PCA coordinate for each prospective event, I
>> > could
>> > see things coming together. The uninhibited neurons (or PAC coordinates)
>> > in
>> > the last layer would be the possible present courses of action.
>> > Stimulate
>> > them all and the best will inhibit the rest, and the best course of
>> > action
>> > will take place.
>> >>
>> >> It is not even capable of
>> >> representing context-free patterns (for example, pictures of
>> >> fractals).
>> >
>> >
>> > Can people do this?
>> >>
>> >> Of course, I'm referring to PCA "as it is", not "as it
>> >> could be".
>> >>
>> >> >>
>> >> >> Also, what would "multiple layers" mean in the generalized version?
>> >> >
>> >> >
>> >> > Performing the PC-like analysis on the principal components derived
>> >> > in a
>> >> > preceding PC-like analysis.
>> >>
>> >> If this worked, it would be another way of trying to break up the task
>> >> into subtasks. It might help, I admit. It has an intuitive feel; it
>> >> fits the idea of there being levels of processing in the brain. But if
>> >> it helps, why?
>> >
>> >
>> > Maybe we are just large data reduction engines?
>> >>
>> >> What clean subtask-division is it relying on?
>> >
>> >
>> > As I have pointed out here many times before, we are MUCH shorter on
>> > knowledge of reality than we are on CS technology. With this approach,
>> > we
>> > might build AGIs without even knowing how they work.
>> >>
>> >> The idea
>> >> of iteratively compressing data by looking for the highest-information
>> >> variable repeatedly makes sense to me, it is a clear subgoal. But what
>> >> is the subgoal here?
>> >>
>> >> Hmm... the algorithm for a single level would need to "subtract" the
>> >> information encoded in the new variable each time, so that the next
>> >> iteration is working with only the still-unexplained properties of the
>> >> data.
>> >
>> >
>> > (Taking another puff) Unfortunately, PCA methods produce amplitude
>> > information but not phase information. This is a little like indefinite
>> > integration, where you know what is there, but not enough to recreate
>> > it.
>> >
>> > Further, maximum information channels would seem to be naturally
>> > orthogonal,
>> > so subtracting, even if it were possible, is probably unnecessary.
>> >>
>> >> The variables then should be independent, right?
>> >
>> >
>> > To the extent that they are not independent, they are not orthogonal,
>> > and
>> > less information is produced.
>> >>
>> >> Yet, if we take
>> >> the multilevel approach, the 2nd level will be trying to take
>> >> advantage of dependencies in those variables...
>> >
>> >
>> > Probably not linear dependencies because these should have been wrung
>> > out in
>> > the previous level. Hopefully, the next layer would look at time
>> > sequencing,
>> > various combinations, etc.
>> >>
>> >> Perhaps this will work due to inaccuracies in the algorithm, caused by
>> >> approximate methods. The task of the higher levels, then, is to
>> >> correct for the approximations.
>> >
>> >
>> > This isn't my (blurred?) vision.
>> >>
>> >> But if this is their usefulness, then
>> >> it needs to be shown that they are capable of it. After all, they will
>> >> be running the same sort of approximation. It is possible that they
>> >> will therefore miss the same sorts of things. So, we need to be
>> >> careful in defining multilevel systems.
>> >
>> >
>> > Story: I once viewed being able to invert the Airy Disk transform (what
>> > makes a blur from a point of light in a microscope or telescope) as an
>> > EXTREMELY valuable thing to do to greatly increase their power, so I set
>> > about finding a transform function. Then, I wrote a program to test it,
>> > first making an Airy Disk blur and then transforming it back to the
>> > original
>> > point. It sorta worked, but there was lots of computational noise in the
>> > result, so I switched to double precision, whereupon it failed to work
>> > at
>> > all. After LOTS more work, I finally figured out that the Airy Disk
>> > function
>> > was a perfect spacial low pass filter, so that two points that were too
>> > close to be resolved as separate points made EXACTLY the same perfectly
>> > circular pattern as did a single point of the same total brightness. In
>> > single precision, I was inverting the computational noise, and doing a
>> > pretty good job of it. However, for about a month, I thought that I had
>> > changed the world.
>> >
>> > I also once had a proof for Fermat's Last Theorem that lasted about a
>> > week
>> > rattling around the math department of a major university.
>> >
>> > Hence, you are preaching to the choir regarding care in approach. I have
>> > already run down my fair share of blind alleys.
>> >
>> > Steve Richfield
>> > ==============
>> >>
>> >> >>> On Tue, Jul 22, 2008 at 2:58 PM, Steve Richfield
>> >> >> <[EMAIL PROTECTED]> wrote:
>> >> >> > Abram,
>> >> >> >
>> >> >> > On 7/22/08, Abram Demski <[EMAIL PROTECTED]> wrote:
>> >> >> >>
>> >> >> >> "Problem Statement: What are the optimal functions, derived from
>> >> >> >> real-world observations of past events, the timings of their
>> >> >> >> comings
>> >> >> >> and goings, and perhaps their physical association, to extract
>> >> >> >> each
>> >> >> >> successive parameter containing the maximum amount of information
>> >> >> >> (in
>> >> >> >> a Shannon sense) usable in reconstructing the observed inputs."
>> >> >> >>
>> >> >> >> I see it now! It is typically very useful to decompose a problem
>> >> >> >> into
>> >> >> >> sub-problems that can be solved either independently or with
>> >> >> >> simple
>> >> >> >> well-defined interaction. What you are proposing is such a
>> >> >> >> decomposition, for the very general problem of compression. "Find
>> >> >> >> an
>> >> >> >> encoding scheme for the data in dataset X that minimizes the
>> >> >> >> number
>> >> >> >> of
>> >> >> >> bits we need" can be split into subproblems of the form "find a
>> >> >> >> meaning for the next N bits of an encoding that maximizes the
>> >> >> >> information they carry". The general problem can be solved by
>> >> >> >> applying
>> >> >> >> a solution to the simpler problem until the data is completely
>> >> >> >> compressed.
>> >> >> >
>> >> >> >
>> >> >> > Yes, we do appear to be on the same page here. The challenge is
>> >> >> > that
>> >> >> > there
>> >> >> > seems to be a prevailing opinion that these don't :stack" into
>> >> >> > multi-level
>> >> >> > structures. The reason that this hasn't been tested seems obvious
>> >> >> > from
>> >> >> > the
>> >> >> > literature - computers are now just too damn slow, but people here
>> >> >> > seem
>> >> >> > to
>> >> >> > think that there is another more basic reason, like it doesn't
>> >> >> > work.
>> >> >> > I
>> >> >> > don't
>> >> >> > understand this argument either.
>> >> >> >
>> >> >> > Richard, perhaps you could explain?
>> >> >> >>
>> >> >> >> "However, it still fails to consider temporal clues, unless of
>> >> >> >> course
>> >> >> >> you just consider these to be another dimension."
>> >> >> >>
>> >> >> >> Why does this not count as a working solution?
>> >> >> >
>> >> >> >
>> >> >> > It might be. Note that delays from axonal transit times could
>> >> >> > quite
>> >> >> > easily
>> >> >> > and effectively present inputs "flat" with time presented as just
>> >> >> > another
>> >> >> > dimension. Now, the challenge of testing a theory with an
>> >> >> > additional
>> >> >> > dimension, that already clogs computers without the additional
>> >> >> > dimension.
>> >> >> > Ugh. Any thoughts?
>> >> >> >
>> >> >> > Perhaps I should write this up and send it to the various people
>> >> >> > working
>> >> >> > in
>> >> >> > this area. Perhaps people with the present test beds could find a
>> >> >> > way
>> >> >> > to
>> >> >> > test this, and the retired math professor would have a better idea
>> >> >> > as
>> >> >> > to
>> >> >> > exactly what needed to be optimized.
>> >> >> >
>> >> >> > Steve Richfield
>> >> >> > =================
>> >> >> >>
>> >> >> >> On Tue, Jul 22, 2008 at 1:48 PM, Steve Richfield
>> >> >> >> <[EMAIL PROTECTED]> wrote:
>> >> >> >> > Ben,
>> >> >> >> > On 7/22/08, Benjamin Johnston <[EMAIL PROTECTED]> wrote:
>> >> >> >> >>>
>> >> >> >> >>> You are confusing what PCA now is, and what it might become.
>> >> >> >> >>> I
>> >> >> >> >>> am
>> >> >> >> >>> more
>> >> >> >> >>> interested in the dream than in the present reality.
>> >> >> >> >>
>> >> >> >> >> That is like claiming that multiplication of two numbers is
>> >> >> >> >> the
>> >> >> >> >> answer
>> >> >> >> >> to
>> >> >> >> >> AGI, and then telling any critics that they're confusing what
>> >> >> >> >> multiplication
>> >> >> >> >> is now with what multiplication may become.
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > Restating (not copying) my original posting, the challenge of
>> >> >> >> > effective
>> >> >> >> > unstructured learning is to utilize every clue and NOT just go
>> >> >> >> > with
>> >> >> >> > static
>> >> >> >> > clusters, etc. This includes temporal as well as positional
>> >> >> >> > clues,
>> >> >> >> > information content, etc. PCA does some but certainly not all
>> >> >> >> > of
>> >> >> >> > this,
>> >> >> >> > but
>> >> >> >> > considering that we were talking about clustering here just a
>> >> >> >> > couple
>> >> >> >> > of
>> >> >> >> > weeks ago, ratcheting up to PCA seems to be at least a step out
>> >> >> >> > of
>> >> >> >> > the
>> >> >> >> > basement.
>> >> >> >> >
>> >> >> >> > I think that perhaps I mis-stated or was misunderstood in my
>> >> >> >> > "position".
>> >> >> >> > No
>> >> >> >> > one has "the answer" yet, but given recent work, I think that
>> >> >> >> > perhaps
>> >> >> >> > the
>> >> >> >> > problem can now be stated. Given a problem statement, it
>> >> >> >> > (hopefully)
>> >> >> >> > should
>> >> >> >> > be "just some math" to zero in on the solution. OK...
>> >> >> >> >
>> >> >> >> > Problem Statement: What are the optimal functions, derived from
>> >> >> >> > real-world
>> >> >> >> > observations of past events, the timings of their comings and
>> >> >> >> > goings,
>> >> >> >> > and
>> >> >> >> > perhaps their physical association, to extract each successive
>> >> >> >> > parameter
>> >> >> >> > containing the maximum amount of information (in a Shannon
>> >> >> >> > sense)
>> >> >> >> > usable
>> >> >> >> > in
>> >> >> >> > reconstructing the observed inputs. IMHO these same functions
>> >> >> >> > will
>> >> >> >> > be
>> >> >> >> > exactly what you need to recognize what is happening in the
>> >> >> >> > world,
>> >> >> >> > what
>> >> >> >> > you
>> >> >> >> > need to act upon, which actions will have the most effect on
>> >> >> >> > the
>> >> >> >> > world,
>> >> >> >> > etc.
>> >> >> >> > PCA is clearly NOT there (e.g. it lacks temporal
>> >> >> >> > consideration),
>> >> >> >> > but
>> >> >> >> > seems
>> >> >> >> > to be a step closer than anything else on the horizon.
>> >> >> >> > Hopefully,
>> >> >> >> > given
>> >> >> >> > the
>> >> >> >> > "hint" of PCA, we can follow the path.
>> >> >> >> >
>> >> >> >> > You should find an explanation of PCA in any elementary linear
>> >> >> >> > algebra
>> >> >> >> > or
>> >> >> >> > statistics textbook. It has a range of applications (like any
>> >> >> >> > transform),
>> >> >> >> > but it might be best regarded as an/the elementary algorithm
>> >> >> >> > for
>> >> >> >> > unsupervised dimension reduction.
>> >> >> >> >
>> >> >> >> > Bingo! However, it still fails to consider temporal clues,
>> >> >> >> > unless
>> >> >> >> > of
>> >> >> >> > course
>> >> >> >> > you just consider these to be another dimension.
>> >> >> >> >
>> >> >> >> > When PCA works, it is more likely to be interpreted as a
>> >> >> >> > comment
>> >> >> >> > on
>> >> >> >> > the
>> >> >> >> > underlying simplicity of the original dataset, rather than the
>> >> >> >> > power
>> >> >> >> > of
>> >> >> >> > PCA
>> >> >> >> > itself.
>> >> >> >> >
>> >> >> >> > Agreed, but so far, I haven't seen any solid evidence that the
>> >> >> >> > world
>> >> >> >> > is
>> >> >> >> > NOT
>> >> >> >> > simple, though it appears pretty complex until you understand
>> >> >> >> > it.
>> >> >> >> >
>> >> >> >> > Thanks for making me clarify my thoughts.
>> >> >> >> >
>> >> >> >> > Steve Richfield
>> >> >> >> >
>> >> >> >> > ________________________________
>> >> >> >> > agi | Archives | Modify Your Subscription
>> >> >> >>
>> >> >> >>
>> >> >> >> -------------------------------------------
>> >> >> >> agi
>> >> >> >> Archives: https://www.listbox.com/member/archive/303/=now
>> >> >> >> RSS Feed: https://www.listbox.com/member/archive/rss/303/
>> >> >> >> Modify Your Subscription: https://www.listbox.com/member/?&;
>> >> >> >> Powered by Listbox: http://www.listbox.com
>> >> >> >
>> >> >> > ________________________________
>> >> >> > agi | Archives | Modify Your Subscription
>> >> >>
>> >> >>
>> >> >> -------------------------------------------
>> >> >> agi
>> >> >> Archives: https://www.listbox.com/member/archive/303/=now
>> >> >> RSS Feed: https://www.listbox.com/member/archive/rss/303/
>> >> >> Modify Your Subscription: https://www.listbox.com/member/?&;
>> >> >> Powered by Listbox: http://www.listbox.com
>> >> >
>> >> > ________________________________
>> >> > agi | Archives | Modify Your Subscription
>> >>
>> >>
>> >> -------------------------------------------
>> >> agi
>> >> Archives: https://www.listbox.com/member/archive/303/=now
>> >> RSS Feed: https://www.listbox.com/member/archive/rss/303/
>> >> Modify Your Subscription: https://www.listbox.com/member/?&;
>> >> Powered by Listbox: http://www.listbox.com
>> >
>> > ________________________________
>> > agi | Archives | Modify Your Subscription
>>
>>
>> -------------------------------------------
>> agi
>> Archives: https://www.listbox.com/member/archive/303/=now
>> RSS Feed: https://www.listbox.com/member/archive/rss/303/
>> Modify Your Subscription: https://www.listbox.com/member/?&;
>> Powered by Listbox: http://www.listbox.com
>
> ________________________________
> agi | Archives | Modify Your Subscription


-------------------------------------------
agi
Archives: https://www.listbox.com/member/archive/303/=now
RSS Feed: https://www.listbox.com/member/archive/rss/303/
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=8660244&id_secret=108809214-a0d121
Powered by Listbox: http://www.listbox.com

Re: [agi] Computing's coming Theory of Everything

Reply via email to