Kingma, D.P. wrote:
Too easy ;)

One of the points in patch-space corresponds to X=center, Y=center, Scale=huge, so this patch is a rescaled version (say 20x20) of the whole image (say 1000x1000). In this 20x20 patch, the letter 'A' emerges naturally and can be reconstructed by the NN, and therefore be recognized. It will probably be salient, since it's far away in patch-space from the small A's in the Scale dimension. Far-away points in patch-space dont battle for salience.
Your second example is solved analogously.

Okay, time for diner now. Vision solved :)

Regards,
Durk

Yeah, modulo a few implementation details, that sounds about right.

We can probably do language the same way tomorrow morning.....

;-)


Richard Loosemore



On Mon, Mar 3, 2008 at 7:59 PM, Richard Loosemore <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:

    Kingma, D.P. wrote:
     > On Mon, Mar 3, 2008 at 6:39 PM, Richard Loosemore
    <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>
     > <mailto:[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>>> wrote:
     >
     >     The problems with bolting together NN and GA are so numerous
    it is hard
     >     to know where to begin.  For one thing, you cannot represent
    structured
     >     information with NNs unless you go to some trouble to add extra
     >     architecture.  Most NNs can only cope with single concepts
    learned in
     >     isolation, so if you show a visual field containing 5,000
    copies of the
     >     letter 'A', all that happens is that the 'A' neuron fires.
     >
     >     If you do find some way to get around this problem, your
    solution will
     >     end up being the tail that wags the dog:  the NN itself will
    fade into
     >     relative insignificance compared to your solution.
     >
     >
     > Well, you could achieve that (5000 registration of the letter 'A'
    with
     > their corresponding position in the image) by using a sliding window
     > over multiple rescaled (and maybe other transformations)
    transformations
     > of the input image. This way, you get image patches for each
    window and
     > scale (and maybe other transformations), and each patch can be a
    given a
     > corresponding position in multidimensional space (e.g., an image
    patch
     > with X and Y position and scale S has is a point in 3-dimensional
     > space). For each of the produced points (patches) in the space,
    run the
     > neural net to produce a lower-dimensional code and corresponding
    energy
     > (= reconstruction quality). Now filter this space by let the
    points have
     > local battles for salience using some heuristic (e.g. lower
    energy means
     > higher salience) and filter out the low-salient points. This
    produces a
     > filtered space with fewer points then the previous one, and each
    point
     > containing a lower-dimensional code.
     >
     > In the example of the letter 'A', the above method would
    recognize all
     > 5000 versions while remembering their individual input position. This
     > presumes the neural net is properly trained on the letter 'A' and can
     > properly reconstuct them (using Hinton's method). This should produce
     > 5000 registrations of the letter 'A', while filtering out unimportant
     > information.
     >
     > But you could take it a step further. For each image input, the above
     > method creates a filtered, 3-dimensional space with points containing
     > low-dimensional codes. This space can then again be harvested by
    taking
     > patches with each patch containing /n/ points, each point
    containing an
     > /m /dimensional code, so each patch being (/m/*/n/)./ /A neural
    net can
     > be trained on lowering the dimension of these patches from
    (/m/*/n/) to
     > something lower-dimensional. This process is quite similar to the
    one in
     > the previous paragraph.
     >
     > What could /possibly /go wrong? :)
     >
     > Regards,
     > Durk Kingma

    Excellent!  Sounds like a perfect solution ;-).

    Oh, wait!

    What about......... if the scene is structured in such a way that the
    5,000 copies of the letter 'A' were actually scattered around in such a
    way that most (but not all) of them were arranged to form a huge
    letter 'A'?

    Would it then count 5,001 copies?

    Oh, and one more thing I forgot to mention that is in the same scene
    (how could I forget this one?):  there are also a couple of women
    standing side by side, leaning against each other with their shoulders
    touching and keeping their bodies stiff and straight, forming the two
    sides of a letter 'A', and holding a model of a horizontally reclining
    woman between them at waist height, to form the crossbar of a letter
    'A'.

    Could we get the NN to recognize, in the context of the overall scene,
    that here were actually 5,002 copies of the letter 'A'......?

    And if the scene had one single, rather small letter B over in the
    corner, would the NN find this funny?

    You have 30 minutes to devise an algorithm, Durk... :-).



    Richard Loosemore


    -------------------------------------------
    agi
    Archives: http://www.listbox.com/member/archive/303/=now
    RSS Feed: http://www.listbox.com/member/archive/rss/303/
    Modify Your Subscription: http://www.listbox.com/member/?&;
    <http://www.listbox.com/member/?&;>
    Powered by Listbox: http://www.listbox.com


------------------------------------------------------------------------
*agi* | Archives <http://www.listbox.com/member/archive/303/=now> <http://www.listbox.com/member/archive/rss/303/> | Modify <http://www.listbox.com/member/?&;> Your Subscription [Powered by Listbox] <http://www.listbox.com>


-------------------------------------------
agi
Archives: http://www.listbox.com/member/archive/303/=now
RSS Feed: http://www.listbox.com/member/archive/rss/303/
Modify Your Subscription: 
http://www.listbox.com/member/?member_id=8660244&id_secret=95818715-a78a9b
Powered by Listbox: http://www.listbox.com

Reply via email to