Kingma, D.P. wrote:
Too easy ;)
One of the points in patch-space corresponds to X=center, Y=center,
Scale=huge, so this patch is a rescaled version (say 20x20) of the whole
image (say 1000x1000). In this 20x20 patch, the letter 'A' emerges
naturally and can be reconstructed by the NN, and therefore be
recognized. It will probably be salient, since it's far away in
patch-space from the small A's in the Scale dimension. Far-away points
in patch-space dont battle for salience.
Your second example is solved analogously.
Okay, time for diner now. Vision solved :)
Regards,
Durk
Yeah, modulo a few implementation details, that sounds about right.
We can probably do language the same way tomorrow morning.....
;-)
Richard Loosemore
On Mon, Mar 3, 2008 at 7:59 PM, Richard Loosemore <[EMAIL PROTECTED]
<mailto:[EMAIL PROTECTED]>> wrote:
Kingma, D.P. wrote:
> On Mon, Mar 3, 2008 at 6:39 PM, Richard Loosemore
<[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>
> <mailto:[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>>> wrote:
>
> The problems with bolting together NN and GA are so numerous
it is hard
> to know where to begin. For one thing, you cannot represent
structured
> information with NNs unless you go to some trouble to add extra
> architecture. Most NNs can only cope with single concepts
learned in
> isolation, so if you show a visual field containing 5,000
copies of the
> letter 'A', all that happens is that the 'A' neuron fires.
>
> If you do find some way to get around this problem, your
solution will
> end up being the tail that wags the dog: the NN itself will
fade into
> relative insignificance compared to your solution.
>
>
> Well, you could achieve that (5000 registration of the letter 'A'
with
> their corresponding position in the image) by using a sliding window
> over multiple rescaled (and maybe other transformations)
transformations
> of the input image. This way, you get image patches for each
window and
> scale (and maybe other transformations), and each patch can be a
given a
> corresponding position in multidimensional space (e.g., an image
patch
> with X and Y position and scale S has is a point in 3-dimensional
> space). For each of the produced points (patches) in the space,
run the
> neural net to produce a lower-dimensional code and corresponding
energy
> (= reconstruction quality). Now filter this space by let the
points have
> local battles for salience using some heuristic (e.g. lower
energy means
> higher salience) and filter out the low-salient points. This
produces a
> filtered space with fewer points then the previous one, and each
point
> containing a lower-dimensional code.
>
> In the example of the letter 'A', the above method would
recognize all
> 5000 versions while remembering their individual input position. This
> presumes the neural net is properly trained on the letter 'A' and can
> properly reconstuct them (using Hinton's method). This should produce
> 5000 registrations of the letter 'A', while filtering out unimportant
> information.
>
> But you could take it a step further. For each image input, the above
> method creates a filtered, 3-dimensional space with points containing
> low-dimensional codes. This space can then again be harvested by
taking
> patches with each patch containing /n/ points, each point
containing an
> /m /dimensional code, so each patch being (/m/*/n/)./ /A neural
net can
> be trained on lowering the dimension of these patches from
(/m/*/n/) to
> something lower-dimensional. This process is quite similar to the
one in
> the previous paragraph.
>
> What could /possibly /go wrong? :)
>
> Regards,
> Durk Kingma
Excellent! Sounds like a perfect solution ;-).
Oh, wait!
What about......... if the scene is structured in such a way that the
5,000 copies of the letter 'A' were actually scattered around in such a
way that most (but not all) of them were arranged to form a huge
letter 'A'?
Would it then count 5,001 copies?
Oh, and one more thing I forgot to mention that is in the same scene
(how could I forget this one?): there are also a couple of women
standing side by side, leaning against each other with their shoulders
touching and keeping their bodies stiff and straight, forming the two
sides of a letter 'A', and holding a model of a horizontally reclining
woman between them at waist height, to form the crossbar of a letter
'A'.
Could we get the NN to recognize, in the context of the overall scene,
that here were actually 5,002 copies of the letter 'A'......?
And if the scene had one single, rather small letter B over in the
corner, would the NN find this funny?
You have 30 minutes to devise an algorithm, Durk... :-).
Richard Loosemore
-------------------------------------------
agi
Archives: http://www.listbox.com/member/archive/303/=now
RSS Feed: http://www.listbox.com/member/archive/rss/303/
Modify Your Subscription: http://www.listbox.com/member/?&
<http://www.listbox.com/member/?&>
Powered by Listbox: http://www.listbox.com
------------------------------------------------------------------------
*agi* | Archives <http://www.listbox.com/member/archive/303/=now>
<http://www.listbox.com/member/archive/rss/303/> | Modify
<http://www.listbox.com/member/?&>
Your Subscription [Powered by Listbox] <http://www.listbox.com>
-------------------------------------------
agi
Archives: http://www.listbox.com/member/archive/303/=now
RSS Feed: http://www.listbox.com/member/archive/rss/303/
Modify Your Subscription:
http://www.listbox.com/member/?member_id=8660244&id_secret=95818715-a78a9b
Powered by Listbox: http://www.listbox.com