Matt:It seems like the next logical step would be to model a fovea and
saccades to reduce the input complexity
Care to expand? Are there any computational/robotic approaches to
vision,which involve both the sensation/vision of a field by the retina AND
the attention to objects/parts of objects within the field, of the fovea?
-----Original Message-----
From: Matt Mahoney
Sent: Friday, April 05, 2013 5:36 PM
To: AGI
Subject: Re: Complexity of vision (was Re: [agi] Utilizing kickstarter.com?)
On Fri, Apr 5, 2013 at 2:27 AM, Ben Goertzel <[email protected]> wrote:
Anyway, I would like opinions on the computational complexity of human
vision. Specifically, how would you optimize Google's cat face
recognizer and bring it up to human level?
http://128.84.158.119/abs/1112.6209v3
I wouldn't try to optimize that algorithm; I would take a different
approach that couples a visual hierarchy with a structurally and
dynamically richer cognitive system...
But I'm not going to try to pack the details of my AGI thinking into an
email...
I assume it is based on DeSTIN, which is also a hierarchical neural network.
http://blog.opencog.org/2011/02/21/destin-vision-development/
http://www.aaai.org/ocs/index.php/FSS/FSS09/paper/viewFile/951/1268
I note from the 2009 paper that DeSTIN is able to distinguish between
32 x 32 x 1 images of A, B, and C with translations and added noise
using an 8 layer (4 feature detection layers alternating with 4 belief
layers) of size 64, 24, 16, 12, 4, 6, 1, 3 with the first layer
detecting 4 x 4 non-overlapping patches. I'm not sure, but I think
there are about 20K connections, mostly in the lower layers. I presume
a single processor is sufficient. The paper did not indicate the
number of training cycles or CPU time except to say there were 300
cycles to learn the intermediate features before training the belief
nodes.
The blog post from 2011 notes a GPU port is planned. Are there any new
experimental results?
It seems like the next logical step would be to model a fovea and
saccades to reduce the input complexity, and then give it some harder
problems like reading text, interpreting captchas, recognizing faces,
or recognizing objects from ImageNet. That could be followed by adding
depth perception and features to detect motion and using it to control
robot navigation.
I realize that DeSTIN and the Google system differ in size and
details, but they are both hierarchical neural networks with
unsupervised learning of intermediate features using winner-take-all
networks or something similar. In both cases, the computational
requirements depend on the size of the training set and the number of
features to be detected. The Google system has 10^5 times more
connections and I guess 10^5 times more training data, requiring 10^10
times as much computation.
I don't know of any good estimates of the number of features in human
level vision. We know there are 10^6 inputs from the optic nerve. I
suppose that we can distinguish among 10^6 visual objects at the top
layer. This is somewhat higher than our vocabulary. I think it has to
be larger than 10^5, because language is inadequate for describing
everything we can see. I can't describe a person's face in sufficient
detail that you would immediately recognize them. You would need a
picture.
Let's assume there are about 10 layers, all the same size. Then there
are about 10^13 connections. Over a few decades we receive 10^16 bits
from the optic nerve at a rate of 10 bits per second per nerve fiber x
10^6 fibers x 10^9 seconds. The processing rate would be 2 x 10^14 OPS
at 50 ms cycle time. That seems about right because it takes about 0.5
seconds to recognize a face. You need 40 TB of RAM to store 10^13
connections as 32 bit integers or floats.
An NVIDIA Titan GPU has 2688 cores and runs at 4 TFLOPS (32 bit
floats) with 6 GB memory. It costs about $1000, uses 250 watts of
electricity, and plugs into a slot in a desktop PC. My simple math
tells me 50 of these would give you enough CPU power but leave you
short on RAM by a factor of 128. You would have to augment each card
with 1 TB of external memory, but the bus bandwidth would be far too
slow to access all of it every 50 ms even with a serial access
pattern. Alternatively, you could put together 6000 cards for $6
million plus the interconnect hardware, and 1.5 MW electricity. This
would allow you to run experiments 128 times faster than real time,
processing a decade's worth of training video in about a month. I
think this would be necessary in order to develop and tune the
algorithm in reasonable time.
I'm also assuming that RAM is accessed in sequentially in large
blocks, as is typical in fully connected neural networks implemented
using vector processing. Random access through pointers or sparse
networks is about 50 times slower. There might be other
implementations using single bits or bytes to represent synapses to
save memory. I'm not sure what the speed impact would be.
Do you agree with my math? I realize my estimate of 10^13 connections
is 1/10 that of the cortex, but I am just estimating the vision
component.
--
-- Matt Mahoney, [email protected]
-------------------------------------------
AGI
Archives: https://www.listbox.com/member/archive/303/=now
RSS Feed: https://www.listbox.com/member/archive/rss/303/6952829-59a2eca5
Modify Your Subscription:
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
-------------------------------------------
AGI
Archives: https://www.listbox.com/member/archive/303/=now
RSS Feed: https://www.listbox.com/member/archive/rss/303/21088071-f452e424
Modify Your Subscription:
https://www.listbox.com/member/?member_id=21088071&id_secret=21088071-58d57657
Powered by Listbox: http://www.listbox.com