> I'm wondering how
> much further one could extend such architecture to coding of
> spatiotemporal (video) patterns, multimodal patterns (video + audio)
> and eventually coding of 3D objects. They are all 'just' extensions of
> such a model, 'just' about finding efficient ways of learning the
> joint probability distributions :) however I imagine that finding
> efficient ways of training such models (e.g. finding compact
> representations) should become increasingly hard.

This is true.  In principle reconstructing a 3D model based upon
observations from one or two images over time is just the reverse of
the ray tracing problem.  By finding correspondences, either in
structure from motion or by stereo correspondence (basically these two
things are the same problem) you can then try to probabilistically
model the ray of light which traveled to the image pixel from the
object.  There's no doubt that this is a hard problem, but I think
it's one which is solvable.

The next logical step in that fellow's research is as you say to
extend the approach to matching features over time in video sequences.
 This involves not only detecting the features themselves but also
making a forward prediction about where the feature will be next and
iteratively modeling the position uncertainty and local surface
orientation.  Andrew Davison's group is doing just this kind of thing,
applying information theory to vision with some success.

