(proposed R&D project for fall 2016 - 2017)

We are now pretty close (a month away, perhaps?) to having an initial,
reasonably reliable version of an OpenCog-controlled Hanson robot
head, carrying out basic verbal and nonverbal interactions.   This
will be able to serve as a platform for Hanson Robotics product
development, and also for ongoing OpenCog R&D aimed at increasing
levels of embodied intelligence.

This email makes a suggestion regarding the thrust of the R&D side of
the ongoing work, to be done once the initial version is ready.  This
R&D could start around the beginning of September, and is expected to
take 9-12 months…


GENERAL IDEA:
Initial experiment on using OpenCog for learning language from
experience, using the Hanson robot heads and associated tools

In other words, the idea is to use simple conversational English
regarding small groups of people observed by a robot head, as a
context in which to experiment with our already-written-down ideas
about experience-based language learning.

BASIC PERCEPTION:

I think we can do some interesting language-learning work without
dramatic extensions of our current perception framework.  Extending
the perception framework is valuable but can be done in parallel with
using the current framework to drive language learning work.

What I think we need to drive language learning work initially, is
that the robot can tell, at each point in time:

— where people’s faces are (and assign a persistent label to each person’s face)

— which people are talking

— whether an utterance is happy or unhappy (and maybe some additional sentiment)

— if person A’s face is pointed at person B’s face (so that if A is
talking, A is likely talking to B) [not yet implemented, but can be
done soon]

— the volume of a person’s voice

— via speech-to-text, what people are saying

— where a person’s hand is pointing [not yet implemented, but can be done soon]

— when a person is moving, leaving or arriving [not yet implemented,
but can be done soon]

— when a person sits down or stands up [not yet implemented, but can
be done soon]

— gender recognition (woman/man), maybe age recognition

EXAMPLES OF LANGUAGE ABOUT THESE BASIC PERCEPTIONS

While simple this set of initial basic perceptions lets a wide variety
of linguistic constructs get uttered, e.g.

Bob is looking at Ben

Bob is telling Jane some bad news

Bob looked at Jane before walking away

Bob said he was tired and then sat down

People more often talk to the people they are next to

Men are generally taller than women

Jane is a woman

Do you think women tend to talk more quietly than men?

Do you think women are quieter than men?

etc. etc.

It seems clear that this limited domain nevertheless supports a large
amount of linguistic and communicative complexity.

SECOND STAGE OF PERCEPTIONS

A second stage of perceptual sophistication, beyond the basic
perceptions, would be to have recognition of a closed class of
objects, events and properties, e.g.:

Objects:
— Feet, hands, hair, arms, legs (we should be able to get a lot of
this from the skeleton tracker)
— Beard
— Glasses
— Head
— Bottle (e.g. water bottle), cup (e.g. coffee cup)
— Phone
— Tablet

Properties:
— Colors: a list of color values can be recognized, I guess
— Tall, short, fat, thin, bald — for people
— Big, small — for person
— Big, small — for bottle or phone or tablet

Events:
— Handshake (between people)
— Kick (person A kicks person B)
— Punch
— Pat on the head
— Jump up and down
— Fall down
— Get up
— Drop (object)
— Pick up (object)
— Give (A gives object X to B)
— Put down (object) on table or floor


CORPUS PREPARATION

While the crux of the proposed project is learning via real-time
interaction between the robot and humans, in the early stages it will
also be useful to experiment with “batch learning” from recorded
videos of human interactions, video-d from the robot’s point of view.

As one part of supporting this effort, I’d suggest that we

1) create a corpus of videos of 1-5 people interacting in front of the
robot, from the robot’s cameras

2) create a corpus of sentences describing the people, objects and
events in the videos, associating each sentence with a particular
time-interval in one of the videos

3) translate the sentences to Lojban and add them to our parallel
Lojban corpus, so we can be sure we have good logical mappings of all
the sentences in the corpus

Obviously, including the Stage Two perceptions along with the Basic
Perceptions, allows a much wider range of descriptions, e.g. …

A tall man with a hat is next to a short woman with long brown hair

The tall man is holding a briefcase in his left hand

The girl who just walked in in a midget with only one leg

Fred is bald

Vytas fell down, then Ruiting picked him up

Jim is pointing at her hat.

Jim pointing at her hat and smiling made her blush.

However, for initial work, I would say it’s best if at least 50% of
the descriptive sentences involve only Basic Perceptions … so we can
get language learning experimentation rolling right away, without
waiting for extended perception…

LANGUAGE LEARNING

What I then suggest is that we

1) Use the ideas from Linas & Ben’s “unsupervised language learning”
paper to learn a small “link grammar dictionary” from the corpus
mentioned above.  Critically, the features associated with each word
should include features from non-linguistic PERCEPTION, not just
features from language.  (The algorithms in the paper support this,
even though non-linguistic features are only very briefly mentioned in
the paper.)  ….  There are various ways to use PLN inference chaining
and Shujing’s information-theoretic Pattern Miner (both within
OpenCog)  in the implementation of these ideas…

2) Once (1) is done, we then have a parallel corpus of quintuples of the form

[audiovisual scene, English sentence, parse of sentence via link
grammar with learned dictionary, Lojban sentence, PLN-Atomese
interpretation of Lojban sentence]

We can take the pairs

[parse of sentence via link grammar with learned dictionary,
PLN-Atomese interpretation of Lojban sentence]

from this corpus and use them as the input to a pattern mining process
(maybe a suitably restricted version of the OpenCog Pattern Miner,
maybe a specialized implementation), which will mine ImplicationLinks
serving the function of current RelEx2Logic rules.

The above can be done for sentences about Basic Perceptions only, and
also for sentences about Second Stage Perceptions.

NEXT STEPS FOR LANGUAGE LEARNING

The link grammar dictionary learned as described above will have
limited scope.  However, it can potentially be used as the SEED for a
larger link grammar dictionary to be learned from unsupervised
analysis of a larger text corpus, for which nonlinguistic correlates
of the linguistic constructs are not available.   This will be a next
step of experimentation.

NEXT STEPS FOR INTEGRATION

Obviously, what can be done with simple perceptions can be done with
more complex perceptions as well … the assumption of simple
perceptions is because that’s what we have working or almost-working
right now… but Hanson Robotics will put significant effort into making
better visual perception for their robots, and as this becomes a
reality we will be able to use it within the above process..



-- 
Ben Goertzel, PhD
http://goertzel.org

Super-benevolent super-intelligence is the thought the Global Brain is
currently struggling to form...

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to opencog+unsubscr...@googlegroups.com.
To post to this group, send email to opencog@googlegroups.com.
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CACYTDBe8Dy7Ojb0uL1Rqow6u7rEmGLRw%2B%2B6nSsL5Af591KXMBw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to