Re: [opencog-dev] Re: word2vec within openCog language learning?

Ben Goertzel Wed, 12 Apr 2017 01:47:23 -0700

Speculating a little further on this...

In word2vec one trains a neural networks to do the following. Given a
specific word in the middle of a sentence (the input word), one looks
at the words nearby and pick one at random.  The network is going to
tell us the probability -- for every word in our vocabulary -- of that
word being the “nearby word” that we chose.


Suppose we try to use word2vec on a vocabulary of 10K words and try to
project the words into vectors of 300 features.

Then the input layer has 10K neurons (one per word), only one of which
is active at a time; the hidden layer has 300 neurons, and the output
layer has 10K neurons... the vector for a word is then given by the
weights to the hidden layer from that word...

(see http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
for simple overview...)

This is cool but not necessarily the best way to do this sort of thing, right?

An alternate approach in the spirit of InfoGAN would be to try to
learn a "generative" network that, given an input word W, outputs the
distribution of words surrounding W ....   There would also be an
"adversarial" network that would try to distinguish the distributions
produced by the generative network, from the distribution produced
from the actual word....  The generative network could have some
latent variables that are supposed to be informationally correlated
with the distribution produced...

One would then expect/hope that the latent variables of the generative
model would correspond to relevant linguistic features... so one would
get shorter and more interesting vectors than word2vec gives...

Suppose that in such a network, for "words surrounding W", one used
"words linked to W in a dependency parse"....  Then the latent
variables of the generative model mentioned above, should be the
relevant syntactico-semantic aspects of the syntactic relationships
that W displays in the dependency parse....

Clustering on these vectors of latent variables should give very nice
clusters which can then be used to define new variables ("parts of
speech") for the next round of dependency parsing in our language
learning algorithm...

-- Ben


On Sat, Apr 8, 2017 at 2:24 AM, Jesús López
<jesus.lopez.salva...@gmail.com> wrote:
> Hello Ben and Linas,
>
> Sorry for the delay, I was reading the papers. About additivity: In
> Coecke's et al. program you turn a sentence into a *multilinear* map
> that goes from the vectors of the words having elementary syntactic
> category to a semantic vector space, the sentence meaning space. So
> yes, there is additivity in each of theese arguments (thing which by
> the way should have a consequence in those beautiful word2vec
> relations of France - Paris ~= Spain - Madrid, though I haven't seen a
> description).
>
> As I understand, your goal is to go from plain text to logical forms
> in a probabilistic logic, and you have two stages, parsing from plain
> text to a pregroup grammar parse structure (I'm not sure that the
> parse trees I spoken before are really trees, hence the change to
> 'parse structure'), and then you go from that parse structure (via
> RelEx and RelEx2Logic if that's ok) to a lambda calculus term bearing
> the meaning and having attached extrinsically a kind of probability
> and another number.
>
> How do Coecke's program (and from now on that unfairly includes all
> the et als.) fit in that picture? I think the key observation is when
> Coecke says that his framework can be interpreted, as a particular
> case, as Montague semantics. Though adorned by linguistic
> considerations this semantic is well known as amenable to computation,
> and a toy version is shown in chapter 10 of the NLTK book, where they
> show how lambda calculus represents a logic that has a model theory.
> That is important because all those lambda terms have to be actual
> functions with actual values.
>
> How exactly does Coecke's framework reduces to Montague semantics?
> That matters, because if we understand how Montague semantics is a
> particular case of Coecke's, we can think in the opposite direction
> and see Coecke's semantics as an extension.
>
> As starting point we have the fact that Coecke semantics can be
> summarized as a monoidal functor that sends a morphism from a compact
> closed category in syntax-land (the pregroup grammar parse structure,
> resulting from parsing the plain text of a sentence) to a morphism in
> a compact closed category in semantics-land, the category of real
> vector spaces, that morphism being a (multi)linear map.
>
> Coecke semantic functor definition, however, hardly needs any
> modification if we use as target the compact closed category of
> modules over a fixed semiring. If the semiring is that of booleans, we
> are talking about the category of relations between sets, with Pierce
> relational product (uncle = brother * father) expressed with the same
> matrix product formula of linear algebra, and with cartesian product
> as the tensor product that makes it monoidal.
>
> The idea is that when Coecke semantic functor has as codomain the
> category of relations, one obtains Montague semantics. More exactly,
> when one applies the semantic functor to a pregroup grammar parse
> structure of a sentence, one obtains the lambda term that Montague
> would have attached to it. Naturally the question is how exactly
> unfold that abstract notion. The folk joke on 'abstract nonsense'
> forgets that there is a down button in the elevator.
>
> Well, this would be lenghty here, but the way I started to come to
> grips is by entering into the equation the CCG linguistic formalism. A
> fast and good slide show of how one goes from plain text to CCG
> derivations, and from derivations then to classic Montague-semantics
> lambda terms, can be found in [1].
>
> One important feature in CCG is that it is lexicalized, i. e., all the
> linguistic data necessary to do both syntatic and semantic parsing is
> attached to the words of the dictionary, in contrast with, say, NLTK
> book ch. 10, where the linguistic data is inside production rules of
> an explicit grammar.
>
> Looking closer to the lexicon (dictionary), one has that each word is
> supplemented with its syntactic category (N/N...) and also with a
> lambda term compatible with the syntactic category used in semantic
> parsing. Those lambda terms are not magical letters. For the lambda
> terms to have a true model theoretic semantics they must correspond to
> specific functions.
>
> The good thing is that the work of porting Coecke semantics to CCG
> (instead of pregroup grammar) is already done: in [2]. The details are
> there, but the thing that I want to highlight is that in this case,
> when one is doing Coecke semantics with CCG parsing, the structure of
> the lexicon is changed. One retains the words, and their associated
> syntactic category. But now, instead of the lambda terms (with their
> corresponding interpretation as actual relations/functions), one has
> vectors and tensors for simple and compound syntactic categories (say
> N vs N/N) respectively. When those tensors/vectors are of booleans one
> recovers Montague semantics.
>
> In the Coecke general case, sentences mean vectors in a real vector
> space and the benefits start by using its inner product, and hence
> norm and metric, so you can measure quantitatively sentence similarity
> (rather normalized vectors...).
>
> CCG is very nice in practical terms. An open SOTA parser
> implementation is [3] described in [4], to be compared with [5] ("The
> parser finds the optimal parse for 99.9% of held-out sentences").
> openCCG is older but does parsing and generation.
>
> One thing that I don't understand well with the above stuff is that
> the category of vector spaces over a fixed field (or even the finite
> dimensional ones) is *not* cartesian closed. While in the presentation
> of Montague semantics in NLTK book ch. 10 the lambda calculus appears
> to be untyped, more faithful presentations seem to require (simply)
> typed or even a more complex calculus/logic. In that case the semantic
> category perhaps should had to be cartesian closed, supporting in
> particular higher order maps.
>
> That's all in the expository front and now some speculation.
>
> Up to now the only tangible enhancement brought by Coecke semantics is
> the motivation of a metric among sentence meanings. What we really
> want is a mathematical motivation to probabilize the crisp, hard facts
> character of the interpretation of sentences as Montague lambda terms.
> How to attack the problem?
>
> One idea is to experiment with other kinds of semantic category as
> target of the Coecke semantic functor. To be terse, this can be
> explored by means of a monad on a vanilla unstructured base category
> such as finite sets. One can have several choices of endofunctor to
> specify the corresponding monad. Then the semantic category proposed
> is its Kleisli category. Theese categories are monoidal and have a
> revealing diagrammatic notation.
>
> 1.- Powerset endofunctor. This gives rise to the category of sets,
> relations and cartesian product as monoidal operation. Coecke
> semantincs results in montagovian hard facts as described above.
> Coecke and Kissinger's new book [6] details the diagramatic language
> particulars.
> 2.- Vector space monad (over the reals). Since the sets are finite,
> the Kleisli category is that of finite dimensional real vector spaces.
> That is properly Coecke's framework for computing sentence similarity.
> Circuit diagrams are tensor networks where boxes are tensors and wires
> are  contractions of specific indices.
> 3.- A monad in quantum computing is shown in [7], and quantumly
> motivated semantics is specifically addressed by Coecke. The whole
> book [8] discuss the connection though I haven't read it. Circuit
> diagrams should be quantum circuits representing possibly unitary
> process. Quantum amplitudes through measurement give rise to classical
> probabilities.
> 4.- The Giry monad here results from the functor that produces all
> formal convex linear combinations of the elements of a given set. The
> Kleisli category is very interesting, having as maps probabilistic
> mappings that under the hood are just conditional probabilities. This
> maps allow a more user friendly understanding of Markov Chains, Markov
> Decission Processes, HMMs, POMDPs, Naive Bayes classifiers and Kalman
> filters. Circuit diagrams have to correspond to the factor diagrams
> notation of bayesian networks [9], and the law of total probability
> generalizes in bayesian networks to the linear algebra tensor network
> calculations of the corresponding network (this can be shown in actual
> bayesian network software).
>
> A quote from mathematician Gian Carlo Rota [10]:
>
> "The first lecture by Jack [Schwartz] I listened to was given in the
> spring of 1954 in a seminar in functional analysis. A brilliant array
> of lecturers had been expounding throughout the spring term on their
> pet topics. Jack's lecture dealt with stochastic processes.
> Probability was still a mysterious subject cultivated by a few
> scattered mathematicians, and the expression "Markov chain" conveyed
> more than a hint of mystery. Jack started his lecture with the words,
> "A Markov chain is a generalization of a function." His perfect
> motivation of the Markov property put the audience at ease. Graduate
> students and instructors relaxed and followed his every word to the
> end."
>
> The thing I would research would be to use as semantic category that
> of those generalized functions of the former quote and bullet 4 so
> basically you replace word2vec vectors by probability distributions of
> the words meaning something, connect a bayesian network from the CCG
> parse and apply generalized total probability to obtain probabilized
> booleans, i.e. a number 0 <= x <= 1 (instead of just a boolean as with
> Montague semantics). That is, the probability that a sentence holds
> depends on the distributions of its syntactically elementary
> contituyents meaning something, and those distros are combined by
> factors of a bayesian net with conditional independence relations that
> respect and reflect the sentence syntax and have the local Markov
> property. The factors are for words of complex syntactic cateogory (as
> N/N...) and their attached tensors are multivariate conditional
> probability distributions.
>
> Hope this helps somehow. Kind regards,
> Jesus.
>
>
> [1] http://yoavartzi.com/pub/afz-tutorial.acl.2013.pdf
> [2] http://www.cl.cam.ac.uk/~sc609/pubs/eacl14types.pdf
> [3] http://homepages.inf.ed.ac.uk/s1049478/easyccg.html
> [4] http://www.aclweb.org/anthology/D14-1107
> [5] https://arxiv.org/abs/1607.01432
> [6] ISBN 1108107710
> [7] https://bram.westerbaan.name/kleisli.pdf
> [8] ISBN 9780199646296
> [9] http://helper.ipam.ucla.edu/publications/gss2012/gss2012_10799.pdf
> [10] Indiscrete thoughts
>
> On 4/2/17, Linas Vepstas <linasveps...@gmail.com> wrote:
>> Hi Ben,
>>
>> On Sun, Apr 2, 2017 at 3:16 PM, Ben Goertzel <b...@goertzel.org> wrote:
>>
>>>  So e.g. if we find X+Y is roughly equal to Z in the domain
>>> of semantic vectors,
>>>
>>
>> But what Jesus is saying (and what we say in our paper, with all that
>> fiddle-faddle about categories)  is precisely that while the concept of
>> addition is kind-of-ish OK for meanings  it can be even better if replaced
>> with the correct categorial generalization.
>>
>> That is, addition -- the plus sign --is a certain speciific morphism, and
>> that this morphism,  the addition of vectors, has the unfortunate property
>> of being commutative, whereas we know that language is non-commutative. The
>> stuff  about pre-group grammars is all about identifying exactly which
>> morphism it is that correctly generalizes the addition morphism.
>>
>> That addition is kind-of OK is why word2vec kind-of works. But I think we
>> can do better.
>>
>> Unfortunately, the pressing needs of having to crunch data, and to write
>> the code to crunch that data, prevents me from devoting enough time to this
>> issue for at least a few more weeks or a month. I would very much like to
>> clarify the theoretical situation here, but need to find a chunk of time
>> that isn't taken up by email and various mundane tasks.
>>
>> --linas
>>
>
> --
> You received this message because you are subscribed to the Google Groups 
> "opencog" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to opencog+unsubscr...@googlegroups.com.
> To post to this group, send email to opencog@googlegroups.com.
> Visit this group at https://groups.google.com/group/opencog.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/opencog/CAFx29Pu6zK7MwbOuTPHcwOUuW9C6Wrhc9ZFxc5Kp3J4GkMtHkg%40mail.gmail.com.
> For more options, visit https://groups.google.com/d/optout.



-- 
Ben Goertzel, PhD
http://goertzel.org

"I am God! I am nothing, I'm play, I am freedom, I am life. I am the
boundary, I am the peak." -- Alexander Scriabin

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to opencog+unsubscr...@googlegroups.com.
To post to this group, send email to opencog@googlegroups.com.
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CACYTDBeJ9%2Bnx9ukXqANyD7QbBtpGKF_KxdZsz3RKn%3DiZPfXbog%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [opencog-dev] Re: word2vec within openCog language learning?

Reply via email to