Hello Francisco,
Thank you very much for the elaborations! They have helped me a lot to
narrow down the issue that I previously called syntax, probably in a bit
too general way. Traditional syntax in a purely grammatical sense is
hence, I understand, something that other parts of the language
processing mechanism (in the brain/neo-cortex as well as in NLP), need
to deal with, using the word representations as input.
What I have actually been thinking about is that context may completely
change the meaning of a word, rather than merely modify it. Negation is
one example ("not eating" != "eating", but also != "drinking"),
multi-word expressions (MWEs) are another one: "New York" refers to
something very different than the words "new" and "York" do in
isolation. I suppose this is especially hard when dealing with MWEs that
occur less frequently than "New York". To my understanding, semantic
fingerprinting tackles that issue mostly by exploiting topical context,
for instance if "New York" co-occurs with "USA" (and/or other related
words), we can "subtract" meanings that refer to York, UK etc. Am I on
the right track there?
Anyway, I am currently not able to raise any new points on that matter.
I think the question I ponder is about the boundary between semantics
and syntax which is often not that clear. I am not sure that there exist
ideal word representations that could serve as a sufficient semantical
input for a subsequent syntax parser if the two components work
independently of each other. However, I am sure that you have worked on
related issues; I am just curious about opinions and potential solutions.
Carsten
Am 29.11.2015 um 21:17 schrieb Francisco Webber:
> Hello Carsten,
> Thank you for sharing your thoughts on this.
> As a first point, Semantic Folding is about representation of word
> semantics. The theory has to accommodate two main constraints:
> - It has to represent the “aboutness” of words in a way, that a real
> neocortex could use it (SDRs), and has to be exclusively based on real
> experiences (Special Case Experiences).
> - The resulting word-SDRs have to be compositional by nature, meaning
> that the "aboutness" of texts can be constructed by adding-up the word
> “aboutness"es.
>
> So one could say that the sentence “fox eats rodent” is about foxes,
> eating and rodents. And the sentence “sheep do not eat rodent” is about
> sheep, not eating and rodents.
> If we want to become more specific we need to identify the meaning of
> the”aboutness” by taking (among other hints) the sequence of words into
> account: thats where HTM comes into play.
> Encoding the meaning into the sequence is dependent of the algorithm
> used. This specific sequence encoding can then, retrospectively, be
> interpreted as syntax. I think it is important to realise that
> linguistic categories are "a posteriori” observations formulated as
> properties and rulesets. Some of these rules and properties (the fact
> that there are nouns, verbs and adjectives) seem to be stable across
> languages and are probably closer bound to the HTM algorithm than
> others. Chomsky and Pinker came even to speak of an “inner language”
> (mentalese).
> With increasing complexity of language some of the sequence information
> is the transferred back to the “aboutness” level of words by creating
> morphological variants like modifying a noun to indicate a genitive form
> in a referral. Like this the morphological variant of a word gets is own
> word-SDR, allowing to represent different notions of the same word in
> frequently reoccurring contexts. As a result the word-SDR for the word
> “apple” is different of its plural form “apples”. The first one has a
> very strong ambiguity between fruits and computers whereas the second
> doesn’t.
>
> Semantic Folding needs the sequence learning capabilities of HTM to get
> from “aboutness” to meaning and HTM needs the “aboutness” to effectively
> (avoiding the combinatorial explosion) encode and decode meaning through
> sequencing.
>
> As of your question about lemmatising and tokenisation, There are no
> such when processing the language definition corpus. Every distinct word
> identified by simple delimiters (punctuation and white space) is treated
> on its own. As similar terms have similar word-SDRs, the word “horse”
> ends up being very similar to the word-SDR of “horses” by Special Case
> Experiences alone. The concept of “plural” becomes an a posteriori
> observation.
>
> Hope this helps clarify.
>
> Francisco
>
>> On 27 Nov 2015, at 09:21, Carsten Schnober
>> <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>> Dear list,
>> Thanks for the good read. I am happy to (hopefully) start the discussion.
>>
>> The first issue that comes into my mind is syntax: this concerns every
>> bag-of-words approach, including Cortical's.
>> The most obvious source of "natural language misunderstanding" in this
>> context is negation, as easily demonstrated in this example:
>>
>> - fox eat rodent
>> - sheep do not eat rodent
>>
>> I suppose the presented algorithm would learn from this that both fox
>> and sheep do eat rodent, doesn't it? This is probably more harmful when
>> classifying sentences than during learning, because a corpus of
>> reasonable size presumably contains a sufficient amount of examples that
>> are phrased in a more straight-forward way.
>> More complex examples, including more subtle negation, relative clauses
>> etc., will pose much larger challenges.
>>
>> I am quite sure that the syntax issue has been discussed in this
>> context. However, I couldn't find any references, neither in the
>> theoretical nor in the practical part of the whitepaper. I am very
>> interested in Cortical.io <http://cortical.io/>'s experiences with
>> that problem and what
>> possible (future) solutions might look like.
>>
>> In statistical NLP, this issue has been tackled (more or less
>> successfully) with methods such as recurrent neural networks or by using
>> sliding windows across multiple words. amongst others. Neither of these
>> approaches seem applicable here without taking away a fundamental and
>> very handy property of the SDRs: that they can be efficiently aggregated
>> with boolean operations.
>>
>> Although the syntax issue might be almost irrelevant for many practical
>> use cases such as document classification, I think it raises an
>> interesting theoretical question. How does the human brain process
>> syntax and, more interestingly, how can this be incorporated into the
>> presented theory?
>>
>>
>> A slightly more technical issue I've stumbled across is word inflection.
>> The whitepaper briefly mentions morphemes which are, according to
>> linguistic theory, "the smallest meaningful units" in language. I
>> understand that working on the word level is sufficient in most cases
>> and much easier for practical reasons (tokenization is relatively easy).
>> I wonder how this is handled in practice though, for instance when
>> learning a new "language definition corpus". Are the words automatically
>> lemmatized? What if a new language is learned for which no lemmatizers
>> are available? Is mere stemming applied in that case? What happens if
>> different word forms do express a different meaning?
>>
>> Thanks for any input on these issues!
>> Carsten
>>
>>
>>
>>
>> Am 25.11.2015 um 19:13 schrieb Fergal Byrne:
>>> Nice, Francisco, thanks for letting us know. I've read the paper, very
>>> well put together. Looking forward to discussions and questions on
>>> the list.
>>>
>>> --
>>>
>>> Fergal Byrne, Brenter IT
>>>
>>> Author, Real Machine Intelligence with Clortex and NuPIC
>>> https://leanpub.com/realsmartmachines
>>>
>>> Speaking on Clortex and HTM/CLA at euroClojure Krakow, June 2014:
>>> http://euroclojure.com/2014/
>>> and at LambdaJam Chicago, July 2014: http://www.lambdajam.com
>>>
>>> http://inbits.com - Better Living through Thoughtful Technology
>>> http://ie.linkedin.com/in/fergbyrne/ - https://github.com/fergalbyrne
>>>
>>> e:[email protected] <http://gmail.com> t:+353 83 4214179
>>> Join the quest for Machine Intelligence at http://numenta.org
>>> Formerly of Adnet [email protected] <mailto:[email protected]>
>>> http://www.adnet.ie
>>>
>>>
>>> On Wed, Nov 25, 2015 at 4:33 PM, Chandan Maruthi
>>> <[email protected]
>>> <mailto:[email protected]> <mailto:[email protected]>>
>>> wrote:
>>>
>>> Francisco
>>>
>>> This is great , looking forward to read this today
>>>
>>> On Wednesday, November 25, 2015, cogmission (David Ray)
>>> <[email protected]
>>> <mailto:[email protected]> <mailto:[email protected]>>
>>> wrote:
>>>
>>> Hi Francisco,
>>>
>>> This will make for a very interesting and informative read!
>>> Can't wait!
>>>
>>> Cheers,
>>> David
>>>
>>> On Wed, Nov 25, 2015 at 8:38 AM, Pascal Weinberger
>>> <[email protected] <mailto:[email protected]>>
>>> wrote:
>>>
>>> Great!
>>> I was waiting for this a long time :D
>>> Will make my day! :)
>>>
>>> Thank you!
>>>
>>>
>>>
>>> Best,
>>>
>>> Pascal Weinberger
>>>
>>> ____________________________
>>>
>>> BE THE CHANGE YOU WANT TO SEE IN THE WORLD ...
>>>
>>>
>>> PLEASE NOTE: This email and any file transmitted are
>>> strictly confidential and/or legally privileged and intended
>>> only for the person(s) directly addressed. If you are not
>>> the intended recipient, any use, copying, transmission,
>>> distribution, or other forms of dissemination is strictly
>>> prohibited. If you have received this email in error, please
>>> notify the sender immediately and permanently delete the
>>> email and files, if any.____
>>>
>>> Please consider the environment before printing this message.
>>>
>>> __ __
>>>
>>>
>>>
>>> On 25 Nov 2015, at 14:49, Francisco Webber
>>> <[email protected] <mailto:[email protected]>> wrote:
>>>
>>>> Hello all,
>>>> For everyone interested in the theoretical background to
>>>> Cortical.io <http://cortical.io/> <http://cortical.io
>>>> <http://cortical.io/>>’s technology:
>>>>
>>>> The Semantic Folding white paper is out in its first
>>>> incarnation:
>>>>
>>>> Download full White Paper
>>>>
>>>> <http://www.cortical.io/static/downloads/semantic-folding-theory-white-paper.pdf>
>>>>
>>>> All the Best
>>>>
>>>> Francisco
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> /With kind regards,/
>>>
>>> David Ray
>>> Java Solutions Architect
>>>
>>> *Cortical.io <http://cortical.io/> <http://cortical.io/>*
>>> Sponsor of: HTM.java <https://github.com/numenta/htm.java>
>>>
>>> [email protected] <mailto:[email protected]>
>>> http://cortical.io <http://cortical.io/> <http://cortical.io/>
>>>
>>>
>>>
>>> --
>>> Regards
>>> Chandan Maruthi
>>>
>>>
>>>
>>
>> --
>> Carsten Schnober
>> Doctoral Researcher
>> Ubiquitous Knowledge Processing (UKP) Lab
>> FB 20 / Computer Science Department
>> Technische Universität Darmstadt
>> Hochschulstr. 10, D-64289 Darmstadt, Germany
>> phone [+49] (0)6151 16-6227, fax -5455, room S2/02/B111
>> [email protected]
>> <mailto:[email protected]>
>> www.ukp.tu-darmstadt.de <http://www.ukp.tu-darmstadt.de/>
>>
>> Web Research at TU Darmstadt (WeRC): www.werc.tu-darmstadt.de
>> <http://www.werc.tu-darmstadt.de/>
>> GRK 1994: Adaptive Preparation of Information from Heterogeneous Sources
>> (AIPHES): www.aiphes.tu-darmstadt.de <http://www.aiphes.tu-darmstadt.de/>
>> PhD program: Knowledge Discovery in Scientific Literature (KDSL)
>> www.kdsl.tu-darmstadt.de <http://www.kdsl.tu-darmstadt.de/>
>
--
Carsten Schnober
Doctoral Researcher
Ubiquitous Knowledge Processing (UKP) Lab
FB 20 / Computer Science Department
Technische Universität Darmstadt
Hochschulstr. 10, D-64289 Darmstadt, Germany
phone [+49] (0)6151 16-6227, fax -5455, room S2/02/B111
[email protected]
www.ukp.tu-darmstadt.de
Web Research at TU Darmstadt (WeRC): www.werc.tu-darmstadt.de
GRK 1994: Adaptive Preparation of Information from Heterogeneous Sources
(AIPHES): www.aiphes.tu-darmstadt.de
PhD program: Knowledge Discovery in Scientific Literature (KDSL)
www.kdsl.tu-darmstadt.de